Too slow to convert PDF document to HTML

rodrigo.rosas · November 10, 2015, 2:37pm

Hi, I just got a temporary license to test your product and tried to use the following instructions to convert a PDF document to HTML:

http://www.aspose.com/docs/display/pdfjava/PDF+to+HTML+-+Single+HTML+with+All+Resources+Embedded

http://www.colt.net/wp-content/uploads/2015/03/cdnp_018932.pdf

It seems it takes a very long time to convert page 9 from this document, after splitting the original document into pages.

Also, I’d like to know how to modify that code to split the document generating a separate HTML document per page.

But first of all I need to understand why this conversion is never finished.

I’m not really interested in the API. I just need some tool I can call from my non-Java application to convert from PDF to HTML with good results even in IE8. If you could provide full code I could compile I’d appreciate as there’s a long time I don’t code in Java and would prefer not to have to spend much time setting up a Java development environment just to create such tool. Basically I only need a tool to convert the document to a single full HTML document and another tool or option to convert it to multiple pages. If you could provide those, I’d really appreciate.

Please let me know if there are any options that could speed up the PDF to HTML conversion and how to get page 9 from the above document to be converted.

codewarior · November 11, 2015, 12:39pm

Hi Rodrigo,

Thanks for contacting support.

When converting PDF file to HTML format and in order to save the output to individual pages, newOptions.setSplitIntoPages(true); code line can be used. However when trying to save HTML file with all resources embedded inside it, setSplitIntoPages(…) method cannot be used. In order to accomplish this requirement, you need to first convert PDF document to individual pages and then save individual page to HTML format. However I have observed that PDF to HTML conversion process is taking too much time. For the sake of correction, I have logged this problem as PDFNEWJAVA-35303 in our issue tracking system. We will further look into the details of this problem and will keep you posted on the status of correction. Please be patient and spare us little time.

Furthermore from your above description, you need some tool for PDF to HTML and you are not much interested in using Java API to accomplish this requirement. If so is the case, then please try using ConversionApp of our sister company named Groupdocs.

rodrigo.rosas · November 11, 2015, 12:52pm

Thank you for your response.

ConversionApp is a web service, not a tool. I’m looking for a command-line tool I can use from Linux in an automated way. While not ideal, I can use the Java API to create such a tool, but I’m not really interested in a web service for this conversion, but thanks for pointing to it anyway.

Thanks for confirming that I have to split the PDF before using the API if I want to embed all resources in the page. I’m missing however a full example demonstrating how setSplitIntoPages would work in Java without embedding all resources…

Thanks,

Rodrigo.

codewarior · November 12, 2015, 3:23am

rodrigo.rosas: ConversionApp is a web service, not a tool. I’m looking for a command-line tool I can use from Linux in an automated way. While not ideal, I can use the Java API to create such a tool, but I’m not really interested in a web service for this conversion, but thanks for pointing to it anyway.

Hi Rodrigo,

In order to use the solution on Linux platform, you need to use Aspose.Pdf for Java. However please spare us little time, so that earlier reported issue is resolved.

rodrigo.rosas: Thanks for confirming that I have to split the PDF before using the API if I want to embed all resources in the page. I’m missing however a full example demonstrating how setSplitIntoPages would work in Java without embedding all resources…

Please take a look over following code snippet.

Java

com.aspose.pdf.Document doc = new com.aspose.pdf.Document("c:/pdftest/Farag.pdf");
com.aspose.pdf.HtmlSaveOptions html = new com.aspose.pdf.HtmlSaveOptions();
html.setSplitIntoPages(true);
doc.save("c:/pdftest/Farag.html", html);

rodrigo.rosas · November 12, 2015, 5:33am

I tried this snippet but it doesn’t work in IE8, which is still a requirement for us.

codewarior · November 13, 2015, 2:29am

Hi Rodrigo,

Can you please share some details regarding the issue you are facing when using IE8. Is it that contents are not properly visible in IE8 or you are facing some other problem. Please share some details, so we may further look into this matter.

rodrigo.rosas · November 13, 2015, 9:55am

Page 53 of the document I mentioned in the beginning of this thread is mostly broken in IE8 when I use the example you provided to split it into many pages. The background image is also not loaded.

If you download the mentioned PDF and remove page 9 from it and run the snippet you suggested and view the resulted page in IE8 you’ll understand what I’m talking about.

codewarior · November 16, 2015, 11:53am

Hi Rodrigo,

Thanks for sharing the details.

In order to test the scenario, I have extracted page 53 from the source file and converted it to HTML format. The images properly appear when viewing the output in Internet Explorer, Can you please try viewing the attached HTML file at your end and share your findings.

[Java]

com.aspose.pdf.Document doc = new com.aspose.pdf.Document("c:/pdftest/cdnp_018932.pdf");

com.aspose.pdf.HtmlSaveOptions html = new com.aspose.pdf.HtmlSaveOptions();

com.aspose.pdf.Document doc2 = new com.aspose.pdf.Document();

doc2.getPages().add(doc.getPages().get_Item(53));
//            html.setSplitIntoPages(true);

doc2.save("c:/pdftest/cdnp_018932.html", html);

rodrigo.rosas · November 16, 2015, 1:59pm

Same issue. I can view the document correctly in Chrome but I don’t see the background image in IE8.

codewarior · November 17, 2015, 12:04pm

Hi Rodrigo,

Thanks for the acknowledgement.

I have logged this problem in our issue tracking system as PDFNEWJAVA-35322 in our issue
tracking system. We will further look into the details of this problem and will
keep you updated on the status of correction. Please be patient and spare us
little time. We are sorry for this inconvenience.

tilal.ahmad · February 15, 2016, 12:08am

Hi Rodrigo,

Thanks for your patience. Our product team has investigate the issue and noticed it is not a bug. The IE8 does <a rel=“nofollow” href=“(”[[BL]]http://www.alphr.com/news/internet/224043/berners-lee-unhappy-with-ie8[[/BL]]" style=“cursor: pointer;”>not support SVG and other W3C standards that are used for converting PDF to HTML. Please use latest browsers like Chrome, Mozilla, Safari and IE 9+, it will resolve the issue.

Please feel free to contact us for any further assistance.

Best Regards,

aspose.notifier · April 17, 2016, 2:19am

The issues you have found earlier (filed as PDFNEWJAVA-35303) have been fixed in Aspose.Pdf for Java 11.4.0.

This message was posted using Notification2Forum from Downloads module by Aspose Notifier.