PDF convert to HTML - some words overlap

beckymc · August 2, 2013, 4:30pm

Hello,

We are evaluating Aspose.pdf/kit (we already license Aspose.Words for Java for a long time now). The primary thing we’re interested in is converting PDF docs to HTML format (text only).

I downloaded aspose-pdf-kit-4.6.1-java and gave it a try. The resulting HTML doc contains the text very nicely formatted, but there are numerous places where the end of one phrase overlaps the beginning of the next phrase. The original PDF doc and the resulting HTML doc are attached. You can see the occurrences starting at the top of the HTML doc with the phone number overlapping the following word ‘Email’, and then the first bullet under Summary has ‘Java’ and ‘J2EE’ overlapping, etc. Looking at the HTML source it seems that the absolute left positioning of the phrases may not be correct in some cases.

Will this problem be fixed?

When will it be fixed?

My test code is below.

Thanks in advance-

Becky McElroy

______________________

public static void main(String[] args)

{

if (args.length < 1)

{

System.out.println(“Enter arg: path to pdf file”);

System.exit(1);

}

try

{

// /create PdfExtractor object

PdfExtractor extractor = new PdfExtractor();

// bind input pDF file

File f = new File(args[0]);

extractor.bindPdf(new FileInputStream(f));

// extract text

extractor.extractText();

// save extracted text as HTML

extractor.extractTextAsHTML(args[0].substring(0, args[0].length() - 3) + “html”);

// close PdfExtractor object

extractor.close();

}

catch (Exception e)

{

e.printStackTrace();

}

codewarior · August 5, 2013, 6:00am

Hi Becky,

Thanks for using our products.

I
have tested the scenario and I am able to reproduce the same problem. For the
sake of correction, I have logged it in our issue tracking system as PDFNEWJAVA-33646. We
will investigate this issue in details and will keep you updated on the status
of a correction.

We apologize for your inconvenience.

PS, please note that Aspose.Pdf.Kit for Java has been merged into Aspose.Pdf for Java and all the classes are now present under com.aspose.pdf.facades package. Please note that the separate release of Aspose.Pdf.Kit for Java will be discontinued in Q3 this year. For further information, please visit Migration from Aspose.Pdf.Kit for Java

beckymc · August 6, 2013, 12:00pm

Hello Team,

Thanks for your reply and opening the ticket.

Can someone on your side please provide the timeframe when PDFNEWJAVA-33646 will be fixed? We are evaluating other solutions as well and will need to make a decision soon. We would prefer to go with Aspose, as we’ve had a good experience in the past with responsiveness and support. We would need this issue fixed by end of September in order for us to be able to go with the Aspose solution – is that timeframe doable for your side?

Thank you.

Becky Mc

codewarior · August 6, 2013, 11:13pm

Hi Becky,

<span style=“font-size:10.0pt;font-family:“Verdana”,“sans-serif”;
color:#333333”>Since we recently have been able to notice this issue, and until
or unless we have investigated and have figured out the actual reasons of this
problem, we might not be able to share any timelines by which this problem will
be resolved.<o:p></o:p>

However, I have updated the development team to try getting this issue resolved before the end of September and I hope we would be able to fix this issue in specified time (it's not a promise but we can try to get it fixed by said time). As soon as we have made some significant progress towards the resolution of this issue, we would be more than happy to update you with the status of correction.

Please be patient and spare us little time. Your patience and comprehension is greatly appreciated in this regard.

beckymc · October 1, 2013, 10:59am

Hi,

Is there any update on this issue: PDFNEWJAVA-33646 ?

When can we expect the fix for that to be released?

Thank you,

Becky McElroy

codewarior · October 2, 2013, 12:01pm

beckymc:

Is there any update on this issue: PDFNEWJAVA-33646 ?
When can we expect the fix for that to be released?

Hi Becky,

Thanks for your patience.

The development team has been busy resolving other priority issues and I am afraid the above stated issue is not yet resolved. However I have requested the team to share the possible ETA regarding its resolution. As soon as I have the required information, I would be more than happy to update you with the status of correction. Please be patient and spare us little more time.

codewarior · October 7, 2013, 4:04am

Hi Becky,

I am pleased to share that the issue reported earlier is resolved and its fix will be included in upcoming release of Aspose.Pdf for Java which is planned to release within current week. Please be patient and spare us little time.

codewarior · October 14, 2013, 12:36pm

Hi Becky,

Starting from the upcoming release of Aspose.Pdf for Java 4.3.0, we are going to introduce a new feature to directly convert PDF files to HTML format. Please try using the following code snippet to accomplish this requirement.

[Java]

// load source PDF file<o:p></o:p>

com.aspose.pdf.Document pdfDocument = new com.aspose.pdf.Document("c:/pdftest/example-original-resume.pdf");

// save the file into HTML format

pdfDocument.save(“c:/pdftest/example-original-resume.html”,
com.aspose.pdf.SaveFormat.Html);

aspose.notifier · October 14, 2013, 2:13pm

The issues you have found earlier (filed as PDFNEWJAVA-33646) have been fixed in Aspose.Pdf for Java 4.3.0.

This message was posted using Notification2Forum from Downloads module by Aspose Notifier.

beckymc · October 14, 2013, 4:31pm

Hi Nayyer,

Thank you for the update. That is great news!

Kind regards,

Becky

beckymc · October 14, 2013, 4:51pm

Hi Nayyer,

Thank you & team for getting this fix into the release. I tried the new 4.3.0 download, and indeed the formatting has been fixed. Looks great when you open it with a browser!

But is there a way, with the new API, to extract the PDF text and convert it to formatted HTML all in one .html file? (no generated folder with the same name as the .html file, containing the .css)

The previous aspose-PDF-kit API supported it and that is exactly what we need - that is: extract text only from the PDF, omit images, preserve the text format, and output to one .html file . I didn’t find a way in the 4.3.0 API to do that.

Thanks in advance,

Becky

codewarior · October 21, 2013, 2:59am

beckymc:

Thank you & team for getting this fix into the release. I tried the new 4.3.0 download, and indeed the formatting has been fixed. Looks great when you open it with a browser!

But is there a way, with the new API, to extract the PDF text and convert it to formatted HTML all in one .html file? (no generated folder with the same name as the .html file, containing the .css)

The previous aspose-PDF-kit API supported it and that is exactly what we need - that is: extract text only from the PDF, omit images, preserve the text format, and output to one .html file . I didn’t find a way in the 4.3.0 API to do that.

Hi Becky,

Thanks for sharing your findings and sorry for the delayed response.

The PDF to HTML using com.aspose.pdf.Document class is new approach and it creates separate folder containing extracted images and CSS information. However, we already have noticed that extractTextAsHTML(…) method is missing from com.aspose.pdf.facades.PdfExtractor class. For
the sake of correction, we already have logged it in our issue tracking system as PDFNEWJAVA-33692. We
will investigate this issue in details and will keep you updated on the status
of a correction.

We apologize for your inconvenience.

beckymc · April 28, 2014, 1:41pm

Hello -

I’m inquiring about the status of this ticket: PDFJAVA-33692

The ticket is about providing the missing extractTextAsHTML() method that would generate the HTML in a single .html file.

Thanks in advance -
Becky McElroy

codewarior · April 28, 2014, 11:42pm

Hi Becky,

Thanks for your patience.

I am afraid the issue PDFNEWJAVA-33692 is not yet resolved. Nevertheless, I have requested the development team to share
any possible ETA. As soon as I have some updates regarding its resolution, I
would be more than happy to update you with the status of correction. Please be
patient and spare us little time.

We are sorry for this delay and inconvenience.

beckymc · April 29, 2014, 6:18pm

Ok, thank you.

codewarior · June 9, 2014, 8:29am

Hi Becky,

Thanks for your patience.

We have further investigated the issue PDFNEWJAVA-33692 reported earlier and I am afraid we cannot support extractTextAsHtml(…) method. As The current Aspose.Pdf for Java is an autoported version of Aspose.Pdf for .NET, so we might not be introducing extractTextAsHtml(…) method in com.aspose.pdf.facades package, because its not present in Aspose.Pdf for .NET. Therefore in order to accomplish your requirement, please try using the following code snippet.

[Java]

Document pdfDocument = new Document(“input.pdf”);<o:p></o:p>

com.aspose.pdf.HtmlSaveOptions saveOptions = new com.aspose.pdf.HtmlSaveOptions();<o:p></o:p>

pdfDocument.save(“output.html”, saveOptions);

beckymc · June 12, 2014, 11:09am

Hello Nayyer,

Thanks for the update and solution suggestion. We’ll give it a try, and if we have issues we’ll open a new ticket at that time.

This ticket can be closed now.
Thank you for your support!
Kind regards,
Becky

codewarior · June 13, 2014, 9:05am

Hi Becky,

Sure. Please try the above stated approach and in case you encounter any issue or you have any further query, please feel free to contact and we would be more than happy to help you out with your queries.