Convert PDF to HTML the output HTML is messed up

Hello. I’m converting PDF to HTML. My pdf is colored and multi-layered and lets say it has 3 pages. And I choose the output HTML to be single page. Then I find the output HTML is not correct. On page 3: you can see texts paragraph from page 1. They are overlapped on page 3. If you look into the generated HTML code, you’ll see the code is messed up. i.e html elements i n page 1 are inserted randomly to other parts of code.


I already set HtmlSaveOptions.SaveTransparentTexts and SaveShadowedTextsAsTransparentTexts to TRUE. But still got this problem. Please help! Thank you so much!

Hi Menghui,


Thanks for contacting support

Please share your source PDF file and code snippet which you are using, so that we can test the scenario in our environment. We are sorry for your inconvenience.

Hello!

Thanks for replying! I’ve attached the pdf file. The file has 2 layers. The bottom layer should be covered by top layer. But in generated html. The bottom layer shows up and overlapped with the top layer. (attached please see what it looks like in html format). Below is the code I’m using, say pdfDocument is the pdf file I want to convert to HTML.
Document pdfDocument = …
pdfDocument.Optimize();
HtmlSaveOptions saveOptions = new HtmlSaveOptions();
saveOptions.CustomResourceSavingStrategy = new HtmlSaveOptions.ResourceSavingStrategy(CustomResourcesProcessing);
saveOptions.CustomStrategyOfCssUrlCreation = new HtmlSaveOptions.CssUrlMakingStrategy(CssUrlCreationCustomStrategy);
saveOptions.CustomCssSavingStrategy = new HtmlSaveOptions.CssSavingStrategy(CustomCssSavingProcessing);
saveOptions.SplitIntoPages = false;
saveOptions.SplitCssIntoPages = false;
saveOptions.HtmlMarkupGenerationMode = HtmlSaveOptions.HtmlMarkupGenerationModes.WriteOnlyBodyContent;
saveOptions.SaveShadowedTextsAsTransparentTexts = true;
saveOptions.SaveTransparentTexts = true;
pdfDocument.Save(htmlContent, saveOptions);

Is there anything wrong with my saveOptions? Or is this a bug with Aspose? Thanks a lot!

Hi Menghui,


Can you please share the complete code snippet so that we can replicate the exact scenario. We are sorry for this inconvenience.
Hello,
You can use this simple code snippet. Please change the local path as your convenience so that it can open the pdf I've attached. Then You can find the same problem of layer overlapped.

/* single html with css embeded */
Document doc = new Document(@"C:\Users\Work\Aspose\Test Document with Layers.pdf");
// Instantiate HTML Save options object
HtmlSaveOptions newOptions = new HtmlSaveOptions();
// Enable option to embed all resources inside the HTML
newOptions.PartsEmbeddingMode = HtmlSaveOptions.PartsEmbeddingModes.EmbedAllIntoHtml;
// This is just optimization for IE and can be omitted newOptions.LettersPositioningMethod = HtmlSaveOptions.LettersPositioningMethods.UseEmUnitsAndCompensationOfRoundingErrorsInCss;
newOptions.RasterImagesSavingMode = HtmlSaveOptions.RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground;
newOptions.FontSavingMode = HtmlSaveOptions.FontSavingModes.SaveInAllFormats;
// Output file path
string outHtmlFile = @"C:\Users\Work\Aspose\single.html";
doc.Save(outHtmlFile, newOptions);

Here is the error preview I've got using code above.

Hi Menghui,


Thanks for sharing the details.

I have tested the scenario and I am able to
notice the same problem. For the sake of correction, I have logged this problem
as PDFNEWNET-39258 in our issue tracking system. We will further
look into the details of this problem and will keep you updated on the status
of correction. Please be patient and spare us little time. We are sorry for
this inconvenience.

Great! Please reach me any time once you have made any kind of progress. Thank you so much!

Hello! Could I know how long I still need to wait for this problem to be fixed? I’m so sorry to be pushy but our product is going to be released. The progress of this problem is essential to our design. Please share with me about the progress. Thank you!

Hi Menghui,


Thanks for your inquiry. As we have recently noticed the issue and it is still pending for investigation in queue with other issues, reported earlier. As soon as the issue investigation completes then we would be in good position to share an ETA. However we have recorded your concern and requested our product team to investigate and share an ETA at their earliest. We will notify you as soon as we made some significant progress towards issue resolution.

We are sorry for the inconvenience caused.

Best Regards,

Be patient, young lad.

Hello! Thank you for your consideration! Would you be able to share any update now? Any kinds of update is fine. I just want to know the progress. Thank you so much!

Hi Menghui,


Thanks for your patience. You may try latest release Aspose.Pdf for .NET 10.8.0, as we have resolved the major layout problems of subjected HTML in this release. However there were some minor issues those are also resolved and the complete fix will be available in 10.9.0, planned in start of October, 2010. We will notify you as soon as it is published.

Thanks for your patience and cooperation.

Best Regards,

Thank you! This update has resolved the overlapping issue. The visual performance is much better now. By the way, here are some minor issues I just found and could be improved in next version:

1. In some lines of the paragraph, the text spacing is incorrectly bigger than it should be.
2. In some small places, font size is a bit smaller than it should be.
3. The texts at the bottom layer are invisible now, which is very good. However, some of them could be selected by dragging the mouse.

Anyway, 10.8.0 is awesome. I appreciate the quick fix. Please notify me when 10.9.0 is released or any progress is made. Thanks!

Hi Menghui,


Thanks for your feedback. We have passed on your findings to our product team, they will look into it and address accordingly.

Sure, we will notify you as soon as the fix is included in Aspose.Pdf for .NET 10.9.0 and it is published.

Thanks for your patience and cooperation.

Best Regards,

Hello. I proposed 3 issues in my last comment. I’ve just fixed #3 by adjusting HtmlSaveOption. I wonder if #1 and #2 could be resolved by adjusting HtmlSaveOption or some other parameters too in 10.8.0. Could you take a look and share some advice? You could use same documents and code snippet at the beginning of the post to reproduce it. Thanks!

Hi Menghui,


Thanks for the acknowledgement.

Aspose.Pdf for .NET 10.8.0 included some partial fixes and as shared earlier, the detailed fix of issues related to PDF to HTML will be included in next release of Aspose.Pdf for .NET 10.9.0. Therefore we suggest you to please be patient and wait for the new release and once you have tested the scenario with new release, you can share the details regarding issues still appearing. Please be patient and wait for the new release.

The issues you have found earlier (filed as PDFNEWNET-39258) have been fixed in Aspose.Pdf for .NET 10.9.0.


This message was posted using Notification2Forum from Downloads module by Aspose Notifier.

Hi, I’ve just upgraded to 10.9 then found Aspose.PDF no longer work. If degraded back to 10.8 I can use Aspose Pdf again. I found 10.9 would make a judgemeng that I have no license file. But I do have the license since I could always use 10.8. So what’s the problem?

Hi Menghui,

Thanks for your feedback. I am afraid I am unable to understand the issue, what exact issue you are facing with 10.9.0. Please also share your license file via email as suggest here, so we will look into it and guide you accordingly.

Moreover,

"
1. In some lines of the paragraph, the text spacing is incorrectly bigger than it should be."

This issue is not fixed in 10.9.0, we have logged a ticket PDFNEWNET-39371 in our issue tracking system for the rectification. We will notify you as soon as it is resolved.

"2. In some small places, font size is a bit smaller than it should be."

We are unable to notice the issue in 10.9.0 output. Please share your code and input/output documents along with some screenshot for the reference.


"3. The texts at the bottom layer are invisible now, which is very good. However, some of them could be selected by dragging the mouse."

Please note such unexpected behavior can theoretically take place when user uses setting. Please confirm if you are not setting these properties, then please share you code so we will look into it further and guide you accordingly.

....

....

htmlOptions.SaveTransparentTexts = true;//transparent

htmlOptions.SaveShadowedTextsAsTransparentTexts = true;

....

....

Best Regards,

Hi! Thank you for the reply. I may share you the license later when I have a chance. Let me clarify the Number 2 issue.“2. In some small places, font size is a bit smaller than it should be.” Please see the attachment you just shared with me. In the output both in 10.8 and 10.9, you can see there are two lines of text “5/14/2014”, right? The first should look exactly same as the second. But now, the first 5/14/2014 is smaller than the second and has extra space in it. Let me know if my explanation is still not clear enough. Thanks a lot!