PDF to HTML - text over an image is lost

rcoston · February 26, 2019, 7:17pm

When converting a PDF to HTML, we sometimes have images & text combinations in the source file, such as when a table of numbers is slightly skewed on the page. The PDF includes the image and also a layer of selectable text on top of the image.

Using AsposePDF to convert to HTML, the image is included correctly, but the selectable text is missing.

Attached is a file exhibiting the behavior. For the two tables, the Aspose HTML version has an image, but no selectable text.

Is there a way to ensure the layer of text is not discarded by AsposePDF during conversion?

6th Sample - skewed tables.pdf (135.8 KB)

asad.ali · February 26, 2019, 10:29pm

@rcoston

Thanks for contacting support.

Would you please use following code snippet with Aspose.PDF for .NET 19.2 and in case you still face any issue, please feel free to let us know. For your kind reference, output HTML is also attached.

Aspose.Pdf.Document document = new Aspose.Pdf.Document(dataDir + "6th Sample - skewed tables.pdf");
HtmlSaveOptions htmlOptions = new HtmlSaveOptions();
htmlOptions.SaveTransparentTexts = true;//transparent
htmlOptions.SaveShadowedTextsAsTransparentTexts = true;
htmlOptions.SplitIntoPages = false;
document.Save(dataDir + "6th Sample - skewed tables.html", htmlOptions);

outputHTML.zip (243.1 KB)

rcoston · February 27, 2019, 1:33pm

Hi Asad,

Thank you for the assistance - that code snippet worked perfectly! We appreciate the quick response, as well.

Best,
Randall

asad.ali · February 27, 2019, 8:58pm

@rcoston

Thanks for your kind feedback.

Please keep using our API and in case you need further assistance, please feel free to create a new post. We will be more than happy to assist you accordingly.