Conversion from PDF to HTML (.NET)

Helge_Lenuweit · April 9, 2018, 1:16pm

Dear Support Team,

i’m currently evaluating Aspose.PDF .net for a new project. For some of the tested pdf’s the resulting html carries no useable text. It seems like only the first character is used to fill all text positions similar to an issue i found for the java toolkit. (PDF JAVA-35579 )
Sample file (42.2 KB)

Another question: Is there a way to detect a hidden text layer from an ocr’ed document?
Seems in such cases SaveShadowedTextsAsTransparentTexts has to be set but it’s not clear beforehand if it needs to be set or not.

Thanks and best regards

Helge

asad.ali · April 9, 2018, 8:49pm

@Helge_Lenuweit

Thanks for contacting support.

The option “SaveShadowedTextsAsTransparentTexts” needs to be set to true, if there is hidden layer of OCR-ed text inside PDF and it also needs to be saved into HTML - so that it can be copied to clipboard.

We were able to replicate the issue in our environment while using Aspose.PDF for .NET 18.4 and for the sake of correction, we have logged it as PDFNET-44504 in our issue tracking system. We will further look into the details of the issue and keep you posted with the status of its correction. Please be patient and spare us little time.

We are sorry for the inconvenience.