Images created from PDF pages have wrong text/font

tuandunguit · December 6, 2017, 3:22am

Hi,

I follow the development guide to create image (.png) for PDF pages.
It works well, however, we have some scanned documents which the result image is not good, the text/font is wrong.
Please help to double check, I attached the PDF and generated image in attachment.

Thanks so much

Scaned Document.pdf (511.6 KB)
Scaned Document.png (51.8 KB)

Farhan.Raza · December 6, 2017, 7:15am

@tuandunguit

Thank you for contacting support.

I have worked with the data shared by you. Source pdf file, Scaned Document.pdf, contains embedded fonts which are not being rendered correctly in generated image. I have attached the fonts which are being used in this PDF, ScannedDocument_Fonts.zip. Please refer to these fonts with below line of code, while creating a png image.

FolderFontSource source = new FolderFontSource(@"D:\ScannedDocument_Fonts\");

I have attached output png image Scaned Document_out.png for your kind reference.

ScannedDocument_Fonts.zip (57.3 KB)

I hope this will be helpful. Please let us know if you need any further assistance.

tuandunguit · December 6, 2017, 7:55am

@Farhan.Raza
Thanks for you quick reply.

I still have some questions need your assistance:

If the host machine install all embedded fonts, we don’t need to specify FolderFontSource, is it correct?
Incase we receive a document from external source, the embedded fonts are unknown, and maybe a lots.
We need to convert all to images automatically, so do we have other completed solution?

Farhan.Raza · December 6, 2017, 12:35pm

@tuandunguit

Yes, this is correct.

I am afraid it may not be possible to utilize embedded fonts when they are not known or too many, so a feature request with ID PDFNET-43829 has been logged in our issue management system to investigate if we can include a feature for rendering a PDF file with embedded fonts. The issue ID has been linked with this thread so that you will receive notification as soon as the issue is resolved.

tuandunguit · December 7, 2017, 6:24am

Thanks @Farhan.Raza, for your answer.
Just one more curious, how can you extract the fonts embedded from PDF, or when we can download?

Farhan.Raza · December 7, 2017, 11:07am

@tuandunguit

There are several tools available online, which claim to extract embedded fonts from a PDF file but they are not as efficient as Aspose.Pdf API. You can extract any embedded file, be it a font, a photo or any other file, with Aspose.Pdf. A simple approach for this extraction is to convert a PDF file to an XML file, as in the code below:

        //Load a PDF document
        Document document = new Document(dataDir + "Sample.pdf");

        //Convert it to an XML file
        document.Save(dataDir + "Sample_out.xml", SaveFormat.MobiXml);

I hope this will be helpful. Please let us know if you need any further assistance.

tuandunguit · December 8, 2017, 3:14am

Hi @Farhan.Raza,

Sorry but how can I set FolderFontSource while creating png image, I can’t find any method to set fonts from Document or Page classes.
PS: I follow this guide to creating png : https://docs.aspose.com/display/pdfnet/Convert+PDF+Pages#ConvertPDFPages-ConvertAllPagestoPNGImages

Farhan.Raza · December 8, 2017, 6:13am

@tuandunguit

You need to set FolderFontSource before creating a png image, by using the line of code I have shared earlier:

FolderFontSource is a member of FolderFontSource Class. Please ensure you are using the latest version of Aspose.Pdf API, i.e Aspose.Pdf for .NET 17.11, in your environment.

Feel free to contact us if you need any further assistance.

tuandunguit · December 12, 2017, 1:52am

@Farhan.Raza
Sorry for this, but how we use ‘source’ variable, as I see we just declare it and nowhere in code reference to it.
If possible, can you give me the source for creating your previous ‘Scanned Document_out.png’?

Thanks

Farhan.Raza · December 12, 2017, 6:04am

@tuandunguit

I would like to share with you that FolderFontSource Class represents the folder that contains font files and FolderFontSource.FolderPath property which is passed as a string argument, specifies the path to a folder containing font files. So we do not need to refer to ‘source’ variable explicitly because Aspose.Pdf API automatically loads the fonts from specified source.

You may visit this documentation article for more details and code snippet for creating png images of a PDF file.

I hope this will be helpful. Please share if I may help you further in this regard.

Farhan.Raza · August 8, 2018, 11:06am

@tuandunguit

We have further investigated PDFNET-43829 and would like to share with you that, the input document includes all fonts as embedded. When some font is embedded into a PDF document, it contains all necessary data to be rendered by any device, so there is no need to extract these fonts from document or connect to some FolderFontSource.

This rule is common for Aspose.PDF independently from PDF and font area (PDF, PDF/A, PDF to HTML, PDF to PNG etc). When a PDF document is converted into image, Aspose.PDF takes font description from document and if this font is embedded - uses it and does not look for another font in any place. We have attached resultant image for latest version of Aspose.PDF API for your kind reference Scanned Document_18.7.png (455.5 KB)
.

So, the conclusions are:

When fonts are embedded into document they are already installed in it, so there is no need in any utilization, installation, extraction, etc.
If some problems with rendered text are related to embedded fonts or font sources - we need a concrete document and code snippet to reproduce problem and then to search for decision.