PdfExtractor extracts some characters as text as zero bytes \0

lukas.rada · April 27, 2017, 9:09am

Hello,

I am trying to extract text from PDF by PdfExtractor (Aspose.Pdf, Version=11.8.0.0).

Code is attached.

PDF is attached.

Result is string where some characters are zero bytes instead of regular characters in text.

Result with errors (zero bytes):

" Př\0d\0luva\r\n\r\n\r\nKaždý z nás již někdy něco uvařil, na \0o\0 s\0 \0is\0ě shodn\0\0\0.\r\n\0d vař\0ní ča\0\0 až \0o složi\0á \0ídla z vybraných surovin. \0\0is\0u\0í s\0ovky kuchař\0k, k\0\0ré vá\0 \0ř\0sně \r\n\0oradí, \0ak na \0o. \0hňová kuchařka …“

Result with no errors should be:

” Předmluva \r\n\r\n\r\nKaždý z nás již někdy něco uvařil, na tom se jistě shodneme.

Od vaření čaje až po složitá jídla z vybraných surovin. Existují stovky kuchařek, které vám přesně poradí, jak na to. Ohňová kuchařka…

Can you help me?

asad.ali · April 27, 2017, 4:04pm

Hi Lukáš.

Thanks for contacting support.

I am afraid that you are using a quite old version of the API, whereas it is always recommended and appreciated to use latest version of the API which is Aspose.Pdf for .NET 17.4.0. Nevertheless, I have tried to extract text from your PDF document using latest version of the API with following code snippet and I was unable to notice the issue. Extracted text was as per your requirement. Please check following code snippet:

Document pdfDocument = new Document(dataDir + “test-example.pdf”);
TextExtractionOptions textExtOptions = new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Pure);
TextAbsorber ta = new TextAbsorber(textExtOptions);
pdfDocument.Pages.Accept(ta);
string extractedtext = ta.Text;// <-- This returns correct output

Please try with latest version of the API and above approach. In case of any further assistance, please feel free to contact us.

Best Regards,

lukas.rada · April 28, 2017, 6:34am

You are right.

I’ve switched to newest Aspose.Pdf and it works.

I’ve tried also render PDF over PngDevice as image (as you can see in attachment) and the text is correct, but font is not preserved.

Not preserving fonts is limitation of trial version of Aspose.Pdf?

Thank you for help.

lukas.rada · April 28, 2017, 7:54am

I’ve found that it is not limitation.

So fonts are not preserved.

Can you help me with that?

Thank you.

asad.ali · April 28, 2017, 11:46am

Hi Lukáš,

Thanks for writing back.

lukas.rada:

I’ve tried also render PDF over PngDevice as image (as you can see in attachment) and the text is correct, but font is not preserved.

We will really appreciate if you please share a sample code snippet which you are using to render PDF over PngDevice. This way we can test the scenario in our environment and address it accordingly. We are sorry for the inconvenience.

Best Regards,

lukas.rada · May 2, 2017, 2:28am

It is very simple example:

using (Document pdfDocument = new Document(“example.pdf”))
{
    Resolution resolution = new Resolution(150);
    PngDevice pngDevice = new PngDevice(resolution);
    pngDevice.RenderingOptions.UseNewImagingEngine = true;
    pngDevice.Process(pdfDocument.Pages[1], “example.png”);
}

asad.ali · May 2, 2017, 8:08am

Hi Lukáš,

Thanks for sharing code snippet. I have tried to convert your PDF into PNG and observed that the embedded fonts were not being rendered in the resultant output. Hence, I have logged an issue as PDFNET-42674 in our issue tracking system for the sake of investigation. We will further look into the details of the issue and keep you updated on the status of its correction.Please be patient and spare us a little time.

We are sorry for the inconvenience.

Best Regards,