Custom Font Encoding when Extracting PDF Text

seanJohnsonRSI · December 12, 2018, 2:18pm

We are using the TextAbsorber class to extract text out of some PDF reports. Most of the PDFs we have come across have been encoded using ANSI, but we recently came across one with custom font encodings. When viewing the PDF in Adobe, everything appears to be correct, but the TextAbsorber is not extracting the text in a usable way. Is there something that we can do to handle this using Aspose.Pdf?

I have provided some PDF property info (PDFProperties_1, PDFProperties_2) as well as a
CodeSnippet. I cannot provide the the PDF as the information is proprietary.

This seems to be related to PDFJAVA-36721 from this post

Farhan.Raza · December 12, 2018, 7:06pm

@seanJohnsonRSI

Thank you for contacting support.

We would like to update you that the feature of getting custom encoding is not supported yet. As you have noticed, PDFJAVA-36721 is already logged as a feature request. However, please note that attachments are accessible to thread owner and Aspose staff only. Source PDF document is required so that we may efficiently address your concerns.

seanJohnsonRSI · December 13, 2018, 1:39pm

It might be important to note that I’m using .Net. Sorry for not including that information earlier.

Farhan.Raza · December 13, 2018, 8:29pm

@seanJohnsonRSI

Thank you for the information.

Requested feature will be supported in .NET as well as Java version, alike. We will let you know as soon as some significant updates will be available in this regard.

protstein · December 12, 2024, 9:21pm

Hi, was this feature ever supported? I am having same issues.

asad.ali · December 12, 2024, 11:08pm

@protstein

We are afraid that this feature hasn’t been available yet. We will inform you via this forum thread as soon as the ticket is resolved.