Text extraction for PDFs with custom encoding

Spfa · April 26, 2018, 1:57pm

Hi,

we frequently encounter PDFs where text extraction returns garbage. Acrobat Reader can display them fine but when you copy and paste some content or export it as text you get garbage characters. Searching fails. By my understanding this is caused by a missing/wrong CMAP and/or custom encoding. A publicly available example PDF can be found here: http://users.tpg.com.au/hufraser/PHONELIC.PDF
Acrobat reader says that the fonts are of type 1 and with custom encoding.

Is there a way with Aspose PDF (for java) to detect PDFs with such problematic fonts? I saw a different topic from 2017 where you said that you’re investigating to make the font encoding available. Is this already implemented?

For another PDF (which I can’t share) with custom encoding Aspose PDF exports no text (using MobiXmlSaveOptions) but only image references and positions. It appears as if there’s no mapping between the glyph and the corresponding unicode character. Acrobat reader says that the fonts are of type 3 and with custom encodings. Surprisingly Acrobat Reader can export the file as text and you can copy and paste text just fine. Searching works.
The XML output of Aspose PDF looks like this

<pdf2xml pages=“8”>
<title/>
<page width=“792” height=“612”>
<text x=“0” y=“0” width=“0” height=“0”/>
<img x=“32.64” y=“565.683” width=“5.278” height=“5.038” src=“pdf2xml_pic1.png”/>
<text x=“0” y=“0” width=“0” height=“0”/>
<img x=“430” y=“19.604” width=“324.512” height=“36.244” src=“pdf2xml_pic2.jpg”/>
<text x=“0” y=“0” width=“0” height=“0”/>
<img x=“232.32” y=“110.286” width=“15.354” height=“15.594” src=“pdf2xml_pic3.png”/>
[…]

Can you explain in what cases you would produce such output instead of exporting text? And if there’s a way to detect this with Aspose PDF?

Farhan.Raza · April 26, 2018, 7:49pm

@Spfa

Thank you for contacting support.

We have investigated the sample PDF file shared by you and we can notice the problem of garbage text. Therefore, a ticket with ID PDFJAVA-37682 has been logged in our issue management system for further investigation. The issue ID has been linked with this thread so that you will receive notification as soon as the ticket is resolved. However, please share link of the thread you are referring to, for our reference.

Regarding the PDF that exports XML with image references, we need the source and generated file to reproduce and investigate it in our environment. We understand your data security and data privacy concerns, that is why the attachments are accessible to thread owner and our staff only. Thus, please share requested data so that we may investigate it to help you out.

Furthermore, please also share your environment details (JDK/JRE details, OS details etc.) with us.

Spfa · May 2, 2018, 6:52am

Thanks for looking into this.

The other topic was Get Font Type and Encoding Type from PDF File

My OS is Windows 2012 Server, 64bit JAVA jre 1.8.0_162.

I can’t share the 2nd PDF because I’m not the owner of it. I’d first have to get clearance for it, etc. I’ll try to find a publicly available example.

Spfa · May 2, 2018, 7:20am

I found a PDF which is a close match for the 2nd described problem (only image references exported but no text): https://latex.org/forum/download/file.php?id=2683&sid=df0925f724872260a2cb78d0279309b3 ( LaTeX1.pdf mentioned on https://latex.org/forum/viewtopic.php?t=10229)

This PDF contains a single character R (the ‘R’ representing real numbers in math). It can be copy’n’pasted and Acrobat reader can save it as text. But Apose PDF only exports

I.e. no text.

Farhan.Raza · May 2, 2018, 11:08am

@Spfa

Thank you for sharing requested details.

We would like to share with you that the feature of getting custom encoding is not supported yet. Respective ticket ID, PDFJAVA-36721, has been linked with this thread so that you will receive notification as soon as the feature will be supported.

Regarding references of images instead of text in generated XML file, we have been able to reproduce it in our environment. A ticket with ID PDFJAVA-37691 has been logged in our issue management system for further investigation and resolution. The ticket ID has been linked with this thread so that you will receive notification as soon as the ticket is resolved.

We are sorry for the inconvenience.