we frequently encounter PDFs where text extraction returns garbage. Acrobat Reader can display them fine but when you copy and paste some content or export it as text you get garbage characters. Searching fails. By my understanding this is caused by a missing/wrong CMAP and/or custom encoding. A publicly available example PDF can be found here: http://users.tpg.com.au/hufraser/PHONELIC.PDF
Acrobat reader says that the fonts are of type 1 and with custom encoding.
Is there a way with Aspose PDF (for java) to detect PDFs with such problematic fonts? I saw a different topic from 2017 where you said that youβre investigating to make the font encoding available. Is this already implemented?
For another PDF (which I canβt share) with custom encoding Aspose PDF exports no text (using MobiXmlSaveOptions) but only image references and positions. It appears as if thereβs no mapping between the glyph and the corresponding unicode character. Acrobat reader says that the fonts are of type 3 and with custom encodings. Surprisingly Acrobat Reader can export the file as text and you can copy and paste text just fine. Searching works.
The XML output of Aspose PDF looks like this
We have investigated the sample PDF file shared by you and we can notice the problem of garbage text. Therefore, a ticket with ID PDFJAVA-37682 has been logged in our issue management system for further investigation. The issue ID has been linked with this thread so that you will receive notification as soon as the ticket is resolved. However, please share link of the thread you are referring to, for our reference.
Regarding the PDF that exports XML with image references, we need the source and generated file to reproduce and investigate it in our environment. We understand your data security and data privacy concerns, that is why the attachments are accessible to thread owner and our staff only. Thus, please share requested data so that we may investigate it to help you out.
Furthermore, please also share your environment details (JDK/JRE details, OS details etc.) with us.
My OS is Windows 2012 Server, 64bit JAVA jre 1.8.0_162.
I canβt share the 2nd PDF because Iβm not the owner of it. Iβd first have to get clearance for it, etc. Iβll try to find a publicly available example.
This PDF contains a single character R (the βRβ representing real numbers in math). It can be copyβnβpasted and Acrobat reader can save it as text. But Apose PDF only exports
We would like to share with you that the feature of getting custom encoding is not supported yet. Respective ticket ID, PDFJAVA-36721, has been linked with this thread so that you will receive notification as soon as the feature will be supported.
Regarding references of images instead of text in generated XML file, we have been able to reproduce it in our environment. A ticket with ID PDFJAVA-37691 has been logged in our issue management system for further investigation and resolution. The ticket ID has been linked with this thread so that you will receive notification as soon as the ticket is resolved.