Hi,
we frequently encounter PDFs where text extraction returns garbage. Acrobat Reader can display them fine but when you copy and paste some content or export it as text you get garbage characters. Searching fails. By my understanding this is caused by a missing/wrong CMAP and/or custom encoding. A publicly available example PDF can be found here: http://users.tpg.com.au/hufraser/PHONELIC.PDF
Acrobat reader says that the fonts are of type 1 and with custom encoding.
Is there a way with Aspose PDF (for java) to detect PDFs with such problematic fonts? I saw a different topic from 2017 where you said that youβre investigating to make the font encoding available. Is this already implemented?
For another PDF (which I canβt share) with custom encoding Aspose PDF exports no text (using MobiXmlSaveOptions) but only image references and positions. It appears as if thereβs no mapping between the glyph and the corresponding unicode character. Acrobat reader says that the fonts are of type 3 and with custom encodings. Surprisingly Acrobat Reader can export the file as text and you can copy and paste text just fine. Searching works.
The XML output of Aspose PDF looks like this
<pdf2xml pages=β8β>
<title/>
<page width=β792β height=β612β>
<text x=β0β y=β0β width=β0β height=β0β/>
<img x=β32.64β y=β565.683β width=β5.278β height=β5.038β src=βpdf2xml_pic1.pngβ/>
<text x=β0β y=β0β width=β0β height=β0β/>
<img x=β430β y=β19.604β width=β324.512β height=β36.244β src=βpdf2xml_pic2.jpgβ/>
<text x=β0β y=β0β width=β0β height=β0β/>
<img x=β232.32β y=β110.286β width=β15.354β height=β15.594β src=βpdf2xml_pic3.pngβ/>
[β¦]
Can you explain in what cases you would produce such output instead of exporting text? And if thereβs a way to detect this with Aspose PDF?