Text extraction issue

jasonjun.ncs.sg · August 7, 2017, 5:45am

Hi Aspose support,

Please help advise how to resolve below text extraction issue highlighted by IBM, thanks

There are many ways to create PDF’s and include fonts, and the overriding problem with the sample documents is that you use embedded CID fonts. From what Oracle support says there is no way for text to be reliably extracted because with CID embedded fonts there’s not a complete mapping from the font characters to unicode available. He explained that the font mapping is embedded but only for the characters that are included in the PDF (so there’s only a partial map to work with) and also that you use your own ids for characters so Search Export can’t determine what the equivalent unicode character is. The Identity-H font you are using is a horizontal font - used for foreign languages.

He did provide a couple of suggestions…

Turn off font embedding in the PDF publishing process; if using standard systems fonts with font embedding off you will have far better success. For example if using Adobe PDF to generate pdf files then look under Configuration -> Publishing Options -> Adobe PDF -> Font embedding to turn off font embedding.
Need to use a full UUID font - for example TrueType (but not CID)

asad.ali · August 7, 2017, 1:36pm

@jasonjun.ncs.sg

Thanks for contacting support.

Aspose.Pdf for .NET offers different font embedding strategies while creating PDF documents. You may embed a complete set of used font as well as subset of it inside the PDF document. Now speaking of the issue which you have mentioned about CID fonts, would you please share a sample document, which have CID font as embedded along with the code snippet, which you are trying to extract text from it.

This way we can observe the issue in our environment and address it accordingly.

jasonjun.ncs.sg · August 14, 2017, 7:21am

Hi, please find below replies from IBM support.
I also uploaded sample documents, thanks for your help.New folder.zip (294.9 KB)

The source code provided is a single high level convert2Pdf method with no details about fonts applied. Possibly fonts are applied when the document object is created or saved:

com.aspose.words.Document doc = new com.aspose.words.Document(filePath + workingFileName);
doc.save(filePath + workingFileNameWoExt + RecordConstants.EXT_PDF, com.aspose.words.SaveFormat.PDF);

however I can’t tell from the code provided or tell what fonts are available.

So at this point there is nothing further we can offer the customer other than to reiterate that they need to change the fonts used to generate the PDF in order for text extraction to work, I can’t tell from the source code how they can do that.

The problem is the use of embedded CID fonts (see attached PDF_properties.bmp), there is no way for Oracle to reliably extract text from documents with CID embedded fonts because there’s not a complete mapping from the font characters to unicode available. The font mapping is embedded but only partially and unique ids for characters are applied for the embedded fonts so the Oracle Search Export utility we use for text extraction can’t derive the equivalent unicode character. In their PDF generation code they need to turn off font embedding in the PDF publishing process, try using standard systems fonts with font embedding off for example use a full UUID font (e.g. TrueType) but not CID.

There’s little more we can do on our end, other than if the customer has specific questions about what fonts to use we can relay those to Oracle tech support. However Oracle would likely need specifics on what font is in use and what fonts are available.

Note: There is a Oracle command line utility (exsimple) they can use to test PDF files and verify the text produced for indexing will work with CPE/CSS, this may help them determine if a specific font will work:

From their CPE server copy the contents of the Oracle INSO directory (e.g. on windows \Program Files\IBM\WebSphere\AppServer\profiles\AppSrv01\FileNet\server1\INSO\bin\sx-8-5-2-win-x86-64) to a temp directory. They should not change anything in their INSO directory!!!
Copy the PDF(s) to be tested to that temp directory then, from the command line, change to that directory and execute:

exsimple Aspose.pdf Aspose.txt sx.cfg (windows)
or
./exsimple Aspose.pdf Aspose.txt sx.cfg (Linux, etc)

where Aspose.pdf is there sample pdf, Aspose.txt is the file where the extracted text is saved and the text that CPE/CSS uses for indexing.

An export failure message or unreadable characters in the generated file indicate a problem with the font used in the PDF and that the file will not index properly.

asad.ali · August 14, 2017, 2:07pm

@jasonjun.ncs.sg

Thanks for sharing more details.

As per my understanding, you are having issue while extracting text from PDF document(s), which have CID font embedded inside them. Furthermore, I have viewed your shared document(s) and observed that the CID font was present only in one PDF file (Test test test test tes print as pdf.pdf)CID.png (8.5 KB).

In the shared screenshot, you may observe that font details are different than that of, which you have shared in your screenshot. Nevertheless, I have tried to extract text in specified environment (IBM WebSphere Application Server, JDK 1.8, Aspose.Pdf for Java 17.7) and was unable to notice any issue or error. The text was extracted just fine.

Please check following code snippet that I have used to extract the text.

Document doc = new Document("Test test test test tes print as pdf.pdf");
com.aspose.pdf.TextAbsorber ta = new com.aspose.pdf.TextAbsorber();
doc.getPages().accept(ta);
System.out.println(ta.getText());

In case if my assumptions are not correct about the scenario or I am missing some points, would you please share some more details, so that we can clearly look into the matter and share our feedback accordingly.