Hi, please find below replies from IBM support.
I also uploaded sample documents, thanks for your help.New folder.zip (294.9 KB)
The source code provided is a single high level convert2Pdf method with no details about fonts applied. Possibly fonts are applied when the document object is created or saved:
com.aspose.words.Document doc = new com.aspose.words.Document(filePath + workingFileName);
doc.save(filePath + workingFileNameWoExt + RecordConstants.EXT_PDF, com.aspose.words.SaveFormat.PDF);
however I can’t tell from the code provided or tell what fonts are available.
So at this point there is nothing further we can offer the customer other than to reiterate that they need to change the fonts used to generate the PDF in order for text extraction to work, I can’t tell from the source code how they can do that.
The problem is the use of embedded CID fonts (see attached PDF_properties.bmp), there is no way for Oracle to reliably extract text from documents with CID embedded fonts because there’s not a complete mapping from the font characters to unicode available. The font mapping is embedded but only partially and unique ids for characters are applied for the embedded fonts so the Oracle Search Export utility we use for text extraction can’t derive the equivalent unicode character. In their PDF generation code they need to turn off font embedding in the PDF publishing process, try using standard systems fonts with font embedding off for example use a full UUID font (e.g. TrueType) but not CID.
There’s little more we can do on our end, other than if the customer has specific questions about what fonts to use we can relay those to Oracle tech support. However Oracle would likely need specifics on what font is in use and what fonts are available.
Note: There is a Oracle command line utility (exsimple) they can use to test PDF files and verify the text produced for indexing will work with CPE/CSS, this may help them determine if a specific font will work:
From their CPE server copy the contents of the Oracle INSO directory (e.g. on windows \Program Files\IBM\WebSphere\AppServer\profiles\AppSrv01\FileNet\server1\INSO\bin\sx-8-5-2-win-x86-64) to a temp directory. They should not change anything in their INSO directory!!!
Copy the PDF(s) to be tested to that temp directory then, from the command line, change to that directory and execute:
exsimple Aspose.pdf Aspose.txt sx.cfg (windows)
./exsimple Aspose.pdf Aspose.txt sx.cfg (Linux, etc)
where Aspose.pdf is there sample pdf, Aspose.txt is the file where the extracted text is saved and the text that CPE/CSS uses for indexing.
An export failure message or unreadable characters in the generated file indicate a problem with the font used in the PDF and that the file will not index properly.