Aspose JAVA extracting arabic missing (replace with wrong) characters

When developing a java program in OSX, I am extracting text from a PDF file. It is a fairly straightfoward code to iterate pdf page by page, and extract text inside, like below

for (com.aspose.pdf.Page page : (Iterable<com.aspose.pdf.Page>) pdfDocument.getPages()) {
                PageContent pageContent = new PageContent();

                // Extract all text content
                if (extractText) {
                    TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber();
                    textFragmentAbsorber.getTextSearchOptions().setLimitToPageBounds(true);

                    page.accept(textFragmentAbsorber);

                    TextFragmentCollection textFragmentCollection = textFragmentAbsorber.getTextFragments();
                    // loop through the fragments
                    textFragmentCollection.forEach(TextSegment -> {
                        // get the text of the segment
                        pageContent.addParagraphs(new Paragraph(TextSegment.getText()));
                    });
                }
}

It seems to get most arabic text out, however, sometimes, one character is missing, eg this segment

{
        "text": "ﻓﺼﻠﻴﺔ - ﺗﺼﺪر ﻋﻦ اﻟﻬﻴﺌﺔ اﻟﻮﻃﻨﻴﺔ \u0002دارة اﻟﻄﻮارئ وا\u0006زﻣﺎت واﻟﻜﻮارث"
      },

Notice the unicode \u0002 and \u0006, and my colleges who is native arabic speaker mentioned below

  • All the extracted words are accurate apart from any word containing these two characters: “لأ”. The extractor in most occurrences of these two characters extracts them in reverse order. That is, it extracts them as “أل”.
  • For the case where the extraction had unicode artifacts (“text”: “ﻓﺼﻠﻴﺔ - ﺗﺼﺪر ﻋﻦ اﻟﻬﻴﺌﺔ اﻟﻮﻃﻨﻴﺔ \u0002دارة اﻟﻄﻮارئ وا\u0006زﻣﺎت واﻟﻜﻮارث”), the same two characters from above “لأ” are not extracted, and the unicode artifacts appear in their position instead.

The pdf is rather big, and it is available publically here, Megazine, I only tried to extract first page.

Any comments is highly appreciated.

@paxlicense

The character recognition and presentation can be dependent upon how a particular OS has configurations. Nevertheless, we need to investigate it in details.

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFJAVA-44490

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.