Aspose JAVA extracting arabic missing (replace with wrong) characters

paxlicense · November 10, 2024, 12:19pm

When developing a java program in OSX, I am extracting text from a PDF file. It is a fairly straightfoward code to iterate pdf page by page, and extract text inside, like below

for (com.aspose.pdf.Page page : (Iterable<com.aspose.pdf.Page>) pdfDocument.getPages()) {
                PageContent pageContent = new PageContent();

                // Extract all text content
                if (extractText) {
                    TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber();
                    textFragmentAbsorber.getTextSearchOptions().setLimitToPageBounds(true);

                    page.accept(textFragmentAbsorber);

                    TextFragmentCollection textFragmentCollection = textFragmentAbsorber.getTextFragments();
                    // loop through the fragments
                    textFragmentCollection.forEach(TextSegment -> {
                        // get the text of the segment
                        pageContent.addParagraphs(new Paragraph(TextSegment.getText()));
                    });
                }
}

It seems to get most arabic text out, however, sometimes, one character is missing, eg this segment

{
        "text": "ﻓﺼﻠﻴﺔ - ﺗﺼﺪر ﻋﻦ اﻟﻬﻴﺌﺔ اﻟﻮﻃﻨﻴﺔ \u0002دارة اﻟﻄﻮارئ وا\u0006زﻣﺎت واﻟﻜﻮارث"
      },

Notice the unicode \u0002 and \u0006, and my colleges who is native arabic speaker mentioned below

All the extracted words are accurate apart from any word containing these two characters: “لأ”. The extractor in most occurrences of these two characters extracts them in reverse order. That is, it extracts them as “أل”.
For the case where the extraction had unicode artifacts (“text”: “ﻓﺼﻠﻴﺔ - ﺗﺼﺪر ﻋﻦ اﻟﻬﻴﺌﺔ اﻟﻮﻃﻨﻴﺔ \u0002دارة اﻟﻄﻮارئ وا\u0006زﻣﺎت واﻟﻜﻮارث”), the same two characters from above “لأ” are not extracted, and the unicode artifacts appear in their position instead.

paxlicense · November 10, 2024, 12:24pm

The pdf is rather big, and it is available publically here, Megazine, I only tried to extract first page.

Any comments is highly appreciated.

asad.ali · November 11, 2024, 11:29am

@paxlicense

The character recognition and presentation can be dependent upon how a particular OS has configurations. Nevertheless, we need to investigate it in details.

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFJAVA-44490

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

Ethan111 · February 24, 2025, 3:01am

Mark, I have same issues

asad.ali · February 24, 2025, 9:24am

@Ethan111

Sure, we have recorded your concerns and will surely inform you as soon as the ticket is resolved. Please be patient and spare us some time.

We are sorry for the inconvenience.