When developing a java program in OSX, I am extracting text from a PDF file. It is a fairly straightfoward code to iterate pdf page by page, and extract text inside, like below
for (com.aspose.pdf.Page page : (Iterable<com.aspose.pdf.Page>) pdfDocument.getPages()) {
PageContent pageContent = new PageContent();
// Extract all text content
if (extractText) {
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber();
textFragmentAbsorber.getTextSearchOptions().setLimitToPageBounds(true);
page.accept(textFragmentAbsorber);
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.getTextFragments();
// loop through the fragments
textFragmentCollection.forEach(TextSegment -> {
// get the text of the segment
pageContent.addParagraphs(new Paragraph(TextSegment.getText()));
});
}
}
It seems to get most arabic text out, however, sometimes, one character is missing, eg this segment
{
"text": "ﻓﺼﻠﻴﺔ - ﺗﺼﺪر ﻋﻦ اﻟﻬﻴﺌﺔ اﻟﻮﻃﻨﻴﺔ \u0002دارة اﻟﻄﻮارئ وا\u0006زﻣﺎت واﻟﻜﻮارث"
},
Notice the unicode \u0002 and \u0006, and my colleges who is native arabic speaker mentioned below
- All the extracted words are accurate apart from any word containing these two characters: “لأ”. The extractor in most occurrences of these two characters extracts them in reverse order. That is, it extracts them as “أل”.
- For the case where the extraction had unicode artifacts (“text”: “ﻓﺼﻠﻴﺔ - ﺗﺼﺪر ﻋﻦ اﻟﻬﻴﺌﺔ اﻟﻮﻃﻨﻴﺔ \u0002دارة اﻟﻄﻮارئ وا\u0006زﻣﺎت واﻟﻜﻮارث”), the same two characters from above “لأ” are not extracted, and the unicode artifacts appear in their position instead.