Highlighted Japanese characters and numbers extracted in incorrect order

wwalker · March 15, 2022, 11:35pm

Hi there,

I am working on some extraction tool using Aspose for Java (version 21.12). The goal is to extract highlighted text from PDFs. I am using the following code to extract the text (and ordering the highlights by page Y coordinate (so they are in text order)

com.aspose.pdf.Document pdfDoc = new com.aspose.pdf.Document(is);
for(com.aspose.pdf.Page page : pdfDoc.getPages()){
List<com.aspose.pdf.HighlightAnnotation> highlights = page.getAnnotations().stream().filter(f -> f instanceof com.aspose.pdf.HighlightAnnotation).sorted((h1, h2)->h2.getRect().getURY().compareTo(h1.getRect().getURY())).collect(Collectors.toList())
for(com.aspose.pdf.Annotation annotation : highlights){
…
text += " "+annotation.getMarkedText()

I have encountered an issue around Japanese characters and numbers and the order in which they get extracted without a single annotation (it is unrelated to the sorting mentioned above).

I have the following example.
This is the original PDF:
Screenshot 2022-03-15 at 16.13.25.jpg (67.5 KB)

And this is what annotation.getMarkedText() produces

89~132% 菌体破砕ステップにおける菌体破砕効率の管理値は %に設定しています。この管理値
は、プロセス・バリデーションの前に実施したパイロットスケールにおける特性試験及び上市
スケールにおける製造実績を基に設定しています。

Has anyone encountered a similar issue? Is this a known limitation or bug?

tahir.manzoor · March 16, 2022, 5:52am

@wwalker

Could you please share your PDF file here for testing? We will investigate the issue and provide you more information on it.

wwalker · March 16, 2022, 11:27am

Sure, here is my file
DOC-00336_removed_removed (1).pdf (592.8 KB)

tahir.manzoor · March 16, 2022, 6:39pm

@wwalker

We have managed to reproduce the same issue at our side. For the sake of correction, we have logged this problem in our issue tracking system as PDFJAVA-41414. You will be notified via this forum thread once this issue is resolved.

We apologize for your inconvenience.

aspose.notifier · June 5, 2023, 9:59pm

The issues you have found earlier (filed as PDFJAVA-41414) have been fixed in Aspose.PDF for Java 23.5.