We're sorry Aspose doesn't work properply without JavaScript enabled.

Free Support Forum - aspose.com

Highlighted Japanese characters and numbers extracted in incorrect order

Hi there,

I am working on some extraction tool using Aspose for Java (version 21.12). The goal is to extract highlighted text from PDFs. I am using the following code to extract the text (and ordering the highlights by page Y coordinate (so they are in text order)

com.aspose.pdf.Document pdfDoc = new com.aspose.pdf.Document(is);
for(com.aspose.pdf.Page page : pdfDoc.getPages()){
List<com.aspose.pdf.HighlightAnnotation> highlights = page.getAnnotations().stream().filter(f -> f instanceof com.aspose.pdf.HighlightAnnotation).sorted((h1, h2)->h2.getRect().getURY().compareTo(h1.getRect().getURY())).collect(Collectors.toList())
for(com.aspose.pdf.Annotation annotation : highlights){

text += " "+annotation.getMarkedText()

I have encountered an issue around Japanese characters and numbers and the order in which they get extracted without a single annotation (it is unrelated to the sorting mentioned above).

I have the following example.
This is the original PDF:
Screenshot 2022-03-15 at 16.13.25.jpg (67.5 KB)

And this is what annotation.getMarkedText() produces

89~132% 菌体破砕ステップにおける菌体破砕効率の管理値は %に設定しています。この管理値

Has anyone encountered a similar issue? Is this a known limitation or bug?


Could you please share your PDF file here for testing? We will investigate the issue and provide you more information on it.

Sure, here is my file
DOC-00336_removed_removed (1).pdf (592.8 KB)


We have managed to reproduce the same issue at our side. For the sake of correction, we have logged this problem in our issue tracking system as PDFJAVA-41414. You will be notified via this forum thread once this issue is resolved.

We apologize for your inconvenience.