Hi there,
I am working on some extraction tool using Aspose for Java (version 21.12). The goal is to extract highlighted text from PDFs. I am using the following code to extract the text (and ordering the highlights by page Y coordinate (so they are in text order)
com.aspose.pdf.Document pdfDoc = new com.aspose.pdf.Document(is);
for(com.aspose.pdf.Page page : pdfDoc.getPages()){
List<com.aspose.pdf.HighlightAnnotation> highlights = page.getAnnotations().stream().filter(f -> f instanceof com.aspose.pdf.HighlightAnnotation).sorted((h1, h2)->h2.getRect().getURY().compareTo(h1.getRect().getURY())).collect(Collectors.toList())
for(com.aspose.pdf.Annotation annotation : highlights){
…
text += " "+annotation.getMarkedText()
I have encountered an issue around Japanese characters and numbers and the order in which they get extracted without a single annotation (it is unrelated to the sorting mentioned above).
I have the following example.
This is the original PDF:
Screenshot 2022-03-15 at 16.13.25.jpg (67.5 KB)
And this is what annotation.getMarkedText() produces
89~132% 菌体破砕ステップにおける菌体破砕効率の管理値は %に設定しています。この管理値
は、プロセス・バリデーションの前に実施したパイロットスケールにおける特性試験及び上市
スケールにおける製造実績を基に設定しています。
Has anyone encountered a similar issue? Is this a known limitation or bug?