Aspose PDF TextFragment issue

vutuyen2636 · July 5, 2017, 8:43am

Hi Aspose,

I tried the attached pdf document and can’t find this text: “to generate efficient machine code” which I can do in Adobe. The code is as below:

public static void main(String[] args) throws Exception {
    try (InputStream in = new FileInputStream("D:\\tmp\\Dat\\sherl_clean.pdf")) {
        Document document = new Document(in);
        TextFragmentAbsorber absorber = new TextFragmentAbsorber("to generate efficient machine code");
        boolean regularExpUsed = false;
        TextSearchOptions searchOption = new TextSearchOptions(regularExpUsed);
        absorber.setTextSearchOptions(searchOption);
        Page firstPage = document.getPages().get_Item(1);
        firstPage.accept(absorber);
        System.out.println("Num of found text: " + absorber.getTextFragments().size());
        if (absorber.getTextFragments().size() > 0) {
            TextFragment frag0 = absorber.getTextFragments().get_Item(1);
            System.out.println("Text in fragment: " + frag0.getText());
        } else {
            System.out.println("Can't find any text fragment");
        }
    }
}

It prints out “Can’t find any text fragment”. If I change the search text to “to generate .* machine code” with regularExpUsed as true, then the text prints out as: “to generate ef?cient machine code” which is not as “to generate efficient machine code” as I expect it to be.

Please let me know if this is a bug, I’m using Aspose PDF for Java 17.5. Thank you.sherl_clean.pdf (964.5 KB)

Regards,
Tuyen

imran.rafique · July 5, 2017, 12:50pm

@vutuyen2636,
We managed to replicate the problem of not being able to retrieve the said phrase of text. It has been logged under the ticket ID PDFJAVA-36882 in our bug tracking system. We have linked your post to this ticket and will keep you informed regarding any available updates.

Best Regards,
Imran Rafique