PDF search issue with special characters

savatham · October 18, 2023, 10:35am

Hi,

We are required to search the PDF content of the given text, find its coordinates, and then apply annotations to it. In this process, if the PDF text has special characters like single or double quotes, we cannot search the PDF content properly. Attached is the code to search the PDF and annotate.
Please help us in resolving this issue.

PDFAnnotator.zip (3.4 KB)

asad.ali · October 18, 2023, 5:19pm

@savatham

The problem can possibly be related to the regular expressions. Can you please share the sample PDF and the text that you want to highlight in it? We will try to add the highlight annotation over it in our environment and share our feedback with you accordingly.

savatham · October 19, 2023, 6:40am

Thanks for looking into this @asad.ali.

Here is the attached pdf file.
June-22-2023-10.08.38_Fast Foods CY - Copy.pdf (422.7 KB)

We are trying to search for the below paragraph and annotate it.

“We have audited the accompanying financial statements of (the “Company”), which comprise the balance sheets as
of December 31, 2019 and 2018, and the related statements of operations, changes in members’ (deficit)/equity and
cash flows for the years then ended, and the related notes to the financial statements.”

Please let me know if you have any further questions on this. Appreciate your help.

asad.ali · October 19, 2023, 6:34pm

@savatham

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFJAVA-43222

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

asad.ali · February 8, 2024, 8:55pm

@savatham

You can use the next code snippet to get the necessary result:

String inputText = "We have audited the accompanying financial statements of (the \"Company\"), which comprise the balance sheets as of December 31, 2019 and 2018, and the related statements of operations, changes in members' (deficit)/equity and cash flows for the years then ended, and the related notes to the financial statements.";

String regexText = escapeRegex(inputText).replaceAll(" ", "\\\\s+");
Document document = new Document(dataDir + "inputforhighlight.pdf");
Page page = document.getPages().get_Item(3);
TextFragmentAbsorber tfa = new TextFragmentAbsorber(regexText, new TextSearchOptions(true));
tfa.visit(page);
for (TextFragment textFragment : tfa.getTextFragments()) {
    System.out.println("Found Text!!");
    HighlightAnnotation highlightAnnotation = new HighlightAnnotation(textFragment.getPage(), textFragment.getRectangle());
    page.getAnnotations().add(highlightAnnotation);
}
document.save(dataDir + "inputforhighlight_mod_" + BuildVersionInfo.AssemblyVersion + ".pdf");

private static String escapeRegex(String input) {
    return input.replaceAll("([\"\'.*+?{}()\\^$|])", "\\\\$1");
}