TextFragmentAbsorber is not searching the sentence with special characters

nathiya1 · July 6, 2021, 7:39am

My Project requirement is,
User will draw a rectangle in the pdf page to hide some sensitive content. In this user may draw a rectangle with some extra spaces we have to remove that spaces and shrink the rectangle. To achieve this we have to grab the text under user drawn rectangle and find an exact rectangle of the text and then we have to resize the rectangle as per text.

I am using the com.aspose.pdf.TextAbsorber to find the text under the user drawn rectangle. I got the exact text from the TextAbsorber.

Then, I am using the com.aspose.pdf.TextFragmentAbsorber to search the text and find the Rectangle (coordinates) of the text from the PDF Page. It is working for the normal alphabets and numeric but, I am facing some issues in below cases,

Case 1: when I am try to search with special characters like ((, ) , ?, +), it is not supporting and it is not returning any fragments.
Sample search text:

Fri, 16 Apr 2021 10:26:58 +0000
( for pdf and MS Office files)

Case 2: When a search text has three lines and if it has an empty line in between two lines of text, TextFragmentAbsorber is not working

Case3: When a text has bullet points like we use in word it is not working.

Find the sample pdf file,
Sample pdf file.pdf (109.7 KB)

Kindly help me to resolve this issue.

Thanks,
Nathiya

asad.ali · July 6, 2021, 6:25pm

@nathiya1

We have tested the scenario using 21.6 version of the API and TextFragmentAbsorber like below. We did not notice any issue as API was able to find both of the text instances:

TextFragmentAbsorber textFragmentAbsorberReplacement = new TextFragmentAbsorber("Fri, 16 Apr 2021 10:26:58 +0000", new TextSearchOptions(false));

Can you please share examples of how you are using the TextFragmentAbsorber in order to extract text for above scenarios? We will further test the scenario in our environment and address it accordingly.

nathiya1 · July 8, 2021, 11:54am

Hi @asad.ali,

Thanks for your immediate response.
I changed my code as same as yours, the case 1 is working fine. Please give some suggestion for other cases too. Please find the inputs below.

We used the TextfragmentAbsorber as follows,
private static final String CASE_INSENSITIVE = “(?i)”;
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(CASE_INSENSITIVE + searchQuery, new TextSearchOptions(Boolean.TRUE));

Now I changed it as per your suggestion:
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(searchQuery, new TextSearchOptions(Boolean.FALSE));
Case 1 that is raised for Special character got resolved.

Still I am facing issue with other cases. Attached file has highlight of sample given below.

Case 2:

Sample Text: (Took from page 2)
"Subject: Initial mail

Initial mail"

Rectangle Value : llx - 45,6 lly - 15, urx - 264, ury - 682

Case 3:

Sample Text: (Took from Page 3)
“ The document appears to be corrupted and cannot be loaded. ( for pdf and MS Office files) - Handled if
we import from local and AMANDA attachments”

Rectangle Value : llx - 36, lly - 366, urx- 560, ury - 405

Sample pdf file(1).pdf (110.1 KB)

Thanks,
Nathiya

asad.ali · July 8, 2021, 10:40pm

@nathiya1

The shared PDF only has 2 pages. Can you please share the correct file on which these text values are present? We will further proceed to assist you accordingly.

nathiya1 · July 9, 2021, 5:09am

Sorry for the confusion, For Case 3 I took text from first page. You can refer the same file.

Case 3:

Sample Text: (Took from Page 1)
“ The document appears to be corrupted and cannot be loaded. ( for pdf and MS Office files) - Handled if
we import from local and AMANDA attachments”

Rectangle Value : llx - 36, lly - 366, urx- 560, ury - 405

asad.ali · July 9, 2021, 6:53pm

@nathiya1

It seems like you already have redacted the text in the shared PDF by placing a rectangle at the position. Nevertheless, we used your previously shared file and were able to observe that API was not extracted multiline text from it. Hence, an issue as PDFJAVA-40682 has been logged in our issue tracking system for investigation. We will look into its details and keep you posted with the status of its correction. Please be patient and spare us some time.

We are sorry for the inconvenience.

asad.ali · August 23, 2021, 6:56pm

@nathiya1

We have investigated the earlier logged ticket. Please use the code snippet below :

TextFragmentAbsorber textFragmentAbsorberReplacement = new TextFragmentAbsorber(
" The document appears to be corrupted and cannot be loaded. \\( for pdf and MS Office files\\) - Handled if (\r\n)we import from local and AMANDA attachments");

textFragmentAbsorberReplacement.getTextSearchOptions().setRegularExpressionUsed(true);

final Document document = new Document("Sample pdf file.pdf");
document.getPages().accept(textFragmentAbsorberReplacement);
System.out.println(textFragmentAbsorberReplacement.getTextFragments().size());