Not able to Find text from pdf file in Java - Aspose.PDF for Java - Text contains line break

rabinintig · July 31, 2020, 6:32am

Try to find the text from pdf. but I am not able to find the text.
My finding text is “2c9f80f373a331b80173a338a0db0007_SIGNATURE” in pdf file.
I can able to find “2c9f80f373a331b80173a338a0db0007_TITLE” text but when it is comming some data in next line as like “2c9f80f373a331b80173a338a0db0007_SIGNATURE” that time code is not able to find the text.
And “2c9f80f373a331b80173a338a0db0007_SIGNATURE” text is hiidden in my pdf file.
I attached code, pdf, and converted Docx file for reference.
PFA
Test.zip (60.3 KB)

rabinintig · August 1, 2020, 5:29am

@asad.ali
can you please help me with this.

asad.ali · August 3, 2020, 9:09am

@rabinintig

The text inside your PDF is as

“2c9f80f373a331b80173a338a0db0007_SIGNATU
RE”

There is a line break in the text. Please use following code snippet to extract it:

Document pdfDocument = new Document(dataDir + "newtest.pdf");
TextFragmentAbsorber absorber = new TextFragmentAbsorber("2c9f80f373a331b80173a338a0db0007_SIGNATU\\s*RE\\b");
absorber.setTextSearchOptions(new TextSearchOptions(true));
pdfDocument.getPages().accept(absorber);
TextFragmentCollection textFragmentCollection = absorber.getTextFragments();
textFragmentCollection.size();
System.out.println(textFragmentCollection.size());

rabinintig · August 5, 2020, 6:08am

thanks @asad.ali
But i can not use “TextFragmentAbsorber absorber = new TextFragmentAbsorber(“2c9f80f373a331b80173a338a0db0007_SIGNATU\s*RE\b”);” code for my application. Because “2c9f80f373a331b80173a338a0db0007_SIGNATUREb” data is dynamic every time for my system.

Basically I try to add some Html data inside the pdf file.But when a added the data the “2c9f80f373a331b80173a338a0db0007_SIGNATURE” string is break into two line.
When it breaks into two-line that time I am not able to file my string inside the pdf file.
I added my whole code for your reference.
Please remember the whole data is dynamic.please do not provide me as like the above solution.
PFA
Test (2).zip (125.5 KB)

asad.ali · August 5, 2020, 4:58pm

@rabinintig

In case the target text is in two lines, we are afraid that suggested approach is the only way to extract/replace or find it. The data/text can only be searched in the way it was added in the PDF.

As another workaround, you can try reducing page margins of output PDF (generated from HTML at the time of conversion). This way the text will be rendered on single line and you will be able to search it from your code.

rabinintig · August 6, 2020, 3:38am

@asad.ali
Is there any other approach to find this kind of text inside pdf.

rabinintig · August 6, 2020, 4:11am

@asad.ali
Also if you check “inputHtml.txt” file “2c9f80f373bcc6d80173bcf34c870011_SIGNATURE” data has no space.But when i add in pdf why space is coming . And to add the html i am using aspose code only.
so, why i am not able to find same data inside pdf using aspose code.
Please help me on this.

asad.ali · August 6, 2020, 8:26pm

@rabinintig

Would you kindly share the code snippet that you are using to convert your HTML into PDF. We will further proceed to assist you accordingly.