Search text issue

Hi Aspose,


I’m using Aspose PDF for Java 17.2.0. I have an issue searching for text. For example:

public static void main(String[] args) throws IOException {
System.out.println(“Start”);
try (InputStream in = new FileInputStream(“D:\tmp\Tuyen\aspose\pdf-sample+copy.pdf”)) {
Document document = new Document(in);
TextFragmentAbsorber absorber = new TextFragmentAbsorber(“Document Format (PDF)”);
TextSearchOptions searchOption = new TextSearchOptions(true);
absorber.setTextSearchOptions(searchOption);
Page firstPage = document.getPages().get_Item(1);
firstPage.accept(absorber);
System.out.println("Num of found text: " + absorber.getTextFragments().size());
}
}

The problem is absorber.getTextFragments().size() returns 0 although we can surely find the “Document Format (PDF)” at the first page of the attached document. I can find 1 text fragment if I search for “Document” only. Can you let me know how to search for that “Document Format (PDF)” in page one using similar source code? Thanks.

Best Regards,
Tuyen

Hi Tuyen,

Thanks for contacting support.

The problem was in setting TextSearchOptions regular expression based as true. This way, API will consider given search string as regular expression, that is why you are getting text fragments count as zero in the output. You need to use TextSearchOptions(false), in order to tell the API to match string exact like the given one. Please check following code snippet and the highlighted part to achieve the functionality.

try (InputStream in = new FileInputStream(“D:\tmp\Tuyen\aspose\pdf - sample + copy.pdf”)) {

       Document document = new Document(in);

    TextFragmentAbsorber absorber = new TextFragmentAbsorber(“Document Format(PDF)”);

    TextSearchOptions searchOption = new TextSearchOptions(false);

    absorber.setTextSearchOptions(searchOption);

    Page firstPage = document.getPages().get_Item(1);

    firstPage.accept(absorber);

    System.out .println("Num of found text: " + absorber.getTextFragments().size());

}

In case of any further assistance, please feel free to contact us.

Best Regards,

Hi Asad,


Thanks for your response. Unfortunately that does not help in my case.

Can you find a way to search for text “www.groupe- t2i .com” in page 1 of attached document? Thank you.

Best Regards,
Tuyen

Hi Tuyen,


Thanks for your inquiry. I have tested the scenario and have managed to reproduce the issue that TextFragmentAbsorber is not search the text “www.groupe- t2i .com” from your provided PDF file. For the sake of correction, I have logged a ticket PDFJAVA-36706 in our issue tracking system. We will further look into it in detail and will keep you updated on the status of its resolution within this forum thread. Please be patient and spare us little time.

We are sorry for this inconvenience.

Best Regards,

p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco}

The issues you have found earlier (filed as ) have been fixed in this update. This message was posted using BugNotificationTool from Downloads module by MuzammilKhan

@vutuyen2636

There are no spaces and any other invisible characters in url on the page.

Adobe Acrobat actually finds the text with spaces: '[www.groupe-](http://www.groupe-/) t2i .com'. But we could not find any reason for this in the document. Aspose.PDF finds this text as ‘www.groupe-t2i.com’.

Please consider the following code with Aspose.PDF for Java 20.6:

InputStream in = new FileInputStream(dataDir + "OnePage.pdf");
Document document = new Document(in); 
TextFragmentAbsorber absorber = new TextFragmentAbsorber("www.groupe-t2i.com"); 
TextSearchOptions searchOption = new TextSearchOptions(true); //false value also works correctly. 
absorber.setTextSearchOptions(searchOption); 
Page firstPage = document.getPages().get_Item(1); 
firstPage.accept(absorber); 
System.out.println("Num of found text: " + absorber.getTextFragments().size());