Search Text in PDF ignoring Case

n.b.vijayakumaraccen · February 12, 2016, 1:17am

Hi Team,

I have a requirement to search text in the PDF ignoring case.But i am unable to find the code snippet to ignore case while searching .Request you to send me the link or code snippet to achieve the requirement.Please find my below code snippet

Aspose.Pdf.Text.TextFragmentAbsorber textFragmentAbsorber = new Aspose.Pdf.Text.TextFragmentAbsorber(searchKeyword);

//accept the absorber for all the pages

inputPdfDocument.Pages.Accept(textFragmentAbsorber);

//get the extracted text fragments

Aspose.Pdf.Text.TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;

Thanks,

Navaneethan V

codewarior · February 12, 2016, 1:30am

Hi Navaneethan,

Thanks for contacting support.

In order to search string in both lowercase and uppercase, please try using following regular expression.

TextFragmentAbsorber textFragmentAbsorber = new
TextFragmentAbsorber(“(?i)Line”, new TextSearchOptions(true));

For further details, please visit Search and get Text from all pages using Regular Expression

n.b.vijayakumaraccen · February 12, 2016, 1:37am

Where should be search text ?Consider i am searching for a text called “testtext”.Can you please share the code

codewarior · February 12, 2016, 6:28pm

Hi Navaneethan,

In case you need to search testtext string as case insensitive, please try using following regular expression.

TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber("(?i)testtext", new TextSearchOptions(true));

In case you encounter any issue, please share your source/input PDF file, so that we can test the scenario in our environment.

n.b.vijayakumaraccen · February 14, 2016, 11:05pm

Hi Team,

It is working fine i need 1 more help.Along with searching the text with ignoring case i have to check for the whole word text while searching .Please let me know the change i have to do.

Please find the below code snippet

string searchTextValue = “(?i)” + searchKeyword + “”;

Aspose.Pdf.Text.TextFragmentAbsorber textFragmentAbsorber = new Aspose.Pdf.Text.TextFragmentAbsorber(searchTextValue, new Aspose.Pdf.Text.TextOptions.TextSearchOptions(true));

//accept the absorber for all the pages

inputPdfDocument.Pages.Accept(textFragmentAbsorber);

//get the extracted text fragments

Aspose.Pdf.Text.TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;

tilal.ahmad · February 15, 2016, 10:36pm

Hi Navaneethan,

Thanks for your feedback. It is good to know that suggested solution worked for you.

Furthermore, I am afraid I am unable to understand your “Whole Word text searching” requirements. We will appreciate it if you please share a sample document and text string to search, we will look into it and will guide you accordingly.

We are sorry for the inconvenience.

Best Regards,

n.b.vijayakumaraccen · February 15, 2016, 11:05pm

Hi Team,

Thanks for the response.My requirement is simple.If i am searching for a word “the” in the PDF.

1. It should return all text fragments with “the” or “THE” - Case insensitivity

2. It should not return text fragments with “FATHER” or “father” where “the” is part of the given text - whole word searching.

Looking forward for your code snippet as soon as possible.

Thanks,

Navaneethan V

tilal.ahmad · February 15, 2016, 11:52pm

Hi Navaneethan,

Thanks for the additional information. Please use following regular expression for exact match a word with case insensitivity, it will help you to accomplish the task.

//create TextAbsorber object to find
all the phrases matching the regular expression<o:p></o:p>

TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(@"(?i)\bthe\b");

//set text search option to specify regular expression usage

TextSearchOptions textSearchOptions = new TextSearchOptions(true);

textFragmentAbsorber.TextSearchOptions = textSearchOptions;

Please feel free to contact us for any further assistance.

Best Regards,

n.b.vijayakumaraccen · February 16, 2016, 12:14am

Hi Team,

I am not getting the desired results none of the search is working fine everything is giving zero results.Please find the below code i am using

string searchKeyword = “”;//search value goes here

Aspose.Pdf.Text.TextFragmentAbsorber textFragmentAbsorber = new Aspose.Pdf.Text.TextFragmentAbsorber(@"(?i)\b"+searchKeyword+"\b");

//set text search option to specify regular expression usage

Aspose.Pdf.Text.TextOptions.TextSearchOptions textSearchOptions = new Aspose.Pdf.Text.TextOptions.TextSearchOptions(true);

textFragmentAbsorber.TextSearchOptions = textSearchOptions;

//accept the absorber for all the pages

inputPdfDocument.Pages.Accept(textFragmentAbsorber);

//get the extracted text fragments

Aspose.Pdf.Text.TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;

n.b.vijayakumaraccen · February 16, 2016, 12:24am

It works when I am hardcoding the value like for example,

If my search keyword is “the"

Below code Works fine

Aspose.Pdf.Text.TextFragmentAbsorber textFragmentAbsorber = new Aspose.Pdf.Text.TextFragmentAbsorber(@”(?i)\bthe\b");

Below code is not working not giving any results

string keyword = “the”;

Aspose.Pdf.Text.TextFragmentAbsorber textFragmentAbsorber = new Aspose.Pdf.Text.TextFragmentAbsorber(@"(?i)\b"+keyword+"\b");

tilal.ahmad · February 16, 2016, 10:45am

Hi Navaeenthan,

Thanks for your feedback. If you want to pass text as variable then please use following code snippet, it will help you to accomplish the task.

TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(@"(?i)\b" + keyword + @"\b");<o:p></o:p>

Please feel free to contact us for any further assistance.

Best Regards,

Kushal.20 · June 11, 2019, 7:08am

@tilal.ahmad
Hi !
I am stuck in a similar issue !

I am using this, but it gives an erroe and says , “Syntax error on token “@”, delete this token”
I am working in Java.
Please help !

asad.ali · June 11, 2019, 3:15pm

@Kushal.20

Please try removing ‘@’ from expression. If issue still persists, please share your sample PDF document and code snippet. We will further proceed to help you accordingly.

Kushal.20 · June 12, 2019, 6:12am

@asad.ali
I tried it, but it’s not working!
I am actually listing the number of docs containing the string that is searched.
Here is my code :

            String strFind = "Test";
	int count =0;
	File[] files = new File("E:\\").listFiles();
	for (File file : files) {
        if (file.isFile()) {
	    	String folderName = file.getParent();
	        String fileName =  file.getName();
	        String extensionName = fileName.substring(fileName.lastIndexOf("."));
if(extensionName.equals(".pdf")) {
//System.out.println(“Processing document: " + fileName);
com.aspose.pdf.Document pdfDoc = new com.aspose.pdf.Document(file.getAbsolutePath());
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(@”(?i)\b"+strFind+@"\b"); // like 1999-2000
TextSearchOptions textSearchOptions = new TextSearchOptions(true);
textFragmentAbsorber.setTextSearchOptions(textSearchOptions);
// Accept the absorber for first page of document
pdfDoc.getPages().accept(textFragmentAbsorber);
// Get the extracted text fragments into collection
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.getTextFragments();
for (TextFragment textFragment : (Iterable) textFragmentCollection) {
if(textFragment.getText() != “”) {
count++;
}
}
if(count > 0) {
System.out.println(“E:\”+file.getName()+" || Count="+count);
}
count=0;
}

I want my search to be case INSENSITIVE. I know one way is by using , “?i”
But, I want to pass text as a variable
Please help !

asad.ali · June 12, 2019, 5:12pm

@Kushal.20

We are looking into it and will get back to you shortly.

Kushal.20 · June 13, 2019, 7:07am

@asad.ali
Okay, waiting for the response !

Kushal.20 · June 14, 2019, 9:51am

@asad.ali
Any Progress ?

asad.ali · June 14, 2019, 7:21pm

@Kushal.20

We have tested the scenario in our environment using following code snippet with one of our sample PDF documents and were not able to notice any issue. API was able to find text based on regular expressions:

Document pdfDoc = new Document(dataDir + "sample.pdf");
TextSearchOptions textSearchOptions = new TextSearchOptions(true);
String regex = "(?i)\\Small Demonstration\\b";
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(regex, textSearchOptions);
pdfDoc.getPages().accept(textFragmentAbsorber);

TextFragmentCollection textFragmentCollection = textFragmentAbsorber.getTextFragments();
int count = 0;
for (TextFragment textFragment : textFragmentCollection) {
  count++;
  System.out.println(count + ". " + textFragment.getText());
}

Can you please share your sample PDF document with us along with the information of text which you want to extract. We will further test the scenario in our environment and address it accordingly.

Kushal.20 · June 17, 2019, 6:41am

This is working.
But, I actually wanted this to have a variable, instead of passing the hard coded text.
And, (?i)\Small Demonstration\b , this is missing a ‘b’ , I guess !
Well, I have done it successfully with the variable too !

String find = “(?i)\b”+strFind+"\b";
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(find);

I have got this working for me !
Anyways, Thanks a lot for your co-operation !

Farhan.Raza · June 17, 2019, 5:22pm

@Kushal.20

Thank you for your kind feedback.

We are glad to know that things are now working in your environment. Please feel free to contact us if you need any further assistance.