Search Text in PDF ignoring Case


#1

Hi Team,

I have a requirement to search text in the PDF ignoring case.But i am unable to find the code snippet to ignore case while searching .Request you to send me the link or code snippet to achieve the requirement.Please find my below code snippet

Aspose.Pdf.Text.TextFragmentAbsorber textFragmentAbsorber = new Aspose.Pdf.Text.TextFragmentAbsorber(searchKeyword);
//accept the absorber for all the pages
inputPdfDocument.Pages.Accept(textFragmentAbsorber);
//get the extracted text fragments
Aspose.Pdf.Text.TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;

Thanks,
Navaneethan V

#2

Hi Navaneethan,


Thanks for contacting support.

In order to search string in both lowercase and uppercase, please try using following regular expression.

TextFragmentAbsorber textFragmentAbsorber = new
TextFragmentAbsorber("(?i)Line", new TextSearchOptions(true));

For further details, please visit Search and get Text from all pages using Regular Expression

#3

Where should be search text ?Consider i am searching for a text called “testtext”.Can you please share the code


#4

Hi Navaneethan,


In case you need to search testtext string as case insensitive, please try using following regular expression.

TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber("(?i)testtext", new TextSearchOptions(true));

In case you encounter any issue, please share your source/input PDF file, so that we can test the scenario in our environment.

#5

Hi Team,

It is working fine i need 1 more help.Along with searching the text with ignoring case i have to check for the whole word text while searching .Please let me know the change i have to do.

Please find the below code snippet

string searchTextValue = “(?i)” + searchKeyword + “”;
Aspose.Pdf.Text.TextFragmentAbsorber textFragmentAbsorber = new Aspose.Pdf.Text.TextFragmentAbsorber(searchTextValue, new Aspose.Pdf.Text.TextOptions.TextSearchOptions(true));
//accept the absorber for all the pages
inputPdfDocument.Pages.Accept(textFragmentAbsorber);
//get the extracted text fragments
Aspose.Pdf.Text.TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;

#6

Hi Navaneethan,


Thanks for your feedback. It is good to know that suggested solution worked for you.

Furthermore, I am afraid I am unable to understand your “Whole Word text searching” requirements. We will appreciate it if you please share a sample document and text string to search, we will look into it and will guide you accordingly.

We are sorry for the inconvenience.

Best Regards,

#7

Hi Team,

Thanks for the response.My requirement is simple.If i am searching for a word “the” in the PDF.

1. It should return all text fragments with “the” or “THE” - Case insensitivity
2. It should not return text fragments with “FATHER” or “father” where “the” is part of the given text - whole word searching.


Looking forward for your code snippet as soon as possible.

Thanks,
Navaneethan V

#8

Hi Navaneethan,


Thanks for the additional information. Please use following regular expression for exact match a word with case insensitivity, it will help you to accomplish the task.

//create TextAbsorber object to find
all the phrases matching the regular expression
<o:p></o:p>

TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(@"(?i)\bthe\b");

//set text search option to specify regular expression usage

TextSearchOptions textSearchOptions = new TextSearchOptions(true);

textFragmentAbsorber.TextSearchOptions = textSearchOptions;

Please feel free to contact us for any further assistance.


Best Regards,


#9

Hi Team,

I am not getting the desired results none of the search is working fine everything is giving zero results.Please find the below code i am using

string searchKeyword = “”;//search value goes here
Aspose.Pdf.Text.TextFragmentAbsorber textFragmentAbsorber = new Aspose.Pdf.Text.TextFragmentAbsorber(@"(?i)\b"+searchKeyword+"\b");
//set text search option to specify regular expression usage
Aspose.Pdf.Text.TextOptions.TextSearchOptions textSearchOptions = new Aspose.Pdf.Text.TextOptions.TextSearchOptions(true);
textFragmentAbsorber.TextSearchOptions = textSearchOptions;
//accept the absorber for all the pages
inputPdfDocument.Pages.Accept(textFragmentAbsorber);
//get the extracted text fragments
Aspose.Pdf.Text.TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;

#10

It works when I am hardcoding the value like for example,


If my search keyword is “the"

Below code Works fine

Aspose.Pdf.Text.TextFragmentAbsorber textFragmentAbsorber = new Aspose.Pdf.Text.TextFragmentAbsorber(@”(?i)\bthe\b");

Below code is not working not giving any results

string keyword = “the”;
Aspose.Pdf.Text.TextFragmentAbsorber textFragmentAbsorber = new Aspose.Pdf.Text.TextFragmentAbsorber(@"(?i)\b"+keyword+"\b");


#11

Hi Navaeenthan,


Thanks for your feedback. If you want to pass text as variable then please use following code snippet, it will help you to accomplish the task.

TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(@"(?i)\b" + keyword + @"\b");<o:p></o:p>


Please feel free to contact us for any further assistance.

Best Regards,

#12

@tilal.ahmad
Hi !
I am stuck in a similar issue !

I am using this, but it gives an erroe and says , “Syntax error on token “@”, delete this token”
I am working in Java.
Please help !


#13

@Kushal.20

Please try removing ‘@’ from expression. If issue still persists, please share your sample PDF document and code snippet. We will further proceed to help you accordingly.


#14

@asad.ali
I tried it, but it’s not working!
I am actually listing the number of docs containing the string that is searched.
Here is my code :

            String strFind = "Test";
	int count =0;
	File[] files = new File("E:\\").listFiles();
	for (File file : files) {
        if (file.isFile()) {
	    	String folderName = file.getParent();
	        String fileName =  file.getName();
	        String extensionName = fileName.substring(fileName.lastIndexOf("."));

if(extensionName.equals(".pdf")) {
//System.out.println(“Processing document: " + fileName);
com.aspose.pdf.Document pdfDoc = new com.aspose.pdf.Document(file.getAbsolutePath());
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(@”(?i)\b"+strFind+@"\b"); // like 1999-2000
TextSearchOptions textSearchOptions = new TextSearchOptions(true);
textFragmentAbsorber.setTextSearchOptions(textSearchOptions);
// Accept the absorber for first page of document
pdfDoc.getPages().accept(textFragmentAbsorber);
// Get the extracted text fragments into collection
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.getTextFragments();
for (TextFragment textFragment : (Iterable) textFragmentCollection) {
if(textFragment.getText() != “”) {
count++;
}
}
if(count > 0) {
System.out.println(“E:\”+file.getName()+" || Count="+count);
}
count=0;
}

I want my search to be case INSENSITIVE. I know one way is by using , “?i”
But, I want to pass text as a variable
Please help !


#15

@Kushal.20

We are looking into it and will get back to you shortly.


#16

@asad.ali
Okay, waiting for the response ! :slight_smile:


#17

@asad.ali
Any Progress ?


#18

@Kushal.20

We have tested the scenario in our environment using following code snippet with one of our sample PDF documents and were not able to notice any issue. API was able to find text based on regular expressions:

Document pdfDoc = new Document(dataDir + "sample.pdf");
TextSearchOptions textSearchOptions = new TextSearchOptions(true);
String regex = "(?i)\\Small Demonstration\\b";
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(regex, textSearchOptions);
pdfDoc.getPages().accept(textFragmentAbsorber);

TextFragmentCollection textFragmentCollection = textFragmentAbsorber.getTextFragments();
int count = 0;
for (TextFragment textFragment : textFragmentCollection) {
  count++;
  System.out.println(count + ". " + textFragment.getText());
}

Can you please share your sample PDF document with us along with the information of text which you want to extract. We will further test the scenario in our environment and address it accordingly.


#19

This is working.
But, I actually wanted this to have a variable, instead of passing the hard coded text.
And, (?i)\Small Demonstration\b , this is missing a ‘b’ , I guess !
Well, I have done it successfully with the variable too !

String find = “(?i)\b”+strFind+"\b";
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(find);

I have got this working for me !
Anyways, Thanks a lot for your co-operation ! :slight_smile:


#20

@Kushal.20

Thank you for your kind feedback.

We are glad to know that things are now working in your environment. Please feel free to contact us if you need any further assistance.