Get total number of documents based on a particular phrase/word

awais.hafeez · June 11, 2019, 4:08am

Both Aspose.Words for Java and Aspose.PDF for Java APIs have Document classes. To avoid any conflicts, please create Document instances like this:

com.aspose.pdf.Document pdfDoc = new com.aspose.pdf.Document();
com.aspose.words.Document wordDoc = new com.aspose.words.Document();

Hope, this helps.

Kushal.20 · June 11, 2019, 7:19am

Yeah, I also thought of using this way only.
Thanks, it is working with this now.
Thanks ! @awais.hafeez.

Now, am working further upon this feature only and will try to achieve some more functionalities. If I get stuck anywhere, I’ll write it to you !
Thanks once again for supporting !

Kushal.20 · June 14, 2019, 6:31am

@awais.hafeez, Hi !
I am working on the same feature that we discussed above. I am done with getting the list of documents, containing the keyword searched for.
Now, the point that comes into light is, what if I have password protected files too!

I know how to unprotect files. I even applied it to my searching application code. But, the question that arises is, not all the files will have same passwords, so while searching we cannot supply passwords for all the files, right?
I used the following code (suppose for word) :

String strFind = "Test";
int count =0;
File[] files = new File("E:\\docs").listFiles();
for (File file : files) {
    if (file.isFile()) {
        String folderName = file.getParent();
        String fileName = file.getName();
        String extensionName = fileName.substring(fileName.lastIndexOf("."));
        if (extensionName.equals(".doc") || extensionName.equals(".docx")) {
            //System.out.println("Processing document: " + fileName);
            String pass = "12345";
            FileFormatInfo fft = FileFormatUtil.detectFileFormat(file.getAbsolutePath());

            LoadOptions loadOps = new LoadOptions();
            loadOps.setPassword(pass);
            com.aspose.words.Document wordDoc = new
                    com.aspose.words.Document(file.getAbsolutePath(), loadOps);
            System.out.println("Opened Successfully with the Password:" + pass);


            FindReplaceOptions options = new FindReplaceOptions();
            ReplaceEvaluator callback = new ReplaceEvaluator();
            options.setReplacingCallback(callback);

            // We want the "your document" phrase to be highlighted.
            Pattern regex = Pattern.compile(strFind, Pattern.CASE_INSENSITIVE);
            wordDoc.getRange().replace(regex, strFind, options);
            int countWord = callback.mMatchNumber;
            if (countWord > 0) {
                //System.out.println("Folowing documnets contain the phrase : '"+regex+"'");
                System.out.println("E:\\" + file.getName() + " || Count=" + countWord);
            } else {
                System.out.println("No document containing '" + regex + "' exists");
            }

        }
    }
}

But, here the password 12345 can only be checked for. But, let’s consider the scenario where I’ll be having 100s or even 1000s of files, of which nobody know how many would be protected. Because, I only have the option to search for a keyword and get the list of the documents containing those words. So, in that case what to do ?

Is there any way out, that protected files too get read and scanned for that word and listed if they have that word, just like the normal files, without supplying any password?

Hope, you understood my concern!
Thankyou !

awais.hafeez · June 14, 2019, 11:39am

@Kushal.20,

Please ZIP and upload your sample password protected Word document (along with the password string) here for testing. We will then investigate the scenario on our end and provide you more information.

Kushal.20 · June 14, 2019, 12:43pm

@awais.hafeez
I hope you got my point.
I am not talking about a single file. There could be any no. of files in the directory. So, it’s not about any single file.
Am just saying. that what I am doing now is, getting the list of all the documents existing in my directory for the searched keyword. For now, the files which are protected are not searched upon and I just get the exception (Invalid Password). So, what I want is that even the protected files should get scanned and returned in the result if they meet the specified criteria.
For achieving this, I tried the code that I already shared

With this code, I have supplied the password, that I just knew for one of the files.
But, all the files won’t be having the same password, right ? and neither, I would be knowing the passwords for any of the files in the user’s directory (so I even can;t pass any password,like I did in the code above).
So, the requirement is that all the password protected files too could be scanned without supplying the passwords, as I stated the reason for we can’t supply the password for X number of files.

Anyways, as per your requirement, am attaching the zip file , for which the password is 12345. But, I don’t think so that this is required. Test Docx Protected.zip (14.4 KB)

Hope, It’s more clear now.
Waiting for some positive outcome.
Thanks for your co-operation !

Adnan.Ahmad · June 14, 2019, 11:13pm

@Kushal.20,

We are investigating this and will get back to you with feedback soon.

awais.hafeez · June 15, 2019, 5:13am

@Kushal.20,

Your document has no protection but it is ‘encrypted’ with password that is why Microsoft Word asks for a password prior opening it. You should simply specify the password to open the encrypted document by using Aspose.Words. If you do not know the password, I am afraid, you will not be able to open/scan this document by using Aspose.Words. Here is how you can load this document into Aspose.Words’ DOM (document object model):

com.aspose.words.LoadOptions opts = new com.aspose.words.LoadOptions();
opts.setLoadFormat(LoadFormat.DOCX);
opts.setPassword("12345");
Document doc = new Document("E:\\temp\\Test Docx Protected\\Test Docx Protected.docx", opts);

Kushal.20 · June 17, 2019, 7:26am

Dear @awais.hafeez ,
Thanks for the support !
But, please read my queries, (last and the second last ones) once again. My issue is slightly different from your response.
I know my file is password protected and I already worked upon opening it with password , etc.

This is my basic requirement, right now !

Am quoting it once again for your convenience.
I have written the code to get the list of all the docs existing in my dictionary, if they have the word that I pass to be searched for. Okay ?
So, with this, I get all the files containing the searched word, but, the password protected files are not scanned.

I need to pass a password.
But, being a user for the application, how would I be knowing it ? I can simply type a word and search for all the docs, where the word exist.
There could be any number of protected files in the directory , right ?
It’s not possible that all the files would be having a same password.So, either passing a password won’t be any helpful.

So, this limits my result documents list to include any of the protected files,even if they might be containing the word that I searched for !

Hope, my third attempt of wirting the query clears you the agenda.
Waiting for your response.
Thanks !

awais.hafeez · June 17, 2019, 4:20pm

@Kushal.20,

As stated earlier, you will not be able to open/scan such password-protected (encrypted) Word documents by using Aspose.Words. You can either process such files by providing correct password or skip them by catching IncorrectPasswordException (i.e. use try-catch blocks).

try {
    Document doc = new Document("E:\\temp\\Test Docx Protected\\Test Docx Protected.docx");
} catch (com.aspose.words.IncorrectPasswordException ex) {
    System.out.println("As we do not know the password, Skip it.");
}

Kushal.20 · June 18, 2019, 7:07am

@awais.hafeez
That’s quite sad.
I want to scan all the documents in my directory for a word. But, it seems, I couldn’t deal with the protected files. Any suggestion or any feature to be added in any upcoming version regarding this ? I hope you understand, that this will exclude my protected docs to be even scanned or considered.
Anyways, Thanks a lot !

Kushal.20 · June 18, 2019, 7:12am

@awais.hafeez
Well, regarding this very topic, that is, to get the list of documents that contain any word the user searched for. I am done with word and cells. Now I am working upon Slides(PPT).
I have tried and have been there almost, facing a slight issue.

I think am not getting the correct count for the words that I am searching.
Here goes my code :

public static void main(String[]args) throws Exception {

	com.aspose.slides.License license = new com.aspose.slides.License();
	license.setLicense("xyz");
	
	
	File[] files = new File("E:\\docs").listFiles();
	for (File file : files) {
		int count = 0;
	    if (file.isFile()) {
	        String fileName =  file.getName();
	        String extensionName = fileName.substring(fileName.lastIndexOf("."));
	        if (extensionName.equals(".ppt")  || extensionName.equals(".pptx")) 
	        {
				Presentation presentation = new Presentation(file.getAbsolutePath());
				presentation.joinPortionsWithSameFormatting();
				ITextFrame[] tb = SlideUtil.getAllTextFrames(presentation, true);
				String find = "sample";
				String strToFind= "(?i)\\b"+find+"\\b";
				for (int i = 0; i < tb.length; i++)
				{
					for (IParagraph ipParagraph : tb[i].getParagraphs()) 
					{	
						for (IPortion iPortion : ipParagraph.getPortions()) 
						{	System.out.println(iPortion.getText());
							if(iPortion.getText().contains(find))
							{ 	
								count++;
								//iPortion.setText(iPortion.getText().replaceAll(strToFind, strToReplace));
								//System.out.println("replaced");
							}
						}
					}
				}
				if(count > 0) {
				System.out.println("E:\\"+file.getName()+" || Count="+count);
				}
				
			} 
		}
	}
}

I guess the count for the word, ‘sample’ should be 3, but am getting it as 2.
Here is the file, in which I am testing it samplepptx.zip (395.0 KB)

Please see to it and help me resolve this. Thanks !

mudassir.fayyaz · June 18, 2019, 11:42am

@Kushal.20,

I have worked with the sample code shared by you. Actually, one of your string contain multiple instances of string to find (sample). However, you are checking string using Contains() method that return only first instance of string to find. I suggest you to please carry out following modification in your sample code.

for (IPortion iPortion : ipParagraph.getPortions()) 
{	System.out.println(iPortion.getText());
        if(iPortion.getText().contains(find))
        { 	
            int fromIndex=0;
            while ((fromIndex = iPortion.getText().indexOf(find, fromIndex)) != -1 )
            {

                System.out.println("Found at index: " + fromIndex);
                count++;
                fromIndex++;
            }


                //count++;
                //iPortion.setText(iPortion.getText().replaceAll(strToFind, strToReplace));
                System.out.println(iPortion.getText());
        }
}

I hope the shared information will be helpful.

Kushal.20 · June 19, 2019, 6:56am

@mudassir.fayyaz
It worked, thanks ! I am still working on it and if I face any other issue, will let you know and get through it.
Thanks a ton for the support !

Kushal.20 · June 19, 2019, 9:56am

@awais.hafeez
I have now completed this, ‘searching for a word and getting the list of documents containing it’ for all the file formats viz, doc, ppt, pdf & xlsx. (Though, struggling with an issue in ppt format).
Now, my next step is to move towards multi-lingual search, that is If I search for any character in any other language, suppose, Japanese , Chinese, etc. then similarly, I should be getting the documents having it.
So, my first question. is it possible ?
If yes then, is there any change that I need to make in my approach that I have been following for all my formats, or I just need to go with the flow?

awais.hafeez · June 20, 2019, 4:09am

@Kushal.20,

Yes, it is possible. There should not be any problem when searching for non-English characters in Word document by using Aspose.Words. In case you have further inquiries or need any help, please let us know.

Kushal.20 · June 20, 2019, 9:14am

@awais.hafeez, Actually am talking about all the formats, that I mentioned above. Is it good to go ?
Well, I tried for some Chinese & Japanese characters, it detected and worked fine. Though, it was just too basic.
Well, am working on this, will get to you If i get stuck somewhere.
Thanks for the information !

Kushal.20 · June 20, 2019, 9:58am

@awais.hafeez
I am just testing a sample Japanese document file. For now, am getting results for the Japanese characters that I searched for. But, to a greater surprise, the English words in it aren’t detected.
Am attaching the test document, please find it here : ActaMeteorSinica_WordTemplate.zip (58.2 KB)

Following is the code that I have written :

public class AsposeWord_FindDocuments {

public static void main(String[]args) throws Exception {
	
	File[] files = new File("E:\\docs").listFiles();

	for (File file : files) {
	    if (file.isFile()) {
	        String folderName = file.getParent();
	        String fileName =  file.getName();
	        String extensionName = fileName.substring(fileName.lastIndexOf("."));

	        if (extensionName.equals(".doc") || extensionName.equals(".docx")) {
	            System.out.println("Processing document: " + fileName);
	            Document doc = new Document(file.getAbsolutePath());
	            FindReplaceOptions options = new FindReplaceOptions();
	            ReplaceEvaluator callback = new ReplaceEvaluator();
	    		options.setReplacingCallback(callback);
	    		
	    		// We want the "your document" phrase to be highlighted.
	    		Pattern regex = Pattern.compile("Drange", Pattern.CASE_INSENSITIVE);
	    		doc.getRange().replace(regex, "Drange", options);
	    		int count = callback.mMatchNumber;
	    		if(count > 0) {
	    			//System.out.println("Folowing documnets contain the phrase : '"+regex+"'");
	    			System.out.println("E:\\"+file.getName()+" || Count="+count);
	    		}
	    		else {
	    			System.out.println("No document containing '" +regex+ "' exists");
	    		}
	    		// Save the output document.
	    		//doc.save("E:\\"+file.getName()+"_TestFile_out.doc");
	        }
	    }

	} 
}

}

class ReplaceEvaluator implements IReplacingCallback {

public int mMatchNumber;

public int replacing(ReplacingArgs e) throws Exception {
	  mMatchNumber++;
	  return ReplaceAction.SKIP;
	    
	
}

}

If I search for ‘ハ’ or any such character, am getting some result(wrong or right, didn’t noticed), but when I search for ‘Drange’ or any normal English word, am not getting any response.
’ 排列 ’ exists in the document, and has 2 occurrences, but when I do it with my code, I don’t get it detected and hence no results. While, I have observed that, If I search for just ’ 排 ’ or ’ 列 ’ as single characters, I get them counted.
On the other hand, I get this sometimes on my console : console.PNG (7.3 KB)

Can you please investigate this and help me surpass this. Thanks in advance !

awais.hafeez · June 20, 2019, 1:10pm

@Kushal.20,

The following code returns the correct count (2 for each) for English and Chinese characters found in your shared document ‘ActaMeteorSinica_WordTemplate.docx’ when using the latest version of Aspose.Words for Java i.e. 19.6.

Document doc = new Document("E:\\temp\\ActaMeteorSinica_WordTemplate\\ActaMeteorSinica_WordTemplate.docx");

FindReplaceOptions options = new FindReplaceOptions();
Pattern regex = Pattern.compile("Drange", Pattern.CASE_INSENSITIVE);
int count = doc.getRange().replace(regex, "", options);

if(count > 0) {
    System.out.println("English Count = " + count);
}

regex = Pattern.compile("排列", Pattern.CASE_INSENSITIVE);
count = doc.getRange().replace(regex, "", options);

if(count > 0) {
    System.out.println("Chinese Count = " + count);
}

Farhan.Raza · June 20, 2019, 10:12pm

@Kushal.20

About these concerns, we would like to update you regarding Aspose.Slides and Aspose.Cells, in general, there are no specific limitations for different languages as long as the fonts are installed because they are required for rendering. In case you notice any issue then please feel free to create a separate topic along with all the details.

Kushal.20 · June 21, 2019, 4:31am

@awais.hafeez
Okay. Thanks !
Well, as suggested by you, I have tried using it with latest version of Aspose.Words (aspose-words-19.6-jdk17.jar) , but am not getting any response.
Neither , English word nor the Chiese/Japanese are giving any response.I am just getting nothing on the console.

Following is the code ( same as you suggested) :

public static void main(String[]args) throws Exception {
	
	File[] files = new File("E:\\docs").listFiles();

	for (File file : files) {
	    if (file.isFile()) {
	        String folderName = file.getParent();
	        String fileName =  file.getName();
	        String extensionName = fileName.substring(fileName.lastIndexOf("."));

	        if (extensionName.equals(".doc") || extensionName.equals(".docx")) {
	            //System.out.println("Processing document: " + fileName);
	            Document doc = new Document(file.getAbsolutePath());
	            FindReplaceOptions options = new FindReplaceOptions();
	   		
	    		Pattern regex = Pattern.compile("Drange", Pattern.CASE_INSENSITIVE);
	    		int count =doc.getRange().replace(regex, "", options);
	    		//int count = callback.mMatchNumber;
	    		if(count > 0) {
	    			//System.out.println("Folowing documnets contain the phrase : '"+regex+"'");
	    			System.out.println("E:\\"+file.getName()+" || Count="+count);
	    		}
	    		
	        }
	    }

	} 
}

And, here is the console’s screenshot : console.PNG (34.0 KB)

Kindly look into this and guide as to what could be possible mistakes or issues.