Get total number of documents based on a particular phrase/word


#42

@Kushal.20,

Regarding your question about Aspose.Words for Java, please check these Word documents (ActaMeteorSinica_WordTemplate.zip (63.0 KB)) and try running the following code. This code not only writes the count values to console window but it also writes them to a blank Word document to confirm the correctness of Aspose.Words in this case:

Document doc = new Document("E:\\temp\\ActaMeteorSinica_WordTemplate\\ActaMeteorSinica_WordTemplate.docx");
DocumentBuilder builder = new DocumentBuilder();

FindReplaceOptions options = new FindReplaceOptions();
Pattern regex = Pattern.compile("Drange", Pattern.CASE_INSENSITIVE);
int count = doc.getRange().replace(regex, "", options);

if(count > 0) {
    System.out.println("English Count = " + count);
    builder.writeln("English Count = " + count);
}

regex = Pattern.compile("排列", Pattern.CASE_INSENSITIVE);
count = doc.getRange().replace(regex, "", options);

if(count > 0) {
    System.out.println("Chinese Count = " + count);
    builder.writeln("Chinese Count = " + count);
}

builder.getDocument().save("E:\\temp\\ActaMeteorSinica_WordTemplate\\output.docx");

#44

@awais.hafeez
I will give it a try and let you know about it ! Thanks !


#46

@awais.hafeez , Thanks !
I tried the above code. But, still I am unable to get any response. The new document that is being generated in this is also blank, with nothing in it. Also, console isn’t getting anything on it.
I request you to help me get through this. There might be something that’s going wrong or missing.
The generated output file is here : ActaMeteorSinica_WordTemplate.zip (5.1 KB)

And, following is the code that I tried (exactly the same that you told me to try) :

public static void main(String[] args) throws Exception {

	com.aspose.slides.License license = new com.aspose.slides.License();
	license.setLicense("Java-License.lic");
	
	Document doc = new Document("E:\\docs\\ActaMeteorSinica_WordTemplate.docx");
	DocumentBuilder builder = new DocumentBuilder();

	FindReplaceOptions options = new FindReplaceOptions();
	
	Pattern regex = Pattern.compile("Drange", Pattern.CASE_INSENSITIVE);
	int count = doc.getRange().replace(regex, "", options);

	if(count > 0) {
	    System.out.println("English Count = " + count);
	    builder.writeln("English Count = " + count);
	}

	regex = Pattern.compile("排列", Pattern.CASE_INSENSITIVE);
	count = doc.getRange().replace(regex, "", options);

	if(count > 0) {
	    System.out.println("Chinese Count = " + count);
	    builder.writeln("Chinese Count = " + count);
	}

	builder.getDocument().save("E:\\ActaMeteorSinica_WordTemplate\\output.docx");
}

#47

@Kushal.20,

We are working on your query and will get back to you soon.


#48

I have sorted this, @awais.hafeez.
Actually, this required the license. I applied it and got it working.

But, when I enter “Drange (姓)” as a single search query, it is unable to detect it.This is where I am stuck now ! Why “Drange (姓)” isn’t getting detected when they are being detected if entered separately.


#49

@Kushal.20,

Please use the following pattern to be able to find Words with parenthesis:

Pattern regex = Pattern.compile("Drange \\(姓\\)", Pattern.CASE_INSENSITIVE);


#50

@awais.hafeez
Okay, thanks ! :slight_smile:


#51

@awais.hafeez Hi ! Hope, you are doing good !
Well, as I told you, am working on the multi-lingual search. It’s going fine.
Now, my next step is the performance. I would like to know If I can implement multi-threading in my searching task.
Suppose, for the following code, where I am searching for all .Doc files containing the search query :

    com.aspose.words.License license = new com.aspose.words.License();
	license.setLicense("Java-License.lic");
	
	File[] files = new File("E:\\docs").listFiles();

	for (File file : files) {
	    if (file.isFile()) {
	        @SuppressWarnings("unused")
			String folderName = file.getParent();
	        String fileName =  file.getName();
	        String extensionName = fileName.substring(fileName.lastIndexOf("."));
	        if (extensionName.equals(".doc") || extensionName.equals(".docx")) {
	            //System.out.println("Processing document: " + fileName);
	            Document doc = new Document(file.getAbsolutePath());
	            FindReplaceOptions options = new FindReplaceOptions();
	           
	    		
	    		Pattern regex = Pattern.compile("sample", Pattern.CASE_INSENSITIVE);
	    		int count = doc.getRange().replace(regex, "", options);
	    		
	    		if(count > 0) {
	    			 System.out.println("E:\\"+file.getName());
	    			//System.out.println("E:\\"+file.getName()+" || Count="+count);
	    		}
	    		
	        }
	    }

	}

Could you please help me implement multi threading in this and help me improve the operation speed.

Thanks !


#52

@Kushal.20,

Please note that Aspose.Words for Java is multi-thread safe as long as only one thread works on a Document at a time. There is no limit on processing of number of documents at the same time. Aspose.Words supports multi-threading and processing multiple documents simultaneously in different threads should work fine. Aspose.Words also does not impose restrictions on number of threads. Please use Google to search/learn different techniques of implementing multi-threading in Java.


#53

@awais.hafeez Hi !
Thanks for the last update !
Well, as you know I have been working on documents search since quite a long time now. Things have been working to a great extent. Thanks to you ! :slight_smile:

Well, now going a step ahead, requirement is PERFORMANCE. Basically, speedy search is required. For an example, Now, It’s like it is taking approx a minute (60-64 seconds) to search in a directory of approx 1000 documents. But, real time requirement can go up to 50,000 documents, which will cost me around 50 mins, practically which is something no one could wait for, seeing a loader on the screen.

So, this is now the requirement. I am posting my complete search code that I have made till now. Kindly go through it, and help me as to what could be the best approach for the same.

public static void main(String[]args) throws Exception {
	
	com.aspose.pdf.License licensePdf = new com.aspose.pdf.License();
	licensePdf.setLicense("Aspose.Total.Java-License.lic");
	
	com.aspose.slides.License licenseSlide = new com.aspose.slides.License();
	licenseSlide.setLicense("Aspose.Total.Java-License.lic");
	
	com.aspose.words.License licenseDoc = new com.aspose.words.License();
	licenseDoc.setLicense("Aspose.Total.Java-License.lic");
	
	com.aspose.cells.License licenseCell = new com.aspose.cells.License();
	licenseCell.setLicense("Aspose.Total.Java-License.lic");
	
	String lang= "ENG";
	String strFind = "";
	int count =0;
	long startTime = System.currentTimeMillis();
	System.out.println("Start Time : "+startTime);
	File[] files = new File("E:\\docs").listFiles();
	if(strFind != "") {
	for (File file : files) {
	    if (file.isFile()) {
	    	String folderName = file.getParent();
	        String fileName =  file.getName();
	        String extensionName = fileName.substring(fileName.lastIndexOf("."));
	        if (extensionName.equals(".xlsx")  || extensionName.equals(".xls")) 
	        {
	        	int countCell = 0;
	        	//System.out.println("Processing document: " + fileName);
				Workbook workbook = new Workbook(file.getAbsolutePath());
				for(int i=0; i<workbook.getWorksheets().getCount(); i++) {
				Worksheet worksheet = workbook.getWorksheets().get(i);
				FindOptions opts = new FindOptions();
				Cell cell = null;
				do
				{
					cell = worksheet.getCells().find(strFind, cell, opts);
					if(cell!=null)
					{
						countCell++;
					}
				}
				while(cell!=null);
				}
				if(countCell > 0)
				  System.out.println("E:\\"+file.getName());
				//System.out.println("E:\\"+file.getName()+" || Count="+countCell);
			}
	        
	        else if(extensionName.equals(".doc") || extensionName.equals(".docx")) {
	            //System.out.println("Processing document: " + fileName);
	        	com.aspose.words.Document wordDoc = new com.aspose.words.Document(file.getAbsolutePath());
	            FindReplaceOptions options = new FindReplaceOptions();
	            //ReplaceEvaluator callback = new ReplaceEvaluator();
	    		//options.setReplacingCallback(callback);
	    		Pattern regex = Pattern.compile(strFind, Pattern.CASE_INSENSITIVE);
	    		int countWord = wordDoc.getRange().replace(regex, strFind, options);
	    		//int countWord = callback.mMatchNumber;
	    		if(countWord > 0) {
	    			  System.out.println("E:\\"+file.getName());
	    			//System.out.println("E:\\"+file.getName()+" || Count="+countWord);
	    		}
	    		
	        }
	        
	        else if(extensionName.equals(".pdf")) {
	            //System.out.println("Processing document: " + fileName);
	        	String find = "";

// if(lang==“ENG”) {
// find = “(?i)\b”+strFind+"\b";
// }else {
// find = strFind;
// }
if(strFind.matches("^[a-zA-Z0-9_ !@#$&()\\-`.+,/\"]*$")){
find = “(?i)\b”+strFind+"\b";
}else {
find = strFind;
}
com.aspose.pdf.Document pdfDoc = new com.aspose.pdf.Document(file.getAbsolutePath());
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(find); // like 1999-2000
TextSearchOptions textSearchOptions = new TextSearchOptions(true);
textFragmentAbsorber.setTextSearchOptions(textSearchOptions);
pdfDoc.getPages().accept(textFragmentAbsorber);
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.getTextFragments();
for (TextFragment textFragment : (Iterable) textFragmentCollection) {
if(textFragment.getText() != “”) {
count++;
}
}
if(count > 0) {
System.out.println(“E:\”+file.getName());
//System.out.println(“E:\”+file.getName()+" || Count="+count);
}
count=0;
}

	        else if(extensionName.equals(".ppt")  || extensionName.equals(".pptx")) {
	        	//System.out.println("Processing document: " + fileName);
	        	Presentation presentation = new Presentation(file.getAbsolutePath());
				presentation.joinPortionsWithSameFormatting();
				ITextFrame[] tb = SlideUtil.getAllTextFrames(presentation, true);
				for (int i = 0; i < tb.length; i++)
				{
					for (IParagraph ipParagraph : tb[i].getParagraphs()) 
					{	
						for (IPortion iPortion : ipParagraph.getPortions()) 
						{	
							if(iPortion.getText().toLowerCase().contains(strFind.toLowerCase()))
							{ 	
					            int fromIndex=0;
					            while ((fromIndex = iPortion.getText().toLowerCase().indexOf(strFind.toLowerCase(), fromIndex)) != -1 )
					            {
					                count++;
					                fromIndex++;
					            }
				               
							}
						}
					}
				}
				if(count > 0) 
				  System.out.println("E:\\"+file.getName());
				//System.out.println("E:\\"+file.getName()+" || Count="+count);
				count=0;
	        }
	        
		}
	}
}
	else 
    	System.out.println("Please enter a valid search query (Blank Query Entered)");
		long endTime = System.currentTimeMillis();
		System.out.println("End Time : "+endTime);
		System.out.println("Total Time : "+((endTime-startTime)/1000)+" seconds");
}

This is a single code for all the file formats. Kindly help me with this. Your suggestions are always welcome !

Also, I want to tell you an idea that I am having is : WHAT IF I DON’T NEED THE COUNTS? MEANS, I JUST SEARCH , AND AS SOON AS IT FINDS THE FIRST OCCURRENCE OF THE WORD IN THE DOCUMENT, IT SHOULD STOP AND MOVE ON TO THE NEXT ONE. HOPEFULLY, IT COULD BE SLIGHT BENEFICIAL. WHAT SAY ?


#54

@Kushal.20,

Thanks for providing sample code and details.

Seeing your code segments a bit, I think you are already doing ok to search each occurrence of word/text in the documents, if this is your ultimate requirements. Please note, if you need to scan 50000 files, it will take some time to search every occurrence of word/text in the whole documents.

Yes, this may minimize the time and enhance the performance. So, you may give it a try and update/accommodate your code segments accordingly.


#55

@Amjad_Sahi Thanks for the concern !
I know this is working fine, but I wanted some solution/suggestion so as to how can I make it faster !
Because, as I mentioned, we cannot give 50 mins for the search to keep working.

Also, if the idea that I shared, could you please help me as to how to restrict the search till first occurrence only for all the file formats? It would be really helpful for me.

Thanks !


#56

@Kushal.20,

Regarding Aspose.Cells, please follow up your thread:

Regarding Aspose.Words, Aspose.PDF and Aspose.Slides, we will check and get back to you soon.


#57

For Aspose.Words, please refer to the thread:

For Aspose.PDF, please follow up the thread:
https://forum.aspose.com/t/find-first-occurrence-of-any-text-and-stop/199291/2