Get total number of documents based on a particular phrase/word

Kushal.20 · May 29, 2019, 6:53am

Hello Team & Friends,
Is there any way out to get the total number of documents in my directory, based on any specific word or a phrase, i.e, Suppose I search for a word ‘request’ and I should get all the documents list or number of documnets, containing the word , ‘request’.
I would be highly obliged if anyone could help me with this.

awais.hafeez · May 29, 2019, 11:40am

@Kushal.20,

For example, you can use the following code to parse .doc/.docx files in a directory and then search string inside each document.

string[] fileNames = Directory.GetFiles("E:\\Temp\\", "*.doc?", SearchOption.TopDirectoryOnly);
foreach (string fileName in fileNames)
{
    Document doc = new Document(fileName);
    // find keyword in Word document
    if (wordFound)
    {
        // Count this document
    }
}

To search a string inside a Word document, please check the following article:
Find and Replace

Hope, this helps.

Kushal.20 · May 29, 2019, 12:22pm

yup, this is something that I need to achieve. First of all, thankyou @awais.hafeez.
But, this isn’t working in Java. Directory doesn’t exists, neither am getting any correct import options.

awais.hafeez · May 30, 2019, 3:43am

@Kushal.20,

You can build on the following Java code to achieve what you are looking for:

File[] files = new File("E:\\Temp\\").listFiles();

for (File file : files) {
    if (file.isFile()) {
        String folderName = file.getParent();
        String fileName =  file.getName();
        String extensionName = fileName.substring(fileName.lastIndexOf("."));

        if (extensionName.equals(".doc") || extensionName.equals(".docx")) {
            System.out.println("Processing document: " + fileName);
            Document doc = new Document(file.getAbsolutePath());
        }
    }
}

Java article link:

Hope, this helps.

Kushal.20 · May 30, 2019, 10:42am

Done ! The purpose is now solved !
Thanks for this, @awais.hafeez !

Well, another thing is, this is for Word only. Right?
What if I remove " if (extensionName.equals(".doc") || extensionName.equals(".docx"))" so as to make it work for all file formats. Will it work ?

awais.hafeez · May 30, 2019, 12:53pm

@Kushal.20,

Aspose.Words for Java only supports processing the file formats mentioned in the following article:

Supported Load Formats

You cannot use Aspose.Words to load XLS or XLSX file formats. To process these .xks and .xlsx files for example, you need to use Aspose.Cells for Java API.

Aspose provides many APIs to process different file formats. Please check:

Kushal.20 · May 30, 2019, 12:59pm

@awais.hafeez
Yes! Am very well aware of that ! I am already exploring and using those APIs to check if I can buy it.
But, my concern was regarding this very example.
Suppose, I want to count and list the no. of documents in my directory, based on any given string or a phrase.
So, how to work in that scenario? Because, a directory can have multiple format files.
Is there any way out to do this in a single piece of code for all the formats, other than writing separate codes for specific file formats using the particular APIs ?

awais.hafeez · May 31, 2019, 2:57am

@Kushal.20,

You can make use of FileFormatUtil.detectFileFormat() method to detect if a particular file is suitable for Aspose.Words or not. Similarly, other APIs also provide similar methods. For example, the following class:

Hope, this helps.

Kushal.20 · June 3, 2019, 10:51am

I actually am asking this, Sir @awais.hafeez, that what If I need to write one code for all the formats, as a directory can have many file formats. So, I will have to write separate codes for all the formats or could it be done in a single piece of code?

awais.hafeez · June 3, 2019, 3:26pm

@Kushal.20,

To find out a string inside a document/file, you need to use separate Aspose APIs. However, to be able to filter different file formats inside a directory, you can just add more items to IF statement or even use a Switch structure e.g.

...
        
if (extensionName.equals(".doc") ||
        extensionName.equals(".docx") ||
        extensionName.equals(".xls") ||
        extensionName.equals(".xlsx") ||
        extensionName.equals(".ppt"))
// And so on
{
    
}
...

Kushal.20 · June 5, 2019, 3:01pm

@awais.hafeez
Thanks !
I will surely try this and let you know my concerns if any that came into my way.
But, before that , could you do me a favour ?
Actually, can you please tell me that typical such type of work (the query in this very topic) that is, Get total number of documents based on a particular phrase/word , will GroupDocs be better ? I mean, would it be easy and more suitable for such kind of tasks to use GroupDocs ?
Please help me with this ASAP, as it’s urgent and I need to decide my Buying options.
Thanks in advance

awais.hafeez · June 5, 2019, 4:21pm

@Kushal.20,

You can only manipulate documents programmatically (without any GUI) by using the native Aspose.Total for Java APIs. However, GroupDocs.Total for Java provides some different feature set; for example, you can include ‘GroupDocs.Viewer for Java’ to visualize/display different document formats in your Java applications. I think, for this scenario, you should go for native Aspose.Total for Java APIs. In case you have further inquiries or need any help, please let us know.

Kushal.20 · June 6, 2019, 9:07am

Okay, thanks for the information, @awais.hafeez
Now, when we have successfully filtered the files from the directory containing the searched word.
i would like to move further, and get those result files as a link, such that when I click on them, they just open up.

Please assist me as to how to proceed for this ?

awais.hafeez · June 7, 2019, 4:36am

@Kushal.20,

For example, you can use the following Aspose.Words’ code to insert Hyperlinks in Word document.

Document doc = new Document();
DocumentBuilder builder = new DocumentBuilder(doc);
builder.moveToDocumentEnd();
builder.getFont().setStyleIdentifier(StyleIdentifier.HYPERLINK);
builder.insertHyperlink("Aspose", "https://www.aspose.com/", false);
doc.save("E:\\temp\\awjava-19.5.docx");

Or you can insert the file as an embedded object inside Word document. Double clicking the embedded object will open the file with appropriate application:

Document doc = new Document();
DocumentBuilder builder = new DocumentBuilder(doc);
builder.moveToDocumentEnd();
BufferedImage image = ImageIO.read(new File("E:\\temp\\Aspose.Words.png"));
Shape oleObject = builder.insertOleObject("E:\\temp\\in.docx", true, false, image);
doc.save("E:\\temp\\awjava-19.5.docx");

Hope, this helps in achieving what you are looking for.

Kushal.20 · June 7, 2019, 5:22am

@awais.hafeez , I guess you got me wrong.

I am asking this regarding the very first query in this topic, that is, now when we have reached the stage where my requirement for getting the list of documents, that contained any searched word is done. Now, I want this list of documents, to be as hyperlinks , that is, when I click upon it, it opens the document.

I hope you got it this time, otherwise you can revert me with your query. I will try to make it even more clear.

Thanks, @awais.hafeez . Waiting for the solution

awais.hafeez · June 7, 2019, 1:02pm

@Kushal.20,

Your list of documents may contain various types of documents e.g. DOCX, DOC, XLSX, PPT and so on. But, do you want to store these hyperlinks in DOCX file, in Text file or in which file format do you want to store these hyperlinks in?

Kushal.20 · June 7, 2019, 1:10pm

@awais.hafeez,
I actually would be showing this list to users on some screen that would be application’s screen. So for that I would be wanting the data to be returned to me in that format, maybe JSON. (This would be my end requirement).
For now, I just want it to display on console , as am just in the testing phase, so for now I just want that like am getting filenames on console, all those filenames should be displayed as hyperlinks on the console… or getting all of them in a text file would also do .

awais.hafeez · June 8, 2019, 5:12am

@Kushal.20,

The following code will print absolute path and the complete filenames on the Java Console window. Copying each line to Windows explorer will open the file with respective application. Hope, this helps.

File[] files = new File("E:\\Temp\\").listFiles();

for (File file : files) {
    if (file.isFile()) {
        String folderName = file.getParent();
        String fileName =  file.getName();
        String extensionName = fileName.substring(fileName.lastIndexOf("."));

        if (extensionName.equals(".doc") || extensionName.equals(".docx")) {
            System.out.println(folderName + "\\" + fileName);
        }
    }
}

Kushal.20 · June 10, 2019, 2:20pm

I will try it and let you know @awais.hafeez
Now, I have done this work for : Word, Cells and Pdf separately for each format.
But, now I am merging all of them together to make it work for all the files in the directory.
I am able to merge it and the Cell & Word format code is working fine, but Pdf version in not working, (though, individual same code for Pdf is running fine).
One observation that I made is, when I comment the code for Word, Pdf starts working. This may be due to the import of Documents package, because both of them uses Document, and we can import either for pdf or word. But, for this also, I managed and solved it. But, not able to run the pdf one.

Here goes my code :

	String strFind = "Test";
	int count =0;
	File[] files = new File("E:\\").listFiles();
	for (File file : files) {
	    if (file.isFile()) {
	        String fileName =  file.getName();
	        String extensionName = fileName.substring(fileName.lastIndexOf("."));
	        if (extensionName.equals(".xlsx")  || extensionName.equals(".xls")) 
	        {

// System.out.println("Processing document: " + fileName);
Workbook workbook = new Workbook(file.getAbsolutePath());
Worksheet worksheet = workbook.getWorksheets().get(0);
FindOptions opts = new FindOptions();
Cell cell = null;
int countCell = 0;
//find each cell containing hello and replace it with

				//blue color hello world in arial black font'
				do
				{
					cell = worksheet.getCells().find(strFind, cell, opts);
					if(cell!=null)
					{
						countCell++;
					}
				}
				while(cell!=null);
				if(countCell > 0)
				System.out.println("E:\\"+file.getName()+" || Count="+countCell);
			}
	        
	        else if(extensionName.equals(".doc") || extensionName.equals(".docx")) {
	            //System.out.println("Processing document: " + fileName);
	            Document doc  = new Document(file.getAbsolutePath());
	            FindReplaceOptions options = new FindReplaceOptions();
	            ReplaceEvaluator callback = new ReplaceEvaluator();
	    		options.setReplacingCallback(callback);
	    		
	    		// We want the "your document" phrase to be highlighted.
	    		Pattern regex = Pattern.compile(strFind, Pattern.CASE_INSENSITIVE);
	    		doc.getRange().replace(regex, strFind, options);
	    		int countWord = callback.mMatchNumber;
	    		if(countWord > 0) {
	    			//System.out.println("Folowing documnets contain the phrase : '"+regex+"'");
	    			System.out.println("E:\\"+file.getName()+" || Count="+countWord);
	    		}
	    		else {
	    			System.out.println("No document containing '" +regex+ "' exists");
	    		}
	    		// Save the output document.
	    		//doc.save("E:\\"+file.getName()+"_TestFile_out.doc");
	        }
	        
	        else if(extensionName.equals(".pdf")) {
	            //System.out.println("Processing document: " + fileName);
	        	com.aspose.pdf.Document pdfDocument = new  com.aspose.pdf.Document();
	        	TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber("(?i)this"); // like 1999-2000
	        	TextSearchOptions textSearchOptions = new TextSearchOptions(true);
				textFragmentAbsorber.setTextSearchOptions(textSearchOptions);
				// Accept the absorber for first page of document
				pdfDocument.getPages().accept(textFragmentAbsorber);
				// Get the extracted text fragments into collection
				TextFragmentCollection textFragmentCollection = textFragmentAbsorber.getTextFragments();
				for (TextFragment textFragment : (Iterable<TextFragment>) textFragmentCollection) {
					System.out.println("sysout : "+textFragment.getText());
					if(textFragment.getText() != "") {
					count++;
					}
				}
	    		if(count > 0) {
	    			System.out.println("E:\\"+file.getName()+" || Count="+count);
	    		}
	    		count=0;
	        }
	        
	        
		}
	}

Kindly , look into the issue and help me !

Kushal.20 · June 10, 2019, 2:50pm

I guess, I have found the issue. And, yes the issue lies in this only.

This uses import com.aspose.words.Document;
And, the problem with PDF Document creation starts