@awais.hafeez Hi !
Thanks for the last update !
Well, as you know I have been working on documents search since quite a long time now. Things have been working to a great extent. Thanks to you !
Well, now going a step ahead, requirement is PERFORMANCE. Basically, speedy search is required. For an example, Now, It’s like it is taking approximately a minute (60-64 seconds) to search in a directory of approximately 1000 documents. But, real time requirement can go up to 50,000 documents, which will cost me around 50 mins, practically which is something no one could wait for, seeing a loader on the screen.
So, this is now the requirement. I am posting my complete search code that I have made till now. Kindly go through it, and help me as to what could be the best approach for the same.
public static void main(String[] args) throws Exception {
com.aspose.pdf.License licensePdf = new com.aspose.pdf.License();
licensePdf.setLicense("Aspose.Total.Java-License.lic");
com.aspose.slides.License licenseSlide = new com.aspose.slides.License();
licenseSlide.setLicense("Aspose.Total.Java-License.lic");
com.aspose.words.License licenseDoc = new com.aspose.words.License();
licenseDoc.setLicense("Aspose.Total.Java-License.lic");
com.aspose.cells.License licenseCell = new com.aspose.cells.License();
licenseCell.setLicense("Aspose.Total.Java-License.lic");
String lang = "ENG";
String strFind = "";
int count = 0;
long startTime = System.currentTimeMillis();
System.out.println("Start Time : " + startTime);
File[] files = new File("E:\\docs").listFiles();
if (strFind != "") {
for (File file : files) {
if (file.isFile()) {
String folderName = file.getParent();
String fileName = file.getName();
String extensionName = fileName.substring(fileName.lastIndexOf("."));
if (extensionName.equals(".xlsx") || extensionName.equals(".xls")) {
int countCell = 0;
//System.out.println("Processing document: " + fileName);
Workbook workbook = new Workbook(file.getAbsolutePath());
for (int i = 0; i < workbook.getWorksheets().getCount(); i++) {
Worksheet worksheet = workbook.getWorksheets().get(i);
FindOptions opts = new FindOptions();
Cell cell = null;
do {
cell = worksheet.getCells().find(strFind, cell, opts);
if (cell != null) {
countCell++;
}
}
while (cell != null);
}
if (countCell > 0)
System.out.println("E:\\" + file.getName());
//System.out.println("E:\\"+file.getName()+" || Count="+countCell);
} else if (extensionName.equals(".doc") || extensionName.equals(".docx")) {
//System.out.println("Processing document: " + fileName);
com.aspose.words.Document wordDoc = new com.aspose.words.Document(file.getAbsolutePath());
FindReplaceOptions options = new FindReplaceOptions();
//ReplaceEvaluator callback = new ReplaceEvaluator();
//options.setReplacingCallback(callback);
Pattern regex = Pattern.compile(strFind, Pattern.CASE_INSENSITIVE);
int countWord = wordDoc.getRange().replace(regex, strFind, options);
//int countWord = callback.mMatchNumber;
if (countWord > 0) {
System.out.println("E:\\" + file.getName());
//System.out.println("E:\\"+file.getName()+" || Count="+countWord);
}
} else if (extensionName.equals(".pdf")) {
//System.out.println("Processing document: " + fileName);
String find = "";
// if(lang=="ENG") {
// find = "(?i)\\b"+strFind+"\\b";
// }else {
// find = strFind;
// }
if (strFind.matches("^[a-zA-Z0-9_ !@#$&()\\\\-`.+,/\\\"]*$")) {
find = "(?i)\\b" + strFind + "\\b";
} else {
find = strFind;
}
com.aspose.pdf.Document pdfDoc = new com.aspose.pdf.Document(file.getAbsolutePath());
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(find); // like 1999-2000
TextSearchOptions textSearchOptions = new TextSearchOptions(true);
textFragmentAbsorber.setTextSearchOptions(textSearchOptions);
pdfDoc.getPages().accept(textFragmentAbsorber);
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.getTextFragments();
for (TextFragment textFragment : (Iterable<TextFragment>) textFragmentCollection) {
if (textFragment.getText() != "") {
count++;
}
}
if (count > 0) {
System.out.println("E:\\" + file.getName());
//System.out.println("E:\\"+file.getName()+" || Count="+count);
}
count = 0;
} else if (extensionName.equals(".ppt") || extensionName.equals(".pptx")) {
//System.out.println("Processing document: " + fileName);
Presentation presentation = new Presentation(file.getAbsolutePath());
presentation.joinPortionsWithSameFormatting();
ITextFrame[] tb = SlideUtil.getAllTextFrames(presentation, true);
for (int i = 0; i < tb.length; i++) {
for (IParagraph ipParagraph : tb[i].getParagraphs()) {
for (IPortion iPortion : ipParagraph.getPortions()) {
if (iPortion.getText().toLowerCase().contains(strFind.toLowerCase())) {
int fromIndex = 0;
while ((fromIndex = iPortion.getText().toLowerCase().indexOf(strFind.toLowerCase(), fromIndex)) != -1) {
count++;
fromIndex++;
}
}
}
}
}
if (count > 0)
System.out.println("E:\\" + file.getName());
//System.out.println("E:\\"+file.getName()+" || Count="+count);
count = 0;
}
}
}
} else
System.out.println("Please enter a valid search query (Blank Query Entered)");
long endTime = System.currentTimeMillis();
System.out.println("End Time : " + endTime);
System.out.println("Total Time : " + ((endTime - startTime) / 1000) + " seconds");
}
This is a single code for all the file formats. Kindly help me with this. Your suggestions are always welcome !
Also, I want to tell you an idea that I am having is : WHAT IF I DON’T NEED THE COUNTS? MEANS, I JUST SEARCH , AND AS SOON AS IT FINDS THE FIRST OCCURRENCE OF THE WORD IN THE DOCUMENT, IT SHOULD STOP AND MOVE ON TO THE NEXT ONE. HOPEFULLY, IT COULD BE SLIGHT BENEFICIAL. WHAT SAY ?