Search multiple keyword in excel workbook having multiple sheet efficiently

I am aware of workSheet.Cells.Find() method to search a keyword at worksheet level.

Our requirement is, we are having a workbook let’s say having 34 sheets and user is passing 5000 approximately search terms to be searched in the given workbook. Suggest some efficient ways to achieve it, as excel allows searching at workbook level but I don’t see any such find method in Aspose.Cells which is at workbook level.

Right now, our approach is to iterate it sheet wise and within that again second iteration on search terms where we are calling workSheet.Cells.Find() method. But the problem is, it is searching 5000 search terms per sheet and is taking lots of time as we are having nested foreach.

So please suggest some efficient ways to solve this performance aspect of search.

@nraj
I’m afraid there’s currently no better way to search for data across the entire Workbook. If the keywords are strings of the same type, you can consider using regular expressions to improve performance, otherwise you can only match each keyword one by one in the Worksheet.

@nraj

We are afraid there is no better solution to provide better performance or convenience for your requirement about search function.

  1. For multiple search keys
    If those keys are string values, you may try the regex option to combine multiple keys and perform the search in one call. However, if there are too many keys, regex itself also may give poor performance too. And if the keys are different types, even we implement it in one api for user’s convenience, we have to check them one by one in a loop too.

  2. For the search at workbook level
    To implement the search function on workbook level, we also need to traverse all the worksheets one by one, and then all cells one by one. When one cell is found, we have to break the nested loops to return the cell to user. After the found cell was processed by user, then we need to build the nested loops again to search the next one. This process may cause many extra operations of building and breaking nested loops. So we are afraid such one api may provide much poorer performance than the solution that user calls Find method on cells(worksheet) in their own loop.

If you have some ideas or think your case should be able to get better performance, please provide us your sample code and data to show the performance issue. We will investigate it further to see whether we can help to improve it.

Can we perform the search operations parallelly? Like FindAll method can accept keywords/search terms in one go(list) and internally we can trigger searching for all the input keywords/search terms parallelly.

Like for example, something like
Parallel.ForEach(searchDto.SearchTerms, async term =>
{
//logic to search indepentently and return results if completed
});

@nraj ,

We will evaluate your requirement and back with our finds.

Also, if you just do search(not modify the workbook). You may try to instance multi-workbooks from same file in multi-threads, then do part of search in each thread at same time.

I have tried calling workSheet.Cells.Find() method inside Parallel.ForEach on search terms which is giving me “Object reference not set to an instance of an object” exception at workSheet.Cells.Find().

The exception occurrence is intermittent. You need to run the solution multiple times maybe to reproduce it or you can increase the load of search terms in search.json file which I have given with solution(bin folder).

Please find attached the solution zip
ExcelSearch.zip (894.1 KB)

Just a suggestion, implementing parallel searching for search terms passed as List<> within let’s say Aspose.Cells.findMultiple()(proposed) method and consolidating the results and returning will help and can enhance the performance.

@nraj
Thank you for your feedback. We will further investigate your issue. Once we have any new information, we will share it with you We will get back to you soon.

@nraj
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): CELLSNET-55546

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

@nraj
When searching string values, cell’s formatted result is required to be built and checked. For performance we use global cache for formatting related data which do not support to be accessed by multiple threads concurrently. It may be the reason of the exception.

From your sample project, we think maybe you only need to check the original cell value of string type, without the need of formatted value or other types. If so, you may use LookInType.OriginalValues instead of LookInType.Values, and ValueTypeSensitive as true. Would you please change it in your code and test it again?

...
                            findOptions.LookInType = LookInType.OriginalValues;
                            findOptions.ValueTypeSensitive = true;
...

And searching original values with strict type also may give better performance.

I have tried with the above suggestion but it’s not working for this attached sheet
searchExcel.zip (26.1 KB)

when we are searching for 1983 whcih is dateTime which is in multiple cells like “B15”.

@nraj
Thank you for your feedback. We will further investigate your issue and inform you promptly of any updates.

@nraj
We are afraid searching the original value cannot work for date time because in fact the date time is formatted result, not the original value. For such kind of situation, the formatting feature is required. Unfortunately, currently the formatting feature does not support to be used in multiple threads concurrently.

We may investigate whether we can make the formatting feature work for multi-threads environment later. However, it is a complicated task so we cannot support it soon. You will be updated here when we get any progress for this feature.

Thank you so much for the update. Could you please let us know if this is feasible and what the approximate timeframe would be to support it sooner?

@nraj
Thank you for your feedback. We will further evaluate your issue. We will get back to you soon.

@nraj

There are two improvements for searching feature we will investigate for the feasibility(according to your requirements):

  1. Multiple keys for one Find() call(CELLSNET-55546)
    We think processing multiple keys in one call may give better performance than processing those keys one by one in multiple calls for the situation of searching in the formatted results of cells. So we also put this task into our future plan. However, we need to make more investigation to design the apis to make them flexible and convenient.

  2. Support to search cells by their formatted results in multi-threads(CELLSNET-55577)
    In fact this requirement requires us to make the formatting feature work for multi-threads environment. This feature may affect many other features so we need to conduct thorough research before we can start the work.

So, according to the complexity(high) and priority(normal) of those two tasks, we don’t think we can support them in near future. Currently we think we may try to support them in the second half of this year.

Since our product includes a feature that allows customers to search through a large list of keywords in Excel and receive results, we consider this to be of utmost importance. Currently, users are waiting for the results on the screen for a long period. Performance is an issue, and the end user experience is subpar when considering search performance. I’m requesting you to prioritize it as well so that you can deliver it as soon as feasible.

@nraj
Thank you for your feedback. Due to the complexity of its functionality, we may not be able to support it in the short term. Once there is an update, we will notify you immediately.

@nraj
Currently those two tickets are supported according to Free Support Policies. In this support queue there have been other tasks with same priority. We need to work on those tasks in the queue one by one.

As workaround, maybe you can implement the search function by yourself to process multiple keys. For example, traverse cells by GetEnumerator() and then traverse all keys to check whether current cell matches one of them.

The issues you have found earlier (filed as CELLSNET-55577) have been fixed in this update. This message was posted using Bugs notification tool by johnson.shi