Monitoring/aborting loading

Hi,

When loading a workbook, it is hard to predict how large the final in-memory model will be.
Note that we are already using MemorySetting.MEMORY_PREFERENCE, as well as loading specific sheet(s) only (LoadDataOption.SheetNames).

Sometimes, users of our product try to process very large workbooks (100MB or more), and we run the risk of running out of memory, and destabilizing the server. However, not all sheets are “large”, so we don’t want to restrict usage of our product only by file size.

It would be nice to have an event listener that can be registered during workbook loading, to keep track of how much content has been loaded so far, and optionally abort the load process. I’m thinking about something similar to com.aspose.cells.AbstractCalculationMonitor#beforeCalculate(). If the total number of cells loaded having real content exceeds a limit (100.000? 1.000.000?), I’d like to proactively abort the load process with an exception, rather than consuming all the heap of the server.

I hope the problem description is clear. Let me know if you have any questions, or a proposal for an API for this.

Kind regards,
Taras

@TarasTielkes,
Thanks for your query.
We have understood your requirement but we need to look into it more. We have logged the issue in our database for investigation. Once, we will have some news for you, we will update you in this topic.

This issue has been logged as

 CELLSNET-46606 - Event listener for memory size while loading huge files

@TarasTielkes,

We have investigated it a bit. There is one class to monitor the loading process: InterruptMonitor. You may check some conditions in another thread and call InterruptMonitor.Interrupt() method to terminate the process. However, you should keep it in mind that the interruption may not be so timely because API cannot perform the interruption check too frequently for performance consideration.

See the document with example code for your reference:

Hope, this helps.

Hi @ahsaniqbalsidiqui,

Thanks for the feedback, I was thinking also about the InterruptMonitor feature before I created the initial post.
However, it will be a bit hard to determine the conditions to interrupt loading from another thread, since it is hard to keep track of how much content has been loaded so far, and how much memory is retained by this.

Ideally, the loading process would provide a progress listener interface that would give callbacks upon loading data. This could be after each cell, or or a more coarse boundary - for example every 1000 cells loaded. Note that in our use cases, we are typically loading only a single sheet, so the granularity would need to be finer than sheet-level, in order to be able to abort a loading process that would otherwise pull in all the data for a large sheet.

For us, this would be a longer-term stability enhancement in our integration with Aspose Cells. We are happy you are investigating the possible options, and we will provide feedback based on the ideas you come up with.

Kind regards,
Taras

@TarasTielkes,

We have logged your feedback under the same issue and we will update you once any further information is available.

@TarasTielkes,

For the representation of loading process, it is much more complex than the one of formula calculations. As you mentioned, one difficulty is the determination of granularity (finer one may give timely response but also it will cause performance issue). And for memory cost, it also depends on many factors, not only cell count in the sheet. For example, some data is in global cache and may have been loaded before loading the cell objects and maybe the memory cost has exceeded your limit after loading it firstly.

So, we are afraid it is hard to implement an universal mechanism to measure the cost of loading template file for all users and all scenarios. It may be more feasible for you to determine the condition and measurement according to your application environment and specific scenario.

No matter the working manner of the monitor (in the same thread or a different one), the determination of interruption can only be made by your own logic. For InterruptMonitor, it is used in many processes, not only loading so we make it as the global property of Workbook. And for performance considerations for most users, there is no requirement of wrapping any progress information for it. Otherwise providing wrapped information with callback also may cause performance for most users, especially when the granularity is fine.

Hi @Amjad_Sahi,

Thanks for the detailed feedback. I am somewhat familiar with the Excel file formats (out of necessity, since we often have to diagnose performance bottlenecks for our product), so I can understand our points - they make sense.

One alternative I was thinking of was to provide LoadOptions with a maximum cells with values property.

During the data loading process, Aspose Cells could check if the number of loaded cells with real content exceeds the provided threshold, and if it does, throw an exception. Such an approach sounds relatively isolated, while still providing a safeguard against loading too much data. The check could also be done loosely, for instance after loading each 1000 cell values.

I am interested in your feedback.

Kind regards,
Taras

@TarasTielkes,

We have logged your comments for further analysis. We will write back here once any feedback is ready to share.

@TarasTielkes,

For your requirement, we think you can control it by LightCells APIs with more flexibility. We made an example according to your requirement, see the following sample code for your reference:
e.g
Sample code:

        LoadFilter filter = new LoadFilter();
        InterrupHandler handler = new InterrupHandler(limit, filter);
        Workbook wb = new Workbook(path, new LoadOptions() { LoadFilter = filter, LightCellsDataHandler = handler });
        if(handler.Exceeded) //handle the situation that the workbook has not been loaded fully
        ...

    class InterrupHandler : LightCellsDataHandler
    {
        private readonly int mCellLimit;
        private readonly LoadFilter mLoadFilter;
        private int mCellCount = 0;
        private bool mExceeded = false;

        /// <summary>
        /// 
        /// </summary>
        /// <param name="cellLimit">cell count limit for the whole Workbook</param>
        /// <param name="filter">the LoadFilter used to load one Workbook. If it is null, just throw exception to terminate the loading process</param>
        public InterrupHandler(int cellLimit, LoadFilter filter)
        {
            mCellLimit = cellLimit;
            mLoadFilter = filter;
        }

        public bool Exceeded{get { return mExceeded; }}

        public bool StartSheet(Worksheet sheet)
        {
            if (mExceeded)
            {
                return false;
            }
            return true;
        }
        public bool StartRow(int rowIndex)
        {
            return true;
        }

        public bool ProcessRow(Row row)
        {
            return true;
        }

        public bool StartCell(int column)
        {
            if (mExceeded)
            {
                return false;
            }
            mCellCount++;
            if (mCellCount > mCellLimit)
            {
                mExceeded = true;
                if (mLoadFilter == null)
                {
                    throw new Exception("Too large dataset, interrupt the process");
                }
                //instead of terminate the process by Exception, here you can reset the LoadFilter to ignore all remaining data.
                mLoadFilter.LoadDataFilterOptions = LoadDataFilterOptions.Structure;
                return false;
            }
            return true;
        }

        public bool ProcessCell(Cell cell)
        {
            return true; //keep the cell data in cells model.
        }
    }

Hi @Amjad_Sahi,

Thank you for the feedback. While I was aware of the light cells API, I did not realize it could be used for this purpose.
I will experiment with the approach you suggest, and will let you know the results.

Kind regards,
Taras

@TarasTielkes,

Sure, take your time to evaluate and implement the suggested approach in your scenarios. Hopefully it will suit your needs well.