Does Aspose load entire .xls into memory

deepu94duggu · January 7, 2022, 9:09am

Hi, We are planning to use Aspose for .xls report generation that could have huge data 100 MB to 1 GB. This report will keep getting appended throughout the day whenever the report job runs. We want to know how does Aspose deal with appending to 1 GB files. Does it load everything into memory, append and write it back ? Do you have any metrics ? Could you share some sample code to do it efficiently ?

amjad.sahi · January 7, 2022, 12:38pm

@deepu94duggu,

Generally, in normal mode, yes, it loads/writes into memory. 1GB data is huge, so surely, it will consume resources (memory, CPU, etc.).

By the way, Aspose.Cells also provides light weight model named LightCells for the purpose (writing and reading large data in Excel spreadsheet), For example, the LightCells API is useful for creating huge Excel spreadsheets quickly and efficiency. See the document for your reference:

Aspose.Cells provides the LightCells API, mainly designed to manipulate cell data one by one without building a complete data model block (using the Cell collection etc.) into memory. It works in an event-driven mode.

Moreover, if you are only concerned about memory, Aspose.Cells provides an option e.g., MemorySetting.MemoryPreference for memory performance consideration, so you may try it. Please see the topic for your complete reference on how to optimize memory usage while reading/rendering bigger files having large data sets etc.:

deepu94duggu · January 7, 2022, 1:43pm

I don’t see much difference on using the MemoryPreference flag.
With flag set :
image.png (14.4 KB)

Without flag:
image.png (8.3 KB)

This is a read operation of an xlsx file that is 30 MB on disk and 108 MB in a text file. All the data is simple text.

amjad.sahi · January 7, 2022, 2:31pm

@deepu94duggu,

Do you just load the file using Workbook constructor (while setting attribute(s) for LoadOptions) or read/retrieve data row/column wise in the sheet(s)? If you are reading data in different cells, you may try to use LightCells APIs.

Also, you may share your sample file(s) and sample code (runnable) to trace the issue. You may zip the resource files and upload to some file sharing service (e.g. dropbox, Google drive, etc.) and share the download link (without any credentials). We will check it soon.

johnson.shi · January 10, 2022, 1:40am

@deepu94duggu,
MemoryPreference mode mainly works for the scenario that there are large amount of cells with mixed data and the memory optimization is mainly works for data type such as bool, error, null, int, .etc. For cells with string values or formulas, memory is mainly used by string values and binary data which cannot be optimized further.

For your requirement and situation, if most of your cells are string values, MemoryPreference may not help. You may try LightCells Apis to write data into the file of destination directly. Or, we think if you can split your data into multiple workbooks, it should be much better for performance than creating one single file with large size. If you do need a single file for the report of one day, combining them at the end of the day should be better.

deepu94duggu · January 10, 2022, 7:36am

Yes, all of my data is string. Thanks for the detail. Does LightCells also have efficient append capability ?

johnson.shi · January 10, 2022, 8:14am

@deepu94duggu,

For your situation, LightCells should be used for Workbook.Save(). In the process of saving workbook, the appended data will be output the destination continuously. Such mechanism is sure helpful for performance, especially when there are large amount of cells need to be saved.

However, because the generation of your report takes long time(it seems should be all the day), so I am afraid you have to provide your own mechanism to control the progress. For example, in your implementation of LightCellsDataProvider, you need to pause the thread of saving workbook until there is new data coming for being appended to the workbook.

deepu94duggu · January 10, 2022, 8:24am

We have a scheduled job that will run every configured amount of time (say every 1 hr) which appends data to the existing report of that day (say Report_02102022). My question here is to perform the append, does LightCells again load entire existing Report (could be huge) to memory.

johnson.shi · January 10, 2022, 8:58am

@deepu94duggu,

We are not sure about your complete procedure of generating the reports. Do you mean you need to load existing(large) report and then append new data to it? If so, we are afraid all data of the existing report needs to be loaded into memory. What we talked about LightCells which may help to improve performance is to start work from an empty workbook, because the received cells data will be saved to destination directly, so no large data set(cells data) is kept in memory(except some binary data cache, such as caches for IO). But if you need to append data to the existing report, then we need to parse and gather all needed information of existing data, so firstly the existing report needs to be loaded completely.

deepu94duggu · January 10, 2022, 9:03am

Thanks for your prompt responses. We will evaluate and get back.