Out Of Memory with big files

atlassian · May 14, 2014, 2:35am

When reading a very big XLS file using Aspose.Cells 7.7.2 (for example, create a new spreadsheet file, go to the last row and column and add some value in that cell), we end up with OOMEs by just reading the file and iterating over the list of cells (not doing anything but getValue() on every Cell object)

It seems the Cell, Row and RowCollection objects are being lazy initialized (good) but retained via strong references (so they cannot be garbage collected) without reason (obviously bad).

Can this get fixed?

babar.raza · May 14, 2014, 2:51am

Hi Agnes,

Thank you for contacting Aspose support.

With recent releases of Aspose.Cells for Java API, we have provided the Memory Optimization Preferences that you may utilize in your application to reduce the overall memory cost while working with spreadsheets having large data sets. We would request you to please check the below linked technical article and give it a try on your end with the latest version of Aspose.Cells for Java 8.0.1 to see if it makes any difference in your particular scenario.

http://www.aspose.com/docs/display/cellsjava/Optimizing+Memory+Usage+while+Working+with+Big+Files+having+Large+Datasets

In case the problem persists, we will require your test application (covering the complete scenario for the sake of problem replication) and the template spreadsheet for our review.

babar.raza · May 14, 2014, 3:30am

Hi again,

Adding more to my previous response, if you only have to traverse all cells to read their values or properties when it is not empty, please use Cells.iterator()/RowCollection.iterator()/Row.iterator() to traverse all existing Row/Cell objects. This is the most efficient way to access all existing Row/Cell objects in one worksheet(Cells Collection).

If you have to use index or name of a cell to check and read cells for some special purpose, please use Cells.checkCell(int, int)/Cells.checkRow(int) instead of Cells.get(int, int)/RowCollection.get(int). The method get() will create one Cell/Row object and keep its data in the Cells collection if the required Cell/Row object has not been initialized. For your case, if you call get() method for all cell in a worksheet, the large amount of objects will surely use large amount of memory. However, the method checkXXX() will return null for those Cell/Row objects that have not been initialized, without those empty cells being created into Cells collection, memory cost will not become a problem.

atlassian · May 14, 2014, 4:34am

Hi,

Thanks for your response.

If you have to use index or name of a cell to check and read cells for some special purpose, please use Cells.checkCell(int, int)/Cells.checkRow(int) instead of Cells.get(int, int)/RowCollection.get(int).

Having other options to check the cells is nice, but I still don’t understand why the objects are strongly referenced when every one of them is independent of each other. In the example I proposed, almost every cell will be empty, but in a real example with lots of cells, that won’t be the case, so this will still be happening.

The method get() will create one Cell/Row object and keep its data in the Cells collection if the required Cell/Row object has not been initialized.

That’s the thing I don’t understand. Lazy initialization is ok, but after using the Cell object, and if I don’t have a reference to it, it should be garbage collected, not being kept by the library. If it wasn’t on memory before, there is no reason for that Cell to stay when I’m processing a different one.

I’ll try the new version and I’ll try using the methods you comment, but even if using checkCell()/checkRow() works, my program will still be using way more memory than necessary when processing the file, and this is worrisome.

babar.raza · May 14, 2014, 5:59am

Hi Agnes,

Please try the options discussed here on your end, and feed us back with your results. Thanks in advance.

atlassian · May 14, 2014, 10:30pm

I’m glad to say that after using 8.0.2 and enabling the new memory options, the cells are no longer retained in memory even after iterating over every single one of them.

I’m not sure what did the trick (the update or the memory flag) but I’ll figure it out later.

Thanks!

babar.raza · May 15, 2014, 3:16am

Hi Agnes,

Its good to know that you have moved past the exception.

Please note, the Memory Optimization Preferences are only available in recent releases of Aspose.Cells APIs. Moreover, starting from v8.0.0 we have changed the behavior of Row & Cell collections to improve the overall memory utilization, therefore if you are not seeing the OutOfMemoryException, it could be due to the both, that is; up-gradation to the latest version as well as setting the memory preferences.

Please check the following articles for your kind reference,