Parsing Large Xlsx Files

Sundarcj · May 17, 2019, 9:01am

Hi,
I am tried to parse 500MB xlsx files using LightCellsDataHandler API, am not creating any other objects but still my heap size grows to 3GB.

Note: in processCell method am just returning false.

amjad.sahi · May 17, 2019, 12:38pm

@Sundarcj,

Thanks for your query.

If you are loading such a huge file, certain amount of memory will be consumed for sure. Generally 8-10 times or more memory of the size of the file is used (when using in normal mode), so this looks to me ok. By the way when you load such a big file into Ms Excel manually, it too takes memory and takes more lot of time to load it into MS Excel.

amjad.sahi · May 20, 2019, 7:18am

@Sundarcj,

We evaluated your issue further. For processing common template file with LightCells, the memory cost should not be so large as you pointed out. Please send us the runnable code and template file and we will make further investigations and try to figure your issue out.

Sundarcj · June 12, 2019, 5:27am

can you share your email id so that i can share my test files to your drive

Sundarcj · June 12, 2019, 5:29am

What about meta detection ? like i want to detect cells data type… which i can get from your API. But i need to know is are you flush out the meta objects like light cells…??

amjad.sahi · June 12, 2019, 11:01am

@Sundarcj,

As requested earlier, please send us the runnable code and template file, zip the project and template file and upload to some file sharing service (e.g dropbox, Google drive) and share the Download link here. We will check it soon.

PS. we cannot get such huge file and project via email.

Sundarcj · June 13, 2019, 6:23pm

https://drive.google.com/open?id=1hREqynsvJk4VyugpXOxyuKykGBu_6A1P

ahsaniqbalsidiqui · June 14, 2019, 5:16am

@Sundarcj,
We are checking the data and will share our feedback with you here soon.

Sundarcj · June 14, 2019, 6:08am

Okay thank you… and one more query i have a workbook with multiple sheet total size around 600MB and aspose unable parse this file.

amjad.sahi · June 14, 2019, 9:05am

@Sundarcj,

Thanks for the template file and sample code.

After an initial test, I am able to reproduce the performance issue using your sample code with your template file (454MB). I found performance issue (memory goes high and it takes long time to complete the process, even I got “java.lang.OutOfMemoryError: Java heap space”) when parsing the large XLSX file in light weight mode. I have logged an investigation ticket with an id “CELLSJAVA-42935” for your issue. Since the file size is very large, so surely, it takes more time and consumes more resources for the big process. Anyways, we will look into your issue soon.

Once we have an update on it, we will let you know.

amjad.sahi · June 14, 2019, 9:07am

This is the similar case, anyways, you may share the file, we will evaluate it as well.

Sundarcj · June 15, 2019, 4:22am

@Amjad_Sahi Thank you so much. how much time it would take?.

amjad.sahi · June 15, 2019, 9:46am

@Sundarcj,

Since we just logged the issue, so please spare us little time (3-5 days or so) to evaluate your issue thoroughly before we could commit any eta or provide an update on it.

Once we have any new information, we will share it with you.

ahsaniqbalsidiqui · June 18, 2019, 7:38am

@Sundarcj,
There are large amount of global cached string values in the template file. For such kind of string values in XLSX file, they must be loaded entirely before processing cells data for every sheet. By our test with given template file, to load those string values into cells model requires at least 2G memory. To make the program run successfully with this file, we think the JVM needs at least 2.5~3G memory.

Sundarcj · June 19, 2019, 4:53am

Okay thank you.can i give the HttpInput stream as input param to aspose parser?

ahsaniqbalsidiqui · June 19, 2019, 8:44am

@Sundarcj,
Overload of Workbook constructor accepts InputStream so if you can cast HttpInputStream to it then you can pass it as input parameter.