Benchmark durations to extract content from Excel files in java using Aspose Cells

teju90 · March 14, 2023, 2:34pm

Hi Team, I would like to know if there are any benchmarks made for the time taken to extract the content of excel files of various sizes in Java using Aspose cells. I tried extracting a ~50MB excel file and it took approximately 13 seconds (the file contains 500K rows and 14 columns).

And it took approximately 3 seconds to extract a 2MB file (which contains 20K rows and 13 columns).

Is this inline with the processing capacity of the Aspose cells? Or would you recommend any improvisations in the code to extract the content.

Please let me know.

Thanks
Ravi

teju90 · March 14, 2023, 2:36pm

code snippet.7z (755 Bytes)
Attaching the code snippet for reference.

amjad.sahi · March 14, 2023, 3:05pm

@teju90,

I checked and your code looks Ok.

Since you are extracting a huge data set from the sheet(s), so it will take little time for sure.
May be you could save each sheet in the workbook to separate CSV or text format, it might be little quicker.

johnson.shi · March 15, 2023, 3:28am

@teju90,

If you only need to export cell content without other objects in the Workbook, you may try LightCells and LoadFilter to improve the performance a bit(those apis mainly help for memory performance, generally better memory performance may decrease the time cost a bit too). For using LightCells, you may take this page as reference. And here is an example for you which simulates the logic in your provided code:

LoadOptions opts1 = new LoadOptions();
opts1.getLoadFilter().setLoadDataFilterOptions(LoadDataFilterOptions.CELL_DATA);
opts1.setLightCellsDataHandler(new ExportContent());
Workbook wb = new Workbook("CellsJava41811.xlsm", opts1);

    private static class ExportContent implements LightCellsDataHandler
    {
        private static final int limit = 0x01<<25;
        private final StringBuilder sb = new StringBuilder(limit);
        private int count = 0;
        public boolean startSheet(Worksheet sheet)
        {
            System.out.println("Before processing sheet[" + sheet.getIndex() + "], content length = " + count);
            String s = sheet.getName();
            count += s.length() + 1;
            if(count > limit)
            {
                return false;
            }
            sb.append(s);
            sb.append(',');
            return true;
        }
        public boolean startRow(int rowIndex) {return count < limit;}
        public boolean processRow(Row row) {return count < limit;}
        public boolean startCell(int columnIndex) {return count < limit;}
        public boolean processCell(Cell cell)
        {

            String s = cell.getStringValue(CellValueFormatStrategy.CELL_STYLE);
            count += s.length() + 1;
            if(count > limit)
            {
                return false;
            }
            sb.append(s);
            sb.append(',');
            return false;
        }
    }

teju90 · March 15, 2023, 5:51am

Thanks for the response @johnson.shi. Is this the streaming way of loading the data? Would this load the images (and any other non-text content) in the excel as well while reading? I am primarily concentrated on the cell content itself, but I am just curious to know how the memory footprint would be reduced by using this API.

Appreciate your help.

Thanks
Ravi

John.He · March 15, 2023, 6:13am

@teju90,
LightCells and LoadFilter are used to improve the performance of processing large files. Through inheritance, you can control which worksheets need to import data, and even accurately control the import of rows and cells.

teju90 · March 15, 2023, 6:18am

Got it. Thanks @John.He

John.He · March 15, 2023, 6:25am

@teju90,
You are welcome. If you have any questions, please feel free to contact us.

johnson.shi · March 15, 2023, 6:32am

@teju90,
With the example code we provided, only the cells data will be loaded, without any other objects such as images, shapes, …etc.

The difference of memory cost of this way and the normal way(load the workbook completely) varies from file to file.

Firstly, if there are lots of images in the template file, then the memory which should be used by data of those images will be saved.

Secondly, in the implementation of LightCellsDataHandler.processCell(Cell), we return false which means that the cell’s data will not be kept in cells mode, so no memory will be occupied after it is processed in this method. In such way if there are large amount of cells, you will get much better memory performance.

Anyways, we think you may test and compare two ways according to your requirement and specific template file to get the actual result about performance and then choose the proper one for your business.

teju90 · March 15, 2023, 6:57am

Thanks for more insights @johnson.shi

John.He · March 15, 2023, 7:02am

@teju90,
You are welcome.

teju90 · June 13, 2023, 6:09am

@John.He, one more question on this.

While using the LightCells, do I need to invoke workbook.dipose and workbook.closeAccessCacheOptions to release any cache held up in memory (if any) by the API? Or not needed?

Thanks
Ravi

John.He · June 13, 2023, 6:18am

@teju90,
Yes, you can call Workbook.Dispose to release resources. Workbook.CloseAccessCache and Workbook.StartAccessCache can also be used together. These two functions provide users with the ability to access data in batch mode with better performance.

teju90 · June 13, 2023, 8:19am

Thanks for the quick response @John.He

amjad.sahi · June 13, 2023, 9:01am

@teju90,

You are welcome.

teju90 · June 26, 2023, 2:21pm

hi @amjad.sahi, @John.He

just a follow up question on the above. Can it lead to a memory leak in the application in wb.close() is not called?

P.S: We are NOT using the StartAccessCache and CloseAccessCache apis.

Thanks
Ravi

johnson.shi · June 26, 2023, 2:43pm

@teju90,

Aspose.Cells for Java is a pure Java component. Workbook.close() is mainly for setting related objects as null recursively so that make it possible for GC to start to work earlier, especially when most of the memory available for the JVM has been used. It will not lead to memory leak even though you do not call it.

teju90 · June 26, 2023, 2:58pm

Got it, Thanks @johnson.shi

amjad.sahi · June 26, 2023, 5:16pm

@teju90,

You are come. Should you have further queries or comments, feel free to write us back.

teju90 · September 11, 2023, 1:54pm

Hi Team,

While using the API com.aspose.cells.LoadOptions for extracting the content of various excel file types, I see that it is occupying as much as 1GB just for a 4MB file as soon as the Workbook object is created.

Can you please explain why it is occupying so much of memory for such a small file?

Thanks
Ravi