Benchmark durations to extract content from Excel files in java using Aspose Cells

Hi Team, I would like to know if there are any benchmarks made for the time taken to extract the content of excel files of various sizes in Java using Aspose cells. I tried extracting a ~50MB excel file and it took approximately 13 seconds (the file contains 500K rows and 14 columns).

And it took approximately 3 seconds to extract a 2MB file (which contains 20K rows and 13 columns).

Is this inline with the processing capacity of the Aspose cells? Or would you recommend any improvisations in the code to extract the content.

Please let me know.

Thanks
Ravi

code snippet.7z (755 Bytes)
Attaching the code snippet for reference.

@teju90,

I checked and your code looks Ok.

Since you are extracting a huge data set from the sheet(s), so it will take little time for sure.
May be you could save each sheet in the workbook to separate CSV or text format, it might be little quicker.

@teju90,

If you only need to export cell content without other objects in the Workbook, you may try LightCells and LoadFilter to improve the performance a bit(those apis mainly help for memory performance, generally better memory performance may decrease the time cost a bit too). For using LightCells, you may take this page as reference. And here is an example for you which simulates the logic in your provided code:

LoadOptions opts1 = new LoadOptions();
opts1.getLoadFilter().setLoadDataFilterOptions(LoadDataFilterOptions.CELL_DATA);
opts1.setLightCellsDataHandler(new ExportContent());
Workbook wb = new Workbook("CellsJava41811.xlsm", opts1);

    private static class ExportContent implements LightCellsDataHandler
    {
        private static final int limit = 0x01<<25;
        private final StringBuilder sb = new StringBuilder(limit);
        private int count = 0;
        public boolean startSheet(Worksheet sheet)
        {
            System.out.println("Before processing sheet[" + sheet.getIndex() + "], content length = " + count);
            String s = sheet.getName();
            count += s.length() + 1;
            if(count > limit)
            {
                return false;
            }
            sb.append(s);
            sb.append(',');
            return true;
        }
        public boolean startRow(int rowIndex) {return count < limit;}
        public boolean processRow(Row row) {return count < limit;}
        public boolean startCell(int columnIndex) {return count < limit;}
        public boolean processCell(Cell cell)
        {

            String s = cell.getStringValue(CellValueFormatStrategy.CELL_STYLE);
            count += s.length() + 1;
            if(count > limit)
            {
                return false;
            }
            sb.append(s);
            sb.append(',');
            return false;
        }
    }
1 Like

Thanks for the response @johnson.shi. Is this the streaming way of loading the data? Would this load the images (and any other non-text content) in the excel as well while reading? I am primarily concentrated on the cell content itself, but I am just curious to know how the memory footprint would be reduced by using this API.

Appreciate your help.

Thanks
Ravi

@teju90,
LightCells and LoadFilter are used to improve the performance of processing large files. Through inheritance, you can control which worksheets need to import data, and even accurately control the import of rows and cells.

1 Like

Got it. Thanks @John.He

@teju90,
You are welcome. If you have any questions, please feel free to contact us.

@teju90,
With the example code we provided, only the cells data will be loaded, without any other objects such as images, shapes, …etc.

The difference of memory cost of this way and the normal way(load the workbook completely) varies from file to file.

Firstly, if there are lots of images in the template file, then the memory which should be used by data of those images will be saved.

Secondly, in the implementation of LightCellsDataHandler.processCell(Cell), we return false which means that the cell’s data will not be kept in cells mode, so no memory will be occupied after it is processed in this method. In such way if there are large amount of cells, you will get much better memory performance.

Anyways, we think you may test and compare two ways according to your requirement and specific template file to get the actual result about performance and then choose the proper one for your business.

1 Like

Thanks for more insights @johnson.shi

@teju90,
You are welcome.

@John.He, one more question on this.

While using the LightCells, do I need to invoke workbook.dipose and workbook.closeAccessCacheOptions to release any cache held up in memory (if any) by the API? Or not needed?

Thanks
Ravi

@teju90,
Yes, you can call Workbook.Dispose to release resources. Workbook.CloseAccessCache and Workbook.StartAccessCache can also be used together. These two functions provide users with the ability to access data in batch mode with better performance.

Thanks for the quick response @John.He

@teju90,

You are welcome.

hi @amjad.sahi, @John.He

just a follow up question on the above. Can it lead to a memory leak in the application in wb.close() is not called?

P.S: We are NOT using the StartAccessCache and CloseAccessCache apis.

Thanks
Ravi

@teju90,

Aspose.Cells for Java is a pure Java component. Workbook.close() is mainly for setting related objects as null recursively so that make it possible for GC to start to work earlier, especially when most of the memory available for the JVM has been used. It will not lead to memory leak even though you do not call it.

Got it, Thanks @johnson.shi

@teju90,

You are come. Should you have further queries or comments, feel free to write us back.

Hi Team,

While using the API com.aspose.cells.LoadOptions for extracting the content of various excel file types, I see that it is occupying as much as 1GB just for a 4MB file as soon as the Workbook object is created.

Can you please explain why it is occupying so much of memory for such a small file?

Thanks
Ravi