Read huge excel file

aabuhijleh · September 12, 2024, 3:10pm

I have a huge Excel file that contains 1 million rows but only 100 rows have data.
When I try to read it I get out of memory because the file loaded in memory.
Is there a way in Aspose cells to remove all the empty rows before or during reading the file content which would be very helpful and will prevent the out of memory?

Thanks

Professionalize.Discourse · September 12, 2024, 3:10pm

@aabuhijleh

To efficiently read a large Excel file with many empty rows using Aspose.Cells for Java, you can optimize memory usage and avoid loading unnecessary data into memory. Here are some strategies you can implement:

Use Memory Settings: You can set the memory preference to optimize memory usage when reading large datasets. This can help reduce the overall memory cost. Use the MemorySetting.MEMORY_PREFERENCE option instead of the default MemorySetting.NORMAL. This adjustment can save memory when working with large datasets.

Here’s an example of how to set the memory preference:
```
// Load the workbook with optimized memory settings
Workbook workbook = new Workbook("path/to/your/largefile.xlsx");
workbook.getSettings().setMemorySetting(MemorySetting.MEMORY_PREFERENCE);
```

Iterate Through Rows: Instead of loading the entire workbook into memory, you can iterate through the rows and only process those that contain data. This way, you can skip empty rows and minimize memory usage.

Here’s a code snippet to demonstrate this:

// Load the workbook
Workbook workbook = new Workbook("path/to/your/largefile.xlsx");
Worksheet worksheet = workbook.getWorksheets().get(0); // Access the first worksheet

// Iterate through rows
for (int i = 0; i < worksheet.getCells().getMaxDataRow(); i++) {
    Row row = worksheet.getCells().getRows().get(i);
    if (row.getCount() > 0) { // Check if the row has any data
        // Process the row
        for (int j = 0; j < row.getCount(); j++) {
            Cell cell = row.get(j);
            System.out.println(cell.getStringValue());
        }
    }
}

Consider Using LoadOptions: If you are only interested in specific data, you can use LoadOptions to specify which parts of the file to load, further reducing memory usage.

By implementing these strategies, you should be able to read your large Excel file without running into memory issues.

For more detailed information on optimizing memory usage while working with large datasets, you can refer to the Aspose documentation on this topic.

Sources:
[1]: Optimizing Memory Usage while Working with Big Files having Large Datasets

amjad.sahi · September 12, 2024, 5:23pm

@aabuhijleh,

Could you please provide us your large Excel file (please zip the Excel file prior sharing), we will check your issue soon.

aabuhijleh · September 12, 2024, 6:45pm

Below the file, I checked your API and found LightCellsDataHandler.
How can I get the rows with data without loading the entire file into memory?
Testfile.zip (4.3 MB)

Thanks

amjad.sahi · September 12, 2024, 7:20pm

@aabuhijleh,

Thanks for the template XLSX file.

I tested your scenario/case using the simplest lines of code and it works fine. I do not get out of memory or any other error. I am using latest version/fix: Aspose.Cells for Java v24.9 (please try it). I also tried to convert to PDF after reading the file into Aspose.Cells’ object model, it works fine too. The output PDF file (attached) is fine tuned.
e.g.,
Sample code:

Workbook workbook = new Workbook ("d:\\files\\Testfile.xlsx");
workbook.save("d:\\files\\out1.pdf");

out1.pdf (52.7 KB)

johnson.shi · September 13, 2024, 2:21am

@aabuhijleh

For using LightCells to load data, please see the document: Using LightCells API|Documentation. You may just keep those data you need in your implementation of LightCellsDataHandler.startCell(int columnIndex)(return false for all cells that you do not need).

If ignoring empty cells is enough for your situation, the better and easier way is just using LoadOptions with specified LoadFilter or LoadDataFilterOptions. Code example:

        LoadOptions lo = new LoadOptions();
        lo.getLoadFilter().setLoadDataFilterOptions(LoadDataFilterOptions.ALL & ~LoadDataFilterOptions.CELL_BLANK);
        new Workbook("Testfile.xlsx", lo);

By our test, loading your template file requires no more than 100M memory with above code. If we do not specified the LoadOptions with LoadDataFilterOptions, the required memory is about 200M.