Aspose Cells LightCells fails to read huge file

peeveen · November 8, 2018, 9:35am

Hi,
We’ve been trying out LightCells ability to read huge XLSX files (we have one that is 488MB … one million rows, 21 columns of various datatypes) and it instantly fails:

Java HotSpot™ 64-Bit Server VM warning: INFO: os::commit_memory(0x000000008d700000, 488636416, 0) failed; error=‘Not enough space’ (errno=12)

There is insufficient memory for the Java Runtime Environment to continue.

Native memory allocation (mmap) failed to map 488636416 bytes for committing reserved memory.

Writing in Kotlin, the code I’m using is simply:

            val loadOptions=LoadOptions()
            loadOptions.lightCellsDataHandler=MyLightCellsDataHandler()
            Workbook("hugeFile.xlsx",loadOptions)

The same file loads OK with Apache POI … which is annoying, because we’ve found Aspose to be generally much faster at processing smaller files.

Is there any way around this?

ahsaniqbalsidiqui · November 8, 2018, 5:11pm

@peeveen,

Thanks for your query.

Please share your sample file and console application(runnable) with us for our testing. We will reproduce the problem and provide our feedback after analysis.

peeveen · November 9, 2018, 10:11am

Hi,

The XLSX file is too large to attach here, so here’s a link to the file, shared in my Google Drive:
https://drive.google.com/file/d/1Nx04rnfCXQY8EGj9lkhGCMxyYlziP-Ip/view?usp=sharing

Here’s the test program I’m using. It’s written in Kotlin (1.3). I’m running it inside IntelliJ, in case that matters.

import com.aspose.cells.*

class LCDH: LightCellsDataHandler {
    override fun processCell(cell: Cell?): Boolean {
        return false
    }

    override fun processRow(row: Row?): Boolean {
        return true
    }

    override fun startSheet(worksheet: Worksheet?): Boolean {
        return worksheet?.index==0
    }

    override fun startCell(column: Int): Boolean {
        return true
    }

    override fun startRow(rowNumber: Int):Boolean {
        if(rowNumber%10000==0)
            println("Row $rowNumber reached.")
        return true
    }
}

fun main()
{
    val loadOptions=LoadOptions()
    loadOptions.lightCellsDataHandler=LCDH()
    Workbook("aspose_meellionRows.xlsx",loadOptions)
}

The process very quickly grabs almost 2 gigabytes of memory, then increases slowly to 2.1GB.

It works for a while, but then gets slower and slower, until …

...
Row 810000 reached.
Row 820000 reached.
Row 830000 reached.
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
	at java.base/java.util.HashMap.resize(HashMap.java:704)
	at java.base/java.util.HashMap.putVal(HashMap.java:663)
	at java.base/java.util.HashMap.put(HashMap.java:612)
	at com.aspose.cells.zbuy.a(Unknown Source)
	at com.aspose.cells.zbuy.a(Unknown Source)
	at com.aspose.cells.zano.a(Unknown Source)
	at com.aspose.cells.zano.B(Unknown Source)
	at com.aspose.cells.zano.A(Unknown Source)
	at com.aspose.cells.zano.z(Unknown Source)
	at com.aspose.cells.zano.a(Unknown Source)
	at com.aspose.cells.zclz.a(Unknown Source)
	at com.aspose.cells.zd.d(Unknown Source)
	at com.aspose.cells.zd.p(Unknown Source)
	at com.aspose.cells.zd.a(Unknown Source)
	at com.aspose.cells.zanm.a(Unknown Source)
	at com.aspose.cells.Workbook.a(Unknown Source)
	at com.aspose.cells.Workbook.a(Unknown Source)
	at com.aspose.cells.Workbook.<init>(Unknown Source)
	at AsposeTestKt.main(AsposeTest.kt:31)
	at AsposeTestKt.main(AsposeTest.kt)

Process finished with exit code 1

If I replace the “return false” in processCell() with “return true”, the process dies around the 400,000 row mark.

I’ve run a similar test using the Apache POI libraries (XSSFSheetXMLHandler). The java.exe process never exceeds 200MB, although it is a bit slower than Aspose.

ahsaniqbalsidiqui · November 9, 2018, 5:30pm

@peeveen,

Thank you for sharing sample file and code. We are downloading this file and will provide our feedback after testing the issue using this sample file.

ahsaniqbalsidiqui · November 11, 2018, 7:39am

@peeveen,

We were able to observe the issue but we need to look into it more. We have logged the issue in our database for investigation and for a fix. Once, we will have some news for you, we will update you in this topic.

This issue has been logged as

CELLSJAVA-42748 - LightCells fails to load huge file

amjad.sahi · November 19, 2018, 8:39am

@peeveen,

We evaluated your issue further. The issue is due to the fact that the file data itself is too large. Currently we need to keep the original data of the file in memory while processing the workbook, so when processing larger file, then more memory will be required. We will optimize the process to improve memory performance. However, it is a complicated task and we need some time to finish it.

Once we have any new information, we will share it.

ahsaniqbalsidiqui · December 13, 2018, 8:01am

@peeveen,

This is to inform you that we have fixed your issue (logged earlier as “CELLSJAVA-42748”) now. We will soon provide you the fixed version after performing QA and incorporating other enhancements and fixes.

We have improved the memory usage for loading large template files. And it is strongly recommended to use file to instantiate Workbook instead of InputStream because using file can give better memory performance after the improvement made in new versions.

ahsaniqbalsidiqui · December 14, 2018, 8:20am

@peeveen,

Please try our latest version/fix: Aspose.Cells for Java v18.11.7:

aspose-cells-18.11.7.jar.zip (6.4 MB)

Your issue should be fixed in it.

Let us know your feedback.

aspose.notifier · December 22, 2018, 3:04pm

The issues you have found earlier (filed as CELLSJAVA-42748) have been fixed in Aspose.Cells for Java 18.12. You can also get the latest Aspose.Cells for Java version from Maven repos. with simple configurations. Please see the document for your reference: Installation|Documentation

This message was posted using BugNotificationTool from Downloads module by Amjad_Sahi

paypaluser · September 22, 2020, 9:43pm

Hi

I am also trying to convert big html file to excel using the following code. Would like to know whether this will not cause memory issue meaning it will not load the entire file in memory during conversion. Please confirm. I would prefer aspose to convert partial bytes and flush them to outputstream instead of loading the whole html file to convert.

  private File convertHtmlToExcel(File htmlFile) throws Exception{
    //Create html load options
    HtmlLoadOptions opts = new HtmlLoadOptions();
    opts.setAutoFitColsAndRows(true);
    opts.setConvertNumericData(false);  
    //Create workbook from your HTML string
    String fileName = htmlFile.getAbsolutePath()+".xlsx";
    File f = new File(fileName);
    try(BufferedOutputStream excelOut = new BufferedOutputStream(new FileOutputStream(f))) {
        Workbook wb = new Workbook(htmlFile.getAbsolutePath(), opts);
        wb.getWorksheets().get(0).setGridlinesVisible(true);
        //Save the workbook in output format
        wb.save(excelOut, new SpreadsheetML2003SaveOptions(SaveFormat.XLSX));
    }
    return f;
}

ahsaniqbalsidiqui · September 23, 2020, 12:39am

@paypaluser,
You may share your sample Html file for our reference. We will use it to reproduce the issue here and provide feedback accordingly.

paypaluser · September 23, 2020, 5:35am

There is no issue. This is just a general question if I am trying to convert a huge html file to excel using the code I mentioned above, will the aspose load the entire html file in memory to do conversion? If so is there a way to avoid loading the entire file in memory instead use streams to convert the file? The reason am asking is we have a huge file to convert. So wondering whats the best way to do this without having aspose to load the entire file in memory

ahsaniqbalsidiqui · September 23, 2020, 7:38am

@paypaluser,
For now, when you import html file to excel, the API reads from file stream. So API will not load the entire html to memory. Let us know if you notice any issue.

paypaluser · September 23, 2020, 8:11am

So you mean to say with the code snippet I shared earlier, since am using file to construct the workbook, it will use the file stream and so will not load the entire html in memory.

If I use the ByteArrayInputStream as follows, will it load the entire bytes in memory?
Workbook wb = new Workbook(new ByteArrayInputStream(htmlcontentBytes), opts);

amjad.sahi · September 23, 2020, 10:10am

@paypaluser,

For your requirements, we think the only available optimization is to use memory preference mode (see the document for your reference). If you are using InputStream, generally, all data in the stream should be loaded into memory before importing the template file because sometimes the stream needs to be accessed randomly.

paypaluser · September 23, 2020, 4:10pm

I am bit confused. In my code snippet I used file and in another snippet I used ByteStream. Are you saying in both cases it loads the entire file in memory? But your other colleague said that if we use the file for constructing workbook it will not load the entire file in memory. Please clarify

private File convertHtmlToExcel(File htmlFile) throws Exception{
//Create html load options
HtmlLoadOptions opts = new HtmlLoadOptions();
opts.setAutoFitColsAndRows(true);
opts.setConvertNumericData(false);  
//Create workbook from your HTML string
String fileName = htmlFile.getAbsolutePath()+".xlsx";
File f = new File(fileName);
try(BufferedOutputStream excelOut = new BufferedOutputStream(new FileOutputStream(f))) {
    Workbook wb = new Workbook(htmlFile.getAbsolutePath(), opts);
    wb.getWorksheets().get(0).setGridlinesVisible(true);
    //Save the workbook in output format
    wb.save(excelOut, new SpreadsheetML2003SaveOptions(SaveFormat.XLSX));
}
return f;

}

By the way I looked at the document you referred for memory preference mode. At the bottom it also Caution saying if we have more insert/delete there will be performance degrade. In my code snippet above all am doing is using aspose API to convert html file to Excel. Am not sure what’s happening inside that Api (whether it’s inserts/delete). So am not sure whether this is a good place to use memory preference mode. Please advise.

In the above code snippet, can I use the ByteStream to load the html and Write them back in ByteStream with memory mode set to PREFERENCE? Will it work? without using BufferedOutputStream. Is there any preferred method of loading the file and writing the file to improve memory foot print?

amjad.sahi · September 23, 2020, 5:35pm

@paypaluser,

Yes, commonly for MS Excel files, both options (using filepath and memory streams) require complete data in the workbook should be loaded into memory. However, for LightCells API it is different which is mainly designed to manipulate cell data one by one without building a complete data model block into memory, so maybe he meant by it. Anyways, for reading HTMLs (into Aspose.Cells object model), we will provide you complete (internal) mechanism for your reference soon.

Moreover, seeing your code segment, you are not violating the mentioned tasks (mentioned in “Caution”), so you can use MemoryPreference option. Please note, “Accessing Cells Randomly and Repeatedly” and “Inserting & Deleting Cells & Rows” refer to the tasks which you manually perform in code. Please note, in memory preference mode, data is read in some sort of compact form, so memory would be saved for sure.

paypaluser · September 23, 2020, 6:09pm

Thanks. So you mean to say even if I use BufferedStream it makes no difference . I can use any streams/ File to read/write. Only those memory preference makes the difference you are saying. With that said does this code looks good with memory config

private File convertHtmlToExcel(File htmlFile) throws Exception{
        //Create html load options
        HtmlLoadOptions opts = new HtmlLoadOptions();
        opts.setAutoFitColsAndRows(true);
        opts.setConvertNumericData(false);  //This is to avoid displaying the large number as scientific notattion
        opts.setMemorySetting(MemorySetting.MEMORY_PREFERENCE); // for large files
        //Create workbook from your HTML string
        String fileName = htmlFile.getAbsolutePath()+".xlsx";
        File f = new File(fileName);
        try(BufferedOutputStream excelOut = new BufferedOutputStream(new FileOutputStream(f))) {
            Workbook wb = new Workbook(htmlFile.getAbsolutePath(), opts);
            wb.getSettings().setMemorySetting(MemorySetting.MEMORY_PREFERENCE); // for large files
            wb.getWorksheets().get(0).setGridlinesVisible(true);
            //setting to utilize memory usage for large files
            Cells cells = wb.getWorksheets().get(0).getCells();
            cells.setMemorySetting(MemorySetting.MEMORY_PREFERENCE);
            //Save the workbook in output format
            wb.save(excelOut, new SpreadsheetML2003SaveOptions(SaveFormat.XLSX));
        }
        return f;
    }

By the way when you say memory would be saved, I hope saving memory will not have any impact to performance. We don’t want to compromise on performance while saving memory and so asking.

ahsaniqbalsidiqui · September 24, 2020, 7:25am

@paypaluser,
We are analyzing it and will write back here soon to share our feedback.

amjad.sahi · September 24, 2020, 10:27am

@paypaluser,

Generally, using filePath instead of streams is better for memory performance. For filePath, sometimes we can access its data randomly without loading entire file’s data into memory. To input common stream, we need to load all data into memory and rebuild it to make random access available. So commonly, we recommend users to use filePath instead of stream when it is possible. Please note, using filePath will not give you worse performance than using streams.

Seeing your final code (above), there is no need to set the memory preference mode again for every sheet after loading the workbook. When loading the workbook from template file with MemoryPreference mode, all cells’ memory mode will be MemoryPreference mode which is enough. So, you may remove unnecessary lines of code where you set memory preference mode.