Memory consumption when running large cell operations

solvexia · February 16, 2010, 11:23pm

Hi,

We have an application that uses Apose.Cells to both clear and update cell vaues in speadsheets. When we run large loops of these instructions - updating 4-6 million cells, we find the memory consumption grows to a working set of multiple GB ;

My question is, is this normal? Is there a way to work with Aspose components so that memory usage is a little more modest?

The code we have looks like

sheet.Cells[iRow, iCol].PutValue(value)

where there are two for loops around this statement controllingthe iRow and iCol variables.

Any thoughts or advice would be much appreciated - or general hints on how to get the best execution performance with Aspose when working with large amounts of data.

Thanks,

James

amjad.sahi · February 17, 2010, 1:33am

Hi James,

Well, when you update or fill millions of cells in a worksheet, you need to have sufficient amount of memory for the operation as it is a big task to deal with. Aspose.Cells will handle it but one thing you need to make sure for performance considerations:

Populate data first by row then by column will certainly improve the overall performance to certain extent. This would also make your program run faster. See the document for reference:
http://www.aspose.com/documentation/.net-components/aspose.cells-for-.net/populate-data-first-by-row-then-by-column.html

In the latest versions/fixes, we did enhance the performance of the product to handle big operations or work with larger files. I have attached the latest fix v4.8.2.6 for you, kindly try it.

Thank you.

solvexia · February 17, 2010, 2:47am

Thanks - we will try the new library now - and report back shortly. Is there any advantage or benefit to saving a file part way through a multi-million cells update ? (regarding memory consumption)

The reason for asking, is that in some other libraries I have used, the equivelant of PutValue() was caching the updates in an in-memory list until flushed to disk. Once flushed to disk - this list was emptied, and additional updates started building up this list again. I can see this is trading memory pressure for IO pressure - but if you have more time than memory - would this offer any benefit?

( The example I am thinking of is during a 4 million cell update, write to disk every 1 million cells for example )

Thanks,

James

solvexia · February 18, 2010, 5:39am

Hi,

We have tried new library.

Still huge memory consumption is issue for us.

I have attached sample code that imports 74MB CSV file to excel file, please use test files supplied in "Test-Files" folder.

Now, is there anything that can be done to reduce the memory demand downwards?

And yes, this supplied code runs in the "Windows service" on our server, once the memory exceeds 1.6Gb - we get a .NET CLR exception on a 64 bit machine

I hope this helps for your investigation,

Cheers,

amjad.sahi · February 18, 2010, 8:07am

Hi,

Thanks for providing us further details with sample project.

We will look into your issue and get back to you soon.

Thank you.

amjad.sahi · February 22, 2010, 12:37am

Hi,

After analyzing your issue with your project, there are 124060 * 68 cells in the CSV file and it would be 124060 * 68 * 2 cells in the memory after copying the range. A workbook which contains 124060 * 68 cells would certainly consume about 400M memory in this test project. Moreover, the two workbooks will consume about 800M memory.

I think you may try:

1) Could you import CSV file to workbook and apply some formatting to the cells.

2) Please clear the source workbook after copying. Please check following sample code.

//Copy cell data from a source ranges
targetRange.Copy(sourceRange);
sheet = null;
sourceRange = null;
book = null;
GC.Collect(); 


Thank you.

solvexia · February 22, 2010, 11:00pm

Hi Amjad,

Thanks for your reply and suggestion.

I have tried assigning range, sheet, book to null. It frees memory immediately, after a huge spike of memory consumption in "performance" graph.

However, our issue here is to keep the load on memory low.

I have attached updated Code zip here.

Any improvement on reducing memory demand would be very helpful here.

Cheers,

solvexia · February 22, 2010, 11:06pm

An additional thought ;

Would it be better to construct two for loops ( row and column ) and do a series of PutValue() statements - and save the target file every [say] 10000 updates ? Would this cause the in-memory buffer that holds cell changes to be flushed - thereby reducing the peak load on memory ? I have seen this technique used in other systems - wondering if this might be an approach we can use here?

Thanks,

James

amjad.sahi · February 23, 2010, 5:08am

Hi,

Please try the new fix 4.8.2.7. We have added a new method Range.GetCellOrNull();

1)Test1-ImportCSV
Please remove all reference to the source file after copying data from the source file.

2)Test02-SaveLargeFile
Please check if the source cell is null before copying the data.
If you call Range[row,column] to get the cell, we have to create a Cell instance in the memory.

3)Test03-Update
Sorry we have no good idea.


4)Test04-CopyCells
Please check if the source cell is null before copying the data.

Please check the attached file.

Thank you.

solvexia · February 25, 2010, 10:57pm

Hi ,

Thanks for the reply. When importing large text (csv) files, we are doing this by:

1. opening both the source and target files

2. creating a range in the source and target files

3. using the copy function on the range object - targetrange.Copy(sourcerange)

Is this the best approach when copy a large block of data from src to target? Given that we are not working with these ranges on a cell by cell basis - I am not sure how we should exploit the GetCellOrNull() method.

Are you suggesting we should move to a different approach - which is two loops nested - copying cell by cell through the ranges - and then use this GetCellOrNull() method?

Thanks in advance for your help.

Regards,

James

amjad.sahi · March 2, 2010, 12:58am

Hi,

If you call targetrange.Copy(sourcerange), there are source and target files together in the memory. I am afraid, we have no good idea for this.

Could you import CSV file to workbook and apply some formatting to the cells?

For example:

private void Test01_ImportCSV()

{

try

{

StartTest("Test01_ImportCSV");

var book = new Workbook();

Range targetRange = null;

//Open text file

book.Open(txtSourceFile.Text, ',');

WriteOutput("CSV file opened");

//If file couldn't be parsed, exception is thrown)

if (book.Worksheets[0] == null) throw new Exception("Couldn't parse text file");

//Create range from all text file data

var sourceRange = book.Worksheets[0].Cells.CreateRange(0, 0, book.Worksheets[0].Cells.MaxDataRow + 1, book.Worksheets[0].Cells.MaxDataColumn + 1);

WriteOutput("Opened CSV file's worksheet");

if (sourceRange == null) throw new Exception("Couldn't create source range in text file");

//Get the target workbook

var targetworkbook = new Workbook();

targetworkbook.Open(txtTargetFile.Text);

WriteOutput("Opened target workbook");

//Getting target sheet

var sheet = targetworkbook.Worksheets[0];

WriteOutput("Extracted target worksheet");

targetRange = sheet.Cells.CreateRange(0, 0, book.Worksheets[0].Cells.MaxDataRow + 1, book.Worksheets[0].Cells.MaxDataColumn + 1);

WriteOutput("Created target range");

////Copy cell data from a source ranges

//targetRange.Copy(sourceRange);

sourceRange.CopyStyle(targetRange);

WriteOutput("Copied CSV data to target");

book.Save(txtTargetFile.Text);

WriteOutput("Saved workbook");

targetRange = null; targetworkbook = null;

FinishTest();

}

catch (Exception ex)

{

HandleException("Importing a CSV file", ex);

}

}

Thank you.

solvexia · March 2, 2010, 1:59pm

Thanks for your response.

Can you please tell me if the ASPOSE.Cells library relases memory when you do a "Save file" operation on a target workbook?

The reason for asking is that I am thinking that an alternative approach here might be to do a whole series of GetValue(row,col) -> PutValue(row,col) calls instead of using the range copy - and save the file every [say] 10000 calls to PutValue. If the "save file" operation releases memory ( as the library no longer needs to cache the changes made by PutValue ) then that might be a way for us to process large files?

We are still stuck with the problem that importing a 76Mb CSV file into a XLS file is getting an 'outofmemory' exception. The process is taking about 2GB of RAM when this occurs, even though it is running on an 8GB 64 bit machine.

Thanks,

James

amjad.sahi · March 3, 2010, 12:26am

Hi,

Well, it takes about 500M memory when we run the following code:

Workbook workbook = new Workbook();
workbook.Open(@"F:\FileTemp\Very Big CSV File-73M.csv", ',');
// workbook.SaveOptions.ClearData = true;
workbook.Save(@"F:\FileTemp\dest.xls");

Could you share a sample project to show how to import csv and export it to xls file?

Thank you.

solvexia · March 4, 2010, 8:19pm

Hi,

Thanks again for your reply.

I think the primary difference between the source code in your prior post, and what we need to do, is that we use ASPOSE components to open both a source CSV and target XL files, and then copy a large number of cells from the CSV to the XL file. We find that when the CSV file is 76Mb , and the target file is an empty workbook, the overall memory demand from the running process rises to 1.8GB - and then at some point an 'out-of-memory' exception is thrown ( even though there is still free physical memory on the machine ). We suspect that the process is suffering from heap fragmentation - so although there is free memory - although we would need to run the CLR viewer to confirm this.

The attached zip file contains a test application and test data that we hope can be used to help replicate this issue in your environment. In the testdata directory there is a 76Mb CSV file you can use as a source file. Create an empty workbook for the target file. Select the first tab "Test 1 - Importing large CSV files"; Here you will see three buttons with different test variations - and you can click on these buttons to view the code behind. When you run these tests - you will see the memory and time taken to complete the tests.

The first variation uses ASPOSE components to open both source and target files - and then a single call to Copy to tranfer cell data from source to target.

The second variation uses ASPOSE components to open both source and target files - and then enters into a row/col loop to do a series of Get/Put statements, saving the target file intermittently to reduce the demand on memory of "unsaved changes".

The third variation an ASPOSE component to open the target file, but not the source file. Instead we open the source CSV data with a custom stream reader - and then enter into a row/col loop to do a series of Get/Put statements, saving the target file intermittently to reduce the demand on memory of "unsaved changes".

Our problem is that currently our production solution uses the technique exemplified in variation #1 - and encounters the 'out of memory' error after 2 or 3 runs using the clients larger files. This is on a 64bit 8Gb machine with nothing else running on it.

Understandably the third variation offers the best performance and scalability, and does not crash with the 76Mb file, even after re-running 50 times. We are in the process of applying this approach to our production solution this weekend so our clients can continue processing.

Our question to Aspose is to what extent can these components be scaled? Are we simply processing file sizes that these components were not designed for ? If this is the case, can you offer any advice on alternative ways of processing large XL files programatically - as most of our clients in the finanical services sector work with XL and CSV files that are routinely in the 100-300Mb space - so we need to find a solution at some point. Our preference is to work with Aspose to get the components working well at these scales, because we like your API, documentation and support (grovel grovel :)) - but it's really up to you guys as to how far you want to take these components. I think we need certainty more than anything else right now.

If you are intending to increase the scalability of these components so that we can use them to process commercial file sizes - we have a couple of questions/thoughts:

- Is there anything that can be done to reduce the "inflation factor" when loading a file? i.e. where a 75Mb file takes 500Mb in memory?
- Can you consider adding methods for importing/exporting CSV data from ranges that uses raw high performance C# code on onside of the transfer ? ( instead of Asponse components on both sides) maybe a better version of the test code we have in variation-3 on test 1 in the testing application attached.
- Would you be able to look at whether underload , these components are fragmenting the heap - consequently triggering the 'out of mem' error ?
- Would it be possible to flush "unsaved changes" to a workbook to a temp co-located file once this list consumes an amount of memory ? i.e. once this list takes up [say] 500Mb - flush it to a temp file and free the memory etc... I understand that this will slow down the overall process, as it introduces file io, however the alternative is for the process to suffer from memory starvation and heap stress.

Thanks again for all your help,

James Simpson

amjad.sahi · March 8, 2010, 12:32am

Hi,

Thanks for providing us information.

We are looking into if we can import the CSV file to a worksheet as you mentioned in ImportCSV_Variant3 in the project.

Thank you.

nausherwan.aslam · March 12, 2010, 4:06am

Hi James,

<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

Please try the attached latest version of Aspose.Cells. We have supported importing csv file to the cells.

<?xml:namespace prefix = u1 />

Please check the following codes:

Workbook workbook = new Workbook();

Workbook.Open(@”Test-ExcelFile.xlsx”);

Cells cells = workbook.Worksheets[0].Cells;

cells.ImportCSV("CSV File 76 Mb.csv", ";", true, 0, 0);

Thank You & Best Regards,