Selectively reading sheets from an Excel dump

Hi,

I have an Excel dump with around 200 sheets which user can upload onto the application. I want to selectively parse a couple of sheets and then read the data into memory. Currently when I read the workbook, the whole dump is read leading to an escalation in memory usage. We have to support around 50 concurrent users uploading their data which requires close to around 1 GB of memory.

Is there anyway we can parse the sheets selectively?

Thanks,
Shashi

Note: Batch mode is not an option in our case as the users want to see the data in realtime.

Are you using Aspose.Cells for .Net or Aspose.Cells for Java?

I am using Aspose Cells for Java. Another requirement that I forgot to mention is exporting this workbook with the same number of sheets i.e. around 200 sheets. Is there anyway to create one sheet in memory at a time and dump that data to the disk without creating the whole workbook of 200 sheets in memory.

Thanks,
Shashi

Hi Laurence,

Any information on this will be highly appreciated. For now if you can just put some light on the feasibility then that would be great.

Thanks,
Shashi

Hi Shashi,

We will get back to you soon.

Thanks for being patient!

Hi Shashi,

Currently the feature of loading specific sheet only is not supported, we will look into it after completing some other important tasks on hand. But in the latest patch of Aspose.Cells for Java, we provides APIs for parsing cells in Event-Driven mode which can decrease memory cost effectively. In this mode, we only parse global data such as shared string, styles that denote cell's value type, and so on at initialization, the values of cell only be parsed when user request to process the workbook or worksheet, and the parsed cells are not retained in memory. Currently it only support cells with plain data, whose data type can be int, double, boolean, String and datetime. Advanced features such as formula, hyperlink, etc. are not supported. This feature is better suited to the situation when the template file mainly contains huge data set and user only need to access worksheet's cells one by one sequentially. Would you please try this feature of the attached version to check whether it can meet your requirement. Following code shows how to use the new API:

...
LightCells lightCells = LightCells.getInstance("t.xls", FileFormatType.EXCEL2003);
CellHandler handler = new CellHandlerSample();

lightCells.processWorkbook(handler);//lightCells.processWorksheet(0, handler);
...

class CellHandlerSample implements CellHandler
{
private int _sheetIndex;


/**start with a worksheet, check whether process it. If this handler need to refer to sheetIndex later in startCell() or process() method, that is, if the process need to know which worksheet is being processed, the implementation should retain the sheetIndex value here. */
public boolean startSheet(int sheetIndex)
{
_sheetIndex = sheetIndex;
return true;
}


/**start with a cell, check whether process it*/
public boolean startCell(int row, int column)
{
return true;
}


/**process current cell.
Note: In consideration of performance, the LightCell object maybe reused for all cells, so if it is needed to keep current cell's data for later use, please call LightCell.copy() and keep the cloned one*/
public void process(LightCell cell)
{
int row = cell.getRowIndex();
int col = cell.getColumnIndex();
int type = cell.getValueType();
Object val = cell.getValue();
System.out.println(_sheetIndex+"-"+row+"&"+col+":"+type+", "+val);
}
}

Hi Johnson,

Thanks for the quick reply. We want to make a copy of the entire worksheet in the memory and the requirement is to copy the formulae and the values. Is there any way to read the formulae, the worksheet name (instead of just the index) and the cell name along with the cell value?

Also how do I write a huge workbook (write one worksheet at a time) to the disk without creating the whole workbook in memory. This is a show stopper for us as its crashing the server. Please help.

Thanks,
Shashi

Hi Shashi,

1. you can get total sheets count of the workbook by LightCells.getSheetCount() and get sheet name by LightCells.getSheetName(int sheetIndex).

2. For cell name, you can get the rowIndex and columnIndex of the cell and then convert it into name by CellHelper.convertCellIndexToName().

3. For formula, current version does not support. We will look into this feature.

4. I am afraid currently you are unable to write worksheets one by one without building whole workbook until we support the feature of loading not the whole workbook but only specific worksheet.

Hi Johnson,

I understand that you are working on a number of high priority features but is there anyway you can support this feature? This is important as its crashing our server because of high heap usage.

Also, if you can think of any workarounds which we can use to achieve the same functionality without using the heap then please do suggest.

Thanks,
Shashi

Hi Shashi,

Does your Excel file contains many objects such as picture, comments, OleObject, and so on? If so you can use Workbook.loadData() to process the data, this method only load all cells data and formatting, ignore all the objects in file.

For LightCells API, do you mean if we provide cell's formula it can meet your current requirement? If so, I think we can provide cell's formula with its value(the value last calculated and saved in file, calculating formula at runtime is not supported) for Excel2007 XLSX format in about 1-2 days. For Excel97-2003, because formula needs more global data and sheet data be loaded and parsed initially, I am afraid we can't support it in few days.

Hi Johnson,

Thanks for the reply. Our Excel file only contains data and formulae. Hence loadData will not make a huge difference in our scenario. If you can provide only the formula string in Lighcells API we can manage for the time being. Excel 2007 XLSX formula is also ok.

Some questions:

  • Will this functionality work for Excel 2003 format without the formula parsing? This is because some workbooks dont contain formulas and we dont want all users to upgrade to Excel 2007.
  • Is there any API in Aspose through which we can check whether the workbook is a 2007 Excl format without reading it into memory?
This following requirement still goes unanswered:

Is there any way (workaround) that I can write a 200 sheet workbook to the disk without creating the whole workbook in memory?

Thanks,
Shashi


Hi Shashi,

For your question:

1. Will this functionality work for Excel 2003 format without the formula parsing? This is because some workbooks dont contain formulas and we dont want all users to upgrade to Excel 2007

yes, If you only need to process cell's value and value type without any formula, LightCells can work for both Excel2003 and Excel2007. If you want to get and process with cell's formula string, currently we can support it for Excel2007 XLSX file only.

2. Is there any API in Aspose through which we can check whether the workbook is a 2007 Excl format without reading it into memory?

No, there is no such API can check the file format for Excel. In common situation, I think the simple way is to check the extension of file name if it is sure saved with file name corresponding format.

3. Is there any way (workaround) that I can write a 200 sheet workbook to the disk without creating the whole workbook in memory?

No, currently it is not supported to create and save sheets one by one without building them as part of a Workbook object in memory.


Hi Johnson,

Thanks for the quick response. Excel 2007 xlsx format will be ok if we can read the cell formulae. Please let me know after you are done implementing this feature.

I did some tests on the Lightcells API and the results were very encouraging. This along with the formula fix should solve our memory issue. I also found the Light cells API to be blazing fast in parsing selective sheets as compared to traditional workbook API.

I have the following questions:

a) I have 2 sheets in a Workbook. 1 sheet has very less data around 20 cells and the other sheet has around 399980 cells. In the cell handler, I ignored the first sheet and parsed the second sheet. It took around 1.1 secs. In another test I ignored the second sheet and parsed only the first sheet. It took around 970 ms. I ran this test around 10-15 times and found the same results. There was not much difference in performance when I parse the 20 cells sheet v/s the huge 399980 cells sheet. Any reason for this (Is it that you go through all the cells but only construct objects for the ones where I want you to, thus showing up relatively low difference in time)?

b) In the LightCells API example that you provided, you mentioned that you can clone the LightCell using the copy() method. In what circumstances would you use this method?

Thanks,
Shashi

Hi Shashi,

Please try this attached patch, we have supported common formula string of cell for Excel2007 XLSX file format. But shared formula and array formula is currently not supported yet.

For your question:

a) the performance of different data set:

I think maybe your stats is the total time cost from building LightCells object to completing process all cells of one sheet. In fact, there is some global data shared by all sheets, such as String values, styles that denote value type, which we have to parse initially and retain in memory for later parsing cells, no matter many sheets or only one sheet has reference to it. . If there are many shared String and styles, it will cost longer time when build the LightCells, even though later you only need to process one sheet. I think it should be obviously different if you time only the processWorksheet() method for your two sheets.

b) for the Interface CellHandler, the method "process()" has one parameter which is a LightCell object, LightCells transferes it to user to do with current cell's data. the LigthCell object is reused in LightCells, that is, when LightCells encounters one cell, it will use the same LightCell object, only change the value of its field such as row, col, valueType and so on according to currently encounted cell.So, if the user need to hold reference to one cell's data out of its process scope, user should use LightCell.copy() to get one copy of LightCell and hold the cloned one for later use. The proper and neccessary place to call LightCell.copy() is in the method process().

Hi Johnson,

Thanks for the patch but it seems the functionality is not working properly. Please find attached one of the templates containing formula fields for your test. The Lightcells API lists certain formulas and skips certain ones (which are not shared or array type of formulas).

Also regarding the shared and array type of formulas, are you planning to support the retrieval? This is because without all type of formula support, this functionality will not be useful for us.

Test Code (I have listed down only three methods of the cell handler below):

public boolean startSheet(int i)
{
System.out.println("" + lightCells.getSheetName(i));

return true;
}

public boolean startCell(int i, int i1)
{
return true; //To change body of implemented methods use File | Settings | File Templates.
}

public void process(LightCell lightCell)
{
System.out.println("" + lightCell.getFormula() + " " + CellsHelper.convertCellIndexToName(lightCell.getRowIndex(), lightCell.getColumnIndex()));

}

Thanks,
Shashi

Hi Shashi,

After test with your file, I can't find the formulas that are not shared formula and ignored by LightCells. LightCells provides total 132 cells whose formula is not null(includes the start cell of the area of shared formula).Would you please point out which cell should give formula but LightCells does not do that?

For supporting shared formula and array formula, we will look into it soon.

Hi Johnson,

In the program that I sent, when you print out formula for all the cells, for cell D3 it shows =IF(D82=1,IF(D6>0,1,""),"") which is correct. Whereas for cell E3, the value should be =IF(E82=1,IF(E6>0,1,""),"") but the getFormula shows it as null. There are numerous other formulae which show up as null in the output.

I dont know the concept of what a shared formula means but then as I mentioned, if we dont get all the formula strings in that Excel Workbook, we can put this to little use in our application. Really appreciate the the effort that you are putting in bringing this functionality into the aspose library.

Thanks,
Shashi

Hi Shashi,

In fact you defined shared formula for the area "D3:AA3", D3 is the start cell of this area, just as I said, currently we only return the formula for this start cell. Anyway, we are looking into supporting this feature of shared formula, please keep an eye on this thread. Thank you.

Hi Johnson,

Any news on the above feature?

Thanks,
Shashi

Hi Shashi,

Currently our work of this new feature is suspended by some urgent bug. I am afraid we can't support this feature soon, hope we can support it about one week later. Thank you for being patient.