Questions about working with large excel files

I have a very large file (about 200k rows) All the questions below are about the large file im working with.


1>The article here shows how to work with large files, however it also says “For some situations, such as building a workbook with a large data set for cells, theMemorySetting.MemoryPreference option may optimize memory use and decrease the memory cost for the application. However, this option may degrade performance in some special cases such as accessing cells randomly and repeatedly”

If i open large excel file and read all the cells using loop, would that degrade the performance?

2>Is there any function available to export large excel file to SQL table without reading the cells?

2>If we open the file using file path, does Aspose brings the complete file in memory?
Workbook workbook = new Workbook(“C:\book1.xls”);

3>How to read just header row ( first row) without bringing the complete file in memory?

Hi,


1) If you read the cells data in sequence (e.g row wise), it should not decrease the performance. Also, please use the code segment and specify the MemoryPreference via LoadOptions when reading large file to optimize memory usage.
e.g
Sample code:

//Specify the LoadOptions
LoadOptions opt = new LoadOptions();
//Set the memory preferences
opt.MemorySetting = MemorySetting.MemoryPreference;

//Instantiate the Workbook
//Load the Big Excel file having large Data set in it
Workbook wb = new Workbook(“e:\test2\Book1.xlsx”, opt);



2) You may try Data exporting options provided by Aspose.Cells. For example, you may try ExportDataTable() method to fill a DataTable and then use your own ADO.NET codes to insert or update that data into your SQL Server database tables. Please see the document for your reference:
http://www.aspose.com/docs/display/cellsnet/Exporting+Data+from+Worksheets

3) Yes, sure it will load the complete dataset in the File into the memory.

4) I think you may try using our LightCells APIs. These are mainly designed to read or manipulate cells data one by one without building a complete data model block (using the Cell collection etc. that you use in normal mode) into memory. It works in an event-driven mode. See the document for your reference:
http://www.aspose.com/docs/display/cellsnet/Using+LightCells+API

Thank you.

  1. if im using LighCells API, is there any way to stop processing after certain row index. In the example below i want to stop processing after verifying header, which is at index 0
  2. Why cant i get cell value while processing row? ( See my comments in the code)

public class AsposeLight
{
public void Read(string file)
{
LoadOptions opts = new LoadOptions();
LightCellsDataHandlerVisitCells v = new LightCellsDataHandlerVisitCells();
opts.LightCellsDataHandler = v;
Workbook wb = new Workbook(file, opts);
}
}

class LightCellsDataHandlerVisitCells : LightCellsDataHandler
{
private int cellCount;
private int formulaCount;
private int stringCount;

internal LightCellsDataHandlerVisitCells()
{
cellCount = 0;
formulaCount = 0;
stringCount = 0;
}

public int CellCount
{
get { return cellCount; }
}

public int FormulaCount
{
get { return formulaCount; }
}

public int StringCount
{
get { return stringCount; }
}

public bool StartSheet(Worksheet sheet)
{
return true;
}

public bool StartRow(int rowIndex)
{
return rowIndex == 0;
}

public bool ProcessRow(Row row)
{
int colCount = 20;
for (int i = 0; i < colCount; i++)
{
// row[i] returns null here ??????
if (row[i].Value.ToString() == "Header" + i.ToString())
{
// Do some extra work
}
}

return false;
}

public bool ProcessCell(Cell cell)
{
return false;
}

public bool StartCell(int columnIndex)
{
return false;
}
}

Hi,

LightCells API reads row/cell objects from template file sequentially. When reading one Row record, it can only get row properties such as, row index, row style … etc. So in method ProcessRow(), the Row object only contains its properties and no cell objects could be read and filled into it. This method provides you an efficient way to determine how to do with this row before parsing cells in it. For example, you can return false when one row is hidden, then all the cells of this row will be skipped without parsing them.

If you need to determine whether continue or stop the process by cell’s content, you have to check it in ProcessCell() method. We think you can define one flag in your implementation of LightCellsDataHandler, after reading the header row and its cells, this flag can be set to true and in ProcessRow() you can check this flag, if it is true then return false to skip following rows one by one.

Thank you.

Below is my working implementation. I'm only processing cells if rowindex is 0, for all other rows it will return false, so no other rows & their cells will get processed. However the StartRow() method will still get call for all other rows. So if i have 200k rows, StartRow() will get call for all the 200k rows.

I just want stop processing completely when certain condition is true, is it doable? (I dont want to throw exception)


However, the API still calls StartRow() method for each row.

class LightCellsDataHandlerVisitCells : LightCellsDataHandler
{
internal LightCellsDataHandlerVisitCells()
{
}
public bool StartSheet(Worksheet sheet)
{
return true;
}

public bool StartRow(int rowIndex)
{
return rowIndex == 0;
// this method will get call for all the rows, is there any way to stop processing if rowindex > 0
}

public bool ProcessRow(Row row)
{
return true;
}

public bool StartCell(int columnIndex)
{
return true;
}

public bool ProcessCell(Cell cell)
{
if (cell.Value.ToString() == "Header" + cell.Column.ToString())
{

}
return false;
}
}

Hi,


Well, we asked you to use ProcessRow() method. This method provides you an efficient way to determine how to do with any row before parsing cells in it. If you need to determine whether continue or stop the process by cell’s content, you have to check it in ProcessCell() method. Anyways, we will check if we could provide you the sample code that you may refer to it.


Thank you.

Hi,

Furthermore, regarding calling StartRow(int) for all rows, we are afraid there is no way to exit the
process of doing with one sheet’s data currently. Time-consumed process is parsing row/cell data and build the data models, if StartRow(int) returns false, then all records for it will be skipped without parsing their data except the row/column index. So we don’t think it will cause performance issue even if StartRow(int) be called for all rows.

Thank you.