Improve performance of reading data from worksheet to DataTable with large number fo columns using Aspose.Cells for .NET in C#

Hi,

the performance when reading from worksheet to c# DataTable isn’t very good.
Do I have to change my code to improve the performance?

In my example I have 20000 rows with 250 columns. (Containing just simple numbers)
I have noticed that the performance is going more worse when adding further columns.

My code to read the DataTable from worksheet:

var workbook = new Aspose.Cells.Workbook(File);
var worksheet = workbook.Worksheets[0];
var dataTable = worksheet.Cells.ExportDataTable(0, 0, _rowCount, _columnCount, false);

Thank you very much

Igor

@igor.pfeifer,

Thanks for providing us sample code segment and details.

If you find memory related issue only, kindly try to use MemorySetting.MemoryPreference option to optimize memory use for cells data and decrease the overall memory cost, see the document for your reference:

Alternatively, we recommend you to use LightCells API for your task which is useful for reading/writing huge Excel spreadsheets data with it, it will demand less memory and you will get better performance and efficiency, see the document for your reference:

Let us know if the performance is increased by following the above approaches.

My issue is the duration of reading from worksheet to datatable, not the memory consumption.

Your example of using light cells api is very fast, but the cells of the datatable result are empty.
So I’ve tried to implement my own LightCellsDataHandler and setted all needed interface properties returning true.

Effect of this is that I get the complete datatable, but now it’s slow again as like without using LightCellsDataHandler.

How do I have to implement the LightCellsDataHandler to get the complete datatable inclusive cell values.

Here is my code:

LoadOptions opts = new LoadOptions();
opts.LightCellsDataHandler = new LightCellsDataHandlerVisitCells();
Workbook wb = new Workbook(File, opts);
var dataTable = wb.Worksheets[0].Cells.ExportDataTable(0, 0, _rowCount, _columnCount, false);



class LightCellsDataHandlerVisitCells : LightCellsDataHandler
{
    internal LightCellsDataHandlerVisitCells()
    {
   
    }

    public bool StartSheet(Worksheet sheet)
    {
        return true;
    }

    public bool StartRow(int rowIndex)
    {
        return true;
    }

    public bool ProcessRow(Row row)
    {
        return true;
    }

    public bool StartCell(int columnIndex)
    {
        return true;
    }

    public bool ProcessCell(Cell cell)
    {
        return true;
    }
}

@igor.pfeifer,

I think you may try to fill into DataTable (one record at a time) from the sheet cells in light mode. You may try to define your DataTable (with specified columns) in the implementation of LightCells APIs and then fill data into the datatable record by record if it makes any difference.

Ok I’ve created a DataTable and added empty columns and rows in advance.

After that I fill the datatable cells in the LightCellsDataHandler implementation, but it’s twice as slow (approx 1min) as the simple call of “worksheet.Cells.ExportDataTable(0, 0, _rowCount, _columnCount, false);” (approx 30sec)

// create datatable columns and rows
DataTable table = new DataTable("myTable");
for (int col = 0; col < 250; col++)
{
    table.Columns.Add(col.ToString());
}
for (int row = 0; row < 20000; row++)
{
    table.Rows.Add();
}

LoadOptions opts = new LoadOptions();
var cellsDataHandler = new LightCellsDataHandlerVisitCells(table);
opts.LightCellsDataHandler = cellsDataHandler;
Workbook wb = new Workbook(File, opts);

class LightCellsDataHandlerVisitCells : LightCellsDataHandler
{
private readonly DataTable _table;

public DataTable Table => _table;
public LightCellsDataHandlerVisitCells(DataTable table)
{
    _table = table;
}

public bool StartSheet(Worksheet sheet)
{
    return true;
}

public bool StartRow(int rowIndex)
{
    return true;
}

public bool ProcessRow(Row row)
{
    return true;
}

public bool StartCell(int columnIndex)
{
    return true;
}

public bool ProcessCell(Cell cell)
{
    var dataRow = _table.Rows[cell.Row];
    dataRow[cell.Column] = cell.Value;
    return true;
}
}

@igor.pfeifer,

Thanks for your code segment.

Well, it is because your code is incorrect regarding light cells. if ProcessCell() method returns true, it means the cell will also be kept in the cells data model in memory. Please change the ProcessCell method like following:
e.g
Sample code:

............
public bool ProcessCell(Cell cell)
{
    var dataRow = _table.Rows[cell.Row];
    dataRow[cell.Column] = cell.Value;
    return false;
} 
.........

Hope, this helps a bit.

The memory consumption is indeed lower than without using light cells impl, but the duration to load the data is near the same to the default call of “worksheet.Cells.ExportDataTable(…)” :frowning:

Do you think there is any another way to speed it up?

@igor.pfeifer,

How much time it takes when performing the task via light cells API and without it? Could you create a sample console application (runnable), zip the project and post us here to demonstrate it, we will check if we could enhance your sample code a bit. Also provide your template Excel file that you are using.

AsposeCellsPerformance.zip (704.1 KB)

worksheet.Cells.ExportDataTable(…) => round about 33-34 sec.

the LightCellsDataHandler implementation is about 37-38 sec, but with lower memory consumption.

An another excel lib (Winnovative) did this task in 11-12 sec, but it has issues at other place…

thanks in advance for reviewing my sample.

@igor.pfeifer,

Thanks for the sample project with template file and screenshot.

After an initial test, I am able to see the performance issue with LightCells APIs when filling data with many records into DataTable. I tested your sample code and got the performance difference. May be the light cells API is not the right API for your specific task. Anyways, I have logged an investigation ticket with an id “CELLSNET-45727” into our database. We will check if your sample needs some tweak or we need to enhance our APIs. Alternatively, we will check if we could propose you some better way to accomplish your task faster.

Once we have an update on it, we will let you know here.

ok, thank you very much

@igor.pfeifer,

We are afraid LightCells API is mainly used for decreasing the memory cost, it doesn’t help for CPU performance. However, you can still optimize your code a bit, it would work a bit faster:
e.g
Sample code:

...........
            DataTable table = new DataTable("myTable");
            for (int col = 0; col < 200; col++)
            {
                table.Columns.Add(col.ToString());
            }

            LoadOptions opts = new LoadOptions();
            var cellsDataHandler = new LightCellsDataHandlerVisitCells(table);
            opts.LightCellsDataHandler = cellsDataHandler;
...
    class LightCellsDataHandlerVisitCells : LightCellsDataHandler
    {
        private readonly DataTable _table;
        private DataRow _drow;

        public LightCellsDataHandlerVisitCells(DataTable table)
        {
            _table = table;
        }

        public bool StartSheet(Worksheet sheet)
        {
            return true;
        }

        public bool StartRow(int rowIndex)
        {
            _drow = _table.Rows.Add();
            return true;
        }

        public bool ProcessRow(Row row)
        {
            return true;
        }

        public bool StartCell(int columnIndex)
        {
            return true;
        }

        public bool ProcessCell(Cell cell)
        {
            // read just 20000 rows and 200 columns
            if (cell.Row >= Program.RowCount || cell.Column >= Program.ColumnCount)
                return false;

            _drow[cell.Column] = cell.Value;
            return false;
        }
    }
......... 

Hope, this helps a bit.