Feed an InputStream with LightCells one row at a time

tdesilva · July 16, 2015, 2:21pm

I’m currently using Aspose.Cells 7.2.0 (without LightCells) to read data
from spreadsheets into a class that converts the data to CSV and pushes
it into the buffer of an InputStream one row at a time. However, the
memory performance of this technique isn’t good enough for the scale of
the spreadsheets it needs to handle, so I’m looking into upgrading to
Aspose.Cells 8.5.0 and using LightCells. Is there a way that I can use
LightCells to do this while only keeping one row (or at
least, significantly fewer rows than the entire sheet) in memory at a
time? Ideally, I’d like LightCellsDataHandler.processRow() to be called
once each time the InputStream’s buffer is emptied, but I’m guessing
that all of the rows get queued for checking as soon as
LightCellsDataHandler.startSheet() returns true.

amjad.sahi · July 17, 2015, 3:19am

Hi,

Thanks for your query.

Please try our latest version (e.g v8.5.0 (latest)), we have done memory optimization in reading/ writing Excel files (internal modules) in recent versions. Aspose.Cells provides the LightCells API, mainly designed to manipulate cell data one by one without building a complete data model block (using the Cell collection etc.) into memory. It works in an event-driven mode. So, you should try LightCells APIs, see the document with examples for your complete reference:

http://www.aspose.com/docs/display/cellsjava/Using+LightCells+API

Thank you.

tdesilva · July 17, 2015, 2:23pm

I tried to refactor my class to use LightCells, but only the first row
was read. For context, I should explain what exactly my class does. It extends InputStream, and on read() calls when the buffer is empty, it will convert the next row of cells into a CSV-formatted string and load the buffer with it. The class takes parameters which restrict it to reading a certain range of rows and columns, and has a counter that it uses to keep track of which row will be processed next.

I implemented LightCellsDataHandler.startCell() so that it obeys the column constraints (that is, it returns whether the given cell is in the specified column range), but I made LightCellsDataHandler.startRow() just check if the given row number is equal to the one that my class’s counter says is next in the queue. This means that only one row should be read each time the sheet is processed. LightCellsDataHandler.processCell() stores the cell in an array, and the cells are pulled from the array when the buffer is emptied as described above. This would work if I could generate the events that tell the DataHandler to process a row myself, but I don’t think I can.

Is what I’m trying to do feasible with LightCells? Can I control which rows are in memory dynamically, or am I stuck with just telling it which rows I need at the start and leaving them there until I’m done with all of them? Can I get the DataHandler to reprocess the sheet without reopening the Workbook?

shakeel.faiz · July 18, 2015, 10:09am

Hi,

Thanks for your description of the issue at hand and using Aspose.Cells.

Please also provide us some sample code that we could run at our end. It will help us look into your requirements and understand them better.

Try to provide us simpler code because it clarifies the things more easily. We will investigate your issue and help you asap. Thanks for your cooperation.

tdesilva · July 20, 2015, 2:54pm

Here are the relevant methods:

@Override
public int read() throws IOException
{
    final int EOF = -1;

    //  Have we reached the end of the file?

    if (endOfFile)
    {
        logger.info(“read: already at EOF”);
        return EOF;
    }

    //  Is the next byte available?

    if (offset == buffer.length)
    {
        //  Try to read another row from the database.

        boolean nextRowExists = getNextRow();
        if (!nextRowExists)
        {
            logger.info(“read: all data has been read, transitioning to EOF”);
            close();
            return EOF;
        }

        //  Convert each value in the row to a string.

        String[] row = convertRowToStringArray();

        //  Convert the array of values to a row in the CSV stream buffer.

        String csv = CsvUtility.toCsvLine(row);
        setBuffer(csv);
    }

    //  Return the next byte from the buffer, and move past it.

    return (int) buffer[offset++];
}

@Override
protected String[] convertRowToStringArray() throws IOException
{
    ArrayList rowList = new ArrayList();

    for (Cell cell: cellList)
    {
        Object val = cell.getValue();

        // Determine how to format the cell’s value
        if (val != null)
        {
            if (val instanceof DateTime)
            {
                if (dateFormat == ExcelDateFormat.EXTRACT_DATE_STRING)
                {
                    rowList.add(cell.getDisplayStringValue());
                }
                else
                {
                    rowList.add(formatter.format(((DateTime) val).toDate()));
                }
            }
            else
            {
                rowList.add(val.toString());
            }
        }
        else
        {
            rowList.add("");
        }

    }
    return rowList.toArray(new String[rowList.size()]);
}

@Override
protected boolean getNextRow() throws IOException
{
    if (row < rowEnd)
    {
        row++;
        return true;
    }
    else
    {
        return false;
    }
}

@Override
public boolean startSheet(Worksheet worksheet)
{
    return worksheet.getName() == sheetName;
}

@Override
public boolean startRow(int i)
{
    return i == row;
}

@Override
public boolean processRow(Row row)
{
    cellList.clear();
    return true;
}

@Override
public boolean startCell(int i)
{
    return i >= columnStart && i <= columnEnd;
}

@Override
public boolean processCell(Cell cell)
{
    cellList.add(cell);
    return true;
}

I’d like each row to be processed as needed when convertRowToStringArray() is called.

shakeel.faiz · July 21, 2015, 6:19am

Hi,

Thanks for your posting and using Aspose.Cells.

1). “guessing that all of the rows get queued for checking as soon as LightCellsDataHandler.startSheet() returns true”

No, method startSheet() only provide some information of one worksheet that will be processed(such as the sheet index, name, …etc) and let user determine whether continue to process/parse its other data such as objects, cells data, …etc.

2). For your situation of extending InputStream and implement LightCellsDataHandler
at the meantime, we are afraid it cannot be achieved directly. Reading
Row/Cell one by one is driven by cells component itself, reading the InputStream
is driven by your program. To make them working together, maybe you
need multi-threads to queue data and synchronize the operators on the
data queue. For processing cells in one row, please do not gather the
cell objects transferred by the parameter of processCell() out of the scope of processCell() method. For LightCells,
we use some shared structure to build the cell object and the shared
structure will be modified for different cells. That is, some data of
the cell object may change at runtime when processing other cells. So,
the cell object can only work fine in the scope of processCell(), and the same for processRow() for Row object. Commonly it is recommended for users to process one cell only in processCell()
method, such as for your scenario, you can directly build the string
object of one cell in this method. If you do need to use the cell data
out of this method, you may cache the data of the cell by your own
logic, such as put the cell’s data such as value, formula, type …etc.
into an array instead of the cell object itself.<span style=“font-size:11.0pt;font-family:“Calibri”,“sans-serif”;mso-ascii-theme-font:minor-latin;mso-hansi-theme-font:minor-latin;mso-bidi-font-family:“Times New Roman”;mso-bidi-theme-font:minor-bidi;color:#1F497D”>

tdesilva · July 21, 2015, 1:22pm

Ok, thanks for the help. Two more questions: Is it possible to use methods like Cells.getMinDataRow() without processing the entire sheet? Is there a complete list of the properties that become available when startSheet(), startRow(), and startCell() are called?

shakeel.faiz · July 21, 2015, 8:56pm

Hi,

Thanks for your posting and using Aspose.Cells.

1). Is it possible to use methods like Cells.getMinDataRow() without processing the entire sheet?

No, you cannot use those APIs for statistics data with LightCells mode even after the whole worksheet has been processed. The reason is, with LightCells mode all cells data(such as the Row objects, Cell objects) will not be kept in memory after the process. So you cannot access those data anymore out of the scope of corresponding method for processing cell/row object. However, it is easy for user to gather those information in their implementation of LightCellsDataHandler. For example, use can record the row index in processCell() when the processed cell is not blank, then the last row index you got should be the maximum data row.

2). Is there a complete list of the properties that become available when startSheet(), startRow(), and startCell() are called?

It is hard to list all available properties/objects when calling startSheet(), startRow() and startCell(). When reading template file in LightCells mode, we just read data in the saved data sequence. So what properties/objects are available depends on what data was saved before cells data. Commonly information such as sheet index, name, active cell, default row height/column width, …etc. should be available.