Aspose-cells如何获取excel文件的最大行数和最大列数

humanhuman · August 31, 2023, 2:50am

我使用poi来获取最大行数和列数时，对于一些特殊文件，有很多空白行的文件，很占内存，比如我的示例文件，初始化XSSFWorkbook对象时，内存占用达到了1G。我尝试xlsx-streamer，内存占用虽然只有150M左右，但是对于xls无法读取。aspose-cells是否有类似功能，能够在内存占用较小的情况下获取行数和列数，当然，如果对于xls和xlsx都可以适用那是最好的
新建文件夹.zip (269.5 KB)

John.He · August 31, 2023, 3:00am

@humanhuman,
Aspose.Cells for Java可以处理所有的Excel文件。对于xls，xlsx，xlsb，xlsm等Excel的任何格式都是适用的。关于工作表的数据范围，请参考以下文档。

johnson.shi · August 31, 2023, 4:13am

@humanhuman,

Yes, We think Aspose.Cells can fit your requirement with better performance. For getting some statistics data from template files, we think you may try our LightCells apis which does not maintain all cells data in memory so it can give better performance than other ways.

Here is an example to demonstrate how you can gather maximum columns for every sheet in your template file, you may modify it according to your requirement:

            Statistics s = new Statistics();
            LoadOptions opts = new LoadOptions();
            opts.LightCellsDataHandler = s;
            Workbook wb = new Workbook(template, opts);
            int[] m = s.MaxColumnIndices;
            ...
        class Statistics : LightCellsDataHandler
        {
            private int[] _maxCols;
            private int _sheetIndex = -1;
            private int _maxCurr = -1;

            public Statistics()
            {
            }

            public int[] MaxColumnIndices
            {
                get
                {
                    _maxCols[_sheetIndex] = _maxCurr;
                    return _maxCols;
                }
            }
            public bool StartSheet(Worksheet sheet)
            {
                if (_sheetIndex < 0)
                {
                    _maxCols = new int[sheet.Workbook.Worksheets.Count];
                    for (int i = _maxCols.Length - 1; i > -1; i--)
                    {
                        _maxCols[i] = -1;
                    }
                }
                else
                {
                    _maxCols[_sheetIndex] = _maxCurr;
                    _maxCurr = -1;
                }
                _sheetIndex = sheet.Index;
                return true;
            }
            public bool StartRow(int row)
            {
                return true;
            }
            public bool ProcessRow(Row row)
            {
                return true;
            }
            public bool StartCell(int col)
            {
                return true;
            }
            public bool ProcessCell(Cell cell)
            {
                if (cell.Column > _maxCurr && cell.Type != CellValueType.IsNull)
                {
                    _maxCurr = cell.Column;
                }
                return false;
            }
        }

And here is another document about using LightCells for your reference: Using LightCells API.

humanhuman · August 31, 2023, 9:27am

非常感谢你的示例，我正在想如何获取空白的行列，我直接使用getMaxColumn等api只能获取到有内容的行列，你的示例和poi的显示一致，是我需要的效果，内存和耗时都比poi降低了一大半，非常感谢

humanhuman · August 31, 2023, 9:27am

非常感谢

amjad.sahi · August 31, 2023, 9:37am

@humanhuman,

不客气。很高兴知道示例代码能够很好地满足您的需求。如果您还有其他疑问或问题，请随时给我们回信，我们将很乐意尽快为您提供帮助。

humanhuman · August 31, 2023, 9:39am

我整理了java示例，如果其他人有这种问题可以参考，至少在我的excel文件测试中，与poi对比，内存降低了70%，耗时降低了50%

package com.example.demo;

import com.aspose.cells.Cell;
import com.aspose.cells.CellValueType;
import com.aspose.cells.LightCellsDataHandler;
import com.aspose.cells.Row;
import com.aspose.cells.Worksheet;

public class Statistics implements LightCellsDataHandler {
private int[] maxCols;
private int[] maxRows;
private int maxCurrCol = -1;
private int maxCurrRow = -1;
private int sheetIndex = -1;

public Statistics() {
}

public int[] getMaxColumnIndices() {
    maxCols[sheetIndex] = maxCurrCol;
    return maxCols;
}

public int[] getMaxRowIndices() {
    maxRows[sheetIndex] = maxCurrRow;
    return maxRows;
}

public boolean startSheet(Worksheet sheet) {
    if (sheetIndex < 0) {
        int pages = sheet.getWorkbook().getWorksheets().getCount();
        maxCols = new int[pages];
        maxRows = new int[pages];
        for (int i = pages - 1; i > -1; i--) {
            maxCols[i] = -1;
            maxRows[i] = -1;
        }
    } else {
        maxCols[sheetIndex] = maxCurrCol;
        maxCurrCol = -1;
        maxRows[sheetIndex] = maxCurrRow;
        maxCurrRow = -1;
    }
    sheetIndex = sheet.getIndex();
    return true;
}

public boolean startRow(int row) {
    return true;
}

public boolean processRow(Row row) {
    maxCurrRow = row.getIndex();
    return true;
}

public boolean startCell(int col) {
    return true;
}

public boolean processCell(Cell cell) {
    if (cell.getColumn() > maxCurrCol && cell.getType() != CellValueType.IS_NULL) {
        maxCurrCol = cell.getColumn();
    }
    return true;
}

}

@Test
public void test1() throws Exception {
//直接获取
Workbook workbook = new Workbook(“C:\Users\Administrator\Downloads\test.xlsx”);
Iterator iterator = workbook.getWorksheets().iterator();
while (iterator.hasNext()) {
Worksheet worksheet = iterator.next();
int maxColumn = worksheet.getCells().getMaxColumn();
int maxRow = worksheet.getCells().getMaxRow();
System.out.println(“列数：”+maxColumn + “—行数：” + maxRow);
}

    //使用LightCellsDataHandler
    Statistics s = new Statistics();
    LoadOptions opts = new LoadOptions();
    opts.setLightCellsDataHandler(s);
    Workbook wb = new Workbook("C:\\Users\\Administrator\\Downloads\\test.xlsx", opts);
    int[] col = s.getMaxColumnIndices();
    int[] row = s.getMaxRowIndices();
    System.out.println("列数："+col[0] + "---行数：" +row[0]);
}

amjad.sahi · August 31, 2023, 9:44am

@humanhuman,

感谢您分享示例代码。它可以帮助其他关心性能和内存使用情况的人。

humanhuman · September 1, 2023, 3:01am

但是我还发现了问题，使用LightCellsDataHandler在某些文件下是不准确的，我已经更新了我的最新代码，不知道是不是哪里写的有问题吗，我的测试文件：应该是列数10，行数16
如果直接使用api获取是正确的，使用LightCellsDataHandler不准确
zip.zip (888.8 KB)

humanhuman · September 1, 2023, 3:02am

这是我的测试代码，如果excel的行数中内容没有断开的空行，则是准确的

John.He · September 1, 2023, 3:26am

@humanhuman,
你的样例代码里，处理行数据时做的递增操作只会统计有数据的行数。如果你想统计最大行列，请使用以下代码:

    public boolean processRow(Row row) {
        maxCurrRow = row.getIndex();
        return true;
    }

johnson.shi · September 1, 2023, 3:33am

@humanhuman,

这要看您统计的是总行数列数，还是非空的最大行号和列号。在您的实现中：

public boolean processRow(Row row) {
    maxCurrRow++;
    return true;
}

表明您是统计在模板文件中有定义的行的行数。如果您改成

maxCurrRow = row.getIndex();

则最后得到的是在文件中有定义的最大行（索引），这个最大行可能是空行，也可能不是。

如果您需要的是非空的最后一行的索引，您可以把对maxCurrRow的赋值改到ProcessCell()中：

public boolean processCell(Cell cell) {
    if (cell.getType() != CellValueType.IS_NULL) {
        if(cell.getColumn() > maxCurrCol)
        {
            maxCurrCol = cell.getColumn();
        }
        maxCurrRow = cell.getRow();
    }
    return true;
}

对LightCells模式，属于流式数据驱动，在模板文件中遇到新定义的行时，会触发StartRow()和ProcessRow(), 遇到新定义的cell时，会触发StartCell()和ProcessCell()。如果一些空行和空的单元格在模板文件中没有定义（出现），则在start和process中也不会处理这些空行和空单元格。