Cells Multithreading issue

Hi.

We are trying to improve perfomance of reading book’s model and we’ve tried to use multithreading. But it is failed.

Following example tries to reading values of cells in several threads. It is works when we try to work in one thread, but when we are increasing count of parallel working, chance to fail increased.

Is it possible to work same way with book?

   public static Stream<Arguments> multiThreading_source() {
        return IntStream.range(19, 98).mapToObj(value -> Arguments.of(19, value - 19 + 1));
    }

    @ParameterizedTest
    @MethodSource("multiThreading_source")
    void multiThreading(int from, int count) throws Exception {
        System.out.println("Creating model");
        Workbook workbook = new Workbook("E://in.xlsx");

        System.out.println("Reading");
        long start = System.currentTimeMillis();

        ExecutorService executorService = Executors.newFixedThreadPool(10);
        CompletionService<String> completionService = new ExecutorCompletionService<String>(executorService);

        Cells cells = workbook.getWorksheets().get(0).getCells();
        for (int i = 0; i < count; i++) {
            completionService.submit(new RowReader(cells, i + from, "#" + (i + 1)));
        }

        List<String> rows = new ArrayList<>(count);

        int received = 0;
        boolean errors = false;
        while (received < count && !errors) {
            Future<String> resultFuture = completionService.take(); //blocks if none available
            try {
                rows.add(resultFuture.get());
                received++;
            } catch (Exception e) {
                //log
                System.err.println(e);
                errors = true;
            }
        }
        System.out.println("Takes " + (System.currentTimeMillis() - start) + "ms");

        for (String row : rows) {
            System.out.println(row);
        }
        assertFalse(errors);
    }

in.zip (253.2 KB)

@makarovalv
Thanks for providing further details.
We will look into it and get back to you soon.

@makarovalv

Generally, one Workbook instance used in multi-threads is not supproted.
However, the definition of RowReader is missed in your shared code, please share us a runnable project to reproduce the issue.

To get cells’ value with multi-threads, you may get more details from the api MultiThreadReading and try it for you application.

I’m sorry. I’ve edited my code several times and lost it

 private static class RowReader implements Callable<String> {
        private final Cells cells;
        private final int row;
        private final String name;

        private RowReader(Cells cells, int row, String name) {
            this.cells = cells;
            this.name = name;
            this.row = row;
        }

        @Override
        public String call() {
            StringBuilder sb = new StringBuilder(name + " ");

            int maxColumn = cells.getMaxColumn();
            System.out.println("Row #" + row + "MaxColumn=" + maxColumn);

            for (int i = 0; i < maxColumn; i++) {
                try {
                    Object v = cells.get(row, i).getValue();
                    sb.append(v == null ? "null" : v.toString());
                } catch (Exception e) {
                    System.err.println("Cell (" + row + ", " + i + "):" + e.getMessage());
                    throw e;
                }
            }
            return sb.toString();
        }
    }

It looks like this property is working correctly.

Best regards

@makarovalv,

Glad to know the property works for you.

For performance consideration, we use some global caches for accessing cells data and we cannot synchronise them always because that will cause poor performance for most of users. So, it is hard for us to make all operations available for multi-threads.

However, some changes in using those APIs of our component still can reduce the risk of issues for multi-threads. For example, when you traversing cells in a range or a row, using one iterator got from Range.iterator() or Row.iterator() is much better than getting cells one by one by row/column indices. The iterator can give you better performance, and it also can reduce the risk of concurrency issues.

Thank you. Is it posiible to process different worksheets in different threads? Could it throw some issues?

@makarovalv,
If you just reading cell’s value, like the code you post, there should be no issue for multi-threads. But if you need other features, such as getting the formatted value of cells(such as Cell.getStringValue(), getDisplayStringValue(), …etc.), concurrency issue may occur because the style settings are cached in workbook level.

In real application, i need to read formatted values with cell’s formats.

So is there no way to process this action in several threads?

@makarovalv,

We are afraid, for such situation you have to process cells of one workbook in a single thread. If your operation is time-consumed and multi-threads may improve the performance, we think you may try to create new workbook for every sheet(copying every sheet to a new workbook) and then process those workbooks in different threads.

But even when i try to collect DisplayStringValue from Cells instance, i get error

If you collect DisplayStringValue from cells of the same workbook in multi-threads, it is very possible to get concurrency issue. It is because this function requires to use the global cache of the workbook to format cell’s value.

How do you suggest creating copies of workbook? When i tried to copy by byte[] it is very ofter throws the exception

java.util.concurrent.ExecutionException: com.aspose.cells.CellsException: File is corrupted

@Test
    void multiThreading() throws Exception {

// License initialization AsposeBook.init("");

        System.out.println("Creating model");
        Workbook workbook = new Workbook("E://in.xlsx");

        System.out.println("Reading");
        long start = System.currentTimeMillis();

        ExecutorService executorService = Executors.newFixedThreadPool(10);
        CompletionService<Void> completionService = new ExecutorCompletionService<>(executorService);

        Cells cells = workbook.getWorksheets().get(0).getCells();
        cells.setMultiThreadReading(true);

        int batchSize = Math.max(1, 10 / cells.getMaxDataColumn());

        Map<Integer, Map<Integer, String>> data = new ConcurrentHashMap<>();

        byte[] body;
        try (ByteArrayOutputStream bous = new ByteArrayOutputStream()) {
            workbook.save(bous, SaveFormat.XLSX);
            body = bous.toByteArray();
        }

        int count = 0;
        int row = 0;
        while (row <= cells.getMaxDataRow()) {
            completionService.submit(new WorkbookRowReader(body, 0, row, row + batchSize, data));
            row += batchSize;
            count++;
        }

        int received = 0;
        boolean errors = false;
        while (received < count && !errors) {
            Future<Void> resultFuture = completionService.take(); //blocks if none available
            try {
                resultFuture.get();
                received++;
            } catch (Exception e) {
                //log
                System.err.println(e);
                errors = true;
            }
        }
        System.out.println("Takes " + (System.currentTimeMillis() - start) + "ms");
        assertFalse(errors);
        assertEquals(cells.getMaxDataRow(), data.size());
    }

    private static class WorkbookRowReader implements Callable<Void> {
        private final byte[] body;
        private final int sheet;
        private final int from;
        private final int to;
        private final Map<Integer, Map<Integer, String>> context;

        private WorkbookRowReader(byte[] body, int sheet, int from, int to, Map<Integer, Map<Integer, String>> context) {
            this.body = body;
            this.sheet = sheet;
            this.from = from;
            this.to = to;
            this.context = context;
        }

        @Override
        public Void call() throws Exception {
            try (InputStream is = new ByteArrayInputStream(body)) {
                Workbook workbook = new Workbook(is);
                Cells cells = workbook.getWorksheets().get(sheet).getCells();
                int maxIndex = Math.min(to, cells.getMaxDataRow());
                for (int rowIdx = from; rowIdx < maxIndex; rowIdx++) {
                    Map<Integer, String> rowCells = context.computeIfAbsent(rowIdx, k -> new HashMap<>());
                    Row row = cells.checkRow(rowIdx);
                    Iterator<?> cellIterator = row.iterator();
                    int i = 0;
                    while (cellIterator.hasNext()) {
                        Cell cell = (Cell) cellIterator.next();
                        rowCells.put(i++, cell.getDisplayStringValue());
                    }
                }
            }
            return null;
        }
    }

Is it possible to provide Thread-safe manipulation with global cache?

We will investigate your code and try to test it and figure out the issue. For copying or splitting workbook into multiple:
If there is only one worksheet in your workbook(it seems your code is just processing such kind of situation), you may try Workbook.Copy() to get multiple copies of the Workbook. However, your code which is instantiating new Workbook from the data of template file should work fine too. So we need to try your code and do more test for the exception you reported.

If there are multiple worksheets in your workbook, you may traverse those worksheets and copy every one(by Worksheet.Copy()) to a new Workbook.

It is complicated to make the operations with global cache thread-safe. We are afraid it is hard for us to implement it soon. On the other hand, performance is always one of the most important factor for us to take into consideration when we investigate some kind of change for our implementation. Currently we are not sure whether we can support such kind of feature.

I’ve found that bottleneck-method is Cell.getDisplayFormat(). Is there some approach to improve perfomance this method?

@makarovalv,

We have tested the code in your post but cannot find the issue/exception. To make sure we are using the same code to run, please send us your java source file or project which can be compiled and run successfully, we will do more test. Your code looks just ok and we have executed it many times without issue.

For the performance of Cell.getDisplayFormat(), we think you are talking about the method Cell.getDisplayStringValue(). To get the display string value for a cell, we need to check many things such as the cell style, conditional formatting, table formatting, …etc and apply those available settings to the cell. So it is reasonable that this method needs quite amount of time and we are afraid currently we have no better solution to improve it.

I’ve found, that issue happens because this book contains a lot of conditional formatting. Is there is some method to make a snapshot of book (with calculated values and applied conditional formatting) to another book without formulas and condition formatting? Either to stop “revalidation” of it?

@makarovalv,

If you do not need the conditional formatting be applied to cells when you process their values, you may use Cell.getStringValue() instead of Cell.getDisplayStringValue().

If you do need the result including the influence of conditional formatting, we are afraid even a “snapshot” of the workbook gives no benefit for performance. It is because in you code in fact you process every cell only one time too. When creating the snapshot of the workbook, every cell also needs to be checked and processed at least one time.

In your real application, if you need to use the result(such as Cell.DisplayStringValue) many times repeatedly, we think it is easy for you to set the result string to another new workbook(your snapshot) and then use the new workbook later.

We also provide some other cache mechanisms for performance consideration to access cells data. Please see startAccessCache and closeAccessCache of Workbook object with the cache options AccessCacheOptions.

When using the cache many data models are required to be “read-only”, we think it is suitable for your scenario because you only need to read data without changing them. You may try it in your code:

      public Void call() throws Exception {
            try (InputStream is = new ByteArrayInputStream(body)) {
                Workbook workbook = new Workbook(is);
                workbook.startAccessCache(AccessCacheOptions.ALL);
                ...
                workbook.closeAccessCache(AccessCacheOptions.ALL);
            }
            return null;