Unexpected CPU hot spot in ZipOutputStream when opening XSLB in Java

TarasTielkes · September 5, 2019, 7:53pm

Hi,

Using Apose Cells for Java, version 19.8.

Profiling the loading of XLSB files, we observe an unexpected CPU hot spot.
See below a “Flame graph” rendering of the sampled CPU data:
aspose-cells-19.8.png (42.0 KB)

During loading of XLSB, Aspose Cells seems to spend significant time compressing, which is not what we would expect (we would expect decompressing, not compressing).

Can you explain the two observed call paths leading to com.aspose.cells.a.f.zk#write(byte[], int, int) and com.aspose.cells.a.f.zk#a(com.aspose.cells.a.f.zi)?

Thanks in advance,
Taras

ahsaniqbalsidiqui · September 6, 2019, 1:05am

@TarasTielkes,
I have tried to reproduce this issue using own sample XLSB file but could not succeed. Please share your sample file and code snippet with us for our testing. We will reproduce the problem and provide our feedback regarding the above calls after analysis.

TarasTielkes · September 6, 2019, 6:23am

@ahsaniqbalsidiqui

Please find the test XLSB file attached:
EIOPA_SolvencyII_DPM_Dictionary_2.1.0.zip (395.5 KB)

The performance test simply loads the provided file in a loop.

ahsaniqbalsidiqui · September 6, 2019, 8:15am

@TarasTielkes,
I have tried to load it hundred times with below code but does not notice any problem.

for (int i = 1 ; i<=100 ; i++)
{
    Workbook workbook = new Workbook("EIOPA_SolvencyII_DPM_Dictionary_2.1.0.xlsb");
}

Could you share your code snippet for our testing if you face the problem with latest version.

TarasTielkes · September 6, 2019, 8:45am

@ahsaniqbalsidiqui your code is sufficient to reproduce the problem.
If you have trouble reproducing the CPU profiling results, simply put a breakpoint in com.aspose.cells.a.f.zk#write(byte[], int, int). It will be repeatedly triggered during loading of the provided file.

I’d like to understand why Aspose Cells is compressing data during loading of XLSB.
The flamegraph from the original post will also provide you the complete call stack, starting from the Workbook constructor.

Kind regards,
Taras

TarasTielkes · September 6, 2019, 9:42am

Also note that a similar CPU hotspot does not occur when loading XLSX files, it is specific to the handling of the XLSB format by Aspose Cells.

amjad.sahi · September 6, 2019, 9:58am

@TarasTielkes,

As you know, XLSX/XLSB is an archive of many parts of the Workbook. While loading the template file, we do not always parse all those entries. For some un-parsed entries, we need to keep them in memory for being used later on, such as, re-saving the Workbook, or parsing those entries further for other process. However, for memory performance considerations, instead of keeping all data of the original file in memory, we only keep those un-parsed entries by compressing them into one data block.

Thanks for your understanding.

TarasTielkes · September 6, 2019, 10:05am

Hi @Amjad_Sahi

From your explanation it would be beneficial if I could express the fact that I only want to read the file, so that it could both skip the compression effort, as well as consume less memory. A very large part of the interactions with the Aspose API are read-only (i.e. only loading data, and not saving files).

I still wonder why this specific behavior is only happening for XLSB, as it has a lot of internal structure similarities to XLSX.

When we tried to optimize the speed of some of our batch processing flows, we expected to get a performance increase from switching from XLSX to XLSB. However, the bottleneck we see in the profiler causes our overall performance to degrade instead when switching from XLSX to XLSB, which is a bit disappointing.

Kind regards,
Taras

amjad.sahi · September 6, 2019, 11:36am

@TarasTielkes,

We will evaluate it and get back to you soon.

Yes, that is is strange. Please spare us little time to evaluate it to provide our feedback.

amjad.sahi · September 10, 2019, 12:57pm

@TarasTielkes,

We need to investigate and evaluate your issue thoroughly. I have logged a ticket with an id “CELLSJAVA-43002” for your issue. We will look into it soon.

Once we have an update on it, we will let you know.

ahsaniqbalsidiqui · September 12, 2019, 7:29am

@TarasTielkes,
This is to inform you that we have fixed your issue (logged earlier as “CELLSJAVA-43002”) now. We will soon provide you the fixed version after performing QA and incorporating other enhancements and fixes.

amjad.sahi · September 12, 2019, 9:24am

@TarasTielkes,

Please try our latest version/fix: Aspose.Cells for Java v19.8.6 (attached)

Your issue should be fixed in it.

Let us know your feedback.
aspose-cells-19.8.6-java.zip (6.7 MB)

TarasTielkes · September 12, 2019, 12:35pm

Hi @Amjad_Sahi,

The XLSB parsing performance of 19.8.6 is much better, good work
In some of our test cases, the performance is close to double of 19.8.0.

That said, in performance profile data of 19.8.6, I still observe a fair amount of CPU being spent on compression. In one of my tests it’s around ~7% now, which is much better than the ~22% observed using 19.8.0.

The remaining call path I see with the profiler is:

main  Runnable CPU usage on sample: 968ms
  java.util.zip.Deflater.deflateBytes(long, byte[], int, int, int) Deflater.java (native)
  java.util.zip.Deflater.deflate(byte[], int, int, int) Deflater.java:444
  java.util.zip.Deflater.deflate(byte[], int, int) Deflater.java:366
  java.util.zip.DeflaterOutputStream.deflate() DeflaterOutputStream.java:251
  java.util.zip.DeflaterOutputStream.write(byte[], int, int) DeflaterOutputStream.java:211
  java.util.zip.ZipOutputStream.write(byte[], int, int) ZipOutputStream.java:331
  com.aspose.cells.a.f.zk.write(byte[], int, int)
  com.aspose.cells.a.c.zab.a(zm, zm)
  com.aspose.cells.zrz.a(HashMap)
  com.aspose.cells.zapn.a(Workbook, LoadOptions, boolean)
  com.aspose.cells.zjp.a(zm)
  com.aspose.cells.zjp.a(String, zm, LoadOptions)
  com.aspose.cells.Workbook.a(String, LoadOptions)
  com.aspose.cells.Workbook.<init>(String)

It would be interesting to know what is the background of this remaining CPU hotspot, and it we can somehow prevent it, for example, by indicating that we are opening the workbook for reading only, and do not require the functionality to save it later.

Kind regards,
Taras

amjad.sahi · September 12, 2019, 12:45pm

@TarasTielkes,

Good to know that XLSB parsing performance is improved now. I have logged your profiler trace and concerns against your issue into our database. We will evaluate and once we have an update on it, we will let you know.

amjad.sahi · September 16, 2019, 8:39am

@TarasTielkes,

As we said, for the case, i.e., the existing content in the template file would be used again after loading (such as, re-save the workbook), we need to keep those entries that were not parsed completely while loading. For performance considerations, we may consider to provide option for “readonly” mode but we need some time to make further investigations.

Once we have an update on it, we will let you know.

TarasTielkes · September 16, 2019, 9:19pm

Hi @Amjad_Sahi,

Thanks for investigating this. I would assume a read-only mode would allow you to make a number of optimizations in terms of what to parse, and which data structures to retain in-memory.

I think such optimizations make a lot of sense: many scenarios served by Aspose Cells relate to data ingestion, where the source is in one of the formats you support, but is only parsed and never serialized again. I know that in many of our use-cases, we would be able to benefit from this. Once you have a build to try, I will be happy to repeat my profiling and share the results back with you.

Kind regards,
Taras

ahsaniqbalsidiqui · September 17, 2019, 12:35am

@TarasTielkes,
Thank you for your comments. We will work on this enhancement later and you will be notified once done. Please feel free to write us back if you have any other query in this regard.

amjad.sahi · November 27, 2019, 9:32am

@TarasTielkes,

Please try our latest version/fix: Aspose.Cells for Java v19.11.3 (attached)

Your issue “CELLSJAVA-43002” should be fixed in it.
In the new fix/version, we provide a new option for this optimization of performance: LoadOptions.KeepUnparsedData - By setting this property as false will fit your requirement.

Let us know your feedback.
Aspose_Cells_Java_v19.11.3.zip (6.6 MB)

TarasTielkes · November 27, 2019, 7:57pm

Hi @Amjad_Sahi,

The new feature improves XLSB loading performance by ~5% for us.
This is a nice and welcome improvement, thank you.

Does this optimization affect XLSB only, or should it also affect other file types?
What are the effects if one tries to save a workbook that has been loaded with this option?

Kind regards,
Taras

ahsaniqbalsidiqui · November 28, 2019, 7:05am

@TarasTielkes,
Currently this optimization affects XLSB and other OOXML file formats, such as XLSX, XLSM, …etc. In future maybe we will make it take effect for other file formats when required and possible.

If one tries to save a workbook that has been loaded with this option, some contents or settings in the original file may be lost. Sometimes ms excel may given warning and protected view while opening the generated file.