Faster render of sheets using Aspose Cells for Java

chetan.mishra · May 26, 2020, 10:42am

hi

We render some 100 cols by 17K rows worth of data in a single sheet and per sheet render time is really high. We use 17.12 but I have tried with latest version also - same effect.

We render cell by cell as that helps us have better control on data (if we have to put specific formatting)

Sample code :- (NOT ACTUAL - but does the same)

We have a source custom object “rawData” and a set of workbook styles “exportStyles”

Cells c=wrkSheet.getCells();
for(int i…)
{
for(int j;…)
{
if(rawData.get(i,j).getType().equalsIgnoreCase(“3Decimal”));
{
Double d=Double.parseDouble(rawData.get(i,j).getValue());
c.get(i,j).setValue(d);
Style cellStyle=exportStyles.get(“THREE_DECIMAL_STYLE”);
cellStyle.setCustom(“#,##0.000”);
c.get(i,j).setStyle(cellStyle);
}
else if (rawData.get(i,j).getType().equalsIgnoreCase(“2Decimal”))
{
Double d=Double.parseDouble(rawData.get(i,j).getValue());
c.get(i,j).setValue(d);
Style cellStyle=exportStyles.get(“TWO_DECIMAL_STYLE”);
cellStyle.setCustom(“#,##0.00”);
c.get(i,j).setStyle(cellStyle);
}
else if (rawData.get(i,j).getType().equalsIgnoreCase(“Integer_Typ”))
{
Integer intval=Integer.parseint(rawData.get(i,j).getValue());
c.get(i,j).setValue(intval);
Style cellStyle=exportStyles.get(“TWO_DECIMAL_STYLE”);
cellStyle.setCustom(“#,##0”);
c.get(i,j).setStyle(cellStyle);
}
}
}

Even If I remove the rawData from the picture and try to apply the values right out of DB or a file, the performance remains the same. So, I was sure that my custom object is not causing the issue.

I went through a below post (suggested to me earlier too) to render cells faster - render row by row - import one row as complete array at a time:

I see the performance bump up - my previous render time was 25 mins and was reduced to 2 mins BUT this benchmark is without this custom setting up of styles/custom formats.

So:
a) I was trying to look for a way if I do not have to follow this new approach - as have to make a lot of changes to my code to render the file differently.
Can I allocate a fixed amount of “cell (s)” ahead of time so that
c.get(i.j).setValue or c.get(i,j).setStyle do not take much time.

b) Setting the styling - the only way I see is to have cellRanges and then parse through one at a time and set the Style. IS there a faster way.

Memory is not a concern and does not even go to 20% of capacity also while running.

thanks
Chetan

amjad.sahi · May 26, 2020, 12:03pm

@chetan.mishra,

Thanks for your query.

I think you may try using LightCells API to read and write large data efficiently in Excel spreadsheets.
Aspose.Cells provides the LightCells APIs which is mainly designed to write or read cell’s data one by one without building a complete data model block into memory. This will surely increase performance and render data quickly. See the document for your reference:

Hope, this helps a bit.

chetan.mishra · May 26, 2020, 12:24pm

hi Amjad

My understanding is that LightcellsAPI is if we have memory constraints.
I do not have memory problems.

I just am trying to find the FASTEST way to render my file - trying to decrease my render time - thats all.

In my code already am doing:
cells.get(i,j).setValue() and cells.get(i,j).setStyle() to set the formatting.

This I believe takes longer time from the posts I saw.

From the post this is also what I saw:
a) “ImportDataTable” is fastest but that method is deprecated. So, what can I use

ICellsDataTable???

b) Will “importObjectArray” be faster than a)?

c) The only reason I render styling one cell at a time is because depending upon value a style is applied (I do not have more than 5-6 styles) but I just assign the right one.

cells.get(i.j).setValue and cells.get(i,j).setStyle

If I use a) or b) - do I still have to go cell by cell and then set the style.

I am looking for the FASTEST way to render my data and apply the styling.

Need your suggestions please.

amjad.sahi · May 26, 2020, 12:38pm

@chetan.mishra,

Well, LightCells API will process the task efficiently, it will take less amount of time to render the large data set for sure. So, you should give it a try for your scenario.

Regarding your second part, we recommend you to create your style objects with your desired formatting and then apply to your specific range of cells, rows/columns via using Range.applyStyle(), Row.applyStyle/Column.applyStyle() methods. See the document for your reference:

Also, see the document with example on what data importing options Aspose.Cells provides which you may use, it will enhance performance:

chetan.mishra · May 28, 2020, 11:05am

hi Amjad

LightCellsData Api says that it does not load the entire workbookcollection of cells into memory. So, then if I try to autofit it will not help. Is there a workaround there or will it work fine?

amjad.sahi · May 28, 2020, 12:11pm

@chetan.mishra,

Well, LightCells APIs are mainly designed for inserting/manipulating cells data one by one without building the complete data model block (Cell collection etc.) into memory. It works in an event-driven mode.In LightCells mode, when APIs of LightCellsDataProvider being called, part of the data of the resultant file has been created and some other data need to be gathered and created before creating cells data, such as the column settings including width, styles, and so on. I am afraid, in LightCells mode one cannot reset or change the settings for columns or rows in the implementation of LightCellsDataProvider. You have to set column width or row heights (via autofit operations) outside of the implementation, i.e., do you perform this task in the class that implements the light cells interface.

chetan.mishra · June 1, 2020, 8:07am

HI Amjad,

Good Morning!

Circling back to this…yes I do see some performance improvement with the lightcells api.

Understand that column operations are not permitted and things of that type as it proceeds in this order:

sheet -> Row -> Cell after cell

But, I do notice one big diff and I believe its to do with the way lightcells works… I notice that the file produced by lightcells is bigger in size - way bigger.

maybe its because - as characters are reused - e.g. say I have a value “chetan” or “0.12345” then in whole memory approach, it optimizes it and keeps only one entry. I see under xl folder

sharedString.xml - “chetan” or “0.12345” is there only one time.

In lightcells approach as it writes every cell back into the file, its not keeping track of previous strings/values being used. the xl folder - sharedString.xml does not have anything. Instead the sheet1.xml under xl->worksheets is bigger.

Can you please throw some light on this.

Thanks
Chetan

amjad.sahi · June 1, 2020, 10:27am

@chetan.mishra,

Did you implement LightCellsDataProvider and what is the returned value for IsGatherString()? Please note, if it returns true in your implementation, those string values will also be cached in the same way as you get with normal mode. So, your generated bigger sized file might be due to the fact. Please confirm what is IsGatherString returning results for your case. If you still find any issue or difference, kindly create a sample console Java program (runnable) and provide to reproduce the issue, we will check it soon.

chetan.mishra · June 1, 2020, 10:43am

Hi Amjad,

I took the implementation from the demo class. It had false. So, I set it as true. Then it worked.

I did not understand what you meant by (if it returns true) - we set it or it comes from somewhere.

image.png (1.3 KB)

Are there any examples of how this is set to true or false. Because from your words it appears that its not just supposed to be set to true but should return something.

amjad.sahi · June 1, 2020, 11:02am

@chetan.mishra,

Good to know that the suggested thing figures out your issue now.

I meant the following way which you have already done now.
i.e.,

.....
public boolean isGatherString() {
	return true;
}

(either you can use the line “return true;” or “return false;”)

chetan.mishra · June 1, 2020, 11:12am

ok Thanks Amjad… this seems to take care of my issue with the sample data. Will check with my main reports and tell you if I face a problem.

amjad.sahi · June 1, 2020, 11:14am

@chetan.mishra,

Sure, take your time to evaluate your scenario/ case using the suggested approach. Hopefully, you will see expected results.

chetan.mishra · June 1, 2020, 4:12pm

Hi Amjad,

Just few more q’s

When I write the file back - this is what we use

OoxmlSaveOptions opt = new OoxmlSaveOptions();
opt.setLightCellsDataProvider(mydataProvider);
CSVWB.save("//lightcells_output0.xlsx",opt);

As I mentioned I still need to do some column operations - like say setting col width or some cell merging operation - like say after lightcells renders the file - some kind of postprocessing.

One option is I reload the file back from disk into a workbook object but rather than that…I was trying to find can I do any other way so that I have the workbook object the way I want it to be - rendered with the sheets using lightcells api - ready for my set of postprocessing operations in memory as a workbook object.

Basically going back to this:
Does lightcells api only get invoked at the save point with the options?.
If it is so then no other option but I have to reload back the file after the save from disk like through a byteinputstream to do some kind of post processing.

The postprocessing cannot be eliminated because we add some company information in separate sheet with some logos and all, plus merging of cells, setting of col widths. It cannot be brought into the realm of cell by cell rendering using Lightcells api.

Thanks
Chetan

amjad.sahi · June 1, 2020, 6:05pm

@chetan.mishra,

You can perform all your post processing tasks ("…sheet with some logos and all, plus merging of cells, setting of col widths") before saving the file in light weight mode, you may do such tasks outside of the implementation of the light cells APIs.

chetan.mishra · June 2, 2020, 4:54am

Hi Amjad,

The setting of column widths, merging of the cells are to be done on the same sheet in which lightcells will render and its based on data.

My scenario is like this:

Meaning if the column contained the word “Age” on A4 then set it 0.5 inch wide. Now “Age” will only come after the file is rendered using lightcells.

Similarly, “Age” column can have 12,14,14,15,15,15,15,16 values in different rows.

so, in that case the rows with 14 value need to be merged. Similarly, with “15” value rows.

I was just trying to see if there was a way to get the file byte array after ooxmlsaveoptions are specified and the final file rendered.

Otherwise, worst scenario will have to read the file back from the same folder.

There are some data dependent things which I do in post process operation.

The reason why I was looking for the object back in memory is because - it will reduce the I/O time.

Worst case, will have to read the file back again.

Thanks and Regards
Chetan

amjad.sahi · June 2, 2020, 9:49am

@chetan.mishra,

Thanks for providing further details.

It is optimal that you manipulate the settings (column width, manipulate merged areas, …etc.) before saving the workbook with LightCells APIs, so if it is possible, you should pre-process the data (cell values) before transfer them to LightCellsDataProvider to gather those settings.

But if it is not possible for you, then you have to re-load the generated Excel file and change the settings (column width, manipulate merged areas, etc.) again. In short, if you cannot re-use the input data for cell values, we think there is no other way but to load the file again to achieve your requirements.

chetan.mishra · July 28, 2020, 4:34pm

HI Amjad/Ahsan,

I did not open a new post but continued here as this was originally for my question on Light Cells.

I am facing some issues still with file size even after setting the following:

public boolean isGatherString() ***
*** {
*** return true;***
*** }***

. The difference of my example code attached might be small (few KB) - But, in my actual code base, at times based on data, the size of the file changes by 2-3 Mb (Lightcells is more) at times. This is causing some issues with file limit sizes in place (in various places).

Do, you know by any chance any issues with lightcells or is this diff expected.

Thanks and Regards
ChetanUseLightCellsSample.zip (1.5 KB)
AsposeRegularFileGenerationandCompare.zip (664 Bytes)

amjad.sahi · July 28, 2020, 8:42pm

@chetan.mishra,

Thanks for the sample code (Java program).

Please note, Gathering string values will give you advantage only if there are many duplicated string values for the cells provided by this implementation. For example, in this situation gathering string will save more memory and will generate a smaller resultant file. But if there are many string values for the cells provided by LightCellsDataProvider but only few of them are identical/duplicate, then gathering string will surely consume more memory and cost more time. Also, in that case, it will not give advantage for the resultant file (file size as well).

Thanks for your understanding!

chetan.mishra · July 29, 2020, 9:11am

Hi Amjad,

Thanks for the explanation. I think I know why the file is more slightly when its rendered through LightCells.

You see previously I used to use (at the very start of this thread):
cells.get(x,y).putValue(value) to render my file.

CONS of this is : It takes time to traverse and find the cell. If not there, then allocate it and maintain reference.

PRO of that approach is: It only allocates the memory for the cells where I render the value.

image.png (7.5 KB)
(Attached snapshot from that post)

LIGHTCELLS approach - I have something like this (image below)
image.png (14.8 KB)

this is how my report is. Some headers (pageby items), row items and column items.

In LightCells as I understood I have to preform the layout of my report and pass it cell by cell to the api.

If you see the blue circles - please do not go by the size of them, they are the cells with no data (basically the blank area of the report).

I have circled some of the “MetricVal” denoting that there is no data for that combination. There can be certain combinations of row and dims (based on my project need) that may not have a value.
e.g rows has “tobacco consumption” , cols has age groups starting from age 3 and the metric is a sale number.

I believe this entire thing (the blanks) is getting added to reference - that is adding to weight of the file. It has no value but just “” in that cell (as it traverses row by row and for each row, traverses col by col).

In cells.get(x,y).putValue(Value) approach, based on that post, the memory does not get allocated to a cell that is not referenced.

In Lightcells then the question is:
Can I skip cell based on some condition or will it still add it.
I believe that cannot be controlled in nextCell() but in startCell() only.

Appreciate any light on this matter.

Thanks and Regards
Chetan

amjad.sahi · July 29, 2020, 10:47am

@chetan.mishra,

For your questions, commonly it should be done in nextCell() instead of startCell(). For example, if cell D2 should be empty without any settings, then in nextCell(), you may check:

if(rowIndex==1&&nextCol==3)
{
nextCol++;
return nextCol; //continue to process E2
}

here rowIndex/nextCol is some kind of vars in your implementation of LightCellsDataProvider to keep index of the row currently being processed and index of the column of next cell to be processed.

If you have to do it in StartCell(Cell cell) method, you should not update the given cell object. In this way this cell will also be ignored without being kept in memory or output to the generated file. We found your implementation in your previous post:

public int nextCell()
{

if (this.colIndex < (this.maxcol-1))
{
this.colIndex++;
return this.colIndex;
}
return -1;
}
with the scenario I decribed, user's code should be like:
public int nextCell()
{

if (this.colIndex < (this.maxcol-1))
{
this.colIndex++;
if(this.rowIndex==1 && this.colIndex==3)
{
this.colIndex++;
}
return this.colIndex;
}
return -1;
}