Performance, How to reduce overhead?

Hi, this questions is rather generic (how to get performance),
but I’ll be rather specific with what we’ve already tried/applied.

We’ve been using several different Aspose components (Cells, Words, Email) for quite some time now to our satisfaction. It has enabled us to grow our company to the point where performance starts to matter because of the sheer amount calculations we need to perform.

It is in this capacity that I am asking for some guidance as to assure we are getting the best out of your tools and we do not have any huge flaws or misunderstanding in our design. I am not reporting a problem, I do not have any reproductions.

In our typical design we are using Cells to perform calculations on a Workbook and then read out the results. The workbook is data-driven, which means we use a designer to link it to our datasource (Dataset). Some of our formulas are dynamic, meaning that they change with the number of rows in the datasource.

In code, we perform the following steps:

  1. Fetch the binary data(contents) for the Workbook.
    This step is performed only once.
  2. Create an instance of the workbook and put this in the local cache.
  3. Each time we need a workbook of the same type, fetch this workbook from cache.
    And use the ‘Copy’ method to initialize a new instance.
  4. Link a designer to the workbook.
  5. Use ‘SetDataSource’ to link the WorkBook to our DataSet.
  6. Call the process-method on the designer, I believe this is what makes our formulas grow and adds records to the worksheet.
  7. Use CalculateFormula on the document.
  8. Read the results of our formulas/worksheet and put them in some result object.
  9. Save the parsed workbook to a stream for logging.

Now we know these things:

  1. CalculateFormulas takes up about 80+ percent of all processing time. (About 5 seconds out of 6)
  2. The 'CalcStackSize’ property can have a huge impact on performance.
    A value of 250 turns out to be ~ 2 times as fast as a value of 9.
  3. We are already using: MemorySetting = MemorySetting.MemoryPreference
  4. We use ParsingFormulaOnOpen = false, since at the time we open the workbook, the designer is not even linked yet. And I doubt we can process the designer on a workbook before opening (copy) it); Perhaps the ‘RecalculateBeforeSave’ method could suit our needs.
  5. CalculateFormula can create a formula-tree to speed up the next calculations based on the 'CreateCalcChain’ property. Much like regular expressions. But we are unsure about dependencies between Designer.Process /*Perform */ and WorkBook.CalculateFormula.

What we do not know is:

  1. Is the calculation-chain copied when you make a copy?
  2. Is the chain reset when you apply the datasource?
  3. Can we take one Workbook, Calculate the formulas and then recycle it with a different datasource over and over, or is applying the datasource a destructive process?
  4. Does Designer.Process /*Perform */ destroy the calculation-chain?
  5. Is it absolutely necessary to use CalculateFormulas, or can the values be determined otherwise, through some clever interactions?
  6. What we think we want to achieve is to have an initialized WorkBook, where the contents have been read and the formulas are ready to go. Then stick that version in cache and create copies when needed. Then only add the rows using the designer. And then read the results.
    Is this even possible?

To summarize, since ‘CalculateFormulas’ is so expensive, we seek to reduce time spend in that method or reduce the number of times we need to call it.

Any feedback and or suggestions would be quite helpful.
Kind regards, Bas.

Side-node.
During performance/stress testing we found that both the constructor and the copy method are not thread-safe and need to be protected by a ‘lock(someobject)’ in order to avoid weird crashes like: ‘unexpected eof found’, "illegal character ’ ’ ".

We need to use the method like below, removing the lock will result in problems.
lock (StartUpLock)
{
result.Copy(cachedCopy);
return result;
}

@info.verne.nu

Thanks for considering Aspose APIs.

We have logged your points in our database for analysis. We will look into them and help you asap.

This issue has been logged as

  • CELLSNET-45759 - Performance, How to reduce overhead?

@info.verne.nu

For some points you talked about:

  1. CalcStackSize

Yes, this property will influence the performance and too small value will make the performance poor. This property and smaller size are used for some special and complex formulas which may cause StackOverflowException. Normally 100-200 should be proper for most environments and formulas. If you have formulas which need so small value such as 9, please send us the template file, we will check whether we can make special improvement for it.

  1. MemorySetting.MemoryPreference

This option is used to decrease the memory cost when the workbook has large data sets for cells. On the other hand, this option will degrade the performance a bit for other operations such as accessing the cells, calculating formulas, …etc.

  1. ParsingFormulaOnOpen

This option has no influence for processing the workbook designer with data source. And the option RecalculateBeforeSave takes effect only for the generated excel file when opening it by MS-Excel.

  1. CreateCalcChain

For your situation, the formulas(and/or its dependents) are dynamic. The created chain will have to be re-built when processing the designer. And creating the chain needs more time when comparing with calculating formulas only. So, we don’t think this property can help your situation. And we are afraid it may cause the performance worse. So the answer to your questions:

Is the calculation-chain copied when you make a copy?
No, we don’t copy the chain, and it has no help for your situation.

Is the chain reset when you apply the datasource?
Yes, if the chain has been built, it will be re-built because the dependents have been changed.

Does Designer.Perform destroy the calculation-chain?
Yes. We think “Designer.Perform” should be “Designer.Process”?


For your other questions:

  1. Is it absolutely necessary to use CalculateFormulas, or can the values be determined otherwise, through some clever interactions?

It depends on whether the formulas refer to the data which will be set/changed when you are processing the designer with data source. If the data that formulas refer to has been changed, we are afraid, the only way to get the refreshed result is re-calculate those formulas.

  1. What we think we want to achieve is to have an initialized Workbook, where the contents have been read and the formulas are ready to go. Then stick that version in cache and create copies when needed. Then only add the rows using the designer. And then read the results. Is this even possible?

We are not sure whether the formulas are all existing formulas in your template for the designer or exist in the smart-markers which will be expanded with the data source when processing the designer. If they are all existing formulas, we think what you said is possible. If those formulas do not refer to the changed data, then they even need not to be re-calculated after processing the designer with data source.


Any ways, we do not know what your template and designer are like, it will be better if you can provide the template file and processing example to us. If so we can make further investigation and try to give some suggestions.

Thank you for your swift and rather detailed reply.
We are working with many different templates, but in a typical case our formulas grow with the data, so I do not think we can get around calling CalculateFormulas().
I suppose it is already optimized to skip formulas that have not changed.

“We think “Designer.Perform” should be “Designer.Process”?”
Yes, I wrote that part from memory, stupid.

“formulas which need so small value such as 9”
We don’t, I was just testing the limits, to put in a nice graph.

"workbook has large data sets "
How many (tables * rows * columns) is considered large?

“If so we can make further investigation and try to give some suggestions”
Taken the new information into account, I think I’ll suggest to my co-workers that we just boot up some more machines.

@info.verne.nu

Thanks for your posting and using Aspose APIs.

We have logged your comment in our database for product team consideration and advice. Once, we will have some more news for you, we will update you asap.

@info.verne.nu

For your question:

How many (tables * rows * columns) is considered large?

Well, “Large” means the memory cost becomes trouble for user’s application because there are too many cells data in the workbook that even cause OutOfMemory error. User can determine whether they need to use this option according to their environments and applications. And the memory optimization is made for every cell data (value and settings for the cell), so the more cells with simple value (blank, numeric, boolean, error. For String and formulas, the optimization gives no help) in the workbook, the more advantages can be got by MemorySetting.MemoryPreference option. If there are enough memory and user pays more attention to the time cost, this option is not needed.

in a typical case our formulas grow with the data

For such case, we think the formulas exist in the smart markers and are expanded when filling data, just as you said, we are afraid there is no optimization can be made for CalculateFormulas().