Presently evaluating Aspose.Cells.dll for use in a web application. Doing some tests where it has to convert large HTML tables (>10,000 rows) to Excel format. Strange thing is, on one occasion (a unit test converting a big list of users from HTML to .xlsx), time complexity is linear (i.e. time doubles as number of rows doubles). In another test (an aspx running on IIS converting a big log report from HTML to .xlsx), time complexity is quadratic (time quadruples as number of rows doubles).
Does anybody have any experience with this? Are there any special parameters I should use to improve time complexity and/or memory footprint?
Hi Ruud,
Thank you for contacting Aspose support.
I am afraid, we require a sample application along with the input documents to replicate the said issue on our side. It will help us to properly investigate the matter, and then we will be in a better position to assist you in this regard. We also request you to use the same application against the latest version of
Aspose.Cells for .NET 8.9.0.3 to benchmark the processing on your end and share your results here.
- Operating system version & architecture
- .NET Framework version
- Locale/region of the machine
Hi Babar,
Thanks for the fast response. I managed to narrow down the problem to our use of colspan in HTML tables. Turns out the problem is very straightforward to reproduce. Consider the following table.
Colspan test
| Cell 1 |
Cell 2 |
| Text underneath |
Duplicate the line that starts with
20000, 40000, 60000 times, and measure how long it takes to convert the HTML to .xlsx format using Aspose.Cells.
20000: 1.6 seconds (inaccurate due to startup overhead of our unit test framework)
40000: 1.6 seconds
60000: 2.4 seconds
Now replace colspan="1" by colspan="2", and repeat the same tests.
20000: 45 seconds
40000: 194 seconds
60000: 557 seconds
The difference is dramatic. Time complexity appears to have gone from linear to quadratic. Double the number of records, and the amount of time quadruples.
Taking a wild guess: this happens because for every new row, column width is recalculated, taking into account all previous rows. Is there any way to postpone width calculation until the last row has been added, or bypass the automatic sizing altogether? I already tried adding style to every cell specifying a fixed width, but that didn't help.
Unfortunately, our customers were used to importing an even greater amount of records into Excel using HTML import; they will expect the same from an Aspose-based implementation.
Please let me know if you need more info to reproduce the problem.
Here's a minimal reproducible situation (Windows 10 x64, Visual Studio 2015). I created a C# unit test project with the following test class.
using Aspose.Cells;
using System.IO;
using Microsoft.VisualStudio.TestTools.UnitTesting;
namespace AsposeColspan
{
[TestClass]
public class UnitTest1
{
[TestMethod]
public void Test20kRowsWithColspan1()
{
WriteHtmlThenConvertToXlsx(20000, 1);
}
[<span style="color: rgb(43, 145, 175);">TestMethod</span>]
<span style="color: blue;">public</span> <span style="color: blue;">void</span> Test20kRowsWithColspan2()
{
WriteHtmlThenConvertToXlsx(20000, 2);
}
<span style="color: blue;">private</span> <span style="color: blue;">void</span> WriteHtmlThenConvertToXlsx(<span style="color: blue;">int</span> rows, <span style="color: blue;">int</span> colspan)
{
<span style="color: blue;">using</span> (<span style="color: blue;">var</span> inputStream = <span style="color: blue;">new</span> <span style="color: rgb(43, 145, 175);">MemoryStream</span>())
{
<span style="color: blue;">using</span> (<span style="color: blue;">var</span> writer = <span style="color: blue;">new</span> <span style="color: rgb(43, 145, 175);">StreamWriter</span>(inputStream))
{
writer.WriteLine(<span style="color: rgb(163, 21, 21);">"<html>\n<head><title>Colspan test</title></head>\n<body>\n<table>\n"</span>);
<span style="color: blue;">for</span> (<span style="color: blue;">var</span> i = 0; i < rows; i++)
{
writer.WriteLine(<span style="color: blue;">string</span>.Format(<span style="color: rgb(163, 21, 21);">"<tr><td colspan=\"{0}\">Foo</td></tr>\n"</span>, colspan));
}
writer.WriteLine(<span style="color: rgb(163, 21, 21);">"</table>\n</body>\n</html>\n"</span>);
inputStream.Position = 0;
<span style="color: blue;">using</span> (<span style="color: blue;">var</span> outputStream = <span style="color: blue;">new</span> <span style="color: rgb(43, 145, 175);">MemoryStream</span>())
{
<span style="color: blue;">using</span> (<span style="color: blue;">var</span> workbook = <span style="color: blue;">new</span> <span style="color: rgb(43, 145, 175);">Workbook</span>(inputStream, <span style="color: blue;">new</span> <span style="color: rgb(43, 145, 175);">LoadOptions</span>(<span style="color: rgb(43, 145, 175);">LoadFormat</span>.Html)))
{
<span style="color: green;">workbook.Worksheets[0].AutoFitColumns();</span>
workbook.Save(outputStream, <span style="color: rgb(43, 145, 175);">SaveFormat</span>.Xlsx);
}
}
}
}
}
}
}
The first test takes less than 1 second to complete, the second one over 30 seconds.
I already tried outcommenting the AutoFitColumns line, but that didn’t change performance.
Hi Ruud,
Thank you for sharing the details and sample code. I have performed a few tests on my side against the latest version of Aspose.Cells for .NET 8.9.0.3, and I am able to notice that said issue. In order to further investigate the matter, I have logged this incident as CELLSNET-44625 in our bug tracking system. Please spare us little time to properly analyze the case and revert back with updates in this regard.
Hi,
Please try the scenario against the latest version of Aspose.Cells for .NET 8.9.0.4 (attached) and let us know your feedback.
Thanks for the updated assembly; that was fast! I did some more tests, this time with a slightly more representative table. It uses colspan in alternating rows, like this:
+-----+-----+-------+
| One | Two | Three |
+-----+-----+-------+
| Spanning | Bye |
+-----+-----+-------+
| One | Two | Three |
+-----+-----+-------+
| Spanning | Bye |
+-----+-----+-------+
To benchmark this using the test class I posted earlier, replace the WriteLine that is inside the 'for' loop by:
writer.WriteLine(string.Format("
| One |
Two |
Three |
\n
| Spanning |
Bye |
\n", colspan));
Colspan 1 completes in less than 2 seconds, colspan 2 takes almost 27 seconds. With 8.9.0.3, colspan 2 would take 69 seconds, so that's already a big improvement.
The bad news is, time complexity still seems quadratic when using colspan; clearly visible when I triple the number of rows:
- WriteHtmlThenConvertToXlsx(20000, 1); // 2 seconds
- WriteHtmlThenConvertToXlsx(20000, 2); // 27 seconds
- WriteHtmlThenConvertToXlsx(60000, 1); // 5 seconds
- WriteHtmlThenConvertToXlsx(60000, 2); // 455 seconds
I'm afraid I won't be able to sell that to my manager.
It really seems as if the whole table is being recalculated for every row added. Is there any way this can be bypassed or postponed until after the last row has been added?
I can imagine this requires quite some analysis and coding from your side. I hope you can give me a estimate on how soon this can be fixed. Coming Friday, I have to make a decision on which library to use to export our data in Excel format.
Hi,
Thanks for providing further details.
I have tested your scenario/ case with the updated line of code (as per your suggestion) using our latest version/fix: Aspose.Cells for .NET v8.9.0.4 . Well, I get different results on my end. Here is my complete test code (runnable):
e.g
Sample code:
…
var watch =
new System.Diagnostics.Stopwatch();
watch.Start();
WriteHtmlThenConvertToXlsx(20000, 1);
watch.Stop();
Console.WriteLine(“colspan = 1: rows = 20000:” + watch.Elapsed);
watch.Start();
WriteHtmlThenConvertToXlsx(20000, 2);
watch.Stop();
Console.WriteLine(“colspan = 2: rows = 20000:” + watch.Elapsed);
watch.Start();
WriteHtmlThenConvertToXlsx(60000, 1);
watch.Stop();
Console.WriteLine(“colspan = 1: rows = 60000:” + watch.Elapsed);
watch.Start();
WriteHtmlThenConvertToXlsx(60000, 2);
watch.Stop();
Console.WriteLine(“colspan = 2: rows = 60000:” + watch.Elapsed);
…
private static void WriteHtmlThenConvertToXlsx(
int rows,
int colspan)
{
using (var inputStream =
new MemoryStream())
{
using (var writer =
new StreamWriter(inputStream))
{
writer.WriteLine("\nColspan test\n\n\n");
for (var i = 0; i < rows; i++)
{
///writer.WriteLine(string.Format("<td colspan="{0}">Foo\n", colspan));
writer.WriteLine(
string.Format(“\n<td colspan=”{0}">Spanning\n", colspan));
}
writer.WriteLine("
\n\n\n");
inputStream.Position = 0;
using (var outputStream =
new MemoryStream())
{
using (var workbook =
new Workbook(inputStream,
new LoadOptions(LoadFormat.Html)))
{
///workbook.Worksheets[0].AutoFitColumns();
workbook.Save(outputStream, SaveFormat.Xlsx);
}
}
}
}
}
............
Output:
colspan = 1: rows = 20000:00:00:05.3342776
colspan = 2: rows = 20000:00:00:08.4431001
colspan = 1: rows = 60000:00:00:17.0401517
colspan = 2: rows = 60000:00:00:25.8763989
As you can see it does not take much time and efficient. I have conducted the test on my system with normal configurations (OS: Windows 8, CPU: i7, RAM: 8GB, VS.NET WinForm solution, etc.)
Please let us know if there is anything different for my pasted code to replicate the issue?
Thank you.
Hi Amjad,
Thanks for your reply. Those are numbers I’d love to see! Looking at your source code, I notice one remarkable difference; you outcommented AutoFitColumns. Now I did try that with 8.9.0.3 (and found little improvement), but forgot all about it when testing against 8.9.0.4; all my latest tests ran with AutoFitColumns.
If AutoFitColumns indeed makes such a difference, then living without it is something worth considering.
Due to holiday, I won’t be able to access my .NET development environment for a while (one of my colleagues may jump in, though). Could you please repeat your test with AutoFitColumns and see if the benchmark results match mine? Just to rule out the possibility that something else was seriously wrong in my test.
Hi,
I’m a colleague of Ruud and I took over the evaluation of Aspose from him.
I can confirm your findings. The quadratic durations that Ruud mentioned are indeed caused by the call to AutoFitColumns. I can reproduce the durations that you mention, and they are indeed fast and linear.
Thanks,
Ad
Hi Ad,
Thank you for the confirmation. As we have reopened the previously logged ticket, we will require some time to further analyze the case and get back to you with updates in this regard.
Hi,
We are pleased to inform you that the issue logged earlier as “CELLSNET-44625” has been fixed. We will soon provide the fix after performing QA and incorporating other enhancements and fixes.
Thank you.
Hi,
Thanks for your using Aspose.Cells.
Please download and try the latest fix: Aspose.Cells for .NET v8.9.0.5 and let us know your feedback.
Hi,
And, here is output console timings using the code segment with v8.9.0.5 which looks efficient and linear now:
colspan = 1: rows = 20000:00:00:02.5592600
colspan = 2: rows = 20000:00:00:04.6552064
colspan = 1: rows = 60000:00:00:10.5251295
colspan = 2: rows = 60000:00:00:16.7006286
Thank you.
Hi,
I can confirm that AutoFitColumns is much faster now and that it scales linearly with the data size in verion 8.9.0.5.
Thanks for the fast response.
Regards,
Ad
Hi,
Thanks for your feedback.
Good to know that your issue is sorted out by the new version/fix. Feel free to contact us any time if you have any issue, queries or comments, we will be happy to assist you soon.
Thank you.
The issues you have found earlier (filed as CELLSNET-44625) have been fixed in this update.
This message was posted using Notification2Forum from Downloads module by Aspose Notifier.