Cell value truncated at line break when importing HTML

I am converting HTML to XLSX. Some cells contain line breaks (
). These cells are rendered in a strange way. I tested Aspose.Cells 16.12.0 with some HTML samples (see below) using the following code.

class Program
{
static void Main(string[] args)
{
Thread.CurrentThread.CurrentCulture = new CultureInfo(“en-US”);
Console.Out.WriteLine(“My Aspose Console”);
HTMLLoadOptions opts = new HTMLLoadOptions(LoadFormat.Html);
Workbook wb = new Workbook(“HtmlInput.html”, opts);
wb.Worksheets[0].AutoFitColumns();
wb.Save(string.Format(“out.{0}.xlsx”, Guid.NewGuid()), SaveFormat.Xlsx);
Console.Out.WriteLine(“Done.”);
Console.In.ReadLine();
}
}
Sample HTML #1: space between br and slash; table not wrapped in div.
<table><tr><td>One<br />Two<br />Three<br />Four<br />Five</td></tr></table>
Sample HTML #2: space between br and slash; table wrapped in div.
<div><table><tr><td>One<br />Two<br />Three<br />Four<br />Five</td></tr></table></div>
Sample HTML #3: no space between br and slash; table not wrapped in div.
<table><tr><td>One<br/>Two<br/>Three<br/>Four<br/>Five</td></tr></table>
Sample HTML #4: no space between br and slash; table wrapped in div.
<div><table><tr><td>One<br/>Two<br/>Three<br/>Four<br/>Five</td></tr></table></div>
Results:
  1. Cell value truncated at first line break; i.e. A1 = “One”.
  2. Cell value truncated at first line break; i.e. A1 = “One”.
  3. Cell value is one contiguous string without line breaks; i.e. A1 = "OneTwoThreeFourFive"
  4. Cell value is split up across multiple rows; i.e. A1 = “One”, A2 = “Two”, A3 = “Three”, A4 = “Four”, A5 = "Five"

To add to the confusion, the behavior totally changes when I add a rowspan (>1) to the TD tag. Then I get a merged cell (spanning the number of rows specified in the rowspan attribute) that does contain the complete value, including line breaks. This was clearly the result of a bugfix implemented in 16.12.0. The bugfix is a real improvement for cells with a rowspan, so please leave it like that. My current concern is with cells without a rowspan (or rowspan = 1); their behavior should be more in line with aforementioned bugfix.

Returning to the 4 samples above, in my opinion they should all behave the same. Personally, I’d vote for having the entire cell content in A1, including line breaks (in line with how Aspose currently handles rowspanned cells). But if you prefer to split across rows (like my sample #4 did; and also how Excel itself renders HTML), then that’s fine with me too.

Please note that the behavior of non-rowspanned cells is not something introduced in 16.12.0; I noticed the same behavior in 16.11.0. Version 8.9.0 had better behavior; at present I am stuck there out of fear of breaking existing reports in our application.

Hi Ruud,


Thank you for sharing the samples.

I have evaluated all scenarios while using Aspose.Cells for .NET 16.12.0, and I am able to replicate the issue for the sample HTML 1, 2 & 3 whereas sample HTML 4 is generating correct results when compared to the results of Excel 2010 for same HTML segment.

For the samples 1 & 2, I have raised the ticket as CELLSNET-45004 because both samples exhibit same behaviour. Moreover. I have raised the ticket as CELLSNET-45005 against the sample 3. Please review attached snapshots for your reference and allow us some time to properly analyze these case and revert back with updates.

@held1353

Thanks for using Aspose APIs.

This is to inform you that we have fixed your issues (i.e. CELLSNET-45004, CELLSNET-45005) now. We will soon provide the fix after performing QA and including other enhancements and fixes.

@held1353

Please download and try the following fix. It should fix your issues.


About CELLSNET-45004 and CELLSNET-45005, there are some tips for you:

  1. We set the data to the same cell with line feed when import the data with < br> tag.

  2. You can get the same result when import the four files via Aspose.Cells.


We use the following code to test it, you can try it.

C#

string[] files = { "sample1.html", "sample2.html", "sample3.html", "sample4.html" };

Thread.CurrentThread.CurrentCulture = new CultureInfo("en-US");

Console.Out.WriteLine("My Aspose Console");

HTMLLoadOptions opts = new HTMLLoadOptions(Aspose.Cells.LoadFormat.Html);
opts.AutoFitColsAndRows = true;

int length = files.Length;

for (int i = 0; i < length; i++)
{
	Workbook wb = new Workbook(filePath + files[i], opts);

	string outFile = filePath + Path.GetFileName(files[i]) + "_out.xlsx";

	wb.Save(outFile, SaveFormat.Xlsx);
}

The issues you have found earlier (filed as CELLSNET-45005,CELLSNET-45004) have been fixed in Aspose.Cells for .NET 18.6. Please also check the document/article for your reference: Install Aspose Cells through NuGet|Documentation