Problem in colspan while converting HTML to Doc File

Arrk · June 13, 2011, 6:05am

Hello,
Problem in colspan while converting HTML to doc file.
If you have colspan = “4” then when you are converting to doc it contains 4 repeated records , same like if you have given colspan =“2” then two repeated records are there.
i have searched in aspose forum to find out the solution and i got the below code to solve the issue ,

NodeCollection cells = doc.GetChildNodes(NodeType.Cell, true);
foreach(Cell cell in cells)
{
    // Check whether cell is merged with previouse.
    if (cell.CellFormat.HorizontalMerge == CellMerge.Previous ||
        cell.CellFormat.VerticalMerge == CellMerge.Previous)
    {
        // Remove content from the cell.
        cell.RemoveAllChildren();
    }
}

but i have one problem with this , i am getting the HTML values from database ans storing it in the string variable , so in that condition i cannot use the above code.
I am using aspose 6.6, is that solved in new version?
pleas help!!

alexey.noskov · June 13, 2011, 7:45am

Hi
Thanks for your request. Could you please provide us your HTML string and output document produced on your side? We will check the issue and provide you more information.
Best regards,

Arrk · June 13, 2011, 12:50pm

Hello,

I have attached the Html code file,

In .net code i used the following code to get the HTML value,

Document doc = new Document(filePath);

string htmlText = doc.GetText();

if you see in the htmlText the same description showing 4 times in the place where i have given colspan = “4” , but if you convert this in to word file only one is visible and other 3 is invisible. if you copy and paste in some where you can find the hidden values.

The main problem with the colspan , if you give colspan =“3” then we will get three records , if you take print out you can find the words overlapping.

How to solve this issue? i am using Aspose 6.6.

Thanks.

alexey.noskov · June 13, 2011, 1:54pm

Hi
Thank you for additional information. Actually, your code does not get HTML string. Your code just extracts plain text from the document. If you need to extract HTML string, you should use code like the following:

public string ConvertDocumentToHtml(Document doc)
{
    string html = string.Empty;
    // Save docuemnt to MemoryStream in Hml format
    using(MemoryStream htmlStream = new MemoryStream())
    {
        doc.Save(htmlStream, SaveFormat.Html);
        // Get Html string
        html = Encoding.UTF8.GetString(htmlStream.GetBuffer(), 0, (int) htmlStream.Length);
    }
    // There could be BOM at the beggining of the string.
    // We should remove it from the string.
    while (html[0] != '<')
        html = html.Substring(1);
    return html;
}

This code will convert your document to HTML and returns the HTML string.
Also, it is not quite clear for me why you cannot use the workaround you have found. Just run the code right after loading the document.
Best regards,

Arrk · June 13, 2011, 9:32pm

Hello,

Thanks for your reply,

but this is not reply i am expecting from you, i am not using HTML document .
i am getting the HTML value from the database and i stored that in the document and i attached here to make it privacy.

i got the workaround but its having the solution only if we have HTML code stored in the document but in my case i am getting my HTML code from database , so i want to eliminate the extra columns that was generated by aspose because aspose not supporting colspan.

Thanks.

alexey.noskov · June 14, 2011, 1:09am

Hi
Thank you for additional information. But it is still not clear for me why you cannot use the workaround. What is the difference from where you get the HTML from file or from database? Ok, for example, you get your HTML from a database as string. Then you insert this HTML into a document and then you need to extract plain text from the document. In this case your code will look like this:

// Get HTML string. In your case you get it from database,
// in my case I get it from file.
string html = File.ReadAllText(@"Test001\HtmlCode.html");
// Create a document and insert HTML into the document.
Document doc = new Document();
// DocumentBuilder will help us to insert HTML.
DocumentBuilder builder = new DocumentBuilder(doc);
builder.InsertHtml(html);
// Here we use the workaround to remove content from the merged cells.
NodeCollection cells = doc.GetChildNodes(NodeType.Cell, true);
foreach(Cell cell in cells)
{
    // Check whether cell is merged with previouse.
    if (cell.CellFormat.HorizontalMerge == CellMerge.Previous ||
        cell.CellFormat.VerticalMerge == CellMerge.Previous)
    {
        // Remove content from the cell.
        cell.RemoveAllChildren();
    }
}
// Now we extract plain text from the document.
string plainText = doc.ToTxt();
// Print the extracted text.
Console.WriteLine(plainText);

Hope this helps.
Best regards,