Vertical & Horizontal merge in table cells

Hi, I have a problem with extracting merge info about cell merge from documents created in word. The merge property is never set for any cell, even if I created a table in word and merged some cells.

Is there some way to read this information some other way if this is not possible?

This message was posted using Page2Forum from HorizontalMerge Property - Aspose.Words for .NET and Java

Hi
Thanks for your request. By Microsoft Word design rows in a table in a Microsoft Word document are completely independent. It means each row can have any number of cells of any width. So if you imagine first row with one wide cell and second row with two narrow cells, then looking at this document the cell in the first row will appear horizontally merged. But it is not a merged cell; it is just a single wide cell. Another perfectly valid scenario is when the first row has two cells. First cell has CellMerge.First and second cell has CellMerge.Previous, in this case it is a merged cell. In both cases the visual appearance in MS Word is exactly the same. Both cases are valid.
So in the second case you can easy determine whether Cells are merged horizontally (using HorizontalMerge property)
https://reference.aspose.com/words/net/aspose.words.tables/cellformat/horizontalmerge/
But in the first case it is not so easy.
Regarding vertically merged cells you can use VerticalMerge property. Please see the following link for more information
https://reference.aspose.com/words/net/aspose.words.tables/cellformat/verticalmerge/
Best regards.

Hi
OK, the cells can be a bit hard to get the span information from. But since the information is reveled as col and rowspans when saving to html, then it can’t be impossible to get. What I do now is to save to html and parse it as xml to get the row and colspan information for the cells. There must be a better way!

Regards

Hi
Thanks for your inquiry. I didn’t tell this is impossible. But it is not so easy to achieve as it sounds. We are using special algorithm that calculates width of cells and make decision whether cell should have colspan in the HTML output or not.
So parsing of HTML is the easiest way to determine whether cells have colspan.
You can also try to create your own algorithm that will calculate colspan for each cell in the table.
Best regards.

Hi ,
I saw the discussion that you had with Hjalmar about Colspan problem with Aspose.
Actually what I’m doing is parsing a word document into custom xml file using DocumentVesitor, and the big problem here is the colspan which not detectible by aspose.
I saw that you said: “Parsing of HTML is the easiest way to determine whether cells have colspan”.
Is it possible for me to use this approach with documentvesitor? if yes How to do it?
Could you please share with me your experience so I can resolve my problem?
Many thanks.
Robert

Hi Robert,

Thanks for your inquiry. As I wrote earlier in this thread, it is very difficult to determine colspan, because rows in the Word table are absolutely independent, and can contain any number of cells of any width.
During converting to HTML Aspose.Words calculates this value using complex algorithm. You can try to convert your document to HTML and then parse it. Unfortunately, there is no way to convert only particular node to HTML, so you should to convert whole document to HTML.
You can try to create your own algorithm to calculate colspan, but it is very complex task. First of all you should calculate width of the whole table. Then you should calculate min width of each cell and determine max number of cells per row in the table and then calculate colspan of each cell.
Best regards

There are 2 alternatives, the one that works best is to create a config file where you specify every table and where the colspans are, this takes time but it’s the one I had to use. The alternative which almost always works it to use the built in funktion in aspose which saves the wordfile to html, parse the file to get the colspans set. (this will work as long as the tables are normal, if the cells are weird in ways only MS Word can make them, then your in for solution 1.

Here are some sample code (If you arn’t familiar with linq, then you should be)

// Save the document to html, and parse it to xml
using (MemoryStream docToHtml = new MemoryStream())
{
    document.Save(docToHtml, SaveFormat.Html);
    docToHtml.Position = 0;
    System.Xml.XmlReader xr = System.Xml.XmlReader.Create(docToHtml);
    XElement html = XElement.Load(xr);

    // extract the information aboute table col and rowspan
    tableDfn = new XElement("tables",
    (from tab in html.Descendants("table")
        select new XElement("table", (from tr in tab.Descendants("tr")
        select new XElement("tr", (from td in tr.Descendants("td")
        select new XElement("td", (from attr in td.Attributes()
        where attr.Name.LocalName.Contains("span")
        select attr))))))));
}

With this xml, just keep track of which table, row and cell you are in when you navigate the word file, so you can query the xml to get colspans.

Good luck!

Is it in the good way should I do to save word document in HTML format and reload it and parse it using documentvesitor?

Document doc = new Document("Document.doc");
doc.Save("Out.html");

doc = new Document("Out.html");
MyDocToTxtWriter myConverter = new MyDocToTxtWriter();
doc.Accept(myConverter); 

Thanks

Robert,

You do not need to save and then reload document as HTML, you can just determine indexes of table and of current row, then find corresponding table and row in the HTML and then determine rowspan and colspan of each cell in the row.
Best regards.

Thanks Alexey,

Here is the XML file which is the result of your code:

<tables>
  <table>
    <tr>
      <td />
      <td colspan="9" />
    </tr>
    <tr>
      <td rowspan="2" />
      <td rowspan="2" />
      <td rowspan="2" />
      <td colspan="2" />
      <td rowspan="2" />
      <td rowspan="2" />
      <td rowspan="2" />
      <td rowspan="2" />
      <td rowspan="2" />
    </tr>
  </table>
</tables>

How to keep track with the word document and the table name? As you can see there is no name for the table.
If I understood you well, you propose to move to the next table at every time I meet a table in word document and to move to next row and cell every time I meet a new row and new cell?
If it’s what you mean, then if you have any hits how to do it, so it will be appreciated.
Thanks

Hi

Thanks for your inquiry. I created simple example, how you can parse the HTML and determine colspan and rowspan of each cell. I used XmlDocument DOM, but you can change the code and use LINQ to get necessary information. Here is my code:

// Open document
Document doc = new Document(@"Test013\in.doc");
// Create visitor
SpanVisitor visitor = new SpanVisitor(doc);
// Accept visitor
doc.Accept(visitor);

========================================================
Here code of the visitor

public class SpanVisitor : DocumentVisitor
{
    /// 
    /// Creates new SpanVisitor instance
    /// 
    /// Is document which we should parse
    public SpanVisitor(Document doc)
    {
        // get collection of tables from the document
        mWordTables = doc.GetChildNodes(NodeType.Table, true);
        // Convert document to HTML
        // We will parse HTML to determine rowspan and colspan of each cell
        MemoryStream htmlStream = new MemoryStream();
        doc.SaveOptions.HtmlExportImagesFolder = Path.GetTempPath();
        doc.Save(htmlStream, SaveFormat.Html);
        doc.Save(@"Test013\out.html");
        // Load HTML into the XML document
        XmlDocument xmlDoc = new XmlDocument();
        htmlStream.Position = 0;
        xmlDoc.Load(htmlStream);
        // Get collection of tables in the HTML document
        XmlNodeList tables = xmlDoc.DocumentElement.SelectNodes("//table");
        foreach (XmlNode table in tables)
        {
            TableInfo tableInf = new TableInfo();
            // Get collection of rows in the table
            XmlNodeList rows = table.SelectNodes("tr");
            foreach (XmlNode row in rows)
            {
                RowInfo rowInf = new RowInfo();
                // Get collection of cells
                XmlNodeList cells = row.SelectNodes("td");
                foreach (XmlNode cell in cells)
                {
                    // Determine row span and colspan of the current cell
                    XmlAttribute colSpanAttr = cell.Attributes["colspan"];
                    XmlAttribute rowSpanAttr = cell.Attributes["rowspan"];
                    int colSpan = colSpanAttr == null ? 0 : Int32.Parse(colSpanAttr.Value);
                    int rowSpan = rowSpanAttr == null ? 0 : Int32.Parse(rowSpanAttr.Value);
                    CellInfo cellInf = new CellInfo(colSpan, rowSpan);
                    rowInf.Cells.Add(cellInf);
                }
                tableInf.Rows.Add(rowInf);
            }
            mTables.Add(tableInf);
        }
    }
    public override VisitorAction VisitCellStart(Aspose.Words.Tables.Cell cell)
    {
        // Determone index of current table
        int tabIdx = mWordTables.IndexOf(cell.ParentRow.ParentTable);
        // Determine index of current row
        int rowIdx = cell.ParentRow.ParentTable.IndexOf(cell.ParentRow);
        // And determine index of current cell
        int cellIdx = cell.ParentRow.IndexOf(cell);
        // Determine colspan and rowspan of current cell
        int colSpan = 0;
        int rowSpan = 0;
        if (tabIdx < mTables.Count &&
        rowIdx < mTables[tabIdx].Rows.Count &&
        cellIdx < mTables[tabIdx].Rows[rowIdx].Cells.Count)
        {
            colSpan = mTables[tabIdx].Rows[rowIdx].Cells[cellIdx].ColSpan;
            rowSpan = mTables[tabIdx].Rows[rowIdx].Cells[cellIdx].RowSpan;
        }
        Console.WriteLine("{0}.{1}.{2} colspan={3}\t rowspan={4}", tabIdx, rowIdx, cellIdx, colSpan, rowSpan);
        return VisitorAction.Continue;
    }
    private List<TableInfo> mTables = new List<TableInfo>();
    private NodeCollection mWordTables = null;
}

======================================================================
And here is code of helper classes.

/// 
/// Helper class that contains collection of rowinfo for each row
/// 
public class TableInfo
{
    public List<RowInfo> Rows
    {
        get { return mRows; }
    }
    private List<RowInfo> mRows = new List<RowInfo>();
}
/// 
/// Helper class that contains collection of cellinfo for each cell
/// 
public class RowInfo
{
    public List<CellInfo> Cells
    {
        get { return mCells; }
    }
    private List<CellInfo> mCells = new List<CellInfo>();
}
/// 
/// Helper class that contains info about cell. currently here is only colspan and rowspan
/// 
public class CellInfo
{
    public CellInfo(int colSpan, int rowSpan)
    {
        mColSpan = colSpan;
        mRowSpan = rowSpan;
    }
    public int ColSpan
    {
        get { return mColSpan; }
    }
    public int RowSpan
    {
        get { return mRowSpan; }
    }
    private int mColSpan = 0;
    private int mRowSpan = 0;
}

I hope this could help you.
Best regards.

Hi,
I got your solution and it was working very well, till I updated my aspose version to the latest one.
Now when the code hits the folowing:
doc.Save(htmlStream, SaveFormat.Html);
I get the error message.
Image file cannot be written to disk. When saving the document to a stream either HtmlExportImagesFolder should be specified or custom streams should be provided via HtmlExportImageSaving event handler. Please see documentation for details.
Could you please help
Regards

Hi

Thanks for your inquiry. You should just specify where images will be stored during converting document to HTML. For example see the following code:

// Specify folder where images will be saved durign export to HTML
doc.SaveOptions.HtmlExportImagesFolder = @"C:\Temp\images";
doc.Save(htmlStream, SaveFormat.Html);

Best regards.

Hi,
unfortunatly it doesn’t work.
If you see your code above:

MemoryStream htmlStream = new MemoryStream();
doc.SaveOptions.HtmlExportImagesFolder = Path.GetTempPath();
doc.Save(htmlStream, SaveFormat.Html);

I even replace Path.GetTempPath(); by @“C:”
it doesn’t work neither the same error message.
Try by yoursef
Thanks

Hi

Thanks for your request. This code works fine on my side.

Document doc = new Document(@"C:\Temp\in.doc");
// Specify folder where images will be saved durign export to HTML
doc.SaveOptions.HtmlExportImagesFolder = Path.GetTempPath();
MemoryStream htmlStream = new MemoryStream();
doc.Save(htmlStream, SaveFormat.Html);

Best regards.

Hi,
I tried again and it doesn’t work with the version 6.3.0.0, however I kept the previous version 6.0.1.0.
So What I did, I desinstalled the 6.3.0.0 and reinstalled the version 6.0.1.0 and it works as before.
Then I gone back to 6.3.0.0 so it’s not working at all with this version.
Please Help.
PS: I Attached my .doc test file to this message
Thanks
Robert

Hi Robert,

I still cannot reproduce the problem on my side. Since you do not need images during saving document to HTML, maybe you should try using code like the following:

public void Test089()
{
    Document doc = new Document(@"C:\Temp\I-C1-1.1_test_tout.doc");
    MemoryStream htmlStream = new MemoryStream();
    doc.SaveOptions.HtmlExportImageSaving += new ExportImageSavingEventHandler(SaveOptions_HtmlExportImageSaving);
    doc.Save(htmlStream, SaveFormat.Html);
}
void SaveOptions_HtmlExportImageSaving(object sender, ExportImageSavingEventArgs e)
{
    e.ImageStream = new MemoryStream();
}

Best regards.

Thanks Alexey, it works now.

Hi,
I am new to Aspose.
I have some tables in HTML files and would like to export to CSV format. Tried the code snippet, but XMLDocument and MemoryStream classes are not found.
Could you please help?
TIA
Sheeba

Hello
Thanks for your request. This question is not related to Aspose.Words. In your case you should just add System.IO and System.Xml name spaces to your project.
Best regards,