Cell ExtractContent

pjcunningham · May 8, 2013, 7:47am

Hellp Aspose,

See code below. I’m trying to extract the contents of table cells to individual Document objects (to be later stored in a database). The output document is empty, although the Debug.WriteLine shows the cell value !

static void Main(string[] args)
{
    Aspose.Words.License wordLicense = new Aspose.Words.License();
    wordLicense.SetLicense("Aspose.Words.lic");
    Document document = new Document(@"C:\temp\extractcell.docx");
    NodeCollection nodes = document.GetChildNodes(NodeType.Cell, false);
    foreach(Cell cell in nodes)
    {
        var text = cell.ToString(SaveFormat.Text).Trim();
        Debug.WriteLine(text);
        ArrayList extractedNodes = ExtractContent(cell.FirstChild, cell.LastChild, false);
        // Insert the content into a new separate document and save it to disk.
        Document dstDoc = GenerateDocument(document, extractedNodes);
        dstDoc.Save(string.Format(@"c:\temp\{0}.docx", Guid.NewGuid()));
    }
}

Regards, Paul

pjcunningham · May 8, 2013, 7:49am

Hello Aspose,

The content is empty also when ExtractContent parameter isInclusive is true.

Regards, Paul.

pjcunningham · May 9, 2013, 5:11am

Hello Aspose,

There was an error in the above code, I failed to use the true flag in the GetChildNodes call. Here is the corrected code - but now I’m getting back the whole cell (including the table), instead of the contents the cell.

static void Main(string[] args)
{
    Aspose.Words.License wordLicense = new Aspose.Words.License();
    wordLicense.SetLicense("Aspose.Words.lic");
    Document document = new Document(@"C:\temp\extractcell.docx");
    NodeCollection nodes = document.GetChildNodes(NodeType.Cell, true);
    foreach(Cell cell in nodes)
    {
        var text = cell.ToString(SaveFormat.Text).Trim();
        Debug.WriteLine(text);
        ArrayList extractedNodes = ExtractContent(cell.FirstChild, cell.LastChild, true);
        // Insert the content into a new separate document and save it to disk.
        Document dstDoc = GenerateDocument(document, extractedNodes);
        dstDoc.Save(string.Format(@"c:\temp\{0}.docx", Guid.NewGuid()));
    }
}

Regards, Paul

tahir.manzoor · May 9, 2013, 5:22am

Hi Paul,

Thanks for your inquiry. Perhaps, you are using an older version of Aspose.Words; as with Aspose.Words v13.4.0, I am unable to reproduce this problem on my side. I would suggest you please upgrade to the latest version of Aspose.Words i.e. v13.4.0 and let us know how it goes on your side. I hope, this will help.

I have attached the output document with this post for your kind reference.

pjcunningham · May 9, 2013, 6:02am

Hello Tahir,

I’ ve fixed the problem using v13 .3.The call to ExtractContent is superfluous, if I replace

ArrayList extractedNodes = ExtractContent(cell.FirstChild, cell.LastChild, true);
Document dstDoc = GenerateDocument(document, extractedNodes);

with

ArrayList extractedNodes = new ArrayList(cell.Paragraphs.ToArray());
Document dstDoc = GenerateDocument(document, extractedNodes);

then the cell content is correct. This doesn’t explain why the first method doesn’t work in V13.3 though, I expect the problem is either with cell.FirstChild or maybe there has been a change in ExtractContent for v13.4.

Regards, Paul.

pjcunningham · May 9, 2013, 6:16am

Hello Tahir,

I’ve just installed V13.4 and using the version of ExtractContent that comes with the samples for release v13.3 the problem is still there.

Can you attach your version of ExtractContent and its related methods that you used in your testing.

Regards, Paul Cunningham.

tahir.manzoor · May 13, 2013, 3:23am

Hi Paul,

Thanks for your inquiry. Please use the following ExtractContent method and let us know if you still face problem. Hope this helps you.

private ArrayList ExtractContent(Node startNode, Node endNode, bool isInclusive)
{
    // First check that the nodes passed to this method are valid for use.
    VerifyParameterNodes(startNode, endNode);
    // Create a list to store the extracted nodes.
    ArrayList nodes = new ArrayList();
    // Keep a record of the original nodes passed to this method so we can split marker nodes if needed.
    Node originalStartNode = startNode;
    Node originalEndNode = endNode;
    // Extract content based on block level nodes (paragraphs and tables). Traverse through parent nodes to find them.
    // We will split the content of First and last nodes depending if the marker nodes are inline
    while (startNode.NodeType != NodeType.Paragraph && startNode.NodeType != NodeType.Table)
        startNode = startNode.ParentNode;
    while (endNode.NodeType != NodeType.Paragraph && endNode.NodeType != NodeType.Table)
        endNode = endNode.ParentNode;
    bool isExtracting = true;
    bool isStartingNode = true;
    bool isEndingNode = false;
    // The current node we are extracting from the document.
    Node currNode = startNode;

    // Begin extracting content. Process all block level nodes and specifically split the First and last nodes when needed so paragraph formatting is retained.
    // Method is little more complex than a regular extractor as we need to factor in extracting using inline nodes, fields, bookmarks etc as to make it really useful.
    while (isExtracting)
    {
        // Clone the current node and its children to obtain a copy.
        CompositeNode cloneNode = (CompositeNode)currNode.Clone(true);
        isEndingNode = currNode.Equals(endNode);
        if (isStartingNode || isEndingNode)
        {
            // We need to process each marker separately so pass it off to a separate method instead.
            if (isStartingNode)
            {
                ProcessMarker(cloneNode, nodes, originalStartNode, isInclusive, isStartingNode, isEndingNode);
                isStartingNode = false;
            }
            // Conditional needs to be separate as the block level start and end markers maybe the same node.
            if (isEndingNode)
            {
                ProcessMarker(cloneNode, nodes, originalEndNode, isInclusive, isStartingNode, isEndingNode);
                isExtracting = false;
            }
        }
        else
            // Node is not a start or end marker, simply add the copy to the list.
            nodes.Add(cloneNode);

        // Move to the next node and extract it. If next node is null that means the rest of the content is found in a different section.
        if (currNode.NextSibling == null && isExtracting)
        {
            currNode = currNode.NextPreOrder(currNode.Document);
            while (currNode.NodeType != NodeType.Paragraph) // && currNode.NodeType != NodeType.Table)
                currNode = currNode.NextPreOrder(currNode.Document);
        }
        else
        {
            // Move to the next node in the body.
            currNode = currNode.NextSibling;
        }
    }
    // Return the nodes between the node markers.
    return nodes;
}

pjcunningham · May 13, 2013, 4:33am

Hello Tahir,

That’s fixed it.

Regards, Paul.

tahir.manzoor · May 14, 2013, 9:11am

Hi Paul,

Thanks for your feedback. Please feel free to ask if you have any question about Aspose.Words, we will be happy to help you.