Not extracting text between bookmarks properly when start of document is table

I have a document where I have placed a bookmark at the beginning of the document by using:

DocumentBuilder builder = new DocumentBuilder(doc);

builder.MoveToDocumentStart();
builder.StartBookmark(bookmarkName);

BookmarkEnd bmEnd = builder.EndBookmark(bookmarkName);
builder.MoveToDocumentEnd();
Paragraph para = builder.InsertParagraph();
para.ParagraphBreakFont.Size = 1;
para.AppendChild(bmEnd);

This should place a bookmark at the beginning and end of the document. However, some documents have a table as the initial item in the document. When that is the case, the bookmark ends up inside the first cell of the table and the first couple rows of the table are not extracted when we attempt to extract everything between the two bookmarks. This is creating a large problem for us. How can we use the bookmarks to extract everything between the bookmarks in this scenario? Is there some way to get the bookmark before the table?

Thanks,
Steven

Hi Steven,

Thanks for your inquiry. To ensure a timely and accurate response, please attach the following resources here for testing:

  • Your input Word document.
  • Please attach the output Word file that shows the undesired behavior.
  • Please create a standalone console application (source code without compilation errors) that helps us to reproduce your problem on our end and attach it here for testing.

As soon as you get these pieces of information ready, we’ll start investigation into your issue and provide you more information. Thanks for your cooperation.

PS: To attach these resources, please zip them and Click ‘Reply’ button that will bring you to the ‘reply page’ and there at the bottom you can include any attachments with that post by clicking the ‘Add/Update’ button.

Well, that seems like an awful lot of information to provide for a problem that I’m pretty sure someone has experienced before…

So instead of doing that, why don’t I share with you the workaround I found? The key is getting the bookmark outside of the table at the start of the document body. If the first item in the document is a table, then calling DocumentBuilder.MoveToDocumentStart() and DocumentBuilder.StartBookmark(bookmark) places the bookmark inside the first cell of the table. So instead, check to see if the first node is a Table node, and if it is, add a blank paragraph to the beginning of the document before calling MoveToDocumentStart():

if(Document.FirstSection.Body.FirstChild.NodeType == NodeType.Table)
    Document.FirstSection.Body.InsertBefore(new Paragraph(Document), Document.FirstSection.Body.FirstChild);

I think I can shrink the paragraph so it doesn’t mess with the document spacing as much.

While this works, I think the existing behavior of ToDocumentStart() when a table is the first item in the document is not desirable. If I want to go to the start of the document, I want to go immediately before the table, not the first cell in the table. But even worse is the extraction when the bookmark is in the first cell of the table. I wanted the bookmark to exist in the extracted nodes and it does not. The first several rows of the table are also not extracted. FYI, I am using the extraction method found here.

If you have a better idea of a workaround, I would love to hear it!

Thanks,
Steven

I added the template document PSHNotice that we use to initially create the document. I also attached the output document (with sensitive info changed to xxxxxxx). As you can see, the first several rows of the table are missing.

Hi Steven,

Thanks for sharing the detail. Please get the code of ExtractContent from Aspose.Words for .NET examples repository at GitHub.

Yes, you can use the shared solution to get the desired output. You may add following highlighted code snippet in ExtractContent method to fix this issue.

Please let us know if you have any more queries.

public static ArrayList ExtractContent(Node startNode, Node endNode, bool isInclusive)
{
    // First check that the nodes passed to this method are valid for use.
    VerifyParameterNodes(startNode, endNode);
    // Create a list to store the extracted nodes.
    ArrayList nodes = new ArrayList();
    // Keep a record of the original nodes passed to this method so we can split marker nodes if needed.
    Node originalStartNode = startNode;
    Node originalEndNode = endNode;
    // Extract content based on block level nodes (paragraphs and tables). Traverse through parent nodes to find them.
    // We will split the content of first and last nodes depending if the marker nodes are inline
    while (startNode.ParentNode.NodeType != NodeType.Body)
        startNode = startNode.ParentNode;
    while (endNode.ParentNode.NodeType != NodeType.Body)
        endNode = endNode.ParentNode;
    bool isExtracting = true;
    bool isStartingNode = true;
    bool isEndingNode = false;
    // The current node we are extracting from the document.
    Node currNode = startNode;
    if (currNode.NodeType == NodeType.Table)
    {
        nodes.Add(currNode);
        currNode = currNode.NextSibling;
    }
    // Begin extracting content. Process all block level nodes and specifically split the first and last nodes when needed so paragraph formatting is retained.
    // Method is little more complex than a regular extractor as we need to factor in extracting using inline nodes, fields, bookmarks etc as to make it really useful.
    while (isExtracting)
    {