Extract content from bookmark generates incorrect result using Java

Hi,
We are using Aspose word for Java 11.4.
Below code is used to get page wise contents using extractContent:
for(Bookmark srcBookmark :sourceDoc.getRange().getBookmarks() )
{
if(srcBookmark.getName().contains(“b_Page_”))
{
Node startNode = srcBookmark.getBookmarkStart();
Node endNode = srcBookmark.getBookmarkEnd();
List nodes = extractContent(startNode,endNode,false);
Document newDoc = generateDocument(sourceDoc, nodes);
}
}

We are using the extractContent and generateDocument methods provided at:
Aspose.Words for Java|Documentation

Documents are attached.

The first page has 2 tables . It is extracting contents only upto first table.
Please suggest possible solution.

Is extractContent method updated as per latest aspose version? If so, please let us know the link.

Thanks.

Hi Sonali,

Thanks for your query. I have changed bookmark position in your document. Please find it in attachment and set Bookmark as shown in attached dummy_extract_content_issue_updated.png file.

Please check
BookmarkIssue.png for your original document. This bookmarkend node is inside table’s cell. To extract complete table, bookmarkend should be after table node as shown in dummy_extract_content_issue_updated.png.

Please let us know if you have any more queries.

Hi Tahir,

Thanks. Can you provide solution for whatever bookmarks we have ?

We cannot change bookmarks. We do not have control where the bookmarks will start and end.

I guess the extract content method needs modification to fetch contents correctly when the bookmark start/end is inside table cell.

We have another problem where table extends few pages and when we try to retrieve page wise it just gives whole table (opposite scenario now).

Please suggest how to handle bookmark end which are inside table cells, so that it should retrieve contents upto that cell ( neither truncating nor fetching extra content after it in a table).


Hi Sonali,

The sample code at following documentation link do not work for your scenario. I will share code for your scenario asap.

Hi Sonali,

Please use the following code snippet for your requirement. Hope this helps you. Please let us know if you have any more queries.

for(Bookmark srcBookmark :doc.getRange().getBookmarks() )

{

if(srcBookmark.getName().contains("b_Page_"))

{

Node startNode = srcBookmark.getBookmarkStart();

Node endNode = srcBookmark.getBookmarkEnd();

ArrayList nodes = extractContent(startNode,endNode, true);


Document newDoc = generateDocument(doc, nodes);

DocumentBuilder builder = new DocumentBuilder(newDoc);

builder.moveToDocumentEnd();


if(newDoc.getLastSection().getBody().getLastParagraph().isListItem())

{

if(newDoc.getLastSection().getBody().getLastParagraph().toTxt().trim().equals(""))

newDoc.getLastSection().getBody().getLastParagraph().remove();

}


NodeCollection bstart = newDoc.getChildNodes(NodeType.BOOKMARK_START, true);

for (BookmarkStart node : (Iterable<BookmarkStart>) bstart)

{

node.remove();

}


NodeCollection bend = newDoc.getChildNodes(NodeType.BOOKMARK_END, true);

for (BookmarkEnd node : (Iterable<BookmarkEnd>) bend)

{

node.remove();

}


newDoc.save("D:\\Data\\Customers\\AsposeOutPage"+i+".doc");

i++;

}

Hi Tahir,
Thanks.

Passing isInclusive = true only works for this document.
<span style=“font-family: “Courier New”;” lang=“EN-GB”> ArrayList nodes = extractContent(startNode,endNode, true);

No need of removing <span style=“font-family: “Courier New”;” lang=“EN-GB”>NodeType.BOOKMARK_START/end .

By passing isinclusive=true affects extraction in some other way again.

Aslo we do not want to remove any bookmarks as we have some processing based on bookmarks in new document.


1. can you explain the processMarker method used by extractContent in detail?
It works for some documents but fails with similar other document with minor difference of where bookmark start and end lies in document tree rendered by aspose.

2. Why are we doing following everytime even when node.getParentNode() and cloneNode not equal ?
int indexDiff = node.getParentNode().getChildNodes().getCount() - cloneNode.getChildNodes().getCount();

// Child node count identical.
if (indexDiff == 0)
node = cloneNode.getChildNodes().get(node.getParentNode().indexOf(node));
else
node = cloneNode.getChildNodes().get(node.getParentNode().indexOf(node) - indexDiff);

3. Can you explain purpose of isSkip flag ?

4. Can you suggest any other better way of extracting contents 
other than extractContent method ?

Hi Sonali,

Please accept my apologies for late response. We are working over your query and will update you asap.

Hi Sonali,

I have modified the processMarker method. Please use the following processMarker method in your code. Hope this helps you. Please let us know if you have any more queries.

private static void processMarker(CompositeNode cloneNode, ArrayList nodes, Node node, boolean isInclusive, boolean isStartMarker, boolean isEndMarker) throws Exception

{

// If we are dealing with a block level node just see if it should be included and add it to the list.

if(!isInline(node))

{

// Don't add the node twice if the markers are the same node

if(!(isStartMarker && isEndMarker))

{

if (isInclusive)

nodes.add(cloneNode);

}

return;

}


// If a marker is a FieldStart node check if it's to be included or not.

// We assume for simplicity that the FieldStart and FieldEnd appear in the same paragraph.

if (node.getNodeType() == NodeType.FIELD_START)

{

// If the marker is a start node and is not be included then skip to the end of the field.

// If the marker is an end node and it is to be included then move to the end field so the field will not be removed.

if ((isStartMarker && !isInclusive) || (!isStartMarker && isInclusive))

{

while (node.getNextSibling() != null && node.getNodeType() != NodeType.FIELD_END)

node = node.getNextSibling();


}

}


// If either marker is part of a comment then to include the comment itself we need to move the pointer forward to the Comment

// node found after the CommentRangeEnd node.

if (node.getNodeType() == NodeType.COMMENT_RANGE_END)

{

while (node.getNextSibling() != null && node.getNodeType() != NodeType.COMMENT)

node = node.getNextSibling();


}


// Find the corresponding node in our cloned node by index and return it.

// If the start and end node are the same some child nodes might already have been removed. Subtract the

// difference to get the right index.

int indexDiff = node.getParentNode().getChildNodes().getCount() - cloneNode.getChildNodes().getCount();


// Child node count identical.

if (indexDiff == 0)

node = cloneNode.getChildNodes().get(node.getParentNode().indexOf(node));

else

node = cloneNode.getChildNodes().get(node.getParentNode().indexOf(node) - indexDiff);


// Remove the nodes up to/from the marker.

boolean isSkip;

boolean isProcessing = true;

boolean isRemoving = isStartMarker;

Node nextNode = cloneNode.getFirstChild();


while (isProcessing && nextNode != null)

{

Node currentNode = nextNode;

isSkip = false;


if (currentNode.equals(node))

{

if (isStartMarker)

{

isProcessing = false;

if (isInclusive)

isRemoving = false;

}

else

{

isRemoving = true;

if (isInclusive)

isSkip = true;

}

}


nextNode = nextNode.getNextSibling();

if (isRemoving && !isSkip)

{

if (currentNode.getNodeType() == NodeType.BOOKMARK_START || currentNode.getNodeType() == NodeType.BOOKMARK_END)

currentNode.remove();

}

}


// After processing the composite node may become empty. If it has don't include it.

if (!(isStartMarker && isEndMarker))

{

if (cloneNode.hasChildNodes())

nodes.add(cloneNode);System.out.println(cloneNode);

}


}

A post was split to a new topic: Extract content Bookmark is inside table