Extract content of cell and save them into new document using Java

Gptrnt · August 11, 2020, 1:08pm

HI,

I am uploading a word document with some hidden character between contents. Such contents are inside table cells. I am finding such hidden words and converts it to bookmark, extracting content between those bookmark and converted it to html string. This html string i will use to create new document for download. Extracted content is coming with table.
I am attaching my sample code, input,output and expected_output document TableExtractIssue.zip (109.3 KB)

I am using aspose suggested methods for extracting content between bookmark. My code working for normal case. Only issue when contents lies inside cell. Kindly check this issue.

Thank you

tahir.manzoor · August 11, 2020, 5:56pm

@Gptrnt

In your case, we suggest you please use Node.GetAncestor method to get the parent Cell node of BookmarkStart node. If it is not null, please import the child nodes of Cell into new document. Please do not use extractContent method if Bookmark is inside table node.

You can use NodeImporter.ImportNode method to import node from one document into another.

Gptrnt · August 12, 2020, 5:27am

can you suggest this by code.

tahir.manzoor · August 12, 2020, 4:04pm

@Gptrnt

Please use the following code example to extract the contents from the table’s cell that contains the bookmark and import them into new document. The document.docx is output document generated by your code. We have attached the input and output documents with this post for your kind reference.

Docs.zip (20.9 KB)

Document doc = new Document(MyDir + "document.docx");
Document dstDoc = new Document();
dstDoc.getFirstSection().getBody().removeAllChildren();
NodeImporter importer = new NodeImporter(doc, dstDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);
Bookmark bookmark = doc.getRange().getBookmarks().get(0);

if(bookmark.getBookmarkStart().getAncestor(NodeType.CELL) != null)
{
    Cell cell = (Cell)bookmark.getBookmarkStart().getAncestor(NodeType.CELL);
    for(Node node :  (Iterable<Node>)cell.getChildNodes())
    {
        dstDoc.getFirstSection().getBody().appendChild(dstDoc.importNode(node, true));
    }
}

System.out.println(bookmark.getName());
dstDoc.save(MyDir + "20.7.docx");

Gptrnt · August 17, 2020, 12:51pm

I tried your solution it extracting the content. But In the document i am using certain hidden words |p1|,|/p1| to extract the content. That hidden words should not be coming in the extract document. And also this hidden words i am replacing with “Note1” bookmark that also i can see in the extracted document. This both should not be coming in the extracted document.

tahir.manzoor · August 17, 2020, 4:34pm

@Gptrnt

I attached the input and output documents in my previous post. Please make sure that you are using the code correctly. The hidden text is imported in output document.

Gptrnt · August 18, 2020, 4:58am

Hi,
Yes, the hidden text is imported in output document. But, I don’t want the hidden text in extracted document. I am adding hidden text only to extract content between that word. And bookmark “Note1” also showing in your output document. That also i don’t want in the extracted document. That bookmark i am adding for extracting the content.

tahir.manzoor · August 18, 2020, 4:31pm

@Gptrnt

You can use Run.Font.Hidden property to check either the text is hidden or not. You can remove the hidden text from the document using following code snippet.

for(Run run :(Iterable<Run>)dstDoc.getChildNodes(NodeType.RUN, true))
{
    if(run.getFont().getHidden())
        run.remove();
}
dstDoc.save(MyDir + "20.8.docx");

Gptrnt · August 20, 2020, 7:54am

I tried your solution. It removing the hidden words. But it leaving a extra enter of hidden word. I am attaching my sample code, output, expected_output and input document TableExtractIssue (2).zip (109.0 KB). In the output document you can see an extra space compared to my expected document. I wants to remove that document.

Thank you

tahir.manzoor · August 20, 2020, 4:16pm

@Gptrnt

In getHtmlContentFromBookMark method, you are saving document to HTML. The output document contains empty paragraph at the end of document. You can remove it using Node.Remove method.