Content Extraction and Removing Nested Merge Fields

vchau · June 5, 2024, 8:03pm

I have been struggling to remove the NestedMerge fields from the Content extracted between a Start and End field. I tried various approaches but in the end I still have one End Merge field left and when I am trying to print the output as HTML - I am getting invalid documentModel error.

Here is what I tried till now -

Used the HelperMethod to extract content as provided by the ASPOSE team and it worked just great

https://github.com/aspose-words/Aspose.Words-for-Java/blob/master/Examples/DocsExamples/Java/src/main/java/DocsExamples/Programming_with_documents/Contents_management/ExtractContentHelper.java

Here is my MergeStart and End Setup and I would like to pull the Content between the MergeFields
UserNoteStart and UserNoteEnd - The content could have another mergeField in between called UserNoteDescriptionStart and UserNoteDescriptionEnd. I need to remove the nested mergeFields from the content.

Here is my code.

@GetMapping("/getTags/details/{fieldName}")
    public TagTextResponse getTagsText(@PathVariable String fieldName) throws Exception {

        //Extract nested nodes
        String nestedFieldName = fieldName + "Description";

        Document document = new Document("TEST_Sample_Preview_Less.docx");
        DocumentBuilder builder = new DocumentBuilder(document);

        FieldStart startField = findFieldStart(document, fieldName+"Start");
        FieldStart endField = findFieldStart(document, fieldName+ "End");

        FieldStart nestedStartField = findFieldStart(document, nestedFieldName+"Start");
        FieldStart nestedEndField = findFieldStart(document, nestedFieldName+ "End");

        ArrayList<Node> extractedNestedNodes = ExtractContentHelper.extractContent(nestedStartField, nestedEndField, false);
        for(Node extractedNode: extractedNestedNodes) {
            log.info(extractedNode.toString(SaveFormat.HTML));
        }

        log.info("#####################");

        ArrayList<Node> extractedMainNodes = ExtractContentHelper.extractContent(startField, endField, false);
        for(Node extractedNode: extractedMainNodes) {
            removeNestedMergeField((CompositeNode) extractedNode, nestedFieldName+START, nestedFieldName+END);
            log.info(extractedNode.toString(SaveFormat.HTML));
        }

        return null;
    }

RemoveNestedMergeField code

private void removeNestedMergeField(CompositeNode extractedNode,  String nestedStartField, String nestedEndField) throws Exception {

        //log.info(extractedNode.toString(SaveFormat.TEXT));


        NodeCollection<FieldStart> fieldStarts = extractedNode.getChildNodes(NodeType.FIELD_START, true);


        //log.info(extractedNode.toString(SaveFormat.HTML));

        FieldStart descriptionStartField = null;
        FieldStart descriptionEndField = null;

        for (FieldStart fieldStart: fieldStarts) {
            //log.info(fieldStart.getField().getFieldCode());
            Field field = fieldStart.getField();
            if(field.getType() == FieldType.FIELD_MERGE_FIELD) {
                String fieldCode = field.getFieldCode();
                if (fieldCode.contains(nestedStartField)) {
                    descriptionStartField = fieldStart;
                } else if (fieldCode.contains(nestedEndField)) {
                    descriptionEndField = fieldStart;
                }
            }
        }

        if (descriptionStartField != null && descriptionEndField != null) {

            boolean isBetween = false;
            Node currentNode = extractedNode.getFirstChild();
            while (currentNode != null) {
                Node nextNode = currentNode.getNextSibling();
                if (currentNode == descriptionStartField) {
                    isBetween = true;
                }
                if (isBetween) {
                    currentNode.remove();
                }
                if (currentNode == descriptionEndField) {
                    Node parent = currentNode.getParentNode();
                    if (parent != null) {
                        currentNode.remove();
                    }
                    isBetween = false;
                }
                currentNode = nextNode;
            }

        }

        log.info(extractedNode.toString(SaveFormat.TEXT));

        log.info(extractedNode.toString(SaveFormat.HTML));

    }

Its failing on the line to print the html of the remaining Node so looks like my DocumentModel is not clean after Node removal.

vchau · June 5, 2024, 8:52pm

TEST_Sample_Preview_Less.docx (25.1 KB)

Attached is the sample template

vchau · June 5, 2024, 8:55pm

I need to return the extracted content with removed Nested DescriptionStart and End mergeFields so that response can be passed to a Rich Text Editor as HTML

alexey.noskov · June 6, 2024, 4:29am

@vchau In your code you are trying to convert each extracted node to HTML. This is not quite correct since field start can belong to one extracted not and end belongs to another. So the individual extracted node DOM might be incomplete. I would suggest to generate a separate document from extracted nodes and them convert the whole document to HTML:

ArrayList<Node> extractedMainNodes = ExtractContentHelper.extractContent(startField, endField, false);
Document extractedDocument = ExtractContentHelper.generateDocument(srcDoc, extractedMainNodes);
// Remove mergefields if required.
for (Field f : extractedDocument.getRange().getFields())
{
    if (f.getType() == FieldType.FIELD_MERGE_FIELD)
        f.remove();
}
// Get HTML
String html = extractedDocument.toString(SaveFormat.HTML);

vchau · June 6, 2024, 7:26am

I tried this but this is removing the FIELD_MERGE_FIELD in the new document but not the content within it - I dont want the content in the response - The nested merged field START and END should be removed along with its content.

alexey.noskov · June 6, 2024, 7:30am

@vchau Could you please save the extracted content as DOCX document and attach it here for our reference? The provided code should remove all merge fields from the document.

vchau · June 6, 2024, 7:35am

For example if the marker fields are setup is like this

«UserNoteStart»«UserNoteDescriptionStart»This is Description«UserNoteDescriptionEnd»This is formatted main text«UserNoteEnd»

I would like to pull the textValue content between UserNoteDescriptionStart and UserNoteDescriptionEnd which is fairly easy using the extractMethod helper function but when I do the same for UserNoteStart and UserNoteEnd - I have to remove the internal UserNoteDescriptionStart and UserNoteDescriptionEnd along with its content “This is Description” and only return html for “This is formatted main text”

vchau · June 6, 2024, 7:41am

Here is the document attached if you see the first line the text from the Description still exists
ExtractedContentAsNewDocument.docx (8.8 KB)

alexey.noskov · June 6, 2024, 7:58am

@vchau Thank you for additional information. The easiest way to achieve this is wrapping the content that should be removed into a bookmark and then remove the bookmark’s content. For example see the following code:

Document doc = new Document("C:\\Temp\\in.docx");
// Get start and end merge fields.
FieldMergeField start = null;
FieldMergeField end = null;
for (Field f : doc.getRange().getFields())
{
    if (f.getType() == FieldType.FIELD_MERGE_FIELD)
    {
        FieldMergeField mf = (FieldMergeField)f;
        if (mf.getFieldName().equals("UserNoteStart"))
            start = mf;
        if (mf.getFieldName().equals("UserNoteEnd"))
            end = mf;
    }
}

// Extract content between mergefields.
ArrayList<Node> extractedNodes = ExtractContentHelper.extractContent(start.getEnd(), end.getStart(), false);
Document extractedDocument = ExtractContentHelper.generateDocument(doc, extractedNodes);

// Wrap content that should be removed to bookmark to make it easier to remove.
// Get start and end merge fields.
FieldMergeField removeStart = null;
FieldMergeField removeEnd = null;
for (Field f : extractedDocument.getRange().getFields())
{
    if (f.getType() == FieldType.FIELD_MERGE_FIELD)
    {
        FieldMergeField mf = (FieldMergeField)f;
        if (mf.getFieldName().equals("UserNoteDescriptionStart"))
            removeStart = mf;
        if (mf.getFieldName().equals("UserNoteDescriptionEnd"))
            removeEnd = mf;
    }
}
String tmpBkName = "tmpBkName";
removeStart.getStart().getParentNode().insertBefore(new BookmarkStart(extractedDocument, tmpBkName), removeStart.getStart());
removeEnd.getEnd().getParentNode().insertAfter(new BookmarkEnd(extractedDocument, tmpBkName), removeEnd.getEnd());
// Remove content inside bookmark.
extractedDocument.getRange().getBookmarks().get(tmpBkName).setText("");
// Remove tmp bookmark.
extractedDocument.getRange().getBookmarks().get(tmpBkName).remove();

extractedDocument.save("C:\\Temp\\out.docx");

vchau · June 6, 2024, 9:38pm

Thanks @alexey.noskov - It worked - The only problem I am seeing it I would like to return HTML not with the inline styles but with regular html tags for formatting.

For example :

<span style="font-family:Arial; font-weight:bold; letter-spacing:-0.1pt">days from when the notice was sent or (2) the date services will change)</span>

Should be

<span><b>days from when the notice was sent or (2) the date services will change)<b></span>

Any option in the HtmlSaveOptions?

alexey.noskov · June 7, 2024, 3:31am

@vchau I am afraid there is no way to output HTML without inline styles.

vchau · June 20, 2024, 9:39pm

Thanks you @alexey.noskov for your help on this - we have gotten another challenge - What if we need to extractContent between the Start and End marker field for each instance of those mergefields int the document. Lets say the UserNoteStart and UserNoteEnd was used 3 times in the document with different content in it. We would like to extract the content between 1, 2 and 3 times usage and record it seperately.
Right now it loop and try to get the content between the last occurence of those marker field.

We can possibly pass from the source like UserNote1, UserNote2 so that the backend retrieves the content from that occurence of the marker fields

alexey.noskov · June 21, 2024, 4:24am

@vchau Yes, the easiest way to achieve this is giving different names to the marker fields, foe example by adding counter - UserNoteStart1…UserNoteEnd1, UserNoteStart2…UserNoteEnd2 etc. In this case you will be able to distinguish different notes. Also, in the loop you can extract the first occurrences of the marker fields, extract content and then move to the next occurrence. The technique is the same.

vchau · June 25, 2024, 7:13pm

THanks @alexey.noskov - How should we apply the text or html under UserNoteStart1 , UserNote2…into

1st instance «UserNoteStart»«UserNoteDescriptionStart»This is Description«UserNoteDescriptionEnd»This is formatted main text«UserNoteEnd»

2nd instance «UserNoteStart»«UserNoteDescriptionStart»This is Description«UserNoteDescriptionEnd»This is formatted main text«UserNoteEnd»

Should we completely remove Marker Fields and inject a new field with UserNOte1 or UserNote2…Or inject the content of UserNOte1 and UserNote2 in the first position of << UserNoteStart >> and so one…What would be the best way to achieve this.

We also have this marker field inside a IF statement also like so we need to have an approach that works within IF also

See image attached
Screenshot 2024-06-25 at 12.18.11 PM.png (75.5 KB)

vyacheslav.deryushev · June 25, 2024, 8:42pm

@vchau I think Alexey meant that you need to manually set different field names. If I understand you correctly, you need to get content from every UserNoteStart...UserNoteEnd in the document. You can do it in the following way:

Document doc = new Document("input.docx");
// Get start and end merge fields.
FieldMergeField start = null;
FieldMergeField end = null;
int i = 1;
for (Field f : doc.getRange().getFields()) {
    if (f.getType() == FieldType.FIELD_MERGE_FIELD) {
        FieldMergeField mf = (FieldMergeField) f;
        if (mf.getFieldName().equals("UserNoteStart")) {
            start = mf;
        }

        if (mf.getFieldName().equals("UserNoteEnd")) {
            end = mf;
        }
    }

    if (start != null && end != null) {
        Document extractedDocument = exctractContent(doc, start, end);
        removeContent(extractedDocument);
        extractedDocument.save(getArtifactsDir() + String.format("out%d.docx", i));

        i++;
        start = null;
        end = null;
    }
}

private Document exctractContent(Document doc, FieldMergeField start, FieldMergeField end) throws Exception {
    ArrayList<Node> extractedNodes = ExtractContentHelper.extractContent(start.getEnd(), end.getStart(), false);
    Document extractedDocument = ExtractContentHelper.generateDocument(doc, extractedNodes);

    return extractedDocument;
}

private void removeContent(Document extractedDocument) throws Exception {
    // Wrap content that should be removed to bookmark to make it easier to remove.
    // Get start and end merge fields.
    FieldMergeField removeStart = null;
    FieldMergeField removeEnd = null;
    for (Field f : extractedDocument.getRange().getFields()) {
        if (f.getType() == FieldType.FIELD_MERGE_FIELD) {
            FieldMergeField mf = (FieldMergeField) f;
            if (mf.getFieldName().equals("UserNoteDescriptionStart"))
                removeStart = mf;
            if (mf.getFieldName().equals("UserNoteDescriptionEnd"))
                removeEnd = mf;
        }
    }
    String tmpBkName = "tmpBkName";
    removeStart.getStart().getParentNode().insertBefore(new BookmarkStart(extractedDocument, tmpBkName), removeStart.getStart());
    removeEnd.getEnd().getParentNode().insertAfter(new BookmarkEnd(extractedDocument, tmpBkName), removeEnd.getEnd());
    // Remove content inside bookmark.
    extractedDocument.getRange().getBookmarks().get(tmpBkName).setText("");
    // Remove tmp bookmark.
    extractedDocument.getRange().getBookmarks().get(tmpBkName).remove();
}

If you need to update field names using Aspose.Words, you can use following code:

for (Field f : doc.getRange().getFields()) {
    if (f.getType() == FieldType.FIELD_MERGE_FIELD) {
        FieldMergeField mf = (FieldMergeField) f;
        if (mf.getFieldName().equals("UserNoteStart")) {
            mf.setFieldName("UserNoteStart1");
            mf.update();
            start = mf;
        }

        if (mf.getFieldName().equals("UserNoteEnd")) {
            mf.setFieldName("UserNoteEnd1");
            mf.update();
            end = mf;
        }
    }
}

Also, you can use increment parameter for the value.

Another option is to create HashMap where you can collect all merge fields and then retrieve them by field name.

vyacheslav.deryushev · June 25, 2024, 8:55pm

@vchau To set the content instead of fields, you can use following code base on fields and bookmarks:

private void insertContent(Document extractedDocument, FieldMergeField start, FieldMergeField end) throws Exception {
    DocumentBuilder builder = new DocumentBuilder(extractedDocument);

    String tmpBkName = "tmpBkName";
    start.getStart().getParentNode().insertBefore(new BookmarkStart(extractedDocument, tmpBkName), start.getStart());
    end.getEnd().getParentNode().insertAfter(new BookmarkEnd(extractedDocument, tmpBkName), end.getEnd());
    // Remove content inside bookmark.
    Bookmark bookmark = extractedDocument.getRange().getBookmarks().get(tmpBkName);
    bookmark.setText("");

    builder.moveToBookmark(bookmark.getName(), true, true);
    builder.insertHtml("<b>Insert HTML in Word Document in Bookmark using C#</b>", true);
    builder.write("Text inside bookmark");
    // Remove tmp bookmark.
    extractedDocument.getRange().getBookmarks().get(tmpBkName).remove();
}