Extracting content inside the cell

Gptrnt · October 17, 2023, 8:59pm

Hi,

I am extracting content from the document using extraction methods. suppose the extracting contents are inside a table cell. I have to extract the content differently in one cell, but it is not happening. I am attaching the sample code Main.zip (5.8 KB) and input file. input.docx (34.3 KB), I want to extract the content separately in between the hidden characters. Please help me

Thank you

alexey.noskov · October 18, 2023, 6:56am

@Gptrnt You can use the following code to extract content between the tags:

// Regular expression that will match start and end tags in the document.
Pattern tagPattern = Pattern.compile("\\|/?[prtmf][0-9]{1,2}\\|");
Pattern startTagPattern = Pattern.compile("\\|([prtmf][0-9]{1,2})\\|");
Pattern endTagPattern = Pattern.compile("\\|/([prtmf][0-9]{1,2})\\|");

Document doc = new Document("C:\\Temp\\in.docx");
// Use find/replace operation to make each tag to be represented as a separate Run node.
// The code actually replaces the matched tag with the same tag,
// but after replace operation the teg will be represented as a separate Run.
FindReplaceOptions opt = new FindReplaceOptions();
opt.setUseSubstitutions(true);
doc.getRange().replace(tagPattern, "$0", opt);

// Now collect start and end tag Run nodes.
HashMap<String, Run> starts = new HashMap<String, Run>();
HashMap<String, Run> ends = new HashMap<String, Run>();
Iterable<Run> runs = doc.getChildNodes(NodeType.RUN, true);
for (Run r : runs)
{
    Matcher startMatcher = startTagPattern.matcher(r.getText());
    if (startMatcher.find())
    {
        starts.put(startMatcher.group(1), r);
        continue;
    }
    Matcher endMatcher = endTagPattern.matcher(r.getText());
    if (endMatcher.find())
    {
        ends.put(endMatcher.group(1), r);
        continue;
    }
}

// Now extract content between `p` tags.
for (String key : starts.keySet())
{
    if (key.startsWith("p"))
    {
        Run startRun = starts.get(key);
        Run endRun = ends.get(key);
        if (startRun != null && endRun != null)
        {
            ArrayList<Node> contentNodes = ExtractContentHelper.extractContent(startRun, endRun, false);
            Document contentDocument = ExtractContentHelper.generateDocument(doc, contentNodes);
            contentDocument.save("C:\\Temp\\" + key + ".docx");
        }
    }
}

Gptrnt · October 18, 2023, 1:52pm

Hi,
I have tried the above solution but getting a null point exception in doc.getRange().replace(pattern, "$0", opt); line. if I remove the line opt.setUseSubstitutions(true); , then it will be running the program but getting the empty document as an output.

alexey.noskov · October 18, 2023, 1:58pm

@Gptrnt Do you use the latest 23.9 version of Aspose.Words on your side? If not, please try using the latest version. The provided code works fine on my side with the sample document you have attached in the initial post.

Gptrnt · October 18, 2023, 2:06pm

@alexey.noskov I have tried the code with the 23.9 version, but it creates a lot of errors in the extract function where the getChildNodes() method is used without specifying the node type. if I rewrite the method as getChildNodes(NodeType.ANY, true), which runs the programming successfully but creates an empty document.

alexey.noskov · October 18, 2023, 2:09pm

@Gptrnt You should replace getChildNodes() with getChildNodes(NodeType.ANY, false).
Here is my implementation of ExtractContentHelper:

public class ExtractContentHelper {

    //ExStart:CommonExtractContent
    public static ArrayList<Node> extractContent(Node startNode, Node endNode, boolean isInclusive)
    {
        // First, check that the nodes passed to this method are valid for use.
        verifyParameterNodes(startNode, endNode);

        // Create a list to store the extracted nodes.
        ArrayList<Node> nodes = new ArrayList<Node>();

        // If either marker is part of a comment, including the comment itself, we need to move the pointer
        // forward to the Comment Node found after the CommentRangeEnd node.
        if (endNode.getNodeType() == NodeType.COMMENT_RANGE_END && isInclusive)
        {
            Node node = findNextNode(NodeType.COMMENT, endNode.getNextSibling());
            if (node != null)
                endNode = node;
        }

        // Keep a record of the original nodes passed to this method to split marker nodes if needed.
        Node originalStartNode = startNode;
        Node originalEndNode = endNode;

        // Add the section where the start node is placed.
        nodes.add(startNode.getAncestor(NodeType.SECTION));

        // Extract content based on block-level nodes (paragraphs and tables). Traverse through parent nodes to find them.
        // We will split the first and last nodes' content, depending if the marker nodes are inline.
        startNode = getAncestorInBody(startNode);
        endNode = getAncestorInBody(endNode);

        boolean isExtracting = true;
        boolean isStartingNode = true;
        // The current node we are extracting from the document.
        Node currNode = startNode;

        // Begin extracting content. Process all block-level nodes and specifically split the first
        // and last nodes when needed, so paragraph formatting is retained.
        // Method is a little more complicated than a regular extractor as we need to factor
        // in extracting using inline nodes, fields, bookmarks, etc. to make it useful.
        while (isExtracting)
        {
            // Clone the current node and its children to obtain a copy.
            Node cloneNode = currNode.deepClone(true);
            boolean isEndingNode = currNode.equals(endNode);

            if (isStartingNode || isEndingNode)
            {
                // We need to process each marker separately, so pass it off to a separate method instead.
                // End should be processed at first to keep node indexes.
                if (isEndingNode)
                {
                    // !isStartingNode: don't add the node twice if the markers are the same node.
                    processMarker(cloneNode, nodes, originalEndNode, currNode, isInclusive,
                            false, !isStartingNode, false);
                    isExtracting = false;
                }

                // Conditional needs to be separate as the block level start and end markers, maybe the same node.
                if (isStartingNode)
                {
                    processMarker(cloneNode, nodes, originalStartNode, currNode, isInclusive,
                            true, true, false);
                    isStartingNode = false;
                }
            }
            else
                // Node is not a start or end marker, simply add the copy to the list.
                nodes.add(cloneNode);

            // Move to the next node and extract it. If the next node is null,
            // the rest of the content is found in a different section.
            if (currNode.getNextSibling() == null && isExtracting)
            {
                // Move to the next section.
                Section nextSection = (Section)currNode.getAncestor(NodeType.SECTION).getNextSibling();
                nodes.add(nextSection.deepClone(true));
                currNode = nextSection.getBody().getFirstChild();
            }
            else
            {
                // Move to the next node in the body.
                currNode = currNode.getNextSibling();
            }
        }

        // For compatibility with mode with inline bookmarks, add the next paragraph (empty).
        if (isInclusive && originalEndNode == endNode && !originalEndNode.isComposite())
            includeNextParagraph(endNode, nodes);

        // Return the nodes between the node markers.
        return nodes;
    }
    //ExEnd:CommonExtractContent

    //ExStart:CommonGenerateDocument
    public static Document generateDocument(Document srcDoc, ArrayList<Node> nodes)
    {
        // Clone source document to preserve source styles.
        Document dstDoc = (Document)srcDoc.deepClone(false);

        // Import each node from the list into the new document. Keep the original formatting of the node.
        NodeImporter importer = new NodeImporter(srcDoc, dstDoc, ImportFormatMode.USE_DESTINATION_STYLES);

        for (Node node : nodes)
        {
            if (node.getNodeType() == NodeType.SECTION)
            {
                Section srcSection = (Section)node;
                Section importedSection = (Section)importer.importNode(srcSection, false);
                importedSection.appendChild(importer.importNode(srcSection.getBody(), false));
                for (HeaderFooter hf : srcSection.getHeadersFooters())
                importedSection.getHeadersFooters().add(importer.importNode(hf, true));

                dstDoc.appendChild(importedSection);
            }
            else
            {
                Node importNode = importer.importNode(node, true);
                dstDoc.getLastSection().getBody().appendChild(importNode);
            }
        }

        return dstDoc;
    }
    //ExEnd:CommonGenerateDocument

    //ExStart:CommonExtractContentHelperMethods
    private static void verifyParameterNodes(Node startNode, Node endNode)
    {
        // The order in which these checks are done is important.
        if (startNode == null)
            throw new IllegalArgumentException("Start node cannot be null");
        if (endNode == null)
            throw new IllegalArgumentException("End node cannot be null");

        if (!startNode.getDocument().equals(endNode.getDocument()))
            throw new IllegalArgumentException("Start node and end node must belong to the same document");

        if (startNode.getAncestor(NodeType.BODY) == null || endNode.getAncestor(NodeType.BODY) == null)
            throw new IllegalArgumentException("Start node and end node must be a child or descendant of a body");

        // Check the end node is after the start node in the DOM tree.
        // First, check if they are in different sections, then if they're not,
        // check their position in the body of the same section.
        Section startSection = (Section)startNode.getAncestor(NodeType.SECTION);
        Section endSection = (Section)endNode.getAncestor(NodeType.SECTION);

        int startIndex = startSection.getParentNode().indexOf(startSection);
        int endIndex = endSection.getParentNode().indexOf(endSection);

        if (startIndex == endIndex)
        {
            if (startSection.getBody().indexOf(getAncestorInBody(startNode)) >
                    endSection.getBody().indexOf(getAncestorInBody(endNode)))
                throw new IllegalArgumentException("The end node must be after the start node in the body");
        }
        else if (startIndex > endIndex)
            throw new IllegalArgumentException("The section of end node must be after the section start node");
    }

    private static Node findNextNode(int nodeType, Node fromNode)
    {
        if (fromNode == null || fromNode.getNodeType() == nodeType)
            return fromNode;

        if (fromNode.isComposite())
        {
            Node node = findNextNode(nodeType, ((CompositeNode)fromNode).getFirstChild());
            if (node != null)
                return node;
        }

        return findNextNode(nodeType, fromNode.getNextSibling());
    }

    private static void processMarker(Node cloneNode, ArrayList<Node> nodes, Node node, Node blockLevelAncestor,
                                      boolean isInclusive, boolean isStartMarker, boolean canAdd, boolean forceAdd)
    {
        // If we are dealing with a block-level node, see if it should be included and add it to the list.
        if (node == blockLevelAncestor)
        {
            if (canAdd && isInclusive)
                nodes.add(cloneNode);
            return;
        }

        // If a marker is a FieldStart node check if it's to be included or not.
        // We assume for simplicity that the FieldStart and FieldEnd appear in the same paragraph.
        if (node.getNodeType() == NodeType.FIELD_START)
        {
            // If the marker is a start node and is not included, skip to the end of the field.
            // If the marker is an end node and is to be included, then move to the end field so the field will not be removed.
            if (isStartMarker && !isInclusive || !isStartMarker && isInclusive)
            {
                while (node.getNextSibling() != null && node.getNodeType() != NodeType.FIELD_END)
                    node = node.getNextSibling();
            }
        }

        // Support a case if the marker node is on the third level of the document body or lower.
        ArrayList<Node> nodeBranch = fillSelfAndParents(node, blockLevelAncestor);

        // Process the corresponding node in our cloned node by index.
        Node currentCloneNode = cloneNode;
        for (int i = nodeBranch.size() - 1; i >= 0; i--)
        {
            Node currentNode = nodeBranch.get(i);
            int nodeIndex = currentNode.getParentNode().indexOf(currentNode);
            currentCloneNode = ((CompositeNode)currentCloneNode).getChildNodes(NodeType.ANY, false).get(nodeIndex);

            removeNodesOutsideOfRange(currentCloneNode, isInclusive || (i > 0), isStartMarker);
        }

        //cloneNode.
        // After processing, the composite node may become empty if it has doesn't include it.
        if (canAdd &&
                (forceAdd || ((CompositeNode)cloneNode).hasChildNodes()))
        {
            nodes.add(cloneNode);
        }
    }

    private static void removeNodesOutsideOfRange(Node markerNode, boolean isInclusive, boolean isStartMarker)
    {
        boolean isProcessing = true;
        boolean isRemoving = isStartMarker;
        Node nextNode = markerNode.getParentNode().getFirstChild();

        while (isProcessing && nextNode != null)
        {
            Node currentNode = nextNode;
            boolean isSkip = false;

            if (currentNode.equals(markerNode))
            {
                if (isStartMarker)
                {
                    isProcessing = false;
                    if (isInclusive)
                        isRemoving = false;
                }
                else
                {
                    isRemoving = true;
                    if (isInclusive)
                        isSkip = true;
                }
            }

            nextNode = nextNode.getNextSibling();
            if (isRemoving && !isSkip)
                currentNode.remove();
        }
    }

    private static ArrayList<Node> fillSelfAndParents(Node node, Node tillNode)
    {
        ArrayList<Node> list = new ArrayList<Node>();
        Node currentNode = node;

        while (currentNode != tillNode)
        {
            list.add(currentNode);
            currentNode = currentNode.getParentNode();
        }

        return list;
    }

    private static void includeNextParagraph(Node node, ArrayList<Node> nodes)
    {
        Paragraph paragraph = (Paragraph)findNextNode(NodeType.PARAGRAPH, node.getNextSibling());
        if (paragraph != null)
        {
            // Move to the first child to include paragraphs without content.
            Node markerNode = paragraph.hasChildNodes() ? paragraph.getFirstChild() : paragraph;
            Node rootNode = getAncestorInBody(paragraph);

            processMarker(rootNode.deepClone(true), nodes, markerNode, rootNode,
                    markerNode == paragraph, false, true, true);
        }
    }

    private static Node getAncestorInBody(Node startNode)
    {
        while (startNode.getParentNode().getNodeType() != NodeType.BODY)
            startNode = startNode.getParentNode();
        return startNode;
    }
    //ExEnd:CommonExtractContentHelperMethods
}

Gptrnt · October 21, 2023, 8:41pm

Thank you so much above solution is working fine