Aspose Word: Search a search word and extract content between the search word- split content between it into new document

aminzamani · October 21, 2013, 4:45pm

Hallo,

we use Aspose.Word for JAVA.

Four our customer we have the case, that a source doc document contains a special search string (also multiple times) alone into one line (PARAGRAPH). Only if this search word (should be founded by a regex expression) is founded the document will be extracted and copied into a new document. If search word is not at the first line Everything before the search word and till to the beginning of the document must be splitted into one document (source document is / should not (be) modified). If the search word (the regex matches multiple times) then multiple document parts will be generated. Every search word always marks the beginning of a new document part. For example, when the search regex is “^finish$” then always when “finish” is inside a paragraph the content will be extracted till to the next regex match of the paragraph with the content “finish”.

I have red a lot inside the documentation. Most important part is “https://docs.aspose.com/words/java/extract-selected-content-between-nodes/”. So my question, how to find such start nodes - for the case above?

I thank you very much for providing the code. Also please note, that really everything must be extracted, images, tables, whatever a word document can have (between the search words).

Thank you very much for your help!

Best regards
Amin

tahir.manzoor · October 22, 2013, 10:26am

Hi Amin,

Thanks for your inquiry. First of all, please note that Aspose.Words is quite different from the Microsoft Word’s Object Model in that it represents the document as a tree of objects more like an XML DOM tree. If you worked with any XML DOM library you will find it is easy to understand and work with Aspose.Words. When you load a Word document into Aspose.Words, it builds its DOM and all document elements and formatting are simply loaded into memory. Please read the following articles for more information on DOM:
https://docs.aspose.com/words/java/aspose-words-document-object-model/
https://docs.aspose.com/words/java/logical-levels-of-nodes-in-a-document/

You can achieve your requirement by implementing IReplacingCallback interface. Please use the same approach shared at following documentation link to find a Node.
https://docs.aspose.com/words/java/find-and-replace/

Once you have found the specific nodes, you can extract the contents between different nodes by using code example shared here:
https://docs.aspose.com/words/java/extract-selected-content-between-nodes/

If you face any issue, please share following details for our reference. We will then provide you more information about your query along with code.

Please supply us with the input document
Please supply us with the expected document showing the desired behavior (You can create this document using Microsoft Word).

aminzamani · October 23, 2013, 12:38pm

Hi,

thanks for your answer. I will test it as soon as possible and will you inform. At the moment I have founed an other way to find special nodes with a certain text. Then I have splitted the document into parts. Afterward, when I want to merge all parts back to one document I always have a new section after every new document. Could you pleae tell where the problem is? Thanks a lot!

I make use of following code for mergin:

public Document mergeDocuments(List splittedDocuments){
    Document res = null;
    try {
        res = createDoc();

    } catch (Exception e) {

        throw new DocumentMergerException("new document couldn’t been created",e);
    }
    for (Document splittedDocument : splittedDocuments) {
        try {
            appendAllNodes(res,splittedDocument.getChildNodes());
        } catch (Exception e) {
            throw new DocumentMergerException("couldn’t merge documents",e);
        }
    }

    return res;
}

private Document createDoc() throws Exception {
    // Create an "empty" document. Note that like in Microsoft Word,
    // the empty document has one section, body and one paragraph in it.
    Document doc = new Document();

    // This truly makes the document empty. No sections (not possible in
    // Microsoft Word).
    doc.removeAllChildren();
    return doc;
}

private Document appendAllNodes(Document doc, NodeCollection nodes)
        throws Exception {

    for (Node node : nodes) {
        appendNode(doc, node);
    }
    return doc;
}

private Document appendNode(Document doc, Node node) throws Exception {
    Node importNode = doc.importNode(node, true);
    doc.appendChild(importNode);
    return doc;
}

-------------------------------------------------

And this is the code to create the splitted documents through the start end nodes:

public Document createDocumentByStartEndNode(Node startNode, Node endNode, boolean isInclusive) throws Exception {
    ArrayList extractContent = extractContent(startNode, endNode, true);
    Document generatedDocument = generateSplittedDocument(getDocument(), extractContent);
    return generatedDocument;

}

public static Document generateSplittedDocument(Document srcDoc, ArrayList nodes) throws Exception {
    // Create a blank document.
    Document dstDoc = new Document();
    // Remove the first paragraph from the empty document.
    dstDoc.getFirstSection().getBody().removeAllChildren();

    // Import each node from the list into the new document. Keep the
    // original formatting of the node.
    NodeImporter importer = new NodeImporter(srcDoc, dstDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);

    for (Node node : (Iterable) nodes) {
        Node importNode = importer.importNode(node, true);
        dstDoc.getFirstSection().getBody().appendChild(importNode);
    }

    // Return the generated document.
    return dstDoc;
}

public static ArrayList extractContent(Node startNode, Node endNode, boolean isInclusive) {
    // First check that the nodes passed to this method are valid for use.
    verifyParameterNodes(startNode, endNode);

    // Create a list to store the extracted nodes.
    ArrayList nodes = new ArrayList();

    // Keep a record of the original nodes passed to this method so we can
    // split marker nodes if needed.
    Node originalStartNode = startNode;
    Node originalEndNode = endNode;

    // Extract content based on block level nodes (paragraphs and tables).
    // Traverse through parent nodes to find them.
    // We will split the content of first and last nodes depending if the
    // marker nodes are inline
    while (startNode.getParentNode().getNodeType() != NodeType.BODY)
        startNode = startNode.getParentNode();

    while (endNode.getParentNode().getNodeType() != NodeType.BODY)
        endNode = endNode.getParentNode();

    boolean isExtracting = true;
    boolean isStartingNode = true;
    boolean isEndingNode;
    // The current node we are extracting from the document.
    Node currNode = startNode;

    // Begin extracting content. Process all block level nodes and
    // specifically split the first and last nodes when needed so paragraph
    // formatting is retained.
    // Method is little more complex than a regular extractor as we need to
    // factor in extracting using inline nodes, fields, bookmarks etc as to
    // make it really useful.
    while (isExtracting) {
        // Clone the current node and its children to obtain a copy.
        CompositeNode cloneNode = (CompositeNode) currNode.deepClone(true);
        isEndingNode = currNode.equals(endNode);

        if (isStartingNode || isEndingNode) {
            // We need to process each marker separately so pass it off to a
            // separate method instead.
            if (isStartingNode) {
                processMarker(cloneNode, nodes, originalStartNode, isInclusive, isStartingNode, isEndingNode);
                isStartingNode = false;
            }

            // Conditional needs to be separate as the block level start and
            // end markers maybe the same node.
            if (isEndingNode) {
                processMarker(cloneNode, nodes, originalEndNode, isInclusive, isStartingNode, isEndingNode);
                isExtracting = false;
            }
        } else
            // Node is not a start or end marker, simply add the copy to the
            // list.
            nodes.add(cloneNode);

        // Move to the next node and extract it. If next node is null that
        // means the rest of the content is found in a different section.
        if (currNode.getNextSibling() == null && isExtracting) {
            // Move to the next section.
            Section nextSection = (Section) currNode.getAncestor(NodeType.SECTION).getNextSibling();
            currNode = nextSection.getBody().getFirstChild();
        } else {
            // Move to the next node in the body.
            currNode = currNode.getNextSibling();
        }
    }
    // Return the nodes between the node markers.
    return nodes;
}

Best regards
Amin

tahir.manzoor · October 24, 2013, 7:43am

Hi Amin,

Thanks for your inquiry. It would be great if you please share following detail for investigation purposes.

Please attach your input Word document.
Please

create a standalone/runnable simple application (for example a Console
Application Project) that demonstrates the code you used to generate
your output document

Please attach the output Word file that shows the undesired behavior.
Please
attach your target Word document showing the desired behavior. You can
use Microsoft Word to create your target Word document. I will
investigate as to how you are expecting your final document be generated
like.

Unfortunately,
it is difficult to say what the problem is without the Document(s) and
simplified application. We need your Document(s) and simple project to
reproduce the problem. As soon as you get these pieces of information to
us we’ll start our investigation into your issue.

aminzamani · November 1, 2013, 9:55am

Hi,

excuse me, it is for me very difficult to split a document by a certain search word in new or mulitple new documents. Could you please be so friendly and share the code here, how to split a document in new ones whenever a special search string is founded? The condition is, that the search string must be alone with no other text inside a paragrapsh (and so in a run). Whenever this string is founded alone with nothing else in a paragraph (=>run) a new document must be generated which contains everything till to the search string (but the search string with the paragraph is not part of the new document, it will not be copied to the generated document, because it marks always the begining of a new document. So it stands inside the new document at the beginning. If the search string is not rang first - at the beginning of the document - , for example it is somewhere else inside the document, then from the beginning of the document everything must be copied till to the line / node before the search string was founded. And the next document part starts with the search string which was founded (before) till to the next occurance of the search string and so on (for more information see description in this post).

So in fact very simple. Search string founded ? => copy document content (from last founded search string and its paragraph or from the beginning of the document) till to the (next) search string paragraph but the paragraph with the search string is not part of the new document.

I appreciate that very much, the efforts to provide the code here.

Thanks,
Amin

tahir.manzoor · November 3, 2013, 9:06am

Hi Amin,

Thanks for your inquiry. I have tried to understand your query and as per my understanding, you want to extract document’s contents based on specific search and save the extracted contents into new document.

In your case, I suggest you please implementing IReplacingCallback interface to find all instances of particular word in the document. I have attached the input file with this post for your kind reference.

Following code example does the followings:

Find all instances of particular word in the document
Insert bookmark to each resulting match found
Extract contents based on inserted bookmarks

Hope this helps you. Please let us know if you have any more queries.

// Load in the document
Document doc = new Document(MyDir + "TestFile.doc");
// insert bookmar at the start of document.
DocumentBuilder builder = new DocumentBuilder(doc);
builder.moveToDocumentStart();
builder.startBookmark("BM_0");
builder.endBookmark("BM_0");
// Find text and insert bookmark
Pattern regex = Pattern.compile("your document", Pattern.CASE_INSENSITIVE);
FindAndInsertBookmark obj = new FindAndInsertBookmark();
// obj.replaceText = "This line has been replaced with a new line";
doc.getRange().replace(regex, obj, true);
ArrayList bookmarks = new ArrayList();
for (int i = 0; i < doc.getRange().getBookmarks().getCount(); i++)
{
    if (doc.getRange().getBookmarks().get(i).getName().startsWith("BM_"))
        bookmarks.add(doc.getRange().getBookmarks().get(i));
}
builder.moveToDocumentEnd();
builder.startBookmark("BM_" + bookmarks.size());
builder.endBookmark("BM_" + bookmarks.size());
for (int i = 0; i < bookmarks.size() - 1; i++)
{
    BookmarkStart bStart = ((Bookmark)bookmarks.get(i)).getBookmarkStart();
    BookmarkEnd bEnd = ((Bookmark)bookmarks.get(i + 1)).getBookmarkEnd();
    ArrayList nodes = extractContent(bStart, bEnd, true);
    Document newdoc = generateDocument(doc, nodes);
    System.out.println(i);
    newdoc.save(MyDir + "Out_" + i + ".docx");
}

class FindAndInsertBookmark implements IReplacingCallback
{
    int i = 1;
    public int replacing(ReplacingArgs e) throws Exception
    {
        // This is a Run node that contains either the beginning or the complete match.
        Node currentNode = e.getMatchNode();
        // The first (and may be the only) run can contain text before the match,
        // in this case it is necessary to split the run.
        if (e.getMatchOffset() > 0)
            currentNode = splitRun((Run)currentNode, e.getMatchOffset());
        // This array is used to store all nodes of the match for further highlighting.
        ArrayList runs = new ArrayList();
        // Find all runs that contain parts of the match string.
        int remainingLength = e.getMatch().group().length();
        while (
                (remainingLength > 0) &&
                        (currentNode != null) &&
                        (currentNode.getText().length() <= remainingLength))
        {
            runs.add(currentNode);
            remainingLength = remainingLength - currentNode.getText().length();
            // Select the next Run node.
            // Have to loop because there could be other nodes such as BookmarkStart etc.
            do
            {
                currentNode = currentNode.getNextSibling();
            }
            while ((currentNode != null) && (currentNode.getNodeType() != NodeType.RUN));
        }
        // Split the last run that contains the match if there is any text left.
        if ((currentNode != null) && (remainingLength > 0))
        {
            splitRun((Run)currentNode, remainingLength);
            runs.add(currentNode);
        }
        DocumentBuilder builder = new DocumentBuilder((Document)currentNode.getDocument());
        builder.moveTo((Run)runs.get(0));
        builder.insertParagraph();
        builder.startBookmark("BM_"+ i);
        builder.endBookmark("BM_"+ i);
        builder.insertParagraph();
        i++;
        // Signal to the replace engine to do nothing because we have already done all what we wanted.
        return ReplaceAction.SKIP;
    }
    /**
     * Splits text of the specified run into two runs.
     * Inserts the new run just after the specified run.
     */
    private Run splitRun(Run run, int position) throws Exception
    {
        Run afterRun = (Run)run.deepClone(true);
        afterRun.setText(run.getText().substring(position));
        run.setText(run.getText().substring((0), (0) + (position)));
        run.getParentNode().insertAfter(afterRun, run);
        return afterRun;
    }
}

aminzamani · November 6, 2013, 1:35pm

Hi,
first of all thank you very much for your help!

I have a little bit experimented with your code. I have taken your code without modification (by the way could you show me how to implement the IReplacingCallback inteface without mark the founded word?). Your code only splits the document properly for your document you attached but not for my document. I have attached my document. Please search for the word “finish” instead of “my document”.

Thank you very much for your help.

Amin

tahir.manzoor · November 7, 2013, 4:28am

Hi Amin,

Thanks for your inquiry. I have modified the following highlighted line of code in FindAndInsertBookmark class and have managed to split document for ‘finish’ search word. I have attached the output documents with this post for your kind reference.

// Split the last run that contains the match if there is any text left.
if ((currentNode != null) && (remainingLength > 0))
{
    splitRun((Run)currentNode, remainingLength);
    runs.add(currentNode);
}
DocumentBuilder builder = new DocumentBuilder((Document)((Run)runs.get(0)).getDocument());
builder.moveTo((Run)runs.get(0));
builder.insertParagraph();
builder.startBookmark("BM_" + i);
builder.endBookmark("BM_" + i);
builder.insertParagraph();

*aminzamani:

I have taken your code without modification (by the way could you show me how to implement the IReplacingCallback inteface without mark the founded word?).*

It would be great if you please share some more detail about your query. I will then provide you more information on this along with code.

aminzamani · November 7, 2013, 5:39am

Hi,

thanks very much for your code. I will try it later. That are the requirements:
Whenever the word finish is found a new document must be generated which starts from the founded word “finish” till to the next occurence of the word “finish”. The word finish must be alone without any other words in a simple line (paragraph). The word finish always marsk the beginning of a new document. The next document contains the start of the founded word “finish” till to the next occurense as mentioned before. It will not contain the occurense of the NEXT word “finish”. As said the documents will be splitted. We assign them later to other people. They modify it. When they are finish we merge all splitted documents into one document back.

Here is an example source document:
first doc line 1
first doc line 2
finish
second document line 1
second document line 2
finish
third document line 1
third document line 2
third document line 3

FOllowing documents should be generated:
-----------------
first document content:
first doc line 1
first doc line 2
-------------------
second document content:
finish
second document line 1
second document line 2
------------------
third document content:
finish
third document line 1
third document line 2
third document line 3

As you see the word finish always marks the beginning of a new document. If inside the source document the first line is not the word “finish” then that first line is the start of the first new document. If the first line is the word “finish” then this is the first line of the new document.

That’s all. As mentioned, later the splitted documents must be merges back together into one document.

Thanks again for you help!

tahir.manzoor · November 8, 2013, 1:06am

Hi Amin,

Thanks for sharing the detail.

Please use the following code example to achieve your requirements. I have attached the output documents and FindAndInsertBookmark code with this post. Please get the extractContent code from here:
https://docs.aspose.com/words/java/extract-selected-content-between-nodes/

Hope this helps you. Pleas let us know if you have any more queries.

// Load in the document
Document doc = new Document(MyDir + "section1.doc");
// insert bookmar at the start of document.
DocumentBuilder builder = new DocumentBuilder(doc);
// Move cursor to document start and insert bookmark
builder.moveToDocumentStart();
builder.startBookmark("BM_0");
builder.endBookmark("BM_0");
// Find text and insert bookmark 
Pattern regex = Pattern.compile("finish", Pattern.CASE_INSENSITIVE);
FindAndInsertBookmark obj = new FindAndInsertBookmark();
doc.getRange().replace(regex, obj, true);
// Add the inserted bookmarks starts with BM_ in an ArrayList
ArrayList bookmarks = new ArrayList();
for (int i = 0; i < doc.getRange().getBookmarks().getCount(); i++)
{
    if (doc.getRange().getBookmarks().get(i).getName().startsWith("BM_"))
        bookmarks.add(doc.getRange().getBookmarks().get(i));
}
// Move cursor to document start and insert bookmark
builder.moveToDocumentEnd();
builder.startBookmark("BM_" + bookmarks.size());
builder.endBookmark("BM_" + bookmarks.size());
// Extract contents between bookmarks and split the document
for (int i = 0; i < bookmarks.size() - 1; i++)
{
    BookmarkStart bStart = ((Bookmark)bookmarks.get(i)).getBookmarkStart();
    BookmarkEnd bEnd = ((Bookmark)bookmarks.get(i + 1)).getBookmarkEnd();
    ArrayList nodes = extractContent(bStart, bEnd, true);
    Document newdoc = generateDocument(doc, nodes);
    if (newdoc.getLastSection().getBody().getLastParagraph().toString(SaveFormat.TEXT).trim().equals("finish"))
        newdoc.getLastSection().getBody().getLastParagraph().remove();
    newdoc.save(MyDir + "Out_" + i + ".docx");
}

aminzamani · November 10, 2013, 7:10pm

Hi,

thank you very much for your great help!

I have compared the input file that I gave to you with the generated output file that you gave me. I have attached a screenshot - It shows both documents, the first is my input file as you got it from me and the second the output file you gave me (here the second one). As you see new page breaks are not inside the new generated output file. For example take a look in the input file after the first occurence of the word “finish”. There you see in the screenshot a blue line directly after the word “finish”. But the second output file starts with “finish” (is ok!!) but where is the page break? The blue line is not inside of it -;( For our loic everything is needed! What ever it is.

I also have executed the code you gave me. But I do not get the same output files as you attachd. The last character is always an “ENTER Chracter” in every splitted document. See attached zip “MyOut.zip”.

I thank you very much if you can provide help and thank you very much for your efforts!

Best regards,
Amin

tahir.manzoor · November 11, 2013, 4:10am

Hi Amin,

Thanks for sharing the detail.

*aminzamani:

I have compared the input file that I gave to you with the generated output file that you gave me. I have attached a screenshot - It shows both documents, the first is my input file as you got it from me and the second the output file you gave me (here the second one). As you see new page breaks are not inside the new generated output file. For example take a look in the input file after the first occurence of the word “finish”. There you see in the screenshot a blue line directly after the word “finish”. But the second output file starts with “finish” (is ok!!) but where is the page break? The blue line is not inside of it -;( For our loic everything is needed! What ever it is.*

I have modified the code according to your requirement. Please see the following highlighted code. Hope this helps you. I have attached the code of FindAndInsertBookmark class, generateDocument and extractContent methods with this post.

// Load in the document
Document doc = new Document(MyDir + "section1.doc");
// insert bookmar at the start of document.
DocumentBuilder builder = new DocumentBuilder(doc);
// Move cursor to document start and insert bookmark
builder.moveToDocumentStart();
builder.startBookmark("BM_0");
builder.endBookmark("BM_0");
String searchKeyWord = "finish";
// Find text and insert bookmark 
Pattern regex = Pattern.compile(searchKeyWord, Pattern.CASE_INSENSITIVE);
FindAndInsertBookmark obj = new FindAndInsertBookmark();
doc.getRange().replace(regex, obj, true);
// Add the inserted bookmarks starts with BM_ in an ArrayList
ArrayList bookmarks = new ArrayList();
for (int i = 0; i < doc.getRange().getBookmarks().getCount(); i++)
{
    if (doc.getRange().getBookmarks().get(i).getName().startsWith("BM_"))
        bookmarks.add(doc.getRange().getBookmarks().get(i));
}
// Move cursor to document start and insert bookmark
builder.moveToDocumentEnd();
builder.startBookmark("BM_" + bookmarks.size());
builder.endBookmark("BM_" + bookmarks.size());
// Extract contents between bookmarks and split the document
for (int i = 0; i < bookmarks.size() - 1; i++)
{
    BookmarkStart bStart = ((Bookmark)bookmarks.get(i)).getBookmarkStart();
    BookmarkEnd bEnd = ((Bookmark)bookmarks.get(i + 1)).getBookmarkEnd();
    ArrayList nodes = extractContent2(bStart, bEnd, true);
    Document newdoc = generateDocument2(doc, nodes);
    if (newdoc.getLastSection().getBody().getLastParagraph().toString(SaveFormat.TEXT).trim().equals(searchKeyWord))
        newdoc.getLastSection().getBody().getLastParagraph().remove();
    if (newdoc.getFirstSection().getBody().getFirstParagraph().toString(SaveFormat.TEXT).trim().equals(searchKeyWord))
    {
        DocumentBuilder newbuilder = new DocumentBuilder(newdoc);
        newbuilder.moveTo(newdoc.getFirstSection().getBody().getFirstParagraph());
        newbuilder.insertBreak(BreakType.SECTION_BREAK_CONTINUOUS);
        newbuilder.getCurrentParagraph().remove();
    }
    newdoc.save(MyDir + "Out_" + i + ".docx");
}

*aminzamani:

I also have executed the code you gave me. But I do not get the same output files as you attachd. The last character is always an “ENTER Chracter” in every splitted document. See attached zip “MyOut.zip”.*

Please use the code FindAndInsertBookmark class shared at following post. I have also attached the same code again with this post. Please check the attached code.txt file.
https://forum.aspose.com/t/49565

aminzamani · November 11, 2013, 4:36am

Hi,

thanks for your help and code!

I have used exactly the code you provided. But result is the same! New site breaks are not visible. I have attached the result files. Could you please attache to me your result files ?

Thanks for you help!

tahir.manzoor · November 11, 2013, 10:22am

Hi Amin,

Thanks for your inquiry. I have attached the output documents generated by using the shared code here with this post for your kind reference. The output documents which you have shared are not correct. Perhaps, you are not using the same code shared in my last post. I have attached the complete source code with this post which I am using.

Moreover, your input document contains the section break (continuous) after ‘finish’ words. The output documents (attached to this post) also contain the section break continuous after ‘finish’ words. Please see the attached image for detail.

Please use the attached code (FindandSplitDocument) and let us know how it goes on your side. Hope this helps you.

aminzamani · November 12, 2013, 1:46pm

Hi,

fantastic, thanks a lot again! There is only still one little Problem. The Problem is, when the first line is the word “finish”. Then a null pointer exception occurs. Actually if the first line contains the word finish, then this is the start of the first new document. The word “finish” always marks the start of the new document, not important if it is inside the first line of the document or not. If it is not in the first line (as the previous tests suppose) then the first line of the source document is the beginning of the first splitted document. Otherwise if it is the word finish then this is the start of the first document. In other words: The first line is always the start of the first splitted document.

I thank you very much when you could provide help . Thanks a lot and best regards,
Amin

tahir.manzoor · November 13, 2013, 6:49am

Hi Amin,

Thanks for your feedback.

Could you please attach your input Word document here for which you are getting the exception? I will investigate the issue on my side and provide you more information.

aminzamani · November 18, 2013, 6:53am

Hi Tahir,

sorry for answering so late, I was in vacation and am back since today. Thanks for helping, I have attached the word document: section1-finish-at-first-line.doc.

Thx,
Amin

tahir.manzoor · November 19, 2013, 12:31am

Hi Amin,

Thanks for sharing the document. Please check the following highlighted code. This change will solve the exception issue. Please let us know if you have any more queries.

// Extract contents between bookmarks and split the document
for (int i = 0; i < bookmarks.size() - 1; i++)
{
    BookmarkStart bStart = ((Bookmark)bookmarks.get(i)).getBookmarkStart();
    BookmarkEnd bEnd = ((Bookmark)bookmarks.get(i + 1)).getBookmarkEnd();
    ArrayList nodes = extractContent(bStart, bEnd, true);
    Document newdoc = generateDocument(doc, nodes);
    if (newdoc.getLastSection().getBody().getLastParagraph().toString(SaveFormat.TEXT).trim().equals(searchKeyWord))
        newdoc.getLastSection().getBody().getLastParagraph().remove();
    if (newdoc.getFirstSection().getBody().getFirstParagraph() != null && newdoc.getFirstSection().getBody().getFirstParagraph().toString(SaveFormat.TEXT).trim().equals(searchKeyWord))
    {
        DocumentBuilder newbuilder = new DocumentBuilder(newdoc);
        newbuilder.moveTo(newdoc.getFirstSection().getBody().getFirstParagraph());
        newbuilder.insertBreak(BreakType.SECTION_BREAK_CONTINUOUS);
        newbuilder.getCurrentParagraph().remove();
    }
    newdoc.save(MyDir + "Out_" + i + ".docx");
}

aminzamani · November 21, 2013, 6:55am

Hi,
I have tested the code. But the response for the given input file is not correct. I have attached the input file and the output. The first generated / splitted output file is empty (=>Out_0.docx).

The result should be:
First file:
----------------
finish
1
Section 1
----------------
Second file:
----------------
finish
2
Section 2
----------------
Third file:
----------------
finish
3
Section 3
----------------
Third file:
----------------
finish

Thank you very much for your help!

Best regards,
Amin

tahir.manzoor · November 22, 2013, 4:48am

Hi Amin,

Thanks for your inquiry. You can solve this issue by two ways.

Please do not insert bookmark BM_0 at the start of document if your document has ‘finish’ word at the start of document.
Check the Paragraphs count for extracted document. If document has only one empty Paragraph, do not save the document. Please see the following code snippet.

Hope this helps you. Please let us know if you have any more queries.

newdoc.ensureMinimum();
if (newdoc.getChildNodes(NodeType.PARAGRAPH, true).getCount() == 1)
{
    if (newdoc.getChild(NodeType.PARAGRAPH, 0, true).toString(SaveFormat.TEXT).trim().equals(""))
        continue;
}
newdoc.save(MyDir + "Out_" + i + ".docx");