Aspose Word: Search a search word and extract content between the search word- split content between it into new document

aminzamani · November 22, 2013, 6:04am

Hi,

unfortunately it is not working. When I debug the code (it is the document I sent to you) following if statement is not true:
“if (newdoc.getChildNodes(NodeType.PARAGRAPH,
true).getCount() == 1)”

The statement “newdoc.getChildNodes(NodeType.PARAGRAPH,
true).getCount()” returns 0 by the first founded document.

On the other site the splitting of documents whose first word is not “finish” (on the first line) is now not working -;(. For that case I have attached a simple document called “first-line-not-splitting-string.docx”. I also have attached the output file “Out_0.docx”.

Thank you very much for providing help.

Amin

aminzamani · November 22, 2013, 7:18am

Hi Tahir,
I guess the best way is to provide you some test input files with their correct output which should be generated (splitted documents).

I have attached a zip file “tests-for-input-files-and-their-correct-output.zip” which contains 8 test folder. Every test folder contains an input doc file. I have manually created the correct output files for the input file. Only if the programm generates the same output files then the code is correct. Please note, that I only have tested with files which contains text and no tables or images or all the things that can be in a document. In reality it also should work with everything which can be inside a document. Thank you very much for your help.

Best regards,
Amin

tahir.manzoor · November 25, 2013, 1:40am

Hi Amin,

Thanks for your inquiry.

*aminzamani:

unfortunately it is not working. When I debug the code (it is the document I sent to you) following if statement is not true:
“if (newdoc.getChildNodes(NodeType.PARAGRAPH, true).getCount() == 1)”

The statement “newdoc.getChildNodes(NodeType.PARAGRAPH, true).getCount()” returns 0 by the first founded document*

Please call the Document.ensureMinimum method as shown in following code snippet. This will return the correct Paragraph count.

newdoc.ensureMinimum();
if (newdoc.getChildNodes(NodeType.PARAGRAPH, true).getCount() == 1)
{
    if (newdoc.getChild(NodeType.PARAGRAPH, 0, true).toString(SaveFormat.TEXT).trim().equals(""))
        continue;
}
newdoc.save(MyDir + "Out_" + i + ".docx");

*aminzamani:

I guess the best way is to provide you some test input files with their correct output which should be generated (splitted documents).

I have attached a zip file “tests-for-input-files-and-their-correct-output.zip” which contains 8 test folder. Every test folder contains an input doc file. I have manually created the correct output files for the input file. Only if the programm generates the same output files then the code is correct. Please note, that I only have tested with files which contains text and no tables or images or all the things that can be in a document. In reality it also should work with everything which can be inside a document. Thank you very*

Thanks for sharing the document. I have checked the input and output documents. You need to use the same approach shared in ‘FindAndInsertBookmark’ class. Find the word ‘finish’ and insert the bookmark. Once you have inserted the bookmarks with each ‘finish’ word, you can easily extract contents by using the approach (Extract Content from a Bookmark) mentioned here:
https://docs.aspose.com/words/java/extract-selected-content-between-nodes/

In your case, you need to do a little modifications according to your requirements in shared code. You need this modification after extracting contents between bookmark. For example, please see the first highlighted code below. Remove the last paragraph of document, If it’s text is ‘finish’ word.

Hope this answers your query. Please let us know if you have any more queries.

// Extract contents between bookmarks and split the document
for (int i = 0; i < bookmarks.size() - 1; i++)
{
    BookmarkStart bStart = ((Bookmark)bookmarks.get(i)).getBookmarkStart();
    BookmarkEnd bEnd = ((Bookmark)bookmarks.get(i + 1)).getBookmarkEnd();
    ArrayList nodes = extractContent2(bStart, bEnd, true);
    Document newdoc = generateDocument2(doc, nodes);
    if (newdoc.getLastSection().getBody().getLastParagraph().toString(SaveFormat.TEXT).trim().equals(searchKeyWord))
        newdoc.getLastSection().getBody().getLastParagraph().remove();
    if (newdoc.getFirstSection().getBody().getFirstParagraph() != null && newdoc.getFirstSection().getBody().getFirstParagraph().toString(SaveFormat.TEXT).trim().equals(searchKeyWord))
    {
        DocumentBuilder newbuilder = new DocumentBuilder(newdoc);
        newbuilder.moveTo(newdoc.getFirstSection().getBody().getFirstParagraph());
        newbuilder.insertBreak(BreakType.SECTION_BREAK_CONTINUOUS);
        newbuilder.getCurrentParagraph().remove();
    }
    newdoc.ensureMinimum();
    if (newdoc.getChildNodes(NodeType.PARAGRAPH, true).getCount() == 1)
    {
        if (newdoc.getChild(NodeType.PARAGRAPH, 0, true).toString(SaveFormat.TEXT).trim().equals(""))
            continue;
    }
    newdoc.save(MyDir + "Out_" + i + ".docx");
}

aminzamani · November 25, 2013, 5:22am

Hi Tahi,

thanks again for your advices. I have tried doing what you mentioned. But it is not working properly. I have attached my java class which contains all the logic, so you can see that everything should be implemented as mentioned.

Inside the zip archive I attached you also find the file “seciton1-input.docx”. I have only tested with this file the new mentioned code and founded no valid output. Inside the “res” folder you see the result after splitting the file. Only 3 files are generated. The last file which only should contain the word “finish” does not exist. I have manually created the folder “expected-result” which contains the correct files which should generated automatically. You see there the file “Out_4.docx” which I added and is not generated automatically.

Also, if you see the base file “seciton1-input.docx” then there are no section breaks but the output files contains section breaks. The output file is not allowed to contain anything which is not in the input file! In the input file are no section breaks!! The extreacted content must be equal!

Thank you very much for finding a solution for this problem. Thanks a lot for your help.

Best regards,
Amin

aminzamani · November 26, 2013, 6:40am

Hi,
I was able to implement the code as you described. But it is not always working. I have attached an input file. The page break is not in the new document. But when i use your input files that you have attached before (the 2 ones) it works. Maybe best is when you attach your class that you have coded “FindandSplitDocument.java” so I can be 100% sure that I have that one you used. Please also check if it works for you with the attached input file “test1-input.doc”. The page break in my case is not inside the second output file.

by the way: Why do we only set a bookmark if in the input file is a page break? The answer seems to be, because then the page break is not in the new generated document when not setting it. But everything should be in the generated document which is between the start and end of the extracted content. It seems for me possible that there could be some more data than only a page break that we add manually.

Do we have add other elements manually into the generated elements, too? Is there no way to easily copy everything from the start till to the end of the document? Also an other problem : We use aspose word because we want to split the documents as described inside this ticket. This documents will be given to someone by an workflow in an enterprise content management system. Many scientists will get the splitted documents and modify them. It is really a very important project and so very important that everything is inside the splitted document which was between the extracted content in the source document. Then when the people have edited the splitted documents we will merge it back to one document. Therefore it is not acceptable when some parts are not there.

The splitting and merging must be finish as soon as possible because in 8 days we have to show it to our customer. I have to ensure that the splitting is working with everything which is in the source document between the search words and thus also copied to the splitted documents. The last step is merging the splitted documents back to one document.

I appreciate your help very much!

Could you ensure / confirm that everything is inside the splitted document? As said for page breaks we put them manually into the splitted documents. Are there other elements which must be added manually to the splitted documents, like:

if (newdoc.getFirstSection().getBody().getFirstParagraph() != null &&
    newdoc.getFirstSection().getBody().getFirstParagraph().toString(SaveFormat.TEXT).trim().equals(searchKeyWord)
    && bStart.getBookmark().getName().startsWith("BM_S"))
{
    DocumentBuilder newbuilder = new DocumentBuilder(newdoc);
    newbuilder.moveTo(newdoc.getFirstSection().getBody().getFirstParagraph());
    newbuilder.insertBreak(BreakType.SECTION_BREAK_CONTINUOUS);
    newbuilder.getCurrentParagraph().remove();
}

Thanks for helping

&

Best regards,
Amin

tahir.manzoor · November 26, 2013, 10:57am

Hi Amin,

Thanks for your inquiry.

*aminzamani:

I was able to implement the code as you described. But it is not always working. I have attached an input file. The page break is not in the new document. But when i use your input files that you have attached before (the 2 ones) it works. Maybe best is when you attach your class that you have coded “FindandSplitDocument.java” so I can be 100% sure that I have that one you used. Please also check if it works for you with the attached input file “test1-input.doc”. The page break in my case is not inside the second output file.*

Yes, in case of test1-input.doc, the code does not insert the section break in output documents. I missed to add following lines of code for last ‘finish’ word.

. . . 
. . . 
BookmarkStart bStart = ((Bookmark)bookmarks.get(bookmarks.size() - 1)).getBookmarkStart();
ArrayList nodes = extractContent2(bStart, doc.getLastSection().getBody().getLastParagraph(), true);
Document newdoc = generateDocument2(doc, nodes);
if (newdoc.getFirstSection().getBody().getFirstParagraph() != null &&
newdoc.getFirstSection().getBody().getFirstParagraph().toString(SaveFormat.TEXT).trim().equals(searchKeyWord)
&& bStart.getBookmark().getName().startsWith("BM_S"))
{
    DocumentBuilder newbuilder = new DocumentBuilder(newdoc);
    newbuilder.moveTo(newdoc.getFirstSection().getBody().getFirstParagraph());
    newbuilder.insertBreak(BreakType.SECTION_BREAK_CONTINUOUS);
    newbuilder.getCurrentParagraph().remove();
}
newdoc.getRange().getBookmarks().clear();
newdoc.save(MyDir + "Out_" + bookmarks.size() + ".docx");

*aminzamani:

by the way: Why do we only set a bookmark if in the input file is a page break? The answer seems to be, because then the page break is not in the new generated document when not setting it. But everything should be in the generated document which is between the start and end of the extracted content. It seems for me possible that there could be some more data than only a page break that we add manually.*

The bookmarks are inserted for all finish words. If ‘finish’ paragraph contains the section break, the inserted bookmark has name started with ‘BM_S’. Please check following lines of codes.

if (((Run)runs.get(0)).getParentParagraph().getText().contains(ControlChar.SECTION_BREAK))
{
    builder.startBookmark("BM_S" + i);
    builder.endBookmark("BM_S" + i);
}
else
{
    builder.startBookmark("BM_" + i);
    builder.endBookmark("BM_" + i);
}

*aminzamani:

Do we have add other elements manually into the generated elements, too? Is there no way to easily copy everything from the start till to the end of the document? Also an other problem : We use aspose word because we want to split the documents as described inside this ticket. This documents will be given to someone by an workflow in an enterprise content management system. Many scientists will get the splitted documents and modify them. It is really a very important project and so very important that everything is inside the splitted document which was between the extracted content in the source document. Then when the people have edited the splitted documents we will merge it back to one document. Therefore it is not acceptable when some parts are not there.*

You can insert images, text, bookmark, tables etc in generated document. You can achieve your requirement what you need. I suggest you please read the following documentation links for your kind reference.
https://docs.aspose.com/words/java/aspose-words-document-object-model/
https://docs.aspose.com/words/java/logical-levels-of-nodes-in-a-document/

Please check the code at following documentation links.
https://docs.aspose.com/words/java/find-and-replace/
https://docs.aspose.com/words/java/extract-selected-content-between-nodes/

*aminzamani:

The splitting and merging must be finish as soon as possible because in 8 days we have to show it to our customer. I have to ensure that the splitting is working with everything which is in the source document between the search words and thus also copied to the splitted documents. The last step is merging the splitted documents back to one document.*

Please use the Document.appendDocument method to append the specified document to the end of this document. I suggest you please read following documentation link.
https://docs.aspose.com/words/java/insert-and-append-documents/

*aminzamani:

Could you ensure / confirm that everything is inside the splitted document? As said for page breaks we put them manually into the splitted documents. Are there other elements which must be added manually to the splitted documents, like:*

The extractContent method works perfectly. However, extractContent does not extract the section breaks. That is the reason section break is added separately after extracting the contents.

Please note that FindAndInsertBookmark and extractContent works fine. Regarding FindReplaceTest method, it seems that all of your scenarios are covered in shared code. However, you need to modify code according to your requirement.

aminzamani · November 26, 2013, 11:17am

Hi,

thanks very much for your help. I have added the code you mentioned. But the section breaks are still not in the output file. I have attached the file test7-input.doc. No section breaks are in the splitted documents when they are generated. -;((

Every page break should be inserted in the splitted documents. Is there a way to search all page breaks and insert them? Everything must be in the splitted document as in the source document. If there should be a page break between to characters, for example “ab” then this must be in the generated document the same.

Thank you very much for your help.

Amin

aminzamani · November 27, 2013, 2:47am

Hallo,

Could you be so friendly and show me how to insert ALL page breaks which are in the base document also into the splitted documents, too?

I thank you very much,

Amin

tahir.manzoor · November 27, 2013, 10:45am

Hi Amin,

Thanks for your inquiry.

The extractContent method does not extract the section breaks. I am working over your current scenario and will update you asap.

aminzamani · November 27, 2013, 11:16am

Hi Tahir,

I thank you very much!

best regards,
Amin

tahir.manzoor · November 28, 2013, 8:45am

Hi Amin,

Thanks for your inquiry. The extractContent method does not extract the section breaks. In this case, I suggest you following solution. Hope this helps you.

Insert the bookmark at the end of each section break with name starts with ‘BM_Break’.
After extracting contents, insert the section break at the place of inserted bookmark.

// Load in the document
Document doc = new Document(MyDir + "test7-input.doc");
// insert bookmar at the start of document.
DocumentBuilder builder = new DocumentBuilder(doc);
String searchKeyWord = "finish";
// Move cursor to document start and insert bookmark
builder.moveToDocumentStart();
// if the first Paragraphs's text is not equal to searchKeyWord then insert the Bookmark
if (!doc.getFirstSection().getBody().getFirstParagraph().toString(SaveFormat.TEXT).trim().equals(searchKeyWord))
{
    builder.startBookmark("BM_0");
    builder.endBookmark("BM_0");
}
// Find text and insert bookmark
Pattern regex = Pattern.compile(searchKeyWord, Pattern.CASE_INSENSITIVE);
FindAndInsertBookmark obj = new FindAndInsertBookmark();
doc.getRange().replace(regex, obj, true);
// Add the inserted bookmarks starts with BM_ in an ArrayList
ArrayList bookmarks = new ArrayList();
for (int i = 0; i < doc.getRange().getBookmarks().getCount(); i++)
{
    if (doc.getRange().getBookmarks().get(i).getName().startsWith("BM_"))
        bookmarks.add(doc.getRange().getBookmarks().get(i));
}
// Move cursor to document start and insert bookmark
builder.moveToDocumentEnd();
builder.startBookmark("BM_" + bookmarks.size());
builder.endBookmark("BM_" + bookmarks.size());
int bm = 1;
for (Section section : doc.getSections())
{
    if (doc.getLastSection().equals(section))
        continue;
    builder.moveTo(section.getBody().getLastParagraph());
    builder.startBookmark("BM_Break" + bm);
    builder.endBookmark("BM_Break" + bm);
    bm++;
}
// Extract contents between bookmarks and split the document
for (int i = 0; i < bookmarks.size() - 1; i++)
{
    BookmarkStart bStart = ((Bookmark)bookmarks.get(i)).getBookmarkStart();
    BookmarkEnd bEnd = ((Bookmark)bookmarks.get(i + 1)).getBookmarkEnd();
    ArrayList nodes = extractContent(bStart, bEnd.getParentNode().getPreviousSibling(), true);
    Document newdoc = generateDocument(doc, nodes);
    DocumentBuilder newbuilder = new DocumentBuilder(newdoc);
    for (Bookmark bookmark : newdoc.getRange().getBookmarks())
    {
        if (bookmark.getName().contains("BM_Break"))
        {
            newbuilder.moveTo(bookmark.getBookmarkEnd().getParentNode());
            newbuilder.insertBreak(BreakType.SECTION_BREAK_CONTINUOUS);
            newbuilder.getCurrentParagraph().remove();
        }
    }
    newdoc.getRange().getBookmarks().clear();
    newdoc.save(MyDir + "Out_" + i + ".docx");
}
BookmarkStart bStart = ((Bookmark)bookmarks.get(bookmarks.size() - 1)).getBookmarkStart();
ArrayList nodes = extractContent(bStart, doc.getLastSection().getBody().getLastParagraph(), true);
Document newdoc = generateDocument(doc, nodes);
DocumentBuilder newbuilder = new DocumentBuilder(newdoc);
for (Bookmark bookmark : newdoc.getRange().getBookmarks())
{
    if (bookmark.getName().contains("BM_Break"))
    {
        newbuilder.moveTo(bookmark.getBookmarkEnd().getParentNode());
        newbuilder.insertBreak(BreakType.SECTION_BREAK_CONTINUOUS);
        newbuilder.getCurrentParagraph().remove();
    }
}
newdoc.getRange().getBookmarks().clear();
newdoc.save(MyDir + "Out_" + bookmarks.size() + ".docx");

class FindAndInsertBookmark implements IReplacingCallback
{
    int i = 1;
    public int replacing(ReplacingArgs e) throws Exception
    {
        // This is a Run node that contains either the beginning or the complete match.
        Node currentNode = e.getMatchNode();
        // The first (and may be the only) run can contain text before the match,
        // in this case it is necessary to split the run.
        if (e.getMatchOffset() > 0)
            currentNode = splitRun((Run)currentNode, e.getMatchOffset());
        // This array is used to store all nodes of the match for further highlighting.
        ArrayList runs = new ArrayList();
        // Find all runs that contain parts of the match string.
        int remainingLength = e.getMatch().group().length();
        while (
                (remainingLength > 0) &&
                        (currentNode != null) &&
                        (currentNode.getText().length() <= remainingLength))
        {
            runs.add(currentNode);
            remainingLength = remainingLength - currentNode.getText().length();
            // Select the next Run node.
            // Have to loop because there could be other nodes such as BookmarkStart etc.
            do
            {
                currentNode = currentNode.getNextSibling();
            }
            while ((currentNode != null) && (currentNode.getNodeType() != NodeType.RUN));
        }
        // Split the last run that contains the match if there is any text left.
        if ((currentNode != null) && (remainingLength > 0))
        {
            splitRun((Run)currentNode, remainingLength);
            runs.add(currentNode);
        }
        DocumentBuilder builder = new DocumentBuilder((Document)((Run)runs.get(0)).getDocument());
        builder.moveTo((Run)runs.get(0));
        builder.startBookmark("BM_"+ i);
        builder.endBookmark("BM_"+ i);
        i++;
        // Signal to the replace engine to do nothing because we have already done all what we wanted.
        return ReplaceAction.SKIP;
    }
    /**
     * Splits text of the specified run into two runs.
     * Inserts the new run just after the specified run.
     */
    private Run splitRun(Run run, int position) throws Exception
    {
        Run afterRun = (Run)run.deepClone(true);
        afterRun.setText(run.getText().substring(position));
        run.setText(run.getText().substring((0), (0) + (position)));
        run.getParentNode().insertAfter(afterRun, run);
        return afterRun;
    }
}

aminzamani · December 5, 2013, 7:04am

Hi Tahir, thank you very much for you suggestion. I am trying now to implement the code. Could you please show me how to know which section type was founded? Becuase there are different types of sections like “BreakType.SECTION_BREAK_CONTINUOUS” and “BreakType.SECTION_BREAK_NEW_PAGE”. The user can use every kind of section. The code you wrote sets for every section that was founded a “BreakType.SECTION_BREAK_CONTINUOUS”. But it is important to set the same section as founded and not a fix section type.

By the way, when I see inside the class “BreakType” I also see following section types:
BreakType.SECTION_BREAK_EVEN_PAGE
BreakType.SECTION_BREAK_NEW_COLUMN
BreakType.SECTION_BREAK_ODD_PAGE

Is the extractContent method adding these sections or do we have to set the section type here manually, too?

I thank you very much for your help!

Best regards,
Amin

tahir.manzoor · December 6, 2013, 3:02am

Hi Amin,

Thanks for your inquiry. Section can have one Body and maximum one HeaderFooter of each HeaderFooterType. Body and HeaderFooter nodes can be in any order inside Section. Each section has its own set of properties that specify page size, orientation, margins etc.

In your case, I suggest you please use the PageSetup.SectionStart property to get the the type of section break for the specified object. Please check the following code example for your kind reference.

Document doc = new Document();
DocumentBuilder builder = new DocumentBuilder(doc); ;
builder.insertBreak(BreakType.SECTION_BREAK_CONTINUOUS);
builder.writeln("SECTION_BREAK_NEW_PAGE");
builder.insertBreak(BreakType.SECTION_BREAK_NEW_COLUMN);
builder.writeln("SECTION_BREAK_NEW_COLUMN");
builder.insertBreak(BreakType.SECTION_BREAK_ODD_PAGE);
builder.writeln("SECTION_BREAK_ODD_PAGE");
builder.insertBreak(BreakType.COLUMN_BREAK);
builder.writeln("COLUMN_BREAK");
for (Section section : doc.getSections())
{
    System.out.println(section.getPageSetup().getSectionStart());
}

aminzamani · December 6, 2013, 4:33am

Hi Tahir,

thanks for your reply.

I have tried what you wrote. Is there no constant to test if the returned value of “section.getPageSetup().getSectionStart());” equals for example a continous section break? How can I test? The fact why I ask is, because this “BreakType.SECTION_BREAK_NEW_PAGE” consists of an other value as the same section as returned by “section.getPageSetup().getSectionStart());”.

Thank you for your reply and help.

Best regards,
Amimn

aminzamani · December 6, 2013, 9:04am

Hi Tahir,

however, I have solved the problem of taking the section breaks of each type to the new merged document. I thank you very much for your great help!

Best regards,
Amin

tahir.manzoor · December 6, 2013, 9:58am

Hi Amin,

Thanks for your feedback. It is nice to hear from you that you have solved your issue.

Please feel free to ask if you have any question about Aspose.Words, we will be happy to help you.

aminzamani · December 6, 2013, 11:04am

Hi Tahir,

thanks for you answer.

Yes, of course, I know that this is returning a number, nothing more. I have thought that it should give a aspose constant which I can compare with the number of the method you mentioned to get to know what type of section it is which was founded. However, I know which number is a section page break and a section continous break. Are there no constants in Aspose word? You don’t have to explore because problem is fixed. I’ve only thought it would be comfortable to compare the section break number with an Aspose constant to know which type of section break is founded.

Best regards,
Amin

aminzamani · December 6, 2013, 11:30am

Hi Tahir,

have founded the constant that I was searching for: SectionStart.NEW_PAGE

Problem solved. Thanks

Amin

tahir.manzoor · December 9, 2013, 12:31am

Hi Amin,

Thanks for your inquiry. Please check the SectionStart constants from following documentation link. These constants are the type of break at the beginning of the section.
https://reference.aspose.com/words/java/com.aspose.words/SectionStart

Please let us know if you have any more queries.