Split document into multiple documents on special character


#1

Hi,
I need help splitting my document into 2 documents at a special character.

My document contains 6 pages.
At the start of every page (not counting the header), a “special line”, with smaller font/white color is located (press CTRL+A to see it) - the line always starts with ¤¤¤

At page 1 the line reads:
¤¤¤D_5_20170118142159768biege_1_FD82449C-10FE-4D21-BF80-2764A01383DA

At page 4 another line is located.

I want to split my documents into 2 documents (with 3 pages in each).
How is this possible?

Hope you can help :slight_smile:



#2
Hi Thomas,

Thanks for your inquiry. Please refer to the following articles:
Find and Repalce
How to Extract Selected Content Between Nodes in a Document

Please use the following code example to split the shared document into two documents. Hope this helps you.

Document doc = new Document(MyDir + "AsposeSplitTest.docx");

DocumentBuilder builder = new DocumentBuilder(doc);

//Find the text and insert a bookmark.

doc.Range.Replace("¤¤¤D_5_20170118142159768biege_1_A3E79C04-783D-4658-ADF7-52177CD9276F", "", new FindReplaceOptions(new FindAndInsertBookmark()));

Bookmark bookmark = doc.Range.Bookmarks["BM_0"];

//Bookmark the contents from start of document to matached node.

builder.MoveToDocumentStart();

builder.StartBookmark("FirstDoc");

builder.MoveToBookmark("BM_0", true, false);

builder.EndBookmark("FirstDoc");

//Bookmark the contents from matached node to the end of document

builder.MoveToBookmark("BM_0", false, true);

builder.StartBookmark("SecondDoc");

builder.MoveToDocumentEnd();

builder.EndBookmark("SecondDoc");

//Split the documebt by extracting the contents between bookmarks

Bookmark DocStart = doc.Range.Bookmarks["FirstDoc"];

Bookmark DocEnd = doc.Range.Bookmarks["SecondDoc"];

//Remove the contents of first part of document

Document dstDoc1 = (Document)doc.Clone(true);

dstDoc1.Range.Bookmarks["FirstDoc"].Text = "";

dstDoc1.Save(MyDir + "First Document.docx");

//Remove the contents of second part of document

Document dstDoc2 = (Document)doc.Clone(true);

dstDoc2.Range.Bookmarks["SecondDoc"].Text = "";

dstDoc2.Save(MyDir + "Second Document.docx");

public class FindAndInsertBookmark : IReplacingCallback

{

int i;

ReplaceAction IReplacingCallback.Replacing(ReplacingArgs e)

{

Paragraph para = (Paragraph)e.MatchNode.ParentNode;

DocumentBuilder builder = new DocumentBuilder((Document)e.MatchNode.Document);

builder.MoveTo(e.MatchNode);

builder.StartBookmark("BM_"+i);

builder.EndBookmark("BM_" + i);

i++;

// Signal to the replace engine to do nothing

return ReplaceAction.Skip;

}

}


#3

Hi,
Thanks for the example! :slight_smile:
I have a couple of things:

1) Instead of splitting on the whole “¤¤¤D_5_20170118142159768biege_1_A3E79C04-783D-4658-ADF7-52177CD9276F” string - I need only to split on "¤¤¤"

Im gussing this is enough?:
doc.Range.Replace(“¤¤¤”, “”, new FindReplaceOptions(new FindAndInsertBookmark()));

2) The “Second document” contains 4 pages (a blank one) - how can this be fixed, so both documents only contain the needed 3 pages

3) I have documents with more than 2 pages (3,4,…100 maybe)
I have attached another test-document which should be split into 4 documents (contains 4x ¤¤¤)

Can you make an example doing this? Would be much appreciated :slight_smile:




#4
Hi Thomas,

Please use the following code example to achieve your requirements. This code example does the followings:

1) Find the text ¤¤¤ and insert bookmark with name start with "BM_".
2) Extract the contents between two bookmarks e.g. "BM_0" and "BM_1"

Please get the code of GenerateDocument and ExtractContent methods from Aspose.Words for .NET examples repository at GitHub.

Document doc = new Document(MyDir + "AsposeSplitTest2.docx");

DocumentBuilder builder = new DocumentBuilder(doc);

//Find the text and insert a bookmark.

doc.Range.Replace("¤¤¤", "", new FindReplaceOptions(new FindAndInsertBookmark()));

ArrayList bookmarks = new ArrayList();

foreach (Bookmark bookmark in doc.Range.Bookmarks)

{

if (bookmark.Name.StartsWith("BM_"))

bookmarks.Add(bookmark.Name);

}

builder.MoveToDocumentEnd();

builder.StartBookmark("BM_" + bookmarks.Count + 1);

builder.EndBookmark("BM_" + bookmarks.Count + 1);

bookmarks.Add("BM_" + bookmarks.Count + 1);

for (int i = 0; i < bookmarks.Count - 1; i++)

{

BookmarkStart bookmarkStart = doc.Range.Bookmarks[bookmarks[i].ToString()].BookmarkStart;

BookmarkStart bookmarkEnd = doc.Range.Bookmarks[bookmarks[i + 1].ToString()].BookmarkStart;

ArrayList extractedNodes = Common.ExtractContent(bookmarkStart, bookmarkEnd, false);

Document doc2 = Common.GenerateDocument(doc, extractedNodes);

doc2.Save(MyDir + "output" + i + ".docx");

}


If you want to remove the last empty page of document, please use following code example.

Document doc = new Document(MyDir + "input.docx");

if (doc.LastSection.Body.Paragraphs.Count == 1 && doc.LastSection.Body.LastParagraph.ToString(SaveFormat.Text).Trim() == "")

doc.LastSection.Remove();

//get last section

Section lastSect = doc.LastSection;

//Remove empty paragraphs from the end of the document

while (string.IsNullOrEmpty(lastSect.Body.LastParagraph.ToString(SaveFormat.Text).Trim()))

{

lastSect.Body.LastParagraph.Remove();

}

doc.Save(MyDir + "Out.docx");


split this topic #5

A post was split to a new topic: FindAndInsertBookmark code


#6

sanath-sample-test.zip (49.5 KB)

I tried the above solution with the word file attached inside the zip, i actually want to split the document from the word “Country=” to till the next word “Country=” appears so ideally document need to split to 4 separate documents

string dataDir = RunExamples.GetDataDir_WorkingWithDocument();
string fileName = “sanath-sample-test.docx”;
Document doc = new Document(dataDir + fileName);

        DocumentBuilder builder = new DocumentBuilder(doc);
        doc.Range.Replace("Country=", "", new FindReplaceOptions(new FindAndInsertBookmark()));
        ArrayList bookmarks = new ArrayList();
        foreach (Bookmark bookmark in doc.Range.Bookmarks)
        {
            if (bookmark.Name.StartsWith("BM_"))
                bookmarks.Add(bookmark.Name);
        }
        builder.MoveToDocumentEnd();
        builder.StartBookmark("BM_" + bookmarks.Count + 1);
        builder.EndBookmark("BM_" + bookmarks.Count + 1);
        bookmarks.Add("BM_" + bookmarks.Count + 1);
        for (int i = 0; i < bookmarks.Count; i++)
        {
            BookmarkStart bookmarkStart = doc.Range.Bookmarks[bookmarks[i].ToString()].BookmarkStart;
            BookmarkStart bookmarkEnd = doc.Range.Bookmarks[bookmarks[i + 1].ToString()].BookmarkStart;
            ArrayList extractedNodes = Common.ExtractContent(bookmarkStart, bookmarkEnd, false);
            Document doc2 = Common.GenerateDocument(doc, extractedNodes);
            doc2.Save(dataDir + "output" + i + ".docx");
        }

    }

Could you please help me in finding the solution for this


#7

sanath-sample-test.zip (49.5 KB)

I tried the above solution with the word file attached inside the zip, i actually want to split the document from the word “ Country= ” to till the next word “ Country= ” appears so ideally document need to split to 4 separate documents

string dataDir = RunExamples.GetDataDir_WorkingWithDocument();
string fileName = “sanath-sample-test.docx”;
Document doc = new Document(dataDir + fileName);

        DocumentBuilder builder = new DocumentBuilder(doc);
        doc.Range.Replace("Country=", "", new FindReplaceOptions(new FindAndInsertBookmark()));
        ArrayList bookmarks = new ArrayList();
        foreach (Bookmark bookmark in doc.Range.Bookmarks)
        {
            if (bookmark.Name.StartsWith("BM_"))
                bookmarks.Add(bookmark.Name);
        }
        builder.MoveToDocumentEnd();
        builder.StartBookmark("BM_" + bookmarks.Count + 1);
        builder.EndBookmark("BM_" + bookmarks.Count + 1);
        bookmarks.Add("BM_" + bookmarks.Count + 1);
        for (int i = 0; i < bookmarks.Count; i++)
        {
            BookmarkStart bookmarkStart = doc.Range.Bookmarks[bookmarks[i].ToString()].BookmarkStart;
            BookmarkStart bookmarkEnd = doc.Range.Bookmarks[bookmarks[i + 1].ToString()].BookmarkStart;
            ArrayList extractedNodes = Common.ExtractContent(bookmarkStart, bookmarkEnd, false);
            Document doc2 = Common.GenerateDocument(doc, extractedNodes);
            doc2.Save(dataDir + "output" + i + ".docx");
        }

    }

Could you please help me in finding the solution for this


#8

@sanathjs

Please add space between “Country” and “=” as shown below to get the desired output.

doc.Range.Replace(“Country =”, “”, new FindReplaceOptions(new FindAndInsertBookmark()));


#9

Thanks for the quick reply, much appreciated your solution is working for me now :slight_smile:


#10

@sanathjs

Please feel free to ask if you have any question about Aspose.Words, we will be happy to help you.