Split document into multiple documents on special character

THTHO · January 18, 2017, 7:38am

Hi,
I need help splitting my document into 2 documents at a special character.

My document contains 6 pages.
At the start of every page (not counting the header), a “special line”, with smaller font/white color is located (press CTRL+A to see it) - the line always starts with ¤¤¤

At page 1 the line reads:
¤¤¤D_5_20170118142159768biege_1_FD82449C-10FE-4D21-BF80-2764A01383DA

At page 4 another line is located.

I want to split my documents into 2 documents (with 3 pages in each).
How is this possible?

Hope you can help

tahir.manzoor · January 19, 2017, 9:34am

Hi Thomas,

Thanks for your inquiry. Please refer to the following articles:
Find and Repalce
How to Extract Selected Content Between Nodes in a Document

Please use the following code example to split the shared document into two documents. Hope this helps you.

Document doc = new Document(MyDir + "AsposeSplitTest.docx");
DocumentBuilder builder = new DocumentBuilder(doc);
// Find the text and insert a bookmark.
doc.Range.Replace("¤¤¤D_5_20170118142159768biege_1_A3E79C04-783D-4658-ADF7-52177CD9276F", "", new FindReplaceOptions(new FindAndInsertBookmark()));
Bookmark bookmark = doc.Range.Bookmarks["BM_0"];
// Bookmark the contents from start of document to matached node.
builder.MoveToDocumentStart();
builder.StartBookmark("FirstDoc");
builder.MoveToBookmark("BM_0", true, false);
builder.EndBookmark("FirstDoc");
// Bookmark the contents from matached node to the end of document
builder.MoveToBookmark("BM_0", false, true);
builder.StartBookmark("SecondDoc");
builder.MoveToDocumentEnd();
builder.EndBookmark("SecondDoc");
// Split the documebt by extracting the contents between bookmarks
Bookmark DocStart = doc.Range.Bookmarks["FirstDoc"];
Bookmark DocEnd = doc.Range.Bookmarks["SecondDoc"];
// Remove the contents of first part of document
Document dstDoc1 = (Document)doc.Clone(true);
dstDoc1.Range.Bookmarks["FirstDoc"].Text = "";
dstDoc1.Save(MyDir + "First Document.docx");
// Remove the contents of second part of document
Document dstDoc2 = (Document)doc.Clone(true);
dstDoc2.Range.Bookmarks["SecondDoc"].Text = "";
dstDoc2.Save(MyDir + "Second Document.docx");

public class FindAndInsertBookmark : IReplacingCallback
{
    int i;
    ReplaceAction IReplacingCallback.Replacing(ReplacingArgs e)
    {
        Paragraph para = (Paragraph)e.MatchNode.ParentNode;
        DocumentBuilder builder = new DocumentBuilder((Document)e.MatchNode.Document);
        builder.MoveTo(e.MatchNode);
        builder.StartBookmark("BM_" + i);
        builder.EndBookmark("BM_" + i);
        i++;
        // Signal to the replace engine to do nothing
        return ReplaceAction.Skip;
    }
}

THTHO · January 20, 2017, 1:59am

Hi,
Thanks for the example!
I have a couple of things:

Instead of splitting on the whole “¤¤¤D_5_20170118142159768biege_1_A3E79C04-783D-4658-ADF7-52177CD9276F” string - I need only to split on “¤¤¤”

Im gussing this is enough?:
doc.Range.Replace(“¤¤¤”, “”, new FindReplaceOptions(new FindAndInsertBookmark()));

The “Second document” contains 4 pages (a blank one) - how can this be fixed, so both documents only contain the needed 3 pages
I have documents with more than 2 pages (3,4,…100 maybe)
I have attached another test-document which should be split into 4 documents (contains 4x ¤¤¤)

Can you make an example doing this? Would be much appreciated

tahir.manzoor · January 23, 2017, 1:19am

Hi Thomas,

Please use the following code example to achieve your requirements. This code example does the followings:

Find the text ¤¤¤ and insert bookmark with name start with “BM_”.
Extract the contents between two bookmarks e.g. “BM_0” and “BM_1”

Please get the code of GenerateDocument and ExtractContent methods from Aspose.Words for .NET examples repository at GitHub.

Document doc = new Document(MyDir + "AsposeSplitTest2.docx");
DocumentBuilder builder = new DocumentBuilder(doc);
// Find the text and insert a bookmark.
doc.Range.Replace("¤¤¤", "", new FindReplaceOptions(new FindAndInsertBookmark()));
ArrayList bookmarks = new ArrayList();
foreach (Bookmark bookmark in doc.Range.Bookmarks)
{
    if (bookmark.Name.StartsWith("BM_"))
        bookmarks.Add(bookmark.Name);
}
builder.MoveToDocumentEnd();
builder.StartBookmark("BM_" + bookmarks.Count + 1);
builder.EndBookmark("BM_" + bookmarks.Count + 1);
bookmarks.Add("BM_" + bookmarks.Count + 1);
for (int i = 0; i < bookmarks.Count - 1; i++)
{
    BookmarkStart bookmarkStart = doc.Range.Bookmarks[bookmarks[i].ToString()].BookmarkStart;
    BookmarkStart bookmarkEnd = doc.Range.Bookmarks[bookmarks[i + 1].ToString()].BookmarkStart;
    ArrayList extractedNodes = Common.ExtractContent(bookmarkStart, bookmarkEnd, false);
    Document doc2 = Common.GenerateDocument(doc, extractedNodes);
    doc2.Save(MyDir + "output" + i + ".docx");
}

If you want to remove the last empty page of document, please use following code example.

Document doc = new Document(MyDir + "input.docx");
if (doc.LastSection.Body.Paragraphs.Count == 1 && doc.LastSection.Body.LastParagraph.ToString(SaveFormat.Text).Trim() == "")
    doc.LastSection.Remove();
// get last section
Section lastSect = doc.LastSection;
// Remove empty paragraphs from the end of the document
while (string.IsNullOrEmpty(lastSect.Body.LastParagraph.ToString(SaveFormat.Text).Trim()))
{
    lastSect.Body.LastParagraph.Remove();
}
doc.Save(MyDir + "Out.docx");

tahir.manzoor · October 9, 2019, 10:51am

A post was split to a new topic: FindAndInsertBookmark code

sanathjs · October 9, 2019, 2:02pm

sanath-sample-test.zip (49.5 KB)

I tried the above solution with the word file attached inside the zip, i actually want to split the document from the word “Country=” to till the next word “Country=” appears so ideally document need to split to 4 separate documents

string dataDir = RunExamples.GetDataDir_WorkingWithDocument();
string fileName = "sanath-sample-test.docx";
Document doc = new Document(dataDir + fileName);


DocumentBuilder builder = new DocumentBuilder(doc);
doc.Range.Replace("Country=", "", new FindReplaceOptions(new FindAndInsertBookmark()));
ArrayList bookmarks = new ArrayList();
foreach (Bookmark bookmark in doc.Range.Bookmarks)
{
    if (bookmark.Name.StartsWith("BM_"))
        bookmarks.Add(bookmark.Name);
}
builder.MoveToDocumentEnd();
builder.StartBookmark("BM_" + bookmarks.Count + 1);
builder.EndBookmark("BM_" + bookmarks.Count + 1);
bookmarks.Add("BM_" + bookmarks.Count + 1);
for (int i = 0; i < bookmarks.Count; i++)
{
    BookmarkStart bookmarkStart = doc.Range.Bookmarks[bookmarks[i].ToString()].BookmarkStart;
    BookmarkStart bookmarkEnd = doc.Range.Bookmarks[bookmarks[i + 1].ToString()].BookmarkStart;
    ArrayList extractedNodes = Common.ExtractContent(bookmarkStart, bookmarkEnd, false);
    Document doc2 = Common.GenerateDocument(doc, extractedNodes);
    doc2.Save(dataDir + "output" + i + ".docx");
}

Could you please help me in finding the solution for this

sanathjs · October 9, 2019, 2:04pm

sanath-sample-test.zip (49.5 KB)

I tried the above solution with the word file attached inside the zip, i actually want to split the document from the word “ Country= ” to till the next word “ Country= ” appears so ideally document need to split to 4 separate documents

string dataDir = RunExamples.GetDataDir_WorkingWithDocument();
string fileName = “sanath - sample - test.docx”;
Document doc = new Document(dataDir + fileName);

DocumentBuilder builder = new DocumentBuilder(doc);
doc.Range.Replace("Country=", "", new FindReplaceOptions(new FindAndInsertBookmark()));
ArrayList bookmarks = new ArrayList();
foreach (Bookmark bookmark in doc.Range.Bookmarks)
{
    if (bookmark.Name.StartsWith("BM_"))
        bookmarks.Add(bookmark.Name);
}
builder.MoveToDocumentEnd();
builder.StartBookmark("BM_" + bookmarks.Count + 1);
builder.EndBookmark("BM_" + bookmarks.Count + 1);
bookmarks.Add("BM_" + bookmarks.Count + 1);
for (int i = 0; i < bookmarks.Count; i++)
{
    BookmarkStart bookmarkStart = doc.Range.Bookmarks[bookmarks[i].ToString()].BookmarkStart;
    BookmarkStart bookmarkEnd = doc.Range.Bookmarks[bookmarks[i + 1].ToString()].BookmarkStart;
    ArrayList extractedNodes = Common.ExtractContent(bookmarkStart, bookmarkEnd, false);
    Document doc2 = Common.GenerateDocument(doc, extractedNodes);
    doc2.Save(dataDir + "output" + i + ".docx");
}

Could you please help me in finding the solution for this

tahir.manzoor · October 9, 2019, 5:55pm

@sanathjs

Please add space between “Country” and “=” as shown below to get the desired output.

doc.Range.Replace("**Country =**", "", new FindReplaceOptions(new FindAndInsertBookmark()));

sanathjs · October 10, 2019, 5:33am

Thanks for the quick reply, much appreciated your solution is working for me now

tahir.manzoor · October 10, 2019, 1:58pm

@sanathjs

Please feel free to ask if you have any question about Aspose.Words, we will be happy to help you.

sanathjs · October 17, 2019, 6:19am

When the document is split it is loosing it’s original style inside (like paragraph fonts etc… are changed) how to retain the original style of the document? or how to retain the original word template and split (like attach the same template even to the splited documents).

tahir.manzoor · October 17, 2019, 2:08pm

@sanathjs

Please ZIP and attach the screenshots of problematic sections of output document along with problematic and expected output documents. We will then provide your more information about your query.

sanathjs · October 18, 2019, 8:27am

test.zip (326.7 KB)

hi @tahir.manzoor,

I have attached the screen shot of original file before split and file after splitting also i have attached the file used to split and splitted files, you may notice the changes in font for paragraph after splitting please have a look at let me know the solution earliest as possible.

tahir.manzoor · October 18, 2019, 3:31pm

@sanathjs

We are investigating your issue and will get back to you soon.

tahir.manzoor · October 21, 2019, 5:06pm

@sanathjs

The style name of shared paragraph is not properly set in your input document. You can check it by unzipping the document in document.xml. Please open the document in MS Word and clear the formatting. MS Word creates new style for the shared paragraph.

Please use the following modified GenerateDocument method to get the desired output.

public static Document GenerateDocument(Document srcDoc, ArrayList nodes)
{
    Document dstDoc = srcDoc.Clone();

    dstDoc.RemoveAllChildren();
    dstDoc.EnsureMinimum();
    // Import each node from the list into the new document. Keep the original formatting of the node.
    NodeImporter importer = new NodeImporter(srcDoc, dstDoc, ImportFormatMode.KeepSourceFormatting);

    foreach (Node node in nodes)
    {
        Node importNode = importer.ImportNode(node, true);
        dstDoc.FirstSection.Body.AppendChild(importNode);
    }

    // Return the generated document.
    return dstDoc;
}