Extracting formatted content between two bookmarks and saving it in a new Word document

I have a large word document with many chapters. Each chapter is surrounded by a Start and an End Bookmark. I need to break up this large document into many Word documents each of which would contain a separate chapter. The content has graphs, images, and visio graphs. The text is formatted and in some cases colored. In other words, I need to preserve the originality of the content in each of the new documents. Please let me know how do I do this? Thanks.

Hi
Thanks for your inquiry. Yes, of course you can achieve this using Aspose.Words. You can find code that deletes all content between bookmarks in the following thread
https://forum.aspose.com/t/102119
I think you can use the same technique to extract content between bookmarks.
Also please provide me your document for testing. If you need I can create sample code for you.
Best regards.

Hi Alexey,
Thank you for your response. Could you please create for me a sample code? Each chapter has a unique start and end bookmark. As I mentioned I need to scrape off everything including the image and the text formatting, etc. I was concerned that the code for deleting the text between bookmarks would not be concerned with the content and its formatting and just select the content between the bookmarks and erase it. That is why I thought to ask you for a sample code. Also I can not send you the original document and creating another one for testing without the original content would not do justice to its complexity. So I have to start with a code that we think is closest to what I need to do and do the testing here myself. Thank you for your help.

Hi
Thanks for your request. I will provide you code example today. It would be great to get your document for testing. You can remove all confidential content from the document of replace this content with dummy data.
Best regards.

Hi
Please see the following link
https://forum.aspose.com/t/107549
I think methods (MergeDocuments and SplitDocuments) could be useful in your case.
Best regards.

Hi Alexey,
Thak you for your response. I looked at the link content but it did not make it any easier. What I am trying to do is to replace Word with Aspose.Word and our current function to split the document is as is pasted in here. May be the best is to try to translate this Microsoft Word API to Aspose Word API. Can you please take a look at it and try to translate it to Aspose? Please also note that Microsft Word automaticaly inserts the reserved Bookmarks “\StartofDoc” and “\EndofDoc” which we would not have in the Aspose Word. This code is in VB6 which we are also converting to VB.NET. Thanks.

Function SplitDocument()

' ******************************************
' SPLIT RCVD FILE
' ******************************************

' Get RCVD file name
returnCheck = getRCVDFileName

' Get RCVD file path
returnCheck = getRCVDFilePath
If returnCheck = False Then
SplitDocument= False
Exit Function
End If

' Load Word (hidden)
LoadWord

' Number of link documents
docCount = mSplitDocs.Count

' Loop through all link documents
For SplitDocInd = 1 To docCount

' Open RCVD File
Set objDoc = objWord.Documents.Open(FileName:=mRCVDFilePath)

' Turn off revision tracking (visible on screen)
objWord.ActiveDocument.TrackRevisions = False
objWord.ActiveDocument.ShowRevisions = False

' Get document ID for section to be split off
docID1 = mSplitDocs.Item(SplitDocInd).docID

' Get next document ID, if any
If SplitDocInd + 1 <= docCount Then
docID2 = mSplitDocs.Item(SplitDocInd + 1).docID
Else
docID2 = 0
End If

' Count bookmarks
**bookMarkCount = objWord.ActiveDocument.Bookmarks.Count**
' First bookmark to find
firstbookmarkName = "b" & docID1

' Second bookmark to find
If docID2 > 0 Then
secondbookmarkName = "b" & docID2
End If

' See if first bookmark exists (at beginning of section)
**If objWord.ActiveDocument.Bookmarks.Exists(firstbookmarkName) Then**

' Get bookmark index (beginning of section)
**Set firstbookmark = objWord.ActiveDocument.Bookmarks.Item(firstbookmarkName)
firstbookmark.Select
firstbookmarkIndex = objWord.Selection.BookmarkID

' Set range up to first bookmark**
**Set startRange = objWord.ActiveDocument.Range(objWord.ActiveDocument.Bookmarks("\StartOfDoc").Range.Start, objWord.ActiveDocument.Bookmarks(firstbookmark).Range.End)
If debugOn Then
Message = "Set start range in CreateLinkDocs."
userWarning
End If

' See if next bookmark exists**
**If SplitDocInd < docCount And _
objWord.ActiveDocument.Bookmarks.Exists(secondbookmarkName) Then

' Move selection to end of previous page**
**Set secondbookmark = objWord.ActiveDocument.Bookmarks.Item(secondbookmarkName)
secondbookmark.Select
objWord.Selection.GoTo What:=wdGoToLine, Which:=wdGoToPrevious, Count:=1, Name:=""

' Set range to end of document**
**Set endRange = objWord.ActiveDocument.Range(objWord.Selection.Range.Start, objWord.ActiveDocument.Bookmarks("\EndOfDoc").Range.End)

' Delete end range**
**endRange.Select
endRange.Delete
End If

' Delete beginning of document

startRange.Select
startRange.Delete

' Set markup properties

With objWord.Options
.InsertedTextMark = wdInsertedTextMarkUnderline
.InsertedTextColor = wdByAuthor
.DeletedTextMark = wdDeletedTextMarkStrikeThrough
.DeletedTextColor = wdByAuthor
.RevisedPropertiesMark = wdRevisedPropertiesMarkNone
.RevisedPropertiesColor = wdAuto
.RevisedLinesMark = wdRevisedLinesMarkOutsideBorder
.RevisedLinesColor = wdAuto
End With**

' Turn on change tracking
objWord.ActiveDocument.TrackRevisions = True
objWord.ActiveDocument.ShowRevisions = True

**' Get Link doc file path**
**fDir = mSplitDocs.Item(SplitDocInd).PathForDocumentsToBeSaved
fName = mSplitDocs.Item(SplitDocInd).fName
fPath = fDir & fName

' Save document to links directory**
**objWord.ActiveDocument.SaveAs FileName:=fPath, FileFormat:= _
wdFormatDocument, LockComments:=False, Password:="", AddToRecentFiles:= _
True, WritePassword:="", ReadOnlyRecommended:=False, EmbedTrueTypeFonts:= _
False, SaveNativePictureFormat:=False, SaveFormsData:=False, _
SaveAsAOCELetter:=False**
**' Close document**
**objWord.Documents.Close SaveChanges:=wdSaveChanges

' Go here if first bookmark not found**
**End If

' Continue loop**
**Next**
' Get rid of Word objects
removeWord
' Normal exit
SplitDocument= True
Exit Function

Hi
Thanks for your inquiry. I think you can use the following code for extracting content between bookmarks.

public void Test165()
{
    // Open source document
    Document doc = new Document(@"Test165\in.doc");
    ExtractContent(doc, "start", "end", @"Test165\out.doc");
}
/// 
/// Method extracts content between bookmarks
/// 
/// Start bookmark name
/// End bookmark name
/// Filename of output file
private void ExtractContent(Document srcDoc, string startBookmark, string endBookmark, string outputFile)
{
    // Get start and end bookamerks from source document
    Bookmark start = srcDoc.Range.Bookmarks[startBookmark];
    Bookmark end = srcDoc.Range.Bookmarks[endBookmark];
    // If strat of end bookamrk does not exist in the document then exit from the function
    if (start == null || end == null)
        return;
    // Get first Node in the selection
    Node startNode = start.BookmarkStart.ParentNode;
    while (startNode.ParentNode.NodeType != NodeType.Body)
        startNode = startNode.ParentNode;
    // Get last Node in the selection
    Node endNode = end.BookmarkStart.ParentNode;
    while (endNode.ParentNode.NodeType != NodeType.Body)
        endNode = startNode.ParentNode;
    // Create new document 
    Document dstDoc = new Document();
    Node currNode = startNode;
    // Copy content 
    while (!currNode.Equals(endNode))
    {
        Node dstNode = dstDoc.ImportNode(currNode, true, ImportFormatMode.KeepSourceFormatting);
        dstDoc.FirstSection.Body.AppendChild(dstNode);
        // If next node is null we should move to the next section
        if (currNode.NextSibling == null)
        {
            Section nextSection = (Section)currNode.GetAncestor(NodeType.Section).NextSibling;
            currNode = nextSection.Body.FirstChild;
        }
        else
        {
            // move to next node
            currNode = currNode.NextSibling;
        }
    }
    // Save output document
    dstDoc.Save(outputFile);
}

Hope this helps.
Best regards.

Hi Alexey,
Thanks for your response and the sample code. I have tried it and converted it to VB.NET. However, I am getting endless loop on the following code:

'Get last Node in the selection 
Dim endNode As Node = [end].BookmarkStart.ParentNode
While endNode.ParentNode.NodeType <> NodeType.Body
endNode = startNode.ParentNode
End While

which is converted from the following C# code in your sample code:

// Get last Node in the selection
Node endNode = end.BookmarkStart.ParentNode;
while (endNode.ParentNode.NodeType != NodeType.Body)
endNode = startNode.ParentNode;

I think I have converted it correctly. Can you please let me know if the endless looping is due to the original code or due to the translation to VB.NET?
Thanks.

Hi again, (Please also see my previous post a few minutes ago)

Also, could you please tell me in words how the WHILE statement is working to find the start and end nodes? Thanks.

Hi again, (Please see my two previous posts a few minutes ago)

One issue that I have, is that the bookmarks are unique and are not known to me to be able to pass them to the Sub in the sample code. Is there any way to get a collection of all the bookmarks in the document and then loop through that collection and call your Sub and pass to it the pair of start and end bookmarks? Thanks.

Hi
Thanks for your inquiry. Here is VB.NET code:

''' 
''' Method extracts content between bookmarks
''' 
''' Start bookmark name
''' End bookmark name
''' Filename of output file
Sub ExtractContent(ByVal srcDoc As Document, ByVal startBookmark As String, ByVal endBookmark As String, ByVal outputFile As String)
'Get start and end bookamerks from source document
Dim startB As Bookmark = srcDoc.Range.Bookmarks(startBookmark)
Dim endB As Bookmark = srcDoc.Range.Bookmarks(endBookmark)
'If strat of end bookamrk does not exist in the document then exit from the function
If (startB Is Nothing Or endB Is Nothing) Then
Exit Sub
End If
'Get first Node in the selection
Dim startNode As Node = startB.BookmarkStart.ParentNode
While (Not startNode.ParentNode.NodeType.Equals(NodeType.Body))
startNode = startNode.ParentNode
End While
'Get last Node in the selection
Dim endNode As Node = endB.BookmarkStart.ParentNode
While (Not endNode.ParentNode.NodeType.Equals(NodeType.Body))
endNode = startNode.ParentNode
End While
'Create new document 
Dim dstDoc As Document = New Document()
Dim currNode As Node = startNode
'Copy content 
While (Not currNode.Equals(endNode))
Dim dstNode As Node = dstDoc.ImportNode(currNode, True, ImportFormatMode.KeepSourceFormatting)
dstDoc.FirstSection.Body.AppendChild(dstNode)
'If next node is null we should move to the next section
If (currNode.NextSibling Is Nothing) Then
Dim nextSection As Section = CType(currNode.GetAncestor(NodeType.Section).NextSibling, Section)
currNode = nextSection.Body.FirstChild
Else
'move to next node
currNode = currNode.NextSibling
End If
End While
'Save output document
dstDoc.Save(outputFile)
End Sub

I use the paragraph or table where bookmark is inserted as a start/end node of the range.
Best regards.

You can use srcDoc.Range.Bookmarks to get collection of bookmarks in the document.
Best regards.

Hi Alexey,
I have tried to use this code but have not been able to make it work. I am attaching a test file with 5 bookmarks in it. Also, I have been trying to make create a self-contained Sub which uses While loops to go through the whole document and traverse through the book marks. I originally started to work with the following code which I got from your web site Forum, but this too has not been working. Then I sent you a post which you responded with a solution based on Strat Node and End Nodes which also did not work due to end-less loop which it was falling into. My original code is as follows and my test file is attched as well. Thanks.

Private Function SplitDoc(ByVal mRCVDFilePath As String, ByVal mSplitDocs As Collection) As Boolean
'Open master document
Dim master As Aspose.Words.Document
Dim doc As Aspose.Words.Document ': doc = null
Dim path As String : path = mRCVDFilePath 
Dim createNew As Boolean : createNew = False
Dim bookmarkName As String : bookmarkName = ""
Dim bookmarkNamePrev As String : bookmarkNamePrev = ""
Dim bookmark As Aspose.Words.Bookmark
Dim srcSection As Aspose.Words.Section
'Dim dstSection As Aspose.Words.Section
Dim fPath As String
Dim fDir As String
Dim fName As String
Dim SplitDocInd As Int16 : SplitDocInd = 0
Dim i As Boolean
'Loop through all section in the document
master = New Aspose.Words.Document("C:\inetpub\wwwroot\CNkk_master.doc")
'master = New Aspose.Words.Document(FileName:=mRCVDFilePath)
'Dim builder As DocumentBuilder = New DocumentBuilder(master)
'builder.MoveToBookmark("a100", True, True) '= True Then
'builder.Document.Range.Bookmarks.Item("a100")

master.TrackRevisions = False
For Each srcSection In master.Sections
'Check if section contains Bookmark with name Document_
'For Each bookmark In srcSection.Range.Bookmarks
For Each bookmark In master.Range.Bookmarks
If (bookmark.Name.StartsWith("b")) Then
bookmarkNamePrev = bookmarkName
bookmarkName = bookmark.Name
createNew = True
'Exit For
Else
createNew = False
End If

If (createNew) Then
'Save previouse document
If Not (doc Is Nothing) Then
'If doc.Range.Bookmarks.Count > 0 Then
Try
doc.Range.Bookmarks(bookmarkNamePrev).Remove()
Catch
End Try
'End If
SplitDocInd = SplitDocInd + 1
fDir = mSplitDocs.Item(SplitDocInd).PathForDocumentsToBeSaved
fName = mSplitDocs.Item(SplitDocInd).fName
fPath = fDir & "KKAS_" & fName
doc.Save(fPath, SaveFormat.Doc)
End If
'Create new document
doc = New Aspose.Words.Document()
doc.FirstSection.Remove()
createNew = False
End If
If Not (doc Is Nothing) Then
**'Append section - DOES NOT WORK - need to break the section by bookmark start and end**
Dim dstSection As Node = doc.ImportNode(srcSection, True, ImportFormatMode.KeepSourceFormatting)
doc.Sections.Add(dstSection)
SplitDocInd = SplitDocInd + 1
If (srcSection.Equals(master.LastSection)) Then
'Save document if this is last section
Try
doc.Range.Bookmarks(bookmarkName).Remove()
Catch
End Try
'objDocAS.TrackRevisions = False
fDir = mSplitDocs.Item(SplitDocInd).PathForDocumentsToBeSaved
fName = mSplitDocs.Item(SplitDocInd).fName
fPath = fDir & "KKAS_" & fName
'doc.Save(fPath & Replace(Replace(Now().ToString, "/", ""), ":", "") & bookmarkName & ".doc")
doc.Save(fPath, SaveFormat.Doc)
End If
End If
Next
Next
End Function

Hi
Thank you for additional information. I modified my code and now it works fine. Please try using the following code:

Sub Main()
Dim doc As Document = New Document("C:\Temp\CNkk_master.doc")
SplitDocument(doc)
End Sub
Sub SplitDocument(ByVal srcDoc As Document)
'Get collection of bookmarks
Dim bkCollection As BookmarkCollection = srcDoc.Range.Bookmarks
'loop though all bookmarks and extract content between each of bookmarks
For bkIdx As Integer = 0 To bkCollection.Count - 1
'Get start and end bookamerks from source document
Dim startB As Bookmark = bkCollection(bkIdx)
Dim endB As Bookmark = Nothing
If (bkCollection.Count > bkIdx + 1) Then
endB = bkCollection(bkIdx + 1)
End If
'Extract content between bookmarks
ExtractContent(srcDoc, startB, endB, String.Format("C:\Temp\out_{0}.doc", bkIdx))
Next
End Sub
''' 
''' Method extracts content between bookmarks
''' 
''' Start bookmark 
''' End bookmark 
''' Filename of output file
Sub ExtractContent(ByVal srcDoc As Document, ByVal startB As Bookmark, ByVal endB As Bookmark, ByVal outputFile As String)
'If strat of end bookamrk does not exist in the document then exit from the function
If (startB Is Nothing) Then
Exit Sub
End If
'Get first Node in the selection
Dim startNode As Node = startB.BookmarkStart.ParentNode
While (Not startNode.ParentNode.NodeType.Equals(NodeType.Body))
startNode = startNode.ParentNode
End While
'Get last Node in the selection
Dim endNode As Node
If (endB Is Nothing) Then
endNode = srcDoc.LastSection.Body.LastChild
Else
endNode = endB.BookmarkStart.ParentNode
End If
While (Not endNode.ParentNode.NodeType.Equals(NodeType.Body))
endNode = endNode.ParentNode
End While
'Create new document 
Dim dstDoc As Document = CType(srcDoc.Clone(False), Document)
'This is needed to import formating of source document.
dstDoc.Sections.Add(dstDoc.ImportNode(srcDoc.FirstSection, True, ImportFormatMode.KeepSourceFormatting))
dstDoc.FirstSection.Body.RemoveAllChildren()
Dim currNode As Node = startNode
'Copy content 
While (Not currNode.Equals(endNode))
Dim dstNode As Node = dstDoc.ImportNode(currNode, True, ImportFormatMode.KeepSourceFormatting)
dstDoc.FirstSection.Body.AppendChild(dstNode)
'If next node is null we should move to the next section
If (currNode.NextSibling Is Nothing) Then
Dim nextSection As Section = CType(currNode.GetAncestor(NodeType.Section).NextSibling, Section)
currNode = nextSection.Body.FirstChild
Else
'move to next node
currNode = currNode.NextSibling
End If
End While
'Save output document
dstDoc.Save(outputFile)
End Sub

I hope this could help.
Best regards.

Hi Alexey,
Thank you for your response and the new code. It is working much better now, thank you. I have still to do more unit testing, however, the first thing that I have noticed so far is that I am losing the Header and Footer in the splitted document. Would you please modify the code so that the Header and Footer info (specially the Header) are kept with the new document? Thanks.
P.S. Could you please respond so that I get it today. Thanks.

Hi
Thanks for your inquiry. Header and footer should be preserved. The following code creates clone of source document and remove only body content. Content of Headers and Footers should be preserved.

'Create new document 
Dim dstDoc As Document = CType(srcDoc.Clone(False), Document)
'This is needed to import formating of source document.
dstDoc.Sections.Add(dstDoc.ImportNode(srcDoc.FirstSection, True, ImportFormatMode.KeepSourceFormatting))
dstDoc.FirstSection.Body.RemoveAllChildren()

Best regards.

Hi,
You are right. I ran it again just your code without my changes and it has the Header Footer in it. What I had done is ti insert the following code before saving the dstDoc file to remove the Bookmarks, since the splitted files should not have any bookmark. Do you know why my code removes the Header and Footer as well as the bookmark? Also it seems that the last document loses one horizontal line before the last heavy horizontal line. This happens only in the last splitted document.

'Remove Bookmarks
Try
dstDoc.Range.Bookmarks(0).Remove()
Catch
End Try
'Save output document
dstDoc.Save(outputFile)

Hi
Thanks for your inquiry. I can’t reproduce the problem. Header is not removed from the output document. Regarding last line. Please add the following in the ExtractContent method.

If (endB Is Nothing) Then
endNode = srcDoc.LastSection.Body.LastChild
dstDoc.FirstSection.Body.AppendChild(dstDoc.ImportNode(endNode, True, ImportFormatMode.KeepSourceFormatting))
End If
'Remove Bookmarks
Try
dstDoc.Range.Bookmarks.Clear()
Catch
End Try
'Save output document
dstDoc.Save(outputFile)

Best regards.

Hi Alexey,

Thanks for your response. I have another issue here which is the page break (hard page break) which was between two consecutive documents in the original source (master) document appears at the top of the splitted document. Could you please modify the code to remove this page break as well/ I have attached a new document with page breaks inserted between two cosecutive documents. Thanks

Hi, (please also see my previous post)

The previous post that I sent was regarding a test that I had done with another document other than the test file that I had sent you. I just did a test with that file (uploaded with my last post) and it resulted with aplitted files which have an extra blank page at the end of each splitted file (except the last one) with a page heading. Could you please modify the code so that not only there is no page break before the start of the document but also there is no extra page at the end of the document. Thank you very much for your help.