We're sorry Aspose doesn't work properply without JavaScript enabled.

Free Support Forum - aspose.com

Extracting formatted content between two bookmarks and saving it in a new Word document

I have a large word document with many chapters. Each chapter is surrounded by a Start and an End Bookmark. I need to break up this large document into many Word documents each of which would contain a separate chapter. The content has graphs, images, and visio graphs. The text is formatted and in some cases colored. In other words, I need to preserve the originality of the content in each of the new documents. Please let me know how do I do this? Thanks.

Hi<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

Thanks for your inquiry. Yes, of course you can achieve this using Aspose.Words. You can find code that deletes all content between bookmarks in the following thread

I think you can use the same technique to extract content between bookmarks.

Also please provide me your document for testing. If you need I can create sample code for you.

Best regards.

Hi Alexey,

Thank you for your response. Could you please create for me a sample code? Each chapter has a unique start and end bookmark. As I mentioned I need to scrape off everything including the image and the text formatting, etc. I was concerned that the code for deleting the text between bookmarks would not be concerned with the content and its formatting and just select the content between the bookmarks and erase it. That is why I thought to ask you for a sample code. Also I can not send you the original document and creating another one for testing without the original content would not do justice to its complexity. So I have to start with a code that we think is closest to what I need to do and do the testing here myself. Thank you for your help.

Hi<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

Thanks for your request. I will provide you code example today. It would be great to get your document for testing. You can remove all confidential content from the document of replace this content with dummy data.

Best regards.

Hi<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

Please see the following link

I think methods (MergeDocuments and SplitDocuments) could be useful in your case.

Best regards.

Hi Alexey,

Thak you for your response. I looked at the link content but it did not make it any easier. What I am trying to do is to replace Word with Aspose.Word and our current function to split the document is as is pasted in here. May be the best is to try to translate this Microsoft Word API to Aspose Word API. Can you please take a look at it and try to translate it to Aspose? Please also note that Microsft Word automaticaly inserts the reserved Bookmarks "\StartofDoc" and "\EndofDoc" which we would not have in the Aspose Word. This code is in VB6 which we are also converting to VB.NET. Thanks.

Function SplitDocument()


' ******************************************
' SPLIT RCVD FILE
' ******************************************

' Get RCVD file name

returnCheck = getRCVDFileName

' Get RCVD file path

returnCheck = getRCVDFilePath
If returnCheck = False Then
SplitDocument= False
Exit Function
End If

' Load Word (hidden)

LoadWord

' Number of link documents

docCount = mSplitDocs.Count

' Loop through all link documents

For SplitDocInd = 1 To docCount

' Open RCVD File

Set objDoc = objWord.Documents.Open(FileName:=mRCVDFilePath)

' Turn off revision tracking (visible on screen)

objWord.ActiveDocument.TrackRevisions = False
objWord.ActiveDocument.ShowRevisions = False

' Get document ID for section to be split off

docID1 = mSplitDocs.Item(SplitDocInd).docID

' Get next document ID, if any

If SplitDocInd + 1 <= docCount Then
docID2 = mSplitDocs.Item(SplitDocInd + 1).docID
Else
docID2 = 0
End If

' Count bookmarks

bookMarkCount = objWord.ActiveDocument.Bookmarks.Count

' First bookmark to find

firstbookmarkName = "b" & docID1

' Second bookmark to find

If docID2 > 0 Then
secondbookmarkName = "b" & docID2
End If

' See if first bookmark exists (at beginning of section)

If objWord.ActiveDocument.Bookmarks.Exists(firstbookmarkName) Then

' Get bookmark index (beginning of section)

Set firstbookmark = objWord.ActiveDocument.Bookmarks.Item(firstbookmarkName)
firstbookmark.Select
firstbookmarkIndex = objWord.Selection.BookmarkID

' Set range up to first bookmark

Set startRange = objWord.ActiveDocument.Range(objWord.ActiveDocument.Bookmarks("\StartOfDoc").Range.Start, objWord.ActiveDocument.Bookmarks(firstbookmark).Range.End)
If debugOn Then
Message = "Set start range in CreateLinkDocs."
userWarning
End If

' See if next bookmark exists

If SplitDocInd < docCount And _
objWord.ActiveDocument.Bookmarks.Exists(secondbookmarkName) Then

' Move selection to end of previous page

Set secondbookmark = objWord.ActiveDocument.Bookmarks.Item(secondbookmarkName)
secondbookmark.Select
objWord.Selection.GoTo What:=wdGoToLine, Which:=wdGoToPrevious, Count:=1, Name:=""

' Set range to end of document

Set endRange = objWord.ActiveDocument.Range(objWord.Selection.Range.Start, objWord.ActiveDocument.Bookmarks("\EndOfDoc").Range.End)

' Delete end range

endRange.Select
endRange.Delete
End If

' Delete beginning of document

startRange.Select
startRange.Delete


' Set markup properties

With objWord.Options
.InsertedTextMark = wdInsertedTextMarkUnderline
.InsertedTextColor = wdByAuthor
.DeletedTextMark = wdDeletedTextMarkStrikeThrough
.DeletedTextColor = wdByAuthor
.RevisedPropertiesMark = wdRevisedPropertiesMarkNone
.RevisedPropertiesColor = wdAuto
.RevisedLinesMark = wdRevisedLinesMarkOutsideBorder
.RevisedLinesColor = wdAuto
End With


' Turn on change tracking

objWord.ActiveDocument.TrackRevisions = True
objWord.ActiveDocument.ShowRevisions = True

' Get Link doc file path

fDir = mSplitDocs.Item(SplitDocInd).PathForDocumentsToBeSaved
fName = mSplitDocs.Item(SplitDocInd).fName
fPath = fDir & fName


' Save document to links directory

objWord.ActiveDocument.SaveAs FileName:=fPath, FileFormat:= _
wdFormatDocument, LockComments:=False, Password:="", AddToRecentFiles:= _
True, WritePassword:="", ReadOnlyRecommended:=False, EmbedTrueTypeFonts:= _
False, SaveNativePictureFormat:=False, SaveFormsData:=False, _
SaveAsAOCELetter:=False

' Close document

objWord.Documents.Close SaveChanges:=wdSaveChanges

' Go here if first bookmark not found

End If

' Continue loop

Next

' Get rid of Word objects

removeWord

' Normal exit

SplitDocument= True
Exit Function

Hi<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

Thanks for your inquiry. I think you can use the following code for extracting content between bookmarks.

public void Test165()

{

//Open source document

Document doc = new Document(@"Test165\in.doc");

ExtractContent(doc, "start", "end", @"Test165\out.doc");

}

///

/// Method extracts content between bookmarks

///

/// Start bookmark name

/// End bookmark name

/// Filename of output file

private void ExtractContent(Document srcDoc, string startBookmark, string endBookmark, string outputFile)

{

//Get start and end bookamerks from source document

Bookmark start = srcDoc.Range.Bookmarks[startBookmark];

Bookmark end = srcDoc.Range.Bookmarks[endBookmark];

//If strat of end bookamrk does not exist in the document then exit from the function

if (start == null || end == null)

return;

//Get first Node in the selection

Node startNode = start.BookmarkStart.ParentNode;

while (startNode.ParentNode.NodeType != NodeType.Body)

startNode = startNode.ParentNode;

//Get last Node in the selection

Node endNode = end.BookmarkStart.ParentNode;

while (endNode.ParentNode.NodeType != NodeType.Body)

endNode = startNode.ParentNode;

//Create new document

Document dstDoc = new Document();

Node currNode = startNode;

//Copy content

while (!currNode.Equals(endNode))

{

Node dstNode = dstDoc.ImportNode(currNode, true, ImportFormatMode.KeepSourceFormatting);

dstDoc.FirstSection.Body.AppendChild(dstNode);

//If next node is null we should move to the next section

if (currNode.NextSibling == null)

{

Section nextSection = (Section)currNode.GetAncestor(NodeType.Section).NextSibling;

currNode = nextSection.Body.FirstChild;

}

else

{

//move to next node

currNode = currNode.NextSibling;

}

}

//Save output document

dstDoc.Save(outputFile);

}

Hope this helps.

Best regards.

Hi Alexey,

Thanks for your response and the sample code. I have tried it and converted it to VB.NET. However, I am getting endless loop on the following code:

'Get last Node in the selection

Dim endNode As Node = [end].BookmarkStart.ParentNode

While endNode.ParentNode.NodeType <> NodeType.Body

endNode = startNode.ParentNode

End While

which is converted from the following C# code in your sample code:

//Get last Node in the selection<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

Node endNode = end.BookmarkStart.ParentNode;

while (endNode.ParentNode.NodeType != NodeType.Body)

endNode = startNode.ParentNode;

I think I have converted it correctly. Can you please let me know if the endless looping is due to the original code or due to the translation to VB.NET?

Thanks.

Hi again, (Please also see my previous post a few minutes ago)

Also, could you please tell me in words how the WHILE statement is working to find the start and end nodes? Thanks.

Hi again, (Please see my two previous posts a few minutes ago)

One issue that I have, is that the bookmarks are unique and are not known to me to be able to pass them to the Sub in the sample code. Is there any way to get a collection of all the bookmarks in the document and then loop through that collection and call your Sub and pass to it the pair of start and end bookmarks? Thanks.

Hi<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

Thanks for your inquiry. Here is VB.NET code:

'''

''' Method extracts content between bookmarks

'''

''' Start bookmark name

''' End bookmark name

''' Filename of output file

Sub ExtractContent(ByVal srcDoc As Document, ByVal startBookmark As String, ByVal endBookmark As String, ByVal outputFile As String)

'Get start and end bookamerks from source document

Dim startB As Bookmark = srcDoc.Range.Bookmarks(startBookmark)

Dim endB As Bookmark = srcDoc.Range.Bookmarks(endBookmark)

'If strat of end bookamrk does not exist in the document then exit from the function

If (startB Is Nothing Or endB Is Nothing) Then

Exit Sub

End If

'Get first Node in the selection

Dim startNode As Node = startB.BookmarkStart.ParentNode

While (Not startNode.ParentNode.NodeType.Equals(NodeType.Body))

startNode = startNode.ParentNode

End While

'Get last Node in the selection

Dim endNode As Node = endB.BookmarkStart.ParentNode

While (Not endNode.ParentNode.NodeType.Equals(NodeType.Body))

endNode = startNode.ParentNode

End While

'Create new document

Dim dstDoc As Document = New Document()

Dim currNode As Node = startNode

'Copy content

While (Not currNode.Equals(endNode))

Dim dstNode As Node = dstDoc.ImportNode(currNode, True, ImportFormatMode.KeepSourceFormatting)

dstDoc.FirstSection.Body.AppendChild(dstNode)

'If next node is null we should move to the next section

If (currNode.NextSibling Is Nothing) Then

Dim nextSection As Section = CType(currNode.GetAncestor(NodeType.Section).NextSibling, Section)

currNode = nextSection.Body.FirstChild

Else

'move to next node

currNode = currNode.NextSibling

End If

End While

'Save output document

dstDoc.Save(outputFile)

End Sub

I use the paragraph or table where bookmark is inserted as a start/end node of the range.

Best regards.

You can use srcDoc.Range.Bookmarks to get collection of bookmarks in the document.<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

Best regards.

Hi Alexey,

I have tried to use this code but have not been able to make it work. I am attaching a test file with 5 bookmarks in it. Also, I have been trying to make create a self-contained Sub which uses While loops to go through the whole document and traverse through the book marks. I originally started to work with the following code which I got from your web site Forum, but this too has not been working. Then I sent you a post which you responded with a solution based on Strat Node and End Nodes which also did not work due to end-less loop which it was falling into. My original code is as follows and my test file is attched as well. Thanks.

Private Function SplitDoc(ByVal mRCVDFilePath As String, ByVal mSplitDocs As Collection) As Boolean

'Open master document

Dim master As Aspose.Words.Document
Dim doc As Aspose.Words.Document ': doc = null
Dim path As String : path = mRCVDFilePath
Dim createNew As Boolean : createNew = False
Dim bookmarkName As String : bookmarkName = ""
Dim bookmarkNamePrev As String : bookmarkNamePrev = ""
Dim bookmark As Aspose.Words.Bookmark
Dim srcSection As Aspose.Words.Section
'Dim dstSection As Aspose.Words.Section
Dim fPath As String
Dim fDir As String
Dim fName As String
Dim SplitDocInd As Int16 : SplitDocInd = 0
Dim i As Boolean

'Loop through all section in the document
master = New Aspose.Words.Document("C:\inetpub\wwwroot\CNkk_master.doc")

'master = New Aspose.Words.Document(FileName:=mRCVDFilePath)
'Dim builder As DocumentBuilder = New DocumentBuilder(master)
'builder.MoveToBookmark("a100", True, True) '= True Then
'builder.Document.Range.Bookmarks.Item("a100")


master.TrackRevisions = False

For Each srcSection In master.Sections

'Check if section contains Bookmark with name Document_
'For Each bookmark In srcSection.Range.Bookmarks
For Each bookmark In master.Range.Bookmarks
If (bookmark.Name.StartsWith("b")) Then
bookmarkNamePrev = bookmarkName
bookmarkName = bookmark.Name
createNew = True
'Exit For
Else
createNew = False
End If


If (createNew) Then
'Save previouse document
If Not (doc Is Nothing) Then
'If doc.Range.Bookmarks.Count > 0 Then
Try
doc.Range.Bookmarks(bookmarkNamePrev).Remove()
Catch
End Try
'End If
SplitDocInd = SplitDocInd + 1
fDir = mSplitDocs.Item(SplitDocInd).PathForDocumentsToBeSaved
fName = mSplitDocs.Item(SplitDocInd).fName
fPath = fDir & "KKAS_" & fName
doc.Save(fPath, SaveFormat.Doc)
End If
'Create new document
doc = New Aspose.Words.Document()
doc.FirstSection.Remove()

createNew = False

End If

If Not (doc Is Nothing) Then

'Append section - DOES NOT WORK - need to break the section by bookmark start and end
Dim dstSection As Node = doc.ImportNode(srcSection, True, ImportFormatMode.KeepSourceFormatting)

doc.Sections.Add(dstSection)
SplitDocInd = SplitDocInd + 1

If (srcSection.Equals(master.LastSection)) Then
'Save document if this is last section
Try
doc.Range.Bookmarks(bookmarkName).Remove()
Catch
End Try
'objDocAS.TrackRevisions = False
fDir = mSplitDocs.Item(SplitDocInd).PathForDocumentsToBeSaved
fName = mSplitDocs.Item(SplitDocInd).fName
fPath = fDir & "KKAS_" & fName

'doc.Save(fPath & Replace(Replace(Now().ToString, "/", ""), ":", "") & bookmarkName & ".doc")
doc.Save(fPath, SaveFormat.Doc)
End If

End If

Next

Next

End Function

Hi<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

Thank you for additional information. I modified my code and now it works fine. Please try using the following code:

Sub <?xml:namespace prefix = st1 ns = "urn:schemas-microsoft-com:office:smarttags" />Main()

Dim doc As Document = New Document("C:\Temp\CNkk_master.doc")

SplitDocument(doc)

End Sub

Sub SplitDocument(ByVal srcDoc As Document)

'Get collection of bookmarks

Dim bkCollection As BookmarkCollection = srcDoc.Range.Bookmarks

'loop though all bookmarks and extract content between each of bookmarks

For bkIdx As Integer = 0 To bkCollection.Count - 1

'Get start and end bookamerks from source document

Dim startB As Bookmark = bkCollection(bkIdx)

Dim endB As Bookmark = Nothing

If (bkCollection.Count > bkIdx + 1) Then

endB = bkCollection(bkIdx + 1)

End If

'Extract content between bookmarks

ExtractContent(srcDoc, startB, endB, String.Format("C:\Temp\out_{0}.doc", bkIdx))

Next

End Sub

'''

''' Method extracts content between bookmarks

'''

''' Start bookmark

''' End bookmark

''' Filename of output file

Sub ExtractContent(ByVal srcDoc As Document, ByVal startB As Bookmark, ByVal endB As Bookmark, ByVal outputFile As String)

'If strat of end bookamrk does not exist in the document then exit from the function

If (startB Is Nothing) Then

Exit Sub

End If

'Get first Node in the selection

Dim startNode As Node = startB.BookmarkStart.ParentNode

While (Not startNode.ParentNode.NodeType.Equals(NodeType.Body))

startNode = startNode.ParentNode

End While

'Get last Node in the selection

Dim endNode As Node

If (endB Is Nothing) Then

endNode = srcDoc.LastSection.Body.LastChild

Else

endNode = endB.BookmarkStart.ParentNode

End If

While (Not endNode.ParentNode.NodeType.Equals(NodeType.Body))

endNode = endNode.ParentNode

End While

'Create new document

Dim dstDoc As Document = CType(srcDoc.Clone(False), Document)

'This is needed to import formating of source document.

dstDoc.Sections.Add(dstDoc.ImportNode(srcDoc.FirstSection, True, ImportFormatMode.KeepSourceFormatting))

dstDoc.FirstSection.Body.RemoveAllChildren()

Dim currNode As Node = startNode

'Copy content

While (Not currNode.Equals(endNode))

Dim dstNode As Node = dstDoc.ImportNode(currNode, True, ImportFormatMode.KeepSourceFormatting)

dstDoc.FirstSection.Body.AppendChild(dstNode)

'If next node is null we should move to the next section

If (currNode.NextSibling Is Nothing) Then

Dim nextSection As Section = CType(currNode.GetAncestor(NodeType.Section).NextSibling, Section)

currNode = nextSection.Body.FirstChild

Else

'move to next node

currNode = currNode.NextSibling

End If

End While

'Save output document

dstDoc.Save(outputFile)

End Sub

I hope this could help.

Best regards.

Hi Alexey,

Thank you for your response and the new code. It is working much better now, thank you. I have still to do more unit testing, however, the first thing that I have noticed so far is that I am losing the Header and Footer in the splitted document. Would you please modify the code so that the Header and Footer info (specially the Header) are kept with the new document? Thanks.

P.S. Could you please respond so that I get it today. Thanks.

Hi<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

Thanks for your inquiry. Header and footer should be preserved. The following code creates clone of source document and remove only body content. Content of Headers and Footers should be preserved.

'Create new document

Dim dstDoc As Document = CType(srcDoc.Clone(False), Document)

'This is needed to import formating of source document.

dstDoc.Sections.Add(dstDoc.ImportNode(srcDoc.FirstSection, True, ImportFormatMode.KeepSourceFormatting))

dstDoc.FirstSection.Body.RemoveAllChildren()

Best regards.

Hi,

You are right. I ran it again just your code without my changes and it has the Header Footer in it. What I had done is ti insert the following code before saving the dstDoc file to remove the Bookmarks, since the splitted files should not have any bookmark. Do you know why my code removes the Header and Footer as well as the bookmark? Also it seems that the last document loses one horizontal line before the last heavy horizontal line. This happens only in the last splitted document.

'Remove Bookmarks

Try

dstDoc.Range.Bookmarks(0).Remove()

Catch

End Try

'Save output document

dstDoc.Save(outputFile)

Hi<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

Thanks for your inquiry. I can’t reproduce the problem. Header is not removed from the output document. Regarding last line. Please add the following in the ExtractContent method.

If (endB Is Nothing) Then

endNode = srcDoc.LastSection.Body.LastChild

dstDoc.FirstSection.Body.AppendChild(dstDoc.ImportNode(endNode, True, ImportFormatMode.KeepSourceFormatting))

End If

'Remove Bookmarks

Try

dstDoc.Range.Bookmarks.Clear()

Catch

End Try

'Save output document

dstDoc.Save(outputFile)

Best regards.

Hi Alexey,

Thanks for your response. I have another issue here which is the page break (hard page break) which was between two consecutive documents in the original source (master) document appears at the top of the splitted document. Could you please modify the code to remove this page break as well/ I have attached a new document with page breaks inserted between two cosecutive documents. Thanks

Hi, (please also see my previous post)

The previous post that I sent was regarding a test that I had done with another document other than the test file that I had sent you. I just did a test with that file (uploaded with my last post) and it resulted with aplitted files which have an extra blank page at the end of each splitted file (except the last one) with a page heading. Could you please modify the code so that not only there is no page break before the start of the document but also there is no extra page at the end of the document. Thank you very much for your help.