I see there is a way using Aspose.Words to extract content from Word documents based on paragraph styles, bookmarks, etc. Is there a way to extract content based on page number? For example, if i wanted to extract the content from pages 1-3?
Thanks for your request. Word document is flow document and does not contain any information about its layout into lines and pages. Therefore, technically there is no “Page” concept in Word document.
Aspose.Words uses our own Rendering Engine to layout documents into pages. And we have plans to expose layout information. Your request has been linked to the appropriate issue. You will be notified as soon as this feature is supported.
Also, I think, as a workaround you can try using PageNumberFinder class suggested by Adam in this thread:
Error 1 Cannot implicitly convert type ‘Aspose.Words.Fields.FieldStart’ to ‘Aspose.Words.Fields.Field’ PageFinder.cs 93 36 ParsingTest
Field fieldStart = builder.InsertField("PAGE", "1");
// Repeat for the end of the node as some nodes can span over more than one page.
builder.MoveTo(endNode);
builder.Font.Hidden = true;
Field fieldEnd = builder.InsertField("PAGE", "1");
// Store these fields in a pair along with the node they represent.
fieldList.Add(node, new FieldPair(fieldStart, fieldEnd));
Thanks for your request. The problem occurs because you are using old version of Aspose.Words. Please try using the latest version of Aspose.Words. You can download it from here: https://releases.aspose.com/words/net/
Thank you. This has been helpful. I am still having a problem in that when trying to import the nodes found using the pagefinder RetrieveAllNodesOnPage, if the node is anything other than a paragraph (Ex: Table), I get a “cannot insert node of this type at this location”.
What I really need to do, is to be able to specify pages within a given document and have everything from those pages (headers, footers and content) cloned into a new document. Is this possible?
Here is a sample of what I have now (this page has a table on it, and it returns an error):
Document doc = new Document(@"c:\test.doc");
PageNumberFinder pageFinder = new PageNumberFinder(doc);
Document dstDoc = new Document();
NodeImporter importer = new NodeImporter(doc, dstDoc, ImportFormatMode.KeepSourceFormatting);
//just extract page 4
for (int pageNum = 4; pageNum <= 4; pageNum++)
{
List<Node> pageNodes = pageFinder.RetrieveAllNodesOnPage(pageNum, true);
if (pageNodes != null)
foreach (Node paragraph in pageNodes)
{
Node importNode = importer.ImportNode(paragraph, true);
dstDoc.FirstSection.Body.AppendChild(importNode);
}
}
Thank you! This looks much better, but I am still having some trouble getting the parsing to look the way it should. For example, depending on which pages I choose to extract, sometimes I end up with a blank page at the beginning and the end of the document, or the Headings don’t copy over, or the entire document is copied rather than just the pages I specified. I have attached a test template that I have been working with. If you try extracting pages 5-7, you will see an example.
I’m afraid I can’t reproduce the issue on my side, the generated document looks identical after the code is run. I have attached the output to this post.
Thanks for your request. Which version of Aspose.Words do you use for testing? Maybe the problem occurs because you are using old version of Aspose.Words.
Thank you for additional information.
Yes, the link is right. The latest version of our product 9.8.0.0.
Please wait a little longer, Adam will give you the code.
I’m afraid this is still a work in progress. I haven’t had time to finish coding it yet. I will look into doing this in the weekend. I appreciate your patience.
Please find attached an upated version of the PageNumberFinder class. This update includes a new method SplitNodesAcrossPages that you can use to beable to extract pages into separate document properly.
You can use the code like below to extract pages to an external document. The SplitNodes method will split the sections of the document which contain content across multiple pages into separate sections, which are one per page. You can then extract each page by extracting each section and insert it into a new document.
Document doc = new Document("Document.docx");
// Set up the document which pages will be copied to. Remove the empty section.
Document dstDoc = new Document();
dstDoc.RemoveAllChildren();
PageNumberFinder finder = new PageNumberFinder(doc);
// Split nodes which are found across pages.
finder.SplitNodesAcrossPages(true);
// Copy all content including headers and footers from the specified pages into the destination document.
ArrayList pageSections = finder.RetrieveAllNodesOnPage(3, 5, NodeType.Section);
foreach (Section section in pageSections)
dstDoc.AppendChild(section);
dstDoc.Save(dataDir + "Document Out.docx");
If you have any issues, please attach your document here for testing.
I copied your code and replaced the old PageFinder.cs with the one you supplied. When I run the code, however, I am getting the following error: “The newChild was created from a different document than the one that created this node.” on this line: