Extract Content between Pages

I see there is a way using Aspose.Words to extract content from Word documents based on paragraph styles, bookmarks, etc. Is there a way to extract content based on page number? For example, if i wanted to extract the content from pages 1-3?

Thank you!

Alicia Gontarek

Hi

Thanks for your request. Word document is flow document and does not contain any information about its layout into lines and pages. Therefore, technically there is no “Page” concept in Word document.

Aspose.Words uses our own Rendering Engine to layout documents into pages. And we have plans to expose layout information. Your request has been linked to the appropriate issue. You will be notified as soon as this feature is supported.

Also, I think, as a workaround you can try using PageNumberFinder class suggested by Adam in this thread:

https://forum.aspose.com/t/58199

Best regards,

Thank you, but PageNumberFinder does not seem to be a recognized class…should I be adding another reference?

Hello

Thanks for your request. You can download this class here:

https://forum.aspose.com/t/58199

Best regards,

Thank you!

I am getting the following error:

Error 1 Cannot implicitly convert type ‘Aspose.Words.Fields.FieldStart’ to ‘Aspose.Words.Fields.Field’ PageFinder.cs 93 36 ParsingTest

Field fieldStart = builder.InsertField("PAGE", "1");
// Repeat for the end of the node as some nodes can span over more than one page.
builder.MoveTo(endNode);
builder.Font.Hidden = true;
Field fieldEnd = builder.InsertField("PAGE", "1");
// Store these fields in a pair along with the node they represent.
fieldList.Add(node, new FieldPair(fieldStart, fieldEnd));

Hi

Thanks for your request. The problem occurs because you are using old version of Aspose.Words. Please try using the latest version of Aspose.Words. You can download it from here:
https://releases.aspose.com/words/net/

Best regards,

Thank you. This has been helpful. I am still having a problem in that when trying to import the nodes found using the pagefinder RetrieveAllNodesOnPage, if the node is anything other than a paragraph (Ex: Table), I get a “cannot insert node of this type at this location”.

What I really need to do, is to be able to specify pages within a given document and have everything from those pages (headers, footers and content) cloned into a new document. Is this possible?

Here is a sample of what I have now (this page has a table on it, and it returns an error):

Document doc = new Document(@"c:\test.doc");
PageNumberFinder pageFinder = new PageNumberFinder(doc);
Document dstDoc = new Document();
NodeImporter importer = new NodeImporter(doc, dstDoc, ImportFormatMode.KeepSourceFormatting);

//just extract page 4
for (int pageNum = 4; pageNum <= 4; pageNum++)
{
    List<Node> pageNodes = pageFinder.RetrieveAllNodesOnPage(pageNum, true);
    if (pageNodes != null)
        foreach (Node paragraph in pageNodes)
        {
            Node importNode = importer.ImportNode(paragraph, true);
            dstDoc.FirstSection.Body.AppendChild(importNode);
        }
}

Hi there,

Thanks for your inquiry.

Yes, this is possible, you just need to copy the entire sections to a new document instead. Please find try using the code below.

Document pageDoc = new Document();
pagedoc.RemoveAllChildren();

NodeImporter importer = new NodeImporter(doc, pageDoc, ImportFormatMode.KeepSourceFormatting);

foreach (Section section in finder.RetrieveAllNodesOnPage(2, 3, NodeType.Section))
    doc.AppendChild(importer.ImportNode(section, true));

Thanks,

Thank you! This looks much better, but I am still having some trouble getting the parsing to look the way it should. For example, depending on which pages I choose to extract, sometimes I end up with a blank page at the beginning and the end of the document, or the Headings don’t copy over, or the entire document is copied rather than just the pages I specified. I have attached a test template that I have been working with. If you try extracting pages 5-7, you will see an example.

Thank you for your help!!!

Alicia

Hi Alicia

Thanks for your inquiry

I’m afraid I can’t reproduce the issue on my side, the generated document looks identical after the code is run. I have attached the output to this post.

Thanks,

Would this have anything to do with the version of Word I am using? I am using Word 2007.

try extracting page 3-4. that produces 5 pages somehow.

Hi

Thanks for your request. Which version of Aspose.Words do you use for testing? Maybe the problem occurs because you are using old version of Aspose.Words.

Best regards,

Hi Alicia,

I suppose this is happening due to the reason stated in my last post, the table contained with the page spans across multiple pages.

I think I have a nice solution to this, I will provide you with some code within a day.

Thanks,

Thanks! I am using the version that I downloaded with the link provided:
https://releases.aspose.com/words/net/

Is this the correct version?

Hello,

Thank you for additional information.
Yes, the link is right. The latest version of our product 9.8.0.0.
Please wait a little longer, Adam will give you the code.

Any luck with this?

Thanks for your help!

Hi there,

Thanks for your inquiry.

I’m afraid this is still a work in progress. I haven’t had time to finish coding it yet. I will look into doing this in the weekend. I appreciate your patience.

Thanks,

Hi Alicia,

Thanks for waiting.

Please find attached an upated version of the PageNumberFinder class. This update includes a new method SplitNodesAcrossPages that you can use to beable to extract pages into separate document properly.

You can use the code like below to extract pages to an external document. The SplitNodes method will split the sections of the document which contain content across multiple pages into separate sections, which are one per page. You can then extract each page by extracting each section and insert it into a new document.

Document doc = new Document("Document.docx");
// Set up the document which pages will be copied to. Remove the empty section.
Document dstDoc = new Document();
dstDoc.RemoveAllChildren();

PageNumberFinder finder = new PageNumberFinder(doc);

// Split nodes which are found across pages.
finder.SplitNodesAcrossPages(true);

// Copy all content including headers and footers from the specified pages into the destination document.
ArrayList pageSections = finder.RetrieveAllNodesOnPage(3, 5, NodeType.Section);

foreach (Section section in pageSections)
    dstDoc.AppendChild(section);

dstDoc.Save(dataDir + "Document Out.docx");

If you have any issues, please attach your document here for testing.

Thanks,

Thank you, Adam.

I copied your code and replaced the old PageFinder.cs with the one you supplied. When I run the code, however, I am getting the following error: “The newChild was created from a different document than the one that created this node.” on this line:

dstDoc.AppendChild(section)

Have I forgotten a step?