Extract Content between Pages

Hi

Thanks for your request. I think, you should just use NodeImprter in this case to import section to the destination document:

Document doc = new Document("Document.docx");
// Set up the document which pages will be copied to. Remove the empty section.
Document dstDoc = new Document();
dstDoc.RemoveAllChildren();

PageNumberFinder finder = new PageNumberFinder(doc);

// Split nodes which are found across pages.
finder.SplitNodesAcrossPages(true);

// Copy all content including headers and footers from the specified pages into the destination document.
NodeImporter importer = new NodeImporter(doc, dstDoc, ImportFormatMode.UseDestinationStyles);
for (int page = 3; page <= 5; page++)
{
    List<Node> pageSections = finder.RetrieveAllNodesOnPage(page, true, NodeType.Section);
    foreach (Section section in pageSections)
    {
        dstDoc.AppendChild(importer.ImportNode(section, true));
    }
}

dstDoc.Save(dataDir + "Document Out.docx";

Best regards,

Can you provide me with any information as to when this feature will be supported? Is there a scheduled release date?

AndreyN:
Hi

Thanks for your request. Word document is flow document and does not contain any information about its layout into lines and pages. Therefore, technically there is no “Page” concept in Word document.

Aspose.Words uses our own Rendering Engine to layout documents into pages. And we have plans to expose layout information. Your request has been linked to the appropriate issue. You will be notified as soon as this feature is supported.

Also, I think, as a workaround you can try using PageNumberFinder class suggested by Adam in this thread:

https://forum.aspose.com/t/58199

Best regards,

I was speaking in reference to the above post…

Hi

Thanks for your request. Unfortunately, the issue is not planed yet. So I cannot provide you a reliable estimate regarding this feature. We will consider exposing layout information of node in future, but no timeframe is available yet.

Best regards,

The code provided seems to be working to my specifications, with just one flaw. The page numbers are being reset in the cloned documents. I need to retain the original page numbers. Is this possible?

Thank you so much for all of the assistance.

Hi there,

Thanks for your inquiry.

Could you please attach your input and code here which allows me to reproduce the issue? I will take a closer look into this for you.

Thanks,

here is the code and a sample:

public static Document ExtractContentBetweenPages(Document srcDoc, int fromPage, int toPage)
{
    // Set up the document which pages will be copied to. Remove the empty section.
    Document dstDoc = new Document();
    dstDoc.RemoveAllChildren();
    PageNumberFinder finder = new PageNumberFinder(srcDoc);
    // Split nodes which are found across pages.
    finder.SplitNodesAcrossPages(true);
    // Copy all content including headers and footers from the specified pages into the destination document.
    NodeImporter importer = new NodeImporter(srcDoc, dstDoc, ImportFormatMode.UseDestinationStyles);
    for (int page = fromPage; page <= toPage; page++)
    {
        List<Node> pageSections = finder.RetrieveAllNodesOnPage(page, true, NodeType.Section);
        foreach (Section section in pageSections)
        {
            //dstDoc.AppendChild(section);
            dstDoc.AppendChild(importer.ImportNode(section, true));
        }
    }
    return dstDoc;
}

Thanks for this additional information.

This was a minor bug which I have fixed, please try downloading the class again.

Thanks,

Thank you for this, it seems to have fixed the problem!

I am seeing one other problem now, however. The code (as a whole) does not seem to work with DOCX files. I get the following attached error in the debug.

Hi

Thanks for your request. Could you please attach a sample document that causes this problem?

Best regards,

simple test document attached. this produces the error that I included in the above post.

Hi there,

Thanks for your inquiry.

I can’t reproduce any problem on my side. Make sure that values you pass to your method are within the valid page range (1 to 4 in the case of your document).

Thanks,

I am definitely using a valid page range. May I ask, are you running this code in a Windows Form or a Web form? Because it actually does work in a Windows form…however, I need it to work in a Web form. It doesn’t make any sense to me why the DOCX files don’t work in both…the code is identical. I must be missing SOMETHING.

Hi there,

Thanks for this additional information.

Could you please attach a quick sample application which reproduces the issue here? I will take a further look into this for you.

Thanks,

Attached is the web code and the test document

It seems you are using an older version of PageFinder in your web application while your using the newer one in your console application. Please make sure to use the new version found here: https://forum.aspose.com/t/77148

Thanks,

Thank you, that did help with the page numbers, however, now it seems to be ignoring the headers.

Attached is the current web code and test document

Hi there,

Thanks for this additional information.

This occurs because these headers are linked to the previous section. Since these sections are moved on their own to the new document they no longer display the content from the previous section’s header. You will see that the footer is not linked so it does not have this problem.

You can work around this by copying any linked header footers from the previous section.

You need to add the code below somewhere in the constructor:

/// <summary>
/// Creates a proper copy of any linked header/ footers into the sections of the document.
/// </summary>
private void CopyLinkedHeaderFooters()
{
    foreach (Section section in mOrigDoc)
    {
        if (section == mOrigDoc.FirstSection)
            continue;
        HeaderFooterCollection previousHeaderFooters = ((Section)section.PreviousSibling).HeadersFooters;

        foreach (HeaderFooter headerFooter in previousHeaderFooters)
        {
            if (section.HeadersFooters[headerFooter.HeaderFooterType] == null)
            {
                HeaderFooter newHeaderFooter = (HeaderFooter)previousHeaderFooters[headerFooter.HeaderFooterType].Clone(true);
                section.HeadersFooters.Add(newHeaderFooter);
            }
        }
    }
}

Thanks,

Thank you, this did get rid of the error. I finished the web application and began testing on various documents and began getting a different error.

I get “Cannot insert a node of this type at this location” on the following line:

Field fieldStart = builder.InsertField("PAGE", "1");

This happens no matter what pages I select to extract.

The problem is, due to the sensitive nature of the documents that I am testing, I cannot send you a sample.

Hi there,

Thanks for your inquiry.

I’m afraid without the input document it’s hard to know what the problem is. You can sanitize your document by replacing any confidential data with dummy data but you need to make sure the issue is still reproducible with the modified document.

Thanks,