GetChildNodes() Node page numbers? (2 part question)

This is a two part question:

First,

When calling doc.GetChildNodes( NodteType.HeaderFooter , true ), is there a way of finding out what page (or section) number a node in the returned NodeCollection belongs to? I have not been able to find documentation regarding this.

Secondly,
I am trying to extract the headers from a MS Word document. It contains a total of 5 headers in 9 sections (Several of the sections contain only footers, which we do not care about).

When I call doc.GetChildNodes( NodteType.HeaderFooter , true ) it returns 3 nodes.

The first node is the header in section 1 (on page 1).

The second node is the header in section 2 (page 3).

The third node is the footer in section 2 (on page 3).

The method appears to return the first 3 nodes correctly, but it only returns 3 nodes. It does not return a node for any of the other headers (or footers) in the file. Did I miss something about the method call? Shouldn’t it be returning all of the HeaderFooter objects? or is there a way to extract the additional headers?

Hi John,
Thanks for your inquiry.
Sure, you can find the index of the section which a header footer belongs to by using the code below.
NodeCollection

headersFooters = doc.GetChildNodes(NodeType.HeaderFooter, true);
int sectionIndex = doc.IndexOf(((HeaderFooter) headersFooters[0]).ParentSection);

Regarding the missing header footers, this is most likely the correct behavior. In the Aspose.Words model when a section does not have any headers or footers then it means that the header and footer of that type is linked to the previous section. If it was not linked then it would have a headerfooter of that type but it would be empty (no content).
If you want to be able to work with the header footers of each section on their own without worrying about some being linked then you can use the code below to copy linked headers to each section. After running this code all sections will have their own appropriate headers and footers instead of them being linked to previous sections.

private void CopyLinkedHeaderFooters(Document doc)
{
    foreach(Section section in doc.Sections)
    {
        if (section != doc.FirstSection)
        {
            Section previousSection = (Section) section.PreviousSibling;
            HeaderFooterCollection previousHeaderFooters = previousSection.HeadersFooters;
            if (!section.PageSetup.RestartPageNumbering)
            {
                section.PageSetup.RestartPageNumbering = true;
                section.PageSetup.PageStartingNumber = previousSection.PageSetup.PageStartingNumber + mFinder.PageSpan(previousSection) + 1;
            }
            foreach(HeaderFooter headerFooter in previousHeaderFooters)
            {
                if (section.HeadersFooters[headerFooter.HeaderFooterType] == null)
                {
                    HeaderFooter newHeaderFooter = (HeaderFooter) previousHeaderFooters[headerFooter.HeaderFooterType].Clone(true);
                    section.HeadersFooters.Add(newHeaderFooter);
                    newHeaderFooter.Range.UpdateFields();
                }
            }
        }
    }
}

Thanks,

Adam,

Thanks for the reply. The code that you posted makes sense, but you included a “mFinder” object that is not declared. Did you leave something out, or can you elaborate on what it is?

- Cheers,

Jbourds

Hi John,
Thanks for this additional information.
You’re right, there is some code missing here. This is because this code is copied and pasted from an existing function. I have filled in the missing code, so things should work correctly on your side now.
Please see the changes below.

public static void CopyLinkedHeadersFooters(Document doc)
{
    Dictionary <Section, int> pages = GetNumberOfPagesInSections(doc);
    foreach(Section section in doc.Sections)
    {
        if (section != doc.FirstSection)
        {
            Section previousSection = (Section) section.PreviousSibling;
            HeaderFooterCollection previousHeaderFooters = previousSection.HeadersFooters;
            if (!section.PageSetup.RestartPageNumbering)
            {
                section.PageSetup.RestartPageNumbering = true;
                section.PageSetup.PageStartingNumber = previousSection.PageSetup.PageStartingNumber + pages[previousSection];
            }
            foreach(HeaderFooter headerFooter in previousHeaderFooters)
            {
                if (section.HeadersFooters[headerFooter.HeaderFooterType] == null)
                {
                    HeaderFooter newHeaderFooter = (HeaderFooter) previousHeaderFooters[headerFooter.HeaderFooterType].Clone(true);
                    section.HeadersFooters.Add(newHeaderFooter);
                    newHeaderFooter.Range.UpdateFields();
                }
            }
        }
    }
}
public static Dictionary <Section, int> GetNumberOfPagesInSections(Document doc)
{
    DocumentBuilder builder = new DocumentBuilder(doc);
    Dictionary <Section, Field> sectionFields = new Dictionary <Section, Field> ();
    Dictionary <Section, int> sectionPages = new Dictionary <Section, int> ();
    foreach(Section section in doc.Sections)
    {
        // We have to find out how many pages there are in the section and what pages they are in the overall document.
        builder.MoveTo(section.Body.FirstParagraph);
        sectionFields.Add(section, builder.InsertField("SECTIONPAGES", null));
    }
    doc.UpdatePageLayout();
    doc.UpdateFields();
    foreach(KeyValuePair <Section, Field> pair in sectionFields)
    {
        sectionPages.Add(pair.Key, int.Parse(pair.Value.Result));
        pair.Value.Remove();
    }
    return sectionPages;
}

Thanks,

Adam,

Going back to the other part of my first question:

Is it possible to extract a page number from a word document?

I have not seen any documentation yet that specifies if a page value can be extracted from a document given a particular object (HeaderFooter, Section, NodeCollection). For instance, I am trying to find the page that a header (or section) is location on. Is it possible, and if so, is there documentation on it?

Thank you very much for your help. I was able to extract a section number from a document for a given HeaderFooter Node. Your code on linked nodes was helpful.

You guys have been great with support!

Best,

- Jbourds

Hi John,

Thanks for your inquiry.

I’m afraid there is currently no way to access the page layout information of a node. I have linked your request to the appropriate issue. We will inform you as soon as it’s resolved. We have this feature on a roadmap, hopefully it will be implemented sometime this year.

In the mean time as a work around you can still find the page number of a node by using the “page” field as a work around. Using this field you can “query” the layout engine for the page number where the field is. This is the same technique I used in the previous code snippet with the SECTIONPAGES field.

public static int GetPageNumberOfSection(Section section)
{
    DocumentBuilder builder = new DocumentBuilder((Document)section.Document);
    builder.MoveTo(section.Body.FirstParagraph);
    Field pageField = builder.InsertField("PAGE");
    int pageNum = int.Parse(pageField.Result);
    pageField.Remove();
    return pageNum;
}

Thanks,

I’m having an issue with GetPAgeNumberOfSections. I’m not sure if the issue has to do with the documents that I’m testing, how I’m retrieving the document sections, or something else.

It looks like pageField.Result is returning an empty string (""), and because of that I’m getting an erorr when the GetPageNumberOfSections method tries to convert it to an int. I’m attaching the word doc.

For this project I’m getting a node array of sections, then passing them to GetPageNumberOfSections as seen below. Should I be doing this another way? What is the best practice?

Node[] sectionNodeArray = doc.GetChildNodes(NodeType.Section, true).ToArray();
Section sectionNode = (Section)sectionNodeArray[index] ; 
int pageNumberOfSection = GetPageNumberOfSection(sectionNode);

Hi there,

Thanks for your inquiry.

You may need to insert the following two calls before retrieving the page number:

doc.UpdatePageLayout();
doc.UpdateFields();

Please let us know if that solves the issue.

Thanks,

I’ve added the two doc.Update methods on the line before I call GetPageNumberOfSections(). It did not appear to change anything. The pageField.Result call still returns an empty string and I still get a “Input string was not in a correct format” error when the string is converted to an int on the line:

int pageNum = int.Parse(pageField.Result); 

Is there anything else I may be missing before the GetPageNumberOfSection call, or anything I may need to add within the method?

Hey guys,

Have you thought of anything else that I may try? I really appreciate your help with this.

Cheers,

- Jbourds

I have attached the entire test class that I’m using for development. Please let me know if you think I should change something around. We are near the end of this project and are looking forward to completing it if we can retrieve the page number.

Thanks again,

- Jbourds

Hi John,
Thanks for this additional information.
Sorry, I wasn’t 100% clear, what I meant was to add that call inside the GetPageNumberOfSection method like below:

public static int GetPageNumberOfSection(Section section)
{
    Document doc = (Document)section.Document;
    DocumentBuilder builder = new DocumentBuilder((Document)section.Document);
    builder.MoveTo(section.Body.FirstParagraph);
    Field pageField = builder.InsertField("PAGE", "1"); // Faster using this overload as no pagination is done twice.
    doc.UpdatePageLayout();
    doc.UpdateFields();
    int pageNum = int.Parse(pageField.Result);
    pageField.Remove();
    return pageNum;
}

However as you may note it is quite slow to update page layout for each field inserted in the document. If this is a problem, then I suggest inserting all fields at once for each section and doing a single update.
The code to achieve this would be pretty much identical to the GetNumberOfPagesInSections with the only difference being that the PAGE field should be inserted instead of a SECTIONPAGES field.
Thanks,

It appears that I have sucessfully implemented the following functionality with your code and the Aspose API:

- Get the number of pages in each section

- Get the page number of the start of each section

- Return a value in a HeaderFooter node if its type matches HeaderFooterType.HeaderFirst

I’m running into two issue now now on a Word document contining 37 pages (attached). The first is that the API is telling me that it contains 98 pages. The number of sections returned in my code (43) matches what I see in MS Word, but the page count seems way off. It may be worth noting that in MS Word it appears there are multiple sections per page, that sections are being skipped, or sections are not visible. In Word I can see that Section 5 appears to start on page 7, and on the next page (page 8 ) section 8 starts.

I am trying to map the print command (in the header) at the start of each section to each page in that section. So If a section starts on page 16 and contains 9 pages, I’m retrieving the print command from the header on page 16 and storing it in an array. Each entry in the array corresponds to a page value (array index 16 maps to page 16, index 24 maps to page 24).

Do you have any idea as to whats going on with the page count? I’m counting the pages in each section using a Dictionary object as you’ll see in my attached code. The code works fine (returns the correct number of pages in the document) on another test file, but not this one.

The second issue that I’m having is that not all pages are being accounted for. I’m getting the starting page of each section and then finding out how many pages are in each section. Using this method I should be able to map the print command from the header in the beginning of each section to each page in that section. In the attached word doc several pages (8 through 15, are unnacounted for). As I debugh through the code, there are multiple sections (containing 1 page) on page 7, and then the next page in the following section after page 7 is page 16. It seems as though pages 8 through 15 are not in a section. I’m running into this issue on another test file as well.

I have uploaded the entire class that contains my testing methods. The method that does most of the work is well commented. If you have any questions please feel free to let me know.

Hi John,
Thanks for this additional information.
I managed to reproduce the issue with the missing pages. In some cases this wasn’t a bug if there was an even or odd section break skipping a page. Since the page doesn’t appear in the document at all it’s practically skipped.
I rewrote some of the code so the missing pages are included. At the same I removed a lot of unnecessary field updates. Before you were updating page layout many many times which was causing the whole process to be very slow. It is much faster now. Also using the reworked code you only need to use one set of PAGE fields and not the SECTIONPAGES field to find all the information you need. The number of pages in a section is calculated all from the PAGE fields.
Regarding your other issues, after this rework I was unable to spot any problems so hopefully they are fixed. Please integrate the changes to the code found below into your application and try again. Let me know if any results are incorrect now.

private static void SaveDocument(Document doc, string outputFile)
{
    using (var mem = new MemoryStream())
    {
        Dictionary<Section, int> sectionStartPageList = GetPageStartOfSections(doc);
        CopyLinkedHeadersFooters(doc, sectionStartPageList);
        doc.Save(mem, SaveFormat.Pdf);
        mem.Position = 0;
        var fileInfo = new PdfFileInfo(mem);
        string previousHeaderInfo = "";
        string currentHeaderInfo = "";
        int iCurrentPageNumber = 0;
        string emptyHeaderInfo = "";
        int startingPageNumberOfThisSection = 1;
        int numberOfPagesInThisSection = 1;
        int sectionCounter = 1; //debugging only
                                // Gets a total page count for the document by adding the starting page number
                                // of the last section and the number of pages in that section
        int diffTotalPages = doc.PageCount;
        // Array to store print commands. One print command per page.
        // If there isnt a new print command (in the header of the current section)
        // use the previous pages print command.
        string[] headerInfoArray = new string[diffTotalPages + 1];
        foreach (Section currentSection in doc.Sections)
        {
            numberOfPagesInThisSection = GetNumberOfPagesInSection(currentSection, sectionStartPageList);
            startingPageNumberOfThisSection = sectionStartPageList[currentSection]; //Aspose method. Not correct on unliked objects
            currentHeaderInfo = GetHeaderInfo(currentSection); //Returns PRINT command or an empty string ""
            int stringCompareValue = String.Compare(currentHeaderInfo, emptyHeaderInfo); //see if current header is equal to an empty string (if no pring command is found
                                                                                         // if the current header is empty (matches a blank string), use previous header info
            if (stringCompareValue == 0)
                currentHeaderInfo = previousHeaderInfo;
            // Add header info to the array starting at the index startingPageNumberOfThisSection
            // and continue adding for numberOfPagesInThisSection pages
            headerInfoArray = AddHeaderInfoToArray(headerInfoArray, numberOfPagesInThisSection, startingPageNumberOfThisSection, currentHeaderInfo);
            previousHeaderInfo = currentHeaderInfo;
            sectionCounter++; //debug only
        } // for each
          // Multiple pages were skipped
          // Created a method that replaced skipped pages
        headerInfoArray = FixNullHeaders(headerInfoArray);
        fileInfo.SaveNewInfo(outputFile);
    }
}
private static int GetNumberOfPagesInSection(Section currentSection, Dictionary<Section, int> sectionStartPageList)
{
    Document doc = (Document)currentSection.Document;
    int endPage = (currentSection == doc.LastSection) ? doc.PageCount + 1 : sectionStartPageList[(Section)currentSection.NextSibling];
    return endPage - sectionStartPageList[currentSection];
}
/// /// aspose code
/// /// public static void CopyLinkedHeadersFooters(Document doc, Dictionary<Section, int> pages)
{
    foreach (Section section in doc.Sections)
    {
        if (section != doc.FirstSection)
        {
            Section previousSection = (Section)section.PreviousSibling;
            HeaderFooterCollection previousHeaderFooters = previousSection.HeadersFooters;
            if (!section.PageSetup.RestartPageNumbering)
            {
                section.PageSetup.RestartPageNumbering = true;
                section.PageSetup.PageStartingNumber = previousSection.PageSetup.PageStartingNumber + GetNumberOfPagesInSection(previousSection, pages);
            }
            foreach (HeaderFooter headerFooter in previousHeaderFooters)
            {
                if (section.HeadersFooters[headerFooter.HeaderFooterType] == null)
                {
                    HeaderFooter newHeaderFooter = (HeaderFooter)previousHeaderFooters[headerFooter.HeaderFooterType].Clone(true);
                    section.HeadersFooters.Add(newHeaderFooter);
                }
            }
        }
    }
}
// Aspose method for getting the number of pages in each section
public static Dictionary<Section, int> GetPageStartOfSections(Document doc)
{
    DocumentBuilder builder = new DocumentBuilder(doc);
    Dictionary<Section, Field> sectionStartFields = new Dictionary<Section, Field>();
    Dictionary<Section, int> sectionStartPages = new Dictionary<Section, int>();
    foreach (Section section in doc.Sections)
    {
        // We have to find out how many pages there are in the currentSection and what pages they are in the overall document.
        Paragraph firstPara = section.Body.FirstParagraph;
        if (firstPara.HasChildNodes)
            builder.MoveTo(firstPara.FirstChild);
        else
            builder.MoveTo(firstPara);
        sectionStartFields.Add(section, builder.InsertField(@"PAGE * Arabic", "1"));
    }
    doc.UpdatePageLayout();
    doc.UpdateFields();
    foreach (KeyValuePair<Section, Field> pair in sectionStartFields)
    {
        sectionStartPages.Add(pair.Key, int.Parse(pair.Value.Result));
        pair.Value.Remove();
    }
    return sectionStartPages;
}

Thanks,

Adam,

Thank you for getting back to me. The updated code appears to do the trick. The only think that I’ve noticed is on two of the test documents that end in blank pages. It seems that Aspose does not consider those two pages in the page count.

The first: “1P00332633HAB Ren 0028 20120208_124.doc” shows 37 pages in MS word, and only a page count of 35 using you API.

The second: “1P16188455HAB New 0001 20110607_126.doc” shows 27 pages in MS word, and only a page count of 25 using your API.

Is this part of the API’s functionality, or is this something that you have not encountered before? I have attached the files mentioned above.

Thank you so much for your great support.

Cheers,

jBourds.

Hi John,

It’s great to hear things are working properly with the new code snippet. Regarding your last two issues, I reproduce that on my side, however I can’t reproduce the issue when using a simplified test document.

Furthermore if I try to resave your document using Microsoft Word it crashes. What program did you use to create these documents? Were they post processed by any tool at any time? Also does resaving using Microsoft Word on your side cause the same issue?

With this information I will take a closer look into the issue. Also it would be great if you could supply the password to turn off document protection.

Thanks,

Adam,
It seems that I am taking a huge performance hit on the doc.update methods. Everything else is running well. Is there another way to accomplish what we’re trying to do in this method, or a way to do it faster?

// Aspose method for getting the number of pages in each section
public static Dictionary <Section, int> GetPageStartOfSections(Document doc)
{
    DocumentBuilder builder = new DocumentBuilder(doc);
    Dictionary <Section, Field> sectionStartFields = new Dictionary <Section, Field> ();
    Dictionary <Section, int> sectionStartPages = new Dictionary <Section, int> ();
    foreach(Section section in doc.Sections)
    {
        // We have to find out how many pages there are in the currentSection and what pages they are in the overall document.
        Paragraph firstPara = section.Body.FirstParagraph;
        if (firstPara.HasChildNodes)
            builder.MoveTo(firstPara.FirstChild);
        else
            builder.MoveTo(firstPara);
        sectionStartFields.Add(section, builder.InsertField(@"PAGE * Arabic", "1"));
    }
    doc.UpdatePageLayout(); //REALLY SLOW
    doc.UpdateFields(); //REALLY SLOW
    foreach(KeyValuePair <Section, Field> pair in sectionStartFields)
    {
        sectionStartPages.Add(pair.Key, int.Parse(pair.Value.Result));
        pair.Value.Remove();
    }
    return sectionStartPages;
}

Hi John,

Thanks for your inquiry.

I’m afraid not, the time spent in those methods involves Aspose.Words building the document layout in memory. You should find that removing any calls to the code above and simply calling doc.UpdatePageLayout and doc.UpdateFields on the original document should take about the same amount of time as your custom code. Please let me know if this not the case as there may be some problem.

There is currently no way to reduce this time, Aspose.Words already renders such documents very fast.

Thanks,

Adam,
First off, thank you for all the great support.
It seems that we have run into an issue with pulling data from 2 word documents (out of about 50 test files so far). We are receiving the following error in your GetPageStartOfSections method when doc.UpdateFields() is called:
“Operation is not valid due to the current state of the object.”
Again, on all the other test files this does not happen, but on these two its an issue. I’m attaching the two test files and our source code. Please not that the class also references API’s from Tracker software, but the error comes well before we get their piece.
Thanks again,
jBourds

aske012:
Hi John,
Thanks for this additional information.
I managed to reproduce the issue with the missing pages. In some cases this wasn’t a bug if there was an even or odd section break skipping a page. Since the page doesn’t appear in the document at all it’s practically skipped.
I rewrote some of the code so the missing pages are included. At the same I removed a lot of unnecessary field updates. Before you were updating page layout many many times which was causing the whole process to be very slow. It is much faster now. Also using the reworked code you only need to use one set of PAGE fields and not the SECTIONPAGES field to find all the information you need. The number of pages in a section is calculated all from the PAGE fields.
Regarding your other issues, after this rework I was unable to spot any problems so hopefully they are fixed. Please integrate the changes to the code found below into your application and try again. Let me know if any results are incorrect now.
private static void SaveDocument(Document doc, string outputFile)
{
using (var mem = new MemoryStream())
{
Dictionary<Section, int> sectionStartPageList = GetPageStartOfSections(doc);
CopyLinkedHeadersFooters(doc, sectionStartPageList);
doc.Save(mem, SaveFormat.Pdf);
mem.Position = 0;
var fileInfo = new PdfFileInfo(mem);
string previousHeaderInfo = “”;
string currentHeaderInfo = “”;
int iCurrentPageNumber = 0;
string emptyHeaderInfo = “”;
int startingPageNumberOfThisSection = 1;
int numberOfPagesInThisSection = 1;
int sectionCounter = 1; //debugging only
//Gets a total page count for the document by adding the starting page number
//of the last section and the number of pages in that section
int diffTotalPages = doc.PageCount;
//Array to store print commands. One print command per page.
//If there isnt a new print command (in the header of the current section)
//use the previous pages print command.
string[] headerInfoArray = new string[diffTotalPages + 1];
foreach (Section currentSection in doc.Sections)
{
numberOfPagesInThisSection = GetNumberOfPagesInSection(currentSection, sectionStartPageList);
startingPageNumberOfThisSection = sectionStartPageList[currentSection]; //Aspose method. Not correct on unliked objects
currentHeaderInfo = GetHeaderInfo(currentSection); //Returns PRINT command or an empty string “”
int stringCompareValue = String.Compare(currentHeaderInfo, emptyHeaderInfo); //see if current header is equal to an empty string (if no pring command is found
//if the current header is empty (matches a blank string), use previous header info
if (stringCompareValue == 0)
currentHeaderInfo = previousHeaderInfo;
//Add header info to the array starting at the index startingPageNumberOfThisSection
//and continue adding for numberOfPagesInThisSection pages
headerInfoArray = AddHeaderInfoToArray(headerInfoArray, numberOfPagesInThisSection, startingPageNumberOfThisSection, currentHeaderInfo);
previousHeaderInfo = currentHeaderInfo;
sectionCounter++; //debug only
} // for each
//Multiple pages were skipped
//Created a method that replaced skipped pages
headerInfoArray = FixNullHeaders(headerInfoArray);
fileInfo.SaveNewInfo(outputFile);
}
}
private static int GetNumberOfPagesInSection(Section currentSection, Dictionary<Section, int> sectionStartPageList)
{
Document doc = (Document)currentSection.Document;
int endPage = (currentSection == doc.LastSection) ? doc.PageCount + 1 : sectionStartPageList[(Section)currentSection.NextSibling];
return endPage - sectionStartPageList[currentSection];
}
/// > /// aspose code
/// > /// > public static void CopyLinkedHeadersFooters(Document doc, Dictionary<Section, int> pages)
{
foreach (Section section in doc.Sections)
{
if (section != doc.FirstSection)
{
Section previousSection = (Section)section.PreviousSibling;
HeaderFooterCollection previousHeaderFooters = previousSection.HeadersFooters;
if (!section.PageSetup.RestartPageNumbering)
{
section.PageSetup.RestartPageNumbering = true;
section.PageSetup.PageStartingNumber = previousSection.PageSetup.PageStartingNumber + GetNumberOfPagesInSection(previousSection, pages);
}
foreach (HeaderFooter headerFooter in previousHeaderFooters)
{
if (section.HeadersFooters[headerFooter.HeaderFooterType] == null)
{
HeaderFooter newHeaderFooter = (HeaderFooter)previousHeaderFooters[headerFooter.HeaderFooterType].Clone(true);
section.HeadersFooters.Add(newHeaderFooter);
}
}
}
}
}
//Aspose method for getting the number of pages in each section
public static Dictionary<Section, int> GetPageStartOfSections(Document doc)
{
DocumentBuilder builder = new DocumentBuilder(doc);
Dictionary<Section, Field> sectionStartFields = new Dictionary<Section, Field>();
Dictionary<Section, int> sectionStartPages = new Dictionary<Section, int>();
foreach (Section section in doc.Sections)
{
// We have to find out how many pages there are in the currentSection and what pages they are in the overall document.
Paragraph firstPara = section.Body.FirstParagraph;
if (firstPara.HasChildNodes)
builder.MoveTo(firstPara.FirstChild);
else
builder.MoveTo(firstPara);
sectionStartFields.Add(section, builder.InsertField(@“PAGE \ Arabic”, “1”));
}
doc.UpdatePageLayout();
doc.UpdateFields(); //ERROR IS THROWN HERE
foreach (KeyValuePair<Section, Field> pair in sectionStartFields)
{
sectionStartPages.Add(pair.Key, int.Parse(pair.Value.Result));
pair.Value.Remove();
}
return sectionStartPages;
}
Thanks,
*

Hi there,

Thanks for your inquiry.

I’m afraid I couldn’t reproduce any issue using your attached code and documents when using Aspose.Words 11.2.0 or Aspose.Words 11.3.0. Could you please retest your application on either of those versions and see if you can reproduce the issue again?

Thanks,