Get Section Page Numbers from Word DOCX Documents using Aspose.Words for Java | Restart Page Numbering | Page Starting Number

PROCUREMENT2 · February 11, 2021, 12:44pm

Language: Java 11
Version: Aspose Word 20.12
Problem: We are trying to find a way to determine section page numbers with the help of Aspose Words.

Explanation:
We are using calculation of each section page count (Get Total Number of Pages in any Section of Word DOCX Document using Java | Page Count | SECTIONPAGES Field
) and based on the PageSetup properties - restartPageNumbering and pageStartingNumber - trying to find a correct section page number.
But this leads us to some problems.

Examples:

1.
We have 2 sections with 2 pages and second section contains page number starting from 1.
It means that:

Absolute page number |  Section page number
           1                    1
           2                    2
           3                    1
           4                    2

This one is easy to calculate. Let’s continue with another example.

2.
3 document pages, continuous section break at the 2nd page and 2nd section contains page number starting from 1.

Absolute page number   |    Section page number
           1                         1
           2                         2
           3                         2

Each section contains 2 pages (so total it is 4), but actual document only 3 pages. This means that our implementation is not correct and should be adjusted based on the section break. (which adds even more checks and node traversing)

3.
2 document pages, Odd page section break in the first page, 2nd section page numbering → continue from previous section.
MS Word shows only 2 pages, but second page page number shows - 3. Word creates hidden page which is not shown in the application.

Absolute page number   |    Section page number
           1                         1
           3                         3

Aspose return page count = 1 for each section, which gives totally 2 pages while document consist of 3.

Adding all three documents in a zip: docx_examples.zip (45.7 KB)

Could you suggest how we can determine section page numbers in a correct and efficient way?

awais.hafeez · February 11, 2021, 3:05pm

@PROCUREMENT2,

You can build logic on the following code of Aspose.Words for Java to get the desired output:

String path = "C:\\Temp\\docx_examples\\";
String pattern = "*.doc?";

String[] fileNames = GetFiles(path, pattern);
for (String fileName : fileNames) {
    Document doc = new Document(path + fileName);
    System.out.println(doc.getOriginalFileName());

    for (int i = 0; i < doc.getPageCount(); i++) {
        Document pageDoc = doc.extractPages(i, 1);

        int pageFieldCounter = 0;
        for (Field field : pageDoc.getRange().getFields())
            if (field.getType() == FieldType.FIELD_PAGE)
                pageFieldCounter++;

        if (pageFieldCounter > 0) {

            DocumentBuilder builder = new DocumentBuilder(pageDoc);

            FieldPage pageField = (FieldPage) builder.insertField(FieldType.FIELD_PAGE, true);
            System.out.println(pageField.getResult());
        }
    }
    System.out.println("================");
}

public static String[] GetFiles(final String path, final String searchPattern) {
    final Pattern re = Pattern.compile(searchPattern.replace("*", ".*").replace("?", ".?"));
    return new File(path).list(new FilenameFilter() {
        @Override
        public boolean accept(File dir, String name) {
            return new File(dir, name).isFile() && re.matcher(name).matches();
        }
    });
}

PROCUREMENT2 · February 11, 2021, 3:45pm

Pretty neat code, @awais.hafeez, thanks for giving it so fast!
I am afraid I did not precisely express what I need.

I need to determine and extract the section page numbers.
The best way to get them is to collect in Map<Integer, Integer>.
For example:
First example: will result in Map.of(1,1,2,2,3,1,4,2)
Second example: Map.of(1,1,2,2,3,2)
Third example: I think it should give me Map.of(1,1,2,2,3,3) instead of Map.of(1,1,3,3)

So, I have investigating during this time around how I can achieve this and found out that each PageSetup contains sectionStart property which gives more information how to process each section with the calculation.

I am afraid that operating fields could potentially lead to failure when section will not have page numbers.
For example: Section (with page number) - Section (without page number) - Section (with page number)

awais.hafeez · February 12, 2021, 5:37am

@PROCUREMENT2,

Please try the following code that does not rely on existing PAGE fields in source Word documents:

String path = "C:\\Temp\\docx_examples\\";
String pattern = "*.doc?";

String[] fileNames = GetFiles(path, pattern);
for (String fileName : fileNames) {
    Document doc = new Document(path + fileName);
    System.out.println(doc.getOriginalFileName());

    Map<Integer, Integer> sectionPageNumbers = new HashMap<Integer, Integer>();

    for (int i = 0; i < doc.getPageCount(); i++) {
        Document pageDoc = doc.extractPages(i, 1);

        if (pageDoc.toString(SaveFormat.TEXT).trim().equals("") &&
                pageDoc.getChildNodes(NodeType.ANY, true).getCount() == 3) {
            // In case of Empty document, lets use same page number
            sectionPageNumbers.put(i + 1, i + 1);
        } else {
            DocumentBuilder builder = new DocumentBuilder(pageDoc);
            FieldPage pageField = (FieldPage) builder.insertField(FieldType.FIELD_PAGE, true);
            sectionPageNumbers.put(i + 1, Integer.parseInt(pageField.getResult()));
        }
    }

    for (Map.Entry me : sectionPageNumbers.entrySet())
        System.out.println("" + me.getKey() + ", " + me.getValue());

    System.out.println("================");
}

PROCUREMENT2 · February 12, 2021, 7:02am

Regarding the efficiency, is doc.extractPages(i, 1) for 100 pages document is more efficient than doing deep cloning for each section and calculating page numbers manually?

With this page field insert, this one is working for provided documents!

But, there is a problem with documents which contains fully stacked content.
2_pages_full_content.zip (14.4 KB)
Although it contains only 2 pages, field insertion expands document to 3. This could lead to some problems. Even adding field to header/footer could change pages count.

awais.hafeez · February 12, 2021, 10:05am

@PROCUREMENT2,

There should not be any performance issues when using the Document.extractPages method. Secondly, you can try to insert a hidden Page field or make the size of inserted Page field very small or instead of inserting Page field inline with text, insert it in a floating textbox:

...
...
} else {
    DocumentBuilder builder = new DocumentBuilder(pageDoc);
    builder.getFont().setHidden(true);
    // builder.getFont().setSize(1); // or write a very small sized page field
    FieldPage pageField = (FieldPage) builder.insertField(FieldType.FIELD_PAGE, true);
    sectionPageNumbers.put(i + 1, Integer.parseInt(pageField.getResult()));
}
...
...

PROCUREMENT2 · February 12, 2021, 6:57pm

I am trying this code so far and everything seems pretty neat.

Found another case which is bothering me:
Even_page_break_page_2.zip (11.9 KB)
Response shows me result of 1/2/3/4/6 while pages are 1/2/3/4/5. Any ideas?

Edit:

Same goes with odd section break but only duplicating odd section break number twice.

public static DocumentBuilder builder() throws Exception {
    DocumentBuilder builder = new DocumentBuilder(new Document());
    builder.insertBreak(BreakType.PAGE_BREAK);
    builder.insertBreak(BreakType.SECTION_BREAK_ODD_PAGE);
    builder.insertBreak(BreakType.PAGE_BREAK);
    builder.insertBreak(BreakType.PAGE_BREAK);
    builder.insertBreak(BreakType.PAGE_BREAK);
    builder.insertBreak(BreakType.PAGE_BREAK);
    builder.insertBreak(BreakType.PAGE_BREAK);
    builder.insertBreak(BreakType.PAGE_BREAK);
    return builder;
  }

Results in [1, 2, 3, 5, 5, 7, 7, 9, 9]

I have an assumption that this an issue which is related to the behavior of document.extractPages(index, count)

awais.hafeez · February 13, 2021, 6:52am

@PROCUREMENT2,

Yes, this seems to be a problem with extractPages method. We can see the same problem in “out_4.docx” when running the following simple Java code:

Document doc = new Document("C:\\Even_page_break_page_2\\Even_page_break_page_2.docx");

for (int i = 0; i < doc.getPageCount(); i++) {
    Document pageDoc = doc.extractPages(i, 1);
    pageDoc.save("C:\\Even_page_break_page_2\\out_" + i + ".docx");
}

For the sake of any corrections, we have logged this problem in our issue tracking system. The ID of this issue is WORDSNET-21820. We will further look into the details of this problem and will keep you updated on the status of linked issue. We apologize for your inconvenience.

aspose.notifier · March 11, 2021, 9:26am

The issues you have found earlier (filed as WORDSNET-21820) have been fixed in this Aspose.Words for .NET 21.3 update and this Aspose.Words for Java 21.3 update.