Change pages only after comparison

Prakruth · September 2, 2025, 1:29pm

Hi,

docA = aw.Document( "DocumentA.docx")
docB = aw.Document("DocumentA_corrected.docx")

options = aw.comparing.CompareOptions()
            
options.ignore_formatting = True
options.ignore_headers_and_footers = True
options.ignore_case_changes = True
options.ignore_tables = True
options.ignore_fields = True
options.ignore_comments = True
options.ignore_textboxes = True
options.ignore_footnotes = True
            
docA.compare(docB, "user", datetime.today(), options)

so lets assume the original document has 5pages and corrections are made in page 1,2,5

after comparing now is there way to save only these pages?
can the do this dynamically for longer document where we do not know whic pages are having changes?
can you please help me with code?

Professionalize.Discourse · September 2, 2025, 1:30pm

@Prakruth

To save only the pages that have changes after comparing two documents, follow these steps:

Cause: You need to identify and save only the pages that have been modified between two documents.
Solution: Use the compare method to find changes, then collect the indices of the changed pages, and finally save those pages using PdfSaveOptions.

Here’s a code example to illustrate this process:

import com.aspose.words.*;

public class CompareAndSaveChangedPages {
    public static void main(String[] args) throws Exception {
        // Load the original and corrected documents
        Document docA = new Document("DocumentA.docx");
        Document docB = new Document("DocumentA_corrected.docx");

        // Compare the documents
        docA.compare(docB, "user", new Date());

        // Create a list to hold the indices of changed pages
        List<Integer> changedPages = new ArrayList<>();

        // Iterate through revisions to find changed pages
        for (Revision revision : docA.getRevisions()) {
            int pageIndex = revision.getParentNode().getDocument().getPageCount();
            if (!changedPages.contains(pageIndex)) {
                changedPages.add(pageIndex);
            }
        }

        // Create PdfSaveOptions and set the pages to save
        PdfSaveOptions saveOptions = new PdfSaveOptions();
        saveOptions.setPageSet(new PageSet(changedPages.stream().mapToInt(i -> i).toArray()));

        // Save the document with only the changed pages
        docA.save("ChangedPages.pdf", saveOptions);
    }
}

Explanation:

The code compares two documents and identifies the pages with changes.
It collects the indices of the changed pages and sets them in the PdfSaveOptions.
Finally, it saves the document containing only the modified pages.

Next Steps:

You can run this code in your Java environment with the Aspose.Words library included. Ensure that the document paths are correct and that you have the necessary permissions to read and write files.

If you have any further questions or need additional assistance, feel free to ask!

Prakruth · September 2, 2025, 1:37pm

we are using ptyhon coding language, can you please help us with python code?

alexey.noskov · September 2, 2025, 2:35pm

@Prakruth As you may know MS Word documents are flow by their nature and there is no “page” concept. The consumer applications reflow the document content into pages on the fly. Aspose.Words has it’s own layout engine to achieve this.
To achieve what you need, it is required to determine page indices where changes are found. Changes are marked with revisions. You can achieve this using LayoutCollector class.
If you are saving the output in fixed page formats, like PDF or XPS, you can set the required page indices in PageSet. Otherwise it will be required to use Document.extract_pages method to extract required pages as separate documents and then join them into one document. Both cases are demonstrated in the code below:

v1 = aw.Document("C:\\Temp\\v1.docx")
v2 = aw.Document("C:\\Temp\\v2.docx")

# Compare documents.
v1.compare(v2, "AW", datetime.date.today())

# Create LayoutCollector to determine page indices with revisions.
collector = aw.layout.LayoutCollector(v1)

# Collect page indices to extract.
pages = []
for r in v1.revisions :
    page = collector.get_start_page_index(r.parent_node)
    if not page in pages:
        pages.append(page - 1)

# If the output is saved to Fixed page formats like PDF you can specify page indices in save options.
pdf_save_opt = aw.saving.PdfSaveOptions()
pdf_save_opt.page_set = aw.saving.PageSet(pages)
v1.save("C:\\Temp\\out.pdf", pdf_save_opt)

# If the output must be saved in flow format, like DOCX, it is required to extract required pages.
tmp = v1.clone(False).as_document()
for i in pages:
    tmp.append_document(v1.extract_pages(i, 1), aw.ImportFormatMode.USE_DESTINATION_STYLES)
tmp.save("C:\\Temp\\out.docx")

Prakruth · September 10, 2025, 10:18am

“Can not access ParentNode for a style revision. Use ParentStyle instead.” this kind of error is seen for certain files, is there a generalised solution which cover all information from a document?

alexey.noskov · September 10, 2025, 12:44pm

@Prakruth You should skip processing style revision. Try adding the following condition:

# Collect page indices to extract.
pages = []
for r in v1.revisions :
    if r.revision_type  != aw.RevisionType.STYLE_DEFINITION_CHANGE:
        page = collector.get_start_page_index(r.parent_node)
        if not page in pages:
            pages.append(page - 1)