Page number not getting properly in HTML Output

AlpeshChaudhariDev · November 7, 2023, 1:12pm

Hello team,

I’m attempting to convert a Word document into HTML. However, before I perform this conversion, I am extracting each page one by one from the source document and appending them into a destination document. Then, I’m converting this new document to HTML. The problem is that in the HTML output, every page is displaying as “Page 1”. This is not matching the page numbering in the source document. How can I fix this issue?

Snippet :

string sourcePath = "D:\Source.docx";
Aspose.Words.Document sourceDoc = new Aspose.Words.Document(sourcePath);
Aspose.Words.Document destDoc = new Aspose.Words.Document();

for (int i = 0; i < sourceDoc.PageCount; i++)
{
    Document page = sourceDoc.ExtractPages(i, 1);
	destDoc.AppendDocument(page, ImportFormatMode.KeepSourceFormatting);
}
 destDoc.Save(@"D:\Output.html");

Attachments:
WordToHTML_Issue.zip (378.2 KB)

alexey.noskov · November 7, 2023, 1:24pm

@AlpeshChaudhariDev You should unlink PAGE field to get the desired result:

Document sourceDoc = new Document(@"C:\Temp\in.docx");
Document destDoc = new Document();

for (int i = 0; i < sourceDoc.PageCount; i++)
{
    Document page = sourceDoc.ExtractPages(i, 1);
    // unlink page field
    page.Range.Fields.Where(f => f.Type == FieldType.FieldPage).ToList()
        .ForEach(f => { f.Update(); f.Unlink(); });

    destDoc.AppendDocument(page, ImportFormatMode.KeepSourceFormatting);
}
destDoc.Save(@"C:\Temp\out.html");

AlpeshChaudhariDev · November 7, 2023, 1:30pm

@alexey.noskov It’s working well now. Thanks for your quick response.

AlpeshChaudhariDev · September 9, 2024, 4:47am

Hi, I’m encountering an error with this snippet when trying to process some documents.

The Exception Error message is:
Invalid document model. The operation cannot be completed.

Can you help me understand why this is happening and how I can resolve it?

Document :
sampleDoc.docx (1.2 MB)

alexey.noskov · September 9, 2024, 5:00am

@AlpeshChaudhariDev The problem occurs because after splitting the document into pages, some fields might be corrupted if part of field is on one page and another on the next page. Usually this is not a problem since such corrupted fields are corrected upon saving. But in your case the page is not saved. You can modify the following code to force document model validation:

Document sourceDoc = new Document(@"C:\Temp\in.docx");
Document destDoc = new Document();

for (int i = 0; i < sourceDoc.PageCount; i++)
{
    Document page = sourceDoc.ExtractPages(i, 1);
    // Force document model validation by saving and opening the document.
    using (MemoryStream tmp = new MemoryStream())
    {
        page.Save(tmp, SaveFormat.Docx);
        tmp.Position = 0;
        page = new Document(tmp);
    }

    // unlink page field
    page.Range.Fields.Where(f => f.Type == FieldType.FieldPage).ToList()
        .ForEach(f => { f.Update(); f.Unlink(); });

    destDoc.AppendDocument(page, ImportFormatMode.KeepSourceFormatting);
}
destDoc.Save(@"C:\Temp\out.html");

AlpeshChaudhariDev · September 9, 2024, 5:50am

If the issue arises because, after splitting the document into pages, certain fields may become corrupted when part of a field is on one page and the remainder on the next, would it be possible to invoke this process before the document is split into pages? Will the page numbers be generated correctly in that case?

alexey.noskov · September 9, 2024, 5:54am

@AlpeshChaudhariDev As you may know MS Word documents are flow by their nature and there is no Page concept. So it would be difficult to make sure the model is not corrupted after splitting to pages. So I would suggest to use the suggested workaround to force the document model validation after splitting.

AlpeshChaudhariDev · September 9, 2024, 5:58am

Thanks for suggestion