Free Support Forum - aspose.com

Remove Blank Pages from Word Document with Headers Footers using C# .NET or Java

@tahir.manzoor, my word document has header and footer. When I convert it to PDF, blank page of word document is retaining headers and footers. Due to this, PDF ( page.IsBlank) is not recognizing blank page. Please suggest what can be done in this scenario.

@crshekharam,

You can build logic on the following C# code of Aspose.Words for .NET API alone to remove blank pages from Word document:

Document doc = new Document(@"C:\Temp\remove empty pages from word.docx");

// A List will hold blank page numbers
ArrayList emptyPageNumbers = new ArrayList();
emptyPageNumbers.Add(-1);

// Extract each page as a separate Word document
int totalPages = doc.PageCount;
for (int i = 0; i < totalPages; i++)
{
    Document pageDoc = doc.ExtractPages(i, 1);

    // Get text representation of this Page
    string textOfPage = "";
    foreach (Section section in pageDoc.Sections)
        // Lets not consider the content of Headers and Footers
        textOfPage = textOfPage + section.Body.ToString(SaveFormat.Text);

    // if text_of_Page is empty then Page is blank
    if (string.IsNullOrEmpty(textOfPage.Trim()))
        emptyPageNumbers.Add(i);
}
emptyPageNumbers.Add(totalPages);

// Concatenate documents with non-empty pages again
Document final_Document = (Document)doc.Clone(false);
final_Document.RemoveAllChildren();

for (int i = 1; i < emptyPageNumbers.Count; i++)
{
    int index = (int)emptyPageNumbers[i - 1] + 1;
    int count = (int)emptyPageNumbers[i] - index;

    if (count > 0)
        final_Document.AppendDocument(doc.ExtractPages(index, count), ImportFormatMode.KeepSourceFormatting);
}

final_Document.Save(@"C:\Temp\merged word Document with non-empty pages.docx");
1 Like

@awais.hafeez, this is working fine if documents are only with text. If any page contains only images or shapes, that page is considered as blank and is getting removed.

@crshekharam,

The following C# code should take Images or Shapes into account while removing blank/empty Pages from Word document:

Document doc = new Document(@"C:\Temp\remove empty pages from word.docx");

// A List will hold blank page numbers
ArrayList emptyPageNumbers = new ArrayList();
emptyPageNumbers.Add(-1);

// Extract each page as a separate Word document
int totalPages = doc.PageCount;
for (int i = 0; i < totalPages; i++)
{
    Document pageDoc = doc.ExtractPages(i, 1);

    // Get text representation of this Page and total count of Shapes
    int shapeCount = 0;
    string textOfPage = "";
    foreach (Section section in pageDoc.Sections)
    {
        // Lets not consider the content of Headers and Footers
        textOfPage = textOfPage + section.Body.ToString(SaveFormat.Text);
        shapeCount += section.Body.GetChildNodes(NodeType.Shape, true).Count;
    }

    // if text_of_Page is empty and does not contain any Shape nodes then consider this Page is blank
    if (string.IsNullOrEmpty(textOfPage.Trim()) && shapeCount == 0)
        emptyPageNumbers.Add(i);
}
emptyPageNumbers.Add(totalPages);

// Concatenate documents with non-empty pages again
Document final_Document = (Document)doc.Clone(false);
final_Document.RemoveAllChildren();

for (int i = 1; i < emptyPageNumbers.Count; i++)
{
    int index = (int)emptyPageNumbers[i - 1] + 1;
    int count = (int)emptyPageNumbers[i] - index;

    if (count > 0)
        final_Document.AppendDocument(doc.ExtractPages(index, count), ImportFormatMode.KeepSourceFormatting);
}

final_Document.Save(@"C:\Temp\merged word Document with non-empty pages.docx");

It seems to me that this solution of figuring out which pages are not empty, and then creating a new document by concatenating them all together, is costly both in time and server resources. Can you confirm this is the best way to accomplish this, rather than simply deleting the pages on the fly when you see they are empty in the loop?

@deisenberg,

Your understanding is correct. However, please note that there is no concept of Page in MS Word document. Pages are created on the fly when you open a Word document with MS Word.

I am in coordination with team to get answer pertaining to your query. Soon you will be updated with the required information.

@deisenberg The approach with Document.ExtractPages method allows you to make sure there are no empty pages. However, you are right it is resources consuming. And it is resource consuming not only because you create several instances of Document object and then concatenate them, but also because document layout calculation is required.
Another approach is to remove explicit page breaks from the document, this approach will be much less resource consuming since it does not require document layout calculation. If you delete explicit page breaks from your document, this might help you to get rid blank pages. There are few options to set an explicit page break in Word document. For example, explicit page break https://apireference.aspose.com/words/net/aspose.words/controlchar/fields/pagebreak
PageBreakBefore paragraph option. https://apireference.aspose.com/words/net/aspose.words/paragraphformat/properties/pagebreakbefore
section break https://docs.aspose.com/words/net/working-with-sections/
Also, it might be good idea to remove empty paragraph from the end of the document, this might help to get read empty pages and the document end.

@alexey.noskov, I tried this option of removing redundant page breaks and removing empty paragraphs. It worked well until shapes are encountered. When shapes are present in document along with text i.e., without using inline text (Flow chart type designs), those shapes are getting disturbed when we remove blank paragraphs that are entered between arrows or textboxes. Refer to images of original and their appearance after removal of empty paragraphs between them
image.png (2.9 KB)
image.png (2.8 KB)

@crshekharam Yes, the mentioned approach is not ideal from the accuracy point of view. MS Word documents allow a lot of different ways to produce blank pages – expected or redundant. Human factor plays not the last role here. For example, someone can use blank paragraphs to move content to the next page, another will use explicit page break or page break before option.
Anyways, presence of blank pages in documents is an exceptional case not a rule and it is normally not required to process all documents to remove blank pages. So to make the code less resource consuming the approach with explicit page breaks can be used to check whether the document might contain blank pages – code can check whether the document contain explicit page breaks or long sequence of empty paragraphs.

Thank you for your response. In our case, we have Word document templates. Some have headers/footers with our logo, etc. and some do not.

We use Aspose.Words to add text to the template based upon database information. For example, replace the text {Your Full Name} with the name in the database using Document.Range.Replace(). Or, replace a bookmark called [Disclaimer] with a long paragraph, sometimes formatted with bullets or numbers, stored as byte from an RTF using Builder.InsertDocument() from a Document stream.

After performing several of these actions you can imagine these inserts will push everything down the page. This sometimes causes blank pages at the end of the document which must be removed.

After we finish inserting/replacing fields we convert it to PDF and save it. The user is almost never given the option to save to a Word document for editing because we do not want them to edit it, they are corporate templates which should not be altered.

Knowing that we have both Aspose.Words and Aspose.PDF, what is your recommendation for the best way (and fastest, and least resource intensive) of removing these blank pages at the end of the documents? We can do it in Word after all inserts are done, or in PDF after the document is converted. Again, the final document sometimes has our logo at the top, a header, a footer, and sometimes does not.

@deisenberg I would suggest few things:

  1. Remove empty paragraphs at the document end, in most cases this will avoid blank pages at the end of the document.
  2. Remove explicit page breaks from the end of the document.
  3. If the last node of the document is paragraph enable “Window/Orphan Control” option for this paragraph. If paragraphs size is not very big (3-6 lines), I would recommend to enable “Keep Lines Together”.
  4. If the last node is table enable “Keep with Next” option for paragraphs of at least last 1-3 rows. This is important because table cannot be the last document node. There must be at least one paragraph after a table. If you remove empty paragraph after the table at the end of the document, an empty paragraph will be automatically added. “Keep with Next” option set in table will move part of the table with the empty paragraph at the end of the document.
    If you have control over the template document, you can do this using MS Word. Otherwise, you can perform programmatic processing of the document. For example, see the following code:
private void AvoidEmptyPagesAtDocumentEnd(Document doc)
{
    // 1. Remove empty sections if persists.
    while ((doc.Sections.Count > 0)
        && (doc.LastSection.Body.GetChildNodes(NodeType.Run, true).Count==0)
        && (doc.LastSection.Body.GetChildNodes(NodeType.Shape, true).Count == 0))
        doc.LastSection.Remove();

    // 2. Remove empty paragraphs at the end of the document.
    while ((doc.LastSection.Body.LastChild.NodeType == NodeType.Paragraph) && !doc.LastSection.Body.LastParagraph.HasChildNodes)
        doc.LastSection.Body.LastParagraph.Remove();

    // 3. Set Window/Orphan control option for the last paragraph.
    if (doc.LastSection.Body.LastChild.NodeType == NodeType.Paragraph)
        doc.LastSection.Body.LastParagraph.ParagraphFormat.WidowControl = true;

    // 4. Enable Keep with next option if the last node is table.
    if (doc.LastSection.Body.LastChild.NodeType == NodeType.Table)
    {
        Table lastTable = (Table)doc.LastSection.Body.LastChild;
        NodeCollection rowParagraphs = lastTable.LastRow.GetChildNodes(NodeType.Paragraph, true);
        foreach (Paragraph para in rowParagraphs)
            para.ParagraphFormat.KeepWithNext = true;
    }
}

Of course, it would be great to take a look at your documents with blank pages at the end. This will allow to determine the exact reason of them. But I am sure that blank pages at the end of the document can be eliminated without need of document rendering or splitting to pages.