Removing blank pages - existing code samples don't work

ksmith1 · January 23, 2019, 3:34am

We have a word document that has empty pages in it that we would like to remove. We have tried solutions offered up by Aspose in other posts but they are not working.

We are also seeing behaviour where Aspose code for determining the page number for a node doesn’t always return the correct page number.

We have a solution for removing these blanks pages that works but it is predicated on LayoutCollector as shown in GetNodesByPage examples (from other Aspose posts) working… which it doesn’t consistently.

Please find attached Aspose code we are trying and the document that has the blank pages.

aspose_code_example.zip (18.9 KB)

tahir.manzoor · January 23, 2019, 2:17pm

@ksmith1

Thanks for your inquiry. We are working over your query and will get back to you soon.

ksmith1 · January 23, 2019, 11:32pm

Please find attached another more complex (and indicative of documents we create) document attached. The page to look at is page 28. When we use the GetNodesByPage() method from this forum (which uses LayoutCollector) the nodes for page 28 are not correct. Instead the nodes from the following page are returned, this means 28 is not flagged as being blank.Larger_Example.zip (96.4 KB)

tahir.manzoor · January 24, 2019, 4:37am

@ksmith1

Thanks for your inquiry. In your case, we suggest you following solution.

Split the document pages into separate document using PageSplitter utility. Please get the code from Github repository.
Join the extracted documents except the one that has no text. You can get the document’s text using Node.ToString(SaveFormat.Text) method and check either it is empty or not.

Hope this helps you.

ksmith1 · January 24, 2019, 5:12am

This seems like a convoluted work around.

Can you tell me why LayoutCollector incorrectly returns nodes for the wong page?

If there is an issue with LayoutCollector it would affect all Words users.

As it stands using PageSplitter maybe fine for a few documents but we are processing thousands and the additional resource utilization and time taken by using PageSplitter isn’t viable.

We can fix the problem if LayoutCollector correctly identifies the page that nodes belong to.

tahir.manzoor · January 24, 2019, 3:31pm

@ksmith1

Thanks for your inquiry. We have logged a feature request as WORDSNET-18064 to remove empty pages from the document in our issue tracking system. You will be notified via this forum thread once this feature is available. We apologize for your inconvenience.

Could you please share the page number for which you are facing this issue? We will investigate the issue on our side and provide you more information.

ksmith1 · January 24, 2019, 9:48pm

Example.zip (213.2 KB)
In the attached file it is page 28. In the larger code base LayoutCollector seems to be wrong, however in stripping it down to a console app this issue (Layout collector) doesn’t present itself. What does happen in the console app attached is that nodes don’t get removed despite being detected as belonging to a blank page. I’ll try reproduce the LayoutCollector issue but would like your input on the nodes not being removed.

Also of note that .doc file grows in size after processing which is surprising.

ksmith1 · January 25, 2019, 12:10am

Splitter_Example.zip (204.7 KB)
In redacting the previously supplied document the problem fixes itself (so in some cases a find replace and the save corrects the internals of the document / object model).

However I implemented PageSplitter (which I note uses LayoutCollector) in a console app and ran a document through it and it suffers from the same issue (no doubt due to LayoutCollector).

If you run the attached console app with the attached word document the output is (a) no blank pages (b) in place of the blank page, specifically page 6 the content of page 5 is shown. Expected result is page 6 is blank not the same as page 5.

Also splitting the pages then rejoining them creates problems eg. the table of contents is destroyed and is no longer a field that can be updated, formatting for some reason is lost.

Blank pages can be detected by seeing if PageSplitter results in two documents are created that are the same (which is a bug as described above) but if LayoutCollector is broken there is no way to get the nodes for the page that is to be deleted… never mind Remove() doesn’t seem to work on the nodes.

tahir.manzoor · January 25, 2019, 7:55am

@ksmith1

Thanks for sharing the detail. We are investigating this issue and will get back to you soon.

tahir.manzoor · January 25, 2019, 10:16am

@ksmith1

Thanks for your patience. In your case, the PageSplitter utility does not work. We suggest you following two workarounds.

Workaround 1.

In your document the paragraph contains the page break that is on 5th page. This causes the 6th page as empty. Below code example removes the empty page from the document. You need to make a recursive call for this code until all empty pages are removed. You need to do this because after removing a page, the page layout of document is changed.

Document doc = new Document(MyDir + "Redacted_Splitter.docx");
NodeCollection paras = doc.GetChildNodes(NodeType.Paragraph, true);
              
List<Node> nodes = paras.Cast<Node>().Where(node => node.GetText().Contains(ControlChar.PageBreak)).ToList<Node>();
LayoutCollector lc = new LayoutCollector(doc);
foreach (var node in nodes)
{
    if (((Paragraph)node).IsEndOfSection)
        continue;
    int page = lc.GetStartPageIndex(node);

    ArrayList pagenodes = GetNodesByPage(page + 1, doc);
    if (pagenodes.Count == 0)
    {
        node.Remove();
    }

    lc.GetStartPageIndex(node);
}

RemovePages(doc);

doc.Save(MyDir + @"19.1.docx");

In GetNodesByPage, please replace following if condition

if (paraGraphPageStart == page || endOfSection)

with
if (paraGraphPageStart == page)

Workaround 2.

Please convert the Word document to PDF using Aspose.Words.
Remove the empty pages from the PDF using Aspose.PDF.
Convert PDF to DOCX using Aspose.PDF.

Below code example shows how to remove the empty pages from PDF.

Aspose.Pdf.Document inputDoc = new Aspose.Pdf.Document(MyDir + "input.pdf");
Aspose.Pdf.Document outputDoc = new Aspose.Pdf.Document();
foreach (var page in inputDoc.Pages)
{
    if (page.IsBlank(0.01d))
        continue;
    else
        outputDoc.Pages.Add(page);
}
outputDoc.Save("out.pdf");

tahir.manzoor · March 11, 2019, 3:45am

@ksmith1

Please note that Word document is flow document and do not contain any information about document layout into pages and lines. Therefore, technically there is no “Page” concept in Word document. So, we have closed WORDSNET-18064 as Won’t Fix. Please use the workarounds shared in my previous post.