Re: How to remove blank pages in word files before appending them?

mtassinari · May 28, 2014, 4:39am

Hi,

I tried the code you gave me and, even though it does not produce any error, it doesn’t seem to work either.

I am attaching a .doc file which has been generated in the following way:

compile JasperReport template with Aspose.Words for JasperReports
convert to Document class with Aspose.Words for Java
remove blank pages
append to master document
repeat steps 1-4 for each sub-document
remove blank pages from master document too

As you can see in the attached file, page 19 of 481 is empty yet was not removed from the code you suggested.

Here is how I reimplemented it, since I am working in Java:

    /**
     * Rimuove le pagine bianche da un documento.
     *
     * @param doc
     * @return
     * @throws Exception
     */
    public static Document removeBlankPages(Document doc) throws Exception {
        for (Section s : doc.getSections()) {
            if (s.toString(SaveFormat.TEXT).trim().isEmpty()) {
                s.remove();
            }
        }

        LayoutCollector lc = new LayoutCollector(doc);
        int pages = lc.getStartPageIndex(doc.getLastSection().getBody().getLastParagraph());

        for (int i = 1; i <= pages; ++i) {
            StringBuilder pageText = new StringBuilder();
            ArrayList paragraphs = getParagraphsInPage(i, doc, lc);

            for (Paragraph p : paragraphs) {
                pageText.append(p.toString(SaveFormat.TEXT).trim());
            }

            if (pageText.toString().isEmpty()) {
                for (Paragraph p : paragraphs) {
                    p.remove();
                }
            }
        }

        return doc;
    }

    /**
     * Ritorna la lista dei paragrafi in una pagina.
     *
     * @param page
     * @param document
     * @param lc
     * @return
     * @throws Exception
     */
    @SuppressWarnings("unchecked")
    protected static ArrayList getParagraphsInPage(int page, Document document, LayoutCollector lc) throws Exception {
        ArrayList paragraphs = new ArrayList<>();
        NodeCollection childNodes = document.getChildNodes(NodeType.PARAGRAPH, true);

        for (Paragraph p : childNodes) {
            if (lc.getStartPageIndex§ == page || p.isEndOfSection()) {
                paragraphs.add§;
            }
        }

        return paragraphs;
    }

Regards,
Matteo

mtassinari · May 27, 2014, 3:57am

I had the same problem, and your solutions worked for me too, thanks.

I would like to expand the original question: is there a way to programmatically search and remove empty pages inside a document?

tahir.manzoor · May 27, 2014, 12:16pm

Hi Matteo,

Thanks
for your inquiry. Please use the following code example to achieve your
requirements. I suggest you please read following documentation links
for your kind reference.
https://reference.aspose.com/words/net/aspose.words.layout/layoutcollector/
https://reference.aspose.com/words/net/aspose.words.layout/layoutenumerator/

Hope this helps you. Please let us know if you have any more queries.

Document doc = new Document(MyDir + "in.docx");
foreach (Section section in doc.Sections)
{
    if (section.ToString(SaveFormat.Text).Trim() == String.Empty)
        section.Remove();
}

//Get Paragraph nodes by page number
private ArrayList GetNodesByPage(int page, Document document)
{
    ArrayList nodes = new ArrayList();
    LayoutCollector lc = new LayoutCollector(document);
    foreach (Paragraph para in document.GetChildNodes(NodeType.Paragraph, true))
    {
        if (lc.GetStartPageIndex(para) == page || para.IsEndOfSection)
            nodes.Add(para);
    }
    return nodes;
}

tahir.manzoor · May 29, 2014, 2:29am

Hi Matteo,

Thanks
for your inquiry. In case you are using an older version of Aspose.Words, I would suggest you please upgrade to the latest version (v14.4.1) from here.

I have modified the code example. Please check the following highlighted code snippet. Hope this helps you. Please let us know if you have any more queries.

    protected static ArrayList getParagraphsInPage(int page, Document document) throws Exception {
        ArrayList paragraphs = new ArrayList<>();
        LayoutCollector lc = new LayoutCollector(document);
        NodeCollection childNodes = document.getChildNodes(NodeType.PARAGRAPH, true);
        for (Paragraph p : childNodes) {
            if (lc.getStartPageIndex(p) == page) {
                paragraphs.add(p);
            }
        }
        return paragraphs;
    }
    private static void removePageBreaks(Document doc) throws Exception
    {
        // Retrieve all paragraphs in the document.
        NodeCollection paragraphs = doc.getChildNodes(NodeType.PARAGRAPH, true);
        // Iterate through all paragraphs
        for (Paragraph para : (Iterable) paragraphs)
        {
            // If the paragraph has a page break before set then clear it.
            if (para.getParagraphFormat().getPageBreakBefore())
                para.getParagraphFormat().setPageBreakBefore(false);
            // Check all runs in the paragraph for page breaks and remove them.
            for (Run run : (Iterable) para.getRuns())
            {
                if (run.getText().contains(ControlChar.PAGE_BREAK))
                    run.setText(run.getText().replace(ControlChar.PAGE_BREAK, ""));
            }
        }
    }
    public static Document removeBlankPages(Document doc) throws Exception {
        LayoutCollector lc = new LayoutCollector(doc);
        for (Section s : doc.getSections()) {
            if (s.toString(SaveFormat.TEXT).trim().isEmpty()) {
                s.remove();
            }
        }
        int pages = lc.getStartPageIndex(doc.getLastSection().getBody().getLastParagraph());
        for (int i = 1; i <= pages; ++i) {
            StringBuilder pageText = new StringBuilder();
            ArrayList paragraphs = getParagraphsInPage(i, doc);
            for (Paragraph p : paragraphs) {
                pageText.append(p.toString(SaveFormat.TEXT).trim());
                if (!pageText.toString().isEmpty())
                    break;
            }
            if (pageText.toString().isEmpty()) {
                for (Paragraph p : paragraphs) {
                    p.remove();
                }
            }
        }
        return doc;
    }

Document doc = new Document(MyDir + "DVR_3_3_5.doc");
removePageBreaks(doc);
removeBlankPages(doc);
doc.save(MyDir + "Out.doc");

sindhupriyak · November 30, 2016, 6:44am

I am not able to find LayoutCollector in aspose-words-16.11.0-jdk16 jar. How can i fix it

tahir.manzoor · December 1, 2016, 4:09am

Hi Sindhu,

Thanks for your inquiry. Please make sure that you are using Aspose.Words for Java 16.11.0. The com.aspose.words.LayoutCollector exists in latest version of Aspose.Words v16.11.0.

mtassinari · September 21, 2017, 12:34pm

Hi,
is the method described in this thread still valid, or is there a better way to “trim” empty pages at the end of a document now?
Thanks.

tahir.manzoor · September 21, 2017, 5:03pm

@mtassinari,
Thanks for your inquiry. Yes, you can use the same method to remove the empty pages.

mtassinari · September 22, 2017, 8:16am

Hi,
I have implemented the following method:

private Document trimBlankPages(Document doc) throws Exception {
  if (doc != null) {
    for (Section s : doc.getSections()) {
      if (s.toString(SaveFormat.TEXT).trim().isEmpty()) {
        s.remove();
      }
    }

    LayoutCollector lc = new LayoutCollector(doc);
    if (lc != null) {
      Section last = doc.getLastSection();

      if (last != null) {
        Body body = last.getBody();

        if (body != null) {
          int pages = lc.getStartPageIndex(body.getLastParagraph());

          for (int i = 1; i <= pages; ++i) {
            StringBuilder pageText = new StringBuilder();
            java.util.List<Paragraph> paragraphs = getParagraphsInPage(i, doc);

            for (Paragraph p : paragraphs) {
              pageText.append(p.toString(SaveFormat.TEXT).trim());
            }

            if (pageText.toString().isEmpty()) {
              for (Paragraph p : paragraphs) {
                p.remove();
              }
            }
          }
        }
      }
    }
  }

  return doc;
}

private static java.util.List<Paragraph> getParagraphsInPage(int page, Document document) throws Exception {
  java.util.List<Paragraph> paragraphs = new ArrayList<>();
  LayoutCollector lc = new LayoutCollector(document);
  NodeCollection<Paragraph> childNodes = document.getChildNodes(NodeType.PARAGRAPH, true);

  for (Paragraph p : childNodes) {
    if (lc.getStartPageIndex(p) == page || p.isEndOfSection()) {
      paragraphs.add(p);
    }
  }

   return paragraphs;
 }

However, I tried using it on the attached document, it still has a blank page at the end; what am I doing wrong?
custom_test.zip (61.8 KB)

tahir.manzoor · September 22, 2017, 2:43pm

@mtassinari,
Thanks for your inquiry. You want to remove the last empty page from the document. Please use following code example to get the desired output.

Document document = new Document(MyDir + "custom_test.doc");
while (string.IsNullOrEmpty(document.LastSection.Body.LastParagraph.ToString(SaveFormat.Text).Trim()))
{
    int childnodes = document.LastSection.Body.LastParagraph.ChildNodes.Cast<Node>().Where(child => child.NodeType != NodeType.Run).ToList<Node>().Count;
    if (childnodes > 0)
        break;
    else
        document.LastSection.Body.LastParagraph.Remove();
}

document.Save(MyDir + "17.9.docx");

mtassinari · September 22, 2017, 2:56pm

@tahir.manzoor
I should have perhaps specified that we use Aspose.Words for Java.
How would I “translate” the code you gave me in Java?
The paragraph’s ChildNodes seems to be a NodeCollection which I do not now how to iterate …
EDIT:
I tried “translating” it as:

Paragraph last = doc.getLastSection().getBody().getLastParagraph();
while (!Strings.isValid(last.toString(SaveFormat.TEXT))) {
	int counter = 0;
	NodeCollection<Node> childNodes = last.getChildNodes();
	for (Node child : childNodes) {
		if (child.getNodeType() != NodeType.RUN) {
			++counter;
		}
	}
	if (counter > 0) {
		break;
	}
	last.remove();
	last = doc.getLastSection().getBody().getLastParagraph();
}

where Strings.isValid(str) returns true if str != null and str.length() > 0, but still the result is unchanged.
EDIT 2:
Dumb me, it works! My mistake was that I didn’t place the .trim() after the .toString(SaveFormat.TEXT), just adding it solved the problem! Many thanks!
However, I wanted to ask a clarification: after checking that the “string value” of the paragraph is not valid, what is that NodeCollection iteration for? Can’t we just remove the paragraph directly?

tahir.manzoor · September 22, 2017, 4:41pm

@mtassinari,
Please accept my apologies for your inconvenience. Please use the following Java code example to remove the last empty page of document. Hope this helps you.
Document document = new Document(MyDir + “custom_test.doc”);

while (document.getLastSection().getBody().getLastParagraph().toString(SaveFormat.TEXT).trim().equals(""))
{
    if (document.getLastSection().getBody().getLastParagraph().getPreviousSibling() != null &&
            (document.getLastSection().getBody().getLastParagraph().getPreviousSibling().getNodeType() != NodeType.PARAGRAPH &&
            document.getLastSection().getBody().getLastParagraph().getPreviousSibling().getNodeType() != NodeType.TABLE))
        break;
    document.getLastSection().getBody().getLastParagraph().remove();

    // If the current section becomes empty, we should remove it.
    if (!document.getLastSection().getBody().hasChildNodes())
        document.getLastSection().remove();

    // We should exit the loop if the document becomes empty.
    if (!document.hasChildNodes())
        break;
}

document.save(MyDir + "output.doc");

mtassinari · September 26, 2017, 2:10pm

This code seems substantially different from the C# version, are they actually equivalent?

tahir.manzoor · September 26, 2017, 4:18pm

@mtassinari,
Thanks for your inquiry. Yes, this code example is different from C# example. However, you can get the desired output using this Java code example. Please let us know if you have any more queries.

mtassinari · September 27, 2017, 8:13am

I tried using the last code snippet you gave, but it does not seem to work properly, see the attached example: in before.doc at page 3 there is a page break between the title ALTRI ALLEGATI AL DOCUMENTO and the following table, which gets removed in after.doc.
However, I would only like to remove empty paragraphs/pages at the end of the document, like a sort of “trim” of the document itself.
trim_test.zip (23.7 KB)

mtassinari · September 27, 2017, 8:31am

Here is another example of the issue, where paragraphs are removed “in the middle” of the document.
another_trim_test.zip (32.9 KB)

tahir.manzoor · September 27, 2017, 3:20pm

@mtassinari,
Please accept my apologies for your inconvenience. Please use the following modified code example to remove the last empty paragraphs and pages from the document.

Document doc = new Document(MyDir + "input.doc");
while (doc.getLastSection().getBody().getLastParagraph().toString(SaveFormat.TEXT).trim().equals(""))
{
    int counter = 0;
    NodeCollection<Node> childNodes = doc.getLastSection().getBody().getLastParagraph().getChildNodes();
    for (Node child : childNodes) {
        if (child.getNodeType() != NodeType.RUN) {
            ++counter;
        }
    }

    if (counter > 0) {
        break;
    }

    doc.getLastSection().getBody().getLastParagraph().remove();

    if (doc.getLastSection().getBody().getLastParagraph().getPreviousSibling() != null &&
            (doc.getLastSection().getBody().getLastParagraph().getPreviousSibling().getNodeType() != NodeType.PARAGRAPH))
        break;

    // If the current section becomes empty, we should remove it.
    if (!doc.getLastSection().getBody().hasChildNodes())
        doc.getLastSection().remove();

    // We should exit the loop if the document becomes empty.
    if (!doc.hasChildNodes())
        break;
}

doc.save(MyDir + "output.doc");

mtassinari · September 28, 2017, 8:25am

Thanks, this seems to work properly!

mtassinari · October 6, 2017, 3:43pm

Hi,
sorry to bother you again, we are in the process of updating our templates from DOC to DOCX format, and I still seem to have an issue with “trimming”.
Here is the code I am currently testing:

private static Document trimBlankPages(Document doc) throws Exception {
	Section lastSection;
	Paragraph lastParagraph;
	Node previousSibling;

	while (
		doc.hasChildNodes()
		&& (lastSection = doc.getLastSection()) != null
		&& (lastParagraph = lastSection.getBody().getLastParagraph()) != null
		&& !Strings.isValid(lastParagraph.toString(SaveFormat.TEXT).trim())
	) {
		int counter = 0;
		NodeCollection<Node> childNodes = lastParagraph.getChildNodes();
		for (Node child : childNodes) {
			if (child.getNodeType() != NodeType.RUN) {
				++counter;
			}
		}

		if (counter > 0) {
			break;
		}

		lastParagraph.remove();

		if (
			(lastParagraph = lastSection.getBody().getLastParagraph()) != null
			&& (previousSibling = lastParagraph.getPreviousSibling()) != null
			&& previousSibling.getNodeType() != NodeType.PARAGRAPH
		) {
			break;
		}

		if (!lastSection.getBody().hasChildNodes()) {
			lastSection.remove();
		}
	}

	return doc;
}

However if you check the attached ZIP archive (trim_error.zip (25.8 KB)) you’ll notice that in template.docx there is an empty paragraph after ALTRI ALLEGATI AL DOCUMENTO, which is missing in merged.docx.
As previously stated, I’d like to only remove excess empty paragraphs or page breaks from the end of the document, but those in the middle should remain.
What am I doing wrong?

tahir.manzoor · October 7, 2017, 5:13pm

@mtassinari,
Thanks for your inquiry. We have tested the scenario using same code example and have not found the shared issue. Please use the following code example. We have attached the output DOCX with this post for your kind reference.
output.zip (12.7 KB)

Document doc = new Document(MyDir + "template.docx");
DataSet ds = new DataSet();
ds.readXml(MyDir + "data.xml");

doc.getMailMerge().executeWithRegions(ds);
doc.getRange().getBookmarks().get("_GoBack").remove();
while (doc.getLastSection().getBody().getLastParagraph().toString(SaveFormat.TEXT).trim().length() == 0)
{
    int counter = 0;
    NodeCollection<Node> childNodes = doc.getLastSection().getBody().getLastParagraph().getChildNodes();
    for (Node child : childNodes) {
        if (child.getNodeType() != NodeType.RUN) {
            ++counter;
        }
    }
    if (counter > 0) {
        break;
    }

    doc.getLastSection().getBody().getLastParagraph().remove();

    if (doc.getLastSection().getBody().getLastParagraph().getPreviousSibling() != null &&
            (doc.getLastSection().getBody().getLastParagraph().getPreviousSibling().getNodeType() != NodeType.PARAGRAPH))
        break;

    // If the current section becomes empty, we should remove it.
    if (!doc.getLastSection().getBody().hasChildNodes())
        doc.getLastSection().remove();

    // We should exit the loop if the document becomes empty.
    if (!doc.hasChildNodes())
        break;
}

doc.save(MyDir + "output.docx");