Re: How to remove blank pages in word files before appending them?

tahir.manzoor · December 1, 2016, 4:09am

Hi Sindhu,

Thanks for your inquiry. Please make sure that you are using Aspose.Words for Java 16.11.0. The com.aspose.words.LayoutCollector exists in latest version of Aspose.Words v16.11.0.

mtassinari · September 21, 2017, 12:34pm

Hi,
is the method described in this thread still valid, or is there a better way to “trim” empty pages at the end of a document now?
Thanks.

tahir.manzoor · September 21, 2017, 5:03pm

@mtassinari,
Thanks for your inquiry. Yes, you can use the same method to remove the empty pages.

mtassinari · September 22, 2017, 8:16am

Hi,
I have implemented the following method:

private Document trimBlankPages(Document doc) throws Exception {
  if (doc != null) {
    for (Section s : doc.getSections()) {
      if (s.toString(SaveFormat.TEXT).trim().isEmpty()) {
        s.remove();
      }
    }

    LayoutCollector lc = new LayoutCollector(doc);
    if (lc != null) {
      Section last = doc.getLastSection();

      if (last != null) {
        Body body = last.getBody();

        if (body != null) {
          int pages = lc.getStartPageIndex(body.getLastParagraph());

          for (int i = 1; i <= pages; ++i) {
            StringBuilder pageText = new StringBuilder();
            java.util.List<Paragraph> paragraphs = getParagraphsInPage(i, doc);

            for (Paragraph p : paragraphs) {
              pageText.append(p.toString(SaveFormat.TEXT).trim());
            }

            if (pageText.toString().isEmpty()) {
              for (Paragraph p : paragraphs) {
                p.remove();
              }
            }
          }
        }
      }
    }
  }

  return doc;
}

private static java.util.List<Paragraph> getParagraphsInPage(int page, Document document) throws Exception {
  java.util.List<Paragraph> paragraphs = new ArrayList<>();
  LayoutCollector lc = new LayoutCollector(document);
  NodeCollection<Paragraph> childNodes = document.getChildNodes(NodeType.PARAGRAPH, true);

  for (Paragraph p : childNodes) {
    if (lc.getStartPageIndex(p) == page || p.isEndOfSection()) {
      paragraphs.add(p);
    }
  }

   return paragraphs;
 }

However, I tried using it on the attached document, it still has a blank page at the end; what am I doing wrong?
custom_test.zip (61.8 KB)

tahir.manzoor · September 22, 2017, 2:43pm

@mtassinari,
Thanks for your inquiry. You want to remove the last empty page from the document. Please use following code example to get the desired output.

Document document = new Document(MyDir + "custom_test.doc");
while (string.IsNullOrEmpty(document.LastSection.Body.LastParagraph.ToString(SaveFormat.Text).Trim()))
{
    int childnodes = document.LastSection.Body.LastParagraph.ChildNodes.Cast<Node>().Where(child => child.NodeType != NodeType.Run).ToList<Node>().Count;
    if (childnodes > 0)
        break;
    else
        document.LastSection.Body.LastParagraph.Remove();
}

document.Save(MyDir + "17.9.docx");

mtassinari · September 22, 2017, 2:56pm

@tahir.manzoor
I should have perhaps specified that we use Aspose.Words for Java.
How would I “translate” the code you gave me in Java?
The paragraph’s ChildNodes seems to be a NodeCollection which I do not now how to iterate …
EDIT:
I tried “translating” it as:

Paragraph last = doc.getLastSection().getBody().getLastParagraph();
while (!Strings.isValid(last.toString(SaveFormat.TEXT))) {
	int counter = 0;
	NodeCollection<Node> childNodes = last.getChildNodes();
	for (Node child : childNodes) {
		if (child.getNodeType() != NodeType.RUN) {
			++counter;
		}
	}
	if (counter > 0) {
		break;
	}
	last.remove();
	last = doc.getLastSection().getBody().getLastParagraph();
}

where Strings.isValid(str) returns true if str != null and str.length() > 0, but still the result is unchanged.
EDIT 2:
Dumb me, it works! My mistake was that I didn’t place the .trim() after the .toString(SaveFormat.TEXT), just adding it solved the problem! Many thanks!
However, I wanted to ask a clarification: after checking that the “string value” of the paragraph is not valid, what is that NodeCollection iteration for? Can’t we just remove the paragraph directly?

tahir.manzoor · September 22, 2017, 4:41pm

@mtassinari,
Please accept my apologies for your inconvenience. Please use the following Java code example to remove the last empty page of document. Hope this helps you.
Document document = new Document(MyDir + “custom_test.doc”);

while (document.getLastSection().getBody().getLastParagraph().toString(SaveFormat.TEXT).trim().equals(""))
{
    if (document.getLastSection().getBody().getLastParagraph().getPreviousSibling() != null &&
            (document.getLastSection().getBody().getLastParagraph().getPreviousSibling().getNodeType() != NodeType.PARAGRAPH &&
            document.getLastSection().getBody().getLastParagraph().getPreviousSibling().getNodeType() != NodeType.TABLE))
        break;
    document.getLastSection().getBody().getLastParagraph().remove();

    // If the current section becomes empty, we should remove it.
    if (!document.getLastSection().getBody().hasChildNodes())
        document.getLastSection().remove();

    // We should exit the loop if the document becomes empty.
    if (!document.hasChildNodes())
        break;
}

document.save(MyDir + "output.doc");

mtassinari · September 26, 2017, 2:10pm

This code seems substantially different from the C# version, are they actually equivalent?

tahir.manzoor · September 26, 2017, 4:18pm

@mtassinari,
Thanks for your inquiry. Yes, this code example is different from C# example. However, you can get the desired output using this Java code example. Please let us know if you have any more queries.

mtassinari · September 27, 2017, 8:13am

I tried using the last code snippet you gave, but it does not seem to work properly, see the attached example: in before.doc at page 3 there is a page break between the title ALTRI ALLEGATI AL DOCUMENTO and the following table, which gets removed in after.doc.
However, I would only like to remove empty paragraphs/pages at the end of the document, like a sort of “trim” of the document itself.
trim_test.zip (23.7 KB)

mtassinari · September 27, 2017, 8:31am

Here is another example of the issue, where paragraphs are removed “in the middle” of the document.
another_trim_test.zip (32.9 KB)

tahir.manzoor · September 27, 2017, 3:20pm

@mtassinari,
Please accept my apologies for your inconvenience. Please use the following modified code example to remove the last empty paragraphs and pages from the document.

Document doc = new Document(MyDir + "input.doc");
while (doc.getLastSection().getBody().getLastParagraph().toString(SaveFormat.TEXT).trim().equals(""))
{
    int counter = 0;
    NodeCollection<Node> childNodes = doc.getLastSection().getBody().getLastParagraph().getChildNodes();
    for (Node child : childNodes) {
        if (child.getNodeType() != NodeType.RUN) {
            ++counter;
        }
    }

    if (counter > 0) {
        break;
    }

    doc.getLastSection().getBody().getLastParagraph().remove();

    if (doc.getLastSection().getBody().getLastParagraph().getPreviousSibling() != null &&
            (doc.getLastSection().getBody().getLastParagraph().getPreviousSibling().getNodeType() != NodeType.PARAGRAPH))
        break;

    // If the current section becomes empty, we should remove it.
    if (!doc.getLastSection().getBody().hasChildNodes())
        doc.getLastSection().remove();

    // We should exit the loop if the document becomes empty.
    if (!doc.hasChildNodes())
        break;
}

doc.save(MyDir + "output.doc");

mtassinari · September 28, 2017, 8:25am

Thanks, this seems to work properly!

mtassinari · October 6, 2017, 3:43pm

Hi,
sorry to bother you again, we are in the process of updating our templates from DOC to DOCX format, and I still seem to have an issue with “trimming”.
Here is the code I am currently testing:

private static Document trimBlankPages(Document doc) throws Exception {
	Section lastSection;
	Paragraph lastParagraph;
	Node previousSibling;

	while (
		doc.hasChildNodes()
		&& (lastSection = doc.getLastSection()) != null
		&& (lastParagraph = lastSection.getBody().getLastParagraph()) != null
		&& !Strings.isValid(lastParagraph.toString(SaveFormat.TEXT).trim())
	) {
		int counter = 0;
		NodeCollection<Node> childNodes = lastParagraph.getChildNodes();
		for (Node child : childNodes) {
			if (child.getNodeType() != NodeType.RUN) {
				++counter;
			}
		}

		if (counter > 0) {
			break;
		}

		lastParagraph.remove();

		if (
			(lastParagraph = lastSection.getBody().getLastParagraph()) != null
			&& (previousSibling = lastParagraph.getPreviousSibling()) != null
			&& previousSibling.getNodeType() != NodeType.PARAGRAPH
		) {
			break;
		}

		if (!lastSection.getBody().hasChildNodes()) {
			lastSection.remove();
		}
	}

	return doc;
}

However if you check the attached ZIP archive (trim_error.zip (25.8 KB)) you’ll notice that in template.docx there is an empty paragraph after ALTRI ALLEGATI AL DOCUMENTO, which is missing in merged.docx.
As previously stated, I’d like to only remove excess empty paragraphs or page breaks from the end of the document, but those in the middle should remain.
What am I doing wrong?

tahir.manzoor · October 7, 2017, 5:13pm

@mtassinari,
Thanks for your inquiry. We have tested the scenario using same code example and have not found the shared issue. Please use the following code example. We have attached the output DOCX with this post for your kind reference.
output.zip (12.7 KB)

Document doc = new Document(MyDir + "template.docx");
DataSet ds = new DataSet();
ds.readXml(MyDir + "data.xml");

doc.getMailMerge().executeWithRegions(ds);
doc.getRange().getBookmarks().get("_GoBack").remove();
while (doc.getLastSection().getBody().getLastParagraph().toString(SaveFormat.TEXT).trim().length() == 0)
{
    int counter = 0;
    NodeCollection<Node> childNodes = doc.getLastSection().getBody().getLastParagraph().getChildNodes();
    for (Node child : childNodes) {
        if (child.getNodeType() != NodeType.RUN) {
            ++counter;
        }
    }
    if (counter > 0) {
        break;
    }

    doc.getLastSection().getBody().getLastParagraph().remove();

    if (doc.getLastSection().getBody().getLastParagraph().getPreviousSibling() != null &&
            (doc.getLastSection().getBody().getLastParagraph().getPreviousSibling().getNodeType() != NodeType.PARAGRAPH))
        break;

    // If the current section becomes empty, we should remove it.
    if (!doc.getLastSection().getBody().hasChildNodes())
        doc.getLastSection().remove();

    // We should exit the loop if the document becomes empty.
    if (!doc.hasChildNodes())
        break;
}

doc.save(MyDir + "output.docx");

mtassinari · October 9, 2017, 7:29am

Your output is quite different from mine, I probably should have specified that this are the cleanup options used for mailmerge:

private static void merge(Document document, InputStream xmlStream, boolean withRegions) throws Exception {
	com.aspose.words.MailMerge mm = document.getMailMerge();
	mm.setFieldMergingCallback(new ImageMerge());
	mm.setTrimWhitespaces(true);

	com.aspose.words.net.System.Data.DataSet dataSet = new com.aspose.words.net.System.Data.DataSet();
	dataSet.readXml(xmlStream);

	if (withRegions) {
		mm.setMergeDuplicateRegions(true);

		mm.setCleanupOptions(
			MailMergeCleanupOptions.REMOVE_UNUSED_REGIONS
			| MailMergeCleanupOptions.REMOVE_EMPTY_TABLE_ROWS
			| MailMergeCleanupOptions.REMOVE_EMPTY_PARAGRAPHS
		);

		// eseguo merge
		mm.executeWithRegions(dataSet);
	}

	mm.setCleanupOptions(
		MailMergeCleanupOptions.REMOVE_UNUSED_FIELDS
		| MailMergeCleanupOptions.REMOVE_CONTAINING_FIELDS
		| MailMergeCleanupOptions.REMOVE_EMPTY_PARAGRAPHS
	);

	// eseguo merge
	mm.execute(dataSet.getTables().get(0));

	// pulizia campi
	FieldsHelper.convertFieldsToStaticText(document, FieldType.FIELD_IF);
}

tahir.manzoor · October 9, 2017, 12:05pm

@mtassinari,
Please accept my apologies for your inconvenience. The code snippet shared in my previous post needs a little change. Please check LastParagraph().remove() in following modified code snippet.

doc.getRange().getBookmarks().get("_GoBack").remove();
while (doc.getLastSection().getBody().getLastParagraph().toString(SaveFormat.TEXT).trim().length() == 0)
{
    int counter = 0;
    NodeCollection<Node> childNodes = doc.getLastSection().getBody().getLastParagraph().getChildNodes();

    for (Node child : childNodes) {
        if (child.getNodeType() != NodeType.RUN) {
            ++counter;
        }
    }
    if (counter > 0) {
        break;
    }

    if (doc.getLastSection().getBody().getLastParagraph().getPreviousSibling() != null &&
            (doc.getLastSection().getBody().getLastParagraph().getPreviousSibling().getNodeType() != NodeType.PARAGRAPH))
        break;

    doc.getLastSection().getBody().getLastParagraph().remove();

    // If the current section becomes empty, we should remove it.
    if (!doc.getLastSection().getBody().hasChildNodes())
        doc.getLastSection().remove();

    // We should exit the loop if the document becomes empty.
    if (!doc.hasChildNodes())
        break;
}

mtassinari · October 11, 2017, 7:12am

Hi again,
I am still having issue with this, please see the attached example: another_trim_error.zip (220.9 KB)
As you see, the merged file has a blank page at the end, which I do not want.
Please also notice that I need to implement a generic solution, and not something that works for this specific template only.
Can you help me? The functionality I’d need is to “trim” all empty paragraphs and page breaks from the end of the document.

tahir.manzoor · October 11, 2017, 10:44am

@mtassinari,
Thanks for your inquiry. Please use the following generic solution to remove empty paragraphs from the end of document. Hope this helps you.

if(doc.getRange().getBookmarks().get("_GoBack") !=  null)
    doc.getRange().getBookmarks().get("_GoBack").remove();

while (doc.getLastSection().getBody().getLastParagraph().toString(SaveFormat.TEXT).trim().length() == 0)
{
    if(doc.getLastSection().getBody().getLastParagraph().getChildNodes(NodeType.SHAPE, true).getCount() > 0
            || doc.getLastSection().getBody().getLastParagraph().getChildNodes(NodeType.GROUP_SHAPE, true).getCount() > 0
            || doc.getLastSection().getBody().getLastParagraph().getChildNodes(NodeType.FORM_FIELD, true).getCount() > 0
            || doc.getLastSection().getBody().getLastParagraph().getChildNodes(NodeType.FOOTNOTE, true).getCount() > 0
            || doc.getLastSection().getBody().getLastParagraph().getChildNodes(NodeType.COMMENT, true).getCount() > 0
            )
        break;

    //Check if last paragraph contains the page break
    if(doc.getLastSection().getBody().getLastParagraph().isEndOfDocument())
    {
	doc.getLastSection().getBody().getLastParagraph().getRange().replace(ControlChar.PAGE_BREAK, "", new FindReplaceOptions());
    }

    if (doc.getLastSection().getBody().getLastParagraph().getPreviousSibling() != null &&
            (doc.getLastSection().getBody().getLastParagraph().getPreviousSibling().getNodeType() != NodeType.PARAGRAPH))
        break;

    doc.getLastSection().getBody().getLastParagraph().remove();

    // If the current section becomes empty, we should remove it.
    if (!doc.getLastSection().getBody().hasChildNodes())
        doc.getLastSection().remove();

    // We should exit the loop if the document becomes empty.
    if (!doc.hasChildNodes())
        break;
}

awais.hafeez · March 26, 2019, 2:59pm

A post was merged into an existing topic: Remove blank empty pages from Word document