How to Remove Empty Paragraph after Inserting HTML into Document using Java

Hi,

I am creating a word document. From that I have to add a html content (file) string with starting and ending with hidden tags. Starting tag is add along with the html content. But between end tag and html content contains an extra line space that i don’t want. It create a headache for me. Please help me to remove that extra line space.
I am attaching my source code along with output and expected output. Extra space issue poc.zip (38.5 KB)

Please help me to find a solution for this issue.

Thank you

@Gptrnt

Please note that minimal valid Body node needs to contain at least one Paragraph. So when you create the document from HTML, an empty paragraph exists at the end of document. You can remove it using following modified code before calling DocumentBuilder.insertDocument method.

public static Document generateDocument(Document document) throws Exception {
//      dstDoc.protect(ProtectionType.READ_ONLY);
  // Creating builder for the document
  DocumentBuilder builder = new DocumentBuilder(document);
      try {
          insertHiddenWord(builder,"t", false);
          ByteArrayInputStream bais = new ByteArrayInputStream(description().getBytes());
          LoadOptions opts = new LoadOptions();
          opts.setLoadFormat(LoadFormat.HTML);
          Document tempDoc = new Document(bais, opts);
          if(!tempDoc.getLastSection().getBody().getLastParagraph().hasChildNodes())
        	  tempDoc.getLastSection().getBody().getLastParagraph().remove();
          builder.insertDocument(tempDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);
          insertHiddenWord(builder,"t", true);
      }catch (Exception e) {
          System.out.println("Error while insert html to the doc");
          }

      return document;
  }

Hi,

My html content is coming from customer side, so it may contains extra line break at the end.
So if I execute the given code, I may loose the last empty paragraph that added by the customer. I want to print the word document with the exact html that provided by the customer.

@Gptrnt

In this case, we suggest you please use DocumentBuilder.InsertHtml(String, HtmlInsertOptions) method to insert the HTML. You need to use second parameter as HtmlInsertOptions.RemoveLastEmptyParagraph. This option removes the empty paragraph that is normally inserted after HTML that ends with a block-level element.

Hi,

The code you suggested is removing the extra line break at the end. But as you can go through the code, I am binding html in the form of builder.insertDocument to avoid the line break in the starting (line break between starting tag and html). If I go through with this solution then, html will start with an extra space. I will attach updated source code with your solution and the output Extra space issue poc (2).zip (27.5 KB)
.

@Gptrnt

Please share the input document that your customers are using along with problematic output document. We will check your documents and write the code example according to your requirement.

Hi,

I am uploading changed sample code with importing word content in html form. Attaching the input document input.docx (15.1 KB) where I have added two content wrapped with tags. In that contents inside the starting(|t1| or |t2|) and ending tag (|/t1| or |/t2|) will be taken and converting it to html and store it in the db. Same html I will be taken and created as a document and download it. So downloaded document should be same as the uploaded one. But when i try with my code it generating an extra line break after the table output.docx (8.7 KB) (Please refer the first case). If I go with the your above solution, then the line break after the table (please refer second case) will remove (that I want in the downloaded one). My input and output document should be same.

I am attaching the changed sample codewrdHtmlWithReplacePoc.zip (49.5 KB)

@Gptrnt

Thanks for sharing the detail. You can use code example shared in my old post here.

This code example does not remove the extra line breaks added by your customer in the document. It only removes the last empty paragraph of document.

Hi,

I tried your solution but my output document is attaching output.docx (8.6 KB). In the input document second case, there is a line break after the table. It is missing in the output. I am attaching the sample code wrdHtmlWithReplacePoc (2).zip (51.0 KB) with updated with your solution.

.

@Gptrnt

We have reviewed your code and noticed that the document generated after extracting contents does not contain the last empty paragraph. We are investigating this issue and will get back to you soon.

@Gptrnt

Please use the following modified method to get the desired output. You can find the modified code between comment //Modified code.... We have attached the output document with this post for your kind reference. 21.9 output.docx (8.7 KB)

public static Document generateDocument(List<String> htmlList) throws Exception {
    Document document = new Document();
    DocumentBuilder builder = new DocumentBuilder(document);
    for (int i=0;i<2; i++){
        String html = htmlList.get(i);//.replace("-aw-import:ignore", "");
        System.out.println(html);
        try {
            insertHiddenWord(builder, "t" +( i + 1), false);
            ByteArrayInputStream bais = new ByteArrayInputStream(html.getBytes());
            LoadOptions opts = new LoadOptions();
            opts.setLoadFormat(LoadFormat.HTML);
            Document tempDoc = new Document(bais, opts);
            
            //Modified code...
            if(tempDoc.getLastSection().getBody().getLastParagraph().toString(SaveFormat.TEXT).trim().length() == 0)
            	tempDoc.getLastSection().getBody().getLastParagraph().remove();
          //Modified code...
            builder.insertDocument(tempDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);
            
            insertHiddenWord(builder, "t" + (i + 1), true);
        } catch (Exception e) {
            System.out.println("Error while insert html to the doc");
        }
    }

    return document;
}
private static LinkedList<String> getHtmlContentFromBookMark(BookmarkCollection bookmarkCollection, Document document) {
    try {
        LinkedList<String> list = new LinkedList<>();
        for (int i=1;i<3;i++) {
            Bookmark bookmark = bookmarkCollection.get("t" + i);
            Node startNode = bookmark.getBookmarkStart();
            Node endNode = bookmark.getBookmarkEnd();
            //Modified code...
            ArrayList<Node> extractedNodes = extractContent(startNode, endNode, true);
            //Modified code...
            Document dstHTML = generateDocument(document, extractedNodes);
            HtmlSaveOptions saveOptions = new HtmlSaveOptions();
            saveOptions.setSaveFormat(SaveFormat.HTML);
            saveOptions.setExportImagesAsBase64(true);
            saveOptions.setExportListLabels(ExportListLabels.AS_INLINE_TEXT);
            list.add(dstHTML.toString(saveOptions));
        }
        return list;
    }catch (Exception e){
        System.out.println("error while fetching bookmark");
    }
    return null;
}

Hi Tahir,

Thank you for your help. It’s working perfectly.

@Gptrnt

Thanks for your feedback. Please feel free to ask if you have any question about Aspose.Words, we will be happy to help you.

Hi,

In our system we are using ckeditor in UI. So while creating document, sometimes html is coming from ckeditor. My problem is if any table added as a last item in html, then in downloaded document after table is not taking the paragraph after spacing. It coming congested without any space. But if table is in middle it coming properly. uploading the sample output.docx (8.8 KB). Attaching the sample code with one ckeditior output html wrdHtmlWithReplacePoc (3).zip (41.6 KB)

Thank you

@Gptrnt

Please note that Aspose.Words mimics the behavior of MS Word. If you perform the same scenario using MS Word, you will get the same output.

The paragraph space after is set for paragraphs in HTML. You can use ParagraphFormat.SpaceAfter property as shown below to get the desired output. Hope this helps you.

Document tempDoc = new Document(bais, opts);
if(tempDoc.getLastSection().getBody().getLastParagraph().toString(SaveFormat.TEXT).trim().length() == 0)
    tempDoc.getLastSection().getBody().getLastParagraph().remove();

tempDoc.getLastSection().getBody().getLastParagraph().getParagraphFormat().setSpaceAfter(0.0);

Hi,

I have tried above solution and added a default space after value 12. But I couldn’t find any change in my output. Attaching the output output.docx (8.8 KB). Also attaching the sample code updated the above solution wrdHtmlWithReplacePoc (4).zip (41.4 KB)

Thank you

@Gptrnt

In your code, you are setting the space after value as 12.0. Please check the attached image for detail.
space after.png (37.7 KB)

Please set it to 0.0 as suggested in my previous post.

Hi,
I have tried that also. /But the output is same output.docx (8.8 KB). Attaching the sample code wrdHtmlWithReplacePoc (5).zip (41.4 KB)

Thank you

@Gptrnt

Please check the attached screenshot. The paragraph space after is 0.0 for desired paragraph. You can use the same approach to set the paragraph properties for all others paragraphs.
space after.png (73.3 KB)

Hi,

I think you are not understand my problem. In the output first case (between hidden character t1), with table at the end not have the paragraph space. you can see there output.docx (8.8 KB) only the particular table end is little congested. Other all maintains a paragraph space. This issue happening only if the table is at the end. I want the end table also maintain the same paragraph space after value.

Thank you