Extract Content from Word document and Convert it to HTML | Bookmark Paragraphs using Java | Ensure Removal of Blank Lines

Gptrnt · June 29, 2020, 9:49am

Hi,

I am extracting content from uploaded word document and converting it to html.If the extracted content have table at the end. Then, when recreating the same word file (same as imported document) with extracted and converted html content (converted html having the extra space at the end)at the end finding an extra line space(even in the converted html also containing the extra P tag). I am uploading the same source with input and output.ExtraSpaceIssue.zip (147.1 KB)

Is any setting i have to add before converted to html?

awais.hafeez · June 30, 2020, 5:06am

@Gptrnt,

I have generated an output DOCX file by using the Java code you provided and attached it here for your reference:

output - 20.6.zip (14.0 KB)

Do you see the same problem in this output DOCX file? If yes, then please create and attach a comparison screenshot which highlights (encircles) the problematic area in this Aspose.Words 20.6 generated DOCX (with respect to expected document). We will then investigate the issue further and provide you more information.

Please also share this converted HTML file with us for further testing.

Gptrnt · June 30, 2020, 5:27am

Do you see the same problem in this output DOCX file? If yes, then please create and attach a comparison screenshot which highlights (encircles) the problematic area in this Aspose.Words 20.6 generated DOCX ( with respect to expected document ). We will then investigate the issue further and provide you more information.

I can see the issue in the attached file. I have added circle around the extra space in the below screenshot. space_Issue.png (187.8 KB)

awais.hafeez · June 30, 2020, 2:49pm

@Gptrnt,

One simple way to workaround this problem is to manually post-process the final document and remove empty Paragraphs appearing right after the Tables using the following code:

...
...
BookmarkCollection bookmarkCollection = document.getRange().getBookmarks();
HtmlSaveOptions saveOptions = htmlSaveOption();
Item item = SaveInItem(bookmarkCollection, document, saveOptions);
// Item item = createItem();
Document outputDocument = generateDocument(item);

for (Table table : (Iterable<Table>) outputDocument.getChildNodes(NodeType.TABLE, true)) {
    Paragraph nextPara = (Paragraph) table.getNextSibling();
    if (nextPara != null && nextPara.toString(SaveFormat.TEXT).trim().equals(""))
        nextPara.remove();
}

outputDocument.save("E:\\Temp\\ExtraSpaceIssue\\20.6.docx", SaveFormat.DOCX);
...
...

Gptrnt · July 1, 2020, 11:42am

I tried with your code but, what happening is it removes empty space added by my own also. I wants to remove only that extra space added by aspose. My generated document should be exactly same as input document. but when table is the end its creating an extra line break not in the uploaded (input) document.

awais.hafeez · July 2, 2020, 7:04am

@Gptrnt,

Please see the following changes in your “addAgendaItemContent” method:

private static void addAgendaItemContent(DocumentBuilder builder, String htmlString, String field) throws Exception {
    try {
        if (!htmlString.equals("")) {
            ByteArrayInputStream bais = new ByteArrayInputStream(htmlString.getBytes());
            LoadOptions opts = new LoadOptions();
            opts.setLoadFormat(LoadFormat.HTML);
            Document tempDoc = new Document(bais, opts);

            Paragraph lastPara = tempDoc.getLastSection().getBody().getLastParagraph();
            if (lastPara.toString(SaveFormat.TEXT).trim().equals("") && lastPara.isEndOfSection())
                tempDoc.getLastSection().getBody().getLastParagraph().remove();

            builder.insertDocument(tempDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);
        }
    } catch (Exception e) {
        builder.insertHtml(htmlString + "</p>", Constant.USE_BUILDER_FORMATTING);
        System.out.println("Error while creating document for inserting " + field);
        e.printStackTrace();
    }
}

Gptrnt · July 3, 2020, 5:52pm

Hi Awais,

I tried solution. But it also deleting if user add a space after table. In this input document input.zip (13.4 KB)
. You can see in the first table two space add after table and before |/p| , second table has 1 space and last one with no space.
But in the output output.zip (13.0 KB) you can see after first table only one space is show and second table has no space. I want exact same output as input.

Thank you

awais.hafeez · July 4, 2020, 7:02am

@Gptrnt,

For this case, please discard the code from my previous post and try the following new code:

Document document = new Document("E:\\Temp\\ExtraSpaceIssue\\input.docx");
DocumentBuilder builder = new DocumentBuilder(document);

// Temporarily add Hidden Bookmarks to all Paragraphs in input document
int bm_idx = 0;
for (Paragraph para : (Iterable<Paragraph>) document.getChildNodes(NodeType.PARAGRAPH, true)) {
    builder.moveTo(para);
    builder.startBookmark("_bm_" + bm_idx);
    builder.endBookmark("_bm_" + bm_idx);
    bm_idx++;
}

FindReplaceOptions options = new FindReplaceOptions(FindReplaceDirection.BACKWARD);
AsposeReplaceCallBack replaceCallBack = new AsposeReplaceCallBack();
options.setReplacingCallback(replaceCallBack);
//extraction call back
Pattern pattern = Pattern.compile(Constant.EXTRACT_TOKEN_REGEX, Pattern.CASE_INSENSITIVE);
document.getRange().replace(pattern, Constant.EMPTY_STRING, options);

BookmarkCollection bookmarkCollection = document.getRange().getBookmarks();
HtmlSaveOptions saveOptions = htmlSaveOption();
Item item = SaveInItem(bookmarkCollection, document, saveOptions);

Document outputDocument = generateDocument(item);

// Remove it if there is no Hidden Bookmark in any Paragraph
for (Paragraph para : (Iterable<Paragraph>) outputDocument.getChildNodes(NodeType.PARAGRAPH, true)) {
    if (para.toString(SaveFormat.TEXT).trim().equals("")) {
        boolean isRemove = true;
        for (Bookmark bm : para.getRange().getBookmarks()) {
            if (bm.getBookmarkStart().getName().startsWith("_bm_")) {
                isRemove = false;
                break;
            }
        }
        if (isRemove)
            para.remove();
    }
}

// Remove all hidden bookmarks that we added
for (Bookmark bm : outputDocument.getRange().getBookmarks()) {
    if (bm.getBookmarkStart().getName().startsWith("_bm_")) {
        bm.remove();
    }
}

outputDocument.save("E:\\Temp\\ExtraSpaceIssue\\20.6.docx", SaveFormat.DOCX);