Extract Content & Preserve Font Name & Size during Converting Word to HTML using Java | CKEditor

Gptrnt · July 3, 2020, 4:00pm

Hi,
When I try to extracted content and converted it to html. But some Paragraph taking font name (Time Roman) and font size(12). Which is not in the uploaded document (In uploaded document font name is Blackadder ITC and font size is 14 ) FontIssue.zip (201.5 KB)

Please verify above source code, input and output document. This issue is very critical to our side please give a work around ASAP

Thank you

awais.hafeez · July 4, 2020, 10:01am

@Gptrnt,

The problem occurs because your this particular “Template.docx” does not have the required Styles to format the final document with. I have attached a new Template document here for your reference:

template.zip (22.2 KB)

I have also modified “generateDocument” method. Please use the following code:

public static Document generateDocument(Document srcDoc, ArrayList nodes) throws Exception {
    Document dstDoc = (Document) srcDoc.deepClone(true);
    dstDoc.removeAllChildren();
    dstDoc.ensureMinimum();

    // Remove the first paragraph from the empty document.
    dstDoc.getFirstSection().getBody().removeAllChildren();

    // Import each node from the list into the new document. Keep the original formatting of the node.
    NodeImporter importer = new NodeImporter(srcDoc, dstDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);

    for (Node node : (Iterable<Node>) nodes) {

        Node importNode = importer.importNode(node, true);
        dstDoc.getFirstSection().getBody().appendChild(importNode);//            }
    }

    // Return the generated document.
    return dstDoc;
}

Gptrnt · July 6, 2020, 12:53pm

Hi Awais,

Your solution is working good. But I just have a concern about the performance. Some times uploading word document may contains 300 to 500 page. Is it make any performance detaly if i added this code. Because as per your code i am creating source doc as a copy of uploaded document and removing all its childerence.

awais.hafeez · July 7, 2020, 4:13am

@Gptrnt,

This step is required because we need to make sure that we have all the required styles available in the destination document where we are importing Nodes in. This should cause no undesired performance issues. However, you can improve performance a bit by replacing these lines:

Document dstDoc = (Document) srcDoc.deepClone(true);
dstDoc.removeAllChildren();
dstDoc.ensureMinimum();

with

Document dstDoc = (Document) srcDoc.deepClone(false);
dstDoc.ensureMinimum();

Gptrnt · July 13, 2020, 12:00pm

Thank you so much

awais.hafeez · July 14, 2020, 5:29am

A post was split to a new topic: Retain Alignment of Bulleted List Paragraphs with Before Hanging Indentation during Word to HTML to Word Round-Trip using Java