Numbered and bullet list number is lost after DOCX>HTML>DOCX using Java

Hi,

I want to convert a word document into HTML string and store it in a DB. This stored Html string will show in UI. And also have to insert in another word document. While converting to Html, the numbered and bullet list is not coming properly. And some other issues also. So can you please help to solve the issue.
Attaching the input doc
Sample_input_.pdf (550.1 KB)

Some exception also coming

Exception in thread "main" java.lang.ClassCastException: com.aspose.words.BookmarkEnd cannot be cast to com.aspose.words.CompositeNode
	at Main.extractContent(Main.java:150)
	at Main.main(Main.java:55)

@Gptrnt

To ensure a timely and accurate response, please attach the following resources here for testing:

  • Your input Word document.
  • Please attach the output HTML file that shows the undesired behavior.
  • Please attach the expected output HTML file that shows the desired behavior.
  • Please create a simple Java application ( source code without compilation errors ) that helps us to reproduce your problem on our end and attach it here for testing.

As soon as you get these pieces of information ready, we will start investigation into your issue and provide you more information. Thanks for your cooperation.

PS: To attach these resources, please zip and upload them.

Hi,

Please check the attached filessample.zip (7.5 MB)

@Gptrnt

Regarding the exception you are facing, we suggest you please use the HtmlSaveOptions as shown below.

HtmlSaveOptions options = new HtmlSaveOptions();
options.setSaveFormat(SaveFormat.HTML);
options.setExportImagesAsBase64(true);

html[0] += node.toString(options);

Regarding the list items issue, please use the following code example.

Document outputDocument = new Document();
outputDocument.removeAllChildren();

Document document = new Document(MyDir + "input_doc.docx");
HtmlSaveOptions options = new HtmlSaveOptions();
options.setSaveFormat(SaveFormat.HTML);
options.setExportImagesAsBase64(true);

Document pdfDocument = new Document();
DocumentBuilder builder = new DocumentBuilder(pdfDocument);
//  Get Paragraph Collection
NodeCollection<Paragraph> paragraphColl = document.getChildNodes(NodeType.PARAGRAPH, true);
List<Paragraph> headings = getHeadingFromParagraph(paragraphColl);
ArrayList<Node>  extractedNodes = new ArrayList();
List<String> html = new ArrayList<>();
for (int i=0;i<headings.size();i++){
    Paragraph startNode = headings.get(i);
    if(headings.size()> (i + 1)){
        Paragraph endNode = headings.get(i + 1);
        ArrayList<Node> nodes = Extract_contents.extractContent(startNode,endNode,false);
        Document dstHTML = Extract_contents.generateDocument(document, nodes);
        html.add(dstHTML.toString(options));
        //html.add(convertNodeToHtmlStr(nodes));
    }else{

    }
}
outputDocument = generateDocument(html);
outputDocument.save(MyDir + "20.4.docx", SaveFormat.DOCX);

Hi,

Thank you for your help. It’s working perfectly well.:blush:

@Gptrnt

Please feel free to ask if you have any question about Aspose.Words, we will be happy to help you.

Hi,

In my document may contains the hidden contents(it could be paragraph, table, etc…).I wants to identify each hidden item and have to extract it. I am using java, I tried with the code specified in the documentation, but unfortunately I am getting error with the code. Can you please tell me the proper code.

@Gptrnt

Please iterate over Run nodes as shown below and remove the hidden formatting of text. Hope this helps you.

Document doc = new Document(MyDir + "in.docx");
for (Run  run : (Iterable<Run>) doc.getChildNodes(NodeType.RUN, true))
{
    if(run.getFont().getHidden())
        run.getFont().setHidden(false);
}

HtmlSaveOptions options = new HtmlSaveOptions();
options.setSaveFormat(SaveFormat.HTML);
options.setExportImagesAsBase64(true);
doc.save(MyDir + "output.html", options);