Extract body from html

Hi,

I am extracting content from the word file,importing the extracted content to a document and converting it to html string. Due to some bullet issue I am converting document to html through ByteArrayOutputStream.

        Document dstHTML = generateDocument(document, extractedNodes);
        HtmlSaveOptions options = new HtmlSaveOptions();
        options.setSaveFormat(SaveFormat.HTML);
        options.setExportImagesAsBase64(true);
        options.setExportListLabels(ExportListLabels.BY_HTML_TAGS);
        ByteArrayOutputStream docStream = new ByteArrayOutputStream();
        dstHTML.save(docStream, options);
        return docStream.toString();

So result getting full html file. This full html causing issue in my other task, so i want only body content from this html string. Is there any way or method in aspose will return only body content from html string?

@Gptrnt

In your case, we suggest you please remove the header and footer of document before converting document to HTML. You can use HeaderFooterCollection.clear() method to remove the header and footer of document. Please check the following line of code.

dstHTML.getFirstSection().getHeadersFooters().clear();

If you still face problem, please ZIP and attach the document generated by generateDocument method and expected output HTML. We will then provide you more information on about your query.

Hi,

I tried your solution, its is not working for me. As per your request I am attaching my sample code to reproduce the case. In the logging you can see the full html string. bodyFromHtml.zip (30.1 KB)

Thank you

@Gptrnt

Your input Word document does not contain the header and footer. Could you please ZIP and attach your problematic and expected output documents? We will then provide you more information on it.

Hi,

I think you wrongly understand my issue. My issue is not any specific with header and footer. In the output html i am getting full html string. In that html title tag contains imported file name. Which causing problem for me. So i wants to remove that title tag or make it empty. Or only get the body content from the html string.

Thank you

@Gptrnt

You can use BuiltInDocumentProperties.Title proeprty to remove the document’s title by setting it to empty string.

If you still face problem, please share the problem that you are facing with title of document. You can also read the HTML file using Java and remove the title tag from HTML string.

Its working fine. Thank you :smiley:

@Gptrnt

Please feel free to ask if you have any question about Aspose.Words, we will be happy to help you.