Table Content getting converted to image in extracted html from document

ashu_agrawal_sirionlabs_com · March 6, 2024, 6:57pm

Hi Team,

When i am trying to extract html from word document , some of my content is getting converted to image and becoming non editable due to that. Please check.

Code :

public static void main(String[] args) throws Exception {
        com.aspose.words.License license = new com.aspose.words.License();
        license.setLicense("/home/saurabharora/Downloads/Aspose.Total.Product.Family.lic");

        Document document = new Document("/home/saurabharora/Downloads/document_test_image.docx");

        HtmlSaveOptions opts = new HtmlSaveOptions(SaveFormat.HTML);
        opts.setExportPageSetup(true);
        opts.setExportDocumentProperties(true);
        opts.setExportListLabels(ExportListLabels.BY_HTML_TAGS);
        opts.setExportImagesAsBase64(true);
        opts.setExportFontsAsBase64(true);
        opts.setExportHeadersFootersMode(ExportHeadersFootersMode.FIRST_PAGE_HEADER_FOOTER_PER_SECTION);
        opts.setCssStyleSheetType(CssStyleSheetType.EMBEDDED);
        opts.setExportTocPageNumbers(true);
        opts.setExportShapesAsSvg(false);
        opts.setExportRelativeFontSize(true);
        // opts.setExportPageMargins(true);
        ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
        document.save(byteArrayOutputStream, opts);
        String html = byteArrayOutputStream.toString(StandardCharsets.UTF_8);
        System.out.println(html);
    }

Document :
document_test_image.zip (55.2 KB)

Thanks

alexey.noskov · March 7, 2024, 5:41am

@ashu_agrawal_sirionlabs_com This is an expected behavior, since content in your MS Word document is in textboxes. There is no meaningful way to export MS Word shapes to HTML, so shapes are exported as images. You can also export them as SVG:

Document doc = new Document("C:\\Temp\\in.docx");
HtmlSaveOptions opt = new HtmlSaveOptions();
opt.setPrettyFormat(true);
opt.setExportShapesAsSvg(true);
doc.save("C:\\Temp\\out.html", opt);

In this case textbox shapes are exported as SVG.