Generate clean HTML export (no formatting)

fireflyinteractive · April 14, 2011, 1:44am

Hi – I’m trying to get Aspose.Words (for Java) to export clean HTML from a word doc (and docx). By “clean” I mean no style markup at all and the entire output needs to be plain ascii. I’ve attached in a zip:

the source docx
the output I currently get from Aspose using the code below
the output I’d ideally like to generate (or at least get closer to)

So far it seems to a fairly decent job at preserving the look of the original document, but that’s not quite what I’m after. Is it possible to:

a) configure it to discard all style information
b) ensure that only valid HTML markup characters are output (I get some of the bullet chars, and a lot of odd control char #160).
c) configure it to discard artificial spacing (long runs of   etc)
d) configure it to NOT wrap everything in one or more <div> tags
e) configure it to correctly interpret ordered/unordered lists (or at least not attempt to replicate them by inserting spaces and tokens instead of <li>

regarding (e) above, I would settle for having it generate plain paragraphs instead of trying to do this (inserting the letter ‘o’ in place of the bullet):

o list element 1
o list element 2

Basically all I want in my output is this:
tables,
headings (<h1><h2 etc)
lists (<ul><ol>)
and paragraphs/breaks (<p><br/>)

Is that currently possible?

here is the code I’m using to convert:

doc = javaLoader.create("com.aspose.words.Document").init("test1.docx", asposeLoadFormat.HTML, 'null');
doc.joinRunsWithSameFormatting();
// save options
so  = doc.getSaveOptions();
so.setHtmlExportHeadersFooters(false);
// save
doc.save("test1.html", saveformat.Html);

many thanks!
CC

alexey.noskov · April 14, 2011, 5:20am

Hi
Thank you for your interest in Aspose.Words. Unfortunately, there is no direct way to generate such kind of HTML. However, since HTML you need to generate is quite simple, you can create your own HTML converter. You can create such converter using DocumentVisitor:
https://reference.aspose.com/words/java/com.aspose.words/DocumentVisitor
Please let me know if you need more assistance, I will be glad to help you.
Best regards,

fireflyinteractive · April 14, 2011, 6:40am

yikes it looks a lot more complicated than I’d hoped!
Thanks for the info

alexey.noskov · April 14, 2011, 7:22am

Unfortunately, I cannot suggest you other way. The other option is to clean-up HTML produced by Aspose.Words. But I think this way will be much more complicated than using DocumentVisitor.
Best regards,