Hi – I’m trying to get Aspose.Words (for Java) to export clean HTML from a word doc (and docx). By “clean” I mean no style markup at all and the entire output needs to be plain ascii. I’ve attached in a zip:
- the source docx
- the output I currently get from Aspose using the code below
- the output I’d ideally like to generate (or at least get closer to)
So far it seems to a fairly decent job at preserving the look of the original document, but that’s not quite what I’m after. Is it possible to:
a) configure it to discard all style information
b) ensure that only valid HTML markup characters are output (I get some of the bullet chars, and a lot of odd control char #160).
c) configure it to discard artificial spacing (long runs of
etc)
d) configure it to NOT wrap everything in one or more <div>
tags
e) configure it to correctly interpret ordered/unordered lists (or at least not attempt to replicate them by inserting spaces and tokens instead of <li>
regarding (e) above, I would settle for having it generate plain paragraphs instead of trying to do this (inserting the letter ‘o’ in place of the bullet):
o list element 1
o list element 2
Basically all I want in my output is this:
tables,
headings (<h1><h2
etc)
lists (<ul><ol>
)
and paragraphs/breaks (<p><br/>
)
Is that currently possible?
here is the code I’m using to convert:
doc = javaLoader.create("com.aspose.words.Document").init("test1.docx", asposeLoadFormat.HTML, 'null');
doc.joinRunsWithSameFormatting();
// save options
so = doc.getSaveOptions();
so.setHtmlExportHeadersFooters(false);
// save
doc.save("test1.html", saveformat.Html);
many thanks!
CC