Attempting to get Cleaner HTML out of a Word Doc

zhann · August 15, 2013, 1:53pm

I am having a hard time finding the information I need. After about 30 minutes of scouring the forums, I figured simply posting my question would be easier.

My problem is simple. I have a series of word documents with simple text enhancements. I need to pull the text out of each document, convert them to HTML, then, say, store them in a DB table. The only enhancements I am concerned with are:

Bold
Italic
Lists (I am told they are always bulleted, but who knows)
Anchors

Opening the files are easy enough. Saving them as HTML is easy enough. However, at this point, there is so much garbage in the HTML that it is really a pain to decipher. For starters, everything is wrapped in span tags, the font differences are inline styles, and even the white space has funny tags around them.

Is there a way to use aspose to produce even slightly better HTML?

My fallback option is to simply save them as is to HTML, then with the use of some crafty regular expressions rewrite them in a somewhat better format. However, if I don’t have to do that, it would save a great deal of time.

Has anyone tried this before?

zhann · August 16, 2013, 8:28am

Judging by the deafening silence, I assume I am alone here. Guess its time to revisit my love for Regular Expressions.

tahir.manzoor · August 16, 2013, 1:07pm

Hi there,

Thanks for your inquiry. Please note that Aspose.Words mimics the same behavior as MS Word do. Please convert your document to HTML by using MS Word and check the HTML output. Upon processing HTML, some features of HTML might be lost. You can find a list of limitations upon HTML exporting/importing here:

https://docs.aspose.com/words/java/load-in-the-html-html-xhtml-mhtml-format/

https://docs.aspose.com/words/java/save-in-the-html-html-xhtml-mhtml-format/

You can extract contents specific contents from document and save to html format. Please read following documentation link to extract contents from document.

https://docs.aspose.com/words/java/extract-selected-content-between-nodes/

Could you please attach your input Word document here along with expected output html for testing? I will investigate the issue on my side and provide you more information.

zhann · August 16, 2013, 2:50pm

I see, that makes sense. I had just assumed your utilities had ways of cleaning up the Word documents a bit. The fact is, I have about 1000 different docs that I need to parse through, all written by different people. Word is notoriously bad at creating HTML, hence I hoped you had something in your suite.

Since the vast majority of the tags used by my team are very simple, I have already built a series of cleaning functions on top of yours which do most of what I need done. There are still some stray white-space areas, and some other little glitches to work through, but about 95% of what I need is ready.

On that note, I appreciate the heads up. I think I can handle this using regular expressions and standard Java utils.

Thanks

tahir.manzoor · August 19, 2013, 3:37am

Hi there,

Thanks for your feedback. It would be great if you please attach your input Word document here along with expected output html for testing. We will investigate, how you want your final output document/html be generated like. We will then provide you more information on this along with code.