Extracting HTML from Word document one paragraph at a time

Hi,

We are now evaluating several tools to perform a task in our system and Aspose.Words is one of them.

I’m trying to build a demo to check if this is possible using Aspose.Words.

The goal of this task is to import a word document and to split it “fields” in our database. These fields will be displayed in an HTML page and can be exported back to Word.

The task should do the following:

  1. Get a word document (path or stream)
  2. build an object tree of the document (each heading level and the text \ images \ everything else under it)
  3. convert each node to HTML

This tree will be saved in a database and each section in the document (heading and paragraph) will be displayed separately on a web page.

This task should be reversible, which means that I should be able to get the HTML pieces and to build a new Word document with them.
Can this be done using Aspose.Words?
Thanks,
Omri

Hi Omri,

Thanks for your interest in Aspose.Words for .NET API. Yes, you can meet all these requirements using Aspose.Words. A Word document can have Fields and when loaded in Aspose.Words’ Document instance, these fields are represented by various classes mentioned under Aspose.Words.Fields Namespace.

  1. Please refer to the following article:
    https://docs.aspose.com/words/net/create-or-load-a-document/

  2. We have many examples uploaded at GitHub repository. I suggest you please take a look at DocumentExplorer project:
    https://docs.aspose.com/words/net/aspose-words-document-object-model/

  3. You can convert any Node to HTML using Node.ToString Method (SaveFormat.Html) method.

You can also Load/Save entire document in Database:
https://docs.aspose.com/words/net/serialize-and-work-with-a-document-in-a-database/

Aspose.Words can also load HTML files/strings into it’s DOM. Please refer to the following supported load formats:
https://reference.aspose.com/words/net/aspose.words/loadformat/

Please let us know if we can be of any further assistance.

Best regards,

Hi Awais,
Thank you very much for your answer.
I’ve started to work with the API and it seems to work well.
It seems that the paragraph separation is not exactly what I need.
I need a tree that represents that document’s logical structure and the child notes structure is not like that. I have 2 problems with that:

  1. Every line is a new paragraph, while I want to get in one node all the lines under a specific heading.
  2. Headings are not built as tree (like TOC in word).

Do you have a solution for these issues?
Thanks,
Omri

Hi Omri,

Thanks for your inquiry. We need to understand the real situation what exactly you’re trying to implement. Could you please attach your initial document and expected Word document here for our reference. We will investigate the structure of your expected document as to how you want your final output be generated like. You can create expected document using Microsoft Word. We will investigate the scenario on our end and provide you more information.

Best regards,

Hi
Sorry for my late response, i was able to solve it all by myself.

Thanks