HTML content and Word content

edegagne · December 20, 2007, 8:45am

I am evaluating Aspose.Words and have so far been pretty impressed with what I have been able to accomplish in a relative short period.

I am working on a project that involves a CMS type system. Users can enter a new product via a web form, then internally we generate the folder structure for product content and then we generate a Word document skeleton that has the product Title, Description and Comments filled in (via bookmarks, the doc is generated from an existing Word Template).

The process is actually 2-way; The CMS spits out a Word doc, then the user takes that doc fills in all of the rest of the information, then through the CMS, uploads the filled in document. The functionality I am building will extract all of the bookmarked info from the doc and insert it into the proper CMS system folders/fields.

Here’s my issue.

In the CMS, some of the fields will be will be normal textboxes, but some can/will be HTML editor controls so the user can do some basic html formatting (bold, italic, lists, etc.).

On the Word document, a user can use the text formatting tools to do the same.

My issue is, how can I;

1.) Get the formatted content from the word doc and convert it into the HTML equivalent.
2.) Take HTML formatted content from the CMS and convert it into properly formatted Word content.

Any ideas?

Thanks in advance.

edegagne · December 20, 2007, 9:54am

I managed to figure out how to insert the HTML coming from the CMS’ into a bookmark (outbound).

if (docBuilder.MoveToBookmark("ci_course_desc"))
{
    // Remove default text inside of bookmark.
    bookMark = docReturnVal.Range.Bookmarks["ci_course_desc"];
    bookMark.Text = "";
    // Need to move back to the bookmark.
    docBuilder.MoveToBookmark("ci_course_desc");
    docBuilder.InsertHtml(xmlCourseDesc.InnerXml);
}

Notice that even though we move to a bookmark in the if…then, we need to again move back to it after using the bookmark object, otherwise the docBuilder.InsertHTML() call throws an error.

Now it’s just a matter of inbound content extraction…

alexey.noskov · December 20, 2007, 11:47am

Hi
Thanks for your inquiry. I think that you can try using Document.Save method to save a document in HTML format. For example see the following code.

Document doc = new Document("in.doc");
doc1.Save("out.html", SaveFormat.Html);

Also you can attach you document here. I will investigate this and provide you more information.
Best regards.

edegagne · December 23, 2007, 11:30am

What I really need to be able to do is not save the entire doc as HTML. I need to get at the Word formatted text of a bookmark’s content, then save/convert it to HTML before going to the CMS.

alexey.noskov · December 24, 2007, 5:41am

Hi
Thanks for your inquiry. I think that this code will help you.

private string GetHtmlFromBookmark(string bookmarkName, Document doc)
{
    Document doc1 = new Document();
    Bookmark mark = doc.Range.Bookmarks[bookmarkName];
    Node node = mark.BookmarkStart.ParentNode;
    Node endNode = mark.BookmarkEnd.ParentNode.NextSibling;
    while (!node.Equals(endNode))
    {
        if ((node as CompositeNode).ChildNodes.Contains(mark.BookmarkStart))
        {
            Node child = (node as CompositeNode).FirstChild;
            Node endChild = mark.BookmarkStart.NextSibling;
            while (!child.Equals(endChild))
            {
                child = child.NextSibling;
                child.PreviousSibling.Remove();
            }
        }
        if ((node as CompositeNode).ChildNodes.Contains(mark.BookmarkEnd))
        {
            Node child = mark.BookmarkEnd;
            while (!child.Equals(child.ParentNode.LastChild))
            {
                child = child.NextSibling;
                child.PreviousSibling.Remove();
            }
            child.Remove();
        }
        doc1.FirstSection.Body.AppendChild(doc1.ImportNode(node, true, ImportFormatMode.KeepSourceFormatting));
        node = node.NextSibling;
        if (node == null)
            break;
    }
    MemoryStream stream = new MemoryStream();
    doc1.Save(stream, SaveFormat.Html);
    string html = Encoding.UTF8.GetString(stream.GetBuffer());
    return html;
}

Best regards.

edegagne · December 26, 2007, 11:04am

Thank you so much Alexey, I will try this out today and let you know if it works out.

edegagne · January 1, 2008, 2:26pm

Is there a function in the API to get just clean HTML out of the book mark?

I tried the above code and got back an html document, but short from running it through a CleanHTML routine, it’s not very useful. I just need the Word content of the bookmark, in HTML.

There’s a lot of junk tags in the output Html from the above routine, would like to remove all of it and just get at an HTML equivalent of the Word formatted text.

romank · January 1, 2008, 2:44pm

What are the “junk tags” you are talking about?
There is no easy way in the API to just get HTML content of a bookmark. Alexey’s code is an attempt to workaround that. The reason why we don’t have this in the API is because we have not yet decided how to resolve some technical issues:
a. Document is a tree of nodes, bookmark start and end are just “markers” and can be anywhere in the tree so one might have to traverse a “jagged” tree fragment and still try to create a valid HTML. For example, “close” a
paragraph when the bookmark ends in the middle of a paragraph etc.
b. What to do with images. If there is an image in the bookmark text.
So, in general, you can get HTML out of Aspose.Words. If you want to get it just for a bookmark, have a look at the Alexey’s code again.

alexey.noskov · January 1, 2008, 2:55pm

Hi
Thanks for your inquiry. Unfortunately, there is no function in the API to get just clean HTML out of the book mark. The thing is that Aspose.Words HTML import and export does not guarantee full data roundtrip.
Best regards.

edegagne · January 2, 2008, 8:28am

Thanks for the replies.

The “junk” tags I was referring to are the well known “junk” that Word is famous for injecting into a “Save As HTML” document.

I was able to use a combination of RegEx patterns to strip out the output from Alexey’s code sample. This thoroughly cleaned the HTML so that is just the HTML from the bookmark content itself.

Thanks again.

romank · January 2, 2008, 6:42pm

I agree Microsoft Word outputs “junk” into HTML, but I thought you are using Aspose.Words. Aspose.Words does not rely on Microsoft Word and outputs HTML as well as all other formats itself. I was avoiding outputting “junk” into HTML when designing HTML export in Aspose.Words. That what got me confused. Are you not happy with HTML produced by Aspose.Words or Microsoft Word? If you don’t like HTML produced by Aspose.Words, let me know exactly what you think is junk.

edegagne · January 8, 2008, 3:19pm

I think that what I wanted was something that was easier to implement.

What I need to do is extract the content from a bookmark (which could have Word formatting in it) and convert that to proper XHTML (bolds show up as , etc, etc). Which is quite perplexing, becasue the compnent allows me to InsertHtml that way, why is there not the reverse function?

But I tried out Alexey’s function (posted in an above reply). One of the issues I see is the creation of the doc1 (temp doc variable) to do the Save (as HTML) for the bookmarks contents; in that if the original document object (passed in as a parameter) has any custom styles in its styles collection, they are not added to the doc1 variable.

So as I tried to do it myself, the first thing that stuck out was that the Styles.Add doesn’t accept a Style object?!?

Why is that?

alexey.noskov · January 9, 2008, 5:17am

Hi
I think that you can solve your problem with custom styles using Clone method. For example see the following code snippet.

Document doc = new Document(@"444_106847_edegagne\in.doc");
Document tempDoc = (Document)doc.Clone(false);

Best regards.

edegagne · January 9, 2008, 7:58am

I thought about that too, but that won’t plug in well to the previous method you provided for extracting bookmark content from the originating document. In your gethtmlfrombookmark method, your only creating the temp doc object to use as a repository for a single bookmarks contents.

Let me try to simplify what I am doing:

1.) User goes into .ASPX page and creates a new course by entering a course name, choosing a language, and selecting a course type.
2.) User clicks save and the CMS creates the supporting folder structure (behind the scenes).
3.) User clicks “Download Course Doc” and is presented with a download of the course in its current state (which for a new course is all blank except the title/name). This document is created from an existing blank Word template.
NOTE: This document is basically a bunch of two column tables with the right side being the contents a user can enter, all these are bookmarks.
4.) User completes the filling out of the word info.
5.) User clicks upload course doc, selects the file, and uploads the doc.
6.) We iterate through all of the bookmarks to extract the user entered info to obtain it and save it into different content items in the CMS.

Basically, the company is using the Word document as a form entry tool (yeah I know, it’s silly, but they are for whatever reasons, attached to this method and can’t be talked out of it).

Because the CMS is XHTML/HTML based and the Word document is not, we use Alexey’s method to extract the bookmark content, save as html (which is a whole seperate issue) and update the content that the bookmark maps to.

The problem is that the client wants to be able to use Word styling to format the entered info, but the CMS needs to have the content in XHTML/HTML. We’ve tried creating RegEx patterns for replacing/removing certain html, but it’s a nightmare. Everything like bold, italic, etc, are placed in a style attribute string, which we cant use, we need to convert it to , , etc…

If it were not for the “styling” aspect of the requirements, this would be done already. Using the Aspose.Words component helped us achieve that (with out having word on the server or using the bloated word API) easily. We can take the Word doc and update all the content items from the bookmark content quite easily, right now we’re stripping all the html out by a series of RegEx replace routines. But going forward, we need to really allow the styling to happen as well.

alexey.noskov · January 9, 2008, 9:28am

Hi
Thank you for additional information. It is not quite clear for me how I can help you.

Document tempDoc = (Document)doc.Clone(false);

This line of code will create deep copy of source document without content. tempDoc will be empty document having styles and formatting that are inherited from source document.
Best regards

edegagne · January 9, 2008, 4:32pm

Alexey,

We’re a lot closer, but still a bit short. Here’s what we’re able to do:

1.) We are able to generate the document dynamically and add our custom styles.
2.) We are able to clone (using your technique with a couple of changes to the GetHTML method).
3.) We are able to take the now correct HTML with a custom style and persist it into the CMS.

Here’s what we’re not able to do:
1.) When we re-generate the document and use the InsertHTML method, we do not get the custom class assigned to any of the content (even though the css class is avialble in the dropdown in Word).

For example, we have 2 custom classes; TestQuestion & TestAnswer. When we created the custom classes in code, we set them as follows:

internal void CreateCourseScriptCss(ref Aspose.Words.Document pDocument)
{
    pDocument.Styles.Add(StyleType.Paragraph, "TestQuestion");
    pDocument.Styles["TestQuestion"].Font.Bold = true;

    pDocument.Styles.Add(StyleType.Paragraph, "TestAnswer");
    pDocument.Styles["TestAnswer"].Font.Bold = false;
    pDocument.Styles["TestAnswer"].Font.Italic = true;
}

When we extract and parse out the bookmark content we end up with the following:

This is a test question
This is a test answer

We then strip out the “Char” that is appended to the class so we end up with:

This is a test
question
This is a test answer

This is what is stored in the CMS and will work for us in our presentation later.

Our issue is strictly when we regenerate the Word doc for further editing, we are not getting the css class assigned to the content (but the classes are available in the Word style dropdown.

Any ideas? Seems like we are very close to what we’re attempting here, alot closer than a few days ago.

edegagne · January 10, 2008, 7:40am

As an addendum, I changed the function above (just the class names as a test) and there seems to be a bug in the Aspose.Words component (?).

No matter what custom style class name we use, the 1st one added dynamically always has “Char” appended to it when we get the HTML back from the GetHTMLFromBookmark() function. All of the other ones are referenced correctly. This doesn’t seem to be affected at all by different StyleType choices either.

Also, it’s kind of a source of confusion on our end as to why the Document.Styles.Add() method doesn’t accept a Style object?!?! Why is that?

alexey.noskov · January 10, 2008, 10:38am

Hi
Thanks for your explanation.

I think that your request is related to existing issue.
Issue #3991 – Add overload for InsertHtml method.
Add new overload for InsertHtml method that will take 2 parameters.
InsertHtml(string html, MapStyles styles)
html is html string
styles is a key-value map with the key being the HTML tag name (even optionally tagName.className combination) and the value is a Word style name.
As I told you earlier the main thing is that Aspose.Words HTML import and export does not guarantee full data roundtrip.
Styles.Add method doesn’t accept a Style object as parameter because the Style class doesn’t have a constructor. Through this you can’t create a Style without document.
Best regards.

edegagne · January 10, 2008, 11:12am

Alexey,

Thanks for your reply. An overloaded InsertHTML() function wasn’t exactly what I was asking for, and I am not sure it would easily solve the issue.

What I am asking for is this;

If I send in a string of HTML into the InsertHTML() method and that value has one (or possibly more) span/p/div tags, etc with a class="" attribute, then why doesn’t the InsertHTML method automatically map the custom class if it is present in the styles collection?

We’re adding the custom classes to the document object way before we start inserting content into bookmarks, so the custom classes are indeed in the document before we place any content into it.

Between this issue and the issue with Protection Exceptions not working, we’re kind of in a bind on this. We are potentially looking at using this component via an enterprise license in our CMS product.

Thanks for you reply.

P.S. We’re you able to duplicate the issue with “Char” being appended to the class name?

Edward DeGagne | Project Manager
ektron, inc.
542 Amherst Street, Route 101A | Nashua, NH 03063
office: 603.594.0249 x 2017 | direct: 603.816.2017
fax: 603.594.0258

romank · January 10, 2008, 8:34pm

If I send in a string of HTML into the InsertHTML() method and that value has one (or possibly more) span/p/div tags, etc with a class="" attribute, then why doesn’t the InsertHTML method automatically map the custom class if it is present in the styles collection?
It does not work because Aspose.Words HTML import and export supports only inline CSS styles that are specified in the style attribute. We currently work on fully supporting CSS styles (both embedded and external) and specified using the class attribute and so on. At first, this functionality will appear in HTML export. Several months later in HTML import. Sorry, I don’t see how we can help you right now with the way you want to use Aspose.Words, there is no easy workaround that we can implement. Therefore you can only wait for full CSS support. Hopefully will be avilable in 2008.
Between this issue and the issue with Protection Exceptions not working, we’re kind of in a bind on this. We are potentially looking at using this component via an enterprise license in our CMS product.
Protection for ranges in a document is not supported in Aspose.Words. We are planning to work on this feature this year too.