Embedded css stylesheet questions

I can now see the new version of Aspose.Words which I have downloaded and implemented in my following code:

private string getSectionHtml(Section section, string title)
{
    string returnHtml;
    Document dummyDoc = new Document();
    dummyDoc.RemoveAllChildren();
    dummyDoc.AppendChild(dummyDoc.ImportNode(section, true, ImportFormatMode.KeepSourceFormatting));

    dummyDoc.BuiltInDocumentProperties.Title = title;

    dummyDoc.SaveOptions.ExportPrettyFormat = true;
    // This is to allow headings to appear to the left of main text. 
    dummyDoc.SaveOptions.HtmlExportAllowNegativeLeftIndent = true;
    dummyDoc.SaveOptions.HtmlExportHeadersFooters = false;

    dummyDoc.SaveOptions.HtmlExportCssStyleSheetType = CssStyleSheetType.External;

    // Code here for obtaining HTML in a string. Note that images will still be written into files! 
    MemoryStream stream = new MemoryStream();

    // By default, images will be written into the system TEMP folder! control this by using the following option. 
    dummyDoc.SaveOptions.ExportImagesFolder = _imagesOriginDir;

    dummyDoc.Save(stream, SaveFormat.Html);

    // Rewind the stream to beginning. 
    stream.Position = 0;

    StreamReader reader = new StreamReader(stream);
    returnHtml = reader.ReadToEnd();

    // close the stream 
    stream.Close();

    return returnHtml;
}

When I ran the code I get the following error:
“You are saving HTML to a stream and requesting a CSS style sheet to be written into a separate file. This is not supported. You need to either save HTML to a file or request CSS style sheet to be embedded.”

This is a problem because I need to be able to use the stream to get the string value of the html in order to add it to a placeholder via code.
I then noticed that I only need the html between the body tags anyway so added the following code, as well as applying the embedded style as opposed to the external option:

private string getSectionHtml(Section section, string title)
{
    string returnHtml;
    Document dummyDoc = new Document();
    dummyDoc.RemoveAllChildren();
    dummyDoc.AppendChild(dummyDoc.ImportNode(section, true, ImportFormatMode.KeepSourceFormatting));

    dummyDoc.BuiltInDocumentProperties.Title = title;

    dummyDoc.SaveOptions.ExportPrettyFormat = true;
    // This is to allow headings to appear to the left of main text. 
    dummyDoc.SaveOptions.HtmlExportAllowNegativeLeftIndent = true;
    dummyDoc.SaveOptions.HtmlExportHeadersFooters = false;

    dummyDoc.SaveOptions.HtmlExportCssStyleSheetType = CssStyleSheetType.Embedded;

    // Code here for obtaining HTML in a string. Note that images will still be written into files! 
    MemoryStream stream = new MemoryStream();

    // By default, images will be written into the system TEMP folder! control this by using the following option. 
    dummyDoc.SaveOptions.ExportImagesFolder = _imagesOriginDir;

    dummyDoc.Save(stream, SaveFormat.Html);

    // Rewind the stream to beginning. 
    stream.Position = 0;

    StreamReader reader = new StreamReader(stream);
    returnHtml = reader.ReadToEnd();

    // close the stream 
    stream.Close();

    // Strip everything in the html before and after the tags 
    // We won't require this html as it is already included when the sitecore item is created. 
    returnHtml = TrimHtml(returnHtml, "", "");

    returnHtml = returnHtml.Replace(" class=\"Normal0\"", "");
    returnHtml = returnHtml.Replace(" class=\"ListBullet\"", "");

    return returnHtml;
}

The only thing then is there are still references in tags which I wanted to remove (this is the only way I can see to do this) as above.

Now this seemed to do the trick and after doing some initial testing all seemed well. But then I noticed that any tables in the html still had in-built styles! Is there a way which I can strip these table styles? In particular the font size and font family?

As always any advice or suggestions would be greatly appreciated.

This message was posted using Email2Forum by romank.

Hi
Thanks for your request. It seems that this occurs because Table styles are unsupported by Aspose.Words. I will consult with my colleague who works on this feature and provide you more information.
Best regards,

Hello!
Thank you for your interest in Aspose.Words.
This is really a design restriction disallowing save to stream operation with external CSS styles. I remember how we tried many cases but all of them were potentially unclear and almost useless. How it’s done with images could be an analogy for CSS. But currently this design is suggested to be bad practice. Since a caller requests saving to stream he (she) would like to avoid creating any files. That’s in common. I’ll be glad to discuss this if you share any ideas on this.
If your task is to retrieve some part of HTML it should be trimmed, extracted with regular expressions etc. You do everything right here. We plan to support exporting individual nodes in the future but there are still many things questionable. Also removing some individual attributes and nodes from the output is okay if you get what you need. These tasks are quite specific so we cannot provide one-call API.
Regarding table styles I’m sorry to disappoint you. They are not supported in the product at all. In particular we don’t output them to reference definitions in the CSS style sheet. In the future we plan to support table styles. Currently all we output is formatting for every cell or saying more strictly for paragraphs inside cells. Do you need strip away this formatting or apply table styles? The first task can be done with HTML postprocessing but the second has no workaround.
Please feel free to ask any further questions. You are the person who provided first feedback on embedded/external CSS feature and your experience could be of great help to improve it.
Regards,

Thanks Klepus for your response.
“Do you need strip away this formatting or apply table styles?”
I want to be able to strip away this formatting.
“The first task can be done with HTML postprocessing …”
Are you referring that this HTML postprocessing can be done via the Aspose api, or as I am currently doing it with standard dotNet Replace/Trim code? If via Aspose are you able to provide me with a code example of this.
As always I appreciate your help and advice on this.
Thanks,
Rodney.

Hello Rodney.
Thank you for clarification.
Currently you are doing this:

returnHtml = returnHtml.Replace(" class=\"Normal0\"", ""); 
returnHtml = returnHtml.Replace(" class=\"ListBullet\"", "");

Such things could be worked-around by changing styles in the document before conversion to HTML. For instance “Normal” style is never output as class, it’s just omitted. You can assign it to some paragraphs in the document. But this could leave some cases uncovered and needs more investigation.
You can attach a document you are experimenting with and I could try myself if you would like me to do so. In this case please formalize exactly what we are removing and what should be retained. string.Replace and regular expressions could be our reserve chance if something couldn’t be done via API.
Best regards,

Thanks Klepus,
I have attached the example document which I am testing it with, which has example cases such as:
content here
We would like to show:
content here
Also as an alternative the document authors could define their own styles in Word. I tested this conversion to html with Aspose and this is the result (where “HtmlExportCssStyleSheetType = CssStyleSheetType.Embedded;”):
What we would like, because we already have our own stylesheets defined, is to show:
Any advice or suggestions you have on this would be greatly appreciately.
Regards,
Rodney.

Hello Rodney.
Thank you for providing additional information.
You can perform this replacement with the code like the following:

public static void TestRodney_120948()
{
    const string fileName = "Rodney_120948.doc";
    Document doc = new Document(fileName);
    // Set some required options.
    // Remove "pretty format" once you have finished debugging.
    // Suggest where to save images in production version.
    doc.SaveOptions.HtmlExportCssStyleSheetType = CssStyleSheetType.Embedded;
    doc.SaveOptions.ExportPrettyFormat = true;
    doc.SaveOptions.ExportImagesFolder = Environment.CurrentDirectory;
    string html;
    // Save to stream
    using (MemoryStream streamHtml = new MemoryStream())
    {
        doc.Save(streamHtml, SaveFormat.Html);
        // Seek to the beginning so it can be read
        streamHtml.Seek(0, SeekOrigin.Begin);
        // Get all the original content to a string to postprocess it further
        using (StreamReader srHtml = new StreamReader(streamHtml))
            html = srHtml.ReadToEnd();
    }
    // Save original HTML to compare
    using (StreamWriter sw = new StreamWriter(fileName.Replace(".doc", "_orig.html")))
        sw.Write(html);
    // Postprocess original HTML:
    // 1. Match simple case, for instance:
    // some text
    // 2. Match containing anything among with font-weight in the style attribute, for instance:
    // some text
    html = Regex.Replace
    (
    html,
    "\\(?([^\\<]*|\\))\\",
    "**${text}**"
    );
    html = Regex.Replace
    (
    html,
    "\\<span style=\"(?[^\"]*)font-weight\\:bold; (?[^\"]*)\"\\>(?([^\\<]*|\\))\\",
    "**${text}**"
    );
    // Save modified HTML
    using (StreamWriter sw = new StreamWriter(fileName.Replace(".doc", "_modified.html")))
        sw.Write(html);
}

Please note that the regular expressions in the code make some assumptions and cannot be used universally. For instance
is a special case here. Or if you’d like also to transform “font-style:italic” to you will have more replacements. These two just cover the document that you attached and show the idea.
You might not like what this sample does with image file names. When converting to a stream Aspose.Words inserts a GUID to all names to guarantee their uniqueness. We can add some magic with image files renaming and replacements in the HTML. If this is difficult I also will try to help you. (The other way is saving to a file then loading it again and rewriting. But since you are manipulating with streams originally, this is not a case.)
Regarding CSS formatting types: inline, embedded and external. When you request inline styles then everything is straightforward: all formatting for every element is written to the appropriate style attribute. Embedded and external types generate definitions of CSS styles separately from HTML code. But only MS Word styles are output as CSS classes. If some element has direct formatting and optionally refers to a class, that means MS Word direct formatting was applied to its original. Of course in well-designed documents there are only a few or none elements with style attribute but everything that possible is separated in CSS. So if you don’t want font-family or whatever else to appear inline, then try using only pure MS Word styles without any direct formatting.
In current implementation paragraph MS Word styles can be applied to p and li elements, character styles are applicable to span elements only. List and table MS Word styles are not supported. So we never output any MS Word styles to div elements. If you specify qualified names like div.MyStyle in your stylesheets they should be changed to p.MyStyle, span.MyStyle or even .MyStyle (with a leading period).
Best regards,

Thanks Klepus I appreciate your efforts on this.
Your regular expression have helped a lot. The following are the main ones that I have implemented in order to clean out these html tags:

// replace some text
// with some text
htmlReturn = Regex.Replace(htmlReturn, "\\<span style=\"font-family\\:\x27(?[^\x27]*)\x27[^>]+\"\\>(?(.*?))\\</span\\>", "${text}");
// replace some text
// with **some text**
htmlReturn = Regex.Replace(htmlReturn, "\\<span style=\"(?[^\"]*)font-weight\\:bold; (?[^\"]*)\"\\>(?(.*?))\\</span\\>", "<b>${text}</b>");
// replace some text
// with some text
htmlReturn = Regex.Replace(htmlReturn, "\\<span style=\"(?[^\"]*)font-style\\:italic; (?[^\"]*)\"\\>(?(.*?))\\</span\\>", "<i>${text}</i>");

The html is now in a state which passes the requirements of our clients.
Once agains thank you very much for your help.
Regards,
Rodney.