We are getting undesired output when extracting bulleted lists from Word documents and saving to HTML. I understand that part of the ASPose implementation for bullets is the use of the non-breaking space character which we are seeing. However, there are far too many of these characters and the result is that the spacing between the bullets and the text is way too big and the output is unacceptable.
Our solution begins with a Word document that resides in Microsoft SharePoint. The document contains content controls (StructuredDocumentTag). We open the document and loop through all the content controls and identify them by tag. The content control in question is a rich text control. I import the content from the StructuredDocumentTag into a temp document using the ImportNode method. The document is then saved as HTML.
Here are the HtmlSaveOptions being used:
HtmlSaveOptions BodyOptions = new HtmlSaveOptions(SaveFormat.Html);
BodyOptions.ImageSavingCallback = new HandleImageSaving(this);
BodyOptions.ImagesFolder = _ImageSettings.BodyImagesFolder;
BodyOptions.ImagesFolderAlias = _ImageSettings.ImagesFolderAlias;
BodyOptions.CssStyleSheetType = CssStyleSheetType.Inline;
I am then saving the document into a memory stream:
using(MemoryStream msDoc = new MemoryStream())
StreamReader srBody = new StreamReader(msDoc);
string BodyContent = srBody.ReadToEnd(); //return the stream contents to string
By the time we get the above string BodyContent, the extra spacing is there. The reason for extracting as text here is that I need to just isolate the Body tag of the HTML, so we then go on to use the HTMLAgilityPack for that, but the extra spacing is already there before we do that.
My only remedy for this at the moment is to remove the characters via String.Replace. I’ve included a sample of the output below.
The curious thing is that this is happening to only a single document. I’ve attached this document for review. The workflow is that the user will enter information in the Body content control and save the file back to SharePoint, then we pull it down for processing.
HTML output containing too many space characters: