Document to HTML - extra symbols

When trying to convert a DOC to HTML, I am seeing extra symbols being added to the HTML. See attached file ExtraSymbolsImg.doc - before the <HTML> tag is an extra symbol stuck in. This is causing my html to be distorted & an extra line is therefore inserted.
I am using the following code to convert the attached document(22840Simple.doc). This is resulting in the HTML attached 22840HTML.doc.

string html = "";
Document ThisDocument = new Document("C:/DocuMed/22840Simple.doc", LoadFormat.Doc, null);
using (MemoryStream stream = new MemoryStream())
{
    ThisDocument.Save(stream, SaveFormat.Html);
    html = Encoding.UTF8.GetString(stream.GetBuffer(), 0, (int)stream.Length);
}
return html;

Please advise.
P.S. See post here which discusses a similar problem - however, I see no resolution there.
https://forum.aspose.com/t/116153

Hi Melissa,

Thanks for your inquiry. It is BOM (Byte Order Mark). Please try using the following code to remove it from the string:

public string ConvertDocumentToHtml(Document doc)
{
    string html = string.Empty;
    // Save document to MemoryStream in Html format
    using(MemoryStream htmlStream = new MemoryStream())
    {
        doc.Save(htmlStream, SaveFormat.Html);
        // Get Html string
        html = Encoding.UTF8.GetString(htmlStream.GetBuffer(), 0, (int) htmlStream.Length);
    }
    // There could be BOM at the beginning of the string.
    // We should remove it from the string.
    while (html[0] != '<')
        html = html.Substring(1);
    return html;
}

Best regards,

Thank you. My document had an extra line break added in the beginning, which this fixed.
I have another question though. With this same document - it is also adding an extra line break at the end. Every time I run the following code, an extra line break gets added at the end of the document

string HTML = "";
Document ThisDocument = new Document("C:/DocuMed/22840Basic.doc", LoadFormat.Doc, null);
// Get HTML
Document docClone = ThisDocument.Clone();
Bookmark mark = docClone.Range.Bookmarks["Body"];
foreach(Section ThisSection in docClone.Sections)
ThisSection.HeadersFooters.Clear();
mark.Remove();
using(MemoryStream stream = new MemoryStream())
{
    docClone.Save(stream, SaveFormat.Html);
    HTML = Encoding.UTF8.GetString(stream.GetBuffer(), 0, (int) stream.Length);
}
while (HTML[0] != '<')
    HTML = HTML.Substring(1);
// End GetHTML
// Build new document
DocumentBuilder DB = new DocumentBuilder(ThisDocument);
Bookmark bookmarkBody = ThisDocument.Range.Bookmarks["Body"];
ThisDocument.Range.Bookmarks["Body"].Text = string.Empty;
DB.MoveToBookmark("Body", true, true);
DB.InsertHtml(HTML);
ThisDocument.Save("C:/DocuMed/22840Basic.doc");

Please advise.
P.S. Thank you for all your help! We are a document processing company, building up our website. I am finding Aspose to be a wonderful product, which fills our needs with processing documents via the web. I am especially satisfied with the service & customer support. You are always quick to answer and able to fix the problems.

Hi Melissa,

Thanks for your inquiry. You have one paragraph with a bookmark. After inserting HTML which contains other paragraphs, paragraph with bookmark will not be lost and you can see it at the end of the document. If you would like to delete this paragraph, I think, you should delete all empty paragraphs from the end of the document, please see the following code

// Remove empty paragraphs from the end of the document.
while (doc.LastSection.Body.LastParagraph != null &&
    string.IsNullOrEmpty(doc.LastSection.Body.LastParagraph.GetText().Trim()))
    doc.LastSection.Body.LastParagraph.Remove();

Hope this helps.
Best regards,

The issues you have found earlier (filed as WORDSNET-3087) have been fixed in this .NET update and this Java update.

This message was posted using Notification2Forum from Downloads module by aspose.notifier.
(1)