When trying to convert a DOC to HTML, I am seeing extra symbols being added to the HTML. See attached file ExtraSymbolsImg.doc - before the <HTML> tag is an extra symbol stuck in. This is causing my html to be distorted & an extra line is therefore inserted.
I am using the following code to convert the attached document(22840Simple.doc). This is resulting in the HTML attached 22840HTML.doc.
string html = "";
Document ThisDocument = new Document("C:/DocuMed/22840Simple.doc", LoadFormat.Doc, null);
using (MemoryStream stream = new MemoryStream())
{
ThisDocument.Save(stream, SaveFormat.Html);
html = Encoding.UTF8.GetString(stream.GetBuffer(), 0, (int)stream.Length);
}
return html;
Please advise.
P.S. See post here which discusses a similar problem - however, I see no resolution there. https://forum.aspose.com/t/116153
Thanks for your inquiry. It is BOM (Byte Order Mark). Please try using the following code to remove it from the string:
public string ConvertDocumentToHtml(Document doc)
{
string html = string.Empty;
// Save document to MemoryStream in Html format
using(MemoryStream htmlStream = new MemoryStream())
{
doc.Save(htmlStream, SaveFormat.Html);
// Get Html string
html = Encoding.UTF8.GetString(htmlStream.GetBuffer(), 0, (int) htmlStream.Length);
}
// There could be BOM at the beginning of the string.
// We should remove it from the string.
while (html[0] != '<')
html = html.Substring(1);
return html;
}
Thank you. My document had an extra line break added in the beginning, which this fixed.
I have another question though. With this same document - it is also adding an extra line break at the end. Every time I run the following code, an extra line break gets added at the end of the document
string HTML = "";
Document ThisDocument = new Document("C:/DocuMed/22840Basic.doc", LoadFormat.Doc, null);
// Get HTML
Document docClone = ThisDocument.Clone();
Bookmark mark = docClone.Range.Bookmarks["Body"];
foreach(Section ThisSection in docClone.Sections)
ThisSection.HeadersFooters.Clear();
mark.Remove();
using(MemoryStream stream = new MemoryStream())
{
docClone.Save(stream, SaveFormat.Html);
HTML = Encoding.UTF8.GetString(stream.GetBuffer(), 0, (int) stream.Length);
}
while (HTML[0] != '<')
HTML = HTML.Substring(1);
// End GetHTML
// Build new document
DocumentBuilder DB = new DocumentBuilder(ThisDocument);
Bookmark bookmarkBody = ThisDocument.Range.Bookmarks["Body"];
ThisDocument.Range.Bookmarks["Body"].Text = string.Empty;
DB.MoveToBookmark("Body", true, true);
DB.InsertHtml(HTML);
ThisDocument.Save("C:/DocuMed/22840Basic.doc");
Please advise.
P.S. Thank you for all your help! We are a document processing company, building up our website. I am finding Aspose to be a wonderful product, which fills our needs with processing documents via the web. I am especially satisfied with the service & customer support. You are always quick to answer and able to fix the problems.
Thanks for your inquiry. You have one paragraph with a bookmark. After inserting HTML which contains other paragraphs, paragraph with bookmark will not be lost and you can see it at the end of the document. If you would like to delete this paragraph, I think, you should delete all empty paragraphs from the end of the document, please see the following code
// Remove empty paragraphs from the end of the document.
while (doc.LastSection.Body.LastParagraph != null &&
string.IsNullOrEmpty(doc.LastSection.Body.LastParagraph.GetText().Trim()))
doc.LastSection.Body.LastParagraph.Remove();