Rich Text HTML Format Detection | Specifying UTF-8 or UTF-32 Encoding | C# .NET

In version 20.4.0 we could read an uncompleted html text.

<bold>this is bold HTML</bold>” would return “this is bold HTML” (text in bold)

Starting at version 20.5.0 until latest version we now get “<bold>this is bold HTML</bold>” (text not in bold and surrounded with bold tags)

Is this a bug that will be fixed or is this by design ?

Thanks,
Kris Goossens
Remmicom

@krisg,

But, the following C# code of Aspose.Words for .NET, when running with 20.4 and 21.8 versions, produces exactly the same outputs on our end:

Document doc = new Document(@"C:\temp\input.docx");
DocumentBuilder builder = new DocumentBuilder(doc);

builder.InsertHtml("<bold>this is bold HTML</bold>"); // in both cases it gives an output which is not bold

doc.Save("C:\\temp\\awnet-20.4.docx");

Can you please provide your source HTML, Aspose.Words generated DOCX files and piece of source code here for testing?

This is our example code :

using Aspose.Words;
using System;
using System.Globalization;
using System.IO;
using System.Text;

namespace AsposeWordsPartialHtmlBug
{
    class Program
    {
        static void Main(string[] args)
        {
            var license = new License();
            license.SetLicense("Aspose.Words.NET.lic");

            var richTextField = Convert("<bold>this is bold HTML</bold>");

            try
            {
                Assert(richTextField.Text);
            }
            catch (Exception e)
            {
                Console.ForegroundColor = ConsoleColor.Red;
                Console.WriteLine(e.Message);
                Console.Read();
                Environment.Exit(1);
            }
            Environment.Exit(0);
        }

        private static void Assert(string text)
        {
            if (text.Contains("Evaluation Only"))
            {
                throw new Exception("License is invalid or not setup correctly.");
            }
            if (text.Contains("<bold>"))
            {
                throw new Exception($"HTML string was not converted correctly.{Environment.NewLine}Text was: {text}");
            }
        }

        private static RichTextField Convert(string value)
        {
            //value = WorkAround(value);
            var htmlStringAsBytes = ConvertHtmlStringToBytes(value);
            return new RichTextField(GetText(htmlStringAsBytes), htmlStringAsBytes);
        }

        private static string WorkAround(string value)
        {
            return ContainsIgnoreCase(value, "<html>") ? value : $"<html><head></head><body>{value}</body></html>";
        }

        private static bool ContainsIgnoreCase(string value, string searchString)
        {
            return CultureInfo.InvariantCulture.CompareInfo.IndexOf(value, searchString, System.Globalization.CompareOptions.IgnoreCase) >= 0;
        }

        private static byte[] ConvertHtmlStringToBytes(string html)
        {
            byte[] result;

            if (string.IsNullOrEmpty(html))
            {
                result = Array.Empty<byte>();
            }
            else
            {
                var htmlBytes = Encoding.UTF32.GetBytes(html);

                using (var htmlStream = new MemoryStream(htmlBytes))
                {
                    var doc = new Document(htmlStream, new LoadOptions(LoadFormat.Html, "", ""));

                    using (var outStream = new MemoryStream())
                    {
                        doc.Save(outStream, SaveFormat.Docx);
                        result = outStream.ToArray();
                    }
                }
            }
            return result;
        }


        private static string GetText(byte[] data)
        {
            Document doc;

            using (var memStream = new MemoryStream(data))
            {
                doc = new Document(memStream);
            }

            return CreateUnformattedText(doc);
        }

        private static string CreateUnformattedText(Document doc)
        {
            return CreateText(doc, new Aspose.Words.Saving.TxtSaveOptions { PreserveTableLayout = true, SimplifyListLabels = true });
        }


        private static string CreateText(Document doc, Aspose.Words.Saving.SaveOptions options)
        {
            using (var ms = new MemoryStream())
            {
                doc.Save(ms, options);
                ms.Position = 0;

                return CreateText(ms);
            }
        }

        private static string CreateText(MemoryStream stream)
        {
            using (var sr = new StreamReader(stream))
            {
                var text = sr.ReadToEnd();

                return text.TrimEnd();
            }
        }
    }
}

@krisg,

But, we are observing a couple of compile time errors on the following method:

private static RichTextField Convert(string value)
{
    //value = WorkAround(value);
    var htmlStringAsBytes = ConvertHtmlStringToBytes(value);
    return new RichTextField(GetText(htmlStringAsBytes), htmlStringAsBytes);
}

The type or namespace name ‘RichTextField’ could not be found (are you missing a using directive or an assembly reference?)

Can you please create a standalone simplified Console Application (source code without compilation errors) that helps us to reproduce this problem on our end and attach it here for our testing? Please do not include Aspose.Words DLL files in it to reduce the file size. Thanks for your cooperation.

Ok, sorry, missing class and uploaded zip.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace AsposeWordsPartialHtmlBug
{
    public class RichTextField
    {
        public RichTextField(string text, byte[] document)
        {
            Text = text;
            Document = document;
        }

        public string Text { get; set; }

        public byte[] Document { get; set; }
    }
}

AsposeWordsPartialHtmlBug.zip (5.0 KB)

@krisg,

For the sake of any corrections in latest versions of Aspose.Words API, we have logged this problem in our issue tracking system with ID WORDSNET-22633. We will further look into the details of this problem and will keep you updated on the status of linked issue. We apologize for any inconvenience.

@krisg,

The problem occurs because latest versions of Aspose.Words try to load an UTF-32 encoded stream using the UTF-8 encoding. As a workaround, you can:

  • either encode HTML in UTF-8 instead of UTF-32 by using Encoding.UTF8.GetBytes instead of Encoding.UTF32.GetBytes;
  • or specify the encoding explicitly by setting LoadOptions.Encoding = Encoding.UTF32.

We will also inform you here as soon as WORDSNET-22633 will get resolved in future.

Ok, thank you

1 Like

The issues you have found earlier (filed as WORDSNET-22633) have been fixed in this Aspose.Words for .NET 21.11 update also available on NuGet.