Aspose.Words: Converting a Word document to HTML - query on ‘chunking’ base64 encoded images to the recommended 76 character length

Aspose version: Aspose Words.NET version 23.6.0

We are using Aspose.Words to convert a Word Document (docx) to HTML by simply saving the loaded docx (loaded into Aspose.Words.Document) as HTML.

The document can contain embedded images which are exported as base64 into the HTML using the HtmlSaveOptions property ExportImagesAsBase64.

Is there a way to ensure the produced continuous base64 string is ‘chunked’ (split across multiple lines) so that each line does not exceed the recommended maximum length of 76 characters?

(The recommendation comes from RFC 2045 which defines the format of internet message bodies – for which the HTML will be used).

If not, maybe there is a method where we can ‘listen’ for the conversion like with IImageSavingCallback and we can perform manipulation ourselves?

Thanks in advance,

James

@JamesMurdoch

To achieve the desired chunking of base64 encoded images in the HTML output generated by Aspose.Words, you can utilize the IImageSavingCallback interface. This allows you to control how images are exported during the conversion process.

Here’s a general approach you can follow:

  1. Implement the IImageSavingCallback Interface: Create a class that implements this interface. In the ImageSaving method, you can manipulate the base64 string of the image.

  2. Chunk the Base64 String: In the ImageSaving method, after obtaining the base64 string, you can split it into chunks of 76 characters each and then reassemble it with line breaks.

  3. Set the Callback: When saving the document to HTML, set your custom callback to handle the image saving process.

Here’s a sample implementation:

using Aspose.Words;
using Aspose.Words.Saving;
using System;
using System.IO;
using System.Text;

public class CustomImageSavingCallback : IImageSavingCallback
{
    public void ImageSaving(ImageSavingArgs args)
    {
        // Convert the image to base64
        using (MemoryStream ms = new MemoryStream())
        {
            args.Image.Save(ms, args.ImageFormat);
            string base64String = Convert.ToBase64String(ms.ToArray());

            // Chunk the base64 string
            StringBuilder chunkedBase64 = new StringBuilder();
            for (int i = 0; i < base64String.Length; i += 76)
            {
                int length = Math.Min(76, base64String.Length - i);
                chunkedBase64.AppendLine(base64String.Substring(i, length));
            }

            // Set the base64 string back to the args
            args.Base64Data = chunkedBase64.ToString();
        }
    }
}

// Usage
Document doc = new Document("input.docx");
HtmlSaveOptions options = new HtmlSaveOptions
{
    ExportImagesAsBase64 = true
};

options.ImageSavingCallback = new CustomImageSavingCallback();
doc.Save("output.html", options);

In this example, the CustomImageSavingCallback class handles the image saving process, converting images to base64 and chunking them into lines of 76 characters. This way, you can ensure compliance with RFC 2045 while exporting your Word document to HTML.

Feel free to adjust the implementation as needed for your specific requirements!

@JamesMurdoch Unfortunately, currently there is no way to achieve this.

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): WORDSNET-28134

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

Ok, thanks for looking into this Alexey and letting us know.

1 Like

@JamesMurdoch We have completed analyzing the issue and think that we shouldn’t implement this feature, because it will not solve your problem. You want to generate HTML document that would comply with line length limitations imposed by RFC 2045. However, since base64-encoded images in HTML are embedded into surrounding content, that content will shift the images’ data and make base64 lines longer.

Your problem should instead be solved at an upper level, by the program that embeds generated HTML into E-Mails. That program should encode the whole HTML document in “quoted-printable” or “base64” encoding and split lines of encoded content appropriately.

We are going to close this request with the “Won’t Fix” resolution.

The issues you have found earlier (filed as WORDSNET-28134) have been fixed in this Aspose.Words for .NET 25.5 update also available on NuGet.