(M)HTML to PDF conversion does not support Unicode

tech.s-und-n.de · July 7, 2015, 3:07am

Hello,

I’m trying to convert a .mht/.html file with Aspose.Pdf 10.5.0. These files may contain language specific letters (eg. German umlauts). The resulting PDF will either remove these characters or replace them with question marks.

HTML to PDF (as suggested in a bug report):

var pdf = new Pdf { HtmlInfo = { CharSet = “UTF-8”, CharsetApplyingLevelOfForce = HtmlInfo.CharsetApplyingForceLevel.EnforceUseAlways } }; pdf.SetUnicode();

var section = pdf.Sections.Add();
var text = new Text(section, htmlString)
{
IsHtmlTagSupported = true,
IsHtml5Supported = true,
TextInfo = {FontName = "Arial Unicode MS"},
IfHtmlTagSupportedOverwriteHtmlFontNames = true
};
text.TextInfo.IsFontEmbedded = true;

section.Paragraphs.Add(text);

pdf.Save(pdfOutputPath);

MHTML to PDF:

using (var document = new Document(mhtmlFile, new MhtLoadOptions()) { PageInfo = { Margin = new Aspose.Pdf.MarginInfo(25, 20, 25, 25) } }) {

document.Save(pdfOutputPath, SaveFormat.Pdf);

}

Note: I couldn't find any way to add Unicode support for the 'Aspose.Pdf.Document.Document' class. Should it be auto-detected or is it missing?

I prefer the second (MHTML to PDF) approach.

tilal.ahmad · July 8, 2015, 12:37am

Hi there,

Thanks for your inquiry. I am afraid the MHT to PDF does not support defining input encoding, so we logged an enhancement ticket PDFNEWNET-38977 in our issue tracking system for this purpose. We will notify you as soon as it is resolved.

However, you can define input encoding in HTML to PDF conversion as follows. Hopefully, it will help you to accomplish the task.

HtmlLoadOptions options = new HtmlLoadOptions();
options.InputEncoding = "UTF-8";

Aspose.Pdf.Document pdf = new Aspose.Pdf.Document("Htmlfile.html", options);
pdf.Save("output.pdf");

Best Regards,

tilal.ahmad · July 9, 2015, 10:55pm

Hi there,

In addition to above reply, we will appreciate it if you please share your sample MHTML document her. It will help us to understand and address your issue exactly.

Best Regards,

tech.s-und-n.de · July 13, 2015, 8:50am

The application runs on a remote computer. Unfortunately I don’t have direct access, thus I cannot provide a file.

Using the HtmlLoadOptions with InputEncoding = “UTF-8” solves the encoding problem, but the images are not included in the PDF file.

Code:

var path = @“C:\temp\test”;


var htmlFile = @“test.html”;


var pdfFile = @“test.pdf”;


var options = new HtmlLoadOptions(path)


{
    InputEncoding = “UTF-8”


};


var pdfDocument = new Document(Path.Combine(path, htmlFile), options);


pdfDocument.Save(Path.Combine(path, pdfFile));

All ‘embedded’ files are in a sub directory.

img src="./(M)HTML to PDF conversion does not support Unicode - Aspose.Pdf Product Family - Forums_files/0b0d5b79-9f1a-4862-9a92-a5c852cc1e1b.gif" alt=“Aspose Staff Member” style="border-width:0px;"

The ‘src’ attribute may not start with ‘.’. You can save a full website with Google Chrome (it’s just an example, I don’t want to use the WebRequest API).

codewarior · July 14, 2015, 8:00am

Hi,

Thanks for sharing the details.

During our testing, we did not notice any issue related to image rendering when converting HTML files to PDF format, however in order to replicate the issue you are facing, you may consider saving the contents over system, save respective images in directory and share the resources with us, so that we can further look into this matter.

tech.s-und-n.de · July 15, 2015, 3:32am

I’ve uploaded a sample project.

The error occurs if the file path uses ‘%20’ instead of spaces.

src=“test file.png” → OK

src=“test%20file.png” → not OK

After reading the documentation I could create a workaround:

var options = new HtmlLoadOptions(baseDirectory)


{
    InputEncoding = “UTF-8”


};


options.CustomLoaderOfExternalResources = (resourceUri) =>


{
    var fileBytes = System.IO.File.ReadAllBytes(Path.Combine(options.BasePath, Uri.UnescapeDataString(resourceUri)));
    return new LoadOptions.ResourceLoadingResult(fileBytes);


};

In my opinion, the internal resource loader should do this. The PDF conversion works now. Nevertheless I’m looking forward for the Unicode encoding support for MHTML files. Using the MHTML approach, I do not have to delete the external files.

codewarior · July 16, 2015, 6:54am

Hi,

Thanks for sharing the sample project.

I have tried replicating the issue and have observed that application hangs during conversion process. Can you please double check at your end.

tech.s-und-n.de · July 17, 2015, 7:44am

It works for me, but it took quite a while (~ 5 minutes) and the memory usage jumps up to 700 MB.

The license is not included (“C:\Aspose.Pdf.lic”). I’m using Visual Studio 2013 Premium.

I encountered another problem. If I’m using the workaround, web resources cannot be accessed. Is there a way to call the standard ‘LoaderOfExternalResources’?

codewarior · July 21, 2015, 4:53pm

Hi,

Thanks for sharing the details.

I have tested the scenario and have observed that HTML to conversion takes too much time. For the sake of correction, I have logged it as PDFNEWNET-39058 in our issue tracking system. We will further look into the details of this problem and will keep you updated on the status of correction. We are sorry for this inconvenience.

codewarior · July 21, 2015, 4:58pm

Hi,

I have also observed that images are not rendered in resultant PDF file. For the sake of correction, I have separately logged it as PDFNEWNET-39059. We will further look into this matter and will keep you posted on the status of correction. We are really sorry for this inconvenience.

tech.s-und-n.de · August 5, 2015, 10:46am

The MHT conversion throws another exception: "Absent or unexpected Content-Transfer-Encoding header’s value detected."

You can find the MHT file in the attachments.

Code:

var mhtOptions = new MhtLoadOptions();

using (var pdfDocument = new Document(Path.Combine(path, file), mhtOptions))

{

pdfDocument.Save(Path.Combine(path, pdfFile));

}

codewarior · August 6, 2015, 9:57am

tech@s-und-n.de:

The MHT conversion throws another exception: “Absent or unexpected Content-Transfer-Encoding header’s value detected.”

You can find the MHT file in the attachments.

Code:
var mhtOptions = new MhtLoadOptions();
using (var pdfDocument = new Document(Path.Combine(path, file), mhtOptions))
{
    pdfDocument.Save(Path.Combine(path, pdfFile));
}

Hi,

Thanks for using our API’s.

I have tested the scenario and I am able to reproduce the same problem. For the sake of correction, I have logged it in our issue tracking system as PDFNEWNET-39154. We will investigate this issue in details and will keep you updated on the status of a correction.

We apologize for your inconvenience.

tilal.ahmad · October 13, 2015, 2:50am

Hi there,

Thanks for your patience. In reference to PDFNEWNET-38977 we have investigated the issue:

We found that unfortunately for MHT to PDF conversion supplying of some one predefined Encoding does not make sense.

In general MHT contains several parts, each of them can be put into MHT in it’s own encoding (they can differ). Please look into the attached document with any text editor.

You can see there such pieces:

------=_NextPart_000_0000_01D0A4B0.B684CBE0

mime-version: 1.0

content-type: text/html;

charset=“utf-8”

content-transfer-encoding: quoted-printable

content-location:

…
------=_NextPart_000_0000_01D0A4B0.B684CBE0

mime-version: 1.0

content-type: text/html;

charset=“iso-8859-1”

content-transfer-encoding: quoted-printable

content-location: [https://d1sojsgu0jwtb7.cloudfront.net/css/cc5dad54bd9c166d5a307792f7279b7b/player_embedded_mini.min.gz.css]  (https://d1sojsgu0jwtb7.cloudfront.net/css/cc5dad54bd9c166d5a307792f7279b7b/player_embedded_mini.min.gz.css)

…
------=_NextPart_000_0000_01D0A4B0.B684CBE0

mime-version: 1.0

content-type: text/html;

charset=“utf-8”

content-transfer-encoding: quoted-printable

content-location: [https://www.spreaker.com/embed/player/mini?show_id=583825&autoplay=](https://www.spreaker.com/embed/player/mini?show_id=583825&autoplay=)false

Please pay attention to the chareheaders - they differ for different parts of same MHT. So, supplying of some one encoding for all MHT does not make sense: content of each piece must be processed with regard to encoding defined in relevant header of that piece (as it is currently done in our code). So, we will not implement the feature cause it does not make sense.

However, we can implement some callback in MhtLoadOptions class, which will be called during processing of each piece and allow customer forcibly set encoding for each piece “on the fly”. But it requires not only the user’s understanding of MHT basics but even of differences of its implementations for different generators of MHT (IE, MozillaFF’s plugin etc.) and it looks like it will overkill. (For example, the sample MHT contains >70 different pieces, including images, javascripts etc., several of them are nested inside others etc., so implementation of such callback will be nontrivial issue on customer side).

Nevertheless, please confirm whether you are really interested in such approach (callback)? If so, we can do that in future. But please note we can not implement this in near future.

We are sorry for inconvenience caused.

Best Regards,

tech.s-und-n.de · October 15, 2015, 7:18am

Hello,

thanks for the information. We need the possibility to use different charsets for each part. That’s not a rare case at all. It happens all the time if you save a website. Same goes for saving an Outlook e-mail as a mht file. There will be multiple parts with 'charset=“us-ascii” and ‘charset=“utf-8”’.

tilal.ahmad · October 16, 2015, 12:57am

Hi there,

Thanks for your feedback. We have logged an enhancement ticket PDFNEWNET-39545 in our issue tracking system to implement a callback method that will allow user forcibly set encoding for each MHT part during conversion. We will notify you as soon as it is resolved.

Best Regards,

fshomou1968 · July 8, 2016, 12:05pm

Hi,

My html is :

général

I have this code :

var path = @“G:\xxxxxx”;

var htmlFile = @“test.html”;

var pdfFile = @“test.pdf”;

var options3 = new HtmlLoadOptions(path)

{

InputEncoding = “UTF-8”

};

var pdfDocument8 = new Document(Path.Combine(path, htmlFile), options3);

pdfDocument8.Save(Path.Combine(path, pdfFile));

I’m trying to convert html (with french letters) file with Aspose.Pdf 10.6.0.0. The resulting PDF (attached) will replace them with question marks.

Thanks

fshomou1968 · July 8, 2016, 1:12pm

Hi,

My html is :

général

I have this code :

var path = @“G:\xxxxxx”;

var htmlFile = @“test.html”;

var pdfFile = @“test.pdf”;

var options3 = new HtmlLoadOptions(path)

{

InputEncoding = “UTF-8”

};

var pdfDocument8 = new Document(Path.Combine(path, htmlFile), options3);

pdfDocument8.Save(Path.Combine(path, pdfFile));

I’m trying to convert html (with french letters) file with Aspose.Pdf 10.6.0.0. The resulting PDF (attached) will replace them with question marks.

Thanks

codewarior · July 12, 2016, 4:06pm

fshomou1968: Hi,

My html is :
<html>
<body>
<p>général</p>
</body>
</html>
I have this code :
var path = @"G:\xxxxxx";
var htmlFile = @"test.html";
var pdfFile = @"test.pdf";
var options3 = new HtmlLoadOptions(path)
{
    InputEncoding = "UTF-8"
};
var pdfDocument8 = new Document(Path.Combine(path, htmlFile), options3);
pdfDocument8.Save(Path.Combine(path, pdfFile));
I’m trying to convert html (with french letters) file with Aspose.Pdf 10.6.0.0. The resulting PDF (attached) will replace them with question marks.

Thanks.

Hi Faris,
Thanks for using our API’s.
I tried the scenario using the latest release of Aspose.Pdf for .NET 11.8.0 in a VisualStudio 2010 project with .NET Framework 4.0, running over Windows 7 (x64) and I am unable to notice any issue. According to my observations, the text is properly rendering inside the PDF file. For your reference, I have also attached the output generated on my end.

[C#]
var htmlFile = @"c:/pdftest/import_font.html";
var options3 = new HtmlLoadOptions("c:/pdftest/")
{
    InputEncoding = "UTF-8"
};
var pdfDocument8 = new Document(htmlFile, options3);
pdfDocument8.Save("c:/pdftest/import_font_Converted.pdf");

aspose.notifier · July 19, 2022, 8:09pm

The issues you have found earlier (filed as PDFNET-39154) have been fixed in Aspose.PDF for .NET 22.7.