(M)HTML to PDF conversion does not support Unicode

Hello,


I’m trying to convert a .mht/.html file with Aspose.Pdf 10.5.0. These files may contain language specific letters (eg. German umlauts). The resulting PDF will either remove these characters or replace them with question marks.


HTML to PDF (as suggested in a bug report):

var pdf = new Pdf
{
HtmlInfo =
{
CharSet = “UTF-8”,
CharsetApplyingLevelOfForce = HtmlInfo.CharsetApplyingForceLevel.EnforceUseAlways
}
};
pdf.SetUnicode();

var section = pdf.Sections.Add();
var text = new Text(section, htmlString)
{
IsHtmlTagSupported = true,
IsHtml5Supported = true,
TextInfo = {FontName = "Arial Unicode MS"},
IfHtmlTagSupportedOverwriteHtmlFontNames = true
};
text.TextInfo.IsFontEmbedded = true;

section.Paragraphs.Add(text);

pdf.Save(pdfOutputPath);

MHTML to PDF:

using (var document = new Document(mhtmlFile, new MhtLoadOptions()) { PageInfo = { Margin = new Aspose.Pdf.MarginInfo(25, 20, 25, 25) } })
{

document.Save(pdfOutputPath, SaveFormat.Pdf);

}

Note: I couldn't find any way to add Unicode support for the 'Aspose.Pdf.Document.Document' class. Should it be auto-detected or is it missing?
I prefer the second (MHTML to PDF) approach.

Hi there,


Thanks for your inquiry. I am afraid the MHT to PDF does not support to define input encoding, so we logged an enhancement ticket PDFNEWNET-38977 in our issue tracking system for the purpose. We will notify you as soon as it is resolved.

However, you can define input encoding in HTML to PDF conversion as following. Hopefully it will help you to accomplish the task.

HtmlLoadOptions options = new HtmlLoadOptions();<o:p></o:p>

options.InputEncoding = "UTF-8";

Aspose.Pdf.Document pdf = new Document("Htmlfile.html", options);

pdf.Save("output.pdf");


Best Regards,

Hi there,


In addition to above reply, we will appreciate it if you please share your sample MHTML document her. It will help us to understand and address your issue exactly.

Best Regards,

The application runs on a remote computer. Unfortunately I don’t have direct access, thus I cannot provide a file.

Using the HtmlLoadOptions with InputEncoding = “UTF-8” solves the encoding problem, but the images are not included in the PDF file.

Code:
var path = @“C:\temp\test”;

var htmlFile = @“test.html”;

var pdfFile = @“test.pdf”;

var options = new HtmlLoadOptions(path)

{
InputEncoding = “UTF-8”

};

var pdfDocument = new Document(Path.Combine(path, htmlFile), options);

pdfDocument.Save(Path.Combine(path, pdfFile));

All ‘embedded’ files are in a sub directory.
img src="./(M)HTML to PDF conversion does not support Unicode - Aspose.Pdf Product Family - Forums_files/0b0d5b79-9f1a-4862-9a92-a5c852cc1e1b.gif" alt=“Aspose Staff Member” style="border-width:0px;"

The ‘src’ attribute may not start with ‘.’. You can save a full website with Google Chrome (it’s just an example, I don’t want to use the WebRequest API).

Hi,


Thanks for sharing the details.

During our testing, we did not notice any issue related to image rendering when converting HTML files to PDF format, however in order to replicate the issue you are facing, you may consider saving the contents over system, save respective images in directory and share the resources with us, so that we can further look into this matter.

I’ve uploaded a sample project.

The error occurs if the file path uses ‘%20’ instead of spaces.

src=“test file.png” → OK
src=“test%20file.png” → not OK



After reading the documentation I could create a workaround:

var options = new HtmlLoadOptions(baseDirectory)

{
InputEncoding = “UTF-8”

};

options.CustomLoaderOfExternalResources = (resourceUri) =>

{
var fileBytes = System.IO.File.ReadAllBytes(Path.Combine(options.BasePath, Uri.UnescapeDataString(resourceUri)));
return new LoadOptions.ResourceLoadingResult(fileBytes);

};

In my opinion, the internal resource loader should do this. The PDF conversion works now. Nevertheless I’m looking forward for the Unicode encoding support for MHTML files. Using the MHTML approach, I do not have to delete the external files.

Hi,


Thanks for sharing the sample project.

I have tried replicating the issue and have observed that application hangs during conversion process. Can you please double check at your end.

It works for me, but it took quite a while (~ 5 minutes) and the memory usage jumps up to 700 MB.

The license is not included (“C:\Aspose.Pdf.lic”). I’m using Visual Studio 2013 Premium.

I encountered another problem. If I’m using the workaround, web resources cannot be accessed. Is there a way to call the standard ‘LoaderOfExternalResources’?

Hi,


Thanks for sharing the details.

I have tested the scenario and have observed that HTML to conversion takes too much time. For the sake of correction, I have logged it as PDFNEWNET-39058 in our issue tracking system. We will further look into the details of this problem and will keep you updated on the status of correction. We are sorry for this inconvenience.

Hi,


I have also observed that images are not rendered in resultant PDF file. For the sake of correction, I have separately logged it as PDFNEWNET-39059. We will further look into this matter and will keep you posted on the status of correction. We are really sorry for this inconvenience.

The MHT conversion throws another exception: "Absent or unexpected Content-Transfer-Encoding header’s value detected."

You can find the MHT file in the attachments.

Code:
var mhtOptions = new MhtLoadOptions();
using (var pdfDocument = new Document(Path.Combine(path, file), mhtOptions))
{
pdfDocument.Save(Path.Combine(path, pdfFile));
}

tech@s-und-n.de:
The MHT conversion throws another exception: “Absent or unexpected Content-Transfer-Encoding header’s value detected.”
You can find the MHT file in the attachments.

Code:
var mhtOptions = new MhtLoadOptions();
using (var pdfDocument = new Document(Path.Combine(path, file), mhtOptions))
{
pdfDocument.Save(Path.Combine(path, pdfFile));
}
Hi,

Thanks
for using our API’s.<o:p></o:p>

I have tested the scenario and I am able to reproduce the same problem. For the sake of correction, I have logged it in our issue tracking system as PDFNEWNET-39154. We will investigate this issue in details and will keep you updated on the status of a correction.

We apologize for your inconvenience.

Hi there,


Thanks for your patience. In reference to PDFNEWNET-38977, we have investigated the issue.

We found that unfortunately for MHT to PDF conversion supplying of some one predefined Encoding does not make sense.

In general,MHT contains several parts , each of them can be put into MHT in it’s own encoding(they can differ). Please look into attached document with any text editor.

You can see there such pieces :

------=_NextPart_000_0000_01D0A4B0.B684CBE0<o:p></o:p>

mime-version: 1.0<o:p></o:p>

content-type: text/html;<o:p></o:p>

charset=“utf-8”<o:p></o:p>

content-transfer-encoding: quoted-printable<o:p></o:p>

content-location: <o:p></o:p>

…<o:p></o:p>

------=_NextPart_000_0000_01D0A4B0.B684CBE0<o:p></o:p>

mime-version: 1.0<o:p></o:p>

content-type: text/html;<o:p></o:p>

charset=“iso-8859-1”<o:p></o:p>

content-transfer-encoding: quoted-printable<o:p></o:p>

content-location: https://d1sojsgu0jwtb7.cloudfront.net/css/cc5dad54bd9c166d5a307792f7279b7b/player_embedded_mini.min.gz.css<o:p></o:p>

…<o:p></o:p>

------=_NextPart_000_0000_01D0A4B0.B684CBE0<o:p></o:p>

mime-version: 1.0<o:p></o:p>

content-type: text/html;<o:p></o:p>

charset=“utf-8”<o:p></o:p>

content-transfer-encoding: quoted-printable<o:p></o:p>

content-location: https://www.spreaker.com/embed/player/mini?show_id=583825&autoplay=false<o:p></o:p>

Please pay attention to 'charset' headers - they differ for different parts of same MHT. So, supplying of some one encoding for all MHT does not make sense : content of each piece must be processed with regard to encoding defined in relevant header of that piece (as it currently done in our code). So, we will not implement the feature cause it does not make sense.

However, we can implement some callback in MhtLoadOptions class , which will be called during processing of each piece and allow customer forcibly set encoding for each piece "on the fly".
But it requires not only the user's understanding of MHT basics, but even of differences of it's implementations for different generators of MHT(IE,MozillaFF's plugin etc) and it looks like it will overkill. (F.e sample MHT, contains >70 different pieces, including images, javascripts etc. , several of them are nested inside others etc, - so , I guess that implementation of such callback will be nontrivial issue on customer's side).
Nevertheless, please confirm whether you are really interested in such approach(callback)? If so, we can do that in future. But please note we can not implement this in near future.


We are sorry for the inconvenience caused.

Best Regards,

Hello,


thanks for the information. We need the possibility to use different charsets for each part. That’s not a rare case at all. It happens all the time if you save a website. Same goes for saving an Outlook e-mail as a mht file. There will be multiple parts with 'charset=“us-ascii” and ‘charset=“utf-8”’.

Hi there,


Thanks for your feedback. We have logged an enhancement ticket PDFNEWNET-39545 in our issue tracking system to implement a callback method that will allow user forcibly set encoding for each MHT part during conversion. We will notify you as soon as it is resolved.

Best Regards,

Hi,


My html is :

général


I have this code :
var path = @“G:\xxxxxx”;
var htmlFile = @“test.html”;
var pdfFile = @“test.pdf”;
var options3 = new HtmlLoadOptions(path)
{
InputEncoding = “UTF-8”
};
var pdfDocument8 = new Document(Path.Combine(path, htmlFile), options3);
pdfDocument8.Save(Path.Combine(path, pdfFile));


I’m trying to convert html (with french letters) file with Aspose.Pdf 10.6.0.0. The resulting PDF (attached) will replace them with question marks.

Thanks

Hi,


My html is :

général


I have this code :
var path = @“G:\xxxxxx”;
var htmlFile = @“test.html”;
var pdfFile = @“test.pdf”;
var options3 = new HtmlLoadOptions(path)
{
InputEncoding = “UTF-8”
};
var pdfDocument8 = new Document(Path.Combine(path, htmlFile), options3);
pdfDocument8.Save(Path.Combine(path, pdfFile));


I’m trying to convert html (with french letters) file with Aspose.Pdf 10.6.0.0. The resulting PDF (attached) will replace them with question marks.

Thanks

fshomou1968:
Hi,

My html is :

général


I have this code :
var path = @“G:\xxxxxx”;
var htmlFile = @“test.html”;
var pdfFile = @“test.pdf”;
var options3 = new HtmlLoadOptions(path)
{
InputEncoding = “UTF-8”
};
var pdfDocument8 = new Document(Path.Combine(path, htmlFile), options3);
pdfDocument8.Save(Path.Combine(path, pdfFile));


I’m trying to convert html (with french letters) file with Aspose.Pdf 10.6.0.0. The resulting PDF (attached) will replace them with question marks.

Thanks
Hi Faris,

Thanks for using our API’s.

I have tested the scenario using latest release of Aspose.Pdf for .NET 11.8.0 in VisualStudio 2010 project with .NET Framework 4.0, running over Windows 7 (x64) and I am unable to notice any issue. As per my observations, the text is properly rendering inside PDF file. For your reference, I have also attached the output generated over my end.

[C#]

var htmlFile = @“c:/pdftest/import_font.html”;<o:p></o:p>

var options3 = new HtmlLoadOptions("c:/pdftest/")

{

InputEncoding = "UTF-8"

};

var pdfDocument8 = new Document(htmlFile, options3);

pdfDocument8.Save(“c:/pdftest/import_font_Converted.pdf”);

The issues you have found earlier (filed as PDFNET-39154) have been fixed in Aspose.PDF for .NET 22.7.