HTML with multibyte characters don't convert to PDF properly

When I create a pdf document from html string, multibyte characters are missing in saved document, see code below:

var pdf = new Aspose.Pdf.Generator.Pdf();
pdf.SetUnicode();
string html1 = "
TEST_中文文档资料_TEXT
";
pdf.ParseToPdf(html1);
//AsposePdfCreator.ConvertToWordDoc(pdf, html1, pdftype);
pdf.SetUnicode();
pdf.Save(@"c:\test2.pdf");

SetUnicode() does not help
I am using Aspose.pdf version 9.1
Am I doing somethisg wrong?
Thank you,
Alexei

Hi Alexei,

Thanks for your inquiry. Your query is related to Aspose.Pdf. I am moving this forum thread to Aspose.Pdf forum. My colleagues from Aspose.Pdf team will reply you shortly.

You may also use Aspose.Words to insert an HTML fragment or whole HTML document into Aspose.Words DOM and convert the final document to Pdf file format. Please check the following code example for your kind reference.


Document doc = new Document();

DocumentBuilder builder = new DocumentBuilder(doc);

builder.InsertHtml("

TEST_中文文档资料_TEXT
");

doc.Save(MyDir + "Out.pdf");



<!–[if gte mso 9]>
<w:WordDocument>
<w:View>Normal</w:View>
<w:Zoom>0</w:Zoom>
<w:TrackMoves/>
<w:TrackFormatting/>
<w:PunctuationKerning/>
<w:ValidateAgainstSchemas/>
<w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid>
<w:IgnoreMixedContent>false</w:IgnoreMixedContent>
<w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText>
<w:DoNotPromoteQF/>
<w:LidThemeOther>EN-GB</w:LidThemeOther>
<w:LidThemeAsian>X-NONE</w:LidThemeAsian>
<w:LidThemeComplexScript>X-NONE</w:LidThemeComplexScript>
<w:Compatibility>
<w:BreakWrappedTables/>
<w:SnapToGridInCell/>
<w:WrapTextWithPunct/>
<w:UseAsianBreakRules/>
<w:DontGrowAutofit/>
<w:SplitPgBreakAndParaMark/>
<w:EnableOpenTypeKerning/>
<w:DontFlipMirrorIndents/>
<w:OverrideTableStyleHps/>
</w:Compatibility>
<m:mathPr>
<m:mathFont m:val=“Cambria Math”/>
<m:brkBin m:val=“before”/>
<m:brkBinSub m:val="–"/>
<m:smallFrac m:val=“off”/>
<m:dispDef/>
<m:lMargin m:val=“0”/>
<m:rMargin m:val=“0”/>
<m:defJc m:val=“centerGroup”/>
<m:wrapIndent m:val=“1440”/>
<m:intLim m:val=“subSup”/>
<m:naryLim m:val=“undOvr”/>
</m:mathPr></w:WordDocument>
<![endif]–><!–[if gte mso 10]>

/* Style Definitions */ table.MsoNormalTable {mso-style-name:"Table Normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-priority:99; mso-style-parent:""; mso-padding-alt:0cm 5.4pt 0cm 5.4pt; mso-para-margin-top:0cm; mso-para-margin-right:0cm; mso-para-margin-bottom:10.0pt; mso-para-margin-left:0cm; line-height:115%; mso-pagination:widow-orphan; font-size:11.0pt; font-family:"Calibri","sans-serif"; mso-ascii-font-family:Calibri; mso-ascii-theme-font:minor-latin; mso-hansi-font-family:Calibri; mso-hansi-theme-font:minor-latin; mso-fareast-language:EN-US;}

<![endif]–>

Hi Alexei,


Thanks for contacting support.

In order to display multibyte/special characters inside PDF file, you need to use the font which supports unicode characters i.e. Arial Unicode MS. Please try using the following code snippet to generate correct output. For your reference, I have also attached the resultant PDF generated over my end. We are sorry for your inconvenience.

[C#]

var pdf = new Aspose.Pdf.Generator.Pdf();<o:p></o:p>

string html1 = "

TEST_中文文档资料_TEXT
";

// creat text object

Aspose.Pdf.Generator.Text text1 = new Aspose.Pdf.Generator.Text(html1);

// indicate to render HTML tags inside PDF

text1.IsHtmlTagSupported = true;

// use TextInfo style

text1.UseTextInfoStyle = true;

// specify the font for PDF contents

text1.TextInfo.FontName = "Arial Unicode MS";

// add text paragraph to paragraphs collection of section object

pdf.Sections.Add().Paragraphs.Add(text1);

// embed font inside PDF file

pdf.SetUnicode();

// save PDF file

pdf.Save(@"c:\pdftest\SpecialCharacters_test2.pdf");

This solution in inacceptable for us:

This will mean a lot of work on our side to parce fonts in html files.When we switched to Aspose libraries, we expected that any legitimate html code will be migrated correctly. Aspose.Words does it seemlessly, without any work on our side.

Could you fix this issue on your side?

Thank you,

Alexei

Hi Alexei,


The code snippet shared in earlier post is the correct approach for rendering non-English (UniCode) text inside the PDF file.

Hi Nayyer,

I would disagree with your last answer.
We are doing exactly the same by providing full HTML to Aspose.Words and Aspose.Pdf.
Aspose.Words works without any issues with Unicode. But Aspose.Pdf has Unicode issue described above.
Your solution doesn’t work for us as we don’t want to do any HTML pre-processing before supplying it into Aspose library.
Can Aspose.Pdf behave the same way as Aspose.Words?

Thank you!

Hi Alexei,


Aspose.Pdf for .NET and Aspose.Words for .NET are two separate API’s and both have separate/different document rendering engines. Both API’s use individual techniques to render objects inside the targeted file format.

However in order to render/display Unicode characters inside PDF file without specifying the font information, you may try using the Document Object Model (DOM) of Aspose.Pdf namespace. But when using this approach, the HTML tags are are rendered/transformed accordingly and they appear as native HTML tags. For the sake of correction, we already have logged the requirement of “parsing HTML tags when using Aspose.Pdf namespace”, in our issue tracking system as PDFNEWNET-35804. The development team is looking into the details of this requirement and will keep you updated on the status
of a correction. We are sorry for this inconvenience.


[C#]
<o:p></o:p>

Document doc = new Document(“c:/pdftest/Paysage.pdf”);<o:p></o:p>

string html1 = " <div

TEST_中文文档资料_TEXT";<o:p></o:p>

doc.Pages.Add().Paragraphs.Add(new Aspose.Pdf.Text.TextFragment(html1));<o:p></o:p>

doc.Save("c:/pdftest/UniCodeTextDOM.pdf");

yes, this may be a work around, but we prefer to use a Pdf.Generator's method and will expect that eventually it will convert all multibyte characters into correct pdf file.

pdf.ParseToPdf(html);

Looking forward to see this feature implemented,

Thank you,

Alexei

Hi Alexei,


Thanks for sharing the details.

I have logged an investigation ticket in our issue tracking system as PDFNEWNET-36825 and development team will further look into this matter and see if the font/multi-byte text related problem can be fixed in ParseToPdf(…) method. We
will further look into the details of this issue and will keep you updated on the status
of a correction.

We apologize for your inconvenience.

The issues you have found earlier (filed as PDFNEWNET-35804) have been fixed in Aspose.Pdf for .NET 9.5.0.


This message was posted using Notification2Forum from Downloads module by Aspose Notifier.
(8)

Hi Alexei,


Thanks for your patience. As stated above PDFNEWNET-35804 is resolved and now you can add HTML string in new/existing document using new DOM approach. Please check following documentation link for the purpose. It will help you to accomplish your requirements.


Please feel free to contact us for any further assistance.

Best Regards,

Hi Alexei,


Thanks for your patience. In reference to PDFNEWNET-36825, we want to update you that Aspose.Pdf.Generator package will be obsolete soon. So we are fixing old issues and making new improvements in new generator (Aspose.Pdf.Document). It is more efficient and improved approach, so please use new DOM approach for adding HTML string to PDF for the issue. It will help you to accomplish the task.

We are sorry for the inconvenience caused.

Best Regards,

The issues you have found earlier (filed as PDFNEWNET-36825) have been fixed in Aspose.Pdf for .NET 9.6.0.


This message was posted using Notification2Forum from Downloads module by Aspose Notifier.