HTML with multibyte characters don't convert to PDF properly

abnovikov · April 24, 2014, 2:30pm

When I create a pdf document from html string, multibyte characters are missing in saved document, see code below:

var pdf = new Aspose.Pdf.Generator.Pdf();

pdf.SetUnicode();

string html1 = "

TEST_中文文档资料_TEXT

";

pdf.ParseToPdf(html1);

//AsposePdfCreator.ConvertToWordDoc(pdf, html1, pdftype);

pdf.SetUnicode();

pdf.Save(@"c:\test2.pdf");

SetUnicode() does not help

I am using Aspose.pdf version 9.1

Am I doing somethisg wrong?

Thank you,

Alexei

tahir.manzoor · April 25, 2014, 3:09am

Hi Alexei,

Thanks for your inquiry. Your query is related to Aspose.Pdf. I am moving this forum thread to Aspose.Pdf forum. My colleagues from Aspose.Pdf team will reply you shortly.

You may also use Aspose.Words to insert an HTML fragment or whole HTML document into Aspose.Words DOM and convert the final document to Pdf file format. Please check the following code example for your kind reference.

Document doc = new Document();

DocumentBuilder builder = new DocumentBuilder(doc);

builder.InsertHtml("

TEST_中文文档资料_TEXT

");

doc.Save(MyDir + "Out.pdf");

<!–[if gte mso 9]>
<w:WordDocument>
<w:View>Normal</w:View>
<w:Zoom>0</w:Zoom>
<w:TrackMoves/>
<w:TrackFormatting/>
<w:PunctuationKerning/>
<w:ValidateAgainstSchemas/>
<w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid>
<w:IgnoreMixedContent>false</w:IgnoreMixedContent>
<w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText>
<w:DoNotPromoteQF/>
<w:LidThemeOther>EN-GB</w:LidThemeOther>
<w:LidThemeAsian>X-NONE</w:LidThemeAsian>
<w:LidThemeComplexScript>X-NONE</w:LidThemeComplexScript>
<w:Compatibility>
<w:BreakWrappedTables/>
<w:SnapToGridInCell/>
<w:WrapTextWithPunct/>
<w:UseAsianBreakRules/>
<w:DontGrowAutofit/>
<w:SplitPgBreakAndParaMark/>
<w:EnableOpenTypeKerning/>
<w:DontFlipMirrorIndents/>
<w:OverrideTableStyleHps/>
</w:Compatibility>
<m:mathPr>
<m:mathFont m:val=“Cambria Math”/>
<m:brkBin m:val=“before”/>
<m:brkBinSub m:val="–"/>
<m:smallFrac m:val=“off”/>
<m:dispDef/>
<m:lMargin m:val=“0”/>
<m:rMargin m:val=“0”/>
<m:defJc m:val=“centerGroup”/>
<m:wrapIndent m:val=“1440”/>
<m:intLim m:val=“subSup”/>
<m:naryLim m:val=“undOvr”/>
</m:mathPr></w:WordDocument>
<![endif]–><!–[if gte mso 10]>

/* Style Definitions */ table.MsoNormalTable {mso-style-name:"Table Normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-priority:99; mso-style-parent:""; mso-padding-alt:0cm 5.4pt 0cm 5.4pt; mso-para-margin-top:0cm; mso-para-margin-right:0cm; mso-para-margin-bottom:10.0pt; mso-para-margin-left:0cm; line-height:115%; mso-pagination:widow-orphan; font-size:11.0pt; font-family:"Calibri","sans-serif"; mso-ascii-font-family:Calibri; mso-ascii-theme-font:minor-latin; mso-hansi-font-family:Calibri; mso-hansi-theme-font:minor-latin; mso-fareast-language:EN-US;}

<![endif]–>

codewarior · April 25, 2014, 5:12am

Hi Alexei,

Thanks for contacting support.

In order to display multibyte/special characters inside PDF file, you need to use the font which supports unicode characters i.e. Arial Unicode MS. Please try using the following code snippet to generate correct output. For your reference, I have also attached the resultant PDF generated over my end. We are sorry for your inconvenience.

Code:

var pdf = new Aspose.Pdf.Generator.Pdf();

string html1 = "<div>TEST_<font name='Arial'> 中文文档</font>资料_TEXT</div>";

// creat text object
Aspose.Pdf.Generator.Text text1 = new Aspose.Pdf.Generator.Text(html1);

// indicate to render HTML tags inside PDF
text1.IsHtmlTagSupported = true;

// use TextInfo style
text1.UseTextInfoStyle = true;

// specify the font for PDF contents
text1.TextInfo.FontName = "Arial Unicode MS";

// add text paragraph to paragraphs collection of section object
pdf.Sections.Add().Paragraphs.Add(text1);

// embed font inside PDF file
pdf.SetUnicode();

// save PDF file
pdf.Save(@"c:\pdftest\SpecialCharacters_test2.pdf")

abnovikov · April 25, 2014, 3:06pm

This solution in inacceptable for us:

This will mean a lot of work on our side to parce fonts in html files.When we switched to Aspose libraries, we expected that any legitimate html code will be migrated correctly. Aspose.Words does it seemlessly, without any work on our side.

Could you fix this issue on your side?

Thank you,

Alexei

codewarior · April 27, 2014, 2:13pm

Hi Alexei,

The code snippet shared in earlier post is the correct approach for rendering non-English (UniCode) text inside the PDF file.

velaskec · April 28, 2014, 10:36am

Hi Nayyer,

I would disagree with your last answer.

We are doing exactly the same by providing full HTML to Aspose.Words and Aspose.Pdf.

Aspose.Words works without any issues with Unicode. But Aspose.Pdf has Unicode issue described above.

Your solution doesn’t work for us as we don’t want to do any HTML pre-processing before supplying it into Aspose library.

Can Aspose.Pdf behave the same way as Aspose.Words?

Thank you!

codewarior · April 28, 2014, 11:54pm

Hi Alexei,

Aspose.Pdf for .NET and Aspose.Words for .NET are two separate APIs and both have separate/different document rendering engines. Both APIs use individual techniques to render objects inside the targeted file format.

However, in order to render/display Unicode characters inside PDF file without specifying the font information, you may try using the Document Object Model (DOM) of Aspose.Pdf namespace. But when using this approach, the HTML tags are are rendered/transformed accordingly and they appear as native HTML tags. For the sake of correction, we already have logged the requirement of “parsing HTML tags when using Aspose.Pdf namespace”, in our issue tracking system as PDFNEWNET-35804. The development team is looking into the details of this requirement and will keep you updated on the status of a correction. We are sorry for this inconvenience.

Document doc = new Document("c:/pdftest/Paysage.pdf");
string html1 = "<div>TEST_中文文档资料_TEXT</div>";
doc.Pages.Add().Paragraphs.Add(new Aspose.Pdf.Text.TextFragment(html1));
doc.Save("c:/pdftest/UniCodeTextDOM.pdf");

abnovikov · April 29, 2014, 9:06am

yes, this may be a work around, but we prefer to use a Pdf.Generator's method and will expect that eventually it will convert all multibyte characters into correct pdf file.

pdf.ParseToPdf(html);

Looking forward to see this feature implemented,

Thank you,

Alexei

codewarior · April 29, 2014, 11:03pm

Hi Alexei

Thanks for sharing the details.

I have logged an investigation ticket in our issue tracking system as PDFNEWNET-36825, and the development team will further look into this matter to see if the font/multi-byte text related problem can be fixed in ParseToPdf(..) method. We will further look into the details of this issue and will keep you updated on the status of a correction.

We apologize for any inconvenience caused.

aspose.notifier · August 7, 2014, 2:15am

The issues you have found earlier (filed as PDFNEWNET-35804) have been fixed in Aspose.Pdf for .NET 9.5.0.

This message was posted using Notification2Forum from Downloads module by Aspose Notifier.
(8)

tilal.ahmad · August 26, 2014, 6:53am

Hi Alexei,

Thanks for your patience. As stated above PDFNEWNET-35804 is resolved and now you can add HTML string in new/existing document using new DOM approach. Please check following documentation link for the purpose. It will help you to accomplish your requirements.

Add HTML string using DOM approach.

Please feel free to contact us for any further assistance.

Best Regards,

tilal.ahmad · September 3, 2014, 11:32am

Hi Alexei,

Thanks for your patience. In reference to PDFNEWNET-36825, we want to update you that Aspose.Pdf.Generator package will be obsolete soon. So we are fixing old issues and making new improvements in new generator (Aspose.Pdf.Document). It is more efficient and improved approach, so please use new DOM approach for adding HTML string to PDF for the issue. It will help you to accomplish the task.

We are sorry for the inconvenience caused.

Best Regards,

aspose.notifier · September 5, 2014, 2:34am

The issues you have found earlier (filed as PDFNEWNET-36825) have been fixed in Aspose.Pdf for .NET 9.6.0.