Tamil Text Renders incorrect in Output PDF | Enable OpenType Features using .NET

Jan_Kratzert · September 16, 2021, 5:06pm

Using the following code to convert a docx file to pdf.

Document doc = new Document("test.docx");
doc.Save($"out.pdf", SaveFormat.Pdf);

Results in a pdf with incorrect order of characters as shown in the following image.
TamilCharsBug.JPG (40.1 KB)

The following zip file contains “test.docx” and “out.pdf”
Tamil.zip (69.7 KB)

We see this incorrect order also when extracting the characters of the initial test.docx file, which is also important for us to be in correct order.

I think this topic may be related to this Bangla characters end up in the wrong order

tahir.manzoor · September 16, 2021, 6:33pm

@Jan_Kratzert

Please refer to the following article. You need to enable open type feature as shown below to get the desired output.
Enable OpenType Features

Document doc = new Document(MyDir + "Test.docx");
doc.LayoutOptions.TextShaperFactory = HarfBuzzTextShaperFactory.Instance;
doc.Save(MyDir + "21.9.pdf");

tahir.manzoor · September 17, 2021, 2:38pm

A post was split to a new topic: DOCX to PDF conversion issue with Bangla Characters

Jan_Kratzert · September 20, 2021, 12:53pm

Ok, thanks for the answer. Is it recommended to enable this OpenType feature in general? Or are there any effects on non OpenType fonts?

Your link mentions this “Text shaping is only performed when exporting to PDF or XPS formats.”. So I assume this won’t fix the incorrect order when extracting the text (Symbol, Font and Position), am I right?

tahir.manzoor · September 20, 2021, 2:58pm

@Jan_Kratzert

Please note that extracting text from document and rendering document to PDF or XPS are different process. You need to enable open type features when rendering document to PDF for better support of international languages and writing systems as compared to PostScript and TrueType.

If you want to extract text from document and save it to flow formats e.g. DOCX or DC, you do not need to enable open type features.