Asian PDF to image conversion makes typographical errors

Vorennor · April 21, 2015, 1:46am

Hi,

When a PDF containing Asian text in columns is converted to image (we have tested PDF to PNG conversion only), the rendering engine makes typographical errors with the characters that must be output to a different shape when the text is in column.

(typically: 「（。etc)

These errors may seem harmless to a non Asian reader, but they are actually very striking for Asian people.

Please find attached:

news_no29.pdf: the original Japanese PDF

jpn diff.jpg: a screenshot comparison (original VS PDF to PNG conversion) where some error are surrounded by a red rectangle.

Best Regards,

codewarior · April 22, 2015, 4:30am

Hi Christophe,

Thanks
for using our API’s.<o:p></o:p>

I have tested the scenario and I am able to reproduce the same problem. For the sake of correction, I have logged it in our issue tracking system as PDFNEWNET-38558. We will investigate this issue in details and will keep you updated on the status of a correction.

We apologize for your inconvenience.

Vorennor · June 23, 2015, 7:08am

Dear Nayyer Shahbaz,

To this day, I haven't received any answer from you regarding PDFNEWNET-38558.

Do you know if this has been fixed?

Kind regards,

codewarior · June 24, 2015, 3:16am

Hi Christophe,

Thanks for your patience.

The reported issue PDFNEWNET-38558 is still pending for review, as the team has been busy investigating/fixing previously reported issues. Nevertheless, as soon as we have some definite news regarding its resolution, we will let you know.

Vorennor · February 18, 2016, 4:45am

Hi,

Is there anything new regarding this topic?

Regards,

Christophe

codewarior · February 18, 2016, 12:51pm

Hi Christophe,

Thanks for your patience.

I am afraid the issue reported earlier is still not resolved as the product team has been busy fixing earlier reported priority issues. However I have intimated the product team to try accommodating the issue in their schedule and as soon as we have some definite updates regarding its resolution, we will let you know.

We are sorry for this delay and inconvenience.

JohnOwens · June 23, 2016, 1:32am

Hi Team,

Can you give the customer an update on this issue please?

Many thanks

John

codewarior · June 23, 2016, 3:03pm

Hi Christophe,

Thanks for your patience.

The product team has started investigating earlier reported issue but I am afraid its not yet resolved. However I have intimated them to share current updates and share any possible ETA. As soon as we have required information, we will let you know.

asad.ali · October 10, 2017, 10:56am

@Vorennor

Thanks for your patience.

The earlier logged issue PDFNET-38558 was investigated and according to the findings by our product team, if problem symbol(vertical line) is copied into another text editor, such as MS Word, then it is displayed as horizontal line. Also line of text with problem symbol was decoded and tested with another font library, problem symbol was drawn as horizontal line too.

Root of problem is that Aspose.Pdf and Acrobat get different unicodes for problem symbol (code 0x1ED3).
Aspose.Pdf gets value U+30FC, whereas for Acrobat it is currently unknown value. Aspose.Pdf functionality to decode text was implemented in accordance to PDF specification and no violations were found for input PDF document. But it seems that Adobe Acrobat uses another mechanism to decode input content to Unicode, cause probability to use symbol (glyph) vertical line instead of horizontal line for common Unicode (horizontal line) is very low.

Also it was found that if font with problem symbol is embedded into document, resultant image is correct. In this case there is no need in Unicode to get symbols (glyph), another mechanism to decode input codes on symbols (glyph) is used which uses direct mapping between input codes (from PDF) and symbols (glyph) from font and usage of embedded font is proving idea about Acrobat Unicode collision - if we use embedded font whole PDF document is converted correctly (with vertical line in correspondent place).

So if it’s possible to embed problem font in document (font Ryumin-Medium), please, use this approach and in this case document will be converted well. Also common CJK fonts can be used instead of Ryumin-Medium, e.g MS Gothic font, but this font has to be embedded to get correct image.

If there is no possibility to embed font into document - unfortunately this error can’t be corrected, cause there are no ways to detect Acrobat’s decoding logic for problem documents like ‘news_no29.pdf’.

Some experiments were made with content - problem symbol (vertical line) was copied and pasted into another place in the same document. And Acrobat pasted this symbol as horizontal line, with Unicode U+30FC (horizontal line). Then it was achieved to get vertical line via Acrobat “option make text direction vertical” - and it was found that Acrobat was started to use new symbol (glyph), and linked it with the same Unicode (U+30FC, horizontal line), and it’s a collision - use the same Unicode for different symbols - vertical and horizontal line, Acrobat’s logic has a collision.

But right decoding logic - is to use unicode U+007C for vertical line and U+30FC for horizontal line.
Both Unicode are common for fonts in a world. So we have a collision that Acrobat decodes the same symbol differently for display and copy/paste operations.

Also it proves that any copy/paste operation with Acrobat leads to horizontal line instead of vertical line and only displaing of symbol produces vertical line, it also looks like collision. May be usage of current logic by Acrobat software has strong arguments but this decoding mechanism is unknown for Aspose.Pdf for current time.

In case of any further assistance, please feel free to let us know.