Converting PDF documents with unicode characters to HTML

petteraas · August 1, 2012, 6:19am

Hi!

I see that a known issue is bad support for convertion of PDF`s with unicode characters. This is a bit troublesome for me, as i can reproduce this error on Windows, but not on Linux.

Any suggestions for a workaround or a fix on this?

All characters are converted correctly from Linux, but not from Windows.

Edit: Fixed, Windows uses a different charset per default, so i had to store the file with UTF-8 in order for the characters to appear

codewarior · August 1, 2012, 8:01am

Hi Tor Henning,

Thanks for using our products.

As per my understanding, you are using extractTextAsHTML(…) method of PdfExtractor class present in Aspose.Pdf.Kit for Java to extract the text of PDF file containing UniCode characters and converting them into HTML format. I have tested the scenario by using one of my sample PDF files containg UniCode characters and as per my observations, the HTML file does not properly show the UniCode characters. Is it the issue which you are actually facing ?

Also please share the resource PDF document which you are using as it will help us in resolving this issue in appropriate manner. We are sorry for your inconvenience.

petteraas · August 1, 2012, 8:15am

Hi,

yes. The issue was that i loaded the content as a string, then tried to save it to a file without specifying a charset. When i specified a charset, the problem went away

So the issue is resolved.

codewarior · August 1, 2012, 2:03pm

Hi Tor Henning,

Your observation is correct. When saving the string containing UniCode, you need to specify the Character set in which the HTML file will be saved. In case of any further query, please feel free to contact.