Problem with accents when extracting text from pdf

tparassin · December 20, 2017, 3:16pm

Hi,
when i extract text from the attached pdf file, some french characters with accent are not correctly extracted :
“è” ==> “ì”
“ê” ==> “¤”
“à” ==> “Å”

attached are the pdf file, the c# program and the text result.
Thanks in advance for your support.
pdf.zip (121.4 KB)

Farhan.Raza · December 20, 2017, 7:17pm

@tparassin

Thank you for contacting support.

I would like to share with you that the issue must be occurring because of missing fonts in your environment. I have tried to reproduce the issue but only first problem is being reproduced, “è” ==> “ì”. I have attached generated .txt file for your kind reference original1_utf8_17.12.zip. Due to missing fonts, I can not even find “à” and “ê” while viewing the PDF file with Adobe Reader. So please install all the fonts which are being used in this PDF document or you may embed used fonts in a PDF document as explained in this documentation article.

tparassin · December 21, 2017, 7:26am

I don’t understand your reply.
In my environment, I have no problem when viewing the pdf file, i can see correctly all the characters with Adobe Reader (see attached file).
But the mentioned characters are not correctly converted in the txt extraction.

Farhan.Raza · December 21, 2017, 10:51am

@tparassin

I have further worked with the data shared by you and have been able to reproduce the issue in our environment. A ticket with ID PDFNET-43914 has been logged in our issue management system for further investigation and resolution. The issue ID has been linked with this thread so that you will receive notification as soon as the issue is resolved.

We are sorry for the inconvenience.

Sam212111 · July 18, 2019, 8:11pm

is there any update on this ? i am having the same issue (sort of) my byte[] contains french characters that when used as
using (var pdfResult = new MemoryStream())
{
using (var termStream = new MemoryStream(my_byte[]))
// my_byte[] contains french characters
{
Aspose.Pdf.Document doc = new Aspose.Pdf.Document(termStream);
Page page = doc.Pages[1];
doc.Save(pdfResult);
retBytes = pdfResult.ToArray();
var fileName = @“C:\Source\text.pdf”;
doc.Save(fileName); // this file has … instead of french characters
}
}

please help…

Farhan.Raza · July 19, 2019, 8:36am

@Sam212111

Would you please share the source file and how are you loading the characters in byte array so that we may try to reproduce it in our environment to help you out.