Read special character from PDF

SistemisticaCCE · June 15, 2018, 3:07pm

Hello,

with the textAbsorber we are reading text from a pdf and some character are return as ? (example -> ć).
There is a way to tell wich character set to use when extracting data?.

Thank you,

asad.ali · June 15, 2018, 7:59pm

@Gimbo71

Thanks for contacting support.

There is no need to specify which character set should be extracted from PDF document. However, would you please share your sample PDF document with us, so that we can test the scenario in our environment and address it accordingly.

SistemisticaCCE · June 25, 2018, 9:30am

Hello,

unfortunatly i can’t share that document because is reserved but i attached a file with the same character that give us the problem.

Thank you in advance.

test PDF.pdf (27.5 KB)

asad.ali · June 25, 2018, 3:16pm

@Gimbo71

Thanks for sharing sample PDF document.

We have tested the scenario while using following code snippet with Aspose.PDF for .NET 18.6 and were unable to notice any issue. The API was able to extract the text character i.e. ć. For your kind reference, attached is the output .txt file and used code snippet:

Aspose.Pdf.Document pdfDocument = new Aspose.Pdf.Document(dataDir + "test PDF.pdf");
Aspose.Pdf.Text.TextAbsorber textAbsorber = new Aspose.Pdf.Text.TextAbsorber();
pdfDocument.Pages.Accept(textAbsorber);
File.WriteAllText(dataDir + "test PDF.txt", textAbsorber.Text);

test PDF.zip (321 Bytes)

Would you please use latest version of the API in order to extract text from PDF document. In case you still face issue with latest version, please share your environment details i.e OS name and version, application type, system locale/language settings, etc. We will test the scenario in our environment and address it accordingly.