Text not being extracted from all pdfs

optimalosg · December 6, 2021, 11:05pm

Hello,

I am trying to perform some language detection on a set of pdf’s. The English one works fine but others like Chinese, Korean, Russian etc. don’t work because no text is extracted from them. I’ve created a test program with example files to demonstrate the problem. See the following link to download the sample project. Dropbox - DocLanguage.zip - Simplify your life

Thanks

tahir.manzoor · December 7, 2021, 6:51am

@optimalosg

Please create a standalone console application (source code without compilation errors) that helps us to reproduce your problem on our end and attach it here for testing. We will investigate the issue and provide you more information on it.

optimalosg · December 8, 2021, 2:16pm

@tahir.manzoor

Sorry, forgot to add the code for the test program. If you click on the link in the original post, I have updated the zip file with source code.

Thanks

tahir.manzoor · December 8, 2021, 6:10pm

@optimalosg

You are facing the expected behavior of Aspose.PDF. If you open the PDF in Adobe writer and try to extract the text, you will not be able to extract it. Please check the attached image for more detail.
image.png (147.8 KB)