Read text from PDF

saurabhmauryabu · September 17, 2021, 5:23pm

Dear Support,

I am trying to read text from PDF file to extract certain data. If data is in paragraph format then it can be easily extracted however issue comes when it’s in tabular format as then it becomes tough to identify to which row or column certain text belongs. I am looking to explore if Aspose offers any solution for this issue.

image.jpg (95.1 KB)

Thanks,
Saurabh

mudassir.fayyaz · September 17, 2021, 9:53pm

@saurabhmauryabu

Can you please share the used sample code and input file. We will be able to investigate that on our end on provision of requested information.

saurabhmauryabu · September 27, 2021, 1:20pm

Hello @mudassir.fayyaz,

Sorry for replying back late.
Please find below code snippet that we are using-

Aspose.Pdf.Document pdfDocument = new Aspose.Pdf.Document(FilePath);
TextAbsorber textAbsorber = new TextAbsorber();
pdfDocument.Pages.Accept(textAbsorber);
extractedText = textAbsorber.Text;

Attached sample file that we are using-
Sample.pdf (25.6 KB)

mudassir.fayyaz · September 27, 2021, 9:16pm

@saurabhmauryabu

I request you to try the code from Extract Table from Existing PDF Document article and share your feedback.

saurabhmauryabu · September 28, 2021, 4:29am

Hello @mudassir.fayyaz,

I have used code given at Extract Table from PDF and getting output as
below. Notice that entire text is not coming in output-
image.png (32.3 KB)

Please suggest how to get entire text.

Thanks,
Saurabh

mudassir.fayyaz · September 28, 2021, 1:28pm

@saurabhmauryabu

Are you using it with latest version and valid license because I can extract the text fine. You should apply license before making any calls to API methods. You can get 30-days free temporary license in case you do not have one to evaluate API without any limitation. In case you still face any issue, please share a sample application.