How to extract text from pdf files?

Sathiya22 · November 12, 2020, 2:10pm

I tried extracting text as given in the documentation but the text is not extracted for pdf files.
Can I get some help?

Document pdfDocument = new Document(“input.pdf”);
TextAbsorber textAbsorber = new TextAbsorber();
pdfDocument.getPages().accept(textAbsorber);
String extractedText = textAbsorber.getText();

asad.ali · November 12, 2020, 11:52pm

@Sathiya22

Would you please share your sample PDF with us as well. We will test the scenario in our environment and address it accordingly.

Sathiya22 · November 13, 2020, 5:19am

samplePdf.pdf.zip (1.1 KB)

PdfExtractor pdfExtractor = new PdfExtractor();
pdfExtractor.bindPdf(fis);//fis - file input stream
pdfExtractor.extractText();
pdfExtractor.getText(fileName + “.txt”);

The above code also does not work.

asad.ali · November 13, 2020, 9:24pm

@Sathiya22

We were not able to notice any issue while testing the scenario with Aspose.PDF for Java 20.10 at our side. Please check console output in attached image;

extractedtext.png (6.9 KB)

Would you kindly make sure that you are using the API with a valid license or at least with 30-days free temporary license. In case you still face any issue, please let us know.

Sathiya22 · November 17, 2020, 5:07am

Only the first line gets extracted.ConsoleOutput.png (9.7 KB)

asad.ali · November 17, 2020, 8:54pm

@Sathiya22

You can notice in the console output that Evaluation Version Message is appended as well which indicates that you are using the API without any license. As requested earlier, please try to use the API with a valid license and in case you still face any issue, please let us know.