Unable to extract all the text from PDF file containing images

manasiak · June 5, 2019, 9:21am

I want to extract all the text line by line from pdf file. But unable to extract text from pdf file containing images and text.But I can get all the text from pdf file containing only text. Please find attached pdf.

Test.pdf (83.0 KB)

using below code for extracting text.

Aspose.Pdf.Document doc = new Aspose.Pdf.Document(@“Test.pdf”);
foreach (Aspose.Pdf.Page pdfPage in doc.Pages)
{
Aspose.Pdf.Text.TextSearchOptions options = new Aspose.Pdf.Text.TextSearchOptions(true);

            Aspose.Pdf.Text.TextFragmentAbsorber absorber = new Aspose.Pdf.Text.TextFragmentAbsorber();
            pdfPage.Accept(absorber);
            Aspose.Pdf.Text.TextFragmentCollection collection = absorber.TextFragments;

            foreach (Aspose.Pdf.Text.TextFragment oneTextFragment in collection)
            {
                string text = oneTextFragment.Text;//not a row
                Console.WriteLine(String.Format("Extracted Text = '{0}'", text));
            }
        }

asad.ali · June 5, 2019, 6:30pm

@manaslak

Thanks for contacting support.

You may please use TextAbsorber Class in order to extract text in raw format from any PDF document. Please check following code snippet in order to do so:

TextAbsorber ta = new TextAbsorber();
ta.ExtractionOptions = new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Raw);
Document pdfDocument = new Document(dataDir + "Test (1).pdf");
pdfDocument.Pages.Accept(ta);
string text = ta.Text; // all text of PDF document

In case you still face any issue, please feel free to let us know.

manasiak · June 6, 2019, 1:39am

Thanks for your reply. But with the above code I could get only first line of pdf. Please check below Screen capture.
image.png (3.6 KB)

asad.ali · June 6, 2019, 12:38pm

@manasiak

This is due to limitation of trial version. Please set a valid license before using any functionality of the API. In case you do not have one, you can apply for 30-days free temporary license to evaluate API without any restriction.