Extracting Text from PDF Page by Page (.Net)

Parthiban.j · April 14, 2010, 4:01am

Hi
I am evaluating Aspose.PDF.Kit for .Net where is our requirement is to read the text content page by page. In the evaluation version, i am not able to extract all the information, this is the same behavior as per your release notes. Now my question is whether if there is an image in page 2 and page 1 and page 3 has text. Our requirement is to create individual text file for each page and if there is no text in a page an empty text file should be created. What would be the result of the below code in the licensed version.This is very urgent as we need to conclude on the Aspose API ASAP.

//create an instance of PdfExtractor class
PdfExtractor extractor = new PdfExtractor();
//set PDF file password
extractor.Password = “”;
//bind PDF file with the extractor object
extractor.BindPdf(inputFileName);
//extract all text from the PDF
extractor.ExtractText();
//save extracted text in a text file
extractor.GetText(Path.ChangeExtension(inputFileName, “txt”));
//text of individual pages can also be saved individually in single text files
int pageCounter = 1;
Console.WriteLine(“Trying to extract notes from PDF”);
while (extractor.HasNextPageText())
{
Console.WriteLine("Extracting notes from PDF for page : " + pageCounter.ToString());
extractor.GetNextPageText(Path.GetDirectoryName(inputFileName) + “\PDFTEXT_” + (pageCounter++).ToString() + “.txt”);
}

P.S: Attached is the input PDF

shahzadlatif · April 14, 2010, 7:20am

Hi Parthiban,

You can use the following code snippet to extract text page by page from a PDF file.

Aspose.Pdf.Kit.PdfExtractor extractor = new Aspose.Pdf.Kit.PdfExtractor();

//set PDF file password

extractor.Password = “”;

//bind PDF file with the extractor object

extractor.BindPdf(“testPDF.PDF”);

//text of individual pages can also be saved individually in single text files

int pageCounter = 1;

//Console.WriteLine(“Trying to extract notes from PDF”);

while (extractor.HasNextPageText())

{

//Console.WriteLine("Extracting notes from PDF for page : " + pageCounter.ToString());

extractor.GetNextPageText(pageCounter.ToString() + “.txt”);

pageCounter += 1;

}

The sample output files are attached. Files 1.txt and 3.txt contain text while 2.txt is empty because second page contains image only. If you want to test it at your end, you can get a temporary license for 30 days from this link.

I hope this helps. If you have any further questions, please do let us know.
Regards,