Free Support Forum - aspose.com

Extract Text from PDF Pages | PDF to TXT or HTML Converter (C# .NET)

Convert PDF Pages, Paragraphs and Tables to Plain or Formatted HTML Text

Offline text extraction from PDF file format programmatically is now possible by using Aspose.Words for .NET API. Upon loading a PDF document, all the text content is represented by Run Class nodes inside Aspose.Words for .NET’s DOM (Document Object Model). Similarly a paragraph is represented by Paragraph Class, PDF table is represented by Table Class and so on.

Plain Text Extraction from All the Pages of a PDF Document

You can use Document Class to first load PDF from a File on disk or from Memory Stream. Aspose.Words will automatically detect the file format of PDF. Then with the help of Document.ToString Method, you can get the plain text representation of entire PDF file by using the following C# code example of Aspose.Words for .NET library

Document doc = new Document("C:\\temp\\input.pdf");
// Get the Plain Text representation of entire PDF file
string pdfText = doc.ToString(SaveFormat.Text);
Console.WriteLine(pdfText);

Alternatively, you can convert PDF document to Text (.txt file format) by using following code:

Document doc = new Document("C:\\temp\\input.pdf");
doc.Save("C:\\temp\\output.txt");

Extract Text from a Particular PDF Page or Page Range Programmatically

The PdfLoadOptions Class allows you to specify additional options when loading a PDF file into a Document object. For example, you can use PdfLoadOptions.PageIndex and PdfLoadOptions.PageCount properties to specify 0-based index of the first page of PDF file and the total number of pages the engine should read. After content of a Page or Page Range is loaded in DOM, following C# code example can be used to extract text content of first three (or any number of) pages from PDF.

PdfLoadOptions pdfLoadOptions = new PdfLoadOptions();
pdfLoadOptions.PageIndex = 0;
pdfLoadOptions.PageCount = 3;
Document doc = new Document("C:\\temp\\input.pdf", pdfLoadOptions);
string pdfText = doc.ToString(SaveFormat.Text);
Console.WriteLine(pdfText);

Extract HTML String representation of Formatted Text from a PDF Paragraph or Table

Objects of Paragraph and Table Classes can have multiple text Run nodes; you can either extract text from entire Paragraph or Table or individual Runs of text (see Run.Text property). You can either use Node.ToString or Node.GetText methods for HTML or plain text extraction.

// Load PDF from file
Document doc = new Document("C:\\temp\\input.pdf");
// Obtain the third Paragraph of PDF
Paragraph pdfParagarph = doc.FirstSection.Body.Paragraphs[2];
// Obtain the first Table of PDF
Table pdfTable = doc.FirstSection.Body.Tables[0];

// Different ways to Extract Plain Text from whole Paragraph or Table
string toString_PlainText = pdfParagarph.ToString(SaveFormat.Text);
string getText_PlainText = pdfParagarph.GetText();
string table_PlainText = pdfTable.ToString(SaveFormat.Text);

// To get HTML markup string of Formatted Text from PDF Table or Paragraph
HtmlSaveOptions htmlSaveOptions = new HtmlSaveOptions(SaveFormat.Html);
htmlSaveOptions.PrettyFormat = true;
htmlSaveOptions.CssStyleSheetType = CssStyleSheetType.Inline;

string paragraph_To_Html = pdfParagarph.ToString(htmlSaveOptions);
string table_To_Html = pdfTable.ToString(htmlSaveOptions);