Search text inside PDF using .NET RegEx Class in Aspose.PDF for .NET

yehuda.alon · August 29, 2017, 9:19pm

Alternatively, how can I extract all paragraphs from a pdf document (or a page in it)?

imran.rafique · August 30, 2017, 2:44am

@yehuda.alon,
There are various ways to search text from the PDF document and you can use regular expressions as well. Please refer to this help topic: Search and get Text from all pages using Regular Expression. You can also extract complete text from the PDF document by using TextAbsorber class as narrated in this help topic: Extract Text From All the Pages of a PDF Document. Kindly let us know in case of any further assistance or questions.

Best Regards,
Imran Rafique

asad.ali · June 22, 2020, 6:13pm

@yehuda.alon

We would like to share with you that Aspose.PDF for .NET now offers usage of .NET Class RegEx to search the text inside PDF documents. Please check following code snippet in order to achieve this:

Search text using .NET RegEx Class

string inFile = GetInputPath("Aspose.Pdf/PdfWithSeveralPages.pdf");

// Create Regex object to find all words
System.Text.RegularExpressions.Regex regex = new System.Text.RegularExpressions.Regex(@"[\S]+");

// Open document
Aspose.Pdf.Document document = new Aspose.Pdf.Document(inFile);

// Get a particular page
Page page = document.Pages[1];

// Create TextAbsorber object to find all instances of the input regex
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(regex);
textFragmentAbsorber.TextSearchOptions.IsRegularExpressionUsed = true;

// Accept the absorber for the page
page.Accept(textFragmentAbsorber);

// Get the extracted text fragments
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;

// Loop through the fragments
foreach (TextFragment textFragment in textFragmentCollection)
{
    Console.WriteLine(textFragment.Text);
}