Alternatively, how can I extract all paragraphs from a pdf document (or a page in it)?
@yehuda.alon,
There are various ways to search text from the PDF document and you can use regular expressions as well. Please refer to this help topic: Search and get Text from all pages using Regular Expression. You can also extract complete text from the PDF document by using TextAbsorber class as narrated in this help topic: Extract Text From All the Pages of a PDF Document. Kindly let us know in case of any further assistance or questions.
Best Regards,
Imran Rafique
We would like to share with you that Aspose.PDF for .NET now offers usage of .NET Class RegEx to search the text inside PDF documents. Please check following code snippet in order to achieve this:
Search text using .NET RegEx Class
string inFile = GetInputPath("Aspose.Pdf/PdfWithSeveralPages.pdf");
// Create Regex object to find all words
System.Text.RegularExpressions.Regex regex = new System.Text.RegularExpressions.Regex(@"[\S]+");
// Open document
Aspose.Pdf.Document document = new Aspose.Pdf.Document(inFile);
// Get a particular page
Page page = document.Pages[1];
// Create TextAbsorber object to find all instances of the input regex
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(regex);
textFragmentAbsorber.TextSearchOptions.IsRegularExpressionUsed = true;
// Accept the absorber for the page
page.Accept(textFragmentAbsorber);
// Get the extracted text fragments
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;
// Loop through the fragments
foreach (TextFragment textFragment in textFragmentCollection)
{
Console.WriteLine(textFragment.Text);
}