Hello,
Hi Prasanth,
Thanks for using our products.
You can get text from a PDF document using Aspose.Pdf by searching a particular text (using “plain text” or “regular expressions”) from a single page or whole document, or you can get the complete text of a single page, range of pages or complete document. Kindly visit the following documentation links for more details and code snippets about different ways of extracting text.
Please feel free to contact support in case you need any further assistance.
Thanks & Regards,
Rashid,
Hi Prasanth,
Thanks for your feedback, I am sorry to inform you that the required feature is currently not available in Aspose.Pdf for .NET. However, I have logged a new feature request as PDFNEWNET-33429 in our issue tracking system. Our development team is looking into this feature and you will be updated via this forum thread once it is supported.
We apologize for your inconvenience.
Thanks & Regards,
Thanks for your patience.
We are pleased to inform you that earlier logged feature request PDFNET-33429 has been served and now feature of extracting paragraphs from PDF document has been added in Aspose.PDF for .NET 18.1. Please upgrade your API to latest version and use following code snippet(s), in order to use the feature:
Sample # 1 - drawing border of sections and paragraphs of text on PDF page:
private static void OutlineSample()
{
Document doc = new Document(myDir + "amblatt2013-10-05.pdf");
Page page = doc.Pages[2];
ParagraphAbsorber absorber = new ParagraphAbsorber();
absorber.Visit(page);
PageMarkup markup = absorber.PageMarkups[0];
foreach (MarkupSection section in markup.Sections)
{
DrawRectangleOnPage(section.Rectangle, page);
foreach (MarkupParagraph paragraph in section.Paragraphs)
{
DrawPolygonOnPage(paragraph.Points, page);
}
}
doc.Save(myDir + "amblatt2013-10-05_sections¶graphs.pdf");
}
private static void DrawRectangleOnPage(Rectangle rectangle, Page page)
{
page.Contents.Add(new Operator.GSave());
page.Contents.Add(new Operator.ConcatenateMatrix(1, 0, 0, 1, 0, 0));
page.Contents.Add(new Operator.SetRGBColorStroke(0, 1, 0));
page.Contents.Add(new Operator.SetLineWidth(2));
page.Contents.Add(
new Operator.Re(rectangle.LLX,
rectangle.LLY,
rectangle.Width,
rectangle.Height));
page.Contents.Add(new Operator.ClosePathStroke());
page.Contents.Add(new Operator.GRestore());
}
private static void DrawPolygonOnPage(Point[] polygon, Page page)
{
page.Contents.Add(new Operator.GSave());
page.Contents.Add(new Operator.ConcatenateMatrix(1, 0, 0, 1, 0, 0));
page.Contents.Add(new Operator.SetRGBColorStroke(0, 0, 1));
page.Contents.Add(new Operator.SetLineWidth(1));
page.Contents.Add(new Operator.MoveTo(polygon[0].X, polygon[0].Y));
for (int i = 1; i < polygon.Length; i++)
{
page.Contents.Add(new Operator.LineTo(polygon[i].X, polygon[i].Y));
}
page.Contents.Add(new Operator.LineTo(polygon[0].X, polygon[0].Y));
page.Contents.Add(new Operator.ClosePathStroke());
page.Contents.Add(new Operator.GRestore());
}
Sample # 2 - iterating through paragraphs collection and get text from them:
private static void TextSample()
{
Document doc = new Document(myDir + "amblatt2013-10-05.pdf");
ParagraphAbsorber absorber = new ParagraphAbsorber();
absorber.Visit(doc);
foreach (PageMarkup markup in absorber.PageMarkups)
{
int i = 1;
foreach (MarkupSection section in markup.Sections)
{
int j = 1;
foreach (MarkupParagraph paragraph in section.Paragraphs)
{
StringBuilder paragraphText = new StringBuilder();
foreach (List<TextFragment> line in paragraph.Lines)
{
foreach (TextFragment fragment in line)
{
paragraphText.Append(fragment.Text);
}
paragraphText.Append("\r\n");
}
paragraphText.Append("\r\n");
Console.WriteLine("Paragraph {0} of section {1} on page {2}:", j, i, markup.Number);
Console.WriteLine(paragraphText.ToString());
j++;
}
i++;
}
}
}
In case you still find this approach unsatisfactory or you face any issue, please feel free to contact us.