How to get Text paragraphs from a PDF

gowthampsrl · March 20, 2012, 6:23pm

Hello,

I am trying to extract existing paragraphs from a PDF. From DOM API as mentioned in documentation, i see there is a Textparagraph class, but would like to get Text paragraphs from a page.

Could you help us with this requirement.I have attached a sample PDF.

Thanks

Prasanth.S

rashid.ali · March 20, 2012, 10:26pm

Hi Prasanth,

Thanks for using our products.

You can get text from a PDF document using Aspose.Pdf by searching a particular text (using “plain text” or “regular expressions”) from a single page or whole document, or you can get the complete text of a single page, range of pages or complete document. Kindly visit the following documentation links for more details and code snippets about different ways of extracting text.

Working with Text

Working with Text (Facades)

Please feel free to contact support in case you need any further assistance.

Thanks & Regards,

gowthampsrl · March 20, 2012, 10:40pm

Rashid,

I am trying to get individual paragraphs from a page in a document.

Say for example i have documents with pages and in page1 i have different paragraphs( Chunks of text), i need to get a paragraph individually.

Like 1or more lines make a paragraph and multiple paragraph makes a page. so i would like to identity each paragraph and so that i can apply some rules to that chunk of text or paragraph.

I am not trying to get complete text of a page or search particular text in page.

Thanks

rashid.ali · March 21, 2012, 1:41am

Hi Prasanth,

Thanks for your feedback, I am sorry to inform you that the required feature is currently not available in Aspose.Pdf for .NET. However, I have logged a new feature request as PDFNEWNET-33429 in our issue tracking system. Our development team is looking into this feature and you will be updated via this forum thread once it is supported.

We apologize for your inconvenience.

Thanks & Regards,

asad.ali · January 28, 2018, 10:19pm

@gowthampsrl

Thanks for your patience.

We are pleased to inform you that earlier logged feature request PDFNET-33429 has been served and now feature of extracting paragraphs from PDF document has been added in Aspose.PDF for .NET 18.1. Please upgrade your API to latest version and use following code snippet(s), in order to use the feature:

Sample # 1 - drawing border of sections and paragraphs of text on PDF page:

private static void OutlineSample()
{
    Document doc = new Document(myDir + "amblatt2013-10-05.pdf");
    Page page = doc.Pages[2];

    ParagraphAbsorber absorber = new ParagraphAbsorber();
    absorber.Visit(page);

    PageMarkup markup = absorber.PageMarkups[0];

    foreach (MarkupSection section in markup.Sections)
    {
        DrawRectangleOnPage(section.Rectangle, page);
        foreach (MarkupParagraph paragraph in section.Paragraphs)
        {
            DrawPolygonOnPage(paragraph.Points, page);
        }
    }

    doc.Save(myDir + "amblatt2013-10-05_sections&paragraphs.pdf");
}

private static void DrawRectangleOnPage(Rectangle rectangle, Page page)
{
    page.Contents.Add(new Operator.GSave());
    page.Contents.Add(new Operator.ConcatenateMatrix(1, 0, 0, 1, 0, 0));
    page.Contents.Add(new Operator.SetRGBColorStroke(0, 1, 0));
    page.Contents.Add(new Operator.SetLineWidth(2));
    page.Contents.Add(
        new Operator.Re(rectangle.LLX,
            rectangle.LLY,
            rectangle.Width,
            rectangle.Height));
    page.Contents.Add(new Operator.ClosePathStroke());
    page.Contents.Add(new Operator.GRestore());
}

private static void DrawPolygonOnPage(Point[] polygon, Page page)
{
    page.Contents.Add(new Operator.GSave());
    page.Contents.Add(new Operator.ConcatenateMatrix(1, 0, 0, 1, 0, 0));
    page.Contents.Add(new Operator.SetRGBColorStroke(0, 0, 1));
    page.Contents.Add(new Operator.SetLineWidth(1));
    page.Contents.Add(new Operator.MoveTo(polygon[0].X, polygon[0].Y));
    for (int i = 1; i < polygon.Length; i++)
    {
        page.Contents.Add(new Operator.LineTo(polygon[i].X, polygon[i].Y));
    }
    page.Contents.Add(new Operator.LineTo(polygon[0].X, polygon[0].Y));
    page.Contents.Add(new Operator.ClosePathStroke());
    page.Contents.Add(new Operator.GRestore());
}

Sample # 2 - iterating through paragraphs collection and get text from them:

private static void TextSample()
{
    Document doc = new Document(myDir + "amblatt2013-10-05.pdf");

    ParagraphAbsorber absorber = new ParagraphAbsorber();
    absorber.Visit(doc);

    foreach (PageMarkup markup in absorber.PageMarkups)
    {
        int i = 1;

        foreach (MarkupSection section in markup.Sections)
        {
            int j = 1;
            
            foreach (MarkupParagraph paragraph in section.Paragraphs)
            {
                StringBuilder paragraphText = new StringBuilder();

                foreach (List<TextFragment> line in paragraph.Lines)
                {
                    foreach (TextFragment fragment in line)
                    {
                        paragraphText.Append(fragment.Text);
                    }
                    paragraphText.Append("\r\n");
                }
                paragraphText.Append("\r\n");

                Console.WriteLine("Paragraph {0} of section {1} on page {2}:", j, i, markup.Number);
                Console.WriteLine(paragraphText.ToString());

                j++;
            }
            i++;
        }
    }
}

In case you still find this approach unsatisfactory or you face any issue, please feel free to contact us.