Extract different sections from PDF document

jdean2k6 · January 18, 2019, 12:56am

Greetings
I am trying to use Aspose.Pdf .Net to extract the table from this document. Primarily data with thin the Description, Qty, U/M, Rate and Total fields. How do I do this? Can I use the table extraction mechanism?
Also say I wanted to extract the address from the document how do I do that?

KAECY MORGAN 20722.pdf (182.7 KB)

Farhan.Raza · January 18, 2019, 9:52am

@jdean2k6

Thank you for contacting support.

Please visit below documentation articles for your kind reference, or you may iterate through each Row and Cell of a TableAbsorber object in order to extract text from a table.

Extract Text from PDF

Search and Get Text from Pages of a PDF Document

Document pdfDocument = new Document(dataDir + "Test.pdf");
TableAbsorber absorber = new TableAbsorber();
absorber.Visit(pdfDocument.Pages[1]);
foreach (AbsorbedTable table in absorber.TableList)
{
  foreach (AbsorbedRow row in table.RowList)
  {
      foreach (AbsorbedCell cell in row.CellList)
      {
          TextFragment textfragment = new TextFragment();
          TextFragmentCollection textFragmentCollection = cell.TextFragments;
          foreach (TextFragment fragment in textFragmentCollection)
          {
              Console.WriteLine(fragment.Text);
          }
      }
  }
}

We hope this will be helpful. Please feel free to contact us if you need any further assistance.

jdean2k6 · January 19, 2019, 7:58am

Okay I did that and it worked but now I want to know how would I isolate the different sections like if I want to extract only the Description column, what’s the best way to do that? If you look at the document its an invoice so it could stretch to another page so I need to make an algorithm that can incorporate that. Just checking if it’s possible.

Farhan.Raza · January 19, 2019, 5:33pm

@jdean2k6

You may search text from particular region of a page as explained in previously shared documentation article. Or you may try to iterate through each row then extract text from first cell as per your requirements.