Extract different sections from PDF document

Greetings
I am trying to use Aspose.Pdf .Net to extract the table from this document. Primarily data with thin the Description, Qty, U/M, Rate and Total fields. How do I do this? Can I use the table extraction mechanism?
Also say I wanted to extract the address from the document how do I do that?

KAECY MORGAN 20722.pdf (182.7 KB)

@jdean2k6

Thank you for contacting support.

Please visit below documentation articles for your kind reference, or you may iterate through each Row and Cell of a TableAbsorber object in order to extract text from a table.

  • Extract Text from PDF

  • Search and Get Text from Pages of a PDF Document

    Document pdfDocument = new Document(dataDir + "Test.pdf");
    TableAbsorber absorber = new TableAbsorber();
    absorber.Visit(pdfDocument.Pages[1]);
    foreach (AbsorbedTable table in absorber.TableList)
    {
      foreach (AbsorbedRow row in table.RowList)
      {
          foreach (AbsorbedCell cell in row.CellList)
          {
              TextFragment textfragment = new TextFragment();
              TextFragmentCollection textFragmentCollection = cell.TextFragments;
              foreach (TextFragment fragment in textFragmentCollection)
              {
                  Console.WriteLine(fragment.Text);
              }
          }
      }
    }
    

We hope this will be helpful. Please feel free to contact us if you need any further assistance.

1 Like

Okay I did that and it worked but now I want to know how would I isolate the different sections like if I want to extract only the Description column, what’s the best way to do that? If you look at the document its an invoice so it could stretch to another page so I need to make an algorithm that can incorporate that. Just checking if it’s possible.

@jdean2k6

You may search text from particular region of a page as explained in previously shared documentation article. Or you may try to iterate through each row then extract text from first cell as per your requirements.