Extract Tables From PDF

I want code in .net framework(C#) to extract all tables from pdf or we can just detect tables from pdf and give them number one by one.

@HarshICE

Please check the below article(s) in the API documentation which demonstrate the example to add and extract tables from the PDF file.

Also, please explain a bit more about how you want to number them one by one. Please try to share some expected output PDF so that we can further share our feedback accordingly.

We want to detect existing tables in pdf. we don’t want to add new table. after detecting tables give them numbers like “Table-1, Table-2,…” and so on.

I am providing sample document where many tables are located randomly. we want to detect all tables from that pdf and give them numbers.PDFuploadcheck.pdf (461.6 KB)

@HarshICE

You can use following code snippet to extract the tables from existing PDF documents:

Document pdfDocument = new Document(dataDir + "PDFuploadcheck.pdf");
foreach (var page in pdfDocument.Pages)
{
 Aspose.Pdf.Text.TableAbsorber absorber = new Aspose.Pdf.Text.TableAbsorber();
 absorber.Visit(page);
 foreach (AbsorbedTable table in absorber.TableList)
 {
  foreach (AbsorbedRow row in table.RowList)
  {
   foreach (AbsorbedCell cell in row.CellList)
   {
    TextFragment textfragment = new TextFragment();
    TextFragmentCollection textFragmentCollection = cell.TextFragments;
    foreach (TextFragment fragment in textFragmentCollection)
    {
     string txt = "";
     foreach (TextSegment seg in fragment.Segments)
     {
      txt += seg.Text;
     }
     Console.WriteLine(txt);
    }
   }
  }
 } 
}

We tried the above code snippet with your PDF and noticed that your PDF document has security applied and its content could not be extracted. You can change the security settings of your PDF to allow content extraction and try the code snippet. Regarding giving numbers to the table, do you want to add table numbers inside PDF document in text form? OR you want to simply manage them in code for your further operations?

No, there is no security applied on my pdf. I am able to extract all text from my pdf easily using Aspose TextAbsorber class. I’m attaching my pdf Security properties you can also clearly see there is no security. Regarding giving numbers to the table, i want output inside PDF document.
Inshort, I’m giving input as my pdf(earlier i provided pdf) and i want output as pdf where all tables were mark with labels like “Table-1,Table-2,…”.

Thank you2020-12-23.png (255.8 KB)

@HarshICE

Thanks for getting back to us.

We are afraid that at the moment, the API is unable to extract the tables from your PDF due to some reason. We have logged this case under the ticket ID PDFNET-49206 in our issue tracking system for further investigation. We will look into its details and keep you posted with the status of its resolution. Please be patient and spare us some time.

We are sorry for the inconvenience.