Extract Table from PDF into Excel or CSV file

Hello,
I have a requirement to extract table data alone into excel or CSV files in a given PDF. Can you please share a sample code extract tables from PDF into a excel/csv file?

I tried your sample code, but this converts the entire document into excel. However i need to extract just the tables(columns/data) into excel/csv file.

verizon-green-financing-framework-second-party-opinion.pdf (418.9 KB)

Appreciate your help! I am attaching a document for your reference.

@sanjaybk

We need to investigate this requirement. Can you please point out which specific table you want to extract as a separate CSV or Excel file? It would be good if you please share a sample expected output as well.

@asad.ali, Thanks for checking.
If you look at the PDF, there is a table starting on page 8 and spanning across page 9. I’m looking for that table. Also, attaching an expected output file.

AsposeTable.zip (7.4 KB)

@sanjaybk

An investigation ticket as PDFNET-52452 has been logged in our issue tracking system for the sake of further analysis. We will investigate the feasibility of your requirements and let you know as soon as the ticket is resolved. Please be patient and spare us some time.

We are sorry for the inconvenience.

@sanjaybk

As a workaround, all text outside the tables (and images) could be deleted so Excel or CSV documents will be generated without it. Please see the code snippet below and note that the tables on pages 8, 9, and 17 were extracted correctly, but there is an issue with the table on page 16 that will be fixed in version 22.11.

Document pdfDocument = new Document("Verizon_PDFNET_52452.pdf");
foreach (Page page in pdfDocument.Pages)
{
   page.Resources.Images.Clear(); // Remove images
   RemoveNonTableFragments(page);     
}

ExcelSaveOptions options = new ExcelSaveOptions() {
    MinimizeTheNumberOfWorksheets = true,
    // Set output format (XLSX by default )
    // Format = ExcelSaveOptions.ExcelFormat.CSV
};

pdfDocument.Save("Verizon_PDFNET_52452.xlsx", options);

...


public static void RemoveNonTableFragments(Page page)
{
    TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber();
    TableAbsorber tableAbsorber = new TableAbsorber();    
    tableAbsorber.Visit(page);    
    textFragmentAbsorber.Visit(page);

    foreach (TextFragment textFragment in textFragmentAbsorber.TextFragments)
    {
       if (!IsThisFragmentInsideTable(tableAbsorber, textFragment))
       textFragment.Text = "";
    }
}

public static bool IsThisFragmentInsideTable(TableAbsorber tableAbsorber, TextFragment fragment)
{
    foreach (AbsorbedTable table in tableAbsorber.TableList)
    {
       if (fragment.Rectangle.Intersect(table.Rectangle) != null)
         return true;
    }

    return false;
}