Exporting Table object to excel

sjamilCS · June 18, 2021, 6:10pm

The Code is in C# .Net Core
Aspose.Pdf.Text.AbsorbedTable table = absorber.TableList[idTable];

var pdfDocument = new Aspose.Pdf.Document(fileName);
var absorber = new Aspose.Pdf.Text.TableAbsorber();

        foreach (Aspose.Pdf.Page page in pdfDocument.Pages)
        {
            absorber.Visit(page);
            for (int idTable = 0; idTable < absorber.TableList.Count; idTable++)
            {
                Aspose.Pdf.Text.AbsorbedTable table = absorber.TableList[idTable];

How to I export table to excel?

mudassir.fayyaz · June 19, 2021, 11:10am

@sjamilCS

There is no direct way to convert PDF tables to Excel worksheets. However, you can retrieve tables from the PDF, iterate through the content of PDF table and store data in a data source (e.g. array, array list, custom object, data table etc.), and then import this data into an Excel worksheet using Aspose.Cells for .NET API. Please refer to these help topics: Manipulate Tables in PDF document (Aspose.Pdf API) and Import Data into Worksheet (Aspose.Cells API).

sjamilCS · June 21, 2021, 3:13pm

HI Mudassir,

Thanks for quick reply. The export option export the table correctly
Aspose.Pdf.Document pdfDocument = new Aspose.Pdf.Document(_FileName);
// Instantiate ExcelSave Option object
ExcelSaveOptions excelSave = new ExcelSaveOptions {Format = ExcelSaveOptions.ExcelFormat.XLSX};

            pdfDocument.Save("PDFToXLS_out.xlsx", excelSave);

This code will export everything from a PDF to excel. Is there a way export just a table from excel?
Thanks,

mudassir.fayyaz · June 21, 2021, 9:21pm

Please share if you want to export table from PDF to excel or from Excel to PDF because your recent feedback is different from initial question.

sjamilCS · June 22, 2021, 4:48pm

My Initial goal is still the same. I want to extract a table from PDF.

I cannot seem to find a clean way get the table out.
I tried to use
var absorber = new Aspose.Pdf.Text.TableAbsorber();
But it appears Aspose.Pdf.Text.AbsorbedTable table = absorber.TableList[idTable]; does not always have the correct information. When I used two different PDF.

In addition, it appears the only clean method that worked fine was saving to Excel. But the excel now would have all the same information.
I tried to get all the text into console to verify if it reading something. But I get nothing usefully.

// Create TextAbsorber object to extract text
TextAbsorber textAbsorber = new TextAbsorber();
// Accept the absorber for all the pages
pdfDocument.Pages.Accept(textAbsorber);

        string extractedText = textAbsorber.Text;
        
        Console.WriteLine(extractedText);

Does this provide more insight ?

mudassir.fayyaz · June 22, 2021, 10:38pm

@sjamilCS

I request you to share the source PDF file so that we may try to reproduce the same on our end.

sjamilCS · June 23, 2021, 1:41pm

I will check with end user. Is there an internal support for this product? how does paid support work? How would I share the PDF?

mudassir.fayyaz · June 23, 2021, 8:25pm

@sjamilCS

The paid support issues are resolved on urgent basis and have the highest priority. Please note that paid support does not guarantee immediate solutions but it does expedite the process of investigation in order to get an ETA. In other words, the issue investigation will be started quickly once you report it in paid support.

You can share the PDF file here or send the information privately by clicking on my name icon in my post and using Message option. It will be a private message directly sent to me. Please notify us here as well once you have sent the private message.

sjamilCS · June 24, 2021, 6:10pm

Hi Mudassir,

Sorry for delay in response. I was discuss this matter with my supervisor. At the moment, let us stick with free support. In addition, may you please tell me the steps to upload the test PDF? I do not see an upload button in the chat box.

Thanks,
Syed

mudassir.fayyaz · June 24, 2021, 9:46pm

@sjamilCS

You can upload the file in this thread by using upload button. If the files sizes are above 9 MB, you can then upload on some file server and share download link with us.

image.png (7.7 KB)

sjamilCS · June 25, 2021, 4:27pm

first-quarter-10q.pdf (763.7 KB)
Thanks for screenshot. File was uploaded.

mudassir.fayyaz · June 25, 2021, 7:49pm

sjamilCS:

I tried to get all the text into console to verify if it reading something. But I get nothing usefully.

// Create TextAbsorber object to extract text
TextAbsorber textAbsorber = new TextAbsorber();
// Accept the absorber for all the pages
pdfDocument.Pages.Accept(textAbsorber);
        string extractedText = textAbsorber.Text;
        
        Console.WriteLine(extractedText);  

Thank you for the file. This code is extracting a lot of text from 139 pages in the PDF document. Can you please summarize the issue while sharing the page number and screenshots so that we can help you accordingly.

sjamilCS · June 25, 2021, 9:07pm

weird, I get no output.
static void Main(string[] args)
{
StringBuilder iOutput = new StringBuilder();

        try
        {
            string txtOutput;
            string _FileName = @"C:\Users\SJamil.025862583557\Python\TestData\first-quarter-10q.pdf";
           // string _FileName = @"C:\Users\SJamil.025862583557\Python\TestData\foo.pdf";
            txtOutput = "I am going to try to parse the PDF: " + _FileName + "\n";
            Console.WriteLine(txtOutput);
            
            // Load source PDF document
            Aspose.Pdf.Document pdfDocument = new Aspose.Pdf.Document(_FileName);
            // Create TextAbsorber object to extract text
            TextAbsorber textAbsorber = new TextAbsorber();
            // Accept the absorber for all the pages
            pdfDocument.Pages.Accept(textAbsorber);

            string extractedText = textAbsorber.Text;
    
            Console.WriteLine(extractedText); 





        }
        catch (Exception ex)
        {
            iOutput.Append("\r\nError: " + ex.Message);
        }
        Console.WriteLine(iOutput);

Above is my code. Below is my output
I am going to try to parse the PDF: C:\Users\SJamil.025862583557\Python\TestData\first-quarter-10q.pdf

Process finished with exit code 0.

mudassir.fayyaz · June 28, 2021, 11:43am

@sjamilCS

Your code is extracting a lot of text but you are facing evaluation limitations. In the evaluation version from any collection, you can process only four elements (for example, only 4 pages, 4 form fields, etc.). If you want to test Aspose.PDF without the evaluation version limitations, you can also request a 30-day Temporary License. Please refer to How to get a Temporary License?

sjamilCS · June 28, 2021, 2:59pm

Thanks Mudassir, I will follow the link to request a temporary license.

sjamilCS · July 1, 2021, 4:57pm

I was able to make some progress with code. It appears I have hit another wall. Below is my code
string txtOutput = "I am going to try to parse the PDF: " + fileName + “\n”;
Console.WriteLine(txtOutput);
var pdfDocument = new Aspose.Pdf.Document(fileName);

        TextAbsorber textAbsorber = new TextAbsorber();
        
        // Accept the absorber for all the pages
        //pdfDocument.Pages.Accept(textAbsorber);
        
        // get a certain page 
        // pdfDocument.Pages[1].Accept(textAbsorber);
        pdfDocument.Pages[3].Accept(textAbsorber);
        Console.WriteLine(textAbsorber.Text);
        
        var absorber = new Aspose.Pdf.Text.TableAbsorber();

        long cpt = 0;
        

        Console.WriteLine("Begin");
        foreach (Aspose.Pdf.Page page in pdfDocument.Pages)
        {
            absorber.Visit(page);
            //Tableaux
            for (int idTable = 0; idTable < absorber.TableList.Count; idTable++)
            {
                Aspose.Pdf.Text.AbsorbedTable table = absorber.TableList[idTable];

                //ligne
                for (int idRow = 0; idRow < table.RowList.Count; idRow++)
                    //foreach (AbsorbedRow row in table.RowList)
                {

                    Aspose.Pdf.Text.AbsorbedRow row = table.RowList[idRow];
                    string cellText = "";

                    //cellule
                    foreach (Aspose.Pdf.Text.AbsorbedCell cell in row.CellList)
                    {
                        
                        
                        
                        foreach (Aspose.Pdf.Text.TextFragment text in cell.TextFragments)
                        {
                            
                             cellText = cellText + text.Text +"\t";
                        }

                       
                    }
                    cpt = cpt + 1;
                    Console.WriteLine(cpt.ToString() + " - " + cellText );

                }

            }
        }
    }
}

}

Console.WriteLine(textAbsorber.Text); is able to output the tables without any issues.
When I try to ouput cellText = cellText + text.Text +"\t"; It shows a bunch of text rather actual table. first-quarter-10q.pdf is the PDF I am running this on.

mudassir.fayyaz · July 1, 2021, 11:05pm

@sjamilCS

This code also shows thousands of lines. Please explain bit more about the problem with snapshots.

sjamilCS · July 2, 2021, 1:54pm

The first few lines from the table on page 3 shows
Summary Financial Data
Quarter ended
Mar 31, 2021
% Change from
($ in millions, except per share amounts)
Mar 31,
2021
Dec 31,
2020
Mar 31,
2020
Dec 31,
2020
Mar 31,
2020
Selected Income Statement Data
Total revenue $ 18,063 17,925 17,717 1 % 2
Noninterest expense 13,989 14,802 13,048 (5) 7
Pre-tax pre-provision profit (PTPP) (1) 4,074 3,123 4,669 30 (13)

but the table does show this data. The first few lines from the table extractor object is shown below:
Begin
1 - 4,133,571,501
2 - 4,133,571,501
3 - 4,133,571,501
4 - 4,133,571,501
5 - 4,133,571,501
6 - 4,133,571,501
7 - 4,133,571,501
8 - 4,133,571,501
9 - 4,133,571,501
10 - 4,133,571,501
11 - 4,133,571,501
12 - 4,133,571,501
13 - Wells Fargo & Company
14 - Consumer Banking and Lending Commercial Banking Corporate and Investment Banking Wealth and Investment Management Corporate Consumer and Small Business Banking Home Lending Cred
it Card Auto Personal Lending Middle Market Banking Asset-Based Lending and Leasing Banking Commercial Real Estate Markets Wells Fargo Advisors The Private Bank Corp
orate Treasury Enterprise Functions Investment Portfolio Affiliated venture capital and private equity partnerships Non-strategic businesses
15 - Consumer Banking and Lending
16 - Consumer and Small Business Banking Home Lending Credit Card Auto Personal Lending
17 - Commercial Banking

I hope that clear up my issue with current code? It is detect the wrong information. While Console.WriteLine(textAbsorber.Text); shows the correct information the first table is guessed wrong.

mudassir.fayyaz · July 4, 2021, 11:44pm

@sjamilCS

I have been able to reproduce the issue on our end. A ticket with ID PDFNET-50149 has been created in our issue tracking system to further investigate the issue on our end. This thread has been linked with the issue so that you may be notified once the issue will be fixed.