Table Extraction from pdf

sathish.sundaresan · October 25, 2023, 7:12am

Hi,

I’d like to buy aspose.pdf, but I need some clarity before. Table absorber isn’t working in my pdf, but it can convert it to excel perfectly. Please help me figure out what’s wrong so that I can fix it in my code.

thanks
M.S.Sathish

sergei.shibanov · October 25, 2023, 8:59am

@sathish.sundaresan
Please attach the source document and the code used, with a more detailed explanation of what does not work.

sathish.sundaresan · October 25, 2023, 9:34am

If there is a larger gap between two rows when converting to Excel, we must have empty rows or the same cell padding format to appropriately identify the rows. Is there a way or alternatives in aspose.pdf to excel?

sergei.shibanov · October 25, 2023, 3:54pm

@sathish.sundaresan

Are you doing a pdf -> Excel conversion? Why then are we talking about TableAbsorber ?
It would be nice if you could attach the document and the code snippet you used.

sathish.sundaresan · February 29, 2024, 8:26am

Hi sergei,

Please see the attachment for the document and c# code file we used for the pdf
table absorber. Please help me to extract the table from the document.

05107_1 1.pdf (466.7 KB)
33333.pdf (106.7 KB)

ExtractMarkedtable.pdf (6.3 KB)

Thanks
M.S.Sathish
9176398138

sathish.sundaresan · February 29, 2024, 10:47am

Hi sergei,

Please see the attachment for the document and c# code file we used for the pdf
table absorber. Please help me to extract the table from the document.

05107_1 1.pdf (466.7 KB)
33333.pdf (106.7 KB)

ExtractMarkedtable.pdf (6.3 KB)

Thanks
M.S.Sathish
9176398138

sathish.sundaresan · February 29, 2024, 10:48am

Hi Team,

Please see the attachment for the document and c# code file we used for the pdf
table absorber. Please help me to extract the table from the document.

05107_1 1.pdf (466.7 KB)
33333.pdf (106.7 KB)

ExtractMarkedtable.pdf (6.3 KB)

Thanks
M.S.Sathish
9176398138

sergei.shibanov · February 29, 2024, 6:35pm

@sathish.sundaresan
Thank you, I will study the information provided and write to you tomorrow.

sergei.shibanov · March 1, 2024, 4:39pm

@sathish.sundaresan
In document 05107_1 1.pdf, all symbols are paths (i.e., drawn graphically). Accordingly, as is, it is not converted to Excel and is not processed by TableAbsorber.

To extract tables from document 33333.pdf I used the following code

var pdfDocument = new Document(dataDir + "05107_1 1.pdf");
var tableAbsorber = new TableAbsorber();

tableAbsorber.UseFlowEngine = true;

// Visit first page with absorber
tableAbsorber.Visit(pdfDocument.Pages[1]);

foreach (AbsorbedTable table in tableAbsorber.TableList)
{
    Console.WriteLine("_______ Table __________");
    foreach (AbsorbedRow row in table.RowList)
    {
        foreach (AbsorbedCell cell in row.CellList)
        {
            string text = "|";
            foreach (TextFragment textFragment in cell.TextFragments)
            {
                text += textFragment.Text;
            }
            Console.Write(text + '|');
        }
        Console.WriteLine("---------------------------------------------------");
    }
}

As you can see from the attached screenshot 1.png (6.7 KB), the result of the work is correct.

sathish.sundaresan · March 4, 2024, 1:11pm

Thanks sergei.shibanov. let me check with this solution if any issues let you know

sergei.shibanov · March 4, 2024, 2:38pm

@sathish.sundaresan
Yes, sure

sathish.sundaresan · March 7, 2024, 8:51am

Hi,

Thank you for the solution. It is now working properly, but I have another issue: when we use table absorber, the cells are split into two since there is a space in the data. Is there a property to handle this space or merge cells depending on the gap or column header value? Please check my attached document and let me know if anything is needed.
MicrosoftTeams-image (4).png (6.9 KB)
04129_2.pdf (204.6 KB)

Thanks
M.S.Sathish

sathish.sundaresan · March 11, 2024, 5:21am

Hi sergei.shibanov,

Any update for my issue please let me know asap.

Thanks
M.S.Sathish

sergei.shibanov · March 11, 2024, 6:49am

@sathish.sundaresan
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-56763

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

sathish.sundaresan · March 18, 2024, 11:41am

Hi Sergey.

Is there a way to check whether an absorbed table has a border or not? If so, how can we use that function in our code? Please use the same code that I previously gave.

Thank you
M.S. Sathish.

sergei.shibanov · March 18, 2024, 2:33pm

@sathish.sundaresan
Perhaps TableAbsorber uses information about existing boundaries when analyzing, but it does not output anything outside.

sathish.sundaresan · March 19, 2024, 5:19am

Hi sergei,

I understand, but if you can indicate whether the table has a full border or a partial border, it will help us fine-tune our logic to extract the correct table from the absorber. Please verify and let me know which property tells whether the border is enabled or not.

thanks
M.S.Sathish

sergei.shibanov · March 19, 2024, 9:24am

@sathish.sundaresan
I talked to the development team - there is no such option now.
I’ll create a task for them to provide table border information from AbsorberCell. As they said, if this task will be in the category from a user with a purchased license, then for the useFlowEngine mode they will do it quickly enough.

sergei.shibanov · March 19, 2024, 9:32am

@sathish.sundaresan
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-56829

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

sathish.sundaresan · April 4, 2024, 9:24am

Hi,

We acquired an aspose.total license, but I need a solution to the problem; if you wish, Please check my login information for the purchased license.

My Login name: parthiban.veerappan@ant.works

Thanks
M.S.Sathish