Table Extraction from pdf

Hi,

I’d like to buy aspose.pdf, but I need some clarity before. Table absorber isn’t working in my pdf, but it can convert it to excel perfectly. Please help me figure out what’s wrong so that I can fix it in my code.

thanks
M.S.Sathish

@sathish.sundaresan
Please attach the source document and the code used, with a more detailed explanation of what does not work.

If there is a larger gap between two rows when converting to Excel, we must have empty rows or the same cell padding format to appropriately identify the rows. Is there a way or alternatives in aspose.pdf to excel?

@sathish.sundaresan

Are you doing a pdf -> Excel conversion? Why then are we talking about TableAbsorber ?
It would be nice if you could attach the document and the code snippet you used.

Hi sergei,

Please see the attachment for the document and c# code file we used for the pdf
table absorber. Please help me to extract the table from the document.

05107_1 1.pdf (466.7 KB)
33333.pdf (106.7 KB)

ExtractMarkedtable.pdf (6.3 KB)

Thanks
M.S.Sathish
9176398138

Hi sergei,

Please see the attachment for the document and c# code file we used for the pdf
table absorber. Please help me to extract the table from the document.

05107_1 1.pdf (466.7 KB)
33333.pdf (106.7 KB)

ExtractMarkedtable.pdf (6.3 KB)

Thanks
M.S.Sathish
9176398138

Hi Team,

Please see the attachment for the document and c# code file we used for the pdf
table absorber. Please help me to extract the table from the document.

05107_1 1.pdf (466.7 KB)
33333.pdf (106.7 KB)

ExtractMarkedtable.pdf (6.3 KB)

Thanks
M.S.Sathish
9176398138

@sathish.sundaresan
Thank you, I will study the information provided and write to you tomorrow.

@sathish.sundaresan
In document 05107_1 1.pdf, all symbols are paths (i.e., drawn graphically). Accordingly, as is, it is not converted to Excel and is not processed by TableAbsorber.

To extract tables from document 33333.pdf I used the following code

var pdfDocument = new Document(dataDir + "05107_1 1.pdf");
var tableAbsorber = new TableAbsorber();

tableAbsorber.UseFlowEngine = true;

// Visit first page with absorber
tableAbsorber.Visit(pdfDocument.Pages[1]);

foreach (AbsorbedTable table in tableAbsorber.TableList)
{
    Console.WriteLine("_______ Table __________");
    foreach (AbsorbedRow row in table.RowList)
    {
        foreach (AbsorbedCell cell in row.CellList)
        {
            string text = "|";
            foreach (TextFragment textFragment in cell.TextFragments)
            {
                text += textFragment.Text;
            }
            Console.Write(text + '|');
        }
        Console.WriteLine("---------------------------------------------------");
    }
}

As you can see from the attached screenshot 1.png (6.7 KB), the result of the work is correct.

Thanks sergei.shibanov. let me check with this solution if any issues let you know

@sathish.sundaresan
Yes, sure

Hi,

Thank you for the solution. It is now working properly, but I have another issue: when we use table absorber, the cells are split into two since there is a space in the data. Is there a property to handle this space or merge cells depending on the gap or column header value? Please check my attached document and let me know if anything is needed.
MicrosoftTeams-image (4).png (6.9 KB)
04129_2.pdf (204.6 KB)

Thanks
M.S.Sathish

Hi sergei.shibanov,

Any update for my issue please let me know asap.

Thanks
M.S.Sathish

@sathish.sundaresan
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-56763

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

Hi Sergey.

Is there a way to check whether an absorbed table has a border or not? If so, how can we use that function in our code? Please use the same code that I previously gave.

Thank you
M.S. Sathish.

@sathish.sundaresan
Perhaps TableAbsorber uses information about existing boundaries when analyzing, but it does not output anything outside.

Hi sergei,

I understand, but if you can indicate whether the table has a full border or a partial border, it will help us fine-tune our logic to extract the correct table from the absorber. Please verify and let me know which property tells whether the border is enabled or not.

thanks
M.S.Sathish

@sathish.sundaresan
I talked to the development team - there is no such option now.
I’ll create a task for them to provide table border information from AbsorberCell. As they said, if this task will be in the category from a user with a purchased license, then for the useFlowEngine mode they will do it quickly enough.

@sathish.sundaresan
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-56829

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

1 Like

Hi,

We acquired an aspose.total license, but I need a solution to the problem; if you wish, Please check my login information for the purchased license.

My Login name: parthiban.veerappan@ant.works

Thanks
M.S.Sathish