Optimizing TableAbsorber Performance and Handling PDF Table

abhishekrai · March 1, 2024, 6:24am

Hello team,

We’ve been utilizing the TableAbsorber to extract tables from PDF pages. However, we’ve encountered performance issues when dealing with large tables on a single page.
Specifically, the line tableAbsorber.visit(page) takes approximately 3-4 minutes to extract the table data for a single page.

Additionally, we’ve noticed that the TableAbsorber is unable to extract data from the PDF snippet provided below.
Is there a specific limitation regarding table formatting, such as whether tables must be bordered or adhere to certain criteria?
but we getting the data in paragraph.

Any insights or recommendations would be greatly appreciated!
pdf-table-snip.PNG (198.8 KB)

asad.ali · March 1, 2024, 3:41pm

@abhishekrai

We need to investigate this scenario. Can you please share PDF files and the code snippet? Please share both PDF files where TableAbsorber is taking time to extract tables and where it is unable to extract tables. We will further proceed accordingly.

abhishekrai · March 11, 2024, 10:31am

Code for TableAbsorber : Below is the code snippet we are using to extract tables from PDF documents:

TableAbsorber tableAbsorber = new TableAbsorber();
tableAbsorber.Visit(page);

foreach (AbsorbedTable table in tableAbsorber.TableList)
{
    List<JObject> rowList = new();
    int rowNum = 1;

    foreach (AbsorbedRow row in table.RowList)
    {
        List<JObject> cellList = new();
        int cellNum = 1;

        foreach (AbsorbedCell cell in row.CellList)
        {
            TextFragmentCollection textFragmentCollection = cell.TextFragments;
            StringBuilder txt = new();

            foreach (TextFragment fragment in textFragmentCollection)
            {
                foreach (TextSegment seg in fragment.Segments)
                {
                    txt.Append(seg.Text.Trim());
                }
            }

            if (!string.IsNullOrEmpty(txt.ToString()))
            {
                cellList.Add(new()
                {
                    ["Cell" + cellNum] = txt.ToString()
                });
                cellNum++;
            }
        }

        if (cellList.Any())
        {
            rowList.Add(new()
            {
                ["row" + rowNum] = JArray.FromObject(cellList)
            });
            rowNum++;
        }
    }

    if (rowList.Any())
    {
        JObject tableobj = new()
        {
            ["table"] = JArray.FromObject(rowList)
        };
    }
}

please find the file attached with this where we are not able to extract pdf.
pdfwithtables.pdf (190.5 KB)

For the second issue where tableAbsorber taking time to extract table

Challenges:

Memory Usage: The PDF file size is approximately 15 MB, and during extraction, we encounter system.outof memory exceptions. This is especially problematic when dealing with large tables.
Time-Consuming: Some pages take 2-3 minutes to absorb the tables due to the presence of numerous tables on each page.

PDF Optimization Attempt:

To mitigate these issues, we applied PDF optimization techniques. While this improved performance, it introduced image compression, which is not compatible with Linux due to the reliance on system.drawing.

Request for Assistance:

Unfortunately, we cannot share the specific PDF causing the delay due to compliance restrictions.
However, we seek advice on optimizing memory usage and reducing extraction time for large tables.
Is there an alternative approach or configuration that can help us achieve better results?

asad.ali · March 11, 2024, 2:36pm

@abhishekrai

Can you please also share the sample PDF that we could use to replicate and observe the second issue about time taken by the API?

abhishekrai · March 15, 2024, 6:49am

Due to compliance issue we are not able to share that meanwhile please look into the other issue for which we have shared the code sample and pdf.

asad.ali · March 15, 2024, 5:23pm

@abhishekrai

We are checking it and will get back to you shortly.

asad.ali · March 15, 2024, 9:35pm

@abhishekrai

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-56806

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

asad.ali · July 8, 2024, 10:40pm

@abhishekrai

To extract a table without borders, our alternative engine can be activated using the UseFlowEngine option. Please see the code snippet below:

Text.TableAbsorber absorber = new Text.TableAbsorber(); absorber.UseFlowEngine = true; absorber.Visit(page);

Also, please note that since the table is a bit messy, our table recognition engine has divided it into several tables. Changes in formatting and large gaps often indicate new tables. For more accurately formatted tables, the results should be better.