Incorrect table detected by TableAbsorber

nnguyen9644 · June 21, 2022, 6:10pm

Hi, I’m trying to extract text and table separately with ParagraphAbsorber and TableAbsorber for such file page2TableStart.pdf (81.3 KB)
. As you can see from the file, first page doesn’t contain any table. However, TableAbsorber would detect a table with each line on first page as a row.

My sample executed code to see this issue:

            Document pdfDocument = new(file);
            foreach (var page in pdfDocument.Pages)
            {
                // extract table
                TableAbsorber tableAbsorber = new();
                tableAbsorber.Visit(page);
                foreach (AbsorbedTable table in tableAbsorber.TableList)
                {
                    foreach (AbsorbedRow row in table.RowList)
                    {
                        foreach (AbsorbedCell cell in row.CellList)
                        {
                            foreach (TextFragment tf in cell.TextFragments)
                            {
                                Console.Writeline(tf.Text);
                            }
                        }
                    }
                }

                // extract text
                paragraphAbsorber.Visit(page);
                foreach (MarkupSection section in paragraphAbsorber.PageMarkups[0].Sections)
                {
                     foreach (MarkupParagraph paragraph in section.Paragraphs)
                     {
                          Console.Writeline(paragraph.Text);
                     }
                }

Running above code will return 1 table on first page of the PDF file with 20 rows corresponding to 20 lines on first page which shouldn’t?

Additionally, since TableAbsorber detected that as a table. Is it possible to check if the table has invisible border?

asad.ali · June 21, 2022, 8:53pm

@nnguyen9644

We need to investigate this case further and for the sake, an investigation ticket as PDFNET-51974 has been logged in our issue tracking system. We will further look into its details and let you know as soon as it is resolved. Please be patient and spare us some time.

We are sorry for the inconvenience.