Hi, I’m trying to extract text and table separately with ParagraphAbsorber
and TableAbsorber
for such file page2TableStart.pdf (81.3 KB)
. As you can see from the file, first page doesn’t contain any table. However, TableAbsorber
would detect a table with each line on first page as a row.
My sample executed code to see this issue:
Document pdfDocument = new(file);
foreach (var page in pdfDocument.Pages)
{
// extract table
TableAbsorber tableAbsorber = new();
tableAbsorber.Visit(page);
foreach (AbsorbedTable table in tableAbsorber.TableList)
{
foreach (AbsorbedRow row in table.RowList)
{
foreach (AbsorbedCell cell in row.CellList)
{
foreach (TextFragment tf in cell.TextFragments)
{
Console.Writeline(tf.Text);
}
}
}
}
// extract text
paragraphAbsorber.Visit(page);
foreach (MarkupSection section in paragraphAbsorber.PageMarkups[0].Sections)
{
foreach (MarkupParagraph paragraph in section.Paragraphs)
{
Console.Writeline(paragraph.Text);
}
}
Running above code will return 1 table on first page of the PDF file with 20 rows corresponding to 20 lines on first page which shouldn’t?
Additionally, since TableAbsorber
detected that as a table. Is it possible to check if the table has invisible border?