TableAbsorber not detecting some headers

When using the TableAbsorber with the file linked below, not all header cells are detected as being in the table. Is this a bug in the absorber or is there anything specific in that table that would create that issue?

ChildrenJoinLegalNoPropertyWifeFilingParty-Page55.pdf

@lpperras

Please check the extracted text that we obtained using below code snippet in our environment with Aspose.PDF for .NET 24.6:

Document pdfDocument = new Document(dataDir + "ChildrenJoinLegalNoPropertyWifeFilingParty-Page55.pdf");
TextAbsorber textAbsorber = new TextAbsorber();
pdfDocument.Pages.Accept(textAbsorber);
var t = textAbsorber.Text;
File.WriteAllText(dataDir + "extractedText.txt", t);

extractedText.zip (806 Bytes)

Can you please check if any information is missing inside it and let us know.

Yes that looks accurate. My problem is more with the table absorber. Why aren’t One Child, Two Children etc part of the table. There were all parsed as being outside the table with version 24.6.

@lpperras

Would you kindly share the code snippet that you are using to extract the table data?

Of course:

Blockquote
var pdfDocument = new Aspose.Pdf.Document(@“X:\Folder\ChildrenJoinLegalNoPropertyWifeFilingParty-Page55.pdf”);

// Create a TableAbsorber object to find tables
TableAbsorber tableAbsorber = new TableAbsorber();

// Visit each page with the table absorber
foreach (var page in pdfDocument.Pages)
{
	tableAbsorber.Visit(page);
}

// List to hold rectangles (areas) of tables
var tableAreas = new Dictionary<int, IList<Rectangle>>();

// Add found tables' rectangles to the list
foreach (var table in tableAbsorber.TableList)
{	
	foreach (var row in table.RowList)
	{
		foreach (var cell in row.CellList)
		{
			if (!tableAreas.ContainsKey(table.PageNum))
			{
				tableAreas.Add(table.PageNum, new List<Rectangle>());
			}
			
			tableAreas[table.PageNum].Add(cell.Rectangle);
		}
	}
}

// Now, use TextFragmentAbsorber for text extraction
var textFragmentAbsorber = new TextFragmentAbsorber();

// Accept the absorber for all the pages
pdfDocument.Pages.Accept(textFragmentAbsorber);

// Iterate over extracted text fragments
foreach (var textFragment in textFragmentAbsorber.TextFragments)
{
	// Check if the text fragment is outside the table areas
	bool isOutsideTables = true;
	if (tableAreas.ContainsKey(textFragment.Page.Number))
	{
		foreach (var tableArea in tableAreas[textFragment.Page.Number])
		{
			if (tableArea.Contains(new Point(textFragment.Rectangle.LLX, textFragment.Rectangle.LLY)) ||
				tableArea.Contains(new Point(textFragment.Rectangle.URX, textFragment.Rectangle.URY)))
			{
				isOutsideTables = false;
				break;
			}
		}
	}

	Console.WriteLine($"Page: {textFragment.Page.Number}, IsOutsideTable: {isOutsideTables}, Text: {textFragment.Text}");

	if (isOutsideTables)
	{
		// Process text fragment here (it's not part of a table)
	}
}

@lpperras

We are checking it and will get back to you shortly.

Any update on that?

@lpperras

Yes, we have tested the scenario with both 24.6 and 24.7 versions of the API and were able to notice the same behavior as you mentioned. Looks like it is happening due to the structure of table in the PDF. Can you please share if you are facing this issue with particular PDF files or with all files?

Also, an issue as PDFNET-57820 has been logged in our issue management system to carry out further investigation regarding this behavior of the API. The ticket has been attached with this forum thread so that you will receive a notification as soon as it is resolved. Please be patient and spare us some time.

We are sorry for the inconvenience.