I faced a problem while extracting data from table in pdf. The table includes some merged cells, and using aspose.pdf cannot read data properly.
The problem is kind of like:
Merged Table Can’t Read a Single Table Via Aspose.pdf TableAbsorber - Free Support Forum - aspose.com
01. T1CPHEV CDU单元电路图(20210127)-V2.5-BAT端子更新_5.pdf (116.3 KB)
@msdos41
Please attach a pdf document and code snippet that showing the problem.
@sergei.shibanov
I have uploaded the pdf in my main post and here is my code
Thank you for the support!
public void ReadPagesDirectly(string pdf)
{
try
{
string txtPath = Path.Combine(@"C:\Users\xj2ssf\Desktop\EDM_test", string.Format("{0}_{1}.txt", Path.GetFileNameWithoutExtension(pdf), DateTime.Now.ToString("yyyy_MM_dd_hh_mm_fff")));
Aspose.Pdf.Document pdfDocument = new Aspose.Pdf.Document(pdf);
using (StreamWriter sw = new StreamWriter(txtPath, true))
{
int index = 0;
// Loop through the pages
foreach (var page in pdfDocument.Pages)
{
index++;
sw.WriteLine("Page: {0}", index);
TableAbsorber absorber = new TableAbsorber();
absorber.Visit(page);
int tableIndex = 0;
foreach (var table in absorber.TableList)
{
tableIndex++;
sw.WriteLine("======>Table {0}", tableIndex);
foreach (AbsorbedRow row in table.RowList)
{
// Loop through each cell in the row
foreach (AbsorbedCell cell in row.CellList)
{
var sb = new StringBuilder();
// Loop through the text fragments
foreach (TextFragment fragment in cell.TextFragments)
{
foreach (TextSegment seg in fragment.Segments)
{
Console.WriteLine(seg.TextState.Font.FontName);
sb.Append(seg.Text);
}
}
sb.Append(";");
sw.Write(sb.ToString());
}
sw.WriteLine();
}
}
sw.WriteLine("======================================================================");
page.Dispose();
}
}
}
catch (Exception)
{
throw;
}
}
@msdos41
Thank you for the submitted data. I will study this issue and write to you.
@msdos41
After I added the line
TableAbsorber absorber = new TableAbsorber();
absorber.UseFlowEngine = true; // <------- This line added.
absorber.Visit(page);
the result is better - the table is recognized as one and all strings from it are present. True, for some cells, an incorrect split is performed (circled in red in the attached image With_UseFlowEngine.jpg (643.8 KB) ) - I will set the task for the development team about this.
@msdos41
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.
Issue ID(s): PDFNET-54289
You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.