Identify merged cells from table in PDF

msdos41 · April 11, 2023, 2:48am

I faced a problem while extracting data from table in pdf. The table includes some merged cells, and using aspose.pdf cannot read data properly.

The problem is kind of like:
Merged Table Can’t Read a Single Table Via Aspose.pdf TableAbsorber - Free Support Forum - aspose.com

01. T1CPHEV CDU单元电路图（20210127）-V2.5-BAT端子更新_5.pdf (116.3 KB)

sergei.shibanov · April 11, 2023, 3:01am

@msdos41
Please attach a pdf document and code snippet that showing the problem.

msdos41 · April 11, 2023, 3:15am

@sergei.shibanov
I have uploaded the pdf in my main post and here is my code

Thank you for the support!

public void ReadPagesDirectly(string pdf)
{
    try
    {
        string txtPath = Path.Combine(@"C:\Users\xj2ssf\Desktop\EDM_test", string.Format("{0}_{1}.txt", Path.GetFileNameWithoutExtension(pdf), DateTime.Now.ToString("yyyy_MM_dd_hh_mm_fff")));
        Aspose.Pdf.Document pdfDocument = new Aspose.Pdf.Document(pdf);

        using (StreamWriter sw = new StreamWriter(txtPath, true))
        {
            int index = 0;
            // Loop through the pages                      
            foreach (var page in pdfDocument.Pages)
            {
                index++;
                sw.WriteLine("Page: {0}", index);

                TableAbsorber absorber = new TableAbsorber();
                absorber.Visit(page);


                int tableIndex = 0;
                foreach (var table in absorber.TableList)
                {
                    tableIndex++;
                    sw.WriteLine("======>Table {0}", tableIndex);

                    foreach (AbsorbedRow row in table.RowList)
                    {
                        // Loop through each cell in the row
                        foreach (AbsorbedCell cell in row.CellList)
                        {
                            var sb = new StringBuilder();
                            // Loop through the text fragments
                            foreach (TextFragment fragment in cell.TextFragments)
                            {
                                foreach (TextSegment seg in fragment.Segments)
                                {
                                    Console.WriteLine(seg.TextState.Font.FontName);
                                    sb.Append(seg.Text);
                                }
                            }
                            sb.Append(";");
                            sw.Write(sb.ToString());
                        }
                        sw.WriteLine();
                    }
                }
                sw.WriteLine("======================================================================");
                page.Dispose();
            }
        }
    }
    catch (Exception)
    {

        throw;
    }

}

sergei.shibanov · April 11, 2023, 2:26pm

@msdos41
Thank you for the submitted data. I will study this issue and write to you.

sergei.shibanov · April 11, 2023, 4:45pm

@msdos41
After I added the line

TableAbsorber absorber = new TableAbsorber();
absorber.UseFlowEngine = true;   // <------- This line added.
absorber.Visit(page);

the result is better - the table is recognized as one and all strings from it are present. True, for some cells, an incorrect split is performed (circled in red in the attached image With_UseFlowEngine.jpg (643.8 KB) ) - I will set the task for the development team about this.

sergei.shibanov · April 11, 2023, 5:00pm

@msdos41
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-54289

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.