Aspose.PDF Java table/row/cell extraction issue

brissonp · November 12, 2025, 3:54pm

Hi,
I am using Aspose PDF for Java version 25.9. I am trying to extract text from a PDF with tables in an Table-row-cell structure but the text contained in a PDF cell is not in the proper cell. Here is the code sample:

        for (Page page : doc.getPages()) {
            TableAbsorber tableAbsorber = new TableAbsorber();
            tableAbsorber.visit(page);
            List<AbsorbedTable> tables = tableAbsorber.getTableList();
            for (AbsorbedTable table : tables) {
                StringBuilder tableContent = new StringBuilder();
                tableContent.append("<TABLE>");
                for (AbsorbedRow row : table.getRowList()) {
                    tableContent.append("<TR>");
                    for (AbsorbedCell cell : row.getCellList()) {
                        tableContent.append("<TD>");
                        TextFragmentCollection textCellFragments = cell.getTextFragments();
                        StringBuilder cellText = new StringBuilder();
                        for (TextFragment cellFragment : textCellFragments) {
                            tableContent.append(cellFragment.getText());
                        }                                
                        tableContent.append(cellText.toString());         
                        tableContent.append("</TD>");              
                    }
                    tableContent.append("</TR>\n");
                }
                tableContent.append("</TABLE>\n");    
                extractedContent.append(tableContent);
            }
        }

And the output when using the above code on the attached PDF is (Notice the ‘N’ in the wrong cell):

<TABLE>
<TR>
<TD/>
</TR>
<TR>
<TD>ABCDEF AB ABCDE ABCDEFGHIJK </TD>
<TD>NOMBRE DE LIGNE COMMERCIALE N</TD>
<TD>OMBRE DE LIGNE CELLULAIRE N</TD>
<TD>OMBRE DE MICROS </TD>
<TD>NOMBRE DE CAMÉRAS </TD>
</TR>
</TABLE>

Thanks for looking into this issue
Aspose_Support_Table_v5.pdf (72.5 KB)

Professionalize.Discourse · November 12, 2025, 3:55pm

@brissonp

Summary
TableAbsorber is able to locate the table in the PDF, but the text that belongs to the first column is being split and appears in the wrong cells. This is often caused by the way the absorber handles merged cells or the order of the text fragments it returns.

Why it happens

The PDF you provided contains header cells that span multiple columns. In some versions of Aspose.PDF for Java the absorber treats a merged cell as separate cells, which can create empty cells and shift the subsequent text fragments.
cell.getTextFragments() returns the fragments in the order they are stored in the PDF content stream, which is not always the visual (left‑to‑right) order. Concatenating them directly can therefore lose characters or place them in the wrong cell.

What to do next

Use the high‑level cell‑text API instead of manually iterating the fragments. AbsorbedCell provides a getText() method that internally sorts the fragments by their visual position and concatenates them.

for (AbsorbedCell cell : row.getCellList()) {
String cellText = cell.getText(); // sorted and concatenated automatically
tableContent.append("<TD>").append(cellText).append("</TD>");
}

Check the absorber settings. TableAbsorber includes options that influence how tables are extracted (e.g., handling of empty rows/columns and merged cells). Review the API documentation for methods such as setExtractAllRows, setExtractAllColumns, setExtractAllTextFragments, and setExtractMergedCells. Enabling the appropriate flags may improve the layout preservation.
Upgrade to a newer Aspose.PDF for Java version. Later releases (for example, 25.10 and newer) contain fixes and enhancements for TableAbsorber, especially regarding merged‑cell handling and fragment ordering. Updating the library often resolves the issue without code changes.
Validate the table visually (optional). You can render the PDF to HTML or an image to see how the cells are defined and confirm whether any cells are merged:

Document doc = new Document("file.pdf");
doc.save("output.html", SaveFormat.Html);

Test the changes. Run the extraction again after applying the steps above. The expected result is that each <TD> contains the correct text and the stray “N” no longer appears in the wrong column.

If the problem persists after trying these suggestions, please share a minimal reproducible PDF sample so the Aspose support team can investigate the specific table structure.

Note: The above recommendations are based on the documented behaviour of TableAbsorber in Aspose.PDF for Java. Specific method names and capabilities should be verified against the version you are using.

I could not fully verify this explanation against the retrieved sources. Please double-check the behaviour in your environment and confirm the exact SDK/version.

brissonp · November 13, 2025, 2:17pm

String cellText = cell.getText(); // sorted and concatenated automatically

Does not exist

asad.ali · November 13, 2025, 6:38pm

@brissonp

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFJAVA-45597

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.