Extract tables with merged cells from pdf using Aspose JAVA

THAMER.MECHARNIA · September 9, 2019, 2:35pm

Hello,
I tried to extract tables from a pdf file which contain merged cells but I couldn’t have the correct results, please find here my source code.

package aspose;

import com.aspose.pdf.*;

public class App 
{
    public static void main( String[] args )
    {
        Document doc = new Document("RIVP000C8E3B.pdf");

        try {
            TableAbsorber absorber = new TableAbsorber();
            PageCollection pc = doc.getPages();
            for(Page pg:pc){

                absorber.visit(pg);
                com.aspose.pdf.internal.ms.System.Collections.Generic.IGenericList<AbsorbedTable> l = absorber.getTableList();
                for(AbsorbedTable table:l){


                    com.aspose.pdf.internal.ms.System.Collections.Generic.IGenericList<AbsorbedRow> r = table.getRowList();
                    for(AbsorbedRow row:r)
                    {

                        com.aspose.pdf.internal.ms.System.Collections.Generic.IGenericList<AbsorbedCell> c = row.getCellList();
                        for(AbsorbedCell cell:c)
                        {

                            for(TextFragment tf:cell.getTextFragments())
                            {
                                for(TextSegment ts:tf.getSegments())
                                {
                                    System.out.println(ts.getText());
                                }
                            }
                        }
                    }
                }
            }
        } catch (Exception e) {
// TODO Auto-generated catch block
            e.printStackTrace();
        }

    }
}

Thank you for your help.
Appreciate
Thamer

asad.ali · September 9, 2019, 5:24pm

@THAMER.MECHARNIA

Could you please share your sample PDF document with us. We will test the scenario in our environment and address it accordingly.

THAMER.MECHARNIA · September 9, 2019, 7:54pm

Thank you very much for your reply, please find an example of my file format and the original source code.
I want to extract all the tables in the file (without the text between them or the footer or the heading or the page number). But I got unexpected results in the file attached within sourceCode.zip.
Thank you again for helping me.
Appreciate.
Test.pdf (225.3 KB)
sourceCode.zip (218.7 KB)

asad.ali · September 9, 2019, 9:48pm

@THAMER.MECHARNIA

Thanks for sharing requested files.

The API extracts the table from PDF document in a way it was added at the time of PDF generation. We have noticed that the PDF was created using MS Word and API was unable to extract text correctly from the table cells. The sequence of the extracted cells and its text was not correct.

Therefore, we have logged an investigation ticket as PDFJAVA-38850 in our issue tracking system. We will further look into details of this issue and keep you posted with the status of its resolution. Please be patient and spare us little time.

We are sorry for the inconvenience.

THAMER.MECHARNIA · September 10, 2019, 8:13am

Thank you very much for your support, I will wait for your new results because I really want to use your API.
Best regards.

asad.ali · September 10, 2019, 9:43pm

@THAMER.MECHARNIA

The issue has just been logged in our issue tracking system and it has low priority. We will investigate it on first come first serve basis and will surely let you know about investigation result. Please spare us little time.

tampt · August 12, 2020, 2:15am

I can’t download source code. Can you help me please?
image.png (1.4 KB)

asad.ali · August 12, 2020, 6:58pm

@tampt

You are not thread owner which is why you are unable to download the source code. You can download it from here.

tampt · August 17, 2020, 3:01am

thank you very much!!!