Extract table from pdf

ishan.mehta065 · June 1, 2017, 3:36am

Hi team,

I’m trying to extract table from pdf file but it is not working with TableAbsorber .Tables are identified but data is not coming.Please Tell me something

mudassir.fayyaz · June 1, 2017, 7:55am

Hi Ihsan,

I have observed your requirements and request you to please share if you are using Aspose.Imaging or Aspose.Pdf on your end. Please also share the source file along with working sample project on your end. Please provide requested information so that we may proceed further to help you out.

Many Thanks,

ishan.mehta065 · June 1, 2017, 8:15am

Hi Mudassir,

i’m attaching the sample project and souce file please find the attachment.

I’ve to extract the table from attached pdf.

mudassir.fayyaz · June 1, 2017, 12:23pm

Hi Ihsan,

Your query is related to Aspose.Pdf and I am moving this query to respective forum where our respective support team will assist you further in this regard.

Many Thanks,

asad.ali · June 2, 2017, 4:43am

Hi Ishan,

Thanks for contacting support.

I have tested the scenario in our environment using the project which you have shared with Aspose.Pdf for .NET 17.5 and was unable to notice the issue. The table data/cell values were extracted as expected. Please check following code snippet and attached screenshot of the output for your reference.

Document pdfDocument = new Document(dataDir + “Max.pdf”);

TableAbsorber absorber = new TableAbsorber();

absorber.Visit(pdfDocument.Pages[1]);

foreach (AbsorbedTable table in absorber.TableList)

{

    foreach (AbsorbedRow row in table.RowList)

    {

        foreach (AbsorbedCell cell in row.CellList)

        {

            foreach (TextFragment text in cell.TextFragments)

            {

                Console.Write(text.Text + " ");

            }

            Console.Write("|");

        }

        Console.WriteLine("-------------------------------------------");

    }

    Console.WriteLine("===========================================");

}

In case if you are using an old version of the API, please update to latest version as it is always recommended and if you are facing issue with latest version of the API, please share some more details regarding your environment (i.e OS Version, Application Type, Target Framework Version, etc). This would help us in investigating the scenario in specified environment and address it accordingly.

Best Regards,

ishan.mehta065 · June 2, 2017, 5:51am

Hi Asad ,

I’m also using same version but cell.TextFragments.Count is Zero.

I tried with some more files but problem is still same.

opeartiong system : windows 8,

application type : console application

target framework : .net framework 4.5

thanks and regards

Ishan

asad.ali · June 2, 2017, 12:47pm

Hi Ishan,

Thanks for writing back and sharing more details.

I have again tested the scenario by running the project which you have shared earlier and noticed that you were not setting license before using API methods, which is why you were getting count of TextFragments as zero. Please note that without setting license, you have very limited access to the content of PDF document for processing.

However, when I set license the code executed fine and returned correct count of TextFragments for first cell of the table. Please check attached screenshot, and try setting license before using any method of the API, so that you can have full access to all collections of Aspose.Pdf.Document class. In case if you still face any issue, please feel free to let us know.

Best Regards,

ishan.mehta065 · June 12, 2017, 11:16pm

Hi Asad,

thanks for reply and that problem has been resolved.

I want to know that aspose.pdf can’t read table from pdf with version 1.3. I’m attaching the document.

regards

Ishan

asad.ali · June 13, 2017, 9:51am

Hi Ishan,

Thanks for your feedback.

It is good to know that your previous issue has been resolved. Moreover I have tested the scenario with recently shared PDF document and observed that TableAbsorber was not able to extract table from it. Hence, I have logged an issue as PDFNET-42893 in our issue tracking system, for the sake of investigation. We will further look into the details and keep you updated with the status of its correction. Please be patient and spare us little time.

We are sorry for the inconvenience.

Best Regards,