Table extraction from pdf not working

imrankhanse · February 2, 2024, 10:59am

here is the code its not working. showing me error

import aspose.pdf as pdf

#Load PDF file
pdfDocument = pdf.Document("/content/example2.pdf")
#Initialize TableAbsorber object
tableAbsorber =  pdf.text.TableAbsorber()
#Parse all the tables on first page
tableAbsorber.visit(pdfDocument.pages[1])
#Get a reference of the first table
absorbedTable = tableAbsorber.table_list[0]

#Iterate through all the rows in the table
for pdfTableRow in absorbedTable.row_list:
    #Iterate through all the columns in the row
    for pdfTableCell in pdfTableRow.cell_list:
        #Fetch the text fragments
        textFragmentCollection = pdfTableCell.text_fragments
        #Iterate through the text fragments
        for textFragment in textFragmentCollection:
            #Print the text
            print(textFragment.text)

error

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-5-179122f711c4> in <cell line: 10>()
      8 tableAbsorber.visit(pdfDocument.pages[1])
      9 #Get a reference of the first table
---> 10 absorbedTable = tableAbsorber.table_list[0]
     11 
     12 #Iterate through all the rows in the table

IndexError: list index out of range

asad.ali · February 2, 2024, 9:34pm

@imrankhanse

Can you please share the sample PDF for our reference as well? We will log an investigation ticket and share the ID with you.

imrankhanse · February 3, 2024, 5:32am

example2.pdf (45.3 KB)

here is the example. however the online ui or demo is working.

Jiyasharma · February 3, 2024, 5:46am

Did you try it again?

imrankhanse · February 3, 2024, 6:06am

no, one day ago. its show the above error as i mentioned.

asad.ali · February 3, 2024, 10:41am

@imrankhanse

We tested in our environment using the latest version of the API and noticed that the API was not able to extract any table from you PDF document. However, can you please share the link of the online APP that you have used and that returned you expected results? We need to investigate what code snippet and which API is being used behind it.

imrankhanse · February 3, 2024, 11:03am

that’s weird. here is the link.

asad.ali · February 3, 2024, 6:57pm

@imrankhanse

We apologize for the confusion. Please note that the online app actually converts a PDF document into Excel file format. It does not really extract tables. In fact it converts the provided PDF into CSV/XLSX directly. In case you want to achieve similar output, you can use the code snippet given in Convert PDF to Excel in Python|Aspose.PDF for Python via .NET article.

imrankhanse · February 4, 2024, 12:06pm

i try this but it crashed the colab sessions and not working. tried all example.

asad.ali · February 4, 2024, 5:59pm

@imrankhanse

Do you also see some error while colab is crashing? Can you please share it as well?