PDF to Excel delimiters

seanJohnsonRSI · December 13, 2018, 9:04pm

Using Aspose.Pdf for .Net 18.9.1.

I’m currently working on extracting tables out of PDFs. I have had no luck with the TableAbsorber class; for a multitude of reasons, it’s just not working out (combining some columns, missing some columns, etc.). I tried converting it to Excel, and it works better for me. The problem here is that now I’ll have to parse the Excel to essentially create a table object in .Net.

The Excel file looks like the results from the TextAbsorber, only you seem to be correctly delimiting the tables. Is there something I can use to get the correct delimiting of the PDF->Excel algorithm, but have it be in an object that I can just read through in code?

Farhan.Raza · December 14, 2018, 7:56am

@seanJohnsonRSI

Thank you for contacting support.

Would you please share a narrowed down code snippet along with sample documents and elaborate your requirements a little more, based on that application so that we may investigate it in our environment to help you out. Moreover, please ensure using Aspose.PDF for .NET 18.12 before sharing requested data.

seanJohnsonRSI · December 18, 2018, 4:48pm

.Net 18.9.1 is the highest my license will allow. Maybe you can let me know if running the PDF through your environment produces different results for table extraction.

I would attach the sample pdf and results files, but apparently I’m not authorized to upload them. (pdf, txt, and xslx)

Below is the code snippets for extracting the tables and converting from pdf to excel.

public void AsposeExtractTables(string fileName)
        {
            NCCIReport parsedReport = new NCCIReport();
            License license = new License();
            license.SetLicense(Properties.Settings.Default.AsposePDFFilePath);

            this.PDFDoc = new Document(fileName);
            TableAbsorber tableAbsorber = new TableAbsorber();

            foreach(Page page in this.PDFDoc.Pages)
            {
                tableAbsorber.Visit(page);
            }

            StringBuilder tablesSB = new StringBuilder();

            foreach (AbsorbedTable table in tableAbsorber.TableList)
            {
                foreach (AbsorbedRow row in table.RowList)
                {
                    foreach (AbsorbedCell cell in row.CellList)
                    {
                        foreach (TextFragment tf in cell.TextFragments)
                        {
                            tablesSB.Append($"\"{tf.Text}\", ");
                        }
                    }
                    tablesSB.AppendLine();
                }
            }

            string result = tablesSB.ToString();
        }

public void AsposePDFToExcel(string fileName)
        {
            NCCIReport parsedReport = new NCCIReport();
            License license = new License();
            license.SetLicense(Properties.Settings.Default.AsposePDFFilePath);

            this.PDFDoc = new Document(fileName);

            // Instantiate ExcelSave Option object
            ExcelSaveOptions excelsave = new ExcelSaveOptions();
            excelsave.MinimizeTheNumberOfWorksheets = true;

            // Save the output in XLS format
            this.PDFDoc.Save("PDFToXLS_out.xls", excelsave);
        }

Farhan.Raza · December 18, 2018, 8:47pm

@seanJohnsonRSI

Thank you for sharing the code.

Kindly mention explicitly which columns are combined or missed with TableAbsorber. Moreover, PDF to Excel conversion algorithm is different approach as compared to extracting table from a PDF document. Alternatively, you may Save Workbook to Text or CSV Format using Aspose.Cells for .NET and manipulate the data as per your requirements.

Furthermore, you may ZIP respective files and share with us for our reference.

seanJohnsonRSI · December 18, 2018, 9:00pm

I was asking about TableAbsorber, not TextAbsorber…

Farhan.Raza · December 18, 2018, 9:04pm

@seanJohnsonRSI

Sorry, we meant TableAbsorber. Our previous response has been updated as well.

seanJohnsonRSI · December 20, 2018, 2:19pm

Considering it won’t let me upload a text file, I guess you’ll just have to take my word for it. An example is that the TableAbsorber thinks there are 48 tables, when there are 8 (seen by a human). It’s not helpful. TableCount.png (12.2 KB)

If i go with the save workbook method, then I have to go PDF->Excel->CSV. That is a lot of overhead considering this functionality needs to be done on the fly. It’s not viable.

I know the TableAbsorber isn’t going to work in it’s current implementation. I’m really asking if there is an intermediary object or structure (table?) that gets created during your PDF->Excel method that I can use instead of having to parse an excel document.

Farhan.Raza · December 20, 2018, 9:44pm

@seanJohnsonRSI

You may ZIP any format file and share it with us. We have understood the problem you are currently facing, therefore, we would like to request you to share source PDF document with us so that we may reproduce and resolve the issue after verifying it in our environment. Moreover, we are afraid any intermediary object may not be accessed as a workaround for this scenario.

seanJohnsonRSI · December 20, 2018, 9:50pm

188419_Files.zip (85.2 KB)
You will find the sample PDF, the results from TableAbsorber, and the results from PDF To Excel

Farhan.Raza · December 21, 2018, 8:44am

@seanJohnsonRSI

Thank you for sharing requested data.

We have been able to reproduce the issue in our environment. A ticket with ID PDFNET-45839 has been logged in our issue management system for further investigation and resolution. The ticket ID has been linked with this thread so that you will receive notification as soon as the ticket is resolved.

We are sorry for the inconvenience.