Aspose.PDF text extraction from table -Issue

avvr · October 31, 2018, 12:28pm

Hi,

We are planning to buy aspose.pdf license for our project requirement.
We have got the temporary license to check whether it fits to our requirement.
When extracting text from table, we are only getting alternate columns.
Code and details as below:
.net core project
TableAbsorber absorber = new TableAbsorber();
absorber.Visit(pdfDocument.Pages[1]);
absorber.TableList --> fetching alternate columns.–> we need all the columns

Request for solution.

Farhan.Raza · October 31, 2018, 9:37pm

@avvr

Thank you for contacting support.

Would you please share source PDF document with us while mentioning the page number and column names for which the problem occurs. Before sharing requested data, please ensure using Aspose.PDF for .NET 18.10 in your environment.

avvr · November 2, 2018, 7:54am

Hi Farhan,

Thanks for the response.
I confirm we are using Apose.PDF for .NET 18.10 in our environment.

Due to security reasons we cannot share the exact document, however for replicating the issue we have created attached pdfs.

Please find attached docs
• Sampletable.pdf – plain table
• Sampletablewithstyles.pdf – table with styles(our production documents are with styles)

Extraction behavior is different for these files.

Currently, we want to extract each word with below mentioned properties from PDF for our business requirement
• FontColor
• FontFamily
• FontSize
• Bold
• Italic
• IsTableItem(is word part of table)
• ColumnNumber (if isTableItem is true )
• RowNumber(if isTableItem is true )
• Left
• Line number
• PageId
• PdfValue(text)
• Right
• Bottom
• Top
• Width
• Height

In future, we may need to extract additional details like - is word part of header/footer.

Request to help in this regard.

Regards,
Vinay Reddy
sampletable.pdf (182.6 KB)
sampletablewithstyles.pdf (334.1 KB)

Farhan.Raza · November 2, 2018, 5:40pm

@avvr

Thank you for sharing the PDF files.

We have used below code snippet and have noticed that each row and each of its cell is being extracted.

Document pdfDocument = new Document(dataDir + "sampletablewithstyles.pdf");
TableAbsorber absorber = new TableAbsorber();
absorber.Visit(pdfDocument.Pages[1]);
foreach (AbsorbedTable table in absorber.TableList)
{
    foreach (AbsorbedRow row in table.RowList)
    {
        foreach (AbsorbedCell cell in row.CellList)
        {
            TextFragment textfragment = new TextFragment();
            TextFragmentCollection textFragmentCollection = cell.TextFragments;
            foreach (TextFragment fragment in textFragmentCollection)
            {
                Console.WriteLine(fragment.Text);
            }
        }
    }
}

Likewise, you may extract all properties which are exposed by TextFragment class by adding below lines of code to the code snippet above.

TextState state = fragment.TextState;
Console.WriteLine(state.Underline);

We hope this will be helpful. Please elaborate in detail if you have any further query.

avvr · November 5, 2018, 4:13am

Hi Farhan,

Thanks for the response.

We are getting different CellList count from the attached pdfs

Please find attached screenshots

tablewithoutstyles.png – CellList count is 8, fetched from sampletable.pdf
tablewithstyles.png – CellList count is 1, fetched from samplepdfwithstyles.pdf

sampletable.pdf (182.6 KB)
tablewithstyles.png (82.9 KB)
sampletablewithstyles.pdf (334.1 KB)
tablwwithoutstyles.png (89.6 KB)

Regards,

Vinay Reddy

Farhan.Raza · November 5, 2018, 1:25pm

@avvr

Thank you for elaborating it.

We have worked with the data shared by you and have been able to reproduce the issue in our environment. A ticket with ID PDFNET-45634 has been logged in our issue management system for further investigation and resolution. The ticket ID has been linked with this thread so that you will receive notification as soon as the ticket is resolved.

We are sorry for the inconvenience.

avvr · December 3, 2018, 8:22am

Can you please provide any update, we need to evaluate and finalize the license purchase.

Farhan.Raza · December 3, 2018, 7:18pm

@avvr

Thank you for getting back to us.

We are afraid PDFNET-45634 is currently pending for investigations. We will update you as soon as some significant updates will be available in this regard. Please be patient and spare us some time.