Extract specific values from PDF with table structure

heylgreg · September 29, 2020, 8:45am

Hi,

I’m using Aspose.PDF for .NET 20.9.0.

I’m working on a POC with Aspose.PDF for extracting specific information in PDF.I have this kind of PDF :
PDF-Exemple.png (26.9 KB)

I would like to know if there is a way to retrieve the value of rows with the corresponding title.

I already found a solution for extract each lines by using TextAbsorber. But when I do that, I lost the corresponding value of each ‘X’ value … That’s my problem.

var pdfDocument = new Aspose.Pdf.Document(new MemoryStream(Resource1.MyPDF));
            var textAbsorber = new TextAbsorber();
            pdfDocument.Pages[2].Accept(textAbsorber);
            var text = textAbsorber.Text;
            var lines = text
                .Replace("\n", "")
                .Split('\r')
                .Select(e => Regex.Replace(e, @"\s+", " "))
                .ToArray();

I also tried to use TableAbsorber but it isn’t possible to use it in my case because PDF table structure are messy.

How could I keep the corresponding header title for each “X” value ? Is it possible with Aspose.PDF ?

Thanks in advance,

asad.ali · September 29, 2020, 5:39pm

@heylgreg

We have checked the image which you have shared and yes, the table has complex structure with multiple cells span. We need to further investigate whether it is possible to extract the data in your desired way or not and for the purpose, we need a sample PDF document. Would you kindly provide it so that we can test the scenario in our environment and address it accordingly.

heylgreg · September 30, 2020, 7:50am

@asad.ali

Thanks you for your quick response !

Indeed, you can find a sample PDF document : File-Example.pdf (156.4 KB)

I hope that this will help you.

Many thanks

asad.ali · September 30, 2020, 7:34pm

@heylgreg

We have logged an investigation ticket as PDFNET-48854 in our issue management system for your specific requirements. We will further look into its details and keep you informed about the status of its resolution. Please be patient and spare us some time.

We are sorry for the inconvenience.