How can I extract cells in a table without column border

Ragnarokkr.Xia · August 25, 2021, 9:25am

I have a pdf containing a table that doesn’t have column border and its row border is dashed lines
How can I extract texts in it correctly?
Here is the sample:Sample.pdf (68.1 KB)

asad.ali · August 25, 2021, 9:04pm

@Ragnarokkr.Xia

Please try to use the code snippet given in the below API documentation article to extract the table cell data. In case you face any issue, please let us know by sharing screenshot and explaining the issue that you are facing.

Extract Data from Table

Ragnarokkr.Xia · August 26, 2021, 2:27am

@asad.ali
I have already tried with the code below:

Document document = new Document(documentMemoryStream);
Page firstPage = document.Pages[1];
TableAbsorber tableAbsorber = new TableAbsorber
    {
        TextSearchOptions = new TextSearchOptions(false)
        {
            IgnoreShadowText = true,
            LimitToPageBounds = false,
            SearchForTextRelatedGraphics = true,
        }
    };
TextAbsorber textAbsorber = new TextAbsorber();
textAbsorber.Visit(firstPage);
tableAbsorber.Visit(firstPage);

The textAbsorber absorbed all the text in the page while tableAbsorber found no table as shown below:
image.png (20.0 KB)

mudassir.fayyaz · August 26, 2021, 6:24pm

@Ragnarokkr.Xia

A ticket with ID PDFNET-50443 has been created in our issue tracking system to further investigate the issue on our end. This thread has been linked with the issue so that you may be notified once the issue will be fixed.

Ragnarokkr.Xia · August 27, 2021, 8:52am

It’s related with
How can I extract text by text block like it’s displayed in Acrobat DC? - Free Support Forum - aspose.com
More samples and details are included in the thread mentioned above.

mudassir.fayyaz · August 27, 2021, 5:13pm

@Ragnarokkr.Xia

We will take care of your other concerns in that thread.