Truncated text when extracting text from PDF table with TableAbsorber

Gerhunt · September 16, 2020, 9:28am

Hello,
I have an issue when trying to extract text from PDF document. The text is in cells of a table.
I then used the TableAbsorber to iterate though the tables, rows and cells, and I get text, but in some cells, the text retrieved is truncated.

My code:
Aspose.Pdf.Text.TableAbsorber absorber = new Aspose.Pdf.Text.TableAbsorber();
absorber.Visit(pdfDocument.Pages[1]);
foreach (AbsorbedTable table in absorber.TableList)
{
foreach (AbsorbedRow row in table.RowList)

{
    foreach (AbsorbedCell cell in row.CellList)
    {
        TextFragment textfragment = new TextFragment();
        TextFragmentCollection textFragmentCollection = cell.TextFragments;
        foreach (TextFragment fragment in textFragmentCollection)
        {
            Console.WriteLine(fragment.Text);
        }
    }
}

}

I also upload the pdf I have to extract: list-gis-non-eu-countries-protected-in-eu_en.pdf (1.1 MB)

When extracting, for example, instead of getting the word “Bulqizë” of the first cell of the tab, I get “zë”.
And all text from the cells seems shatered…

Do you have any idea why is it so ?

Regards,
Jerome

asad.ali · September 16, 2020, 8:44pm

@Gerhunt

We have tested the scenario in our environment and were able to notice the similar issue while using Aspose.PDF for .NET 20.9. Therefore, we have logged it as PDFNET-48782 in our issue management system for the sake of rectification. We will look into its details and keep you informed with the status of its correction. Please be patient and spare us some time.

We are sorry for the inconvenience.