Extract text from PDF using Aspose.PDF for .NET - Can't find words from PDF to Text

manel.gracia · October 28, 2020, 11:53am

Helo
I use:
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(new TextEditOptions(TextEditOptions.NoCharacterAction.UseStandardFont ));
info.Document.Pages[page].Accept(textFragmentAbsorber);

the result in the text is as follows:
textFragmentAbsorber.Text ==> " Fecha: 18/01/2012 Longitudinal, 6 Nº 117 Mercabarna"
But, in textFragmentAbsorber.TextFragments ==>
TextFragments .item(9) == 1
TextFragments .item(10) == 8
TextFragments .item(11 == /01/
TextFragments .item(12) == 20

WHY???

asad.ali · October 28, 2020, 9:47pm

@manel.gracia

The API extracts the text in the form it was added in the PDF document. In other words, the text extraction depends upon how the text is stored in the structure of PDF file. Would you please share your sample PDF document with us so that we can test the scenario in our environment and address it accordingly.

manel.gracia · October 29, 2020, 8:48am

251717_f87165.pdf (499.5 KB)

manel.gracia · October 29, 2020, 8:49am

Thanks a lot

asad.ali · October 29, 2020, 6:25pm

@manel.gracia

We tested the scenario in our environment and as per our observations, the API was extracting text as expected. However, we have logged an investigation ticket as PDFNET-48965 in our issue tracking system to investigate further and determine how to force API to extract text as you desire. We will look into ticket details and keep you posted with the status of its resolution. Please be patient and spare us some time.

We are sorry for the inconvenience.