TextAbsorber does not find text

Christophe.Percepied · February 9, 2020, 2:16pm

Hi,

I have problems when using TextAbsorber to extract text from a PDF document.

Some parts are not found.

In the provided sample, you’ll se that the document contains two text fields (<Cont.SE[Id:95650|Ali:G|Tai:P|Comm:Élodie FRANCON> and <Cont.SE[Id:94828|Ali:G|Tai:P|Comm:Rémi DAUBAN>) that are visible but not found with TextAbsorber.
Moreover, the text of these fields is not preserved when copied to ClipBoard.

Other text fields work normally.

Is there any workaround for this ?

Best regards,

Paul

Christophe.Percepied · February 9, 2020, 2:17pm

A3144-87760_essai-B.pdf (255.0 KB)

asad.ali · February 9, 2020, 8:09pm

@paul.fresquet

We were able to notice the issue while extracting text with Aspose.PDF for .NET 20.1 using following code snippet:

TextAbsorber ta = new TextAbsorber();
ta.ExtractionOptions = new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Pure);
Document pdfDocument = new Document(dataDir + "A3144-87760_essai-B.pdf");
pdfDocument.Pages.Accept(ta);
string text = ta.Text;

We have logged an issue as PDFNET-47656 in our issue tracking system for further investigation. We will surely inform you as soon as it is resolved. Please be patient and spare us little time.

We are sorry for the inconvenience.