PdfExtractor vs TextAbsorber

O-L-G-A · March 30, 2020, 11:10pm

Hi,

I am looking into extracting text from PDF files. What is the difference between PdfExtractor vs TextAbsorber? I used both to extract the text from the same file. Both produce different byte array. What is the actual difference between both classes/methods and what is stored in extra bytes?

PdfExtractor
PdfExtractor extractor = new PdfExtractor();
extractor.bindPdf(new ByteArrayInputStream(input));
extractor.extractText();
ByteArrayOutputStream out = new ByteArrayOutputStream();
extractor.getText(out);
byte[] text = out.toByteArray();

TextAbsorber
Document pdfDocument = new Document(new ByteArrayInputStream(input));
TextAbsorber textAbsorber = new TextAbsorber();
pdfDocument.getPages().accept(textAbsorber);
extractedText = textAbsorber.getText();
byte[] text = extractedText.getBytes();

Output:
PDFExtractor: text.length = 70
TextAbsorber: text.length = 35

NOTE:
I tried this for several PDF files and in each case byte array for PDFExtractor is twice bigger than for TextAbsorber.

TestDoc.pdf (147.4 KB)

Adnan.Ahmad · March 31, 2020, 11:22am

@O-L-G-A,

I like to inform that PdfExtractor is facade which uses TextAbsorber in its implementation and also PdfExtractor allows to extract images (ImagePlacementAbsorber is used).

O-L-G-A · March 31, 2020, 11:29am

2 questions:

So given a simple example I attached, why the size of extracted bytes in one case twice as big as the size in another case? It is simple text with few words.
Should I use TextAbsorber if I am only interested in text?

From the attached file example, when I strip all white spaces and save it to DB I get the following when doing query:
PDFExtractor: O l g a t e s t M e t e s t i n g …
TextAbsorber: OlgatestMetesting…

Adnan.Ahmad · April 1, 2020, 4:20am

@O-L-G-A,

Thanks for sharing further details.

We have logged an investigation ticket as PDFNET-47918 in our issue tracking system. We will further look into details of it and keep you posted with the status of its resolution. Please be patient and spare us some time.

We are sorry for the inconvenience.

O-L-G-A · April 7, 2020, 1:44pm

PDFExtractor allows to set encoding when doing extractText(Charset encoding).

How can I achieve the same with TextAbsorber? I want to be able to specify encoding when extracting with TextAbsorber

Adnan.Ahmad · April 8, 2020, 9:37am

@O-L-G-A,

We have logged an investigation ticket as PDFNET-47953 in our issue tracking system. We will further look into details either it is possible or not. Please be patient and spare us some time.

Adnan.Ahmad · May 5, 2020, 9:31pm

@O-L-G-A,

I like to inform that TextAbsorber has no special option for representing output text with a specified encoding. But System.Text Encoding class provides tools for simple conversion between encoding.

Please consider the following code:

Encoding encoding1 = new UnicodeEncoding();
Encoding encoding2 = new ASCIIEncoding();
Document document = new Document(dataDir + “Lorem_ipsum.pdf”);
TextAbsorber absorber = new TextAbsorber();
document.Pages.Accept(absorber);
File.WriteAllBytes(dataDir + “Lorem_ipsum_Unicode.txt”, encoding1.GetBytes(absorber.Text));
File.WriteAllBytes(dataDir + “Lorem_ipsum_ASCII.txt”, encoding2.GetBytes(absorber.Text));