I am looking into extracting text from PDF files. What is the difference between PdfExtractor vs TextAbsorber? I used both to extract the text from the same file. Both produce different byte array. What is the actual difference between both classes/methods and what is stored in extra bytes?
PdfExtractor extractor = new PdfExtractor();
ByteArrayOutputStream out = new ByteArrayOutputStream();
byte text = out.toByteArray();
Document pdfDocument = new Document(new ByteArrayInputStream(input));
TextAbsorber textAbsorber = new TextAbsorber();
extractedText = textAbsorber.getText();
byte text = extractedText.getBytes();
PDFExtractor: text.length = 70
TextAbsorber: text.length = 35
I tried this for several PDF files and in each case byte array for PDFExtractor is twice bigger than for TextAbsorber.
TestDoc.pdf (147.4 KB)