All the data is not getting extracted

Shivam999 · August 8, 2019, 12:39pm

Hi Team,

I need to extract all the text from the attached PDF File, however in the extracted file the data from the table is missing.
I have tried various examples mentioned in your github but unfortunately the result was same.

public void extractTextBasedOnColumns() throws IOException {
// String path = “PathToDir”;
// instantiate Document instance with path of input file as argument
Document pdfDocument = new Document(“Do.pdf”);
// create TextFragment Absorber instance
TextFragmentAbsorber tfa = new TextFragmentAbsorber();
pdfDocument.getPages().accept(tfa);
// create TextFragment Collection instance
TextFragmentCollection tfc = tfa.getTextFragments();
for (TextFragment tf : (Iterable) tfc) {
// need to reduce font size at least for 70%
tf.getTextState().setFontSize(tf.getTextState().getFontSize() * 0.7f);
}
// temporary save the file
pdfDocument.save("" + “TempOutput.pdf”);
pdfDocument = new Document(“TempOutput.pdf”);
TextAbsorber textAbsorber = new TextAbsorber();
pdfDocument.getPages().accept(textAbsorber);
String extractedText = textAbsorber.getText();
textAbsorber.visit(pdfDocument);
// Create a writer and open the file
java.io.FileWriter writer = new java.io.FileWriter(new java.io.File(“C:\Newfolder\Extracted_text.txt”));
writer.write(extractedText);
// Write a line of text to the file
// Close the stream
writer.close();
}

Thanks,
ShivamDo.pdf (409.9 KB)

Farhan.Raza · August 8, 2019, 5:14pm

@Shivam999

Thank you for contacting support.

We have investigated the data shared by you and have found it to extracting all the text which can be selected in the PDF document. Acrobat neither extracts the text from other contents including comments etc. We have attached generated files for your kind reference.

In order to test the API in its full capacity, please ensure using the API with valid license or apply for a free 30 days temporary license if you do not own a license.

ExtractedText.zip