Hello!
I am trying to evaluate Aspose.pdf for java.
How can I get text from a PDF as paragraphs?
Hi Jukka,
Thanks for your inquiry. You may extract text from PDF document and preserve the formatting using “Pure” TextFormattingMode. Please check following sample code for the purpose. You can also extract text from a specified page region.
// open document
com.aspose.pdf.Document pdfDocument = new com.aspose.pdf.Document(myDir + "input.pdf");
// create TextAbsorber object to extract text
com.aspose.pdf.TextAbsorber textAbsorber = new com.aspose.pdf.TextAbsorber();
// set text extraction options - set text extraction mode (Raw-no formatting or Pure- preserve formatting)
textAbsorber.getExtractionOptions().setFormattingMode(TextExtractionOptions.TextFormattingMode.Pure);
// accept the absorber for all the pages
pdfDocument.getPages().get_Item(29).accept(textAbsorber);
// get the extracted text
String extractedText = textAbsorber.getText();
// create a writer and open the file
java.io.FileWriter writer = new java.io.FileWriter(new java.io.File(myDir + "Extracted_text_java.txt"));
writer.write(extractedText);
// write a line of text to the file
// tw.WriteLine(extractedText);
// close the stream
writer.close();
Please feel free to contact us for any further assistance.
Best Regards,
We would like to share with you that Aspose.PDF for Java now supports extracting text from PDF document paragraphs-wise. Please consider using following code snippet:
Document doc = new Document(dataDir + "temp--(3d-pdf).pdf");
TextAbsorber ta = new TextAbsorber();
ta.visit(doc);
System.out.println(ta.getText());
ParagraphAbsorber pa = new ParagraphAbsorber();
pa.visit(doc);
for (PageMarkup pm:pa.getPageMarkups()){
for (MarkupSection ms:pm.getSections()){
for (MarkupParagraph mp:ms.getParagraphs()){
StringBuilder sb =new StringBuilder();
for(java.util.List<TextFragment> tflist : mp.getLines()){
for(TextFragment tf:tflist ){
sb.append(tf.getText());
}
sb.append("/r/n");
}
sb.append("/r/n");
System.out.println(sb);
}
}
}