Hello!
I am trying to evaluate Aspose.pdf for java.
How can I get text from a PDF as paragraphs?
Hi Jukka,
// open document<o:p></o:p>
com.aspose.pdf.Document pdfDocument = new com.aspose.pdf.Document(myDir+ "input.pdf");
// create TextAbsorber object to extract text
com.aspose.pdf.TextAbsorber textAbsorber = new com.aspose.pdf.TextAbsorber();
//set text extraction options - set text extraction mode (Raw-no formatting or Pure- preserve formatting)
textAbsorber.getExtractionOptions().setFormattingMode(TextExtractionOptions.TextFormattingMode.Pure);
// accept the absorber for all the pages
pdfDocument.getPages().get_Item(29).accept(textAbsorber);
// get the extracted text
String extractedText = textAbsorber.getText();
// create a writer and open the file
java.io.FileWriter writer = new java.io.FileWriter(new java.io.File(
myDir + "Extracted_text_java.txt"));
writer.write(extractedText);
// write a line of text to the file
// tw.WriteLine(extractedText);
// close the stream
writer.close();
Please feel free to contact us for any further assistance.
Best Regards,
We would like to share with you that Aspose.PDF for Java now supports extracting text from PDF document paragraphs-wise. Please consider using following code snippet:
Document doc = new Document(dataDir + "temp--(3d-pdf).pdf");
TextAbsorber ta = new TextAbsorber();
ta.visit(doc);
System.out.println(ta.getText());
ParagraphAbsorber pa = new ParagraphAbsorber();
pa.visit(doc);
for (PageMarkup pm:pa.getPageMarkups()){
for (MarkupSection ms:pm.getSections()){
for (MarkupParagraph mp:ms.getParagraphs()){
StringBuilder sb =new StringBuilder();
for(java.util.List<TextFragment> tflist : mp.getLines()){
for(TextFragment tf:tflist ){
sb.append(tf.getText());
}
sb.append("/r/n");
}
sb.append("/r/n");
System.out.println(sb);
}
}
}