Get text from a PDF as paragraphs using Aspose.PDF for Java

Hello!

I am trying to evaluate Aspose.pdf for java.
How can I get text from a PDF as paragraphs?

Hi Jukka,


Thanks for your inquiry. You may extract text from PDF document and preserve the formatting using “Pure” TextFormattingMode. Please check following sample code for the purpose. You can also extract text from a specified page region.

// open document<o:p></o:p>

com.aspose.pdf.Document pdfDocument = new com.aspose.pdf.Document(myDir+ "input.pdf");

// create TextAbsorber object to extract text

com.aspose.pdf.TextAbsorber textAbsorber = new com.aspose.pdf.TextAbsorber();

//set text extraction options - set text extraction mode (Raw-no formatting or Pure- preserve formatting)

textAbsorber.getExtractionOptions().setFormattingMode(TextExtractionOptions.TextFormattingMode.Pure);

// accept the absorber for all the pages

pdfDocument.getPages().get_Item(29).accept(textAbsorber);

// get the extracted text

String extractedText = textAbsorber.getText();

// create a writer and open the file

java.io.FileWriter writer = new java.io.FileWriter(new java.io.File(

myDir + "Extracted_text_java.txt"));

writer.write(extractedText);

// write a line of text to the file

// tw.WriteLine(extractedText);

// close the stream

writer.close();


Please feel free to contact us for any further assistance.


Best Regards,

@JukkaTervaskanto

We would like to share with you that Aspose.PDF for Java now supports extracting text from PDF document paragraphs-wise. Please consider using following code snippet:

Document doc = new Document(dataDir + "temp--(3d-pdf).pdf");
        TextAbsorber ta = new TextAbsorber();
        ta.visit(doc);
        System.out.println(ta.getText());
        ParagraphAbsorber pa = new ParagraphAbsorber();
        pa.visit(doc);
        for (PageMarkup pm:pa.getPageMarkups()){
            for (MarkupSection ms:pm.getSections()){

                for (MarkupParagraph mp:ms.getParagraphs()){
                    StringBuilder sb =new StringBuilder();
                    for(java.util.List<TextFragment> tflist : mp.getLines()){
                        for(TextFragment tf:tflist ){
                            sb.append(tf.getText());
                        }
                        sb.append("/r/n");
                    }
                    sb.append("/r/n");
                    System.out.println(sb);
                }
            }
        }