How to extract just text from the document but without headers?

mariuszpala · August 1, 2014, 2:55am

Hi,

I’m currently using the following code to get the text from the document. Unfortunately it returns also the headers. How can I get only the body text?

Document doc = new Document(fileName);

String text = doc.getText();

Thanks in advance for any help,

Mariusz

mariuszpala · August 1, 2014, 3:10am

Found the answer:

doc.getFirstSection().getBody().getText();

awais.hafeez · August 3, 2014, 12:08pm

Hi Mariusz,

Thanks for your inquiry. Please use the following code to achieve this:

String text = doc.getFirstSection().getBody().toString(SaveFormat.TEXT);

I hope, this helps.

Best regards,

mariuszpala · August 8, 2014, 2:32pm

Thank you. What’s the difference between getText() and toString(SaveFormat.TEXT)?

awais.hafeez · August 10, 2014, 10:18pm

Hi Mariusz,

Thanks for your inquiry.

The CompositeNode.GetText method gets the text of this node and of all its children. The returned string includes all control and special characters as described in ControlChar class. The following code shows the difference between calling the GetText and ToString methods on a node.

Document doc = new Document();
// Enter a dummy field into the document.

DocumentBuilder builder = new DocumentBuilder(doc);

builder.insertField(“MERGEFIELD Field”);
// GetText will retrieve all field codes and special characters

System.out.println("GetText() Result: " + doc.getText());
// ToString will export the node to the specified format. When converted to text it will not retrieve fields code

// or special characters, but will still contain some natural formatting characters such as paragraph markers etc.

// This is the same as “viewing” the document as if it was opened in a text editor.

System.out.println("ToString() Result: " + doc.toString(SaveFormat.TEXT));

I hope, this helps.

Best regards,

mariuszpala · August 13, 2014, 7:13am

Great, thank you for the clarification!

Regards,

Mariusz