How to get node text only


#1

In Aspose.Words for Java, is there a method to get the text of a node without any of the control characters? I just want the actual text that the reader of the document sees. Node.getText() seems to return all of the control characters around the actual text as well.


#2

Please check the following article in our wiki on the topic:

http://www.aspose.com/Wiki/default.aspx/Aspose.Words/ExtractingTextOnly.html

It lacks Java code example however. We will compose and provide the sample code for Java shortly.

Best regards,


#3

Hi, dngan,

SaveFormat.Text is not public in java at the moment, so you will be better to implement DocumentVisitor. Here is a code snippet how to do this:

public class TestTxtWriter
{
@Test
public void TxtWriterTest() throws Exception
{
TxtWriter txtWriter = new TxtWriter();
Document doc = new Document("X:\\Aspose\\forum\\WordCount_Testing.doc");

//save to txt file
OutputStream stream1 = new FileOutputStream("X:\\Aspose\\forum\\out1.txt");
txtWriter.save(doc, stream1);

//save to string
String text = txtWriter.getPlainText(doc);

//save the string to a file
OutputStream stream2 = new FileOutputStream("X:\\Aspose\\forum\\out2.txt");
stream2.write(text.getBytes());

//close streams
stream1.close();
stream2.close();
}
}

/**
* Responsible for saving document in text format.
*/
class TxtWriter extends DocumentVisitor

{
TxtWriter()
{
}

/**
* Saves the document in plain text format.
*/
public void save(Document document, OutputStream stream) throws Exception
{
String text = getPlainText(document);
stream.write(text.getBytes());

//Not closing stream here as it is the client's responsibility.
stream.flush();
}

/**
* Gets a plain text from the node.
*/
public String getPlainText(Node node) throws Exception
{
mIsSkipText = false;
mBuilder = new StringBuilder();

//Extract text from the node.
node.accept(this);

//Remove remaining control characters
String text = mBuilder.toString();
text = text.replace(ControlChar.LINE_BREAK, ControlChar.CR_LF);
text = text.replace(ControlChar.ANNOTATION_REF, "");
text = text.replace(ControlChar.FOOTNOTE_REF, "");
text = text.replace(ControlChar.DRAWN_OBJECT, "");

return text;
}

public int visitRun(Run run)
{
appendText(run.getText());
return VisitorAction.CONTINUE;
}

public int visitFieldStart(FieldStart fieldStart)
{
mIsSkipText = true;
return VisitorAction.CONTINUE;
}

public int visitFieldSeparator(FieldSeparator fieldSeparator)
{
mIsSkipText = false;
return VisitorAction.CONTINUE;
}

public int visitFieldEnd(FieldEnd fieldEnd)
{
mIsSkipText = false;
return VisitorAction.CONTINUE;
}

public int visitParagraphEnd(Paragraph paragraph)
{
appendText(ControlChar.CR_LF);
return VisitorAction.CONTINUE;
}

private void appendText(String text)
{
if (!mIsSkipText)
mBuilder.append(text);
}

private StringBuilder mBuilder;
private boolean mIsSkipText;
}

Best regards,