How to get node text only

dngan · July 24, 2006, 1:36pm

In Aspose.Words for Java, is there a method to get the text of a node without any of the control characters? I just want the actual text that the reader of the document sees. Node.getText() seems to return all of the control characters around the actual text as well.

miklovan · July 25, 2006, 10:23am

Please check the following article in our wiki on the topic:

https://docs.aspose.com/words/net/work-with-text-document/

It lacks Java code example however. We will compose and provide the sample code for Java shortly.

Best regards,

Konstantin · July 26, 2006, 2:36am

Hi, dngan,

SaveFormat.Text is not public in java at the moment, so you will be better to implement DocumentVisitor. Here is a code snippet how to do this:

public class TestTxtWriter
{
    @Test
    public void TxtWriterTest() throws Exception
    {
        TxtWriter txtWriter = new TxtWriter();
        Document doc = new Document("X:\\Aspose\\forum\\WordCount_Testing.doc");

        //save to txt file
        OutputStream stream1 = new FileOutputStream("X:\\Aspose\\forum\\out1.txt");
        txtWriter.save(doc, stream1);

         //save to string
        String text = txtWriter.getPlainText(doc);

        //save the string to a file
        OutputStream stream2 = new FileOutputStream("X:\\Aspose\\forum\\out2.txt");
        stream2.write(text.getBytes());

        //close streams
        stream1.close();
        stream2.close();
    }
}

      /**
        \* Responsible for saving document in text format.
        */
    class TxtWriter extends DocumentVisitor

    {
    TxtWriter()
    {
    }

    /**
        \* Saves the document in plain text format.
    */
    public void save(Document document, OutputStream stream) throws Exception
    {
        String text = getPlainText(document);
        stream.write(text.getBytes());

        //Not closing stream here as it is the client's responsibility.
        stream.flush();
    }

    /**
        \* Gets a plain text from the node.
    */
    public String getPlainText(Node node) throws Exception
    {
        mIsSkipText = false;
        mBuilder = new StringBuilder();

        //Extract text from the node.
        node.accept(this);

        //Remove remaining control characters
        String text = mBuilder.toString();
        text = text.replace(ControlChar.LINE_BREAK, ControlChar.CR_LF);
        text = text.replace(ControlChar.ANNOTATION_REF, "");
        text = text.replace(ControlChar.FOOTNOTE_REF, "");
        text = text.replace(ControlChar.DRAWN_OBJECT, "");

        return text;
    }

    public int visitRun(Run run)
    {
        appendText(run.getText());
        return VisitorAction.CONTINUE;
    }

    public int visitFieldStart(FieldStart fieldStart)
    {
        mIsSkipText = true;
        return VisitorAction.CONTINUE;
    }

    public int visitFieldSeparator(FieldSeparator fieldSeparator)
    {
        mIsSkipText = false;
        return VisitorAction.CONTINUE;
    }

    public int visitFieldEnd(FieldEnd fieldEnd)
    {
        mIsSkipText = false;
        return VisitorAction.CONTINUE;
    }

    public int visitParagraphEnd(Paragraph paragraph)
    {
        appendText(ControlChar.CR_LF);
        return VisitorAction.CONTINUE;
    }

    private void appendText(String text)
    {
        if (!mIsSkipText)
            mBuilder.append(text);
    }

    private StringBuilder mBuilder;
    private boolean mIsSkipText;
}

Best regards,

alexey.noskov · May 7, 2022, 5:26am