Extraction of text from document

andyradchenko · October 16, 2012, 9:06am

I just got your library for evaluation and wrote a simple code, that should allow to get all text information from word document (which is my goal).
After testing this code on different documents I found out that it would write down some additional text, which is not visibly displayed in document and not some hidden text.
My first thought was: it’s some inside format-related information. But after short investigation - it turned out to be field names, so I went through your documentation in attempt to find out how to exclude this information from output text and didn’t manage to find something to resolve the problem:
https://docs.aspose.com/words/net/how-to-extract-selected-content-between-nodes-in-a-document/
https://docs.aspose.com/words/net/working-with-fields

The line below would also return string containing field names.

tmpNodeValue = node.toString(SaveFormat.TEXT);
// MERGEFIELD MyName * MERGEFORMAT

My goal is to get only real text information, excluding some functional data. Does the library allow to handle this somehow?

tahir.manzoor · October 17, 2012, 7:01am

Hi Andrey,

Thanks for your inquiry. Could you please attach your Word documents here for testing? I will investigate the issue on my side and provide you more information.

andyradchenko · October 19, 2012, 7:27am

Sure. Find the one with fields attached. So is my understanding correct, based on your response: library should not include field-specific information when acquiring text by node.toString(SaveFormat.TEXT)) ?

tahir.manzoor · October 22, 2012, 1:52am

Hi Andrey,

Thanks for sharing the information. Please use the Run.Text property to get text of each node. All text of the document is stored in runs of text. Run can only be a child of Paragraph. Please see the attached screen shot and use the following code snippet for your requirement.

Document doc = new Document(MyDir + "test9.doc");
foreach (FieldStart fStart in doc.GetChildNodes(NodeType.FieldStart, true))
{
    string FieldCode = GetFieldCode(fStart);
    Console.WriteLine(FieldCode);
}

private static string GetFieldCode(Aspose.Words.Fields.FieldStart fieldStart)
{
    StringBuilder builder = new StringBuilder();
    for (Node node = fieldStart; node != null && node.NodeType != NodeType.FieldSeparator && node.NodeType != NodeType.FieldEnd; node = node.NextPreOrder(node.Document))
    {
        // Use the text only of Run nodes to avoid duplication.
        if (node.NodeType == NodeType.Run)
            builder.Append(node.GetText());
    }
    return builder.ToString();
}

Hope this helps you. Please let us know if you have any more queries.

andyradchenko · October 23, 2012, 4:00am

I don’t think I have delivered my question to you:
According to your screenshot - it has the same text inside Run element as I got with my test program:
MERGEFIELD MyName * MERGEFORMAT
My question was is there a way to receive and modify plain text ( “MyName” )? I don’t need information specific for Word. I just need real information, entered by user.

public class MainClass
{
    public static void main (String [] args) throws Exception
    {
        final String fileName = "test9.doc";
        MainClass testInstance = new MainClass();
        testInstance.parseDocument(fileName);

    }
    private Document mDocument;
    private int counter = 0;
    public void parseDocument(String fileName) throws Exception {
        InputStream stream = new FileInputStream(fileName);
        mDocument = new Document(stream);
        FormFieldCollection formFieldCollection = mDocument.getRange().getFormFields();
        Iterator formFieldIterator = formFieldCollection.iterator();
        while (formFieldIterator.hasNext())
        {
            FormField formField = formFieldIterator.next();
            DropDownItemCollection dropDownItemCollection = formField.getDropDownItems();
            if (dropDownItemCollection != null)
            {
                Iterator dropDownItemIterator = dropDownItemCollection.iterator();
                while (dropDownItemIterator.hasNext())
                {
                    String dropDownItem = dropDownItemIterator.next();
                    // System.out.println(String.format("DropDownItem #%d:%s", counter++, dropDownItem));
                }
            }

        }

        getTextFromNode(mDocument);
        stream.close();

    }

    public void getTextFromNode(Node node) throws Exception {
        if (node instanceof Run)
        {
            // System.out.println(String.format("Node #%d:%s", counter++, node.toString(SaveFormat.TEXT)));
        }
        if (node instanceof CompositeNode)
        {
            CompositeNode tempCastNode = (CompositeNode)node;
            Iterator iterator = tempCastNode.iterator();
            while (iterator.hasNext())
            {
                Node childNode = (Node)iterator.next();
                getTextFromNode(childNode);
            }
        }
    }
}

tahir.manzoor · October 23, 2012, 5:35am

Hi Andrey,

Thanks for sharing the information. You can get real information entered by user by implementing the IFieldMergingCallback interface. Please read following documentation links for your kind reference.
https://reference.aspose.com/words/java/com.aspose.words/IFieldMergingCallback
https://reference.aspose.com/words/java/com.aspose.words/FieldMergingArgs

Document doc = new Document(MYDir + "in.docx");
doc.getMailMerge().setFieldMergingCallback(new TextEnterByUser());
// Fill the fields in the document with user data.
doc.getMailMerge().execute(
new String[] { "MyName", "company" },
new Object[] { "James Bond", "MI5 Headquarters" });

public class TextEnterByUser implements IFieldMergingCallback {
    public void fieldMerging(FieldMergingArgs e) throws Exception{
        if (e.getFieldName().equals("MyName")) {
            System.out.println(e.getFieldValue());
        }
    }
    public void imageFieldMerging(ImageFieldMergingArgs args) throws Exception {
        // your code
    }
}

Hope this helps you. Please let us know if you have any more queries.