Issue with dynamic generation of "plain text" for Paragraph objects

Greetings,

I’m currently working on moving a set of tools from a JAXB based document processor to the Aspose.Words package, and I am running into some difficulties. Some of the tools require that a “plain text” version of a particular paragraph object be generated on-the-fly. By plain text I mean that everything that is relative is rendered (auto-numbering, etc.), and that only text that will be visible when printed to paper is stored (delete revisions, comments, etc. are not stored). The two main issues I’m facing are rendering the auto-numbering numbers, and properly filtering out all “non-visible” text.
For filtering out the text I have tried several techniques, including those posted on this forum. If I filter by NodeType most of the non-visible text is removed, however there are still some comments that appear in the text. If I use the DocumentVisitor I get the same results. Since the tools generate and delete paragraphs dynamically, I have not tried converting the document to text first, as this will de-synchronize the text with the Aspose objects.
I think the issue with comments appearing in the text is due the below type of “comment run” found in document.xml:

pairing at least one object to a corresponding first modifier that modifies access to said object;

It may be possible that the above “comment run” is getting parsed into a Run object as if it were a “text run”. A method to determine the sub-type of a Run would be very helpful.
As for the auto-numbering, I’m not sure where to start. I see that there is all of the styling and formatting data for the list available in the Paragraph object. This is useful but there doesn’t seem to be a method that renders and returns the actual number. Any suggestions?
Any assistance is much appreciated,
PCR

Hi

Thanks for your inquiry.

  1. You can easily filter out unnecessary text using DocuemntVisitor. For instance, the code provided in the following article filters field codes out.

https://docs.aspose.com/words/net/how-to-extract-selected-content-between-nodes-in-a-document/

///
/// Called when a FieldStart node is encountered in the document.
///
public int visitFieldStart(FieldStart fieldStart)
{
    // In Microsoft Word, a field code (such as "MERGEFIELD FieldName") follows
    // after a field start character. We want to skip field codes and output field
    // result only, therefore we use a flag to suspend the output while inside a field code.
    // // Note this is a very simplistic implementation and will not work very well
    // if you have nested fields in a document.
    mIsSkipText = true;
    return VisitorAction.CONTINUE;
}
///
/// Called when a FieldSeparator node is encountered in the document.
///
public int visitFieldSeparator(FieldSeparator fieldSeparator)
{
    // Once reached a field separator node, we enable the output because we are
    // now entering the field result nodes.
    mIsSkipText = false;
    return VisitorAction.CONTINUE;
}
///
/// Called when a FieldEnd node is encountered in the document.
///
public int visitFieldEnd(FieldEnd fieldEnd)
{
    // Make sure we enable the output when reached a field end because some fields
    // do not have field separator and do not have field result.
    mIsSkipText = false;
    return VisitorAction.CONTINUE;
}

You can use the same technique to filter comments. Your additional code will look like this:

public int visitCommentStart(Comment comment)
{
    // Once reached a start of comment, we disable output.
    mIsSkipText = true;
    return VisitorAction.CONTINUE;
}
public int visitCommentEnd(Comment comment)
{
    // Once reached a comment end, we enable the output.
    mIsSkipText = false;
    return VisitorAction.CONTINUE;
}

Also, I think DocumentExplorer demo could be useful for you. It allow easily inspect document structure.
2. Unfortunately, there is no way to get list item numbers using Aspose.Words. These numbers are not stored in the document. MS Word calculates list numbers on the fly during opening document. However, you can try to create your own method for calculating list numbers.

I created sample code for another customer to achieve similar task. Please see the following link to learn more.
https://forum.aspose.com/t/106128
Hope this helps.
Best regards.

Thank you for the quick reply.

I’m currently porting to Java the code that you linked for processing List Labels and I’ve run into a method in ListLabelsExtractor that I’m not sure how to properly port.

///

/// Method returns format of list label (.NET format like {0}.{1})
///

/// List Level
/// Method returns format of the list label
private String getLevelNumberFormat(ListLevel lstLevel)
{
    String format = lstLevel.getNumberFormat();
    format = format.replace("\x0000", "{0}");
    format = format.replace("\x0001", "{1}");
    format = format.replace("\x0002", "{2}");
    format = format.replace("\x0003", "{3}");
    format = format.replace("\x0004", "{4}");
    format = format.replace("\x0005", "{5}");
    format = format.replace("\x0006", "{6}");
    format = format.replace("\x0007", "{7}");
    format = format.replace("\x0008", "{8}");

    return format;
}

Any suggestions would be greatly appreciated. In the mean time I will try the DocumentVisitor solution that you suggested.

In reply to the DocumentVisitor solution that was suggested, the problem of comments appearing in text is still present. The xml I posted in the first post is an example of the type of comment that is present in the plain text.

I’ve attached the code for the visitor class below.

public class WordParagraphVisitor extends DocumentVisitor
{

    private String mBuilder;
    private boolean mIsSkipText;

    public WordParagraphVisitor()
    {
        mIsSkipText = false;
        mBuilder = new String();
    }

    public String getText()
    {
        return mBuilder.toString();
    }

    public int visitRun(Run run)
    {
        appendText(run.getText());

        return VisitorAction.CONTINUE;
    }

    public int visitFieldStart(FieldStart fieldStart)
    {
        mIsSkipText = true;
        return VisitorAction.CONTINUE;
    }

    public int visitFieldSeparator(FieldSeparator fieldSeparator)
    {
        mIsSkipText = false;
        return VisitorAction.CONTINUE;
    }

    public int visitFieldEnd(FieldEnd fieldEnd)
    {
        mIsSkipText = false;
        return VisitorAction.CONTINUE;
    }

    public int visitCommentStart(Comment comment)
    {
        mIsSkipText = true;
        return VisitorAction.CONTINUE;
    }

    public int visitCommentEnd(Comment comment)
    {
        mIsSkipText = false;
        return VisitorAction.CONTINUE;
    }

    public int visitParagraphStart(Paragraph paragraph)
    {
        mIsSkipText = false;
        return VisitorAction.CONTINUE;
    }

    public int visitParagraphEnd(Paragraph paragraph)
    {
        appendText(ControlChar.CR_LF);
        return VisitorAction.CONTINUE;
    }

    public int visitHeaderFooterStart(HeaderFooter headerFooter)
    {
        return VisitorAction.SKIP_THIS_NODE;
    }

    private void appendText(String text)
    {
        if (!mIsSkipText)
            mBuilder += (text);
    }

Hi

Thanks for your request.

  1. I translated ListLabelExtractor class to Java (see the attached files). Here is code how you can use these classes.
Document doc = new Document("C:\\Temp\\in.doc");
// Create an object that inherits from the DocumentVisitor class.
MyDocToTxtWriter myConverter = new MyDocToTxtWriter();
// This is the well known Visitor pattern. Get the model to accept a visitor.
// The model will iterate through itself by calling the corresponding methods
// on the visitor object (this is called visiting).
// // Note that every node in the object model has the Accept method so the visiting
// can be executed not only for the whole document, but for any node in the document.
doc.accept(myConverter);
// Once the visiting is complete, we can retrieve the result of the operation,
// that in this example, has accumulated in the visitor.
// Save text to the file
String text = myConverter.GetText();
FileWriter outFile = new FileWriter("C:\\Temp\\out.txt");
PrintWriter out = new PrintWriter(outFile);
// Write text to file
out.println(text);
out.close();

Hope this helps.
2. Could you please attach your sample document with comments here for testing? I will check it and provide you more information.

Best regards.

Thank you for the quick response.

I am unable to attach the document I had referenced in the first post due to confidentiality issues, however I have prepared and attached another document that reproduces the problem.

Much thanks for the continued help,

PCR

Hi

Thanks for your request. I modified MyDocToTxtWriter class, which I attached in my previous post. I just added the following 2 methods:

///
/// Called when a CommentStart is encountered in the document.
/// Used to skip comment text.
///
public int visitCommentStart(Comment comment)
{
    // Once reached a start of comment, we disable output.
    mIsSkipText = true;
    return VisitorAction.CONTINUE;
}
///
/// Called when a CommentEnd is encountered in the document.
/// Used to skip comment text.
///
public int visitCommentEnd(Comment comment)
{
    // Once reached a comment end, we enable the output.
    mIsSkipText = false;
    return VisitorAction.CONTINUE;
}

And added the following two lines at the beginning of visitParagraphStart method:
if (mIsSkipText)

return VisitorAction.CONTINUE;

And Comments are skipped from the output document. I attached the modified code here.
Best regards:

Thank you for the quick response.

The issue is still persistent even with the new code you have uploaded. However I seem to have discovered where the problem is occurring. In my code I am trying to use the DocumentVisitor to generate the plain text for each Paragraph object independently and not the entire document. The code posted does indeed work for the entire document, but it doesn’t work when used as a visitor for a single Paragraph object.

Do not worry about the functionality of the ListLabelsExtractor as I have already re-written it to work for a single paragraph. The only problem left is removing comments when using the DocumentVisitor posted on a Paragraph.

Any suggestions are much appreciated.

Hi

Thank you for additional information. I think, you can use the following method to determine whether the node is child of Comment:

///
/// Returns true if node is child of Comment.
///
private boolean isComment(Node node)
{
    Node parentComment = node.getAncestor(NodeType.COMMENT);
    return (parentComment != null);
}

I added this method into the MyDocToTxtWriter. See the attached class. Hope this helps.
Best regards.

Thank you very much for all of the help, that last method worked wonders.

PCR