Aspose.Word java api returns incorrect comment id's (Bug)

Hi, The comment id in a Word DOCX document is, for example,

<w:comment w:id="0" w:author="abc xyzzy" w:date="2022-06-07T16:04:55Z" w:initials="AB">

However, the Comment.getId() from Aspose API returns a incorrect number like 7. We observe this with document that has bookmarks but there could be other reasons too.

Can Aspose please fix this critical bug so that the Comment.getId() matches what is present in OOXML of the DOCX file?

Thanks.

@kml2020 According to the description “The comment identifier allows to anchor a comment to a region of text in the document. The region must be demarcated using the CommentRangeStart and CommentRangeEnd object sharing the same identifier value as the Comment object. You would use this value when looking for the CommentRangeStart and CommentRangeEnd nodes that are linked to this comment.” The comment identifier in Aspose.Words API is limited to this description, and its identity to the values ​​recorded in OOXML markup has not been declared.
The comment identifier in OOXML markup, written by MS Word, serves the same purpose, its value is not guaranteed to be saved when resaving, and even more so, this value is guaranteed to be changed during certain operations (for example, adding a reply to the previous comment). Thus, it can be concluded that the comment identifier specific value is narrowly focused and limited and cannot be extended randomly (for example, the comment unique identification during any operations and/or document resaving).
Aspose.Words tries to keep the comment ID generation identical to that of MS Word whenever possible, but since Aspose.Words functionality in some moments is wider than that of MS Word, this is not always possible to do. Thus, we cannot guarantee you the same comment IDs in Aspose.Words and OOXML markup.

@Vadim.Saltykov Our team ran into a similar problem while evaluating Aspose.WORD

We’re looking to build a web based document commenting solution (Java backend).
This involves reading, creating, modifying, deleting and resolving comments and replies.

That said, we have 3 layers of ids to contend with:

  1. HTML ids for the web interface (as returned by Aspose.WORD HTML via doc.save(OutputStream, HtmlSaveOptions)) observed as _cmntref?

  2. Aspose.WORD comment ids (as returned by doc.getChildNodes(NodeType.COMMENT, true) and comment.getId())

  3. The original comment ids in the OOXML (found in comments.xml)

Questions:

[A] In our app, when a user hovers over a comment anchor in HTML, we’d like to show them comment details (by linking 1 and 2). We observed that the _cmntref? comments are 1-based and the comment.getId() values are 0-based. Can we assume that the HTML comment ids formatted as _cmntref? will always correlate to comment.getId() with this offset of 1? If not, how do we bring in details in HTML for the comments?

[B] When a user attempts to modify text in a comment denoted in HTML as _cmntref6, is it safe to ‘find the comment id==5 in Aspose.WORD’ and update it? If yes, which API in com.aspose.words.Comment do we call to update comment text and persist it back into the DOCX? I assume a mutation to remove a comment from the DOCX would follow the same principle wherein we find the comment and call comment.remove()? If not, please let me know the best strategy to locate a comment and apply a mutation on it in the DOCX.

[C] Does this ^ apply to the latest version v22.5 or older versions too?

Thanks in advance!

@rogerC123 In general the observation is correct, the comment ID generator works like a 0-based counter assigning a value to the next ID using continuous numbering of the entire comment hierarchy of the document tree. For html export this is a 1-based counter. But you can use this approach only at your own risk, it will work guaranteed only in the case of flat hierarchy and relatively small document sizes, and especially with great care in conjunction with html and docx export/import and editing docx using MS Word in the middle.
Please note once again, comment IDs are guaranteed to perform only two functions: the first one is the link between the comment itself and the text anchor (CommentRangeStart and CommentRangeEnd) and the second one is the comment order in the document.

@Vadim.Saltykov What is the recommended best practice to perform comment mutations programmatically? Deleting of a comment seems fairly straightforward. What about updates? e.g. I’d like to edit a comment or resolve a comment

@rogerC123 You can use any public API provided by Aspose.Words to work with comments. If you are interested in practice in terms of saving comment identifiers, then in this case a large number of tests can be recommended.
Also, please see our documentation to lean more about working with comments.

@Vadim.Saltykov Could you be more specific about what could cause this assumption to break? e.g. what’s an example of a non-flat hierarchy? and why would a large document cause this assumption to break?

I looked here and couldn’t find anything to edit or resolve an existing comment.
How would I accomplish this with Aspose?

Please note that in this situation need to know all the functionality of your application. What do you do beside importing html documents. For example, you mentioned OOXML markup, that is, editing MS Word documents is also supposed to be done, isn’t it? It is necessary to create tests for all such cases of import/export and adding/deleting of comments, and you can use the number of comments before and after these operations as a reference. You can also use fully valid identifiers (for example, based on GUID) written instead of the comment text, and compare them with what comment.getId() returns.

For example, adding a reply to the previous comment will change the ID of the current comment, since replies are also comments with their own ID.

To edit a comment you first need to find it. You can do this the following way:

Comment comment = doc.getChild(NodeType.COMMENT, 0, true);

And then do whatever you want with it.

// Adds reply to this comment.
comment.addReply("Author", new Date(), "Text comment");
// Removes itself from the parent.
comment.remove()
...

@Vadim.Saltykov
It is okay for these IDs to change when new comments or replies are added to the document.
This is not a problem for us.

We did however run in to a problem with the following use case:

I have the following text in my document:

The quick brown fox jumped over the moon.

I add a comment C1 on the text brown
I add another comment C2 on the text quick brown fox
Therefore, these two comments overlap in context with C1 nested inside C2

The Aspose.WORD API returns these comments as…
id=1: C1
id=0: C2
This is correct and how I would expect it because C2’s context starts before C1

The Aspose.WORD generated HTML generates _cmntref comment markers ordered like this:
_cmntref1: C1
_cmntref2: C2

… which appears like this in the DOM …

    <a name="_cmntref2">
      <span style="font-family:Calibri">quick </span>
    </a>
    <a name="_cmntref1">
      <span style="font-family:Calibri">brown</span>
    </a>
    <span style="font-family:Calibri"> fox</span>
    <span style="-aw-comment-end:_cmntref2"></span>

id=0 maps to _cmntref2
id=1 maps to _cmntref1

The HTML _cmntref tags in this example are out-of-sync with what the Aspose generated comment ids are.

Also observed here ^ is that there is no -aw-comment-end:_cmntref1 tag

Is there a way for Aspose to preserve the consistency or mapping between ids returned via the Aspose.WORD api vs Aspose generated HTML content?

Without this, there’s no easy way for us to show comment details (returned by the Aspose.WORD api) and tie them to the HTML rendered version of the document in our web application.

@rogerC123 If you create the original Docx document with Aspose.Words code, then this issue can be excluded by adding the comment ‘quick brown fox’ first, and then ‘brown’. In this case the numbering of html and docx identifiers will be identical. This is due to the fact that in docx format the comment definition order in the markup does not matter, and only identifier values matter. However, html export pays attention to this order and assigns IDs based on it. Unfortunately, this is one of those issues I mentioned above, and the reason for it lies in the fact that pseudo-identifiers are not intended for such tasks. This issue is presumably not the only one. To fix all the issues we can suggest using full value comment identifiers in the form of prefixes or postfixes. Something like:
“[C0001] quick brown fox”
“[C0002] brown”

Unfortunately, we don’t have the luxury to restrict our customers to only use our product which in turn exclusively uses Aspose.Words. Our customers can create comments directly in Word and then upload those documents into our system. Our system then presents the DOCX (along with its embedded comments) in a web interface (as Aspose.Words generated HTML).
What does it mean to add ‘full value comment identifiers’ like you mentioned?
Can you provide a source example?

You can import such a document and try to restore the comment order.

NodeCollection comments = doc.GetChildNodes(NodeType.Comment, true);
for (int i = 0; i < comments.Count - 1; i++)
{
    Comment cmt1 = comments[i] as Comment;
    Comment cmt2 = comments[i + 1] as Comment;
    if (cmt1.Id > cmt2.Id)
    {
        CompositeNode parent = cmt1.ParentNode;
        cmt1.Remove();
        parent.AppendChild(cmt1);
    }
}

Please consider the following code as the main idea demonstration.

Document doc = new Document("TestDoc_cmt.docx");

NodeCollection comments = doc.GetChildNodes(NodeType.Comment, true);
// Let's generate real identifiers.
int id = 0;
foreach (Comment cmt in comments)
{
    Run run = new Run(doc, "C" + id);
    run.Font.Hidden = true;
    cmt.FirstParagraph.InsertBefore(run, cmt.FirstParagraph.FirstChild);
    id++;
}
// Let's fix the identifiers of the docx document.
string realIdDocx = ((Comment)comments[0]).FirstParagraph.FirstRun.Text;
string commentText = comments[0].GetText();
// Let's reload document from html.
doc.Save("TestDoc_cmt.html");
Document doc1 = new Document("TestDoc_cmt.html");

string realIdHtml = string.Empty;
comments = doc.GetChildNodes(NodeType.Comment, true);
// Let's get the identifiers from the text.
foreach (Comment cmt in comments)
{
    if (cmt.GetText().Equals(commentText))
    {
        Run idRun = cmt.FirstParagraph.FirstChild as Run;
        if (idRun != null && idRun.Font.Hidden)
            realIdHtml = idRun.Text;
    }
}

Assert.AreEqual(realIdDocx, realIdHtml);

TestDoc_cmt.docx (10.6 KB)

Thank you.
This is helpful.
Let me check out the code and see if I have any questions.

@rogerC123 Please feel free to ask in case of any issues, we will be glad to help you.