Extract all formatted content from a word document which has track changes using Java | Get number of revisions

Hi Team,
I need to extract all formatted content from a word document which has track changes.
I am able to extract revisions , but it is not returning the formatted description, whether bold, italic, strikethrough…

I too need to get page number from the below code.

Document doc = new Document(documentNameAndPath);
RevisionCollection revisionCollections = doc.getRevisions();
revisionCollections.getGroups();
for (Revision revision : revisionCollections)
{
	stopIteration++;
	JSONObject revisionDataJsonObject = new JSONObject();
	JSONObject formattedDataJsonObject = new JSONObject();

	String revisionText = revision.getParentNode().getText();
	revisionDataJsonObject.put("revisionText", revisionText);
	formattedDataJsonObject.put("revisionText", revisionText);

	String revisionAuthor = revision.getAuthor();
	revisionDataJsonObject.put("revisionAuthor", revisionAuthor);
	formattedDataJsonObject.put("revisionAuthor", revisionAuthor);
	String revisionPostedAt = "";
	SimpleDateFormat sdf = new SimpleDateFormat("dd-MMM-yyyy HH:mm");
	Date revisionDate = revision.getDateTime();
	sdf.setTimeZone(TimeZone.getTimeZone("UTC"));
	if (revisionDate != null && !revisionDate.toString().isEmpty())
	{
		revisionPostedAt = sdf.format(revisionDate);
		revisionDataJsonObject.put("revisionDate", revisionPostedAt);
		formattedDataJsonObject.put("revisionDate", revisionPostedAt);
	}

	int revisionModeInt = revision.getRevisionType();
	String revisionMode = "";
	if (revisionModeInt == 0)
	{
		revisionMode = "INSERTION";
	}
	else if (revisionModeInt == 1)
	{
		revisionMode = "DELETION";
	}
	else if (revisionModeInt == 2)
	{
		revisionMode = "FORMAT_CHANGE";

	}
	else if (revisionModeInt == 3)
	{
		revisionMode = "STYLE_DEFINITION_CHANGE";
	}
	else if (revisionModeInt == 4)
	{
		revisionMode = "MOVING";
	}
	revisionDataJsonObject.put("revisionMode", revisionMode);
	formattedDataJsonObject.put("revisionMode", revisionMode);

	revisionJsonArray.add(0, formattedDataJsonObject);
	revisionJsonArray.add(1, revisionDataJsonObject);

Regards,
Mamtha.A.C.D.

@HAREEM_HCL_COM,

The following code should tell what text in document has format revisions and what actually are those format changes.

Document doc = new Document("E:\\temp\\in.docx");

for (Revision rev : doc.getRevisions()) {
    if (rev.getRevisionType() == RevisionType.FORMAT_CHANGE) {
        System.out.println(rev.getGroup().getText());
        com.aspose.words.Node node = rev.getParentNode();
        if (node.getNodeType() == NodeType.RUN) {
            Run run = (Run) node;
            System.out.println("Text --> " + run.getText());
        }
    }
} 

Attachment: input-document.zip (9.8 KB)

Hi Team,
I need assistance in Java code.

how to test, if the content was formatted or has track changes inorder to get the formatted details.

Where in the code to check for the above stated properties like, …

Revision.ParentStyle property,Revision.StyleDefinitionChange revision type.

Regards,
Mamtha…A.C.D

Hi Tahir,
I have a question, I need to extract only the content formatted in word document along with track changes.
How to extract it, like Font was bold, Italic, highlighted.

I am able to extract the revisions, but it does not return page number and formatted type…
Please assist.

Regards,
Mamtha.A.C.D.

Hi Tahir,
I need code to extract the formatted content and what formatting was applied to the conent, along with the above code.

How do i get Author and Revision details with the RunObject.

Regards,
Mamtha.A.C.D.

@HAREEM_HCL_COM,

And to get the page number of that node, please try using the following code:

Document doc = new Document("E:\\temp\\in.docx");

Run run = null;
for (Revision rev : doc.getRevisions()) {
    if (rev.getRevisionType() == RevisionType.FORMAT_CHANGE) {
        System.out.println(rev.getGroup().getText());
        com.aspose.words.Node node = rev.getParentNode();
        if (node.getNodeType() == NodeType.RUN) {
            run = (Run) node;
            System.out.println("Formatted --> " + run.getText());
            break;
        }
    }
}

if (run != null){
    LayoutCollector collector = new LayoutCollector(doc);
    System.out.println(collector.getStartPageIndex(run));
} 

Hope, this helps.

Hi, I have this code already.

rev.getGroup().getText() does not exist.
revision object does not have method named ‘getGroup()’.
In addition, I need to what kind of change formatting**, had done, like bold, italic, strikeThrough,…**…

Hi Please respond.
I need assistance on this.
I see, there are like ParagraphFormat, and check the format.
It will be better if it could return what type changes have been done during track changes.

And this line of code is consuming more time, excess time.
LayoutCollector collector = new LayoutCollector(doc);
** System.out.println(collector.getStartPageIndex(run));**

It is takeing close to one sec for each iterations, which means, for each revision, a second, which means, it will take a minute for 60 revisions, i have around 40 revisions, which is

@HAREEM_HCL_COM,

Please upgrade to the latest version of Aspose.Words for Java i.e. 19.7. Hope, this helps.

You can create instance of LayoutCollector once and then use getStartPageIndex method in loop. This should improve the time.

Thanks for the update. I already figured it out, of placing the instance outside the loop. Now my concern is on extracting what type of formatting was performed revisioned text.
a) I am unable to fetch it. Like, I need to what kind of change formatting, had done, like bold, italic, strikeThrough

b) rev. getGroup ().getText() does not exist.
revision object does not have method named ‘ getGroup ()’.

c) The below code some times returns wrong page number, where in page with such number does not exist. How to resolve it.

LayoutCollector collector = new LayoutCollector(doc);
** System.out.println(collector.getStartPageIndex(run));**

d) How to use LayoutEnumerator to get page Index. I tried and it returned always ‘1’. Please assist.

Hi R u there, Please assit.
a) Page index wrong
b) How to get what type of format change was performed on a content.

@HAREEM_HCL_COM,

It seems that you are using an old version of Aspose.Words for Java. We suggest you please upgrade to the latest version of Aspose.Words for Java i.e. 19.7. Hope, this helps in resolving all these issues.

Hi Team,
I have the latest JAR .
I have major request.
While extracting the Track changes from Revision object, there are multiple entries for same changes.
I would want that to be consolidated, as it looks like duplicate entries.
For instance, the document which we are working on, 2900 revisions, when we view the Review Pan in the document,
however, while using Aspose, it extracts around 4000 entries, agains 2900, because,
I understand, every singe SAVE action is being treated as track changes, hence Aspose returns Count as 4000.
I need to logic to consolidate the revision counts excluding duplicates.

How to acheive this.

Hi Awais,
As reported earlier, LayoutCollecter is returning wrong page number, where the page it self doesnot exist.
It is actually not just returning wrong number, it some how mistaking while returning page number, which does not exist.
Please assist.

I took latest JAR**. aspose-words-19.7-jdk17.jar**, still it returns wrong page number which does not exist.

Regards,
Mamtha.A.C.D.

@HAREEM_HCL_COM,

Please ZIP and upload your input Word document (you are getting this problem with) here for testing. We will then investigate the issues on our end and provide you more information.

Hi Team,
I am afraid, we will be unable to share the confidential document, as it belongs to our client.

There are 3 concerns,

  1. Page number - returns wrong page number at few instance and returns few** page numbers, which does not exist**,

  2. Track Changes or Revisions, are duplicating or multiple entries of same or similar changes are returned,
    Below is the code snippet
    RevisionCollection revisionCollections = doc.getRevisions();

  3. We need to consolidate these multiple entries, to single Revision, as the extracted data are reviewed for final edition of the document. As these multiple entries are causing issues for final conclusion of the document.
    Please assist on how to consolidate the Revisions with multiple same entries into single entry. When we view the Review pane, it shows less number of revisions, where while extracting from Revision collections, it returns 1.5 times higher revisions.
    Please assist on how to get through this.

Please assist
Regards,
Mamtha.A.C.D.

@HAREEM_HCL_COM,

Regarding 1 & 2, as requested earlier, please ZIP and upload your input Word document (you are getting this problem with) here for testing. Unfortunately, it is difficult to say what the problem is without the document. We need your document to reproduce the problem on our end. Please note that it is safe to attach files in the forum. If you attach your document here, only you and Aspose staff members can download it. You can also remove any sensitive information by replacing it with dummy data instead.

Regarding 3, please provide the following resources here for testing:

  • Your simplified input Word document
  • Aspose.Words 19.7 generated output document showing the undesired behavior
  • Your expected document showing the correct output. You can create expected document by using MS Word. Please also list the steps that you performed in MS Word to create the expected document.

As soon as you get these pieces of information ready, we will start investigation into your issue and provide you more information. Thanks for your cooperation.

Hi Team, here with I am attaching the document. Kindly assist.

I am pretty sure, Aspose is returning duplicates on extracting Revisions. And on further investigating, it reveals, most of the duplications are caused by RUN. That is I verified the Parent node of the revisions,and I see, for every entry of a Revision with Parent Node RUN, there is another entry with out a parent Node. By filtering all Revisions with ‘RUN’ as parent node, I still see duplications.

The page numbers are too incorrect with documents with more number of pages, which goes beyond 100.

The document attached shows, number of Revisions as 29, however, Aspose returns Revisions as 206. On excluding RUNS, it returns Revions as 96. How is there this much difference.

Further, the page number issue could not re-produced with the dummy document which I am sharing. May be you could try it at yourJavaIntro-CommentsAndTC.zip (187.3 KB)
end.

Regards,
Mamtha.A.C.D.

Hi Team, Please udpate.

@HAREEM_HCL_COM,

Please see the following screenshot:

https://i.imgur.com/5fwsrll.png

MS Word 2019 says that there are 8 pages and 16 revision groups in your Word document. The following Aspose.Words for Java 19.7 code returns 252 revisions, 13 revision groups and 10 pages.

Document doc = new Document("E:\\temp\\JavaIntro-CommentsAndTC\\JavaIntro-CommentsAndTC.docx");
System.out.println(doc.getRevisions().getCount()); // 252
System.out.println(doc.getRevisions().getGroups().getCount()); //13
System.out.println(doc.getPageCount()); // 10

For the sake of corrections, we have logged the following issues in our issue tracking system:

WORDSNET-18982: Incorrect number of revisions returned
WORDSNET-18983: PageCount returned is incorrect

We will further look into the details of these problems and will keep you updated on the status. We apologize for your inconvenience.

@HAREEM_HCL_COM,

Regarding WORDSNET-18983, it seems you are expecting Aspose.Words layout to match “Simple Markup” MS Word review option. The attached screenshot in my previous post shows that “Simple Markup” is chosen in MS Word.

This option is not stored in the document. The option is a viewing option in MS Word and it may affect the document layout and the number of pages in the document as displayed in MS Word. This is not the default reviewing option in MS Word, so the default Aspose.Words layout options do not match it.

In order to emulate MS Word simple markup in Aspose.Words’ layout, the following options should be set to Document.LayoutOption.RevisionOptions before updating layout or requesting page count:

Document doc = new Document("E:\\Temp\\JavaIntro-CommentsAndTC.docx");

doc.getLayoutOptions().getRevisionOptions().setShowRevisionMarks(false);
doc.getLayoutOptions().getRevisionOptions().setShowRevisionBars(true);
doc.getLayoutOptions().getRevisionOptions().setShowOriginalRevision(false);

System.out.println(doc.getPageCount()); // shows page count as 8

So, please use layout options to emulate MS Word simple markup. Hope, this helps.