While extracting content from one cell of table it is also extracting content from another cell

ELSSAM_elsevier_com · June 6, 2024, 7:50am

We are using Aspose.words library with Java for extracting MS word content.
Currently we are using aspose-words: 23.7 and things are working fine. But after upgrading it to 24.5(Latest version), width and height is getting increased significantly for table content, we are using “layoutEnumerator.getRectangle()” method to get bounds of entity.
We are using attached sampled docx file for extracting content and it’s layout, where we are seeing above mentioned discrepancy after version upgrade.

Could you please assist us on same?
simple_word_doc_with_table_content_style.docx (13.2 KB)

alexey.noskov · June 6, 2024, 8:39am

@ELSSAM_elsevier_com As I can see bounding box of the table is calculated properly using the latest 24.5 version of Aspose.Words. I have used the following code for testing:

Document doc = new Document("C:\\Temp\\in.docx");
LayoutCollector collector = new LayoutCollector(doc);
LayoutEnumerator enumerator = new LayoutEnumerator(doc);

// Calculate the table building box and draw rectangle around it
// to make sure the rectangle is calculated properly.
Iterable<Table> tables = doc.getChildNodes(NodeType.TABLE, true);
for (Table t : tables)
{
    // Skip tables which are in header footer(LayoutCollector and LayoutEnumerator classes do not work with header/footer nodes)
    if (t.getAncestor(NodeType.HEADER_FOOTER) != null)
        continue;

    // Move LayoutEnumerator to the first row
    enumerator.setCurrent(collector.getEntity(t.getFirstRow().getFirstCell().getFirstParagraph()));
    while (enumerator.getType() != LayoutEntityType.ROW)
        enumerator.moveParent();

    //Get rectangle of the first row of the table.
    Rectangle2D first_rect = enumerator.getRectangle();

    // Do the same with last row
    enumerator.setCurrent(collector.getEntity(t.getLastRow().getFirstCell().getFirstParagraph()));
    while (enumerator.getType() != LayoutEntityType.ROW)
        enumerator.moveParent();

    // Get rectangle of the last row in the table.
    Rectangle2D last_rect = enumerator.getRectangle();
    // Union of the rectangles is the bounding box of the table.
    Rectangle2D result_rect = first_rect.createUnion(last_rect);

    // Create a shape.
    Shape shapeRect = new Shape(doc, ShapeType.RECTANGLE);
    shapeRect.setFilled(false);
    shapeRect.setStroked(true);
    shapeRect.getStroke().setColor(Color.RED);
    shapeRect.getStroke().setWeight(2);
    shapeRect.setWidth(result_rect.getWidth());
    shapeRect.setHeight(result_rect.getHeight());
    shapeRect.setRelativeHorizontalPosition(RelativeHorizontalPosition.PAGE);
    shapeRect.setRelativeVerticalPosition(RelativeVerticalPosition.PAGE);
    shapeRect.setLeft(result_rect.getX());
    shapeRect.setTop(result_rect.getY());
    ((Paragraph)t.getNextSibling()).appendChild(shapeRect);
}

doc.save("C:\\temp\\out.docx");

The code draws the bounding box of the table above the table in the output document.
out.docx (10.7 KB)

ELSSAM_elsevier_com · June 12, 2024, 10:00am

Hello @alexey.noskov ,

As we don’t get layout directly,

We are adding BookMark for each cell from table.
Then extracting layout and content latter by providing BookMarkStart.
Then extracting layout and content by providing BookMarkEnd.
And Then combining results of #2 and #3.
But When extracting layout and content by providing BookMarkEnd node for Row1 and cell 1, it is also extracting layout and content from Row2 and cell 2 which is not expected due to which whole bounding box is getting changed.
We are using condition for extraction as “layoutEnumerator.getType() == LayoutEntityType.SPAN” and method as “layoutEnumerator.moveNext()”

alexey.noskov · June 12, 2024, 12:31pm

@ELSSAM_elsevier_com LayoutEntityType.SPAN is a portion of text, so bounds of this entity does not ocupy whole cell. As you can see in the above provided code LayoutEntityType.ROW is used to calculate bounds of the first and the last row of the table. Then union of these rectangles gives area ocupied by whole table.

ELSSAM_elsevier_com · June 13, 2024, 7:33am

Hello @alexey.noskov ,

I have created minimal POC to demonstrate the issue, can download from here, “aspose_bounding_box_demo.zip - Google Drive”
Added comments in “LayoutExtractor” constructor from attached code which explains the issue.
Adding same details here as well,

/*
  - With Aspose version 24.5
  - For First cell only allLineLayouts.get(1).getTokenLayouts().size() is 12 which has tokens from Row 2 and cell 1 which is not expected
  - With Aspose version 23.7, it is not extracting content from Row 2 and cell 1 and allLineLayouts.get(1).getTokenLayouts().size()  is less than 12
*/

alexey.noskov · June 13, 2024, 6:25pm

@ELSSAM_elsevier_com Thank you for additional information. Unfortunately, it is still not quite clear what the problem is. As I can see Aspose.Words returns correct layout and calculates correct bounding boxes of element. I have modified the above provided code to check bounding boxes of cells in the table:

Document doc = new Document("C:\\Temp\\in.docx");
LayoutCollector collector = new LayoutCollector(doc);
LayoutEnumerator enumerator = new LayoutEnumerator(doc);

// Calculate the table building box and draw rectangle around it
// to make sure the rectangle is calculated properly.
Iterable<Table> tables = doc.getChildNodes(NodeType.TABLE, true);
for (Table t : tables)
{
    // Skip tables which are in header footer(LayoutCollector and LayoutEnumerator classes do not work with header/footer nodes)
    if (t.getAncestor(NodeType.HEADER_FOOTER) != null)
        continue;

    for (Row r : t.getRows())
    {
        for (Cell c : r.getCells())
        {
            enumerator.setCurrent(collector.getEntity(c.getFirstParagraph()));
            while (enumerator.getType() != LayoutEntityType.CELL)
                enumerator.moveParent();

            Rectangle2D cellRect = enumerator.getRectangle();

            // Create a shape.
            Shape shapeRect = new Shape(doc, ShapeType.RECTANGLE);
            shapeRect.setFilled(false);
            shapeRect.setStroked(true);
            shapeRect.getStroke().setColor(Color.RED);
            shapeRect.getStroke().setWeight(2);
            shapeRect.setWidth(cellRect.getWidth());
            shapeRect.setHeight(cellRect.getHeight());
            shapeRect.setRelativeHorizontalPosition(RelativeHorizontalPosition.PAGE);
            shapeRect.setRelativeVerticalPosition(RelativeVerticalPosition.PAGE);
            shapeRect.setLeft(cellRect.getX());
            shapeRect.setTop(cellRect.getY());
            ((Paragraph)t.getNextSibling()).appendChild(shapeRect);
        }
    }
}

doc.save("C:\\temp\\out.docx");

out.docx (10.8 KB)

It looks like your code is collecting layout of nodes in the document and the problem is that collected data differs from the data returned by an old version, right? This might occur because internal document layout model has been changed due to some fixes or improvements made between the versions.

ELSSAM_elsevier_com · June 14, 2024, 6:14am

Hello @alexey.noskov,

Now, I have changed issue title as well as earlier exact code part of issue was unknown.
Right now the main issue is “While extracting content from only Row1 cell1, it is also extracting content from Row2 cell1 with new version upgrade”.
Let me know if you need more details on same.

alexey.noskov · June 14, 2024, 7:41am

@ELSSAM_elsevier_com Unfortunately, I cannot reproduce the problem on my side. I have used the following simple code for testing:

Document doc = new Document("C:\\Temp\\in.docx");
Table table = doc.getFirstSection().getBody().getTables().get(0);
// Get content of the first cell in the first row.
System.out.println(table.getRows().get(0).getCells().get(0).toString(SaveFormat.TEXT).trim());

ELSSAM_elsevier_com · June 14, 2024, 8:34am

Hello @alexey.noskov ,

Why don’t you try POC shared by us to replicate the issue and run/debug that code, I have also added comments in code as I mentioned in earlier comment?

alexey.noskov · June 14, 2024, 5:21pm

@ELSSAM_elsevier_com Unfortunately, it is not quite clear how to reproduce the problem using the provided code. I above provide code checks that content of the first cell is extracted properly. So most likely there is a mistake in your code not in Aspose.Words. Since you are more familiar with your code you are in the better position to debug and resolve it.

ELSSAM_elsevier_com · June 18, 2024, 6:59am

Hello @alexey.noskov , as per your comment above,

“It looks like your code is collecting layout of nodes in the document and the problem is that collected data differs from the data returned by an old version, right? This might occur because internal document layout model has been changed due to some fixes or improvements made between the versions.”
Elsevier: If document model has changed, then shouldn’t it be fixed to make it work like before?

alexey.noskov · June 18, 2024, 7:13am

@ELSSAM_elsevier_com

Document model has not been changed. It looks like you encounter changes in the internal document layout model. So, no, it should not be fixed, since the changes has been made to fix some other issues. Aspose.Words provides very limited layout information in the public API since the full document layout model is too complicated.

ELSSAM_elsevier_com · July 1, 2024, 11:45am

@alexey.noskov Using the simple_word_doc_with_table_content_style.docx file, can you add bookmark to each cell by iterating through each row and print the LineBounds of each Bookmark using 22.8 and 24.5 version?

Layout dimensions for Bookmarks using 22.8
java.awt.geom.Rectangle2D$Float[x=77.65,y=93.4,w=24.018,h=20.9]
java.awt.geom.Rectangle2D$Float[x=281.45,y=93.4,w=24.674,h=20.9]
Line layouts count: 4
java.awt.geom.Rectangle2D$Float[x=281.45,y=93.4,w=24.674,h=20.9]
java.awt.geom.Rectangle2D$Float[x=406.1,y=93.4,w=22.014,h=20.9]
Line layouts count: 4
java.awt.geom.Rectangle2D$Float[x=406.1,y=93.4,w=22.014,h=20.9]
java.awt.geom.Rectangle2D$Float[x=522.75,y=93.4,w=0.0,h=20.9 ]
Line layouts count: 2
java.awt.geom.Rectangle2D$Float[x=77.65,y=114.8,w=93.316,h=13.799]
java.awt.geom.Rectangle2D$Float[x=281.45,y=114.8,w=58.037,h=20.9]
Line layouts count: 4
java.awt.geom.Rectangle2D$Float[x=281.45,y=114.8,w=58.037,h=20.9]
java.awt.geom.Rectangle2D$Float[x=406.1,y=114.8,w=84.482994,h=34.834]
Line layouts count: 6
java.awt.geom.Rectangle2D$Float[x=406.1,y=114.8,w=84.482994,h=34.834]
java.awt.geom.Rectangle2D$Float[x=522.75,y=114.8,w=0.0,h=34.834 ]
Line layouts count: 2

Layout dimensions for Bookmarks using 24.5
java.awt.geom.Rectangle2D$Float[x=77.65,y=93.4,w=24.018,h=20.9]
java.awt.geom.Rectangle2D$Float[x=281.45,y=93.4,w=24.674,h=20.9]
Line layouts count: 4
java.awt.geom.Rectangle2D$Float[x=281.45,y=93.4,w=24.674,h=20.9]
java.awt.geom.Rectangle2D$Float[x=406.1,y=93.4,w=22.014,h=20.9]
Line layouts count: 4
java.awt.geom.Rectangle2D$Float[x=406.1,y=93.4,w=22.014,h=20.9]
java.awt.geom.Rectangle2D$Float[x=77.65,y=114.8,w=93.316,h=13.799 ]
Line layouts count: 8
java.awt.geom.Rectangle2D$Float[x=77.65,y=114.8,w=93.316,h=13.799]
java.awt.geom.Rectangle2D$Float[x=281.45,y=114.8,w=58.037,h=20.9]
Line layouts count: 4
java.awt.geom.Rectangle2D$Float[x=281.45,y=114.8,w=58.037,h=20.9]
java.awt.geom.Rectangle2D$Float[x=406.1,y=114.8,w=84.482994,h=34.834]
Line layouts count: 6
java.awt.geom.Rectangle2D$Float[x=406.1,y=114.8,w=84.482994,h=34.834]
java.awt.geom.Rectangle2D$Float[x=72.0,y=150.134,w=0.0,h=20.9 ]
Line layouts count: 2

Both versions have similar bookmark start and end except 3rd column of both first and second rows.
The dimension value returned for 3rd column is differing between old and new version though they are placing the bookmark end at the same location (verified by adding the bookmark in old version and checked in new version, and vice-versa).

I noticed you were comparing the first row first cell for comparison in the above example, can you run the same comparison for 3rd column of first and second row. The bookmark end is not pointing to the right place.
PS: I verified the same with basic aspose.word code to add and retrieve bookmark linelayout.

vyacheslav.deryushev · July 1, 2024, 5:28pm

@ELSSAM_elsevier_com We will check the issue and get back to you.

vyacheslav.deryushev · July 8, 2024, 9:23am

@ELSSAM_elsevier_com
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): WORDSNET-27171

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

ELSSAM_elsevier_com · July 10, 2024, 7:24am

Thanks for your update @vyacheslav.deryushev.
We will wait for this issue to be fixed.

vyacheslav.deryushev · July 15, 2024, 10:07am

@ELSSAM_elsevier_com We have finished analyzing WORDSNET-27171 and have concluded that this is not a bug. This behavior was due to block-level bookmarks that were not supported and handled consistently, which sometimes caused exceptions in the code. But now they have been moved to the inline level.

In the code, the bookmark is inserted at block level, around the cell. This causes the end of the bookmark to fall into the first paragraph of the next cell. Cell 3 is the last cell in the row, so the end of the bookmark goes into the first paragraph of cell 4.

Before the change, the bookmark remained after cell 3, which, logically, is before the row break, therefore in the same row. This was favorable to the logic of the application, which expected the bookmark to appear after the cell. However, this arrangement was inconsistent because the bookmark was not embedded in the row, which caused problems.

To workaround this problem make sure bookmarks are inserted into the paragraph, inline.

ELSSAM_elsevier_com · July 24, 2024, 9:06am

Thank you for the suggestion, will make the change and test it.

ELSSAM_elsevier_com · September 23, 2024, 4:52am

Thank you this suggestion has worked for us!