Retrieve Paragraphs in order of appearance in the document

natasjadegroot · March 13, 2025, 3:12pm

Dear,

I wonder if there is a way to retrieve the paragraphs in a document in the order they appear in the document.

So first all paragraphs from page 1 (in the display order), then paragraphs from p2.
The attached document (is only a single page document). With the code below, the paragraphs are NOT returned in the order in which they appear in the document.
I used the following code

@Test
	public void retrieveAllParagraphsInOrder_Test() throws Exception {
		Document doc = new Document("documents/fr-doc.docx");
		LayoutCollector collector = new LayoutCollector(doc);
		LayoutEnumerator enumerator = new LayoutEnumerator(doc);
		
		assertNotNull(doc);
		NodeCollection allParagraphs = doc.getChildNodes(NodeType.PARAGRAPH, true);
		for (Paragraph p : (Iterable<Paragraph>) allParagraphs) {
			System.out.println(paragraphText(p));
		}
	}
	
	
	
	@SuppressWarnings({ "rawtypes", "unchecked" })
    private String paragraphText(final Paragraph par) {
        if (par == null) {
            return "";
        }

        NodeCollection runs = par.getChildNodes(NodeType.RUN, false);
        StringBuilder builder = new StringBuilder();

        for (Run run : (Iterable<Run>) runs) {
            builder.append(getText(run));
        }

        return builder.toString().trim();
    }
	
	private String getText(final Node node) {
        if (node == null) {
            return "";
        }

        String text;
        try {
            text = node.toString(SaveFormat.TEXT);
        } catch (Exception e) {
            text = node.getRange().getText();
        }

        return text;
    }

I know there is something like a LayoutCollector and LayoutEnumerator, but how would that work accross multiple pages, as the Rectangle which is returned by the LayoutEnumerator does not give any information about Page.

The code and document are attached.
Many thanks for your support

Regards
RetrieveParagraphs.zip (5.3 KB)

Professionalize.Discourse · March 13, 2025, 3:13pm

@natasjadegroot

To retrieve paragraphs in the order they appear in a document using Aspose.Words, you can utilize the LayoutCollector and LayoutEnumerator classes effectively. These classes allow you to traverse the layout entities of the document, which can help you maintain the visual order of paragraphs across multiple pages.

Here’s a sample approach to achieve this:

Initialize the Document and Layout Collector: Load your document and create instances of LayoutCollector and LayoutEnumerator.
Update Page Layout: Call the UpdatePageLayout method to ensure that the layout information is accurate.
Iterate Through Layout Entities: Use the LayoutEnumerator to traverse the layout entities in the document. You can check the type of each entity and collect paragraphs in the order they appear.

Here’s a code example demonstrating this approach:

import com.aspose.words.*;

public class RetrieveParagraphsInOrder {
    public static void main(String[] args) throws Exception {
        Document doc = new Document("path/to/your/document.docx");
        LayoutCollector layoutCollector = new LayoutCollector(doc);
        LayoutEnumerator layoutEnumerator = new LayoutEnumerator(doc);
        
        // Update the layout to get accurate page information
        doc.updatePageLayout();

        // Iterate through the layout entities
        layoutEnumerator.reset();
        while (layoutEnumerator.moveNext()) {
            if (layoutEnumerator.getType() == LayoutEntityType.SPAN) {
                Node node = layoutCollector.getEntity(layoutEnumerator.getCurrent());
                if (node.getNodeType() == NodeType.PARAGRAPH) {
                    Paragraph paragraph = (Paragraph) node;
                    System.out.println(paragraph.getText().trim());
                }
            }
        }
    }
}

Explanation:

LayoutEnumerator: This class allows you to traverse the layout entities in a document. By moving through the entities, you can access paragraphs in the order they are visually rendered.
LayoutEntityType.SPAN: This type represents text spans, which are parts of paragraphs. You can check if the current entity is a span and then retrieve the corresponding paragraph.
getEntity: This method retrieves the node associated with the layout entity, allowing you to check if it is a paragraph and print its text.

This method ensures that you retrieve paragraphs in the order they appear in the document, including across multiple pages.

For more detailed information on using LayoutCollector and LayoutEnumerator, you can refer to the Aspose documentation on layout entities and their traversal methods.

If you have any further questions or need additional assistance, feel free to ask!

natasjadegroot · March 13, 2025, 4:50pm

Professionalize.Discourse:

LayoutCollector layoutCollector = new LayoutCollector(doc);
        LayoutEnumerator layoutEnumerator = new LayoutEnumerator(doc);
        
        // Update the layout to get accurate page information
        doc.updatePageLayout();

        // Iterate through the layout entities
        layoutEnumerator.reset();
        while (layoutEnumerator.moveNext()) {
            if (layoutEnumerator.getType() == LayoutEntityType.SPAN) {
                Node node = layoutCollector.getEntity(layoutEnumerator.getCurrent());
                if (node.getNodeType() == NodeType.PARAGRAPH) {
                    Paragraph paragraph = (Paragraph) node;
                    System.out.println(paragraph.getText().trim());
                }
            }
        }

Dear,

The code you provide does not return the paragraphs in the order as they can bee seen in the document.
In fact the code as provided does not even compile. LayoutEnumberator.getEntity returns an Object.
LayoutCollector.getEntity method expects a Node.

When I fixed this and then execute the code, nothing is printed on the console.

I would like to ask you to validate the code you provided on my submitted document. Then you can see the problem for yourself.

Many thanks.

natasjadegroot · March 13, 2025, 4:52pm

Also your answer talks about layout type SPAN, which is only applicable for text. What now if there are images, tables in the document?

I would like to understand how I can retrieve all items of the document in the order as they can be seen on the document visually.

alexey.noskov · March 13, 2025, 8:18pm

@natasjadegroot doc.getChildNodes(NodeType.PARAGRAPH, true) returns paragraph in the order they are in the document object model, i.e. in order they are in the document. However, in your document the paragraphs are floating text frames, so they visual order does not match their location in the DOM. I am afraid there is no easy way to get paragraphs in the visual order.
You are right, you can use LayoutCollector and LayoutEnumerator to get actual location of nodes.

Actually it does. You can use LayoutEnumerator.PageIndex property.

natasjadegroot · March 13, 2025, 9:32pm

Is there a way to retrieve the floating textframes. I could not find any method in document to do do.

Further when I would use LayoutEnumerator.getRectangle() after having set the current on the Enumerator to a multiline paragraph, I get X, Y coordinates back and width and height.
The height I get back for the 4 lines paragraph is smaller than the height I get back from any of the one line paragraphs in the document.
How can this be?
Maybe I’m missing something.
Many thanks for support

alexey.noskov · March 14, 2025, 5:38am

@natasjadegroot

The behavior is correct. the entity returned by LayoutCollector.getEntity method for a Paragraph node is a paragraph break span. So rectangle returned by LayoutEnumerator is the rectangle copied by the paragraph break span.

To calculate actual paragraph building box, you can wrap the paragraph into the bookmark. Calculate coordinates of bookmark stand and end and then union of these rectangle will give bounding box of the paragraph. Something like this:

Document doc = new Document("C:\\Temp\\in.docx");

Iterable<Paragraph> paragraphs = (Iterable<Paragraph>)doc.getChildNodes(NodeType.PARAGRAPH, true);
// Wrap paragraph into the bookmarks.
int bookmakrIndex = 0;
for (Paragraph p : paragraphs)
{
    // Skip paragraphs in header/footer and in shapes.
    if (p.getAncestor(NodeType.HEADER_FOOTER) != null && p.getAncestor(NodeType.SHAPE) != null)
        continue;

    String bkName = "tmp_bk_" + bookmakrIndex++;
    p.prependChild(new BookmarkStart(doc, bkName));
    p.appendChild(new BookmarkEnd(doc, bkName));
}

LayoutCollector collector = new LayoutCollector(doc);
LayoutEnumerator enumerator = new LayoutEnumerator(doc);
for (Paragraph p : paragraphs)
{
    Bookmark wrappingBookmark = null;
    for (Bookmark bk : p.getRange().getBookmarks())
    {
        if (bk.getName().startsWith("tmp_bk_"))
        {
            wrappingBookmark = bk;
            break;
        }
    }

    if (wrappingBookmark == null)
        continue;

    enumerator.setCurrent(collector.getEntity(wrappingBookmark.getBookmarkStart()));
    Rectangle2D start = enumerator.getRectangle();

    enumerator.setCurrent(collector.getEntity(wrappingBookmark.getBookmarkEnd()));
    Rectangle2D end = enumerator.getRectangle();

    System.out.println(start);
    System.out.println(end);
    System.out.println("===============================");

    // Remove temporaty boormark.
    wrappingBookmark.remove();
}

You can determine whether paragraph is a text frame using FrameFormat.isFrame property:

Document doc = new Document("C:\\Temp\\in.docx");

Iterable<Paragraph> paragraphs = (Iterable<Paragraph>)doc.getChildNodes(NodeType.PARAGRAPH, true);
for (Paragraph p : paragraphs)
{
    if (p.getFrameFormat().isFrame())
    {
        System.out.println("HorizontalPosition: " + p.getFrameFormat().getHorizontalPosition());
        System.out.println("VerticalPosition: " + p.getFrameFormat().getVerticalPosition());
        System.out.println("Width: " + p.getFrameFormat().getWidth());
        System.out.println("Height " + p.getFrameFormat().getHeight());
        System.out.println("===========================");
    }
}

natasjadegroot · March 14, 2025, 9:27am

@alexey.noskov,

I understand what you are hinting at using the bookmarks.
However how can this even work with paragraphs spanning multiple pages. Or with multipage tables (which is in fact not uncommon).
A paragraph starting close to the end of the page and ending close to the beginning of the next page.
This would create a rectangle which basically is a complete page.

This rectangle would then be kind of useless for trying to determine if some element is above of below another element.

thanks for advising.

alexey.noskov · March 14, 2025, 9:37am

@natasjadegroot You can use LayoutEnumerator.PageIndex property to determine page index where the current entity is located.