Extract Images from Word document and also know the page no. on which the image is embedded/inserted

hitesh.khatri · September 13, 2024, 1:16pm

Our requirements is to extract Images from Word document and also know the page no. on which the image is embedded/inserted.

the issue I am facing is that,

The word document is a flow document, hence the Aspose.words SDK does not seem to provide us the context on the page number on which the Image is present in the document.
I tried to convert the word document into PDF document, in memory and use Aspose.PDF tiral license and then I am trying to do the same task.
I also tried to use other methods like inserting tag text in place of the image before converting the document to PDF and then extract image with page number, however the layout shifts and the text loses its page number context.

any help on this would be really helpful

Professionalize.Discourse · September 13, 2024, 1:17pm

@hitesh.khatri

To extract images from a Word document and determine the page number on which each image is embedded, you can utilize the Aspose.Words library effectively. Here’s a step-by-step approach to achieve this:

Load the Document: Start by loading your Word document using the Document class.
Use LayoutCollector: The LayoutCollector class is essential for determining the layout of the document, including the page numbers of various elements.
Iterate Through Shapes: You can iterate through the shapes in the document to find images. Each image is represented as a Shape object.
Get Page Number: For each shape, you can use the LayoutCollector to get the page number where the shape (image) is located.
Extract Image Data: You can extract the image data from each shape and save it to a file or process it as needed.

Here’s a code example demonstrating this process in C#:

using Aspose.Words;
using Aspose.Words.Drawing;
using Aspose.Words.Layout;
using System;
using System.IO;

class Program
{
    static void Main()
    {
        // Load the document
        Document doc = new Document("YourDocument.docx");

        // Create a LayoutCollector to get layout information
        LayoutCollector layoutCollector = new LayoutCollector(doc);

        // Iterate through all shapes in the document
        foreach (Shape shape in doc.GetChildNodes(NodeType.Shape, true))
        {
            // Check if the shape is an image
            if (shape.ShapeType == ShapeType.Image)
            {
                // Get the page number where the image is located
                int pageNumber = layoutCollector.GetStartPageIndex(shape);
                Console.WriteLine($"Image found on page: {pageNumber}");

                // Save the image to a file
                string imageFileName = $"Image_Page_{pageNumber}.png";
                shape.ImageData.Save(imageFileName);
                Console.WriteLine($"Image saved as: {imageFileName}");
            }
        }
    }
}

Explanation of the Code:

The Document class is used to load the Word document.
The LayoutCollector is instantiated to gather layout information.
The code iterates through all shapes in the document, checking if they are images.
For each image, it retrieves the page number using GetStartPageIndex.
Finally, it saves the image data to a specified file.

This approach should help you extract images along with their corresponding page numbers effectively. If you encounter any issues or need further assistance, feel free to ask!

Sources:

Rendering Shapes Separately from Document|Aspose.Words for .NET