Extract text from images in word doc with Aspose.Words for Python via .NET

Hello,

Can I have help extracting text from images in a word document and inserting next to the image back into the document.

Please write in python with Aspose.Words for Python via .NET.

@ln22

To extract text from images in a Word document using Aspose.Words for Python via .NET and insert the extracted text next to the corresponding image, you can follow these steps:

  1. Load the Word Document: First, you need to load the Word document that contains the images.

  2. Extract Images and Text: Iterate through the shapes in the document to find images. For each image, you can use Optical Character Recognition (OCR) to extract text from the image.

  3. Insert Extracted Text: After extracting the text, you can insert it back into the document next to the image.

Here’s a sample code snippet demonstrating this process:

from aspose.words import Document
from aspose.words import NodeType
import pytesseract
from PIL import Image
import io

# Load the Word document
doc = Document("document.docx")

# Iterate through shapes in the document
for shape in doc.get_child_nodes(NodeType.SHAPE, True):
    if shape.has_image:
        # Extract image data
        image_data = shape.image_data.to_bytes()
        
        # Use PIL to open the image and perform OCR
        image = Image.open(io.BytesIO(image_data))
        extracted_text = pytesseract.image_to_string(image)
        
        # Insert the extracted text next to the image
        # Create a new paragraph for the text
        paragraph = shape.parent_paragraph
        paragraph.append_child(paragraph.document.create_paragraph())
        paragraph.append_child(paragraph.document.create_run(extracted_text))

# Save the modified document
doc.save("modified_document.docx")

Explanation:

  • Loading the Document: The Document class is used to load the Word document.
  • Extracting Images: The code iterates through all shapes in the document and checks if they contain images.
  • OCR Processing: The pytesseract library is used to perform OCR on the extracted image to get the text.
  • Inserting Text: A new paragraph is created next to the image to insert the extracted text.
  • Saving the Document: Finally, the modified document is saved.

Make sure you have the required libraries installed, including pytesseract and Pillow for image processing.

This approach allows you to efficiently extract text from images and place it back into the Word document.

I get the following error trying to run that code:

AttributeError: ‘aspose.words.Node’ object has no attribute ‘has_image’

Can someone help me create working code?

@ln22 The code generated by AI is not quite correct. You can use the following code to get image bytes of the shape:

doc = aw.Document("C:\\Temp\\in.docx")
builder = aw.DocumentBuilder(doc)
for s in doc.get_child_nodes(aw.NodeType.SHAPE, True):
    s = s.as_shape()
    if s.has_image :
        # Extract image data
        image_data = s.image_data.image_bytes
        # extract content from the image
        # .......

Extracting text from the image, i.e. OCR operation is out of Aspose.Words scope. But you can try using Aspose.OCR to achieve this part of the task.

I will try Aspose OCR. Do you know after I get the text how I can insert this text directly after the image into the document?

@ln22 You can use the same approach as suggested here:
https://forum.aspose.com/t/extracting-text-for-textboxes-and-shapes-with-aspose-words-for-python-via-net/300290/3

If the text from the image will be extracted as a simple string, you can simply use DocumentBuilder.write method to insert the text.

So like this:

builder.move_to(shape_as_shape)
builder.writeln()
while shape_as_shape.has_child_nodes:
    builder.current_paragraph.parent_node.insert_before(extracted_image_text, builder.current_paragraph)

@ln22 You can use like this:

builder.move_to(shape_as_shape)
builder.writeln(text_extracted_from_image)
1 Like

builder.writeln(edited_extracted_text)
RuntimeError: Proxy error(ArgumentException): Cannot insert a node of this type at this location.

Sometimes I get this error when trying to write the text after the image. Would you know why?

@ln22 Could you please provide the code along with the problematic string that will allow us to reproduce the problem? We will check the issue and provide you more information.