Free Support Forum - aspose.com

Read image text from docx file


#1

read image text from docx file as write into txt file…


#2

@rabin.samanta,

You can use following Aspose.Words for .NET code to extract all images from DOCX file:

Document doc = new Document("E:\\temp\\input.docx");

int i = 0;
foreach (Shape img in doc.GetChildNodes(NodeType.Shape, true))
{
    if (img.HasImage)
    {
        ShapeRenderer renderer = img.GetShapeRenderer();

        ImageSaveOptions opts = new ImageSaveOptions(SaveFormat.Png);
        opts.ImageColorMode = ImageColorMode.Grayscale;
        renderer.Save("E:\\temp\\img_" + i + ".png", opts);
        i++;
    }
}

After that you can use Aspose.OCR for .NET API to extract actual text from image files. I have moved your thread in Aspose.OCR Product Family forum where you will be guided further.


#3

Capture12.PNG (70.3 KB)
this is my docx file i wants to read link :–>

OUTPUT will be

Hi Rabin Samanta
—then image data
This is the file .


#4

@rabin.samanta,

Please ZIP and upload your actual Word document (.docx file) here for testing. We will then investigate the scenario on our end and provide you more information.


#5

test (2).zip (26.6 KB)


#6

@rabin.samanta

Thanks for sharing sample document.

We have tested the scenario in our environment and observed that image which was present in .docx file was of low quality. Please note that in order to use API with better results, the source image should be of minimum 300DPI and font size should be 12pt or bigger.

Furthermore, supported fonts by the API include Arial, Times New Roman, Courier New, Verdana, Tahoma and Calibri with regular, bold and italic font styles. Please try your scenario with larger DPI image and in case you still face any issue, feel free to contact us.


#7

@asad.ali
if image font is less then 300DPI then how i will solve that…


#8

@rabin.samanta

Thanks for your inquiry.

In case the image is blurred and DPI is not appropriate, you may try applying correction filters over images before text extraction. Furthermore, we have tested the scenario by applying correction filters as well but the output text was not correct. Hence, we have logged an investigation ticket as OCR-576 in our issue tracking system for further investigation.

We will further look into details of the issue and keep you posted with its rectification status. Please be patient and spare us little time.

We are sorry for the inconvenience.