Unable to get all the text from an image

terrence123 · June 14, 2021, 12:03pm

am using the below code to get the list of images in a pdf and then extract the text from each image
but am getting just a few gibberish chars per image

string file = @“D:\MP09.pdf”;
Document pdfDocument = new Document(file);
ImagePlacementAbsorber abs = new ImagePlacementAbsorber();
pdfDocument.Pages.Accept(abs);

foreach (ImagePlacement imagePlacement in abs.ImagePlacements)
{
string slResult = “”;
XImage ximage = imagePlacement.Image;
Console.Out.WriteLine(“image width:/height” + imagePlacement.Rectangle.Width + “/”+ imagePlacement.Rectangle.Height);
AsposeOcr libOcr = new AsposeOcr();

using (MemoryStream ms = new MemoryStream())
{
	ximage.Save(ms);
	ms.Position = 0;
	slResult = libOcr.RecognizeImage(ms);
}
Console.WriteLine(slResult);

}
Please advice what is the issue here
MP09.pdf (1.4 MB)

asad.ali · June 14, 2021, 7:07pm

@terrence123

We are testing the scenario and will get back to you shortly.

asad.ali · June 15, 2021, 5:37pm

@terrence123

We were able to notice the issue in our environment while testing the scenario with Aspose.OCR for .NET 21.5. Therefore, we have logged an issue as OCRNET-361 in our issue tracking system. We will further look into its details and keep you posted with its rectification status. Please be patient and spare us some time.

We are sorry for the inconvenience.

terrence123 · June 29, 2021, 8:44am

Its been two weeks
Whats the update on this?

asad.ali · June 30, 2021, 5:28am

@terrence123

Sadly, the earlier logged ticket is not yet investigated. Please note that it was logged in free support model and will be resolved on first come first serve basis. As soon as its investigation is complete, we will be able to share some ETA with you in this forum thread. Please be patient and give us some time.

We apologize for the inconvenience.

asad.ali · July 6, 2021, 7:07pm

@terrence123

The .pdf file contains raw text and images. The first page is text content, and the other pages are images. But you try to extract only images. In order to view the pdf-file content, please use our service: Parse PDF | Online and Free | Aspose.PDF

In order to test the OCR result, you can please use our service: Free Online PDF OCR - Convert PDF to Text

Please, try this code:

            string file = @"MP09.pdf";
            string totalResult = String.Empty;

            Document pdfDocument = new Document(file);

            TextAbsorber textAbsorber = new TextAbsorber();
            pdfDocument.Pages.Accept(textAbsorber);
            string extractedText = textAbsorber.Text;
            Console.WriteLine(extractedText);
            totalResult += extractedText;



            ImagePlacementAbsorber abs = new ImagePlacementAbsorber();
            pdfDocument.Pages.Accept(abs);
            // int i = 0;

            foreach (ImagePlacement imagePlacement in abs.ImagePlacements)
            {
                string slResult = "";
                XImage ximage = imagePlacement.Image;
                Console.Out.WriteLine("image width:/ height" + imagePlacement.Rectangle.Width + "/" + imagePlacement.Rectangle.Height);
                AsposeOcr libOcr = new AsposeOcr();

                using (MemoryStream ms = new MemoryStream())
                {
                    ximage.Save(ms);
                    ms.Position = 0;
                  // to check images
                  // using (FileStream fs = new FileStream("D://img" + (i++).ToString() + ".jpg", FileMode.Create))
                  // {
                  //     fs.Write(ms.ToArray());
                        slResult = libOcr.RecognizeImage(ms);
                  // }
                }
                Console.WriteLine(slResult);
                totalResult += slResult;
            }

            File.WriteAllText("D://resultPdfOcr.txt", totalResult);

The recognition result is in the attached file.resultPdfOcr.zip (2.9 KB)