am using the below code to get the list of images in a pdf and then extract the text from each image
but am getting just a few gibberish chars per image
string file = @“D:\MP09.pdf”;
Document pdfDocument = new Document(file);
ImagePlacementAbsorber abs = new ImagePlacementAbsorber();
pdfDocument.Pages.Accept(abs);
foreach (ImagePlacement imagePlacement in abs.ImagePlacements)
{
string slResult = “”;
XImage ximage = imagePlacement.Image;
Console.Out.WriteLine(“image width:/height” + imagePlacement.Rectangle.Width + “/”+ imagePlacement.Rectangle.Height);
AsposeOcr libOcr = new AsposeOcr();
using (MemoryStream ms = new MemoryStream())
{
ximage.Save(ms);
ms.Position = 0;
slResult = libOcr.RecognizeImage(ms);
}
Console.WriteLine(slResult);
}
Please advice what is the issue here
MP09.pdf (1.4 MB)
@terrence123
We are testing the scenario and will get back to you shortly.
@terrence123
We were able to notice the issue in our environment while testing the scenario with Aspose.OCR for .NET 21.5. Therefore, we have logged an issue as OCRNET-361 in our issue tracking system. We will further look into its details and keep you posted with its rectification status. Please be patient and spare us some time.
We are sorry for the inconvenience.
Its been two weeks
Whats the update on this?
@terrence123
Sadly, the earlier logged ticket is not yet investigated. Please note that it was logged in free support model and will be resolved on first come first serve basis. As soon as its investigation is complete, we will be able to share some ETA with you in this forum thread. Please be patient and give us some time.
We apologize for the inconvenience.
@terrence123
The .pdf file contains raw text and images. The first page is text content, and the other pages are images. But you try to extract only images. In order to view the pdf-file content, please use our service: Parse PDF | Online and Free | Aspose.PDF
In order to test the OCR result, you can please use our service: Free Online PDF OCR - Convert PDF to Text
Please, try this code:
string file = @"MP09.pdf";
string totalResult = String.Empty;
Document pdfDocument = new Document(file);
TextAbsorber textAbsorber = new TextAbsorber();
pdfDocument.Pages.Accept(textAbsorber);
string extractedText = textAbsorber.Text;
Console.WriteLine(extractedText);
totalResult += extractedText;
ImagePlacementAbsorber abs = new ImagePlacementAbsorber();
pdfDocument.Pages.Accept(abs);
// int i = 0;
foreach (ImagePlacement imagePlacement in abs.ImagePlacements)
{
string slResult = "";
XImage ximage = imagePlacement.Image;
Console.Out.WriteLine("image width:/ height" + imagePlacement.Rectangle.Width + "/" + imagePlacement.Rectangle.Height);
AsposeOcr libOcr = new AsposeOcr();
using (MemoryStream ms = new MemoryStream())
{
ximage.Save(ms);
ms.Position = 0;
// to check images
// using (FileStream fs = new FileStream("D://img" + (i++).ToString() + ".jpg", FileMode.Create))
// {
// fs.Write(ms.ToArray());
slResult = libOcr.RecognizeImage(ms);
// }
}
Console.WriteLine(slResult);
totalResult += slResult;
}
File.WriteAllText("D://resultPdfOcr.txt", totalResult);
The recognition result is in the attached file.resultPdfOcr.zip (2.9 KB)