ASPOSE.OCR reads only one character from pdf image

Anil1995 · June 12, 2021, 7:27am

I have one doubt, I am using below code to read text from an image of a pdf file but it returns only single line I want to read the whole file text. Here is my code which ASPOSE.OCR is given.

AsposeOcr libVar = new AsposeOcr();
var slResult = libVar.RecognizeLine(@“D:\ExtractText\Pdf_mages\637590973339132665_out.JPG”);

slResult return only one word i.e “s”.

Please help me. Thank you

asad.ali · June 14, 2021, 6:19pm

@Anil1995

Could you please make sure that you are using the API with a valid or 30-days free temporary license? In case issue still happens, please share your sample image with us so that we can test the scenario in our environment and address it accordingly.

Anil1995 · June 28, 2021, 1:40pm

Hi Asad, I have attached the image file which I want to read please check and confirm me. Thank you.637605038128338394_out.jpg (347.9 KB)

asad.ali · June 30, 2021, 5:22am

@Anil1995

The RecognizeLine() method is used for the image that contains single line of text. Please use below line to recognize text and let us know in case you face any issue:

var slResult = libVar.RecognizeImage("637605038128338394_out.jpg");

Anil1995 · June 30, 2021, 1:52pm

Hi @asad.ali I’m using the following code given by you
var slResult = libVar.RecognizeImage(“637605038128338394_out.jpg”);
but it returns following text:-

BalatiOnly CGtdvilASOSGPDECOpDight202-2024SDOSPyL
Senator Justo S. Quitugua, M.Ed.
AGENDA
'. Old Business
a. Referred Legislation

************* Trial Licenses *************

I want to read the whole text from that image file, it return only few lines. Can you please check and confirm me.
I have attached the file which I want to read the text. Thanks in advance.637606771327100809_out.jpg (283.9 KB)

asad.ali · July 1, 2021, 12:46pm

@Anil1995

This looks like a limitation of not using a valid license as we were able to extract all text from the image. As shared in our earlier response, please set the license before using the API and let us know in case you still face any issues. You can obtain a free temporary license from the link shared in our previous response.

Anil1995 · July 2, 2021, 5:23am

Thanks @asad.ali for your quick response. We will definitely go with valid license. I have one more doubt, I want to save my file in memory stream and will fetch the details from there. I used the aspose code from documentation but it won’t work. Can you please check and confirm me. Below is my code:-

PdfExtractor pdfExtractor = new PdfExtractor();
pdfExtractor.BindPdf(pdffile);

            pdfExtractor.ExtractText();

            MemoryStream tempMemoryStream = new MemoryStream();
            pdfExtractor.GetText(tempMemoryStream);


            using (MemoryStream ms = new MemoryStream())
            using (FileStream file1 = new FileStream(@"D:\ExtractText\Pdf_mages\637606771327100809_out.JPG", FileMode.Open, FileAccess.Read))
            {
                file1.CopyTo(ms);
                var Result = libVar.RecognizeImage(@"D:\ExtractText\Pdf_mages\637606771327100809_out.JPG");
                Result = libVar.RecognizeImage(ms);
                Console.WriteLine(Result);
            }

            objConverter.Close();
            Console.ReadKey();

How can I store the record in memory stream and read from memory stream please share the code. Thank you

asad.ali · July 2, 2021, 6:20pm

@Anil1995

Please try using the below code snippet and let us know in case you still face any issue:

Document pdfDocument = new Document(file);
ImagePlacementAbsorber abs = new ImagePlacementAbsorber();
pdfDocument.Pages.Accept(abs);

foreach (ImagePlacement imagePlacement in abs.ImagePlacements)
{
 string slResult = "";
 XImage ximage = imagePlacement.Image;
 AsposeOcr libOcr = new AsposeOcr();

 using (MemoryStream ms = new MemoryStream())
 {
  ximage.Save(ms);
  ms.Position = 0;
  slResult = libOcr.RecognizeImage(ms);
 }
 Console.WriteLine(slResult);
}

Anil1995 · July 28, 2021, 6:36am

Hi @asad.ali
I have implemented the above code given by you but I’m not getting the expected result in output.
I have attached the file which I’m getting in output window (Again few anonymous character it’s returning). Please check once.
Eagerly waiting for your replay. I have attached PDF also which I want read.MP09.pdf (1.4 MB)
Thank you.Screenshot (69).png (30.7 KB)
Screenshot (68).png (30.7 KB)
Screenshot (67).png (30.8 KB)

asad.ali · July 28, 2021, 9:37pm

@Anil1995

It still looks like that you are using an evaluation copy of the API. As requested earlier, please apply for a free 30-days temporary license and set it before using the API. Please also use 21.6 version of the API as it is the latest and if you still face any issues, please let us know.

Anil1995 · July 29, 2021, 7:57am

Hi @asad.ali

I got the temporary license but whenever I’m trying to use it in my code I’m getting error on line no 87 i.e RecognizeImage method. Attached is the code. Please help. Thank you.Code.PNG (14.3 KB)

asad.ali · July 29, 2021, 7:33pm

@Anil1995

You are trying to access the RecognizeImage method from License Class. Please initialize and set license separately at the start of method and then use the earlier shared code snippet like below:

Aspose.OCR.License olicense = new Aspose.OCR.License();
olicense.SetLicense("Aspose.OCR.NET.lic");
Aspose.Pdf.Document pdfDocument = new Aspose.Pdf.Document(file);
ImagePlacementAbsorber abs = new ImagePlacementAbsorber();
pdfDocument.Pages.Accept(abs);

foreach (ImagePlacement imagePlacement in abs.ImagePlacements)
{
 string slResult = "";
 XImage ximage = imagePlacement.Image;
 AsposeOcr libOcr = new AsposeOcr();

 using (MemoryStream ms = new MemoryStream())
 {
  ximage.Save(ms);
  ms.Position = 0;
  slResult = libOcr.RecognizeImage(ms);
 }
 Console.WriteLine(slResult);
}

Anil1995 · July 30, 2021, 6:17am

Hi @asad.ali Thanks for your quick replay.

I used the above code given by you but I’m getting the result(see the attachment). Why it reads only one page from 4 pages?? I want read all the texts from image file within the pdf. Can you please look into this. Thank you advance.Screenshot (74).png (92.0 KB)

asad.ali · July 30, 2021, 9:39pm

@Anil1995

Are you also setting a license for Aspose.PDF? Please note that Aspose.PDF cannot extract or process more than 4 elements in trial version. Please use a temporary license for Aspose.PDF as well and try again as we tested the case at our end and did not notice any issue. The API was able to extract text from all pages of the PDF:

If you still want to use Aspose.PDF without license, please use the below code snippet and you will notice that API is extracting text of only 4 pages:

string file = @"MP09.pdf";
Document pdfDocument = new Document(file);

foreach(Page page in pdfDocument.Pages)
{
 Aspose.Pdf.Devices.PngDevice pngDevice = new Aspose.Pdf.Devices.PngDevice(new Aspose.Pdf.Devices.Resolution(300));
 pngDevice.Process(page, "output" + page.Number + ".jpg");
 AsposeOcr libOcr = new AsposeOcr();
 string slResult = "";
 slResult = libOcr.RecognizeImage("output" + page.Number + ".jpg");
 Console.WriteLine(slResult);
}

Anil1995 · August 3, 2021, 5:41am

Hi @asad.ali
I’m using below code snippet as given by you but not getting the proper result. Please see the below code.

Aspose.OCR.License olicense = new Aspose.OCR.License();
olicense.SetLicense(@“C:\Users\OPTLPTP217\Downloads\Aspose.OCR.NET.lic”);
Aspose.Pdf.Document pdfDocument = new Aspose.Pdf.Document(pdffile);
ImagePlacementAbsorber abs = new ImagePlacementAbsorber();
pdfDocument.Pages.Accept(abs);

        foreach (ImagePlacement imagePlacement in abs.ImagePlacements)
        {
            string slResult = "";
            XImage ximage = imagePlacement.Image;
            AsposeOcr libOcr = new AsposeOcr();

            using (MemoryStream ms = new MemoryStream())
            {
                ximage.Save(ms);
                ms.Position = 0;
                slResult = libOcr.RecognizeImage(ms);
            }
            Console.WriteLine(slResult);
        }
        Console.ReadLine();
    }

But it returns same result as shared earlier after taking the temporary license Thank you.

asad.ali · August 3, 2021, 3:42pm

@Anil1995

The PDF file can have many formats and they can have entirely different structures. For a PDF file, that contains multiple images inside it along with textual content and you need to extract text from the images only - OR if a PDF has a single image on one page, you can use the code snippet that involves ImagePlacementAbsorber usage.

In another case, where you want to extract the text of a complete page but the content on a page is a mixture of images and text, you need to convert the whole page into a single image and then perform OCR operation on it.

The PDF file which you have shared with us represents the second case and we used the below code snippet (it was shared in our previous response as well) to extract text from it. We did not notice any issue. The API was able to extract complete text from the converted images:

string file = @"MP09.pdf";
Document pdfDocument = new Document(file);

foreach(Page page in pdfDocument.Pages)
{
 Aspose.Pdf.Devices.PngDevice pngDevice = new Aspose.Pdf.Devices.PngDevice(new Aspose.Pdf.Devices.Resolution(300));
 pngDevice.Process(page, "output" + page.Number + ".jpg");
 AsposeOcr libOcr = new AsposeOcr();
 string slResult = "";
 slResult = libOcr.RecognizeImage("output" + page.Number + ".jpg");
 Console.WriteLine(slResult);
}

Please try using this code and let us know about the issues if you face some.

Anil1995 · August 5, 2021, 12:25pm

@asad.ali Thanks for your replay. I have implemented the above code given by you. Attached is the screenshot which I’m getting in result window. For me page number 1 and 2 OCR reads all the characters but for page number 3 and 4 it doesn’t read the all characters only few it able to read. can you please look into it. Below is my code.
Screenshot (84).png (47.7 KB)
Screenshot (83).png (32.9 KB)

Aspose.OCR.License olicense = new Aspose.OCR.License();
olicense.SetLicense(@“C:\Users\OPTLPTP217\Downloads\Aspose.OCR.NET.lic”);
Aspose.Pdf.Document pdfDocument = new Aspose.Pdf.Document(pdffile);
foreach (Page page in pdfDocument.Pages)
{
Aspose.Pdf.Devices.PngDevice pngDevice = new Aspose.Pdf.Devices.PngDevice(new Aspose.Pdf.Devices.Resolution(300));
pngDevice.Process(page, “output” + page.Number + “.jpg”);
AsposeOcr libOcr = new AsposeOcr();
string slResult = “”;
slResult = libOcr.RecognizeImage(“output” + page.Number + “.jpg”);
Console.WriteLine(slResult);
}

asad.ali · August 5, 2021, 8:04pm

@Anil1995

Please check the attached screenshot of the results which we have at our end while testing the scenario using both Aspose.OCR and Aspose.PDF with valid licenses. extractedtextpage34.png (34.3 KB)
Can you please point out in the screenshot if you notice any anomaly or missing results?

lion.brotzky · August 6, 2021, 2:43pm

homedepot_crop.jpg (955.4 KB)
OCR_results.png (8.3 KB)
Asad, I experience the same problem.
Having valid license file Aspose.Total.Java.lic setting it this way
com.aspose.ocr.License license = new com.aspose.ocr.License(); license.setLicense("C:\\licenses\\Aspose.Total.Java.lic");
then running recognition (in Java methods named differently)
AsposeOCR ocr = new AsposeOCR(); String result = ocr.RecognizePage("C:\\images\\homedepot_crop.jpg");
Results are terrible, see attached. When I run on your online service results are fair.
I’m trying to figure out what I do wrong.homedepot_crop.jpg (955.4 KB)

asad.ali · August 6, 2021, 8:11pm

@lion.brotzky

We are collecting the information related to Aspose.OCR online App and will get back to you shortly.