Convert Scanned PDF to Excel - URGENT

dewan.ishi · August 6, 2020, 5:33am

Hi Team,

I have a scanned pdf. I am trying to convert it to excel. I am not sure how to use Aspose.OCR. I am pretty new to it. i tried to find documentation but there was nothing clear,

I tired the following code. Is it the right code to do this ?
if this is right in the following code the ocr engine is not getting recognized.

static void Main(string[] args)
{
Aspose.OCR.License license = new Aspose.OCR.License();
license.SetLicense(@“C:\Licenses\Aspose20.lic”);

        //Create an instance of Document to load the PDF
        Document pdfDocument = new Document(@"C:\PDFs\ScannedForAspose.pdf");

        //Create an instance of OcrEngine for recognition
        Aspose.OCR.OcrEngine ocrEngine = new Aspose.OCR.OcrEngine();

        //Iterate over the pages of PDF
        for (int pageCount = 1; pageCount <= pdfDocument.Pages.Count; pageCount++)
        {
            //Creating a MemoryStream to hold the image temporarily
            using (MemoryStream imageStream = new MemoryStream())
            {
                //Create Resolution object
                Resolution resolution = new Resolution(300);

                JpegDevice jpegDevice = new JpegDevice();

                //Convert a particular page and save the image to stream
                jpegDevice.Process(pdfDocument.Pages[pageCount], imageStream);
                imageStream.Position = 0;
                ocrEngine.Image = ImageStream.FromStream(imageStream, ImageStreamFormat.Jpg);

                //Perform OCR operation on one page at a time
                if (ocrEngine.Process())
                {
                    Console.WriteLine(ocrEngine.Text);
                }
            }
        }
    }

asad.ali · August 6, 2020, 9:16pm

@dewan.ishi

Please use Aspose.OCR for .NET 20.7 version along with following code snippet:

AsposeOcr api = new AsposeOcr();
// Recognize image
string result = api.RecognizeImage(imageStream);
// Display the recognized text
Console.WriteLine(result);

imageStream in the above code will be the same in which image is being stored by Aspose.PDF for .NET.

dewan.ishi · August 12, 2020, 2:06pm

@asad.ali
Hi,

When i made the above change i get the following error:

System.IO.FileNotFoundException
HResult=0x80070002
Message=Could not find file ‘C:\Users\abcd\source\repos\ConsoleApp4\ConsoleApp4\bin\Debug\netcoreapp3.1\System.IO.MemoryStream’.

Not sure what are we looking in for here.

Also, i had another question can i convert the scanned pdf to text file ?

asad.ali · August 12, 2020, 6:53pm

@dewan.ishi

Would you kindly share a sample console application which is able to reproduce the issue that you are facing. We will test it in our environment and address the issue accordingly.

Does your scanned PDF contain a layer of hidden text already? If it does not, sadly it cannot be directly converted to text file.