[.NET] Using Aspose.OCR on a scanned PDF without saving the PDF images to local

robert.stovall · July 31, 2015, 3:12pm

Hey friends,

I’m trying to use .NET Aspose.OCR on the pages of a scanned PDF file. The PDF is scanned in as pages of images. I’ve read on other posts that the best way to do this is to use Aspose.PDF to extract the images, save them, then use Aspose.OCR on those saved images. Is there any way to do this using memory streams so I don’t have to save the individual pages to my local machine? That doesn’t seem like a very scalable solution.

babar.raza · August 1, 2015, 4:14am

Hi Robert,

Thank you for contacting Aspose support.

Yes, it is possible to store the image temporarily in an instance of MemoryStream and perform OCR operation on it. Please check the following piece of code for better understanding.

C#

//Create an instance of Document to load the PDF
Document pdfDocument = new Document(“D:/Disclosure.pdf”);

//Create an instance of OcrEngine for recognition
OcrEngine ocrEngine = new OcrEngine();

//Iterate over the pages of PDF
for (int pageCount = 1; pageCount <= pdfDocument.Pages.Count; pageCount++)
{
//Creating a MemoryStream to hold the image temporarily
using (MemoryStream imageStream = new MemoryStream())
{
//Create Resolution object
Resolution resolution = new Resolution(300);

//Create JPEG device with specified attributes (Width, Height, Resolution, Quality)
//where Quality [0-100], 100 is Maximum
JpegDevice jpegDevice = new JpegDevice(resolution, 100);

//Convert a particular page and save the image to stream
jpegDevice.Process(pdfDocument.Pages[pageCount], imageStream);
imageStream.Position = 0;

//Set Image property of OcrEngine to the stream obtained from previous step
ocrEngine.Image = ImageStream.FromStream(imageStream, ImageStreamFormat.Jpg);

//Perform OCR operation on one page at a time
if (ocrEngine.Process())
{
Console.WriteLine(ocrEngine.Text);
}
}
}