Hi I am using Aspose.OCR to perform OCR on a PDF and after the ocr completes I save it in a doc and then finally in a pdf. Because I need the result back in a OCred pdf. However i lose format while I perform this operation.
Here is my solution:
foreach (var file in allfiles)
{
FileInfo f = new FileInfo(file);
var pdfDocument = new Aspose.Pdf.Document(f.FullName);
//Create an instance of OcrEngine for recognition
var ocrEngine = new Aspose.OCR.OcrEngine();
Aspose.Words.Document doc = new Aspose.Words.Document();
Aspose.Words.DocumentBuilder builder = new Aspose.Words.DocumentBuilder(doc);
for (int pageCount = 1; pageCount <= pdfDocument.Pages.Count; pageCount++)
{
//Creating a MemoryStream to hold the image temporarily
using (var imageStream = new System.IO.MemoryStream())
{
//Create Resolution object with DPI value
var resolution = new Aspose.Pdf.Devices.Resolution(300);
//Create JPEG device with specified attributes (Width, Height, Resolution, Quality)
//where Quality [0-100], 100 is Maximum
var jpegDevice = new Aspose.Pdf.Devices.JpegDevice(resolution, 100);
//Convert a particular page and save the image to stream
jpegDevice.Process(pdfDocument.Pages[pageCount], imageStream);
imageStream.Position = 0;
//Set Image property of OcrEngine to the stream obtained from previous step
ocrEngine.Image = Aspose.OCR.ImageStream.FromStream(imageStream, Aspose.OCR.ImageStreamFormat.Jpg);
//Perform OCR operation on one page at a time
if (ocrEngine.Process())
{
var pages = ocrEngine.Pages;
foreach (Page page in pages)
{
builder.Writeln(page.PageText.ToString());
builder.InsertBreak(BreakType.PageBreak);
}
}
}
}
var fileName = Path.GetFileNameWithoutExtension(f.ToString()) + "_OCRed.pdf";
doc.Save(fileName, SaveFormat.Pdf);
Please advise what should I do to retain the format.