We're sorry Aspose doesn't work properply without JavaScript enabled.

Free Support Forum - aspose.com

Format being Lost in Aspose.OCR

Hi I am using Aspose.OCR to perform OCR on a PDF and after the ocr completes I save it in a doc and then finally in a pdf. Because I need the result back in a OCred pdf. However i lose format while I perform this operation.

Here is my solution:

     foreach (var file in allfiles)
        {
            FileInfo f = new FileInfo(file);

            var pdfDocument = new Aspose.Pdf.Document(f.FullName);
            //Create an instance of OcrEngine for recognition
            var ocrEngine = new Aspose.OCR.OcrEngine();

            Aspose.Words.Document doc = new Aspose.Words.Document();
            Aspose.Words.DocumentBuilder builder = new Aspose.Words.DocumentBuilder(doc);
            for (int pageCount = 1; pageCount <= pdfDocument.Pages.Count; pageCount++)
            {

                //Creating a MemoryStream to hold the image temporarily
                using (var imageStream = new System.IO.MemoryStream())
                {
                    //Create Resolution object with DPI value
                    var resolution = new Aspose.Pdf.Devices.Resolution(300);

                    //Create JPEG device with specified attributes (Width, Height, Resolution, Quality)
                    //where Quality [0-100], 100 is Maximum
                    var jpegDevice = new Aspose.Pdf.Devices.JpegDevice(resolution, 100);

                    //Convert a particular page and save the image to stream
                    jpegDevice.Process(pdfDocument.Pages[pageCount], imageStream);

                    imageStream.Position = 0;

                    //Set Image property of OcrEngine to the stream obtained from previous step
                    ocrEngine.Image = Aspose.OCR.ImageStream.FromStream(imageStream, Aspose.OCR.ImageStreamFormat.Jpg);

                    //Perform OCR operation on one page at a time
                    if (ocrEngine.Process())
                    {
                        
                        var pages = ocrEngine.Pages;
                        foreach (Page page in pages)
                        {
                            builder.Writeln(page.PageText.ToString());
                            builder.InsertBreak(BreakType.PageBreak);
                        }
                    }
                   
                }
            }
            var fileName = Path.GetFileNameWithoutExtension(f.ToString()) + "_OCRed.pdf";
            doc.Save(fileName, SaveFormat.Pdf);

Please advise what should I do to retain the format.

@ShahidBhat

Thank you for contacting support.

Would you please share source and generated files while sharing a screenshot of difference in format so that we may investigate further to help you out.

Hi Farhan,

Here are the source and generated pdfs please have a look.Home_OCRed.pdf (37.6 KB)
Home.pdf (105.5 KB)

@ShahidBhat

Thank you for sharing requested data.

You are noticing the difference because Aspose.OCR API extracts text only but not the formatting or position information for the text, that is why you are facing the situation. A ticket with ID OCR-588 has been logged in our issue management system for further investigation and resolution. The ticket ID has been linked with this thread so that you will receive notification as soon as the ticket is resolved.

We are sorry for the inconvenience.