Creating a Searchable PDF from a Scanned PDF File

Hi Everyone,
We are testing Aspose C# SDk to create Searchable PDF from a scanned PDF file. But we are getting only OCR text result. Is there any way to achive this task using the sdk? We can only see OCR to Image SDK in the list.

@Cengizkurban

Creating a Searchable PDF from a Scanned PDF with Aspose.PDF for .NET

You can turn a scanned (image‑only) PDF into a searchable PDF by combining two Aspose products:

Product Role
Aspose.PDF for .NET Holds the final PDF document and can embed hidden text layers.
Aspose.OCR for .NET Performs the optical‑character recognition on each scanned page.

Below is a complete, step‑by‑step example (C#) that:

  1. Loads the scanned PDF.
  2. Renders each page to an image (required for OCR).
  3. Runs OCR on the image to obtain the text.
  4. Adds the original image back to a new PDF page.
  5. Overlays the OCR‑generated text as hidden (searchable) text.
  6. Saves the result as a searchable PDF.

1. Install the required NuGet packages

Install-Package Aspose.PDF
Install-Package Aspose.OCR

2. Sample code

using System;
using Aspose.Pdf;
using Aspose.Pdf.Text;
using Aspose.OCR;
using Aspose.OCR.Models;

class SearchablePdfFromScannedPdf
{
    static void Main()
    {
        // -----------------------------------------------------------------
        // 1️⃣ Load the scanned PDF (image‑only)
        // -----------------------------------------------------------------
        string scannedPdfPath = @"C:\Input\ScannedDocument.pdf";
        Document scannedPdf = new Document(scannedPdfPath);

        // -----------------------------------------------------------------
        // 2️⃣ Create a new PDF that will hold the searchable result
        // -----------------------------------------------------------------
        Document searchablePdf = new Document();

        // -----------------------------------------------------------------
        // 3️⃣ Process each page
        // -----------------------------------------------------------------
        for (int pageIdx = 1; pageIdx <= scannedPdf.Pages.Count; pageIdx++)
        {
            // ----- a) Render the page to an image (Aspose.PDF rasterization) -----
            var rasterizer = new Aspose.Pdf.Devices.JpegDevice(300, 300);
            using (var imageStream = new System.IO.MemoryStream())
            {
                rasterizer.Process(scannedPdf.Pages[pageIdx], imageStream);
                imageStream.Position = 0;

                // ----- b) Run OCR on the rendered image -----
                var ocrEngine = new OcrEngine();
                ocrEngine.Image = new Aspose.OCR.Image(imageStream);
                OcrResult ocrResult = ocrEngine.Recognize();

                // ----- c) Add a new page to the result PDF -----
                var resultPage = searchablePdf.Pages.Add();

                // ----- d) Insert the original scanned image back onto the page -----
                var image = new Aspose.Pdf.Image
                {
                    ImageStream = new System.IO.MemoryStream(imageStream.ToArray())
                };
                resultPage.Paragraphs.Add(image);

                // ----- e) Overlay the OCR text as hidden text -----
                // The text is placed at (0,0) with a very small font size – it is
                // invisible but searchable.
                var hiddenFragment = new TextFragment(ocrResult.Text)
                {
                    Position = new Position(0, 0),
                    Font = new Font(FontFamily.Helvetica, 0.01f), // virtually invisible
                    TextState = { RenderingMode = TextRenderingMode.Invisible }
                };

                resultPage.Paragraphs.Add(hiddenFragment);
            }
        }

        // -----------------------------------------------------------------
        // 4️⃣ Save the searchable PDF
        // -----------------------------------------------------------------
        string outputPath = @"C:\Output\SearchableDocument.pdf";
        searchablePdf.Save(outputPath);

        Console.WriteLine($"Searchable PDF created at: {outputPath}");
    }
}

What the code does

Step Explanation
Render page to image JpegDevice (or PngDevice) rasterizes the scanned page because Aspose.OCR works with Image objects.
Run OCR OcrEngine extracts the textual content from the image.
Add original image The rasterized image is re‑added to the new PDF page so the visual appearance stays the same.
Add hidden text A TextFragment with RenderingMode = Invisible (or a very tiny font) is placed on the same page. PDF viewers can index this text, making the PDF searchable.
Save The final document is a normal PDF that looks like the scanned one but is fully searchable.

3. Important notes

Topic Guidance
Image quality Use a resolution of 300 DPI (or higher) for the rasterized image; this gives the OCR engine enough detail.
Language support Set ocrEngine.Config.Language = Language.English; (or any other supported language) before Recognize() if your document is not English.
Performance For large PDFs consider processing pages in parallel (Parallel.For) and re‑using a single OcrEngine instance.
Alternative If you have Aspose.PDF Cloud, the same operation can be performed via the OCR API – see the cloud docs for POST /pdf/{fileName}/ocr.
License Ensure you have a valid commercial license for both Aspose.PDF and Aspose.OCR; otherwise the output will contain evaluation watermarks.

Further reading

If you encounter any issues (e.g., missing fonts, language configuration, or performance concerns), feel free to provide the error details and we’ll help you troubleshoot. Happy coding! :rocket:

Hi,
We used the below code and still have issues.

ss.pdf is my original file. When I convert to searchble PDF, the result is ss_ocr.pdf and as you can see, the convert did not happend. Also it uses a lot of reources and takes long time some reason. Can you please help on this one too?

ss.pdf (986.9 KB)

ss_ocr.pdf (864.6 KB)

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Xml.Linq;
using Aspose.Pdf;
using Aspose.OCR;
using Aspose.Pdf.Devices;
using System.IO;
using Aspose.Pdf.Text;

namespace asposeocrsearchabele
{
class Program
{
static void Main(string[] args)
{
string inputPdf = @“C:\YesPDF\data\localweb\uploads\input\ss.pdf”;
string outputPdf = @“C:\YesPDF\data\localweb\uploads\input\ss_ocr.pdf”;

        string tempDir = Path.Combine(Path.GetTempPath(), "PdfOcrTemp_" + Guid.NewGuid());
        Directory.CreateDirectory(tempDir);

        try
        {
            // OCR motoru
            AsposeOcr ocrEngine = new AsposeOcr();

            // PDF dosyası aç
            Document pdfDoc = new Document(inputPdf);

            // Son PDF
            Document resultPdf = new Document();

            int pageIndex = 1;
            foreach (Page pdfPage in pdfDoc.Pages)
            {
                // PDF sayfasını 300 DPI PNG olarak kaydet
                string imagePath = Path.Combine(tempDir, $"page_{pageIndex}.png");
                Resolution resolution = new Resolution(100);
                PngDevice pngDevice = new PngDevice(resolution);
                using (FileStream fs = new FileStream(imagePath, FileMode.Create))
                {
                    pngDevice.Process(pdfPage, fs);
                }

                // OCR için input oluştur
                OcrInput ocrInput = new OcrInput(InputType.SingleImage);
                ocrInput.Add(imagePath);

                // OCR işlemini çalıştır
                var ocrResults = ocrEngine.Recognize(ocrInput);

                string recognizedText = string.Empty;
                if (ocrResults != null && ocrResults.Count > 0)
                    recognizedText = ocrResults[0].RecognitionText;

                // Yeni sayfa oluştur
                Page newPage = resultPdf.Pages.Add();

                // Orijinal görseli ekle
                Image img = new Image
                {
                    File = imagePath,
                    FixWidth = pdfPage.Rect.Width,
                    FixHeight = pdfPage.Rect.Height
                };
                newPage.Paragraphs.Add(img);

                // OCR metni PDF sayfasına ekle (seçilebilir)
                TextFragment tf = new TextFragment(recognizedText);
                newPage.Paragraphs.Add(tf);

                pageIndex++;
            }

            // OCR sonrası PDF’i kaydet
            resultPdf.Save(outputPdf);

            Console.WriteLine("✅ OCR tamamlandı! Çıktı: " + outputPdf);
        }
        catch (Exception ex)
        {
            Console.WriteLine("Hata oluştu: " + ex.Message);
        }
        finally
        {
            // Geçici dosyaları temizle
            try { Directory.Delete(tempDir, true); } catch { }
        }

        Console.WriteLine("Program sonlandı.");


    }
}

}

@Cengizkurban

We have checked your document and noticed that the images contain text with different directions. We have logged an investigation ticket as OCRNET-1131 in our issue tracking system to analyze this scenario from Aspose.OCR perspective because it alone support generated searchable PDFs from a scanned PDF document.

We will look into the details of the logged ticket and keep you posted with the status of its resolution. Please be patient and spare us some time.

We are sorry for the inconvenience.

Hi,
But this is not only about a vertical pdf. We are having the same issue with all scanned documents. Would it be possible for tou look at are codes and let us what we are doing wrong. Because it should be working on an ordinery scanned pdf right?

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Xml.Linq;
using Aspose.Pdf;
using Aspose.OCR;
using Aspose.Pdf.Devices;
using System.IO;
using Aspose.Pdf.Text;

namespace asposeocrsearchabele
{
class Program
{
static void Main(string[] args)
{
string inputPdf = @“C:\YesPDF\data\localweb\uploads\input\ss.pdf”;
string outputPdf = @“C:\YesPDF\data\localweb\uploads\input\ss_ocr.pdf”;

        string tempDir = Path.Combine(Path.GetTempPath(), "PdfOcrTemp_" + Guid.NewGuid());
        Directory.CreateDirectory(tempDir);

        try
        {
            // OCR motoru
            AsposeOcr ocrEngine = new AsposeOcr();

            // PDF dosyası aç
            Document pdfDoc = new Document(inputPdf);

            // Son PDF
            Document resultPdf = new Document();

            int pageIndex = 1;
            foreach (Page pdfPage in pdfDoc.Pages)
            {
                // PDF sayfasını 300 DPI PNG olarak kaydet
                string imagePath = Path.Combine(tempDir, $"page_{pageIndex}.png");
                Resolution resolution = new Resolution(100);
                PngDevice pngDevice = new PngDevice(resolution);
                using (FileStream fs = new FileStream(imagePath, FileMode.Create))
                {
                    pngDevice.Process(pdfPage, fs);
                }

                // OCR için input oluştur
                OcrInput ocrInput = new OcrInput(InputType.SingleImage);
                ocrInput.Add(imagePath);

                // OCR işlemini çalıştır
                var ocrResults = ocrEngine.Recognize(ocrInput);

                string recognizedText = string.Empty;
                if (ocrResults != null && ocrResults.Count > 0)
                    recognizedText = ocrResults[0].RecognitionText;

                // Yeni sayfa oluştur
                Page newPage = resultPdf.Pages.Add();

                // Orijinal görseli ekle
                Image img = new Image
                {
                    File = imagePath,
                    FixWidth = pdfPage.Rect.Width,
                    FixHeight = pdfPage.Rect.Height
                };
                newPage.Paragraphs.Add(img);

                // OCR metni PDF sayfasına ekle (seçilebilir)
                TextFragment tf = new TextFragment(recognizedText);
                newPage.Paragraphs.Add(tf);

                pageIndex++;
            }

            // OCR sonrası PDF’i kaydet
            resultPdf.Save(outputPdf);

            Console.WriteLine("✅ OCR tamamlandı! Çıktı: " + outputPdf);
        }
        catch (Exception ex)
        {
            Console.WriteLine("Hata oluştu: " + ex.Message);
        }
        finally
        {
            // Geçici dosyaları temizle
            try { Directory.Delete(tempDir, true); } catch { }
        }

        Console.WriteLine("Program sonlandı.");


    }
}

}

@Cengizkurban

Below is the code snippet that can be used only with Aspose.OCR API in order to convert a scanned PDF document into searchable PDF:

  OcrInput ocrInput = new(InputType.PDF);
  ocrInput.Add("file.pdf");
  OcrOutput output = api.Recognize(ocrInput);
  output.Save("ResultFile.pdf", Aspose.OCR.SaveFormat.Pdf);