Creating a Searchable PDF from a Scanned PDF File

Hi Everyone,
We are testing Aspose C# SDk to create Searchable PDF from a scanned PDF file. But we are getting only OCR text result. Is there any way to achive this task using the sdk? We can only see OCR to Image SDK in the list.

@Cengizkurban

Creating a Searchable PDF from a Scanned PDF with Aspose.PDF for .NET

You can turn a scanned (image‑only) PDF into a searchable PDF by combining two Aspose products:

Product Role
Aspose.PDF for .NET Holds the final PDF document and can embed hidden text layers.
Aspose.OCR for .NET Performs the optical‑character recognition on each scanned page.

Below is a complete, step‑by‑step example (C#) that:

  1. Loads the scanned PDF.
  2. Renders each page to an image (required for OCR).
  3. Runs OCR on the image to obtain the text.
  4. Adds the original image back to a new PDF page.
  5. Overlays the OCR‑generated text as hidden (searchable) text.
  6. Saves the result as a searchable PDF.

1. Install the required NuGet packages

Install-Package Aspose.PDF
Install-Package Aspose.OCR

2. Sample code

using System;
using Aspose.Pdf;
using Aspose.Pdf.Text;
using Aspose.OCR;
using Aspose.OCR.Models;

class SearchablePdfFromScannedPdf
{
    static void Main()
    {
        // -----------------------------------------------------------------
        // 1️⃣ Load the scanned PDF (image‑only)
        // -----------------------------------------------------------------
        string scannedPdfPath = @"C:\Input\ScannedDocument.pdf";
        Document scannedPdf = new Document(scannedPdfPath);

        // -----------------------------------------------------------------
        // 2️⃣ Create a new PDF that will hold the searchable result
        // -----------------------------------------------------------------
        Document searchablePdf = new Document();

        // -----------------------------------------------------------------
        // 3️⃣ Process each page
        // -----------------------------------------------------------------
        for (int pageIdx = 1; pageIdx <= scannedPdf.Pages.Count; pageIdx++)
        {
            // ----- a) Render the page to an image (Aspose.PDF rasterization) -----
            var rasterizer = new Aspose.Pdf.Devices.JpegDevice(300, 300);
            using (var imageStream = new System.IO.MemoryStream())
            {
                rasterizer.Process(scannedPdf.Pages[pageIdx], imageStream);
                imageStream.Position = 0;

                // ----- b) Run OCR on the rendered image -----
                var ocrEngine = new OcrEngine();
                ocrEngine.Image = new Aspose.OCR.Image(imageStream);
                OcrResult ocrResult = ocrEngine.Recognize();

                // ----- c) Add a new page to the result PDF -----
                var resultPage = searchablePdf.Pages.Add();

                // ----- d) Insert the original scanned image back onto the page -----
                var image = new Aspose.Pdf.Image
                {
                    ImageStream = new System.IO.MemoryStream(imageStream.ToArray())
                };
                resultPage.Paragraphs.Add(image);

                // ----- e) Overlay the OCR text as hidden text -----
                // The text is placed at (0,0) with a very small font size – it is
                // invisible but searchable.
                var hiddenFragment = new TextFragment(ocrResult.Text)
                {
                    Position = new Position(0, 0),
                    Font = new Font(FontFamily.Helvetica, 0.01f), // virtually invisible
                    TextState = { RenderingMode = TextRenderingMode.Invisible }
                };

                resultPage.Paragraphs.Add(hiddenFragment);
            }
        }

        // -----------------------------------------------------------------
        // 4️⃣ Save the searchable PDF
        // -----------------------------------------------------------------
        string outputPath = @"C:\Output\SearchableDocument.pdf";
        searchablePdf.Save(outputPath);

        Console.WriteLine($"Searchable PDF created at: {outputPath}");
    }
}

What the code does

Step Explanation
Render page to image JpegDevice (or PngDevice) rasterizes the scanned page because Aspose.OCR works with Image objects.
Run OCR OcrEngine extracts the textual content from the image.
Add original image The rasterized image is re‑added to the new PDF page so the visual appearance stays the same.
Add hidden text A TextFragment with RenderingMode = Invisible (or a very tiny font) is placed on the same page. PDF viewers can index this text, making the PDF searchable.
Save The final document is a normal PDF that looks like the scanned one but is fully searchable.

3. Important notes

Topic Guidance
Image quality Use a resolution of 300 DPI (or higher) for the rasterized image; this gives the OCR engine enough detail.
Language support Set ocrEngine.Config.Language = Language.English; (or any other supported language) before Recognize() if your document is not English.
Performance For large PDFs consider processing pages in parallel (Parallel.For) and re‑using a single OcrEngine instance.
Alternative If you have Aspose.PDF Cloud, the same operation can be performed via the OCR API – see the cloud docs for POST /pdf/{fileName}/ocr.
License Ensure you have a valid commercial license for both Aspose.PDF and Aspose.OCR; otherwise the output will contain evaluation watermarks.

Further reading

If you encounter any issues (e.g., missing fonts, language configuration, or performance concerns), feel free to provide the error details and we’ll help you troubleshoot. Happy coding! :rocket: