Existing PDF to PDF with Text Layer

jayoffice365 · October 8, 2018, 7:55pm

Hi,

There are loads of acticles on the forums covering this topic but mostly old and none particularly helpful.

We have a simple requirement to take an existing PDF document with no text layer, OCR it and save it back to a PDF document.

We have currently do this by breaking down the PDF to individual image files before using tesseract to OCR and save as a PDF doc… but this is slow, and CPU intensive… we shouldn’t need to do this!

We have code which tranforms the existing PDF into a multipafge TIFF, and we OCR that TIFF with Aspose,OCR… can you outline with sample code how we can convert the TIFF file to PDF end embedding the OCR’d text?

Thanks

muhammadahmad · October 8, 2018, 8:53pm

@jayoffice365,

You may use the code snippet given below to save the OCR output of TIFF file to PDF.

OcrEngine ocrEngine = new OcrEngine();
ocrEngine.Image = ImageStream.FromFile("SampleText.tif");
ocrEngine.ProcessAllPages = true;

Aspose.Words.Document doc = new Aspose.Words.Document();
Aspose.Words.DocumentBuilder builder = new Aspose.Words.DocumentBuilder(doc);
if (ocrEngine.Process())
{
    // Retrieve the list of Pages
    Page[] pages = ocrEngine.Pages;

    // Iterate over the list of Pages
    foreach (Page page in pages)
    {
        // Display the recognized text from each Page
        Console.WriteLine(page.PageText);
        builder.Writeln(page.PageText.ToString());
        builder.InsertBreak(BreakType.PageBreak);
    }
}
doc.Save("Sample.pdf", SaveFormat.Pdf);

You may also run OCR on PDF file as well. For more information and sample code, please visit the link given below.
Performing OCR on PDF Documents

We hope that this answered your question. Please feel free to reach us if additional information is required.