Creating searchable pdfs (ocr)

BSchwab · March 5, 2018, 1:19pm

Hello,

it is possible to ocr text in an existing pdf and create a new pdf with the recognized text so it becomes searchable?

imran.rafique · March 5, 2018, 6:22pm

You can identify whether the source PDF has only text or also including images. PdfExtractor class helps identify this, please refer to this code help topic: Find whether PDF file contains images or text only

Furthermore, in order to convert non-searchable PDF file (scanned image PDF) to searchable PDF document, please try using following code snippet with Tesseract.

C#

Document doc = new Document("D:/Downloads/input.pdf");
doc.Convert(CallBackGetHocr);
doc.Save("E:/Data/pdf_searchable.pdf");
//********************* CallBackGetHocr method ***********************//
static string CallBackGetHocr(System.Drawing.Image img)
{
    string dir = @"E:\Data\";
    img.Save(dir + "ocrtest.jpg");
    ProcessStartInfo info = new ProcessStartInfo(@"C:\Program Files (x86)\Tesseract-OCR\tesseract.exe");
    info.WindowStyle = ProcessWindowStyle.Hidden;
    info.Arguments = @"E:\data\ocrtest.jpg E:\data\out hocr";
    Process p = new Process();
    p.StartInfo = info;
    p.Start();
    p.WaitForExit();
    StreamReader streamReader = new StreamReader(@"E:\data\out.html");
    string text = streamReader.ReadToEnd();
    streamReader.Close();
    return text;
}

BSchwab · March 6, 2018, 8:02am

Hi, thanks for code snippet. It seems to work … but only for some PDF files.
With PDF files created with Microsoft Word (just with a single picture) nothing happens … why?
See attached file “test.pdf”: test.pdf (26.4 KB)

imran.rafique · March 6, 2018, 2:33pm

@BSchwab,

There may an issue with software package Tesseract. You can use Aspose.OCR API to detect the text from images. Please modify the code as follows:
C#

static string CallBackGetHocr(System.Drawing.Image img)
{
    string dataDir = @"C:\Pdf\test701\";
    img.Save(dataDir + "ocrtest.jpg");
    // Initialize an instance of OcrEngine
    Aspose.OCR.OcrEngine ocrEngine = new Aspose.OCR.OcrEngine();

    // Set the Image property by loading the image from file path location or an instance of MemoryStream 
    ocrEngine.Image = Aspose.OCR.ImageStream.FromFile(dataDir + "ocrtest.jpg");

    // Process the image
    ocrEngine.Process();            
    string text = ocrEngine.Text.ToString();
    return text;
}

BSchwab · March 7, 2018, 7:41am

Hello,
i think there is a problem with the “Convert” method. Using the “test.pdf” the “CallBackGetHocr” Method isnt reached/triggered. So a code modification for the CallBackGetHocr Method does not help.
test.pdf (26.4 KB)

Is the “CallBackGetHocr” the only way to add hocr to a pdf file?

imran.rafique · March 7, 2018, 6:42pm

@BSchwab,

Please try the following code example:
C#

string dataDir = @"C:\Pdf\test705\";
Document doc = new Document(dataDir + "test.pdf");

// Create ImagePlacementAbsorber object to perform image placement search
ImagePlacementAbsorber abs = new ImagePlacementAbsorber();
// Accept the absorber for all the pages
doc.Pages[1].Accept(abs);
// Loop through all ImagePlacements, get image and ImagePlacement Properties
ImagePlacement imagePlacement = abs.ImagePlacements[1];
// Get the image using ImagePlacement object
XImage xImage = imagePlacement.Image;
FileStream outputImage = new FileStream(dataDir + "ocrtest.jpg", FileMode.Create);
// Save output image
xImage.Save(outputImage, ImageFormat.Jpeg);
outputImage.Close();

// Initialize an instance of OcrEngine
Aspose.OCR.OcrEngine ocrEngine = new Aspose.OCR.OcrEngine();
// Set the Image property by loading the image from file path location or an instance of MemoryStream 
ocrEngine.Image = Aspose.OCR.ImageStream.FromFile(dataDir + "ocrtest.jpg");

// Process the image
ocrEngine.Process();
string text = ocrEngine.Text.ToString();

// Create RedactionAnnotation instance for specific page region
RedactionAnnotation annot = new RedactionAnnotation(doc.Pages[1], imagePlacement.Rectangle);
annot.FillColor = Aspose.Pdf.Color.White;
annot.BorderColor = Aspose.Pdf.Color.Yellow;
annot.Color = Aspose.Pdf.Color.White;
// Text to be printed on redact annotation
//annot.OverlayText = text;
           
// Add annotation to annotations collection of first page
doc.Pages[1].Annotations.Add(annot);
// Flattens annotation and redacts page contents (i.e. removes text and image
// Under redacted annotation)
annot.Redact();
FloatingBox box = new FloatingBox();

TextFragment fragment = new TextFragment(text);
///fragment.Margin = new MarginInfo(0, 20, 0, 0);
fragment.TextState.HorizontalAlignment = HorizontalAlignment.Justify;

box.Paragraphs.Add(fragment);
//box.Left = imagePlacement.Rectangle.LLX;
doc.Pages[1].Paragraphs.Add(box);
doc.Save(dataDir + "Output.pdf");

This is the output PDF: Output.pdf (21.1 KB)

imran.rafique · March 7, 2018, 7:13pm

@BSchwab,

We managed to replicate the XML format error in our environment. We have logged an investigation under the ticket ID PDFNET-44340 in our bug tracking system. We have linked your post to this ticket and will keep you informed regarding any available updates.

BSchwab · March 8, 2018, 7:19am

Hi,
thanks for the code but the output.pdf looks completly different (see here: compare.jpg (136.6 KB) )
It seems that the original picture is invisible and the recognized text is visible but on the wrong position.

thanks for the ticket id, i’am looking forward to an update.

imran.rafique · March 8, 2018, 4:41pm

@BSchwab,

You can formulate the position of floating box on PDF page with rectangle coordinates, page coordinates and page margins. Please modify the code as follows:
C#

FloatingBox box = new FloatingBox();
TextFragment fragment = new TextFragment(text);
fragment.TextState.HorizontalAlignment = HorizontalAlignment.Left;
box.Paragraphs.Add(fragment);
box.Top = doc.Pages[1].Rect.URY - imagePlacement.Rectangle.URY - doc.Pages[1].PageInfo.Margin.Top;
doc.Pages[1].PageInfo.Margin.Left = imagePlacement.Rectangle.LLX;
doc.Pages[1].Paragraphs.Add(box);
doc.Save(dataDir + "Output.pdf");

BSchwab · March 9, 2018, 8:03am

Thanks for the approach but … i´m sorry, that makes no sense for me.
You cant use it that way in a professional environment. Creating a searchable pdf must not change the layout in any way. ‘Real’ pdf-Files are also much more complex than the “test.pdf”.
I will wait for an update for the ticket.

imran.rafique · March 9, 2018, 5:42pm

@BSchwab,

The floating box accepts absolute positioning rather than flow layout (Top left to Bottom right) on the page. As a workaround, Top, Left, Bottom and Right properties can be used to adjust the position of a floating box.

We will let you know once the linked ticket ID PDFNET-44340 is resolved.

BSchwab · July 27, 2018, 11:20am

Hi, it seems that the problem still persists?

Farhan.Raza · July 27, 2018, 7:13pm

@BSchwab

Thank you for getting back to us.

We are afraid this ticket is still pending for investigations owing to previously logged and critical issues in the queue. It will be scheduled on its due turn. Moreover, we have recorded your concerns and will let you know as soon as some significant updates will be available. We appreciate your patience and comprehension in this regard.

aspose.notifier · February 7, 2019, 4:46pm

The issues you have found earlier (filed as ) have been fixed in this update. This message was posted using BugNotificationTool from Downloads module by MuzammilKhan

rabin.samanta · February 13, 2019, 9:03am

how to do it in java

Farhan.Raza · February 13, 2019, 7:13pm

@rabin.samanta

Please visit Converting non searchable PDF to searchable PDF document for your kind reference.

rabin.samanta · February 15, 2019, 6:47am

@Farhan.Raza
thanks

salemantulsa · January 8, 2020, 3:18am

Do you have a C# sample code?

asad.ali · January 8, 2020, 7:26pm

@salemantulsa

In order to convert non-searchable PDF file (scanned image PDF) to searchable PDF document in C#, please try using following code snippet with Tesseract .

C#

Document doc = new Document("D:/Downloads/input.pdf");
doc.Convert(CallBackGetHocr);
doc.Save("E:/Data/pdf_searchable.pdf");
//********************* CallBackGetHocr method ***********************//
static string CallBackGetHocr(System.Drawing.Image img)
{
    string dir = @"E:\Data\";
    img.Save(dir + "ocrtest.jpg");
    ProcessStartInfo info = new ProcessStartInfo(@"C:\Program Files (x86)\Tesseract-OCR\tesseract.exe");
    info.WindowStyle = ProcessWindowStyle.Hidden;
    info.Arguments = @"E:\data\ocrtest.jpg E:\data\out hocr";
    Process p = new Process();
    p.StartInfo = info;
    p.Start();
    p.WaitForExit();
    StreamReader streamReader = new StreamReader(@"E:\data\out.html");
    string text = streamReader.ReadToEnd();
    streamReader.Close();
    return text;
}

salemantulsa · January 9, 2020, 4:12am

How do I do this in VB.net?