Merge OCR text to original PDF

simon.fairey · February 21, 2020, 12:43am

Hi

We have some scanned PDFs and we have them OCR’d by a separate process, this returns us the words and bounding box positions for the recognised text.

Is there a way to merge this data into the text layer for each page of the original PDF?

Thanks

Simon

asad.ali · February 21, 2020, 10:53am

@simon.fairey

Thanks for contacting support.

The Aspose.PDF provides a way to create searchable PDFs using external OCR utilities. Following the C# code to achieve it:

public static string ConvertPDFToSearchable(string file)
{
Aspose.Pdf.Document doc = new Aspose.Pdf.Document(file);
doc.Convert(CallBackGetHocr);
doc.Save(file + “-Version4.pdf”);
return file + “-Version4.pdf”;
}
static string CallBackGetHocr(System.Drawing.Image img)
{
		string dir = @"C:\Temp\";
		img.Save(dir + "ocrtest.jpg");
		ProcessStartInfo info = new ProcessStartInfo(@"C:\Program Files\Tesseract-OCR\tesseract.exe");
		info.WorkingDirectory = @"C:\Program Files\Tesseract-OCR";
		info.WindowStyle = ProcessWindowStyle.Hidden;
		info.Arguments = @"C:\Temp\ocrtest.jpg C:\Temp\out hocr";
		Process p = new Process();
		p.StartInfo = info;
		p.Start();
		p.WaitForExit();
		StreamReader streamReader = new StreamReader(@"C:\Temp\out.txt");
		string text = streamReader.ReadToEnd();
		streamReader.Close();
		return text;
	}

For your particular case, would you kindly share sample PDF along with extracted text file. We will test the scenario in our environment and address it accordingly.

simon.fairey · February 21, 2020, 4:08pm

This seems Tesseract specific, the output I have is from Azure Cognitive Service in json format how should I generate the output to work with the above example.

Thanks

asad.ali · February 21, 2020, 8:47pm

@simon.fairey

Regretfully your requirement could not be fulfilled with current feature set of the API. However, we surely intend to investigate the feasibility in details which is why we may require sample PDF along with extracted text file from your side. This would help us investigating the feasibility accordingly.

simon.fairey · February 24, 2020, 5:11pm

Ah I didn’t realise HOCR was a format so in theory I just need to convert what I have to HOCR.

Thanks