Hi Venkataravi,
Thanks for contacting support.
As per my understanding, you are able to convert non-searchable PDF file to Image format using Aspose.Pdf for Java but facing issue while creating searchable PDF from image file. Can you please share whether you are getting a problem while performing OCR on extracted image or facing issue while converting Image files to PDF format (which I think won’t help because converting images back to PDF will produce non-searchable PDF file).
Besides this, when using Aspose.Pdf for .NET, you can use Aspose.Pdf in collaboration with some other OCR application supporting HOCR standards. A free Google Tessseract OCR can be used. So as described below, one can convert non-searchable PDF to searchable PDF document as described below. You can install Google Tessseract OCR on your computer from http://code.google.com/p/tesseract-ocr/downloads/list. After installation, you will have the tesseract.exe console application.
Below you can see usage example:
C#
public void Main()
{
Document doc = new Document("Input.pdf");
doc.Convert(CallBackGetHocr);
doc.Save("output.pdf");
}
private string CallBackGetHocr(System.Drawing.Image img)
{
string dir = @"c:\PdfTest\";
img.Save(dir + "test.jpg");
ProcessStartInfo info = new ProcessStartInfo(@"tesseract");
info.WindowStyle= ProcessWindowStyle.Hidden;
info.Arguments = @"c:\pdftest\test.jpg c:\pdftest\out hocr";
Process p = new Process();
p.StartInfo = info;
p.Start();
p.WaitForExit();
StreamReader streamReader = new StreamReader(@"c:\pdftest\out.html");
string text = streamReader.ReadToEnd();
streamReader.Close();
return text;
}
However, to use the solution in Java, you need to try using jtesseract. For more information, you may consider looking over the following links:
Besides this, we have an API named Aspose.OCR for Java, which performs OCR over images on the Java platform. So, as per your requirement, you can first convert the pages of the PDF file to Image format using instructions specified in Convert PDF pages to JPEG Image. Once the image files are generated, you can perform OCR using Aspose.OCR for Java. For further details, please visit Performing OCR on an Image. However, when using this approach, the output is saved in a simple Text file.
I am afraid the current release of Aspose.OCR for Java does not support the feature to perform OCR on an image and save the output in HTML format, but I have asked my fellow worker to further look into this requirement. Soon you will be updated with the required information. We are sorry for this inconvenience.