Convert a tiff file to searchable pdf file

atulkalohatechnology · August 6, 2014, 10:56am

Hi Team,

I have converted the tiff file to pdf using aspose.pdf dll but when i open a converted pdf and to find the some text using “ctrl+f” then it gives me a message of “text not found” though there is a text present in pdf file this means that the created pdf is not a searchable pdf. I want to convert the tiff file to pdf with ocr functionality so can i get a c# code to convert the tiff file to searchable pdf.

Also i have used the below code that was posted from aspose team related to searchable pdf but it can’t works for me.

public void Main()
{
    Document doc = new Document("Input.pdf");
    doc.Convert(CallBackGetHocr);
    doc.Save("output.pdf");
}

private string CallBackGetHocr(System.Drawing.Image img)
{
    string dir = @"c:\PdfTest\";
    img.Save(dir + "test.jpg");
    ProcessStartInfo info = new ProcessStartInfo(@"tesseract");
    info.WindowStyle= ProcessWindowStyle.Hidden;
    info.Arguments = @"c:\pdftest\test.jpg c:\pdftest\out hocr";
    Process p = new Process();
    p.StartInfo = info;
    p.Start();
    p.WaitForExit();
    StreamReader streamReader = new StreamReader(@"c:\pdftest\out.html");
    string text = streamReader.ReadToEnd();
    streamReader.Close();
    return text;
}

I have downloaded the tesseract application also.

Thanks,
Atul kadam

tilal.ahmad · August 7, 2014, 3:05am

Hi Atul,

Thanks for your inquiry. As suggested code takes PDF file as input for creating searchable PDF document using google OCR tool. You may convert your TIFF image to PDF and then pass it to the mentioned code, It should work. If you find any issue then please share the error message and your source TIFF, we will test the scenario at our end and will guide you accordingly.

Best Regards,

atulkalohatechnology · August 7, 2014, 3:41am

Hello Tilal,

I have converted my tiff file first into PDF using TIFF image to PDF code then again i have used the above code to create the converted PDF to searchable pdf, The above code does not gives any error for me but the pdf file created is not searchable pdf when i try to search any word using ctrl+f that word is not gives me a message of “text not found” though there is a text present in pdf file. I have attached a tif which i have to convert into searchable pdf. Also Let us know that is there any another way to convert the tif to searchable pdf rather than using the “google OCR tool”.

If aspose has direct way to convert the tif file to searchable pdf then please post the code that will be very helpful for us.

Thanks,

Atul kadam

tilal.ahmad · August 8, 2014, 12:51am

Hi Atul,

Thanks for your feedback. I am afraid I am unable to find any issue in creating searchable PDF document from your shared TIFF image. Please find sample project for the purpose. Hopefully it will help you accomplish the task.

Moreover, I am afraid currently Aspose.OCR is not mature enough to serve the purpose. Our development team is working hard to improve Aspose.OCR. As soon as issues are fixed in Aspose.OCR, we will be able to create searchable PDF document independent of any third party tool. We are sorry for the inconvenience.

Best Regards,

atulkalohatechnology · August 8, 2014, 2:34am

Hello Tilal,

Can you please provide me the sample project from which you have converted the tiff file to searchable pdf ?

Thanks

Atul kadam

tilal.ahmad · August 8, 2014, 2:48am

Hi Atul,

Sorry, I forget to attach sample project. Please find now attached project in above post.

Best Regards,

atulkalohatechnology · August 8, 2014, 6:32am

Hello Tilal,

Thanks, your solution is working for me, But i will consider this as a work around because we are using a third party tool. I am still waiting for Aspose.OCR to fix the issue of searchable pdf so that we get rid of third party tool and directly us Aspose.OCR to convert tif to searchable pdf.

Thanks

Atul kadam

codewarior · August 8, 2014, 2:56pm

Hi Atul,

We have logged an enhancement ticket in issue tracking system of Aspose.OCR for .NET as OCR-33801 to perform OCR over TIFF or other image files and return HTML/XHTML result so that formatting of Image contents is preserved. Once the HTML/XHTML is generated you may either use Aspose.Pdf for .NET or Aspose.Words for .NET to convert HTML/XHTML file to PDF format. The respective team is working hard on supporting above stated feature and as soon as we have some definite news regarding its implementation, we will let you know. Please be patient and spare us little time.