Convert image to searchable pdf file using google ocr exe

Hi

I am use VS2012,VB and Aspose.Pdf download form individual API,i.eAspose.Pdf.dll ver is 9.4.0.0 and link is http://www.aspose.com/community/files/51/.net-components/aspose.pdf-for-.net/default.aspx
and Aspose.Pdf for .NET 9.4.0 (DLLs only) this dll and inside.net2.0 folder dll only use,i am not use3.0 or 4.0

My problem is i cant use callbackhocr function in vb language,here the forum post

First solve in vb language for CallbackgetHocr function,here i attach sample project in vb.

Here i can't attach full project zip,already post via gmail ,here i attach only axps.axps.vb and web.config file,pls use ur Aspose.Pdf dll version is 9.4.0.0 above i mention where i download.

Regards
Aravind

Hi


I am convert all images file to pdf file,but i need some more facility,like images have some words,so after convert images to pdf file,i need to search the word form pdf,but currently not show any result.
Note: in .Net ,search word in output pdf file

Pls provide sample code.here i attach sample image file convert as pdf file


pls reply asap
Regards
Aravind

Hi Aravind,


Thanks for your inquiry. I’m afraid currently searchable PDF is not supported with Aspose components. As Aspose.Ocr is not quite mature. We are facing some issue in text recognition accuracy and its coordinates. Our development team is working hard to fix these issue and investigating some new algorithms for the purpose.

As a workaround you can create a searchable PDF document form image using Aspose.Pdf with collaboration of some other OCR application supporting HOCR standards. You can use free google tesseract OCR for the purpose. In first step please convert your image to PDF by following this documentation link and later can convert it into searchable PDF document as described following.

Please install google tesseract OCR on your computer from http://code.google.com/p/tesseract-ocr/downloads/list and after that you will have tesseract.exe console application.

Below you can see usage example:

[C#]

private string CallBackGetHocr(System.Drawing.Image img)<o:p></o:p>

{<o:p></o:p>

string dir = @"c:\PdfTest";<o:p></o:p>

img.Save(dir + “test.jpg”);<o:p></o:p>

ProcessStartInfo info = new ProcessStartInfo(@“tesseract”);<o:p></o:p>

info.WindowStyle= ProcessWindowStyle.Hidden;<o:p></o:p>

info.Arguments = @“c:\pdftest\test.jpg c:\pdftest\out hocr”;<o:p></o:p>

Process p = new Process();<o:p></o:p>

p.StartInfo = info;<o:p></o:p>

p.Start();<o:p></o:p>

p.WaitForExit();<o:p></o:p>

StreamReader streamReader = new StreamReader(@“c:\pdftest\out.html”);<o:p></o:p>

string text = streamReader.ReadToEnd();<o:p></o:p>

streamReader.Close();<o:p></o:p>

return text;<o:p></o:p>

}<o:p></o:p>

public void Main<o:p></o:p>

{<o:p></o:p>

Document doc = new Document(“Input.pdf”);<o:p></o:p>

doc.Convert(CallBackGetHocr);<o:p></o:p>

doc.Save(“output.pdf”);<o:p></o:p>

}


Please feel free to contact us for any further assistance.

Best Regards,

Hi

Can u provide above code in vb and which tesseract OCR need to download form http://code.google.com/p/tesseract-ocr/downloads/list ? r u kidding me ? this is page got more than 150 files there and all are same name,pls provide screenshot which need to be download .sample here i attach.

And this function

private string CallBackGetHocr(System.Drawing.Image img)<o:p></o:p>

{<o:p></o:p>

string dir = @"c:\PdfTest";<o:p></o:p>

img.Save(dir + “test.jpg”);

ProcessStartInfo info = new ProcessStartInfo(@“tesseract”);<o:p></o:p>

info.WindowStyle= ProcessWindowStyle.Hidden;<o:p></o:p>

info.Arguments = @“c:\pdftest\test.jpg c:\pdftest\out hocr”;<o:p></o:p>

Process p = new Process();<o:p></o:p>

p.StartInfo = info;<o:p></o:p>

p.Start();<o:p></o:p>

p.WaitForExit();<o:p></o:p>

StreamReader streamReader = new StreamReader(@“c:\pdftest\out.html”);<o:p></o:p>

string text = streamReader.ReadToEnd();<o:p></o:p>

streamReader.Close();<o:p></o:p>

return text;<o:p></o:p>

}


public void Main

{

Document doc = new Document(“Input.pdf”);

doc.Convert(CallBackGetHocr);

doc.Save(“output.pdf”);

}


In first step u specify need to convert image file to pdf .but what that file u use ?

Input.pdf is converted form image file using Aspose tool ?

And what about test.jpg file ? test.jpg is image file and input.pdf is converted form test.jpg file ? Then why again need to use test.jpg file



Note: pls provide in vb language,specify which tesseract OCR need to download,here i attach one file and download what i specify by red redtangle box,if i open i didnt see any exe file.



Regards


Aravind


Hi Aravind

Aravindb:
Hi
Can u provide above code in vb and which tesseract OCR need to download form http://code.google.com/p/tesseract-ocr/downloads/list ? r u kidding me ? this is page got more than 150 files there and all are same name,pls provide screenshot which need to be download .sample here i attach.

Please pay attention to summary column+label on the link it contains the package details, screenshot is attached here for reference.

Aravindb:
And this function

private string CallBackGetHocr(System.Drawing.Image img)

{

string dir = @"c:\PdfTest\";

img.Save(dir + "test.jpg");

ProcessStartInfo info = new ProcessStartInfo(@"tesseract");

info.WindowStyle= ProcessWindowStyle.Hidden;

info.Arguments = @"c:\pdftest\test.jpg c:\pdftest\out hocr";

Process p = new Process();

p.StartInfo = info;

p.Start();

p.WaitForExit();

StreamReader streamReader = new StreamReader(@"c:\pdftest\out.html");

string text = streamReader.ReadToEnd();

streamReader.Close();

return text;

}


public void Main

{

Document doc = new Document("Input.pdf");

doc.Convert(CallBackGetHocr);

doc.Save("output.pdf");

}



Please find sample VB code.

Private Shared Sub Main(args As String())

Dim doc As New Document("Input.pdf")

doc.Convert(CallBackGetHocr)

doc.Save("output.pdf")

End Sub

Private Function CallBackGetHocr(img As System.Drawing.Image) As String

Dim dir As String = "c:\PdfTest\"

img.Save(dir & Convert.ToString("test.jpg"))

Dim info As New ProcessStartInfo("tesseract")

info.WindowStyle = ProcessWindowStyle.Hidden

info.Arguments = "c:\pdftest\test.jpg c:\pdftest\out hocr"

Dim p As New Process()

p.StartInfo = info

p.Start()

p.WaitForExit()

Dim streamReader As New StreamReader("c:\pdftest\out.html")

Dim text As String = streamReader.ReadToEnd()

streamReader.Close()

Return text

End Function


Aravindb:

In first step u specify need to convert image file to pdf .but what that file u use ?

Input.pdf is converted form image file using Aspose tool ?

And what about test.jpg file ? test.jpg is image file and input.pdf is converted form test.jpg file ? Then why again need to use test.jpg file



Note: pls provide in vb language,specify which tesseract OCR need to download,here i attach one file and download what i specify by red redtangle box,if i open i didnt see any exe file.



Regards


Aravind



Yes we have used Aspose.Pdf to convert image to PDF file (input.pdf), rest of the code is tesseract related code it needs the image files as argument for ocr. You may check tesseract specification for related query.

Best Regards,

Hi

I have got some problem in passing arguments for CallBackGetHocrfunction.pls see the screenshot.How u pass function without argument to doc.Convert(CallBackGetHocr),bcz CallBackGetHocr have one argument but not pass,how ? pls check vb and c# code and also my screenshot also.

http://prntscr.com/46ukbe

Hi Aravind,


Thanks for your inquiry. Please note Convert(CallBackGetHocr) is procedure for call back. We will appreciate if you please share your sample project here, so we will investigate it and will provide you more information accordingly.

We are sorry for the inconvenience caused.

Best Regards,

Hi Aravind,


Thanks for sharing additional information. We are looking into it and will get back to you soon.

Best Regards,

Hi Aravind,

We are sorry for the inconvenience caused. While testing the scenario with the latest version of Aspose.Pdf for .NET 9.4.0, we have managed to reproduce the reported issue with VB code and logged it in our bug tracking system as PDFNEWNET-37283 for further investigation and resolution. We will notify you via this thread as soon as it is resolved.

Please feel free to contact us for any further assistance.

Best Regards,

Hi

Hello support team any update in this question ?
Bug Tracking no is PDFNEWNET-37283 for convert image to searchable pdf. CallBackHocr function not support by vb language


Regards
Aravind

Hi Aravind,


Thanks for your patience.

As we
recently have been able to notice this issue, so development team requires
little time to investigate and figure out the reasons of this problem.
Nevertheless, as soon as we have made some definite progress towards
its resolution, we would be more than happy to update you with the status of
correction.
<o:p></o:p>

Our humble request is to
please be patient and spare us little time.

Hi

Hello support team any update in this question ?
Bug Tracking no is PDFNEWNET-37283 for convert image to searchable pdf. CallBackHocr function not support by vb language


Regards
Aravind

Hi Aravind,


Thanks for your patience. Please note Addressof keyword is need to use for callback in VB. Please check following code snippet for the purpose. It will help you to accomplish the task.

Sub Main()<o:p></o:p>

Dim license As New Aspose.Pdf.License()

license.SetLicense("Aspose.Total.lic")

Dim doc As New Document("E:/Data/test.pdf")

doc.Convert(AddressOf CallBackGetHocr)

doc.Save("E:/Data/searcable_output.pdf")

End Sub

Private Function CallBackGetHocr(ByVal img As System.Drawing.Image) As String

Dim dir As String = "E:\Data\"

img.Save(dir & Convert.ToString("ocrtest.jpg"))

Dim info As New ProcessStartInfo("C:\Program Files (x86)\Tesseract-OCR\tesseract.exe")

info.WindowStyle = ProcessWindowStyle.Hidden

info.Arguments = "E:\data\ocrtest.jpg E:\data\out hocr"

Dim p As New Process()

p.StartInfo = info

p.Start()

p.WaitForExit()

Dim streamReader As New IO.StreamReader("E:\data\out.html")

Dim text As String = streamReader.ReadToEnd()

streamReader.Close()

Return text

End Function

Please feel free to contact us for any further assistance.


Best Regards,

Hi,
In .Net instead of using exe ,can we use dll ,shall i get any sample code for that
thanks in advance



string dir = @"c:\PdfTest\";
img.Save(dir + "test.jpg");
ProcessStartInfo info = new ProcessStartInfo(@"exe");
info.WindowStyle = ProcessWindowStyle.Hidden;
info.Arguments = @"c:\pdftest\test.jpg c:\pdftest\out hocr";
Process p = new Process();
p.StartInfo = info;
p.Start();
p.WaitForExit();
StreamReader streamReader = new StreamReader(@"c:\pdftest\out.html");
string text = streamReader.ReadToEnd();
streamReader.Close();
return text;

Hi Aravind,


Thanks for your inquriy. Yes, you can use Tesseract-ocr .Net wrapper. You can add reference of Tesseract DLL from NuGet gallery and use as following. Hopefully it will help you to accomplish the task.

Furthermore, please note the Tesseract DLL is adding some extra information in OCR html that causing format issue. We have already logged a ticket PDFNET-41118 to fix the issue. However as a workaround we can remove extra information for OCR text with following regex.

private static string CallBackGetHocr(System.Drawing.Image img)<o:p></o:p>

{

string dir = @"E:\Data\";

string text;

try

{

img.Save(dir + "ocrtest.jpg");

using (var engine = new TesseractEngine("tessdata", "eng", EngineMode.Default))

{

using (var pix = Pix.LoadFromFile(dir + "ocrtest.jpg"))

{

using (var tesPage = engine.Process(pix))

{

using (StreamWriter writer = new StreamWriter(dir + "out.html"))

{

writer.Write(tesPage.GetHOCRText(0, true));

}

}

}

}

using (StreamReader streamReader = new StreamReader(dir + "out.html"))

{

text = streamReader.ReadToEnd();

text = System.Text.RegularExpressions.Regex.Replace(text, @"; x_wconf \d+", "");

}

}

catch (Exception ex)

{

throw ex;

}

return text;

}


Best Regards,


@ARAVINDDSRC

Thanks for your patience.

We are pleased to inform you that earlier reported issue PDFNET-37346, has been resolved in latest version Aspose.Pdf for .NET 17.11. We have updated our internal convertor to accept HOcr files generated by Tesseract 3.0.4. (hocrtess304.txt), and also HOcr files without namespace declaration (org.txt), will be acceptable for latest version.

We have used the following code for testing:

public static void test()
{
    using (var pdf = new Document(@"orgtopdf.pdf"))
    {
        pdf.Convert(CallBackGetHocr);
        pdf.Save(@"41118_out.pdf");
    }
}

static string CallBackGetHocr(System.Drawing.Image img)
{
    return File.ReadAllText(@"hocrtess304.txt"); //or org.txt
}

Results:
41118_1_out.pdf (with org.txt.zip)
41118_2_out.pdf (with hocrtess304.txt.zip)

Please try using the latest release version and in case you face any issue, please feel free to contact us.