Convert image to searchable pdf file using google ocr exe

Aravindb · July 28, 2014, 12:32pm

Hi

I am use VS2012,VB and Aspose.Pdf download form individual API,i.eAspose.Pdf.dll ver is 9.4.0.0 and link is http://www.aspose.com/community/files/51/.net-components/aspose.pdf-for-.net/default.aspx

and Aspose.Pdf for .NET 9.4.0 (DLLs only) this dll and inside.net2.0 folder dll only use,i am not use3.0 or 4.0

My problem is i cant use callbackhocr function in vb language,here the forum post

http://prntscr.com/4773cw

First solve in vb language for CallbackgetHocr function,here i attach sample project in vb.

Here i can't attach full project zip,already post via gmail ,here i attach only axps.axps.vb and web.config file,pls use ur Aspose.Pdf dll version is 9.4.0.0 above i mention where i download.

Regards

Aravind

Aravindb · July 24, 2014, 6:08am

Hi

I am convert all images file to pdf file,but i need some more facility,like images have some words,so after convert images to pdf file,i need to search the word form pdf,but currently not show any result.

Note: in .Net ,search word in output pdf file

Pls provide sample code.here i attach sample image file convert as pdf file

pls reply asap

Regards

Aravind

tilal.ahmad · July 24, 2014, 8:29am

Hi Aravind,

Thanks for your inquiry. I’m afraid currently searchable PDF is not supported with Aspose components. As Aspose.Ocr is not quite mature. We are facing some issue with text recognition accuracy and its coordinates. Our development team is working hard to fix these issues and investigate some new algorithms for the purpose.

As a workaround, you can create a searchable PDF document from an image using Aspose.Pdf with the collaboration of some other OCR application supporting HOCR standards. You can use free Google Tesseract OCR for the purpose. In the first step, please convert your image to PDF by following this documentation link and later can convert it into a searchable PDF document as described below.

Please install Google Tesseract OCR on your computer from http://code.google.com/p/tesseract-ocr/downloads/list and after that, you will have the tesseract.exe console application.

Below you can see a usage example:

[C#]

private string CallBackGetHocr(System.Drawing.Image img)
{
    string dir = @"c:\PdfTest\";
    img.Save(dir + "test.jpg");
    ProcessStartInfo info = new ProcessStartInfo(@"tesseract");
    info.WindowStyle = ProcessWindowStyle.Hidden;
    info.Arguments = @"c:\pdftest\test.jpg c:\pdftest\out hocr";
    Process p = new Process();
    p.StartInfo = info;
    p.Start();
    p.WaitForExit();
    StreamReader streamReader = new StreamReader(@"c:\pdftest\out.html");
    string text = streamReader.ReadToEnd();
    streamReader.Close();
    return text;
}

public void Main()
{
    Document doc = new Document("Input.pdf");
    doc.Convert(CallBackGetHocr);
    doc.Save("output.pdf");
}

Please feel free to contact us for any further assistance.

Best Regards,

Aravindb · July 24, 2014, 11:29am

Hi

Can you provide the above code in VB and which tesseract OCR do I need to download? Are you kidding me? This page has more than 150 files, and they all have the same name! Please provide a screenshot of what needs to be downloaded. Here is a sample of what I attached.

And this function:

private string CallBackGetHocr(System.Drawing.Image img)
{
    string dir = @"c:\PdfTest\";
    img.Save(dir + "test.jpg");
    ProcessStartInfo info = new ProcessStartInfo(@"tesseract");
    info.WindowStyle = ProcessWindowStyle.Hidden;
    info.Arguments = @"c:\pdftest\test.jpg c:\pdftest\out hocr";
    Process p = new Process();
    p.StartInfo = info;
    p.Start();
    p.WaitForExit();
    StreamReader streamReader = new StreamReader(@"c:\pdftest\out.html");
    string text = streamReader.ReadToEnd();
    streamReader.Close();
    return text;
}

public void Main()
{
    Document doc = new Document("Input.pdf");
    doc.Convert(CallBackGetHocr);
    doc.Save("output.pdf");
}

In the first step, you specify that a need to convert image file to pdf. What file do you use? Is Input.pdf converted from an image file using Aspose tool? And what about test.jpg file? is test.jpg image file and input.pdf is already converted from test.jpg file? Then why do we need to use test.jpg file again?

Note: Please provide in VB language and specify which tesseract OCR needs to be downloaded. Here, I attached a file and downloaded what I specified by red rectangle box. If I open it, I did not see any exe file.

Regards

Aravind

tilal.ahmad · July 25, 2014, 2:42am

Hi Aravind

Aravindb: Hi

Can you provide the above code in VB and which tesseract OCR should be downloaded? Are you kidding me? This page has more than 150 files, all with the same name. Please provide a screenshot specifying which one to download. A sample is attached here for reference.

Please pay attention to the summary column + label on the link; it contains the package details. A screenshot is attached here for reference.

Aravindb: And this function

private string CallBackGetHocr(System.Drawing.Image img)
{
    string dir = @"c:\PdfTest\";
    img.Save(dir + "test.jpg");
    ProcessStartInfo info = new ProcessStartInfo(@"tesseract");
    info.WindowStyle = ProcessWindowStyle.Hidden;
    info.Arguments = @"c:\pdftest\test.jpg c:\pdftest\out hocr";
    Process p = new Process();
    p.StartInfo = info;
    p.Start();
    p.WaitForExit();
    StreamReader streamReader = new StreamReader(@"c:\pdftest\out.html");
    string text = streamReader.ReadToEnd();
    streamReader.Close();
    return text;
}

public void Main()
{
    Document doc = new Document("Input.pdf");
    doc.Convert(CallBackGetHocr);
    doc.Save("output.pdf");
}

Please find the sample VB code.

Private Shared Sub Main(args As String())
    Dim doc As New Document("Input.pdf")
    doc.Convert(CallBackGetHocr)
    doc.Save("output.pdf")
End Sub

Private Function CallBackGetHocr(img As System.Drawing.Image) As String
    Dim dir As String = "c:\PdfTest\"
    img.Save(dir & "test.jpg")
    Dim info As New ProcessStartInfo("tesseract")
    info.WindowStyle = ProcessWindowStyle.Hidden
    info.Arguments = "c:\pdftest\test.jpg c:\pdftest\out hocr"
    Dim p As New Process()
    p.StartInfo = info
    p.Start()
    p.WaitForExit()
    Dim streamReader As New StreamReader("c:\pdftest\out.html")
    Dim text As String = streamReader.ReadToEnd()
    streamReader.Close()
    Return text
End Function

Aravindb: In the first step, you specify the need to convert an image file to a PDF. What file do you use?

Input.pdf is converted from an image file using the Aspose tool.

And what about the test.jpg file? test.jpg is an image file, and input.pdf is converted from test.jpg file. Then why do you need to use the test.jpg file again?

Note: Please provide this in VB language and specify which tesseract OCR is needed for download. Here, I attach one file, and download what I specify by the red rectangle box. If I open it, I don’t see any exe file.

Regards,
Aravind

Yes, we have used Aspose.Pdf to convert an image to a PDF file (input.pdf). The rest of the code is tesseract-related code; it needs the image files as arguments for OCR. You may check the tesseract specification for a related query.

Best Regards,

Aravindb · July 28, 2014, 3:45am

Hi

I have got some problem in passing arguments for CallBackGetHocrfunction.pls see the screenshot.How u pass function without argument to doc.Convert(CallBackGetHocr),bcz CallBackGetHocr have one argument but not pass,how ? pls check vb and c# code and also my screenshot also.

http://prntscr.com/46ukbe

tilal.ahmad · July 28, 2014, 11:06am

Hi Aravind,

Thanks for your inquiry. Please note Convert(CallBackGetHocr) is procedure for call back. We will appreciate if you please share your sample project here, so we will investigate it and will provide you more information accordingly.

We are sorry for the inconvenience caused.

Best Regards,

tilal.ahmad · July 29, 2014, 12:06pm

Hi Aravind,

Thanks for sharing additional information. We are looking into it and will get back to you soon.

Best Regards,

tilal.ahmad · August 1, 2014, 12:15am

Hi Aravind,

We are sorry for the inconvenience caused. While testing the scenario with the latest version of Aspose.Pdf for .NET 9.4.0, we have managed to reproduce the reported issue with VB code and logged it in our bug tracking system as PDFNEWNET-37283 for further investigation and resolution. We will notify you via this thread as soon as it is resolved.

Please feel free to contact us for any further assistance.

Best Regards,

Aravindb · August 6, 2014, 9:32pm

Hi

Hello support team any update in this question ?

Bug Tracking no is PDFNEWNET-37283 for convert image to searchable pdf. CallBackHocr function not support by vb language

Regards

Aravind

codewarior · August 7, 2014, 3:07pm

Hi Aravind,

Thanks for your patience.

As we have recently noticed this issue, the development team requires some time to investigate and figure out the reasons for this problem. However, as soon as we make some definite progress towards its resolution, we would be more than happy to update you with the status of the correction.

Our humble request is to please be patient and give us some time.

Aravindb · August 14, 2014, 2:17am

Hi

Hello support team any update in this question ?

Bug Tracking no is PDFNEWNET-37283 for convert image to searchable pdf. CallBackHocr function not support by vb language

Regards

Aravind

tilal.ahmad · August 15, 2014, 12:21am

Hi Aravind,

Thanks for your patience. Please note the AddressOf keyword is needed to use for callback in VB. Please check the following code snippet for the purpose. It will help you to accomplish the task.

Sub Main()
    Dim license As New Aspose.Pdf.License()
    license.SetLicense("Aspose.Total.lic")

    Dim doc As New Document("E:/Data/test.pdf")
    doc.Convert(AddressOf CallBackGetHocr)
    doc.Save("E:/Data/searcable_output.pdf")
End Sub

Private Function CallBackGetHocr(ByVal img As System.Drawing.Image) As String
    Dim dir As String = "E:\Data\"
    img.Save(dir & Convert.ToString("ocrtest.jpg"))

    Dim info As New ProcessStartInfo("C:\Program Files (x86)\Tesseract-OCR\tesseract.exe")
    info.WindowStyle = ProcessWindowStyle.Hidden
    info.Arguments = "E:\data\ocrtest.jpg E:\data\out hocr"

    Dim p As New Process()
    p.StartInfo = info
    p.Start()
    p.WaitForExit()

    Dim streamReader As New IO.StreamReader("E:\data\out.html")
    Dim text As String = streamReader.ReadToEnd()
    streamReader.Close()

    Return text
End Function

Please feel free to contact us for any further assistance.

Best Regards,

aravinddsrc1 · February 16, 2017, 11:57pm

Hi,

In .Net instead of using exe ,can we use dll ,shall i get any sample code for that

thanks in advance

string dir = @"c:\PdfTest\";

img.Save(dir + "test.jpg");

ProcessStartInfo info = new ProcessStartInfo(@"exe");

info.WindowStyle = ProcessWindowStyle.Hidden;

info.Arguments = @"c:\pdftest\test.jpg c:\pdftest\out hocr";

Process p = new Process();

p.StartInfo = info;

p.Start();

p.WaitForExit();

StreamReader streamReader = new StreamReader(@"c:\pdftest\out.html");

string text = streamReader.ReadToEnd();

streamReader.Close();

return text;

tilal.ahmad · February 19, 2017, 8:42pm

Hi Aravind,

Thanks for your inquiry. Yes, you can use Tesseract-ocr .Net wrapper. You can add a reference to the Tesseract DLL from NuGet gallery and use it as follows. Hopefully, it will help you accomplish the task.

Furthermore, please note that the Tesseract DLL is adding some extra information
in OCR html that is causing format issues. We have already logged a ticket PDFNET-41118 to fix the issue. However, as a workaround, we can remove extra information for OCR text with the following regex.

private static string CallBackGetHocr(System.Drawing.Image img)
{
  string dir = @"E:\Data\";
  string text;
  // ...

}

try
{
    img.Save(dir + "ocrtest.jpg");

    using (var engine = new TesseractEngine("tessdata", "eng", EngineMode.Default))
    {
        using (var pix = Pix.LoadFromFile(dir + "ocrtest.jpg"))
        {
            using (var tesPage = engine.Process(pix))
            {
                using (StreamWriter writer = new StreamWriter(dir + "out.html"))
                {
                    writer.Write(tesPage.GetHOCRText(0, true));
                }
            }
        }
    }

    using (StreamReader streamReader = new StreamReader(dir + "out.html"))
    {
        text = streamReader.ReadToEnd();
        text = System.Text.RegularExpressions.Regex.Replace(text, @"; x_wconf \d+", "");
    }

}
catch (Exception ex)
{
    throw ex;
}

return text;

Best Regards,

asad.ali · November 14, 2017, 7:37am

@ARAVINDDSRC

Thanks for your patience.

We are pleased to inform you that earlier reported issue PDFNET-37346, has been resolved in latest version Aspose.Pdf for .NET 17.11. We have updated our internal convertor to accept HOcr files generated by Tesseract 3.0.4. (hocrtess304.txt), and also HOcr files without namespace declaration (org.txt), will be acceptable for latest version.

We have used the following code for testing:

public static void test()
{
    using (var pdf = new Document(@"orgtopdf.pdf"))
    {
        pdf.Convert(CallBackGetHocr);
        pdf.Save(@"41118_out.pdf");
    }
}

static string CallBackGetHocr(System.Drawing.Image img)
{
    return File.ReadAllText(@"hocrtess304.txt"); //or org.txt
}

Results:
41118_1_out.pdf (with org.txt.zip)
41118_2_out.pdf (with hocrtess304.txt.zip)

Please try using the latest release version and in case you face any issue, please feel free to contact us.