Hi
Hi Aravind,
Thanks for your inquiry. I’m afraid currently searchable PDF is not supported with Aspose components. As Aspose.Ocr is not quite mature. We are facing some issue with text recognition accuracy and its coordinates. Our development team is working hard to fix these issues and investigate some new algorithms for the purpose.
As a workaround, you can create a searchable PDF document from an image using Aspose.Pdf with the collaboration of some other OCR application supporting HOCR standards. You can use free Google Tesseract OCR for the purpose. In the first step, please convert your image to PDF by following this documentation link and later can convert it into a searchable PDF document as described below.
Please install Google Tesseract OCR on your computer from http://code.google.com/p/tesseract-ocr/downloads/list and after that, you will have the tesseract.exe
console application.
Below you can see a usage example:
[C#]
private string CallBackGetHocr(System.Drawing.Image img)
{
string dir = @"c:\PdfTest\";
img.Save(dir + "test.jpg");
ProcessStartInfo info = new ProcessStartInfo(@"tesseract");
info.WindowStyle = ProcessWindowStyle.Hidden;
info.Arguments = @"c:\pdftest\test.jpg c:\pdftest\out hocr";
Process p = new Process();
p.StartInfo = info;
p.Start();
p.WaitForExit();
StreamReader streamReader = new StreamReader(@"c:\pdftest\out.html");
string text = streamReader.ReadToEnd();
streamReader.Close();
return text;
}
public void Main()
{
Document doc = new Document("Input.pdf");
doc.Convert(CallBackGetHocr);
doc.Save("output.pdf");
}
Please feel free to contact us for any further assistance.
Best Regards,
Hi
Can you provide the above code in VB and which tesseract OCR do I need to download? Are you kidding me? This page has more than 150 files, and they all have the same name! Please provide a screenshot of what needs to be downloaded. Here is a sample of what I attached.
And this function:
private string CallBackGetHocr(System.Drawing.Image img)
{
string dir = @"c:\PdfTest\";
img.Save(dir + "test.jpg");
ProcessStartInfo info = new ProcessStartInfo(@"tesseract");
info.WindowStyle = ProcessWindowStyle.Hidden;
info.Arguments = @"c:\pdftest\test.jpg c:\pdftest\out hocr";
Process p = new Process();
p.StartInfo = info;
p.Start();
p.WaitForExit();
StreamReader streamReader = new StreamReader(@"c:\pdftest\out.html");
string text = streamReader.ReadToEnd();
streamReader.Close();
return text;
}
public void Main()
{
Document doc = new Document("Input.pdf");
doc.Convert(CallBackGetHocr);
doc.Save("output.pdf");
}
In the first step, you specify that a need to convert image file to pdf. What file do you use? Is Input.pdf converted from an image file using Aspose tool? And what about test.jpg file? is test.jpg image file and input.pdf is already converted from test.jpg file? Then why do we need to use test.jpg file again?
Note: Please provide in VB language and specify which tesseract OCR needs to be downloaded. Here, I attached a file and downloaded what I specified by red rectangle box. If I open it, I did not see any exe file.
Regards
Aravind
Hi Aravind
Aravindb: Hi
Can you provide the above code in VB and which tesseract OCR should be downloaded? Are you kidding me? This page has more than 150 files, all with the same name. Please provide a screenshot specifying which one to download. A sample is attached here for reference.
Please pay attention to the summary column + label on the link; it contains the package details. A screenshot is attached here for reference.
Aravindb: And this function
private string CallBackGetHocr(System.Drawing.Image img) { string dir = @"c:\PdfTest\"; img.Save(dir + "test.jpg"); ProcessStartInfo info = new ProcessStartInfo(@"tesseract"); info.WindowStyle = ProcessWindowStyle.Hidden; info.Arguments = @"c:\pdftest\test.jpg c:\pdftest\out hocr"; Process p = new Process(); p.StartInfo = info; p.Start(); p.WaitForExit(); StreamReader streamReader = new StreamReader(@"c:\pdftest\out.html"); string text = streamReader.ReadToEnd(); streamReader.Close(); return text; }
public void Main() { Document doc = new Document("Input.pdf"); doc.Convert(CallBackGetHocr); doc.Save("output.pdf"); }
Please find the sample VB code.
Private Shared Sub Main(args As String())
Dim doc As New Document("Input.pdf")
doc.Convert(CallBackGetHocr)
doc.Save("output.pdf")
End Sub
Private Function CallBackGetHocr(img As System.Drawing.Image) As String
Dim dir As String = "c:\PdfTest\"
img.Save(dir & "test.jpg")
Dim info As New ProcessStartInfo("tesseract")
info.WindowStyle = ProcessWindowStyle.Hidden
info.Arguments = "c:\pdftest\test.jpg c:\pdftest\out hocr"
Dim p As New Process()
p.StartInfo = info
p.Start()
p.WaitForExit()
Dim streamReader As New StreamReader("c:\pdftest\out.html")
Dim text As String = streamReader.ReadToEnd()
streamReader.Close()
Return text
End Function
Aravindb: In the first step, you specify the need to convert an image file to a PDF. What file do you use?
Input.pdf is converted from an image file using the Aspose tool.
And what about the test.jpg file? test.jpg is an image file, and input.pdf is converted from test.jpg file. Then why do you need to use the test.jpg file again?
Note: Please provide this in VB language and specify which tesseract OCR is needed for download. Here, I attach one file, and download what I specify by the red rectangle box. If I open it, I don’t see any exe file.
Regards,
Aravind
Yes, we have used Aspose.Pdf to convert an image to a PDF file (input.pdf). The rest of the code is tesseract-related code; it needs the image files as arguments for OCR. You may check the tesseract specification for a related query.
Best Regards,
Hi
I have got some problem in passing arguments for CallBackGetHocrfunction.pls see the screenshot.How u pass function without argument to doc.Convert(CallBackGetHocr),bcz CallBackGetHocr have one argument but not pass,how ? pls check vb and c# code and also my screenshot also.
http://prntscr.com/46ukbe
Hi Aravind,
Hi Aravind,
Hi Aravind,
We are sorry for the inconvenience caused. While testing the scenario with the latest version of Aspose.Pdf for .NET 9.4.0, we have managed to reproduce the reported issue with VB code and logged it in our bug tracking system as PDFNEWNET-37283 for further investigation and resolution. We will notify you via this thread as soon as it is resolved.
Please feel free to contact us for any further assistance.
Best Regards,
Hi
Hi Aravind,
Thanks for your patience.
As we have recently noticed this issue, the development team requires some time to investigate and figure out the reasons for this problem. However, as soon as we make some definite progress towards its resolution, we would be more than happy to update you with the status of the correction.
Our humble request is to please be patient and give us some time.
Hi
Hi Aravind,
Thanks for your patience. Please note the AddressOf
keyword is needed to use for callback in VB. Please check the following code snippet for the purpose. It will help you to accomplish the task.
Sub Main()
Dim license As New Aspose.Pdf.License()
license.SetLicense("Aspose.Total.lic")
Dim doc As New Document("E:/Data/test.pdf")
doc.Convert(AddressOf CallBackGetHocr)
doc.Save("E:/Data/searcable_output.pdf")
End Sub
Private Function CallBackGetHocr(ByVal img As System.Drawing.Image) As String
Dim dir As String = "E:\Data\"
img.Save(dir & Convert.ToString("ocrtest.jpg"))
Dim info As New ProcessStartInfo("C:\Program Files (x86)\Tesseract-OCR\tesseract.exe")
info.WindowStyle = ProcessWindowStyle.Hidden
info.Arguments = "E:\data\ocrtest.jpg E:\data\out hocr"
Dim p As New Process()
p.StartInfo = info
p.Start()
p.WaitForExit()
Dim streamReader As New IO.StreamReader("E:\data\out.html")
Dim text As String = streamReader.ReadToEnd()
streamReader.Close()
Return text
End Function
Please feel free to contact us for any further assistance.
Best Regards,
Hi Aravind,
Thanks for your inquiry. Yes, you can use Tesseract-ocr .Net wrapper. You can add a reference to the Tesseract DLL from NuGet gallery and use it as follows. Hopefully, it will help you accomplish the task.
Furthermore, please note that the Tesseract DLL is adding some extra information
in OCR html that is causing format issues. We have already logged a ticket PDFNET-41118 to fix the issue. However, as a workaround, we can remove extra information for OCR text with the following regex.
private static string CallBackGetHocr(System.Drawing.Image img)
{
string dir = @"E:\Data\";
string text;
// ...
}
try
{
img.Save(dir + "ocrtest.jpg");
using (var engine = new TesseractEngine("tessdata", "eng", EngineMode.Default))
{
using (var pix = Pix.LoadFromFile(dir + "ocrtest.jpg"))
{
using (var tesPage = engine.Process(pix))
{
using (StreamWriter writer = new StreamWriter(dir + "out.html"))
{
writer.Write(tesPage.GetHOCRText(0, true));
}
}
}
}
using (StreamReader streamReader = new StreamReader(dir + "out.html"))
{
text = streamReader.ReadToEnd();
text = System.Text.RegularExpressions.Regex.Replace(text, @"; x_wconf \d+", "");
}
}
catch (Exception ex)
{
throw ex;
}
return text;
Best Regards,
@ARAVINDDSRC
Thanks for your patience.
We are pleased to inform you that earlier reported issue PDFNET-37346, has been resolved in latest version Aspose.Pdf for .NET 17.11. We have updated our internal convertor to accept HOcr files generated by Tesseract 3.0.4. (hocrtess304.txt), and also HOcr files without namespace declaration (org.txt), will be acceptable for latest version.
We have used the following code for testing:
public static void test()
{
using (var pdf = new Document(@"orgtopdf.pdf"))
{
pdf.Convert(CallBackGetHocr);
pdf.Save(@"41118_out.pdf");
}
}
static string CallBackGetHocr(System.Drawing.Image img)
{
return File.ReadAllText(@"hocrtess304.txt"); //or org.txt
}
Results:
41118_1_out.pdf (with org.txt.zip)
41118_2_out.pdf (with hocrtess304.txt.zip)
Please try using the latest release version and in case you face any issue, please feel free to contact us.