Hi
Hi Aravind,
private string CallBackGetHocr(System.Drawing.Image img)<o:p></o:p>
{<o:p></o:p>
string dir = @"c:\PdfTest";<o:p></o:p>
img.Save(dir + “test.jpg”);<o:p></o:p>
ProcessStartInfo info = new ProcessStartInfo(@“tesseract”);<o:p></o:p>
info.WindowStyle= ProcessWindowStyle.Hidden;<o:p></o:p>
info.Arguments = @“c:\pdftest\test.jpg c:\pdftest\out hocr”;<o:p></o:p>
Process p = new Process();<o:p></o:p>
p.StartInfo = info;<o:p></o:p>
p.Start();<o:p></o:p>
p.WaitForExit();<o:p></o:p>
StreamReader streamReader = new StreamReader(@“c:\pdftest\out.html”);<o:p></o:p>
string text = streamReader.ReadToEnd();<o:p></o:p>
streamReader.Close();<o:p></o:p>
return text;<o:p></o:p>
}<o:p></o:p>
public void Main<o:p></o:p>
{<o:p></o:p>
Document doc = new Document(“Input.pdf”);<o:p></o:p>
doc.Convert(CallBackGetHocr);<o:p></o:p>
doc.Save(“output.pdf”);<o:p></o:p>
}
Hi
private string CallBackGetHocr(System.Drawing.Image img)<o:p></o:p>
{<o:p></o:p>
string dir = @"c:\PdfTest";<o:p></o:p>
img.Save(dir + “test.jpg”);
ProcessStartInfo info = new ProcessStartInfo(@“tesseract”);<o:p></o:p>
info.WindowStyle= ProcessWindowStyle.Hidden;<o:p></o:p>
info.Arguments = @“c:\pdftest\test.jpg c:\pdftest\out hocr”;<o:p></o:p>
Process p = new Process();<o:p></o:p>
p.StartInfo = info;<o:p></o:p>
p.Start();<o:p></o:p>
p.WaitForExit();<o:p></o:p>
StreamReader streamReader = new StreamReader(@“c:\pdftest\out.html”);<o:p></o:p>
string text = streamReader.ReadToEnd();<o:p></o:p>
streamReader.Close();<o:p></o:p>
return text;<o:p></o:p>
}
public void Main
{
Document doc = new Document(“Input.pdf”);
doc.Convert(CallBackGetHocr);
doc.Save(“output.pdf”);
}
In first step u specify need to convert image file to pdf .but what that file u use ?
Input.pdf is converted form image file using Aspose tool ?
And what about test.jpg file ? test.jpg is image file and input.pdf is converted form test.jpg file ? Then why again need to use test.jpg file
Note: pls provide in vb language,specify which tesseract OCR need to download,here i attach one file and download what i specify by red redtangle box,if i open i didnt see any exe file.
Regards
Aravind
Aravindb:HiCan u provide above code in vb and which tesseract OCR need to download form http://code.google.com/p/tesseract-ocr/downloads/list ? r u kidding me ? this is page got more than 150 files there and all are same name,pls provide screenshot which need to be download .sample here i attach.
Aravindb:And this functionprivate string CallBackGetHocr(System.Drawing.Image img)
{
string dir = @"c:\PdfTest\";
img.Save(dir + "test.jpg");
ProcessStartInfo info = new ProcessStartInfo(@"tesseract");
info.WindowStyle= ProcessWindowStyle.Hidden;
info.Arguments = @"c:\pdftest\test.jpg c:\pdftest\out hocr";
Process p = new Process();
p.StartInfo = info;
p.Start();
p.WaitForExit();
StreamReader streamReader = new StreamReader(@"c:\pdftest\out.html");
string text = streamReader.ReadToEnd();
streamReader.Close();
return text;
}
public void Main
{
Document doc = new Document("Input.pdf");
doc.Convert(CallBackGetHocr);
doc.Save("output.pdf");
}
Private Shared Sub Main(args As String())
Dim doc As New Document("Input.pdf")
doc.Convert(CallBackGetHocr)
doc.Save("output.pdf")
End Sub
Private Function CallBackGetHocr(img As System.Drawing.Image) As String
Dim dir As String = "c:\PdfTest\"
img.Save(dir & Convert.ToString("test.jpg"))
Dim info As New ProcessStartInfo("tesseract")
info.WindowStyle = ProcessWindowStyle.Hidden
info.Arguments = "c:\pdftest\test.jpg c:\pdftest\out hocr"
Dim p As New Process()
p.StartInfo = info
p.Start()
p.WaitForExit()
Dim streamReader As New StreamReader("c:\pdftest\out.html")
Dim text As String = streamReader.ReadToEnd()
streamReader.Close()
Return text
End Function
Aravindb:
In first step u specify need to convert image file to pdf .but what that file u use ?
Input.pdf is converted form image file using Aspose tool ?
And what about test.jpg file ? test.jpg is image file and input.pdf is converted form test.jpg file ? Then why again need to use test.jpg file
Note: pls provide in vb language,specify which tesseract OCR need to download,here i attach one file and download what i specify by red redtangle box,if i open i didnt see any exe file.
Regards
Aravind
Hi
I have got some problem in passing arguments for CallBackGetHocrfunction.pls see the screenshot.How u pass function without argument to doc.Convert(CallBackGetHocr),bcz CallBackGetHocr have one argument but not pass,how ? pls check vb and c# code and also my screenshot also.
http://prntscr.com/46ukbe
Hi Aravind,
Hi Aravind,
Hi Aravind,
We are sorry for the inconvenience caused. While testing the scenario with the latest version of Aspose.Pdf for .NET 9.4.0, we have managed to reproduce the reported issue with VB code and logged it in our bug tracking system as PDFNEWNET-37283 for further investigation and resolution. We will notify you via this thread as soon as it is resolved.
Please feel free to contact us for any further assistance.
Best Regards,
Hi
Hi Aravind,
As we
recently have been able to notice this issue, so development team requires
little time to investigate and figure out the reasons of this problem.
Nevertheless, as soon as we have made some definite progress towards
its resolution, we would be more than happy to update you with the status of
correction.<o:p></o:p>
Our humble request is to
please be patient and spare us little time.
Hi
Hi Aravind,
Sub Main()<o:p></o:p>
Dim license As New Aspose.Pdf.License()
license.SetLicense("Aspose.Total.lic")
Dim doc As New Document("E:/Data/test.pdf")
doc.Convert(AddressOf CallBackGetHocr)
doc.Save("E:/Data/searcable_output.pdf")
End Sub
Private Function CallBackGetHocr(ByVal img As System.Drawing.Image) As String
Dim dir As String = "E:\Data\"
img.Save(dir & Convert.ToString("ocrtest.jpg"))
Dim info As New ProcessStartInfo("C:\Program Files (x86)\Tesseract-OCR\tesseract.exe")
info.WindowStyle = ProcessWindowStyle.Hidden
info.Arguments = "E:\data\ocrtest.jpg E:\data\out hocr"
Dim p As New Process()
p.StartInfo = info
p.Start()
p.WaitForExit()
Dim streamReader As New IO.StreamReader("E:\data\out.html")
Dim text As String = streamReader.ReadToEnd()
streamReader.Close()
Return text
End Function
Please feel free to contact us for any further assistance.
Best Regards,
Hi Aravind,
private static string CallBackGetHocr(System.Drawing.Image img)<o:p></o:p>
{
string dir = @"E:\Data\";
string text;
try
{
img.Save(dir + "ocrtest.jpg");
using (var engine = new TesseractEngine("tessdata", "eng", EngineMode.Default))
{
using (var pix = Pix.LoadFromFile(dir + "ocrtest.jpg"))
{
using (var tesPage = engine.Process(pix))
{
using (StreamWriter writer = new StreamWriter(dir + "out.html"))
{
writer.Write(tesPage.GetHOCRText(0, true));
}
}
}
}
using (StreamReader streamReader = new StreamReader(dir + "out.html"))
{
text = streamReader.ReadToEnd();
text = System.Text.RegularExpressions.Regex.Replace(text, @"; x_wconf \d+", "");
}
}
catch (Exception ex)
{
throw ex;
}
return text;
}
@ARAVINDDSRC
Thanks for your patience.
We are pleased to inform you that earlier reported issue PDFNET-37346, has been resolved in latest version Aspose.Pdf for .NET 17.11. We have updated our internal convertor to accept HOcr files generated by Tesseract 3.0.4. (hocrtess304.txt), and also HOcr files without namespace declaration (org.txt), will be acceptable for latest version.
We have used the following code for testing:
public static void test()
{
using (var pdf = new Document(@"orgtopdf.pdf"))
{
pdf.Convert(CallBackGetHocr);
pdf.Save(@"41118_out.pdf");
}
}
static string CallBackGetHocr(System.Drawing.Image img)
{
return File.ReadAllText(@"hocrtess304.txt"); //or org.txt
}
Results:
41118_1_out.pdf (with org.txt.zip)
41118_2_out.pdf (with hocrtess304.txt.zip)
Please try using the latest release version and in case you face any issue, please feel free to contact us.