Hi Cody,
Hello again,
Hi Cody,
Hi Cody,
The issues you have found earlier (filed as PDFNEWNET-40674) have been fixed in Aspose.Pdf for .NET 16.12.0.
This message was posted using Notification2Forum from Downloads module by Aspose Notifier.
I am still unable to get a searchable PDF with 16.12.0. I have attached my input and output.
Hi Cody,
Document doc = new
Document();<o:p></o:p>
Page page = doc.Pages.Add();
Aspose.Pdf.Image image = new Aspose.Pdf.Image();
image.File = "D:/Downloads/in.tif";
page.Paragraphs.Add(image);
MemoryStream ms = new MemoryStream();
doc.Save(ms);
doc = new Document(ms);
doc.Convert(CallBackGetHocr);
doc.Save("E:/Data/tiftopdf_searchable.pdf");
.....
static string CallBackGetHocr(System.Drawing.Image img)
{
string dir = @"E:\Data\";
img.Save(dir + "ocrtest.jpg");
///V3.02
ProcessStartInfo info = new ProcessStartInfo(@"C:\Program Files (x86)\Tesseract-OCR\tesseract.exe");
info.WindowStyle = ProcessWindowStyle.Hidden;
info.Arguments = @"E:\data\ocrtest.jpg E:\data\out hocr";
Process p = new Process();
p.StartInfo = info;
p.Start();
p.WaitForExit();
StreamReader streamReader = new StreamReader(@"E:\data\out.html");
string text = streamReader.ReadToEnd();
streamReader.Close();
return text;
}
I found my problem, you need it to be XHTML (not just HTML). But now I’m having another issue. I have attached my newly generated XHTML (renamed to .rtf). When I call doc.Convert(CallBackGetHocr); I get the attached error.
Hi Cody,
I was using HTML and it would never work. You sent me a rar file with a .html file that was in XHTML format. So I changed my engine to output using it instead.
Hi Cody,
The project I am using isn’t using an install of tesseract that I call like you are. I’m using this nuget package (NuGet Gallery | Tesseract 3.0.1). I’m using this version of the package because every version past this was built using Visual Studio 2015, which I don’t have yet. Do you think it’s how it’s outputting? It looks pretty similar to the one you supplied.
Hi Cody,
I stripped the license out, but here is the project I’m using.
Hi Cody,
Hi Cody,
StreamReader streamReader = new StreamReader(@“E:\data\phototest.html”);<o:p></o:p>
string text = streamReader.ReadToEnd();
streamReader.Close();
text = System.Text.RegularExpressions.Regex.Replace(text, @"; x_wconf \d+", "");
This regex fix works! Thank you!
Hi Cody,