Searchable PDF C#

tilal.ahmad · July 26, 2016, 12:33am

Hi Cody,

Thanks for your inquriy. Yes I talked about PDFNEWNET-40674(Searchable PDF issue) in my above post.

Best Regards,

chammond523 · August 22, 2016, 2:50pm

Hello again,

Has this issue moved any since I last checked in?

codewarior · August 23, 2016, 1:02pm

Hi Cody,

Thanks for your patience.

Recently the product team has started investigating earlier reported issue but I am afraid its not yet resolved. However as soon as we have some definite updates, we will let you know.

tilal.ahmad · October 13, 2016, 4:09am

Hi Cody,

Thanks for your inquiry. I am afraid the issue investigation is still not completed. We are in coordination with the product team and will notify you as soon as some update is available.

Thanks for your patience and cooperation.

Best Regards,

aspose.notifier · December 6, 2016, 9:04pm

The issues you have found earlier (filed as PDFNEWNET-40674) have been fixed in Aspose.Pdf for .NET 16.12.0.

This message was posted using Notification2Forum from Downloads module by Aspose Notifier.

chammond523 · December 12, 2016, 7:50am

I am still unable to get a searchable PDF with 16.12.0. I have attached my input and output.

EDIT: As well as the HOCR HTML file (renamed to .rtf to allow upload).

tilal.ahmad · December 13, 2016, 9:10am

Hi Cody,

Thanks for your inquriy. I have again tested the scenario with Aspose.Pdf for .NET 16.12.0 and unable to notice any issue. Please find sample code and output files for reference.

Document doc = new
Document();<o:p></o:p>

Page page = doc.Pages.Add();

Aspose.Pdf.Image image = new Aspose.Pdf.Image();

image.File = "D:/Downloads/in.tif";

page.Paragraphs.Add(image);

MemoryStream ms = new MemoryStream();

doc.Save(ms);

doc = new Document(ms);

doc.Convert(CallBackGetHocr);

doc.Save("E:/Data/tiftopdf_searchable.pdf");

.....

static string CallBackGetHocr(System.Drawing.Image img)

{

string dir = @"E:\Data\";

img.Save(dir + "ocrtest.jpg");

///V3.02

ProcessStartInfo info = new ProcessStartInfo(@"C:\Program Files (x86)\Tesseract-OCR\tesseract.exe");

info.WindowStyle = ProcessWindowStyle.Hidden;

info.Arguments = @"E:\data\ocrtest.jpg E:\data\out hocr";

Process p = new Process();

p.StartInfo = info;

p.Start();

p.WaitForExit();

StreamReader streamReader = new StreamReader(@"E:\data\out.html");

string text = streamReader.ReadToEnd();

streamReader.Close();

return text;

}

Best Regards,

chammond523 · December 13, 2016, 12:17pm

I found my problem, you need it to be XHTML (not just HTML). But now I’m having another issue. I have attached my newly generated XHTML (renamed to .rtf). When I call doc.Convert(CallBackGetHocr); I get the attached error.

tilal.ahmad · December 14, 2016, 12:03pm

Hi Cody,

Thanks for your feedback. I have noticed the reported issue with your shared XHTML, but we will appreciate it if you please share the details why you are using XHTML, whereas PDF document is being converted successfully to Searchable PDF using HTML generated with Tesseract-OCR and Aspose.Pdf for .NET 16.12.0. It will help us to log and further investigate the issue.

Best Regards,

chammond523 · December 14, 2016, 1:28pm

I was using HTML and it would never work. You sent me a rar file with a .html file that was in XHTML format. So I changed my engine to output using it instead.

From the .html you sent me:

<?xml version="1.0" encoding="UTF-8"?>

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"

“http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd”>

tilal.ahmad · December 15, 2016, 11:42am

Hi Cody,

Thanks for your feedback. I have shared the sample HTML code generated by Tesseract-OCR using above shared code. It seems it contains XHTML. Please share your sample code and OCR software version. It would be good if you can share a sample console project for investigation. We will further look into it and will guide you accordingly.

We are truly sorry for the inconvenience.

Best Regards,

chammond523 · December 19, 2016, 10:04am

The project I am using isn’t using an install of tesseract that I call like you are. I’m using this nuget package (NuGet Gallery | Tesseract 3.0.1). I’m using this version of the package because every version past this was built using Visual Studio 2015, which I don’t have yet. Do you think it’s how it’s outputting? It looks pretty similar to the one you supplied.

tilal.ahmad · December 20, 2016, 9:48am

Hi Cody,

Thanks for sharing the details. However as requested above, please share your sample code or sample project here, it will help us to replicate and address the issue exactly.

We are sorry for the inconvenience.

Best Regards,

chammond523 · December 20, 2016, 10:51am

I stripped the license out, but here is the project I’m using.

tilal.ahmad · December 21, 2016, 10:07am

Hi Cody,

Thanks for sharing the sample project. We are looking into it and will update you soon.

Best Regards,

tilal.ahmad · January 3, 2017, 10:35am

Hi Cody,

Thanks for your patience. It seems your tesseract dll version is extracting some additional information that Aspose.Pdf is unable to parse so throwing the exception. We have logged a ticket PDFNET-42084 for its investigation and rectification. For a workaround you may execute following regex on the whole HOCR text, it will create a valid searchable PDF.

StreamReader streamReader = new StreamReader(@“E:\data\phototest.html”);<o:p></o:p>

string text = streamReader.ReadToEnd();

streamReader.Close();

text = System.Text.RegularExpressions.Regex.Replace(text, @"; x_wconf \d+", "");

We are sorry for the inconvenience.

Best Regards,

chammond523 · January 18, 2017, 2:40pm

This regex fix works! Thank you!

codewarior · January 19, 2017, 1:48pm

Hi Cody,

Thanks for the acknowledgement.

We are glad to hear that your problem is resolved. Please continue using our API and in the event of any further query, please feel free to contact.