Searchable PDF C#

Hi Cody,


Thanks for your inquriy. Yes I talked about PDFNEWNET-40674(Searchable PDF issue) in my above post.

Best Regards,

Hello again,


Has this issue moved any since I last checked in?

Hi Cody,


Thanks for your patience.

Recently the product team has started investigating earlier reported issue but I am afraid its not yet resolved. However as soon as we have some definite updates, we will let you know.

Hi Cody,


Thanks for your inquiry. I am afraid the issue investigation is still not completed. We are in coordination with the product team and will notify you as soon as some update is available.

Thanks for your patience and cooperation.

Best Regards,

The issues you have found earlier (filed as PDFNEWNET-40674) have been fixed in Aspose.Pdf for .NET 16.12.0.


This message was posted using Notification2Forum from Downloads module by Aspose Notifier.

I am still unable to get a searchable PDF with 16.12.0. I have attached my input and output.


EDIT: As well as the HOCR HTML file (renamed to .rtf to allow upload).

Hi Cody,


Thanks for your inquriy. I have again tested the scenario with Aspose.Pdf for .NET 16.12.0 and unable to notice any issue. Please find sample code and output files for reference.

Document doc = new
Document();<o:p></o:p>

Page page = doc.Pages.Add();

Aspose.Pdf.Image image = new Aspose.Pdf.Image();

image.File = "D:/Downloads/in.tif";

page.Paragraphs.Add(image);

MemoryStream ms = new MemoryStream();

doc.Save(ms);

doc = new Document(ms);

doc.Convert(CallBackGetHocr);

doc.Save("E:/Data/tiftopdf_searchable.pdf");

.....

static string CallBackGetHocr(System.Drawing.Image img)

{

string dir = @"E:\Data\";

img.Save(dir + "ocrtest.jpg");

///V3.02

ProcessStartInfo info = new ProcessStartInfo(@"C:\Program Files (x86)\Tesseract-OCR\tesseract.exe");

info.WindowStyle = ProcessWindowStyle.Hidden;

info.Arguments = @"E:\data\ocrtest.jpg E:\data\out hocr";

Process p = new Process();

p.StartInfo = info;

p.Start();

p.WaitForExit();

StreamReader streamReader = new StreamReader(@"E:\data\out.html");

string text = streamReader.ReadToEnd();

streamReader.Close();

return text;

}


Best Regards,

I found my problem, you need it to be XHTML (not just HTML). But now I’m having another issue. I have attached my newly generated XHTML (renamed to .rtf). When I call doc.Convert(CallBackGetHocr); I get the attached error.

Hi Cody,


Thanks for your feedback. I have noticed the reported issue with your shared XHTML, but we will appreciate it if you please share the details why you are using XHTML, whereas PDF document is being converted successfully to Searchable PDF using HTML generated with Tesseract-OCR and Aspose.Pdf for .NET 16.12.0. It will help us to log and further investigate the issue.

Best Regards,

I was using HTML and it would never work. You sent me a rar file with a .html file that was in XHTML format. So I changed my engine to output using it instead.


From the .html you sent me:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"

Hi Cody,


Thanks for your feedback. I have shared the sample HTML code generated by Tesseract-OCR using above shared code. It seems it contains XHTML. Please share your sample code and OCR software version. It would be good if you can share a sample console project for investigation. We will further look into it and will guide you accordingly.

We are truly sorry for the inconvenience.

Best Regards,

The project I am using isn’t using an install of tesseract that I call like you are. I’m using this nuget package (NuGet Gallery | Tesseract 3.0.1). I’m using this version of the package because every version past this was built using Visual Studio 2015, which I don’t have yet. Do you think it’s how it’s outputting? It looks pretty similar to the one you supplied.

Hi Cody,


Thanks for sharing the details. However as requested above, please share your sample code or sample project here, it will help us to replicate and address the issue exactly.

We are sorry for the inconvenience.

Best Regards,

I stripped the license out, but here is the project I’m using.

Hi Cody,


Thanks for sharing the sample project. We are looking into it and will update you soon.

Best Regards,

Hi Cody,


Thanks for your patience. It seems your tesseract dll version is extracting some additional information that Aspose.Pdf is unable to parse so throwing the exception. We have logged a ticket PDFNET-42084 for its investigation and rectification. For a workaround you may execute following regex on the whole HOCR text, it will create a valid searchable PDF.

StreamReader streamReader = new StreamReader(@“E:\data\phototest.html”);<o:p></o:p>

string text = streamReader.ReadToEnd();

streamReader.Close();

text = System.Text.RegularExpressions.Regex.Replace(text, @"; x_wconf \d+", "");


We are sorry for the inconvenience.

Best Regards,

This regex fix works! Thank you!

Hi Cody,


Thanks for the acknowledgement.

We are glad to hear that your problem is resolved. Please continue using our API and in the event of any further query, please feel free to contact.