Images to Searchable Text PDF

iwagbe · January 27, 2017, 2:46am

Hello,

We need to convert images to the searchable (text) pdf. As I see in this topic (2013)

Convert scanned PDF (Image) to searchable PDF (Text)

its not possible without destroying format but now we are in 2017 and i though something was increased but I am disappointed for the result.

Well I am uploading all files here…

1- attachment is my temp image (an invoice)

2- this one is my result by using aspose.ocr and pdf.

Here is my code

           Aspose.OCR.License license = new Aspose.OCR.License();

license.SetLicense(“Aspose.Total.lic”);

Aspose.Pdf.License licenseforPdf = new Aspose.Pdf.License();

licenseforPdf.SetLicense(“Aspose.Total.lic”);
       <span style="color:#2b91af;">OcrEngine</span> ocrEngine = <span style="color:blue;">new</span> <span style="color:#2b91af;">OcrEngine</span>();
       Aspose.Pdf.Generator.<span style="color:#2b91af;">Pdf</span> pdf1 = <span style="color:blue;">new</span> Aspose.Pdf.Generator.<span style="color:#2b91af;">Pdf</span>();

       <span style="color:#2b91af;">String</span> searchFolder = txtDirectoryPath.Text;
       <span style="color:blue;">var</span> filters = <span style="color:blue;">new</span> <span style="color:#2b91af;">String</span>[] { <span style="color:#a31515;">"jpg"</span>, <span style="color:#a31515;">"jpeg"</span>, <span style="color:#a31515;">"tif"</span>, <span style="color:#a31515;">"png"</span>, <span style="color:#a31515;">"gif"</span>, <span style="color:#a31515;">"tiff"</span>, <span style="color:#a31515;">"bmp"</span> };
       <span style="color:blue;">var</span> images = GetFilesFrom(searchFolder, filters, <span style="color:blue;">false</span>);


       <span style="color:blue;">foreach</span> (<span style="color:blue;">var</span> image <span style="color:blue;">in</span> images)
       {
           ocrEngine.Image = <span style="color:#2b91af;">ImageStream</span>.FromFile(image);

           <span style="color:blue;">if</span> (ocrEngine.Process())
           {
               Aspose.Pdf.Generator.<span style="color:#2b91af;">Section</span> sec1 = pdf1.Sections.Add();

               <span style="color:green;">// Create a new text paragraph and pass the text to its constructor as argument</span>
            
            Aspose.Pdf.Generator.<span style="color:#2b91af;">Text</span> t2 = <span style="color:blue;">new</span> Aspose.Pdf.Generator.<span style="color:#2b91af;">Text</span>(ocrEngine.Text.ToString());
               sec1.Paragraphs.Add(t2);
               
               pdf1.Save(<span style="color:#2b91af;">Path</span>.Combine(searchFolder, <span style="color:#a31515;">"Result"</span>, <span style="color:#2b91af;">Path</span>.GetFileName(image) + <span style="color:#a31515;">".Pdf"</span>));

               <span style="color:green;">// Display the recognized text</span>
               <span style="color:#2b91af;">Console</span>.WriteLine(ocrEngine.Text);
               <span style="color:#2b91af;">Console</span>.WriteLine(ocrEngine.Text.PartsInfo[0].Box);
           }
           <span style="color:blue;">else</span>
           {
               <span style="color:#2b91af;">Console</span>.WriteLine(<span style="color:#a31515;">"Error in file "</span> + <span style="color:#2b91af;">Path</span>.GetFileName(image));
           }

       }

       <span style="color:#2b91af;">MessageBox</span>.Show(<span style="color:#a31515;">"Completed"</span>);</pre><pre style="font-family: Consolas; font-size: 13px; background: white;"><br></pre><pre style="font-family: Consolas; font-size: 13px; background: white;">(I see that i can only get it as text format)</pre><pre style="font-family: Consolas; font-size: 13px; background: white;"><br></pre><pre style="font-family: Consolas; font-size: 13px; background: white;">3- I tried abby finereader online service and the result is on the third attachment</pre><pre style="font-family: Consolas; font-size: 13px; background: white;"><br></pre><pre style="font-family: Consolas; font-size: 13px; background: white;">As you see abby finereader is the perfect. </pre><pre style="font-family: Consolas; font-size: 13px; background: white;"><br></pre><pre style="font-family: Consolas; font-size: 13px; background: white;">Well i need to know that whether you have that kind of service or you will have in close feature? </pre><pre style="font-family: Consolas; font-size: 13px; background: white;">If not then we will look for other solution. </pre><pre style="font-family: Consolas; font-size: 13px; background: white;">Because the aspose.ocr result is completly unuseful (horrible) data for us.</pre><pre style="font-family: Consolas; font-size: 13px; background: white;"><br></pre><pre style="font-family: Consolas; font-size: 13px; background: white;"><br></pre><pre style="font-family: Consolas; font-size: 13px; background: white;"><br></pre><pre style="font-family: Consolas; font-size: 13px; background: white;"><br></pre></div>

iwagbe · January 27, 2017, 5:15am

I’ve solved the problem by using tesseract engine. Version >=3.03 gives directly a pdf output without ruin the structer.

I think that it uses iTextSharp behind, if i can find a wrapper for .net, i will adapte aspose instead of iTextSharp for pdf converting.

tilal.ahmad · February 1, 2017, 7:48am

Hi Ertan,

Thanks for your feedback. It is good to know that you have managed to resolve the issue at your own.

Furthermore for converting Image to searchable PDF document, you can use tesseract OCR with collaboration of Aspose.Pdf as following.

[C#]

static string CallBackGetHocr(System.Drawing.Image img)

{
    string dir = @“E:\Data”;

    img.Save(dir + “ocrtest.jpg”);

    ProcessStartInfo info = new ProcessStartInfo(@“C:\Program Files(x86)\Tesseract - OCR\tesseract.exe”);

    info.WindowStyle = ProcessWindowStyle.Hidden;

    info.Arguments = @“E:\data\ocrtest.jpg E:\data\out hocr”;

    Process p = new Process();

    p.StartInfo = info;

    p.Start();

    p.WaitForExit();

    StreamReader streamReader = new StreamReader(@“E:\data\out.html”);

    string text = streamReader.ReadToEnd();

    streamReader.Close();

    return text;
}

static void Main(string[] args)
{

    Aspose.Pdf.License license = new Aspose.Pdf.License();

    license.SetLicense(“E:/ Data / AsposeLicense / asposetotal / Aspose.Total.lic”);

    Document doc = new Document();

    Page page = doc.Pages.Add();

    Aspose.Pdf.Image image = new Aspose.Pdf.Image();

    image.File = “E:/ Data / invoice13.jpg”;

    page.Paragraphs.Add(image);

    MemoryStream ms = new MemoryStream();

    doc.Save(ms);

    doc = new Document(ms);

    doc.Convert(CallBackGetHocr);

    doc.Save(“E:/ Data / invoice13.jpg_output.pdf”);

}

Please feel free to contact us for any further assistance.

Best Regards,

iwagbe · February 2, 2017, 9:08am

Hello Tilal,

Thank you for your reply. My code is similar but if you use ‘hocr’ as a parameter, the structer has ruined. Well then if you use directly ‘pdf’ as parameter then i get what i want.

info.Arguments = @“E:\data\ocrtest.jpg E:\data\out pdf”;

and you dont need to use aspose.pdf for html-pdf convertion. This result will be already a pdf file.

Warning: you have to have at least V3.0 Tesseract, otherwiese ‘pdf’ parameter does not work!

tilal.ahmad · February 3, 2017, 8:15am

Hi Ertan,

Thanks for sharing your feedback. However, the scenario working as expected with hocr parameter at my end. I am using above shared code with V3.02 Tesseract.

Best Regards,