Convert images to PDF Text + Image

rmcartaxo · March 2, 2020, 4:31pm

Hello,

I’m searching for an API which is capable of converting an image, from tiff or pdf format, to an PDF Text + Image while performing the necessary OCR to the image.

The output should be a PDF which I’m able to select the text, and be able to be process by an index / search engine.

It would also be important to this component to be able to produce PDF/A formats.

From what I’ve analyze, aspose has APIs for PDF and OCR processing, so I would like to know if the platform can address the need I described.

Thanks in advance.

Adnan.Ahmad · March 2, 2020, 9:32pm

@rmcartaxo,

Thanks for contacting support.

Can you please visit this Convert Image to PDF article on your end and share feedback with us. Also please share your requirements in form of sample so that we may further investigate to help you out.

rmcartaxo · March 3, 2020, 3:00pm

Hello @Adnan.Ahmad,

Thanks for your reply.

From my understanding of the link you shared, it address the part of converting an image to pdf format.

But what I need to achieve is:

Perform OCR of an image and extract the text, which I assume your OCR component can do;
Convert Image to PDF, with the feature you mention in the article;
Convert the PDF with an image to a PDF of image + text, using the extract text from the OCR process.

Is this possible with aspose components?

Thanks,
Ricardo Cartaxo

asad.ali · March 3, 2020, 8:13pm

@rmcartaxo

Thanks for writing back.

Yes, you can surely perform OCR Operation on an image using Aspose.OCR API.

As far as your above requirement is concerned, Aspose.PDF offers a way to generate searchable PDF documents using an external tesseract utility. In order to convert non-searchable PDF file (scanned image PDF) to searchable PDF document in C#, please try using following code snippet with [Tesseract].

C#

Document doc = new Document("D:/Downloads/input.pdf");
doc.Convert(CallBackGetHocr);
doc.Save("E:/Data/pdf_searchable.pdf");
//********************* CallBackGetHocr method ***********************//
static string CallBackGetHocr(System.Drawing.Image img)
{
    string dir = @"E:\Data\";
    img.Save(dir + "ocrtest.jpg");
    ProcessStartInfo info = new ProcessStartInfo(@"C:\Program Files (x86)\Tesseract-OCR\tesseract.exe");
    info.WindowStyle = ProcessWindowStyle.Hidden;
    info.Arguments = @"E:\data\ocrtest.jpg E:\data\out hocr";
    Process p = new Process();
    p.StartInfo = info;
    p.Start();
    p.WaitForExit();
    StreamReader streamReader = new StreamReader(@"E:\data\out.html");
    string text = streamReader.ReadToEnd();
    streamReader.Close();
    return text;
}