Convert PDF file to OCR PDF File

meinstei · September 12, 2019, 2:42pm

It possible to generate new PDF OCR using PDF file ? What solution to implement?
thank

Farhan.Raza · September 12, 2019, 10:05pm

Thank you for contacting support.

Would you please elaborate your requirements a little more while sharing respective sample files, if any. You may also go through product documentation for your kind reference.

meinstei · September 16, 2019, 6:56am

Hi Raza,
for example, we have a PDF file with not seachable text. We want to convert it into PDF searchable text . It possible to use ASPOSE PDF or ASPOSe OCR to convert PDF in PDF searchable text?

thank

Farhan.Raza · September 16, 2019, 4:15pm

@meinstei

You can convert non-searchable PDF file to searchable PDF document, please try using following code snippet with Tesseract.

C#

Document doc = new Document("D:/Downloads/input.pdf");
doc.Convert(CallBackGetHocr);
doc.Save("E:/Data/pdf_searchable.pdf");
//********************* CallBackGetHocr method ***********************//
static string CallBackGetHocr(System.Drawing.Image img)
{
    string dir = @"E:\Data\";
    img.Save(dir + "ocrtest.jpg");
    ProcessStartInfo info = new ProcessStartInfo(@"C:\Program Files (x86)\Tesseract-OCR\tesseract.exe");
    info.WindowStyle = ProcessWindowStyle.Hidden;
    info.Arguments = @"E:\data\ocrtest.jpg E:\data\out hocr";
    Process p = new Process();
    p.StartInfo = info;
    p.Start();
    p.WaitForExit();
    StreamReader streamReader = new StreamReader(@"E:\data\out.html");
    string text = streamReader.ReadToEnd();
    streamReader.Close();
    return text;
}

We hope this will be helpful. Please feel free to contact us if you need any further assistance.

meinstei · September 19, 2019, 11:12am

Thanks !

which solution have to integrate ? ASPOSE.PDF +ASPOSE.OCR+TESSERACT?

Its compatible with serveur Red Hat Linux 7 + JRE OpenJDK ?
What about a licence and pricing?
What about serveur architecture to support the solution . ( CPU, RAM, DISK SPACE,Multi processing)

Farhan.Raza · September 19, 2019, 9:20pm

@meinstei

You may test the same code snippet with your files and then decide if you want to use Tesseract or OCR engine while extracting text. Then you may use one of these in combination with Aspose.PDF API. Please also visit Documentation of respective APIs where you can find information about system requirements etc. in Getting Started section.

Moreover, you can test the APIs in their full capacity by applying a free 30-days temporary license. You may apply for temporary license on the given link and evaluate latest version of the API as per your requirements. Once you complete evaluation process, you may purchase the license to keep using APIs features. In case you face any issue, please feel free to let us know.

Furthermore, you may contact our sales team or create a post at Purchase Forum for any sales related inquiry.

yjyj1990 · July 20, 2021, 7:47am

Hi, I have a question about this issue.
When I try this code snippet, @“E:\data\out.html” part throws error because
I haven’t created html file or it doesn’t create automatically.
What am I missing here?
For your reference, when I use .hocr file instead of out.html file, it throws format unmatched error.
When and who should this html file create?

asad.ali · July 20, 2021, 7:12pm

@yjyj1990

The .txt or .html file is created by the tesseract in the specified location. Please try to modify the code snippet as below:

static string CallBackGetHocr(System.Drawing.Image img)
{
    string dir = @"E:\Data\";
    img.Save(dir + "ocrtest.jpg");
    ProcessStartInfo info = new ProcessStartInfo(@"C:\Program Files (x86)\Tesseract-OCR\tesseract.exe");
    info.WindowStyle = ProcessWindowStyle.Hidden;
    info.Arguments = @"E:\data\ocrtest.jpg E:\data\out.hocr";
    Process p = new Process();
    p.StartInfo = info;
    p.Start();
    p.WaitForExit();
    return File.ReadAllText(@"E:\data\out.hocr");
}