OCR non-searchable PDF and extract pages

ismael · May 3, 2013, 12:46pm

Hello Java-OCR Tech Support,

I was exchanging emails with your Sales group but before I buy your software, I need to make sure that it can do the job. Please provide a technical solution to my requirements using your products.

I have to OCR, search, and extract pages on non-searchable PDFs. I need to do in a Java servlet in Tomcat. My requirements are:

1. I have to OCR a non-searchable PDF document (multiple pages) and look for some strings.

2. On the pages where I see these strings, I need to pull out the pages and store into a new PDF file.

I see that you have these products 'Java-OCR' and 'Java-PDF'. Based on the description, 'Java-OCR' can only work on BMP files. Does your Java-OCR now support PDF file format? If not, can I pull each page using Java-PDF and convert into BMP and do the OCR to locate the strings?

Thank you.

tilal.ahmad · May 6, 2013, 11:50am

Hi Ismael,

Thanks for your interest in Aspose. Yes you are right, currently It supports BMP file format as input with English language. It can recognize Arial, Times New Roman and Tohama fonts with regular/bold/italic font styles. Recognition accuracy of big font sizes i.e. 32pts and above is 90% and smaller font sizes have less accuracy. Our development team is working over a major revamp of Aspose.OCR API for performance improvement, support of smaller font sizes, new fonts and languages.

You can convert Pdf document to BMP image with the help of Aspose.Pdf and later can OCR resulting Image with Aspose.OCR.

Please feel free to contact us for any further assistance.

Best Regards,

ismael · May 6, 2013, 1:15pm

Tital,

I tried your sample code to convert a PDF with multiple pages into BMP files. I have a typical PDF containing a form with 9 pages. It took about 10 minutes to create the BMP files for the first 4 pages (as the eval copy can proces only up to 4 pages). This is not a viable solution to me because it is too slow. Is this a normal speed?

I wanted to test the whole process from PDF -> BMP -> OCR -> Text. However, your evaluation copy, puts a solid black block which covers most of the page so I can't test BMP-> OCR. I want to test the BMP file generated by your product to check for occuracy. Is there a way to get a version that doesn't put a solid block in the page so I can test the whole process?

Thank you.

ismael · May 6, 2013, 8:18pm

I tried the Java.OCR project and the performance is unsatisfactory at best. When I used Aspose's sample BMP file, it found the letters. However, in this sample BMP file, the letters are the size of an elephant so it had to find the letters.

I tried to OCR my own BMP file which I feel to be of typical font size and it found only one letter.

Are customers of this product happy with its performance?

Thank you.

tilal.ahmad · May 7, 2013, 8:09am

Hi Ismael,

Thanks for your feedback.

We apologies for the inconvenience caused. As I've already mentioned above there is lot of room of improvement in Aspose.OCR. Aspose.OCR is still an early stage product and doesn't quite meet the expectations. Our development team is working hard to improve the performance and capabilities of the product. We are working on some new algorithms and after these hopefully we will release a significantly improve version.

I've linked your request to the related issue OCR-29048 and will notify you as soon as it is resolved.

Best regards,

awais.hafeez · March 29, 2018, 5:23am

The issues you have found earlier (filed as ) have been fixed in this Aspose.Words for JasperReports 18.3 update.