OCR is terrible, free python libraries are significantly better

I’m trying to convert scanned PDFs to text, and it’s been a terrible experience using ASPose OCR. We have the Total license, does this give us better support?

I can’t share the actual files as they’re legal documents. I can say ASPose is missing paragraphs of text from the pages where Tesseract might miss a word here or there. It really is completely different experience between the two.

Here’s a snippet of my code, I’ve been messing wth filters, but they don’t seem to do anything. Also, changing the image resolution to 600 or even 1200 makes it worse, 300 is best so far, 150 got worse too.

summary of code

Document pdfDocument = new Document(stream);
JpegDevice jpegDevice = new JpegDevice(new Resolution(1200));
for (Page page : pdfDocument.getPages()) {
    ByteArrayOutputStream jpegStream = new ByteArrayOutputStream();
    jpegDevice.process(page, jpegStream);
    ...
    AsposeOCR ocrEngine = new AsposeOCR();
    RecognitionSettings settings = new RecognitionSettings();
    settings.setLanguage(Language.Eng);
    PreprocessingFilter filters = new PreprocessingFilter();
    filters.add(PreprocessingFilter.AutoSkew());
    filters.add(PreprocessingFilter.ContrastCorrection());
    filters.add(PreprocessingFilter.Scale(2));

    ByteArrayInputStream inputImageStream = new ByteArrayInputStream(jpegStream.toByteArray());

    OcrInput ocrInput = new OcrInput(InputType.SingleImage, filters);
    ocrInput.add(inputImageStream);
    ArrayList<RecognitionResult> result = ocrEngine.Recognize(ocrInput, settings);
    System.out.println(result.get(0).recognitionText);

I was able to create an example of the two but this doesn’t do justice to how bad it is for our legal documents.

obi.pdf (179.6 KB)

ASPose OCR:
ITDID N[T TAKE
.c.E@t [
LDNG ID REALIZE
THERE WAS A DISTURBANGE
IN THE F[RGE.

THE AGT0R N0TICED S0MEIHING WAS ASKEW as soon as he stepped onto
the Obi-Wan Kenobi soundstage. ““I came round the set, and it was just
this ring of people standing around.””
Not quite sure what all the commotion was about, a confused
McGregor took his position in the frame-a look of puzzlement on his
face not seen since the nefarious Count Dooku dropped aSith Lordtruth
bomb on the Jedi Master back on Geonosis. ““I had the cameras behind
me looking down this street, and behind the cameras were 100 people
standing there,”” recalls McGregor. "“They’re usually in places doing
work, notjust standing. I couldn’t quite work out what was happening.”

Tesseract OCR:
IT DID NOT TAKE
EWAN McGREGOR

LONG TO REALIZE
THERE WAS A DISTURBANGE
IN THE FORCE.

THE ACTOR NOTICED SOMETHING WAS ASKEW as soon as he stepped onto
the Obi-Wan Kenobi soundstage. “I came round the set, and it was just
this ring of people standing around.”

Not quite sure what all the commotion was about, a confused
McGregor took his position in the frame—a look of puzzlement on his
face not seen since the nefarious Count Dooku dropped a Sith Lord truth
bomb on the Jedi Master back on Geonosis. “I had the cameras behind
me looking down this street, and behind the cameras were 100 people
standing there,” recalls McGregor. “They’re usually in places doing
work, not just standing, I couldn’t quite work out what was happening.”

@Lowzee

We are checking it and will get back to you shortly.

@Lowzee

We do apologize for the poor results that you got using the API. We are constantly working on improving the recognition quality of the API.

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): OCRNET-839

We will let you know once we have some updates about ticket resolution.