Dear support,
I would like to render a searchable layer over an already existing PDF document. I got acquainted thoroughly with your tutorial regarding usage of Tesseract OCR for text extraction from an image. What blocks me from successfully executing the example attached to issue PDFNEWJAVA-33678 is an error that I get from Aspose:
tesseract src/main/resources/test-files/test.jpg src/main/resources/test-files/out hocr
Exception in thread “main” class com.aspose.pdf.internal.ms.System.z75: Unknown char: ;
com.aspose.pdf.internal.p584.z6.m1(Unknown Source)
com.aspose.pdf.internal.p584.z6.m1(Unknown Source)
com.aspose.pdf.internal.p584.z6.m1(Unknown Source)
com.aspose.pdf.internal.ms.System.z61.m1(Unknown Source)
com.aspose.pdf.internal.ms.System.z43.m12(Unknown Source)
com.aspose.pdf.internal.p88.z5.m1(Unknown Source)
com.aspose.pdf.internal.p88.z5.m1(Unknown Source)
com.aspose.pdf.internal.p88.z5.m1(Unknown Source)
com.aspose.pdf.ADocument.convert(Unknown Source)
com.aspose.pdf.Document.convert(Unknown Source)
ch.mimacom.zurich.AsposePDFScanner.scanText(AsposePDFScanner.java:80)
ch.mimacom.zurich.Starter.main(Starter.java:10)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:497)
com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
And an excerpt from generated html by Tesseract:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
Roland und Stephan Aebi, p.Adr. Aspa AG, Amerbachslrasse 72, 4007 Basel
I am using Tesseract 3.04 & Aspose PDF 10.9.0. I am evaluating available OCR tools for my company for purchase, so any help with this is greatly appreciated.