Hi,
we are testing aspose ocr java solution to include in our ERP solution.
We use a simple source to try to get the text from a PDF file.
The performance is soo bad!!!
The bad performance is bad due that I am using trial version???
My test computer has 8 Gb RAM and i7 proccesor and I am using java 1.7
Best regards
Jose
Trial version only limits the result to display. It has nothing to do with the performance of the API. Please share the sample PDF file that you are using at your end. We will evaluate it and update you about our findings.
Hi Ikram
thanks
I attach the pdf file, is very simple.
Also I include the source and exec times.
1.- com.aspose.pdf.Document pdfDocument = new com.aspose.pdf.Document(FileNamePdf); spends many seconds
2.- jpegDevice.process( pdfDocument.getPages().get_Item(1), imageStream);
spends many minutes
3.- OcrEngine.setImage(ImageStream.fromFile(FileJPG));
spends many minutes
4.- OcrEngine.getText();
spends many many minutes.
Thanks Ikram
public String GetText( String FileNamePdf, String Idioma ) {
com.aspose.ocr.OcrEngine OcrEngine = new com.aspose.ocr.OcrEngine();
com.aspose.pdf.Document pdfDocument = new com.aspose.pdf.Document(FileNamePdf);
//OcrEngine.
String Buffer = "";
String FileJPG = FileNamePdf + ".jpg";
try {
com.aspose.pdf.devices.Resolution resolution = new
com.aspose.pdf.devices.Resolution(300);
com.aspose.pdf.devices.JpegDevice jpegDevice = new
com.aspose.pdf.devices.JpegDevice(resolution, 100);
java.io.OutputStream imageStream = new java.io.FileOutputStream( FileJPG);
// Perform OCR operation on one page at a time
jpegDevice.process( pdfDocument.getPages().get_Item(1), imageStream);
OcrEngine.setImage(ImageStream.fromFile(FileJPG));
if ( OcrEngine.process()) {
Buffer = "" + OcrEngine.getText();
}
// Close the stream
imageStream.close();
System.out.println("Result: " + Buffer );
} catch ( Exception e ) {
int i = 0;
}
return Buffer;
}
JL2017112915119584315440_2017-FA-01-0003.PDF (21.4 KB)
2017112915119584315440_2017-FA-01-0003.PDF (21.4 KB)
We have investigated the issue. It was found that the data in the PDF is in tabular format. Please note that current implementation does not support extracting data from table format. This is to update you that reading data from tabular format issue has been logged into our system with ID OCRNET-2941. The issue ID has been link with this thread. You will be notified automatically in this forum thread once any update is available.