Multiple Recognize-Operations always return the result of first page

GerdRein · June 26, 2021, 12:45pm

Hallo, I think, this is a bug.
What I am doing:
1.) Create 1 JPG for every page of a pdf (I controlled the result is ok)
2.) Convert every image to a single PDF. Here I tried 2 different ways:
a)
private ArrayList convertImagesToPdfs(ArrayList imageFiles)
{
ArrayList pdfFiles = new ArrayList<>();
AsposeOCR api;
String imageFile;
String outFile;
RecognitionResult res;
for (int i = 0; i < imageFiles.size(); i++)
{
try
{
imageFile = imageFiles.get(i);
System.out.println("OCR of " + imageFile);
outFile = imageFile + (i+1) + “.pdf”;
api = new AsposeOCR();
RecognitionSettings set = new RecognitionSettings();
set.setDetectAreas(false);
set.setLanguage(Language.Deu);
set.setAutoSkew(true);
res = api.RecognizePage(imageFile, set);
res.save(outFile, Format.Pdf);
System.out.println("Adding " + outFile);
pdfFiles.add(outFile);
} catch (Exception e)
{
e.printStackTrace();
}
}
return pdfFiles;
}
b)
private ArrayList convertImagesToPdfs(ArrayList imageFiles)
{
ArrayList pdfFiles = new ArrayList<>();
AsposeOCR api;
String imageFileDir;
String outFile;
ArrayList res;
RecognitionResult resOne;
if (imageFiles.size() > 0)
{
File f = new File(imageFiles.get(0));
imageFileDir = f.getParent();
try
{
api = new AsposeOCR();
RecognitionSettings set = new RecognitionSettings();
set.setDetectAreas(false);
set.setLanguage(Language.Deu);
set.setAutoSkew(true);
res = api.RecognizeMultiplePages(imageFileDir, set);
for (int i = 0; i < res.size(); i++)
{
resOne = res.get(i);
outFile = imageFileDir + “\” + (i + 1) + “.pdf”;
resOne.save(outFile, Format.Pdf);
System.out.println("Adding " + outFile);
pdfFiles.add(outFile);
}
} catch (Exception e)
{
e.printStackTrace();
}
}
return pdfFiles;
}

Result: All single-page PDF’s are created. BUT (!!!) every PDF is identical: It has the text from the first page. It seems to me, that the RecognitionResult is always the same.
I am using aspose-ocr-21.5.

asad.ali · June 28, 2021, 10:21am

@GerdRein

Can you please also share the sample source files for our reference. We will test the scenario in our environment and address it accordingly.

GerdRein · June 28, 2021, 10:57am

Yes, I can.
The Input-File:applikationsnotiz.pdf (204.6 KB)
the first 2 created jpg-Files (of 8): 1.Jpeg (1.1 MB)
2.Jpeg (951.9 KB)
the resulting first 2 single-page-pfs (created from the 2 images): 1.pdf (4.7 KB)
2.pdf (8.9 KB)
Important: The 2nd pdf is bigger than the first, and so on, until the 8th pdf: 8.pdf (31.0 KB).
But I always see the same (first) page.
The result of the 2 methods, I have written (see above) is the same.

Also Acrobat-Reader shows me some error, but if I open the pdf in Chrome, there is no error.
Regards, Gerd

GerdRein · June 28, 2021, 1:45pm

Supplement to my previous post.
Today I tested the following:
1.) Extract all pages from PDF into single jpgs (as before)
2.) OCR every single jpg and write recognitionResult to a textfile
It works!
If I write (in a loop) every recognitionResult in a newly created DOCx, followed by a page break: It works!
If I save the complete docx as PDF: It works!

So, it seems to me, that the only problem is, when saving the recognitionResult directly to a pdf.

asad.ali · June 30, 2021, 5:26am

@GerdRein

We have logged an issue as OCRJAVA-135 in our issue tracking system to further investigate the scenario. We will look into its details and let you know as soon as the earlier logged ticket is resolved. Please be patient and spare us some time.

We are sorry for the inconvenience.

asad.ali · July 6, 2021, 7:03pm

@GerdRein

We have fixed the earlier logged issue in Aspose.OCR for Java 21.5 and now creating PDF documents works perfectly for both of the options. Please update the maven dependency (delete old and download from maven new packages).

And in release 21.7 (it will be published in July), the ability to create multi-page pdfs will be implemented.