Seeking Assistance with Landscape PDF OCR in Aspose

corixAGmst · November 23, 2023, 1:09pm

Dear Aspose Community,

Our company is currently in the process of evaluating a new OCR solution for our software and we have encountered an issue while testing Aspose OCR for Java.

Our tests have shown that when we run OCR on a PDF formatted in portrait orientation, the text recognition works as expected (within the limitations of the test license/version). However, when we apply the same code to a PDF file oriented in landscape, the ocr seems not to work at all. Only a few random (and often incorrect) characters are recognized, if any at all.

We have attempted different approaches and tested with various PDF files in both orientations, but the issue persists.

Is there a specific setting that needs to be configured to recognize PDFs in landscape orientation? Any guidance or information on this topic would be greatly appreciated, as I have been unable to find any relevant information elsewhere.

Thank you in advance for your assistance.

Best regards, Manuel

asad.ali · November 23, 2023, 9:06pm

@corixAGmst

Would you kindly share one of your sample PDFs that is creating issue while performing OCR operation? We will test the scenario in our environment and address it accordingly.

corixAGmst · November 24, 2023, 8:49am

scanned_landscape_pdfs.zip (294,8 KB)

@asad.ali

I attached a ZIP-File with two scanned PDFs we used to test scanned landscape documents.

Thank you for your answer and tests.

asad.ali · November 24, 2023, 7:23pm

@corixAGmst

We have opened an investigation ticket as OCRJAVA-345 in our issue tracking system to investigate this case. We will further look into it and let you know as soon as the ticket is resolved. Please be patient and spare us some time.

corixAGmst · December 5, 2023, 11:07am

@asad.ali

I’m following up on investigation ticket OCRJAVA-345. Could you please provide an update on the progress of this case? We are considering purchasing the product and would appreciate any information on when we might expect a resolution. Thank you for your time.

asad.ali · December 5, 2023, 9:02pm

@corixAGmst

The images on these PDFs rotated on 90 degree. But they rotated by PDF matrix. Unfortunately we have no functionality that can extract the rotation matrix from PDF, and our skew corrector can only detect the -90 degree rotation.

So we have two ways that can help you in this case.

First is to use Preprocessing Filter and rotate PDF on 90 degree before. But there is also the bad result, because during rotation we don’t change the image size and it cuts by width. This will be fixed in the next release.

PreprocessingFilter filter = new PreprocessingFilter
{
    PreprocessingFilter.Rotate(90)
};

OcrInput ocrInput = new OcrInput(InputType.PDF, filter);
ocrInput.Add(imgPath);

RecognitionSettings set = new RecognitionSettings
{
};
List<RecognitionResult> result = api.Recognize(ocrInput, set);

Other way is to use Aspose.PDF:

AsposeOcr api = new AsposeOcr();

Resolution resolution = new Resolution(300);
PngDevice imageDevice = new PngDevice(resolution);
Document pdfDocument = new Document(pdfPath);

for (int pageCount = 1; pageCount <= pdfDocument.Pages.Count; pageCount++)
{
    using (MemoryStream ms = new MemoryStream())
    {
        // Convert a particular page and save the image to stream
        imageDevice.Process(pdfDocument.Pages[pageCount], ms);

        OcrInput input = new OcrInput(InputType.SingleImage);
        input.Add(ms);
        var recognResult = api.Recognize(input, new RecognitionSettings { });
        ms.Close();
    }
}

corixAGmst · December 8, 2023, 9:33am

Thanks a lot for your investigation and your feedback.

Way 2 is not completely clear for us. Could you please provide a complete code solution in Java for the way that uses Aspose.PDF, so that we can see how the recognized text is brought back into the PDF document?

Input: a PDF document with e.g. 5 pages, where for example the pages 2 and 3 are in landscape mode

Output: a PDF document containing the pages in the same portrait/landscape mode with recognized text

That would be very helpful for us. If we cannot handle such documents, then we would probably need to search for other products.

Thanks in advance for any additional help.

asad.ali · December 8, 2023, 4:52pm

@corixAGmst

Please try with the below code snippet:

// Initialize OCR License Instance
String ocrLicenseFile = "Aspose.OCR.Product.Family.lic";
com.aspose.ocr.License.setLicense(ocrLicenseFile);

// Initialize PDF License Instance
com.aspose.pdf.License pdfLicense = new com.aspose.pdf.License();
try {
    pdfLicense.setLicense("Aspose.Total.Product.Family.lic");
} catch (Exception e) {
    e.printStackTrace();
}

AsposeOCR api = new AsposeOCR();
RecognitionSettings settings = new RecognitionSettings();
settings.setDetectAreasMode(DetectAreasMode.PHOTO);

PreprocessingFilter filters = new PreprocessingFilter();
//filters.add(PreprocessingFilter.AutoSkew());

OcrInput input = new OcrInput(InputType.SingleImage, filters);

// Use Aspose.PDF to convert pages into images
Document pdfDocument = new Document(file);
Resolution resolution = new Resolution(300);
PngDevice pngDevice = new PngDevice(resolution);

for (int pageCount = 1; pageCount <= pdfDocument.getPages().size(); pageCount++) {
    java.io.ByteArrayOutputStream outputBinImageFile = new java.io.ByteArrayOutputStream();
    // Convert a particular page and save the image to stream
    pngDevice.process(pdfDocument.getPages().get_Item(pageCount), outputBinImageFile);
    ByteArrayInputStream in = new ByteArrayInputStream(outputBinImageFile.toByteArray());
    // Put stream into OCR object for recognition
    input.add(in);
    outputBinImageFile.close();
    in.close();
}

// Recognize all pages
ArrayList<RecognitionResult> results = api.Recognize(input, settings);

// Print results
for (RecognitionResult res : results) {
    PrintResultShort(res);
}

input.clear();

// Save results in needed format
AsposeOCR.SaveMultipageDocument("D://java.pdf", Format.Pdf, results);
AsposeOCR.SaveMultipageDocument("D://javaWithoutImg.pdf", Format.PdfNoImg, results);