Convert PDF from Tesseract with OCR Overlay

scomag · January 4, 2023, 9:52am

Hi,
We are trying to convert a PDF with OCR overlay from Tesseract (v5.2.0) to PDF/A-3B.
After the conversion the OCR-layer is gone.

We are using Aspose-pdf v 22.12 for Java

our code:

    public void convert(InputStream inputPdf, OutputStream outputPdf, Optional<String> hocr) {
        Document pdfDoc = new Document(inputPdf);
        if(hocr.isPresent()){
            pdfDoc.convert(bufferedImage -> hocr.get());
        }
        //pdfDoc.validate(new PdfFormatConversionOptions(PdfFormat.PDF_A_3B));
        PdfFormatConversionOptions pdfConvertOptions = new PdfFormatConversionOptions(PdfFormat.PDF_A_3B);
        pdfDoc.convert(pdfConvertOptions);
        pdfDoc.save(outputPdf);
    }

Example Input:
Tesseract-Result.pdf (23.1 KB)

Example Output from Aspose:
Aspose-result.pdf (27.8 KB)

Thanks
Didi

asad.ali · January 4, 2023, 7:32pm

@scomag

An issue as PDFJAVA-42365 has been logged in our issue tracking system for further investigation. We will look into its details and keep you posted with the status of its correction. Please be patient and spare us some time.

We are sorry for the inconvenience.

stud0r · February 10, 2023, 2:17pm

Dear all

Can you reproduce it, or do you need more information?
Do you have an update on this issue?

asad.ali · February 10, 2023, 11:01pm

@stud0r

We are afraid that the investigation of the earlier logged ticket could not be completed due to other pending issues in the queue logged prior to it. Nevertheless, your concerns have been recorded and will be considered during ticket investigation. We will inform you as soon as we make some progress towards issue fix. Please spare us some time.

We are sorry for the inconvenience.