Convert PDF with HOCR to PDF/A-3B

scomag · January 4, 2023, 10:05am

Hi,
We are trying to convert a PDF with custom HOCR information to PDF/A-3B.
When searching in the result PDF, the positioning of the cursor is wrong.

Our code:

    public void convert(InputStream inputPdf, OutputStream outputPdf, Optional<String> hocr) {
        Document pdfDoc = new Document(inputPdf);
        if(hocr.isPresent()){
            pdfDoc.convert(bufferedImage -> hocr.get());
        }
        //pdfDoc.validate(new PdfFormatConversionOptions(PdfFormat.PDF_A_3B));
        PdfFormatConversionOptions pdfConvertOptions = new PdfFormatConversionOptions(PdfFormat.PDF_A_3B);
        pdfDoc.convert(pdfConvertOptions);
        pdfDoc.save(outputPdf);
    }

Example:
Input PDF:
Custom-Input.pdf (353.2 KB)

Input HOCR (zipped):
Custom-Input.hocr.7z (5.0 KB)

Result PDF:
Custom-Result.pdf (443.0 KB)

Thanks
Didi

asad.ali · January 4, 2023, 8:32pm

@scomag

An issue as PDFJAVA-42366 has been logged in our issue management system for further analysis on this case. We will look into its details and keep you posted with the status of its correction. Please be patient and spare us some time.

We are sorry for the inconvenience.

stud0r · February 10, 2023, 2:15pm

Dear all

Do you have an update on this issue?
Can you reproduce it, or do you need more information?

asad.ali · February 10, 2023, 10:58pm

@stud0r

We are afraid that the earlier logged ticket has not been yet resolved. It was logged under free support model and will be investigated/resolved on a first come first serve basis. However, we will surely inform you as soon as the ticket is resolved. Please be patient and spare us some time.

We are sorry for the inconvenience.

asad.ali · April 5, 2023, 3:07pm

@scomag

This is not a bug of Aspose.PDF but an incorrectly created OCR with incorrect positions for text. We used for Example tesseract to receive text (image.tesseract.hocr) for the recognized image received in CallBackGetHocr and after adding this result into pdf, text matches positions on the image. image.tesseract.zip (6.0 KB)