Problem Converting to PDF/A-2U after OCR


#1

Hi, I am having an issue converting PDF output from tesseract-OR to the PDF/A-2U format. The output after going through the conversion does not comply to PDF/A when viewing it.
Also, the output file is not able to select any words after the conversion. Does the conversion makes the pdf unsearchable?
Thanks.

Code when converting the output after converting from image file to pdf using Tesseract-OCR
com.aspose.pdf.Document pdfDoc = new com.aspose.pdf.Document(fullFilePathName);
PdfFormatConversionOptions opts = new PdfFormatConversionOptions( output.log , PdfFormat.PDF_A_2U, ConvertErrorAction.Delete);
pdfDoc.convert(opts);
PdfFormatConversionOptions options = new PdfFormatConversionOptions( PdfFormat.v_1_7 );
pdfDoc.validate(options);
pdfDoc.save(fullFilePathName);
pdfDoc.close();

Included the log file when converting and the output file.
PDF A conversion.zip (31.0 KB)


#2

@dustin00

Thank you for contacting support.

Would you please also share the source PDF document which is generated by tesseract so that we may try to replicate the problem while converting it to PDF/A, and assist you accordingly.


#3

Hi, please find attached the converted output from a gif file from tesseract-OCR. To add on, I tested using Aspose PDF 19.3. Previously when I tested using pdf 17.9, the output from aspose was able to convert the pdf to PDF/A-2U. Thanks.
Gif sample file.pdf (10.1 KB)


#4

@dustin00

Thank you for sharing requested data.

We have been able to reproduce the issue in our environment. A ticket with ID PDFJAVA-38699 has been logged in our issue management system for further investigation and resolution. The ticket ID has been linked with this thread so that you will receive notification as soon as the ticket is resolved.

We are sorry for the inconvenience.


#5

Hi @Farhan.Raza, thanks for the response. Could you advise on whether the PDF/A-2U output pdf can still remain a searchable pdf after the conversion to PDF/A, as currently, the converted file cannot search any words? Or if this issue is also included in this ticket?
Thanks.


#6

@dustin00

Please note that the ticket will be resolved as per the specifications of PDF/A-2U format. Exactly like some file converted to this format with Adobe Acrobat because Aspose.PDF API mimics the behavior of aforementioned application.