Problem Converting to PDF/A-2U after OCR


#1

Hi, I am having an issue converting PDF output from tesseract-OR to the PDF/A-2U format. The output after going through the conversion does not comply to PDF/A when viewing it.
Also, the output file is not able to select any words after the conversion. Does the conversion makes the pdf unsearchable?
Thanks.

Code when converting the output after converting from image file to pdf using Tesseract-OCR
com.aspose.pdf.Document pdfDoc = new com.aspose.pdf.Document(fullFilePathName);
PdfFormatConversionOptions opts = new PdfFormatConversionOptions( output.log , PdfFormat.PDF_A_2U, ConvertErrorAction.Delete);
pdfDoc.convert(opts);
PdfFormatConversionOptions options = new PdfFormatConversionOptions( PdfFormat.v_1_7 );
pdfDoc.validate(options);
pdfDoc.save(fullFilePathName);
pdfDoc.close();

Included the log file when converting and the output file.
PDF A conversion.zip (31.0 KB)


#2

@dustin00

Thank you for contacting support.

Would you please also share the source PDF document which is generated by tesseract so that we may try to replicate the problem while converting it to PDF/A, and assist you accordingly.


#3

Hi, please find attached the converted output from a gif file from tesseract-OCR. To add on, I tested using Aspose PDF 19.3. Previously when I tested using pdf 17.9, the output from aspose was able to convert the pdf to PDF/A-2U. Thanks.
Gif sample file.pdf (10.1 KB)


#4

@dustin00

Thank you for sharing requested data.

We have been able to reproduce the issue in our environment. A ticket with ID PDFJAVA-38699 has been logged in our issue management system for further investigation and resolution. The ticket ID has been linked with this thread so that you will receive notification as soon as the ticket is resolved.

We are sorry for the inconvenience.


#5

Hi @Farhan.Raza, thanks for the response. Could you advise on whether the PDF/A-2U output pdf can still remain a searchable pdf after the conversion to PDF/A, as currently, the converted file cannot search any words? Or if this issue is also included in this ticket?
Thanks.


#6

@dustin00

Please note that the ticket will be resolved as per the specifications of PDF/A-2U format. Exactly like some file converted to this format with Adobe Acrobat because Aspose.PDF API mimics the behavior of aforementioned application.


#7

Hi, thanks for the explanation.
I have encountered another similar issue where some pdf files are also not able to convert to PDF/A-2U when using Aspose PDF 19.3. I have attached the log file and source file, please advise if this is a similar issue to the images file also not converting to PDF/A-2U.
failed pdf a-2u log and source and output files.zip (265.7 KB)
Thanks.


#8

@dustin00

Thank you for elaborating it further.

We have logged another ticket with ID PDFJAVA-38729 in our issue management system for further investigations. We will let you know as soon as any update will be available in this regard.


#9

Hi, just checking if there are any updates on the progess of this tickets? Thanks.


#10

@dustin00

Please note that the issue has been logged under free support model and will be investigated on first come first serve basis. Therefore, it may take some months to resolve. As soon as we have some definite updates regarding ticket resolution, we will let you know.

Furthermore, we also offer paid support model where issues are resolved on urgent basis and have priority over the issues logged under free support model. You may check our Paid Support options for your reference.


#11

@dustin00

While investigating this ticket, we have noticed that attached input documents differ from output document as there is not any “RESTRICTED” watermark in input documents. We suspect that the problem could be with this watermark. Would you please elaborate about the watermark, for our reference.


#12

Apologies for the late reply. For the watermark “RESTRICTED”, it is added during the process after converting documents to PDF and before the conversion to PDF/A as it is a step for our application when converting documents. Is there any updates to the investigation?
Thanks.

Below is the code for adding the watermark
com.aspose.pdf.Document pdfDocument = new com.aspose.pdf.Document(filePath);

	com.aspose.pdf.TextStamp textStamp = new com.aspose.pdf.TextStamp(watermarkText);
	textStamp.setOpacity(0.3f);
	textStamp.setWidth(500);
	textStamp.setHeight(100);
	textStamp.setHorizontalAlignment(com.aspose.pdf.HorizontalAlignment.Center);
	textStamp.setVerticalAlignment(com.aspose.pdf.VerticalAlignment.Center);
	textStamp.setRotate(0);
	textStamp.getTextState().setFont(FontRepository.findFont("Arial"));
	textStamp.getTextState().setFontSize(50.0F);
	textStamp.getTextState().setFontStyle(com.aspose.pdf.FontStyles.Italic);
	textStamp.getTextState().setForegroundColor(com.aspose.pdf.Color.getLightGray());

	// iterate through all pages of PDF file
	int pageSize = pdfDocument.getPages().size();
	if (pageSize >= 1) {
		for (int page_counter = 1; page_counter <= pageSize; page_counter++) {
			// add stamp to all pages of PDF file
			com.aspose.pdf.PageCollection pages = pdfDocument.getPages();
			if (null != pages) {
				if (pages.get_Item(page_counter) != null) {
					com.aspose.pdf.Page page = pages.get_Item(page_counter);
					if (null != page) {
						page.addStamp(textStamp);
					}
				}
			}
		}
	}
	pdfDocument.save();
	pdfDocument.close();

#13

@dustin00

Thank you for elaborating further.

We have recorded your feedback and will continue the investigations. We will let you know as soon as any further update will be available.