Convert PDF to PDF/A-2U after OCR using Aspsoe.PDF for Java - output is not compliant

dustin00 · July 8, 2019, 3:15am

Hi, I am having an issue converting PDF output from tesseract-OR to the PDF/A-2U format. The output after going through the conversion does not comply to PDF/A when viewing it.
Also, the output file is not able to select any words after the conversion. Does the conversion makes the pdf unsearchable?
Thanks.

Code when converting the output after converting from image file to pdf using Tesseract-OCR
com.aspose.pdf.Document pdfDoc = new com.aspose.pdf.Document(fullFilePathName);
PdfFormatConversionOptions opts = new PdfFormatConversionOptions( output.log , PdfFormat.PDF_A_2U, ConvertErrorAction.Delete);
pdfDoc.convert(opts);
PdfFormatConversionOptions options = new PdfFormatConversionOptions( PdfFormat.v_1_7 );
pdfDoc.validate(options);
pdfDoc.save(fullFilePathName);
pdfDoc.close();

Included the log file when converting and the output file.
PDF A conversion.zip (31.0 KB)

Farhan.Raza · July 8, 2019, 11:51am

@dustin00

Thank you for contacting support.

Would you please also share the source PDF document which is generated by tesseract so that we may try to replicate the problem while converting it to PDF/A, and assist you accordingly.

dustin00 · July 9, 2019, 4:36am

Hi, please find attached the converted output from a gif file from tesseract-OCR. To add on, I tested using Aspose PDF 19.3. Previously when I tested using pdf 17.9, the output from aspose was able to convert the pdf to PDF/A-2U. Thanks.
Gif sample file.pdf (10.1 KB)

Farhan.Raza · July 9, 2019, 1:02pm

@dustin00

Thank you for sharing requested data.

We have been able to reproduce the issue in our environment. A ticket with ID PDFJAVA-38699 has been logged in our issue management system for further investigation and resolution. The ticket ID has been linked with this thread so that you will receive notification as soon as the ticket is resolved.

We are sorry for the inconvenience.

dustin00 · July 10, 2019, 3:12am

Hi @Farhan.Raza, thanks for the response. Could you advise on whether the PDF/A-2U output pdf can still remain a searchable pdf after the conversion to PDF/A, as currently, the converted file cannot search any words? Or if this issue is also included in this ticket?
Thanks.

Farhan.Raza · July 10, 2019, 1:42pm

@dustin00

Please note that the ticket will be resolved as per the specifications of PDF/A-2U format. Exactly like some file converted to this format with Adobe Acrobat because Aspose.PDF API mimics the behavior of aforementioned application.

dustin00 · July 24, 2019, 3:46am

Hi, thanks for the explanation.
I have encountered another similar issue where some pdf files are also not able to convert to PDF/A-2U when using Aspose PDF 19.3. I have attached the log file and source file, please advise if this is a similar issue to the images file also not converting to PDF/A-2U.
failed pdf a-2u log and source and output files.zip (265.7 KB)
Thanks.

Farhan.Raza · July 24, 2019, 2:18pm

@dustin00

Thank you for elaborating it further.

We have logged another ticket with ID PDFJAVA-38729 in our issue management system for further investigations. We will let you know as soon as any update will be available in this regard.

dustin00 · August 14, 2019, 10:58am

Hi, just checking if there are any updates on the progess of this tickets? Thanks.

Farhan.Raza · August 15, 2019, 11:57am

@dustin00

Please note that the issue has been logged under free support model and will be investigated on first come first serve basis. Therefore, it may take some months to resolve. As soon as we have some definite updates regarding ticket resolution, we will let you know.

Furthermore, we also offer paid support model where issues are resolved on urgent basis and have priority over the issues logged under free support model. You may check our Paid Support options for your reference.

Farhan.Raza · September 7, 2019, 12:17am

@dustin00

While investigating this ticket, we have noticed that attached input documents differ from output document as there is not any “RESTRICTED” watermark in input documents. We suspect that the problem could be with this watermark. Would you please elaborate about the watermark, for our reference.

dustin00 · September 26, 2019, 9:37am

Apologies for the late reply. For the watermark “RESTRICTED”, it is added during the process after converting documents to PDF and before the conversion to PDF/A as it is a step for our application when converting documents. Is there any updates to the investigation?
Thanks.

Below is the code for adding the watermark
com.aspose.pdf.Document pdfDocument = new com.aspose.pdf.Document(filePath);

	com.aspose.pdf.TextStamp textStamp = new com.aspose.pdf.TextStamp(watermarkText);
	textStamp.setOpacity(0.3f);
	textStamp.setWidth(500);
	textStamp.setHeight(100);
	textStamp.setHorizontalAlignment(com.aspose.pdf.HorizontalAlignment.Center);
	textStamp.setVerticalAlignment(com.aspose.pdf.VerticalAlignment.Center);
	textStamp.setRotate(0);
	textStamp.getTextState().setFont(FontRepository.findFont("Arial"));
	textStamp.getTextState().setFontSize(50.0F);
	textStamp.getTextState().setFontStyle(com.aspose.pdf.FontStyles.Italic);
	textStamp.getTextState().setForegroundColor(com.aspose.pdf.Color.getLightGray());

	// iterate through all pages of PDF file
	int pageSize = pdfDocument.getPages().size();
	if (pageSize >= 1) {
		for (int page_counter = 1; page_counter <= pageSize; page_counter++) {
			// add stamp to all pages of PDF file
			com.aspose.pdf.PageCollection pages = pdfDocument.getPages();
			if (null != pages) {
				if (pages.get_Item(page_counter) != null) {
					com.aspose.pdf.Page page = pages.get_Item(page_counter);
					if (null != page) {
						page.addStamp(textStamp);
					}
				}
			}
		}
	}
	pdfDocument.save();
	pdfDocument.close();

Farhan.Raza · September 26, 2019, 8:32pm

@dustin00

Thank you for elaborating further.

We have recorded your feedback and will continue the investigations. We will let you know as soon as any further update will be available.

dustin00 · February 20, 2020, 8:08am

Hi, I would like to enquire if there have been any updates to this issue?
Thanks.

asad.ali · February 20, 2020, 5:25pm

@dustin00

Regretfully the issues are not yet resolved due to other high priority issues. We will surely inform you as soon as we have some certain news on their resolution. Please be patient and spare us some time.

We are sorry for the inconvenience.

dustin00 · July 2, 2020, 5:53am

Hi, it has been a while since the previous update, please advise if there is any updates for this issue?
Thanks.

asad.ali · July 2, 2020, 2:09pm

@dustin00

We are afraid that earlier logged tickets are not yet resolved as they require more time to get fully investigated. We will inform you as soon as we have additional updates regarding their resolution. Please give us some time.

We are sorry for the inconvenience.

bckoh_ncs_com_sg · September 9, 2020, 10:18am

Hi @asad.ali, Any update for above program fixes ? Thanks

asad.ali · September 9, 2020, 5:56pm

@bckoh_ncs_com_sg

The earlier logged ticket(s) are under the phase of investigation and as soon as their investigation is complete we will share updates with you. Please give us some time.

We apologize for the delay and inconvenience caused.