Issue when testing out PDF to PDF/A-2U

I am testing out the converting of PDF to PDF/A-2U.
When using the convert method of aspose.pdf.java(16.12) to convert the PDF to the PDF/A-2U format, the words in the PDF/A-2U file is spaced out and overlaps each other. I have attached screenshots to show the issue

I have tried with pdf that is converted using the aspose.words.java (16.4), with the save method from a docx file and apsose.email.java (4.7) using the msg to mhtml to pdf method.

However when testing out converting from a doc file or from other PDF files, there is no such issue.
Is converting from PDF the only way to convert to the PDF/A-2U format, can the docx or msg files be converted to the PDF/2-2U format directly?
How can i fix this issue?
Please advise.
Thanks,
Dustin.

PDF converted from Docx file
image.png (11.7 KB)
After converting to PDF/2_2U format
image.png (22.0 KB)

@dustin00,
Kindly send us a sample of your source PDF document. We will investigate and share our findings with you. Your response is awaited.

Aspose.Words API cannot convert a Word document to PDF/2-2U. We are taking confirmation of converting a MSG file to PDF/2-2U and let you know soon.

Best Regards,
Imran Rafique

@dustin00,
There is no direct way to convert a MSG file into PDF format. You can convert a MSG file to MHT with Aspose.Email API, and then convert MHT document to PDF/2-2U with Aspose.Pdf API.

Best Regards,
Imran Rafique

Hi Imran,

Thanks for the response.
words docx.pdf (42.5 KB)
words docs pdf a.pdf (106.7 KB)

msg .pdf (71.4 KB)
msg pdf a.pdf (136.1 KB)

Do you require the original files?
Can you provide a code sample of how to convert MHT document to PDF/2-2U directly?
And can we convert word documents to PDF/2-2U directly too?

Thanks.

@dustin00,
We have tested your source PDF documents with the latest version 17.7 of Aspose.Pdf for Java API and the output PDF (A-2U) documents are fine.

This is the output PDF: wordsdocxA2U_Out17.7.pdf (45.3 KB)

This is the output PDF: msgA2U_Out17.7.pdf (74.1 KB)

We do not require your original files.

[Java]

MhtLoadOptions mhtOpts = new MhtLoadOptions();
Document document = new Document("C:\\temp\\Input.mht", mhtOpts);
PdfFormatConversionOptions opts = new PdfFormatConversionOptions("C:\\temp\\outLogmht.txt", PdfFormat.PDF_A_2U, ConvertErrorAction.Delete);
document.convert(opts);
document.save("C:\\temp\\Output.pdf");

Aspose.Words API can import Word documents and cannot save in the PDF/2-2U format.

Best Regards,
Imran Rafique

Thank you. Will test with the 17.7 version then.

Hi, just a general enquiry, is there any roadmap to implement direct conversion/save from office documents or html to the PDF/2-2U format?
Thanks.

@dustin00,
You can convert an HTML document to PDF/A-2U by calling the following code. Kindly list down all office file formats, we will assist you appropriately.

[Java]

// load HTML document
HtmlLoadOptions optsLoad = new HtmlLoadOptions("base path here");
Document document = new Document("html file path here", optsLoad);
// convert HTML to PDF/A-2U
PdfFormatConversionOptions opts = new PdfFormatConversionOptions("c:\\temp\\outLog.txt", PdfFormat.PDF_A_2U, ConvertErrorAction.Delete);
document.convert(opts);
document.save("c:\\temp\\outFile.pdf"); 

Best Regards,
Imran Rafique

Hi, may I inquire if other formats, xls and xlxs for excel documents and ppt, pptx for powerpoint documents can be converted to PDF/A-2U format directly or is it the same as word documents where we have to convert to PDF first, then to PDF/2-2U?

Also, I have tested using version 17.7 of Apose.PDF and the issue mentioned earlier does not occur.

Thanks.

@dustin00,
There is no direct way to convert Excel and PowerPoint documents to PDF/2-2U. You can convert Excel and PowerPoint documents with Aspose.Cells and Aspose.Slides APIs to PDF, and then use Aspose.Pdf API to convert PDF to PDF/2-2U.

It is nice to hear from you that the problem has been resolved.

Hi,

Thanks for the information.

  1. Could you advise on how to convert from a txt file to a pdf file? I tried using the code from this page
    https://docs.aspose.com/display/pdfjava/Converting+Text+File+to+PDF
    but the Pdf , Section and Text cannot be resolved to a type. I am currently using Aspose.Pdf 17.9 to test.

  2. I found that the PDF version seems to be defaulted to 1.5 or 1.4 after converting from Word/Excel/PowerPoint file formats. Using the method you provided when converting HTML files, the PDF version would be 1.7.
    Is that a way to set the PDF version and set the format to the PDF/A-2U format in the PdfFormatConversionOptions? Or setting the PDF version when converting the Word/Excel/PowerPoint file formats as PDFs?

  3. I have tested to convert images to pdf by first converting the image file(png/jpg/tiff/bmp) to a searchable pdf using Google Tesseract-OCR. Then, the PDF is converted to the PDF/A-2U format. However, after converting to the PDF/A-2U format, the words in the PDF cannot be highlighted. As I am using a trial version, the watermark can be highlighted, so I am unsure if the PDF has become unsearchable or not. This does not occur for the other file formats such as the doc, docx, xls. pptx, html, msg .etc. I have attached a pair of the PDFs.
    Please advise on this issue.

From Tesseract : jpeg-pdf.pdf (456.6 KB)
After converting to PDF/A-2U : Jpeg PDF A 2U.pdf (497.4 KB)

Thanks.

@dustin00,

The com.aspose.pdf.generator is a legacy approach, please use the new DOM approach and try TextFragment class to convert a text file to PDF document: Please refer to this help topic: Convert text file to PDF format.

The Document class offers a validate member to change the PDF version. Please try the following code:

[Java]

// load HTML document
HtmlLoadOptions optsLoad = new HtmlLoadOptions("base path here");
Document document = new Document("html file path here", optsLoad);
// convert HTML to PDF/A-2U
PdfFormatConversionOptions opts = new PdfFormatConversionOptions("c:\\temp\\outLog.txt", PdfFormat.PDF_A_2U, ConvertErrorAction.Delete);
document.convert(opts);
System.out.println(document.getVersion());
document.validate("C:\\temp\\outlog.log", PdfFormat.v_1_7);
document.save("c:\\temp\\outFile.pdf"); 

We have tested your source PDF (jpeg-pdf.pdf) with the latest version 17.9 of Aspose.Pdf for Java API and managed to replicate the problem of not highlighting the target word after search. It has been logged under the ticket ID PDFJAVA-37135 in our bug tracking system. We have linked your post to this ticket and will keep you informed regarding any available updates.

@dustin00

Thanks for your patience.

We are pleased to inform you that earlier reported issue PDFJAVA-37135, has been resolved in the latest version Aspose.Pdf for Java 17.11.

Please try using the latest release version and in case you face any issue, please feel free to contact us.