Text encoding in PDF/A

010101 · August 16, 2023, 6:58am

I got a question from an achivist regaring which encoding is used when converting a Word file to PDF/A.
As I userstand it there are some alternative (PDFDocEncoding, UTF-16BE and UTF-8).

The archivist insited that UTF-8 should be used to ensure that the text could be retrevied in the future.

Is UTF-8 used or are there anyway I can make the convertion use it?

This is how the conversion is done at the moment

Document doc = new Document(bais);
PdfSaveOptions pso = new PdfSaveOptions();
pso.setCompliance(PdfCompliance.PDF_A_1_A);
doc.save(baos, pso);

Konstantin.Kornilov · August 16, 2023, 8:39am

@010101 Aspose.Words is using WinAnsiEncoding for Latin text and Identity-H encoding for other Unicode text. Unicode mapping is specified in ToUnicode CMap which in turn uses UTF-16BE (which is required by specification). Also in specific cases when there are no direct CID->Unicode mapping Aspose.Words uses ActualText in marked content sequence which also uses UTF-16BE. This way of text encoding is allowed by PDF/A-1 specification.

010101 · August 16, 2023, 2:06pm

Thanks for the answer.

Would it be different if PDF/A-3 or PDF/A-4 was used?

Konstantin.Kornilov · August 16, 2023, 2:22pm

@010101 No, text encoding in Aspose.Words PDF output does not depends on the selected compliance. Also the requirements for the text encoding has not been changed in PDF/A-3 and PDF/A-4 comparing to PDF/A-1 which implies that PDF community finds the requirements suitable for long term archiving.