I got a question from an achivist regaring which encoding is used when converting a Word file to PDF/A.
As I userstand it there are some alternative (PDFDocEncoding, UTF-16BE and UTF-8).
The archivist insited that UTF-8 should be used to ensure that the text could be retrevied in the future.
Is UTF-8 used or are there anyway I can make the convertion use it?
This is how the conversion is done at the moment
Document doc = new Document(bais);
PdfSaveOptions pso = new PdfSaveOptions();
pso.setCompliance(PdfCompliance.PDF_A_1_A);
doc.save(baos, pso);
@010101 Aspose.Words is using WinAnsiEncoding for Latin text and Identity-H encoding for other Unicode text. Unicode mapping is specified in ToUnicode CMap which in turn uses UTF-16BE (which is required by specification). Also in specific cases when there are no direct CID->Unicode mapping Aspose.Words uses ActualText in marked content sequence which also uses UTF-16BE. This way of text encoding is allowed by PDF/A-1 specification.
@010101 No, text encoding in Aspose.Words PDF output does not depends on the selected compliance. Also the requirements for the text encoding has not been changed in PDF/A-3 and PDF/A-4 comparing to PDF/A-1 which implies that PDF community finds the requirements suitable for long term archiving.