Convert PDF to PDF/A-2A not valid

sergei.shibanov · October 14, 2024, 5:41pm

@hasanirmak
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-58359,PDFNET-58361,PDFNET-58360

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

hasanirmak · October 15, 2024, 1:39pm

Hi Sergei
I have checked it with the variant you suggested and get the following results:

Aspose PDF version 24.7: I get 11 out of 107 documents as PDF_A_2A valid.

With Aspose PDF version 24.10: I get 11 out of 107 documents as PDF_A_2A valid.

I can see that there are various errors in the xml report, but with ConvertErrorAction.Delete it doesn’t seem to have been able to fix or delete them, and there are still very few PDF_A_2A valid documents to be seen.

Do you have any other suggestions to get better results.

Best regards

sergei.shibanov · October 15, 2024, 2:50pm

@hasanirmak

A possible reason may be the lack of fonts. In short: the original document may not describe any of the fonts used, but the PDF/A document must have a font description.
You can see that this is happening in the reports. This situation (lack of font description in the document) corresponds to lines like:

<Problem Severity="Error" Clause="6.3.4" ObjectID="1151" Page="32" Convertable="False">Font 'ArialMT' is not embedded</Problem>

If Convertable=“True” the font description was found, if false, then no.

You can see ways to solve this problem in one of my answers in this forum thread. PDF conversion loosing data - #8
(where I provide the development team’s answer)

I will wait for your answer.

hasanirmak · October 15, 2024, 2:56pm

Hi Sergei

If i want to open the suggested link, i get the following: Oops! That page is private.

image.png (25.9 KB)

sergei.shibanov · October 15, 2024, 3:03pm

@hasanirmak
Sorry, I forgot that this is a private topic.
I will also provide the response from the development team here.

The source document has two problems that prevent its conversion to PDF/A-2b. First, it has XMP metadata associated with its pages, that contains some non-standard properties. This problem will be addressed in the 24.4 version, the pages metadata will now be updated to contain only standard entries.

Second, the document doesn’t contain a definition for the font “MinionPro-Regular”, therefore it isn’t possible to make a valid PDF/A document unless the missing font or its substitution is provided. To create a valid PDF/A-2b document beginning with the 24.4 version the customer may use one of the following techniques:

Use a default substitution font instead (for all text that uses MinionPro-Regular, the font will be changed to the Times New Roman):

var tempDoc = new Aspose.Pdf.Document(dataDir + "Produksjonsformat.pdf");
tempDoc.Form.Type = Aspose.Pdf.Forms.FormType.Standard;
var options = new PdfFormatConversionOptions(PdfFormat.PDF_A_2B);

// Replace the inaccessible MinionPro-Regular with the default substitution font (Times New Roman)
options.FontEmbeddingOptions.UseDefaultSubstitution = true;

tempDoc.Convert(options);
if (!tempDoc.Validate(new MemoryStream(), PdfFormat.PDF_A_2B))
    Console.WriteLine("not validate");

tempDoc.Save(dataDir + "Produksjonsformat-converted.pdf");

Assign a user selected accessible (installed in the customer’s system) font to replace the missing font (see Convert PDF to PDF/A formats|Aspose.PDF for .NET for details):

// Replace the inaccessible MinionPro-Regular with the user chosen font
FontRepository.Substitutions.Add(new SimpleFontSubstitution("MinionPro-Regular", "Arial"));

var tempDoc = new Aspose.Pdf.Document(dataDir + "Produksjonsformat.pdf");
tempDoc.Form.Type = Aspose.Pdf.Forms.FormType.Standard;
var options = new PdfFormatConversionOptions(PdfFormat.PDF_A_2B);

tempDoc.Convert(options);
if (!tempDoc.Validate(new MemoryStream(), PdfFormat.PDF_A_2B))
    Console.WriteLine("not validate");

tempDoc.Save(dataDir + "Produksjonsformat-converted.pdf");

Provide the external font definition for the MinionPro-Regular font if it’s not installed in the system:

// Add the folder containing the MinionPro-Regular font definition to the list of font sources
FontRepository.Sources.Add(new FolderFontSource("path_to_the_folder_with_the_font"));

var tempDoc = new Aspose.Pdf.Document(dataDir + "Produksjonsformat.pdf");
tempDoc.Form.Type = Aspose.Pdf.Forms.FormType.Standard;
var options = new PdfFormatConversionOptions(PdfFormat.PDF_A_2B);

tempDoc.Convert(options);
if (!tempDoc.Validate(new MemoryStream(), PdfFormat.PDF_A_2B))
    Console.WriteLine("not validate");

tempDoc.Save(dataDir + "Produksjonsformat-converted.pdf");

hasanirmak · October 15, 2024, 3:04pm

Hi again

i have also searched for “PDF conversion loosing data” in the forum and got one relevant, which tries to manipulate the document in order to make it pdf_a valid

in our case the document cant be manipulated and it should be converted as it is

Best regards

sergei.shibanov · October 15, 2024, 3:25pm

@hasanirmak
First of all, you need to check whether the conversion errors are related to fonts, as I described in the previous post.

hasanirmak · October 25, 2024, 3:01pm

Hi Segei

I have used the suggestions regarding font insertion and use the following code and have been able to validate a large number of them in pdf/a_2a.



 FontRepository.Sources.Add(new FolderFontSource(ConfigHelper.FontPath));
 foreach (var font in ConfigHelper.ListofFonts)
 {
     FontRepository.Substitutions.Add(new SimpleFontSubstitution(font.Key, font.Value)); //the font.value for all of the fonts is for now Arial
 }

var temp_stream = new MemoryStream();
var result = new ConvertedResult();
try
{
    var documentStream = org_doc.InMemoryStream;
    temp_doc = new Document(documentStream);
    temp_doc.Form.Type = AsposeForm.FormType.Standard;

    var options = new PdfFormatConversionOptions(settings.PdfSaveSettings.PdfFormat);
   
    options.FontEmbeddingOptions.UseDefaultSubstitution = true;
    options.ErrorAction = ConvertErrorAction.Delete;
    options.LogFileName = $"{org_doc.OutPutPath}\\{org_doc.Name}_report.xml";
    bool isValidated = temp_doc.Validate(new MemoryStream(), settings.PdfSaveSettings.PdfFormat);
    int count=0;
    while (isValidated == false && count < 2)
    {
        temp_doc.Convert(options);
        //temp_doc.Convert($"{org_doc.OutPutPath}\\{org_doc.Name}_report.xml", settings.PdfSaveSettings.PdfFormat, ConvertErrorAction.Delete);
        isValidated = temp_doc.Validate(new MemoryStream(), settings.PdfSaveSettings.PdfFormat);
        count++;
    }
    using (var resultPdfStream = new MemoryStream())
    {
        temp_doc.Save(resultPdfStream);
        var valid_Message = "";
        isValidated = temp_doc.Validate(new MemoryStream(), settings.PdfSaveSettings.PdfFormat);
        if (isValidated)
        {
            valid_Message = $" It is {settings.PdfSaveSettings.PdfFormat} valid after {count} iterations.";
            RunConvert.pdfAValid++;
        }
        else
        {
            valid_Message = $" It is not valid after {count} iterations.";
        }
        result.ConvertedDocument = resultPdfStream.ToArray();
        result.Result = Result.Successful;
        result.Message = $"PDF Conversion was successful. {valid_Message}";
    }

For some of them, however, it is not valid the first time, and I have to pass it through the converter twice until it is finally valid.

I have converted 107 documents → 3 of them are not converted at all and generate errors (see error file), some of them are converted despite font adjustment, but are not pdf/a_2a valid (see report file).

Now the question arises based on your example, why i could not get a valid pdf/a_2a the first time, and only after the second conversion (e.g. 31 from 92 docs were affected).

How do I manage to bring the remaining (not valid) into a valid state, including those where there is a permission problem?

Best regards
Error.zip (577.3 KB)

report.zip (11.0 KB)

sergei.shibanov · October 25, 2024, 3:09pm

@hasanirmak
I have already completed the work this week and unfortunately I will not be able to answer today. I will study the issue on Monday and write to you.

sergei.shibanov · October 28, 2024, 4:58pm

@hasanirmak
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-58481

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

sergei.shibanov · October 28, 2024, 5:08pm

@hasanirmak
I checked the conversion of those three documents that are in the Errors archive.
I checked in Windows, in the .Net 6 project, Aspose.Pdf library version 24.10. The code was used:

var doc = new Document(dataDir + "02. OpenText Archive Server 10.1.1 Storage Platforms Release Notes.pdf");
doc.Convert(dataDir + "report.xml", PdfFormat.PDF_A_2A, ConvertErrorAction.Delete);
bool isValidated = doc.Validate(new MemoryStream(), PdfFormat.PDF_A_2A);
Console.WriteLine("isValidated : " + isValidated);
doc.Save(dataDir + "02.pdf");

For two files, the result passes Adobe Preflight validation, which is a model for us. For the document that does not pass the check after conversion (“02. OpenText Archive Server 10.1.1 Storage Platforms Release Notes.pdf”), I created a task for the PDFNET-58481 development team. For the conversion, I used only one pass.

Please check whether the given code works properly in your environment (for two documents that I converted).

hasanirmak · October 31, 2024, 8:26am

Hi Sergei

we are using .Net Framework 4.8 with Aspose.Pdf library version 24.7 and 24.10, can you please check this as well and see if it is reproducible.
We can’t switch the whole project to Core in a hurry.

Best regards

sergei.shibanov · October 31, 2024, 11:10am

@hasanirmak
When the .Net Framework 4.8 mentioned I pay attention to the fact that in the latest versions of the library it is supported only .Net Framework 4.8.1 and I recommend downloading the corresponding library from here: Download .NET Component DLL to Process PDF | Aspose.PDF API
For old Net Framework.png (46.2 KB)

However, in this case, when I created an empty .Net Framework 4.8 project and connected the library with Nuget converting of the mentioned 2 documents was successful.
I will attach them to make sure that we are talking about the same documents.
01. OpenText Archive Server 10.1.1 Release Notes.pdf (194.2 KB)

01.OpenText_Runtime_and_Core_Services&_Directory_Services_10.2.1_Release_Notes.pdf (240.1 KB)

hasanirmak · January 30, 2025, 9:25am

Hi Sergei

I habe now updated the solution to 25.1 using the recomended
Aspose.PDF for .NET Framework 4.0 25.1 (DLLs only) , but i still get Error’s with some pdf files including the one mentioned in your answer.

The Error Message message occurs while converting and results in exception error :
{“Unable to cast object of type ‘#=zjtuHkLJyRmgrxygqkbXXLDRbFimV’ to type ‘#=zlb5r82XT9HdpaZOfEWz57niCk3MG’.”}|System.InvalidCastException|

Here are some examples with the same Error:
01.OpenText_Runtime_and_Core_Services&_Directory_Services_10.2.1_Release_Notes.pdf (240.1 KB)

01. OpenText Archive Server 10.1.1 Release Notes.pdf (194.2 KB)

02. OpenText Archive Server 10.1.1 Storage Platforms Release Notes.pdf (209.2 KB)

Here is the code i use to convert and save the document:

public static ConvertedResult ConvertDirectPathToPDFA(BaseDocument org_doc, AsposeSettings settings)
 {
     var result = new ConvertedResult();
     try
     {
         var doc = new Document(org_doc.FullPath);
         doc.Convert($"{org_doc.OutPutPath}\\{org_doc.Name}_report.xml", settings.PdfSaveSettings.PdfFormat, settings.PdfASaveSettings.ErrorAction);
         bool isValidated = doc.Validate(new MemoryStream(), settings.PdfSaveSettings.PdfFormat);
         var temp_doc = org_doc.OutPutPath + "\\" + org_doc.Name + ".pdf" ;
         
         var valid_Message = "";
         if (isValidated)
         {
             valid_Message = $" It is {settings.PdfSaveSettings.PdfFormat} valid.";
             RunConvert.pdfAValid++;
         }
         else
         {
             valid_Message = $" It is not valid.";
         }

         doc.Save(temp_doc);
         result.ConvertedDocument = GetMemoryStream(temp_doc).ToArray();
         result.Result = Result.Successful;
         result.Message = $"PDF Conversion was successful. {valid_Message}";
         
     }
     catch (Exception ex)
     {
         result.Result = Result.Error;
         result.Message = "ConvertToPDF was not successful";
         Logger.Error(ex);
     }
     return result;
 }

i 've got the following error message while converting:
System.InvalidCastException: ‘Unable to cast object of type ‘#=zjtuHkLJyRmgrxygqkbXXLDRbFimV’ to type ‘#=zlb5r82XT9HdpaZOfEWz57niCk3MG’.’

if i start to validate before converting

 bool isValidated = temp_doc.Validate(new MemoryStream(), settings.PdfSaveSettings.PdfFormat);
 temp_doc.Convert(options);

i 've got the following error message:
System.InvalidCastException: ‘Unable to cast object of type ‘#=zjtuHkLJyRmgrxygqkbXXLDRbFimV’ to type ‘#=zlb5r82XT9HdpaZOfEWz57niCk3MG’.’

if i validate after converting

 temp_doc.Convert(options);
 bool isValidated = temp_doc.Validate(new MemoryStream(), settings.PdfSaveSettings.PdfFormat);

Importatnt: We have also realised that it works if just AsposePdf.dll is used in the project, but not in combination with other aspose products.

Best regards

hasanirmak · January 30, 2025, 10:09am

i have also realised that there are valid result differences depending on whether i validate it after the conversion or after save.

using after converting

var doc = new Document(org_doc.FullPath);
      doc.Convert($"{org_doc.OutPutPath}\\{org_doc.Name}_report.xml", settings.PdfSaveSettings.PdfFormat, settings.PdfASaveSettings.ErrorAction);
      bool isValidated = doc.Validate(new MemoryStream(), settings.PdfSaveSettings.PdfFormat);

      var temp_doc = org_doc.OutPutPath + "\\" + org_doc.Name + ".pdf" ;
   
      var valid_Message = "";
      if (isValidated)
      {
          valid_Message = $" It is {settings.PdfSaveSettings.PdfFormat} valid.";
          RunConvert.pdfAValid++;
      }
      else
      {
          valid_Message = $" It is not valid.";
      }

      doc.Save(temp_doc);
      result.ConvertedDocument = GetMemoryStream(temp_doc).ToArray();
      result.Result = Result.Successful;
      result.Message = $"PDF Conversion was successful. {valid_Message}";

using after doc.Save

var doc = new Document(org_doc.FullPath);
      doc.Convert($"{org_doc.OutPutPath}\\{org_doc.Name}_report.xml", settings.PdfSaveSettings.PdfFormat, settings.PdfASaveSettings.ErrorAction);
     

      var temp_doc = org_doc.OutPutPath + "\\" + org_doc.Name + ".pdf" ;
   
   

      doc.Save(temp_doc);
 bool isValidated = doc.Validate(new MemoryStream(), settings.PdfSaveSettings.PdfFormat);

   var valid_Message = "";
      if (isValidated)
      {
          valid_Message = $" It is {settings.PdfSaveSettings.PdfFormat} valid.";
          RunConvert.pdfAValid++;
      }
      else
      {
          valid_Message = $" It is not valid.";
      }
      result.ConvertedDocument = GetMemoryStream(temp_doc).ToArray();
      result.Result = Result.Successful;
      result.Message = $"PDF Conversion was successful. {valid_Message}";

converting results looks like this for case 1 and 2 with the same two documents:

case 1:
Sucessful=> File Name: 00-Submitting Indexer sizing requests - iManage Support.pdf -Message: PDF Conversion was successful. It is PDF_A_2A valid. -Time: 13 seconds.

Sucessful=> File Name: 00-Work 10 Indexer Powered by RAVN - Study Guide.pdf -Message: PDF Conversion was successful. It is not valid. -Time: 1 seconds.

case 2:

Sucessful=> File Name: 00-Submitting Indexer sizing requests - iManage Support.pdf -Message: PDF Conversion was successful. It is not valid. -Time: 14 seconds.

Sucessful=> File Name: 00-Work 10 Indexer Powered by RAVN - Study Guide.pdf -Message: PDF Conversion was successful. It is PDF_A_2A valid. -Time: 2 seconds.

why do i get differences and what makes more sense, where should i validate it, only after converting?

Best regards

sergei.shibanov · January 30, 2025, 3:06pm

@hasanirmak
We are looking into your question and will write to you shortly.

sergei.shibanov · January 31, 2025, 5:04am

@hasanirmak

In many cases, the complete formation of a PDF document is performed when calling the Save() method - therefore, the result obtained after calling Save() can be considered more accurate.
For performance, you can use saving to memory.

doc.Save(new MemoryStream());

Regarding the already created tasks, nothing new, unfortunately.

Regarding your post about throwing an exception during conversion - please create a request in a separate topic so as not to overload the current one.

hasanirmak · January 31, 2025, 7:41am

Thanks sergei for the validation suggestion.

Regarding the other issue i will open another ticket.

Best regards

aspose.notifier · March 24, 2025, 6:41pm

The issues you have found earlier (filed as PDFNET-58359,PDFNET-58361) have been fixed in Aspose.PDF for .NET 25.3.

sergei.shibanov · April 4, 2025, 10:54am

@hasanirmak
Task PDFNET-58360 is closed.

Document.Validate() should be used after calling Document.Save(). This is not obvious and should be mentoned in the documentation and possibly taken into account during further changes to the library.

the code

var doc = new Document(dataDir + "11-SecurityBoost.pdf");
doc.Convert(dataDir + "report.xml", PdfFormat.PDF_A_2A, ConvertErrorAction.Delete);
bool isValidated = doc.Validate(new MemoryStream(), PdfFormat.PDF_A_2A);
Console.WriteLine("isValidated : " + isValidated);
doc.Save(dataDir + "11_out.pdf");
bool isValidatedAfterSave = doc.Validate(new MemoryStream(), PdfFormat.PDF_A_2A);
Console.WriteLine("isValidated after Save: " + isValidatedAfterSave);

outputs

isValidated : False
isValidated after Save: True