Optimized PDF to PDF/A 1b conversion problems

cmarsura · June 1, 2015, 5:05am

Hi Tilal,

I am continuing to evaluate Aspose.Pdf for .NET 10.4.0.

Currently I have slightly modified the code to convert files to pdf/a 1b while trying to produce the smaller file size possible.

The code used is the following:

static bool ConvetToPdfA(string infile, string outfile)

{

bool result = false;

if (File.Exists(outfile))

{

File.Delete(outfile);

}

Aspose.Pdf.Document document = new Aspose.Pdf.Document(infile);

try

{

document.Optimize();

Document.OptimizationOptions optimizationOptions = Document.OptimizationOptions.All();

optimizationOptions.UnembedFonts = false;

document.OptimizeResources(optimizationOptions);

result = document.Convert(outfile + “.log.xml”, Aspose.Pdf.PdfFormat.PDF_A_1B, Aspose.Pdf.ConvertErrorAction.Delete);

}

catch (Exception ex)

{

Console.Error.WriteLine("Catched exception " + ex.ToString());

}

if (result)

{

document.Save(outfile);

}

Console.Error.WriteLine( "Conversion " + (result ? “successful” : “failed”) + “\n”);

return result;

}

The three included files were not converted; some lines in conversion report shows that the files as not convertable because of “Font ‘font name’ is not embedded”; Adobe Acrobat Pro DC instead is capable to successfully convert these files.

There is a way to forcing library during conversion to embed the fonts partially ? I tried to iterate all fonts in the pages before calling the Convert() without success - the fonts are fully embedded, causing the file to grow.

Further, I noted that the previous uploaded file js_api_reference.pdf, is successfully converted to pdf/a 1b and pass Adobe preflight validation, but size grows from less of 4 MBytes to 14 MBytes; there is a way to obtain a pdf/a file of a more reasonable size? I looked over documentation but have not found hints related to.

Regards.

Carlo

codewarior · June 2, 2015, 4:40am

Hi Carlo,

Thanks for using our API’s.

I have tested the scenario and I am able to reproduce the same problem where converting 4-fil.pdf to PDF/A_1b compliance, the resultant file is not correct. For the sake of correction, I have logged it in our issue tracking system as PDFNEWNET-38791. We will investigate this issue in detail and will keep you updated on the status of a correction.

We apologize for your inconvenience.

codewarior · June 2, 2015, 5:09am

Hi Carlo,

During my further testing, I have also managed to reproduce that other two files are also not being converted PDF/A_1b format. For the sake of correction, I have separately logged them as

PDFNEWNET-38792 - Registro Croce Rep 5941-4095.pdf
PDFNEWNET-38793 - RR immobiliare rep63-52.pdf

Furthermore, in order to reduce file size, OptimizationOptions is correct approach and it should be called before converting file to PDF/A format. Once the document is converted to PDF/A, no further document manipulation can be performed. We are sorry for your inconvenience.

cmarsura · June 3, 2015, 2:18am

Thank you for the prompt response, Nayyer.

About my question on font embedding, there is a way to enable the embedding of only used characters instead of full character set before doing the conversion to pdf/a ?

Best.

Carlo

codewarior · June 4, 2015, 1:45am

Hi Carlo,

Thanks for sharing the details.

Currently a complete font is embedded inside PDF document. However I am in coordination with development team to see if only a subset of font can be embedded inside PDF file. We will keep you posted with our findings.

codewarior · June 4, 2015, 2:23am

Hi Carlo,

Besides including complete font, you can also include the subset of font. Please try using the following code snippet.

//Open the document
Document doc = new Document("c:/pdftest/LineColorIssue.pdf");

//Iterate through all the pages
foreach (Page page in doc.Pages)
{
    if (page.Resources.Fonts != null)
    {
        foreach (Aspose.Pdf.Text.Font pageFont in page.Resources.Fonts)
        {
            //Check if font is already embedded
            if (!pageFont.IsSubset)
            {
                pageFont.IsSubset = true;
            }
        }
    }

    //Check for the Form objects
    foreach (XForm form in page.Resources.Forms)
    {
        if (form.Resources.Fonts != null)
        {
            foreach (Aspose.Pdf.Text.Font formFont in form.Resources.Fonts)
            {
                //Check if the font is embedded
                if (!formFont.IsSubset)
                {
                    formFont.IsSubset = true;
                }
            }
        }
    }
}

//Save the document
doc.Save("c:/pdftest/CL_ExtractoCCValor_Subset-outut.pdf");

cmarsura · June 4, 2015, 10:42am

Thank you, Nayyer, I completely mismatched the property method.

Carlo

codewarior · June 8, 2015, 2:11am

Hi Carlo,

Thanks for the acknowledgement. Please continue using our API’s and in the event of any further query, please feel free to contact.

cmarsura · June 8, 2015, 7:44am

Yes Nayyer, I am continuing to test the library.

Trying to reduce the pdf/a final size by forcing the library to embed only the font subset using the code suggested by you, I found some problems. Seems the library assigns the wrong font if I enable the code that sets property IsSubset to true;

Below the code used:

static bool ConvetToPdfA(string infile, string outfile)

{

bool result = false;

if (File.Exists(outfile))

{

File.Delete(outfile);

}

Aspose.Pdf.Document document = new Aspose.Pdf.Document(infile);

try

{

document.Optimize();

Document.OptimizationOptions optimizationOptions = Document.OptimizationOptions.All();

optimizationOptions.UnembedFonts = false;

document.OptimizeResources(optimizationOptions);

//////////////////////////// BEGIN SUBSETTING FONTS CODE

foreach (Aspose.Pdf.Page page in document.Pages)

{

if (page.Resources.Fonts != null)

{

foreach (Aspose.Pdf.Text.Font pageFont in page.Resources.Fonts)

{

//Check if font is already embedded

if (!pageFont.IsSubset)

pageFont.IsSubset = true;

}

//Check for the Form objects

foreach (XForm form in page.Resources.Forms)

{

if (form.Resources.Fonts != null)

{

foreach (Aspose.Pdf.Text.Font formFont in form.Resources.Fonts)

{

//Check if the font is embedded

if (!formFont.IsSubset)

formFont.IsSubset = true;

}

//////////////////////////// END SUBSETTING FONTS CODE

result = document.Convert(outfile + ".log.xml", Aspose.Pdf.PdfFormat.PDF_A_1B, Aspose.Pdf.ConvertErrorAction.Delete);

}

catch (Exception ex)

{

Console.Error.WriteLine("Catched exception " + ex.ToString());

}

if (result)

{

document.Save(outfile);

}

Console.Error.WriteLine( "Conversion " + (result ? "successful" : "failed") + "\n");

return result;

}

To rapidly glean out the differences, compile and convert included files with the code above, than comment code marked between 'SUBSETTING FONTS CODE', compile and convert same files saving output to another folder.

Opening files with and without font subsetted , you can see immediately the differences.

For example, use the VC_BZ_1411113_1E1EC.pdf: you can see the first line of text 'Autonome Provinz Bozen - Südtirol' that in a file is correctly using a TimesNewRomanBoldItalic font and in the other is using 'TimesNewRoman'. Moreover text appearance has lost bold and italic attributes both and the total number of incorporated fonts is different (10 vs. 6).

To see text font name I usually use the PDF-XChange Viewer portable, open the pdf, mark the text, right click over, select 'Text properties...' > Formatting.

Best regards.

Carlo

codewarior · June 9, 2015, 6:14am

Hi Carlo,

Thanks for sharing the details.

I have tested the scenario and have observed that when optimizing (VC_BZ_1411113_1E1EC.pdf and LF_BZ_1340174_33FAC-1.pdf) file size and then converting these documents to PDF/A format, the resultant file contents get corrupted and an error message appears when viewing the resultant files. For
the sake of correction, I have logged it in our issue tracking system as PDFNEWNET-38840.

However when performing same operation over 1_soppressa.pdf, I have observed that text formatting is lost in resultant file when using IsSubset property. I have separately logged it as PDFNEWNET-38841 in our issue tracking system.

We will investigate this issue in details and will keep you updated on the status of a correction. <span style=“font-size:10.0pt;font-family:“Arial”,“sans-serif”;mso-fareast-font-family:
“Times New Roman”;mso-ansi-language:EN-US;mso-fareast-language:EN-US;
mso-bidi-language:AR-SA”>We apologize for your inconvenience.