PDF to PDF/A 3 B results in formatting errors

Hello,



I have problems converting PDF-files to PDF/A3 B. Éspecially the formatting of tables is often weird. Sometimes a word/words of a sentence is/are moved up or down or moved over other text. Somestimes hyperlinks are shown as a blue block - it seems that the color of the text is now the background-color as well.

This happens mostly with HTML- as well as Word-files that are converted into PDF (Aspose.Words, which workes fine).



.NET 3.5, tested with 11.4.0

bea.grosse-venhaus:




I have problems converting PDF-files to PDF/A3 B. Éspecially the formatting of tables is often weird. Sometimes a word/words of a sentence is/are moved up or down or moved over other text.
Hi Bea,

Thanks for contacting support.

I have tested the scenario using Aspose.Words for .NET 15.4.0 and Aspose.Pdf for .NET 11.4.0 and have managed to reproduce the text formatting issue when converting Besprechung.pdf file to PDF/A_3b format. For the sake of correction, I have logged this problem
as
PDFNEWNET-40443 in our issue tracking system. We will
further look into the details of this problem and will keep you updated on the
status of correction. Please be patient and spare us little time. We are sorry
for this inconvenience.


bea.grosse-venhaus:

Somestimes hyperlinks are shown as a blue block - it seems that the color of the text is now the background-color as well.

This happens mostly with HTML- as well as Word-files that are converted into PDF (Aspose.Words, which workes fine).
I have also tried converting EMail_2016-145.html file to PDF/A_3b format and I am unable to notice any issue. For your reference, I have also attached the output generated over my end.

[C#]

Aspose.Words.Document worddoc = new Aspose.Words.Document(@"C:\pdftest\EMail_2016-145.html", new Aspose.Words.LoadOptions( Aspose.Words.LoadFormat.Html, "",""));

worddoc.Save(@"C:\pdftest\EMail_2016-145.pdf", Aspose.Words.SaveFormat.Pdf);

var pdfDocument = new Document(@"C:\pdftest\EMail_2016-145.pdf");

pdfDocument.Convert(@"C:\pdftest\Besprechung_Source.txt", PdfFormat.PDF_A_3B, ConvertErrorAction.Delete);

pdfDocument.Save(@“C:\pdftest\EMail_2016-145_PDF_A_3b.pdf”);

Thank you very much for your fast reply.


We have many szenarios in which we convert documents to PDF/A3 B. The one I have problems with, is the convertion of E-Mails in an Outlook-AddIn.

When the E-Mail is generated using the RTF format, a RTF document of the E-Mail is created and afterwards converted.

When the E-Mail is generated using HTML format, the HTML body of the E-Mail is saved as a HTML file and all referenced file are saved in the original format -> in a SharePoint library.
To convert the HTML file LoadOptions.ResourceLoadingStrategy CustomLoaderOfExternalResources is used to load all referenced files from the correct folder in the correct SharePoint library:

private byte[] ConvertHtmlToPdf(byte[] fileBytes, SPWeb referateWeb, string folderUrl)
{
            using(MemoryStream memoryStream = new MemoryStream(fileBytes))
            {
Aspose.Pdf.HtmlLoadOptions htmlOptions = new Aspose.Pdf.HtmlLoadOptions(folderUrl);
htmlOptions.CustomLoaderOfExternalResources = uri =>
{
string fileNameSrc = uri.Replace(“cid:”, “”);
fileNameSrc = fileNameSrc.Substring(0, fileNameSrc.IndexOf(’@’));

 			//referateWeb is a SPWeb-object
SPFile file = referateWeb.GetFile(Path.Combine(folderUrl, fileNameSrc));
if(file.Exists)
{
FilesToDelete.Add(file);
LoadOptions.ResourceLoadingResult result = new LoadOptions.ResourceLoadingResult(file.OpenBinary());
return result;
}
return new LoadOptions.ResourceLoadingResult(new byte[] { });
};
Document htmlDocument = new Document(memoryStream, htmlOptions) { IgnoreCorruptedObjects = true };
htmlDocument.OptimizeResources(Document.OptimizationOptions.All());
            <span style="color:blue;">using</span>(<span style="color:#2b91af;">MemoryStream</span> documentStream = <span style="color:blue;">new</span> <span style="color:#2b91af;">MemoryStream</span>())
            {
                htmlDocument.Save(documentStream);
                <span style="color:blue;">return</span> documentStream.ToArray();
            }
        }
    }</pre></div>

Hi Bea,


Thanks for sharing the details.

Please share all the resource files for which you are facing problem while converting them to PDF/A_3 format. However in above stated code snippet, it appears that you are converting HTML file to PDF and applying optimization technique.I am unable to notice PDF/A conversion code in above post.
I attached the class the convertion(s) take place as well as test documents.

To sum it up:
Besides the formatting problems that occur when converting DOCX files to PDF/A 3B (and you confirmed), we also habe problems when converting RTF and HTML files to PDF/A 3B. These files are created within an Outlook-AddIn, saved in a SharePoint library und afterwards converted.

The formattings problem occur in the DOCX, RTF as well as HTML file.
Images in RTF files can get blown up.

The process is as follows :

RTF-E-Mail
-> RTF-file
-> PDF (shared class file)
-> PDF/A 3B (shared class file) Formatting is wrong, pictures can get blown up

HTML-E-Mail
-> HTML-file of body
-> PDF (with previous method I shared) Formatting is wrong (espacially in tables), hyperlinks can be blue
-> PDF/A 3B (shared class file)


Hi Bea,


Thanks for sharing the resource files.

The conversion issues related to Besprechung.docx file are already logged in our issue tracking system as PDFNEWNET-40443. However I am working on replicating issues related to other files and will get back to you soon.

bea.grosse-venhaus:

To sum it up:
Besides the formatting problems that occur when converting DOCX files to PDF/A 3B (and you confirmed), we also habe problems when converting RTF and HTML files to PDF/A 3B. These files are created within an Outlook-AddIn, saved in a SharePoint library und afterwards converted.

The formattings problem occur in the DOCX, RTF as well as HTML file.
Images in RTF files can get blown up.

The process is as follows :

RTF-E-Mail
-> RTF-file
-> PDF (shared class file)
-> PDF/A 3B (shared class file) Formatting is wrong, pictures can get blown up
Hi Bea,

Thanks for your patience. I have tested the scenario using EMail_2016-78.rtf file and have managed to reproduce above stated issues. For the sake of correction, I have logged it as PDFNEWNET-40523 in our issue tracking system.

bea.grosse-venhaus:
HTML-E-Mail
-> HTML-file of body
-> PDF (with previous method I shared) Formatting is wrong (espacially in tables), hyperlinks can be blue
-> PDF/A 3B (shared class file)
Ihave tested the scenario and have observed that some special characters are rendering in PDF file when converting HTML file to PDF format using Aspose.Words, so I have intimated my fellow workers from respective team to further look into this matter and reply accordingly.

I have also noticed a file Ursprungsmail WG Neue Pressemitteilung - Martin Lohse wird neuer Wissenschaftlicher Vorstand des MDC_PDF3B008.pdf in earlier attachments but its source file is missing. Can you please share the input file and some details regarding the issues which you are facing for this document, so that we can further look into this scenario. We are sorry for this inconvenience.

bea.grosse-venhaus:
HTML-E-Mail
-> HTML-file of body
-> PDF (with previous method I shared) Formatting is wrong (espacially in tables), hyperlinks can be blue
-> PDF/A 3B (shared class file)
Hi Bea,

In a separate attempt of converting HTML file to PDF/A_3b format, the text at bottom of file in resultant file is garbled (characters are overlapping). For the
sake of correction, I have logged this problem as
PDFNEWNET-40524 in
our issue tracking system. We will further look into the details of this
problem and will keep you updated on the status of correction. Please be
patient and spare us little time. We are sorry for this inconvenience.

Hi Bea,


Thanks for your inquiry. We have converted your “EMail_2016-145.html” file to PDF format using both the latest version of Aspose.Words for .NET 16.2.0 and MS Word 2016. We have attached these PDF documents here for your reference. The only difference between these two PDF files is highlighted in attached screenshot. We have logged this issue in our bug tracking system. The ID of this issue is WORDSNET-13352. Your thread has also been linked to this issue and you will be notified as soon as it is resolved. Sorry for the inconvenience.

Best regards,

Thanks for your reply. The customer uses MS Office 2010, I don’t know if this makes the difference…

Next problem: I just tried to convert a MSG-file to PDF with the newest DLL’s but now the Concatenate()-method does not work anymore, the “target”-MemoryStream is 0 bytes (it works with PDF 11.4.0 .Net 3.5) - I use the method to create one PDF for the E-Mail message ansd all attached files.

AND…also in PDF douments some pictures get blown up…

Hi Bea,

Can you please also share the input email (MSG) file and your complete code to reproduce this issue at our end?

Best Regards,

I already did that when I started this thread

bea.grosse-venhaus:
AND...also in PDF douments some pictures get blown up...
Hi Bea,

Thanks for sharing the resource files.

I have tested the scenario and have managed to reproduce same problem that resultant file is not PDF/A_3b compliant and also the image on title page is blown up.. For the sake of correction, I have logged it as PDFNEWNET-40638 in our issue tracking system. We will further look into the details of this problem and will keep you posted on the status of correction. Please be patient and spare us little time. We are sorry for this inconvenience.

The issues you have found earlier (filed as PDFNET-40524,PDFNET-40638) have been fixed in Aspose.PDF for .NET 19.10.

The issues you have found earlier (filed as WORDSNET-13352) have been fixed in this Aspose.Words for .NET 23.8 update also available on NuGet.