PDF generated from docx no longer bit-wise identical

fhoeben · September 28, 2018, 11:29am

When I was using the evaluation version of Words for Java 18.4 I could generated a pdf (PDF/A) from a docx using mail merge that was bit-wise the same on every run (as long as I used to same docx and same input).

When I upgrade to 18.5 or higher (I tried 5, 6, 8 and 9) this is no longer the case, the contents of the PDF changes on each run.

I had a very nice unit test that verified that my custom IMailMergeDataSource still produced exactly the same output. But that will no longer work.

Will I see the same behaviour when I purchase a license, or is this difference limited to the evaluation version?
If the licensed version’s behaviour is also changed: is there a way for me to get the old behaviour?

Thanks

tahir.manzoor · September 28, 2018, 3:07pm

@fhoeben

Thanks for your inquiry. To ensure a timely and accurate response, please attach the following resources here for testing:

Your input Word document.
Please attach the output Word file that shows the undesired behavior.
Please attach the expected output Word file that shows the desired behavior.
Please create a simple Java application ( source code without compilation errors ) that helps us to reproduce your problem on our end and attach it here for testing.

As soon as you get these pieces of information ready, we will start investigation into your issue and provide you more information. Thanks for your cooperation.

PS: To attach these resources, please zip and upload them.

fhoeben · September 29, 2018, 11:08am

Find attached a Java class and .docx document.
With 18.4 the program prints “OK”, with 18.9 I get the exception indicating the pdfs are not identical

sample-pdf.zip (23.8 KB)

tahir.manzoor · September 29, 2018, 2:22pm

@fhoeben

We have tested the scenario and have managed to reproduce the same issue at our side. For the sake of correction, we have logged this problem in our issue tracking system as WORDSNET-17522. You will be notified via this forum thread once this issue is resolved.

We apologize for your inconvenience.

tahir.manzoor · October 14, 2018, 5:16pm

@fhoeben

Thanks for your patience. It is to inform you that the issue which you are facing is actually not a bug in Aspose.Words. So, we have closed this issue (WORDSNET-17522) as ‘Not a Bug’.

You are facing the expected behavior. Aspose.Words generates unique PDF File Identifier when saving to PDF. This feature was introduced in Aspose.Words v18.5.

fhoeben · October 15, 2018, 5:45am

Can you explain what the purpose of such a unique identifier is?

For us, at the moment, having the PDFs not be completely identical is very inconvenient. Is there a way to disable/prevent generation of this identifier?

Alternatively is this identifier maybe located in a fixed location, so we might be able to ignore this and compare the rest of the file?

tahir.manzoor · October 15, 2018, 10:08am

@fhoeben

Thanks for your inquiry. We fixed an issue in Aspose.Words 18.5 that is related to search for words in PDF. Here is the detail of issue:

The problem appears because of Actobat “Fast Find” function. It caches the content of documents in order to increase search speed. PDF document “File Identifier” is used to distinguish the documents in the cache. Aspose.Words built the “File Identifier” only based on the document info in old versions. Thus documents with the same info will have the same identifier. And two different documents with the same “File Identifier” breaks the “Fast Find” which causes search issue.

If we disable/prevent generation of this identifier, the search text feature will not be available when you open the PDF in viewer.

fhoeben · October 15, 2018, 10:42am

If I understand correctly you fixed the issue of incorrect caching by ensuring documents are never cached (even if content are same). It sounds like you should have put a document hash in the identifier instead of a unique id. That way, two identical documents (which Acrobat can cache and see as same) will have same identifier and be cached properly and search would work just fine. My test would then also still work as two pdfs with identical content should get the same hash, therefore the same file identifier, and be bitwise the same.

So then my issue report basically becomes: use a hash (e.g. MD5 or SHA) on the PDF content to calculate the PDF file identifier instead of using a random number.
Does this make sense?

tahir.manzoor · October 15, 2018, 2:59pm

@fhoeben

Thanks for your inquiry. We have logged a feature request as WORDSNET-17579 (Add property in PdfSaveOptions to disable the insertion of “File Identifier”) in our issue tracking system. You will be notified via this forum thread once this feature is available. We apologize for your inconvenience.

fhoeben · October 15, 2018, 5:28pm

Thanks for creating the feature request.

Can you add, in this request or in a separate one, (since I understand that having an identifier is better for Acrobat). I would actually prefer to have an identifier that is based on the entire content of the document, but ‘stable’ (i.e. the same document content gets the same identifier). I imagine that calculating a hash over the pdf content (without the identifier added) would give you such an identifier. This would cover both scenario’s: Acrobat’s find, and its caching, works correctly AND the saving the same document twice results is a bitwise identical file.

tahir.manzoor · October 16, 2018, 5:07am

@fhoeben

Thanks for your inquiry. We have logged a new feature request as WORDSNET-17582 in our issue tracking system. We will check the possibility of implementation of this feature. We will inform you via this forum thread once there is an update available on it.

tahir.manzoor · November 13, 2018, 5:59pm

@fhoeben

It is to update you that we have closed the issues (WORDSNET-17579 and WORDSNET-17582) with “Won’t Fix” resolution.