PDF Comparison Issues – Cropped images and Misaligned Content

Hi Team,

We are experiencing issues while performing PDF-to-PDF comparisons using Aspose. The generated output file has cropped images and misaligned content, which impacts the accuracy of the comparison.

To help troubleshoot, I have attached the following files for your review:

PythonCode_compare.zip (contains the script used for comparison)
output_pdf_test_001 1.pdf (output file with issues)
PDF_File_2_updated_v2.pdf (one of the input files)
PDF_File1_Updated_v1.pdf (another input file)

Could you please:

  1. Review the attached files to identify the root cause of the cropping and misalignment.
  2. Provide any recommendations to resolve these issues and improve the accuracy of the comparison.
  3. Let us know if specific settings or adjustments are needed in the code or library configuration.

PythonCode_compare.zip (4.1 KB)

ouput_pdf_test_001 1.pdf (1.1 MB)

PDF_File_2_updated_v2.pdf (924.3 KB)

PDF_File1_Updated_v1.pdf (864.3 KB)

Regards,
Munish Singla

@munish.singla Aspose.Words is designed to work with MS Word documents. MS Word documents are flow documents and they have structure very similar to Aspose.Words Document Object Model. But on the other hand PDF documents are fixed page format documents. While loading PDF document, Aspose.Words converts Fixed Page Document structure into the Flow Document Object Model. Unfortunately, such conversion does not guaranty 100% fidelity. I am afraid there is no way to 100% preserve PDF document Layout after PDF->Aspose.Words DOM->PDF roundtrip.
Regarding PDF document comparison using Aspose.Words, though PDF document might look the same visually, their structure might be different, that leads into the different DOM build by Aspose.Words and as a result the differences in document comparison.

Hi @alexey.noskov

Thank you for your detailed explanation regarding the limitations of PDF-to-PDF comparisons using Aspose.Words.

I have a few follow-up questions:

  1. Is there any way to perform PDF-to-PDF comparisons effectively using Aspose, even if it involves additional configurations or tools?
  2. Can we identify specific types of PDF documents that are not supported or prone to issues during the conversion and comparison process?
  3. To enhance the user experience, are there any steps or best practices we can take to improve the accuracy of PDF-to-PDF comparisons?
  4. Is there a way to pre-validate PDF documents to determine if they are likely to result in poor comparison results, so we can proactively stop the comparison and notify the user?

Your insights will help us provide a better experience to our users and address their concerns effectively.

Best regards,
Munish

@munish.singla

I am afraid, currently, there is no way to improve PDF comparison quality. WE have a feature request WORDSNET-24926 to provide a feature to compare PDF documents without loading them into Aspose.Words DOM. I have linked your topic to this feature request. We will keep you informed and let you know once it is implemented.

No, unfortunately, there is no way to identify problematic PDF documents.

I am afraid there are no recommendations.

No, unfortunately, there is no way to pre-validate PDF documents.

Hi @alexey.noskov

Thank you for your updates.

I have a few more questions:

  1. Should we use the Python library or the Java library?
  2. Which API receives updates faster?
  3. Currently, there is no support for PDF-to-PDF comparison in the Java API, correct? If so, is there a timeline for this feature?
  4. Will these issues with PDF-to-PDF comparison also be addressed in the Python library? If yes, when can we expect fixes for those?

Looking forward to your response.

@munish.singla

  1. .NET, Python and Java version of Aspose.Words provides the same set of features with small exceptions. For example in Python version callbacks are not yet supported, while in Java version loading PDF documents is not supported.

  2. Our main product is .NET version of Aspose.Words and it gets updates first. Then the code is ported to Java, this process takes about one-two weeks. Parallelly a special build of .NET version is wrapped into Python version this process usually takes about a week. So all Aspose.Words versions gets updates almost simultaneously.

  3. Yes, you are right, Aspose.Words for Java does not support loading PDF documents. Unfortunately, currently, there is no estimates when this feature will be available in Java version.

  4. All fixes are made in the main .NET version of Aspose.Words and then ported to other products. So once the fix is done in the main version, it also will be available in Python version too. I am afraid, there are no estimates when comparison of PDF documents without loading them in Aspose.Words DOM will be available.

Hi @alexey.noskov We are also looking for pdf comparison using aspose words since something similar architecture we are using for document conversion and Java version would be helpful for us . could you please tell us if there is roadmap for using PDF comparison .
Additionally do we have limit of usage of document comparison i.e. size limit, Max no. of pages it can support. etc.
Looking forward for your response .
thank you.

@abhishek.sonkar I am afraid currently there are no estimate for both issues: for loading PDF documents into Aspose.Words DOM in Java and for comparing PDF documents without loading into Aspose.Words DOM.
As mentioned above currently loading PDF documents is supported only in .NET and Python versions of Aspose.Words.

in that case @alexey.noskov this contradicts your documentation given for aspose words java

https://docs.aspose.com/words/java/supported-document-formats/

here we can see you support PDF also for comparison.

@abhishek.sonkar There is no contradiction. Aspose.Words for Java supports PDF document format only for export as marked in the above provided table:

My bad @alexey.noskov
Thank you for quick response and correcting it .
But we are thinking of buying a license for document comparison but it will also depend if in near future we would be able to use it for PDF comparison.
Thank you .

@abhishek.sonkar I am afraid, we cannot promise that PDF document loading and comparison in Java version will be available anytime soon.