Document comparison - Using Aspose library

Hi,

I am trying to compare two documents could be pdf-pdf, pdf-docx, docx-docx.

Docx-Docx comparison works well as it gives revisions which is helpful when I am trying to highlight my document.

PDF-PDF comparison and PDF-DOCX isn’t that accurate for complex documents. So I am trying to convert pdf to docx using the Aspose.pdf library and then do the comparison. When converting the structure of the document changes a lot which causes incorrect differences.

Attaching the code used to do the conversion:


            InputStream pdfInputStream = pdfFile.getInputStream();
            com.aspose.pdf.License lic = new com.aspose.pdf.License();
            lic.setLicense("/path/to/Aspose.PDFProductFamily.lic");
            com.aspose.pdf.Document pdfDocument = new com.aspose.pdf.Document(pdfInputStream);

            // Set the save options for DOCX
            com.aspose.pdf.DocSaveOptions saveOptions = new com.aspose.pdf.DocSaveOptions();
            saveOptions.setFormat(com.aspose.pdf.DocSaveOptions.DocFormat.DocX);
            saveOptions.setMode(com.aspose.pdf.DocSaveOptions.RecognitionMode.EnhancedFlow);
            saveOptions.setRecognizeBullets(true);                           // keep real lists
            saveOptions.setRelativeHorizontalProximity(2.5F);                // fix multi-column docs
//            saveOptions.setIm(true);

            // Save the PDF as a DOCX file
            ByteArrayOutputStream docxOutputStream = new ByteArrayOutputStream();
            pdfDocument.save(docxOutputStream, saveOptions);
            ByteArrayInputStream pdfStream = new ByteArrayInputStream(docxOutputStream.toByteArray());

            // Load the converted DOCX into Aspose.Words Document
            com.aspose.words.Document doc = new com.aspose.words.Document(pdfStream);

            // Normalize the document
            doc.joinRunsWithSameFormatting();


            // Save the normalized DOCX to a ByteArrayOutputStream
            ByteArrayOutputStream normalizedDocxStream = new ByteArrayOutputStream();
            doc.save(normalizedDocxStream, SaveFormat.DOCX);

Is there a better way to do the conversion so that the structure remains accurate for comparison?

Thank you!

@Likitha_T

To improve the accuracy of your PDF to DOCX conversion using the Aspose library, you can adjust the DocSaveOptions settings. Here are some effective strategies to enhance document structure:

  1. Use Different Recognition Modes: Experiment with various recognition modes. For example, RecognitionMode.Flow may provide better results for certain document types compared to EnhancedFlow.

  2. Set the UseAntiAliasing Property: Enabling anti-aliasing can enhance the visual quality of the converted document, which may help maintain the layout.

  3. Adjust the RelativeHorizontalProximity: Fine-tune this value based on your specific document layout. A smaller value can improve the handling of multi-column documents.

  4. Consider Using setIm: If you haven’t already, enable the setIm option to improve image handling and placement within the document.

  5. Post-Processing with Aspose.Words: After conversion, you can further manipulate the DOCX document using Aspose.Words. This allows you to adjust styles or formatting to better match the original PDF.

Here’s a modified version of your code snippet incorporating these suggestions:

// Set the save options for DOCX
com.aspose.pdf.DocSaveOptions saveOptions = new com.aspose.pdf.DocSaveOptions();
saveOptions.setFormat(com.aspose.pdf.DocSaveOptions.DocFormat.DocX);
saveOptions.setMode(com.aspose.pdf.DocSaveOptions.RecognitionMode.Flow); // Try Flow mode
saveOptions.setRecognizeBullets(true);
saveOptions.setRelativeHorizontalProximity(2.0F); // Adjust this value
saveOptions.setUseAntiAliasing(true); // Enable anti-aliasing
saveOptions.setIm(true); // Enable image handling

// Save the PDF as a DOCX file
ByteArrayOutputStream docxOutputStream = new ByteArrayOutputStream();
pdfDocument.save(docxOutputStream, saveOptions);

After implementing these adjustments, test the output to determine if the document structure improves for your comparison needs. If you have any further questions or need additional assistance, please feel free to ask!

Is setUseAntiAliasing() and setIm() a method available in aspose-pdf-25.4-jdk17.jar version?

@Likitha_T,

To assess your issues effectively, please zip and attach your sample files (e.g., input PDF documents, output DOCX and PDF files, etc.). Additionally, provide screenshots to highlight the problematic areas and discrepancies between the original document and the converted document. We will review your issues soon.