Hi,
I am trying to compare two documents could be pdf-pdf, pdf-docx, docx-docx.
Docx-Docx comparison works well as it gives revisions which is helpful when I am trying to highlight my document.
PDF-PDF comparison and PDF-DOCX isn’t that accurate for complex documents. So I am trying to convert pdf to docx using the Aspose.pdf library and then do the comparison. When converting the structure of the document changes a lot which causes incorrect differences.
Attaching the code used to do the conversion:
InputStream pdfInputStream = pdfFile.getInputStream();
com.aspose.pdf.License lic = new com.aspose.pdf.License();
lic.setLicense("/path/to/Aspose.PDFProductFamily.lic");
com.aspose.pdf.Document pdfDocument = new com.aspose.pdf.Document(pdfInputStream);
// Set the save options for DOCX
com.aspose.pdf.DocSaveOptions saveOptions = new com.aspose.pdf.DocSaveOptions();
saveOptions.setFormat(com.aspose.pdf.DocSaveOptions.DocFormat.DocX);
saveOptions.setMode(com.aspose.pdf.DocSaveOptions.RecognitionMode.EnhancedFlow);
saveOptions.setRecognizeBullets(true); // keep real lists
saveOptions.setRelativeHorizontalProximity(2.5F); // fix multi-column docs
// saveOptions.setIm(true);
// Save the PDF as a DOCX file
ByteArrayOutputStream docxOutputStream = new ByteArrayOutputStream();
pdfDocument.save(docxOutputStream, saveOptions);
ByteArrayInputStream pdfStream = new ByteArrayInputStream(docxOutputStream.toByteArray());
// Load the converted DOCX into Aspose.Words Document
com.aspose.words.Document doc = new com.aspose.words.Document(pdfStream);
// Normalize the document
doc.joinRunsWithSameFormatting();
// Save the normalized DOCX to a ByteArrayOutputStream
ByteArrayOutputStream normalizedDocxStream = new ByteArrayOutputStream();
doc.save(normalizedDocxStream, SaveFormat.DOCX);
Is there a better way to do the conversion so that the structure remains accurate for comparison?
Thank you!