Line/Paragraph Spacing for Word to PDF conversion

When converting legal documents form Word to PDF, we’re getting line and/or paragraph spacing issues and we cannot determine the cause or even a workaround in the Word document. This is resulting in a different number of pages between DOCX and PDF. Pages are breaking across different paragraphs which makes headers and page breaks not be aligned in the location the user would expect. Having exact or very similar PDF output is essential for legal documents. Even the number of pages matters very much for legal applications.

After user’s upload a DOCX file, we’re saving it with a similar code snippet and no further document manipulation:

File inputFile = new File(filePath);
File resultFile = new File(filePath.replace(".docx", ".pdf"));
System.out.println("Creating new PDF document: " + resultFile.getAbsolutePath());

try (FileInputStream fis = new FileInputStream(inputFile))
{
	Document wordDocument = new Document(fis);

	PdfSaveOptions saveOptions = new PdfSaveOptions();
	saveOptions.setTextCompression(PdfTextCompression.FLATE);
	saveOptions.setImageCompression(PdfImageCompression.JPEG);
	saveOptions.getDownsampleOptions().setDownsampleImages(true);
	saveOptions.getDownsampleOptions().setResolution(144);
	saveOptions.setJpegQuality(90);
	saveOptions.setCompliance(PdfCompliance.PDF_A_1_B);
	saveOptions.setMemoryOptimization(true);
	saveOptions.setTempFolder(new File(System.getProperty("java.io.tmpdir")).getAbsolutePath());
	
	wordDocument.save(resultFile.getAbsolutePath());
}

I have create a fully standalone sample code project and can attach the source code, the source document, an Aspose.Words converted PDF document, and a PDF document created when Word saves as PDF. I would expect all of these to have paragraphs on the same pages and the same number of pages in each document.

Also, the example document I’m uploading has all fonts (Century Schoolbook) embedded, so I don’t believe font substitution should not be an issue. I have tried removing the PDF/A conversion code and the spacing issue is still a problem.

wordconvert.zip (2.7 KB)
20-0681 Nelson.docx (247.8 KB)
20-0681 Nelson (Aspose).pdf (115.2 KB)
20-0681 Nelson (Word).pdf (158.2 KB)

@jshannon I have reproduced and logged the issue as WORDSNET-24517. We will keep you informed and let you know once it has been resolved.
I suggest you the following code as a workaround while the issue is being fixed.

// Opening Docx...
...

HeaderFooter headersFooter = input.getFirstSection().getHeadersFooters().getByHeaderFooterType(HeaderFooterType.FOOTER_PRIMARY); 
Paragraph para = headersFooter.getParagraphs().get(1);
Paragraph newPara = new Paragraph(input);
newPara.getParagraphFormat().setAlignment(ParagraphAlignment.CENTER);
headersFooter.insertAfter(newPara, para);
while (para.getFirstChild() != null)
    newPara.appendChild(para.getFirstChild());
para.remove();

...
// Saving Pdf...

Please accept our apologies for your inconvenience.

@Vadim.Saltykov thanks for the quick reply! That work around works for the document I uploaded. However, I’m going to attach another two documents from our customer where the workaround is not working. I’m guessing it’s because headers/footers are setup differently. Are these the same issue? Can you have a look to see if there is a different way to work around this consistently?

McKinneyword.docx (37.5 KB)
Docketing - Docketing - Docketing Notice - 2022-05-10T132910.584.docx (81.7 KB)

@jshannon The issue is in the erroneous definition of paragraph frame boundaries in the footer. The proposed workaround is to move the page number field into a regular paragraph without a frame. In “McKinneyword.docx” document this header is in the second section, so for “McKinneyword.docx” you need to change input.getFirstSection() to getSections().get(1). There are no frames in the document “Docketing - Docketing - Docketing Notice - 2022-05-10T132910.584.docx”, so it does not need to be additionally processed.

@Vadim.Saltykov I tried your suggestion with “McKinneyword.docx” and it’s not working for me. I have 19 pages in Word and 20 pages in the PDF. I also manually modified the Word file and removed the frame, just using the page number in the footer. The line spacing is still off.

Also, if there are no frames with “Docketing - Docketing - Docketing Notice - 2022-05-10T132910.584.docx”, is there a different problem? The lines/pages do not match between the DOCX > PDF conversion and what I see in Word itself. There’s an extra paragraph on the last page (page 5) in the Aspose converted PDF.

Can you try reproducing it with the program I sent and these 2 files?

@jshannon I changed the code to search for problematic paragraphs in all the headers and footers. In my case I have the following output files. Please consider the following code.

Document input = new Document("input.docx");

for (Section section : input.getSections())
{
    for (HeaderFooter headerFooter : section.getHeadersFooters())
    {
        for (Paragraph para : headerFooter.getParagraphs())
        {
            if (para.getFrameFormat().isFrame())
            {
                Paragraph newPara = new Paragraph(input);
                newPara.getParagraphFormat().setAlignment(ParagraphAlignment.CENTER);
                headerFooter.insertAfter(newPara, para);
                while (para.getFirstChild() != null)
                    newPara.appendChild(para.getFirstChild());
                para.remove();
            }
        }
    }
}

Docketing.pdf (104.3 KB)
McKinneyword.pdf (84.7 KB)

@Vadim.Saltykov thank again for the help. If you look at the McKinneyword.pdf output on page 5. It doesn’t line up with Word document. Then all future pages are also misaligned, as well. I’m attaching screenshots of pages 5 and 6 but none of them line up after page 5.

McKinneyword-Page5.png (120.9 KB)
McKinneyword-Page6.png (148.6 KB)

@jshannon Apparently, this is another issue that has nothing to do with the issue of paragraph frame in footer. I have logged the issue as WORDSNET-24526. We will keep you informed and let you know once it has been resolved. Please accept our apologies for your inconvenience.

@Vadim.Saltykov thank you so much for your time. Our customers will be very anxious to getting these issues resolved.

1 Like

The issues you have found earlier (filed as WORDSNET-24526) have been fixed in this Aspose.Words for Java 23.1 update.

The issues you have found earlier (filed as WORDSNET-24517) have been fixed in this Aspose.Words for Java 23.5 update.