Summary
When you call OleFormat.save() on a PDF OLE object embedded in a Word file, Aspose.Words writes the whole OLE container (including the OLE header) to disk. If you later treat that file as a plain PDF and merge it with other PDFs, the OLE header bytes are interpreted as PDF content, which results in pages full of garbled characters.
Why it happens
OLE container vs. pure PDF stream – OleFormat.save() stores the full OLE package. The first bytes are the OLE signature (D0 CF 11 E0 …). A PDF parser does not expect this header, so it renders nonsense.
Embedded PDF is inside the OLE data stream – The actual PDF bytes are available through the OLE data stream, not through save(). Extracting the raw stream avoids the extra header.
Aspose.Words behaviour – The forum posts we have show that the library currently treats PDF OLE objects as generic OLE packages, which matches the observed corruption. (The evidence does not explicitly mention a method to extract the raw PDF, so the next step is based on the documented API.)
What to do next
Extract the embedded PDF bytes directly
// ole is an instance of com.aspose.words.OleFormat
if (ole.getFileExtension().equalsIgnoreCase(".pdf")) {
// According to Aspose.Words API, getOleData() returns the raw bytes of the embedded file
byte[] pdfBytes = ole.getOleData();
attachmentPdfs.add(pdfBytes);
}
Why: getOleData() (or the equivalent getRawData() method) returns only the embedded file’s content, omitting the OLE header.
(Optional) Validate / clean the extracted PDF
// Write to a temporary file for manual inspection
Files.write(Paths.get("temp.pdf"), pdfBytes);
// Or load with Aspose.PDF to be sure the stream is well‑formed
com.aspose.pdf.Document pdfDoc = new com.aspose.pdf.Document(new ByteArrayInputStream(pdfBytes));
ByteArrayOutputStream cleaned = new ByteArrayOutputStream();
pdfDoc.save(cleaned);
attachmentPdfs.set(attachmentPdfs.size() - 1, cleaned.toByteArray()); // replace with cleaned stream
Why: Re‑saving with Aspose.PDF can repair minor inconsistencies and guarantees a valid PDF.
Merge the PDFs (example using Aspose.PDF)
com.aspose.pdf.Document finalPdf = new com.aspose.pdf.Document();
for (byte[] part : attachmentPdfs) {
com.aspose.pdf.Document partDoc = new com.aspose.pdf.Document(new ByteArrayInputStream(part));
finalPdf.getPages().add(partDoc.getPages());
}
finalPdf.save("Result.pdf");
Why: Merging clean PDF streams avoids the OLE‑header corruption.
Consider upgrading Aspose.Words
Newer releases often contain fixes for OLE handling. If you are on 24.11, try the latest version and re‑test the extraction.
Additional troubleshooting
Open the original Word document in Word and verify that the embedded PDF opens correctly.
Inspect the first few bytes of a file saved with OleFormat.save(). You should see the OLE signature (D0 CF 11 E0). This confirms that the header is the source of the corruption.
If ole.getOleData() still yields a corrupted PDF, repeat step 2 to re‑save the stream with Aspose.PDF.
Note: The evidence we have (forum posts) confirms that PDF OLE objects are saved as full OLE containers, leading to the described garbled pages. The specific method OleFormat.getOleData() is documented in the Aspose.Words API, but the retrieved evidence does not explicitly show it. Please refer to the official Aspose.Words Java documentation for the exact method name and usage in your version.
I could not fully verify this explanation against the retrieved sources. Please double-check the behaviour in your environment and confirm the exact SDK/version.
SDK version unclear — refer to the official release notes before applying this guidance.