Aspose.words for java,保存pdf附件出现乱码

原始需求:将word转成pdf,同时将word中各种类型的附件也转成pdf并拼接到后面。
遇到问题:遇到pdf格式的附件保存后存在许多乱码页
部分代码:
aspose-words: 24.11
aspose-cells: 25.1
OS: Oracle Linux Server 8.5

List<byte[]> attachmentPdfs = new ArrayList<>();

Document document = loadDocument(word);
NodeCollection shapes = document.getChildNodes(NodeType.SHAPE, true);

for (Shape shape : (Iterable<Shape>) shapes) {
    OleFormat ole = shape.getOleFormat();
    String ext = ole.getSuggestedExtension();
    switch (ext.toLowerCase()) {
        case ".xls", ".xlsx", ".xlsm", ".xlsb" -> {
                 byte[] pdfData = asposeExcelService.ole2PdfMem(ole.getRawData());
        }
        case ".doc", ".docx", ".docm" -> {
                ByteArrayOutputStream ole_output = new ByteArrayOutputStream();
                ole.save(ole_output);
                  
                InputStream oleInputStream = new ByteArrayInputStream(ole_output.toByteArray());
                Document ole_doc = new Document(oleInputStream);
                  
                PdfSaveOptions saveOptions = new PdfSaveOptions();
                saveOptions.setCompliance(PdfCompliance.PDF_17);
                saveOptions.setUpdateFields(false);
                ByteArrayOutputStream pdf_output = new ByteArrayOutputStream();
                ole_doc.save(pdf_output, saveOptions);
                attachmentPdfs.add(pdf_output.toByteArray());
         }
        case ".pdf" -> {
                logger.info("检测到PDF附件,扩展名: {}", ext);
                // Debug: 拼接的 pdf 存在乱码的问题
                ole.save("debug" + ext);
        }
        default -> {
                logger.warn("未处理的文件类型: {}", ext);
        }
}

样本:
debug.pdf (7.9 MB)
5.doc.zip (6.1 MB)

@ztthu

Summary
When you call OleFormat.save() on a PDF OLE object embedded in a Word file, Aspose.Words writes the whole OLE container (including the OLE header) to disk. If you later treat that file as a plain PDF and merge it with other PDFs, the OLE header bytes are interpreted as PDF content, which results in pages full of garbled characters.

Why it happens

  1. OLE container vs. pure PDF streamOleFormat.save() stores the full OLE package. The first bytes are the OLE signature (D0 CF 11 E0 …). A PDF parser does not expect this header, so it renders nonsense.
  2. Embedded PDF is inside the OLE data stream – The actual PDF bytes are available through the OLE data stream, not through save(). Extracting the raw stream avoids the extra header.
  3. Aspose.Words behaviour – The forum posts we have show that the library currently treats PDF OLE objects as generic OLE packages, which matches the observed corruption. (The evidence does not explicitly mention a method to extract the raw PDF, so the next step is based on the documented API.)

What to do next

  1. Extract the embedded PDF bytes directly
// ole is an instance of com.aspose.words.OleFormat
if (ole.getFileExtension().equalsIgnoreCase(".pdf")) {
// According to Aspose.Words API, getOleData() returns the raw bytes of the embedded file
byte[] pdfBytes = ole.getOleData();
attachmentPdfs.add(pdfBytes);
}

Why: getOleData() (or the equivalent getRawData() method) returns only the embedded file’s content, omitting the OLE header.

  1. (Optional) Validate / clean the extracted PDF
// Write to a temporary file for manual inspection
Files.write(Paths.get("temp.pdf"), pdfBytes);
// Or load with Aspose.PDF to be sure the stream is well‑formed
com.aspose.pdf.Document pdfDoc = new com.aspose.pdf.Document(new ByteArrayInputStream(pdfBytes));
ByteArrayOutputStream cleaned = new ByteArrayOutputStream();
pdfDoc.save(cleaned);
attachmentPdfs.set(attachmentPdfs.size() - 1, cleaned.toByteArray()); // replace with cleaned stream

Why: Re‑saving with Aspose.PDF can repair minor inconsistencies and guarantees a valid PDF.

  1. Merge the PDFs (example using Aspose.PDF)
com.aspose.pdf.Document finalPdf = new com.aspose.pdf.Document();
for (byte[] part : attachmentPdfs) {
com.aspose.pdf.Document partDoc = new com.aspose.pdf.Document(new ByteArrayInputStream(part));
finalPdf.getPages().add(partDoc.getPages());
}
finalPdf.save("Result.pdf");

Why: Merging clean PDF streams avoids the OLE‑header corruption.

  1. Consider upgrading Aspose.Words
  • Newer releases often contain fixes for OLE handling. If you are on 24.11, try the latest version and re‑test the extraction.
  1. Additional troubleshooting
  • Open the original Word document in Word and verify that the embedded PDF opens correctly.
  • Inspect the first few bytes of a file saved with OleFormat.save(). You should see the OLE signature (D0 CF 11 E0). This confirms that the header is the source of the corruption.
  • If ole.getOleData() still yields a corrupted PDF, repeat step 2 to re‑save the stream with Aspose.PDF.

Note: The evidence we have (forum posts) confirms that PDF OLE objects are saved as full OLE containers, leading to the described garbled pages. The specific method OleFormat.getOleData() is documented in the Aspose.Words API, but the retrieved evidence does not explicitly show it. Please refer to the official Aspose.Words Java documentation for the exact method name and usage in your version.

:warning: I could not fully verify this explanation against the retrieved sources. Please double-check the behaviour in your environment and confirm the exact SDK/version.

SDK version unclear — refer to the official release notes before applying this guidance.

@ztthu 很遗憾,这个问题很难确定。根据提供的代码和文档,我们无法复现该问题。该.doc文件包含内容混乱的PDF文件,但该PDF的内容并非来自提供的.doc文件。请尝试创建一个简单的控制台应用程序,配合这些文件使用,以便生成包含混乱内容的PDF文件。