请问，我在使用24.8版本将PDF转成Markdown文本的时候，列表没有转成Markdown格式

foaj · December 10, 2024, 1:32am

代码如下：

try (ByteArrayOutputStream docxStream = new ByteArrayOutputStream()) {
    // Step 1: Open PDF stream using Document class of Aspose.Pdf
    try (Document pdfDoc = new Document(download)) {
        // Step 2: Convert PDF to DOCX by using save method of Aspose.Pdf, but write to ByteArrayOutputStream
        pdfDoc.save(docxStream, SaveFormat.DocX);

        // Prepare ByteArrayInputStream from the DOCX bytes for Aspose.Words
        try (InputStream docxInputStream = new ByteArrayInputStream(docxStream.toByteArray());
             ByteArrayOutputStream markdownStream = new ByteArrayOutputStream()) {
            // Step 3: Load DOCX stream by using Document class of Aspose.Words
            PdfLoadOptions pdfLoadOptions = new PdfLoadOptions();
            pdfLoadOptions.setSkipPdfImages(true);
                com.aspose.words.Document wordDoc = new com.aspose.words.Document(docxInputStream,pdfLoadOptions);

            MarkdownSaveOptions markdownSaveOptions = new MarkdownSaveOptions();
            markdownSaveOptions.setExportImagesAsBase64(true);
            markdownSaveOptions.setUpdateFields(true);
            markdownSaveOptions.setSaveFormat(com.aspose.words.SaveFormat.MARKDOWN);
            // Step 4: Save the document to MARKDOWN format using Save method and set MARKDOWN as SaveFormat
            wordDoc.save(markdownStream, markdownSaveOptions);

            return GPTStringUtils.removeBase64(markdownStream.toString("UTF-8"));
        }
    }
} catch (Exception e) {
    e.printStackTrace();
    return null;
}

vyacheslav.deryushev · December 10, 2024, 9:07am

@foaj 你能提供你正在使用的输入数据吗？

foaj · December 10, 2024, 9:26am

开标情况记录表_2102985593.pdf (11.4 KB)

可以的，文件就是这个，我执行上面代码后拿到的文本是这样

vyacheslav.deryushev · December 10, 2024, 12:38pm

@foaj 看起来“Aspose.Pdf”将数据保存为DOCX格式的帧集，这可能会导致问题。尝试按如下方式更新PDF代码：

com.aspose.pdf.Document pdfDoc = new com.aspose.pdf.Document("input.pdf");

com.aspose.pdf.DocSaveOptions saveOptions = new com.aspose.pdf.DocSaveOptions();
saveOptions.setFormat(com.aspose.pdf.DocSaveOptions.DocFormat.DocX);
saveOptions.setMode(com.aspose.pdf.DocSaveOptions.RecognitionMode.Flow);
saveOptions.setRelativeHorizontalProximity(2.5f);
saveOptions.setRecognizeBullets(true);

pdfDoc.save("output.docx", saveOptions);

foaj · December 11, 2024, 1:22am

非常感谢，按你的调整代码之后确实看到了Markdown格式，但是这个格式看起来不太准确，分隔符并不在文本中间

vyacheslav.deryushev · December 11, 2024, 7:03am

@foaj 感谢您报告此问题。我们已经在我们的内部问题跟踪系统中打开了以下新工单，并将根据免费支持政策中提到的条款提供它们的修复：

Issue ID(s): WORDSNET-27672

如果您需要优先支持以及直接联系我们的付费支持管理团队，您可以获得付费支持服务。

foaj · December 11, 2024, 7:24am

好的，感谢，，，，，，，，，

vyacheslav.deryushev · January 20, 2025, 8:52am

@foaj 请检查以下代码以获得更正确的结果：

Document doc = new Document("input.docx");

MarkdownSaveOptions saveOptions = new MarkdownSaveOptions();
saveOptions.setExportAsHtml(MarkdownExportAsHtml.TABLES);

doc.save("output.md", saveOptions);

遗憾的是，我们无法提供更准确的结果，因为“Aspose.PDF”在DOCX输出文件中没有提供完全正确的表格，因为在输入DOCX文件中使用了图像作为表格网格。

aspose.notifier · March 19, 2025, 3:18pm

The issues you have found earlier (filed as WORDSNET-27672) have been fixed in this Aspose.Words for Java 25.3 update.