Aspose.PDF 转DOC是否支持OCR处理内部内容

wangjianye · June 4, 2024, 1:57am

你好，我们在测试Aspose PDF转功能，发现内部转的时候如果是扫描件无法自动OCR处理。

我们目标是想保持pdf中的表格、图片、段落，尽量完整还原到新的doc中。
感谢。

asad.ali · June 4, 2024, 10:29am

@wangjianye

请您提供以下信息供我们参考，以便我们进一步开展相应工作？

输入文件示例
您正在使用的示例代码片段
示例生成的输出和预期输出
问题截图

wangjianye · June 4, 2024, 10:52am

提取自《电力工程基本术语标准》 GBT50297-2018(2).pdf (3.2 MB)

你好，这个是我们在测试的文档。目前提取是按图片转成doc的，我们是想OCR尽可能的去处理这样的情况，可以把原文还原出来。

asad.ali · June 4, 2024, 6:30pm

@wangjianye

感谢您分享示例文档。您还可以分享您用来执行操作的示例代码片段吗？然后我们将能够在我们的环境中测试该场景并相应地解决它。

wangjianye · June 5, 2024, 1:16am

string sFilePath = “”;
using (OpenFileDialog openFileDialog = new OpenFileDialog())
{
openFileDialog.Title = “选择方案pdf文件”;
openFileDialog.Filter = “pdf文件 (.pdf)|.pdf”;
openFileDialog.FilterIndex = 1;

            if (openFileDialog.ShowDialog() != DialogResult.OK)
                return;

            sFilePath = openFileDialog.FileName;
        }

        string sDirPath = Path.GetDirectoryName(sFilePath);

        Document pdfDocument = new Document(sFilePath);

        // 保存PDF为DOC格式
        DocSaveOptions docSaveOpt = new DocSaveOptions();
        docSaveOpt.Mode = DocSaveOptions.RecognitionMode.Flow;
        docSaveOpt.RecognizeBullets = true;
        docSaveOpt.Format = DocSaveOptions.DocFormat.DocX;

        pdfDocument.Save(sDirPath + "\\output.docx", docSaveOpt);

你好，这个是我们测试代码。

asad.ali · June 5, 2024, 9:42am

@wangjianye

根据我们对该场景的理解，您希望对扫描的 PDF 文档执行 OCR 并获得具有相同格式文本的输出。我们尝试使用 Aspose.OCR for .NET 来执行此操作，但是输出的 PDF 包含垃圾字符。无法正确识别汉字。我们使用下面的代码片段：

try
{
    Aspose.OCR.AsposeOcr api = new Aspose.OCR.AsposeOcr();

    Aspose.OCR.OcrInput ocrInputPdf = new Aspose.OCR.OcrInput(Aspose.OCR.InputType.PDF);
    ocrInputPdf.Add(dataDir + "提取自《电力工程基本术语标准》 GBT50297-2018(2).pdf");
    List<Aspose.OCR.RecognitionResult> resultPdf = api.Recognize(ocrInputPdf, new Aspose.OCR.RecognitionSettings { DetectAreasMode = OCR.DetectAreasMode.DOCUMENT, Language  = OCR.Language.Chi });
    Aspose.OCR.AsposeOcr.SaveMultipageDocument(dataDir + "searchablePdf.pdf", Aspose.OCR.SaveFormat.Pdf, resultPdf);
    Aspose.OCR.AsposeOcr.SaveMultipageDocument(dataDir + "searchablePdfNoImg.pdf", Aspose.OCR.SaveFormat.PdfNoImg, resultPdf);

}
catch (Exception ex)
{
    throw ex;
}

我们已将此问题记录为问题跟踪系统中的OCRNET-852，以供进一步分析。我们将调查其详细信息并随时向您通报其更正状态。请耐心等待并给我们一些时间。

对此造成的不便，我们表示歉意。

asad.ali · June 20, 2024, 7:04pm

@wangjianye

我们特此通知，我们已在版本 24.6.0 中解决了该问题。
添加了添加自定义字体的功能

SaveMultipageDocument(string fullFileName, SaveFormat saveFormat, List<RecognitionResult> results, string embeddedFontPath = null)