Pdf 提取文本需要非常长的时间，有没有什么办法快一点吗？

jcing · October 15, 2018, 7:25am

一个9M多一点的PDF文件，等了３个小时都没有处理结束？
Document pdfDocument = new Document(outPath + PDF_FILENAME);
PageCollection pageCollection = pdfDocument.getPages();
if (pageCollection == null) {
System.out.println(“is null!”);
return;
}
System.out.printf(“run2.1… \n”);

        PrintWriter out = new PrintWriter(outFile);
        try {
            System.out.printf("run2.2... \n");
            TextAbsorber textAbsorber = new TextAbsorber(new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Raw));
            pageCollection.accept(textAbsorber);
            System.out.printf("run2.3... \n");
            String extractedText = textAbsorber.getText();
            System.out.printf("run2.4... \n");
            out.println(extractedText);
            mResult.put(TEXT_STATUS_KEY, SUCCESS);
            mResult.put(TEXT_MESSAGE_KEY, TEXT_FILENAME);
        } finally {
            pdfDocument.close();
            out.flush();
            out.close();
        }

输出了run2.3… ，textAbsorber.getText() 就下不去了！

Farhan.Raza · October 15, 2018, 12:09pm

@jcing

感谢您与支持人员联系。

您能否将源PDF文件上传到Google Drive，Dropbox等，以便我们可以尝试在我们的环境中重现和调查它。在共享请求的数据之前，请确保使用 Aspose.PDF for Java 18.9.1。

还分享您的环境详细信息，包括JDK / JRE版本，操作系统详细信息等。

jcing · October 15, 2018, 2:39pm

dbc83203bb9b4eb1e637867f9f272d4f.pdf.zip (8.4 MB)

jcing · October 15, 2018, 2:42pm

JDK: openjdk version “1.8.0_171”
操作系统：Linux version 4.15.0-29deepin-generic

jcing · October 16, 2018, 2:02am

我昨天晚上开了７台服务器，都是３２Ｇ内存的＋２４核CUP处理PDF，１２个小时左右只处理了８００个左右！
我现在手头上需要批量转的pdf有５万多个，这种速度基本整个项目基本要泡汤的感觉呀？

jcing · October 16, 2018, 2:05am

image.png (319.3 KB)
cup都１００％了，就是不出结果呀？急死人了！

Farhan.Raza · October 16, 2018, 5:21am

@jcing

感谢您与支持人员联系。

我们已经注意到TextFragmentAbsorber的问题，并且在我们的问题管理系统中记录了带有ID PDFJAVA-38063的故障单，以便进一步调查和解决。故障单ID已与此线程链接，以便您在解决故障单后立即收到通知。

此外，请您详细说明如何传输PDF文件并面对速度问题，以便我们调查它以帮助您。

jcing · October 16, 2018, 9:20am

Blockquote
此外，请您详细说明如何传输PDF文件并面对速度问题，以便我们调查它以帮助您？
对这句我不是很明白你想说明什么问题？
如何传输PDF文件? 是什么意思？
面对速度问题？这速度慢成龟速了！

jcing · October 16, 2018, 2:55pm

image.png (47.0 KB)
cup占用１０００多，处理了１０１个小时！

Farhan.Raza · October 16, 2018, 9:54pm

@jcing

请在共享示例应用程序时详细说明问题，以便我们可以尝试在我们的环境中重现和调查它。

Farhan.Raza · October 25, 2018, 8:49am

@jcing

关于PDFJAVA-38063，在文本提取算法的堆内存使用中发现了一个问题。我们正在进一步调查。

但是，下面是逐页提取文本的变通方法。我们的电脑需要1分20秒。

    String PDF_FILENAME = "dbc83203bb9b4eb1e637867f9f272d4f";
    Document pdfDocument = new Document(myDir + PDF_FILENAME+".pdf");
    PageCollection pageCollection = pdfDocument.getPages();
    if (pageCollection == null) {
        System.out.println("is null!");
        return;
    }
    System.out.printf( "run2.1... \n");
    int pagesCount = pageCollection.size();
    PrintWriter out = new PrintWriter(myDir + PDF_FILENAME+"_text.txt");
    try {
        System.out.printf("run2.2... \n");
        StringBuilder extractedText = new StringBuilder();
        for (int i = 1; i <=pagesCount ; i++) {

            Page p = pdfDocument.getPages().get_Item(i);
            TextAbsorber textAbsorber = new TextAbsorber(new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Raw));

            p.accept(textAbsorber);
            System.out.printf("run2.3... \n for page: "+ p.getNumber());
            extractedText.append(textAbsorber.getText());
            pdfDocument.dispose();
            pdfDocument = new Document(myDir + PDF_FILENAME+".pdf");
        }
        System.out.printf("run2.4... \n");
        out.println(extractedText.toString());
    } finally {
        pdfDocument.close();
        out.flush();
        out.close();
    }

jcing · October 25, 2018, 11:42am

可以啊！谢谢啦！是快好多了！

jcing · October 25, 2018, 11:47am

    Document pdfDocument = new Document("/data/1.pdf");
    pdfDocument.setEmbedStandardFonts(true);
    HtmlSaveOptions saveOptions = new HtmlSaveOptions();
    saveOptions.setFixedLayout(true);
    saveOptions.setSplitIntoPages(true);
    saveOptions.FontSavingMode = HtmlSaveOptions.FontSavingModes.AlwaysSaveAsWOFF;
    saveOptions.LettersPositioningMethod = LettersPositioningMethods.UseEmUnitsAndCompensationOfRoundingErrorsInCss;
    saveOptions.RasterImagesSavingMode = HtmlSaveOptions.RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground;
    pdfDocument.save(mHtmlPath.getAbsolutePath() + "/out.html", saveOptions);

这段代码也是很慢的，有办法向上面一样一页导出后重建document的方法改善速度问题呢？

jcing · October 25, 2018, 1:30pm

    Document pdfDocument = new Document(outPath + PDF_FILENAME);
    PageCollection pageCollection = pdfDocument.getPages();
    if (pageCollection == null) {
        System.out.println("is null!");
        return;
    }
    int pagesCount = pageCollection.size();
    for (int i = 1; i <= pagesCount; i++) {
        Page p = pdfDocument.getPages().get_Item(i);
        Document tmpDoc = new Document();
        tmpDoc.setEmbedStandardFonts(true);
        tmpDoc.getPages().add(p);
        HtmlSaveOptions saveOptions = new HtmlSaveOptions();
        saveOptions.setFixedLayout(true);
        saveOptions.setSplitIntoPages(true);
        saveOptions.FontSavingMode = HtmlSaveOptions.FontSavingModes.AlwaysSaveAsWOFF;
        saveOptions.LettersPositioningMethod = LettersPositioningMethods.UseEmUnitsAndCompensationOfRoundingErrorsInCss;
        saveOptions.RasterImagesSavingMode = HtmlSaveOptions.RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground;
        tmpDoc.save(mHtmlPath.getAbsolutePath() + "/" + Integer.toString(i) + ".html", saveOptions);
        pdfDocument.close();
        pdfDocument.dispose();
        tmpDoc.dispose();
        pdfDocument = new Document(outPath + PDF_FILENAME);
    }

jcing · October 25, 2018, 1:31pm

现在基本可行了！这个速度是可以接受的！

Farhan.Raza · October 25, 2018, 11:11pm

@jcing

谢谢你的反馈。

我们很高兴知道现在的环境在您的环境中正常运行。

aspose.notifier · December 2, 2018, 8:27pm

The issues you have found earlier (filed as PDFJAVA-38063) have been fixed in Aspose.PDF for Java 18.11.

jcing · December 8, 2018, 1:16pm

好的,谢谢! 我明天就试试!