Pdf保存文本的时候报错！

jcing · July 24, 2018, 3:24pm

Document pdfDocument = new Document("/data/temp/1.pdf");
pdfDocument.save(outFile, SaveFormat.TeX);
错误信息：
class com.aspose.pdf.internal.ms.System.I01: Specified node ‘com.aspose.pdf.internal.l154I.I11@4346808’ should has Pen or Brush for it wrapper node
com.aspose.pdf.internal.l1521.I7.lI(Unknown Source)
com.aspose.pdf.internal.l152I.I01.lif(Unknown Source)
com.aspose.pdf.internal.l1521.I7.ll(Unknown Source)
com.aspose.pdf.internal.l152I.I1.ll(Unknown Source)
com.aspose.pdf.internal.l152I.I1.ll(Unknown Source)
com.aspose.pdf.internal.l152I.I1.ll(Unknown Source)
com.aspose.pdf.internal.l152I.II.ll(Unknown Source)
com.aspose.pdf.internal.l152I.I1.ll(Unknown Source)
com.aspose.pdf.internal.l152I.II.ll(Unknown Source)
com.aspose.pdf.internal.l152I.I1.ll(Unknown Source)
com.aspose.pdf.internal.l152I.I0l.ll(Unknown Source)
com.aspose.pdf.internal.l152I.I1.ll(Unknown Source)
com.aspose.pdf.internal.l152l.Il.lif(Unknown Source)
com.aspose.pdf.internal.l152l.Il.lif(Unknown Source)
com.aspose.pdf.internal.l152l.Il.lif(Unknown Source)
com.aspose.pdf.I221.lif(Unknown Source)
com.aspose.pdf.I221.lif(Unknown Source)
com.aspose.pdf.ADocument.lI(Unknown Source)
com.aspose.pdf.ADocument.lI(Unknown Source)
com.aspose.pdf.ADocument.ll(Unknown Source)
com.aspose.pdf.Document.ll(Unknown Source)
com.aspose.pdf.ADocument.lif(Unknown Source)
com.aspose.pdf.ADocument.save(Unknown Source)
com.aspose.pdf.Document.save(Unknown Source)

asad.ali · July 24, 2018, 6:28pm

@jcing

我们在此论坛帖子中分享了您对类似查询的回复。您可以在那里跟进或在此发布您的回复。

jcing · July 25, 2018, 12:56am

按回复做了，问题依旧存在的，运行环境：
java -version
java version “1.8.0_111”
Java™ SE Runtime Environment (build 1.8.0_111-b14)
Java HotSpot™ 64-Bit Server VM (build 25.111-b14, mixed mode)
java 运行参数：
java -Dfile.encoding=UTF-8 -Xms512m -Xmx2048m -classpath PdfConver.jar：commons-io-2.4.jar:aspose-pdf-18.6.jar Main /data/xiangdang/xiangdang_storage_pdf/tmp/df57ce14ed0125688a4c807d4c91a1ef.pdf
系统版本
CentOS Linux release 7.4.1708 和 “Alpine Linux”（最终在这个上面跑）都不行

按要求改成了：
LaTeXSaveOptions saveOptions = new LaTeXSaveOptions();
pdfDocument.save(outFile, saveOptions);

又试了一下，下面这种方法：
` PageCollection pageCollection = pdfDocument.getPages();
if (pageCollection == null) {
System.out.println(“is null!”);
return;
}
System.out.println(“run 2.2”);
PrintWriter out = new PrintWriter(outFile);
try {
System.out.println(“run 2.3”);
pdfDocument.setEmbedStandardFonts(true);
System.out.println(“run 2.4”);
pdfDocument.optimizeResources();
System.out.println(“run 2.5”);
TextAbsorber textAbsorber = new TextAbsorber(new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Raw));
System.out.println(“run 2.6”);
pageCollection.accept(textAbsorber);
System.out.println(“run 2.7”);
String extractedText = textAbsorber.getText();
System.out.println(“run 2.8”);
System.out.printf(extractedText);

            out.println(extractedText);

            mResult.put(TEXT_STATUS_KEY, SUCCESS);
            mResult.put(TEXT_MESSAGE_KEY, TEXT_FILENAME);
        } finally {
            System.out.println("run 2.9");
            pdfDocument.close();
            System.out.println("run 2.10");
            out.flush();
            out.close();
        }

`
也是会报错的！1.pdf.zip (9.9 MB)

jcing · July 25, 2018, 2:48am

换了一个pdf用pdfDocument.save(outFile, new LaTeXSaveOptions());
出来的结果不是我想要的，我就想把pdf中的文字提取出来，样式和颜色什么的都不用，看附件：
1.txt.zip (1011.9 KB)
后来改成如下方法：
PageCollection pageCollection = pdfDocument.getPages();
pdfDocument.setEmbedStandardFonts(true);
pdfDocument.optimizeResources();
TextAbsorber textAbsorber = new TextAbsorber(new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Raw));
pageCollection.accept(textAbsorber);
String extractedText = textAbsorber.getText();

在转换的时候出现异常：
java.util.UnknownFormatConversionException: Conversion = ‘的’

提供pdf文件：
1.pdf.zip (9.0 MB)

asad.ali · July 25, 2018, 10:33am

@jcing

我很遗憾地分享在理解原始解决方案时存在困惑。由于您在第一篇文章中共享了SaveFormat.TeX，这给我们的印象是我们要生成一个LaTex文件。既然您已经确认要从PDF生成TXT文件，我们将暂时分享我们的反馈。

jcing · July 25, 2018, 11:24am

在转换的时候出现异常：
java.util.UnknownFormatConversionException: Conversion = ‘的’
这个怎么解决？

asad.ali · July 25, 2018, 11:54am

@jcing

请注意，SaveFormat.TeX表示LaTex格式，不能用于.txt或纯文本文件。要从PDF生成.txt文件，请使用以下代码段。为了您的类型参考，还附加了输出.txt文件。

Document doc = new Document(dataDir + "1.pdf");
TextAbsorber absorber = new TextAbsorber(new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Raw));
doc.getPages().accept(absorber);
String text = absorber.getText();
try (PrintWriter out = new PrintWriter(dataDir + "output.txt")) {
	out.println(text);
} catch (FileNotFoundException e) {
	// TODO Auto-generated catch block
	e.printStackTrace();
}

output.zip (163.5 KB)

如果您在使用建议的代码段时仍遇到任何问题，请随时告诉我们。

jcing · July 29, 2018, 1:54am

谢谢，感觉不是很稳定，我换了一台机器试了一下就不会报错！

asad.ali · July 29, 2018, 11:53am

@jcing

感谢您回复我们。

如果您使用建议的方法体验任何问题，请您解释一下这个问题。