Aspose PDF&Aspose WORDS for JAVA 文档结构化解析

changeo · August 17, 2021, 11:43am

请问如何解析出文档的层次结构呢？
我这边解析docx：

Document doc = new Document(path+filename+"."+suffix);
for(Section section : doc.getSections().toArray()){
ParagraphCollection collection = section.getBody().getParagraphs();
for(int i =0; i<collection.toArray().length; i++){
System.out.println(collection.get(i).getListLabel().getLabelString()+" "+collection.get(i).getText());
}
}

解析pdf：

Document doc = new Document(path+filename+"."+suffix);
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber();
doc.getPages().accept(textFragmentAbsorber);
String content1 = textFragmentAbsorber.getText();
System.out.println(content1);

解析出来的内容平铺的，并无层次结构（如：1.1.1章节属于1.1章节，1.1章节属于1章节）。
请问是否有线程方法可以解析出层次结构呢？还是说需要一些个性化的开发

asad.ali · August 17, 2021, 6:40pm

@changeo

您能否分享示例文档以及您获得的结果的屏幕截图以供我们参考？我们将在我们的环境中进一步测试该场景并相应地解决它

changeo · August 18, 2021, 1:42am

@asad.ali

您好，
我这边的文件：
详细设计说明书模板.docx (32.1 KB)
详细设计说明书模板.pdf (448.9 KB)
word解析打印结果（截取部分）：
image.png (3.9 KB)

pdf解析打印结果（截取部分）：
image.png (5.4 KB)

如你所见，段落7和7.1、7.2之间的关系无法从结构上直接获取

asad.ali · August 18, 2021, 9:46pm

@changeo

请尝试使用以下代码从带有格式的 PDF 中提取文本，因为我们在我们的环境中对其进行了测试，结果更好。还附上了屏幕截图供您参考。

Document doc = new Document(dataDir + "详细设计说明书模板.pdf");
TextAbsorber ta = new TextAbsorber();
ta.setExtractionOptions(new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Pure));
doc.getPages().accept(ta);
String content1 = ta.getText();
System.out.println(content1);

textextraction.png (8.1 KB)

我们正在从 Aspose.Words 的角度检查查询，并会尽快通知您。

awais.hafeez · August 19, 2021, 6:16am

@changeo,

您可以使用 Aspose.Words for Java 的以下代码解析 Word DOCX 文档中目录（TOC 字段）条目的内容。

Document doc = new Document("C:\\temp\\word.docx");
doc.updateListLabels();

DocumentBuilder builder = new DocumentBuilder();
builder.getFont().setName("SimSun");

ArrayList listOfParagraphs = new ArrayList();
for (Field field : (Iterable<Field>) doc.getRange().getFields()) {
    if (field.getType() == (FieldType.FIELD_HYPERLINK)) {
        FieldHyperlink hyperlink = (FieldHyperlink) field;
        if (hyperlink.getSubAddress() != null && hyperlink.getSubAddress().startsWith("_Toc")) {
            Paragraph tocItem = (Paragraph) field.getStart().getAncestor(NodeType.PARAGRAPH);

            Bookmark bm = doc.getRange().getBookmarks().get(hyperlink.getSubAddress());
            // Get the location this TOC Item is pointing to
            Paragraph pointer = (Paragraph) bm.getBookmarkEnd().getAncestor(NodeType.PARAGRAPH);
            listOfParagraphs.add(pointer);
        }
    }
}

for (int i = 0; i < listOfParagraphs.size(); i++) {
    Paragraph startPara = (Paragraph) listOfParagraphs.get(i);
    Paragraph endPara = null;

    if (i + 1 == listOfParagraphs.size())
        endPara = doc.getLastSection().getBody().getLastParagraph();
    else
        endPara = (Paragraph) listOfParagraphs.get(i + 1);

    for (Node node = startPara; node != endPara && node != null; node = node.getNextSibling())
        builder.writeln(node.toString(SaveFormat.TEXT).trim());

    builder.insertBreak(BreakType.PAGE_BREAK);
}

builder.getDocument().save("C:\\temp\\awjava-21.8.docx");