How can i deal with overlapped Node?

Madecho · November 4, 2024, 5:42am

I’m dealing with a Cover File, which used table.
however , i found that my code has dealt with the same coneten twice
here’s my code :

      private LinkedHashMap[] getZhEnCoverVal(Document analysedDoc, CoverFormDTO coverFormDTO) {
        // 标题  学位相关的副标题 培养单位 申请人 学科 指导教师 指导教师title 书脊标题 书脊申请人 时间
        // var1     var2        var3  var4 var5   var6        var7
        // 培养单位  申请人  指导教师
        LinkedHashMap[] array = new LinkedHashMap[1];

        LinkedHashMap<String, String> zhMap = new LinkedHashMap();
        Map<String, String> zhData = initZhData();

        Node[] nodesArray = analysedDoc.getSections().get(0).getBody().getChildNodes(NodeType.ANY, true).toArray();
        Paragraph lastParagraph = null;
        StringBuilder collectedText  = new StringBuilder();

        for (Node node : nodesArray) {
            if (node instanceof Paragraph paragraph) {
                String text = paragraph.getText().trim().replace(" ", "");
                if (text.isEmpty()) {
                    continue;
                }
                // todo
                /**
                 *    NodeType :  5 是 table  ,  8是 paragraph
                 *    (申请清华大学工商管理硕士专业学位论文)   这一句
                 *    又走了下面的逻辑，论文的题目 本来从正确的，就被覆盖了
                 */

                // 在这里收集段落文本
                collectedText.append(paragraph.getText().trim()).append(ControlChar.PARAGRAPH_BREAK);

                if (isChineseApplyForThesis(text)) {
                    if (lastParagraph != null) {
                        zhData.put("标题", collectedText.toString().trim());
                        zhData.put("书脊标题", collectedText.toString().trim());
                    }
                }
                //中文处理
                processChineseText(text, zhData);
                lastParagraph = paragraph;

            } else if (node instanceof Table table) {
                for (int i = 0; i < table.getRows().getCount(); i++) {
                    Row row = table.getRows().get(i);
                    String text = row.getText().trim().replace(" ", "");
                    if (text.isEmpty()) {
                        continue;
                    }
                    // 申请xxxxx论文 之上，就是论文标题
                    if (isChineseApplyForThesis(text)) {
                        if (i > 0) {
                            zhData.put("标题", table.getRows().get(i - 1).getText().trim());
                            zhData.put("书脊标题", table.getRows().get(i - 1).getText().trim());
                        }
                    }
                    processChineseText(text, zhData);
                 
                }
            }
        }
        List<Map> zhList = coverFormDTO.getZh();
        zhMap = fillMapWithData(zhList, zhData, zhMap);
        array[0] = zhMap;

        return array;
    }

    private LinkedHashMap fillMapWithData(List<Map> List, Map<String, String> data, LinkedHashMap returnMap) {
        for (Map map : List) {
            String varName = (String) map.get("key");
            String name = (String) map.get("label");
            String value = data.get(name);
            returnMap.put(varName, value);
        }
        return returnMap;
    } 

    public class CoverFormDTO {

    private List<Map> zh;
     }

here’s the test cover :
cover_1.docx (38.3 KB)
for example : the text “申请清华大学工商管理硕士专业学位论文” has been formatted twice in the above code. once as paragraph, ,once as table row.
so How should i fix my code?

alexey.noskov · November 4, 2024, 6:09am

@Madecho The behavior is expected since you are iterating all nodes in your document. Content in the table cells is also represented using paragraphs. Please see our documentation to learn more about Aspose.Words Document Object Model:
https://docs.aspose.com/words/net/aspose-words-document-object-model/

I would suggest you to consider using DocumentVisitor to iterate through all nodes in your document.

Madecho · November 4, 2024, 6:16am

Alexey, could you give me a small example that using DocumentVisitor and avoid the problem that we discussed above ?

alexey.noskov · November 4, 2024, 6:24am

@Madecho Sure, you can use the following code:

Document doc = new Document("C:\\Temp\\in.docx");
MyDocumentVisitor visitor = new MyDocumentVisitor();
doc.accept(visitor);

private static class MyDocumentVisitor extends DocumentVisitor
{
    @Override
    public int visitParagraphStart(Paragraph paragraph) throws Exception {
        if(!mIgnoreParagraph)
            System.out.println(paragraph.toString(SaveFormat.TEXT));
        return VisitorAction.CONTINUE;
    }
    
    @Override
    public int visitRowStart(Row row) throws Exception {
        System.out.println(row.toString(SaveFormat.TEXT));
        return VisitorAction.CONTINUE;
    }
    
    @Override
    public int visitTableStart(Table table) throws Exception {
        mIgnoreParagraph = true;
        return VisitorAction.CONTINUE;
    }
    
    @Override
    public int visitTableEnd(Table table) throws Exception {
        mIgnoreParagraph = false;
        return VisitorAction.CONTINUE;
    }
        
    private boolean mIgnoreParagraph = false;
}