Extracted docx document paragraphs and found that some unusual text appeared

yjsdfsdf · August 8, 2023, 6:58am

I use the following code to extract the paragraphs inside the docx document, and found that there are some abnormal text inside the paragraphs, please ask how to deal with it, see the attached picture。

Document document = new Document("aa.doc");
NodeCollection childNodes = document.getChildNodes(NodeType.PARAGRAPH, true);
for(int i=0;i<childNodes.getCount();i++){
    String text = childNodes.get(i).getText();
    System.out.println(text);
}

Downloads.zip (359.8 KB)

coderthiyagarajan1980 · August 8, 2023, 1:23pm

can you try like this?

denis.shvydkiy · August 8, 2023, 2:35pm

@yjsdfsdf, the text marked in your screenshot is part of table of contents. You can filter out the table contents elements by style name:

Document document = new Document("aa.doc");
for (Paragraph para : (Iterable<Paragraph>)document.getChildNodes(NodeType.PARAGRAPH, true))
{
    if (para.getParagraphFormat().getStyleName().contains("TOC"))
        continue;

    String text = para.getText();
    System.out.println(text);
}

yjsdfsdf · August 14, 2023, 6:20am

The document contains hyperlinks, and the text I extracted contains characters I don’t want, please ask how to get rid of, refer to the attachment
Downloads.zip (30.0 KB)

yjsdfsdf · August 14, 2023, 6:23am

In this document, the contents of the directory because of the format of the TOC, filter out, in fact, I need to extract the text of the directory, ask how to do! thank you

denis.shvydkiy · August 14, 2023, 4:09pm

@yjsdfsdf, the text of TOC entires can be extracted using the following code:

Document document = new Document("a.docx");

for (Field field : (Iterable<Field>)document.getRange().getFields()) {
    if (field.getType() == FieldType.FIELD_HYPERLINK) {
        FieldHyperlink hyperlink = (FieldHyperlink) field;
        String subAddress = hyperlink.getSubAddress();
        if (subAddress != null && subAddress.startsWith("_Toc"))
        {
            String text = hyperlink.getDisplayResult();
            System.out.println(text);
        }
    }
}

yjsdfsdf · August 16, 2023, 12:53am

In fact, what I want is to extract the paragraph information inside the docx document, but the extracted paragraphs inside the encounter some TOC content etc., my purpose is still want to get the paragraph, is the same as the text seen on the page, how should I do?

alexey.noskov · August 16, 2023, 5:47am

@yjsdfsdf In your code you are using Paragraph.getText() method, which returns text with special characters. In your case, if you need to get only visible text without special characters and fields codes, you should use Paragraph.toString method:

String text = para.toString(SaveFormat.TEXT);

yjsdfsdf · August 16, 2023, 6:30am

Thank you very much.