I use the following code to extract the paragraphs inside the docx document, and found that there are some abnormal text inside the paragraphs, please ask how to deal with it, see the attached picture。
Document document = new Document("aa.doc");
NodeCollection childNodes = document.getChildNodes(NodeType.PARAGRAPH, true);
for(int i=0;i<childNodes.getCount();i++){
String text = childNodes.get(i).getText();
System.out.println(text);
}
Downloads.zip (359.8 KB)
@yjsdfsdf, the text marked in your screenshot is part of table of contents. You can filter out the table contents elements by style name:
Document document = new Document("aa.doc");
for (Paragraph para : (Iterable<Paragraph>)document.getChildNodes(NodeType.PARAGRAPH, true))
{
if (para.getParagraphFormat().getStyleName().contains("TOC"))
continue;
String text = para.getText();
System.out.println(text);
}
The document contains hyperlinks, and the text I extracted contains characters I don’t want, please ask how to get rid of, refer to the attachment
Downloads.zip (30.0 KB)
In this document, the contents of the directory because of the format of the TOC, filter out, in fact, I need to extract the text of the directory, ask how to do! thank you
@yjsdfsdf, the text of TOC entires can be extracted using the following code:
Document document = new Document("a.docx");
for (Field field : (Iterable<Field>)document.getRange().getFields()) {
if (field.getType() == FieldType.FIELD_HYPERLINK) {
FieldHyperlink hyperlink = (FieldHyperlink) field;
String subAddress = hyperlink.getSubAddress();
if (subAddress != null && subAddress.startsWith("_Toc"))
{
String text = hyperlink.getDisplayResult();
System.out.println(text);
}
}
}
In fact, what I want is to extract the paragraph information inside the docx document, but the extracted paragraphs inside the encounter some TOC content etc., my purpose is still want to get the paragraph, is the same as the text seen on the page, how should I do?
@yjsdfsdf In your code you are using Paragraph.getText()
method, which returns text with special characters. In your case, if you need to get only visible text without special characters and fields codes, you should use Paragraph.toString
method:
String text = para.toString(SaveFormat.TEXT);