Hi,
I want to extract text from .DOC to count words and characters then save it as .TXT. I tested a .DOC file contains some bullet characters, it’s saved .TXT correctly with bullets but the bullet’s gone when I extract by doc.getText() as the picture below.
Why are the results different? Is it possible to get same results both files and string?
Regards,
Rapeepan
test_bullet.zip (18.4 KB)
Screenshot from 2020-06-12 16-12-00.png (42.2 KB)
com.aspose.words.Document doc = new com.aspose.words.Document(sFileNameInput);
System.out.println(doc.getText());
com.aspose.words.TxtSaveOptions options = new com.aspose.words.TxtSaveOptions();
options.setEncoding(java.nio.charset.Charset.forName(“UTF-8”));
options.setSaveFormat(SaveFormat.TEXT);
doc.save(sFileNameOutput,options);
@rcomniscien
Please call Document.updateListLabels method and use Node.toString method as shown below to get the desired output.
Document doc = new Document(MyDir + "test_bullet.docx");
doc.updateListLabels();
System.out.println(doc.toString(SaveFormat.TEXT));
Moreover, you can use following code example to extract the label of each paragraph in a list as a value or a String.
Document doc = new Document(getMyDir() + "Rendering.docx");
doc.updateListLabels();
int listParaCount = 1;
for (Paragraph paragraph : (Iterable<Paragraph>) doc.getChildNodes(NodeType.PARAGRAPH, true)) {
// Find if we have the paragraph list. In our document our list uses plain arabic numbers,
// which start at three and ends at six
if (paragraph.getListFormat().isListItem()) {
System.out.println(MessageFormat.format("Paragraph #{0}", listParaCount));
// This is the text we get when actually getting when we output this node to text format
// The list labels are not included in this text output. Trim any paragraph formatting characters
String paragraphText = paragraph.toString(SaveFormat.TEXT).trim();
System.out.println("Exported Text: " + paragraphText);
ListLabel label = paragraph.getListLabel();
// This gets the position of the paragraph in current level of the list. If we have a list with multiple level then this
// will tell us what position it is on that particular level
System.out.println("Numerical Id: " + label.getLabelValue());
// Combine them together to include the list label with the text in the output
System.out.println("List label combined with text: " + label.getLabelString() + " " + paragraphText);
listParaCount++;
}
}