Bullets with Unicode characters are lost by Document.getText using Java

Hi,

I want to extract text from .DOC to count words and characters then save it as .TXT. I tested a .DOC file contains some bullet characters, it’s saved .TXT correctly with bullets but the bullet’s gone when I extract by doc.getText() as the picture below.
Why are the results different? Is it possible to get same results both files and string?

Regards,
Rapeepan

test_bullet.zip (18.4 KB)
Screenshot from 2020-06-12 16-12-00.png (42.2 KB)

com.aspose.words.Document doc = new com.aspose.words.Document(sFileNameInput);
System.out.println(doc.getText());
com.aspose.words.TxtSaveOptions options = new com.aspose.words.TxtSaveOptions();
options.setEncoding(java.nio.charset.Charset.forName(“UTF-8”));
options.setSaveFormat(SaveFormat.TEXT);
doc.save(sFileNameOutput,options);

@rcomniscien

Please call Document.updateListLabels method and use Node.toString method as shown below to get the desired output.

Document doc = new Document(MyDir + "test_bullet.docx");
doc.updateListLabels();
System.out.println(doc.toString(SaveFormat.TEXT)); 

Moreover, you can use following code example to extract the label of each paragraph in a list as a value or a String.

Document doc = new Document(getMyDir() + "Rendering.docx");
doc.updateListLabels();
int listParaCount = 1;

for (Paragraph paragraph : (Iterable<Paragraph>) doc.getChildNodes(NodeType.PARAGRAPH, true)) {
    // Find if we have the paragraph list. In our document our list uses plain arabic numbers,
    // which start at three and ends at six
    if (paragraph.getListFormat().isListItem()) {
        System.out.println(MessageFormat.format("Paragraph #{0}", listParaCount));

        // This is the text we get when actually getting when we output this node to text format
        // The list labels are not included in this text output. Trim any paragraph formatting characters
        String paragraphText = paragraph.toString(SaveFormat.TEXT).trim();
        System.out.println("Exported Text: " + paragraphText);

        ListLabel label = paragraph.getListLabel();
        // This gets the position of the paragraph in current level of the list. If we have a list with multiple level then this
        // will tell us what position it is on that particular level
        System.out.println("Numerical Id: " + label.getLabelValue());

        // Combine them together to include the list label with the text in the output
        System.out.println("List label combined with text: " + label.getLabelString() + " " + paragraphText);

        listParaCount++;
    }
}