Having Issue to read the level numbers from docx , which is having only levels, not the headers

Having Issue to read the level numbers from docx , which is having only levels, not the headers.
I am able to read the Level text, but not the Level Numbers. I am using Trial version of Aspose java.
Can you please provide the suggestion to read the Level numbers.

Thank you

@madhusudhangovindu,

That is because the number is not really there. It is generated as the items are been rendered.

So it is not a limitation because you are using the trial version.

It is having numbers, can you please find the attached document and suggest me.
I am using the below code to read the level numbers. I am getting the Text but getting 0’s in the place of level numbers.

for (Paragraph paragraph : (Iterable<Paragraph>) doc.getChildNodes(NodeType.PARAGRAPH, true)) {
    System.out.println(paragraph.getListLabel().getLabelValue() + "-> " +paragraph.getText());
}

TestDoc.docx (45.1 KB)

@madhusudhangovindu,

I made a solution in C#,

private void Logic(Document doc)
{
    var countByLevel = new int[9];

    int level = 0;
    var headingList = doc.GetChildNodes(NodeType.Paragraph, true).Cast<Paragraph>().Where(p => p.ParagraphFormat.StyleName.Contains("Level")); // all paragraphs with Headings styles
    int previusLevel = 0; 
    foreach (Paragraph paragraph in headingList)
    {
        if (paragraph.ListFormat.IsListItem)
        {
            string styleName = paragraph.ParagraphFormat.StyleName;
            string levelStr = styleName.Replace("Level ", "");
            level = Convert.ToInt32(levelStr);
            string text = paragraph.ToString(SaveFormat.Text);

            countByLevel[level - 1] = countByLevel[level - 1] + 1;
            if (level < previusLevel)
            {                        
                countByLevel[previusLevel - 1] = 0;
            }

            

            Console.WriteLine($"Number: {GetStringNumber(level, countByLevel)} - Text: {text}");

            previusLevel = level;
        }
    }

    Console.ReadKey();
}       

private string GetStringNumber(int level, int[] countLevels)
{
    string result = "";
    for (int i = 0; i < level; i++)
    {
        if (string.IsNullOrWhiteSpace(result))
        {
            result += countLevels[i];
        }
        else
        {
            result += $".{countLevels[i]}";
        }
    }

    return result;
} 

Solution in Java:

public void Logic(Document doc) throws Exception
{
    var countByLevel = new int[9];

    int level = 0;
    var headingList = doc.getChildNodes(NodeType.PARAGRAPH, true).toArray();
    int previusLevel = 0;

    for (Node node : headingList)
    {
        var paragraph = (com.aspose.words.Paragraph)node;
        if(paragraph.getParagraphFormat().getStyleName().contains("Level")
            && paragraph.getListFormat().isListItem())
        {
            String styleName = paragraph.getParagraphFormat().getStyleName();
            String levelStr = styleName.replace("Level ", "");
            level = Integer.valueOf(levelStr);
            String text = paragraph.toString(SaveFormat.TEXT); // (SaveFormat.Text);

            countByLevel[level - 1] = countByLevel[level - 1] + 1;
            if (level < previusLevel)
            {
                countByLevel[previusLevel - 1] = 0;
            }

            System.out.println("Number: " + GetStringNumber(level, countByLevel) + " Text: " + text);
            previusLevel = level;
        }
    }
}

private String GetStringNumber(int level, int[] countLevels)
{
    String result = "";
    for (int i = 0; i < level; i++)
    {
        if (result == "")
        {
            result += countLevels[i];
        }
        else
        {
            result += "." + countLevels[i];
        }
    }

    return result;
}

@madhusudhangovindu You should simply call Document.updateListLabels before accessing Paragraph.ListLabel property.

doc.updateListLabels();
for (Paragraph paragraph : (Iterable<Paragraph>) doc.getChildNodes(NodeType.PARAGRAPH, true)) {
    System.out.println(paragraph.getListLabel().getLabelValue() + "-> " +paragraph.getText());
}

FYI @carlos.molina

Hi Alexey
The output is coming as below.
.

but it should be as below

if we have bullet list as (a) (b) … under 1 or 1.1, the provided solution still getting as 1.
Can you please help.

@madhusudhangovindu Sure, you should use LabelString property instead of LabelValue:

Document doc = new Document("C:\\Temp\\in.docx");
doc.updateListLabels();
for (Paragraph paragraph : (Iterable<Paragraph>) doc.getChildNodes(NodeType.PARAGRAPH, true)) {
    System.out.println(paragraph.getListLabel().getLabelString() + "-> " +paragraph.toString(SaveFormat.TEXT).trim());
}

Thank you so much for your help Alexey, the solution is working.
But when we have Style as Heading, the solution not working. Can you please help.

@madhusudhangovindu Could you please attach the problematic document here for testing? We will check it and provide you more information.

Hi Alexey, it looks like its working. I will get back to you if I need anything else.

1 Like

Hi Alexey, when we have Level 1 as Heading as attached document, the level number is coming without paragraph.getListLabel().getLabelString().

for (Paragraph paragraph : (Iterable<Paragraph>) doc.getChildNodes(NodeType.PARAGRAPH, true)) {
 System.out.println(paragraph.getText());
			
}

During document extraction I am not sure which document I will be getting.

So is there any generic way to get the label number along with text.

Test Doc with Heading.docx (34.1 KB)

@madhusudhangovindu paragraph.getListLabel().getLabelString() returns nothing because heading paragraphs are not actually list items, so they does not have list labels. The numbers before text in these paragraphs is a simple text. You can add a condition to check whether the paragraph is a list item:

Document doc = new Document("C:\\Temp\\in.docx");
doc.updateListLabels();
for (Paragraph paragraph : (Iterable<Paragraph>) doc.getChildNodes(NodeType.PARAGRAPH, true)) {
    if (paragraph.isListItem())
        System.out.println(paragraph.getListLabel().getLabelString() + "-> " + paragraph.toString(SaveFormat.TEXT).trim());
}

Ok Alexey, got it.

When I have heading paragraph like one which I shared, how can I extract data from 1.1 and 1.2 if the data is in multiple lines.

Thank you.

@madhusudhangovindu You can parse your the paragraph text. For example see the following code:

Document doc = new Document("C:\\Temp\\in.docx");
doc.updateListLabels();
for (Paragraph paragraph : (Iterable<Paragraph>) doc.getChildNodes(NodeType.PARAGRAPH, true)) {
     String paragraphText = paragraph.toString(SaveFormat.TEXT).trim();
     if (paragraph.isListItem())
          System.out.println(paragraph.getListLabel().getLabelString() + "-> " + paragraphText);
     else if (paragraphText.indexOf("\t")>0)
          System.out.println(paragraphText.substring(0, paragraphText.indexOf("\t")) + "-> " + paragraphText.substring(paragraphText.indexOf("\t")));
}

Thank you for your quick response. If my document is in the below format:

  1. Some Text
    1.1 child of some text
    1.1.1 child of child of some text
    1.2 second child of some text
    1.2.1 child of second child of some text

then how to identify 1.1 and 1.2 are child of 1
and 1.1.1 is child of 1.1
and 1.2.1 is child of 1.2…

In both heading paragraph and level paragraph.

Can you please help.

thanks

@madhusudhangovindu In case if all the paragraphs are list items, you can detect whether they belong to the same list. But in your case when part of paragraphs are list items and other part just imitate numbering, it is required to analyze the textual content and order of the paragraphs in the document. Aspose.Words only provides you a tool to read the document content, analyzing the content is another task. I am afraid this task is out of Aspose.Words scope.

Hi Alexey, if we have document with all list paragraphs,
then how to identify 1.1 and 1.2 are child of 1
and 1.1.1 is child of 1.1
and 1.2.1 is child of 1.2…

@madhusudhangovindu In this case the paragraphs belong to the same list (See Paragraph.getListFormat().getList() property), but have different level (see Paragraph.getListFormat().getListLevelNumber() property).

Hi Alexey, when I am reading the word document, and print the paragraph.getText() its first printing footer data then starts the document paragraph data. How can I avoid footer data ?

for (Paragraph paragraph : (Iterable<Paragraph>) doc.getChildNodes(NodeType.PARAGRAPH, true)) {
    System.out.println(paragraph.getText());
}

@madhusudhangovindu You can use code like the following to skip paragraphs from header/footer:

for (Paragraph paragraph : (Iterable<Paragraph>) doc.getChildNodes(NodeType.PARAGRAPH, true)) {

    // Skip paragraphs from header/footer
    if(paragraph.getAncestor(NodeType.HEADER_FOOTER)!=null)
        continue;
            
    System.out.println(paragraph.getText());
}

Also, I would suggest you to consider using DocumentVisitor to iterate over the document’s nodes. In this case it is easier to track the document’s structure:
https://docs.aspose.com/words/java/extract-selected-content-between-nodes/#extract-content-using-documentvisitor