Having Issue to read the level numbers from docx , which is having only levels, not the headers

madhusudhangovindu · February 20, 2023, 3:15pm

Having Issue to read the level numbers from docx , which is having only levels, not the headers.
I am able to read the Level text, but not the Level Numbers. I am using Trial version of Aspose java.
Can you please provide the suggestion to read the Level numbers.

Thank you

carlos.molina · February 20, 2023, 3:25pm

@madhusudhangovindu,

That is because the number is not really there. It is generated as the items are been rendered.

So it is not a limitation because you are using the trial version.

madhusudhangovindu · February 20, 2023, 4:13pm

It is having numbers, can you please find the attached document and suggest me.
I am using the below code to read the level numbers. I am getting the Text but getting 0’s in the place of level numbers.

for (Paragraph paragraph : (Iterable<Paragraph>) doc.getChildNodes(NodeType.PARAGRAPH, true)) {
    System.out.println(paragraph.getListLabel().getLabelValue() + "-> " +paragraph.getText());
}

TestDoc.docx (45.1 KB)

carlos.molina · February 20, 2023, 6:04pm

@madhusudhangovindu,

I made a solution in C#,

private void Logic(Document doc)
{
    var countByLevel = new int[9];

    int level = 0;
    var headingList = doc.GetChildNodes(NodeType.Paragraph, true).Cast<Paragraph>().Where(p => p.ParagraphFormat.StyleName.Contains("Level")); // all paragraphs with Headings styles
    int previusLevel = 0; 
    foreach (Paragraph paragraph in headingList)
    {
        if (paragraph.ListFormat.IsListItem)
        {
            string styleName = paragraph.ParagraphFormat.StyleName;
            string levelStr = styleName.Replace("Level ", "");
            level = Convert.ToInt32(levelStr);
            string text = paragraph.ToString(SaveFormat.Text);

            countByLevel[level - 1] = countByLevel[level - 1] + 1;
            if (level < previusLevel)
            {                        
                countByLevel[previusLevel - 1] = 0;
            }

            

            Console.WriteLine($"Number: {GetStringNumber(level, countByLevel)} - Text: {text}");

            previusLevel = level;
        }
    }

    Console.ReadKey();
}       

private string GetStringNumber(int level, int[] countLevels)
{
    string result = "";
    for (int i = 0; i < level; i++)
    {
        if (string.IsNullOrWhiteSpace(result))
        {
            result += countLevels[i];
        }
        else
        {
            result += $".{countLevels[i]}";
        }
    }

    return result;
}

Solution in Java:

public void Logic(Document doc) throws Exception
{
    var countByLevel = new int[9];

    int level = 0;
    var headingList = doc.getChildNodes(NodeType.PARAGRAPH, true).toArray();
    int previusLevel = 0;

    for (Node node : headingList)
    {
        var paragraph = (com.aspose.words.Paragraph)node;
        if(paragraph.getParagraphFormat().getStyleName().contains("Level")
            && paragraph.getListFormat().isListItem())
        {
            String styleName = paragraph.getParagraphFormat().getStyleName();
            String levelStr = styleName.replace("Level ", "");
            level = Integer.valueOf(levelStr);
            String text = paragraph.toString(SaveFormat.TEXT); // (SaveFormat.Text);

            countByLevel[level - 1] = countByLevel[level - 1] + 1;
            if (level < previusLevel)
            {
                countByLevel[previusLevel - 1] = 0;
            }

            System.out.println("Number: " + GetStringNumber(level, countByLevel) + " Text: " + text);
            previusLevel = level;
        }
    }
}

private String GetStringNumber(int level, int[] countLevels)
{
    String result = "";
    for (int i = 0; i < level; i++)
    {
        if (result == "")
        {
            result += countLevels[i];
        }
        else
        {
            result += "." + countLevels[i];
        }
    }

    return result;
}

alexey.noskov · February 21, 2023, 5:50am

@madhusudhangovindu You should simply call Document.updateListLabels before accessing Paragraph.ListLabel property.

doc.updateListLabels();
for (Paragraph paragraph : (Iterable<Paragraph>) doc.getChildNodes(NodeType.PARAGRAPH, true)) {
    System.out.println(paragraph.getListLabel().getLabelValue() + "-> " +paragraph.getText());
}

FYI @carlos.molina

madhusudhangovindu · February 21, 2023, 6:11am

Hi Alexey
The output is coming as below.
.

but it should be as below

if we have bullet list as (a) (b) … under 1 or 1.1, the provided solution still getting as 1.
Can you please help.

alexey.noskov · February 21, 2023, 6:18am

@madhusudhangovindu Sure, you should use LabelString property instead of LabelValue:

Document doc = new Document("C:\\Temp\\in.docx");
doc.updateListLabels();
for (Paragraph paragraph : (Iterable<Paragraph>) doc.getChildNodes(NodeType.PARAGRAPH, true)) {
    System.out.println(paragraph.getListLabel().getLabelString() + "-> " +paragraph.toString(SaveFormat.TEXT).trim());
}

madhusudhangovindu · February 21, 2023, 6:56am

Thank you so much for your help Alexey, the solution is working.
But when we have Style as Heading, the solution not working. Can you please help.

alexey.noskov · February 21, 2023, 6:57am

@madhusudhangovindu Could you please attach the problematic document here for testing? We will check it and provide you more information.

madhusudhangovindu · February 21, 2023, 8:09am

Hi Alexey, it looks like its working. I will get back to you if I need anything else.

madhusudhangovindu · February 21, 2023, 9:38am

Hi Alexey, when we have Level 1 as Heading as attached document, the level number is coming without paragraph.getListLabel().getLabelString().

for (Paragraph paragraph : (Iterable<Paragraph>) doc.getChildNodes(NodeType.PARAGRAPH, true)) {
 System.out.println(paragraph.getText());
			
}

During document extraction I am not sure which document I will be getting.

So is there any generic way to get the label number along with text.

Test Doc with Heading.docx (34.1 KB)

alexey.noskov · February 21, 2023, 9:44am

@madhusudhangovindu paragraph.getListLabel().getLabelString() returns nothing because heading paragraphs are not actually list items, so they does not have list labels. The numbers before text in these paragraphs is a simple text. You can add a condition to check whether the paragraph is a list item:

Document doc = new Document("C:\\Temp\\in.docx");
doc.updateListLabels();
for (Paragraph paragraph : (Iterable<Paragraph>) doc.getChildNodes(NodeType.PARAGRAPH, true)) {
    if (paragraph.isListItem())
        System.out.println(paragraph.getListLabel().getLabelString() + "-> " + paragraph.toString(SaveFormat.TEXT).trim());
}

madhusudhangovindu · February 21, 2023, 9:56am

Ok Alexey, got it.

When I have heading paragraph like one which I shared, how can I extract data from 1.1 and 1.2 if the data is in multiple lines.

Thank you.

alexey.noskov · February 21, 2023, 10:19am

@madhusudhangovindu You can parse your the paragraph text. For example see the following code:

Document doc = new Document("C:\\Temp\\in.docx");
doc.updateListLabels();
for (Paragraph paragraph : (Iterable<Paragraph>) doc.getChildNodes(NodeType.PARAGRAPH, true)) {
     String paragraphText = paragraph.toString(SaveFormat.TEXT).trim();
     if (paragraph.isListItem())
          System.out.println(paragraph.getListLabel().getLabelString() + "-> " + paragraphText);
     else if (paragraphText.indexOf("\t")>0)
          System.out.println(paragraphText.substring(0, paragraphText.indexOf("\t")) + "-> " + paragraphText.substring(paragraphText.indexOf("\t")));
}

madhusudhangovindu · February 21, 2023, 1:45pm

Thank you for your quick response. If my document is in the below format:

Some Text
1.1 child of some text
1.1.1 child of child of some text
1.2 second child of some text
1.2.1 child of second child of some text

then how to identify 1.1 and 1.2 are child of 1
and 1.1.1 is child of 1.1
and 1.2.1 is child of 1.2…

In both heading paragraph and level paragraph.

Can you please help.

thanks

alexey.noskov · February 21, 2023, 2:45pm

@madhusudhangovindu In case if all the paragraphs are list items, you can detect whether they belong to the same list. But in your case when part of paragraphs are list items and other part just imitate numbering, it is required to analyze the textual content and order of the paragraphs in the document. Aspose.Words only provides you a tool to read the document content, analyzing the content is another task. I am afraid this task is out of Aspose.Words scope.

madhusudhangovindu · February 21, 2023, 5:17pm

Hi Alexey, if we have document with all list paragraphs,
then how to identify 1.1 and 1.2 are child of 1
and 1.1.1 is child of 1.1
and 1.2.1 is child of 1.2…

alexey.noskov · February 22, 2023, 5:37am

@madhusudhangovindu In this case the paragraphs belong to the same list (See Paragraph.getListFormat().getList() property), but have different level (see Paragraph.getListFormat().getListLevelNumber() property).

madhusudhangovindu · February 22, 2023, 7:22am

Hi Alexey, when I am reading the word document, and print the paragraph.getText() its first printing footer data then starts the document paragraph data. How can I avoid footer data ?

for (Paragraph paragraph : (Iterable<Paragraph>) doc.getChildNodes(NodeType.PARAGRAPH, true)) {
    System.out.println(paragraph.getText());
}

alexey.noskov · February 22, 2023, 7:36am

@madhusudhangovindu You can use code like the following to skip paragraphs from header/footer:

for (Paragraph paragraph : (Iterable<Paragraph>) doc.getChildNodes(NodeType.PARAGRAPH, true)) {

    // Skip paragraphs from header/footer
    if(paragraph.getAncestor(NodeType.HEADER_FOOTER)!=null)
        continue;
            
    System.out.println(paragraph.getText());
}

Also, I would suggest you to consider using DocumentVisitor to iterate over the document’s nodes. In this case it is easier to track the document’s structure:
https://docs.aspose.com/words/java/extract-selected-content-between-nodes/#extract-content-using-documentvisitor