Extracting all Headings


#1

Hello,
Am evaluating Aspose, and one of our Use Case is to be able to extract all headings from document. How can I do that?

A.	Introduction 

Some intro from SK
• After the intro
 After the intro introd

  1. Headline Act 2
    Normal 2
    • Bullet 2
     Sq Bullet 2
    (a) Headline Act 3
    Normal 3 headline
    • Bullet 3
     Sq Bullet 3
    (i) Headline 4
    Normal 4

So from the text above I want to extract
Heading 1: Introduction
Heading 2: Headline Act 2
Heading 3: Headline Act 3

Code I was playign with which gives me all document not just the headings

Blockquote
SectionCollection sectionCollection = docCon.getSections();
for(Section section : sectionCollection){
Body sectionBody = section.getBody();
System.out.println(“sectionBody :” + sectionBody);
for(Paragraph paragraph : sectionBody.getParagraphs()){
System.out.println(“sectionBody :” + sectionBody + “, node type: " + paragraph.getNodeType() + “, paragraph :” + paragraph.getText() );
NodeCollection paraNodeCollection = paragraph.getChildNodes();
for(Node node : paraNodeCollection.toArray()){
//System.out.println(”----------------- node type: " + node.getNodeType() + " node: " + node.getText());
}
}
}


#2

@aspose1212,

You can build on the following code to achieve what you are looking for:

for (Paragraph para : (Iterable<Paragraph>) doc.getChildNodes(NodeType.PARAGRAPH, true)) {
    if (para.getParagraphFormat().getStyleIdentifier() == StyleIdentifier.HEADING_1 ||
            para.getParagraphFormat().getStyleIdentifier() == StyleIdentifier.HEADING_2 ||
            para.getParagraphFormat().getStyleIdentifier() == StyleIdentifier.HEADING_3 /* and so on*/) {

        System.out.println(para.toString(SaveFormat.TEXT) + " <-- this is a heading para");
    }
}

#3

2 posts were merged into an existing topic: List in Aspose


#4

Thanks it works in most of the cases, but failed when I had track changes/comments on in the document.
Say if I have heading called “Introduction” and have added comment “nice job” then I using the code above I see this as “Introductionnice job”. How can I extract just the heading text without pulling in the track change/comments note?

I realize attaching a sample document will help, but it is a bit hard (not allowed) to attach documents on external sites from within our firm. Hope you will be able to reproduce this with my description above?


#5

@aspose1212,

You can simply remove comments from Heading Paragraphs before getting their text:

for (Paragraph para : (Iterable<Paragraph>) doc.getChildNodes(NodeType.PARAGRAPH, true)) {
    if (para.getParagraphFormat().getStyleIdentifier() == StyleIdentifier.HEADING_1 ||
            para.getParagraphFormat().getStyleIdentifier() == StyleIdentifier.HEADING_2 ||
            para.getParagraphFormat().getStyleIdentifier() == StyleIdentifier.HEADING_3 /* and so on*/) {

        para.getChildNodes(NodeType.COMMENT, true).clear();
        System.out.println(para.toString(SaveFormat.TEXT) + " <-- this is a heading para");
    }
}

#6

Thanks that worked!