Extract all paragraphs from the document excluding table paragraphs

shapov · July 9, 2009, 8:47am

I am using Aspose.Word to extract text from a word document. So far I am getting all the paragraphs in the document using par_col = ((Body)doc.getSections().get(2).getBody()).getParagraphs(); It works excellent, however I want to be able to retrieve all the paragraphs in the document EXCLUDING paragraphs that are inside of any tables. Any help is appreciated.

Thank you.
Alex Shapovalov

alexey.noskov · July 9, 2009, 9:08am

Hi Alex,

Thanks for your inquiry. I think you can use DocuemntVisitor to do that.

In this case, you can determine whether paragraph is inside table of not.

Also, you can get all paragraphs form the document and exclude paragraphs, which are children of tables during further processing. For example, see the following code:

// Open document
Document doc = new Document("C:\\Temp\\in");

// Get all paragraphs
NodeCollection paragraphs = doc.getChildNodes(NodeType.PARAGRAPH, true);

// Loop through all paragraphs
for (int parIdx = 0; parIdx < paragraphs.getCount(); parIdx++)
{
    // Get paragraph
    Paragraph par = (Paragraph)paragraphs.get(parIdx);

    // Check if paragraph is child of table.
    // And exclude such paragraphs from further processing.
    if (par.getAncestor(NodeType.TABLE) != null)
        continue;

    // Do something useful..............
    // .................................
}

Hope this helps.

Best regards.

shapov · July 9, 2009, 9:10am

if (par.getAncestor(NodeType.TABLE) != null)

this is what I was looking for, excellent response. And also lighting-fast as usual.

Thank you.
Alex S.

shapov · July 9, 2009, 9:35am

I have a follow up question. This is in reference to the question I had in this thread.

Since I was unable to export those grouped shapes as an image, I want to at least disregard the text inside.

Thank you.

alexey.noskov · July 9, 2009, 10:40am

You can use this code:

if (par.getAncestor(NodeType.SHAPE) != null)

Hope this helps.

Best regards

shapov · July 9, 2009, 12:11pm

That was my first thought, however since the shapes are grouped it combines the text from each text box into a signle paragraph, and that paragraph returns null for this test

if (par.getAncestor(NodeType.SHAPE) != null)

I think you can reproduce this behavior with my sample file posted in the other thread.

Thank you.

alexey.noskov · July 9, 2009, 1:18pm

Hi Alex,

Try also using

if(par.getAncestor(NodeType.GROUP_SHAPE)!=null)

Best regards.

shapov · July 10, 2009, 9:04am

Alexey, thank you for your responce, however using NodeType.GROUP_SHAPE yeilded no results. Here’s the debug print that I am using to test this:

id: 100
style id: 0
empty?: 645
style name: Normal
isHeading: false
isInTable: null
isInShape: null
isInGroupShape: null
parentType: 3

and as you can tell, everything about this paragraph is average, I can’t seem to find a way to get a hook on this paragraph.

Thank You.

alexey.noskov · July 10, 2009, 9:30am

Hi Alex,

Could you please attach sample document here and shown me simple code, which will allow me to reproduce the problem.

Best regards.

shapov · July 10, 2009, 10:12am

Attached is an example file. This is the piece of code i use to parse it

NodeCollection par_col = doc.getChildNodes(NodeType.PARAGRAPH, true);

for (int counter = 0; counter < par_col.getCount(); counter++)
{
    //Skip empty paragraphs, bulleted lists
    Paragraph par = (Paragraph)par_col.get(counter);
    if (par.toTxt().trim().length() > 0 && !par.isInCell() && !par.isListItem())
    {
        if (Global.debug)
        {
            outputStream.write(“< strong >< strong >< strong >< strong >< strong >< strong >< strong >< strong >< strong >< strong >< strong >< strong >\n");
            outputStream.write(“id: “+counter +”\n”);
            outputStream.write(“style id: “+par.getParagraphFormat().getStyleIdentifier() +”\n”);
            outputStream.write(“empty ?: “+par.toTxt().trim().length() +”\n”);
            outputStream.write(“style name: “+par.getParagraphFormat().getStyleName() +”\n”);
            outputStream.write(“isHeading: “+par.getParagraphFormat().isHeading() +”\n”);
            outputStream.write(“isInTable: “+par.getAncestor(NodeType.TABLE) +”\n”);
            outputStream.write(“isInShape: “+par.getAncestor(NodeType.SHAPE) +”\n”);
            outputStream.write(“isInGroupShape: “+par.getAncestor(NodeType.GROUP_SHAPE) +”\n”);
            outputStream.write(“parentType: “+par.getParentNode().getNodeType() +”\n”);
            outputStream.write("------------------------\n");
            outputStream.write(par.toTxt().trim() + "\n");
            outputStream.write("</strong></strong></strong></strong></strong></strong></strong></strong></strong></strong></strong></strong> \n”);
        }
        else
        {
            outputStream.write(par.toTxt().trim() + "\n");
        }//End if
    } //end for

using this code paragraph #2 contains all the text from each text box into one paragraph.

please let me know what you think.

alexey.noskov · July 10, 2009, 11:04am

Hi Alex,

I think, I understand what is going wrong on your side. You use par.toTxt() method to get text of your paragraphs, but shapes also are children of paragraphs in Word documents, so par.toTxt() also returns text of Shapes (text of all paragraphs inside Shape). To remove this text, you should remove shapes from the document.

// Remove Group shapes
NodeCollection groupShapes = doc.getChildNodes(NodeType.GROUP_SHAPE, true);
while (groupShapes.getCount() != 0)
    groupShapes.removeAt(0);

Best regards.

shapov · July 10, 2009, 11:26am

Once again, thank you for your very quick and more importantly very helpful response.

Thank you.
Alex S.