Question about ppt vs pptx text extraction

czaloumis · March 9, 2010, 5:58pm

I'm currently working with some older java code that extracts text from a PPT file. I'm in the middle of creating a new class that will extract text from a PPTX file. In the older code, we execute two instances of extracting text:

1. shape.getTextFrame().getText(); // By shapes

2. ((TextHolder) holder).getText(); // By placeholder

From what I've found in the documentation this is done differently for a pptx file. Is this all of the code I will need or is there a way to replicate the two methods of extracting text (by shapes and placeholder) as it is done in a ppt file?

for (int j = 0; j < shapesCount; j++) {

ShapeEx shape = shps.get(j);

if (shape.isTextHolder()) {

AutoShapeEx ashp = (AutoShapeEx)shape;

if (ashp.getTextFrame() != null)

mpTHA.handleText(TS_PLACEHOLDER, ashp.getTextFrame().getText(), null);

//texts.add(shape.getTextFrame().getText());

}

Thanks,

Jesse Majcher

jmajcher@qumu.com

This message was posted using Aspose.Live 2 Forum

rblaettler · March 10, 2010, 1:23pm

I would actually also appreciate an example for Text Extraction out of a PPTX file. Seems like you changed around a few things.

I’m using the .NET version by the way.

mudassir.fayyaz · March 10, 2010, 3:53pm

Hi Remy,

Thanks for your interest in Aspose.Slides.

Please use the code snippet below to extract text from place holders and text frames in slide shapes. For reference, I have uploaded the sample PPTX file as well.

[C#]

PresentationEx presentation = new PresentationEx("D://ppt//Test.pptx");

for (int index = 0; index < presentation.Slides.Count; index++)
{
    SlideEx slide = presentation.Slides[index];
    ShapesEx shps = slide.Shapes;

    foreach (ShapeEx shp in shps)
    {
        if (shp.Placeholder != null)
        {
            String Text = ((AutoShapeEx)shp).TextFrame.Text;
        }
        else
        {
            if (shp is AutoShapeEx)
            {
                AutoShapeEx ashp = (AutoShapeEx)shp;
                TextFrameEx tf = ashp.TextFrame;
                String Text = tf.Text;
            }
        }
    }
}

We are sorry for the delayed response.

rblaettler · March 16, 2010, 1:26pm

Thanks for the example code, but I think it fails for a few cases. Here is my version and my own little Power Point Test File:

static private int CountInPowerPointEx(MemoryStream stream)

{

//PresentationEx instead of Presentation, some weird aspose quirk.

PresentationEx pres = new PresentationEx(stream);

StringBuilder text = new StringBuilder();

foreach (SlideEx fstSlide in pres.Slides)

{

// text.Append(fstSlide.HeaderFooter.HeaderText); not necessary for pptx files

//iterate through all shapes and try to get the text from the shapes

foreach (ShapeEx shp in fstSlide.Shapes)

{

text.Append(GetTextFromPowerPointShapesEx(shp));

}

return CountWordsInString(text.ToString());

}

static private string GetTextFromPowerPointShapesEx(ShapeEx shp)

{

StringBuilder text = new StringBuilder();

if (shp is Aspose.Slides.Pptx.AutoShapeEx)

{

AutoShapeEx ashp = shp as AutoShapeEx;

if (ashp.TextFrame != null)

{

TextFrameEx tf = ashp.TextFrame;

if (tf != null)

{

text.Append(tf.Text.Replace("\r", " “) + " “); //for some reason, the linebreak \r disappears after the Append

}

else if (shp is Aspose.Slides.Pptx.GroupShapeEx)

{

GroupShapeEx gshp = shp as GroupShapeEx;

foreach (ShapeEx shpex in gshp.Shapes)

{

text.Append(GetTextFromPowerPointShapesEx(shpex));

}

else if (shp is Aspose.Slides.Pptx.TableEx)

{

TableEx tbl = shp as TableEx;

foreach(RowEx row in tbl.Rows)

{

CellListEx cellList = row as CellListEx;

foreach (CellEx cell in cellList)

{

TextFrameEx tf = cell.TextFrame;

if (tf != null)

{

text.Append(tf.Text.Replace(”\r”, " ") + " "); //for some reason, the linebreak \r disappears after the Append

}

return text.ToString();

}//GetTextFromPowerPointShapesEx