Question about ppt vs pptx text extraction

I'm currently working with some older java code that extracts text from a PPT file. I'm in the middle of creating a new class that will extract text from a PPTX file. In the older code, we execute two instances of extracting text:

1. shape.getTextFrame().getText(); // By shapes

2. ((TextHolder) holder).getText(); // By placeholder

From what I've found in the documentation this is done differently for a pptx file. Is this all of the code I will need or is there a way to replicate the two methods of extracting text (by shapes and placeholder) as it is done in a ppt file?

for (int j = 0; j < shapesCount; j++) {

ShapeEx shape = shps.get(j);

if (shape.isTextHolder()) {

AutoShapeEx ashp = (AutoShapeEx)shape;

if (ashp.getTextFrame() != null)

mpTHA.handleText(TS_PLACEHOLDER, ashp.getTextFrame().getText(), null);

//texts.add(shape.getTextFrame().getText());

}

}

Thanks,

Jesse Majcher

jmajcher@qumu.com


This message was posted using Aspose.Live 2 Forum

I would actually also appreciate an example for Text Extraction out of a PPTX file. Seems like you changed around a few things.

I’m using the .NET version by the way.

Hi Remy,

Thanks for your interest in Aspose.Slides.

Please use the code snippet below to extract text from place holders and text frames in slide shapes. For reference, I have uploaded the sample PPTX file as well.

[C#]

PresentationEx presentation = new PresentationEx("D://ppt//Test.pptx");

for (int index = 0; index < presentation.Slides.Count; index++)
{
    SlideEx slide = presentation.Slides[index];
    ShapesEx shps = slide.Shapes;

    foreach (ShapeEx shp in shps)
    {
        if (shp.Placeholder != null)
        {
            String Text = ((AutoShapeEx)shp).TextFrame.Text;
        }
        else
        {
            if (shp is AutoShapeEx)
            {
                AutoShapeEx ashp = (AutoShapeEx)shp;
                TextFrameEx tf = ashp.TextFrame;
                String Text = tf.Text;
            }
        }
    }
}

We are sorry for the delayed response.

Thanks for the example code, but I think it fails for a few cases. Here is my version and my own little Power Point Test File:


static private int CountInPowerPointEx(MemoryStream stream)
{
//PresentationEx instead of Presentation, some weird aspose quirk.
PresentationEx pres = new PresentationEx(stream);

StringBuilder text = new StringBuilder();

foreach (SlideEx fstSlide in pres.Slides)
{
// text.Append(fstSlide.HeaderFooter.HeaderText); not necessary for pptx files

//iterate through all shapes and try to get the text from the shapes
foreach (ShapeEx shp in fstSlide.Shapes)
{
text.Append(GetTextFromPowerPointShapesEx(shp));
}
}
return CountWordsInString(text.ToString());
}


static private string GetTextFromPowerPointShapesEx(ShapeEx shp)
{
StringBuilder text = new StringBuilder();

if (shp is Aspose.Slides.Pptx.AutoShapeEx)
{
AutoShapeEx ashp = shp as AutoShapeEx;

if (ashp.TextFrame != null)
{
TextFrameEx tf = ashp.TextFrame;

if (tf != null)
{
text.Append(tf.Text.Replace("\r", " “) + " “); //for some reason, the linebreak \r disappears after the Append
}
}
}
else if (shp is Aspose.Slides.Pptx.GroupShapeEx)
{
GroupShapeEx gshp = shp as GroupShapeEx;

foreach (ShapeEx shpex in gshp.Shapes)
{
text.Append(GetTextFromPowerPointShapesEx(shpex));
}
}
else if (shp is Aspose.Slides.Pptx.TableEx)
{
TableEx tbl = shp as TableEx;

foreach(RowEx row in tbl.Rows)
{
CellListEx cellList = row as CellListEx;

foreach (CellEx cell in cellList)
{
TextFrameEx tf = cell.TextFrame;

if (tf != null)
{
text.Append(tf.Text.Replace(”\r”, " ") + " "); //for some reason, the linebreak \r disappears after the Append
}
}
}
}

return text.ToString();
}//GetTextFromPowerPointShapesEx