Free Support Forum - aspose.com

Question about ppt vs pptx text extraction

I'm currently working with some older java code that extracts text from a PPT file. I'm in the middle of creating a new class that will extract text from a PPTX file. In the older code, we execute two instances of extracting text:

1. shape.getTextFrame().getText(); // By shapes

2. ((TextHolder) holder).getText(); // By placeholder

From what I've found in the documentation this is done differently for a pptx file. Is this all of the code I will need or is there a way to replicate the two methods of extracting text (by shapes and placeholder) as it is done in a ppt file?

for (int j = 0; j < shapesCount; j++) {

ShapeEx shape = shps.get(j);

if (shape.isTextHolder()) {

AutoShapeEx ashp = (AutoShapeEx)shape;

if (ashp.getTextFrame() != null)

mpTHA.handleText(TS_PLACEHOLDER, ashp.getTextFrame().getText(), null);

//texts.add(shape.getTextFrame().getText());

}

}

Thanks,

Jesse Majcher

jmajcher@qumu.com


This message was posted using Aspose.Live 2 Forum

I would actually also appreciate an example for Text Extraction out of a PPTX file. Seems like you changed around a few things.

I’m using the .NET version by the way.

Hi Remy,<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

Thanks for your interest in Aspose.Slides.

Please use the code snippet below to extract text from place holders and text frames in slide shapes. For reference, I have uploaded the sample PPTX file as well.

[C#]

PresentationEx presentation = new PresentationEx("D://ppt//Test.pptx");

for (int index=0;index<presentation.Slides.Count;index++)

{

SlideEx slide =presentation.Slides[index] ;

ShapesEx shps = slide.Shapes;

foreach (ShapeEx shp in shps)

{

if (shp.Placeholder != null)

{

String Text = ((AutoShapeEx)shp).TextFrame.Text;

}

else

{

if (shp is AutoShapeEx)

{

AutoShapeEx ashp = (AutoShapeEx)shp;

TextFrameEx tf = ashp.TextFrame;

String Text = tf.Text;

}

}

}

}

We are sorry for the delayed response.

Thanks for the example code, but I think it fails for a few cases. Here is my version and my own little Power Point Test File:


static private int CountInPowerPointEx(MemoryStream stream)
{
//PresentationEx instead of Presentation, some weird aspose quirk.
PresentationEx pres = new PresentationEx(stream);

StringBuilder text = new StringBuilder();

foreach (SlideEx fstSlide in pres.Slides)
{
//text.Append(fstSlide.HeaderFooter.HeaderText); not necessary for pptx files

//iterate through all shapes and try to get the text from the shapes
foreach (ShapeEx shp in fstSlide.Shapes)
{
text.Append(GetTextFromPowerPointShapesEx(shp));
}
}
return CountWordsInString(text.ToString());
}


static private string GetTextFromPowerPointShapesEx(ShapeEx shp)
{
StringBuilder text = new StringBuilder();

if (shp is Aspose.Slides.Pptx.AutoShapeEx)
{
AutoShapeEx ashp = shp as AutoShapeEx;

if (ashp.TextFrame != null)
{
TextFrameEx tf = ashp.TextFrame;

if (tf != null)
{
text.Append(tf.Text.Replace("\r", " “) + " “); //for some reason, the linebreak \r disappears after the Append
}
}
}
else if (shp is Aspose.Slides.Pptx.GroupShapeEx)
{
GroupShapeEx gshp = shp as GroupShapeEx;

foreach (ShapeEx shpex in gshp.Shapes)
{
text.Append(GetTextFromPowerPointShapesEx(shpex));
}
}
else if (shp is Aspose.Slides.Pptx.TableEx)
{
TableEx tbl = shp as TableEx;

foreach(RowEx row in tbl.Rows)
{
CellListEx cellList = row as CellListEx;

foreach (CellEx cell in cellList)
{
TextFrameEx tf = cell.TextFrame;

if (tf != null)
{
text.Append(tf.Text.Replace(”\r”, " ") + " "); //for some reason, the linebreak \r disappears after the Append
}
}
}
}

return text.ToString();
}//GetTextFromPowerPointShapesEx