Can't extract all text from slides after upgrading to aspose.slides 18.3

nuix · April 11, 2018, 7:43am

Hi,

We’ve run into a problem where we can no longer extract all of the text from a slide. This was working in aspose-slides 16.9.0 but doesn’t seem to work in 18.2 or 18.3.

Do we need to use a different API to extract the text, now?
Or is this a regression?

Here’s how we’re getting our results:

Download GitHub - aspose-slides/Aspose.Slides-for-Java: Aspose.Slides for Java Examples
From the zip below, add the .java file and the .ppt file

Observed: With aspose-slides 16.9.0, we see the following output:
Example
Powerpoint

text1

text2

text3

text4

Last slide


goal3

goal2

goal1

But, if we adjust pom.xml to aspose-slides 18.2 or 18.3, then we only see this output – it’s missing some of the text fields:

Example 
Powerpoint



Last slide

For reference, here’s the java source that we’re using:

package com.aspose;

import com.aspose.slides.*;

public class DumpAllText {
    public static void main(String[] args) {
        com.aspose.slides.License slidesLicence = new com.aspose.slides.License();
        slidesLicence.setLicense(AsposeUtils.getLicenceData());

        //ExStart:EndParaGraph
        // Instantiate a Presentation class that represents a PPTX file
        Presentation presentation = new Presentation("test.ppt");

        ISlideCollection slides = presentation.getSlides();
        for(ISlide slide : slides) {
            for(ITextFrame textFrame : SlideUtil.getAllTextBoxes(slide)) {
                for(IParagraph paragraph : textFrame.getParagraphs()) {
                    for(IPortion portion : paragraph.getPortions()) {
                        System.out.println(portion.getText());
                    }
                }
            }
        }
    }
    //ExEnd:EndParaGraph
}

If we use SlideUtil.getAllTextFrames, it gives similar results – most of the textboxes in this .ppt are skipped over in aspose-slides 18 (but they’re picked up in 16.9.0)

What can we do to work around this?

DumpAllText.zip (5.7 KB)

mudassir.fayyaz · April 11, 2018, 10:52am

@nuix,

I have observed your presentation and like to share that you need to traverse all slides and their respective shapes for extracting text. Can you please try using following sample code on your end.

public static void GetText()
{

    path="C:\\Aspose Data\\DumpAllText\\";
    presName="test.ppt";
    Presentation pres = new Presentation(path+presName);
    ISlideCollection slides = pres.getSlides();

    ISlide slide = null;
    IShape shape = null;
    for (int i = 0; i < slides.size(); i++) 
    {
       slide = slides.get_Item(i);

        for (int j = 0; j < slide.getShapes().size(); j++) 
        {
             shape = slide.getShapes().get_Item(j);

              // if (shape.getPlaceholder() != null) 
              if(shape instanceof AutoShape)
              {
                if (((IAutoShape)shape).getTextFrame() != null) 
                {
                    ExtractFonts(((IAutoShape) shape).getTextFrame());

                }

              }
              else if(shape instanceof LegacyDiagram)
              {
                  LegacyDiagram legacy=(LegacyDiagram)shape;

                  ISmartArt smart=legacy.convertToSmartArt();
                  for(ISmartArtNode node:smart.getAllNodes())
                  {
                      if(node.getTextFrame()!=null)
                      {
                         ExtractFonts(node.getTextFrame());
                      }

                  }

              }
              else if(shape instanceof SmartArt)
              {
                  ISmartArt smart=(ISmartArt)shape;
                  for(ISmartArtNode node:smart.getAllNodes())
                  {
                      if(node.getTextFrame()!=null)
                      {
                         ExtractFonts(node.getTextFrame());
                      }

                  }

              }

              else if (shape instanceof Table)
              {
                  ITable table=(ITable)shape;
                  for(int u=0;u<table.getRows().size();u++)
                  {
                      for(int v=0;v<table.getColumns().size();v++)
                      {
                          ICell cell=table.get_Item(v, u);
                          if(cell.getTextFrame()!=null )
                          {
                           ExtractFonts(cell.getTextFrame());
                          }

                      }
                  }

              }
            }
        }
    }

public static void ExtractFonts(ITextFrame tf2)
{

    for (int k = 0; k < tf2.getParagraphs().getCount(); k++) 
    {
        IParagraph paragraph = tf2.getParagraphs().get_Item(k);
        for (int n = 0; n < paragraph.getPortions().getCount(); n++) 
        {
            IPortion portion = paragraph.getPortions().get_Item(n);
            IPortionFormat pformat=portion.getPortionFormat();
            System.out.println("Portion Text: "+portion.getText());

        }
    }
}

nuix · April 12, 2018, 4:55am

Thanks mudassir, this seems to work perfectly. Great stuff.

In this case, should we consider the following methods deprecated or problematic?

* SlideUtil.getAllTextFrames
* SlideUtil.getAllTextBoxes(slide)

As this documentation seems to indicate that they will retrieve all of the text frames on a slide, but this doesn’t seem to be the case.

Adnan.Ahmad · April 12, 2018, 1:51pm

@nuix,

I have observed your comments. I regret to inform that SlideUtil.getAllTextBoxes(slide) has some restrictions. We have created an internal investigation ticket for this. Please used shared sample code as workaround.