TextExtractionArrangingMode.Arranged seems not working on PPTX

Aspose.Slides for Java v17.6

I’m trying to convert powerpoint files to txt using PresentationFactory.getInstance().getPresentationText(…)
When I try to convert PPT files, Aspose.Slides arrange the text correctly. However, using PPTX files the text file isn’t arranged. Please see the source code below and the files attached to this topic. How to fix this?

import java.io.IOException;
import java.nio.charset.StandardCharsets;

import org.apache.commons.io.FileUtils;

import com.aspose.slides.IPresentationText;
import com.aspose.slides.ISlideText;
import com.aspose.slides.PresentationFactory;
import com.aspose.slides.TextExtractionArrangingMode;

public class TestPPTXToTXT {

public static void main(String[] args) throws Exception {
    convertPowerPointToTXT("E:\\Download\\powerpoint\\Test1.ppt", "E:\\Download\\powerpoint\\Test1.txt");
    convertPowerPointToTXT("E:\\Download\\powerpoint\\Test2.pptx", "E:\\Download\\powerpoint\\Test2.txt");
}

private static void convertPowerPointToTXT(String fileFrom, String fileTo) throws IOException {
    IPresentationText presentationText =
            PresentationFactory.getInstance().getPresentationText(fileFrom, TextExtractionArrangingMode.Arranged);

    ISlideText[] slidesText = presentationText.getSlidesText();
    StringBuffer stringBuffer = new StringBuffer();

    // Loop through the Array of TextFrames
    for (int i = 0; i < slidesText.length; i++) {
        stringBuffer.append(slidesText[i].getText());
    }

    FileUtils.writeStringToFile(FileUtils.getFile(fileTo), stringBuffer.toString(), StandardCharsets.UTF_8);
}

}
powerpoint.zip (253.0 KB)


This Topic is created by codewarior using the Email to Topic plugin.


This Topic is created by codewarior using the Email to Topic plugin.

@harry.ardimedia,

I have observed the requirement shared by you and like to share that Aspose.Slides extracts the raw text from text frames of slides and there is no internal text organizing mechanism supported by Aspose.Slides. The text extracted from PPTX file is actually what it should be. I have created an issue with ID SLIDESJAVA-36466 as enhancement to check why the extraction of PPT is different from that of PPTX. This thread has been linked with the issue so that you may be automatically notified once the issue will be fixed.

Many Thanks,

Mudassir Fayyaz

@mudassir.fayyaz

From TextExtractionArrangingMode | Aspose.Slides for Java API Reference

Arranged
The text is positioned in the same order as on the slide

Unarranged
The raw text with no respect to position on the slide

So, once I use Arranged with PPT works fine, but it doesn´t work for PPTX.
Just to make sure we are at the same page, the text extracted from PPTX file is not what it should be. It is extracted as Unarranged but I set Arranged. Am I, right?

@cpatricio76 ,

Yes, your observation is right and I have shared the information in our issue tracking system for our product team to consider when working over the issue.

Many Thanks,

Mudassir Fayyaz

@mudassir.fayyaz

It looks like SLIDESJAVA-36466 is not fixed in the new version of Aspose.Slides v17.7 released today (31/07).
Is it possible to provide a patch as soon as it is fixed?

@cpatricio76,

I have observed your comments. I like to share this issue is going to be resolved tentatively in Aspose.Slides 17.9. I have also requested our product team to resolved this issue as soon as possible. We will share good news with you soon.

Best Regards,