Converting PPTX to Text with Line Separators Using Aspose.Slides for Java

Ramyarao · June 2, 2022, 11:04am

Hi,

Could you please let me know if there is a way to convert pptx file to text (with line separators) using aspose.slides.

I tried with Extract Text from Presentation|Aspose.Slides Documentation.
for the attached file. But the text output file has texts with line separators not at the expected places.

pptx.zip (395.7 KB)

Thanks

andrey.potapov · June 2, 2022, 1:45pm

@Ramyarao,
Thank you for your question.

I looked at the raw data of the presentation you provided and saw that some of the text portions have slightly different formatting. So when you read the text portions, the text lines look broken. Please try using IParagraph.getText method to extract entire paragraphs like this:

var text = paragraph.getText();

API Reference: IParagraph.getText method

If this way does not suit you, please describe the expected result in more detail.

Ramyarao · June 6, 2022, 11:42am

Hi,

We tried the code given in the link :Extract Text from Presentation|Aspose.Slides Documentation with some modifications. As suggested by you, we are using paragraph.getText(); instead of port.getText() and storing the text in String builder and finally saving it to a file with line separators after every slide (introduced stringbuilder.append(“\n”) line while iterating through slides

But for the attached input file, we expect some data to appear in single line for first slide like below:
AWS Lambda Function URLs Amazon API Gateway
Resource Lambda (with Function URL) API Gateway + Lambda
API Type Support HTTP HTTP, REST, WebSocket

But it appears in multiple line:
AWS Lambda Function URLs
Amazon API Gateway
Resource
Lambda (with Function URL)
API Gateway + Lambda
API Type Support
HTTP
HTTP, REST, WebSocket

Similiarly, for the second slides we expect data to come in correct sequence order.(attached expected output.txt)

Also, we noticed that there is some unknown character in the output txt file

image.png (1.7 KB)

test3.zip (63.5 KB)

Please let us know if there is a way to get expected results?

Thanks in advance

Ramyarao · June 6, 2022, 12:02pm

or please let us know if there is any direct API using aspose.slides in java to convert pptx to text

andrey.potapov · June 6, 2022, 2:34pm

@Ramyarao,
The code example in that article searches all text frames and retrieves their content. As far as I can see, you are merging text in the table rows. Therefore, using the SlideUtil.getAllTextBoxes method is probably not suitable for you. You have to manually iterate through all the shapes, analyze the shape type, and implement your own behavior for extracting/merging text.

var presentation = new Presentation("test3.pptx");

for (var slide : presentation.getSlides()) {
    for (var shape : slide.getShapes()) {
        if (shape instanceof ITable) {
            // Read text items from cells and merge them by rows.
        }
    }
}

presentation.dispose();

That string contains two non-printable characters with code 11 at the end. You can remove such characters as shown below:

var cleanText = text.replaceAll("\\p{C}", "");