Challenges in Extracting Text from Slides: Dealing with Split Lines

We are encountering two challenges in our use case of text extraction from slides using Aspose:

  1. Garbage text is being appended during the extraction process.
  2. The segmentation of text is inaccurate, leading to break lines within sentences. For example, a sentence like “The quick brown fox jumped over the fence, and chased the sheep” is segmented incorrectly as:

“The quick brown fox jumped
over the fence, and chased the
sheep”

CODE SNIPPET

public static String ExtractTextFromPptx(String pptxFilePath){
        // Full path where the extracted text will be saved
        String fullPath = "/tmp/output_aspose.txt";
        
        // Loading the presentation
        try {

                Presentation pres = new Presentation("/tmp/" + pptxFilePath);
                PrintWriter writerPath = new PrintWriter(new FileWriter(fullPath));
            
            // Extracting all text frames from the presentation
            ITextFrame[] textFramesPPTX = SlideUtil.getAllTextFrames(pres, true);
            
            // Looping through the extracted text frames
            for (ITextFrame textFrame : textFramesPPTX) {
                // Looping through paragraphs in the current text frame
                for (IParagraph para : textFrame.getParagraphs()) {
                    // Looping through portions in the current paragraph to get text
                    for (IPortion port : para.getPortions()) {
                        // Writing extracted text to our output file
                        //System.out.println(port.getText().trim() + "\n");
                        writerPath.write(port.getText().trim() + "\n");
                    }
                }
            }
            writerPath.close();
            
        } catch (IOException e) {
            System.err.println("Error occurred while working with files: " + e.getMessage());
            return "";
        } catch (Exception e) {
            System.err.println("An error occurred: " + e);
            // Handling other Exceptions
            return "";
        }
        // Returning the file path of the saved text file as an indication of success
        return fullPath;
    }

PPT SAMPLE SLIDE

ppt_zip.zip (15.5 KB)

OUTPUT (TEXT EXTRACTED)

Click to edit the title text format

Click to edit the outline text format
Second Outline Level
Third Outline Level
Fourth Outline Level
Fifth Outline Level
Sixth Outline Level
Seventh Outline Level
Hello world
The quick brown fox jumped over the fence, a
nd chased the ship
. But whenever we fight we fight as community and not as an individual
However the design made for this experiment is not accurate as needed, but it is important to hang in there and try again:
Why is it important to start again?
But it is difficult then hit the honest button again.
Hence in conclusion I would like to say best of luck for your journey.

@cacglo,
Thank you for contacting support.

It looks you are talking about this part:

Click to edit the title text format

Click to edit the outline text format
Second Outline Level
Third Outline Level
Fourth Outline Level
Fifth Outline Level
Sixth Outline Level
Seventh Outline Level

You can set the withMasters parameter in the getAllTextFrames method to false, then the “garbage” text will not be returned.

ITextFrame[] textFramesPPTX = SlideUtil.getAllTextFrames(pres, false);

Extract Text from Presentation|Aspose.Slides Documentation

In your code example, the text is retrieved in text portions. In the sample presentation, the text “nd chased the ship” is bold (text formatting is different), therefore it is stored in a separate text portion and retrieved separately. xml.png (60.1 KB). You can get all the text from para.getText().

Thank you both the issues have been resolved.

However, I see some special characters in the text. PFA
Screenshot from 2024-05-08 16-32-33.png (5.0 KB)

Is there any check I can use to resolve this?

@cacglo,
Your sample presentation does not contain the text displayed in the screenshot. Could you kindly share the presentation file you used? We will then check the problem on our end.

@andrey.potapov

PFA the ppt sample
ppt_zip.zip (15.1 KB)

@cacglo,
Thank you for the sample presentation. I’ve reproduced the problem you described.

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): SLIDESJAVA-39459

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.