Zero and Asterisk Symbols Appear when Extracting Text from PPT File in Java

lucy.hq · September 29, 2021, 9:00am

hi,
We are extracting text from ppt using Aspose.Slide, but we met some problem.
here is my code:

// extract text from silde
 ISlideCollection slides = presentation.getSlides();
    StringBuilder textBuilder = new StringBuilder();
    for (ISlide slide : slides) {
        ITextFrame[] textFrames = SlideUtil.getAllTextBoxes(slide);
        if (textFrames != null && textFrames.length > 0) {
            for (int index = 0; index < textFrames.length; index++) {
                for (IParagraph paragraph : textFrames[index].getParagraphs()) {
                    for (IPortion portion : paragraph.getPortions()) {
                        textBuilder.append(portion.getText()).append("\n");
                    }
                }
            }
        }
    }

// extract text from master slide and layout slide
IMasterSlideCollection masters = presentation.getMasters();
    if (masters != null && masters.size() > 0) {
        for (IMasterSlide masterSlide : masters) {
            getContentTextFromSlide(masterSlide, textBuilder);
            IMasterLayoutSlideCollection masterLayoutSlides = masterSlide.getLayoutSlides();
            if (masterLayoutSlides != null && masterLayoutSlides.size() > 0) {
                for (ILayoutSlide masterLayoutSlide : masterLayoutSlides) {
                    getContentTextFromSlide(masterLayoutSlide, textBuilder);
                }
            }
        }
    }

    private void getContentTextFromSlide(IBaseSlide slide, StringBuilder textBuilder) {
        ITextFrame[] textFrames = SlideUtil.getAllTextBoxes(slide);
        if (textFrames != null && textFrames.length > 0) {
            for (int index = 0; index < textFrames.length; index++) {
                for (IParagraph paragraph : textFrames[index].getParagraphs()) {
                    for (IPortion portion : paragraph.getPortions()) {
                        textBuilder.append(portion.getText()).append("\n");
                    }
                }
            }
        }
    }

There is no “0” in my ppt slide, but there is a “0” in extracted slide text
2.when I extract text from masters slide and layout slide, “‹#›” is extracted as “*”

this is my ppt file:
testfile.zip (661.8 KB)

andrey.potapov · September 29, 2021, 2:22pm

@lucy.hq,
Thank you for the issue description.

Your presentation is containing three invisible objects with “0” text in the top right corner of each slide. You can find them by using Find tool. zero.jpg (195.3 KB)

I reproduced the problem with the appearing “*” symbol and logged the issue with ID SLIDESJAVA-38640 in our tracking system. Our development team will investigate this case. You will be notified when the problem is resolved.

andrey.potapov · September 30, 2021, 3:54pm

@lucy.hq,

As a temporary workaround, you can use the following code snippet for this case:

private static void getContentTextFromSlide(IBaseSlide slide, StringBuilder textBuilder) {
    ITextFrame[] textFrames = SlideUtil.getAllTextBoxes(slide);
    if (textFrames != null && textFrames.length > 0) {
        for (int index = 0; index < textFrames.length; index++) {
            for (IParagraph paragraph : textFrames[index].getParagraphs()) {
                for (IPortion portion : paragraph.getPortions()) {
                    String text = portion.getText();
                    if (portion.getField() != null && portion.getField().getType().getInternalString().equals("slidenum"))
                        text = "<#>";
                    textBuilder.append(text).append("\r\n");
                }
            }
        }
    }
}

API Reference: IPortion Interface, IField Interface

aspose.notifier · November 25, 2021, 2:54pm

The issues you have found earlier (filed as SLIDESJAVA-38640) have been fixed in this update.