ASPOSE.SLIDE Java - Text Extraction Superscript not working


#1

Hi,

We are currently extracting text from PPTx and just out that superscript and subscript are being converted to normal text and stored away into a database. Is there a way to maintain superscript values? To keep the integrity?

We are using PHP 7.3 in an NGINX & Tomcat environment. Code Snippet below.

        {
            $slideStrippedCount = $sl + 1;
            $textFrames =  $SlideUtil->getAllTextBoxes($pres->getSlides()->get_Item($sl));

            foreach($textFrames as $textFrame)
            {
                $ParaCollection= new java("com.aspose.slides.IParagraphCollection");
                $ParaCollection = $textFrame->getParagraphs();
                $ParaCount = (int)(string)$ParaCollection->getCount();

                $Para= new java( "com.aspose.slides.IParagraph");

                for ( $i = 0 ; $i < $ParaCount ; $i++ )
                {
                    $Para = $ParaCollection->get_Item($i);

                    $PortCollection=new java("com.aspose.slides.IPortionCollection");
                    $Port=new java("com.aspose.slides.IPortion");

                    $PortCollection=$Para->getPortions();
                    $PortCount=(int)(string)$PortCollection->getCount();

                    $output = "";
                    for ($j = 0 ; $j < $PortCount ; $j++)
                    {
                        $Port=$PortCollection->get_Item($j);

                        # Get text portion
                        $sSlideContent = $Port->getText();

                        $dDateNow = date("Y-m-d H:i:s");

                        $SlideRecord_item=array(
                            "slideStrippedCount" => $slideStrippedCount,
                            "iPresentationID" => $iPresentationID,
                            "sSlideContent" => $sSlideContent,
                            "sFullFolderPPTPathFile" => $sFullFolderPPTPathFile,
                            "dDateNow" => $dDateNow
                        );

                        array_push($slidecontent_arr["records"], $SlideRecord_item);

                    }
                }

            }

Many thanks


#2

@JediJide,

I have observed your requirements and like to share that you are only extracting text from portion of text and API is returning the text only. The SuperScript or SubScript behaviours of text are derived from Escapement property value inside PortionFormat class for respective portion. So, this is not an issue.


#3

@mudassir.fayyaz,

Thanks for getting back to me. I have worked with and looked at the examples you provided, whilst they are helpful, none of them is close enough to what we need it for.

All we want is any instances of superscript or subscript to preserved during text extraction. There are many examples of setting Escapements, which is ok if we are looking to create new superscript or subscript. What we are looking for is to preserve the superscript/subscript wherever they are in the body of slides.

Many thanks,


#4

@JediJide,

I have understood your requirements. In your case you are extracting text on portion level and assuming the text properties shall remain preserved in extracted text. Fact is you are only extracting text and not preserving any of portion or paragraph associated properties. Like font color, font, boldness etc., Escapement is also one of properties. If you want to retain the Escapement or other desired properties along with extracted text then you need to extract them separately along with text and store them as well. Later on, when you intend to use the extracted text, you can set the text along with preserved properties. I hope the shared information will be helpful now.


#5

@mudassir.fayyaz - Ok, thanks. I’ll work through a solution and share the code later if successful.

Thanks,


#6

@JediJide,

You are welcome.


#7

@mudassir.fayyaz - No luck. We’ve tried different methods.

Can you please show us an example in any language. Java would do.

Thanks,


#8

@JediJide,

As I share earlier with you that there is no other option then to preserve Escapement value for portion when extracting portion text. You need to store the text as well its respective Escapement value. I already shared Escapement property values with you as well.

float PortionEscaptement=iPortion.getPortionFormat().getEscapement();


#9

@mudassir.fayyaz - I was expecting to work it differently, It presented the escapement values, I had wrong expectations, I was expecting it to just preserve the superscript, as displayed on the PPTX, I wasn’t expecting to store the values. It does make sense now.

Thanks,


#10

@JediJide,

Actually, this is not an issue. When you are use following statement in your code then you are only going to extract text that belong to that portion inside paragraph.

It is unfortunately, not possible that properties associated may automatically get extracted when only text is extracted. I have created an issue with ID SLIDESJAVA-37753 as a feature request to investigate the possibility of serializing the portion or paragraph inside text frame. If this is going to get possible then in that case you may try to serialize entire portion based on selected text and all properties associated with that portion shall remain intact. However, it has just been added in our issue tracking system and our team shall investigate the possibility of implementing the requested support. In the mean while the only available option is the one that I have suggested before.


#11

@mudassir.fayyaz- Noted, and thanks for the support.