TextFragment's setText to an EMPTY STRING alters the TextSegmentCollection of the Fragment

athota · June 11, 2019, 5:08am

TextFragment’s setText to an EMPTY STRING deletes all TextSegments that constitute the TextFragment and leaves a single Textsegment with EMPTY STRING.

If we iterate over the TextSegments and setText on it…the performance takes a hit since setting each text segments takes around 20-30 ms and with a TextFragment containing 50 TextSegments… it takes 20 * 30 = 600 ms.

The performance of the SetText on TextFragment is good but our use case requires the original TextSegments intact after the SetText.

**Our Ask ** : Can a method (probably called resetSegmentCollectionText() on TextFragment be provided which keeps the TextSegmentCollection intact while emptying the text on the TextSegments… ?

asad.ali · June 11, 2019, 2:55pm

@athota

Thanks for contacting support.

Yes, you are right about the case. The API clears the TextSegmentCollection once its parent TextFragment is set to empty string.

This requirement needs investigation and we have to analyse if this is feasible or not. Would you please provide your sample PDF document with sample code snippet. We will further log a ticket in our system and share ID with you.

athota · June 13, 2019, 11:10am

Hi Asad

Thanks for the reply.

I am including a test java class for your convenience. Also, I am including the pdf document with some representative text that we generally have in our use case. Please note that it would be a string containing text from multiple charsets in the Unicode (Chinese, Japanese, Korean etc)

input-cjk.pdf (250.9 KB)
ResetSegmentDemo.zip (854 Bytes)

From our tests, it is seen that the string that we have in our use case which translates to a TextFragment, constitutes 40 - 200 TextSegments within.

As mentioned before, we are forced to operate at the level of TextSegments since we are not able to keep the TextSegmentCollection of the TextFragment intact when we do a setText on the TextFragment.

The problem with this is that it 's taking 15 ms per TextSegment to setText to EMPTY STRING. So, with 100 TextSegments on an average per each TextFragment, it would be 100 * 15 ms = 1.5 seconds for each string which is prohibitively slow for our use case which could involve 1000 - 16000 of such strings in a single PDF.

As mentioned in my previous comment, we are looking for

An improvement in the performance of the setText method on the TextSegment

AND / OR

A new method in the TextFragment API - resetTextSegmentsText() which keeps the TextSegmentCollection of the TextFragment intact while emptying the text of the segments.

to address our concerns with the performance.

Thanks
Aditya

asad.ali · June 13, 2019, 7:21pm

@athota

Thanks for providing the details.

We have logged an enhancement request as PDFJAVA-38627 in our issue tracking system for the sake of implementation of your requirements. We will definitely look into details of the ticket and investigate the feasibility. As soon as we make some progress towards ticket resolution, we will let you know. Please be patient and spare us little time.

We are sorry for the inconvenience.

athota · June 14, 2019, 12:28pm

Thanks for the reply, Asad. Could you please let me know the timeline for this to get addressed ?

asad.ali · June 14, 2019, 7:42pm

@athota

The ticket has been logged under free support model where issues have low priority and are resolved on first come first serve basis. Resolution time of the ticket depends upon how many priority issues are in queue. We will keep you posted in case we make some significant progress towards implementation of requested enhancement. Please spare us little time.

We are sorry for the inconvenience.

athota · June 17, 2019, 6:10am

Hi Asad,

What are the options for getting this expedited ?
If it is a paid support model, what would be the expected timeline for the addressal. We would at least need a ball park estimate.

We are currently using V18.5.

Farhan.Raza · June 17, 2019, 5:11pm

@athota

Please note that the paid support tickets are definitely resolved sooner than free support tickets. About ETA for PDFJAVA-38627, we have logged your concerns and will update you once any update will be available in this regard.

athota · June 18, 2019, 7:10am

Thanks Farhan.

We would be very interested in seeing this addressed. We would appreciate if you could provide information regarding the feasibility of the fix after a preliminary investigation by your dev team as soon as you can.

If the answer is a YES for the feasibility, we would look into the next steps for getting it addressed faster.

asad.ali · June 18, 2019, 7:01pm

@athota

Sure, we will definitely let you know about our feedback once investigation of your requirements is completed. We have recorded your concerns in this regard and will definitely consider them during analysis. As soon as we have some definite news about analysis, we will surely share with you in this forum thread. Please spare us little time.

athota · June 26, 2019, 5:02am

Hi Asad

Just checking if there was any progress made in this regard - checking the feasibility of the fix.

Thanks
Aditya

aspose.notifier · June 26, 2019, 4:43pm

The issues you have found earlier (filed as PDFJAVA-38627) have been fixed in Aspose.PDF for Java 19.6.

athota · June 28, 2019, 6:01am

Hi Asad

Thanks for the reply. I have tested our use case with the 19.6 artifact.

It is great to see an overall improvement - 1.9 times in the TextSegment’s setText operation
performance w.r.t our use case… through the newly added setTextSuppressedUpdate(…) method.

However the absolute performance is still a concern to us and we would appreciate if you can look into it further. The Set Text performance (averaged) for a particular use case of ours where we have 36 SetText operations

BEFORE: 13.30555 ms
AFTER : 5.63888 ms

As you can see, it is taking 5.63 ms for each text segment. Is there anything that can be done to improve this performance ?

OR alternatively

is it possible to provide a method on the TextFragment itself… as mentioned in my previous comments

A new method in the TextFragment API - resetTextSegmentsText() which keeps the TextSegmentCollection of the TextFragment intact while emptying the text of the segments.

Please note that our concern about the absolute performance of the setText operation stems from the fact that our use case has several hundred thousands of SetText operations to be performed to process a single PDF.

Thanks in advance !
Aditya

asad.ali · June 28, 2019, 5:41pm

@athota

Thanks for your feedback.

We would like to share some more details about our investigation and ticket resolution. The time growing with every setText/setFont/setFontSize and other changes is because, every time data is updated in document operators collection and it gets serialized and de-serialized. for the cases where you want to just change the text against every found text segment, we have implemented methods to avoid serializing and de-serializing on every iteration and it will only be processed once document is saved.

On our side, previous time taken to set Text and set Font for all segments in first fragment was 649-740 ms. After using SuppressedUpdate methods, the time became 312-339 ms which is faster more than twice.

Please replace the methods as follows;

segment.setText(""); -> segment.setTextSuppressedUpdate("");
segment.getTextState().setFont(arial); -> segment.getTextState().setFontSuppressedUpdate(arial);
segment.getTextState().setFontSize((segment.getTextState().getFontSize() - 1)); -> segment.getTextState().setFontSizeSuppressedUpdate((segment.getTextState().getFontSize() - 1));

Complete code:

Document pdfDocument = new Document(dataDir+"input-cjk.pdf");
        TextFragmentAbsorber visitor = new TextFragmentAbsorber("(?s)\\Q**\\E.*?!", new TextSearchOptions(true));
        visitor.getTextReplaceOptions().setReplaceAdjustmentAction(TextReplaceOptions.ReplaceAdjustment.None);
        pdfDocument.getPages().accept(visitor);
        TextFragmentCollection textFragments = visitor.getTextFragments();

        System.out.println("Number of fragments : " + textFragments.size() + "\n");

        textFragments.forEach(fragment -> System.out.println("Fragment : " + fragment.getText()));
        Font arial = FontRepository.findFont("Arial");

        Instant start = Instant.now();
        for (TextFragment fragment : textFragments) {

            TextSegmentCollection segments = fragment.getSegments();
            System.out.println("Number of segments on fragment " + fragment.getText() + ": " + segments.size());

            for (Iterator<TextSegment> iterator = segments.iterator(); iterator.hasNext(); ) {
                TextSegment segment = iterator.next();
                Instant s1 = Instant.now();
                System.out.println("BEFORE : Segment : " + segment.getText());
//                segment.setText("");
                segment.setTextSuppressedUpdate("");
                System.out.println("AFTER : Segment : " + segment.getText());
                System.out.println("Time Taken to set text on segment : " + Duration.between(s1, Instant.now()).toMillis() + " ms");
            }

            for (Iterator<TextSegment> iterator = segments.iterator(); iterator.hasNext(); ) {
                TextSegment segment = iterator.next();
                Instant s1 = Instant.now();
                System.out.println("BEFORE : Segment : " + segment.getText());
//                segment.getTextState().setFont(arial);
//                segment.getTextState().setFontSize((segment.getTextState().getFontSize() - 1));
                segment.getTextState().setFontSuppressedUpdate(arial);
                segment.getTextState().setFontSizeSuppressedUpdate((segment.getTextState().getFontSize() - 1));
                System.out.println("AFTER : Segment : " + segment.getText());
                System.out.println("Time Taken to set font and font size on segment : " + Duration.between(s1, Instant.now()).toMillis() + " ms");
            }
        }
        System.out.println("Time Taken to set text and set font on all segments in the first fragment : " + Duration.between(start, Instant.now()).toMillis() + " ms");
        pdfDocument.save(dataDir+"input-cjk_version19.6_SuppressedUpdate.pdf");

Also, you can perform mass operation with font change that will improve the overall performance. If the last loop in above code will be changes following way, the result is taken in 220-262 ms:

Document pdfDocument = new Document(dataDir+"input-cjk.pdf");
        TextFragmentAbsorber visitor = new TextFragmentAbsorber("(?s)\\Q**\\E.*?!", new TextSearchOptions(true));
        visitor.getTextReplaceOptions().setReplaceAdjustmentAction(TextReplaceOptions.ReplaceAdjustment.None);
        pdfDocument.getPages().accept(visitor);
        TextFragmentCollection textFragments = visitor.getTextFragments();

        System.out.println("Number of fragments : " + textFragments.size() + "\n");

        textFragments.forEach(fragment -> System.out.println("Fragment : " + fragment.getText()));
        Font arial = FontRepository.findFont("Arial");

        Instant start = Instant.now();
        for (TextFragment fragment : textFragments) {

            TextSegmentCollection segments = fragment.getSegments();
            System.out.println("Number of segments on fragment " + fragment.getText() + ": " + segments.size());

            for (Iterator<TextSegment> iterator = segments.iterator(); iterator.hasNext(); ) {
                TextSegment segment = iterator.next();
                Instant s1 = Instant.now();
                System.out.println("BEFORE : Segment : " + segment.getText());
//                segment.setText("");
                segment.setTextSuppressedUpdate("");
                System.out.println("AFTER : Segment : " + segment.getText());
                System.out.println("Time Taken to set text on segment : " + Duration.between(s1, Instant.now()).toMillis() + " ms");
            }

            // Perform mass operation
            visitor.applyForAllFragments(arial);
            for (Iterator<TextSegment> iterator = segments.iterator(); iterator.hasNext(); ) {
                TextSegment segment = iterator.next();
                Instant s1 = Instant.now();
                System.out.println("BEFORE : Segment : " + segment.getText());
//                segment.getTextState().setFont(arial);
//                segment.getTextState().setFontSize((segment.getTextState().getFontSize() - 1));
//                segment.getTextState().setFontSuppressedUpdate(arial);
                segment.getTextState().setFontSizeSuppressedUpdate((segment.getTextState().getFontSize() - 1));
                System.out.println("AFTER : Segment : " + segment.getText());
                System.out.println("Time Taken to set font and font size on segment : " + Duration.between(s1, Instant.now()).toMillis() + " ms");
            }
        }
        System.out.println("Time Taken to set text and set font on all segments in the first fragment : " + Duration.between(start, Instant.now()).toMillis() + " ms");
        pdfDocument.save(dataDir+"input-cjk_version19.6_SuppressedUpdate.pdf");

Furthermore, we have recorded your feedback and concerns and will let you know in case we have further updates and feedback to share with you.

athota · July 1, 2019, 4:54am

Thanks Asad. I had looked at the 19.6 release notes and the API changes already and made the changes in the same way as in the example you posted. I too saw close 2X improvement in performance as stated in my prior comment.

Thank you for the detailed example. Time taken to setFont is not that big of a concern for us at this point. Our concern is more on the absolute performance numbers of the setText operation on the TextSegment which still is around 5 - 8 ms even after the impressive improvements made in v19.6. I am hoping that you guys would be able to improve on this.

Alternatively, as mentioned in my previous comment, it would also be sufficient for our use case if you guys could provide a method resetText method on the TextFragment which will only reset the text set on the TextSegmentCollection without altering the number of segments in the collection and performing this within tens of ms time. I am assuming that this is what you referred to in the last line of your previous comment and would wait for your update on the same.

Thanks in advance !! Appreciate your support.

asad.ali · July 1, 2019, 4:53pm

@athota

We will surely consider your original requirements and will further investigate the feasibility. We will let you know in case of additional updates. Please spare us little time.

athota · July 5, 2019, 10:32am

Hi @asad.ali

Just following up to check if you guys were able to determine the feasibility of providing a resetTextSegmentCollectionText() method on the TextFragment to address our performance concerns ?

This is turning out to be the decider for us - We were able to get our PDF editing requirements functionally addressed with the impressive functionality provided by AsposePdf API but the performance is playing the spoilsport here and unfortunately even a potential deal breaker. We are looking forward to your update in this regard.

Thanks in advance !
@athota

asad.ali · July 5, 2019, 9:13pm

@athota

As shared earlier, we are currently investigating your requirements and as soon as we have some share-worthy results, we will share with you within this forum thread. We highly appreciate your patience and cooperation in this regard. Please spare us little time.

We are sorry for the inconvenience.

athota · July 10, 2019, 7:28am

Hi @asad.ali

We had an internal discussion regarding this and there has been a push to explore other options if we do not realize some progress in a limited time frame. I urge you to at least provide an update on the feasibility of a performance fix as soon as you can (if not the actual fix itself) so that we can decide our next course of action.

Thanks
@athota

asad.ali · July 10, 2019, 5:29pm

@athota

Thanks for getting back to us.

We would like to share with you that we have investigated your requirements and decided to implement a fast resetTextSegments() method. The enhancement has been logged under the ticket ID PDFJAVA-38701 in our issue tracking and we will surely work over its implementation. We will let you know as soon as we have some additional updates regarding its implementation. Please spare us little time.

We are sorry for the inconvenience.