The issues you have found earlier (filed as PDFJAVA-38627) have been fixed in Aspose.PDF for Java 19.6.
Hi Asad
Thanks for the reply. I have tested our use case with the 19.6 artifact.
It is great to see an overall improvement - 1.9 times in the TextSegment’s setText operation
performance w.r.t our use case… through the newly added setTextSuppressedUpdate(…) method.
However the absolute performance is still a concern to us and we would appreciate if you can look into it further. The Set Text performance (averaged) for a particular use case of ours where we have 36 SetText operations
BEFORE: 13.30555 ms
AFTER : 5.63888 ms
As you can see, it is taking 5.63 ms for each text segment. Is there anything that can be done to improve this performance ?
OR alternatively
is it possible to provide a method on the TextFragment itself… as mentioned in my previous comments
A new method in the TextFragment API - resetTextSegmentsText() which keeps the TextSegmentCollection of the TextFragment intact while emptying the text of the segments.
Please note that our concern about the absolute performance of the setText operation stems from the fact that our use case has several hundred thousands of SetText operations to be performed to process a single PDF.
Thanks in advance !
Aditya
Thanks for your feedback.
We would like to share some more details about our investigation and ticket resolution. The time growing with every setText/setFont/setFontSize and other changes is because, every time data is updated in document operators collection and it gets serialized and de-serialized. for the cases where you want to just change the text against every found text segment, we have implemented methods to avoid serializing and de-serializing on every iteration and it will only be processed once document is saved.
On our side, previous time taken to set Text and set Font for all segments in first fragment was 649-740 ms. After using SuppressedUpdate
methods, the time became 312-339 ms which is faster more than twice.
Please replace the methods as follows;
segment.setText("");
-> segment.setTextSuppressedUpdate("");
segment.getTextState().setFont(arial);
-> segment.getTextState().setFontSuppressedUpdate(arial);
segment.getTextState().setFontSize((segment.getTextState().getFontSize() - 1));
-> segment.getTextState().setFontSizeSuppressedUpdate((segment.getTextState().getFontSize() - 1));
Complete code:
Document pdfDocument = new Document(dataDir+"input-cjk.pdf");
TextFragmentAbsorber visitor = new TextFragmentAbsorber("(?s)\\Q**\\E.*?!", new TextSearchOptions(true));
visitor.getTextReplaceOptions().setReplaceAdjustmentAction(TextReplaceOptions.ReplaceAdjustment.None);
pdfDocument.getPages().accept(visitor);
TextFragmentCollection textFragments = visitor.getTextFragments();
System.out.println("Number of fragments : " + textFragments.size() + "\n");
textFragments.forEach(fragment -> System.out.println("Fragment : " + fragment.getText()));
Font arial = FontRepository.findFont("Arial");
Instant start = Instant.now();
for (TextFragment fragment : textFragments) {
TextSegmentCollection segments = fragment.getSegments();
System.out.println("Number of segments on fragment " + fragment.getText() + ": " + segments.size());
for (Iterator<TextSegment> iterator = segments.iterator(); iterator.hasNext(); ) {
TextSegment segment = iterator.next();
Instant s1 = Instant.now();
System.out.println("BEFORE : Segment : " + segment.getText());
// segment.setText("");
segment.setTextSuppressedUpdate("");
System.out.println("AFTER : Segment : " + segment.getText());
System.out.println("Time Taken to set text on segment : " + Duration.between(s1, Instant.now()).toMillis() + " ms");
}
for (Iterator<TextSegment> iterator = segments.iterator(); iterator.hasNext(); ) {
TextSegment segment = iterator.next();
Instant s1 = Instant.now();
System.out.println("BEFORE : Segment : " + segment.getText());
// segment.getTextState().setFont(arial);
// segment.getTextState().setFontSize((segment.getTextState().getFontSize() - 1));
segment.getTextState().setFontSuppressedUpdate(arial);
segment.getTextState().setFontSizeSuppressedUpdate((segment.getTextState().getFontSize() - 1));
System.out.println("AFTER : Segment : " + segment.getText());
System.out.println("Time Taken to set font and font size on segment : " + Duration.between(s1, Instant.now()).toMillis() + " ms");
}
}
System.out.println("Time Taken to set text and set font on all segments in the first fragment : " + Duration.between(start, Instant.now()).toMillis() + " ms");
pdfDocument.save(dataDir+"input-cjk_version19.6_SuppressedUpdate.pdf");
Also, you can perform mass operation with font change that will improve the overall performance. If the last loop in above code will be changes following way, the result is taken in 220-262 ms:
Document pdfDocument = new Document(dataDir+"input-cjk.pdf");
TextFragmentAbsorber visitor = new TextFragmentAbsorber("(?s)\\Q**\\E.*?!", new TextSearchOptions(true));
visitor.getTextReplaceOptions().setReplaceAdjustmentAction(TextReplaceOptions.ReplaceAdjustment.None);
pdfDocument.getPages().accept(visitor);
TextFragmentCollection textFragments = visitor.getTextFragments();
System.out.println("Number of fragments : " + textFragments.size() + "\n");
textFragments.forEach(fragment -> System.out.println("Fragment : " + fragment.getText()));
Font arial = FontRepository.findFont("Arial");
Instant start = Instant.now();
for (TextFragment fragment : textFragments) {
TextSegmentCollection segments = fragment.getSegments();
System.out.println("Number of segments on fragment " + fragment.getText() + ": " + segments.size());
for (Iterator<TextSegment> iterator = segments.iterator(); iterator.hasNext(); ) {
TextSegment segment = iterator.next();
Instant s1 = Instant.now();
System.out.println("BEFORE : Segment : " + segment.getText());
// segment.setText("");
segment.setTextSuppressedUpdate("");
System.out.println("AFTER : Segment : " + segment.getText());
System.out.println("Time Taken to set text on segment : " + Duration.between(s1, Instant.now()).toMillis() + " ms");
}
// Perform mass operation
visitor.applyForAllFragments(arial);
for (Iterator<TextSegment> iterator = segments.iterator(); iterator.hasNext(); ) {
TextSegment segment = iterator.next();
Instant s1 = Instant.now();
System.out.println("BEFORE : Segment : " + segment.getText());
// segment.getTextState().setFont(arial);
// segment.getTextState().setFontSize((segment.getTextState().getFontSize() - 1));
// segment.getTextState().setFontSuppressedUpdate(arial);
segment.getTextState().setFontSizeSuppressedUpdate((segment.getTextState().getFontSize() - 1));
System.out.println("AFTER : Segment : " + segment.getText());
System.out.println("Time Taken to set font and font size on segment : " + Duration.between(s1, Instant.now()).toMillis() + " ms");
}
}
System.out.println("Time Taken to set text and set font on all segments in the first fragment : " + Duration.between(start, Instant.now()).toMillis() + " ms");
pdfDocument.save(dataDir+"input-cjk_version19.6_SuppressedUpdate.pdf");
Furthermore, we have recorded your feedback and concerns and will let you know in case we have further updates and feedback to share with you.
Thanks Asad. I had looked at the 19.6 release notes and the API changes already and made the changes in the same way as in the example you posted. I too saw close 2X improvement in performance as stated in my prior comment.
Thank you for the detailed example. Time taken to setFont is not that big of a concern for us at this point. Our concern is more on the absolute performance numbers of the setText operation on the TextSegment which still is around 5 - 8 ms even after the impressive improvements made in v19.6. I am hoping that you guys would be able to improve on this.
Alternatively, as mentioned in my previous comment, it would also be sufficient for our use case if you guys could provide a method resetText method on the TextFragment which will only reset the text set on the TextSegmentCollection without altering the number of segments in the collection and performing this within tens of ms time. I am assuming that this is what you referred to in the last line of your previous comment and would wait for your update on the same.
Thanks in advance !! Appreciate your support.
We will surely consider your original requirements and will further investigate the feasibility. We will let you know in case of additional updates. Please spare us little time.
Hi @asad.ali
Just following up to check if you guys were able to determine the feasibility of providing a resetTextSegmentCollectionText() method on the TextFragment to address our performance concerns ?
This is turning out to be the decider for us - We were able to get our PDF editing requirements functionally addressed with the impressive functionality provided by AsposePdf API but the performance is playing the spoilsport here and unfortunately even a potential deal breaker. We are looking forward to your update in this regard.
Thanks in advance !
@athota
As shared earlier, we are currently investigating your requirements and as soon as we have some share-worthy results, we will share with you within this forum thread. We highly appreciate your patience and cooperation in this regard. Please spare us little time.
We are sorry for the inconvenience.
Hi @asad.ali
We had an internal discussion regarding this and there has been a push to explore other options if we do not realize some progress in a limited time frame. I urge you to at least provide an update on the feasibility of a performance fix as soon as you can (if not the actual fix itself) so that we can decide our next course of action.
Thanks
@athota
Thanks for getting back to us.
We would like to share with you that we have investigated your requirements and decided to implement a fast resetTextSegments() method. The enhancement has been logged under the ticket ID PDFJAVA-38701 in our issue tracking and we will surely work over its implementation. We will let you know as soon as we have some additional updates regarding its implementation. Please spare us little time.
We are sorry for the inconvenience.
Thanks for the update @asad.ali. Glad to hear that.
Just to reiterate on our expectations from the proposed resetTextSegments method of the TextFragment – we want it to just reset the text segments’ text to an empty string while ensuring that the number of the text segments in the underlying TextSegmentCollection of the TextFragment is left intact.
Thanks
@athota
Sure, we have recorded your concerns and will definitely consider them during feature implementation.
Could you also provide a release timeline for this feature implementation even if tentative ? Would help us plan at our end.
Thanks
@athota
We regret that we cannot provide any ETA or tentative timeline for the ticket to be resolved as it has low priority and will be taken care of on first come first serve basis. However, we have recorded your concerns and will definitely consider them while implementing the feature. As soon as we have any update regarding feature availability, we will share it with you. Please spare us some time.
The ticket is planned to be processed after release of Aspose.PDF for Java 19.7. Once working on the ticket is started, we will be in position to share some expected timeline for this feature to be available. We really appreciate your cooperation and patience and in this matter. Please spare us some time.
We are sorry for the inconvenience.
Hi @asad.ali
Could you please let us know the latest on PDFJAVA-38701 ? Can we expect it’s delivery in this month’s release ?
Thanks
@athota
Investigation is still underway. Please note that required feature needs implementation of new modules and modifications in existing internal components of the API. Hence, the investigation is taking some time. We will let you know as soon as it is completed and share an ETA with you as soon as we have some results. Please spare us little time.
Hi @asad.ali
I see that the status of the linked ticket PDFJAVA-38701 has changed to Feedback. Could you please elaborate on what we can expect ?
Thanks
@athota
The issue status is feedback currently because required feature needs internal implementations and separate tasks have been created for that upon which the main ticket is dependent. Feedback means, this ticket is in waiting queue to get feedback from those internal tasks. As soon as the ticket is resolved, we will let you know. Please spare us little time.