Remove all text from page with TextFragmentCollection is to slow

Hi Nayyer,

It may worth mentioning that same slow speed happens when taking the TextFragments from another PDF and adding it to a new PDF, something like this:

for (int i = 1; i <= totPages; i++)
{
if (bw.CancellationPending)
break;
if (ReAddText)
{
try
{
[//Aspose.Pdf.Text.TextParagraphA](https://aspose.pdf.text.textparagrapha/) textParagraphAbsorber = new Aspose.Pdf.Text.TextParagraph();
Aspose.Pdf.Text.TextFragmentAbsorber textFragmentAbsorber = new Aspose.Pdf.Text.TextFragmentAbsorber();
//accept the absorber for this page
pdfDocument.Pages[i].Accept(textFragmentAbsorber);
//get the extracted text fragments
Aspose.Pdf.Text.TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;
string info2 = info;
int totalTF = textFragmentCollection.Count;
int cTF = 0;
foreach (Aspose.Pdf.Text.TextFragment textFragment in textFragmentCollection)
{
if (bw.CancellationPending)
break;
try
{

//Create TextBuilder object
Aspose.Pdf.Text.TextBuilder textBuilder = new Aspose.Pdf.Text.TextBuilder(newDocument.Pages[i]);
//Append the text fragment to the PDF page
textBuilder.AppendText(textFragment);
}
catch { }
}
}
catch { }
}
}

Also, it may worth mentioning that I’m doing the “job” of transfering the TextFragments from one document to another in a background worker (BackgroundWorker / C#).

Basically what I’m trying to do here is flatten the PDF pages without text, and later add the text to the new pdf so that it keeps the search/select functionality of the PDF but background of the pages are flatten/converted to an image representation of all pdf objects on that page except the text. Maybe there is a better way to do this than the above one I’m trying?
I’m using the BackgrounWorker so that I can display some progress to the
user.

Thanks a lot for your support,
Razvan

razvar:
It may worth mentioning that same slow speed happens when taking the TextFragments from another PDF and adding it to a new PDF, something like this:

for (int i = 1; i <= totPages; i++)
{
if (bw.CancellationPending)
break;
if (ReAddText)
{
try
{
//Aspose.Pdf.Text.TextParagraphA textParagraphAbsorber = new Aspose.Pdf.Text.TextParagraph();
Aspose.Pdf.Text.TextFragmentAbsorber textFragmentAbsorber = new Aspose.Pdf.Text.TextFragmentAbsorber();
//accept the absorber for this page
pdfDocument.Pages[i].Accept(textFragmentAbsorber);
//get the extracted text fragments
Aspose.Pdf.Text.TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;
string info2 = info;
int totalTF = textFragmentCollection.Count;
int cTF = 0;
foreach (Aspose.Pdf.Text.TextFragment textFragment in textFragmentCollection)
{
if (bw.CancellationPending)
break;
try
{

//Create TextBuilder object
Aspose.Pdf.Text.TextBuilder textBuilder = new Aspose.Pdf.Text.TextBuilder(newDocument.Pages[i]);
//Append the text fragment to the PDF page
textBuilder.AppendText(textFragment);
}
catch { }
}
}
catch { }
}
}

Also, it may worth mentioning that I'm doing the "job" of transfering the TextFragments from one document to another in a background worker (BackgroundWorker / C#).
Hi Razvan,

Thanks for sharing the details.

I have tested the scenario and have managed to reproduce the same issue. For the sake of correction, I have separately logged it as PDFNEWNET-37882. We will further look into the details of this problem and will keep you updated on the status of correction. Please be patient and spare us little time.

razvar:
Basically what I’m trying to do here is flatten the PDF pages without text, and later add the text to the new pdf so that it keeps the search/select functionality of the PDF but background of the pages are flatten/converted to an image representation of all pdf objects on that page except the text. Maybe there is a better way to do this than the above one I’m trying?
I’m using the BackgrounWorker so that I can display some progress to the
user.
Hi Razvan,

As per my understanding, as you need to keep text searchable and want to flatten every other element over page, the suitable approach is to extract text from file, add/copy it to other Document instance and convert PDF pages (all elements except Text) to image format. But since the text extraction feature is taking too much time, you need to wait, until we optimize the API performance for text extraction.

Hi,

Any news here?

I am using evaluation version of Aspose.PDF now.
But I am facing the quite similar issue to the one reported here, which has been seemingly not resolved for years.

I need to make decision whether I can purchase this product or not.
I’d like to know as soon as possible whether I can find a way out to “reasonably quickly” update the text by simply assigning another string as follows

textFragment.Text = updatedText;

which is so far reportedly “too slow” just as I am actually experiencing right now with the latest version.

If there is any better solution ( such as paragraph-based manipulation ) please advise me.

@KDSSHO,

Please note, the paragraph based manipulations are supported. Please refer to this help topic: Extract Paragraph from PDF. Kindly send all details of the scenario, including source PDF and code. We will investigate your scenario in our environment and share our findings with you.

Thank you for your prompt reply.

Please find it below. I use the sample app explained on this topic. Please take a look at the attached source PDF files as well.

And the point I mentioned “it’s slow” is as follows.

//set old fragment text to empty string
textFragment.Text = String.Empty;

For the whole process of my “replacing texts to translated texts” explained above, it takes 20 sec.
If I comment out one line,

// textFtagment.Text = String.Empty;
it’s done only in 2 sec.

So I’d like to know whether it can be faster when I assign something to textFragment.Text.
I truly appreciate your help!

Thank you very much. I have already checked it out and testing now.
But it is slow, too, if I empty textFragment.Text, and fast if I don’t.

@KDSSHO,

We can find the slow performance by setting an empty string of the text fragment. It has been logged under the ticket ID PDFNET-44715 (input PDF: 2017112144212.pdf) in our issue tracking system. We have linked your post to this ticket and will keep you informed regarding any available updates.

Thank you. I’ll keep an eye on this topic.

@KDSSHO,

In reference to the linked ticket ID PDFNET-37868, we have implemented a feature in version 18.7 of Aspose.PDF for .NET API to remove all text items. Please use TextFragmentAbsorber.RemoveAllText(Page page) to remove text from the page of PDF document, and TextFragmentAbsorber.RemoveAllText(Document document) to remove text from the whole document.
C#

Document doc = new Document(inFile);
System.Diagnostics.Stopwatch sw = System.Diagnostics.Stopwatch.StartNew();
TextFragmentAbsorber absorber = new TextFragmentAbsorber();
absorber.RemoveAllText(doc);
doc.Save(outFile);
sw.Stop();

In reference to the linked ticket ID PDFNET-44715, please do not set empty text because it invokes a number of checks and text position adjustment operations. Please this alternative way of removing existing text.
C#

Stopwatch watch = new Stopwatch();
watch.Start();
Document doc = new Document(myDir + "2017112144212.pdf");
foreach (Page page in doc.Pages)
{
    TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(".+", new TextSearchOptions(true));
    //accept the absorber for all the pages
    page.Accept(textFragmentAbsorber);
    //get the text fragments
    TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;
    Dictionary<Position, TextFragment> updatedFragments = new Dictionary<Position, TextFragment>();
    //prepare updated fragments
    foreach (TextFragment textFragment in textFragmentCollection)
    {
        if (textFragment.Text == " " || textFragment.Text == "")
            continue;
        //create new text fragment with updated string that contains all necessary newline markers
        //string updatedText = TranslateText(textFragment.Text, "ja|en");
        string updatedText = textFragment.Text + "updated";
        TextFragment updatedFragment = new TextFragment(updatedText);

        //set new text fragment properties if necessary
        updatedFragment.TextState.Font = FontRepository.FindFont("MS UI Gothic");
        updatedFragment.TextState.FontSize = textFragment.TextState.FontSize;
        updatedFragment.TextState.LineSpacing = 0.5f;

        Position position = new Position(textFragment.Position.XIndent, textFragment.Position.YIndent
                          - updatedFragment.TextState.FontSize -updatedFragment.TextState.LineSpacing);
        updatedFragments.Add(position, updatedFragment);
}

//remove existing text
OperatorSelector operatorSelector = new OperatorSelector(new Operator.TextShowOperator());
System.Collections.ArrayList list = new System.Collections.ArrayList();
page.Contents.Accept(operatorSelector);
list.AddRange(operatorSelector.Selected);
page.Contents.Delete(list);

//add updated fragments
foreach (var entry in updatedFragments)
{
    //create TextParagraph object
    TextParagraph par = new TextParagraph();
    //set paragraph position
    par.Position = entry.Key;
    // Specify word wraping mode
    par.FormattingOptions.WrapMode = TextFormattingOptions.WordWrapMode.ByWords;
    //add new TextFragment to paragraph
    par.AppendLine(entry.Value);

    //add the TextParagraph using TextBuilder
    TextBuilder textBuilder = new TextBuilder(page);
    textBuilder.AppendParagraph(par);
  }
}

// Save resulting PDF document.
doc.Save(myDir + "2017112144212_updated_op.pdf");
watch.Stop();
Console.WriteLine(watch.ElapsedMilliseconds);

In reference to the ticket ID PDFNET-37864, the scenario is changing text for multitude text fragments invokes a number of checks and text position adjustment operations. They are essential in the text editing scenarios. The difficulty is that we cannot determine how many of text fragments will be removed in the scenario. We recommend to use another approach for this scenario to remove all text from PDF pages.
C#

Document pdfDocument = new Document(myDir + "slow_textFragment.pdf");
for (int i = 1; i <= pdfDocument.Pages.Count; i++)
{
    Page page = pdfDocument.Pages[i];
    OperatorSelector operatorSelector = new OperatorSelector(new Operator.TextShowOperator());
    System.Collections.ArrayList list = new System.Collections.ArrayList();
    page.Contents.Accept(operatorSelector);
    list.AddRange(operatorSelector.Selected);
    page.Contents.Delete(list);    
}
pdfDocument.Save(myDir + "slow_textFragment_op_removed.pdf", Aspose.Pdf.SaveFormat.Pdf);

Hi Imran

Is Aspose.Pdf for .Net 18.7.0 released already?

At June 26, 2018 3:00 PM JST I can only update Aspose.PDF to 18.6.1, which lacks some
functionalities explained in above example. I can not build that code.

Aspose.Pdf.Operator doesn’t have TextShowOperator().
operatorSelector.Selected (System.Collections.Generic.IList<Aspose.Pdf.Operator>) is not compatible with ‘System.Collections.ICollection’

and so on.
I regret that I can not test your sample now. When is this sample going to be available?

@KDSSHO,

The 18.7 of Aspose.PDF for .NET API will be released in the next month July, 2018. We will notify you once the next version is published.

I’m looking forward to it! Thanks.

@KDSSHO,

All descendants of Aspose.Pdf.Operator namespace have been moved under Aspose.Pdf.Operators namespace in version 18.6 of Aspose.PDF for .NET API. Please refer to this code example: Remove All Text From PDF Document

Oops yes I remember they have, It’s written on the release note so that I should have noticed it. I’m sorry I haven’t.

Then I’ll wait for the other solutions for other functionalities which don’t work yet.

@KDSSHO,

Sure, we will keep you informed regarding the available updates.

The issues you have found earlier (filed as PDFNET-37868) have been fixed in Aspose.PDF for .NET 18.7.

@KDSSHO

In addition to above notification, we have used following code snippet to remove all text from PDF document and it took less than 2 seconds to complete the operation.

Document doc = new Document(inFile);
System.Diagnostics.Stopwatch sw = System.Diagnostics.Stopwatch.StartNew();
TextFragmentAbsorber absorber = new TextFragmentAbsorber();
absorber.RemoveAllText(doc);
doc.Save(outFile);
sw.Stop();

Hi Asad,

I downloaded 18.7 and tested above snippet which seems not to be working.
I can’t build these 2 lines.

list.AddRange(operatorSelector.Selected);
page.Contents.Delete(list);

@KDSSHO

You can simply modify your code snippet as follows, in order to get it working correctly.

page.Contents.Delete(operatorSelector.Selected);

Furthermore, please also check the code snippet, which we have shared in earlier response. We have introduced a new method TextFragmentAbsorber.RemoveAllText(), which removes text from PDF document efficiently. You may also use this approach in order to remove all text from your PDF documents.

Hi Asad,

Thank you very much I could fix it just as you mentioned!
I’ll check RemoveAllText(), too.