Remove all text from page with TextFragmentCollection is to slow

razvar · November 26, 2014, 10:08am

Hello,

I’m trying to remove all text from a page using the following code but it’s to slow on some pages that contains a lot of text/links. Is there a faster way to just remove the text from PDF? Also, is there anyway to add the textFragment to another PDF before I remove it? Like copy fragment on another variable and add that to the second PDF Document? My try has failed in doing so…

Aspose.Pdf.Text.TextFragmentAbsorber textFragmentAbsorber = new Aspose.Pdf.Text.TextFragmentAbsorber();
pdfDocument.Pages[i].Accept(textFragmentAbsorber);
Aspose.Pdf.Text.TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;

foreach (Aspose.Pdf.Text.TextFragment textFragment in textFragmentCollection)
{
//Create TextBuilder object
[//Aspose.Pdf.Text.TextBuilder](https://aspose.pdf.text.textbuilder/) textBuilder = new Aspose.Pdf.Text.TextBuilder(newDocument.Pages[i]);
//Append the text fragment to the new PDF page
[//textBuilder.AppendText](https://textbuilder.appendtext/)(textFragment); //doesn’t work, it doesn’t append text to the new pdf document… maybe because it get’s cleared bellow?

textFragment.Text = “”; //to slow removing each fragment for all pages

}

Thank you,
Razvan

codewarior · November 27, 2014, 6:24am

Hi Razvan,

In order to remove each TextFragment, the API has to parse the complete PDF file and try replacing individual TextFragment with blank character. However in order for us to test the scenario, we request you to please share the resource file. We are sorry for this inconvenience.

razvar · December 1, 2014, 2:23am

Hi Nayyer,

Thank you. I’ve send you the pdf document using the forum functionality “Contact->Send codewarior an email”.

Thank you again,
Razvan

codewarior · December 1, 2014, 7:52am

Hi Razvan,

Thanks for sharing the resource file.

I have tested the scenario and I am able to notice that text replace feature takes too much time. For the sake of correction, I have logged it in our issue tracking system as PDFNEWNET-37864. We will investigate this issue in details and will keep you updated on the status of a correction.

We apologize for your inconvenience.

razvar · December 1, 2014, 10:23am

Hi Nayyer,

Thanks a lot. Looks like there are to many TextFragments to loop trough on some pages, it may be a document fault. Maybe a usefully functionality will be to select somehow all the text from a page and be able to remove it. I know you can use the absorber to get the entire page text in one go by page rectangle. I guess it will be usefully if after getting that text from rectangle to also be able to remove it.

Thanks again for your support,
Razvan

codewarior · December 2, 2014, 3:58am

Hi Razvan,

Thanks for sharing the details.

I have observed that when using TextAbsorber to extract PDF contents, the process is completed in few seconds but similar approach is missing for page contents removal from PDF file. However for the sake of implementation, , I have logged this requirement in our
issue tracking system under New Features list as PDFNEWNET-37868. We
will further investigate this requirement in details and will keep you updated
on the status of a correction.

We apologize for your inconvenience.

razvar · December 3, 2014, 8:40am

Hi Nayyer,

It may worth mentioning that same slow speed happens when taking the TextFragments from another PDF and adding it to a new PDF, something like this:

for (int i = 1; i <= totPages; i++)
{
if (bw.CancellationPending)
break;
if (ReAddText)
{
try
{
[//Aspose.Pdf.Text.TextParagraphA](https://aspose.pdf.text.textparagrapha/) textParagraphAbsorber = new Aspose.Pdf.Text.TextParagraph();
Aspose.Pdf.Text.TextFragmentAbsorber textFragmentAbsorber = new Aspose.Pdf.Text.TextFragmentAbsorber();
//accept the absorber for this page
pdfDocument.Pages[i].Accept(textFragmentAbsorber);
//get the extracted text fragments
Aspose.Pdf.Text.TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;
string info2 = info;
int totalTF = textFragmentCollection.Count;
int cTF = 0;
foreach (Aspose.Pdf.Text.TextFragment textFragment in textFragmentCollection)
{
if (bw.CancellationPending)
break;
try
{

//Create TextBuilder object
Aspose.Pdf.Text.TextBuilder textBuilder = new Aspose.Pdf.Text.TextBuilder(newDocument.Pages[i]);
//Append the text fragment to the PDF page
textBuilder.AppendText(textFragment);
}
catch { }
}
}
catch { }
}
}

Also, it may worth mentioning that I’m doing the “job” of transfering the TextFragments from one document to another in a background worker (BackgroundWorker / C#).

Basically what I’m trying to do here is flatten the PDF pages without text, and later add the text to the new pdf so that it keeps the search/select functionality of the PDF but background of the pages are flatten/converted to an image representation of all pdf objects on that page except the text. Maybe there is a better way to do this than the above one I’m trying?
I’m using the BackgrounWorker so that I can display some progress to the
user.

Thanks a lot for your support,
Razvan

codewarior · December 4, 2014, 1:52pm

razvar:

It may worth mentioning that same slow speed happens when taking the TextFragments from another PDF and adding it to a new PDF, something like this:

for (int i = 1; i <= totPages; i++)
{
if (bw.CancellationPending)
break;
if (ReAddText)
{
try
{
//Aspose.Pdf.Text.TextParagraphA textParagraphAbsorber = new Aspose.Pdf.Text.TextParagraph();
Aspose.Pdf.Text.TextFragmentAbsorber textFragmentAbsorber = new Aspose.Pdf.Text.TextFragmentAbsorber();
//accept the absorber for this page
pdfDocument.Pages[i].Accept(textFragmentAbsorber);
//get the extracted text fragments
Aspose.Pdf.Text.TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;
string info2 = info;
int totalTF = textFragmentCollection.Count;
int cTF = 0;
foreach (Aspose.Pdf.Text.TextFragment textFragment in textFragmentCollection)
{
if (bw.CancellationPending)
break;
try
{

//Create TextBuilder object
Aspose.Pdf.Text.TextBuilder textBuilder = new Aspose.Pdf.Text.TextBuilder(newDocument.Pages[i]);
//Append the text fragment to the PDF page
textBuilder.AppendText(textFragment);
}
catch { }
}
}
catch { }
}
}

Also, it may worth mentioning that I'm doing the "job" of transfering the TextFragments from one document to another in a background worker (BackgroundWorker / C#).

Hi Razvan,

Thanks for sharing the details.

I have tested the scenario and have managed to reproduce the same issue. For the sake of correction, I have separately logged it as PDFNEWNET-37882. We will further look into the details of this problem and will keep you updated on the status of correction. Please be patient and spare us little time.

codewarior · December 4, 2014, 2:08pm

razvar:

Basically what I’m trying to do here is flatten the PDF pages without text, and later add the text to the new pdf so that it keeps the search/select functionality of the PDF but background of the pages are flatten/converted to an image representation of all pdf objects on that page except the text. Maybe there is a better way to do this than the above one I’m trying?
I’m using the BackgrounWorker so that I can display some progress to the
user.

Hi Razvan,

As per my understanding, as you need to keep text searchable and want to flatten every other element over page, the suitable approach is to extract text from file, add/copy it to other Document instance and convert PDF pages (all elements except Text) to image format. But since the text extraction feature is taking too much time, you need to wait, until we optimize the API performance for text extraction.

KDSDEV · May 16, 2018, 8:49am

Hi,

Any news here?

I am using evaluation version of Aspose.PDF now.
But I am facing the quite similar issue to the one reported here, which has been seemingly not resolved for years.

I need to make decision whether I can purchase this product or not.
I’d like to know as soon as possible whether I can find a way out to “reasonably quickly” update the text by simply assigning another string as follows

textFragment.Text = updatedText;

which is so far reportedly “too slow” just as I am actually experiencing right now with the latest version.

If there is any better solution ( such as paragraph-based manipulation ) please advise me.

imran.rafique · May 16, 2018, 2:17pm

@KDSSHO,

Please note, the paragraph based manipulations are supported. Please refer to this help topic: Extract Paragraph from PDF. Kindly send all details of the scenario, including source PDF and code. We will investigate your scenario in our environment and share our findings with you.

KDSDEV · May 17, 2018, 8:54am

Thank you for your prompt reply.

Please find it below. I use the sample app explained on this topic. Please take a look at the attached source PDF files as well.

And the point I mentioned “it’s slow” is as follows.

//set old fragment text to empty string
textFragment.Text = String.Empty;

For the whole process of my “replacing texts to translated texts” explained above, it takes 20 sec.
If I comment out one line,

// textFtagment.Text = String.Empty;

it’s done only in 2 sec.

So I’d like to know whether it can be faster when I assign something to textFragment.Text.
I truly appreciate your help!

Thank you very much. I have already checked it out and testing now.
But it is slow, too, if I empty textFragment.Text, and fast if I don’t.

imran.rafique · May 17, 2018, 3:36pm

@KDSSHO,

We can find the slow performance by setting an empty string of the text fragment. It has been logged under the ticket ID PDFNET-44715 (input PDF: 2017112144212.pdf) in our issue tracking system. We have linked your post to this ticket and will keep you informed regarding any available updates.

KDSDEV · May 18, 2018, 5:20am

Thank you. I’ll keep an eye on this topic.

imran.rafique · June 24, 2018, 5:26pm

@KDSSHO,

In reference to the linked ticket ID PDFNET-37868, we have implemented a feature in version 18.7 of Aspose.PDF for .NET API to remove all text items. Please use TextFragmentAbsorber.RemoveAllText(Page page) to remove text from the page of PDF document, and TextFragmentAbsorber.RemoveAllText(Document document) to remove text from the whole document.
C#

Document doc = new Document(inFile);
System.Diagnostics.Stopwatch sw = System.Diagnostics.Stopwatch.StartNew();
TextFragmentAbsorber absorber = new TextFragmentAbsorber();
absorber.RemoveAllText(doc);
doc.Save(outFile);
sw.Stop();

In reference to the linked ticket ID PDFNET-44715, please do not set empty text because it invokes a number of checks and text position adjustment operations. Please this alternative way of removing existing text.
C#

Stopwatch watch = new Stopwatch();
watch.Start();
Document doc = new Document(myDir + "2017112144212.pdf");
foreach (Page page in doc.Pages)
{
    TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(".+", new TextSearchOptions(true));
    //accept the absorber for all the pages
    page.Accept(textFragmentAbsorber);
    //get the text fragments
    TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;
    Dictionary<Position, TextFragment> updatedFragments = new Dictionary<Position, TextFragment>();
    //prepare updated fragments
    foreach (TextFragment textFragment in textFragmentCollection)
    {
        if (textFragment.Text == " " || textFragment.Text == "")
            continue;
        //create new text fragment with updated string that contains all necessary newline markers
        //string updatedText = TranslateText(textFragment.Text, "ja|en");
        string updatedText = textFragment.Text + "updated";
        TextFragment updatedFragment = new TextFragment(updatedText);

        //set new text fragment properties if necessary
        updatedFragment.TextState.Font = FontRepository.FindFont("MS UI Gothic");
        updatedFragment.TextState.FontSize = textFragment.TextState.FontSize;
        updatedFragment.TextState.LineSpacing = 0.5f;

        Position position = new Position(textFragment.Position.XIndent, textFragment.Position.YIndent
                          - updatedFragment.TextState.FontSize -updatedFragment.TextState.LineSpacing);
        updatedFragments.Add(position, updatedFragment);
}

//remove existing text
OperatorSelector operatorSelector = new OperatorSelector(new Operator.TextShowOperator());
System.Collections.ArrayList list = new System.Collections.ArrayList();
page.Contents.Accept(operatorSelector);
list.AddRange(operatorSelector.Selected);
page.Contents.Delete(list);

//add updated fragments
foreach (var entry in updatedFragments)
{
    //create TextParagraph object
    TextParagraph par = new TextParagraph();
    //set paragraph position
    par.Position = entry.Key;
    // Specify word wraping mode
    par.FormattingOptions.WrapMode = TextFormattingOptions.WordWrapMode.ByWords;
    //add new TextFragment to paragraph
    par.AppendLine(entry.Value);

    //add the TextParagraph using TextBuilder
    TextBuilder textBuilder = new TextBuilder(page);
    textBuilder.AppendParagraph(par);
  }
}

// Save resulting PDF document.
doc.Save(myDir + "2017112144212_updated_op.pdf");
watch.Stop();
Console.WriteLine(watch.ElapsedMilliseconds);

In reference to the ticket ID PDFNET-37864, the scenario is changing text for multitude text fragments invokes a number of checks and text position adjustment operations. They are essential in the text editing scenarios. The difficulty is that we cannot determine how many of text fragments will be removed in the scenario. We recommend to use another approach for this scenario to remove all text from PDF pages.
C#

Document pdfDocument = new Document(myDir + "slow_textFragment.pdf");
for (int i = 1; i <= pdfDocument.Pages.Count; i++)
{
    Page page = pdfDocument.Pages[i];
    OperatorSelector operatorSelector = new OperatorSelector(new Operator.TextShowOperator());
    System.Collections.ArrayList list = new System.Collections.ArrayList();
    page.Contents.Accept(operatorSelector);
    list.AddRange(operatorSelector.Selected);
    page.Contents.Delete(list);    
}
pdfDocument.Save(myDir + "slow_textFragment_op_removed.pdf", Aspose.Pdf.SaveFormat.Pdf);

KDSDEV · June 26, 2018, 6:09am

Hi Imran

Is Aspose.Pdf for .Net 18.7.0 released already?

At June 26, 2018 3:00 PM JST I can only update Aspose.PDF to 18.6.1, which lacks some
functionalities explained in above example. I can not build that code.

Aspose.Pdf.Operator doesn’t have TextShowOperator().
operatorSelector.Selected (System.Collections.Generic.IList<Aspose.Pdf.Operator>) is not compatible with ‘System.Collections.ICollection’

and so on.
I regret that I can not test your sample now. When is this sample going to be available?

imran.rafique · June 26, 2018, 2:01pm

@KDSSHO,

The 18.7 of Aspose.PDF for .NET API will be released in the next month July, 2018. We will notify you once the next version is published.

KDSDEV · June 27, 2018, 2:28am

I’m looking forward to it! Thanks.

imran.rafique · June 27, 2018, 2:59am

@KDSSHO,

All descendants of Aspose.Pdf.Operator namespace have been moved under Aspose.Pdf.Operators namespace in version 18.6 of Aspose.PDF for .NET API. Please refer to this code example: Remove All Text From PDF Document

KDSDEV · June 27, 2018, 4:16am

Oops yes I remember they have, It’s written on the release note so that I should have noticed it. I’m sorry I haven’t.

Then I’ll wait for the other solutions for other functionalities which don’t work yet.