I’m trying to remove all text from a page using the following code but it’s to slow on some pages that contains a lot of text/links. Is there a faster way to just remove the text from PDF? Also, is there anyway to add the textFragment to another PDF before I remove it? Like copy fragment on another variable and add that to the second PDF Document? My try has failed in doing so…
Aspose.Pdf.Text.TextFragmentAbsorber textFragmentAbsorber = new Aspose.Pdf.Text.TextFragmentAbsorber();
pdfDocument.Pages[i].Accept(textFragmentAbsorber);
Aspose.Pdf.Text.TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;
foreach (Aspose.Pdf.Text.TextFragment textFragment in textFragmentCollection)
{
//Create TextBuilder object
[//Aspose.Pdf.Text.TextBuilder](https://aspose.pdf.text.textbuilder/) textBuilder = new Aspose.Pdf.Text.TextBuilder(newDocument.Pages[i]);
//Append the text fragment to the new PDF page
[//textBuilder.AppendText](https://textbuilder.appendtext/)(textFragment); //doesn’t work, it doesn’t append text to the new pdf document… maybe because it get’s cleared bellow?
textFragment.Text = “”; //to slow removing each fragment for all pages
}
In order to remove each TextFragment, the API has to parse the complete PDF file and try replacing individual TextFragment with blank character. However in order for us to test the scenario, we request you to please share the resource file. We are sorry for this inconvenience.
I
have tested the scenario and I am able to notice that text replace feature takes too much time. For the
sake of correction, I have logged it in our issue tracking system asPDFNEWNET-37864. We will
investigate this issue in details and will keep you updated on the status of a
correction.
Thanks a lot. Looks like there are to many TextFragments to loop trough on some pages, it may be a document fault. Maybe a usefully functionality will be to select somehow all the text from a page and be able to remove it. I know you can use the absorber to get the entire page text in one go by page rectangle. I guess it will be usefully if after getting that text from rectangle to also be able to remove it.
I have observed that when using TextAbsorber to extract PDF contents, the process is completed in few seconds but similar approach is missing for page contents removal from PDF file. However for the sake of implementation, , I have logged this requirement in our
issue tracking system under New Features list as PDFNEWNET-37868. We
will further investigate this requirement in details and will keep you updated
on the status of a correction.
It may worth mentioning that same slow speed happens when taking the TextFragments from another PDF and adding it to a new PDF, something like this:
for (int i = 1; i <= totPages; i++)
{
if (bw.CancellationPending)
break;
if (ReAddText)
{
try
{
[//Aspose.Pdf.Text.TextParagraphA](https://aspose.pdf.text.textparagrapha/) textParagraphAbsorber = new Aspose.Pdf.Text.TextParagraph();
Aspose.Pdf.Text.TextFragmentAbsorber textFragmentAbsorber = new Aspose.Pdf.Text.TextFragmentAbsorber();
//accept the absorber for this page
pdfDocument.Pages[i].Accept(textFragmentAbsorber);
//get the extracted text fragments
Aspose.Pdf.Text.TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;
string info2 = info;
int totalTF = textFragmentCollection.Count;
int cTF = 0;
foreach (Aspose.Pdf.Text.TextFragment textFragment in textFragmentCollection)
{
if (bw.CancellationPending)
break;
try
{
//Create TextBuilder object
Aspose.Pdf.Text.TextBuilder textBuilder = new Aspose.Pdf.Text.TextBuilder(newDocument.Pages[i]);
//Append the text fragment to the PDF page
textBuilder.AppendText(textFragment);
}
catch { }
}
}
catch { }
}
}
Also, it may worth mentioning that I’m doing the “job” of transfering the TextFragments from one document to another in a background worker (BackgroundWorker / C#).
Basically what I’m trying to do here is flatten the PDF pages without text, and later add the text to the new pdf so that it keeps the search/select functionality of the PDF but background of the pages are flatten/converted to an image representation of all pdf objects on that page except the text. Maybe there is a better way to do this than the above one I’m trying?
I’m using the BackgrounWorker so that I can display some progress to the
user.
It may worth mentioning that same slow speed happens when taking the TextFragments from another PDF and adding it to a new PDF, something like this:
for (int i = 1; i <= totPages; i++) { if (bw.CancellationPending) break; if (ReAddText) { try { //Aspose.Pdf.Text.TextParagraphA textParagraphAbsorber = new Aspose.Pdf.Text.TextParagraph(); Aspose.Pdf.Text.TextFragmentAbsorber textFragmentAbsorber = new Aspose.Pdf.Text.TextFragmentAbsorber(); //accept the absorber for this page pdfDocument.Pages[i].Accept(textFragmentAbsorber); //get the extracted text fragments Aspose.Pdf.Text.TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments; string info2 = info; int totalTF = textFragmentCollection.Count; int cTF = 0; foreach (Aspose.Pdf.Text.TextFragment textFragment in textFragmentCollection) { if (bw.CancellationPending) break; try {
//Create TextBuilder object Aspose.Pdf.Text.TextBuilder textBuilder = new Aspose.Pdf.Text.TextBuilder(newDocument.Pages[i]); //Append the text fragment to the PDF page textBuilder.AppendText(textFragment); } catch { } } } catch { } } }
Also, it may worth mentioning that I'm doing the "job" of transfering the TextFragments from one document to another in a background worker (BackgroundWorker / C#).
Hi Razvan,
Thanks for sharing the details.
I have tested the scenario and have managed to reproduce the same issue. For the sake of correction, I have separately logged it as PDFNEWNET-37882. We will further look into the details of this problem and will keep you updated on the status of correction. Please be patient and spare us little time.
Basically what I’m trying to do here is flatten the PDF pages without text, and later add the text to the new pdf so that it keeps the search/select functionality of the PDF but background of the pages are flatten/converted to an image representation of all pdf objects on that page except the text. Maybe there is a better way to do this than the above one I’m trying?
I’m using the BackgrounWorker so that I can display some progress to the
user.
Hi Razvan,
As per my understanding, as you need to keep text searchable and want to flatten every other element over page, the suitable approach is to extract text from file, add/copy it to other Document instance and convert PDF pages (all elements except Text) to image format. But since the text extraction feature is taking too much time, you need to wait, until we optimize the API performance for text extraction.
I am using evaluation version of Aspose.PDF now.
But I am facing the quite similar issue to the one reported here, which has been seemingly not resolved for years.
I need to make decision whether I can purchase this product or not.
I’d like to know as soon as possible whether I can find a way out to “reasonably quickly” update the text by simply assigning another string as follows
textFragment.Text = updatedText;
which is so far reportedly “too slow” just as I am actually experiencing right now with the latest version.
If there is any better solution ( such as paragraph-based manipulation ) please advise me.
Please note, the paragraph based manipulations are supported. Please refer to this help topic: Extract Paragraph from PDF. Kindly send all details of the scenario, including source PDF and code. We will investigate your scenario in our environment and share our findings with you.
We can find the slow performance by setting an empty string of the text fragment. It has been logged under the ticket ID PDFNET-44715 (input PDF: 2017112144212.pdf) in our issue tracking system. We have linked your post to this ticket and will keep you informed regarding any available updates.
In reference to the linked ticket ID PDFNET-37868, we have implemented a feature in version 18.7 of Aspose.PDF for .NET API to remove all text items. Please use TextFragmentAbsorber.RemoveAllText(Page page) to remove text from the page of PDF document, and TextFragmentAbsorber.RemoveAllText(Document document) to remove text from the whole document. C#
Document doc = new Document(inFile);
System.Diagnostics.Stopwatch sw = System.Diagnostics.Stopwatch.StartNew();
TextFragmentAbsorber absorber = new TextFragmentAbsorber();
absorber.RemoveAllText(doc);
doc.Save(outFile);
sw.Stop();
In reference to the linked ticket ID PDFNET-44715, please do not set empty text because it invokes a number of checks and text position adjustment operations. Please this alternative way of removing existing text. C#
Stopwatch watch = new Stopwatch();
watch.Start();
Document doc = new Document(myDir + "2017112144212.pdf");
foreach (Page page in doc.Pages)
{
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(".+", new TextSearchOptions(true));
//accept the absorber for all the pages
page.Accept(textFragmentAbsorber);
//get the text fragments
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;
Dictionary<Position, TextFragment> updatedFragments = new Dictionary<Position, TextFragment>();
//prepare updated fragments
foreach (TextFragment textFragment in textFragmentCollection)
{
if (textFragment.Text == " " || textFragment.Text == "")
continue;
//create new text fragment with updated string that contains all necessary newline markers
//string updatedText = TranslateText(textFragment.Text, "ja|en");
string updatedText = textFragment.Text + "updated";
TextFragment updatedFragment = new TextFragment(updatedText);
//set new text fragment properties if necessary
updatedFragment.TextState.Font = FontRepository.FindFont("MS UI Gothic");
updatedFragment.TextState.FontSize = textFragment.TextState.FontSize;
updatedFragment.TextState.LineSpacing = 0.5f;
Position position = new Position(textFragment.Position.XIndent, textFragment.Position.YIndent
- updatedFragment.TextState.FontSize -updatedFragment.TextState.LineSpacing);
updatedFragments.Add(position, updatedFragment);
}
//remove existing text
OperatorSelector operatorSelector = new OperatorSelector(new Operator.TextShowOperator());
System.Collections.ArrayList list = new System.Collections.ArrayList();
page.Contents.Accept(operatorSelector);
list.AddRange(operatorSelector.Selected);
page.Contents.Delete(list);
//add updated fragments
foreach (var entry in updatedFragments)
{
//create TextParagraph object
TextParagraph par = new TextParagraph();
//set paragraph position
par.Position = entry.Key;
// Specify word wraping mode
par.FormattingOptions.WrapMode = TextFormattingOptions.WordWrapMode.ByWords;
//add new TextFragment to paragraph
par.AppendLine(entry.Value);
//add the TextParagraph using TextBuilder
TextBuilder textBuilder = new TextBuilder(page);
textBuilder.AppendParagraph(par);
}
}
// Save resulting PDF document.
doc.Save(myDir + "2017112144212_updated_op.pdf");
watch.Stop();
Console.WriteLine(watch.ElapsedMilliseconds);
In reference to the ticket ID PDFNET-37864, the scenario is changing text for multitude text fragments invokes a number of checks and text position adjustment operations. They are essential in the text editing scenarios. The difficulty is that we cannot determine how many of text fragments will be removed in the scenario. We recommend to use another approach for this scenario to remove all text from PDF pages. C#
Document pdfDocument = new Document(myDir + "slow_textFragment.pdf");
for (int i = 1; i <= pdfDocument.Pages.Count; i++)
{
Page page = pdfDocument.Pages[i];
OperatorSelector operatorSelector = new OperatorSelector(new Operator.TextShowOperator());
System.Collections.ArrayList list = new System.Collections.ArrayList();
page.Contents.Accept(operatorSelector);
list.AddRange(operatorSelector.Selected);
page.Contents.Delete(list);
}
pdfDocument.Save(myDir + "slow_textFragment_op_removed.pdf", Aspose.Pdf.SaveFormat.Pdf);
At June 26, 2018 3:00 PM JST I can only update Aspose.PDF to 18.6.1, which lacks some
functionalities explained in above example. I can not build that code.
Aspose.Pdf.Operator doesn’t have TextShowOperator().
operatorSelector.Selected (System.Collections.Generic.IList<Aspose.Pdf.Operator>) is not compatible with ‘System.Collections.ICollection’
and so on.
I regret that I can not test your sample now. When is this sample going to be available?
All descendants of Aspose.Pdf.Operator namespace have been moved under Aspose.Pdf.Operators namespace in version 18.6 of Aspose.PDF for .NET API. Please refer to this code example: Remove All Text From PDF Document