It may worth mentioning that same slow speed happens when taking the TextFragments from another PDF and adding it to a new PDF, something like this:
for (int i = 1; i <= totPages; i++)
{
if (bw.CancellationPending)
break;
if (ReAddText)
{
try
{
[//Aspose.Pdf.Text.TextParagraphA](https://aspose.pdf.text.textparagrapha/) textParagraphAbsorber = new Aspose.Pdf.Text.TextParagraph();
Aspose.Pdf.Text.TextFragmentAbsorber textFragmentAbsorber = new Aspose.Pdf.Text.TextFragmentAbsorber();
//accept the absorber for this page
pdfDocument.Pages[i].Accept(textFragmentAbsorber);
//get the extracted text fragments
Aspose.Pdf.Text.TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;
string info2 = info;
int totalTF = textFragmentCollection.Count;
int cTF = 0;
foreach (Aspose.Pdf.Text.TextFragment textFragment in textFragmentCollection)
{
if (bw.CancellationPending)
break;
try
{
//Create TextBuilder object
Aspose.Pdf.Text.TextBuilder textBuilder = new Aspose.Pdf.Text.TextBuilder(newDocument.Pages[i]);
//Append the text fragment to the PDF page
textBuilder.AppendText(textFragment);
}
catch { }
}
}
catch { }
}
}
Also, it may worth mentioning that I’m doing the “job” of transfering the TextFragments from one document to another in a background worker (BackgroundWorker / C#).
Basically what I’m trying to do here is flatten the PDF pages without text, and later add the text to the new pdf so that it keeps the search/select functionality of the PDF but background of the pages are flatten/converted to an image representation of all pdf objects on that page except the text. Maybe there is a better way to do this than the above one I’m trying?
I’m using the BackgrounWorker so that I can display some progress to the
user.
It may worth mentioning that same slow speed happens when taking the TextFragments from another PDF and adding it to a new PDF, something like this:
for (int i = 1; i <= totPages; i++) { if (bw.CancellationPending) break; if (ReAddText) { try { //Aspose.Pdf.Text.TextParagraphA textParagraphAbsorber = new Aspose.Pdf.Text.TextParagraph(); Aspose.Pdf.Text.TextFragmentAbsorber textFragmentAbsorber = new Aspose.Pdf.Text.TextFragmentAbsorber(); //accept the absorber for this page pdfDocument.Pages[i].Accept(textFragmentAbsorber); //get the extracted text fragments Aspose.Pdf.Text.TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments; string info2 = info; int totalTF = textFragmentCollection.Count; int cTF = 0; foreach (Aspose.Pdf.Text.TextFragment textFragment in textFragmentCollection) { if (bw.CancellationPending) break; try {
//Create TextBuilder object Aspose.Pdf.Text.TextBuilder textBuilder = new Aspose.Pdf.Text.TextBuilder(newDocument.Pages[i]); //Append the text fragment to the PDF page textBuilder.AppendText(textFragment); } catch { } } } catch { } } }
Also, it may worth mentioning that I'm doing the "job" of transfering the TextFragments from one document to another in a background worker (BackgroundWorker / C#).
Hi Razvan,
Thanks for sharing the details.
I have tested the scenario and have managed to reproduce the same issue. For the sake of correction, I have separately logged it as PDFNEWNET-37882. We will further look into the details of this problem and will keep you updated on the status of correction. Please be patient and spare us little time.
Basically what I’m trying to do here is flatten the PDF pages without text, and later add the text to the new pdf so that it keeps the search/select functionality of the PDF but background of the pages are flatten/converted to an image representation of all pdf objects on that page except the text. Maybe there is a better way to do this than the above one I’m trying?
I’m using the BackgrounWorker so that I can display some progress to the
user.
Hi Razvan,
As per my understanding, as you need to keep text searchable and want to flatten every other element over page, the suitable approach is to extract text from file, add/copy it to other Document instance and convert PDF pages (all elements except Text) to image format. But since the text extraction feature is taking too much time, you need to wait, until we optimize the API performance for text extraction.
I am using evaluation version of Aspose.PDF now.
But I am facing the quite similar issue to the one reported here, which has been seemingly not resolved for years.
I need to make decision whether I can purchase this product or not.
I’d like to know as soon as possible whether I can find a way out to “reasonably quickly” update the text by simply assigning another string as follows
textFragment.Text = updatedText;
which is so far reportedly “too slow” just as I am actually experiencing right now with the latest version.
If there is any better solution ( such as paragraph-based manipulation ) please advise me.
Please note, the paragraph based manipulations are supported. Please refer to this help topic: Extract Paragraph from PDF. Kindly send all details of the scenario, including source PDF and code. We will investigate your scenario in our environment and share our findings with you.
We can find the slow performance by setting an empty string of the text fragment. It has been logged under the ticket ID PDFNET-44715 (input PDF: 2017112144212.pdf) in our issue tracking system. We have linked your post to this ticket and will keep you informed regarding any available updates.
In reference to the linked ticket ID PDFNET-37868, we have implemented a feature in version 18.7 of Aspose.PDF for .NET API to remove all text items. Please use TextFragmentAbsorber.RemoveAllText(Page page) to remove text from the page of PDF document, and TextFragmentAbsorber.RemoveAllText(Document document) to remove text from the whole document. C#
Document doc = new Document(inFile);
System.Diagnostics.Stopwatch sw = System.Diagnostics.Stopwatch.StartNew();
TextFragmentAbsorber absorber = new TextFragmentAbsorber();
absorber.RemoveAllText(doc);
doc.Save(outFile);
sw.Stop();
In reference to the linked ticket ID PDFNET-44715, please do not set empty text because it invokes a number of checks and text position adjustment operations. Please this alternative way of removing existing text. C#
Stopwatch watch = new Stopwatch();
watch.Start();
Document doc = new Document(myDir + "2017112144212.pdf");
foreach (Page page in doc.Pages)
{
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(".+", new TextSearchOptions(true));
//accept the absorber for all the pages
page.Accept(textFragmentAbsorber);
//get the text fragments
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;
Dictionary<Position, TextFragment> updatedFragments = new Dictionary<Position, TextFragment>();
//prepare updated fragments
foreach (TextFragment textFragment in textFragmentCollection)
{
if (textFragment.Text == " " || textFragment.Text == "")
continue;
//create new text fragment with updated string that contains all necessary newline markers
//string updatedText = TranslateText(textFragment.Text, "ja|en");
string updatedText = textFragment.Text + "updated";
TextFragment updatedFragment = new TextFragment(updatedText);
//set new text fragment properties if necessary
updatedFragment.TextState.Font = FontRepository.FindFont("MS UI Gothic");
updatedFragment.TextState.FontSize = textFragment.TextState.FontSize;
updatedFragment.TextState.LineSpacing = 0.5f;
Position position = new Position(textFragment.Position.XIndent, textFragment.Position.YIndent
- updatedFragment.TextState.FontSize -updatedFragment.TextState.LineSpacing);
updatedFragments.Add(position, updatedFragment);
}
//remove existing text
OperatorSelector operatorSelector = new OperatorSelector(new Operator.TextShowOperator());
System.Collections.ArrayList list = new System.Collections.ArrayList();
page.Contents.Accept(operatorSelector);
list.AddRange(operatorSelector.Selected);
page.Contents.Delete(list);
//add updated fragments
foreach (var entry in updatedFragments)
{
//create TextParagraph object
TextParagraph par = new TextParagraph();
//set paragraph position
par.Position = entry.Key;
// Specify word wraping mode
par.FormattingOptions.WrapMode = TextFormattingOptions.WordWrapMode.ByWords;
//add new TextFragment to paragraph
par.AppendLine(entry.Value);
//add the TextParagraph using TextBuilder
TextBuilder textBuilder = new TextBuilder(page);
textBuilder.AppendParagraph(par);
}
}
// Save resulting PDF document.
doc.Save(myDir + "2017112144212_updated_op.pdf");
watch.Stop();
Console.WriteLine(watch.ElapsedMilliseconds);
In reference to the ticket ID PDFNET-37864, the scenario is changing text for multitude text fragments invokes a number of checks and text position adjustment operations. They are essential in the text editing scenarios. The difficulty is that we cannot determine how many of text fragments will be removed in the scenario. We recommend to use another approach for this scenario to remove all text from PDF pages. C#
Document pdfDocument = new Document(myDir + "slow_textFragment.pdf");
for (int i = 1; i <= pdfDocument.Pages.Count; i++)
{
Page page = pdfDocument.Pages[i];
OperatorSelector operatorSelector = new OperatorSelector(new Operator.TextShowOperator());
System.Collections.ArrayList list = new System.Collections.ArrayList();
page.Contents.Accept(operatorSelector);
list.AddRange(operatorSelector.Selected);
page.Contents.Delete(list);
}
pdfDocument.Save(myDir + "slow_textFragment_op_removed.pdf", Aspose.Pdf.SaveFormat.Pdf);
At June 26, 2018 3:00 PM JST I can only update Aspose.PDF to 18.6.1, which lacks some
functionalities explained in above example. I can not build that code.
Aspose.Pdf.Operator doesn’t have TextShowOperator().
operatorSelector.Selected (System.Collections.Generic.IList<Aspose.Pdf.Operator>) is not compatible with ‘System.Collections.ICollection’
and so on.
I regret that I can not test your sample now. When is this sample going to be available?
All descendants of Aspose.Pdf.Operator namespace have been moved under Aspose.Pdf.Operators namespace in version 18.6 of Aspose.PDF for .NET API. Please refer to this code example: Remove All Text From PDF Document
In addition to above notification, we have used following code snippet to remove all text from PDF document and it took less than 2 seconds to complete the operation.
Document doc = new Document(inFile);
System.Diagnostics.Stopwatch sw = System.Diagnostics.Stopwatch.StartNew();
TextFragmentAbsorber absorber = new TextFragmentAbsorber();
absorber.RemoveAllText(doc);
doc.Save(outFile);
sw.Stop();
You can simply modify your code snippet as follows, in order to get it working correctly.
page.Contents.Delete(operatorSelector.Selected);
Furthermore, please also check the code snippet, which we have shared in earlier response. We have introduced a new method TextFragmentAbsorber.RemoveAllText(), which removes text from PDF document efficiently. You may also use this approach in order to remove all text from your PDF documents.