Sure, here is the cleaned-up version of your markdown and the corrected code:
We want to remove text from a PDF by converting the content to a neat and clean Markdown format. The following code doesn’t work as expected:
Aspose.Pdf.Document pdfDocument = new Aspose.Pdf.Document(@"");'
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber();
pdfDocument.Pages.Accept(textFragmentAbsorber);
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;
foreach (TextFragment textFragment in textFragmentCollection)
{
textFragment.Text = "";
}
Dear sir, I wrote this code on a button click, but it is not working. You can test it on any PDF file in which text is selectable.
If your goal is to remove text from a PDF, you might need to take into account that simply clearing the Text property of TextFragment objects may not be enough. You might need to use the Aspose.Pdf.Text.Document.Page class and remove the actual text block objects.
Here is an updated version of your code:
using Aspose.Pdf;
using Aspose.Pdf.Text;
public void RemoveTextFromPdf(string pdfPath)
{
// Load the PDF document
Aspose.Pdf.Document pdfDocument = new Aspose.Pdf.Document();
// Create a TextFragmentAbsorber object and accept the pages
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber();
foreach (Page page in pdfDocument.Pages)
{
page.Accept(textFragmentAbsorber);
}
// Get the text fragments
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;
// Iterate through the text fragments
foreach (TextFragment textFragment in textFragmentCollection)
{
// Remove the text fragment from the PDF document
textFragment.Parent.IsOpaque = true;
}
}
You can call this method and pass the path of your PDF file to remove the text. Ensure you have the necessary using directives and that the Aspose.PDF library is properly referenced in your project.
In order to replace all the TextFragment inside the PDF document, you need to first search each TextFragment using Regular Expression and then replace it with blank character. However, when using 2.pdf, I am afraid all TextFragments are not being replaced with blank character. For the sake of correction, I have logged this problem as PDFNEWNET-36510 in our issue tracking system. We will further look into the details of this problem and will keep you updated on the status of correction. Please be patient and spare us a little time. We are sorry for this inconvenience.
[C#]
Aspose.Pdf.Document pdfDocument = new Aspose.Pdf.Document(@"c:/pdftest/2.pdf");
// search all TextFragments
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(@"[\S]+");
textFragmentAbsorber.TextSearchOptions.IsRegularExpressionUsed = true;
pdfDocument.Pages.Accept(textFragmentAbsorber);
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;
foreach (TextFragment textFragment in textFragmentCollection)
{
textFragment.Text = "";
}
pdfDocument.Save("c:/pdftest/TextRemoved.pdf");
The issues you have found earlier (filed as PDFNET-36510) have been fixed in Aspose.PDF for .NET 18.5. This message was posted using BugNotificationTool by asad.ali
Adding more to our previous responses, please check following code snippet which uses much faster approach to remove text from PDF document:
Aspose.Pdf.Document pdfDocument = new Aspose.Pdf.Document(myDir + @"2.pdf");
// Used text showing operators
Operator[] operators = new Operator[]
{
new Operator.ShowText(),
new Operator.SetGlyphsPositionShowText(new List()),
new Operator.MoveToNextLineShowText(),
new Operator.SetSpacingMoveToNextLineShowText(0,0,""),
};
foreach (Page page in pdfDocument.Pages)
{
ArrayList list = new ArrayList();
OperatorCollection pageOperators = page.Contents;
foreach (Operator op in operators)
{
OperatorSelector operatorSelector = new OperatorSelector(op);
pageOperators.Accept(operatorSelector);
list.AddRange(operatorSelector.Selected);
}
pageOperators.Delete(list);
}
pdfDocument.Save(myDir + "TextRemoved_operators_18.4.pdf");
The main change in latest version is namespace Aspose.Pdf.Operators. Please check following updated code snippet:
Aspose.Pdf.Document pdfDocument = new Aspose.Pdf.Document(myDir + @"2.pdf");
// Used text showing operators
Operator[] operators = new Operator[]
{
new Aspose.Pdf.Operators.ShowText(),
new Aspose.Pdf.Operators.SetGlyphsPositionShowText(new List<Aspose.Pdf.Operators.GlyphPosition>()),
new Aspose.Pdf.Operators.MoveToNextLineShowText(),
new Aspose.Pdf.Operators.SetSpacingMoveToNextLineShowText(0,0,""),
};
foreach (Page page in pdfDocument.Pages)
{
ArrayList list = new ArrayList();
OperatorCollection pageOperators = page.Contents;
foreach (Operator op in operators)
{
OperatorSelector operatorSelector = new OperatorSelector(op);
pageOperators.Accept(operatorSelector);
list.AddRange(operatorSelector.Selected);
}
pageOperators.Delete(list);
}
pdfDocument.Save(myDir + "TextRemoved_operators_18.4.pdf");
one more question. is there anyway we can set regular expression, or something else we can remove specific text, not all text from the given pdf document?
Yes, it is possible. Please consider using following code snippet:
var textFragmentAbsorber = new TextFragmentAbsorber(@"(?<TM>\w*ex\w*)");
var textSearchOptions = new TextSearchOptions(true);
textFragmentAbsorber.TextSearchOptions = textSearchOptions;
Document pdfDocument = new Document(dataDir + "test.pdf");
pdfDocument.Pages.Accept(ta); // you can also loop through pages to search text on page level for low memory consumption
var textFragmentCollection = textFragmentAbsorber.TextFragments;
foreach (TextFragment textFragment in textFragmentCollection)
{
textFragment.Text = String.Empty;
}
I was trying to remove text from a pdf and I get an exception on deleting operators, see the exception in the image attached. image.png (3.9 KB)
the file that causes the exception: orig_6.pdf (270.6 KB)
and the code:
private static readonly Operator[] TextOperators =
{
new Aspose.Pdf.Operators.ShowText(),
new Aspose.Pdf.Operators.SetGlyphsPositionShowText(new List<Operators.GlyphPosition>()),
new Aspose.Pdf.Operators.MoveToNextLineShowText(),
new Aspose.Pdf.Operators.SetSpacingMoveToNextLineShowText(0, 0, string.Empty)
};
private static void RemoveTextFromPdfPage(Page pdfDocumentPage)
{
List operators = new List();
OperatorCollection pageOperators = pdfDocumentPage.Contents;
foreach (Operator @operator in TextOperators)
{
OperatorSelector operatorSelector = new OperatorSelector(@operator);
pageOperators.Accept(operatorSelector);
operators.AddRange(operatorSelector.Selected);
}
pageOperators.Delete(operators);
}
Then I tried us ing the method to set text fragments to empty strings, unfortunetely, that method has it’s own bugs too. It leaves out selectable text. Try the source file I am uploading: orig_1.pdf (674.7 KB)
Use the file with this code:
private static void RemoveTextFromPdfPage(Page pdfDocumentPage)
{
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(@"[\S]+");
textFragmentAbsorber.TextSearchOptions.IsRegularExpressionUsed = true;
pdfDocumentPage.Accept(textFragmentAbsorber);
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;
foreach (TextFragment textFragment in textFragmentCollection)
{
textFragment.Text = string.Empty;
}
}
the outcome is a file with selectable words in the table : “digits”, “N/A”, also page numbers “1/7” are not removed although left side of the page is cleared, and similar issue with top of the page where “en-English” is left selectable while left of that “NL-Netherlands” fragment was replaced. see the output file I am receiving: result_textFragments.pdf (675.4 KB)
also, i have just checked other pages of the same file and rest of them also have “en-English” at the top page and page number/of pages at the bottom selectable.
We tested the scenario using Aspose.PDF for .NET 20.9 and were able to notice the exception in our environment. Therefore, we have logged it as PDFNET-48740 in our issue tracking system. We will further look into its details and keep you informed with the status of its correction. Please be patient and spare us some time.
The recommended approach to remove the text from a PDF document is using operators. We tried the same code snippet with your this file and 20.9v. We were unable to notice any issue as all text was removed successfully. Please check the attached PDF document for your kind reference. TextRemoved_operators_20.9.pdf (674.1 KB)
Yes, reporting the issue in paid support would escalate its priority and it will have precedence over the issues logged under free support model. If you have a paid support subscription, you can create a ticket there with the reference of issue ID so that it can be escalated accordingly.
Best time of the day to You, dear Ali!
Used the code below to remove all text from several pages of PDF document:
Operator[] operators = new Operator[]
{
new ShowText(),
new SetGlyphsPositionShowText(new List()),
new MoveToNextLineShowText(),
new SetSpacingMoveToNextLineShowText(0,0,“”),
};
List list = new List();
OperatorCollection pageOperators = page.Contents;
foreach (Operator op in operators)
{
OperatorSelector operatorSelector = new OperatorSelector(op);
pageOperators.Accept(operatorSelector);
list.AddRange(operatorSelector.Selected);
}
pageOperators.Delete(list);
It works well for 13 pages of 18 of PDF document. Was stored in separate procedure. But after 13-n page appears such a notification:
System.ArgumentException: An element with such a key already exists
(An entry with this key already exists - translation from my language)
в System.ThrowHelper.ThrowArgumentException(ExceptionResource resource)
в System.Collections.Generic.SortedList2.Add(TKey key, TValue value) в Aspose.Pdf.OperatorCollection.#=zNaG7xQiZGBPB(IList1 #=zmexNZBA=)
в Aspose.Pdf.OperatorCollection.#=zZOBOkPM=(IList1 #=zmexNZBA=, #=zK9S$bkg= #=zCJf6AvQ=) в Aspose.Pdf.OperatorCollection.Delete(IList1 list)
в Pdf2DwgA.SelectPDFs.RemoveTextFromPage(Page OldPage) в C:\C# AutoCAD\V2019\PDF2DWG2019\Pdf2DwgA\SelectPDFs.cs:line 251
SelectPDFs.cs - name of my program. Very soon we shall obtain new ASPOSE PDF.NET version, will this bug be fixed?
I bag Your pardon, but there was mistake in code:
Not
List list = new List();
but
List(Operator) list = new List(Operator)();
Naturally, instead of circular parentheses there are angular ones.
Sets consent for sending user data to Google for online advertising purposes.
Sets consent for personalized advertising.
Cookie Notice
To provide you with the best experience, we use cookies for personalization, analytics, and ads. By using our site, you agree to our cookie policy.
More info
Enables storage, such as cookies, related to analytics.
Enables storage, such as cookies, related to advertising.
Sets consent for sending user data to Google for online advertising purposes.
Sets consent for personalized advertising.
Cookie Notice
To provide you with the best experience, we use cookies for personalization, analytics, and ads. By using our site, you agree to our cookie policy.
More info
Enables storage, such as cookies, related to analytics.
Enables storage, such as cookies, related to advertising.
Sets consent for sending user data to Google for online advertising purposes.