Remove all text from PDF document except images using Aspose.PDF for .NET

echemistry · September 16, 2011, 2:30pm

I just wonder if there is a way to remove all text in a PDF file. I only need vector images. Because I cannot extract vector images out of a PDF file, if this works, that will help a lot. Thanks.

shahzadlatif · September 17, 2011, 2:11pm

Hi Tony,

In order to remove all the text from the PDF file, you may try our new merged Aspose.Pdf for .NET 6.2.0. You need to replace all the text segments with the empty string i.e. “”. Please use the following sample to remove all the text from the PDF file:

//open document

Document pdfDocument = new Document(“input.pdf”);

//create TextAbsorber object to find all text fragments

TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber();

//accept the absorber for all the pages

pdfDocument.Pages.Accept(textFragmentAbsorber);

//get the extracted text fragments

TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;

//loop through the fragments

foreach (TextFragment textFragment in textFragmentCollection)

{

//replace text with empty string

textFragment.Text = “”;

}

pdfDocument.Save(“output.pdf”);

I hope this helps. If you find any further questions, please do let us know.
Regards,

echemistry · September 17, 2011, 3:58pm

You example code not working with the attached PDF. I confirmed I tried version 6.2.0.0

shahzadlatif · September 19, 2011, 5:52am

Hi Tony,

I have reproduced this problem at my end and logged it as PDFNEWNET-30701 in our issue tracking system. Our team will look into this issue and you’ll be updated via this forum thread once it is resolved.

We’re sorry for the inconvenience.
Regards,

codewarior · February 29, 2012, 3:20pm

Hi Tony,

Thanks for your patience.

I am pleased to share that the issue reported earlier has been fixed. Please try using the latest release version of Aspose.Pdf for .NET 6.7.0. For your reference, I have also attached the resultant PDF that I have generated using the code snippet already shared in this thread. In case you still face the similar problem or you have any further query, please feel free to contact.

asad.ali · June 3, 2020, 9:09pm

@echemistry

We would like to share with you that Aspose.PDF for .NET now offers a much faster way to delete all text from PDF document. Please check following snippet in order to achieve that:

Aspose.Pdf.Document pdfDocument = new Aspose.Pdf.Document(myDir + @"2.pdf");

// Used text showing operators
Operator[] operators = new Operator[]
{
 new Operator.ShowText(),
 new Operator.SetGlyphsPositionShowText(new List()),
 new Operator.MoveToNextLineShowText(),
 new Operator.SetSpacingMoveToNextLineShowText(0,0,""),
};

foreach (Page page in pdfDocument.Pages)
{
 ArrayList list = new ArrayList();
 OperatorCollection pageOperators = page.Contents;

 foreach (Operator op in operators)
 {
    OperatorSelector operatorSelector = new OperatorSelector(op);
    pageOperators.Accept(operatorSelector);
    list.AddRange(operatorSelector.Selected);
 }
 pageOperators.Delete(list);
}
pdfDocument.Save(myDir + "TextRemoved_operators_18.4.pdf");