Saving PDF Document after removing content throws a NullReferenceException

I am trying to use Aspose.PDF to remove invisible all text layers from a document. These invisible text layers are the result of a previous OCR operation. I need to remove these text layers before reprocessing the document through OCR - otherwise, I will get a “Double OCR” problem, making all searchable text will appear twice.

I previously tried using the TextFragmentAbsorber to find invisible text fragments and then setting each text fragment to an empty string. Unfortunately, this was incredibly slow - sometimes over 5 minutes per page, which is triggering my process timeout.

Therefore, I am trying to loop through the page.Contents collection to search for BT (Begin Text) and ET (End Text) operators, capturing all operators in-between, looking for a 3 Tr operator (text rendering mode invisible), and then calling page.Contents.Delete() if a 3 Tr is found.

Unfortunately, when I try to call document.Save() at the end of this loop, then I receive a NullReferenceException. The stack trace is useless because it’s obfuscated. Nonetheless, here is the stack trace:

   at #=zAeY$NuMBAmJjsmkfApRKezBh_F5l.#=zmk7mtyM=(#=zFmA20cmyVm5QZ$0jFMrqaP3Ohjh1 #=zokgkXC$dh0fR)
   at #=zUVbXLpeH7qSc7llNlK7ncoQ=.#=zjDiJry0=(#=zFmA20cmyVm5QZ$0jFMrqaP3Ohjh1 #=zokgkXC$dh0fR)
   at #=ziUwdsE_RoM4kdjkRFk_Rzj0=.#=zjDiJry0=(#=zFmA20cmyVm5QZ$0jFMrqaP3Ohjh1 #=zokgkXC$dh0fR)
   at System.RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor)
   at System.Reflection.MethodBaseInvoker.InvokeDirectByRefWithFewArgs(Object obj, Span`1 copyOfArgs, BindingFlags invokeAttr)
   at #=qmh6bZuxa7yeSwR5TaJdU3Ygunr$5vEJqcbqnyBH9v5U=.#=zxvr9L7oLdT_YTPOMeNkBlgpqFRN1ea9CjVFOi_LWbv5b(Object #=zTPv_1Iw=)
   at #=qmh6bZuxa7yeSwR5TaJdU3Ygunr$5vEJqcbqnyBH9v5U=.#=zAGHy1h3L4dZimT3OgWmf$8An0AKE(Object #=zTPv_1Iw=)
   at #=qmh6bZuxa7yeSwR5TaJdU3Ygunr$5vEJqcbqnyBH9v5U=.#=z0$FZgFgQgCzvKRYGKt8qNOQ=(MethodBase #=zTPv_1Iw=, Boolean #=zAzPicws=)
   at #=qmh6bZuxa7yeSwR5TaJdU3Ygunr$5vEJqcbqnyBH9v5U=.#=z9HP2gw4N_YbowQMB0f3Kg5Y=(#=qmh6bZuxa7yeSwR5TaJdU3Ygunr$5vEJqcbqnyBH9v5U= #=zTPv_1Iw=, #=qpERxZKoT7cBo5CypqeVCEPDzmOQ5qjrD6ZryeJocp0I= #=zAzPicws=)
   at #=qmh6bZuxa7yeSwR5TaJdU3Ygunr$5vEJqcbqnyBH9v5U=.#=zqbECXdJ_twxtStaTJrkqBjs=(Boolean #=zTPv_1Iw=)
   at #=qmh6bZuxa7yeSwR5TaJdU3Ygunr$5vEJqcbqnyBH9v5U=.#=zxvr9L7oLdT_YTPOMeNkBlgpqFRN1ea9CjVFOi_LWbv5b(Object #=zTPv_1Iw=)
   at #=qmh6bZuxa7yeSwR5TaJdU3Ygunr$5vEJqcbqnyBH9v5U=.#=zAGHy1h3L4dZimT3OgWmf$8An0AKE(Object #=zTPv_1Iw=)
   at #=qmh6bZuxa7yeSwR5TaJdU3Ygunr$5vEJqcbqnyBH9v5U=.#=zRJFnc6X$WY2agkoIHOO9GJ5cFrOH00wqCw==()
   at #=qmh6bZuxa7yeSwR5TaJdU3Ygunr$5vEJqcbqnyBH9v5U=.#=z68Lp8NA4xKwE3uDHLNIkPlRN6HpOeehKYg==(#=qmh6bZuxa7yeSwR5TaJdU3Ygunr$5vEJqcbqnyBH9v5U= #=zTPv_1Iw=, #=qpERxZKoT7cBo5CypqeVCEPDzmOQ5qjrD6ZryeJocp0I= #=zAzPicws=)
   at #=qmh6bZuxa7yeSwR5TaJdU3Ygunr$5vEJqcbqnyBH9v5U=.#=zqbECXdJ_twxtStaTJrkqBjs=(Boolean #=zTPv_1Iw=)
   at #=qmh6bZuxa7yeSwR5TaJdU3Ygunr$5vEJqcbqnyBH9v5U=.#=zxvr9L7oLdT_YTPOMeNkBlgpqFRN1ea9CjVFOi_LWbv5b(Object #=zTPv_1Iw=)
   at #=qmh6bZuxa7yeSwR5TaJdU3Ygunr$5vEJqcbqnyBH9v5U=.#=zAGHy1h3L4dZimT3OgWmf$8An0AKE(Object #=zTPv_1Iw=)
   at #=qmh6bZuxa7yeSwR5TaJdU3Ygunr$5vEJqcbqnyBH9v5U=.#=zRJFnc6X$WY2agkoIHOO9GJ5cFrOH00wqCw==()
   at #=qmh6bZuxa7yeSwR5TaJdU3Ygunr$5vEJqcbqnyBH9v5U=.#=zKVFlxpW_AznOJx4FIuZsUjlykL6jArealxNr2H$IpDHE(Object #=zTPv_1Iw=, UInt32 #=zAzPicws=)
   at #=qmh6bZuxa7yeSwR5TaJdU3Ygunr$5vEJqcbqnyBH9v5U=.#=zqbECXdJ_twxtStaTJrkqBjs=(Boolean #=zTPv_1Iw=)
   at #=qmh6bZuxa7yeSwR5TaJdU3Ygunr$5vEJqcbqnyBH9v5U=.#=zqbECXdJ_twxtStaTJrkqBjs=(Boolean #=zTPv_1Iw=)
   at #=qmh6bZuxa7yeSwR5TaJdU3Ygunr$5vEJqcbqnyBH9v5U=.#=zyYmXx0gb$vFPKAEmHxnFbpOuUq2cxWg4Qw==(Object[] #=zTPv_1Iw=, Type[] #=zAzPicws=, Type[] #=zW264f5Q=, Object[] #=z9pl$FKo=)
   at #=qmh6bZuxa7yeSwR5TaJdU3Ygunr$5vEJqcbqnyBH9v5U=.#=zTWbaTmvPyC8zNf0Z3a9tY1ZuVKLLfhCgsUkW0y0OnvNe(Stream #=zTPv_1Iw=, String #=zAzPicws=, Object[] #=zW264f5Q=)
   at Aspose.Pdf.Document.#=zgv0R6rzJJFzN(Stream #=z4pqNpB4=, SaveOptions #=zDrH$9qVYCr8I)
   at Aspose.Pdf.Document.#=zGCQNpK7iFIXB(String #=z8FuxuEK8Mixe)
   at Aspose.Pdf.Document.Save(String outputFileName)

@aphilyaw7

You’re seeing a NullReferenceException after modifying the page content streams, which can leave the PDF’s internal structure inconsistent and cause the save operation to fail.

Use the built‑in hidden‑data sanitizer

Aspose.PDF provides a sanitizer that removes hidden objects, including text drawn with the invisible rendering mode 3 (Tr 3). It works on the object model, so the PDF structure remains valid and the document can be saved without error.

using Aspose.Pdf;
using Aspose.Pdf.Security.HiddenDataSanitization;

// Load the PDF that contains the OCR‑generated invisible text
var doc = new Document(@"C:\input\ocred.pdf");

// Configure the sanitizer to remove all hidden content
var options = HiddenDataSanitizationOptions.All();   // removes hidden text, XObjects, etc.
options.ConvertPagesToImages = false;               // keep pages as PDF
options.ImageDpi = 200;                             // required by the class, not used here

var sanitizer = new HiddenDataSanitizer(options);
sanitizer.Sanitize(doc);

// Save the cleaned PDF
doc.Save(@"C:\output\cleaned.pdf");

The HiddenDataSanitizer walks the PDF objects, detects the invisible‑text operators and removes the corresponding content safely [4].

Alternative approaches

  • OptimizeResources – you can also call document.OptimizeResources with options such as RemoveUnusedObjects = true to clean up unused streams, which may help eliminate hidden content [4].
  • Manual stream editing – if you must edit the content stream yourself, clearing the entire stream (e.g., page.Contents.Clear()) and rebuilding it can avoid the inconsistency that leads to the NRE, but this requires careful reconstruction of the page content.

By using the sanitizer (or the optimizer) you should be able to drop the invisible OCR text layer and save the PDF without encountering the NullReferenceException.

I have tried many different things in order to remove the existing OCR layers. I really don’t understand why it’s so difficult to DELETE content from an existing PDF. The following suggestions will not work for me for various reasons:

  • Use the TextFragmentAbsorber and set all invisible text fragments to string.Empty. Unfortunately, this is WAY too slow, and it is causing my five-minute timeout to trigger.

  • Flatten the PDF. Unfortunately, if this is a mixed PDF with native text and scanned text, then this rasterizes the entire file. I don’t want to force native text to go through OCR (because OCR is totally not perfect), and this also increases the file size.

  • Use OCRmyPDF with the --redo-ocr and setting the Tesseract timeout to 0 so that it skips the actual OCR operation. This didn’t work for me.

Well, it turns out that I forgot to set the Aspose license file in my code - stupid mistake.

Still, better error messages would be appreciated!