I’m reaching out to explore solutions for scanning text and performing OCR on PDF documents to redact PII information. We are looking to process approximately 1 million PDFs every day, and I am keen to understand the best practices and tools available to efficiently handle this volume for .NET 8
For i am doing trial with .Net 8 to redact SSN from pdf, so i am using the solution provided here Redact PDF using C# | Aspose.PDF
Could you please check and let me know.
@Aniruddha003
To effectively redact Social Security Numbers (SSNs) from PDF documents using Aspose.PDF in .NET 8, follow these best practices:
-
Identify the Pattern: Use a regular expression to identify SSNs in the format XXX-XX-XXXX
. The pattern can be defined as @"\d{3}-\d{2}-\d{4}"
.
-
Load the PDF Document: Start by loading your PDF document using the Document
class.
-
Use TextFragmentAbsorber: Create a TextFragmentAbsorber
to search for the SSN pattern in the document.
-
Apply Redaction: Once you identify the text fragments that match the SSN pattern, apply redaction to those fragments.
Here’s a sample code snippet to illustrate the process:
using Aspose.Pdf;
using Aspose.Pdf.Text;
// Load the PDF document
Document pdfDocument = new Document("input.pdf");
// Define the pattern for SSNs
string pattern = @"\d{3}-\d{2}-\d{4}";
// Create a TextFragmentAbsorber to search for text
TextFragmentAbsorber absorber = new TextFragmentAbsorber(pattern);
pdfDocument.Pages.Accept(absorber);
// Iterate through the text fragments and apply redaction
foreach (TextFragment textFragment in absorber.TextFragments)
{
textFragment.SetText("REDACTED"); // Replace with redaction text
}
// Save the redacted PDF
pdfDocument.Save("redacted.pdf");
Additional Considerations:
- Batch Processing: If you’re processing a large volume of PDFs (like 1 million daily), implement batch processing to handle multiple files efficiently.
- Testing: Always test your redaction process on sample documents to ensure that sensitive information is properly redacted and that no unintended data is exposed.
Next Steps:
- Implement the provided code in your .NET application.
- Test the functionality with various PDF documents to ensure accuracy.
- If you encounter any issues or need further assistance, please reach out for clarification or support.
Feel free to ask if you need more information or help with specific aspects of the implementation!
Okay, using above snippet i am getting error for “SetText” method which is not exist.
I am using library version 25.8.0, see below
image.png (51.4 KB)
@Aniruddha003
You can simple use TextFragment.Text property to set the text. However, you mentioned in your first post that you are exploring solution to apply redaction annotation in PDF documents. Can you please share what kind of issues have you been facing while trying to achieve it? Please share some sample code that you have used along with the sample PDF document. We will test the scenario in our environment and address it accordingly.
For issue, see the previously attached image.
In other hand using finding the SSN and later applying redaction. so its working for the text search but however SSN pattern not getting found using the same.
We required to find SSN and do redaction, so could you please check and let me know.
Below is the code snippet.
using Aspose.Pdf;
using Aspose.Pdf.Text;
using Aspose.Pdf.Annotations;
using System.Text.RegularExpressions;
StartReduction();
static void StartReduction()
{
var inputPath = “C:\Aniruddha\Temp 2”;
var outputPath = “C:\Aniruddha\Test_Aspose”;
var filename = “2389757441.pdf”;
try
{
// Load the PDF document
Document pdfDocument = new Document(Path.Combine(inputPath, filename));
string ssnPattern = @"\b\d{3}-\d{2}-\d{4}\b";
var ssnRegex = new Regex(ssnPattern, RegexOptions.IgnoreCase);
bool ssnFound = false;
// Loop through all pages
foreach (Page page in pdfDocument.Pages)
{
//TextFragmentAbsorber absorber = new TextFragmentAbsorber(ssnRegex, new TextSearchOptions(true));
TextFragmentAbsorber absorber = new TextFragmentAbsorber("important");
page.Accept(absorber);
foreach (TextFragment fragment in absorber.TextFragments)
{
RedactionAnnotation redaction = new RedactionAnnotation(page, fragment.Rectangle);
page.Annotations.Add(redaction);
ssnFound = true;
}
}
if (ssnFound)
{
// Apply redactions and save the document
pdfDocument.Save(Path.Combine(outputPath, filename));
}
}
catch (Exception e)
{
Console.WriteLine($"Error for file {filename}: {e.Message}");
}
}
@Aniruddha003
Will you please share your sample PDF file for our reference as well? We will test the scenario in our environment and address it accordingly.
The original i can not but i created sample and does find and redact from sample file however it is not able redact same from our original file.
See attached sample file.
Sample SSN.pdf (44.4 KB)
@Aniruddha003
Please check below sample code snippet that we used using the latest version of the API. An output generated by the code has also been attached for your kind reference. Please check in the output that SSN was redacted.
var document = new Document(dataDir + "Sample SSN.pdf");
string ssnPattern = @"\b\d{3}-\d{2}-\d{4}\b";
var ssnRegex = new Regex(ssnPattern, RegexOptions.IgnoreCase);
// Loop through all pages
foreach (Page page in document.Pages)
{
TextFragmentAbsorber absorber = new TextFragmentAbsorber(ssnRegex, new TextSearchOptions(true));
page.Accept(absorber);
foreach (TextFragment fragment in absorber.TextFragments)
{
RedactionAnnotation redaction = new RedactionAnnotation(page, fragment.Rectangle);
page.Annotations.Add(redaction);
redaction.Redact();
}
}
document.Save(dataDir + "redacted_SSN.pdf");
redacted_SSN.pdf (46.4 KB)