Editing the pdf file [masking some text in the existing pdf file]

pdftest · March 3, 2009, 4:08am

I have a pdf file, which will have the account details, along with credit card numbers. I have to replace the credit card number with "X", that is basically mask the credit card number without altering the pdf file format etc...Is it possible to do this

I had some conversation on live chat and got the following response:

PdfContentEditor editor = new PdfContentEditor();
editor.BindPdf(inputPath + "text.pdf");
editor.ReplaceText("Pdf", "WordPpt");
editor.Save(outputPath + "replace.pdf");

With the above code we find the particular text and replace it.

But the scenario is:

pdf will have text like as below
Credit Card Number: 1234 5678 0987 6754

and we have to replace it as below:
Credt Card Number: xxxx xxxx xxxx xxxx

Can't we search the pdf for a text "Credit Card Number:" and then when it finds the same, replace the following text with X's as above. Is it possible. If yes can you please guide me through the code samples.

The links i referred are:

http://www.aspose.com/documentation/file-format-components/aspose.pdf.kit-for-.net-and-java/aspose.pdf.kit.pdfcontenteditor.replacetext.html

http://www.aspose.com/documentation/file-format-components/aspose.pdf.kit-for-.net-and-java/extract-text-from-pdf-document.html

pdftest · March 3, 2009, 5:12am

I have a pdf file, which will have the account details, along with credit card numbers. I have to replace the credit card number with "X", that is basically mask the credit card number without altering the pdf file format etc...Is it possible to do this

I had some conversation on live chat and got the following response:

PdfContentEditor editor = new PdfContentEditor();
editor.BindPdf(inputPath + "text.pdf");
editor.ReplaceText("Pdf", "WordPpt");
editor.Save(outputPath + "replace.pdf");

With the above code we find the particular text and replace it.

But the scenario is:

pdf will have text like as below
Credit Card Number: 1234 5678 0987 6754

and we have to replace it as below:
Credt Card Number: xxxx xxxx xxxx xxxx

Can't we search the pdf for a text "Credit Card Number:" and then when it finds the same, replace the following text with X's as above. Is it possible. If yes can you please guide me through the code samples.

The links i referred are:

http://www.aspose.com/documentation/file-format-components/aspose.pdf.kit-for-.net-and-java/aspose.pdf.kit.pdfcontenteditor.replacetext.html

http://www.aspose.com/documentation/file-format-components/aspose.pdf.kit-for-.net-and-java/extract-text-from-pdf-document.html

shahzadlatif · March 3, 2009, 6:34am

Hi,

Thank you very much for considering Aspose.

I'm looking into your requirement and will update you the earliest possible.

Regards,

codewarior · March 3, 2009, 6:49am

Hello Nilesh,

Thanks for considering Aspose.
We have a component named Aspose.Pdf.Kit which provides the capability to edit/manipulate existing PDF documents. As per your requirement,you can use the ReplaceText method of :ReplaceText and for PdfContentEditor class to accomplish your requirement. If you need to replace the “12345678 0987 6754” with ‘xxxxxxxxxxxxxxxx’. If you are using Aspose.Pdf.Kit for Java, it also provides a class named PdfSearcher, which offers the capability to search the text within particular region, but it will return the location of Text string. In order to replace the text you need to use the PdfContentEditor class. PdfSearcher is only present in Java version.

Please try using the ReplaceText method andin case you face any problem, please share the resource Pdf file. I’ve marked this forum thread as private so that no one else Aspose.Staff can access the file.

For more information on searching text in Pdf, please visit the:PdfSearcher

FYIf, the Beta version of ReplaceText & PdfSearcher. The feature may not be supported well for some PDF files and we may be not able to fix it in short time.

pdftest · March 3, 2009, 7:28am

No no. See I will have pdf file having data as below:

Name: ABC Kumar

Accout Jan - Feb

....

Credit Card Number: 1234 5678 1919 1919

...

Now, there will be lot of pdf files coming in. And certainly we will not be knowing the credit card numbers. So what we want is we will search the pdf for the string prefix "Credit Card Number" and then when it finds, replace the digits that follows the above text with xxxx xxxx xxxx xxxx

Is this possible.

pdftest · March 3, 2009, 7:33am

alternatively, you can reach me at 91 9916052500

shahzadlatif · March 3, 2009, 7:55am

Hi,

I have been looking into this issue, and would like to add something to Nayyer's response.

According to your requirement, you want to replace a dynamic string with a specified string of characters. In fact, this would not be possible by using ReplaceText feature; nevertheless, you can try a workaround if that helps:

1. Get text from a particular page using ExtractText method of the PdfExtractor class in to a string variable

2. Find specific pattern of string in memory programatically. You might use reqular expression or something

3. Replace that text with your required string 'XXXX XXXX XXXX'

4. Delete the older page

5. Insert new page at this old position

This is a little overhead, but I hope this might help. If you have any further questions, please do let us know.

Regards,

pdftest · March 4, 2009, 1:25pm

We have been having a tough time reaching to the solution. Let me explain you what we have been trying.

We used the extracttext method, and then we call the gettext method to which we pass the stream object. we then create a filestream object for the above stream. now we will be searching the credit card pattern in the filestream object. but we were not able to retriev the text from the filestream object, it says end of file every time we say filestream.readline()

We would really be grateful if you can pass out the code snippet to resemble our requirement. Once we have the results, we shall be looking forward to acquire the license for your product.

Also, i was unable to understand your objective of points 4 and 5 in above post

The code that we are using is:

//Instantiate PdfExtractor object

PdfExtractor extractor = new PdfExtractor();

//Set Password for input PDF file

extractor.Password = "";

//Bind the input PDF document to extractor

extractor.BindPdf("C:\\pdftest\\WebApplication1\\" + "prod_eob.pdf");

//Extract text from the input PDF document

extractor.ExtractText();

string path= "C:\\pdftest\\WebApplication1\\" + "prod_eob1.txt";

FileStream f = new FileStream(path, FileMode.Create);

extractor.GetText(f);

StreamReader reader = new StreamReader(f);

string mainReportText = reader.ReadToEnd();

mainReportText = mainReportText.Trim();

In the above code reader.ReadToEnd(); doesnot return anything...

shahzadlatif · March 4, 2009, 10:25pm

Hi,

I'm working on this issue. Please spare us some time so we could resolve it.

We appreciate your patience.

Regards,

shahzadlatif · March 4, 2009, 11:11pm

Hi,

Please add f.Seek(0,0) after the extractor.GetText(f) line as given below:

extractor.GetText(f);

f.Seek(0,0);

It works at my end; I hope it'll work at your end too.

If you still find any issue please do let us know.

Regards,

pdftest · March 5, 2009, 12:49am

I get the error as "Exception of type 'System.OutOfMemoryException' was thrown." when i try to process the pdf of size 9 MB.

Also I tested with smaller pdf [147 kb], it works for it.

Code is as below: [For simple testing i have created the regular expression to match dd/dd/dd string pattern in pdf]

//Instantiate PdfExtractor object

PdfExtractor extractor = new PdfExtractor();

//Set Password for input PDF file

extractor.Password = "";

//Bind the input PDF document to extractor

extractor.BindPdf("C:\\pdftest\\WebApplication1\\" + "Test.pdf");

//Extract text from the input PDF document

extractor.ExtractText();

string path = "C:\\pdftest\\WebApplication1\\" + "Test.txt";

FileStream f = new FileStream(path, FileMode.Create);

extractor.GetText(f);

f.Seek(0, 0);

StreamReader reader = new StreamReader(f);

string mainReportText = reader.ReadToEnd();

mainReportText = mainReportText.Trim();

f.Close();

string pattern = @"((\d{2})/(\d{2})/(\d{2}))";

//[-+]((0[0-9]|1[0-3]):([03]0|45)|14:00)

//@"^(\d{2}/)(\d{2}/)(\d{2}/)$";

//@"^((4\d{3})|(5[1-5]\d{2})|(6011))-?\d{4}-?\d{4}-?\d{4}|3[4,7][\d\s-]{15}$";

Regex match = new Regex(pattern);

//return match.IsMatch(num);

Match m = match.Match(mainReportText);

Response.Write(m.Value);

//Save the extracted text to a text file

//extractor.GetText("C:\\pdftest\\WebApplication1\\" + "prod_eob.txt");

PdfContentEditor editor = new PdfContentEditor();

editor.BindPdf("C:\\pdftest\\WebApplication1\\" + "Test.pdf");

editor.ReplaceText(m.Value, "xx/xx/xx");

editor.Save("C:\\pdftest\\WebApplication1\\" + "replace2.pdf");

Also, the below code does not work. It gives error that "file .pdf is being used by another process. This is because i am trying to replace multiple texts in a for loop.

PdfContentEditor editor = new PdfContentEditor();

editor.BindPdf("C:\\pdftest\\WebApplication1\\" + "new1.pdf");

foreach (Match mat in match3.Matches(mainReportText))

{

editor.ReplaceText(mat.Value.Substring(mat.Value.IndexOf('-') + 1, 4), "XXXX");

}

But if i replace a single text as below it works:

PdfContentEditor editor = new PdfContentEditor();

editor.BindPdf("C:\\pdftest\\WebApplication1\\" + "new1.pdf");

Match mat = match3.Match(mainReportText);

editor.ReplaceText(mat.Value.Substring(mat.Value.IndexOf('-') + 1, 4), "XXXX");

pdftest · March 5, 2009, 4:25am

Here is the complete code. See the comment taht says which block is working and which block is not working.

//Instantiate PdfExtractor object

PdfExtractor extractor = new PdfExtractor();

//Set Password for input PDF file

extractor.Password = "";

//Bind the input PDF document to extractor

extractor.BindPdf("C:\\pdftest\\WebApplication1\\" + "new1.pdf");

//Extract text from the input PDF document

extractor.ExtractText();

string path = "C:\\pdftest\\WebApplication1\\" + "Test2.txt";

FileStream f = new FileStream(path, FileMode.Create);

extractor.GetText(f);

f.Seek(0, 0);

StreamReader reader = new StreamReader(f);

string mainReportText = reader.ReadToEnd();

reader.Close();

f.Close();

f.Dispose();

string pattern3 = @"CO\s*((\d{5})-(\d{4}))";

Regex match3 = new Regex(pattern3);

PdfContentEditor editor = new PdfContentEditor();

editor.BindPdf("C:\\pdftest\\WebApplication1\\" + "new1.pdf");

//This works for replacing single match

Match mat = match3.Match(mainReportText);

editor.ReplaceText(mat.Value.Substring(mat.Value.IndexOf('-') + 1, 4), "XXXX");

//This does not work for replacing multiple matches

//foreach (Match mat in match3.Matches(mainReportText))

//{

// editor.ReplaceText(mat.Value.Substring(mat.Value.IndexOf('-') + 1, 4), "XXXX");

//}

editor.Save("C:\\pdftest\\WebApplication1\\" + "new.pdf");

shahzadlatif · March 5, 2009, 4:30am

Hi,

Thank you very much for sharing the code with us. I'm looking into the issue and will update you the earliest possible.

Regards,

pdftest · March 5, 2009, 5:13am

Also, I have attached a sample pdf. This is the actual pdf file. I am unable to read the data from this pdf. When I use the earlier codes to read, it reads the data from the pdf only from first page. And actually i need to read the data of 7 th page where the credit card number will be stored.

shahzadlatif · March 6, 2009, 1:51am

Hi,

After further investigation, I have found that you were having three issues:

1. couldn't read text from page number 7

please try to add following two lines before extracting the text. It works fine at my end.

extractor.StartPage = 7;
extractor.EndPage = 7;

2. File is in use error

In order to troubleshoot this issue, I'll need the file containing the credit card number. The file shared doesn't contain any credit card number, so I'm unable to reproduce the issue at my end.

3. Out of memory exception

To reproduce this issue at my end, please share the same 9 MB file you're having problem with.

We really appreciate your patience and cooperation.

Regards,

pdftest · March 6, 2009, 3:24am

For the point#1:

I used the below code, but the only text that comes in the mainReportText variable is as below:

Warning:This is the evaluation version of Aspose.Pdf.Kit. Just one line has been extracted. Please purchase your license to extract text correctly.Extracted with Aspose.Pdf.Kit Copyright 2002-2008 Aspose Pty Ltd.

SOLO Payment Options

The code is:

//Instantiate PdfExtractor object

PdfExtractor extractor = new PdfExtractor();

//Set Password for input PDF file

extractor.Password = "";

//Bind the input PDF document to extractor

extractor.BindPdf("C:\\pdftest\\WebApplication1\\" + "Sample.pdf");

extractor.StartPage = 7;

extractor.EndPage = 7;

//Extract text from the input PDF document

extractor.ExtractText();

string path = "C:\\pdftest\\WebApplication1\\" + "Sample.txt";

FileStream f = new FileStream(path, FileMode.Create);

extractor.GetText(f);

f.Seek(0, 0);

StreamReader reader = new StreamReader(f);

string mainReportText = reader.ReadToEnd();

reader.Close();

f.Close();

f.Dispose();

string pattern3 = @"Credit Card Number:\s*((\d{16}))";//@"CO\s*((\d{5})-(\d{4}))";//@"CO\s*((\d{5})-(\d{4}))";

Regex match3 = new Regex(pattern3);

PdfContentEditor editor = new PdfContentEditor();

editor.BindPdf("C:\\pdftest\\WebApplication1\\" + "Sample.pdf");

//This works for replacing single match

Match mat = match3.Match(mainReportText);

editor.ReplaceText(mat.Value.Substring(mat.Value.IndexOf(':') + 1, 16), "XXXXXXXXXXXXXXXX");

//This does not work for replacing multiple matches

//foreach (Match mat in match3.Matches(mainReportText))

//{

// editor.ReplaceText(mat.Value.Substring(mat.Value.IndexOf('-') + 1, 4), "XXXX");

//}

editor.Save("C:\\pdftest\\WebApplication1\\" + "Sample1.pdf");

My objective is, in the sample.pdf, in page 7 towards the end of page there is Credit Card Number: _________________ I want to display it as Credit Card Number: XXXXXXXXXXXXXXX

Can you please look into this. If we are unable to resolve this today, the client might close off the project. Please respond ASAP. Please let me know if you need any inputs from my end.

shahzadlatif · March 6, 2009, 4:12am

Hi,

You're getting this message because this is an evaluation version. If you want to test without any restrictions then you'll have to get a temporary license for 30 days. Please follow the link to get a temporary license.

I hope once you have a license to evaluate without restrictions you'll be able to achieve your objective.

If you still find any issue please do let us know.

Regards,

pdftest · March 6, 2009, 4:39am

Thats true, but we are getting other message also, which means it only reads some information from pdf. See the message below carefully in red.

Warning:This is the evaluation version of Aspose.Pdf.Kit. Just one line has been extracted. Please purchase your license to extract text correctly.Extracted with Aspose.Pdf.Kit Copyright 2002-2008 Aspose Pty Ltd.

SOLO Payment Options

pdftest · March 6, 2009, 4:51am

Since we have time constraint, can you please do the following.

In the sample.pdf, in page 7 towards the end of page there is text like:

Credit Card Number: _________________

I want to display it as:

Credit Card Number: XXXXXXXXXXXXXXX

Can you please modify my code in earlier post to edit the pdf for this Please respond ASAP. Please let me know if you need any inputs from my end.

shahzadlatif · March 6, 2009, 4:52am

Hi,

As I can see from the provided PDF this is the 'one line' which is being extracted from the page number 7; the text in black and bold is the warning message. Once you'll have temporary or permanent license you'll be having complete functionality.

Please correct me if we're not on the same page.

Regards,