In pdf document replace text between two string to be ***

ortasa · April 20, 2017, 6:25am

Good moening,

I am working in c#.

I have a pdf document that can be one or multiple pages. I need to anonymize some of the text in it (replace the chars that are not blank space with ‘*’).

I need to anonymize everything between start and end string.

The start and end string can be different one from the other.
The start and end string can appear multiple times in a document and I should treat each pair separately.
The start and end texts can start at one page and end at the other page.

I need to anonymize from a start string until the end of a document.

Can you advise me on how to implement this with aspode.pdf?

Thanks in advance,

Ortal

imran.rafique · April 20, 2017, 7:00pm

Hi Ortal,

Thank you for contacting support. You can replace a text phrase in the whole PDF document. You may also apply checks to identify the start and end strings while replacing a text phrase. Please refer to this help topic: Replace Text in All Pages of PDF Document

Kindly download and try the latest version 17.4.0 of Aspose.Pdf for .NET API. You can get a 30 day temporary license for the testing purposes from the purchase portal (recommended). Its option is available in step 4. Please also refer to this help topic: Apply License to Aspose.Pdf for .NET API

ortasa · April 22, 2017, 11:52pm

Good morning,

Thank you for the rapid reply.

After reading all the documents I have additional questions:

How can I find the end position in a pdf document?
How can I replace text between start and end position (between my start and end string)?

Thanks in advance,

Ortal

codewarior · April 23, 2017, 12:48pm

ortasa:

Good morning,

Thank you for the rapid reply.

After reading all the documents I have additional questions:

How can I find the end position in a pdf document?

How can I replace text between start and end position (between my start and end string)?

Hi Ortal,

Thanks for contacting support.

The best approach to define the bounds for text replace is to use Regular Expression. Please visit the following link for further information on Replace Text Based on a Regular Expression

ortasa · April 30, 2017, 4:30am

Good morning,

Thank you “Replace Text Based on a Regular Expression” worked.

I have an additional question : how can I add new line in the text I replaced?

I have tried add the Environment.NewLine but it did not work: textFragment.Text = startText + " *** " + Environment.NewLine + endText;

Thanks in advance,

Ortal

ortasa · April 30, 2017, 12:13pm

Hi,

I have another question:
Although my regular expression find only the text between start string to end string/ or from start string till the end of the document. When I try to change the text to *** also the start and end text are deleted.

For example:
endText = “approved by”
startTextas= “approved by”
regular expression =(?<=“approved by”)(\w)((.|(\r\n))?)[ \t]*(?=“approved by”)
Text = " report text approved by ortal approved by"

The text that in textFragment.Text is “ortal”.
But when after I update the text in the textFragment.Text ="***" the report text now is :" report text ***".

My code :

I have byte[] pdfDocumentByte, string endText and string startTextas input.

//I use it to creat new Aspose.Pdf.Document
Aspose.Pdf.Document pdfDocument = new Aspose.Pdf.Document(new MemoryStream(pdfDocumentByte));
string regular = string.Empty;

if (string.IsNullOrEmpty(endText))
{
    regular = string.Format(@"(?<={0})(\w)*((.|(\r\n))*?).*$", startText);
}
else
{
    regular = string.Format(@"(?<={0})(\w)*((.|(\r\n))*?)[ \t]*(?={1})", startText, endText);
}

// Create TextAbsorber object to find all the phrases matching the regular expression
Aspose.Pdf.Text.TextFragmentAbsorber textFragmentAbsorber = new Aspose.Pdf.Text.TextFragmentAbsorber(regular);

// Set text search option to specify regular expression usage
Aspose.Pdf.Text.TextOptions.TextSearchOptions textSearchOptions = new
Aspose.Pdf.Text.TextOptions.TextSearchOptions(true);

textFragmentAbsorber.TextSearchOptions = textSearchOptions;

// Accept the absorber for a single page
pdfDocument.Pages[1].Accept(textFragmentAbsorber);

// Get the extracted text fragments
Aspose.Pdf.Text.TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;
foreach (Aspose.Pdf.Text.TextFragment textFragment in textFragmentCollection)
{
    // Update text and other properties
    text Fragment.Text = " *** ";
}

Can you help me understand why the start and end text are changed although it not part of the text Fragment.Text?

Thanks,

Ortal

codewarior · May 1, 2017, 5:14am

Hi Ortal,

Thanks for contacting support.

I have tested the scenario and have managed to reproduce same problem that New Line character is being ignored during text replace. For the sake of correction, I have logged it as PDFNET-42669 in our issue tracking system. We will further look into the details of this problem and will keep you posted on the status of correction. Please be patient and spare us little time. We are sorry for this inconvenience.

codewarior · May 1, 2017, 2:56pm

Hi Ortal,

Thanks for contacting support.

Can you please share the sample PDF file, so that we can test the scenario in our environment. We are sorry for this inconvenience.

ortasa · May 3, 2017, 6:52am

Hi,

Attached 2 examples.

Thanks,

Ortal

codewarior · May 3, 2017, 2:15pm

Hi Ortal,

Thanks for contacting support.

I am afraid I am unable to find attachments in this thread. Can you please double check at your end.

ortasa · May 4, 2017, 2:09am

Good morning,

I attached it as pdf.

Have a good day,

Ortal

ortasa · May 4, 2017, 2:14am

below is a link to me google drive:

https://drive.google.com/file/d/0B57yScOCOODFbDZqQXRNMUNLVHc/view?usp=sharing

ortasa · May 4, 2017, 6:50am

Good aftrnoon,

I made a more detailed explanation of my problems (attached zip file and link to google drive

https://drive.google.com/file/d/0B57yScOCOODFUndSMjV1NTJoSm8/view?usp=sharing ).

Inside it there is a word file for each problem and pdf examples for before and after.

Problem 1: TextState.FontStyle does not affect the found TextFragment. Underline remain after changing the textFragment.TextState.FontStyle =Aspose.Pdf.Text.FontStyles.Regular
Problem 2 when the start text is underlined changing the text delete the start text altho it was not part of the regular expression.
Problem 3 Not all the text found by the regular expression is replaced when using the textFragment.Text =.

Thanks,

Ortal

codewarior · May 4, 2017, 3:05pm

Hi Ortal,

Thanks for sharing the sample files.

We are working on testing the scenarios in our environment and will keep you updated with our findings.

ortasa · May 8, 2017, 1:26am

Good morning,

Do you have any updates?

Thanks,

Ortal

codewarior · May 8, 2017, 12:48pm

Hi Ortal,

Thanks for your patience.

I have tested the scenario using Aspose.Pdf for .NET 17.4.0 where I have used the same code snippets and as per my observations, the text is not being replaced at all. Please take a look over following Regular expressions which I have used and confirm if we are using them correctly. Also it appears that you have been using an older release version, so can you please confirm which version of API you have been using. This information will help us in further investigating these scenarios in our environment.

For your reference, I have also attached the output files generated over my end.

string regular = @"(?<=<‘viewPPSStudies’)(\w) * ((.| (\r\n))*?).*$";<o:p></o:p>
string regular2 = @"(?<=< Conclusion :)(\w) * ((.| (\r\n))*?).*$";
string regular3 = @"(?<={<'RepeatingView2'})(\w)*((.|(\r\n))*?)[ \t]*(?={<'viewPPSStudies' not found})";

ortasa · May 9, 2017, 7:49am

Good afternoon,

I am currently working with 8.4.0.0.

After replacing the dll to the latest version and change some name space :

Old:

Aspose.Pdf.Text.TextOptions.TextSearchOptions textSearchOptions = new Aspose.Pdf.Text.TextOptions.TextSearchOptions(true);

New:

Aspose.Pdf.Text.TextSearchOptions textSearchOptions = new Aspose.Pdf.Text.TextSearchOptions(true);

The code compiled.

The latest version performed worse than 8.4.0.0:

When I used the following regular exasperation: (?<=<'RepeatingView2')(\w)*((.|(\r\n))*?)[ \t]*(?=<'viewPPSStudies')

textFragment.Text value was: “not found in STND_REPORT_ITEMS.dwc><'viewPPSProcedures' not found in STND_REPORT_ITEMS.dwc>”

When I tried to replace the text : textFragment.Text = " *** ";

I get the following exception:

“Exception thrown: 'System.IndexOutOfRangeException' in Aspose.Pdf.dll

Additional information: At most 4 elements (for any collection) can be viewed in evaluation mode.”

When I used the following regular exasperation: “(?<=Conclusion)(\w)((.|(\r\n))?).*$”

The text that textFragment.Text found is only until the end of the line and not until the end of the document.

I would appreciate your help in solving these issues or the problems of a version 8.4.0.0.

waiting for your rapid reply,

Ortal

ortasa · May 9, 2017, 8:55am

Good afternoon,

I like to change approach.

Forget me code. Can you help me generate a code that :

1. replace all the text between

Start : <'RepeatingView2'

End: <'viewPPSStudies' not

To: ***

2. replace all the text after Conclusion:

To: ***

Attached pdf.

Thanks,

Ortal

codewarior · May 9, 2017, 4:02pm

Hi Ortal,

Thanks for sharing the details.

Since you have been using quite old release, so there have been lots of changes in API structure and in order to simplify the approach, the TextOptions namespace was removed.

Additional information: At most 4 elements (for any collection) can be viewed in evaluation mode.”

The issue may be occurring because your current license do not support upgrade to latest release version. So before you upgrade your subscription, in case you need to test the latest release without any limitations, you may consider requesting a 30 days temporary license. For more information, please visit Get a temporary license.

When I used the following regular exasperation: “(?<=Conclusion)(\w)((.|(\r\n))?).*$”
The text that textFragment.Text found is only until the end of the line and not until the end of the document.

Can you please recheck the scenario after initializing the upgraded license. The issue might be due to trial mode limitations.

We apologize for your inconvenience.

imran.rafique · May 9, 2017, 4:58pm

Hi Ortal,

Thank you for the details. Please use the source code as below. We have also attached an output PDF to this reply.

[.NET, C#]

// open document
Document pdfDocument = new Document(@"C:\Pdf\test26\input.pdf"); 
// create TextAbsorber object to find all instances of the input search
String from = "<'RepeatingView2'"; 
String till = "<'viewPPSStudies' not"; 
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(from + "((.|\n)*)" + till, new TextSearchOptions(true)); 
// accept the absorber for first page of document
pdfDocument.Pages.Accept(textFragmentAbsorber); 
// get the extracted text fragments into collection
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;
// loop through the Text fragments
foreach (TextFragment textFragment in textFragmentCollection) 
{ 
    // Update text and other properties
    textFragment.Text = from + "***" + till; 
}
pdfDocument.Save(@"C:\Pdf\test26\output.pdf");