In pdf document replace text between two string to be ***

Good moening,


I am working in c#.
I have a pdf document that can be one or multiple pages. I need to anonymize some of the text in it (replace the chars that are not blank space with ‘*’).
  1. I need to anonymize everything between start and end string.
  • The start and end string can be different one from the other.
  • The start and end string can appear multiple times in a document and I should treat each pair separately.
  • The start and end texts can start at one page and end at the other page.
  • I need to anonymize from a start string until the end of a document.
  • Can you advise me on how to implement this with aspode.pdf?
    Thanks in advance,
    Ortal

    Hi Ortal,

    Thank you for contacting support. You can replace a text phrase in the whole PDF document. You may also apply checks to identify the start and end strings while replacing a text phrase. Please refer to this help topic: Replace Text in All Pages of PDF Document

    Kindly download and try the latest version 17.4.0 of Aspose.Pdf for .NET API. You can get a 30 day temporary license for the testing purposes from the purchase portal (recommended). Its option is available in step 4. Please also refer to this help topic: Apply License to Aspose.Pdf for .NET API

    Good morning,

    Thank you for the rapid reply.

    After reading all the documents I have additional questions:
    • How can I find the end position in a pdf document?
    • How can I replace text between start and end position (between my start and end string)?
    Thanks in advance,
    Ortal

    ortasa:
    Good morning,

    Thank you for the rapid reply.

    After reading all the documents I have additional questions:
    • How can I find the end position in a pdf document?
    • How can I replace text between start and end position (between my start and end string)?
    Hi Ortal,

    Thanks for contacting support.

    The best approach to define the bounds for text replace is to use Regular Expression. Please visit the following link for further information on Replace Text Based on a Regular Expression

    Good morning,

    Thank you “Replace Text Based on a Regular Expression” worked.

    I have an additional question : how can I add new line in the text I replaced?

    I have tried add the Environment.NewLine but it did not work: textFragment.Text = startText + " *** " + Environment.NewLine + endText;

    Thanks in advance,

    Ortal

    Hi,

    I have another question:
    Although my regular expression find only the text between start string to end string/ or from start string till the end of the document. When I try to change the text to *** also the start and end text are deleted.

    For example:
    endText = “approved by”
    startTextas= “approved by”
    regular expression =(?<=“approved by”)(\w)((.|(\r\n))?)[ \t]*(?=“approved by”)
    Text = " report text approved by ortal approved by"

    The text that in textFragment.Text is “ortal”.
    But when after I update the text in the textFragment.Text ="***" the report text now is :" report text ***".

    My code :

    I have byte[] pdfDocumentByte, string endText and string startTextas input.

    //I use it to creat new Aspose.Pdf.Document
    Aspose.Pdf.Document pdfDocument = new Aspose.Pdf.Document(new MemoryStream(pdfDocumentByte));
    string regular = string.Empty;
    
    if (string.IsNullOrEmpty(endText))
    {
        regular = string.Format(@"(?<={0})(\w)*((.|(\r\n))*?).*$", startText);
    }
    else
    {
        regular = string.Format(@"(?<={0})(\w)*((.|(\r\n))*?)[ \t]*(?={1})", startText, endText);
    }
    
    // Create TextAbsorber object to find all the phrases matching the regular expression
    Aspose.Pdf.Text.TextFragmentAbsorber textFragmentAbsorber = new Aspose.Pdf.Text.TextFragmentAbsorber(regular);
    
    // Set text search option to specify regular expression usage
    Aspose.Pdf.Text.TextOptions.TextSearchOptions textSearchOptions = new
    Aspose.Pdf.Text.TextOptions.TextSearchOptions(true);
    
    textFragmentAbsorber.TextSearchOptions = textSearchOptions;
    
    // Accept the absorber for a single page
    pdfDocument.Pages[1].Accept(textFragmentAbsorber);
    
    // Get the extracted text fragments
    Aspose.Pdf.Text.TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;
    foreach (Aspose.Pdf.Text.TextFragment textFragment in textFragmentCollection)
    {
        // Update text and other properties
        text Fragment.Text = " *** ";
    }
    

    Can you help me understand why the start and end text are changed although it not part of the text Fragment.Text?

    Thanks,

    Ortal

    Hi Ortal,

    Thanks for contacting support.

    I have tested the scenario and have managed to reproduce same problem that New Line character is being ignored during text replace. For the sake of correction, I have logged it as PDFNET-42669 in our issue tracking system. We will further look into the details of this problem and will keep you posted on the status of correction. Please be patient and spare us little time. We are sorry for this inconvenience.

    Hi Ortal,

    Thanks for contacting support.

    Can you please share the sample PDF file, so that we can test the scenario in our environment. We are sorry for this inconvenience.

    Hi,


    Attached 2 examples.

    Thanks,
    Ortal

    Hi Ortal,


    Thanks for contacting support.

    I am afraid I am unable to find attachments in this thread. Can you please double check at your end.

    Good morning,


    I attached it as pdf.

    Have a good day,
    Ortal

    below is a link to me google drive:

    https://drive.google.com/file/d/0B57yScOCOODFbDZqQXRNMUNLVHc/view?usp=sharing

    Good aftrnoon,

    I made a more detailed explanation of my problems (attached zip file and link to google drive

    https://drive.google.com/file/d/0B57yScOCOODFUndSMjV1NTJoSm8/view?usp=sharing ).

    Inside it there is a word file for each problem and pdf examples for before and after.
    1. Problem 1: TextState.FontStyle does not affect the found TextFragment. Underline remain after changing the textFragment.TextState.FontStyle =Aspose.Pdf.Text.FontStyles.Regular
    2. Problem 2 when the start text is underlined changing the text delete the start text altho it was not part of the regular expression.
    3. Problem 3 Not all the text found by the regular expression is replaced when using the textFragment.Text =.
    Thanks,
    Ortal

    Hi Ortal,


    Thanks for sharing the sample files.

    We are working on testing the scenarios in our environment and will keep you updated with our findings.
    Good morning,
    Do you have any updates?
    Thanks,
    Ortal

    Hi Ortal,

    Thanks for your patience.

    I have tested the scenario using Aspose.Pdf for .NET 17.4.0 where I have used the same code snippets and as per my observations, the text is not being replaced at all. Please take a look over following Regular expressions which I have used and confirm if we are using them correctly. Also it appears that you have been using an older release version, so can you please confirm which version of API you have been using. This information will help us in further investigating these scenarios in our environment.

    For your reference, I have also attached the output files generated over my end.

    string regular = @"(?<=<‘viewPPSStudies’)(\w) * ((.| (\r\n))*?).*$";<o:p></o:p>
    string regular2 = @"(?<=< Conclusion :)(\w) * ((.| (\r\n))*?).*$";
    string regular3 = @"(?<={<'RepeatingView2'})(\w)*((.|(\r\n))*?)[ \t]*(?={<'viewPPSStudies' not found})";
    

    Good afternoon,

    I am currently working with 8.4.0.0.

    After replacing the dll to the latest version and change some name space :

    Old:

    Aspose.Pdf.Text.TextOptions.TextSearchOptions textSearchOptions = new Aspose.Pdf.Text.TextOptions.TextSearchOptions(true);
    

    New:

    Aspose.Pdf.Text.TextSearchOptions textSearchOptions = new Aspose.Pdf.Text.TextSearchOptions(true);
    

    The code compiled.

    The latest version performed worse than 8.4.0.0:

    1. When I used the following regular exasperation: (?<=<'RepeatingView2')(\w)*((.|(\r\n))*?)[ \t]*(?=<'viewPPSStudies')

    textFragment.Text value was: “not found in STND_REPORT_ITEMS.dwc><'viewPPSProcedures' not found in STND_REPORT_ITEMS.dwc>”

    When I tried to replace the text : textFragment.Text = " *** ";

    I get the following exception:

    “Exception thrown: 'System.IndexOutOfRangeException' in Aspose.Pdf.dll
    
    Additional information: At most 4 elements (for any collection) can be viewed in evaluation mode.”
    
    1. When I used the following regular exasperation: “(?<=Conclusion)(\w)((.|(\r\n))?).*$”

    The text that textFragment.Text found is only until the end of the line and not until the end of the document.

    I would appreciate your help in solving these issues or the problems of a version 8.4.0.0.

    waiting for your rapid reply,

    Ortal

    Good afternoon,

    I like to change approach.

    Forget me code. Can you help me generate a code that :

    1. replace all the text between

    Start : <'RepeatingView2'

    End: <'viewPPSStudies' not

    To: ***

    2. replace all the text after Conclusion:

    To: ***

    Attached pdf.

    Thanks,

    Ortal

    Hi Ortal,

    Thanks for sharing the details.

    Since you have been using quite old release, so there have been lots of changes in API structure and in order to simplify the approach, the TextOptions namespace was removed.

    Additional information: At most 4 elements (for any collection) can be viewed in evaluation mode.”

    The issue may be occurring because your current license do not support upgrade to latest release version. So before you upgrade your subscription, in case you need to test the latest release without any limitations, you may consider requesting a 30 days temporary license. For more information, please visit Get a temporary license.

    When I used the following regular exasperation: “(?<=Conclusion)(\w)((.|(\r\n))?).*$”
    The text that textFragment.Text found is only until the end of the line and not until the end of the document.

    Can you please recheck the scenario after initializing the upgraded license. The issue might be due to trial mode limitations.

    We apologize for your inconvenience.

    Hi Ortal,

    Thank you for the details. Please use the source code as below. We have also attached an output PDF to this reply.

    [.NET, C#]

    // open document
    Document pdfDocument = new Document(@"C:\Pdf\test26\input.pdf"); 
    // create TextAbsorber object to find all instances of the input search
    String from = "<'RepeatingView2'"; 
    String till = "<'viewPPSStudies' not"; 
    TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(from + "((.|\n)*)" + till, new TextSearchOptions(true)); 
    // accept the absorber for first page of document
    pdfDocument.Pages.Accept(textFragmentAbsorber); 
    // get the extracted text fragments into collection
    TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;
    // loop through the Text fragments
    foreach (TextFragment textFragment in textFragmentCollection) 
    { 
        // Update text and other properties
        textFragment.Text = from + "***" + till; 
    }
    pdfDocument.Save(@"C:\Pdf\test26\output.pdf");