Textabsorber not getting regex right, when having a :

Hi Team,
i’m trying to remove the tags from the document using the attached code. but when trying to add a colon in the tag like below, and modified the regex as below. the textfragent is not coming.

i’ve using the files as below:
Scenario1: when uploading doc file, i’m converting that to pdf and then trying to retrieve the tag from the file using the textabsorber. this is not happening.

scenario2: converting a html file having tag to pdf, and then using that pdf file for retrieve the tag, it is happening.

foreach (TextFragment textFragment in textFragmentAbsorberNewTags.TextFragments)

(?i)<<\d{1,2}[,][a-zA-Z1-9 "”“,‘‘’`æøå.:][,][a-zA-Z0-9 ][,][ ]\d{1,2}[Oo, ]>>

<<1,” Please enter your name: ”, AN25,1>>

removesignofftag.zip (51.3 KB)

@NPSwaroop

It sounds like you’re having trouble with the regex pattern when using the TextFragmentAbsorber to extract tags that include a colon. Here’s a structured approach to resolve the issue:

Cause: The regex pattern may not be correctly accounting for the colon, or the text extraction process might differ between document types.

Solution:

  1. Regex Pattern Adjustment: Modify your regex pattern to ensure the colon is properly handled. You can escape the colon if necessary. For example:

    (?i)<<\d{1,2}[,][a-zA-Z0-9 \"”“,‘‘’`æøå.:]*[,][a-zA-Z0-9 ]*[,][ ]*\d{1,2}[Oo, ]*>>
    
  2. TextSearchOptions: Ensure that the TextSearchOptions for the TextFragmentAbsorber is set to recognize that you’re using a regular expression. Set the IsRegularExpressionUsed property to true.

  3. Testing with Different Document Types: Since the extraction works with HTML files but not with DOC files, verify that the DOC file is being converted to PDF correctly and that the text is preserved.

  4. Debugging: Implement logging to capture the text fragments processed by the absorber. This will help determine if the issue lies with the regex or the text extraction.

Next Steps:

  • Implement the suggested regex adjustments and ensure the TextSearchOptions are correctly configured.
  • Test the extraction with both HTML and DOC files to compare results.
  • Add logging to monitor the text fragments being captured.

Here’s a minimal code snippet to illustrate how to set up the TextFragmentAbsorber with the adjusted regex:

var absorber = new TextFragmentAbsorber("(?i)<<\\d{1,2}[,][a-zA-Z0-9 \"”“,‘‘’`æøå.:]*[,][a-zA-Z0-9 ]*[,][ ]*\\d{1,2}[Oo, ]>>");
absorber.TextSearchOptions.IsRegularExpressionUsed = true;

// Accept the absorber for the document
document.Pages.Accept(absorber);

// Process the extracted text fragments
foreach (TextFragment textFragment in absorber.TextFragments)
{
    Console.WriteLine(textFragment.Text);
}

Try these adjustments and let me know if you need further assistance!

i’ve tried the text block suggested, but still the same issue.

if i’m using below tag in PDF and i’m getting the textfragment.

<<1,” Please enter your name ”, AN25,1>>

when using the below tag in the pdf, textfragment is not coming

<<1,” Please enter your name:”, AN25,1>>

issue got resolved thanks

1 Like