RegEx pattern matching in TextFragmentAbsorber

gregglinnell · April 5, 2018, 5:09pm

I am trying to use TextFragmentAbsorber to Identify elements of text on a PDF page that I need to work with.

Each element is surrounded by {S} and {E}.

An example is {S}Section 1{E} - {S}First Section Title{E}

Using TextFragmentAbsorber with a regular expression i want to pull back a list of the 2 items, i.e.
{S}Section 1{E}
{S}First Section Title{E}

I have set my search string to: {S}(.*){E}

However it returns 1 item, which is the whole row, i.e.
{S}Section 1{E} - {S}First Section Title{E}

What do I need to set my search string to, so that it pulls back the 2 items and not the whole string?

Thanks in advance for your help.

Farhan.Raza · April 5, 2018, 8:38pm

@gregglinnell

Thank you for contacting support.

We would like to request you to share a narrowed down code snippet along with the source PDF file so that we may investigate it in our environment.

imran.rafique · April 5, 2018, 9:20pm

@gregglinnell,

In addition to the above reply, please try this regular expression “{S}\w*\s?\w*\s?\w*{E}” and if this does not help, then kindly send us your source PDF document.

gregglinnell · April 6, 2018, 9:05am

I tried that, but it doesn’t find any matches. Please see attached PDF example.

Code I have tried is as below:
Document doc = new Document(path);

        // Create TextAbsorber object to find all instances of the input search phrase
        //TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber("{S}(.*){E}");
        TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber("{S}\\w*\\s?\\w*\\s?\\w*{E}");

        textFragmentAbsorber.TextSearchOptions.IsRegularExpressionUsed = true;

        // Accept the absorber for all the pages
        doc.Pages.Accept(textFragmentAbsorber);

        // Get the extracted text fragments
        TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;

        foreach (TextFragment text in textFragmentCollection)
        {
            string[] items = text.Text.Split((char)1);

            if (items.Length == 4)
            {
                text.Text = items[2];
                LinkAnnotation annotation = new LinkAnnotation(text.Page, text.Rectangle);
                annotation.Border = new Border(annotation);
                annotation.Border.Width = 0;
                annotation.Action = new GoToAction(new XYZExplicitDestination(Convert.ToInt32(items[1]), 0, 0, 0));
                text.Page.Annotations.Add(annotation);
            }
        }

        // Save
        doc.Save(path);

gregglinnell · April 6, 2018, 9:06am

636586054611670463.pdf (27.4 KB)

imran.rafique · April 6, 2018, 7:14pm

@gregglinnell,

There is a problem in recognizing character box with question mark using regular expressions. In order to address this issue, a ticket ID PDFNET-44491 has been logged in our issue tracking system. We have linked your post to this ticket and will keep you informed regarding any available updates.

The regular expression “{S}\s([\/\w]\s[\/\w])?([\/\w]\s[\/\w]\s[\/\w])?[\/\w]?\s{E}” can retrieve all 7 matching text strings, if the character box with a question mark is a white space.

gregglinnell · April 9, 2018, 5:01pm

I have made a slight change to the tags, and unfortunately your suggestion does not work.

If I use “{S}.+?{E}” then it does work, but not if the tags go over a line break.

See attached document. It should match 19 items, but only matches 18.

gregglinnell · April 9, 2018, 5:01pm

test3.pdf (28.7 KB)

imran.rafique · April 9, 2018, 9:44pm

@gregglinnell,

Please try this regular expression: “{S}(.|\n)+?{E}”. It matches 19 items.

gregglinnell · April 10, 2018, 8:32am

Thanks Imran. That does now find all 19 items.

However it has thrown up another issue, where by when I remove the tags, it is messing up the third line of the table.

My code is now:

        string dataDir = "C:\\Bundling\\";
        Document doc = new Document(dataDir + "test3.pdf");

        // Create TextAbsorber object to find all instances of the input search phrase
        TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(@"{S}(.|\n)+?{E}");

        textFragmentAbsorber.TextSearchOptions.IsRegularExpressionUsed = true;

        // Accept the absorber for all the pages
        doc.Pages.Accept(textFragmentAbsorber);

        // Get the extracted text fragments
        TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;

        foreach (TextFragment text in textFragmentCollection)
        {
            string[] items = text.Text.Split('|');

            if (items.Length == 2)
            {
                items[0] = items[0].Replace("{S}", "");
                items[1] = items[1].Replace("{E}", "");

                text.Text = items[1];

                if (text.Text != "")
                {
                    LinkAnnotation annotation = new LinkAnnotation(text.Page, text.Rectangle);
                    annotation.Border = new Border(annotation);
                    annotation.Border.Width = 0;
                    annotation.Action = new GoToAction(new XYZExplicitDestination(Convert.ToInt32(items[0]), 0, 0, 0));
                    text.Page.Annotations.Add(annotation);
                }
            }
        }

        // Save
        doc.Save(dataDir + "Output.pdf");

And I will attached what the Output.pdf now looks like.

Thanks for you help with this.

Regards

Gregg

gregglinnell · April 10, 2018, 8:32am

Output.pdf (57.3 KB)

imran.rafique · April 10, 2018, 7:34pm

@gregglinnell,

We managed to replicate the problem of displaced text in our environment. It has been logged under the ticket ID PDFNET-44515 in our bug tracking system. We have linked your post to this ticket and will keep you informed regarding any available updates. You can set the horizontal position of the problematic text with respect to the horizontal position of date in the second row (as a workaround).

gregglinnell · April 10, 2018, 9:15pm

Could you please give me an example code change for the workaround?

Thanks again

imran.rafique · April 11, 2018, 6:45am

@gregglinnell,

We have tried to change the rectangle position of the problematic text, but it is also not working. We will let you know once a significant progress has been made in regard of the linked ticket ID PDFNET-44515. We are sorry for the inconvenience caused.

imran.rafique · May 7, 2018, 12:52am

@gregglinnell,

In reference to the linked ticket ID PDFNET-44491, the character box with a question mark is not a white space. The Unreadable character is U+0001 according to the ‘ToUnicode’ entry in the font description in the source PDF document. Please try the code as follows:
C#

Document doc = new Document(myDir + "636586054611670463.pdf");
// Create TextAbsorber object to find all instances of the input search phrase
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(@"{S}(.*?){E}");
textFragmentAbsorber.TextSearchOptions.IsRegularExpressionUsed = true;
// Accept the absorber for all the pages
doc.Pages.Accept(textFragmentAbsorber);
// Get the extracted text fragments
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;
Console.WriteLine("{0} fragments found:", textFragmentCollection.Count);
foreach (TextFragment text in textFragmentCollection)
{ Console.WriteLine(text.Text); }

gregglinnell · May 8, 2018, 8:10am

I have it finding all the items now using: @"{S}(.|\n)+?{E}" already.

It is PDFNET-44515 that is the more pressing issue now. Any news on that one?

Regards

imran.rafique · May 8, 2018, 4:48pm

@gregglinnell,

The linked ticket ID PDFNET-44515 is not resolved yet. It could take time because there are other high priority tickets in the queue. Besides this, we recommend our clients to post their critical issues (or ticket IDs) in the paid support forum. Please refer to this helping link: Aspose support options

aspose.notifier · June 20, 2022, 8:46pm

The issues you have found earlier (filed as PDFNET-44515) have been fixed in Aspose.PDF for .NET 22.6.