In addition to the above reply, please try this regular expression “{S}\w*\s?\w*\s?\w*{E}” and if this does not help, then kindly send us your source PDF document.
I tried that, but it doesn’t find any matches. Please see attached PDF example.
Code I have tried is as below:
Document doc = new Document(path);
// Create TextAbsorber object to find all instances of the input search phrase
//TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber("{S}(.*){E}");
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber("{S}\\w*\\s?\\w*\\s?\\w*{E}");
textFragmentAbsorber.TextSearchOptions.IsRegularExpressionUsed = true;
// Accept the absorber for all the pages
doc.Pages.Accept(textFragmentAbsorber);
// Get the extracted text fragments
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;
foreach (TextFragment text in textFragmentCollection)
{
string[] items = text.Text.Split((char)1);
if (items.Length == 4)
{
text.Text = items[2];
LinkAnnotation annotation = new LinkAnnotation(text.Page, text.Rectangle);
annotation.Border = new Border(annotation);
annotation.Border.Width = 0;
annotation.Action = new GoToAction(new XYZExplicitDestination(Convert.ToInt32(items[1]), 0, 0, 0));
text.Page.Annotations.Add(annotation);
}
}
// Save
doc.Save(path);
There is a problem in recognizing character box with question mark using regular expressions. In order to address this issue, a ticket ID PDFNET-44491 has been logged in our issue tracking system. We have linked your post to this ticket and will keep you informed regarding any available updates.
The regular expression “{S}\s([\/\w]\s[\/\w])?([\/\w]\s[\/\w]\s[\/\w])?[\/\w]?\s{E}” can retrieve all 7 matching text strings, if the character box with a question mark is a white space.
We managed to replicate the problem of displaced text in our environment. It has been logged under the ticket ID PDFNET-44515 in our bug tracking system. We have linked your post to this ticket and will keep you informed regarding any available updates. You can set the horizontal position of the problematic text with respect to the horizontal position of date in the second row (as a workaround).
We have tried to change the rectangle position of the problematic text, but it is also not working. We will let you know once a significant progress has been made in regard of the linked ticket ID PDFNET-44515. We are sorry for the inconvenience caused.
In reference to the linked ticket ID PDFNET-44491, the character box with a question mark is not a white space. The Unreadable character is U+0001 according to the ‘ToUnicode’ entry in the font description in the source PDF document. Please try the code as follows: C#
Document doc = new Document(myDir + "636586054611670463.pdf");
// Create TextAbsorber object to find all instances of the input search phrase
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(@"{S}(.*?){E}");
textFragmentAbsorber.TextSearchOptions.IsRegularExpressionUsed = true;
// Accept the absorber for all the pages
doc.Pages.Accept(textFragmentAbsorber);
// Get the extracted text fragments
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;
Console.WriteLine("{0} fragments found:", textFragmentCollection.Count);
foreach (TextFragment text in textFragmentCollection)
{ Console.WriteLine(text.Text); }
The linked ticket ID PDFNET-44515 is not resolved yet. It could take time because there are other high priority tickets in the queue. Besides this, we recommend our clients to post their critical issues (or ticket IDs) in the paid support forum. Please refer to this helping link: Aspose support options