Get rectangle of text based on offset

Hello, We are using AI tool to find PII entries in pdf. How that works , we extract text from PDF page

var textAbsorber = new TextAbsorber()
    {
        ExtractionOptions =
            new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.MemorySaving)
    };
 page.Accept(textAbsorber);

Then we call API with that text
client.api.call(textAbsorber.text)
and it return us response with text offset.
startoffset: 10,endoffset 20,text: "Alexandr" , type:"Name"

May you help to convert it back to PDF and get rectangle of that text?

@grinaypps

The rectangle for the text in Aspose.PDF consists of four values i.e. LLX, LLY, URX, and URY. Getting a rectangle using the values that you already have depends upon how you are extracting or calculating those values. Another way to get rectangle of a text inside a PDF would be to use TextFragmentAbsorber Class. You can find text in PDF using it and then get the rectangle of absorbed text fragments.

Document pdfDocument = new Document(dataDir + "SearchAndGetTextFromAll.pdf");
// Create TextAbsorber object to find all instances of the input search phrase
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber("text");
// Accept the absorber for all the pages
pdfDocument.Pages.Accept(textFragmentAbsorber);
// Get the extracted text fragments
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;
// Loop through the fragments
foreach (TextFragment textFragment in textFragmentCollection)
{
 var rect = textFragment.Rectangle;
}

I know how to extract rectangle of text. But this doesn’t solve the issue I mention.
Let imagine we have line
“Alexandr had made some changes on Alexandr’s machine” and name Alexandr may be mention on page many times, but I need to only highlight the exact match and convert it to coordinate.
PII Found the Entity name Alexander and identify it from context as offset 0-8
With text absorber it will find all name “Alexander” and I can’t identify which exactly TextFragment belong to that offset. Does it make sense?

@grinaypps

Would you please try to share the below information so that we can further test the case in our environment and try to produce a sample code snippet to fulfil your requirements?

  • Sample PDF document
  • Offset values for text that you obtained using your own method/implementation

We will try to find the text in the PDF using shared offset values and let you know about our feedback.

In this document our tool find “zoom.us” in certain place and I want to highlight exact place where this found. However if i try to find by origin text it will find 3 occurrence and resulting document will be incorrect
document.pdf (65.9 KB)
resultdocument.pdf (66.9 KB)

var document = new Aspose.Pdf.Document("document.pdf");

var textAbsorber = new TextAbsorber();
textAbsorber.Visit(document.Pages[1]);
//Some find logic here which based on textAbsorber.text extraction and result look like list below
var entries = new List<dynamic>()
{
    new
    {
        BeginOffset = 518,
        EndOffset = 525,
        Type = "WEB"
    }
};

foreach (var entry in entries)
{
    var originalText = textAbsorber.Text.Substring(entry.BeginOffset, entry.EndOffset - entry.BeginOffset);

    var textAbsorber2 = new TextFragmentAbsorber(originalText);
    document.Pages[1].Accept(textAbsorber2);
    
    foreach (var textFragment in textAbsorber2.TextFragments)
    {
        var highlight = new HighlightAnnotation(document.Pages[1], textFragment.Rectangle)
        {
            Color = Color.Blue,
            Title = "PII"
        };
        document.Pages[1].Annotations.Add(highlight);
    }
}

document.Save("resultdocument.pdf");

@grinaypps

One more question, are you displaying the PDF on screen and these offsets values are calculated on the basis of screen size/display by your tool?

We are, but in this case it doesn’t matter. What I need is to return correct Rectangle of this text. the same value as we have in textAbsorber.textfragment.Rectangle.

@grinaypps

Coordinating system inside a PDF works differently than the coordinates on screen. The offset values collected by clicking using mouse cannot be simply converted into Rectangle values using Aspose.PDF. Also, Aspose.PDF follows a coordinating system where (0,0) mean bottom-left corner. The values of X and Y are in points.

Nevertheless, we will definitely investigate your requirements and try to come up with some method to achieve them. For the purpose, an investigation ticket as PDFNET-52173 has been logged in our issue tracking system. We will let you know as soon as the ticket is resolved. Please be patient and spare us some time.

We are sorry for the inconvenience.

Thanks, Will wait.

1 Like