Calculating the position of a sub-part of a textfragment

Good afternoon,

I have an issue with the Rectangles of textfragment. I need to know the precise position of an element in my text. To do this, I use rectangles. I get the rectangle around my textfragments, and it works well.

Llx = textFragment.Rectangle.LLX,
Lly = textFragment.Rectangle.LLY,
Urx = textFragment.Rectangle.URX,
Ury = textFragment.Rectangle.URY,

My problem is this : the TextFragment breakdown is not precise enough. For example, the text value could be :
‘ebatur aut falsum, pro liquido’.
But I would like to be able to find the rectangle around ‘pro liquido’ only, or ‘falsum’ only, …

I have tried to retrieve the segments for each textfragment, but the segment is the same as the textfragment.

Is it possible to retrieve the Rectangle of a single word, please?

Thank you very much,
Best regards,

@BlackSea

To retrieve the rectangle around a specific word within a TextFragment, you can utilize the TextSegment class, which represents individual segments of text within a TextFragment. Each TextSegment has its own rectangle that defines its position on the page.

Here’s how you can achieve this:

  1. Use a TextFragmentAbsorber to extract the text fragments from your PDF document.
  2. Iterate through each TextFragment and then through its TextSegment collection.
  3. For each TextSegment, you can access its rectangle properties to get the precise position.

Here’s a code example demonstrating this approach:

using System.IO;
using Aspose.Pdf;
using Aspose.Pdf.Text;

namespace Aspose.Pdf.Examples
{
    public class GetTextSegmentRectangle
    {
        public static void Run()
        {
            // Load the PDF document
            Document document = new Document("input.pdf");

            // Create a TextFragmentAbsorber to find all text
            TextFragmentAbsorber textAbsorber = new TextFragmentAbsorber();
            document.Pages.Accept(textAbsorber);

            // Iterate through the text fragments
            foreach (TextFragment textFragment in textAbsorber.TextFragments)
            {
                // Iterate through the text segments
                foreach (TextSegment textSegment in textFragment.Segments)
                {
                    // Check if the segment text matches the word you are looking for
                    if (textSegment.Text.Equals("pro liquido")) // Change this to your target word
                    {
                        // Get the rectangle of the specific text segment
                        float llx = textSegment.Rectangle.LLX;
                        float lly = textSegment.Rectangle.LLY;
                        float urx = textSegment.Rectangle.URX;
                        float ury = textSegment.Rectangle.URY;

                        // Output the rectangle coordinates
                        Console.WriteLine($"Rectangle for '{textSegment.Text}': LLX={llx}, LLY={lly}, URX={urx}, URY={ury}");
                    }
                }
            }
        }
    }
}

In this example, replace "pro liquido" with the specific word you want to find. The code will print the rectangle coordinates for that word, allowing you to determine its precise position on the page.

This method should help you accurately retrieve the rectangle for individual words within a TextFragment. If you have any further questions or need additional assistance, feel free to ask!

Thank you very much for your answer. However, I have already tried this method, and as I said the text of TextFragment and TextSegment is the same :

foreach (TextFragment textFragment in paragraph.Fragments)
        {
            foreach (TextSegment textSegment in textFragment.Segments)
            {
                _logger.Debug($"Text fragment : {textFragment.Text}");
                _logger.Debug($"Text segment : {textSegment.Text}");
            }
        }

The result is :

Text fragment : Dum haec in oriente aguntur, Arelate hiemem agens Constantius post theatralis ludos atque 
Text segment : Dum haec in oriente aguntur, Arelate hiemem agens Constantius post theatralis ludos atque 
Text fragment : circenses ambitioso editos apparatu diem sextum idus Octobres, qui imperii eius annum tricensimum 
Text segment : circenses ambitioso editos apparatu diem sextum idus Octobres, qui imperii eius annum tricensimum 
Text fragment : terminabat, insolentiae pondera gravius librans, siquid dubium defer
Text segment : terminabat, insolentiae pondera gravius librans, siquid dubium defer
Text fragment : ebatur aut falsum, pro liquido 
Text segment : ebatur aut falsum, pro liquido 
Text fragment : accipiens et conperto, inter alia excarni?catum Gerontium Magnentianae comitem partis exulari 
Text segment : accipiens et conperto, inter alia excarni?catum Gerontium Magnentianae comitem partis exulari 
Text fragment : maerore multavit.
Text segment : maerore multavit.
Text fragment :  
Text segment : 

Do you please have an other solution :slight_smile: ?
Best regards

@BlackSea
Could you provide original document so we can investigate this issue in our environment?

Sure, this is the document :
Keyword.pdf (186,7 Ko)

And I am trying to get the exact Rectangle of the following words group : “sextum idus Octobres” on page 2, paragraph 1.1.

Thank you very much,
Best regards

@BlackSea
Thank you, I’ll check this issue and write you back as soon as possible

1 Like

@BlackSea
I checked a little bit, if you need only some specific word combination the following seem to work:

textAbsorber = new TextFragmentAbsorber("pro liquido");
document.Pages.Accept(textAbsorber);

In that case, comparing output with base constructor the result differs the following way:

Rectangle for 'ebatur aut falsum, pro liquido ': LLX=377,59, LLY=551,3500000095368, URX=513,6359195299149, URY=563,4939999675751


Rectangle for ‘pro liquido’: LLX=462,41031970691677, LLY=551,3500000095368, URX=511,08567953872677, URY=563,4939999675751

There’s no options to separate words, however you can modify the solution I suggested and add regular expression that searchs whole words, separated by newline or whitespace

//add regex to search for split words
//I used suggested by AI, take it with a grain of salt, although it seems to work
TextFragmentAbsorber textAbsorber = new TextFragmentAbsorber(new Regex(@"\b[\w'-]+\b"));
//enable regex search
textAbsorber.TextSearchOptions = new TextSearchOptions(true);

You can use Microsoft documentation to modify regex for your needs

Thank you for your answer :slight_smile:

1 Like