I have an issue with the Rectangles of textfragment. I need to know the precise position of an element in my text. To do this, I use rectangles. I get the rectangle around my textfragments, and it works well.
My problem is this : the TextFragment breakdown is not precise enough. For example, the text value could be :
‘ebatur aut falsum, pro liquido’.
But I would like to be able to find the rectangle around ‘pro liquido’ only, or ‘falsum’ only, …
I have tried to retrieve the segments for each textfragment, but the segment is the same as the textfragment.
Is it possible to retrieve the Rectangle of a single word, please?
To retrieve the rectangle around a specific word within a TextFragment, you can utilize the TextSegment class, which represents individual segments of text within a TextFragment. Each TextSegment has its own rectangle that defines its position on the page.
Here’s how you can achieve this:
Use a TextFragmentAbsorber to extract the text fragments from your PDF document.
Iterate through each TextFragment and then through its TextSegment collection.
For each TextSegment, you can access its rectangle properties to get the precise position.
Here’s a code example demonstrating this approach:
using System.IO;
using Aspose.Pdf;
using Aspose.Pdf.Text;
namespace Aspose.Pdf.Examples
{
public class GetTextSegmentRectangle
{
public static void Run()
{
// Load the PDF document
Document document = new Document("input.pdf");
// Create a TextFragmentAbsorber to find all text
TextFragmentAbsorber textAbsorber = new TextFragmentAbsorber();
document.Pages.Accept(textAbsorber);
// Iterate through the text fragments
foreach (TextFragment textFragment in textAbsorber.TextFragments)
{
// Iterate through the text segments
foreach (TextSegment textSegment in textFragment.Segments)
{
// Check if the segment text matches the word you are looking for
if (textSegment.Text.Equals("pro liquido")) // Change this to your target word
{
// Get the rectangle of the specific text segment
float llx = textSegment.Rectangle.LLX;
float lly = textSegment.Rectangle.LLY;
float urx = textSegment.Rectangle.URX;
float ury = textSegment.Rectangle.URY;
// Output the rectangle coordinates
Console.WriteLine($"Rectangle for '{textSegment.Text}': LLX={llx}, LLY={lly}, URX={urx}, URY={ury}");
}
}
}
}
}
}
In this example, replace "pro liquido" with the specific word you want to find. The code will print the rectangle coordinates for that word, allowing you to determine its precise position on the page.
This method should help you accurately retrieve the rectangle for individual words within a TextFragment. If you have any further questions or need additional assistance, feel free to ask!
Thank you very much for your answer. However, I have already tried this method, and as I said the text of TextFragment and TextSegment is the same :
foreach (TextFragment textFragment in paragraph.Fragments)
{
foreach (TextSegment textSegment in textFragment.Segments)
{
_logger.Debug($"Text fragment : {textFragment.Text}");
_logger.Debug($"Text segment : {textSegment.Text}");
}
}
The result is :
Text fragment : Dum haec in oriente aguntur, Arelate hiemem agens Constantius post theatralis ludos atque
Text segment : Dum haec in oriente aguntur, Arelate hiemem agens Constantius post theatralis ludos atque
Text fragment : circenses ambitioso editos apparatu diem sextum idus Octobres, qui imperii eius annum tricensimum
Text segment : circenses ambitioso editos apparatu diem sextum idus Octobres, qui imperii eius annum tricensimum
Text fragment : terminabat, insolentiae pondera gravius librans, siquid dubium defer
Text segment : terminabat, insolentiae pondera gravius librans, siquid dubium defer
Text fragment : ebatur aut falsum, pro liquido
Text segment : ebatur aut falsum, pro liquido
Text fragment : accipiens et conperto, inter alia excarni?catum Gerontium Magnentianae comitem partis exulari
Text segment : accipiens et conperto, inter alia excarni?catum Gerontium Magnentianae comitem partis exulari
Text fragment : maerore multavit.
Text segment : maerore multavit.
Text fragment :
Text segment :
Do you please have an other solution ?
Best regards
@BlackSea
I checked a little bit, if you need only some specific word combination the following seem to work:
textAbsorber = new TextFragmentAbsorber("pro liquido");
document.Pages.Accept(textAbsorber);
In that case, comparing output with base constructor the result differs the following way:
Rectangle for 'ebatur aut falsum, pro liquido ': LLX=377,59, LLY=551,3500000095368, URX=513,6359195299149, URY=563,4939999675751
Rectangle for ‘pro liquido’: LLX=462,41031970691677, LLY=551,3500000095368, URX=511,08567953872677, URY=563,4939999675751
There’s no options to separate words, however you can modify the solution I suggested and add regular expression that searchs whole words, separated by newline or whitespace
//add regex to search for split words
//I used suggested by AI, take it with a grain of salt, although it seems to work
TextFragmentAbsorber textAbsorber = new TextFragmentAbsorber(new Regex(@"\b[\w'-]+\b"));
//enable regex search
textAbsorber.TextSearchOptions = new TextSearchOptions(true);
You can use Microsoft documentation to modify regex for your needs