Does PdfExtractor API extracted text coordinates?

GaneshKelam · June 21, 2024, 10:16am

We are able to extract Text from PDF documents using either TextAbsorber or PdfExtractor.
We have our own UI, where we will display the PDF as an image and extracted text below to that in a separate panel.
Now, we would like to highlight the word/line, whenever the user mouse-overs on particular word OR selects the whole line.
For this, we need rectangular coordinates of the word/line like x,y, width, and height relative to the Top left coordinates of the page.
Does PdfExtractor provide rectangular coordinates of the word it extracted from a line? Also, we need individual line coordinates in a paragraph.
Please provide code samples to achieve the same.

asad.ali · June 21, 2024, 7:59pm

@GaneshKelam

You can get every text fragment inside PDF pages using TextFragmentAbsorber Class which will provide Rectangle of the extracted text as well as the X,Y Position like in the below code snippet:

Document doc = new Document(dataDir + "sample.pdf");
Page page = doc.Pages[1];

TextFragmentAbsorber textAbsorber = new TextFragmentAbsorber();
// if you want to search for a whole line or particular word, you need to use the TextFragmentAbsorber like below
// TextFragmentAbsorber textAbsorber = new TextFragmentAbsorber("Some line or word inside PDF document");
doc.Pages[1].Accept(textAbsorber);

foreach(var text in textAbsorber.TextFragments)
{
    var llx = text.Rectangle.LLX;
    var lly = text.Rectangle.LLY;
    var urx = text.Rectangle.URX;
    var ury = text.Rectangle.URY;
    var posX = text.Position.XIndent;
    var posY = text.Position.YIndent;
}