Extracting text with coordinates

chennabasappa.c · May 14, 2015, 7:43am

Hello,

I want to extract the content with coordinates from the PDF document. Suppose if document contains "Table of Content" in first page, I want to get the text as "Table" with x-indent and y-indent.

Using pdfextractor I can get the content.

using TextFragmentAbsorber I can get coordinates.

Is there anyway to get content with coordinates?

tilal.ahmad · May 14, 2015, 9:51am

Hi Chenna,

Thanks for your inquiry. You can get both text and coordinates from PDF document using TextFragmentAbsorber. Please check following documentation link for the purpose. It will help you to accomplish the task.

Search and get Text segments from PDF document.

Please feel free to contact us for any further assistance.

Best Regards,

chennabasappa.c · May 15, 2015, 1:17am

Thanks for the response. I tried implement the code to extract content with coordinates using above code. The Textfragment will be 0 for all the files.
Please let me know how to extract the content with coordinates.

chennabasappa.c · May 15, 2015, 1:29am

I have attached sample pdf document which I have used to extract the content with coordinates.

tilal.ahmad · May 15, 2015, 1:00pm

Hi Chenna,

Thanks for your inquiry. Please check the following code snippet to get text and its coordinates from the PDF document. Hopefully, it will help you to accomplish the task.

//open document
Document pdfDocument = new Document(myDir + "Table+of+content.pdf");

//create TextAbsorber object to find all the phrases matching the regular expression
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(@"[\S]+");

//set text search option to specify regular expression usage
TextSearchOptions textSearchOptions = new TextSearchOptions(true);
textFragmentAbsorber.TextSearchOptions = textSearchOptions;

//accept the absorber for all the pages
pdfDocument.Pages.Accept(textFragmentAbsorber);

//get the extracted text fragments
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;

//loop through the fragments
foreach (TextFragment textFragment in textFragmentCollection)
{
    Console.WriteLine("Text : {0} ", textFragment.Text);
    Console.WriteLine("Position : {0} ", textFragment.Position);
    Console.WriteLine("LLX : {0} ", textFragment.Position.XIndent);
    Console.WriteLine("LLY : {0} ", textFragment.Position.YIndent);
    Console.WriteLine("URX : {0} ",textFragment.Position.XIndent+textFragment.Rectangle.Width);
    Console.WriteLine("URY : {0} ",textFragment.Position.YIndent+textFragment.Rectangle.Height);
}

Please feel free to contact us for any further assistance.

Best Regards,

chennabasappa.c · May 18, 2015, 6:36am

Thanks for the response. It’s working. I am able to get the text with coordinates.

tilal.ahmad · May 18, 2015, 9:42am

Hi Cheena,

Thanks for your feedback. It is good to know that you have managed to accomplish your requirement.

Please keep using our API and feel free to ask any question or concern, we will be more than happy to extend our support.

Best Regards,