How to extract Text and other information

smeverts · October 17, 2018, 11:49am

We have a requirement that we extract not only the text from the pdf but also the following requirements:

Character level Extraction
Bounding box coordinates for each character, XIndent, YIndent, Height, Width
Line number on the page
Extract the text by line number, i.e. in order from Top To Bottom (When prototyping this we seem to be getting the footer data first) …

How could we accomplish these requirements using Aspose.Pdf?

Farhan.Raza · October 17, 2018, 7:50pm

Thank you for contacting support.

We would like to share with you that Aspose,PDF for .NET does not include character level information. Any paragraph in a PDF document consists of TextFragments and TextSegments which often contain one or more words. However, limited information about characters in each TextSegment can be accessed by segment.Characters as explained in Highlight each character in PDF document.

Moreover, you can Search and Get Text from All the Pages of PDF Document and retrieve respective properties as per your requirements. About line numbers, text in a PDF document does not exist by line numbers but in the form of paragraphs which can be extracted as mentioned in Extract Text from PDF document in Paragraphs form.

Furthermore, please note that the basic measuring unit in Aspos.PDF API is point, where 1 inch = 72 points. Origin position of all Aspose.PDF objects (images, text, stamp, rectangle,page etc) is left bottom(0,0), for example in case of page dimensions left bottom of page is (0,0) and right top is (page width, page height).

We hope this will be helpful. Please feel free to contact us if you need any further assistance.