Enumerating text rects

bendavid · November 8, 2010, 9:52am

Hello,

I want to be able to extract text from existing pdf document, in addition to the text location.

I am able to extract text from existing pdf file using the PDFExtractor. what I am missing is the ability to get the coordinates of each text segment as a rectangle. Is it possible using your library?

I am willing to use extractTextInRectangle, but in order to get all the rects, I need some API to do the enumeration. Is it possible?

Thanks,

Shay

shahzadlatif · November 9, 2010, 5:15am

Hi Shay,

I would like to share with you that currently Aspose.Pdf.Kit for Java only allows you to extract all the text or by specifying a particular rectangle - that is, the text in the specified rectangle will be extracted. I’m afraid, extracting text along with the coordinates where it resides on the page is currently not supported; however, if you could share some more details regarding your requirement our team might try to provide such a feature in our future versions.

Moreover, could you please elaborate the following statement:
I am willing to use extractTextInRectangle, but in order to get all the rects, I need some API to do the enumeration. Is it possible?

We’re sorry for the inconvenience and looking forward to help you out.
Regards,

bendavid · November 9, 2010, 6:29am

Latif,

I want to achieve functionality of "pdf2word", thus creating an edittable version of the document, while preserving the look of it as much as possible.

In order to achieve it I want to be able to enumerate all the objects (text, images) in s specific page and get their position together with metadata (text,font, size etc).

Thanks,

Shay

shahzadlatif · November 9, 2010, 11:09pm

Hi Shay,

Kindly share whether you want Aspose.Pdf.Kit to convert a PDF file to Word document or you just want to get a list of all the objects from the PDF file along with their properties, so you could process them at your end.

We’re looking forward to help you out.
Regards,

bendavid · November 10, 2010, 1:10am

Latif,

My vision is to be able to extract all objects from a specific page, including text, images, hyperlinks along with its locations and all properties.

I want to apply some changes on those objects and then rebuild a document with my modified objects.

I would be able to use either Apose.pdf or Aspose.words to reconstruct the modified document, depending on required file format

-Shay

shahzadlatif · November 11, 2010, 12:48am

Hi Shay,

I’m sorry to share with you that this kind of functionality is currently not available; however, I have logged a new feature request as PDFKITJAVA-21486 in our issue tracking system. Our team will investigate it in detail and you’ll be updated via this forum thread once it is supported in future.

We’re sorry for the inconvenience. If you have any further questions, please do let us know.
Regards,