How to extract paragraphs from PDF document using Aspose.PDF for .NET

ontardev · May 31, 2007, 3:48pm

One more question to developers.

There’s very easy – as I can see now – to extract single page and save it into a new PDF file. My next task is to extract content of this page, e.g. images (it is pretty easy task, indeed) and text. I need formatted text e.g. paragraphs, Aspose.Pdf.Kit – as I can see - can retrieve a straight text only. So, I tried to use Aspose.Pdf namespace.

What is a best way to retrieve formatted text from existing file?

Should I use classes of both Pdf (such as Pdf document) and Pdf.kit namespaces?? How to bind Pdf document witn existing PDF file/ stream?

AdeelTaseer · May 31, 2007, 8:28pm

Hi,

Thank you for considering Aspose.

I am afraid that this feature is not supported by Aspose.Pdf.Kit. You can’t extract formatted text, only simple text without formatting can be extracted.

Thanks.

Adeel Ahmad
Support Developer
Aspose Changsha Team
http://www.aspose.com/Wiki/default.aspx/Aspose.Corporate/ContactChangsha.html

ontardev · June 1, 2007, 9:11am

I can get all information about formatted text (e.g. about paragraph, fonts, color) by using Aspose.Slides and only purpose to purchase Aspose.Total was to perform similar operations for Adobe and Word formats.

PDF document class MUST have ways to bind with PDF files to provide real manipulation of PDF elements…

Extraction of plain text and images itself has a very little value for real development; you can’t generate any web page without information about paragraphs, fonts, borders etc in original document. Aspose.Slides has all features and we had no doubts about the existence of similar functionality for adobe files…

It is VERY sad, indeed. Will you have this functionality in the nearest future? When???

In this case, I can start to think about more tricky ways to extract formatted text.

As I can guess there is a way to get XML file representing content of PDF. File

Can you recommend me any way to generate this XML file with Aspose tools???

Of course, I can export Adobe file to XML format manually… but can I create PDF document object using exported file???

GeorgieYuan · June 1, 2007, 11:38am

Hi,
We are working hard to deal with this issue and try to get more detailed information now. But it may take a long time to improve this function.
For this reason, exportting detailed information to an XML file is not feasibility now.

Any more question is welcome.

wcling33 · September 22, 2011, 2:21pm

It is now 2011 and I was wondering if there is now support to read formatted text from PDF using any of the Aspose PDF APi for Java?

The task we need to perform on PDF is reading the PDF and creating fragments of the paragraphs, images, tables, headers, footer.
In reading the PDF and detecting a paragraph we then want to convert this to a formatted HTML string.
In reading the PDF and detecting a table we then want to convert this to a formatted HTML string.

Are these function supported and to what level. Text with or without formatting, etc.
Any links to sample code or methods would be very helpful.

My company already has lic for Aspose (Java) Word and Excel API and is looking to purchase and integrate the PDF API. Also which PDF API do you recommend we use? aspose.pdf or aspose.pdf.kit

Wayne Clingingsmith
Trintech.com

shahzadlatif · September 23, 2011, 4:59am

Hi Wayne,

We have provided a similar feature (PDFKITJAVA-6024) in our upcoming version which will be published this week. In this release, we have provided functionality to convert the text to HTML and also get the formatting information of the text. Please wait for this release and see if this might help in your scenario. You’ll be notified via this forum thread once it is published.

Regards,

aspose.notifier · September 24, 2011, 8:55am

The issues you have found earlier (filed as 6024) have been fixed in this update.

This message was posted using Notification2Forum from Downloads module by aspose.notifier.

asad.ali · May 27, 2020, 10:39pm

@ontardev

We would like to share with you that you can now extract entire paragraph from PDF documents using Aspose.PDF for .NET.