PDF - load pdf to get structure/paragraphs

Hi,

With below code , we can create pdf structure like - Page - Sections-Paragraphs -Text/Image etc.

Pdf pdf1 = new Pdf();
aspose.pdf.Section sec1 = pdf1.getSections().add();
sec1.getParagraphs().add(new aspose.pdf.Text(sec1, "paragraph 1 "));


But how to retrieve this structure for loaded PDF document. We load document as below:
com.aspose.pdf.Document pdfDoc= new com.aspose.pdf.Document(new FileInputStream(file));

Is there any other way to load PDF so as to retrieve sectionsparagraphs,textx,images ect and loop thorugh all nodes starting from Document node?

Thank you.
–Sonali

Hi Sonali,

Thanks for contacting support.

As per your understanding, the aspose.pdf package provides the feature to create PDF file in structured manner (i.e. create PDF object which contains one or more Section objects and each Section contains one or more paragraph objects). Similarly, the com.aspose.pdf package also provides the capability to create as well as manipulate existing PDF files in structured manner (i.e. retrieve/manipulate elements from PDF file) where Document represents the PDF file, which contains one or more Page object. Each Page element has Paragraphs collection where Image, Text, Annotation etc are paragraph level elements. Please visit the following link for further details on

In case I have not properly understood your requirement or you have any further query, please share some further details.

Hi Nayyer,

Thanks for looking into the query.

Below few more queries...

1. With com.aspose.pdf.Document can we get document.getPages().get_item(1).getParagraphs ?

looks like getParagraphs() method no longer exists.

2. Suppose in one page I have inserted:

Text1 then image1 then text2 then image2 then Text3 .

with com.aspose.pdf ,suppose some how I found Text2. Now I want to delete image 'Image2' immediately following this text.

So based on text2, how can I get index of image2 to delete it?

Can we get object ids for text and images and delete particular objects directly from pdf irrespectve of object types?

3. Suppose image points to web url ,any method in com.aspose.pdf.XImage to get hyperlink associated with image?

4. We found if image has web url, it is coming as LinkAnnotation. any method in annotation to get this image or image location?

Thanks.

-Sonali

sonaliag1:
1. With com.aspose.pdf.Document can we get document.getPages().get_item(1).getParagraphs ?

looks like getParagraphs() method no longer exists.

Hi Sonali,

Thanks for contacting support.

I have tested the scenario using latest hotfix of Aspose.Pdf for Java 4.5.1 and as per my observations, the getParagraphs(..) method exists.

sonaliag1:
2. Suppose in one page I have inserted:

Text1 then image1 then text2 then image2 then Text3 .

with com.aspose.pdf ,suppose some how I found Text2. Now I want to delete image 'Image2' immediately following this text.

So based on text2, how can I get index of image2 to delete it?

Can we get object ids for text and images and delete particular objects directly from pdf irrespectve of object types?

Images are saved in Images collection and can be retrieved using getImages(..) method. Whereas Text can be accessed using TextAbsorber class. Images in their collection have separate indexing and they do not have any relation with text present in PDF.

I am working on other queries and will get back to you soon.

Hi,

Any update? Please updae us with whatever completed so far and you may continue furthur to complete it.

or Can you let us know approach you are taking for acieving this?

Thank you.

-Sonali

sonaliag1:
3. Suppose image points to web url ,any method in com.aspose.pdf.XImage to get hyperlink associated with image?
Hi Sonali,

I am afraid currently Aspose.Pdf for Java does not support the feature to get Hyperlink/URL associated with image file. However for the sake of implementation, I have logged this requirement as PDFNEWJAVA-34031 in our issue tracking system. We will further look into the details of this requirement and will keep you posted on the status of correction. Please be patient and spare us little time.

sonaliag1:
4. We found if image has web url, it is coming as LinkAnnotation. any method in annotation to get this image or image location?
Hi Sonali,

Thanks for contacting support.

All the hyperlinks are represented as LinkAnnotations and I am afraid you cannot retrieve image associated with hyperlink. However as stated in my earlier post, I have logged a requirement to retrieve URL/Hyperlink associated with Image object.

In case you need to get the information regarding location where image appears in PDF file, please follow the instructions specified over Get the resolution and dimensions of embedded images