Is there a named TextFragment or TextSegment?

tn77 · March 22, 2012, 2:31pm

I’m new to the PDF API and Aspose.Pdf in particular, so I don’t know how to get to certain parts of a document, or if it is even available or possible.

I see the TextFragmentAbsorber class and the TextFragment within it, but this appears to just get raw text.

Can you tell me: is there a way for the document to be created so that a particular piece of text has a name associated with it? And then how/if I can access that name via Aspose.Pdf?

If this is possible, it will change the way my company does certain things, as people can create PDFs with specific pieces of parseable text that will make other processes easier to accomplish.

Thank you

codewarior · March 23, 2012, 8:54am

Hello Tim,

Thanks for your interest in our products.

I am pleased to share that you can add text paragraphs inside PDF document and assign a particular ID to it but I am afraid it does not support the feature to to extract text paragraph based on particular ID or name associated with it. Please note that in order to add text paragraph along with its ID, you need to make use of Aspose.Pdf.Generator namespace. The Text object is a paragraph level element and you can assign particular ID to it. For more information, please visit [Assign ID to Paragraph ](http://docs.aspose.com/display/pdfnet/Assign+ID+to+Paragraph)

You can use some Regular expression to extract text from PDF document. For this purpose, you need to try using Aspose.Pdf namespace. For more information, please visit Search and Get Text From All Pages Using Regular Expression

PS, please note that I have logged the requirement of extracting text based on particular ID as PDFNEWNET-33444 in our issue tracking system. We will further look into the details of this requirement and see how we can implement this feature. Please be patient and spare us little time. We apologize for your inconvenience.

tn77 · March 26, 2012, 8:49am

You say that I cannot get a paragraph by its ID. That’s fine, but can I iterate over all the paragraphs in a document, and will each paragraph object then contain the ID? That’s really all I need.

Do you have a sample document with paragraphs and IDs?

codewarior · March 27, 2012, 8:38am

Hello Tim,

Thanks for contacting support.

Please note that currently you can extract the text of all paragraphs present inside the PDF document but as I have stated earlier, currently it does not support the feature to extract paragraph based over particular ID. The feature is already logged in our issue tracking system and as soon as it becomes available, we would be more than happy to update you with the status of correction. Meanwhile you may consider visiting the following link for information on Extract Text from all the Pages using Text Device

Concerning to your other requirement on sample PDF file with paragraphs and ID, please find attached the resultant PDF file generated with following code snippet.

[C#]

// instantiate PDF object
Pdf pdf = new Pdf();
// create section inside PDF
Aspose.Pdf.Generator.Section sec = pdf.Sections.Add();
// create a sample text paragraph
Text para1 = new Aspose.Pdf.Generator.Text(“Text Paragraphs with ID Text1”);
// assign ID to text paragraph
para1.ID = “Text1”;
// add paragraph to paragraphs collection of section object
sec.Paragraphs.Add(para1);

// create a sample text paragraph
Text para2 = new Aspose.Pdf.Generator.Text(“Text Paragraphs with ID Text2”);
// assign ID to text paragraph
para2.ID = “Text2”;
// add paragraph to paragraphs collection of section object
sec.Paragraphs.Add(para2);

// save the resultnat PDF file
pdf.Save(“D:/pdftest/Pdf_2_Paragraphs.pdf”);