Support for getting information from existing pdf document

becky_bai · December 1, 2006, 1:51pm

Hi,

I haven't yet get myself very familiarized with Aspose.PDF. I have the impression that it handles mainly creating new pdf docs.

Does it support manipulating existing document? Say, for example: I want to do a word count on the entire doc (or on each page), is it possible? Another example is: If I want to extract all the text content of a pdf, and extract comment/note from an existing pdf?

Thanks !

forever · December 1, 2006, 5:45pm

The features you need is supported by Aspose.Pdf.Kit so I move this post to this forum.

Please refer to PdfExtractor . Word counting and text extracting is supported but comment/note extracting is not supported yet.

becky_bai · April 13, 2007, 12:27pm

Is there any estimation on when notes/comments extraction can be supported?

Thanks!

seawolf · April 13, 2007, 9:52pm

Hi Becky_Bai

Thank you for considering Aspose products, and notes/comments extraction will be supported in the next hotfix version. Moreover, I want to know which information of comments you need, for example, rectangle, contents, createdate, popup flag, etc.

becky_bai · April 18, 2007, 9:10am

I will need to retrieve the following annotations:

1. notes created by the Adobe Acrabat's note tool ( the balloon)

2. free text created by using Adobe's Textbox tool.

Also, is it possible to retrieve text elements by page?

Thanks!

becky_bai · April 20, 2007, 1:37pm

Any estimate on the features I described in my last post?
Thanks!

seawolf · April 20, 2007, 5:39pm

hi,you can download the new dll of Aspose.pdf.kit2.4.1.In
PdfContentEditor.cs, ExtractAnnotations() support to extract the
content of the annotations specified type from a existing pdf document.
Now the supported annotation types include “Text”,“Highlight”,
“Squiggly”, “Strikeout” and “Underline”. You can try to use it, if any other questions ,please dont hesitate to notify me.

becky_bai · April 25, 2007, 9:08am

Thanks I will try it shortly.

One other question is, we need to ability to do word count per page, or to retrieve text elements per page, is it going to be implemented in the near future?

AdeelTaseer · April 25, 2007, 9:44am

Hi,

It is difficult to get word count for some of the languages (such as Chinese) so we have no plans in short to support this feature. We only support to extract text from the whole PDF File. We won’t recommend it, but if you want to use this feature then please refer to:

And if you need a work around then split Pdf to multiple PDFs having single page each. And then extract the text from each PDF File, so you can counter the text exracted from each page.

Thanks.

Adeel Ahmad
Support Developer
Aspose Changsha Team
http://www.aspose.com/Wiki/default.aspx/Aspose.Corporate/ContactChangsha.html

becky_bai · May 15, 2007, 4:38pm

We actually don't have to do word count per page, understanding the problem with counting Asian characters. However, it is an important feature that we provide extracted text per page, or just a boolean representing whether text element exists on a page. Can this feature be implemented?

Thanks!

forever · May 16, 2007, 3:40am

We will investigate this issue to see if we can support extracting text per page.

becky_bai · May 16, 2007, 8:01am

Could you let me know the estimated date of this feature if you are planning to support it?

forever · May 16, 2007, 8:26am

I have discussed with the developers and we think this feature is not difficult to support. We will soon give you a ETA of the feature.

becky_bai · May 16, 2007, 10:30am

I was trying the extract annotation function, I can't extract popup baloon and free text notes. Please see the attached two pdf files. These two types of annotations are what we want to extract.

Thanks!

AdeelTaseer · May 16, 2007, 11:30am

Hi,

I have checked and found that with the file named “File4_TextNotes.pdf”. Annotation are extracted. with the code :

PdfContentEditor editor = new PdfContentEditor();
string TestPath = @"D:\AsposeTest\TestData\";
editor.BindPdf(TestPath + "File4_FreeTextNotes.pdf");
string[] annotType ={ "Text", "Highlight" };
ArrayList annotList = editor.ExtractAnnotations(1, 2, annotType);
for (int i = 0; i < annotList.Count; i++)
{
Hashtable currentNode = (Hashtable)annotList[i];
object partValue = null;
foreach (string partName in currentNode.Keys)
{
partValue = currentNode[partName];
if (partValue is string)
{
Console.WriteLine(partName + ":" + currentNode[partName].ToString());
}
}
foreach (string partName in currentNode.Keys)
{
partValue = currentNode[partName];
if (partValue is Hashtable)
{
Console.WriteLine(partName);
Hashtable hashTable = (Hashtable)partValue;
if (partName.Equals("contents-richtext"))
Console.WriteLine(hashTable["Rc"].ToString().Substring(21));
else
{
foreach (string name in hashTable.Keys)
{
Console.WriteLine(name + ":" + hashTable[name].ToString());
}
}
}
}
}
Console.ReadKey(false);

I have checked with the file named "File4_FreeTextNotes.pdf"and found that it is not the noteType we support but it is the Text Box. I will discuss this issue with the developer and will let you know as soon as solution is found.

Thanks.

Adeel Ahmad
Support Developer
Aspose Changsha Team
http://www.aspose.com/Wiki/default.aspx/Aspose.Corporate/ContactChangsha.html

becky_bai · May 16, 2007, 2:10pm

I tried your code, however the annotList count is 0 for both of the files. Are you using .NET 2.0 version of the assembly?

Thanks.

AdeelTaseer · May 16, 2007, 2:19pm

Hi,

Yes, I am using .NET 2.0 with Aspose.Pdf version 3.4.3.0. Please check with latest version.

Thanks.

Adeel Ahmad
Support Developer
Aspose Changsha Team
http://www.aspose.com/Wiki/default.aspx/Aspose.Corporate/ContactChangsha.html

becky_bai · May 16, 2007, 3:31pm

I am using Aspose.Pdf.Kit 2.5.0.0, Aspose.Pdf wouldn't be needed for extracting annotations right? It is odd the annotation ArrayList still returns nothing.

Do you mind trying another file for me. I have it attached. Thank you!

AdeelTaseer · May 16, 2007, 8:20pm

Hi,

Yes, I am sorry Aspose.Pdf.Kit is used to extract Annotations. I have reproduce the error. It was working in version 2.4.2.0 but have some problems with latest version. I will discuss this with the developers and we will try to fix it as soon as possible. Sorry for inconvenience.

Thanks.

Adeel Ahmad
Support Developer
Aspose Changsha Team
http://www.aspose.com/Wiki/default.aspx/Aspose.Corporate/ContactChangsha.html

GeorgieYuan · May 16, 2007, 9:13pm

Hi,

We are plan to support extractting text per page now and we hope it could be available in a week.

Best regards.