Support for getting information from existing pdf document

Hi,

I haven't yet get myself very familiarized with Aspose.PDF. I have the impression that it handles mainly creating new pdf docs.

Does it support manipulating existing document? Say, for example: I want to do a word count on the entire doc (or on each page), is it possible? Another example is: If I want to extract all the text content of a pdf, and extract comment/note from an existing pdf?

Thanks !

The features you need is supported by Aspose.Pdf.Kit so I move this post to this forum.

Please refer to PdfExtractor . Word counting and text extracting is supported but comment/note extracting is not supported yet.

Is there any estimation on when notes/comments extraction can be supported?

Thanks!

hi,becky_bai,

Thank you for considering aspose products, and notes/comments extraction will be supported in the next hotfix version. Moreover, I want to know which information of comments you need, for example, rectangle, contents, createdate, popup flag, etc.

I will need to retrieve the following annotations:

1. notes created by the Adobe Acrabat's note tool ( the balloon)

2. free text created by using Adobe's Textbox tool.

Also, is it possible to retrieve text elements by page?

Thanks!

Any estimate on the features I described in my last post?
Thanks!

hi,you can download the new dll of Aspose.pdf.kit2.4.1.In
PdfContentEditor.cs, ExtractAnnotations() support to extract the
content of the annotations specified type from a existing pdf document.
Now the supported annotation types include “Text”,“Highlight”,
“Squiggly”, “Strikeout” and “Underline”. You can try to use it, if any other questions ,please dont hesitate to notify me.

Thanks I will try it shortly.

One other question is, we need to ability to do word count per page, or to retrieve text elements per page, is it going to be implemented in the near future?

Hi,

It is difficult to get word count for some of the languages (such as Chinese) so we have no plans in short to support this feature. We only support to extract text from the whole PDF File. We won’t recommend it, but if you want to use this feature then please refer to:

And if you need a work around then split Pdf to multiple PDFs having single page each. And then extract the text from each PDF File, so you can counter the text exracted from each page.

Thanks.

Adeel Ahmad
Support Developer
Aspose Changsha Team
http://www.aspose.com/Wiki/default.aspx/Aspose.Corporate/ContactChangsha.html

We actually don't have to do word count per page, understanding the problem with counting Asian characters. However, it is an important feature that we provide extracted text per page, or just a boolean representing whether text element exists on a page. Can this feature be implemented?

Thanks!

We will investigate this issue to see if we can support extracting text per page.

Could you let me know the estimated date of this feature if you are planning to support it?

I have discussed with the developers and we think this feature is not difficult to support. We will soon give you a ETA of the feature.

I was trying the extract annotation function, I can't extract popup baloon and free text notes. Please see the attached two pdf files. These two types of annotations are what we want to extract.

Thanks!

Hi,

  1. I have checked and found that with the file named “File4_TextNotes.pdf”. Annotation are extracted. with the code :
PdfContentEditor editor = new PdfContentEditor();
string TestPath = @"D:\AsposeTest\TestData\";
editor.BindPdf(TestPath + "File4_FreeTextNotes.pdf");
string[] annotType ={ "Text", "Highlight" };
ArrayList annotList = editor.ExtractAnnotations(1, 2, annotType);
for (int i = 0; i < annotList.Count; i++)
{
Hashtable currentNode = (Hashtable)annotList[i];
object partValue = null;
foreach (string partName in currentNode.Keys)
{
partValue = currentNode[partName];
if (partValue is string)
{
Console.WriteLine(partName + ":" + currentNode[partName].ToString());
}
}
foreach (string partName in currentNode.Keys)
{
partValue = currentNode[partName];
if (partValue is Hashtable)
{
Console.WriteLine(partName);
Hashtable hashTable = (Hashtable)partValue;
if (partName.Equals("contents-richtext"))
Console.WriteLine(hashTable["Rc"].ToString().Substring(21));
else
{
foreach (string name in hashTable.Keys)
{
Console.WriteLine(name + ":" + hashTable[name].ToString());
}
}
}
}
}
Console.ReadKey(false);
  1. I have checked with the file named "File4_FreeTextNotes.pdf"and found that it is not the noteType we support but it is the Text Box. I will discuss this issue with the developer and will let you know as soon as solution is found.

Thanks.

Adeel Ahmad
Support Developer
Aspose Changsha Team
http://www.aspose.com/Wiki/default.aspx/Aspose.Corporate/ContactChangsha.html

I tried your code, however the annotList count is 0 for both of the files. Are you using .NET 2.0 version of the assembly?

Thanks.

Hi,

Yes, I am using .NET 2.0 with Aspose.Pdf version 3.4.3.0. Please check with latest version.

Thanks.

Adeel Ahmad
Support Developer
Aspose Changsha Team
http://www.aspose.com/Wiki/default.aspx/Aspose.Corporate/ContactChangsha.html

I am using Aspose.Pdf.Kit 2.5.0.0, Aspose.Pdf wouldn't be needed for extracting annotations right? It is odd the annotation ArrayList still returns nothing.

Do you mind trying another file for me. I have it attached. Thank you!

Hi,

Yes, I am sorry Aspose.Pdf.Kit is used to extract Annotations. I have reproduce the error. It was working in version 2.4.2.0 but have some problems with latest version. I will discuss this with the developers and we will try to fix it as soon as possible. Sorry for inconvenience.

Thanks.

Adeel Ahmad
Support Developer
Aspose Changsha Team
http://www.aspose.com/Wiki/default.aspx/Aspose.Corporate/ContactChangsha.html

Hi,

We are plan to support extractting text per page now and we hope it could be available in a week.

Best regards.