Support for getting information from existing pdf document

becky_bai · May 17, 2007, 8:11am

Thanks for your prompt responses!

This is a question about how to efficiently use Aspose.Pdf and Aspose.Pdf.Kit. In my app, I need to do a couple of things:

1. I need to check if the file is encrypted. (I use Aspose.Pdf.PdfFileInfo)

2. I need to extract text, attachments. ( I use Aspose.Pdf.Kit.PdfExtractor)

3. I need to extract annotations. (I use Aspose.Pdf.Kit.PdfContentEditor)

This means I will need to load the same file 3 times, which could lead to performance issues if we need to run a lot of pdfs.

Is there a way that this can be improved? For example, being able to extract annotations via PdfExtractor, and being able to see if file is encrypted via PdfExtractor as well?

Thanks!

AdeelTaseer · May 17, 2007, 9:28am

Hi,

It is possible to join but it will create more complexity. We have tried to make it simple so that developers can learn and use our library with no pain. Anyhow, I will discuss your concerns with the developers and if we have plan to support or give some functions together in one class then we will let you know. Right now, please use as it is. Thanks for suggestion.

Thanks.

Adeel Ahmad
Support Developer
Aspose Changsha Team
http://www.aspose.com/Wiki/default.aspx/Aspose.Corporate/ContactChangsha.html

ken · May 17, 2007, 8:07pm

Hi, Becky
After we discussed the problem in detail, we found it can’t be supported by merging several functions in one function since this will not improve the performance.

Anyway, you could use the stream parameter of the functions since this will open the file only once. But you should reset the position of the input stream before every operations like the following:

inputStream.Position = 0

Best regards.

seawolf · May 17, 2007, 8:50pm

hi,I have reproduce this error using Aspose.Pdf.Kit 2.5.0.0, and I will fix this bug within two days. In the next hotfix, we will support extrat “FreeText” annotation.

forever · May 21, 2007, 2:22am

Please download hotfix 3.4.4.0.

seawolf · May 21, 2007, 8:34am

hi,becky_bai .

please download the Aspose.Pdf.Kit for .Net Hotfix 2.5.1.0 for extract annotations.

becky_bai · May 23, 2007, 8:33am

Thank you! The annotation extraction is working!

becky_bai · May 23, 2007, 9:41am

The functionality we want to achieve is to see if text elements exist in the page or not, I was trying the ExtractText() funtion, it takes a while to run if the pdf is rather large. Is there (or could there be) a more efficient way to check text existence on a page?

Another question is: the ExtractText() function will throw exception when extracting the attached pdf. Can you see why?

Thank you!

Becky

AdeelTaseer · May 23, 2007, 10:58am

Hi,

This is the only way to extract Text from Pdf page right now. We are working on the Text per page issue. Right now there is no property to check that, that page contains text or not.

About the second issue, I have reproduced this error. We will try to fix it soon.

Thanks.

Adeel Ahmad
Support Developer
Aspose Changsha Team
http://www.aspose.com/Wiki/default.aspx/Aspose.Corporate/ContactChangsha.html

GeorgieYuan · May 31, 2007, 7:46pm

Hi,

I’m reget to say that we have meet some technology problems which can not be solved in short time. We can’t give an ETA for ExtractText bug now.

Best Regards.

becky_bai · June 11, 2007, 10:11pm

I was using PdfExtractor.ExtractText() to extract a pdf that only has one sentence in it, it took about 20-30 seconds to do that. Is the performance of this method a known issue? Pdf I tested attached.

Thanks.

nparis · June 11, 2007, 10:19pm

That will definitely be too much of a perfomance hit for us. Are you guys going to be able to come up with a solution for this within the next 2 weeks?

Thanks

forever · June 11, 2007, 10:37pm

I tested this pdf and the text is extracted within one second. Are you sure you are using the latest version of Aspose.Pdf.Kit?

becky_bai · June 12, 2007, 11:28am

Thanks for your reply.

I made a mistake by testing it in debug mode.

Is extracting text per page going to be supported anytime soon? Is the bug I sent you in an earlier post still being investigated (the bug is about ExtractText() throws exception on the example file I gave you).

AdeelTaseer · June 12, 2007, 12:31pm

Hi,

Certainly we have plans in near future, but right now extracting text per page is in its development stages. You can try, but it has few limitations right now. You can use it like:

PdfExtractor m_pdfExtractor= new PdfExtractor();

m_pdfExtractor.BindPdf(@"D:\AsposeTest\File1_NonSearch.pdf");

**m_pdfExtractor.StartPage = 1;**

**m_pdfExtractor.EndPage = 1;** m_pdfExtractor.ExtractText();

About the second, bug problem our developers are working hard to find the root cause of this problem. As Georgie already told that it is difficult to give a ETA for this problem, but I will again reconfirm it and will get back to you.

Thanks.

Adeel Ahmad
Support Developer
Aspose Changsha Team
http://www.aspose.com/Wiki/default.aspx/Aspose.Corporate/ContactChangsha.html

GeorgieYuan · June 12, 2007, 11:17pm

Hi,

We will provide a .Net2.0 version of Aspose.Pdf.Kit which support extracting text per page before tommorrow.

The ExtractText bug with PDF file that doesn’t contain text hasn’t fix now. We are working hard to solve this problem but we could not give an ETA now.

GeorgieYuan · June 13, 2007, 12:48am

Hi,

The attachment is a .Net 2.0 version of Aspose.Pdf.Kit, Please try it.

Best Regards.

nparis · June 13, 2007, 10:00am

Great, we will test this out later today and let you know. Thanks

nparis · June 14, 2007, 12:44pm

We tested the new dll and here is what we found

1. Documents with text:

Seems to be working well for getting the text. It seem though that PdfExtractor.HasNextPageText() only works if you extract the text for the current page. Is this true? I.e. it seems we should be able to do the following:

// Starting a 0 because want to know if 1 - pageCount has text
for (int i = 0; i < pageCount; i++ )
{
extractor.StartPage = i;
extractor.EndPage = i;

bool nextPageHasText = extractor.HasNextPageText();
}

but this only seems to work if we ExtractText() before calling HasNextPageText(). We have cases when we only want to know if there is text but dont need to extract it. Let me know if I am just setting it up incorrectly.

2. Documents with no text:

We have some documents that have no text, but Extract() and GetText() are returning “blanks” and HasNextPageText() is returnnig true. I have attached an example.

AdeelTaseer · June 14, 2007, 7:53pm

Hi,

I didn’t see the attached example. Please attach again. Meanwhile, I will discuss with the developers about the possibility of first page problem.

Thanks.

Adeel Ahmad
Support Developer
Aspose Changsha Team
http://www.aspose.com/Wiki/default.aspx/Aspose.Corporate/ContactChangsha.html