Support for getting information from existing pdf document

Thanks for your prompt responses!

This is a question about how to efficiently use Aspose.Pdf and Aspose.Pdf.Kit. In my app, I need to do a couple of things:

1. I need to check if the file is encrypted. (I use Aspose.Pdf.PdfFileInfo)

2. I need to extract text, attachments. ( I use Aspose.Pdf.Kit.PdfExtractor)

3. I need to extract annotations. (I use Aspose.Pdf.Kit.PdfContentEditor)

This means I will need to load the same file 3 times, which could lead to performance issues if we need to run a lot of pdfs.

Is there a way that this can be improved? For example, being able to extract annotations via PdfExtractor, and being able to see if file is encrypted via PdfExtractor as well?

Thanks!

Hi,

It is possible to join but it will create more complexity. We have tried to make it simple so that developers can learn and use our library with no pain. Anyhow, I will discuss your concerns with the developers and if we have plan to support or give some functions together in one class then we will let you know. Right now, please use as it is. Thanks for suggestion.

Thanks.

Adeel Ahmad
Support Developer
Aspose Changsha Team
http://www.aspose.com/Wiki/default.aspx/Aspose.Corporate/ContactChangsha.html

Hi, Becky
After we discussed the problem in detail, we found it can’t be supported by merging several functions in one function since this will not improve the performance.

Anyway, you could use the stream parameter of the functions since this will open the file only once. But you should reset the position of the input stream before every operations like the following:

inputStream.Position = 0

Best regards.

hi,I have reproduce this error using Aspose.Pdf.Kit 2.5.0.0, and I will fix this bug within two days. In the next hotfix, we will support extrat “FreeText” annotation.

Please download hotfix 3.4.4.0.

hi,becky_bai .

please download the Aspose.Pdf.Kit for .Net Hotfix 2.5.1.0 for extract annotations.

Thank you! The annotation extraction is working!

The functionality we want to achieve is to see if text elements exist in the page or not, I was trying the ExtractText() funtion, it takes a while to run if the pdf is rather large. Is there (or could there be) a more efficient way to check text existence on a page?

Another question is: the ExtractText() function will throw exception when extracting the attached pdf. Can you see why?

Thank you!

Becky

Hi,

This is the only way to extract Text from Pdf page right now. We are working on the Text per page issue. Right now there is no property to check that, that page contains text or not.

About the second issue, I have reproduced this error. We will try to fix it soon.

Thanks.

Adeel Ahmad
Support Developer
Aspose Changsha Team
http://www.aspose.com/Wiki/default.aspx/Aspose.Corporate/ContactChangsha.html

Hi,

I’m reget to say that we have meet some technology problems which can not be solved in short time. We can’t give an ETA for ExtractText bug
now.

Best Regards.

I was using PdfExtractor.ExtractText() to extract a pdf that only has one sentence in it, it took about 20-30 seconds to do that. Is the performance of this method a known issue? Pdf I tested attached.

Thanks.

That will definitely be too much of a perfomance hit for us. Are you guys going to be able to come up with a solution for this within the next 2 weeks?

Thanks

I tested this pdf and the text is extracted within one second. Are you sure you are using the latest version of Aspose.Pdf.Kit?

Thanks for your reply.

I made a mistake by testing it in debug mode.

Is extracting text per page going to be supported anytime soon? Is the bug I sent you in an earlier post still being investigated (the bug is about ExtractText() throws exception on the example file I gave you).

Hi,

Certainly we have plans in near future, but right now extracting text per page is in its development stages. You can try, but it has few limitations right now. You can use it like:

PdfExtractor m_pdfExtractor= new PdfExtractor();

m_pdfExtractor.BindPdf(@"D:\AsposeTest\File1_NonSearch.pdf");

**m_pdfExtractor.StartPage = 1;**

**m_pdfExtractor.EndPage = 1;** m_pdfExtractor.ExtractText();

About the second, bug problem our developers are working hard to find the root cause of this problem. As Georgie already told that it is difficult to give a ETA for this problem, but I will again reconfirm it and will get back to you.

Thanks.

Adeel Ahmad
Support Developer
Aspose Changsha Team
http://www.aspose.com/Wiki/default.aspx/Aspose.Corporate/ContactChangsha.html

Hi,

We will provide a .Net2.0 version of Aspose.Pdf.Kit which support extracting text per page before tommorrow.

The ExtractText bug with PDF file that doesn’t contain text hasn’t fix now. We are working hard to solve this problem but we could not give an ETA now.


Hi,

The attachment is a .Net 2.0 version of Aspose.Pdf.Kit, Please try it.

Best Regards.

Great, we will test this out later today and let you know. Thanks

We tested the new dll and here is what we found

1. Documents with text:

Seems to be working well for getting the text. It seem though that PdfExtractor.HasNextPageText() only works if you extract the text for the current page. Is this true? I.e. it seems we should be able to do the following:

// Starting a 0 because want to know if 1 - pageCount has text
for (int i = 0; i < pageCount; i++ )
{
extractor.StartPage = i;
extractor.EndPage = i;

bool nextPageHasText = extractor.HasNextPageText();
}

but this only seems to work if we ExtractText() before calling HasNextPageText(). We have cases when we only want to know if there is text but dont need to extract it. Let me know if I am just setting it up incorrectly.
2. Documents with no text:

We have some documents that have no text, but Extract() and GetText() are returning “blanks” and HasNextPageText() is returnnig true. I have attached an example.

Hi,

I didn’t see the attached example. Please attach again. Meanwhile, I will discuss with the developers about the possibility of first page problem.

Thanks.

Adeel Ahmad
Support Developer
Aspose Changsha Team
http://www.aspose.com/Wiki/default.aspx/Aspose.Corporate/ContactChangsha.html