Issue Reading pages from a large PDF

AA.Engineering · September 25, 2014, 11:32am

I have been having issues parsing large PDFs (200 - 400) MB. In this case a series of textbooks that have a lot of images baked in.

This issue is difficult to produce. According to the stack I have (attached) I am stuck in FileStream called by Aspose.

I used the simple Aspose.Pdf.Document(pdfPath)to create the pdfDocument

TextAbsorber textAbsorber = new TextAbsorber();

textAbsorber.ExtractionOptions.FormattingMode = Aspose.Pdf.Text.TextOptions.TextExtractionOptions.TextFormattingMode.Raw;

this.pdfDocument.Pages[pageOffsetPlusOne].Accept(textAbsorber);

return textAbsorber.Text;

While this is in operation sometime memory usage gets very very high. See memory_leak*.png attached.

As of right now I do not have permission to host this file. If that changes I will see what I can do. I have attached a transcript of chat I had with Tilal Ahmad about this issue as well.

Anyone else have these issues? I noticed these issue initially in 8.7.0. Just as test I moved onto 9.6.0 and have yet to have the issue. However, testing is still in early phases and, like I said above, it is difficult to reproduce. What I would really like is way to set a timeout one the Accept(TextAbsorber), if that is possible, with an exception/indication that a timeout occurred.

I do still have some older ghostscript code in the mix reading from the file. Does that cause any known issues?

Thanks

codewarior · September 26, 2014, 8:53am

Hi Kent,

Thanks for contacting support.

As requested by Tilal, in order for us to figure out the actual reasons of this issue, we need the resource PDF file which can help us in replicating this problem in our environment. Once you have approval, you may consider uploading the PDF document over some free FTP and we can test the scenario using this document.

Furthermore, please share some details regarding your point “issue is difficult to reproduce” do you mean it does not occur all the time or the time taken by API to generate this error is too much.

We are sorry for this inconvenience.

PS, during our testing, we have easily managed to create/manipulate PDF files upto 1GB and for this particular scenario, the problem appears to be related to Structure and Complexity of source PDF file.

AA.Engineering · September 29, 2014, 3:21pm

First let me give a better layout of our program.

Our program is multi threaded. We have two threads. One that uses GhostScript only and one that uses Aspose.pdf and Ghostscript. We hope to move everything to Aspose but that is not done yet.

Both threads operate on the same file however read operations are synchronized. There are no write operations to the input PDF.

What I think is happening, despite our efforts to synchronize access to the file, is perhaps the garbage collectors unsynchronized nature combined with the fact that Ghostscript opens/closes the file every time it does an operation, is the root cause of the problem. My guess is that there is some resource that Aspose.pdf needs to continue the read that has not yet been released by the garbage collector.

If we let the program sit for a long time, the call stack I posted unblocks and proceeds. I think that is because perhaps the garbage collector eventually ran some code that released some resource(s).

Regardless, I will send you a private message (email) shortly with the link to the file in question.

Thanks

codewarior · September 30, 2014, 2:14pm

Hi Kent,

Thanks for sharing the details.

We will appreciate if you can please share some sample application which can help us in replicating this issue in our environment. We are really sorry for this inconvenience.