Re: How to limit memory usage when extracting text from large PDFs?

jason.saunders · March 28, 2014, 7:43am

Has there been any change in this issue? I am facing exactly the same problem.

codewarior · March 29, 2014, 11:37pm

Hi Jason,

Thanks for contacting support.

I am afraid the above stated problem is not yet resolved. Nevertheless, can you please share the source PDF file which you are using so that we can further investigate the problem in our environment. We are sorry for your inconvenience.

jason.saunders · April 17, 2014, 2:47am

[quote user=“codewarior”]Hi Jason,

Thanks for contacting support.

I am afraid the above stated problem is not yet resolved. Nevertheless, can you please share the source PDF file which you are using so that we can further investigate the problem in our environment. We are sorry for your inconvenience.

[/quote]

How can I send you the example PDF privately? It is a customer document which I don’t want to share publicly.

codewarior · April 17, 2014, 7:06am

Hi Jason,

In order to directly send us the document, please follow the instructions specified over How to send a license?

jason.saunders · April 17, 2014, 7:45am

I have now sent you a link to an example PDF as per instructions above.

Thanks

tilal.ahmad · April 18, 2014, 4:25am

Hi Jason,

Thanks for sharing the details. We are the looking into the issue and will get back to you soon.

Best Regards,

codewarior · April 18, 2014, 4:52am

Hi Jason,

I am afraid I have not yet received any document/link via email. Can you please double check at your end.

jason.saunders · April 22, 2014, 10:57am

[quote user=“codewarior”]Hi Jason,

I am afraid I have not yet received any document/link via email. Can you please double check at your end.

[/quote]

I definitely sent it, but have sent it again.

jason.saunders · April 23, 2014, 2:33am

This is now a critical problem. Can it be treated as a priority please.

Also, we are currently using version 7.9 of Aspose.Pdf.dll, and our upgrade period has expired. When the fix is done how will we receive the fix?

codewarior · April 23, 2014, 6:10am

Hi Jason,

Thanks for sharing the resource file. I am working on testing this scenario in my environment and will keep you posted with my findings.

codewarior · April 27, 2014, 3:29am

Hi Jason,

Sorry for the delayed response.

I have tested the scenario and have observed that memory utilization increases by 600MB. For the sake of correction, I have logged this problem
as PDFNEWNET-36809 in our issue tracking system. We will further
look into the details of this problem and will keep you updated on the status
of correction. Please be patient and spare us little time. We are sorry for
this inconvenience.

jason.saunders · May 19, 2014, 4:11am

Has there been any progress with this problem?

I do not agree with “observed that memory utilization increases by 600MB”. A simple test shows the memory usage increasing after every page with the example PDF I sent you and by page 20 its up to 6GB. e.g.

private static string GetPdfText(string path)

{

Stopwatch stopWatch = new Stopwatch();

Aspose.Pdf.Document doc = new Aspose.Pdf.Document(path);

int totalPages = doc.Pages.Count;

TextAbsorber textAbsorber = new TextAbsorber();

for (int i = 1; i <= totalPages; i++)

{

stopWatch.Restart();

textAbsorber.Visit(doc.Pages[i]);

stopWatch.Stop();

Console.WriteLine(String.Format(“Page #{0} time:{1}ms memory:{2}MB”, i, stopWatch.ElapsedMilliseconds, GC.GetTotalMemory(false) / (1024 * 1024)));

}

return textAbsorber.Text;

}

output:

Page #1 time:7487ms memory:534MB

Page #2 time:6500ms memory:805MB

Page #3 time:7754ms memory:1151MB

Page #4 time:9124ms memory:1507MB

Page #5 time:7536ms memory:1822MB

Page #6 time:5583ms memory:2023MB

…

Page #20 time:118610ms memory:6198MB

jason.saunders · May 19, 2014, 4:59am

I realised there is a simple workaround - reopen the doc in the loop that’s processing each page. The TextAbsorber object still receives all the text across all pages - e.g.

private static string GetPdfText(string path)

{

Stopwatch stopWatch = new Stopwatch();

Aspose.Pdf.Document doc = new Aspose.Pdf.Document(path);

int totalPages = doc.Pages.Count;

TextAbsorber textAbsorber = new TextAbsorber();

for (int i = 1; i <= totalPages; i++)

{

/** Reopen the pdf to deallocate memory held by the Document instance **/

doc = new Aspose.Pdf.Document(path);

stopWatch.Restart();

textAbsorber.Visit(doc.Pages[i]);

stopWatch.Stop();

Console.WriteLine(String.Format(“Page #{0} time:{1}ms memory:{2}MB”, i, stopWatch.ElapsedMilliseconds, GC.GetTotalMemory(false) / (1024 * 1024)));

}

return textAbsorber.Text;

}

However, its obviously not very efficient to reopen the PDF on every iteration of the loop especially when the PDF can contain hundreds of pages.

codewarior · May 20, 2014, 4:31am

Hi Jason,

Thanks for your patience.

The development team has been busy resolving other priority issues and I am afraid the above stated problem is not yet resolved. Nevertheless, as soon as we have some updates regarding its resolution, I would be more than happy to update you with the status of correction. Please be patient and spare us little time.We are sorry for this delay and inconvenience.

Meanwhile, I have also shared the workaround information with the development team and it might be helpful while rectifying this problem.

aspose.notifier · May 7, 2016, 3:18pm

The issues you have found earlier (filed as PDFNEWNET-36809) have been fixed in Aspose.Pdf for .NET 11.6.0.

This message was posted using Notification2Forum from Downloads module by Aspose Notifier.