Re: How to limit memory usage when extracting text from large PDFs?

Has there been any change in this issue? I am facing exactly the same problem.

Hi Jason,


Thanks for contacting support.

I am afraid the above stated problem is not yet resolved. Nevertheless, can you please share the source PDF file which you are using so that we can further investigate the problem in our environment. We are sorry for your inconvenience.

[quote user=“codewarior”]Hi Jason,


Thanks for contacting support.

I am afraid the above stated problem is not yet resolved. Nevertheless, can you please share the source PDF file which you are using so that we can further investigate the problem in our environment. We are sorry for your inconvenience.
[/quote]

How can I send you the example PDF privately? It is a customer document which I don’t want to share publicly.

Hi Jason,


In order to directly send us the document, please follow the instructions specified over How to send a license?

I have now sent you a link to an example PDF as per instructions above.


Thanks

Hi Jason,


Thanks for sharing the details. We are the looking into the issue and will get back to you soon.

Best Regards,

Hi Jason,


I am afraid I have not yet received any document/link via email. Can you please double check at your end.

[quote user=“codewarior”]Hi Jason,


I am afraid I have not yet received any document/link via email. Can you please double check at your end.
[/quote]

I definitely sent it, but have sent it again.

This is now a critical problem. Can it be treated as a priority please.


Also, we are currently using version 7.9 of Aspose.Pdf.dll, and our upgrade period has expired. When the fix is done how will we receive the fix?

Hi Jason,


Thanks for sharing the resource file. I am working on testing this scenario in my environment and will keep you posted with my findings.

Hi Jason,


Sorry for the delayed response.

I have tested the scenario and have observed that memory utilization increases by 600MB. For the sake of correction, I have logged this problem
as PDFNEWNET-36809 in our issue tracking system. We will further
look into the details of this problem and will keep you updated on the status
of correction. Please be patient and spare us little time. We are sorry for
this inconvenience.

Has there been any progress with this problem?


I do not agree with “observed that memory utilization increases by 600MB”. A simple test shows the memory usage increasing after every page with the example PDF I sent you and by page 20 its up to 6GB. e.g.

private static string GetPdfText(string path)
{
Stopwatch stopWatch = new Stopwatch();
Aspose.Pdf.Document doc = new Aspose.Pdf.Document(path);
int totalPages = doc.Pages.Count;
TextAbsorber textAbsorber = new TextAbsorber();

for (int i = 1; i <= totalPages; i++)
{
stopWatch.Restart();
textAbsorber.Visit(doc.Pages[i]);
stopWatch.Stop();

Console.WriteLine(String.Format(“Page #{0} time:{1}ms memory:{2}MB”, i, stopWatch.ElapsedMilliseconds, GC.GetTotalMemory(false) / (1024 * 1024)));
}

return textAbsorber.Text;
}

output:

Page #1 time:7487ms memory:534MB
Page #2 time:6500ms memory:805MB
Page #3 time:7754ms memory:1151MB
Page #4 time:9124ms memory:1507MB
Page #5 time:7536ms memory:1822MB
Page #6 time:5583ms memory:2023MB
Page #20 time:118610ms memory:6198MB

I realised there is a simple workaround - reopen the doc in the loop that’s processing each page. The TextAbsorber object still receives all the text across all pages - e.g.


private static string GetPdfText(string path)
{
Stopwatch stopWatch = new Stopwatch();
Aspose.Pdf.Document doc = new Aspose.Pdf.Document(path);
int totalPages = doc.Pages.Count;
TextAbsorber textAbsorber = new TextAbsorber();

for (int i = 1; i <= totalPages; i++)
{
/** Reopen the pdf to deallocate memory held by the Document instance **/
doc = new Aspose.Pdf.Document(path);

stopWatch.Restart();
textAbsorber.Visit(doc.Pages[i]);
stopWatch.Stop();

Console.WriteLine(String.Format(“Page #{0} time:{1}ms memory:{2}MB”, i, stopWatch.ElapsedMilliseconds, GC.GetTotalMemory(false) / (1024 * 1024)));
}

return textAbsorber.Text;
}

However, its obviously not very efficient to reopen the PDF on every iteration of the loop especially when the PDF can contain hundreds of pages.

Hi Jason,


Thanks for your patience.<o:p></o:p>

The development team has been busy resolving other priority issues and I am afraid the above stated problem is not yet resolved. Nevertheless, as soon as we have some updates regarding its resolution, I would be more than happy to update you with the status of correction. Please be patient and spare us little time. We are sorry for this delay and inconvenience.

Meanwhile I have also shared the workaround information with development team and it might be helpful while rectifying this problem.

The issues you have found earlier (filed as PDFNEWNET-36809) have been fixed in Aspose.Pdf for .NET 11.6.0.


This message was posted using Notification2Forum from Downloads module by Aspose Notifier.