System.OutOfMemoryException when extracting text from some PDF files

cnielsen · November 19, 2014, 8:44am

Hello,

We would like to report a problem we encounter when extracting text from PDF files using Aspose.PDF in .NET. We have noticed that some PDF files cause the memory usage to grow until .NET throws an OutOfMemory exception. This is a serious problem since we use this functionality on a server with many other components running and a lot of memory available.

We have tried using the newest version currently available (9.8.0)

PDF file that causes this issue is attached.

The issue can be reproduced using this code sample:

[C#]
using Aspose.Pdf;
using Aspose.Pdf.Devices;
using Aspose.Pdf.InteractiveFeatures.Annotations;
using Aspose.Pdf.Text.TextOptions;
using System.IO;
using System.Text;

namespace AsposePdfTest
{
class Program
{
static void Main(string[] args)
{
Document pdfDocument = new Document("Example1.pdf");

string extractedText = "";

foreach (Page pdfPage in pdfDocument.Pages)
{
using (MemoryStream textStream = new MemoryStream())
{
TextDevice textDevice = new TextDevice();

TextExtractionOptions textExtOptions = new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Pure);

textDevice.ExtractionOptions = textExtOptions;

textDevice.Process(pdfPage, textStream);

textStream.Close();

extractedText = Encoding.Unicode.GetString(textStream.ToArray());
}
}

File.WriteAllText("ExtractedText.txt", extractedText, Encoding.Unicode);
}
}
}

Regards,
Christoffer

codewarior · November 20, 2014, 3:43am

Hi Christoffer,

Thanks for contacting support. I have tested the scenario using Aspose.Pdf for .NET 9.8.0 in Visual Studio 2012 application with target Framework as .NET Framework 4.0 running over Windows 7 (x64), Intel 3.4 Ghz with 8GB of RAM and I am unable to notice any issue. The text is being extracted from source/input PDF file.

For your reference, I have also attached the TEXT file containing extracted text. Can you please share some further details which can help us replicating the problem in our environment. We are sorry for your inconvenience.

C#

Document pdfDocument = new Document("c:/pdftest/Example1.pdf");
string extractedText = "";

foreach (Page pdfPage in pdfDocument.Pages)
{
    using (MemoryStream textStream = new MemoryStream())
    {
        TextDevice textDevice = new TextDevice();
        TextExtractionOptions textExtOptions = new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Pure);
        textDevice.ExtractionOptions = textExtOptions;
        textDevice.Process(pdfPage, textStream);
        textStream.Close();
        extractedText = Encoding.Unicode.GetString(textStream.ToArray());
    }
}

File.WriteAllText("c:/pdftest/Example1_ExtractedText.txt", extractedText, Encoding.Unicode);

cnielsen · November 20, 2014, 10:12am

Thank you for the quick reply.

We noticed the problem described in this post while investigating an issue with a process that grows indefinitely. The process is a SQL Job running a .NET Console Application that uses Aspose.PDF to extract text from a number of PDF files.

We have reproduced the problem using the code sample of this post with one of the PDF files that are a suspected to cause the issue we are investigating.

We are able to reproduce the problem in Visual Studio if we start the application with debugging (F5) or if we run the code using an unit test . But if we start the application without debugging (CTRL+F5) the exception is not thrown.

codewarior · November 21, 2014, 4:13am

Hi Christoffer,

Thanks for sharing the feedback.

I have again tested the scenario using Aspose.Pdf for .NET in windows based application developed in Visual Studio 2012 where I have used F5 key to start project debugging and I am unable to notice any problem. Can you please share some sample project which can help us in replicating this problem in our environment. We are sorry for this inconvenience.

cnielsen · November 24, 2014, 6:15am

Hi Nayyer,

I have attached the sample project that I have used to replicate the problem. I have testet on a Windows 7 64-bit workstation with 8 GB RAM and on a Windows 7 workstation with 16 GB RAM.

The problem occurs only when the target platform of the project is set to x86 not when it is set to x64. I have been able to reproduce it using Visual Studio Premium 2012, 2013 and Visual Studio Express 2013.

My experience is that the issue do not always occur in Visual Studio 2012 Premium. Sometimes starting some other applications or rebooting the workstation before debugging the application makes the issue appear again.

Thanks,
Christoffer

codewarior · November 25, 2014, 1:24am

Hi Christoffer,

Thanks for sharing the project.

I have tested the scenario and I am able to reproduce the same problem. For the sake of correction, I have logged it in our issue tracking system as PDFNEWNET-37828. We will investigate this issue in details and will keep you updated on the status of a correction.

We apologize for your inconvenience.

asad.ali · August 19, 2018, 7:50pm

@cnielsen

Thanks for your patience.

It is possible to avoid exception using TextFormattingMode.MemorySaving :

Document pdfDocument = new Document(myDir + "Example1.pdf");
string extractedText = "";
foreach (Page pdfPage in pdfDocument.Pages)
{
    using (MemoryStream textStream = new MemoryStream())
    {
        TextDevice textDevice = new TextDevice();
        TextExtractionOptions textExtOptions = new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.MemorySaving);
        textDevice.ExtractionOptions = textExtOptions;
        textDevice.Process(pdfPage, textStream);
        textStream.Close();
        extractedText = Encoding.Unicode.GetString(textStream.ToArray());
    }
}
File.WriteAllText(myDir + "Example1_ExtractedText.txt", extractedText);

Example1_ExtractedText.zip (3.5 KB)

Please note that extracted text (Example1_ExtractedText.txt) is not readable. The correct extraction of text is not possible from the document because used fonts do not contain ToUnicode entry and also have non-standard glyph notifications. (See chapter 9.10.2 of PDF reference.)
Adobe Acrobat is also unable to extract readable text.

Please use Aspose.PDF for .NET 18.8 while using above code snippet and in case of further assistance, please feel free to let us know.

aspose.notifier · February 7, 2019, 6:00pm

The issues you have found earlier (filed as ) have been fixed in this update. This message was posted using BugNotificationTool from Downloads module by MuzammilKhan