Attempting to upload the file (removed) causes an Exception of type ‘System.OutOfMemoryException’ to be thrown. Stack trace below.
Exception of type ‘System.OutOfMemoryException’ was thrown.
at System.IO.MemoryStream.set_Capacity(Int32 value)
at System.IO.MemoryStream.EnsureCapacity(Int32 value)
at System.IO.MemoryStream.Write(Byte[] buffer, Int32 offset, Int32 count)
at System.IO.StreamWriter.Flush(Boolean flushStream, Boolean flushEncoder)
at System.IO.StreamWriter.Write(Char[] buffer, Int32 index, Int32 count)
at System.IO.TextWriter.WriteLine(String value)
at .(Stream , Encoding )
at .()
at …ctor(List`1 , Rectangle , TextExtractionOptions )
at .(TextExtractionOptions )
at Aspose.Pdf.Text.TextAbsorber.( , Boolean )
at Aspose.Pdf.Text.TextAbsorber.Visit(Page page)
at Aspose.Pdf.PageCollection.Accept(TextAbsorber visitor)
at CMS.BusinessLayer.ContentFileManager.ExtractPdfContent(Content fileContent, CmsFile oFile) in C:\Dev\Master\IrmsWeb\src\Cms\CMS.BusinessLayer\Content\ContentFileManager.vb:line 886
We have tested the scenario using following code snippet with Aspose.PDF for .NET 18.12 and were unable to notice the exception.
Aspose.Pdf.Document pdfDocument = new Aspose.Pdf.Document("D:\\Enoxaparin MDV 451429A.pdf");
TextAbsorber ta = new TextAbsorber();
pdfDocument.Pages.Accept(ta);
Would you please share the complete code snippet which you are using at your side and experiencing the issue. Please also share your environment details with us so that we can test the scenario in our environment and address it accordingly. We have tested the scenario in an environment i.e. Windows 10 EN x64, Console App x64 Debug Mode, Core Framework 2.1, Visual Studio 2017 Community Edition with 8GB of RAM installed.
Thank you for your quick response! Here is the requested information:
Complete Code Snippet
Private Shared Sub ExtractPdfContent(ByVal fileContent As Content, ByVal oFile As CmsFile)
Dim inFile As String = Nothing
inFile = oFile.FilePath & fileContent.FileName
Dim impersonateUser As OBA.Core.Security.SecureAccessUser = OBA.Core.Security.SecureAccessUser.GetSecureAccessUser()
Using New OBA.Core.Security.Impersonator(impersonateUser.UserName, impersonateUser.Domain, impersonateUser.Password)
Try
'open document
Dim doc As New Aspose.Pdf.Document(inFile)
'create TextAbsorber object to extract text
Dim textAbsorber As New TextAbsorber()
'accept the absorber for all the pages
doc.Pages.Accept(textAbsorber)
'get the extracted text
Dim extractedText As String = textAbsorber.Text
fileContent.Text = extractedText
fileContent.IsCustomDocumentText = False
Catch ex As System.IO.IOException
If ex.Message.StartsWith("Wrong text extracting, please check your pdf") Then
If ContentManager.AllowDocumentTextEditForContent() Then
SessionFeedback.SetFeedback(Resource.Resource.ID_PROBLEMEXTRACTINGTEXT, SessionFeedback.FeebackMode.Information)
Else
SessionFeedback.SetFeedback(Resource.Resource.ID_FILETEXTNOTEXTRACTED, SessionFeedback.FeebackMode.Information)
End If
End If
Catch ex As Exception
Throw
End Try
End Using
End Sub
Environment Details
Windows 10 EN x64
ASP.NET x64 Debug Mode
Microsoft .NET Framework 4.7.1
Visual Studio 2017 Professional Edition
12GB of RAM installed.
We have again tested the scenario in similar configuration that you have shared and were not able to replicate the issue. Please note that it is necessary for us to replicate the issue at our side in order to address it. Would you please share a sample application, which is able to reproduce the same issue. We will again test it in our environment and address it accordingly.
We have identified that the Accept method call is responsible for the large performance impact experienced. Upon further investigation, we received mixed results in replicating the timeout, which seemingly depended on where we executed the code (local vs server). Nonetheless, even in success, the performance we received was undesirable.
Also, we found another related thread which indicated that this issue has recently been addressed?
That said, do you have any suggestions on how we can improve the performance of the code snippet provided?
The other issue in the post which link you have shared was related to huge operator collections on particular pages of the document. We had already improved TextAbsorber to deal with this kind of large documents. In case it can help, you may please use TextFormattingMode.MemorySaving in TextExtractionOptions during initializing TextAbsorber. It is almost same to ‘Raw’ mode but works slightly faster and uses less memory.
Please initialize TextAbsorber as following:
TextAbsorber absorber = new TextAbsorber(new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.MemorySaving));
You may additionally reduce memory consumption by using ‘per page’ processing and manual calling dispose on processed page objects.
TextAbsorber absorber = new TextAbsorber(new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.MemorySaving));
using (doc = new Aspose.Pdf.Document(myDir + "input.pdf"))
{
foreach (Page page in doc.Pages)
{
page.Accept(absorber);
page.Dispose();
}
}
string text = absorber.Text;
doc.Dispose();
In case you still face any issue, please let us know. We will further proceed to help you out.
I have implemented the changes suggested, and can confirm that the performance was improved (and the System.OutOfMemoryException exception is no longer thrown).
It is good to know that your issue has been resolved by implementing suggested approach. Please keep using our API and in case you face any issue, please feel free to contact us.