Hi team,
I am trying to extract text content for a 100 mb file which using Aspose.Words using below code
public override string GetFileContent(string filePath)
{
string extractedText;
// Open document
try
{
var doc = new Aspose.Words.Document(filePath);
extractedText = doc.ToString(SaveFormat.Text);
}
catch (Exception ex)
{
extractedText = "[Error]" + ex.ToString();
}
return extractedText;
}
When executing the line var doc = new Aspose.Words.Document(filePath);
the CPU and memory shoots up extensively.
I tried to optimize the memory by using below code by breaking the file in chunks -
public override string GetFileContent(string filePath)
{
StringBuilder extractedText = new StringBuilder();
const int bufferSize = 4 * 1024 * 1024; // 4 MB chunks (adjust based on your needs)
try
{
// Open the file stream for reading
using (var fileStream = new FileStream(filePath, FileMode.Open, FileAccess.Read))
{
byte[] buffer = new byte[bufferSize];
int bytesRead;
// Read the file in chunks
while ((bytesRead = fileStream.Read(buffer, 0, buffer.Length)) > 0)
{
// Create a memory stream for the current chunk
using (var chunkStream = new MemoryStream(buffer, 0, bytesRead))
{
// Load the document from the chunk stream
var doc = new Aspose.Words.Document(chunkStream);
//extractedText.Append(doc.ToString(SaveFormat.Text));
// Process the document and extract text
foreach (Aspose.Words.Section section in doc.Sections)
{
foreach (Aspose.Words.Paragraph paragraph in section.Body.Paragraphs)
{
extractedText.Append(paragraph.GetText());
}
}
}
}
}
}
catch (Exception ex)
{
return "[Error] " + ex.Message;
}
return extractedText.ToString();
}
The above approach provides some optimization in terms of memory but CPU utilization still remains high.
Can you suggest any other alternatives. I am using the 23.7 version of Aspose Total