Converting Large XLS/XLSX files to Text for searching

I have a product that converts files to Text for searching. Currently we are testing Aspose to see if it will address this challenge in a more efficient way but we are running into some serious performance issues.


Here’s the overview of how the solution works.

1. Product is built in C# .NET
2. Runs as a Windows Service or Console Application
3. Iterates through a file system and grabs specific file types (XLS/XLSX for example)
4. Converts these files to a temp file in text format and runs a search against them.
5. Generates the results.

Below is the code currently being used for conversion. When we hit files in excess of 300MB when running a Parallel.ForEach we see Aspose hanging up for hours or even days. We are seeing 6K+ hard faults/s on a system with over 24GB of RAM. Very little CPU utilization.

Questions:
1. What should be modified to reduce the amount of time/memory utilized to convert a 400mb simple text fields XLSX file to a text file?

2. Is there a better/more efficient way to get the data into text format and write it to a temp file instead of using the array method below?

3. Any other suggestions? We are seeing similar behavior on PDF conversions from time to time.

//CODE BELOW

p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 9.5px Consolas; color: #e3e3e3; background-color: #282828} p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 9.5px Consolas; color: #e3e3e3; background-color: #282828; min-height: 11.0px}

public static string ConvertExcelFileToTextString(string i_ExcelFilePath)

{

Aspose.Cells.License lic = new Aspose.Cells.License();

lic.SetLicense(".\\Component\\Aspose.Total.lic");


// Load your source workbook

Workbook workbook = new Workbook(i_ExcelFilePath);


//0-byte array

byte[] workbookData = new byte[0];


// Text save options. You can use any type of separator

TxtSaveOptions opts = new TxtSaveOptions(Aspose.Cells.SaveFormat.CSV);

opts.Separator = '\t';


// Copy each worksheet data in text format inside workbook data array

for (int idx = 0; idx < workbook.Worksheets.Count; idx++)

{

// Save the active worksheet into text format

MemoryStream ms = new MemoryStream();

workbook.Worksheets.ActiveSheetIndex = idx;

workbook.Save(ms, opts);


// Save the worksheet data into sheet data array

ms.Position = 0;

byte[] sheetData = ms.ToArray();


// Combine this worksheet data into workbook data array

byte[] combinedArray = new byte[workbookData.Length + sheetData.Length];

Array.Copy(workbookData, 0, combinedArray, 0, workbookData.Length);

Array.Copy(sheetData, 0, combinedArray, workbookData.Length, sheetData.Length);


workbookData = combinedArray;

}


string OutputTextTempFilePath = Path.GetTempFileName();


// Save entire workbook data into file

ZlpIOHelper.WriteAllBytes(OutputTextTempFilePath, workbookData);


string Result = ZlpIOHelper.ReadAllText(OutputTextTempFilePath);


ZlpIOHelper.DeleteFile(OutputTextTempFilePath);


return (Result);



Hi,


Thanks for your posting and using Aspose.Cells.

In order to save memory consumption, please save your worksheets one by one in .txt files on disk. For example, your first three sheets are saved on disk like a1.txt, a2.txt, a3.txt etc.

Then combine all of these text files via a batch command like

copy /b a1.txt+a2.txt+a3.txt merged.txt

It will combine all the files and you will get a merged.txt which will have all the textual data of your entire workbook (or all of your worksheets).

You can also save loading time of your excel file via


Once, your file is loaded, you will use the same approach as discussed above.