RemoveDuplicates doesn't work or takes too much time

My excel file has only 1 column and over 150 thousand records.
I wanto sort and remove duplicate records.
source.zip (1.5 MB)

button 1:
I call Aspose.cells.RemoveDuplicates.
It takes too much time.
I don’t get the result.

button 2:
I remove duplicate records by myself.
It takes about 50 seconds.
Can it be faster?

Aspose.Cells for .Net.
Version:20.6
Visual Studio Community 2019

private void button1_Click(object sender, EventArgs e)
{
Stopwatch sw = new System.Diagnostics.Stopwatch();
sw.Start();

        Aspose.Cells.Workbook wb = new Aspose.Cells.Workbook("source.xlsx");
        Aspose.Cells.Cells cells = wb.Worksheets[0].Cells;
        int RowNumbers = cells.MaxDataRow + 1;
        int ColNumbers = cells.MaxDataColumn + 1;
        wb.DataSorter.AddKey(0, Aspose.Cells.SortOrder.Ascending);
        wb.DataSorter.Sort(cells, 1, 0, RowNumbers, ColNumbers);
        cells.RemoveDuplicates(1, 0, RowNumbers, ColNumbers);
        wb.Save(DateTime.Now.ToString("yyyy-MM-dd HH.mm.ss")+"dest.xlsx");
        sw.Stop();
        double elapse = sw.ElapsedMilliseconds / 1000.0;
        MessageBox.Show("Time elapsed:" + elapse.ToString());
    }

    private void button2_Click(object sender, EventArgs e)
    {
        Stopwatch sw = new System.Diagnostics.Stopwatch();
        sw.Start();

        Aspose.Cells.Workbook wb = new Aspose.Cells.Workbook("source.xlsx");
        Aspose.Cells.Cells cells = wb.Worksheets[0].Cells;
        int RowNumbers = cells.MaxDataRow + 1;
        int ColNumbers = cells.MaxDataColumn + 1;
        wb.DataSorter.AddKey(0, Aspose.Cells.SortOrder.Ascending);
        wb.DataSorter.Sort(cells, 1, 0, RowNumbers, ColNumbers);

        for (int r = 1; r < RowNumbers; ++r)
        {
            string v1 = cells[r, 0].StringValue;
            int r2 = r + 1;
            for (; r2 < RowNumbers; ++r2)
            {
                string v2 = cells[r2, 0].StringValue;
                if (v1 != v2) break;
            }
            if (r + 1 < r2)
            {
                cells.DeleteRows(r + 1, r2 - r - 1);
                RowNumbers -= r2 - r - 1;
            }
        }

        wb.Save(DateTime.Now.ToString("yyyy-MM-dd HH.mm.ss") + "dest.xlsx");
        sw.Stop();
        double elapse = sw.ElapsedMilliseconds / 1000.0;
        MessageBox.Show("Time elapsed:" + elapse.ToString());
    }

@fangrk.sz,

Thanks for the sample code segment and template file.

After an initial test, I am able to reproduce the issue as you mentioned by using your sample code and template file. I found Worksheet.Cells.RemoveDuplicates does not work or it is taking too much time. I waited for sometime and then stopped the process. I have logged a ticket with an id “CELLSNET-47433” for your issue. We will look into it to figure it out soon.

Once we have an update on it, we will let you know here.

Thanks.
Hoping good news soon.

RemoveDuplicatesTest.zip (6.9 MB)

I found that DeleteRow(s) takes a low efficiency.

button2:
I remove the duplicated rows by DeleteRows.
In my demo, the process takes about 50 secondes.

button3:
I copy the result records to a new excel file.
It takes only 5 seconds.

private void button2_Click(object sender, EventArgs e)
{
Stopwatch sw = new System.Diagnostics.Stopwatch();
sw.Start();

        Aspose.Cells.Workbook wb = new Aspose.Cells.Workbook("source.xlsx");
        Aspose.Cells.Cells cells = wb.Worksheets[0].Cells;
        int RowNumbers = cells.MaxDataRow + 1;
        int ColNumbers = cells.MaxDataColumn + 1;
        wb.DataSorter.AddKey(0, Aspose.Cells.SortOrder.Ascending);
        wb.DataSorter.Sort(cells, 1, 0, RowNumbers, ColNumbers);

        for (int r = 1; r < RowNumbers; ++r)
        {
            string v1 = cells[r, 0].StringValue;
            int r2 = r + 1;
            for (; r2 < RowNumbers; ++r2)
            {
                string v2 = cells[r2, 0].StringValue;
                if (v1 != v2) break;
            }
            if (r + 1 < r2)
            {
                cells.DeleteRows(r + 1, r2 - r - 1);
                RowNumbers -= r2 - r - 1;
            }
        }

        wb.Save(DateTime.Now.ToString("yyyy-MM-dd HH.mm.ss") + "_button2.xlsx");
        sw.Stop();
        double elapse = sw.ElapsedMilliseconds / 1000.0;
        MessageBox.Show("Time elapsed:" + elapse.ToString());
    }

    private void button3_Click(object sender, EventArgs e)
    {
        Stopwatch sw = new System.Diagnostics.Stopwatch();
        sw.Start();

        Aspose.Cells.Workbook wb = new Aspose.Cells.Workbook("source.xlsx");
        Aspose.Cells.Cells cells = wb.Worksheets[0].Cells;
        int RowNumbers = cells.MaxDataRow + 1;
        int ColNumbers = cells.MaxDataColumn + 1;
        wb.DataSorter.AddKey(0, Aspose.Cells.SortOrder.Ascending);
        wb.DataSorter.Sort(cells, 1, 0, RowNumbers, ColNumbers);

        Aspose.Cells.Workbook destWB = new Aspose.Cells.Workbook();
        Aspose.Cells.Cells destCells = destWB.Worksheets[0].Cells;
        int destRow = -1;
        destCells.CopyRow(cells, 0, ++destRow);

        for (int r = 1; r < RowNumbers;)
        {
            destCells.CopyRow(cells, r, ++destRow);
            string v1 = cells[r, 0].StringValue;
            int r2 = r + 1;
            for (; r2 < RowNumbers; ++r2)
            {
                string v2 = cells[r2, 0].StringValue;
                if (v1 != v2) break;
            }
            r = r2;
        }

        destWB.Save(DateTime.Now.ToString("yyyy-MM-dd HH.mm.ss") + "_button3.xlsx");
        sw.Stop();
        double elapse = sw.ElapsedMilliseconds / 1000.0;
        MessageBox.Show("Time elapsed:" + elapse.ToString());
    }

@fangrk.sz,
We are looking into this issue and will share our feedback soon.

@fangrk.sz,

Thanks for sharing manual way to remove duplicates. We will check if your code segment needs some tweak to minimize the time cost. But since we will fix the original issue (first) for Cells.RemoveDuplicates, so you might not require your own code segment for the task.

Please spare us little time, we will try to fix your issue soon.

@fangrk.sz,
When Cells.RemoveDuplicates is called , we could not know whether the data is sorted, so we have to iterate all rows to find duplicate rows.
There are 156 thousands of rows and only about 30 thousands of duplicate rows in your template file, it takes more time to iterate rows.
We need more time to improve performance for this feature.

RemoveDuplicate.zip (5.9 MB)

Remove duplicate without sorted records.
I deleted most of the records in the excel file, only keeping the top 50K.

1: call aspose.cells.RemoveDuplicate
2: hashset and DeleteRow
3: hashset and CopyRow

It seems that Copy is more efficient than Delete。And I hope the 20.7 version can brings a pleasant surprise.
Thank you.

Code:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Diagnostics;
using Aspose.Cells;

namespace RemoveDuplicate
{
class Program
{
static private void Aspose_RemoveDuplicate()
{
Stopwatch sw = new System.Diagnostics.Stopwatch();
sw.Start();

        Aspose.Cells.Workbook wb = new Aspose.Cells.Workbook("source.xlsx");
        Aspose.Cells.Cells cells = wb.Worksheets[0].Cells;
        int RowNumbers = cells.MaxDataRow + 1;
        int ColNumbers = cells.MaxDataColumn + 1;
        cells.RemoveDuplicates(1, 0, RowNumbers, ColNumbers);
        wb.Save(DateTime.Now.ToString("yyyy-MM-dd HH.mm.ss") + "_AsposeRemoveDuplicate.xlsx");
        sw.Stop();
        double elapse = sw.ElapsedMilliseconds / 1000.0;
        Console.WriteLine("Aspose_RemoveDuplicate's done. Time elapsed:" + elapse.ToString());
    }
    static private void DeleteRow_HashSet()
    {
        Stopwatch sw = new System.Diagnostics.Stopwatch();
        sw.Start();

        Aspose.Cells.Workbook wb = new Aspose.Cells.Workbook("source.xlsx");
        Aspose.Cells.Cells cells = wb.Worksheets[0].Cells;
        int RowNumbers = cells.MaxDataRow + 1;
        int ColNumbers = cells.MaxDataColumn + 1;
        HashSet<string> IsDuplicate = new HashSet<string>();
        for (int r = 1; r < RowNumbers;)
        {
            string v1 = cells[r, 0].StringValue;
            if (IsDuplicate.Contains(v1))
            {
                cells.DeleteRow(r);
                --RowNumbers;
            }
            else
            {
                IsDuplicate.Add(v1);
                ++r;
            }
        }

        wb.Save(DateTime.Now.ToString("yyyy-MM-dd HH.mm.ss") + "_DeleteRowHashSet.xlsx");
        sw.Stop();
        double elapse = sw.ElapsedMilliseconds / 1000.0;
        Console.WriteLine("DeleteRow_HashSet's done. Time elapsed:" + elapse.ToString());
    }
    static private void CopyRow_HashSet()
    {
        Stopwatch sw = new System.Diagnostics.Stopwatch();
        sw.Start();

        Aspose.Cells.Workbook wb = new Aspose.Cells.Workbook("source.xlsx");
        Aspose.Cells.Cells cells = wb.Worksheets[0].Cells;
        int RowNumbers = cells.MaxDataRow + 1;
        int ColNumbers = cells.MaxDataColumn + 1;
        Aspose.Cells.Workbook destWB = new Aspose.Cells.Workbook();
        Aspose.Cells.Cells destCells = destWB.Worksheets[0].Cells;

        HashSet<string> IsDuplicate = new HashSet<string>();

        int destRow = -1;
        destCells.CopyRow(cells, 0, ++destRow);

        for (int r = 1; r < RowNumbers; ++r)
        {
            string v1 = cells[r, 0].StringValue;
            if (IsDuplicate.Contains(v1))
            {
            }
            else
            {
                IsDuplicate.Add(v1);
                destCells.CopyRow(cells, r, ++destRow);
            }
        }

        destWB.Save(DateTime.Now.ToString("yyyy-MM-dd HH.mm.ss") + "_CopyRowHashSet.xlsx");
        sw.Stop();
        double elapse = sw.ElapsedMilliseconds / 1000.0;
        Console.WriteLine("CopyRowHashSet's done. Time elapsed:" + elapse.ToString());
    }
    static void Main(string[] args)
    {
        Aspose_RemoveDuplicate();
        DeleteRow_HashSet();
        CopyRow_HashSet();
    }
}

}

run:
E:>cd E:\RemoveDuplicate\bin\Release

E:\RemoveDuplicate\bin\Release>removeduplicate
Aspose_RemoveDuplicate’s done. Time elapsed:96.984
DeleteRow_HashSet’s done. Time elapsed:3.978
CopyRowHashSet’s done. Time elapsed:0.536

E:\RemoveDuplicate\bin\Release>removeduplicate
Aspose_RemoveDuplicate’s done. Time elapsed:95.635
DeleteRow_HashSet’s done. Time elapsed:3.773
CopyRowHashSet’s done. Time elapsed:0.532

@fangrk.sz,
We have noted this information and will consider it while working on this issue.

@fangrk.sz,
Please try our latest version/fix: Aspose.Cells for .NET v20.6.4:

Aspose.Cells20.6.4 For .Net2_AuthenticodeSigned.Zip (5.4 MB)
Aspose.Cells20.6.4 For .Net4.0.Zip (5.4 MB)

Your issue should be fixed in it.

Let us know your feedback.

ConsoleApp1.zip (6.8 MB)

I have tried your new dll file, see attachment。

E:\ConsoleApp1\bin\Release>ConsoleApp1.exe
ver:20.6.4
Aspose_RemoveDuplicate’s done. Time elapsed:260.968
DeleteRow_HashSet’s done. Time elapsed:51.262
CopyRow_HashSet’s done. Time elapsed:1.933

class Program
{
static private void Aspose_RemoveDuplicate()
{
Stopwatch sw = new System.Diagnostics.Stopwatch();
sw.Start(); Aspose.Cells.Workbook wb = new Aspose.Cells.Workbook(“source.xlsx”);
Aspose.Cells.Cells cells = wb.Worksheets[0].Cells;
int RowNumbers = cells.MaxDataRow + 1;
int ColNumbers = cells.MaxDataColumn + 1;
cells.RemoveDuplicates(1, 0, RowNumbers, ColNumbers);
wb.Save(DateTime.Now.ToString(“yyyy-MM-dd HH.mm.ss”) + “_AsposeRemoveDuplicate.xlsx”);
sw.Stop();
double elapse = sw.ElapsedMilliseconds / 1000.0;
Console.WriteLine(“Aspose_RemoveDuplicate’s done. Time elapsed:” + elapse.ToString());
}
static private void DeleteRow_HashSet()
{
Stopwatch sw = new System.Diagnostics.Stopwatch();
sw.Start();

        Aspose.Cells.Workbook wb = new Aspose.Cells.Workbook("source.xlsx");
        Aspose.Cells.Cells cells = wb.Worksheets[0].Cells;
        int RowNumbers = cells.MaxDataRow + 1;
        int ColNumbers = cells.MaxDataColumn + 1;
        HashSet<string> IsDuplicate = new HashSet<string>();
        for (int r = 1; r < RowNumbers;)
        {
            string v1 = cells[r, 0].StringValue;
            if (IsDuplicate.Contains(v1))
            {
                cells.DeleteRow(r);
                --RowNumbers;
            }
            else
            {
                IsDuplicate.Add(v1);
                ++r;
            }
        }

        wb.Save(DateTime.Now.ToString("yyyy-MM-dd HH.mm.ss") + "_DeleteRowHashSet.xlsx");
        sw.Stop();
        double elapse = sw.ElapsedMilliseconds / 1000.0;
        Console.WriteLine("DeleteRow_HashSet's done. Time elapsed:" + elapse.ToString());
    }
    static private void CopyRow_HashSet()
    {
        Stopwatch sw = new System.Diagnostics.Stopwatch();
        sw.Start();

        Aspose.Cells.Workbook wb = new Aspose.Cells.Workbook("source.xlsx");
        Aspose.Cells.Cells cells = wb.Worksheets[0].Cells;
        int RowNumbers = cells.MaxDataRow + 1;
        int ColNumbers = cells.MaxDataColumn + 1;
        Aspose.Cells.Workbook destWB = new Aspose.Cells.Workbook();
        Aspose.Cells.Cells destCells = destWB.Worksheets[0].Cells;

        HashSet<string> IsDuplicate = new HashSet<string>();

        int destRow = -1;
        destCells.CopyRow(cells, 0, ++destRow);

        for (int r = 1; r < RowNumbers; ++r)
        {
            string v1 = cells[r, 0].StringValue;
            if (IsDuplicate.Contains(v1))
            {
            }
            else
            {
                IsDuplicate.Add(v1);
                destCells.CopyRow(cells, r, ++destRow);
            }
        }

        destWB.Save(DateTime.Now.ToString("yyyy-MM-dd HH.mm.ss") + "_CopyRowHashSet.xlsx");
        sw.Stop();
        double elapse = sw.ElapsedMilliseconds / 1000.0;
        Console.WriteLine("CopyRow_HashSet's done. Time elapsed:" + elapse.ToString());
    }
    static void Main(string[] args)
    {
        Console.WriteLine("ver:"+Aspose.Cells.CellsHelper.GetVersion());
        Aspose_RemoveDuplicate();
        DeleteRow_HashSet();
        CopyRow_HashSet();
    }
}

@fangrk.sz,
We have observed this behavior and logged it in our database for further investigation. We will write back soon to share our feedback.

@fangrk.sz,
For the performance of new fix(20.6.4), time cost should be less than 20% of the old versions.

For most use cases, commonly there are multiple columns with multiple kinds of cell values, so HashSet is not the suitable solution for us to implement Cells.RemoveDuplicates() and it is impossible to get the same performance with this special case.

For checking duplicated records, the cost is n*n(count of cells data). For your second “source.xlsx” the dataset is more than 3 times of the old one, so the time cost with the new source.xlsx will be about 10 times of the old one and we think the time cost of 260s is reasonable.

So, we are afraid we cannot improve the performance of this API further currently. For your special case, if you need much better performance, you should implement the logic by yourself instead of using Cells.RemoveDuplicates(). As the code shows, HashSet may be one suitable solution.

RemoveDuplicate.zip (7.2 MB)

I rewrote the code and didn’t use containers like HashSet.

A new auxiliary column is added to mark the original row order (1,2,3…).

Then sort by the target column. After sorting, find out duplicate records and set -1 in the auxiliary column.

Then sort the auxiliary columns. Those records marked -1 will be at the top, and the reserved data will remain in the original order.

Finally, remove the auxiliary columns and the duplicate rows.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Diagnostics;
using Aspose.Cells;

namespace RemoveDuplicate
{
class Program
{
static private void Aspose_RemoveDuplicate(string xlsxName)
{
Stopwatch sw = new System.Diagnostics.Stopwatch();
sw.Start();

        Aspose.Cells.Workbook wb = new Aspose.Cells.Workbook(xlsxName);
        Aspose.Cells.Cells cells = wb.Worksheets[0].Cells;
        int RowNumbers = cells.MaxDataRow + 1;
        int ColNumbers = cells.MaxDataColumn + 1;

        Console.WriteLine("Aspose_RemoveDuplicate.");
        Console.WriteLine("File:{0}\tRowCount:{1}", xlsxName, RowNumbers);

        cells.RemoveDuplicates(1, 0, RowNumbers, ColNumbers);
        wb.Save(DateTime.Now.ToString("yyyy-MM-dd HH.mm.ss") + "_AsposeRemoveDuplicate.xlsx");
        sw.Stop();
        double elapse = sw.ElapsedMilliseconds / 1000.0;
        Console.WriteLine("Done. RowCount:{0}. Time elapsed:{1}", cells.MaxDataRow + 1, elapse.ToString());
        Console.WriteLine();
    }
    static private void MrFang_RemoveDuplicate(string xlsxName)
    {
        Stopwatch sw = new System.Diagnostics.Stopwatch();
        sw.Start();

        Aspose.Cells.Workbook wb = new Aspose.Cells.Workbook(xlsxName);
        Aspose.Cells.Cells cells = wb.Worksheets[0].Cells;
        int RowNumbers = cells.MaxDataRow + 1;
        int ColNumbers = cells.MaxDataColumn + 1;
        Console.WriteLine("Mr.Fang_RemoveDuplicate.");
        Console.WriteLine("File:{0}\tRowCount:{1}", xlsxName,RowNumbers);

        cells[0, ColNumbers].PutValue("raw_order");

        for(int r=1;r< RowNumbers; ++r)
        {
            cells[r, ColNumbers].PutValue(r);
        }
        wb.DataSorter.AddKey(0, SortOrder.Ascending);
        wb.DataSorter.Sort(cells, 1, 0, RowNumbers, ColNumbers+1 );

        string v1 = cells[1, 0].StringValue;
        int dup_count = 0;
        for (int r = 2; r < RowNumbers;)
        {
            string v2;
            while ((v2=cells[r,0].StringValue)==v1)
            {
                cells[r, ColNumbers].PutValue(-1);
                ++r;
                ++dup_count;                    
            }
            v1 = v2;
            ++r;
        }
        wb.DataSorter.Clear();
        wb.DataSorter.AddKey(ColNumbers, SortOrder.Ascending);
        wb.DataSorter.Sort(cells, 1, 0, RowNumbers, ColNumbers+1 );
        cells.DeleteColumn(ColNumbers);
        cells.DeleteRows(1, dup_count);
        
        wb.Save(DateTime.Now.ToString("yyyy-MM-dd HH.mm.ss") + "_MrFang_RemoveDuplicate.xlsx");
        sw.Stop();
        double elapse = sw.ElapsedMilliseconds / 1000.0;
        Console.WriteLine("Done. RowCount:{0}. Time elapsed:{1}" , cells.MaxDataRow + 1, elapse.ToString());
        Console.WriteLine();
    }
    
    static void Main(string[] args)
    {
        Console.WriteLine("ver:{0}", Aspose.Cells.CellsHelper.GetVersion());
        Console.WriteLine();
        Aspose_RemoveDuplicate("source-small.xlsx");
        Aspose_RemoveDuplicate("source-big.xlsx");
        MrFang_RemoveDuplicate("source-small.xlsx");
        MrFang_RemoveDuplicate("source-big.xlsx");
        Console.ReadKey();
    }
}

}

E:\RemoveDuplicate\bin\Release>RemoveDuplicate
ver:20.6.4

Aspose_RemoveDuplicate.
File:source-small.xlsx RowCount:50576
Done. RowCount:37444. Time elapsed:19.127

Aspose_RemoveDuplicate.
File:source-big.xlsx RowCount:156039
Done. RowCount:128106. Time elapsed:217.957

Mr.Fang_RemoveDuplicate.
File:source-small.xlsx RowCount:50576
Done. RowCount:37444. Time elapsed:2.013

Mr.Fang_RemoveDuplicate.
File:source-big.xlsx RowCount:156039
Done. RowCount:128106. Time elapsed:8.296

@fangrk.sz,
We are analyzing this code and will share our feedback after detailed analysis.

@fangrk.sz,
Thanks for your example code. Though the code and scenario (only one column and only string values) is not suitable for common cases, the thought did help us to improve the performance further. We will provide new fix with more improvements soon.

@fangrk.sz,
This is to inform you that we have fixed your issue (logged earlier as “CELLSNET-47433”) now. We will soon provide you the fixed version after performing QA and incorporating other enhancements and fixes.

@fangrk.sz,

Please try our latest version/fix: Aspose.Cells for .NET v20.6.6 (attached)

Your issue should be fixed in it.

Let us know your feedback.
Aspose.Cells20.6.6 For .Net2_AuthenticodeSigned.Zip (5.4 MB)
Aspose.Cells20.6.6 For .Net4.0.Zip (5.4 MB)

I have tried this fixed version. Perfect! Thank you.

@angrk.sz,

Good to know that your issue is sorted out by the new version/fix. Feel free to contact us any time if you need further help or have some other issue or queries, we will be happy to assist you soon.