Significant lock contention in parallelized Scene.Open with fbx files

bortos · July 30, 2021, 9:19am

I have a custom application that processes many scenes, all saved as fbx files.

I’ve noticed that performance is limited with no apparent bottlenecks - and this doesn’t change even when I increase parallelism.

After installing JetBrains dotTrace I am able to profile my application and get a good breakdown on what each thread is doing.

From what I’ve gathered the significant majority of time is spent on lock contention within one static class:
internal static class \u0023\u003DqKzSFbguYx3NkUQbRcVH3CP5GWWsnrNIs5bdQqA0ZiME\u003D

When I debug the workers all seem to be stuck on this Monitor.Enter bit:

image.png (317.4 KB)

See the following dotTrace screenshot:

image.png (445.2 KB)

I will endeavour to create a reproducible project that should demonstrate the issue

asad.ali · July 30, 2021, 9:41pm

@bortos

It would really be appreciated if you can share a sample project to demonstrate the issue that you mentioned. We will surely investigate the case by testing the scenario in our environment and address it accordingly.

bortos · August 2, 2021, 5:45am

Hi Asad

I’ve made a simple console app that exhibits some of the performance issues I’m encountering but not to the full extent of my app, but I think the cause is the same Monitor.Enter mentioned in the thread.

Please see the onedrive link with the zipped dotnet project:
https://1drv.ms/u/s!ArCE2vd0EvhJhW58cnDF84DOGaVm?e=KzadFe

You will need to fix up the Aspose3D license file.
The console app should be able to run without any console args and will default to 1 worker. If you add --parallel 12 to the args, it will increase the worker count. The app also loads the target file and stores the bytes in memory to rule out IO.

Here are the results for my work pc (16 core 32 thread AMD Ryzen 9 3950x) without attaching the debugger:

1 thread:

Total average over 1 workers was 1.0 seconds from 12 samples

From this we can establish that the file takes about a second to load…

2 threads - roughly 5% CPU

Worker 0 average scene open time: 1.5 seconds for 12 samples
Worker 1 average scene open time: 1.5 seconds for 12 samples
Total average over 2 workers was 1.5 seconds from 24 samples

6 threads

Worker 0 average scene open time: 3.7 seconds for 12 samples
Worker 1 average scene open time: 3.7 seconds for 12 samples
Worker 2 average scene open time: 3.8 seconds for 12 samples
Worker 3 average scene open time: 3.6 seconds for 12 samples
Worker 4 average scene open time: 3.7 seconds for 12 samples
Worker 5 average scene open time: 3.6 seconds for 12 samples
Total average over 6 workers was 3.7 seconds from 72 samples

12 threads:

Worker 0 average scene open time: 11.2 seconds for 12 samples
Worker 1 average scene open time: 10.7 seconds for 12 samples
Worker 2 average scene open time: 11.3 seconds for 12 samples
Worker 3 average scene open time: 11.3 seconds for 12 samples
Worker 4 average scene open time: 11.2 seconds for 12 samples
Worker 5 average scene open time: 11.2 seconds for 12 samples
Worker 6 average scene open time: 11.3 seconds for 12 samples
Worker 7 average scene open time: 10.7 seconds for 12 samples
Worker 8 average scene open time: 11.3 seconds for 12 samples
Worker 9 average scene open time: 11.3 seconds for 12 samples
Worker 10 average scene open time: 11.3 seconds for 12 samples
Worker 11 average scene open time: 11.3 seconds for 12 samples
Total average over 12 workers was 11.2 seconds from 144 samples

16 threads

Total average over 16 workers was 12.2 seconds from 192 samples

32 threads - overall cpu usage ranges between 13-22%, hard to get a good reading

Total average over 32 workers was 32.3 seconds from 384 samples

image.png (31.3 KB)

As you can see, the same file ends up taking longer and longer for each additional thread running. This does not appear to be limited to fbx files, as a glb file also increased in time.

Pause the debugger with 16 workers. The following image is of “Parallel Stacks” and shows 15 threads stuck on the Monitor.Enter
image.png (292.1 KB)

This has become an issue for me recently because I’m working on reducing memory from massive cad files by breaking them up into lots of smaller files. This way areas can be stored on disk and loaded when they are needed, instead of keeping the entire thing in memory. However this has reduced performance to a crawl.

With 16 threads jetbrains dotTrace is showing 81% of time is lock contention:

image.png (28.9 KB)

asad.ali · August 2, 2021, 6:32pm

@bortos

An issue as THREEDNET-918 has been logged in our issue tracking system for the sake of detailed investigation. We will surely look into its detail and let you know as soon as it is fixed. Please be patient and spare us some time.

We are sorry for the inconvenience.

aspose.notifier · August 12, 2021, 6:33pm

The issues you have found earlier (filed as THREEDNET-918) have been fixed in Aspose.3D for .NET 21.8.

bortos · August 13, 2021, 9:47am

Hi

Thanks for the update. I have noticed a decent improvement in performance!

There is still a large reduction in performance when running multiple threads with <PackageReference Include="Aspose.3D" Version="21.8.0" />, and only a fraction of the PCs resources are being used.

Using the same project attached earlier with no debugger attached:

1 worker: Total average over 1 workers was 0.9 seconds from 12 samples
2 workers: Total average over 2 workers was 1.3 seconds from 28 samples
4 workers: Total average over 4 workers was 2.1 seconds from 32 samples
8 workers: Total average over 8 workers was 3.9 seconds from 96 samples
12 workers: Total average over 12 workers was 5.7 seconds from 144 samples
16 workers: Total average over 16 workers was 7.5 seconds from 192 samples
32 workers: Total average over 32 workers was 17.3 seconds from 256 samples

I can’t really make much sense of the dotTrace timeline to find any particular issues. Both flame graphs seem proportional between 1 worker scene load (~1 second) and 32 worker scene load.

Is parallelism expected to have such a rapid reduction in performance? Seems odd that with 4 workers each worker is loading scenes in over twice the time compared to 1 worker

Thanks

asad.ali · August 13, 2021, 7:09pm

@bortos

The lock contention reduced dramatically after we upgraded to the latest obfuscator. But the GC Wait may still affect the CPU usage during the parallel testing. We have improved the memory allocation internally. We will further let you know after performing investigation against your recent feedback.

bortos · October 6, 2021, 2:50am

Hi, is this still being looked into?

asad.ali · October 6, 2021, 7:05pm

@bortos

According to our analysis, the lock contentions are mainly caused by the GC since 21.8. We’ll improve the memory allocation performance to reduce the GC for FBX loader soon. As a temporary work around, use multi-process to avoid the lock issue.

 private static void Worker(MemoryStream ms)
        {
            int numMeshes = 9659; //There're 9659 meshes in the Export1.fbx
            int totalVec4s = 4822104; 
            using var reader = new BinaryReader(ms);
            var scene = new List<List<float[]>>();
            for (int i = 0; i < numMeshes; i++)
            {
                var mesh = new List<float[]>();
                var numVecs = totalVec4s / 2 / numMeshes;
                for (int k = 0; k < 2; k++)//control points, normal data
                {
                    var vecs = new float[numVecs * 4];
                    for (int j = 0, p= 0; j < numVecs; j++)
                    {
                        vecs[p++] = reader.ReadSingle();
                        vecs[p++] = reader.ReadSingle();
                        vecs[p++] = reader.ReadSingle();
                        vecs[p++] = k;
                    }
                    mesh.Add(vecs);
                }
                scene.Add(mesh);
            }
        }

The above code snippet simulated the memory allocation of vectors in the Export1.fbx, the GC Wait still make the multiple cores in hunger state according to the dotTrace.