.NET Core Parallel.ForEach, Aspose.Words.Document() load not thread safe

dgreenwood · July 9, 2020, 10:24pm

I am evaluating Aspose.Words for use as a batch converter for RTF-to-PDF, and have obtained a Trial license.

The program/service in question will receive a “list of files” (as files to be read from disk or as byte[] arrays) and then process each in turn. This is a classic opportunity to use TPL and the Parallel.Foreach construct. My test machine has 4 processors available (4 threads), so here we go…

To test this, I built a trivial console application that loads a reasonably complex RTF file into a byte[] array, creates several copies of that array into unique separately allocated MemoryStreams, and then iterates over those streams to process the result. This ensures that there is no memory contention - these are not ‘shared memory’ type operations. Each load/save pair is completely separate.

When 1 processor is used (either with or without Parallel.foreach), the time to load and save the stream is encouraging. However, when more than one processor is used (MaxDegreeOfParallelism = 2 or 4), then I notice that the time taken for Aspose.Words.Document(inputStream) is significantly deteriorated depending on the number of threads. Putting a lock() around the call demonstrates that the Document() load function is not thread safe.

Here are some results:
Processing 4 documents using 1 threads
Document: doc1 Read: 721 Save: 1066 ms
Document: doc2 Read: 773 Save: 868 ms
Document: doc3 Read: 680 Save: 847 ms
Document: doc4 Read: 601 Save: 886 ms
Total elapsed time: 6482 ms

Now with 4 threads and no lock():
Processing 4 documents using 4 threads
Document: doc2 Read: 2039 Save: 2715 ms
Document: doc4 Read: 2091 Save: 2674 ms
Document: doc3 Read: 2038 Save: 2736 ms
Document: doc1 Read: 2139 Save: 2683 ms
Total elapsed time: 4862 ms

And again with a lock() { } around the Document() load call:

Document: doc2 Read: 628 Save: 1516 ms
Document: doc3 Read: 997 Save: 1350 ms
Document: doc1 Read: 1224 Save: 1185 ms
Document: doc4 Read: 1132 Save: 929 ms
Total elapsed time: 4972 ms

As you can see, the time for 1 thread (1 document at a time) is ~1.5 seconds. But adding threads, and without the lock(), the time balloons to ~4.7 seconds (as the threads compete). Then, with the lock(), the time improves to ~2.1 seconds. Still not great. It is likely that the serialized “lock” allows some independence for the Save() function as well, so my guess is that the Save() is also not thread safe (in fact, now that I think of it, I will test that next - using 2 lock()s).

Can you please advise the best way to utilize the elegance of Parallel.Foreach to process multiple documents on multiple processors? Is it reasonable to expect that with a small amount of overhead for parallelization, the time for 1 document should scale linearly, or approx 1.2x (allowing for TPL overhead)?

I am happy to provide the source code - nothing proprietary. Let me know.

Aspose team, please advise asap. This will determine if we purchase a significant license, including other products in the family such as Aspose.PDF or Aspose.Barcode.

Thank you!

dgreenwood · July 9, 2020, 11:47pm

TestRTFtoPDF.zip (121.8 KB)
I figured to short-cut this conversation and analysis, here is the sample project.
Be sure the try setting # of threads and commenting/uncommenting the lock()'s.
Thanks

awais.hafeez · July 10, 2020, 10:08am

@dgreenwood,

We have logged this problem in our issue tracking system. Your ticket number is WORDSNET-20769. We will further look into the details of this problem and will keep you updated on the status of the linked issue.

dgreenwood · July 11, 2020, 1:16pm

Aspose team,
After doing some more research, I figured I would try (prove) whether ‘thread local’ storage would help. I wrote another simple console app to prove the point. In the attached app, I utilize Actions with ThreadLocal variables, and then run Parallel.Invoke to test. Here are some results…
Processing 8 documents using 1 threads
amosched1.rtf total: 4228 ms Read: 9 ms Load: 1970 ms Save: 2249 ms threadID: 1
amosched2.rtf total: 1703 ms Read: 14 ms Load: 639 ms Save: 1050 ms threadID: 1
amosched3.rtf total: 1764 ms Read: 9 ms Load: 671 ms Save: 1084 ms threadID: 1
amosched4.rtf total: 1592 ms Read: 9 ms Load: 763 ms Save: 820 ms threadID: 1
amosched1.rtf total: 1441 ms Read: 1 ms Load: 581 ms Save: 859 ms threadID: 1
amosched2.rtf total: 1348 ms Read: 0 ms Load: 577 ms Save: 771 ms threadID: 1
amosched3.rtf total: 1414 ms Read: 0 ms Load: 595 ms Save: 819 ms threadID: 1
amosched4.rtf total: 1438 ms Read: 0 ms Load: 595 ms Save: 843 ms threadID: 1

Processing 8 documents using 2 threads
amosched1.rtf total: 2355 ms Read: 25 ms Load: 1018 ms Save: 1312 ms threadID: 1
amosched2.rtf total: 2339 ms Read: 7 ms Load: 1015 ms Save: 1317 ms threadID: 6
amosched4.rtf total: 2515 ms Read: 3 ms Load: 1048 ms Save: 1464 ms threadID: 6
amosched3.rtf total: 2587 ms Read: 2 ms Load: 1235 ms Save: 1350 ms threadID: 1
amosched1.rtf total: 2251 ms Read: 0 ms Load: 941 ms Save: 1310 ms threadID: 6
amosched2.rtf total: 2277 ms Read: 1 ms Load: 969 ms Save: 1307 ms threadID: 1
amosched3.rtf total: 2319 ms Read: 1 ms Load: 931 ms Save: 1387 ms threadID: 6
amosched4.rtf total: 2330 ms Read: 1 ms Load: 1032 ms Save: 1297 ms threadID: 1

Processing 8 documents using 3 threads
amosched3.rtf total: 3243 ms Read: 2 ms Load: 1409 ms Save: 1832 ms threadID: 5
amosched1.rtf total: 3385 ms Read: 4 ms Load: 1475 ms Save: 1906 ms threadID: 1
amosched2.rtf total: 3418 ms Read: 1 ms Load: 1502 ms Save: 1915 ms threadID: 8
amosched4.rtf total: 3470 ms Read: 1 ms Load: 1600 ms Save: 1869 ms threadID: 5
amosched1.rtf total: 3389 ms Read: 4 ms Load: 1472 ms Save: 1913 ms threadID: 1
amosched2.rtf total: 3601 ms Read: 1 ms Load: 1649 ms Save: 1951 ms threadID: 8
amosched3.rtf total: 2491 ms Read: 3 ms Load: 1082 ms Save: 1406 ms threadID: 5
amosched4.rtf total: 2445 ms Read: 0 ms Load: 1042 ms Save: 1403 ms threadID: 1

Processing 8 documents using 4 threads
amosched3.rtf total: 5116 ms Read: 2 ms Load: 2182 ms Save: 2932 ms threadID: 6
amosched4.rtf total: 5147 ms Read: 2 ms Load: 2129 ms Save: 3016 ms threadID: 9
amosched1.rtf total: 5224 ms Read: 2 ms Load: 2168 ms Save: 3054 ms threadID: 1
amosched2.rtf total: 5233 ms Read: 2 ms Load: 2169 ms Save: 3062 ms threadID: 8
amosched1.rtf total: 4555 ms Read: 0 ms Load: 2043 ms Save: 2512 ms threadID: 6
amosched2.rtf total: 4570 ms Read: 0 ms Load: 2047 ms Save: 2523 ms threadID: 9
amosched3.rtf total: 4551 ms Read: 2 ms Load: 2170 ms Save: 2379 ms threadID: 1
amosched4.rtf total: 4565 ms Read: 2 ms Load: 2228 ms Save: 2335 ms threadID: 8

As you can see, Aspose.Words degrades with more threads.

One thought is that you are working in the stream to perform processing, and you should probably work in memory byte[] arrays, and read (or write) the stream in one call. Just a thought.

I am eager to see if you can solve this. I’m sure the answer will help all customers.
Thanks!

dgreenwood · July 11, 2020, 1:18pm

ThreadLocal.zip (338.7 KB)

awais.hafeez · July 12, 2020, 7:11am

@dgreenwood,

We have also managed to observe that the performance degrades as we increase the number of threads when running the code you provided on our end. To address this undesired performance issue, we have logged a separate issue with ID WORDSNET-20774. We will further look into the details of this problem and will keep you updated on the status of the linked issues.

dgreenwood · July 22, 2020, 12:37pm

Is there any update on this issue WORDSNET-20774 (or the related issue WORDSNET-20769) that you can share? I am eager to resolve this and more fully utilize the Aspose.Words product for a scaled deployment.

Thank you

awais.hafeez · July 23, 2020, 5:30am

@dgreenwood,

Unfortunately, these issues are not resolved yet. We have completed the initial analysis of WORDSNET-20769 but WORDSNET-20774 is currently pending for analysis is in the queue. We will inform you here as soon as these issues will get resolved in future or any further updates may be available. We apologize for any inconvenience.

dgreenwood · July 24, 2020, 12:11pm

Okay, thank you. In your “initial analysis of WORDSNET-20769”, any guidance you can give? In general, and in light of my findings/examples, what does Aspose recommend for the best way to parallelize conversion operations? I’m sure I am not the only person interested in this answer. Thank you

awais.hafeez · July 24, 2020, 2:17pm

@dgreenwood,

You use the following code:

    // load/save semaphores
    object _syncLoad = new object();
    object _syncSave = new object();

    Parallel.ForEach(streamCount, threadOptions, i => {
        var timer1 = new System.Diagnostics.Stopwatch();
        var timer2 = new System.Diagnostics.Stopwatch();

        Aspose.Words.Document document;

        lock (_syncLoad) {
            timer1.Start();
            document = new Aspose.Words.Document(workList[i].rtfStream);
            timer1.Stop();
        }

        lock (_syncSave) {
            timer2.Start();
            document.Save(workList[i].pdfStream, options);
            timer2.Stop();
        }
    });

The main idea is to use Parallel.ForEach() for more speed. You process 4 identical documents in memory and examine non-parallel For loop, Parallel.ForEach() and Parallel.ForEach() with a locking mechanism.

Here are a few words about the used locking mechanism.
The 2 parts (reading and writing) are separated. And every part is in a critical section.
You wrote a comment: "load/save semaphores".
I am afraid, this is not quite correct. These are not semaphores. The locking mechanism is based on the monitors.

When you examine the non-parallel For loop, you get “Read: 773 Save: 868 ms” .
When you examine the Parallel.ForEach(), you get “Read: 2091 Save: 2674 ms”.
And when you examine the Parallel.ForEach() with a locking mechanism, you get “Read: 997 Save: 1350 ms”.

The longer processing time in case (2) is explained by additional parallelization costs.
And you explain the shorter processing time in case (3) as follows:

But in fact, the main reason for the shorter processing time in case (3) is the switching off the additional parallelization costs.
Only one thread can enter the critical section.
So the parallel effect doesn’t work.
This is why the processing time becomes similar to the processing time in case (1).

When we talk about a “non-thread safe” function, we can assume that it produces incorrect results in parallel processing.
But the test shows that both parts (reading and writing) produce correct results (model and output streams) when processed in parallel.

And to proof the correct results we analyzed the Hash and observed that the same hash (“B85C9507A6BC6CAB47F54006726E257C”) was found for every output stream in the different cases (non-parallel For loop, Parallel.ForEach() and Parallel.ForEach() with a locking mechanism)

Therefore, we have concluded that this is not a bug in Aspose.Words API.

It is rather difficult to suggest you the best way. Please just use the Parallel.Foreach() without locking as according to our tests, we can say that the Document class is thread safe."

Please see attached test code and screenshots for more details:

WORDSNET-20769-Analysis.zip (40.7 KB)

dgreenwood · July 24, 2020, 2:48pm

I appreciate the prompt reply and detailed analysis. However, I think you may want to re-evaluate your assertion that it is simply the “parallelism overhead” for the increase in processing time when adding threads (without locking). As you can see in your own results, the time DOUBLES for 4 threads. If you were to try 8 or 16 threads you would see the time increase further. The TPL does not add such a high overhead. To prove this, I wrote yet another test program that uses a simple message queue and a farmer/worker model. The idea is to post a message of ‘work to do’ and have one or more separate worker processes listen on the queue and take action. When run this way, as separate worker processes, the parallelization speedup is near linear, with perhaps < 10% “parallelization overhead”. I’m happy to share this code with you if it will help your analysis. I would assert that the threaded approach is in some way internally blocked (whether in your code or the underlying .NET library for streams, etc.). Why else would a threaded version degrade by 2x or more, and a separate process version show no degradation? Thoughts?

dgreenwood · July 24, 2020, 2:59pm

One more thought. I acknowledge that perhaps I’m using the term “thread safe” incorrectly. I agree that the API does produce correct results when used in a parallel construct. It behaves in a ‘thread safe’ manner in that there are no errors. So perhaps the better phrase I should have used is: “severely degrades in performance with additional threads with sufficiently available processing cores”. Agree?

awais.hafeez · July 25, 2020, 5:16am

@dgreenwood,

We have logged these details in our issue tracking system and will keep you posted here on any further updates.

awais.hafeez · July 28, 2020, 11:09am

@dgreenwood,

We have determined that these two WORDSNET-20769 and WORDSNET-20774 are related to each other. Please check below the analysis details of WORDSNET-20774:

Again, your main idea is to use parallelism for more speed. You process 8 identical documents in memory using a different number of threads. The process consists of 2 steps - reading from a stream and saving to a stream.

When you examine 1 thread, you get “Load: 581 ms Save: 859 ms” per thread.
When you examine 2 threads, you get “Load: 969 ms Save: 1307 ms” per thread.
When you examine 3 threads, you get “Load: 1472 ms Save: 1913 ms” per thread.
And when you examine 4 threads, you get “Load: 2168 ms Save: 3054 ms” per thread.

Put simply, the main reason for the performance degradation (per thread) in multiple threads is the PC memory architecture.

Please look at this picture: WORDSNET-20774 Analysis 01.png (29.7 KB)

In a PC, a CPU can contain multiple cores and handle multiple threads, but it still only works with one memory bus and only one system RAM. This is the main problem. Several threads can work completely independently within the processor, but they need to wait for each other when working with memory.

It doesn’t matter how you work with memory - through a byte array or a memory stream. These structures just provide various tools for manipulating data. (And for working with files with flexible structure, a byte array will not help).

Therefore, it is better to say “PC degrades with more threads” than “Aspose.Words degrades with more threads”. Because almost each program will have less performance when it is run on a PC in multiple instances that compete with each other.

There are probably two solutions to the problem.

Using a processor with a large L1/L2/L3 cache inside. (This allows you to work with cached data and not use system RAM.)
Please see this picture: WORDSNET-20774 Analysis 02.png (43.1 KB)
Use of non-PC architecture (for example, mainframe).

dgreenwood · July 28, 2020, 12:17pm

That is a surprising conclusion you have reached. You are basically saying “memory contention” can result in a 2x or more performance decrease in a threaded application. I think you will find that many, many threaded applications do not experience that level of degradation. To prove my point, I will now write a test program that performs memory-intensive operations – any memory-intensive operations – on, say, a byte[] array, and then added threading. Your assertion is that I should see the same performance decrease. I will post the result here.
For reference, btw, a useful tool to run on any program is dotnet-trace (dotnet-trace diagnostic tool - .NET CLI - .NET | Microsoft Learn), then view the results in speedscope (Performance Profiling of .NET Core 3 applications on Linux with dotnet-trace and PerfView | Michael's Coding Spot). What you will find if you profile the test apps I provided is that Aspose spends considerable time in unmanaged system calls, which is likely where the thread contention occurs. One does not, however, experience this same “contention” when you run separate instances of a program to perform the conversion on a multi-processor PC. If your idea of memory-contention were if fact accurate, then how would you explain the difference between running a threaded application versus 2 or more instances of the same application? Stay tuned for more.

dgreenwood · July 28, 2020, 3:24pm

As promised, here is a very simple test app you can run. It accesses up to 4 independent multi MB-sized byte[] arrays or memory streams concurrently.

Program.cs.zip (1.1 KB)

The results on a multi-core machine, Release, running outside VisualStudio in a DOS prompt are as follows:

byte[] array test
using no threads: 249 ms
using 1 threads: 251 ms
using 2 threads: 127 ms
using 4 threads: 74 ms

MemoryStream test
using no threads: 603 ms
using 1 threads: 534 ms
using 2 threads: 287 ms
using 4 threads: 136 ms

You can see that accessing memory in a threaded application does not produce the high degree of memory contention as you proposed. Yes, there appears to be some overhead for parallelization, but by no means does the time degrade by 2x or more.

Did I misunderstand your point or does this example now prove otherwise?

Given the results here, I will ask my question again: what is it about the Aspose.Words load and save operations (in memory?) that would cause such severe multi-threaded performance degradation?

Please understand, I am very truly interested in using Aspose.Words to process files in parallel. But with such severe performance degradation when using Aspose.Words in a multi-threaded approach, we both need to better understand the underlying performance issue. Do you still suggest the issue is “memory contention” ?

Thanks for your consideration and I eagerly await your reply.

dgreenwood · July 28, 2020, 3:43pm

… and just to convince myself that there was not any “cacheing” going on from each loop to the next (after loop 1, then loop 2 would be accessing the same arrays/streams, etc)… I simply copy’n’pasted additional arrays/streams so that each loop would work on new data.

Program.cs.zip (1.1 KB)

And the results are essentially the same, if not better…
byte[] array test
using no threads: 242 ms
using 1 threads: 257 ms
using 2 threads: 173 ms
using 4 threads: 75 ms

MemoryStream test
using no threads: 568 ms
using 1 threads: 518 ms
using 2 threads: 261 ms
using 4 threads: 146 ms

Please let me know your thoughts. Thanks

awais.hafeez · July 29, 2020, 12:52am

@dgreenwood,

Thank you for providing more information on this topic. We have logged these details in our issue tracking system and will keep you posted here on further updates.

dgreenwood · August 5, 2020, 7:33pm

Do you anticipate any further action or study on these issues? I have to believe (and it’s easy to test/verify within your code) that there is something in the Aspose loading and saving processing that is accessing a shared resource, for which perhaps you’ve placed a lock around (for thread safety), or have created some other blocking action. Why else would time decrease so dramatically with additional threads, when I’ve clearly shown that ‘memory contention’ is not a thing? I’m very interested to hear what other explanations the team may have, and I’m willing to help solve it. Thank you.

awais.hafeez · August 6, 2020, 7:47am

@dgreenwood,

I am afraid, there is no further news about the linked issues. However, we have noted your concerns and rest assured; we will keep you posted here on future updates.