C# Sharding and Multithreading - Deep Dive

Posted by: Jeffrey Rennie , on 8/26/2018, in Category C#

Views: 46653

Abstract: Can database-style sharding improve the performance of a multi-threaded C# application? This tutorial talks about the performance bottlenecks in sophisticated C# concurrency operations and how sharding can solve these issues.

Sharding and Multithreading in C# - Deep Dive

Can database-style sharding improve the performance of a multi-threaded application?

I was working on a sample application that solves Sudoku puzzles. The app solves puzzles very rapidly, on multiple threads. I wanted to track how many puzzle boards had been examined in all. I needed a counter that would potentially be incremented 3,000,000 times per second, so I started thinking about performance.

This tutorial is from the DotNetCurry(DNC) Magazine with in-depth tutorials and best practices in .NET and JavaScript. This magazine is aimed at Developers, Architects and Technical Managers and covers C#, Patterns, .NET Core, MVC, Azure, DevOps, ALM, TypeScript, Angular, React, and more. Subscribe to this magazine for FREE and receive all previous, current and upcoming editions, right in your Inbox. No Spam Policy.

The performance of the counter was probably not going to affect the overall performance of my application, but it would be fun to see how different implementations perform. To learn more, I implemented the counter in a variety of ways. Then, I ran multiple benchmarks on my machine with 6 CPU cores and 12 virtual CPU cores. To cut to the chase, here are the results. Taller is better.

performance-benchmark

Each column represents one run of the benchmark. The top axis displays the number of tasks simultaneously trying to increment the single counter. The left axis displays how many times the counter was incremented in 1 second.

So, for example, let's consider column 004. This column shows us that, with 4 concurrent tasks, the LockingCounter was incremented just over 40 million times per second, the InterlockedCounter was incremented just under 30 million times per second, and the ShardedCounter was incremented nearly 140 million times per second.

The UnsychronizedCounter only appears in column 001, because the unsynchronized counter does nothing to prevent race condtions in the code. Trying to increment an UnsychronizedCounter from multiple threads will result in undercounting. The total count will not be correct. Therefore, it's only appropriate to examine UnsychronizedCounter’s performance when being incremented by a single thread.

The benchmark exercises the worst-case scenario for thread contention: multiple threads in a tight loop competing to read and write the same value. Here's what the inner loop for each task looks like:

while (!cancel.Token.IsCancellationRequested)
        counter.Increase(1);

So the big question is, what is a ShardedCounter, and why does it perform better when there are more tasks competing to increase it?

To understand the answer, let's look at each counter implementation, from simplest to most complex.

UnsynchronizedCounter

unsynchronized-counter

Image Credit: Shutterstock.com vector ID: 229462564

The UnsychronizedCounter is as simple as can be:

public class UnsynchronizedCounter : ICounter
{
    private long _count = 0;
    public long Count => _count;

    public void Increase(long amount)
    {
        _count += amount;
    }
}

The Count property returns the private _count, and the Increase() method increases the private _count.

Because the UnsynchronizedCounter has zero overhead, it can count faster than any of the other counters, but only on one thread.

If multiple threads simultaneously call Increase(), the final count will be less than expected, due to race conditions. Wikipedia has great description of race conditions and how they cause bugs.

LockingCounter

The LockingCounter prevents race conditions by holding a lock while reading and writing _count.

public class LockingCounter : ICounter
{
    private long _count = 0;
    private readonly object _thisLock = new object();

    public long Count
    {
        get
        {
            lock (_thisLock)
            {
                return _count;
            }
        }
    }

    public void Increase(long amount)
    {
        lock (_thisLock)
        {
            _count += amount;
        }
    }
}

The lock prevents the undercounting problem that UnsynchronizedCounter suffered, but as the benchmark results above indicate, LockingCounter is much slower than UnsynchronizedCounter.

InterlockedCounter

interlocked

System.Threading.Interlocked provides atomic operations for values that are shared by multiple threads. For better performance, C#'s concurrent collections use interlocked operations to implement collections like ConcurrentStack. Interlocked operations cannot be used to replace locking in every case, but in the simple case of increasing a counter, they can. InterlockedCounter uses Interlocked.Add()to increase the count in a way that will never be undercounted, and without blocking while waiting to acquire a lock.

public class InterlockedCounter : ICounter
{
    private long _count = 0;

    public long Count => Interlocked.CompareExchange(ref _count, 0, 0);

    public void Increase(long amount)
    {
        Interlocked.Add(ref _count, amount);
    }
}

Looking at the benchmark results above, we see that with only a single task, InterlockedCounter can count more than twice as fast as LockingCounter. That's a huge reduction in overhead. InterlockedCounter is the fastest choice when a counter is being increased by multiple threads but there is not a lot of contention.

But notice that when multiple threads are trying to increment the counter very quickly, InterlockedCounter and LockingCounter perform about the same. Why is that? Because both implementations are constantly evicting the value of the counter from CPU caches and loading it again from RAM. Looking up a value in RAM takes at least 10 times as long as finding it in the cache, so evicting the value of counter from the cache is very costly.

Here's a block diagram that illustrates the problem.

There are 2 CPUs, each with its own cache. Initially, the RAM and both caches store the value 0 for counter.

cpu-cache

First, the CPU on the left increases the counter. It reads the counter value from its cache, and writes the new value back to its cache and to RAM. But notice also, the cache on the right no longer contains the value counter=0. The cache entry on the right was evicted because its value was out of date.

cpu-cache-counter

Next, the CPU on the right increases the counter. It has to retrieve the value of counter from distant RAM, as shown by the red arrow, because its cache no longer has the value.

cpu-cache-no-value

Looking up a value in RAM takes at least 10 times as long as finding it in cache. Reading the value from RAM dominates performance, so that it doesn't matter whether the implementation uses locks or the magic of the System.Threading.Interlocked.

Is there anything we can do to avoid the performance bottleneck in both the InterlockedCounter and LockingCounter? Yes, there is.

ShardedCounter

sharded-counter

ShardedCounter uses the same principle as database sharding, also known as horizontal partitioning. In short, when a database is performing poorly because too many clients are trying to access the same table at the same time, one solution is to break it up into multiple tables across multiple database servers. For example, consider a table of addresses:

address-table

The entire table is stored in one SQL Server, and the server can serve 20 queries per second. When many clients try to access the table at the same time, they are limited to 20 queries per second total. When the load exceeds 20 queries per second, then the clients' requests take longer, and performance of the whole system suffers.

In this situation, it may be possible (Depending on what kind of queries are being run. In this case, the queries would need to be querying data by state) to improve performance by breaking the one Addresses table into 50 Addresses tables, one for each state.

break-address-table

Because each SQL Server is now handling a fraction of the load, the total throughput has increased to 20 queries per second * 50 SQL Servers = 1000 queries per second.

ShardedCounter applies the same strategy for increasing throughput to counters. It breaks up the one counter into multiple counters, one for each CPU core. Each of the counters is called a shard, hence the name ShardedCounter.

Note: Actually, it breaks the counter into multiple counters, one for each thread. But since a single thread tends to run on the same core for a while, this approximation makes the performance easier to explain.

The idea of sharding counters is an old one. I first saw it in MapReduce. A Google search for "sharded counter" today yields a discussion of sharded counters mainly related to Google App Engine and Google Cloud Datastore. I've seen it used in SQL databases too.

Let's replay the same steps above with the ShardedCounter. Initially, there are two counters. Both counters are stored in RAM, and a counter is stored in each CPU cache.

sharded-counter-cpu-cache 4

The CPU on the left increments the counter. It reads counterA from the cache, and writes the value back to cache and RAM.

cpu-cache-counter-increment

Then, the CPU on the right increments the counter. It reads counterB from the cache, and writes the value back to cache and RAM.

read-write-cache-value

The ShardedCounter performs better because the Increase operation never (see note below) reads a value from RAM. It always reads the value from cache.

Note: Approximately never. Threads get scheduled and descheduled from CPUs. When a new thread is scheduled on a CPU, it may have to read the value from RAM.

But of course, at some point, we need to read the total count, and to do that, we must load some values from RAM. The diagram below illustrates how the total count is computed.

total-count-computation

So, reading the total count is still somewhat expensive. However, in my application, in the benchmark, and in many real world applications, a counter is read far less frequently than it is increased. The benchmark code reads the counter once per second but increases the counter millions of times per second. Therefore, the cost of an increase dwarves the cost of a read.

How does ShardedCounter create one counter per core? Technically, it doesn't. It creates one counter per thread. Because threads tend to run on the same core for a while, the effect is similar.

ShardedCounter allocates a new thread-local storage slot via Thread.AllocateDataSlot(). This creates a place to store the counter shard for each thread.

public class ShardedCounter : ICounter
{
    // Protects _shards.
    private readonly object _thisLock = new object();

    // The list of shards.
    private List<Shard> _shards = new List<Shard>();

    // The thread-local slot where shards are stored.
    private readonly LocalDataStoreSlot _slot = Thread.AllocateDataSlot();

Retrieving the count requires summing the counts in all the shards.

public long Count
{
    get
    {
        // Sum over all the shards.
        long sum = 0;
        lock (_thisLock)
        {
            foreach (Shard shard in _shards)
            {
                sum += shard.Count;
            }
        }
        return sum;
    }
}

In the fast, common path, increasing the counter requires no locks and only reads and writes a value that no other thread can read or write. Therefore, there's little risk of another CPU trying to read the value and fetching it from RAM.

public void Increase(long amount)
{
    // Increase counter for this thread.
    Shard counter = Thread.GetData(_slot) as Shard;
    if (null == counter)
    {
        counter = new Shard()
        {
            Owner = Thread.CurrentThread
        };
        Thread.SetData(_slot, counter);
        lock (_thisLock) _shards.Add(counter);
    }
    counter.Increase(amount);
}

Each shard is an InterlockedCounter, so that Count sees the latest values of all the counters, and thus avoids undercounting.

private class Shard : InterlockedCounter
    {
        public Thread Owner { get; set; }
    }
}

The complete code, which is a little more complicated to clean up after threads that have finished, can be found here.

Conclusion

When concurrency issues slow down your code, sometimes using more sophisticated concurrency operations like those provided by System.Threading.Interlocked will not improve performance. The problem is that too many actors are trying to access the same value at the same time.

In this particular instance, it was a toy problem that would do little to affect the overall performance of the application. But this same technique of sharding a fought-over value can be applied to larger problems too. Sharding a value is especially easy when it is calculated with purely associative operations, like addition and multiplication, when the order of operations does not affect the result.

The full source code for the counters benchmark and the sample application is available on github.com.

This article was technically reviewed by Damir Arh.

This article has been editorially reviewed by Suprotim Agarwal.

C# and .NET have been around for a very long time, but their constant growth means there’s always more to learn.

We at DotNetCurry are very excited to announce The Absolutely Awesome Book on C# and .NET. This is a 500 pages concise technical eBook available in PDF, ePub (iPad), and Mobi (Kindle).

Organized around concepts, this Book aims to provide a concise, yet solid foundation in C# and .NET, covering C# 6.0, C# 7.0 and .NET Core, with chapters on the latest .NET Core 3.0, .NET Standard and C# 8.0 (final release) too. Use these concepts to deepen your existing knowledge of C# and .NET, to have a solid grasp of the latest in C# and .NET OR to crack your next .NET Interview.

Click here to Explore the Table of Contents or Download Sample Chapters!

What Others Are Reading!

String Interpolation in C# (Covers C# 10 and C# 11)

Using GitHub Actions to build .NET projects

Get the most out of Async/Await in C#

Async streams in C# – Deep Dive

Tic Tac Toe in F# - Part 3

Tic Tac Toe in F# – Part 2

Was this article worth reading? Share it with fellow developers too. Thanks!

Author

Jeffrey Rennie works as a Developer Programs Engineer for Google. He frequently posts to medium, stackoverflow, and github

Page copy protected against web site content infringement by Copyscape

Feedback - Leave us some adulation, criticism and everything in between!

Click here to post your Comments

Featured Tools

C# Sharding and Multithreading - Deep Dive

Sharding and Multithreading in C# - Deep Dive

UnsynchronizedCounter

LockingCounter

InterlockedCounter

ShardedCounter

Conclusion

Categories

.NET Web

.NET Framework, Visual Studio and C#

Patterns & Practices

Cloud and Mobile

JavaScript

.NET Desktop

Interview Questions & Product Reviews

JOIN OUR COMMUNITY

POPULAR ARTICLES

Tags

.NET Magazines

ASP.NET

Patterns and Practices

.NET

Cloud and Mobile

Useful

JavaScript

Publications

Older Technologies

Angular 4 Development Cheat Sheet

Error Handling in Large .NET Projects - Best Practices

Behavior Driven Development (BDD) � an in-depth look

Aspect Oriented Programming (AOP) in C# with SOLID

JavaScript Frameworks for ASP.NET MVC Developers

C# Sharding and Multithreading - Deep Dive

Sharding and Multithreading in C# - Deep Dive

UnsynchronizedCounter

LockingCounter

InterlockedCounter

ShardedCounter

Conclusion

Categories

.NET Web

.NET Framework, Visual Studio and C#

Patterns & Practices

Cloud and Mobile

JavaScript

.NET Desktop

Interview Questions & Product Reviews

JOIN OUR COMMUNITY

POPULAR ARTICLES

Tags

.NET Magazines