CoffeeBeforeArch.github.io

View on GitHub

Mutex vs Atomic

Some parallel applications do not have any data sharing between threads. Others have sharing between threads that is read-only. But some applications are more complex, and require the coordiation of multiple threads that want to write to the same memory. In these situations, we can reach into our parallel programming toolbox for something like std::mutex to make sure only a single thread enters a critical section at a time. However, a mutex is just one tool in our toolbox, and different synchronization mechanisms have different performance characteristics. In this blog post, we’ll be compare a few different benchmarks using mutexes and atomics, and analyze the performance results of each.

Compiling and Running the Benchmarks

The benchmarks from this blog post were written using Google Benchmark. Below is the command I used to compile the simple increment benchmarks (on Ubuntu 20.04 using g++10).

g++ inc_bench.cpp -lbenchmark -lpthread -O3 -march=native -mtune=native -flto -fuse-linker-plugin -o inc_bench

The matrix multiplication benchmarks can be built using the existing makefile in the repository.

Parallel Increment - The Bad Way

Consider the following simple application:

 int main() {
   // Shared value for our threads
   int shared_val = 0;
 
   // Number of iterations (65536)
   int N = 1 << 16;
 
   // Lambda that performs an increment
   auto inc_func = [&]() {
     for (auto i = 0; i < N; i++) shared_val++;
   };
 
   // Create two threads
   std::thread t1(inc_func);
   std::thread t2(inc_func);
 
   // Join the threads
   t1.join();
   t2.join();
 
   // Print the result
   std::cout << "FINAL VALUE IS: " << shared_val << '\n';
 
   return 0;
 }

What is the value of shared_val when it is printed to the screen? Our expected result is 2^16 + 2^16 (2^17) because that’s how many increments we perform. However, the real answer is far less straight-forward. The real answer is that we won’t know for certain until runtime because we have a race condition where both of our threads are trying to write to the same location at the same time.

Each of our increments to shared_val really consists of three operations: a read of shared_val, an increment of shared_val, and a write to shared_val. When our two threads try and perform these operations at the same time, they can interleave in nasty ways that give us unexpected results. Consider a simple example where our threads take turns updating shared_val.

First, thread 1 reads shared_val, then increments it, then writes the value back to memory. Sometime later, thread 2 performs the same 3 operations (uninterrupted), and this goes back and forth until each thread increments shared_val 2^16 times. Sounds great! We would expect the final value to be 2^16 + 2^16 (2^17). Now let’s consider another scenrio where the threads do not politely take turns.

First, thread 1 reads shared_val. Then, thread 2 reads shared val. Next, thread 1 increments the value it read, followed by thread 2 incrementing the value it read. Finally, both threads write the value they incremented to memory, and this set of steps repeat until each thread performs 2^16 increments. What is the value now? 2^16, not 2^17! Let’s substitute in some values to make this make this more clear.

Let’s start with shared_val = 0. First, both threads read shared_val, and find a value of 0. Next both threads increment this value to 1. Finally, both threads write the value 1 to memory. Each thread performed an increment, but the final value of shared_val doesn’t reflect this because of the interleaving!

Here are a few printouts from when I ran the program:

FINAL VALUE IS: 65536
FINAL VALUE IS: 131072
FINAL VALUE IS: 85026

In the first run, it looks like shared_val was incremented half as many times as it should have been. In the second case, we got the expected result (2^17). In the final run, we got something in-between. We clearly need some way to restrict which orderings are allowed.

Additional Notes

Detecting Race Conditions

One way we can detect some race conditions is using Google’s thread sanitizer (by adding -fsanitize=thread to our compile flags). Here is a sample output from running bad increment program with thread sanitizer enabled:

==================
WARNING: ThreadSanitizer: data race (pid=31049)
  Read of size 4 at 0x7fffb18f4d28 by thread T2:
    #0 std::thread::_State_impl<std::thread::_Invoker<std::tuple<main::{lambda()#1}> > >::_M_run() <null> (a.out+0x401379)
    #1 execute_native_thread_routine ../../../../../libstdc++-v3/src/c++11/thread.cc:80 (libstdc++.so.6+0xd62af)

  Previous write of size 4 at 0x7fffb18f4d28 by thread T1:
    #0 std::thread::_State_impl<std::thread::_Invoker<std::tuple<main::{lambda()#1}> > >::_M_run() <null> (a.out+0x40138f)
    #1 execute_native_thread_routine ../../../../../libstdc++-v3/src/c++11/thread.cc:80 (libstdc++.so.6+0xd62af)

  Location is stack of main thread.

  Location is global '<null>' at 0x000000000000 ([stack]+0x00000001ed28)

  Thread T2 (tid=31052, running) created by main thread at:
    #0 pthread_create ../../../../libsanitizer/tsan/tsan_interceptors_posix.cpp:962 (libtsan.so.0+0x5be22)
    #1 __gthread_create /home/cba/forked_repos/gcc/build/x86_64-pc-linux-gnu/libstdc++-v3/include/x86_64-pc-linux-gnu/bits/gthr-default.h:663 (libstdc++.so.6+0xd6524)
    #2 std::thread::_M_start_thread(std::unique_ptr<std::thread::_State, std::default_delete<std::thread::_State> >, void (*)()) ../../../../../libstdc++-v3/src/c++11/thread.cc:135 (libstdc++.so.6+0xd6524)
    #3 main <null> (a.out+0x401136)

  Thread T1 (tid=31051, finished) created by main thread at:
    #0 pthread_create ../../../../libsanitizer/tsan/tsan_interceptors_posix.cpp:962 (libtsan.so.0+0x5be22)
    #1 __gthread_create /home/cba/forked_repos/gcc/build/x86_64-pc-linux-gnu/libstdc++-v3/include/x86_64-pc-linux-gnu/bits/gthr-default.h:663 (libstdc++.so.6+0xd6524)
    #2 std::thread::_M_start_thread(std::unique_ptr<std::thread::_State, std::default_delete<std::thread::_State> >, void (*)()) ../../../../../libstdc++-v3/src/c++11/thread.cc:135 (libstdc++.so.6+0xd6524)
    #3 main <null> (a.out+0x401127)

SUMMARY: ThreadSanitizer: data race (/home/cba/forked_repos/misc_code/inc_bench/a.out+0x401379) in std::thread::_State_impl<std::thread::_Invoker<std::tuple<main::{lambda()#1}> > >::_M_run()
==================
FINAL VALUE IS: 131072
ThreadSanitizer: reported 1 warnings

Race Conditions In a Single Instruction

You may have wondered how there can be any interleaving when our increment in C++ gets translated to a single x86 instruction (like add or inc). This is beause x86 processors don’t actually execute x86 instructions (at least not directly). These instructions get translated into micro-operations (uops), and the uops can be interleaved between threads.

Check out uops.info for more information.

Parallel Increment - Mutex and Atomic

In this section, we’ll discuss how to avoid the race condition in our parallel increment example using mutexes and atomics.

Parallel Increment - Mutex

The first way we can avoid our race condition is using a mutex (std::mutex in this example). Below is the definition of std::mutex from en.cppreference.com:

Sounds like exactly what we are looking for (a way to serialize access to shared data)! Let’s look at an updated function that will be called from multiple threads where we use a std::mutex to serialize access to shared_val.

 // Function to increment using a lock
 void inc_mutex(std::mutex &m, int &shared_val) {
   for (int i = 0; i < (1 << 16); i++) {
     std::lock_guard<std::mutex> g(m);
     shared_val++;
   }
 }

Let’s step through what’s going on here. Each iteration of the loop, we use a std::lock_guard to lock our std::mutex. If a thread finds the std::mutex is already locked, it waits for the lock to be released. If a thread finds the std::mutex unlocked, it locks the std::mutex, increments shared_val, then unlocks the mutex (the std::lock_guard object locks the mutex when it is created, and releases it when it is destroyed).

Let’s take a looks at the low-level assembly:

       │20:┌─→mov   %r12,%rdi
 22.62 │   │→ callq pthread_mutex_lock@plt
  2.51 │   │  test  %eax,%eax
       │   │↓ jne   60
 49.50 │   │  incq  0x0(%rbp)
  0.75 │   │  mov   %r12,%rdi
 11.33 │   │→ callq pthread_mutex_unlock@plt
       │   ├──dec   %ebx
 13.29 │   └──jne   20

Pretty much what we described above! We call pthread_mutex_lock, use an incq to increment the value, then call pthread_mutex_unlock. Now let’s compare the performance when we scale the number of threads.

-----------------------------------------------------------------------
Benchmark                             Time             CPU   Iterations
-----------------------------------------------------------------------
lock_guard_bench/1/real_time       1.30 ms         1.23 ms          543
lock_guard_bench/2/real_time       8.02 ms         7.56 ms           82
lock_guard_bench/3/real_time       8.55 ms         7.41 ms           80
lock_guard_bench/4/real_time       19.4 ms         16.0 ms           36
lock_guard_bench/5/real_time       24.5 ms         20.3 ms           29
lock_guard_bench/6/real_time       32.8 ms         28.3 ms           22
lock_guard_bench/7/real_time       39.4 ms         33.2 ms           18
lock_guard_bench/8/real_time       44.8 ms         38.5 ms           16

Unsurprisingly, our performance gets worse as the number of threads increases. That’s because with more threads, we have more contention for our lock.

Parallel Increment - Atomic

As usual in programming, there are many different ways to solve a problem. Using a mutex to solve our parallel increment race condition is actually overkill. What we should be using are atomic operations. Here is overview of the atomic operations library from en.cppreference.com

Basically, our increment that requires a read, modify, and write of shared_val becomes a single indivisible operation. That means there can be no interleaving of operations across threads.

Let’s take a look at how we can do this in C++:

// Function to incrememnt atomic int
void inc_atomic(std::atomic<int> &shared_val) {
  for (int i = 0; i < (1 << 16); i++) shared_val++;
}

Each increment of shared_val is now an atomic increment! Let’s see how this translates to the low-level assembly.

 98.89 │10:   lock   incq (%rdx)
  0.03 │      dec    %eax
  1.08 │    ↑ jne    10

A much more streamlined routine. We perform incq instructions with the lock prefix 2^16 times on each thread. This lock prefix ensures that the core performing the instruction has exclusive ownership of the cache line we are writing to for the entire operation. This is how the increment is made into an indivisible unit.

Now let’s look at the performance with a varying number of threads, and compare the performance to using a mutex:

-----------------------------------------------------------------------
Benchmark                             Time             CPU   Iterations
-----------------------------------------------------------------------
atomic_bench/1/real_time          0.400 ms        0.400 ms         1737
atomic_bench/2/real_time           3.24 ms         3.13 ms          224
atomic_bench/3/real_time           5.34 ms         5.11 ms          100
atomic_bench/4/real_time           7.18 ms         6.95 ms           96
atomic_bench/5/real_time           8.43 ms         6.03 ms           78
atomic_bench/6/real_time           9.78 ms         7.78 ms           72
atomic_bench/7/real_time           11.5 ms         10.4 ms           59
atomic_bench/8/real_time           13.4 ms         11.3 ms           52

Way faster than using a mutex (by over 3x in some cases)! But this shouldn’t be terribly surprising. When we use a mutex, we’re relying on software routines to lock and unlock the mutex. With our atomic operations, we’re relying on the underlying hardware mechanisms to make the increment an indivisible operation.

However, it’s important to note that what we’re profiling here is repeated locks and unlocks for our mutex, and repeated atomic increments for our atomic benchmark. This is largely an artifical scenario, so we should be cautious in thinking cases where we can replace a mutex with an atomic operation will give us a massive speedup.

Additional Notes

One interesting question to ask is how to locks get passed between threads when we use a mutex? There are numerous different techniques that have varying level of complexity, fairness, and performance. Here is a presentation that goes over some of the classical lock algorithms.

Matrix Multiplication - Mutex and Atomics

In this section, we’ll look at a few different implementations of matrix multiplication, and compare the difference in performance when elements are statically mapped to threads, or dynamically mapped using mutexes or atomics.

Statically Mapped Elements

We can parallelize matrix multiplication without without synchronization mechanisms if we have threads work on different parts of the output matrix. Here is an example implementation that performs some cache tiling for even better performance:

 // Blocked column parallel implementation w/o atomic
 void blocked_column_multi_output_parallel_gemm(const double *A, const double *B,
                                                double *C, std::size_t N,
                                                std::size_t tid,
                                                std::size_t stride) {
   // For each chunk of columns
   for (std::size_t col_chunk = tid * 16; col_chunk < N; col_chunk += stride)
     // For each chunk of rows
     for (std::size_t row_chunk = 0; row_chunk < N; row_chunk += 16)
       // For each block of elements in this row of this column chunk
       // Solve for 16 elements at a time
       for (std::size_t tile = 0; tile < N; tile += 16)
         // For apply that tile to each row of the row chunk
         for (std::size_t row = 0; row < 16; row++)
           // For each row in the tile
           for (std::size_t tile_row = 0; tile_row < 16; tile_row++)
             // Solve for each element in this tile row
             for (std::size_t idx = 0; idx < 16; idx++)
               C[(row + row_chunk) * N + col_chunk + idx] +=
                   A[(row + row_chunk) * N + tile + tile_row] *
                   B[tile * N + tile_row * N + col_chunk + idx];
 }

Each thread process 16 columns of output elements each iteration of the outer-most loop until every element has been processed (we’re assuming square matrices that have a dimension N that is divisible by 16).

Let’s measure the performance when we statically partition the matrix like this for matrices of dimension 2^8 + 16, 2^9 + 16, and 2^10 + 16. We are not directly using a power-of-two dimension to avoid cache assosciativity problems (conflict misses).

Here are the results:

-------------------------------------------------------------------------
Benchmark                               Time             CPU   Iterations
-------------------------------------------------------------------------
static_mmul_bench/8/real_time       0.913 ms        0.462 ms         1771
static_mmul_bench/9/real_time        4.01 ms         2.51 ms          340
static_mmul_bench/10/real_time       29.1 ms         23.6 ms           57

Dynamically Mapped Using Mutexes and Atomics

One way we can try to improve the performance is by having a queue of work items rather than statically dividing the work. The rationale behind this is that threads complete work at different rates due to things like runtime scheduling differences. If work is statically and evenly divided between threads, some threads may finish their work faster than others (due to advantageous thread schedules), leading to idle resources. With a queue of work items, threads can keep grabbing more work until there is nothing left, minimizing the time threads are sitting idle while there is still work to do.

Let’s first take a look at an implementation with mutexes:

 // Fetch and add routine using a mutex
 uint64_t fetch_and_add(uint64_t &pos, std::mutex &m) {
   std::lock_guard<std::mutex> g(m);
   auto ret_val = pos;
   pos += 16;
   return ret_val;
 }
 
 // Blocked column parallel implementation w/ atomics
 void blocked_column_multi_output_parallel_mutex_gemm(const double *A,
                                                      const double *B, double *C,
                                                      std::size_t N,
                                                      uint64_t &pos,
                                                      std::mutex &m) {
   // For each chunk of columns
   for (std::size_t col_chunk = fetch_and_add(pos, m); col_chunk < N;
        col_chunk = fetch_and_add(pos, m))
     // For each chunk of rows
     for (std::size_t row_chunk = 0; row_chunk < N; row_chunk += 16)
       // For each block of elements in this row of this column chunk
       // Solve for 16 elements at a time
       for (std::size_t tile = 0; tile < N; tile += 16)
         // For apply that tile to each row of the row chunk
         for (std::size_t row = 0; row < 16; row++)
           // For each row in the tile
           for (std::size_t tile_row = 0; tile_row < 16; tile_row++)
             // Solve for each element in this tile row
             for (std::size_t idx = 0; idx < 16; idx++)
               C[(row + row_chunk) * N + col_chunk + idx] +=
                   A[(row + row_chunk) * N + tile + tile_row] *
                   B[tile * N + tile_row * N + col_chunk + idx];
 }

In the outer-most loop, each thread gets a chunk of 16 columns to solve from our fetch_and_add routine. This routine takes the current position (pos), saves the current value, increments pos by 16, and returns the previously value. Access to pos is restricted using a std::mutex and std::lock_guard. You can think of pos as our position in the queue of or work items (which are really just chunks of the output matrix we are solving).

Let’s measure the performance:

------------------------------------------------------------------------
Benchmark                              Time             CPU   Iterations
------------------------------------------------------------------------
mutex_mmul_bench/8/real_time       0.519 ms        0.474 ms         1345
mutex_mmul_bench/9/real_time        2.80 ms         2.33 ms          242
mutex_mmul_bench/10/real_time       21.3 ms         19.1 ms           33

A decent improvement compared to our statically mapped version (~30% improvement). Now let’s look at an implementation where we directly atomic fetch_add for a std::atomic:

// Blocked column parallel implementation w/ atomics
void blocked_column_multi_output_parallel_atomic_gemm(
    const double *A, const double *B, double *C, std::size_t N,
    std::atomic<uint64_t> &pos) {
  // For each chunk of columns
  for (std::size_t col_chunk = pos.fetch_add(16); col_chunk < N;
       col_chunk = pos.fetch_add(16))
    // For each chunk of rows
    for (std::size_t row_chunk = 0; row_chunk < N; row_chunk += 16)
      // For each block of elements in this row of this column chunk
      // Solve for 16 elements at a time
      for (std::size_t tile = 0; tile < N; tile += 16)
        // For apply that tile to each row of the row chunk
        for (std::size_t row = 0; row < 16; row++)
          // For each row in the tile
          for (std::size_t tile_row = 0; tile_row < 16; tile_row++)
            // Solve for each element in this tile row
            for (std::size_t idx = 0; idx < 16; idx++)
              C[(row + row_chunk) * N + col_chunk + idx] +=
                  A[(row + row_chunk) * N + tile + tile_row] *
                  B[tile * N + tile_row * N + col_chunk + idx];
}

Pretty much the same as our std::mutex, except we don’t have to write our own fetch_and_add routine. Let’s take a look at the performance results.

-------------------------------------------------------------------------
Benchmark                               Time             CPU   Iterations
-------------------------------------------------------------------------
atomic_mmul_bench/8/real_time       0.516 ms        0.458 ms         1349
atomic_mmul_bench/9/real_time        2.78 ms         2.31 ms          246
atomic_mmul_bench/10/real_time       21.0 ms         18.8 ms           33

Interestingly enough, they’re about the same as when we used a std::mutex (and still ~30% faster than the static mapping)! In these examples, most of our time is being spent doing fused multiply-add operations in the inner loops, and not performing the fetch and add in the outer-most loop. Because the fetch and add is not the bottleneck, it’s unlikely that performance differences between the two will manifest in a significant way.

Final Thoughts

Are mutexes and atomics the same thing? No, certainly not. However, both can be used to solve parallel programming problems where threads want to access the same piece of data. Mutexes allow us to serialize access to a region of code through locking and unlocking. Atomic operations allow us to directly make use of hardware locking mechanisms for fundamental operations like increment, add, exchange, etc.. What’s important is that we understand the tools in our parallel programming toolbox, and the costs/benefits of each.

Thanks for reading,

–Nick