CoffeeBeforeArch.github.io

View on GitHub

Spinlocks - Part 3

In the last blog post, we used an optimization called locally spinning to alleviate the high number of coherence misses in our initial naive spinlock implementation. However, at the end of that blog post, I mentioned that we’ve traded out constant contention for our cache lines for bursty contention when a thread releases the lock.

In this blog post, we’ll spend some time understanding this bursty contention, and try and address the problem using a spinlock optimization called active backoff.

Our Benchmark

We will evaluate our spinlock implementations using Google Benchmark. The structure of the benchmark can be found below:

 // Small Benchmark
 static void bench(benchmark::State &s) {
   // Sweep over a range of threads
   auto num_threads = s.range(0);
 
   // Value we will increment
   std::int64_t val = 0;
 
   // Allocate a vector of threads
   std::vector<std::thread> threads;
   threads.reserve(num_threads);
 
   Spinlock sl;
 
   // Timing loop
   for (auto _ : s) {
     for (auto i = 0u; i < num_threads; i++) {
       threads.emplace_back([&] { inc(sl, val); });
     }
     // Join threads
     for (auto &thread : threads) thread.join();
     threads.clear();
   }
 }
 BENCHMARK(bench)
     ->RangeMultiplier(2)
     ->Range(1, std::thread::hardware_concurrency())
     ->UseRealTime()
     ->Unit(benchmark::kMillisecond);

We’ll be evaluating our benchmark for a power-of-two number of threads (1, 2, 4, and 8 on my machine), where each thread runs a function called inc. This function can be found below:

 // Increment val once each time the lock is acquired
 void inc(Spinlock &s, std::int64_t &val) {
   for (int i = 0; i < 100000; i++) {
     s.lock();
     val++;
     s.unlock();
   }
 }

For 100k iterations, our threads try and lock our spinlock, increment a shared value, then release the lock. This gives us a heavy contention scenario in which to test our spinlock.

A Spinlock with Active Backoff

Before we jump into our active backoff optimization implementation, let’s try and understand where we have bursty contention in our spinlock implementation.

Bursty Contention

Our bursty contention occurs when the thread currently holding the lock releases the lock. Let’s take another look at our lock method to understand this:

   // Lock the spinlock
   void lock() {
     // Keep trying until we get the lock
     while (1) {
       // Try and grab the lock
       if (!locked.exchange(true)) return;
 
       // Read the spinlock state
       while (locked.load())
         ;
     }
   }

When the lock is released, all threads spinning in the second while loop (our locally spinning optimization) read that the lock is released, and race to the atomic exchange to try and grab the lock. However, only one thread of the potentially many that break out of the polling loop will get the lock. However, all threads that didn’t get the lock will still wastefully perform their atomic exchange, and go back to reading the lock state.

To solve this problem, we need to limit the number of threads that can break out of our waiting loop when the lock is freed. We can do this by adding backoff to this polling loop.

The Active Backoff Optimization

To keep our threads from immediately breaking out of the second while loop, we will be add a small delay between each read of the spinlock state. Here is our modified lock method:

   // Locking mechanism
   void lock() {
     while (1) {
       // Try and grab the lock
       if (!locked.exchange(true)) return;
 
       // Wait for the lock to be free
       do {
         // Pause for some number of iterations
         for (volatile int i = 0; i < 150; i += 1)
           ;
         // Read the lock state
       } while (locked.load());
     }
   }

Our lock method is almost identical to our spinning locally implemention, with the exception of our spinning locally optimization loop. Instead of reading the lock state as fast as possible, we’ll insert a configurable delay using a dummy for loop.

But what does inserting a delay here accomplish? If threads read the state of the lock as fast as possible, it’s likely that almost all the threads will break out at and compete for the lock at the same time. By adding a delay in this loop, we increase the probability that some threads will get caught up executing the delay instructions from our dummy for loop, while others (but not all the threads) break out at try and get the lock.

But how many iterations should we run in our for loop? It depends! This is a tuneable parameter. You should play around with this yourself to see what works best/well for your application and architecture. 150 iterations provided reasonable speedup for my use-case.

And why did we call it active backoff? I’ve called it active backoff because we’re active doing work for our delay (executing instructions for our dummy for loop). We’ll look at a passive form of backoff in the next post.

Here is our full spinlock implementation:

// Simple Spinlock
// Lock now performs local spinning
// Lock now does backoff
class Spinlock {
 private:
  // Lock is just an atomic bool
  std::atomic<bool> locked{false};

 public:
  // Locking mechanism
  void lock() {
    // Keep trying
    while (1) {
      // Try and grab the lock
      // Exit if we got the lock
      if (!locked.exchange(true)) return;

      // If we didn't get the lock, just read the value which gets cached
      // locally. This leads to less traffic.
      // Each iteration we can also call pause to limit the number of writes.
      // How many times you should pause each time should be experimentally
      // determined
      do {
        // Pause for some number of iterations
        for (volatile int i = 0; i < 150; i += 1)
          ;
      } while (locked.load());
    }
  }

Low-Level Assembly

Let’s take a look at how our low-level assembly for our benchmark looks with this new optimization:

  0.42 │20:┌─→mov     %edi,%ecx              
 17.29 │   │  xchg    %cl,(%rax)             
  0.01 │   │  test    %cl,%cl                
  0.01 │   │↓ je      67                     
  0.09 │   │  nop                            
  0.35 │30:│  movl    $0x0,-0x4(%rsp)        
  0.07 │   │  mov     -0x4(%rsp),%ecx        
  0.02 │   │  cmp     $0x95,%ecx             
       │   │↓ jg      5e                     
  0.06 │   │  nop                            
  1.77 │48:│  mov     -0x4(%rsp),%ecx        
  1.44 │   │  inc     %ecx                   
 15.57 │   │  mov     %ecx,-0x4(%rsp)        
 21.50 │   │  mov     -0x4(%rsp),%ecx        
  0.07 │   │  cmp     $0x95,%ecx             
  8.49 │   │↑ jle     48                     
 13.96 │5e:│  movzbl  (%rax),%ecx            
  0.04 │   │  test    %cl,%cl                
  0.02 │   │↑ jne     30                     
  0.35 │   │↑ jmp     20                     
  2.54 │67:│  incq    (%rsi)                 
 15.93 │   │  xchg    %cl,(%rax)             
       │   ├──dec     %edx                   
  0.00 │   └──jne     20                     

Our threads first try and grab the lock using an atomic exchange (xchg). If a thread gets the lock, it jumps to label 67: where we increment our shared value, release the lock, decrement the loop counter (for our 100k iterations). If the thread fails to get the lock, our thread we set up our active backoff for loop at label 30:, and executes the loop at label 48:.

After this delay loop, our threads read the lock state again (movzbl), and either goes through the delay loop again if the lock is still taken, or jump back up to the atomic exchange to try and grab the lock again if it’s free.

Performance

Here are the end-to-end performance numbers for 1, 2, 4, and 8 threads:

---------------------------------------------------------------------
Benchmark                           Time             CPU   Iterations
---------------------------------------------------------------------
active_backoff/1/real_time       1.05 ms        0.054 ms          679
active_backoff/2/real_time       2.60 ms        0.081 ms          281
active_backoff/4/real_time       9.49 ms        0.121 ms           75
active_backoff/8/real_time       50.5 ms        0.287 ms           14

And here are the performance numbers from our previous (spin_locally) spinlock implementation:

-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
spin_locally/1/real_time       1.01 ms        0.046 ms          698
spin_locally/2/real_time       10.4 ms        0.070 ms           68
spin_locally/4/real_time       28.9 ms        0.123 ms           19
spin_locally/8/real_time       91.4 ms        0.148 ms            8

Another huge improvement! Compared to our spinlock with only the locally spinning optimization, our end-to-end execution time went down from 10.4ms to 2.60ms for 2 threads, 28.9ms to 9.49ms for 4 threads, and 91.4ms to 50.5ms for 8 threads.

While this is another great improvement in performance, we’re wasting power executing all these extra instructions for the for loop that generates our delay. We’ll address this in the next blog post using a passive form of backoff and the _mm_pause() intrinsic.

Final Thoughts

Understanding how our threads interact with each other is critical for finding these performance opportunities. However, just like last time, there is still plenty of room for improvement in our implementation (in terms of both performance and power consumption).

Thanks for reading,

–Nick