CoffeeBeforeArch.github.io

View on GitHub

Spinlocks - Part 5

In the last blog post, we used the passive backoff optimization to reduce the power consumption of our spinlock. However, we many not want our threads to always pause for the same number of iterations.

In this blog post, we’ll look at two spinlocks that use a non-constant form of backoff (exponential and random backoff).

Our Benchmark

We will evaluate our spinlock implementations using Google Benchmark. The structure of the benchmark can be found below:

 // Small Benchmark
 static void bench(benchmark::State &s) {
   // Sweep over a range of threads
   auto num_threads = s.range(0);
 
   // Value we will increment
   std::int64_t val = 0;
 
   // Allocate a vector of threads
   std::vector<std::thread> threads;
   threads.reserve(num_threads);
 
   Spinlock sl;
 
   // Timing loop
   for (auto _ : s) {
     for (auto i = 0u; i < num_threads; i++) {
       threads.emplace_back([&] { inc(sl, val); });
     }
     // Join threads
     for (auto &thread : threads) thread.join();
     threads.clear();
   }
 }
 BENCHMARK(bench)
     ->RangeMultiplier(2)
     ->Range(1, std::thread::hardware_concurrency())
     ->UseRealTime()
     ->Unit(benchmark::kMillisecond);

We’ll be evaluating our benchmark for a power-of-two number of threads (1, 2, 4, and 8 on my machine), where each thread runs a function called inc. This function can be found below:

 // Increment val once each time the lock is acquired
 void inc(Spinlock &s, std::int64_t &val) {
   for (int i = 0; i < 100000; i++) {
     s.lock();
     val++;
     s.unlock();
   }
 }

For 100k iterations, our threads try and lock our spinlock, increment a shared value, then release the lock. This gives us a heavy contention scenario in which to test our spinlock.

A Spinlock with Non-Constant Backoff

It is sometimes advantageous for threads to pause for a non-constant number of iterations. For example, if one of our threads is holding the lock longer than expected, we might want to increase the time between reads of the spinlock state.

Let’s take a look at two non-constant backoff implementations. In our first implementation, we’ll look at an optimization called exponential backoff where the number of pause iterations increases exponentially the longer we wait for the lock. In the second, we’ll look at a random backoff optimization where the number of pause iterations comes from a random number generator.

Exponential Backoff

The core idea behind exponential backoff is that threads that continuously find that the spinlock is still not available should wait longer between each read of the spinlock state. Here’s how our updated spinlock lock method looks:

  // Locking mechanism
  void lock() {
    // Start backoff at MIN_BACKOFF iterations
    int backoff_iters = MIN_BACKOFF;

    // Keep trying
    while (1) {
      // Try and grab the lock
      if (!locked.exchange(true)) return;

      // Pause for an exponentially increasing number of iterations
      do {
        for (int i = 0; i < backoff_iters; i++) _mm_pause();

        // Get the backoff iterations for next time
        backoff_iters = std::min(backoff_iters << 1, MAX_BACKOFF);

        // Check to see if the lock is free
      } while (locked.load());
    }
  }

We start our backoff iterations (backoff_iters) at some minimum value (MIN_BACKOFF) which should be experimentally determined (I used 4). If a thread fails to get the lock with the atomic exchange, it falls into our spinning locally loop where we pause for backoff_iters number of iterations. We then exponentially increase the number of backoff iterations for the next trip in the loop, read the spinlock state, and either go back to the top of the loop, or try for the lock again.

One thing to notice is that we cap the number of backoff iterations to MAX_BACKOFF. This is generally a good idea, because exponential numbers grow quickly (who would have thought!). If we did not bound this to a maximum, our number of backoff iterations could grow so quickly that all our threads end up waiting un-necessarily long in the delay loop even after the lock as become free. The number of MAX_BACKOFF should be experimentally determined (just like MIN_BACKOFF). I chose 1024.

Here is the full spinlock implementation:

// Simple Spinlock
// Lock now performs local spinning
// Lock now performs exponential backoff
class Spinlock {
 private:
  // Lock is just an atomic bool
  std::atomic<bool> locked{false};

 public:
  // Locking mechanism
  void lock() {
    // Start backoff at MIN_BACKOFF iterations
    int backoff_iters = MIN_BACKOFF;

    // Keep trying
    while (1) {
      // Try and grab the lock
      // Return if we get the lock
      if (!locked.exchange(true)) return;

      // If we didn't get the lock, just read the value which gets cached
      // locally. This leads to less traffic.
      // Pause for an exponentially increasing number of iterations
      do {
        // Pause for some number of iterations
        for (int i = 0; i < backoff_iters; i++) _mm_pause();

        // Get the backoff iterations for next time
        backoff_iters = std::min(backoff_iters << 1, MAX_BACKOFF);

        // Check to see if the lock is free
      } while (locked.load());
    }
  }

  // Unlocking mechanism
  // Just set the lock to free (false)
  // Can also use the assignment operator
  void unlock() { locked.store(false); }
};

Low-Level Assembly

Let’s take a look at how our low-level assembly for our benchmark looks with this new optimization:

  0.75 │22:┌─→vmovdqa _IO_stdin_used+0xa0,%xmm0
       │   │  vmovd   %xmm0,%esi                 
       │   │  xchg    %ax,%ax                    
       │30:│  mov     %r8d,%ecx                  
 17.71 │   │  xchg    %cl,(%rax)                 
       │   │  test    %cl,%cl                    
       │   │↓ je      6f                        
  0.00 │   │  nop                              
       │40:│  xor     %ecx,%ecx                
       │   │  test    %esi,%esi                
       │   │↓ jle     58                       
       │   │  nop                              
  0.00 │50:│  inc     %ecx                     
 63.59 │   │  pause                             
  0.00 │   │  cmp     %ecx,%esi                
  0.01 │   │↑ jne     50                        
  0.00 │58:│  vpslld  $0x1,%xmm0,%xmm0          
       │   │  vpminsd %xmm1,%xmm0,%xmm0         
       │   │  vmovd   %xmm0,%esi                
  0.05 │   │  movzbl  (%rax),%ecx               
       │   │  test    %cl,%cl                   
       │   │↑ jne     40                        
  0.01 │   │↑ jmp     30                        
  0.67 │6f:│  incq    (%rdi)                    
 17.20 │   │  xchg    %cl,(%rax)                
       │   ├──dec     %edx                      
  0.01 │   └──jne     22                        

We first try and get the lock using an atomic exchange (xchg). If we get the lock, we increment the shared value, release the lock, decrement our loop counter, then jump back to the top of the assembly to try and grab the lock again.

If we fail to get the lock, we fall into our pause loop (label 50:). After completing our backoff iterations, we calculate the exponential growth of our backoff iterations, read the lock state, then either jump back to our pause loop or break out and try and grab the lock again.

Performance

Here are the end-to-end performance numbers for 1, 2, 4, and 8 threads:

------------------------------------------------------------------
Benchmark                        Time             CPU   Iterations
------------------------------------------------------------------
exp_backoff/1/real_time       1.17 ms        0.047 ms          607
exp_backoff/2/real_time       2.30 ms        0.063 ms          301
exp_backoff/4/real_time       4.56 ms        0.089 ms          154
exp_backoff/8/real_time       9.12 ms        0.140 ms           77

And here was our performance from our previous blog post:

----------------------------------------------------------------------
Benchmark                            Time             CPU   Iterations
----------------------------------------------------------------------
passive_backoff/1/real_time       1.03 ms        0.054 ms          676
passive_backoff/2/real_time       2.48 ms        0.082 ms          276
passive_backoff/4/real_time       9.59 ms        0.110 ms           70
passive_backoff/8/real_time       46.7 ms        0.236 ms           15

A significant performance improvement, but mainly at the higher thread cases. While our one and two thread benchmarks showed little-to-no improvement, our 4 thread case improved from 9.59ms to 4.56ms, and our 8 thread case from 46.7ms to 9.12ms.

But why did our performance improve so significantly? We’re approximately serializing our threads! Many of our threads will end up hitting the maximum backoff iterations. This makes it more likely that the thread that just released the lock will be able to re-lock the lock (because the other threads get caught up in the long pause loop).

Serialization is best for performance here because our increments can not occur in parallel. Furthermore, a single thread working at a time means it can hit on a local copy of the spinlock state and shared value its L1 cache without the cache line being invalidated and bouncing between cores.

Random Backoff

Another non-constant form of backoff we can look at is random backoff. Instead of our backoff iterations increasing exponentially, we’ll select the number of iterations at random. Here’s how our updated lock method locks:

  // Locking mechanism
  void lock() {
    // Keep trying
    while (1) {
      // Try and grab the lock
      if (!locked.exchange(true)) return;

      // Pause for a random number of iterations (between 4 and 1024)
      do {
        int backoff_iters = dist(rng);

        // Pause for some number of iterations
        for (int i = 0; i < backoff_iters; i++) _mm_pause();

        // Check to see if the lock is now free
      } while (locked.load());
    }
  }

Here, our random number will be between 4 and 1024 (our min and max backoff iterations from our exponential backoff example).

Here is the full spinlock implementation:

// Simple Spinlock
// Lock now performs local spinning
// Lock now does backoff
class Spinlock {
 private:
  // Lock is just an atomic bool
  std::atomic<bool> locked{false};
  // Random number generator
  std::uniform_int_distribution<int> dist;
  std::mt19937 rng;

 public:
  // Constructor to set up our RNG
  Spinlock() {
    // Get random numbers between 4 and 1024
    rng.seed(std::random_device()());
    dist = std::uniform_int_distribution<int>(4, 1024);
  }

  // Locking mechanism
  void lock() {
    // Keep trying
    while (1) {
      // Try and grab the lock
      // Exit if we got the lock
      if (!locked.exchange(true)) return;

      // If we didn't get the lock, just read the value which gets cached
      // locally. This leads to less traffic.
      // Pause for a random number of iterations (between 4 and 1024)
      do {
        // Get a random number of iterations between 4 and 1024
        int backoff_iters = dist(rng);

        // Pause for some number of iterations
        for (int i = 0; i < backoff_iters; i++) _mm_pause();

        // Check to see if the lock is now free
      } while (locked.load());
    }
  }

  // Unlocking mechanism
  // Just set the lock to free (false)
  // Can also use the assignment operator
  void unlock() { locked.store(false); }
};

Low-Level Assembly

Let’s take a look at how our low-level assembly for our benchmark looks with this new optimization:

  0.94 │28:┌─→mov     %r13d,%eax                        
 15.29 │   │  xchg    %al,(%rbx)                        
       │   │  test    %al,%al                           
  0.00 │   │↓ je      69                               
  0.01 │   │  lea     0x4(%rbx),%r15                   
       │   │  lea     0x10(%rbx),%r14                   
       │   │  nop                                       
  0.01 │40:│  mov     %r15,%rsi                         
  0.00 │   │  mov     %r14,%rdi                         
       │   │→ callq   std::uniform_int_distribution<int>::operator()<std::mersenne_twister_engine<...
       │   │  xor     %edx,%edx                         
       │   │  test    %eax,%eax                         
       │   │↓ jle     60                                
       │   │  nop                                       
       │58:│  inc     %edx                              
 65.19 │   │  pause                                     
       │   │  cmp     %edx,%eax                         
       │   │↑ jne     58                                
  0.09 │60:│  movzbl  (%rbx),%eax                       
       │   │  test    %al,%al                           
       │   │↑ jne     40                                
       │   │↑ jmp     28                                
  0.85 │69:│  incq    (%r12)                            
 17.62 │   │  xchg    %al,(%rbx)                        
       │   ├──dec     %ebp                              
       │   └──jne     28                                

The ony major difference from our previous assembly is that our exponential backoff was replaced with a call to our random number generator.

Performance

Here are the end-to-end performance numbers for 1, 2, 4, and 8 threads:

---------------------------------------------------------------------
Benchmark                           Time             CPU   Iterations
---------------------------------------------------------------------
random_backoff/1/real_time      0.873 ms        0.054 ms          797
random_backoff/2/real_time       1.74 ms        0.080 ms          404
random_backoff/4/real_time       3.40 ms        0.121 ms          207
random_backoff/8/real_time       6.83 ms        0.177 ms          102

And here was our performance from our previous blog post:

------------------------------------------------------------------
Benchmark                        Time             CPU   Iterations
------------------------------------------------------------------
exp_backoff/1/real_time       1.17 ms        0.047 ms          607
exp_backoff/2/real_time       2.30 ms        0.063 ms          301
exp_backoff/4/real_time       4.56 ms        0.089 ms          154
exp_backoff/8/real_time       9.12 ms        0.140 ms           77

It looks like we got even faster! However, we should keep in mind we’re working for a very simple microbenchmark, so our results should be taken with a grain of salt. Some of this comes from our shared value and lock being forcibly separated onto different cache lines (because of where the random number generator is located in our class). You can see that the numbers start to align more closely when we force different cache lines in our exponential backoff benchmark:

------------------------------------------------------------------
Benchmark                        Time             CPU   Iterations
------------------------------------------------------------------
exp_backoff/1/real_time       1.01 ms        0.047 ms          695
exp_backoff/2/real_time       1.99 ms        0.057 ms          354
exp_backoff/4/real_time       3.93 ms        0.081 ms          178
exp_backoff/8/real_time       7.87 ms        0.126 ms           89

Final Thoughts

We’ve made a lot of progress in terms of performance and power consumption, but we have yet to discuss other important topics like fairness and starvation. We have no guarantee that some threads won’t be waiting for extreme amounts of time for the lock. We’ll look at this in the next post with a ticket-based spinlock.

Thanks for reading,

–Nick