CoffeeBeforeArch.github.io

View on GitHub

Spinlocks - Part 4

While everyone cares about performance (to some degree), an increasing number of people also care about power-consumption (especially in massive data-centers). Intuitively we can save on power by improving performance so that the processor is active for a shorter amount of time. However, there are ways we can improve power-consumption in applications that run for (approximately) the same length of time.

In the last blog post, we used an optimization active backoff to improve the bursty contention for our spinlock whenever the lock is freed. In this blog post, we’ll spend some time understanding where we are wasting power in that optimization, then analyze the performance and power consumption when using an optimization called passive backoff.

Our Benchmark

We will evaluate our spinlock implementations using Google Benchmark. The structure of the benchmark can be found below:

 // Small Benchmark
 static void bench(benchmark::State &s) {
   // Sweep over a range of threads
   auto num_threads = s.range(0);
 
   // Value we will increment
   std::int64_t val = 0;
 
   // Allocate a vector of threads
   std::vector<std::thread> threads;
   threads.reserve(num_threads);
 
   Spinlock sl;
 
   // Timing loop
   for (auto _ : s) {
     for (auto i = 0u; i < num_threads; i++) {
       threads.emplace_back([&] { inc(sl, val); });
     }
     // Join threads
     for (auto &thread : threads) thread.join();
     threads.clear();
   }
 }
 BENCHMARK(bench)
     ->RangeMultiplier(2)
     ->Range(1, std::thread::hardware_concurrency())
     ->UseRealTime()
     ->Unit(benchmark::kMillisecond);

We’ll be evaluating our benchmark for a power-of-two number of threads (1, 2, 4, and 8 on my machine), where each thread runs a function called inc. This function can be found below:

 // Increment val once each time the lock is acquired
 void inc(Spinlock &s, std::int64_t &val) {
   for (int i = 0; i < 100000; i++) {
     s.lock();
     val++;
     s.unlock();
   }
 }

For 100k iterations, our threads try and lock our spinlock, increment a shared value, then release the lock. This gives us a heavy contention scenario in which to test our spinlock.

A Spinlock with Passive Backoff

Before we jump into our passive backoff optimization implementation, let’s try and understand were the waste is in our active backoff implementation.

Wasteful Instructions

Let’s take a look at just the assembly for our active backoff optimization we applied in the last post:

  0.35 │30:│  movl    $0x0,-0x4(%rsp)        
  0.07 │   │  mov     -0x4(%rsp),%ecx        
  0.02 │   │  cmp     $0x95,%ecx             
       │   │↓ jg      5e                     
  0.06 │   │  nop                            
  1.77 │48:│  mov     -0x4(%rsp),%ecx        
  1.44 │   │  inc     %ecx                   
 15.57 │   │  mov     %ecx,-0x4(%rsp)        
 21.50 │   │  mov     -0x4(%rsp),%ecx        
  0.07 │   │  cmp     $0x95,%ecx             
  8.49 │   │↑ jle     48                     

Our active backoff delay loop leads to thousands, millions, or billions of increments, loads, and stores depending on duration of our benchmark. While executing these instructions give us an improvement in performance, it would be nice to get the same pause effect without executing useless instruction.

For spin-wait loops like our spinlock, we can use the _mm_pause() intrinsic for this exact purpose.

The Pause Intrinsic

Here is the _mm_pause() intrinsic from the intel Intrinsics Guide. If we look at the description, it says:

Provide a hint to the processor that the code sequence is a spin-wait loop. This can help improve the performance and power consumption of spin-wait loops.

With our original active backoff implementation, we got a performance benefit from creating a finite pause within our loop that polls (reads) the state of our spinlock. However, we got this at the cost of thousands, millions, or billions of extra instructions (depending on how long our application runs) executed for our dummy for loop.

We can use the _mm_pause() intrinsic to get this to the same effect (generating a delay between reads of our spinlock state) without executing any of those extra instructions.

The Passive Backoff Optimization

Let’s take a look at our spinlock’s lock method but with passive instead of active backoff. For this, we will be swapping out our for loop with calls to the _mm_pause() intrinsic. Here’s how that may look:

   // Locking mechanism
   void lock() {
     while (1) {
       // Try and grab the lock
       if (!locked.exchange(true)) return;
 
       do {
         // Pause for some number of iterations
         for (int i = 0; i < 4; i++) _mm_pause();
 
         // Check to see if the lock is free
       } while (locked.load());
     }
   }

The core functionality is exactly the same as our active backoff implementation. The only difference is that we’re using the _mm_pause() intrinsic instead of the dummy for loop.

Just like we could configure the number of iterations of our for loop to control our delay in the previous implementation, we can tune how many times we use the pause intrinsic as well. I’ve chosen 4 iterations for this use-case to approximately match the performance of our active backoff implementation.

Here is our full spinlock implementation with passive backoff:

// Simple Spinlock
// Lock now performs local spinning
// Lock now does backoff
class Spinlock {
 private:
  // Lock is just an atomic bool
  std::atomic<bool> locked{false};

 public:
  // Locking mechanism
  void lock() {
    // Keep trying
    while (1) {
      // Try and grab the lock
      // Exit if we got the lock
      if (!locked.exchange(true)) return;

      // If we didn't get the lock, just read the value which gets cached
      // locally. This leads to less traffic.
      // Each iteration we can also call pause to limit the number of writes.
      // How many times you should pause each time should be experimentally
      // determined
      do {
        // Pause for some number of iterations
        for (int i = 0; i < 4; i++) _mm_pause();

        // Check to see if the lock is free
      } while (locked.load());
    }
  }

  // Unlocking mechanism
  // Just set the lock to free (false)
  // Can also use the assignment operator
  void unlock() { locked.store(false); }
};

Low-Level Assembly

Let’s take a look at how our low-level assembly for our benchmark looks with this new optimization:

  0.34 │20:┌─→mov     %edi,%ecx                  
 17.32 │   │  xchg    %cl,(%rax)                 
  0.01 │   │  test    %cl,%cl                    
  0.01 │   │↓ je      41                         
  0.09 │   │  nop                                
 12.61 │30:│  pause                              
 12.39 │   │  pause                              
 12.32 │   │  pause                              
 12.06 │   │  pause                              
 15.37 │   │  movzbl  (%rax),%ecx                
  0.03 │   │  test    %cl,%cl                    
  0.05 │   │↑ jne     30                         
  0.42 │   │↑ jmp     20                         
  2.75 │41:│  incq    (%rsi)                     
 14.23 │   │  xchg    %cl,(%rax)                 
       │   ├──dec     %edx                       
  0.00 │   └──jne     20                        

No major structural changes compare to our previous implementation. We start by trying to get the lock with an atomic exchange (xchg). If we get the lock, we increment the shared value (incq), free the lock with another atomic exchange (xchg), decrement the loop counter (dec), and return to the top of the loop if there are more iterations.

If we didn’t get the look, we go to our spinning locally do while loop with passive backoff starting at label 30:. There, we see 4 pause instructions (coming from our calls to the _mm_pause() intrinsic), followed by our read of the spinlock state that decides whether we pause again, or try and grab the lock.

Performance

Here are the end-to-end performance numbers for 1, 2, 4, and 8 threads:

----------------------------------------------------------------------
Benchmark                            Time             CPU   Iterations
----------------------------------------------------------------------
passive_backoff/1/real_time       1.03 ms        0.054 ms          676
passive_backoff/2/real_time       2.48 ms        0.082 ms          276
passive_backoff/4/real_time       9.59 ms        0.110 ms           70
passive_backoff/8/real_time       46.7 ms        0.236 ms           15

And here was our performance for the active backoff implementation:

---------------------------------------------------------------------
Benchmark                           Time             CPU   Iterations
---------------------------------------------------------------------
active_backoff/1/real_time       1.05 ms        0.054 ms          679
active_backoff/2/real_time       2.60 ms        0.081 ms          281
active_backoff/4/real_time       9.49 ms        0.121 ms           75
active_backoff/8/real_time       50.5 ms        0.287 ms           14

As expected, the performance of the two implementations were about the same (because I approximately matched the number of loop iterations to number of calls to _mm_pause()). The slight differences are to be expected because thread scheduling is non-deterministic, and it’s nearly impossible to perfectly match the performance of two different pieces of code. But remember, the primary thing we’re optimizing for here is power consumption. Let’s start digging into those metrics.

Power Consumption

To reason about the power-consumption of our benchmarks, we’ll first be looking at the total number of instructions executed so we can compare the amount of work being done. We’ll then directly querying power performance counters to check if our reasoning about power consumption was correct.

Instructions Executed

Let’s start our power-consumption analysis by comparing the total number of instructions executed. For this, we’ll compare the 8-thread benchmark case for our active and passive backoff implementations with the number of benchmark iterations set to 50.

Here are the number instructions executed for our active backoff run:

55,227,656,500      instructions              #    0.76  insn per cycle

Around 55 billion. Because thread scheduling is non-deterministic, this number will vary from run to run, but for me it generally fell in the 45-60 billion instruction range.

Now let’s see how many instructions were executed for our passive backoff implementation:

1,012,803,760      instructions              #    0.01  insn per cycle         

A ~55x reduction in instructions! But this should make sense! In our active backoff implementation, we’re executing billions of extra instructions just to induce the delay inside of our locally spinning loop. In our passive backoff implementation, we’re using a dedicated instruction that adds a finite delay to our pipeline, allowing us to simply stop doing work for short periods of time.

Intuitively, we can reason that our passive backoff implementation that executes 55x fewer instructions and has roughly equivilant end-to-end execution time as our active backoff should consume less power while running.

Additional Statistics

One thing you may notice if you look at our other performance counter stats for our passive backoff benchmark is that we have a very high branch miss-prediction and L1 data cache miss rate:

216,996,864      branches                  #   12.154 M/sec                  
 35,495,595      branch-misses             #   16.36% of all branches        
240,163,142      L1-dcache-loads           #   13.452 M/sec                  
147,908,665      L1-dcache-load-misses     #   61.59% of all L1-dcache hits  

What’s going on? Let’s think about what we know about our benchmarks and spinlock implementations. In both benchmarks, we have a heavy contention scenario where threads fight over the cache line(s) containing our spinlock state and shared value val. Furthermore, when a thread actually sees the lock as free or successfully gets the lock is non-deterministic, and could be difficult for our branch predictor to guess correctly.

These things explain our observations in passive backoff, but why do these not look like a problem in our active backoff implementation?

 7,871,605,800      branches                  #  576.093 M/sec                  
    76,998,863      branch-misses             #    0.98% of all branches        
15,543,237,429      L1-dcache-loads           # 1137.551 M/sec                  
   106,735,606      L1-dcache-load-misses     #    0.69% of all L1-dcache hits  

Just look at the raw counter values! We’re only doing ~217 million branches in our passive backoff benchmark, but ~7.9 billion in our active backoff benchmark. Likewise, our L1 data cache load increased from ~240 million to ~15.5 billion! Most of these loads and branches are from the dummy for loop we added to insert a delay. These are hiding the underlying randomness and lack of cache-locality that is fundamental to our benchmark.

Power Performance Counters

A more direct way that we can estimate the power consumption of our benchmarks is by directly accessing the power performance counters. Specifically, we’ll be looking at power/energy-pkg/.

Here is the estimate in power consumption for our active backoff benchmark:

-------------------------------------------------------------------------
Benchmark                               Time             CPU   Iterations
-------------------------------------------------------------------------
active_backoff/8/iterations:50       46.3 ms        0.172 ms           50

 Performance counter stats for 'system wide':

            190.59 Joules power/energy-pkg/                                           

       2.337446416 seconds time elapsed

And here is the estimate for our passive backoff benchmark:

--------------------------------------------------------------------------
Benchmark                                Time             CPU   Iterations
--------------------------------------------------------------------------
passive_backoff/8/iterations:50       48.7 ms        0.197 ms           50

 Performance counter stats for 'system wide':

            151.31 Joules power/energy-pkg/                                           

       2.457350713 seconds time elapsed

Looks great! Notice that even though our performance was slightly worse and our benchmark runs for longer (for passive backoff), we’re still using significantly less power (151 Joules vs. 190 Joules).

Final Thoughts

While compilers are great, they don’t always use the best instructions for the job, and there are many hyper-specific instructions (like our pause) that they well simply never generate. This makes it very important for us to know what tools we have available to use in our programming toolbox (e.g., intrinsics). In the next post, we’ll look at some spinlock implementations that use a non-constant form of backoff.

Thanks for reading,

–Nick