CoffeeBeforeArch.github.io

View on GitHub

Spinlocks - Part 6

In the last blog post, we looked at non-constant forms of backoff that further improved the end-to-end execution time of our benchmarks. However, optimizations that help with throughput/performance may hurt other metrics like fairness.

In this blog post, we’ll look at a spinlock implementation optimized for fairness called a ticket spinlock.

Our Benchmark

We will evaluate our spinlock implementations using Google Benchmark. The structure of the benchmark can be found below:

 // Small Benchmark
 static void bench(benchmark::State &s) {
   // Sweep over a range of threads
   auto num_threads = s.range(0);
 
   // Value we will increment
   std::int64_t val = 0;
 
   // Allocate a vector of threads
   std::vector<std::thread> threads;
   threads.reserve(num_threads);
 
   Spinlock sl;
 
   // Timing loop
   for (auto _ : s) {
     for (auto i = 0u; i < num_threads; i++) {
       threads.emplace_back([&] { inc(sl, val); });
     }
     // Join threads
     for (auto &thread : threads) thread.join();
     threads.clear();
   }
 }
 BENCHMARK(bench)
     ->RangeMultiplier(2)
     ->Range(1, std::thread::hardware_concurrency())
     ->UseRealTime()
     ->Unit(benchmark::kMillisecond);

We’ll be evaluating our benchmark for a power-of-two number of threads (1, 2, 4, and 8 on my machine), where each thread runs a function called inc. This function can be found below:

 // Increment val once each time the lock is acquired
 void inc(Spinlock &s, std::int64_t &val) {
   for (int i = 0; i < 100000; i++) {
     s.lock();
     val++;
     s.unlock();
   }
 }

For 100k iterations, our threads try and lock our spinlock, increment a shared value, then release the lock. This gives us a heavy contention scenario in which to test our spinlock.

A Ticket Spinlock

Before we jump into our ticket spinlock implementation, we’ll discuss why we should care about things like fairness and starvation.

Fairness and Starvation

End-to-end performance and throughput are not the only metrics to consider when optimization. In many circumstances, we also care about things like fairness and starvation.

In the case of our spinlock, starvation occurs when one thread (or a subset of threads) hog access to our spinlock, starving one or more threads for access. So far, we have done nothing to prevent this from happening in any of our implementations (naive, spin_locally, active_backoff, passive_backoff, exp_backoff, random_backoff).

In fact, our exponential backoff implementation increases unfairness/starvation. To understand this, let’s take a look back at our lock method.

   // Locking mechanism
   void lock() {
     // Start backoff at MIN_BACKOFF iterations
     int backoff_iters = MIN_BACKOFF;
   
     // Keep trying
     while (1) {
       // Try and grab the lock
       // Return if we get the lock
       if (!locked.exchange(true)) return;
   
       // If we didn't get the lock, just read the value which gets cached
       // locally. This leads to less traffic.
       // Pause for an exponentially increasing number of iterations
       do {
         // Pause for some number of iterations
         for (int i = 0; i < backoff_iters; i++) _mm_pause();
 
         // Get the backoff iterations for next time
         backoff_iters = std::min(backoff_iters << 1, MAX_BACKOFF);
   
         // Check to see if the lock is free
       } while (locked.load());
     }
   }

The longer a thread waits for a spinlock, the longer it waits in the _mm_pause() loop. That means that a thread that just started waiting for the lock will check the lock state faster than a thread that has been waiting for the lock for a long time (because it goes through fewer pause loop iterations).

In order to prevent starvation and promote fairness in how access to the lock is granted, we can use a ticket-based system.

The Ticket Spinlock Optimization

The core idea behind the ticket-spinlock is that the first thread to check if the spinlock is free should be the next thread to get the lock. What we’re essentially implementing is a FIFO (first in first out) lock.

Our ticket spinlock will have 4 components:

  1. State for the latest ticket number
  2. State for the now-serving number
  3. A method for unlocking our spinlock
  4. A method for locking our spinlock

Latest Ticket Number State

for our state that maintains the latest ticket number, we’re going to use a std::atomic<std::uint16_t>.

   std::atomic<std::uint16_t> line{0};

Similar to our lock state from our previous examples, we need to make this atomic because multiple threads will be trying to get “tickets” at the same time, and trying to modify this state. Making this atomic prevents two threads from thinking they have the same “ticket”.

Now Serving Number State

For our now-serving number we’ll use a volatile std::uint16_t.

   volatile std::uint16_t serving{0};

By marking this volatile, we prevent the compiler from storing this in a register. This is important, because modifying the serving number is how we signal the next thread to start working.

However, we need not make this atomic. The only thread that will be modifying the serving number is the thread currently holding the lock. We’ll go into some of the more gory details about volatile and compiler memory ordering near the end of the post.

A Lock Method

Our lock method is fairly simple:

   // Locking mechanism
   void lock() {
     // Get the latest place in line (and increment the value)
     auto place = line.fetch_add(1);
 
     // Wait until our number is "called"
     while (serving != place)
       ;
   }

To lock our spinlock, we do an atomic fetch and add to get our place in line (our “ticket”). This reads the current value of the latest ticket number (line) into the place variable, and stores line + 1 back to memory (atomically).

After a thread gets a “ticket”, it waits for its number to be “called”. This is done in the while loop where a thread thread checks if the serving is equal to place.

Once the values are equal, the thread exits the lock method (it now has the lock).

An Unlock Method

Our unlock method is still just a single line of code:

   void unlock() { serving = serving + 1; }

In order to pass the lock to the next thread, we just need to increment the serving value by 1. Then, the thread next in line will see that its value of serving is equal to place, and break out of its lock method.

We’ll discuss some pitfalls related to compiler memory ordering relating to serving near the end of the post.

Full Ticket Spinlock Implementation

Here is the full ticket spinlock implementation:

 // Simple Spinlock
 // Now uses ticket system for fairness
 class Spinlock {
  private:
   // Lock is now two counters:
   //  1.) The latest place taken in line
   //  2.) Which number is currently being served
   std::atomic<std::uint16_t> line{0};
   // Needs to avoid the compiler putting this in a register!
   volatile std::uint16_t serving{0};
 
  public:
   // Locking mechanism
   void lock() {
     // Get the latest place in line (and increment the value)
     auto place = line.fetch_add(1);
 
     // Wait until our number is "called"
     while (serving != place)
       ;
   }
 
   // Unlocking mechanism
   // Increment serving number to pass the lock
   // No need for an atomic! The thread with the lock is the only one that
   // accesses this variable!
   void unlock() { serving = serving + 1; }
 };

Low-Level Assembly

Let’s take a look at how our low-level assembly for our benchmark looks with this new optimization:

  0.06 │18:┌─→mov     $0x1,%edx                
 20.41 │   │  lock    xadd    %dx,(%rax)       
  0.03 │   │  nop                              
 50.39 │28:│  movzwl  0x2(%rax),%edi           
  0.05 │   │  cmp     %di,%dx                  
 24.54 │   │↑ jne     28                       
  3.58 │   │  movzwl  0x2(%rax),%edx           
  0.66 │   │  incq    (%rsi)                   
       │   │  inc     %edx                     
       │   │  mov     %dx,0x2(%rax)            
       │   ├──dec     %ecx                     
  0.27 │   └──jne     18                       

The first thing our threads do is an atomic fetch and add (xadd) to get their place in line. The threads then wait inside of the loop at label 28:, where the now serving number is repeatedly read.

If a thread finds the now serving number is equal to its ticket number, it breaks out of the loop, increments the shared value (incq), then increments the now-serving number and stores it to memory.

The threads then either jumps back to the top of the loop to get a new ticket, or finishes executing.

Removing the Volatile Qualifier

I mentioned earlier that we needed the volatile qualifier to prevent the compiler from caching the now-serving number in a register. Let’s look at the assembly generated with the volatile qualifier is removed:

  0.81 │11:┌─→mov     $0x1,%edi            
 17.86 │   │  lock    xadd    %di,(%rax)   
       │   │  movzwl  0x2(%rax),%esi       
  0.74 │1f:│  cmp     %si,%di              
 79.72 │   │↑ jne     1f                   
       │   │  incq    (%rcx)               
  0.87 │   │  incw    0x2(%rax)            
       │   ├──dec     %edx                 
       │   └──jne     11                   

Similar to the previous assembly, our threads first do an atomic fetch and add to get a place in line. Next, they read the now-serving number into the register esi. From there, the threads fall into a 2 instruction loop at label 1f: where the values of two registers (our number in line and the now-serving number) are compared until they are no longer equal.

However, if a thread tries to grab the lock while it is already taken, it will never break out of this loop! Let’s think about why. If the lock is free, and one thread tries to grab the lock, the thread grabs a place in line, reads the now serving number into a register, finds that it is equal to its place in line, and exits the lock method.

If the lock is already taken, the thread gets its place in line, reads the now serving position into a register, and compares the place in line to the now-serving position (both in registers) forever, finding that they are never equal. The value of the now-serving number never changes in the register because the thread only updates now-serving position in memory. This does not affect any values stored in registers (cache coherence does not apply to registers!).

Threads waiting in the loop at 1f: will never see the update in memory, and our program will deadlock.

Performance

Here are the end-to-end performance numbers for 1, 2, 4, and 8 threads:

------------------------------------------------------------------
Benchmark                        Time             CPU   Iterations
------------------------------------------------------------------
ticket_lock/1/real_time      0.651 ms        0.053 ms         1093
ticket_lock/2/real_time       11.3 ms        0.069 ms           67
ticket_lock/4/real_time       29.7 ms        0.103 ms           24
ticket_lock/8/real_time       69.3 ms        0.190 ms           10

Our performance is lower compared to previous implementations (e.g., exponential backoff) but it’s not terrible. In fact, it’s fairly close to our passive backoff implementation. As a reminder, here are those performance numbers:

----------------------------------------------------------------------
Benchmark                            Time             CPU   Iterations
----------------------------------------------------------------------
passive_backoff/1/real_time       1.03 ms        0.054 ms          676
passive_backoff/2/real_time       2.48 ms        0.082 ms          276
passive_backoff/4/real_time       9.59 ms        0.110 ms           70
passive_backoff/8/real_time       46.7 ms        0.236 ms           15

So why is our performance worse? Let’s think about what we optimized. In the exponential backoff implementation, we were roughly serializing our application through long delays so that on thread had uninterrupted access to the lock.

In our ticket spinlock, we’re doing almost the opposite. We’re now ensuring that the lock will be passed between requesting threads in FIFO order. That means the cache line with our lock and shared value will bounce between the cores of requesting threads, adding extra overhead.

Thoughts on Compiler Memory Re-Ordering

One of the potential pitfalls we have not discussed yet is memory re-ordering done by the compiler. volatile does not guarantee the “happens before” relationship, meaning that non-volatile store can be reordered and placed after a volatile store, even if it occurs earlier in the written program.

To prevent this, we can insert a compiler memory barrier in our unlock method:

   void unlock() {
     asm volatile("" : : : "memory");
     serving = serving + 1;
   }

This tricks the compiler into thinking we have an instruction that may touch all memory (thereby introducing fake dependencies into our program). If we look at the assembly for our benchmark after inserting this hint, we get the following:

       │18:┌─→mov     $0x1,%edx         
 20.97 │   │  lock    xadd    %dx,(%rax)
  0.04 │   │  nop                       
 46.76 │28:│  movzwl  0x2(%rax),%edi    
  0.06 │   │  cmp     %di,%dx           
 28.11 │   │↑ jne     28                
  3.31 │   │  incq    (%rsi)            
  0.33 │   │  movzwl  0x2(%rax),%edx    
  0.01 │   │  inc     %edx              
  0.09 │   │  mov     %dx,0x2(%rax)     
       │   ├──dec     %ecx              
  0.32 │   └──jne     18                

The only difference is that our read (movzwl) of the now-serving number was moved after the increment of our shared value val. Our operations now occur in the same order as in our high-level C++. Unfortunately, this comes at a slight cost to our performance (mainly at a higher number of threads):

------------------------------------------------------------------
Benchmark                        Time             CPU   Iterations
------------------------------------------------------------------
ticket_lock/1/real_time      0.635 ms        0.044 ms         1101
ticket_lock/2/real_time       10.3 ms        0.057 ms           66
ticket_lock/4/real_time       27.5 ms        0.088 ms           26
ticket_lock/8/real_time       77.1 ms        0.187 ms            9

If all of these volatile and memory ordering details are getting too confusing, we can always replace our now-serving number with an atomic. By default, that this reordering will not occur (when using the default std::memory_order_seq_cst). Here is the resulting assembly:

  0.02 │18:┌─→mov     $0x1,%edi                   
  9.33 │   │  lock    xadd    %di,(%rax)          
  0.04 │   │  nop                                 
 50.43 │28:│  movzwl  (%rcx),%r8d                 
  0.08 │   │  cmp     %r8w,%di                    
 23.09 │   │↑ jne     28                          
  4.21 │   │  incq    (%rsi)                      
 12.79 │   │  lock    incw (%rcx)                 
       │   ├──dec     %edx                        
       │   └──jne     18                          

We see roughly the same assembly as when we used the compiler memory barrier. The only major difference is that we use an atomic increment (lock incw) to release the lock. If we look at the performance, we see it has dropped again, and for all thread cases:

------------------------------------------------------------------
Benchmark                        Time             CPU   Iterations
------------------------------------------------------------------
ticket_lock/1/real_time       1.07 ms        0.054 ms          656
ticket_lock/2/real_time       16.7 ms        0.075 ms           46
ticket_lock/4/real_time       43.5 ms        0.105 ms           16
ticket_lock/8/real_time        111 ms        0.236 ms            6

We could of course change the memory ordering for each of our atomics away from the conservative sequential consistency (to the appropriate ordering), but I’ll leave that as an exercise to the reader (I didn’t find the end-results to be particularly compelling or worthy of including here).

Final Thoughts

Ticket-based spinlocks give us a fair way for threads to busy-wait for a lock. While our end-to-end performance took a hit (compared to our exponential backoff spinlock), we were able to understand why. Throughput is not the only metric in the world, and there are plenty of situations where we want a compromise between throughput and fairness.

Thanks for reading,

–Nick