CoffeeBeforeArch.github.io

View on GitHub

Spinlocks - Part 1

One of the my important parts of parallel programming is synchronization and coordinating access to shared data. An important tool in that domain is the spinlock. Spinlocks provide a way for threads to busy-wait for a lock instead of yielding their remaining time to the O.S.’s thread scheduler. This can be increadibly useful when you know your threads will not be waiting very long for access to the lock.

In this first blog post on spinlocks, we’ll look at how we can implement a basic spinlock in C++, measure the performance on a simple benchmark, and think about areas where we can improve.

Our Benchmark

We will evaluate our spinlock implementation using Google Benchmark. The structure of the benchmark can be found below:

 // Small Benchmark
 static void bench(benchmark::State &s) {
   // Sweep over a range of threads
   auto num_threads = s.range(0);
 
   // Value we will increment
   std::int64_t val = 0;
 
   // Allocate a vector of threads
   std::vector<std::thread> threads;
   threads.reserve(num_threads);
 
   Spinlock sl;
 
   // Timing loop
   for (auto _ : s) {
     for (auto i = 0u; i < num_threads; i++) {
       threads.emplace_back([&] { inc(sl, val); });
     }
     // Join threads
     for (auto &thread : threads) thread.join();
     threads.clear();
   }
 }
 BENCHMARK(bench)
     ->RangeMultiplier(2)
     ->Range(1, std::thread::hardware_concurrency())
     ->UseRealTime()
     ->Unit(benchmark::kMillisecond);

We’ll be evaluating our benchmark for a power-of-two number of threads (1, 2, 4, and 8 on my machine), where each thread runs a function called inc. This function can be found below:

 // Increment val once each time the lock is acquired
 void inc(Spinlock &s, std::int64_t &val) {
   for (int i = 0; i < 100000; i++) {
     s.lock();
     val++;
     s.unlock();
   }
 }

For 100k iterations, our threads try and lock our spinlock, increment a shared value, then release the lock. This gives us a heavy contention scenario in which to test our spinlock.

A Naive Spinlock

Fundamentally, a spinlock requires three elements for functionality:

  1. State to maintain whether or not the spinlock is currently locked
  2. A method for locking the spinlock
  3. A method for unlocking the spinlock

Here is a naive way you could implement a spinlock as a class in C++:

 class Spinlock {
  private:
   // Spinlock state
   std::atomic<bool> locked{false};
 
  public:
   // Lock the spinlock
   void lock() {
     while (locked.exchange(true))
       ;
   }
   
   // Unlock the spinlock
   void unlock() { locked.store(false); }
 };

Spinlock State

For our spinlock state (locked) we’re using a std::atomic<bool>. This has been made atomic because multiple threads will be competing to modify the spinlock state at the same time.

Using an atomic forces the read-modify-write operation of our lock state to be 1 indivisable operation (i.e., reading the lock state, changing it to locked, and writing it out to memory is one operation). If we do not use an atomic, the subcomponents of our read-modify-write operation (the read, modify, and write) could be interleaved between different threads, leading to two threads thinking they both grabbed the lock.

The Lock Method

Our lock method is simple while loop where we repeatedly do an atomic exchange of true with the current state of the lock (locked). If the lock is free (locked == false), the atomic exchange will set locked to true, and return false. This will make the thread break out of the while loop, and return from the lock method.

If the spinlock was already locked by another thread (locked == true), the atomic exchange will repeatedly swap out true (the state of the locked) with true (the value passed the .exchange()) method. This means the thread waiting will get stuck in the while loop until the lock is eventually freed (by another thread), and the waiting thread does an exchange that returns false.

The Unlock Method

The unlock method for our spinlock is incredibly simple. When the thread holding the lock wants to release the lock, it simply sets locked to false. The next atomic exchange from a waiting thread in the lock method will then grab the lock.

Low Level Assembly

Let’s take a look at the low level assembly for our benchmark to see what is being executed by the processor:

  0.33 │20:┌─→mov     %edi,%ecx     
 77.99 │   │  xchg    %cl,(%rdx)    
  0.03 │   │  test    %cl,%cl       
  0.18 │   │↑ jne     20            
  8.43 │   │  incq    (%rsi)        
 13.05 │   │  xchg    %cl,(%rdx)    
       │   ├──dec     %eax          
  0.00 │   └──jne     20            

The first thing we see is our loop with an atomic exchange (xchg). This is our lock method where we spin in a while loop until our atomic exchange returns false.

The next thing we have is an increment instruction (incq). This is the increment of our shared variable in the inc fucntion our threads are running.

Finally, we have another atomic exchange xchg to release the lock, before decrementing the loop counter for our for loop, and repeating the process if the 100k iterations are not done.

Performance

Here are the end-to-end performance numbers for 1, 2, 4, and 8 threads:

------------------------------------------------------------
Benchmark                  Time             CPU   Iterations
------------------------------------------------------------
naive/1/real_time       1.05 ms        0.054 ms          686
naive/2/real_time       14.2 ms        0.090 ms           47
naive/4/real_time       66.8 ms        0.129 ms           10
naive/8/real_time        247 ms        0.255 ms            3

We might have initially assumed that doubling the number of threads would double the execution time because the amount of work (number of total increments of our shared value) doubles. However, this is not what we see from our timing results. Instead of our 2-thread benchmark taking 2x as long as the single-threaded, it takes 14x longer. You can observe a similar trend between the other benchmark cases as well.

We must have some inter-thread contention that is causing our execution time to expload. Because our benchmark is simply doing a few loads and stores to memory, a good place to start would be looking at our caches. Here is our L1 cache miss-rate for the 8 thread benchmark:

104,434,391      L1-dcache-loads           #   10.438 M/sec                  
 48,181,819      L1-dcache-load-misses     #   46.14% of all L1-dcache hits  

It seems like our current spinlock implementation is causing a huge number of L1 cache misses. This is something we’ll tackle in the next blog post with with an optimization called locally spinning.

Final Thoughts

It wasn’t very hard to write a functionally correct spinlock. However, we have a long way to go before we reach performance (and power efficiency) similar to that of a library implementation (like the pthread_spinlock_t).

Thanks for reading,

–Nick