CoffeeBeforeArch.github.io

View on GitHub

Spinlocks - Part 2

In the last blog post, we implemented a basic spinlock in C++. While our implementation was functionally correct, the performance left something to be desired. When we dug into our performance counters, we quickly discovered that our L1 data cache misses were through the roof.

In this blog post, we’ll spend some time understanding where our L1 data cache misses were coming from, and address them through a spinlock optimization called locally spinning.

Our Benchmark

We will evaluate our spinlock implementations using Google Benchmark. The structure of the benchmark can be found below:

 // Small Benchmark
 static void bench(benchmark::State &s) {
   // Sweep over a range of threads
   auto num_threads = s.range(0);
 
   // Value we will increment
   std::int64_t val = 0;
 
   // Allocate a vector of threads
   std::vector<std::thread> threads;
   threads.reserve(num_threads);
 
   Spinlock sl;
 
   // Timing loop
   for (auto _ : s) {
     for (auto i = 0u; i < num_threads; i++) {
       threads.emplace_back([&] { inc(sl, val); });
     }
     // Join threads
     for (auto &thread : threads) thread.join();
     threads.clear();
   }
 }
 BENCHMARK(bench)
     ->RangeMultiplier(2)
     ->Range(1, std::thread::hardware_concurrency())
     ->UseRealTime()
     ->Unit(benchmark::kMillisecond);

We’ll be evaluating our benchmark for a power-of-two number of threads (1, 2, 4, and 8 on my machine), where each thread runs a function called inc. This function can be found below:

 // Increment val once each time the lock is acquired
 void inc(Spinlock &s, std::int64_t &val) {
   for (int i = 0; i < 100000; i++) {
     s.lock();
     val++;
     s.unlock();
   }
 }

For 100k iterations, our threads try and lock our spinlock, increment a shared value, then release the lock. This gives us a heavy contention scenario in which to test our spinlock.

A Spinlock with Locally Spinning

As a reminder, here were the cache misses for our 8-thread benchmark using the naive spinlock:

104,434,391      L1-dcache-loads           #   10.438 M/sec                  
 48,181,819      L1-dcache-load-misses     #   46.14% of all L1-dcache hits  

To tackle these L1 cache misses, we should first try and understand where they are coming from. One way we can approach this problem is by categorizing our misses by the miss-type.

Let’s categorize our our cache misses according to the 4-Cs of cache misses.

The 4-Cs of Cache Misses

Cache misses have classically been categorized into 4 types:

  1. Cold-Start/Compulsory Misses
  2. Capacity Misses
  3. Conflict Misses
  4. Coherence Misses

Let’s reason about which category our misses fall into so that it can guide our plan for optimization.

Cold-Start/Compulsory Misses

Cold-start misses (sometimes called compulsory misses) occur when we access data that for the first time. With the exception of prefetching, data accessed for the first time will not be in our caches, and we will see cache misses.

Is the large percentage of L1 cache loads being misses a result of cold-start misses?

Unlikely. Let’s think about the loop each thread in our benchmark is executing:

 // Increment val once each time the lock is acquired
 void inc(Spinlock &s, std::int64_t &val) {
   for (int i = 0; i < 100000; i++) {
     s.lock();
     val++;
     s.unlock();
   }
 }

We’re repeatedly accessing the same two pieces of memory (the state of the spinlock in our lock and unlock methods, and the shared value val). While we may initially miss in the first iteration of our loop, we should not miss in the thousands of remaining iterations.

Capacity Misses

Capacity misses occur when the amount of data we’re accessing is simply too large to fit into our cache. For example, if we’re iterating over 1GB of data, the entire GB can not possible all fit into a 32KB L1 cache.

Is the large percentage of L1 cache loads being misses a result of capacity misses?

Unlikely. Let’s think about the size of the data we’re accessing. We’re still only accessing two pieces of data we’re accessing (our lock state, and shared value val). Our lock state is a std::atomic<bool> which conservatively only takes up 1 byte. Our value, val, is a std::uint64_t, which is 8 bytes. That means together we’re only using 9 bytes of data.

9 bytes of data is not large enough to cause capacity misses in a modern L1 data cache (typically 32KB).

Conflict Misses

Conflict misses occur when we access cache lines that all map to the same set (or subset of total sets) in our set-associative caches. Modern L1 caches usually have 8-way associativity, meaning that each set has 8 slots where a cache line mapped to that set can be placed.

If we access 8 cache lines that map to the same set, they can all be stored in our cache (and in that set). However, if we access a 9th cache line (that is also mapped to that set), one of the cache lines currently in the cache will need to be evicted. This can lead to cache misses when the eviected line is accessed again.

Is the large percentage of L1 cache loads being misses a result of conflict misses?

Unlikely, and for the same reason that our misses are not capacity misses. We are simply not accessing enough data for conflict misses to be an issue. Realistically, the 9 bytes we are repeatedly accessing would at most be split across two different cache lines (one line for our spinlock state, and one line for our shared value val).

Coherence Misses

Coherence misses occur because of writes in multi-threaded applications. Before a thread can write to data on a cache line, it first must get that cache line in the modified (M) state. This state means that the thread has exclusive access to the cache line, and no other copies of the cache line exist in any other cache.

In order to guarantee that there are no copies of the cache line in any other cache, invalidations of that cache line are sent out to all other cores/caches. Coherence misses occur when a cache line in one core is invalidated by a thread doing a write on a different core, leading to an eventual miss when the data is re-accessed.

Is the large percentage of L1 cache loads being misses a result of coherence misses?

Yes! Our benchmark is running multiple threads, all of which are writing to the same two pieces of memory (the spinlock state and the shared value val). All of our threads are fighting over the same cache line(s) and sending invalidations to each other.

One way we can prove this out is by eliminating all doubt about the first 3-Cs of cache misses (cold-start, capacity, and conflict misses). If they exist at all in our benchmark, we should see them in the single-threaded case. Here are the cache miss numbers when we only run a single thread:

254,289,668      L1-dcache-loads           #  321.707 M/sec                  
  2,906,087      L1-dcache-load-misses     #    1.14% of all L1-dcache hits  

Only a ~1% miss-rate! Our cache misses only spike up to ~50% when we run our multi-threaded benchmarks (2, 4, or 8 threads)

Another great tool we can use to understand coherence misses is perf c2c. Using perf c2c record and perf c2c report, we can generate a shared data cache line table for our benchmark run. Here’s an abbreviated form of the high-level table generated from my run:

Shared Data Cache Line Table     (1 entries, sorted on Total HITMs)
       ----------- Cacheline ----------    Total      Tot  ----- LLC Load Hitm -----
Index             Address  Node  PA cnt  records     Hitm    Total      Lcl      Rmt
    0      0x7fff8569f400     0    4285    38378   99.99%    15457    15457        0

My table had a single cache line under heavy contention. Heavy contention is denoted by the number of HITM events recorded. A HITM event occurs when a thread tries to access a piece of memory, and finds it in the modified (M) coherence state in another cache. If we see many of these events for a cache line (like we see above), it’s a good indicator that we have multiple threads tearing the cache line back and forth with writes.

We can also get more information for individual cache lines in this table. Here’s an abbreviated form of that sub-table for the cache line above:

Cacheline 0x7fff8569f400
----- HITM -----  -- Store Refs --  ------- CL --------                      
    Rmt      Lcl   L1 Hit  L1 Miss    Off  Node  PA cnt        Code address  
  0.00%   92.92%   93.71%    0.08%   0x27     0       1      0x55e0ebddf802  
  0.00%    0.41%    6.27%    0.00%   0x27     0       1      0x55e0ebddf80b  
  0.00%    6.67%    0.02%   99.92%   0x28     0       1      0x55e0ebddf808  

Noitce that HITM events occur at 2 different offsets of the cache line (0x27 and 0x28) and at three different code addresses. These corresponds to our two pieces of memory (the lock state and shared value val), and our three write locations (the lock method, the unlock method, and the shared value increment in our inc function).

Addressing Coherence Misses

Now that we know that our cache misses are due to coherence (specifically writes), what can we do about it? Intuitively, we want to limit the number of writes we perform, thereby reducing the number of cache line invalidation. Let’s look at the three main places where we are performing writes:

  1. The increment of our shared value val
  2. The unlock method of our spinlock
  3. The lock method of our spinlock

Let’s try and narrow down where we can optimize by reasoning about the opportunity at each write location.

The Increment

Can we limit the number of writes to our shared value val?

No, because that would not be addressing any of the issues with our spinlock. The increment of val is just the arbitrary piece of work we have given our threads to perform, and is completely independant from the implementation of our spinlock. We should instead be focusing on writes native to our spinlock implemntation, because these will exist no-matter what work we assign our threads.

The Unlock Method

Can we limit the number of writes done by our unlock method?

No, because the number of writes our unlock method does is already the minimum. Let’s take a look at how we implemented this method in our naive implementation:

   // Unlock the spinlock
   void unlock() { locked.store(false); }

Each call to unlock only performs a single store that marks the lock as free. We’ll have to look for opportunities elsewhere.

The Lock Method

Can we limit the number of writes done by our lock method?

Yes! Let’s re-examine our lock implementation from our naive spinlock:

   // Lock the spinlock
   void lock() {
     while (locked.exchange(true))
       ;
   }

As long as a thread is waiting in the while loop for the spinlock to be free, it is writing to the spinlock state (with the atomic exchange). That means that our cache line with the lock state is being torn between the different cores on our chip as fast as different threads can issue atomic writes. But how do we address this?

The Spinning Locally Optimization

We can start by rethinking when we try and grab the lock. Our current lock method constantly tries to grab the lock using atomic exchanges. This is poor design choice because in many cases we will fail to get the lock (especially when as we increase the number of threads). Instead, we can try only doing atomic exchanges (writes) when the read the lock state has become free. This leads us to the spinning locally optimization.

Let’s take a look at our lock method with the spinning locally optimization:

   // Lock the spinlock
   void lock() {
     // Keep trying until we get the lock
     while (1) {
       // Try and grab the lock
       if (!locked.exchange(true)) return;
 
       // Read the spinlock state
       while (locked.load())
         ;
     }
   }

When a thread tries to lock our spinlock it falls into an infinite while loop. Inside the that loop, the thread first tries and grab the lock using an atomic exchange. If it gets the lock, the thread returns from the lock method.

If a thread fails to get the lock, it falls into the second while loop where the thread repeatedly reads the state of the spinlock until it becomes free. When the thread reads the lock has become free, it breaks out of the second while loop, and attempts to get the lock again.

What we’ve done is replace our writes while the lock is take with reads. This is good because read-only copies of a cache line can exist in multiple different caches at the same time. This allows our waiting threads to read a local copy of the spinlock state from their own L1 caches (and hit on this value) until the thread with the lock writes that the lock is free. This will invalidate the all copies of the cache line, and the waiting threads will then read the update that the spinlock has become free.

Here is the full implementation we’ll be using:

// Simple Spinlock
// Lock now performs local spinning
class Spinlock {
 private:
  // Lock is just an atomic bool
  std::atomic<bool> locked{false};

 public:
  // Locking mechanism
  void lock() {
    // Keep trying forever
    while (1) {
      // Try and grab the lock
      // Return if we get the lock
      if (!locked.exchange(true)) return;

      // If we didn't get the lock, just read the value which gets cached
      // locally.
      // This leads to less traffic
      while (locked.load())
        ;
    }
  }

  // Unlocking mechanism
  // Just set the lock to free (false)
  // Can also use the assignment operator
  void unlock() { locked.store(false); }
};

Low-Level Assembly

Let’s take a look at how our low-level assembly for our benchmark looks with this new optimization:

  0.25 │20:┌─→mov     %edi,%ecx       
 36.12 │   │  xchg    %cl,(%rax)      
  0.03 │   │  test    %cl,%cl         
  0.06 │   │↓ je      40              
  0.28 │   │  nop                     
 35.84 │30:│  movzbl  (%rax),%ecx     
  0.10 │   │  test    %cl,%cl         
  5.87 │   │↑ jne     30              
  1.10 │   │↑ jmp     20              
       │   │  nop                     
  4.78 │40:│  incq    (%rsi)         
 15.57 │   │  xchg    %cl,(%rax)    
  0.00 │   ├──dec     %edx           
       │   └──jne     20             

The first thing our code does is try and grab the lock using an atomic exchange (xchg). If we get the lock, we jump down the increment of our shared value val (incq), decrement our loop counter, and jump back to the top of the loop if our loop is not done.

Where things differ from our assembly in the previous post is when we fail to get the lock. Instead of immediately retrying the atomic exchange, we fall into a tight loop where we read the state of the spinlock (movzbl), test if the spinlock is now free, then either jump back up to the atomic exchange if the lock is free, or keep reading the spinlock state if the lock is still taken.

Performance

Here are the end-to-end performance numbers for 1, 2, 4, and 8 threads:

-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
spin_locally/1/real_time       1.01 ms        0.046 ms          698
spin_locally/2/real_time       10.4 ms        0.070 ms           68
spin_locally/4/real_time       28.9 ms        0.123 ms           19
spin_locally/8/real_time       91.4 ms        0.148 ms            8

And here are the performance numbers from our previous (naive) spinlock implementation:

------------------------------------------------------------
Benchmark                  Time             CPU   Iterations
------------------------------------------------------------
naive/1/real_time       1.05 ms        0.054 ms          686
naive/2/real_time       14.2 ms        0.090 ms           47
naive/4/real_time       66.8 ms        0.129 ms           10
naive/8/real_time        247 ms        0.255 ms            3

Comparing the two, we see a significant performance improvement with spin_locally compared to our naive spinlock! Our performance goes from 14.2ms to 10.4ms for 2 threads, 66.8ms to 28.9ms for 4 threads, and 247ms to 91.4ms for 8 threads. Let’s take a look at our L1 cache miss-rate for the 8 thread case to see if we helped reduce the number of coherence cache misses:

1,496,253,736      L1-dcache-loads           #  253.267 M/sec                  
   70,272,048      L1-dcache-load-misses     #    4.70% of all L1-dcache hits  

Down from ~50% to ~5% (a huge improvement)!. While this is a great, we should still be thinking about how we can do better. While we removed the constant contention for the cache line, we have replaced it with bursty contention that occurs each time a thread releases the lock.

We’ll try and alleviate this bursty contention with an optimization called backoff in the next blog post.

Final Thoughts

A relatively minor change to our code (replacing reads with writes) lead to a huge improvement in performance. However, there is still plenty of room for improvement in our implementation, especially centering around how fast our threads read the spinlock state.

Thanks for reading,

–Nick