Spinlocks - Part 3
In the last blog post, we used an optimization called locally spinning to alleviate the high number of coherence misses in our initial naive
spinlock implementation. However, at the end of that blog post, I mentioned that we’ve traded out constant contention for our cache lines for bursty contention when a thread releases the lock.
In this blog post, we’ll spend some time understanding this bursty contention, and try and address the problem using a spinlock optimization called active backoff.
Link to the source code
- Source Code:
- My YouTube Channel:
- My GitHub Account:
- My Email: CoffeeBeforeArch@gmail.com
Our Benchmark
We will evaluate our spinlock implementations using Google Benchmark. The structure of the benchmark can be found below:
// Small Benchmark
static void bench(benchmark::State &s) {
// Sweep over a range of threads
auto num_threads = s.range(0);
// Value we will increment
std::int64_t val = 0;
// Allocate a vector of threads
std::vector<std::thread> threads;
threads.reserve(num_threads);
Spinlock sl;
// Timing loop
for (auto _ : s) {
for (auto i = 0u; i < num_threads; i++) {
threads.emplace_back([&] { inc(sl, val); });
}
// Join threads
for (auto &thread : threads) thread.join();
threads.clear();
}
}
BENCHMARK(bench)
->RangeMultiplier(2)
->Range(1, std::thread::hardware_concurrency())
->UseRealTime()
->Unit(benchmark::kMillisecond);
We’ll be evaluating our benchmark for a power-of-two number of threads (1, 2, 4, and 8 on my machine), where each thread runs a function called inc
. This function can be found below:
// Increment val once each time the lock is acquired
void inc(Spinlock &s, std::int64_t &val) {
for (int i = 0; i < 100000; i++) {
s.lock();
val++;
s.unlock();
}
}
For 100k iterations, our threads try and lock our spinlock, increment a shared value, then release the lock. This gives us a heavy contention scenario in which to test our spinlock.
A Spinlock with Active Backoff
Before we jump into our active backoff optimization implementation, let’s try and understand where we have bursty contention in our spinlock implementation.
Bursty Contention
Our bursty contention occurs when the thread currently holding the lock releases the lock. Let’s take another look at our lock method to understand this:
// Lock the spinlock
void lock() {
// Keep trying until we get the lock
while (1) {
// Try and grab the lock
if (!locked.exchange(true)) return;
// Read the spinlock state
while (locked.load())
;
}
}
When the lock is released, all threads spinning in the second while
loop (our locally spinning optimization) read that the lock is released, and race to the atomic exchange to try and grab the lock. However, only one thread of the potentially many that break out of the polling loop will get the lock. However, all threads that didn’t get the lock will still wastefully perform their atomic exchange, and go back to reading the lock state.
To solve this problem, we need to limit the number of threads that can break out of our waiting loop when the lock is freed. We can do this by adding backoff to this polling loop.
The Active Backoff Optimization
To keep our threads from immediately breaking out of the second while
loop, we will be add a small delay between each read of the spinlock state. Here is our modified lock
method:
// Locking mechanism
void lock() {
while (1) {
// Try and grab the lock
if (!locked.exchange(true)) return;
// Wait for the lock to be free
do {
// Pause for some number of iterations
for (volatile int i = 0; i < 150; i += 1)
;
// Read the lock state
} while (locked.load());
}
}
Our lock
method is almost identical to our spinning locally implemention, with the exception of our spinning locally optimization loop. Instead of reading the lock state as fast as possible, we’ll insert a configurable delay using a dummy for
loop.
But what does inserting a delay here accomplish? If threads read the state of the lock as fast as possible, it’s likely that almost all the threads will break out at and compete for the lock at the same time. By adding a delay in this loop, we increase the probability that some threads will get caught up executing the delay instructions from our dummy for
loop, while others (but not all the threads) break out at try and get the lock.
But how many iterations should we run in our for
loop? It depends! This is a tuneable parameter. You should play around with this yourself to see what works best/well for your application and architecture. 150 iterations provided reasonable speedup for my use-case.
And why did we call it active backoff? I’ve called it active backoff because we’re active doing work for our delay (executing instructions for our dummy for
loop). We’ll look at a passive form of backoff in the next post.
Here is our full spinlock implementation:
// Simple Spinlock
// Lock now performs local spinning
// Lock now does backoff
class Spinlock {
private:
// Lock is just an atomic bool
std::atomic<bool> locked{false};
public:
// Locking mechanism
void lock() {
// Keep trying
while (1) {
// Try and grab the lock
// Exit if we got the lock
if (!locked.exchange(true)) return;
// If we didn't get the lock, just read the value which gets cached
// locally. This leads to less traffic.
// Each iteration we can also call pause to limit the number of writes.
// How many times you should pause each time should be experimentally
// determined
do {
// Pause for some number of iterations
for (volatile int i = 0; i < 150; i += 1)
;
} while (locked.load());
}
}
Low-Level Assembly
Let’s take a look at how our low-level assembly for our benchmark looks with this new optimization:
0.42 │20:┌─→mov %edi,%ecx
17.29 │ │ xchg %cl,(%rax)
0.01 │ │ test %cl,%cl
0.01 │ │↓ je 67
0.09 │ │ nop
0.35 │30:│ movl $0x0,-0x4(%rsp)
0.07 │ │ mov -0x4(%rsp),%ecx
0.02 │ │ cmp $0x95,%ecx
│ │↓ jg 5e
0.06 │ │ nop
1.77 │48:│ mov -0x4(%rsp),%ecx
1.44 │ │ inc %ecx
15.57 │ │ mov %ecx,-0x4(%rsp)
21.50 │ │ mov -0x4(%rsp),%ecx
0.07 │ │ cmp $0x95,%ecx
8.49 │ │↑ jle 48
13.96 │5e:│ movzbl (%rax),%ecx
0.04 │ │ test %cl,%cl
0.02 │ │↑ jne 30
0.35 │ │↑ jmp 20
2.54 │67:│ incq (%rsi)
15.93 │ │ xchg %cl,(%rax)
│ ├──dec %edx
0.00 │ └──jne 20
Our threads first try and grab the lock using an atomic exchange (xchg
). If a thread gets the lock, it jumps to label 67:
where we increment our shared value, release the lock, decrement the loop counter (for our 100k iterations). If the thread fails to get the lock, our thread we set up our active backoff for
loop at label 30:
, and executes the loop at label 48:
.
After this delay loop, our threads read the lock state again (movzbl
), and either goes through the delay loop again if the lock is still taken, or jump back up to the atomic exchange to try and grab the lock again if it’s free.
Performance
Here are the end-to-end performance numbers for 1, 2, 4, and 8 threads:
---------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------
active_backoff/1/real_time 1.05 ms 0.054 ms 679
active_backoff/2/real_time 2.60 ms 0.081 ms 281
active_backoff/4/real_time 9.49 ms 0.121 ms 75
active_backoff/8/real_time 50.5 ms 0.287 ms 14
And here are the performance numbers from our previous (spin_locally
) spinlock implementation:
-------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------
spin_locally/1/real_time 1.01 ms 0.046 ms 698
spin_locally/2/real_time 10.4 ms 0.070 ms 68
spin_locally/4/real_time 28.9 ms 0.123 ms 19
spin_locally/8/real_time 91.4 ms 0.148 ms 8
Another huge improvement! Compared to our spinlock with only the locally spinning optimization, our end-to-end execution time went down from 10.4ms to 2.60ms for 2 threads, 28.9ms to 9.49ms for 4 threads, and 91.4ms to 50.5ms for 8 threads.
While this is another great improvement in performance, we’re wasting power executing all these extra instructions for the for
loop that generates our delay. We’ll address this in the next blog post using a passive form of backoff and the _mm_pause()
intrinsic.
Final Thoughts
Understanding how our threads interact with each other is critical for finding these performance opportunities. However, just like last time, there is still plenty of room for improvement in our implementation (in terms of both performance and power consumption).
Thanks for reading,
–Nick
Link to the source code
- Source Code:
- My YouTube Channel:
- My GitHub Account:
- My Email: CoffeeBeforeArch@gmail.com