Spinlocks - Part 5
In the last blog post, we used the passive backoff optimization to reduce the power consumption of our spinlock. However, we many not want our threads to always pause for the same number of iterations.
In this blog post, we’ll look at two spinlocks that use a non-constant form of backoff (exponential and random backoff).
Link to the source code
- Source Code:
- My YouTube Channel:
- My GitHub Account:
- My Email: CoffeeBeforeArch@gmail.com
Our Benchmark
We will evaluate our spinlock implementations using Google Benchmark. The structure of the benchmark can be found below:
// Small Benchmark
static void bench(benchmark::State &s) {
// Sweep over a range of threads
auto num_threads = s.range(0);
// Value we will increment
std::int64_t val = 0;
// Allocate a vector of threads
std::vector<std::thread> threads;
threads.reserve(num_threads);
Spinlock sl;
// Timing loop
for (auto _ : s) {
for (auto i = 0u; i < num_threads; i++) {
threads.emplace_back([&] { inc(sl, val); });
}
// Join threads
for (auto &thread : threads) thread.join();
threads.clear();
}
}
BENCHMARK(bench)
->RangeMultiplier(2)
->Range(1, std::thread::hardware_concurrency())
->UseRealTime()
->Unit(benchmark::kMillisecond);
We’ll be evaluating our benchmark for a power-of-two number of threads (1, 2, 4, and 8 on my machine), where each thread runs a function called inc
. This function can be found below:
// Increment val once each time the lock is acquired
void inc(Spinlock &s, std::int64_t &val) {
for (int i = 0; i < 100000; i++) {
s.lock();
val++;
s.unlock();
}
}
For 100k iterations, our threads try and lock our spinlock, increment a shared value, then release the lock. This gives us a heavy contention scenario in which to test our spinlock.
A Spinlock with Non-Constant Backoff
It is sometimes advantageous for threads to pause for a non-constant number of iterations. For example, if one of our threads is holding the lock longer than expected, we might want to increase the time between reads of the spinlock state.
Let’s take a look at two non-constant backoff implementations. In our first implementation, we’ll look at an optimization called exponential backoff where the number of pause iterations increases exponentially the longer we wait for the lock. In the second, we’ll look at a random backoff optimization where the number of pause iterations comes from a random number generator.
Exponential Backoff
The core idea behind exponential backoff is that threads that continuously find that the spinlock is still not available should wait longer between each read of the spinlock state. Here’s how our updated spinlock lock
method looks:
// Locking mechanism
void lock() {
// Start backoff at MIN_BACKOFF iterations
int backoff_iters = MIN_BACKOFF;
// Keep trying
while (1) {
// Try and grab the lock
if (!locked.exchange(true)) return;
// Pause for an exponentially increasing number of iterations
do {
for (int i = 0; i < backoff_iters; i++) _mm_pause();
// Get the backoff iterations for next time
backoff_iters = std::min(backoff_iters << 1, MAX_BACKOFF);
// Check to see if the lock is free
} while (locked.load());
}
}
We start our backoff iterations (backoff_iters
) at some minimum value (MIN_BACKOFF
) which should be experimentally determined (I used 4
). If a thread fails to get the lock with the atomic exchange, it falls into our spinning locally loop where we pause for backoff_iters
number of iterations. We then exponentially increase the number of backoff iterations for the next trip in the loop, read the spinlock state, and either go back to the top of the loop, or try for the lock again.
One thing to notice is that we cap the number of backoff iterations to MAX_BACKOFF
. This is generally a good idea, because exponential numbers grow quickly (who would have thought!). If we did not bound this to a maximum, our number of backoff iterations could grow so quickly that all our threads end up waiting un-necessarily long in the delay loop even after the lock as become free. The number of MAX_BACKOFF
should be experimentally determined (just like MIN_BACKOFF
). I chose 1024
.
Here is the full spinlock implementation:
// Simple Spinlock
// Lock now performs local spinning
// Lock now performs exponential backoff
class Spinlock {
private:
// Lock is just an atomic bool
std::atomic<bool> locked{false};
public:
// Locking mechanism
void lock() {
// Start backoff at MIN_BACKOFF iterations
int backoff_iters = MIN_BACKOFF;
// Keep trying
while (1) {
// Try and grab the lock
// Return if we get the lock
if (!locked.exchange(true)) return;
// If we didn't get the lock, just read the value which gets cached
// locally. This leads to less traffic.
// Pause for an exponentially increasing number of iterations
do {
// Pause for some number of iterations
for (int i = 0; i < backoff_iters; i++) _mm_pause();
// Get the backoff iterations for next time
backoff_iters = std::min(backoff_iters << 1, MAX_BACKOFF);
// Check to see if the lock is free
} while (locked.load());
}
}
// Unlocking mechanism
// Just set the lock to free (false)
// Can also use the assignment operator
void unlock() { locked.store(false); }
};
Low-Level Assembly
Let’s take a look at how our low-level assembly for our benchmark looks with this new optimization:
0.75 │22:┌─→vmovdqa _IO_stdin_used+0xa0,%xmm0
│ │ vmovd %xmm0,%esi
│ │ xchg %ax,%ax
│30:│ mov %r8d,%ecx
17.71 │ │ xchg %cl,(%rax)
│ │ test %cl,%cl
│ │↓ je 6f
0.00 │ │ nop
│40:│ xor %ecx,%ecx
│ │ test %esi,%esi
│ │↓ jle 58
│ │ nop
0.00 │50:│ inc %ecx
63.59 │ │ pause
0.00 │ │ cmp %ecx,%esi
0.01 │ │↑ jne 50
0.00 │58:│ vpslld $0x1,%xmm0,%xmm0
│ │ vpminsd %xmm1,%xmm0,%xmm0
│ │ vmovd %xmm0,%esi
0.05 │ │ movzbl (%rax),%ecx
│ │ test %cl,%cl
│ │↑ jne 40
0.01 │ │↑ jmp 30
0.67 │6f:│ incq (%rdi)
17.20 │ │ xchg %cl,(%rax)
│ ├──dec %edx
0.01 │ └──jne 22
We first try and get the lock using an atomic exchange (xchg
). If we get the lock, we increment the shared value, release the lock, decrement our loop counter, then jump back to the top of the assembly to try and grab the lock again.
If we fail to get the lock, we fall into our pause loop (label 50:
). After completing our backoff iterations, we calculate the exponential growth of our backoff iterations, read the lock state, then either jump back to our pause loop or break out and try and grab the lock again.
Performance
Here are the end-to-end performance numbers for 1, 2, 4, and 8 threads:
------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------
exp_backoff/1/real_time 1.17 ms 0.047 ms 607
exp_backoff/2/real_time 2.30 ms 0.063 ms 301
exp_backoff/4/real_time 4.56 ms 0.089 ms 154
exp_backoff/8/real_time 9.12 ms 0.140 ms 77
And here was our performance from our previous blog post:
----------------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------------
passive_backoff/1/real_time 1.03 ms 0.054 ms 676
passive_backoff/2/real_time 2.48 ms 0.082 ms 276
passive_backoff/4/real_time 9.59 ms 0.110 ms 70
passive_backoff/8/real_time 46.7 ms 0.236 ms 15
A significant performance improvement, but mainly at the higher thread cases. While our one and two thread benchmarks showed little-to-no improvement, our 4 thread case improved from 9.59ms to 4.56ms, and our 8 thread case from 46.7ms to 9.12ms.
But why did our performance improve so significantly? We’re approximately serializing our threads! Many of our threads will end up hitting the maximum backoff iterations. This makes it more likely that the thread that just released the lock will be able to re-lock the lock (because the other threads get caught up in the long pause loop).
Serialization is best for performance here because our increments can not occur in parallel. Furthermore, a single thread working at a time means it can hit on a local copy of the spinlock state and shared value its L1 cache without the cache line being invalidated and bouncing between cores.
Random Backoff
Another non-constant form of backoff we can look at is random backoff. Instead of our backoff iterations increasing exponentially, we’ll select the number of iterations at random. Here’s how our updated lock method locks:
// Locking mechanism
void lock() {
// Keep trying
while (1) {
// Try and grab the lock
if (!locked.exchange(true)) return;
// Pause for a random number of iterations (between 4 and 1024)
do {
int backoff_iters = dist(rng);
// Pause for some number of iterations
for (int i = 0; i < backoff_iters; i++) _mm_pause();
// Check to see if the lock is now free
} while (locked.load());
}
}
Here, our random number will be between 4
and 1024
(our min and max backoff iterations from our exponential backoff example).
Here is the full spinlock implementation:
// Simple Spinlock
// Lock now performs local spinning
// Lock now does backoff
class Spinlock {
private:
// Lock is just an atomic bool
std::atomic<bool> locked{false};
// Random number generator
std::uniform_int_distribution<int> dist;
std::mt19937 rng;
public:
// Constructor to set up our RNG
Spinlock() {
// Get random numbers between 4 and 1024
rng.seed(std::random_device()());
dist = std::uniform_int_distribution<int>(4, 1024);
}
// Locking mechanism
void lock() {
// Keep trying
while (1) {
// Try and grab the lock
// Exit if we got the lock
if (!locked.exchange(true)) return;
// If we didn't get the lock, just read the value which gets cached
// locally. This leads to less traffic.
// Pause for a random number of iterations (between 4 and 1024)
do {
// Get a random number of iterations between 4 and 1024
int backoff_iters = dist(rng);
// Pause for some number of iterations
for (int i = 0; i < backoff_iters; i++) _mm_pause();
// Check to see if the lock is now free
} while (locked.load());
}
}
// Unlocking mechanism
// Just set the lock to free (false)
// Can also use the assignment operator
void unlock() { locked.store(false); }
};
Low-Level Assembly
Let’s take a look at how our low-level assembly for our benchmark looks with this new optimization:
0.94 │28:┌─→mov %r13d,%eax
15.29 │ │ xchg %al,(%rbx)
│ │ test %al,%al
0.00 │ │↓ je 69
0.01 │ │ lea 0x4(%rbx),%r15
│ │ lea 0x10(%rbx),%r14
│ │ nop
0.01 │40:│ mov %r15,%rsi
0.00 │ │ mov %r14,%rdi
│ │→ callq std::uniform_int_distribution<int>::operator()<std::mersenne_twister_engine<...
│ │ xor %edx,%edx
│ │ test %eax,%eax
│ │↓ jle 60
│ │ nop
│58:│ inc %edx
65.19 │ │ pause
│ │ cmp %edx,%eax
│ │↑ jne 58
0.09 │60:│ movzbl (%rbx),%eax
│ │ test %al,%al
│ │↑ jne 40
│ │↑ jmp 28
0.85 │69:│ incq (%r12)
17.62 │ │ xchg %al,(%rbx)
│ ├──dec %ebp
│ └──jne 28
The ony major difference from our previous assembly is that our exponential backoff was replaced with a call to our random number generator.
Performance
Here are the end-to-end performance numbers for 1, 2, 4, and 8 threads:
---------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------
random_backoff/1/real_time 0.873 ms 0.054 ms 797
random_backoff/2/real_time 1.74 ms 0.080 ms 404
random_backoff/4/real_time 3.40 ms 0.121 ms 207
random_backoff/8/real_time 6.83 ms 0.177 ms 102
And here was our performance from our previous blog post:
------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------
exp_backoff/1/real_time 1.17 ms 0.047 ms 607
exp_backoff/2/real_time 2.30 ms 0.063 ms 301
exp_backoff/4/real_time 4.56 ms 0.089 ms 154
exp_backoff/8/real_time 9.12 ms 0.140 ms 77
It looks like we got even faster! However, we should keep in mind we’re working for a very simple microbenchmark, so our results should be taken with a grain of salt. Some of this comes from our shared value and lock being forcibly separated onto different cache lines (because of where the random number generator is located in our class). You can see that the numbers start to align more closely when we force different cache lines in our exponential backoff benchmark:
------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------
exp_backoff/1/real_time 1.01 ms 0.047 ms 695
exp_backoff/2/real_time 1.99 ms 0.057 ms 354
exp_backoff/4/real_time 3.93 ms 0.081 ms 178
exp_backoff/8/real_time 7.87 ms 0.126 ms 89
Final Thoughts
We’ve made a lot of progress in terms of performance and power consumption, but we have yet to discuss other important topics like fairness and starvation. We have no guarantee that some threads won’t be waiting for extreme amounts of time for the lock. We’ll look at this in the next post with a ticket-based spinlock.
Thanks for reading,
–Nick
Link to the source code
- Source Code:
- My YouTube Channel:
- My GitHub Account:
- My Email: CoffeeBeforeArch@gmail.com