Spinlocks - Part 6
In the last blog post, we looked at non-constant forms of backoff that further improved the end-to-end execution time of our benchmarks. However, optimizations that help with throughput/performance may hurt other metrics like fairness.
In this blog post, we’ll look at a spinlock implementation optimized for fairness called a ticket spinlock.
Link to the source code
- Source Code:
- My YouTube Channel:
- My GitHub Account:
- My Email: CoffeeBeforeArch@gmail.com
Our Benchmark
We will evaluate our spinlock implementations using Google Benchmark. The structure of the benchmark can be found below:
// Small Benchmark
static void bench(benchmark::State &s) {
// Sweep over a range of threads
auto num_threads = s.range(0);
// Value we will increment
std::int64_t val = 0;
// Allocate a vector of threads
std::vector<std::thread> threads;
threads.reserve(num_threads);
Spinlock sl;
// Timing loop
for (auto _ : s) {
for (auto i = 0u; i < num_threads; i++) {
threads.emplace_back([&] { inc(sl, val); });
}
// Join threads
for (auto &thread : threads) thread.join();
threads.clear();
}
}
BENCHMARK(bench)
->RangeMultiplier(2)
->Range(1, std::thread::hardware_concurrency())
->UseRealTime()
->Unit(benchmark::kMillisecond);
We’ll be evaluating our benchmark for a power-of-two number of threads (1, 2, 4, and 8 on my machine), where each thread runs a function called inc
. This function can be found below:
// Increment val once each time the lock is acquired
void inc(Spinlock &s, std::int64_t &val) {
for (int i = 0; i < 100000; i++) {
s.lock();
val++;
s.unlock();
}
}
For 100k iterations, our threads try and lock our spinlock, increment a shared value, then release the lock. This gives us a heavy contention scenario in which to test our spinlock.
A Ticket Spinlock
Before we jump into our ticket spinlock implementation, we’ll discuss why we should care about things like fairness and starvation.
Fairness and Starvation
End-to-end performance and throughput are not the only metrics to consider when optimization. In many circumstances, we also care about things like fairness and starvation.
In the case of our spinlock, starvation occurs when one thread (or a subset of threads) hog access to our spinlock, starving one or more threads for access. So far, we have done nothing to prevent this from happening in any of our implementations (naive
, spin_locally
, active_backoff
, passive_backoff
, exp_backoff
, random_backoff
).
In fact, our exponential backoff implementation increases unfairness/starvation. To understand this, let’s take a look back at our lock
method.
// Locking mechanism
void lock() {
// Start backoff at MIN_BACKOFF iterations
int backoff_iters = MIN_BACKOFF;
// Keep trying
while (1) {
// Try and grab the lock
// Return if we get the lock
if (!locked.exchange(true)) return;
// If we didn't get the lock, just read the value which gets cached
// locally. This leads to less traffic.
// Pause for an exponentially increasing number of iterations
do {
// Pause for some number of iterations
for (int i = 0; i < backoff_iters; i++) _mm_pause();
// Get the backoff iterations for next time
backoff_iters = std::min(backoff_iters << 1, MAX_BACKOFF);
// Check to see if the lock is free
} while (locked.load());
}
}
The longer a thread waits for a spinlock, the longer it waits in the _mm_pause()
loop. That means that a thread that just started waiting for the lock will check the lock state faster than a thread that has been waiting for the lock for a long time (because it goes through fewer pause loop iterations).
In order to prevent starvation and promote fairness in how access to the lock is granted, we can use a ticket-based system.
The Ticket Spinlock Optimization
The core idea behind the ticket-spinlock is that the first thread to check if the spinlock is free should be the next thread to get the lock. What we’re essentially implementing is a FIFO (first in first out) lock.
Our ticket spinlock will have 4 components:
- State for the latest ticket number
- State for the now-serving number
- A method for unlocking our spinlock
- A method for locking our spinlock
Latest Ticket Number State
for our state that maintains the latest ticket number, we’re going to use a std::atomic<std::uint16_t>
.
std::atomic<std::uint16_t> line{0};
Similar to our lock state from our previous examples, we need to make this atomic because multiple threads will be trying to get “tickets” at the same time, and trying to modify this state. Making this atomic prevents two threads from thinking they have the same “ticket”.
Now Serving Number State
For our now-serving number we’ll use a volatile std::uint16_t
.
volatile std::uint16_t serving{0};
By marking this volatile, we prevent the compiler from storing this in a register. This is important, because modifying the serving number is how we signal the next thread to start working.
However, we need not make this atomic. The only thread that will be modifying the serving number is the thread currently holding the lock. We’ll go into some of the more gory details about volatile and compiler memory ordering near the end of the post.
A Lock Method
Our lock method is fairly simple:
// Locking mechanism
void lock() {
// Get the latest place in line (and increment the value)
auto place = line.fetch_add(1);
// Wait until our number is "called"
while (serving != place)
;
}
To lock our spinlock, we do an atomic fetch and add to get our place in line (our “ticket”). This reads the current value of the latest ticket number (line
) into the place
variable, and stores line + 1
back to memory (atomically).
After a thread gets a “ticket”, it waits for its number to be “called”. This is done in the while
loop where a thread thread checks if the serving
is equal to place
.
Once the values are equal, the thread exits the lock
method (it now has the lock).
An Unlock Method
Our unlock method is still just a single line of code:
void unlock() { serving = serving + 1; }
In order to pass the lock to the next thread, we just need to increment the serving
value by 1. Then, the thread next in line will see that its value of serving
is equal to place
, and break out of its lock method.
We’ll discuss some pitfalls related to compiler memory ordering relating to serving
near the end of the post.
Full Ticket Spinlock Implementation
Here is the full ticket spinlock implementation:
// Simple Spinlock
// Now uses ticket system for fairness
class Spinlock {
private:
// Lock is now two counters:
// 1.) The latest place taken in line
// 2.) Which number is currently being served
std::atomic<std::uint16_t> line{0};
// Needs to avoid the compiler putting this in a register!
volatile std::uint16_t serving{0};
public:
// Locking mechanism
void lock() {
// Get the latest place in line (and increment the value)
auto place = line.fetch_add(1);
// Wait until our number is "called"
while (serving != place)
;
}
// Unlocking mechanism
// Increment serving number to pass the lock
// No need for an atomic! The thread with the lock is the only one that
// accesses this variable!
void unlock() { serving = serving + 1; }
};
Low-Level Assembly
Let’s take a look at how our low-level assembly for our benchmark looks with this new optimization:
0.06 │18:┌─→mov $0x1,%edx
20.41 │ │ lock xadd %dx,(%rax)
0.03 │ │ nop
50.39 │28:│ movzwl 0x2(%rax),%edi
0.05 │ │ cmp %di,%dx
24.54 │ │↑ jne 28
3.58 │ │ movzwl 0x2(%rax),%edx
0.66 │ │ incq (%rsi)
│ │ inc %edx
│ │ mov %dx,0x2(%rax)
│ ├──dec %ecx
0.27 │ └──jne 18
The first thing our threads do is an atomic fetch and add (xadd
) to get their place in line. The threads then wait inside of the loop at label 28:
, where the now serving number is repeatedly read.
If a thread finds the now serving number is equal to its ticket number, it breaks out of the loop, increments the shared value (incq
), then increments the now-serving number and stores it to memory.
The threads then either jumps back to the top of the loop to get a new ticket, or finishes executing.
Removing the Volatile Qualifier
I mentioned earlier that we needed the volatile
qualifier to prevent the compiler from caching the now-serving number in a register. Let’s look at the assembly generated with the volatile
qualifier is removed:
0.81 │11:┌─→mov $0x1,%edi
17.86 │ │ lock xadd %di,(%rax)
│ │ movzwl 0x2(%rax),%esi
0.74 │1f:│ cmp %si,%di
79.72 │ │↑ jne 1f
│ │ incq (%rcx)
0.87 │ │ incw 0x2(%rax)
│ ├──dec %edx
│ └──jne 11
Similar to the previous assembly, our threads first do an atomic fetch and add to get a place in line. Next, they read the now-serving number into the register esi
. From there, the threads fall into a 2 instruction loop at label 1f:
where the values of two registers (our number in line and the now-serving number) are compared until they are no longer equal.
However, if a thread tries to grab the lock while it is already taken, it will never break out of this loop! Let’s think about why. If the lock is free, and one thread tries to grab the lock, the thread grabs a place in line, reads the now serving number into a register, finds that it is equal to its place in line, and exits the lock
method.
If the lock
is already taken, the thread gets its place in line, reads the now serving position into a register, and compares the place in line to the now-serving position (both in registers) forever, finding that they are never equal. The value of the now-serving number never changes in the register because the thread only updates now-serving position in memory. This does not affect any values stored in registers (cache coherence does not apply to registers!).
Threads waiting in the loop at 1f:
will never see the update in memory, and our program will deadlock.
Performance
Here are the end-to-end performance numbers for 1, 2, 4, and 8 threads:
------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------
ticket_lock/1/real_time 0.651 ms 0.053 ms 1093
ticket_lock/2/real_time 11.3 ms 0.069 ms 67
ticket_lock/4/real_time 29.7 ms 0.103 ms 24
ticket_lock/8/real_time 69.3 ms 0.190 ms 10
Our performance is lower compared to previous implementations (e.g., exponential backoff) but it’s not terrible. In fact, it’s fairly close to our passive backoff implementation. As a reminder, here are those performance numbers:
----------------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------------
passive_backoff/1/real_time 1.03 ms 0.054 ms 676
passive_backoff/2/real_time 2.48 ms 0.082 ms 276
passive_backoff/4/real_time 9.59 ms 0.110 ms 70
passive_backoff/8/real_time 46.7 ms 0.236 ms 15
So why is our performance worse? Let’s think about what we optimized. In the exponential backoff implementation, we were roughly serializing our application through long delays so that on thread had uninterrupted access to the lock.
In our ticket spinlock, we’re doing almost the opposite. We’re now ensuring that the lock will be passed between requesting threads in FIFO order. That means the cache line with our lock and shared value will bounce between the cores of requesting threads, adding extra overhead.
Thoughts on Compiler Memory Re-Ordering
One of the potential pitfalls we have not discussed yet is memory re-ordering done by the compiler. volatile
does not guarantee the “happens before” relationship, meaning that non-volatile store can be reordered and placed after a volatile
store, even if it occurs earlier in the written program.
To prevent this, we can insert a compiler memory barrier in our unlock
method:
void unlock() {
asm volatile("" : : : "memory");
serving = serving + 1;
}
This tricks the compiler into thinking we have an instruction that may touch all memory (thereby introducing fake dependencies into our program). If we look at the assembly for our benchmark after inserting this hint, we get the following:
│18:┌─→mov $0x1,%edx
20.97 │ │ lock xadd %dx,(%rax)
0.04 │ │ nop
46.76 │28:│ movzwl 0x2(%rax),%edi
0.06 │ │ cmp %di,%dx
28.11 │ │↑ jne 28
3.31 │ │ incq (%rsi)
0.33 │ │ movzwl 0x2(%rax),%edx
0.01 │ │ inc %edx
0.09 │ │ mov %dx,0x2(%rax)
│ ├──dec %ecx
0.32 │ └──jne 18
The only difference is that our read (movzwl
) of the now-serving number was moved after the increment of our shared value val
. Our operations now occur in the same order as in our high-level C++. Unfortunately, this comes at a slight cost to our performance (mainly at a higher number of threads):
------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------
ticket_lock/1/real_time 0.635 ms 0.044 ms 1101
ticket_lock/2/real_time 10.3 ms 0.057 ms 66
ticket_lock/4/real_time 27.5 ms 0.088 ms 26
ticket_lock/8/real_time 77.1 ms 0.187 ms 9
If all of these volatile
and memory ordering details are getting too confusing, we can always replace our now-serving number with an atomic. By default, that this reordering will not occur (when using the default std::memory_order_seq_cst
). Here is the resulting assembly:
0.02 │18:┌─→mov $0x1,%edi
9.33 │ │ lock xadd %di,(%rax)
0.04 │ │ nop
50.43 │28:│ movzwl (%rcx),%r8d
0.08 │ │ cmp %r8w,%di
23.09 │ │↑ jne 28
4.21 │ │ incq (%rsi)
12.79 │ │ lock incw (%rcx)
│ ├──dec %edx
│ └──jne 18
We see roughly the same assembly as when we used the compiler memory barrier. The only major difference is that we use an atomic increment (lock incw
) to release the lock. If we look at the performance, we see it has dropped again, and for all thread cases:
------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------
ticket_lock/1/real_time 1.07 ms 0.054 ms 656
ticket_lock/2/real_time 16.7 ms 0.075 ms 46
ticket_lock/4/real_time 43.5 ms 0.105 ms 16
ticket_lock/8/real_time 111 ms 0.236 ms 6
We could of course change the memory ordering for each of our atomics away from the conservative sequential consistency (to the appropriate ordering), but I’ll leave that as an exercise to the reader (I didn’t find the end-results to be particularly compelling or worthy of including here).
Final Thoughts
Ticket-based spinlocks give us a fair way for threads to busy-wait for a lock. While our end-to-end performance took a hit (compared to our exponential backoff spinlock), we were able to understand why. Throughput is not the only metric in the world, and there are plenty of situations where we want a compromise between throughput and fairness.
Thanks for reading,
–Nick
Link to the source code
- Source Code:
- My YouTube Channel:
- My GitHub Account:
- My Email: CoffeeBeforeArch@gmail.com