Spinlocks - Part 7
In the last blog post, we looked at yet another implementation of a spinlock (a ticket spinlock). However, most people will never write their own synchronization mechanisms. Instead people will use what is available to them in libraries.
In this blog post, we’ll look at a pthread spinlock implementation (pthread_spinlock_t
), see how it is implemented, and measure the performance using a simple benchmark.
Link to the source code
- Source Code:
- My YouTube Channel:
- My GitHub Account:
- My Email: CoffeeBeforeArch@gmail.com
Our Benchmark
We will evaluate our spinlock implementations using Google Benchmark. The structure of the benchmark can be found below:
// Small Benchmark
static void bench(benchmark::State &s) {
// Sweep over a range of threads
auto num_threads = s.range(0);
// Value we will increment
std::int64_t val = 0;
// Allocate a vector of threads
std::vector<std::thread> threads;
threads.reserve(num_threads);
// Create a spinlock
pthread_spinlock_t sl;
pthread_spin_init(&sl, PTHREAD_PROCESS_PRIVATE);
// Timing loop
for (auto _ : s) {
for (auto i = 0u; i < num_threads; i++) {
threads.emplace_back([&] { inc(sl, val); });
}
// Join threads
for (auto &thread : threads) thread.join();
threads.clear();
}
}
BENCHMARK(bench)
->RangeMultiplier(2)
->Range(1, std::thread::hardware_concurrency())
->UseRealTime()
->Unit(benchmark::kMillisecond);
We’ll be evaluating our benchmark for a power-of-two number of threads (1, 2, 4, and 8 on my machine), where each thread runs a function called inc
. This function can be found below:
// Increment val once each time the lock is acquired
void inc(pthread_spinlock_t &sl, std::int64_t &val) {
for (int i = 0; i < 100000; i++) {
pthread_spin_lock(&sl);
val++;
pthread_spin_unlock(&sl);
}
}
For 100k iterations, our threads try and lock the spinlock, increment a shared value, then release the lock. This gives us a heavy contention scenario in which to test the pthread_spinlock_t
.
The Pthread Spinlock
At this point of the post, we’d usually look at some sort of C or C++ implementation of a spinlock. Unfortunately, We have no such luxury for our pthread_spin_lock
and pthread_spin_unlock
functions. This is because they are both implemented directly in assembly! Let’s dig into this assembly for these functions and see how they work.
pthread_spin_lock
Here is the implementation of pthread_spin_lock
from glibc:
/* Copyright (C) 2012-2020 Free Software Foundation, Inc.
This file is part of the GNU C Library.
The GNU C Library is free software; you can redistribute it and/or
modify it under the terms of the GNU Lesser General Public
License as published by the Free Software Foundation; either
version 2.1 of the License, or (at your option) any later version.
The GNU C Library is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
Lesser General Public License for more details.
You should have received a copy of the GNU Lesser General Public
License along with the GNU C Library; if not, see
<https://www.gnu.org/licenses/>. */
#include <lowlevellock.h>
#include <sysdep.h>
ENTRY(pthread_spin_lock)
1: LOCK
decl 0(%rdi)
jne 2f
xor %eax, %eax
ret
.align 16
2: rep
nop
cmpl $0, 0(%rdi)
jg 1b
jmp 2b
END(pthread_spin_lock)
The first thing the pthread_spin_lock
routine does is try and grab the lock using an atomic decrement (LOCK decl
). Here, a thread gets the lock when it decrements the pthread_spinlock_t
value in (%rdi)
from 1
to 0
(this will set the zero flag used by the following jne
instruction). If the zero flag is set, the thread exits the routine, returning 0
.
If the thread does not get the lock, it jumps to a waiting loop that starts with the pair of instructions rep
and nop
. These instructions say to repeat a no-op for cx
iterations. However, this pair of instructions has the same opcode as the pause
instruciion we introduced with our passive backoff implementation! This is a clever way to be backwards compatible. On very old processors, the instruction pair gets decoded as rep
and nop
, and on new processors it well be decoded as pause
.
You can find details about the pause
instruction here.
After our pause
instruction (that’s in disguise), our thread checks to see if the spinlock state is greater than 0
, and either jumps to decrement the value at %rdi
again, or jumps to the waiting loop.
Fundamentally, it is a spinlock doing the same passive backoff optimization (from this post) with some small hand-tuned differences (e.g., instruction scheduling and using an atomic decrement instead of an atomic exchange).
Here is how the assembly looks when decoded from our executable:
│ pthread_spin_lock():
│ endbr64
35.11 │ 4:┌─→lock decl (%rdi)
│ │↓ jne 10
0.01 │ │ xor %eax,%eax
0.67 │ │← retq
│ │ nop
29.73 │10:│ pause
34.26 │ ├──cmpl $0x0,(%rdi)
0.18 │ └──jg 4
0.05 │ ↑ jmp 10
Just about the same, except for the final addresses and that rep
+ nop
was decoded as a pause
instruction.
But Is This Safe?
One thing you might have wondered is how we can get away with decrementing the pthread_spinlock_t
value in (%rdi)
instead of using an atomic exchange (like we did). Won’t the value eventually circle-back to 1
after enough decrements (allowing another thread to get the lock at the same time)?
Yes, it will! Is this likely to happen? Certainly not. The value is only decremented when a new thread tries to grab the lock. That means we would need as many threads trying for the lock as needed to roll the value in (%rdi)
back to 1. 100,000 threads trying for the same lock at the same time would not even be enough. We can safely ignore this corner case.
pthread_spin_unlock
Here is the implementation of pthread_spin_unlock
from glibc:
/* Copyright (C) 2002-2020 Free Software Foundation, Inc.
This file is part of the GNU C Library.
Contributed by Ulrich Drepper <drepper@redhat.com>, 2002.
The GNU C Library is free software; you can redistribute it and/or
modify it under the terms of the GNU Lesser General Public
License as published by the Free Software Foundation; either
version 2.1 of the License, or (at your option) any later version.
The GNU C Library is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
Lesser General Public License for more details.
You should have received a copy of the GNU Lesser General Public
License along with the GNU C Library; if not, see
<https://www.gnu.org/licenses/>. */
#include <sysdep.h>
ENTRY(pthread_spin_unlock)
movl $1, (%rdi)
xorl %eax, %eax
retq
END(pthread_spin_unlock)
/* The implementation of pthread_spin_init is identical. */
.globl pthread_spin_init
pthread_spin_init = pthread_spin_unlock
Not terribly complicated. All our pthread_spin_unlock
function does is set the the value of the pthread_spinlock_t
to 1
and return 0
(through %eax
). The next thread in the pthread_spin_lock
function that does an atomic decrement of the pthread_spinlock_t
will end up getting the lock and returning 0
from the function.
Another thing to note is that pthread_spin_init
is identical to the pthread_spin_unlock
function (it just sets the value of pthread_spinlock_t
to 1
).
Here is the assembly for the function dumped from my executable:
│ pthread_spin_init():
0.19 │ endbr64
2.39 │ movl $0x1,(%rdi)
94.27 │ xor %eax,%eax
3.15 │ ← retq
Instead of duplicating code for init and unlock functions, they use the exact same routine.
Performance
Here are the end-to-end performance numbers for 1, 2, 4, and 8 threads:
-----------------------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------------
pthread_spinlock/1/real_time 0.641 ms 0.043 ms 1101
pthread_spinlock/2/real_time 2.82 ms 0.058 ms 284
pthread_spinlock/4/real_time 13.4 ms 0.094 ms 52
pthread_spinlock/8/real_time 46.5 ms 0.178 ms 15
If we compare this to our own passive backoff spinlock performance, we see they are fairly close:
----------------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------------
passive_backoff/1/real_time 1.01 ms 0.045 ms 689
passive_backoff/2/real_time 2.69 ms 0.065 ms 259
passive_backoff/4/real_time 9.00 ms 0.083 ms 75
passive_backoff/8/real_time 46.2 ms 0.221 ms 15
However, there are a few things to note. For one, the pthread spinlock is significantly faster in single-thread case. This shows good efficient design (compared to ours) because the spinlock should be incredibly fast when there is no contention (our design clearly has some excess overhead).
Another important point was that our passive backoff spinlock implementation used 4 pause
instructions each iteration of our waiting loop. While this improved the performance on my system, it isn’t very portable (it’s even discouraged in the intel optimization guide). The pthread implementation is able to get about the same performance with just a single pause
in the waiting loop.
Final Thoughts
We’ve come to the end of the last post (for now) on spinlocks. What we’ve seen is that it doesn’t take much to write a functionally correct implementation, but it takes quite a lot careful thinking to make one with good performance and or fairness. We’ve also seen that a library implementation isn’t necessarily the end-all be-all for performance. With a little work, you can write your own implementation that works best for your particual situation.
Thanks for reading,
–Nick
Link to the source code
- Source Code:
- My YouTube Channel:
- My GitHub Account:
- My Email: CoffeeBeforeArch@gmail.com