Spinlocks - Part 4
While everyone cares about performance (to some degree), an increasing number of people also care about power-consumption (especially in massive data-centers). Intuitively we can save on power by improving performance so that the processor is active for a shorter amount of time. However, there are ways we can improve power-consumption in applications that run for (approximately) the same length of time.
In the last blog post, we used an optimization active backoff to improve the bursty contention for our spinlock whenever the lock is freed. In this blog post, we’ll spend some time understanding where we are wasting power in that optimization, then analyze the performance and power consumption when using an optimization called passive backoff.
Link to the source code
- Source Code:
- My YouTube Channel:
- My GitHub Account:
- My Email: CoffeeBeforeArch@gmail.com
Our Benchmark
We will evaluate our spinlock implementations using Google Benchmark. The structure of the benchmark can be found below:
// Small Benchmark
static void bench(benchmark::State &s) {
// Sweep over a range of threads
auto num_threads = s.range(0);
// Value we will increment
std::int64_t val = 0;
// Allocate a vector of threads
std::vector<std::thread> threads;
threads.reserve(num_threads);
Spinlock sl;
// Timing loop
for (auto _ : s) {
for (auto i = 0u; i < num_threads; i++) {
threads.emplace_back([&] { inc(sl, val); });
}
// Join threads
for (auto &thread : threads) thread.join();
threads.clear();
}
}
BENCHMARK(bench)
->RangeMultiplier(2)
->Range(1, std::thread::hardware_concurrency())
->UseRealTime()
->Unit(benchmark::kMillisecond);
We’ll be evaluating our benchmark for a power-of-two number of threads (1, 2, 4, and 8 on my machine), where each thread runs a function called inc
. This function can be found below:
// Increment val once each time the lock is acquired
void inc(Spinlock &s, std::int64_t &val) {
for (int i = 0; i < 100000; i++) {
s.lock();
val++;
s.unlock();
}
}
For 100k iterations, our threads try and lock our spinlock, increment a shared value, then release the lock. This gives us a heavy contention scenario in which to test our spinlock.
A Spinlock with Passive Backoff
Before we jump into our passive backoff optimization implementation, let’s try and understand were the waste is in our active backoff implementation.
Wasteful Instructions
Let’s take a look at just the assembly for our active backoff optimization we applied in the last post:
0.35 │30:│ movl $0x0,-0x4(%rsp)
0.07 │ │ mov -0x4(%rsp),%ecx
0.02 │ │ cmp $0x95,%ecx
│ │↓ jg 5e
0.06 │ │ nop
1.77 │48:│ mov -0x4(%rsp),%ecx
1.44 │ │ inc %ecx
15.57 │ │ mov %ecx,-0x4(%rsp)
21.50 │ │ mov -0x4(%rsp),%ecx
0.07 │ │ cmp $0x95,%ecx
8.49 │ │↑ jle 48
Our active backoff delay loop leads to thousands, millions, or billions of increments, loads, and stores depending on duration of our benchmark. While executing these instructions give us an improvement in performance, it would be nice to get the same pause effect without executing useless instruction.
For spin-wait loops like our spinlock, we can use the _mm_pause()
intrinsic for this exact purpose.
The Pause Intrinsic
Here is the _mm_pause()
intrinsic from the intel Intrinsics Guide. If we look at the description, it says:
Provide a hint to the processor that the code sequence is a spin-wait loop. This can help improve the performance and power consumption of spin-wait loops.
With our original active backoff implementation, we got a performance benefit from creating a finite pause within our loop that polls (reads) the state of our spinlock. However, we got this at the cost of thousands, millions, or billions of extra instructions (depending on how long our application runs) executed for our dummy for
loop.
We can use the _mm_pause()
intrinsic to get this to the same effect (generating a delay between reads of our spinlock state) without executing any of those extra instructions.
The Passive Backoff Optimization
Let’s take a look at our spinlock’s lock
method but with passive instead of active backoff. For this, we will be swapping out our for
loop with calls to the _mm_pause()
intrinsic. Here’s how that may look:
// Locking mechanism
void lock() {
while (1) {
// Try and grab the lock
if (!locked.exchange(true)) return;
do {
// Pause for some number of iterations
for (int i = 0; i < 4; i++) _mm_pause();
// Check to see if the lock is free
} while (locked.load());
}
}
The core functionality is exactly the same as our active backoff implementation. The only difference is that we’re using the _mm_pause()
intrinsic instead of the dummy for
loop.
Just like we could configure the number of iterations of our for
loop to control our delay in the previous implementation, we can tune how many times we use the pause intrinsic as well. I’ve chosen 4
iterations for this use-case to approximately match the performance of our active backoff implementation.
Here is our full spinlock implementation with passive backoff:
// Simple Spinlock
// Lock now performs local spinning
// Lock now does backoff
class Spinlock {
private:
// Lock is just an atomic bool
std::atomic<bool> locked{false};
public:
// Locking mechanism
void lock() {
// Keep trying
while (1) {
// Try and grab the lock
// Exit if we got the lock
if (!locked.exchange(true)) return;
// If we didn't get the lock, just read the value which gets cached
// locally. This leads to less traffic.
// Each iteration we can also call pause to limit the number of writes.
// How many times you should pause each time should be experimentally
// determined
do {
// Pause for some number of iterations
for (int i = 0; i < 4; i++) _mm_pause();
// Check to see if the lock is free
} while (locked.load());
}
}
// Unlocking mechanism
// Just set the lock to free (false)
// Can also use the assignment operator
void unlock() { locked.store(false); }
};
Low-Level Assembly
Let’s take a look at how our low-level assembly for our benchmark looks with this new optimization:
0.34 │20:┌─→mov %edi,%ecx
17.32 │ │ xchg %cl,(%rax)
0.01 │ │ test %cl,%cl
0.01 │ │↓ je 41
0.09 │ │ nop
12.61 │30:│ pause
12.39 │ │ pause
12.32 │ │ pause
12.06 │ │ pause
15.37 │ │ movzbl (%rax),%ecx
0.03 │ │ test %cl,%cl
0.05 │ │↑ jne 30
0.42 │ │↑ jmp 20
2.75 │41:│ incq (%rsi)
14.23 │ │ xchg %cl,(%rax)
│ ├──dec %edx
0.00 │ └──jne 20
No major structural changes compare to our previous implementation. We start by trying to get the lock with an atomic exchange (xchg
). If we get the lock, we increment the shared value (incq
), free the lock with another atomic exchange (xchg
), decrement the loop counter (dec
), and return to the top of the loop if there are more iterations.
If we didn’t get the look, we go to our spinning locally do while
loop with passive backoff starting at label 30:
. There, we see 4 pause
instructions (coming from our calls to the _mm_pause()
intrinsic), followed by our read of the spinlock state that decides whether we pause again, or try and grab the lock.
Performance
Here are the end-to-end performance numbers for 1, 2, 4, and 8 threads:
----------------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------------
passive_backoff/1/real_time 1.03 ms 0.054 ms 676
passive_backoff/2/real_time 2.48 ms 0.082 ms 276
passive_backoff/4/real_time 9.59 ms 0.110 ms 70
passive_backoff/8/real_time 46.7 ms 0.236 ms 15
And here was our performance for the active backoff implementation:
---------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------
active_backoff/1/real_time 1.05 ms 0.054 ms 679
active_backoff/2/real_time 2.60 ms 0.081 ms 281
active_backoff/4/real_time 9.49 ms 0.121 ms 75
active_backoff/8/real_time 50.5 ms 0.287 ms 14
As expected, the performance of the two implementations were about the same (because I approximately matched the number of loop iterations to number of calls to _mm_pause()
). The slight differences are to be expected because thread scheduling is non-deterministic, and it’s nearly impossible to perfectly match the performance of two different pieces of code. But remember, the primary thing we’re optimizing for here is power consumption. Let’s start digging into those metrics.
Power Consumption
To reason about the power-consumption of our benchmarks, we’ll first be looking at the total number of instructions executed so we can compare the amount of work being done. We’ll then directly querying power performance counters to check if our reasoning about power consumption was correct.
Instructions Executed
Let’s start our power-consumption analysis by comparing the total number of instructions executed. For this, we’ll compare the 8-thread benchmark case for our active and passive backoff implementations with the number of benchmark iterations set to 50
.
Here are the number instructions executed for our active backoff run:
55,227,656,500 instructions # 0.76 insn per cycle
Around 55 billion. Because thread scheduling is non-deterministic, this number will vary from run to run, but for me it generally fell in the 45-60 billion instruction range.
Now let’s see how many instructions were executed for our passive backoff implementation:
1,012,803,760 instructions # 0.01 insn per cycle
A ~55x reduction in instructions! But this should make sense! In our active backoff implementation, we’re executing billions of extra instructions just to induce the delay inside of our locally spinning loop. In our passive backoff implementation, we’re using a dedicated instruction that adds a finite delay to our pipeline, allowing us to simply stop doing work for short periods of time.
Intuitively, we can reason that our passive backoff implementation that executes 55x fewer instructions and has roughly equivilant end-to-end execution time as our active backoff should consume less power while running.
Additional Statistics
One thing you may notice if you look at our other performance counter stats for our passive backoff benchmark is that we have a very high branch miss-prediction and L1 data cache miss rate:
216,996,864 branches # 12.154 M/sec
35,495,595 branch-misses # 16.36% of all branches
240,163,142 L1-dcache-loads # 13.452 M/sec
147,908,665 L1-dcache-load-misses # 61.59% of all L1-dcache hits
What’s going on? Let’s think about what we know about our benchmarks and spinlock implementations. In both benchmarks, we have a heavy contention scenario where threads fight over the cache line(s) containing our spinlock state and shared value val
. Furthermore, when a thread actually sees the lock as free or successfully gets the lock is non-deterministic, and could be difficult for our branch predictor to guess correctly.
These things explain our observations in passive backoff, but why do these not look like a problem in our active backoff implementation?
7,871,605,800 branches # 576.093 M/sec
76,998,863 branch-misses # 0.98% of all branches
15,543,237,429 L1-dcache-loads # 1137.551 M/sec
106,735,606 L1-dcache-load-misses # 0.69% of all L1-dcache hits
Just look at the raw counter values! We’re only doing ~217 million branches in our passive backoff benchmark, but ~7.9 billion in our active backoff benchmark. Likewise, our L1 data cache load increased from ~240 million to ~15.5 billion! Most of these loads and branches are from the dummy for
loop we added to insert a delay. These are hiding the underlying randomness and lack of cache-locality that is fundamental to our benchmark.
Power Performance Counters
A more direct way that we can estimate the power consumption of our benchmarks is by directly accessing the power performance counters. Specifically, we’ll be looking at power/energy-pkg/
.
Here is the estimate in power consumption for our active backoff benchmark:
-------------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------------
active_backoff/8/iterations:50 46.3 ms 0.172 ms 50
Performance counter stats for 'system wide':
190.59 Joules power/energy-pkg/
2.337446416 seconds time elapsed
And here is the estimate for our passive backoff benchmark:
--------------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------------
passive_backoff/8/iterations:50 48.7 ms 0.197 ms 50
Performance counter stats for 'system wide':
151.31 Joules power/energy-pkg/
2.457350713 seconds time elapsed
Looks great! Notice that even though our performance was slightly worse and our benchmark runs for longer (for passive backoff), we’re still using significantly less power (151 Joules vs. 190 Joules).
Final Thoughts
While compilers are great, they don’t always use the best instructions for the job, and there are many hyper-specific instructions (like our pause
) that they well simply never generate. This makes it very important for us to know what tools we have available to use in our programming toolbox (e.g., intrinsics). In the next post, we’ll look at some spinlock implementations that use a non-constant form of backoff.
Thanks for reading,
–Nick
Link to the source code
- Source Code:
- My YouTube Channel:
- My GitHub Account:
- My Email: CoffeeBeforeArch@gmail.com