Spinlocks - Part 7

In the last blog post, we looked at yet another implementation of a spinlock (a ticket spinlock). However, most people will never write their own synchronization mechanisms. Instead people will use what is available to them in libraries.

In this blog post, we’ll look at a pthread spinlock implementation (pthread_spinlock_t), see how it is implemented, and measure the performance using a simple benchmark.

Link to the source code

Source Code:
My YouTube Channel:
My GitHub Account:
My Email: CoffeeBeforeArch@gmail.com

Our Benchmark

We will evaluate our spinlock implementations using Google Benchmark. The structure of the benchmark can be found below:

 // Small Benchmark
 static void bench(benchmark::State &s) {
   // Sweep over a range of threads
   auto num_threads = s.range(0);
 
   // Value we will increment
   std::int64_t val = 0;
 
   // Allocate a vector of threads
   std::vector<std::thread> threads;
   threads.reserve(num_threads);
 
   // Create a spinlock
   pthread_spinlock_t sl;
   pthread_spin_init(&sl, PTHREAD_PROCESS_PRIVATE);
 
   // Timing loop
   for (auto _ : s) {
     for (auto i = 0u; i < num_threads; i++) {
       threads.emplace_back([&] { inc(sl, val); });
     }
     // Join threads
     for (auto &thread : threads) thread.join();
     threads.clear();
   }
 }
 BENCHMARK(bench)
     ->RangeMultiplier(2)
     ->Range(1, std::thread::hardware_concurrency())
     ->UseRealTime()
     ->Unit(benchmark::kMillisecond);

We’ll be evaluating our benchmark for a power-of-two number of threads (1, 2, 4, and 8 on my machine), where each thread runs a function called inc. This function can be found below:

 // Increment val once each time the lock is acquired
void inc(pthread_spinlock_t &sl, std::int64_t &val) {
  for (int i = 0; i < 100000; i++) {
    pthread_spin_lock(&sl);
    val++;
    pthread_spin_unlock(&sl);
  }
}

For 100k iterations, our threads try and lock the spinlock, increment a shared value, then release the lock. This gives us a heavy contention scenario in which to test the pthread_spinlock_t.

The Pthread Spinlock

At this point of the post, we’d usually look at some sort of C or C++ implementation of a spinlock. Unfortunately, We have no such luxury for our pthread_spin_lock and pthread_spin_unlock functions. This is because they are both implemented directly in assembly! Let’s dig into this assembly for these functions and see how they work.

pthread_spin_lock

Here is the implementation of pthread_spin_lock from glibc:

/* Copyright (C) 2012-2020 Free Software Foundation, Inc.
   This file is part of the GNU C Library.

   The GNU C Library is free software; you can redistribute it and/or
   modify it under the terms of the GNU Lesser General Public
   License as published by the Free Software Foundation; either
   version 2.1 of the License, or (at your option) any later version.

   The GNU C Library is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
   Lesser General Public License for more details.

   You should have received a copy of the GNU Lesser General Public
   License along with the GNU C Library; if not, see
   <https://www.gnu.org/licenses/>.  */

#include <lowlevellock.h>
#include <sysdep.h>

ENTRY(pthread_spin_lock)
1:	LOCK
	decl	0(%rdi)
	jne	2f
	xor	%eax, %eax
	ret

	.align	16
2:	rep
	nop
	cmpl	$0, 0(%rdi)
	jg	1b
	jmp	2b
END(pthread_spin_lock)

The first thing the pthread_spin_lock routine does is try and grab the lock using an atomic decrement (LOCK decl). Here, a thread gets the lock when it decrements the pthread_spinlock_t value in (%rdi) from 1 to 0 (this will set the zero flag used by the following jne instruction). If the zero flag is set, the thread exits the routine, returning 0.

If the thread does not get the lock, it jumps to a waiting loop that starts with the pair of instructions rep and nop. These instructions say to repeat a no-op for cx iterations. However, this pair of instructions has the same opcode as the pause instruciion we introduced with our passive backoff implementation! This is a clever way to be backwards compatible. On very old processors, the instruction pair gets decoded as rep and nop, and on new processors it well be decoded as pause.

You can find details about the pause instruction here.

After our pause instruction (that’s in disguise), our thread checks to see if the spinlock state is greater than 0, and either jumps to decrement the value at %rdi again, or jumps to the waiting loop.

Fundamentally, it is a spinlock doing the same passive backoff optimization (from this post) with some small hand-tuned differences (e.g., instruction scheduling and using an atomic decrement instead of an atomic exchange).

Here is how the assembly looks when decoded from our executable:

       │    pthread_spin_lock():             
       │      endbr64                        
 35.11 │ 4:┌─→lock    decl    (%rdi)         
       │   │↓ jne     10                     
  0.01 │   │  xor     %eax,%eax              
  0.67 │   │← retq                           
       │   │  nop                            
 29.73 │10:│  pause                          
 34.26 │   ├──cmpl    $0x0,(%rdi)            
  0.18 │   └──jg      4                      
  0.05 │    ↑ jmp     10

Just about the same, except for the final addresses and that rep + nop was decoded as a pause instruction.

But Is This Safe?

One thing you might have wondered is how we can get away with decrementing the pthread_spinlock_t value in (%rdi) instead of using an atomic exchange (like we did). Won’t the value eventually circle-back to 1 after enough decrements (allowing another thread to get the lock at the same time)?

Yes, it will! Is this likely to happen? Certainly not. The value is only decremented when a new thread tries to grab the lock. That means we would need as many threads trying for the lock as needed to roll the value in (%rdi) back to 1. 100,000 threads trying for the same lock at the same time would not even be enough. We can safely ignore this corner case.

pthread_spin_unlock

Here is the implementation of pthread_spin_unlock from glibc:

/* Copyright (C) 2002-2020 Free Software Foundation, Inc.
   This file is part of the GNU C Library.
   Contributed by Ulrich Drepper <drepper@redhat.com>, 2002.

   The GNU C Library is free software; you can redistribute it and/or
   modify it under the terms of the GNU Lesser General Public
   License as published by the Free Software Foundation; either
   version 2.1 of the License, or (at your option) any later version.

   The GNU C Library is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
   Lesser General Public License for more details.

   You should have received a copy of the GNU Lesser General Public
   License along with the GNU C Library; if not, see
   <https://www.gnu.org/licenses/>.  */

#include <sysdep.h>

ENTRY(pthread_spin_unlock)
	movl	$1, (%rdi)
	xorl	%eax, %eax
	retq
END(pthread_spin_unlock)

	/* The implementation of pthread_spin_init is identical.  */
	.globl	pthread_spin_init
pthread_spin_init = pthread_spin_unlock

Not terribly complicated. All our pthread_spin_unlock function does is set the the value of the pthread_spinlock_t to 1 and return 0 (through %eax). The next thread in the pthread_spin_lock function that does an atomic decrement of the pthread_spinlock_t will end up getting the lock and returning 0 from the function.

Another thing to note is that pthread_spin_init is identical to the pthread_spin_unlock function (it just sets the value of pthread_spinlock_t to 1).

Here is the assembly for the function dumped from my executable:

       │   pthread_spin_init():         
  0.19 │     endbr64                    
  2.39 │     movl    $0x1,(%rdi)        
 94.27 │     xor     %eax,%eax          
  3.15 │   ← retq

Instead of duplicating code for init and unlock functions, they use the exact same routine.

Performance

Here are the end-to-end performance numbers for 1, 2, 4, and 8 threads:

-----------------------------------------------------------------------
Benchmark                             Time             CPU   Iterations
-----------------------------------------------------------------------
pthread_spinlock/1/real_time      0.641 ms        0.043 ms         1101
pthread_spinlock/2/real_time       2.82 ms        0.058 ms          284
pthread_spinlock/4/real_time       13.4 ms        0.094 ms           52
pthread_spinlock/8/real_time       46.5 ms        0.178 ms           15

If we compare this to our own passive backoff spinlock performance, we see they are fairly close:

----------------------------------------------------------------------
Benchmark                            Time             CPU   Iterations
----------------------------------------------------------------------
passive_backoff/1/real_time       1.01 ms        0.045 ms          689
passive_backoff/2/real_time       2.69 ms        0.065 ms          259
passive_backoff/4/real_time       9.00 ms        0.083 ms           75
passive_backoff/8/real_time       46.2 ms        0.221 ms           15

However, there are a few things to note. For one, the pthread spinlock is significantly faster in single-thread case. This shows good efficient design (compared to ours) because the spinlock should be incredibly fast when there is no contention (our design clearly has some excess overhead).

Another important point was that our passive backoff spinlock implementation used 4 pause instructions each iteration of our waiting loop. While this improved the performance on my system, it isn’t very portable (it’s even discouraged in the intel optimization guide). The pthread implementation is able to get about the same performance with just a single pause in the waiting loop.

Final Thoughts

We’ve come to the end of the last post (for now) on spinlocks. What we’ve seen is that it doesn’t take much to write a functionally correct implementation, but it takes quite a lot careful thinking to make one with good performance and or fairness. We’ve also seen that a library implementation isn’t necessarily the end-all be-all for performance. With a little work, you can write your own implementation that works best for your particual situation.

Thanks for reading,

–Nick

Link to the source code

Source Code:
My YouTube Channel:
My GitHub Account:
My Email: CoffeeBeforeArch@gmail.com