CoffeeBeforeArch.github.io

View on GitHub

Thread Affinity

When you write a multi-threaded application, there are a number of performance pitfalls to watch out for. If you partition your data poorly, you may suffer from false sharing. Likewise, if you don’t fairly partition your work, you may end up with some threads overloaded with work, and others that are mostly idle. Another thing to consider is where threads are placed in relation to each other.

Placing threads that share data close to each other may allow you to exploit inter-thread locality. Likewise, placing threads that share data far apart may result in contention, especially if the threads are scheduled to different NUMA nodes. Unfortunately your operating system is mostly-oblivious to the needs of your application with respect to thread scheduling. As a result, it often makes sub-optimal decisions (if not given guidance). In this blog post, we’ll be taking a look at exactly how we can give some scheduling guidance to our O.S. using Pthreads.

Compiling and Running the Benchmarks

The benchmarks from this blog post were written using Google Benchmark. The following is the command I used to compile the benchmarks (on Ubuntu 18.04 using g++10).

g++ thread_affinity.cpp -lbenchmark -lpthread -O3 -march=native -mtune=native -o thread_affinity

Relying on the O.S.

Let’s craft a benchmark with pairs of threads fighting over shared pieces of data. The first thing we need to do is to have our threads compete over something. Let’s use an atomic integer that we increment 100k times. Here’s what that might look like.

void work(std::atomic<int>& a) {
  for(int i = 0; i < 100000; ++i)
    a++;
}

Now we need some threads to do the fighting. Here’s how the rest of my main benchmark loop was implemented.

struct alignas(64) AlignedAtomic {
  AlignedAtomic(int v) { val = v; }
  std::atomic<int> val;
};

void os_scheduler() {
  AlignedAtomic a{0};
  AlignedAtomic b{0};

  // Create four threads and use lambda to launch work
  std::thread t0([&]() { work(a.val); });
  std::thread t1([&]() { work(a.val); });
  std::thread t2([&]() { work(b.val); });
  std::thread t3([&]() { work(b.val); });

  // Join the threads
  t0.join();
  t1.join();
  t2.join();
  t3.join();
}

Threads t0 and t1 compete for AlignedAtomic ‘a’, while threads t2 and t3 compete for AlignedAtomic ‘b’. AlignedAtomic is just an atomic int wrapped in a struct which is aligned to 64B. This keeps our atomic ints from winding up on the same cache line (avoiding any false sharing).

For this experiment, we’re expecting heavy contention on two cache lines (the ones that have our AlignedAtomics). Let’s take a look at both the execution time, and Shared Data Cache Line Table from perf-c2c. Here is the execution time (we’ll focus on the wall clock time reported in the ‘Time’ column).

-----------------------------------------------------------------
Benchmark                       Time             CPU   Iterations
-----------------------------------------------------------------
osScheduling/real_time       5.09 ms        0.080 ms          100

Ok, some baseline numbers. Now let’s take a look at some stats using perf c2c record and perf c2c report. Here’s a truncated view of the Shared Data Cache Line Table.

Shared Data Cache Line Table     (2 entries, sorted on Total HITMs)
                             Total      Tot  ----- LLC Load Hitm -----
Index           Cacheline  records     Hitm    Total      Lcl      Rmt
    0      0x7fff1dede640     3242   53.67%     1630     1630        0     
    1      0x7fff1dede600     2703   46.23%     1404     1404        0 

Just as we expected, two cacheline under heavy contention. As a reminder, Hitm events are being used as a measure of contention. Hitm events occur when a thread loads data and finds that the cache line it’s looking for is in the modified state in a different cache. Having a lot of these events indicate that 2 or more threads are fighting for exclusive access to a cache line so they can write to it. Let’s zoom in on cache line 0x7fff1dede640 for some more information.

Cacheline 0x7fff1dede640
----- HITM -----                cpu
    Rmt      Lcl       Off      cnt
  0.00%   59.20%       0x0        4  (Thread t3)
  0.00%   40.80%       0x0        4  (Thread t4)

Two threads accessing the same part of the same cache line (AlignedAtomic ‘b’ in this case). What we should focus on is the cpu cnt column. This is the number of physical cores that samples came from (all 4 in my 4 core/8 thread CPU). This means that both of our threads hopped around to all 4 cores in our system during execution.

When the O.S. schedules a thread to a core, it doesn’t just leave it there until it finishes. It may move the thread around if the scheduler thinks it is advantageous to do so (depending on the current load of the CPU). This means our performance may vary (and significantly). Let’s compare the timing for a few runs.

-----------------------------------------------------------------
Benchmark                       Time             CPU   Iterations
-----------------------------------------------------------------
osScheduling/real_time       6.08 ms        0.149 ms          133
osScheduling/real_time       5.16 ms        0.140 ms          100
osScheduling/real_time       5.45 ms        0.078 ms          100
osScheduling/real_time       4.77 ms        0.074 ms          145

A wide range of results! Even in the better cases of thread scheduling (lower execution time), let’s look at the the L1 data cache hit rate.

69,543,035    L1-dcache-loads          #   27.243 M/sec                  
35,919,314    L1-dcache-load-misses    #   51.65% of all L1-dcache hits  

We’re missing almost half the time! Let’s see if we can help our O.S. with some scheduling guidance.

Setting Thread Affinity

Every major O.S. supports the ability to set thread affinity (i.e., inform the O.S. which physical cores a thread should be scheduled to). However, we can’t do this solely with C++11’s std::thread. We’ll need to use the dedicated functions provided by the library used to implement std::thread. On my Linux machine, it’s Pthreads. Luckily for us, we don’t have to abandon C++11 threads entirely. We can use the std::thread method native_handle() which returns (in this case) a handle to our underlying pthread.

Let’s re-write our benchmark’s main loop using pthread_setaffinity_np to pin each pair of threads to a single core. Here’s how my implementation looked.

#include <pthread.h>

void thread_affinity() {
  AlignedAtomic a{0};
  AlignedAtomic b{0};

  // Create cpu sets for threads 0,1 and 2,3
  cpu_set_t cpu_set_1;
  cpu_set_t cpu_set_2;

  // Zero them out
  CPU_ZERO(&cpu_set_1);
  CPU_ZERO(&cpu_set_2);

  // And set the CPU cores we want to pin the threads too
  CPU_SET(0, &cpu_set_1);
  CPU_SET(1, &cpu_set_2);

  // Create thread 0 and 1, and pin them to core 0
  std::thread t0([&]() { work(a.val); });
  assert(pthread_setaffinity_np(t0.native_handle(), sizeof(cpu_set_t),
                                &cpu_set_1) == 0);
  std::thread t1([&]() { work(a.val); });
  assert(pthread_setaffinity_np(t1.native_handle(), sizeof(cpu_set_t),
                                &cpu_set_1) == 0);
 
  // Create thread 1 and 2, and pin them to core 1
  std::thread t2([&]() { work(b.val); });
  assert(pthread_setaffinity_np(t2.native_handle(), sizeof(cpu_set_t),
                                &cpu_set_2) == 0);
  std::thread t3([&]() { work(b.val); });
  assert(pthread_setaffinity_np(t3.native_handle(), sizeof(cpu_set_t),
                                &cpu_set_2) == 0);

  // Join the threads
  t0.join();
  t1.join();
  t2.join();
  t3.join();
}

Let’s pick apart the new parts of our function. The first new addition cpu_set_t. This is used to specify the set of CPUs (physical cores, not threads) that a thread can be scheduled to. With CPU_ZERO, we can initialize our cpu_set_t to zero. We then can use CPU_SET to set which physical cores will be valid scheduling options for our threads. We can then call pthread_set_affinity_np with our native thread handle (a handle to a pthread), the size of the cpu_set_t, and the cpu_set_t itself to set the thread’s affinity. Here is a link where you find more information on pthread_set_affinity_np.

For our purposes, we’ll pin threads t0 and t1 to physical core 0, and threads t2 and t3 to physical core 1. This means that our pairs of threads can share the same copy of the cache line containing their respective AlignedAtomics instead of stealing it away from each other. Let’s check our execution time.

-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
threadAffinity/real_time       1.32 ms        0.074 ms          536

Over 4x faster (on average) than when we just relied on the O.S. to do its best! Furthermore, this result far more stable because we’re no longer relying on non-deterministic thread scheduling by the O.S. Let’s take a look at the L1 hit rate.

281,131,290    L1-dcache-loads          #    177.685 M/sec
  7,969,745    L1-dcache-load-misses    #    2.83% of all L1-dcache hits

Very few L1 misses! This is because our pairs of threads are no longer invalidating each other. In my case, the cache lines containing the AlignedAtomics disappeared completely from the Shared Data Cache Line table from perf c2c! All that was left was ~30 cache lines with very few Hitm events (mainly from the kernel, so I’ve omitted that data from the post).

Concluding remarks

You can’t always escape sharing in multi-threaded applications. However, you do have some control in minimizing the performance impact when it occurs. Placing memory and threads is extremely important, especially when working with threads scattered across NUMA nodes.

Thanks for reading,

–Nick