Hardware Memory Reordering
Even if we use software memory barriers to prevent the compiler from reordering memory accesses in our generated assembly, they still might be reordered by the hardware itself during execution. How these accesses can be be reordered is a function of the processor’s memory consistency model.
In this blog post, we’ll be looking at an example of x86 Store-Load reordering in an implementation of Peterson’s algorithm, and how we can prevent it using hardware barriers.
Link to the source code
- My YouTube Channel:
- My GitHub Account:
- My Email: CoffeeBeforeArch@gmail.com
Background on Sequential Consistency
Hardware memory reordering occurs when memory accesses become globally visible in an order different than the accesses appear in the program’s assembly. How a processor is allowed to reorder memory accesses is a function of the processor’s memory consistency model. The most intuitive of these models is sequential consistency.
In a sequentially consistent system, each memory access for a thread becomes globally visible in program order. However, memory accesses from different threads may still be interleaved in a global total order. Consider the following simple pseudo-assembly example where A
and B
are both initially 0
:
Thread 1 | Thread 2 |
---|---|
Write A, 1 |
Write B, 1 |
Read B |
Read A |
In a sequentially consistent system, what combinations of values are possible for A
and B
? There are three unique scenarios to consider. Let’s take a look at the globally visible order of instructions for each.
Case 1: A = 1, B = 1
Consider the follwing interleaving of memory instructions (initially A = 0
and B = 0
):
Instruction | Issuing Thread | A, B |
---|---|---|
Write A, 1 |
Thread 1 | 1, 0 |
Write B, 1 |
Thread 2 | 1, 1 |
Read B |
Thread 1 | 1, 1 |
Read A |
Thread 2 | 1, 1 |
In this case, the writes to A
and B
occur before the reads of A
and B
. This leads to the result of A = 1
and B = 1
.
This same result could be achieved if the order of the writes is reversed, or the order of reads is reversed.
Case 2: A = 0, B = 1
Consider the follwing interleaving of memory instructions (initially A = 0
and B = 0
):
Instruction | Issuing Thread | A, B |
---|---|---|
Write B, 1 |
Thread 2 | 0, 1 |
Read A |
Thread 2 | 0, 1 |
Write A, 1 |
Thread 1 | 1, 1 |
Read B |
Thread 1 | 1, 1 |
In this case, Thread 2 executes its write and read before Thread 1 does either. This leads to the result of A = 0
, and B = 1
.
Case 3: A = 1, B = 0
Consider the follwing interleaving of memory instructions (initially A = 0
and B = 0
):
Instruction | Issuing Thread | A, B |
---|---|---|
Write A, 1 |
Thread 1 | 1, 0 |
Read B |
Thread 1 | 1, 0 |
Write B, 1 |
Thread 2 | 1, 1 |
Read A |
Thread 2 | 1, 1 |
In this case, Thread 1 executes its write and read before Thread 2 does either. This leads to the result of A = 1
, and B = 0
(the opposite of Case 2).
Takeaways - Sequential Consistency
One thing to notice from our sequential consistency test cases is that the instructions from each thread always become visible in sequential order (e.g, Thread 1 always executes Write A, 1
before Read B
regardless of how instructions from Thread 2 are interleaved in between).
While reasoning about interleavings of instructions across threads can be difficult, reasoning about activity within a single thread is simple.
Relaxed Memory Consistency Models (x86)
Computer architecture is all about tradeoffs. While sequential consistency is simple and intuitive, it places restrictions on execution that limit performance. These restrictions include the following rules about memory reordering:
Reordering | Description |
---|---|
Write -> Write | Writes are not reordered with older writes |
Write -> Read | Reads are not reordered with older writes |
Read -> Read | Reads are not reordered with older Reads |
Read -> Write | Writes are not reordered with older Reads |
By relaxing these restriction, we can increase performance at the cost of increasing hardware and programming model complexity. In this post, we’re going to focus on relaxation in the Intel x86 memory model. This is described in detail in section 8.2 of the Intel Software Developer Manual.
Basics of the x86 Memory Model
The x86 architecture follows a consistency model called Processor Consistency (referred to in the developer manual as Processor Ordering). This model is very similar to sequential consistency, but with the following (major) exception:
- “Reads and writes always appear in programmed order at the system bus—except for the following situation where processor ordering is exhibited. Read misses are permitted to go ahead of buffered writes on the system bus when all the buffered writes are cache hits and, therefore, are not directed to the same address being accessed by the read miss.”
Processor consistency relaxes our Write -> Read ordering constraint, allowing writes to be reordered with subsequent reads. This introduces a fourth possible scenario for interleaved instructions for our two threads.
Case 4: A = 0, B = 0
Consider the follwing interleaving of memory instructions (initially A = 0
and B = 0
):
Instruction | Issuing Thread | A, B |
---|---|---|
Read A |
Thread 2 | 0, 0 |
Read B |
Thread 1 | 0, 0 |
Write A, 1 |
Thread 1 | 1, 0 |
Write B, 1 |
Thread 2 | 1, 1 |
In this case, Thread 1 and Thread 2 both perform their reads before performing their writes. This leads to the result of A = 1
, and B = 0
(the opposite of Case 2). The same order could be achieved by swapping the order of reads, or swapping the order of writes.
Takeaways - Processor Consistency
Processor consistency allows for instructions to be reordered within a single thread (e.g., Thread 1’s read of B
can complete before its write to A
). It’s easy to see how this complicates the programming model, and makes things less intuitive for programmers.
Even if a programmer orders operations correctly in their high-level language (e.g., C++) and ensures the operations are ordered correctly in the compiler-generated assembly, the hardware may still reorder a read with an older write.
Jeff Preshing’s blog post Memory Reodering Caught in the Act is a great read that includes a short benchmark to shows this exact example of hardware Write -> Read reordering.
You can find my C++ implementation of the benchmark from that post here. This can be compiled with:
g++ hw_barrier.cpp -o hw_barrier -O2 -lpthread
Peterson’s Algorithm for Mutual Exclusion
To understand how hardware memory reordering can cause real-world bugs, let’s take a look at a C++ implementation of Peterson’s algorithm for two threads. Peterson’s algorithm is an algorithm for mutual exclusion where threads communicate through shared memory.
We’ll start by looking at:
- What state is required for our implementation
- How a thread gets access to a critical section
- How a thread notifies it’s done with a critical section
After this, we’ll run a short benchmark to show how x86 memory reordering introduces a bug in our direct implementation of the algorithm, and how we can fix this with a dedicated memory barrier instruction.
You can find the full benchmark here.
Required State
Peterson’s algorithm fundamentally requires two pieces of state:
interested
- An array of elements (equal to the number of threads) indicating each thread’s interest in entering the critical sectionturn
- An element indicating which thread currently has priority to enter the critical section
We can implement these as data members of a class called Peterson
.
// Simple class implementing Peterson's algorithm for mutual exclusion
class Peterson {
private:
// Is this thread interested in the critical section
volatile int interested[2] = {0, 0};
// Who's turn is it?
volatile int turn = 0;
};
Note, we have both our variables marked as volatile
to keep them from being cached in registers (we need the access of these variables to be in memory so that they are visible to both threads).
Gaining Access to the Critical Section
The next things we need is a method where a thread will wait for access to a critical section. We’ll call this method lock
. Here is how I implemented mine:
// Method for locking w/ Peterson's algorithm
void lock(int tid) {
// Mark that this thread wants to enter the critical section
interested[tid] = 1;
// Assume the other thread has priority
int other = 1 - tid;
turn = other;
// Wait until the other thread finishes or is not interested
while (turn == other && interested[other])
;
}
The first thing the calling thread does is index into our interested
array using its thread ID (assumed to be either 0
or 1
), and write 1
to say that it is interested in entering the critical section.
Next, the thread yields priority to the other thread by setting turn
equal to the other thread’s ID.
Finally the thread enters a while
loop, waiting for either the other thread to yield priority, or to no longer be interested in entering the critical section.
Exiting the Critical Section
The final thing we need is a method to notify that a thread is done with the critical section. We can call this method unlock
. Here is how I implemented mine:
// Method for unlocking w/ Peterson's algorithm
void unlock(int tid) {
// Mark that this thread is no longer interested
interested[tid] = 0;
}
Notifying that a thread is done with the critical section is simple. The exiting thread just needs to say it is no longer interested in the critical section by writing 0
to interested[tid]
Our Benchmark
To test our implementation of Peterson’s algorithm, we’ll be using it to prevent simultaneous increments to a non-atomic integer. Here is the function called work
that we’ll have two thread run:
// Work function
void work(Peterson &p, int &val, int tid) {
for (int i = 0; i < 1e8; i++) {
// Lock using Peterson's algorithm
p.lock(tid);
// Critical section
val++;
// Unlock using Peterson's algorithm
p.unlock(tid);
}
}
Here, we’re using Peterson’s algorithm to protect the increment of our shared value val
.
The main
function that drives our benchmark is simple:
int main() {
// Shared value
int val = 0;
Peterson p;
// Create threads
std::thread t0([&] { work(p, val, 0); });
std::thread t1([&] { work(p, val, 1); });
// Wait for the threads to finish
t0.join();
t1.join();
// Print the result
std::cout << "FINAL VALUE IS: " << val << '\n';
return 0;
}
We create our shared value val
and Peterson
object, spawn two threads (with thread IDs 0
and 1
respectively), wait for them to finish running the work
function, then print out the final result.
Compiling and Running the Benchmark
We can compile our benchmark peterson.cpp
using the following command:
g++ peterson.cpp -o peterson -O2 -lpthread
What we’d expect to be printed at the end of execution (if our implementation of Peterson’s algorithm gives us mutual exclusion) is 2 million (from 1 million increments performed by each thread). If our implementation does not give us mutual exclusion, the final value may be less than 2 million because of overlapping increments from our two threads.
Here are the results from a few runs of the benchmark:
FINAL VALUE IS: 199999998
FINAL VALUE IS: 199999991
FINAL VALUE IS: 199999993
FINAL VALUE IS: 199999992
Definitely a bug! 1e8 + 1e8
does not equal 199999998
(or any of our other values)! So what happened?
We can first look at our assembly to see if the order of our instructions is the problems. Here’s are the output from perf:
0.00 │18:┌─→movl $0x1,0x4(%rax)
0.20 │ │ movl $0x0,0x8(%rax)
│ │↓ jmp 36
│ │ nop
25.73 │30:│ mov (%rax),%edx
34.10 │ │ test %edx,%edx
0.00 │ │↓ je 3d
29.10 │36:│ mov 0x8(%rax),%edx
0.22 │ │ test %edx,%edx
3.79 │ │↑ je 30
6.86 │3d:│ incl (%rsi)
│ │ movl $0x0,0x4(%rax)
│ ├──dec %ecx
│ └──jne 18
Nothing stands out as immediately incorrect. First, our threads say they are interested in entering the critical section and yield priority to each other using the movl
instructions starting at label 18:
.
Next, our threads wait inside of the while
loop starting at label 30:
, where they evaluate the two conditions (at labels 30:
and 36:
).
If one of these conditions becomes false, a thread does the increment of the shared value (incl
), stores 0
to say it is no longer intersted in the critical section, and continues executing the for
loop in our work
function.
Since the ordering of these instructions make sense, something must be getting reordered in hardware! Let’s think about where that could be happening in our high-level C++. As a reminder, here’s our lock
method:
// Method for locking w/ Peterson's algorithm
void lock(int tid) {
// Mark that this thread wants to enter the critical section
interested[tid] = 1;
// Assume the other thread has priority
int other = 1 - tid;
turn = other;
// Wait until the other thread finishes or is not interested
while (turn == other && interested[other])
;
}
We know that Processor Consistency says writes may be reordered with subsequent reads, so we will focus on those relationships. Specifically, we will focus on the conditions being checked in our while
loop because that is where our threads wait for permission to enter the critical section.
First, let’s consider turn
. Each thread yields priority by setting turn = other
, then waiting on turn == other
in the while
loop. If the read of turn
for the comparison is reordered before the write, the thread could break out of the while
loop and exist the critical section. Is this possible?
No! This write and read can not be reordered because they are the same location! The developer manual specifically says this is not allowed:
- “Reads may be reordered with older writes to different locations but not with older writes to the same location.”
Let’s consider the write-read pair of interested[tid]
and interested[other]
. Our thread first says it’s interested in entering the critical section by writing to 1
to interested[tid]
. In the while
loop, it then waits for the other thread to no longer be interested in the critical section by evaluating interested[other] == 1
.
If the two reads (one by each thread) of interested[other]
are reordered before the two writes (one by each thread) to interested[tid]
, each thread would think that the other is not interested in the critical section, break out of the while
loop, and enter the critical section. Is this possible?
Yes! interested[tid]
and interested[other]
are two different memory locations, meaning that this write-read pair can be reordered by hardware!
Now that we understand where our bug is coming from, we can ask the million dollar question. How do we prevent this reordering?
Hardware Memory Barriers
To prevent the reordering of memory operations, x86 provides memory barriers in the form of fence instructions that we directly insert using intrinsics. These fences include:
_mm_sfence
- “Perform a serializing operation on all store-to-memory instructions that were issued prior to this instruction. Guarantees that every store instruction that precedes, in program order, is globally visible before any store instruction which follows the fence in program order.”
_mm_lfence
- “Perform a serializing operation on all load-from-memory instructions that were issued prior to this instruction. Guarantees that every load instruction that precedes, in program order, is globally visible before any load instruction which follows the fence in program order.”
_mm_mfence
- “Perform a serializing operation on all load-from-memory and store-to-memory instructions that were issued prior to this instruction. Guarantees that every memory access that precedes, in program order, the memory fence instruction is globally visible before any memory instruction which follows the fence in program order.”
Now the question is, which one do we need in our program?
_mm_sfence
only deals with the the ordering of store instructions, and _mm_lfence
only deals with the ordering of load instructions, so neither of these (at least on their own) are what we’re looking for. However, _mm_mfence
looks like a perfect fit!
_mm_mfence
gives us the guarantee that prior stores (like the write to interested[tid]
) will have become globally visible before and subsequent memory operation (like the read of interested[other]
). Let’s add it between our write of interested[tid]
and read of interested[other]
:
// Method for locking w/ Peterson's algorithm
void lock(int tid) {
// Mark that this thread wants to enter the critical section
interested[tid] = 1;
// Assume the other thread has priority
int other = 1 - tid;
turn = other;
// HW memory barrier
_mm_mfence();
// Wait until the other thread finishes or is not interested
while (turn == other && interested[other])
;
}
If we recompile and rerun our application a few times we get the following result:
FINAL VALUE IS: 200000000
FINAL VALUE IS: 200000000
FINAL VALUE IS: 200000000
Finally! The correct answer! Let’s take a look at the assembly:
│18:┌─→movl $0x1,(%rax)
0.16 │ │ movl $0x1,0x8(%rax)
20.13 │ │ mfence
1.15 │ │↓ jmp 37
│ │ nop
19.13 │30:│ mov 0x4(%rax),%ecx
3.12 │ │ test %ecx,%ecx
│ │↓ je 3f
1.55 │37:│ mov 0x8(%rax),%ecx
0.13 │ │ cmp $0x1,%ecx
33.72 │ │↑ je 30
20.90 │3f:│ incl (%rsi)
0.01 │ │ movl $0x0,(%rax)
│ ├──dec %edx
│ └──jne 18
The only major difference is that we have an mfence
instruction after our stores to interested[tid]
and turn
.
Does Hardware Reodering Require OoO Execution?
It’s natural to think that hardware memory reordering requires out-of-order (OoO) execution. However, this is not the case. Hardware memory reordering can happen even in an in-order processor! To understand this, let’s take a look at the architecture block diagram for Intel’s Kaby Lake Processors (which does OoO execution) from wikichip.
In the Memory Subsystem, you can see a block labeled “Store Buffer & Forwarding”. A store buffer is there to do exactly what it sounds like: buffer stores. Unlike loads, stores are not typically on the critical path (instructions often stall waiting on data to be loaded, not on data being written). Instead of making stores immediately visible, they can first be buffered in a store buffer, and written back gradually as bandwidth becomes available.
In an in-order processor, a thread could first perform a store that goes into the store buffer. At this point, the store is the memory subsystem’s problem (to the core, that operation is complete). The thread could then issue a read which completes before the previous store has drained out of the store buffer (and become globally visible)! No OoO execution required!
How Do Fence Instruction Work?
It’s natural to be inquisitve about how the the x86 instructions work at the hardware level. Below are some quick thoughts on how they could function:
_mm_sfence
- To ensure all previous stores have become globally visible, the hardware could stall all later stores (in program order), until all earlier stores have retired and the store buffer have drained, before the
sfence
instruction itself retires.
- To ensure all previous stores have become globally visible, the hardware could stall all later stores (in program order), until all earlier stores have retired and the store buffer have drained, before the
_mm_lfence
- Similar to the
_mm_sfence
, the architecture could stall later loads (in program order), until all earlier loads have completed, before thelfence
instruction itself retires.
- Similar to the
_mm_mfence
- Naturally, the
mfence
instruction is a combination of the previous two. The architecture could stall all later loads and stores (in program order) until all ealier loads and stores have retired and the store buffer has been drained, before themfence
instruction itself retires.
- Naturally, the
There are of course exceptions and nuances not covered by these descriptions, but I’ll leave the patent reading as an exercise for the reader.
Final Thoughts
Memory consistency can be difficult to understand. Even with the relatively strict consistency model of x86 (compared to weaker models used by ARM and POWER), bugs related to hardware memory reordering often go against our intuition, and often require some very careful thought to get around.
Thanks for reading,
–Nick
Link to the source code
- My YouTube Channel:
- My GitHub Account:
- My Email: CoffeeBeforeArch@gmail.com