Compiler Optimization of a Clamp Function
Modern processors are incredibly complex, and writing functionally correct code for even a moderately complex application can be a painful and teadious endeavor. Luckily for us, we have compilers that allow us to write code in high level languages like C++ and generate assembly that is both functionally correct code and highly optimized. In this blog post, we’ll be looking at a simple clamp benchmark, and compiling it at different optimization levels, and with different optimization flags. We’ll then look at both the performance and generated assembly from each experiment, and try to understand how things change based on the compiler options.
Link to the source code
- Source Code:
- My YouTube Channel:
- My GitHub Account:
- My Email: CoffeeBeforeArch@gmail.com
Misc. Links
For more detail about the GCC optimization flags, check out this link.
The Clamp Function
Before we start looking at optimizations, we need to understand the function we’re optimizing. We’ll be studying a clamp function in this blog post. A clamp function simply clamps a given unput number to a range of values. In this case, we’ll be looking at a function that clamps an integer to a maximum value (as opposed to a clamp function that clamps to an upper and lower bound).
Here’s an example implementation in C++:
for (int i = 0; i < N; i++)
v_out[i] = (v_in[i] > ceiling) ? ceiling : v_in[i];
For each element in an array/vector, we compare the input to the ceiling. If the input is greater than the ceiling, we store ceiling in the output array/vector. Otherwise, we store the input value.
Clamp at -O0 Optimization
Let’s start our journey by looking at some un-optimized code. When we compile using gcc without any optimization flags, it defaults to the -O0 optimization level, which generates unoptimized code. Let’s compare the results of a few different C++ implementations of our clamp function.
All of these benchmarks were compiled using the following command:
g++ clamp_bench.cpp -lbenchmark -lpthread -O0 -o clamp
Clamp with Vectors
This implementation of our clamp function uses std::vector
containers and a for-loop.
// Benchmark for a clamp function
static void clamp_bench(benchmark::State &s) {
// Number of elements in the vector
auto N = 1 << s.range(0);
// Create our random number generators
std::mt19937 rng;
rng.seed(std::random_device()());
std::uniform_int_distribution<int> dist(0, 1024);
// Create a vector of random integers
std::vector<int> v_in(N);
std::vector<int> v_out(N);
std::generate(begin(v_in), end(v_in), [&]() { return dist(rng); });
// Main benchmark loop
for (auto _ : s) {
for (int i = 0; i < N; i++) {
v_out[i] = (v_in[i] > 512) ? 512 : v_in[i];
}
}
}
BENCHMARK(clamp_bench)->DenseRange(8, 10);
And here is the output assembly:
1.77 │306:┌─→mov -0x27e0(%rbp),%eax
0.30 │ │ cmp -0x27dc(%rbp),%eax
│ │↓ jge 382
2.39 │ │ mov -0x27e0(%rbp),%eax
1.96 │ │ movslq %eax,%rdx
1.92 │ │ lea -0x2780(%rbp),%rax
0.33 │ │ mov %rdx,%rsi
1.72 │ │ mov %rax,%rdi
3.57 │ │→ callq std::vector<int, std::allocator<int> >::operator[]
13.36 │ │ mov (%rax),%eax
2.45 │ │ cmp $0x200,%eax
2.59 │ │↓ jg 357
13.52 │ │ mov -0x27e0(%rbp),%eax
0.62 │ │ movslq %eax,%rdx
0.57 │ │ lea -0x2780(%rbp),%rax
1.98 │ │ mov %rdx,%rsi
0.06 │ │ mov %rax,%rdi
2.08 │ │→ callq std::vector<int, std::allocator<int> >::operator[]
4.12 │ │ mov (%rax),%ebx
2.43 │ │↓ jmp 35c
13.62 │357:│ mov $0x200,%ebx
2.99 │35c:│ mov -0x27e0(%rbp),%eax
1.13 │ │ movslq %eax,%rdx
3.25 │ │ lea -0x2760(%rbp),%rax
2.42 │ │ mov %rdx,%rsi
│ │ mov %rax,%rdi
4.84 │ │→ callq std::vector<int, std::allocator<int> >::operator[]
9.64 │ │ mov %ebx,(%rax)
2.63 │ │ addl $0x1,-0x27e0(%rbp)
1.58 │ └──jmp 306
This seems like a lot of code for just a clamp. However, we must remember that this code is largely unoptimized. Let’s break it down into some key components.
We see three calls to ::operator[]
in our assembly. Remember, A vector is really just a class that implements the []
operator, and operators act like any other method or function at the assembly level. The first call to the []
operator is to read our input value from v_in
. The remaining two []
operators are for writing to v_out
(one for if we have to clamp the input, and the other for if we don’t). Our clamp is implemented with a simple compare against 512 (cmp $0x200,%eax
) and jg
instruction that determines if we store 512 or the value from v_in
.
Now let’s look at the performance:
------------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------------
clamp_bench/8 1361 ns 1361 ns 495944
clamp_bench/9 3364 ns 3364 ns 217505
clamp_bench/10 7437 ns 7437 ns 91160
This looks fairly slow, but we won’t for certain until we look at our other implementations.
Clamp with Raw Pointers
We seem to have a lot of extra overhead from using C++ STL containers (our calls to operator[]
and size
for our vectors). Let’s run another experiment where we swap out our std::vector
containers with raw pointers.
Here is what that looks like:
// Benchmark for a clamp function
// Uses raw pointers to avoid overhead in unoptimized code
static void clamp_bench_raw_ptr(benchmark::State &s) {
// Number of elements in the vector
auto N = 1 << s.range(0);
// Create our random number generators
std::mt19937 rng;
rng.seed(std::random_device()());
std::uniform_int_distribution<int> dist(0, 1024);
// Create a vector of random integers
int *v_in = new int[N]();
int *v_out = new int[N]();
std::generate(v_in, v_in + N, [&]() { return dist(rng); });
// Main benchmark loop
for (auto _ : s) {
for (int i = 0; i < N; i++) {
v_out[i] = (v_in[i] > 512) ? 512 : v_in[i];
}
}
delete[] v_in;
delete[] v_out;
}
BENCHMARK(clamp_bench_raw_ptr)->DenseRange(8, 10);
And here is the output assembly:
14.20 │32e:┌─→mov -0x27b0(%rbp),%eax
0.57 │ │ cmp -0x27ac(%rbp),%eax
│ │↓ jge 38b
0.22 │ │ mov -0x27b0(%rbp),%eax
0.34 │ │ cltq
12.82 │ │ lea 0x0(,%rax,4),%rdx
0.54 │ │ mov -0x27a8(%rbp),%rax
0.42 │ │ add %rdx,%rax
21.44 │ │ mov (%rax),%eax
6.38 │ │ mov -0x27b0(%rbp),%edx
│ │ movslq %edx,%rdx
│ │ lea 0x0(,%rdx,4),%rcx
6.65 │ │ mov -0x27a0(%rbp),%rdx
7.04 │ │ add %rcx,%rdx
│ │ mov $0x200,%ecx
│ │ cmp $0x200,%eax
6.23 │ │ cmovg %ecx,%eax
21.78 │ │ mov %eax,(%rdx)
0.67 │ │ addl $0x1,-0x27b0(%rbp)
│ └──jmp 32e
Still un-optimized, but a lot cleaner than our std::vector implementation. Furthermore, you can see we don’t have a branch anymore for our clamp. It has been replaced by a cmovg
, which conditionally moves a value if it is greater than another based on the result of the previous compare (cmp
).
In previous blog posts, we’ve looked at how branchless alternatives to code are often beneficial because of the large price of branch misprediction. Let’s measure run our benchmark to see how much performance improved by getting rid of the branch and std::vector
operator overhead.
Here are the performance results:
------------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------------
clamp_bench_raw_ptr/8 408 ns 408 ns 1715523
clamp_bench_raw_ptr/9 803 ns 803 ns 872497
clamp_bench_raw_ptr/10 1592 ns 1592 ns 439707
Much faster (by ~4x)! Let’s keep track on how this changes as we increase the optimization levels.
Clamp with Vectors and std::transform
Another way we can implement our clamp function is using the STL algorithm std::transform
. Here’s how that looks like using std::vector
:
// Benchmark for a clamp function
static void clamp_bench_lambda(benchmark::State &s) {
// Number of elements in the vector
auto N = 1 << s.range(0);
// Create our random number generators
std::mt19937 rng;
rng.seed(std::random_device()());
std::uniform_int_distribution<int> dist(0, 1024);
// Create a vector of random integers
std::vector<int> v_in(N);
std::vector<int> v_out(N);
std::generate(begin(v_in), end(v_in), [&]() { return dist(rng); });
// Our clamp function
auto clamp = [](int in) { return (in > 512) ? 512 : in; };
// Main benchmark loop
for (auto _ : s) {
std::transform(begin(v_in), end(v_in), begin(v_out), clamp);
}
}
BENCHMARK(clamp_bench_lambda)->DenseRange(8, 10);
This is functionally the same as our previous two implementations, but now we’re relying more heavily on the STL.
Let’s take a look at the assembly:
│ 000000000000b622 <__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > > std::transform<__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, clamp_bench_lambda(benchmark::State&)::{lambda(int)#2}>(__gnu_cxx::__normal_it
│ _ZSt9transformIN9__gnu_cxx17__normal_iteratorIPiSt6vectorIiSaIiEEEES6_ZL18clamp_bench_lambdaRN9benchmark5StateEEUliE0_ET0_T_SC_SB_T1_():
│ endbr64
│ push %rbp
│ mov %rsp,%rbp
│ push %r12
│ push %rbx
│ sub $0x20,%rsp
│ mov %rdi,-0x18(%rbp)
│ mov %rsi,-0x20(%rbp)
│ mov %rdx,-0x28(%rbp)
│ lea -0x20(%rbp),%rdx
│ lea -0x18(%rbp),%rax
│ mov %rdx,%rsi
6.68 │ mov %rax,%rdi
0.03 │ → callq __gnu_cxx::operator!=<int*, std::vector<int, std::allocator<int> > >
0.03 │ test %al,%al
│ → je b69d <__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > > std::transform<__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> >
│ lea -0x18(%rbp),%rax
1.91 │ mov %rax,%rdi
4.18 │ → callq __gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >::operator*
33.23 │ mov (%rax),%r12d
0.13 │ lea -0x28(%rbp),%rax
│ mov %rax,%rdi
7.57 │ → callq __gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >::operator*
│ mov %rax,%rbx
1.64 │ lea -0x29(%rbp),%rax
5.18 │ mov %r12d,%esi
0.03 │ mov %rax,%rdi
2.12 │ → callq clamp_bench_lambda(benchmark::State&)::{lambda(int)#2}::operator()
15.85 │ mov %eax,(%rbx)
0.03 │ lea -0x18(%rbp),%rax
0.03 │ mov %rax,%rdi
7.51 │ → callq __gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >::operator++
3.07 │ lea -0x28(%rbp),%rax
0.13 │ mov %rax,%rdi
4.21 │ → callq __gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >::operator++
6.03 │ → jmp b63d <__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > > std::transform<__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> >
0.36 │ mov -0x28(%rbp),%rax
│ add $0x20,%rsp
0.03 │ pop %rbx
│ pop %r12
│ pop %rbp
│ ← retq
Let’s parse what’s going on here. Each iteration of the loop, we access our input and output elements through our vector iterators with ::operator*
. Our input is then clamped with a call to our lambda. After we store the result, we move to the next element in our iterators using ::operator++
. Let’s look at the code generated for our clamp lambda in greater detail:
│ 000000000000ac00 <clamp_bench_lambda(benchmark::State&)::{lambda(int)#2}::operator()(int) const>:
│ _ZZL18clamp_bench_lambdaRN9benchmark5StateEENKUliE0_clEi():
29.45 │ push %rbp
0.71 │ mov %rsp,%rbp
10.84 │ mov %rdi,-0x8(%rbp)
24.26 │ mov %esi,-0xc(%rbp)
0.18 │ mov $0x200,%eax
0.72 │ cmpl $0x200,-0xc(%rbp)
25.04 │ cmovle -0xc(%rbp),%eax
8.43 │ pop %rbp
0.37 │ ← retq
Very similar to our raw pointer implementation. Instead of a branch, our compiler opted to use a conditional move (cmov
) instruction again (this time using a less-than comparison).
Now let’s check out the performance:
------------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------------
clamp_bench_lambda/8 2883 ns 2853 ns 244106
clamp_bench_lambda/9 5749 ns 5692 ns 122683
clamp_bench_lambda/10 11894 ns 11372 ns 62469
Our worst performance yet! Despite using a cmov
instruction, we still have too much overhead from our unoptimized C++ operators.
Clamp with Raw Pointers and std::transform
A final thing we can try is using raw pointers and std::transform. Here’s my implementation:
// Benchmark for a clamp function
// Uses raw pointers to avoid overhead in unoptimized code
static void clamp_bench_raw_ptr_lambda(benchmark::State &s) {
// Number of elements in the vector
auto N = 1 << s.range(0);
// Create our random number generators
std::mt19937 rng;
rng.seed(std::random_device()());
std::uniform_int_distribution<int> dist(0, 1024);
// Create a vector of random integers
int *v_in = new int[N]();
int *v_out = new int[N]();
std::generate(v_in, v_in + N, [&]() { return dist(rng); });
// Our clamp function
auto clamp = [](int in) { return (in > 512) ? 512 : in; };
// Main benchmark loop
for (auto _ : s) {
std::transform(v_in, v_in + N, v_out, clamp);
}
delete[] v_in;
delete[] v_out;
}
BENCHMARK(clamp_bench_raw_ptr_lambda)->DenseRange(8, 10);
Let’s see how our assembly changed using raw pointers:
│ 000000000000b6ec <int* std::transform<int*, int*, clamp_bench_raw_ptr_lambda(benchmark::State&)::{lambda(int)#2}>(int*, int*, clamp_bench_raw_ptr_lambda(benchmark::State&)::{lambda(int)#2}, clamp_bench_raw_ptr_lambda(benchmark::State&)::{lambda(int)#2})>:
│ _ZSt9transformIPiS0_ZL26clamp_bench_raw_ptr_lambdaRN9benchmark5StateEEUliE0_ET0_T_S6_S5_T1_():
0.05 │ endbr64
│ push %rbp
│ mov %rsp,%rbp
0.03 │ sub $0x20,%rsp
│ mov %rdi,-0x8(%rbp)
│ mov %rsi,-0x10(%rbp)
│ mov %rdx,-0x18(%rbp)
3.28 │ mov -0x8(%rbp),%rax
0.67 │ cmp -0x10(%rbp),%rax
0.03 │ → je b734 <int* std::transform<int*, int*, clamp_bench_raw_ptr_lambda(benchmark::State&)::{lambda(int)#2}>(int*, int*, clamp_bench_raw_ptr_lambda(benchmark::State&)::{lambda(int)#2}, clamp_bench_raw_ptr_lambda(benchmark::State&)::{lambda(int)#2})+0x48>
14.56 │ mov -0x8(%rbp),%rax
15.91 │ mov (%rax),%edx
1.81 │ lea -0x19(%rbp),%rax
0.08 │ mov %edx,%esi
8.53 │ mov %rax,%rdi
8.82 │ → callq clamp_bench_raw_ptr_lambda(benchmark::State&)::{lambda(int)#2}::operator()
0.21 │ mov -0x18(%rbp),%rdx
28.54 │ mov %eax,(%rdx)
2.83 │ addq $0x4,-0x8(%rbp)
14.13 │ addq $0x4,-0x18(%rbp)
│ → jmp b704 <int* std::transform<int*, int*, clamp_bench_raw_ptr_lambda(benchmark::State&)::{lambda(int)#2}>(int*, int*, clamp_bench_raw_ptr_lambda(benchmark::State&)::{lambda(int)#2}, clamp_bench_raw_ptr_lambda(benchmark::State&)::{lambda(int)#2})+0x18>
0.42 │ mov -0x18(%rbp),%rax
0.03 │ leaveq
0.08 │ ← retq
Much more clean than our implementation using vectors. Now, our std::transform
accesses memory directly rather than calling the iterator dereference operator. However, we still see a call to our clamp lambda. Here is how our clamp was implemented:
│ 000000000000b09a <clamp_bench_raw_ptr_lambda(benchmark::State&)::{lambda(int)#2}::operator()(int) const>:
│ _ZZL26clamp_bench_raw_ptr_lambdaRN9benchmark5StateEENKUliE0_clEi():
0.21 │ push %rbp
25.43 │ mov %rsp,%rbp
3.31 │ mov %rdi,-0x8(%rbp)
0.16 │ mov %esi,-0xc(%rbp)
12.88 │ mov $0x200,%eax
29.61 │ cmpl $0x200,-0xc(%rbp)
21.14 │ cmovle -0xc(%rbp),%eax
0.11 │ pop %rbp
7.15 │ ← retq
Roughly the same as our implementation with std::vector
. We are still using a cmovle
instead of a branch for our clamp.
Let’s see how our performance compares to our previous three implementations:
------------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------------
clamp_bench_raw_ptr_lambda/8 553 ns 552 ns 1159259
clamp_bench_raw_ptr_lambda/9 1127 ns 1088 ns 643189
clamp_bench_raw_ptr_lambda/10 2199 ns 2168 ns 322517
Not bad! Only slightly slower than our implementation that used raw pointers and a hand-rolled for-loop! Based on our generated assembly, this is likely because we still have some lingering C++ operator overhead that doesn’t exists in the other implementation.
-O0 Optimization Summary
So what did we learn in this section? From our initial measurements, writing code in a more C-style seems to be faster (and significantly!). However, let’s keep in mind this is WITHOUT OPTIMIZATIONS. This not a typical situation (except when debugging). Let’s continue our experiments with -O1
optimizations in the next section.
Clamp at -O1 Optimization
Now that we have a solid understanding about our performance with un-optimized code, let’s enabled -O1 optimziations. For GCC, this enables the following optimization flags:
-fauto-inc-dec
-fbranch-count-reg
-fcombine-stack-adjustments
-fcompare-elim
-fcprop-registers
-fdce
-fdefer-pop
-fdelayed-branch
-fdse
-fforward-propagate
-fguess-branch-probability
-fif-conversion
-fif-conversion2
-finline-functions-called-once
-fipa-profile
-fipa-pure-const
-fipa-reference
-fipa-reference-addressable
-fmerge-constants
-fmove-loop-invariants
-fomit-frame-pointer
-freorder-blocks
-fshrink-wrap
-fshrink-wrap-separate
-fsplit-wide-types
-fssa-backprop
-fssa-phiopt
-ftree-bit-ccp
-ftree-ccp
-ftree-ch
-ftree-coalesce-vars
-ftree-copy-prop
-ftree-dce
-ftree-dominator-opts
-ftree-dse
-ftree-forwprop
-ftree-fre
-ftree-phiprop
-ftree-pta
-ftree-scev-cprop
-ftree-sink
-ftree-slsr
-ftree-sra
-ftree-ter
-funit-at-a-time
All of these benchmarks were compiled using the following command:
g++ clamp_bench.cpp -lbenchmark -lpthread -O1 -o clamp
Clamp with Vectors
Let’s re-evaluate our clamp benchmark that uses std::vector
containers and for-loop. Here is the generated assembly with -O1
optimizations enabled:
0.54 │247:┌─→mov 0x10(%rsp),%rdx
25.07 │ │ cmpl $0x200,(%rdx,%rax,1)
10.78 │ │ mov %esi,%ecx
22.07 │ │ cmovle (%rdx,%rax,1),%ecx
0.12 │ │ mov 0x30(%rsp),%rdx
30.33 │ │ mov %ecx,(%rdx,%rax,1)
0.05 │ │ add $0x4,%rax
│ ├──cmp %rax,%rdi
10.48 │ └──jne 247
Much better! Our compiler was able to get rid of our ::operator[]
calls and turn our branch into a cmovle
instruction. Let’s see how much our performance improved:
------------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------------
clamp_bench/8 176 ns 176 ns 3930433
clamp_bench/9 346 ns 346 ns 2025628
clamp_bench/10 696 ns 696 ns 1006332
Much better! Trimming that overhead and changing to a cmovle
instruction gave us an 8-10x performance boost!
Clamp with Raw Pointers
Let’s re-evaluate our clamp with raw pointers and for-loop. Here is the generated assembly with -O1
optimizations enabled:
49.04 │1ff:┌─→cmpl $0x200,(%rbx,%rax,4)
0.03 │ │ mov %ecx,%edx
0.55 │ │ cmovle (%rbx,%rax,4),%edx
49.79 │ │ mov %edx,0x0(%rbp,%rax,4)
0.02 │ │ add $0x1,%rax
│ ├──cmp %rsi,%rax
│ └──jne 1ff
An even-tighter inner-loop! We still have a cmovle
instruction, and our optimizations trimmed away quite a few instructions!
Now let’s see if this improved our performance:
------------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------------
clamp_bench_raw_ptr/8 116 ns 116 ns 6026730
clamp_bench_raw_ptr/9 226 ns 226 ns 3071095
clamp_bench_raw_ptr/10 456 ns 456 ns 1539512
Somewhere between 4x and 5x faster than when we compiled with -O0
optimizations, and still a bit faster than our std::vector
with for-loop implementation.
Clamp with Vectors and std::transform
Now let’s see how our worst-performing implementation faired with -O1
optimizations enabled. Here’s our new assembly:
48.00 │1d0:┌─→cmpl $0x200,(%rdx,%rax,1)
0.02 │ │ mov %r8d,%ecx
1.15 │ │ cmovle (%rdx,%rax,1),%ecx
49.59 │ │ mov %ecx,(%rdi,%rax,1)
0.03 │ │ add $0x4,%rax
│ ├──cmp %rsi,%rax
│ └──jne 1d0
Very similar to our clamp benchmark with raw pointers and a for-loop when -O1
optimizations are enabled! Let’s measure the performance:
------------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------------
clamp_bench_lambda/8 117 ns 117 ns 6026575
clamp_bench_lambda/9 227 ns 227 ns 3056295
clamp_bench_lambda/10 510 ns 510 ns 1357904
As you probably expected, the performance is similar to our last example as well!
Clamp with Raw Pointers and std::transform
Let’s see if there is still a benefit of using raw pointers with std::transform
. Here is our generated assembly:
48.89 │1e8:┌─→cmpl $0x200,(%rbx,%rax,1)
0.03 │ │ mov %ecx,%edx
1.12 │ │ cmovle (%rbx,%rax,1),%edx
48.94 │ │ mov %edx,0x0(%rbp,%rax,1)
0.01 │ │ add $0x4,%rax
│ ├──cmp %rax,%r12
│ └──jne 1e8
More of the same assembly we saw previously. Let’s check the performance just to make sure nothing else is going on:
------------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------------
clamp_bench_raw_ptr_lambda/8 115 ns 115 ns 6017905
clamp_bench_raw_ptr_lambda/9 226 ns 226 ns 3078079
clamp_bench_raw_ptr_lambda/10 509 ns 509 ns 1355150
Looks good! Only our implementation that uses std::vector
and a hand-rolled for-loop is still under-performing.
-O1 Optimization Summary
What we’ve seen so far is that using things like STL containers and algorithms can have substantial overheads, but those largely dissapear when you enable just basic optimizations (through a combination of function inlining and other optimizations). In the next section, we’ll continue our discussion by enabling -O2
optimizations for our benchmarks and analyzing the results.
Clamp at -O2 Optimization
When we enable -O2
optimizations, we get all the previous optimizations enabled, along with the following flags in GCC:
-falign-functions -falign-jumps
-falign-labels -falign-loops
-fcaller-saves
-fcode-hoisting
-fcrossjumping
-fcse-follow-jumps -fcse-skip-blocks
-fdelete-null-pointer-checks
-fdevirtualize -fdevirtualize-speculatively
-fexpensive-optimizations
-ffinite-loops
-fgcse -fgcse-lm
-fhoist-adjacent-loads
-finline-functions
-finline-small-functions
-findirect-inlining
-fipa-bit-cp -fipa-cp -fipa-icf
-fipa-ra -fipa-sra -fipa-vrp
-fisolate-erroneous-paths-dereference
-flra-remat
-foptimize-sibling-calls
-foptimize-strlen
-fpartial-inlining
-fpeephole2
-freorder-blocks-algorithm=stc
-freorder-blocks-and-partition -freorder-functions
-frerun-cse-after-loop
-fschedule-insns -fschedule-insns2
-fsched-interblock -fsched-spec
-fstore-merging
-fstrict-aliasing
-fthread-jumps
-ftree-builtin-call-dce
-ftree-pre
-ftree-switch-conversion -ftree-tail-merge
-ftree-vrp
-O2
enables nearly all optimizations, as long as they don’t involve a space-speed tradeoff.
The following benchmark was compiled using the following command:
g++ clamp_bench.cpp -lbenchmark -lpthread -O2 -o clamp
Benchmark Results
Unlike the previous optimization levels, we will no-longer be analyzing 4 benchmarks. Why? Because they generate roughly the same assembly, and have the same performance!
Let’s use the most-C++ of our implementations (using std::vector
containers and std::transform
) for the reaminder of our experiments. Here is the generated assembly for that benchmark:
49.05 │2a0:┌─→cmpl $0x200,(%rdx,%rax,1)
0.01 │ │ mov %esi,%ecx
1.21 │ │ cmovle (%rdx,%rax,1),%ecx
48.38 │ │ mov %ecx,(%r8,%rax,1)
0.01 │ │ add $0x4,%rax
│ ├──cmp %rdi,%rax
│ └──jne 2a0
The exact same assembly as we had with -O1
optimizations. Here are the performance numbers:
----------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------
clamp_bench_lambda/8 119 ns 119 ns 5874782
clamp_bench_lambda/9 232 ns 232 ns 2998641
clamp_bench_lambda/10 521 ns 521 ns 1336188
Unsurprisingly, We get the about same performance as previouslly reported!
-O2 Optimization Summary
Do -O2
optimizations help at all? Yes, absolutely! However, we are optimizing an incredibly simple operation (a clamp function). With such a simple operation, there is a finite set of optimizations that are applicable, and none of those in the -O2
category, in this case were helpful for our tight inner-loop.
We’ll continue our exploration with -O3 optimizations in the next section.
Clamp at -O3 Optimization
When we enabled -O3
optimizations, we get all the previous optimization, plus the following flags in GCC:
-fgcse-after-reload
-fipa-cp-clone
-floop-interchange
-floop-unroll-and-jam
-fpeel-loops
-fpredictive-commoning
-fsplit-loops
-fsplit-paths
-ftree-loop-distribution
-ftree-loop-vectorize
-ftree-partial-pre
-ftree-slp-vectorize
-funswitch-loops
-fvect-cost-model
-fvect-cost-model=dynamic
-fversion-loops-for-strides
We have some very important optimizations at this level (e.g., vectorization).
Our benchmark was compiled using the following command:
g++ clamp_bench.cpp -lbenchmark -lpthread -O3 -o clamp
As a reminder, we wll only be looking at a single benchmark since they all now generate the same assembly.
Benchmark Results
Let’s see if our assembly changed when we enabled -O3
optimizations:
18.70 │290:┌─→movdqu 0x0(%rbp,%rax,1),%xmm0
1.39 │ │ movdqu 0x0(%rbp,%rax,1),%xmm3
16.98 │ │ movdqa %xmm1,%xmm2
2.73 │ │ pcmpgtd %xmm1,%xmm0
17.90 │ │ pand %xmm0,%xmm2
1.56 │ │ pandn %xmm3,%xmm0
18.13 │ │ por %xmm2,%xmm0
1.50 │ │ movups %xmm0,(%r12,%rax,1)
17.72 │ │ add $0x10,%rax
0.03 │ ├──cmp %rdx,%rax
1.40 │ └──jne
Some new code! What happened? Our compiler performed vectorization! Vectorization allows our processor to process multiple elements in a single instruction. In this case, we’re packing 4 elements into each 128-bit xmm
register that is used. Let’s see how this changed our performance:
------------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------------
clamp_bench_lambda/8 36.5 ns 36.5 ns 19224165
clamp_bench_lambda/9 78.2 ns 78.2 ns 8927309
clamp_bench_lambda/10 154 ns 154 ns 4508519
Quite a significant improvement! In fact, almost 4x faster! This is somewhat to be expected. If we’re using instructions that process 4 elements at a time, we would optimistically expect a 4x speedup. We don’t quite hit that mark, likely because we have a few extra instructions to handle the vectorization, and maybe some minor throughput differences in the instructions.
-O3 Optimization Summary
Vectorization can have a huge impact on performance (as seen in this example). However, many things can get into the way of the auto-vectorizer (e.g., aliasing) that can prevent vectorization. In the next section, we’ll look at a few additional flags we can pass to our compiler beyond just changing the optimization level.
Clamp at -O3 Optimization with Native Tuning
One final thing we will try are the -march=native
and -mtune=native
flags. These flags tell the compiler to produce code for the native system’s processor architecture, and tune for the native architecture respectively.
Our benchmark was compiled using the following command:
g++ clamp_bench.cpp -lbenchmark -lpthread -O3 -march=native -mtune=native -o clamp
Benchmark Results
Let’s see if compiling and tuning for the native architecture made any difference to our performance. Here is the generated assembly:
46.07 │2c8:┌─→vpminsd (%r12,%rax,1),%ymm1,%ymm0
14.49 │ │ vmovdqu %ymm0,0x0(%r13,%rax,1)
18.79 │ │ add $0x20,%rax
0.26 │ ├──cmp %rax,%rcx
18.78 │ └──jne 2c8
Two important changes! For one, we’re using 256-bit vector instructions (ymm
registers are 256 bits) which process 8 integers at a time instead of the 4 we were previously. Secondly, we have fewer instructions, because we’re using vpminsd
, a dedicated instruction for extracting the minimum of packed numbers. Let’s see how this changes performance:
------------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------------
clamp_bench_lambda/8 10.6 ns 10.6 ns 66181966
clamp_bench_lambda/9 19.4 ns 19.4 ns 36134751
clamp_bench_lambda/10 57.3 ns 57.3 ns 12117304
Another ~3-4x performance improvement (somewhat expected based on the changes in the assembly)!
-O3 Optimization with Native Tuning Summary
Without informing the compiler about the architecture it is compiling for, it has to be conservative, and it will not use any features (e.g. wider SIMD instruction) that it doens’t know are supported. Telling our compiler to perform native tuning for our clamp benchmark meant that it knew it could generated 8-wide SIMD instructions, and use a dedicated instruction for extracting minimums from packed numbers.
Final Thoughts
Code can change drastically at different optimization levels, and with different optimization flags. Understanding how these optimizations work can help us improve the performance of our applications, and guide how we write code in our high-level languages. One important parting thought is this. Before you try and perfrom micro-optimizations yourself, make sure the compiler doesn’t have a switch that will do it for you (and maybe much more!).
Thanks for reading,
–Nick
Link to the source code
- Source Code:
- My YouTube Channel:
- My GitHub Account:
- My Email: CoffeeBeforeArch@gmail.com