Writing a Trace-Based Cache Simulator
Computer architects use many tools to evaluate proposed architectures. They may use coarse-grained analytical models to quickly rule out sub-optimal designs, or complex RTL simulation to get an accurate view of how the real hardware will behave. Another common tool that provides a compromise on speed in accuarcy between these are micro-architecture simulators.
Microarchitecture simulators can provide researchers with a decently accurate view of how hardware will behave in a reasonable amount of time, and are used extensively in both research and industry. This makes them a great thing to get familiar with if you are interested in hardware and architecture.
In this blog post we’ll be looking at how to write a trace-based cache simulator based on the course assignment I found here. This should provide a (relatively) simple entry point for those interested in simulator design and implementation, and give you ideas on how you could design your own,independent of things like the specific trace format we use in this blog post.
Link to the source code
- My YouTube Channel
- My GitHub Account
- The Cache Simulator
- My Email: CoffeeBeforeArch@gmail.com
Trace-Based Simulator Basics
Before we dig into the code, we should discuss what we are actually trying to design, starting with what “trace-based” simulation means.
Simulators often come in two flavors: trace-based and execution-driven. Execution-driven simulators drive simulation by directly executing a program’s instructions. Conversely, trace-driven simulators drive simulation by reading/parsing a pre-recorded trace of instructions and results. For our simple cache simulator, we will be using a trace-based design.
Our simulator must fundamentally do three things:
- Read+Parse instructions from a trace of memory accesses
- Process the cache accesses
- Record statistics about what happened
In the following sections, we will discuss this functionality in greater detail as we implement methods and add data members to our CacheSim
class:
// Our cache simulator
class CacheSim {
public:
private:
};
Reading and Parsing Memory Accesses
The first step in building a cache simulator is building the infrastructure to read/parse the instructions. We’ll start by looking at the format of the instruction trace, then at the code that will read/parse the instructions.
Format of the Instruction Trace
The instruction traces for this simulator are from this course assignment. Each line of the trace files looks like the following:
# 0 7fffed80 1
Following the #
we have the type of memory access (0
for read, and 1
for write), the address of the piece of memory being accessed, and then the number of instructions executed between the previous memory access and this one (for a cache simulator, we don’t care about the details of instructions that are not memory accesses).
Opening and Closing the Trace File
We will be using a std::ifstream
object for reading our trace file. We can add this to our private data members and name it infile
:
// Input trace file
std::ifstream infile;
The first thing we need to set up for our std::ifstream
object is opening and closing the file. We’ll follow the programming technique RAII (resource acquisition is initialization) for this, which binds the life cycle of a resource (the file we’ll be opening) to the lifetime of the object (our CacheSim
object).
Specifically, we will be opening the file in the CacheSim
constructor, and closing it in the destructor. We’ll start with our constructor which will take a single std::string
(the path to the trace file to open). Inside the constructor, we will call the open
method for our ifstream
object:
// Constructor
CacheSim(std::string input) {
// Initialize of input file stream object
infile.open(input);
}
Next, we have the destructor which closes the file using the close
method:
// Destructor
~CacheSim() {
infile.close();
}
Reading and Parsing Trace File Lines
Now that we’ve set up the opening and closing of our trace file, we can focus on reading the instructions line-by-line and extracting the access type, address, and instruction count.
For this, we’ll be implementing two methods:
run
- A method that will keep reading until we reach the end of the traceparse_line
- A method for extracting the access type, address, and instruction count from a line from the trace file
We can implement the run
method as follows:
// Run the simulation
void run() {
// Keep reading data from a file
std::string line;
while (std::getline(infile, line)) {
// We will parse the line from the file here
}
}
Each iteration of the while
loop, we read a new line from the trace file into our std::string
named line
using std::getline
. When we reach the end of the file, we break out of the loop.
The next thing we will add is a helper method to extract the data stored in line
. We’ll implement that in our parse_line
method:
// Get memory access from the trace file
std::tuple<bool, std::uint64_t, int> parse_line(std::string access) {
// What we want to parse
int type;
std::uint64_t address;
int instructions;
// Parse from the string we read from the file
sscanf(access.c_str(), "# %d %lx %d", &type, &address, &instructions);
return {type, address, instructions};
}
We will pass a std::string
containing the line from our trace file to the method. We’ll then create 3 temporary variables (type
, address
, and instructions
) for the three values we want to extract from our string.
Next, we’ll use sscanf
to extract the values in the string into our temporary variables.
Finally, we’ll return all three variables using a std::tuple
.
All we need to do now is add the call to parse_line
to our run
method:
// Run the simulation
void run() {
// Keep reading data from a file
std::string line;
while (std::getline(infile, line)) {
// Get the data for the access
auto [type, address, instructions] = parse_line(line);
}
}
We can even use structured bindings (C++17) to unpack our the three values in the returned std::tuple
directly (as seen above) instead of using something like std::tie
or std::get
.
End of Section Thoughts
Great! We know have a nice setup for opening/closing our trace file, and reading/parsing the data from each line.
This is a good place to stop and verify that everything works. One good way to verify that everything is working correctly is to print out the values returned from parse_line
, redirect the output to a file, and then diff
that file with the original trace (they should perfectly match if everything is working correctly).
Processing Cache Accesses
In this next section, we’ll focus on how to model a cache and process the accesses we read and process from our trace file.
Cache State
Before we can implement the funcionality of our cache model, we first need to add the appropriate state (data members) to our CacheSim
class.
The first thing we’ll add are the configuration parameters. These describe the shape of our cache (block size, associativity, and capactiy), and the overheads (miss penalty and dirty writeback penalty) for each access.
// Cache settings
unsigned block_size;
unsigned associativity;
unsigned capacity;
unsigned miss_penalty;
unsigned dirty_wb_penalty;
The next thing we’ll add to our cache are some offsets and masks that we’ll use for extracting the set number and tag from a memory address:
// Access settings
unsigned set_offset;
unsigned tag_offset;
unsigned set_mask;
The final thing we’ll add for functionality is actual state of our cache:
// The actual cache state
std::vector<std::uint64_t> tags;
std::vector<char> dirty;
std::vector<char> valid;
std::vector<int> priority;
For each cache block, we’ll store the tag at the location, if the cache block is dirty, if the cache block is valid, and the priority of the cache block w.r.t. the LRU replacement policy.
Setting Up the Cache
Now that we have the data members in our CacheSim
class, we need to initialize them. This is something we can do in the constructor.
We’ll start by expanding our constructor to also take the configuration options (block size, associativity, capacity, miss-penalty, and dirty writeback penalty) as parameters.
// Constructor
CacheSim(std::string input, unsigned bs, unsigned a, unsigned c, unsigned mp,
unsigned wbp) {
// Initialize of input file stream object
infile.open(input);
// Set all of our cache settings
block_size = bs;
associativity = a;
capacity = c;
miss_penalty = mp;
dirty_wb_penalty = wbp;
The next thing we need to do is resize our vectors based on the number of cache blocks for which we want to maintain state. This will simply be the capacity of our cache divided by the cache block size:
// Calculate the number of blocks
// Assume this divides evenly
auto num_blocks = capacity / block_size;
// Create our cache based on the number of blocks
tags.resize(num_blocks);
dirty.resize(num_blocks);
valid.resize(num_blocks);
priority.resize(num_blocks);
Note, we don’t actually need to worry about the shape of our cache when sizing the vectors. This is because the shape of our cache will only affect the traversal of the vectors, not the number of total elements (which will always be equal to the number of cache blocks). Furthermore, resizing the vectors as show will initialize all the values to 0. Therefore, all of our cache blocks will initially be invalid.
The final thing we need to initialize are the offsets and mask used to extract the set number and tag from a memory address. We’ll be using the following memory address format as a basis for these offsets and masks.
X-Bits Y-Bits Z-Bits
|****** TAG ******|**** SET ****|** OFFSET **|
The first thing we’ll calculate is the set_offset
, which is the number of bits we need to shift over over to extract the set number (Z-Bits from our diagram). Assuming byte-addressable memory and a cache block_size
of 16-bytes, the number of bits (Z-Bits) would be 4. This is because 4 bits can represent 16 total states (values 0b0000
to 0b1111
, one for each byte in a cache block).
The fundamental calculation we need is log base 2 of the cache block size (e.g. log base 2 of 16 is 4). A simple way to calculate this is by subtracting 1 from block_size
then using std::popcount
to count the number of set bits. Consider the following example where the block_size
is 16 bytes:
block_size = 16 = 0b00010000
block_size - 1 = 15 = 0b00001111
popcount(0b00001111) = 4
In C++, we can do this as follows:
auto set_offset = std::popcount(block_size - 1);
This calculation is reliant on the block_size
being a power-of-2. For all practical intents and purposes, this will be the case.
We next thing we need to do calculate is our set_mask
. This will be a mask of 1s equal to the number of set bits (Y-Bits from our diagram) in our memory address. We can start by calculating the number of sets in our cache:
auto sets = capacity / (block_size * associativity);
The size of each set is the equal to the associativity multiplied by the block size. We can therefore figure out the total number of sets buy dividing the total capacity of the cache by the size of each set.
To create a mask of ones equal to the number of set bits (Y-bits from our diagram), we can simply subtract 1 from the the number of sets. Consider the following example where the number sets is 16:
sets = 16 = 0b00010000
sets - 1 = 15 = 0b00001111
Because the number of sets will be a power-of-2 for all practical intents and purposes, subtracting one from than number will give us a mask of ones equal to the number of set bits (for the same reasoning as with our set_offset
calculation).
set_mask = sets - 1;
The final thing we need to calculate is the tag_offset
(the number of bits to shift the address over to extract the tag). This will simply be the set_offset
plus the number number of set bits (fundamentally, we’re just shifting out the OFFSET
and SET
bits from the diagram). We can calculate this as follows:
auto set_bits = std::popcount(set_mask);
tag_offset = set_offset + set_bits;
We do not need a mask to extract the tag bits because all the upper-bits of the address shifted by tag_offset
will be zeros.
Here is the full implementation of our constructor:
// Constructor
CacheSim(std::string input, unsigned bs, unsigned a, unsigned c, unsigned mp,
unsigned wbp) {
// Initialize of input file stream object
infile.open(input);
// Set all of our cache settings
block_size = bs;
associativity = a;
capacity = c;
miss_penalty = mp;
dirty_wb_penalty = wbp;
// Calculate the number of blocks
// Assume this divides evenly
auto num_blocks = capacity / block_size;
// Create our cache based on the number of blocks
tags.resize(num_blocks);
dirty.resize(num_blocks);
valid.resize(num_blocks);
priority.resize(num_blocks);
// Calculate values for traversal
// Cache lines come in the following format:
// |****** TAG ******|**** SET ****|** OFFSET **|
// Calculate the number of offset bits
auto block_bits = std::popcount(block_size - 1);
// Calculate the number of set bits, and create a mask of 1s
set_offset = block_bits;
auto sets = capacity / (block_size * associativity);
set_mask = sets - 1;
auto set_bits = std::popcount(set_mask);
// Calculate the bit-offset for the tag and create a mask of 1s
// Always use 64-bit addresses
tag_offset = block_bits + set_bits;
}
Accessing the Cache
Now that we have initialized our cache state, we can move into the logic for processing cache accesses. We will implement the following methods:
probe
- A method that performs the cache accessget_set
- A helper method to get the set number from an addressget_tag
- A helper method to get the tag from an address
Additionally, we will need to update our run
method with our call to probe
.
Helper Method Implementations
Our helper method get_set
uses our set_offset
and set_mask
initialized in the constructor:
// Extract the set number
// Shift the set to the bottom then extract the set bits
int get_set(std::uint64_t address) {
auto shifted_address = address >> set_offset;
return shifted_address & set_mask;
}
To get the set number, we simply shift the input memory address down to get rid of the offset bits, then extract only the set bits from the address (ignoring the tag).
Our get_tag
method is similar, except that we do not need to use a mask to extract the tag bits (because the upper bits will all be 0 in our shifted address).
// Extract the tag
// Shift the tag to the bottom
// No need to use mask (tag is all upper remaining bits)
std::uint64_t get_tag(std::uint64_t address) {
return address >> tag_offset;
}
Probe Implementation
Our probe
method will simulate the actual cache access. To our method we will pass the type of access (read/write), and the address of the access. We will return if the access was a hit or a miss, and if the access caused the writeback of a dirty cache line.
// Probe the cache
std::tuple<bool, bool> probe(bool type, std::uint64_t address) { }
The first thing we will add is our calls to the helper methods to extract the set number and tag:
// Calculate the set from the address
auto set = get_set(address);
auto tag = get_tag(address);
Using the cache set number, we’ll create a std::span
(C++20) for each of the vectors containing our cache state:
// create a spans for our set
auto base = set * associativity;
std::span local_tags{tags.data() + base, associativity};
std::span local_dirty{dirty.data() + base, associativity};
std::span local_valid{valid.data() + base, associativity};
std::span local_priority{priority.data() + base, associativity};
std::span
provides a nice way to only view a subset of a container (a vector in our case) at once. This allws us to view the data for just one cache set from each of our vectors.
Next, we start implementing the actual probe functionality. We will begin with the logic for checking if the access is a hit or a miss:
// Check each cache line in the set
auto hit = false;
int invalid_index = -1;
int index;
for (auto i = 0u; i < local_valid.size(); i++) {
// Check if the block is invalid
if (!local_valid[i]) {
// Keep track of invalid entries in case we need them
invalid_index = i;
continue;
}
// Check if the tag matches
if (tag != local_tags[i]) continue;
// We found the line, so mark it as a hit
hit = true;
index = i;
// Update dirty flag
local_dirty[index] |= type;
// Break out of the loop
break;
}
Our for
loop goes over each line in the set of our cache, first checking if the current cache line is invalid. If the line is invalid, we keep track of the index (for use in replacement), and continue
to check the next cache block (or exit if there are no more blocks to check).
If the cache block is valid, we check if the tag of the access matches the tag of the block in the cache. If they don’t match, we continue
to check the next cache block (or exit if there are no more blocks to check).
If the cache block is not invalid, and the tag in the access and block match, we have a cache hit! We therefore mark the access as a hit, save the index (to update priority later), mark the dirty bit for the block if it is a write, and break out of the loop.
If the access was not a hit, we must implement the code to find a cache line to replace:
// Find an element to replace if it wasn't a hit
auto dirty_wb = false;
if (!hit) {
// First try and use an invalid line (if available)
if (invalid_index >= 0) {
index = invalid_index;
local_valid[index] = 1;
}
// Otherwise, evict the lowest-priority cache block (largest value)
else {
auto max_element = std::ranges::max_element(local_priority);
index = std::distance(begin(local_priority), max_element);
dirty_wb = local_dirty[index];
}
// Update the tag and dirty state
local_tags[index] = tag;
local_dirty[index] = type;
}
If we found an invalid cache line when initially searching the set, we will that for replacement. Otherwise, we will replace the element with the lowest priority (where the highest priority is 0
, and the lowest priority is associativity - 1
). Additionally, we will set the tag and dirty bit if the access was a write.
We next need to update the priority for other lines in our cache replacement policy. For this, we will use std::transform
:
// Update the priority
// Go through each element
// Increase the priority of all the blocks with a lower priority than the
// one we are accessing
// High priority -> Low priority = 0 -> associativity - 1
std::transform(begin(local_priority), end(local_priority),
begin(local_priority), [&](int p) {
if (p <= local_priority[index] && p < associativity)
return p + 1;
else
return p;
});
// Currently accessed block has the highest priority (0)
local_priority[index] = 0;
Here, we decrease the priority (increment the value) of all elements in the span that have a higher original priority (lower value) than the cache block being accessed. Then, we set the accessed block to have the highest priority (0
).
Finally, we return if the access was a hit/miss, and if the access caused a dirty writeback:
return {hit, dirty_wb};
Modification to the Run Method
The final step to functionally model our cache is to add a call to probe
to our run
method:
// Run the simulation
void run() {
// Keep reading data from a file
std::string line;
while (std::getline(infile, line)) {
// Get the data for the access
auto [type, address, instructions] = parse_line(line);
// Probe the cache
auto [hit, dirty_wb] = probe(type, address);
}
}
Just like we did with parse_line
, we use structured bindings (C++17) to unpack the two values returned from the probe
method.
End of Section Thoughts
We implemented many things in this section! However, it’s a good idea to test each of these methods as you go.
For example, I would first verify that your get_set
and get_tag
methods are properly extracting the fields from the memory addresses.
Next, I would create a sample trace with just a few accesses (these could be copied from one of the larger trace files) that can be easily manually verified. For example, you could initially check that two accesses to the same cache line result in one hit and one miss.
Recording Statistics
The final thing we need for our simulator is statistic collection. We’ll implement this using a simple update_stats
method, and small modification to our run
method. We will also add a dump_stats
method for printing the cache statistics (and place a call to this method in the destructor).
State for our Statistics
To keep track our our statistics, we’ll add a few more data members to our class:
// Cache statistics
std::int64_t writes_ = 0;
std::int64_t mem_accesses_ = 0;
std::int64_t misses_ = 0;
std::int64_t dirty_wb_ = 0;
std::int64_t instructions_ = 0;
Initially, all our statistics will be 0
.
A good thing to keep in mind is that we don’t need to have a unique counter for every statistic. For example, the number of reads can be calculated at the end of execution by subtracting writes_
from the total number memory accesses (mem_accesses_
).
The Update Stats Method
Our update_stats
method is incredibly simple:
// Update the stats
void update_stats(int instructions, bool type, bool hit, bool dirty_wb) {
mem_accesses_++;
writes_ += type;
misses_ += !hit;
instructions_ += instructions;
dirty_wb_ += dirty_wb;
}
As parameters, our method takes the number of instructions executed, the type of memory access (read/write), if the access was a hit/miss, and if the access resulted in a dirty writeback.
We then update the internal stats accordingly.
The Dump Stats Method
The final method we’ll implement is dump_stats
. This method will simply calculate any statistics that need to be derived, and print them to the screen:
// Dump the statistics from simulation
void dump_stats() {
// Print the cache settings
std::cout << "CACHE SETTINGS\n";
std::cout << " Cache Size (Bytes): " << capacity << '\n';
std::cout << " Associativity : " << associativity << '\n';
std::cout << " Block Size (Bytes): " << block_size << '\n';
std::cout << " Miss Penalty (Cycles): " << miss_penalty << '\n';
std::cout << "Dirty WB Penalty (Cycles): " << dirty_wb_penalty << '\n';
std::cout << '\n';
// Print the access breakdown
std::cout << "CACHE ACCESS STATS\n";
std::cout << "TOTAL ACCESSES: " << mem_accesses_ << '\n';
std::cout << " READS: " << mem_accesses_ - writes_ << '\n';
std::cout << " WRITES: " << writes_ << '\n';
std::cout << '\n';
// Print the miss-rate breakdown
std::cout << "CACHE MISS-RATE STATS\n";
double miss_rate = (double)misses_ / (double)mem_accesses_ * 100.0;
auto hits = mem_accesses_ - misses_;
std::cout << " MISS-RATE: " << miss_rate << '\n';
std::cout << " MISSES: " << misses_ << '\n';
std::cout << " HITS: " << hits << '\n';
std::cout << '\n';
// Print the instruction breakdown
std::cout << "CACHE IPC STATS\n";
auto cycles = miss_penalty * misses_;
cycles += dirty_wb_penalty * dirty_wb_;
cycles += instructions_;
double ipc = (double)instructions_ / (double)cycles;
std::cout << " IPC: " << ipc << '\n';
std::cout << " INSTRUCTIONS: " << instructions_ << '\n';
std::cout << " CYCLES: " << cycles << '\n';
std::cout << " DIRTY WB: " << dirty_wb_ << '\n';
}
We will also add a call to this method in the destructor:
// Destructor
~CacheSim() {
infile.close();
dump_stats();
}
Modification to the Run Method
We can complete our run
method by adding a call to update_stats
:
// Run the simulation
void run() {
// Keep reading data from a file
std::string line;
while (std::getline(infile, line)) {
// Get the data for the access
auto [type, address, instructions] = parse_line(line);
// Probe the cache
auto [hit, dirty_wb] = probe(type, address);
// Update the cache statistics
update_stats(instructions, type, hit, dirty_wb);
}
}
Now our run
method will read every line from our trace file, extract the access details, simulate the cache access, then save record the statistics.
End of Section Thoughts
We’ve finished implementing our cache simulator! However, it’s a good idea to implement these statistics as you go (and not wait until the end).
For example, I’d first add statistics for reads/writes and instruction count before implementing the logic for hits/misses (since they are independent of each other).
Running the Simulator
If all has gone according to plan, you should be able to run your simulator! Here is the main
function I used to create and start my simulator:
int main(int argc, char *argv[]) {
// Kill the program if we didn't get an input file
assert(argc == 2);
// File location
std::string location(argv[1]);
// Hard coded cache settings
unsigned block_size = 1 << 4;
unsigned associativity = 1 << 0;
unsigned capacity = 1 << 14;
unsigned miss_penalty = 30;
unsigned dirty_wb_penalty = 2;
// Create our simulator
CacheSim simulator(location, block_size, associativity, capacity,
miss_penalty, dirty_wb_penalty);
simulator.run();
return 0;
}
While I simply hard-coded the cache configuration parameters in the main
function, you could optionally add the as command line arguments. Additionally, the path to the trace file is the only command line argument.
To compile the simulator (cache_sim.cpp
), I used the following command:
g++ cache_sim.cpp --std=c++2a -O3 -o cache_sim
Here is an example output from running the simulator with one of the traces from the original course website:
CACHE SETTINGS
Cache Size (Bytes): 16384
Associativity : 1
Block Size (Bytes): 16
Miss Penalty (Cycles): 30
Dirty WB Penalty (Cycles): 2
CACHE ACCESS STATS
TOTAL ACCESSES: 1957764
READS: 1731883
WRITES: 225881
CACHE MISS-RATE STATS
MISS-RATE: 28.2638
MISSES: 553339
HITS: 1404425
CACHE IPC STATS
IPC: 0.235004
INSTRUCTIONS: 5136716
CYCLES: 21857966
DIRTY WB: 60540
Final Thoughts
We have made it to the end of our journey! You can test your simulator using the traces found at the original course website and compare the results against the reference results, or write your own!
As with all the code in these blog posts, there are many things we could improve. For example, the cache simulator we created is actually pretty slow! There are numerous optimizations we could perform, especially concerning the how we read accesses from the trace file (why read 1 access at a time when we could read a chunk of accesses!). However, we’ll leave that discussion to another blog post.
Thanks for reading,
–Nick
Link to the source code
- My YouTube Channel
- My GitHub Account
- The Cache Simulator
- My Email: CoffeeBeforeArch@gmail.com