\documentclass[12pt]{article} \usepackage{url} \usepackage{todonotes} \title{Modelling L2 and L3 CPU caches separately in Cachegrind for both x86\_64 and ARM64 architectures - Literature Review} \date{2017-11-18} \author{Matt Coles - mc930} \begin{document} \pagenumbering{gobble} \maketitle \tableofcontents \listoftodos \newpage \pagenumbering{arabic} \section{Introduction} In order to understand the scope of this problem, there are a few areas that are of interest. To begin with, one must first understand how a CPU cache makes use of associative memory to provide fast access to data, and then further, how processor architects have split those caches into multiple levels and what makes them important. A processor having fast data caches is essential to the performance of the instruction pipeline, and therefore by extension, the programs running on said CPU. In addition to this there are differing styles of cache, including the simpler direct-mapped cache and the more modern n-way set-associative cache, and when having multiple levels of cache, designers may choose to use one or more of these different types. Taking this into consideration, along with the fact that multiple levels of cache are usually present further from the core, one will notice that there is a disparity in the speed of access to the caches.\\ In addition, as the project is to extend an existing cache profiling tool so that the results it gives might better reflect the actual architecture of modern processors, we will also investigate other computer program profiling tools and how they might be used to produce similar results or how they take a different approach to profiling cache usage. \subsection{Introducing Cachegrind} A popular tool for profiling programs is called Valgrind, which is a dynamic binary instrumentation framework for building other dynamic binary analysis tools. Valgrind, as of 2017 consists of six ‘production quality’ tools, one of those being Cachegrind. This is a method of cache profiling where the program is run with a model of the cache and then it is measured whether data would or would not be present in the cache, and can therefore measure cache misses and hits. In order for this to work for a given architecture, the instructions within said architecture must be categorised into whether they are reads or writes or neither or both. It can then monitor the instruction stream and update the cache model as the machine (likely) would \cite{nethercote_2004}. The advantage of using this simulated approach is manyfold. Firstly, it means that the bulk of the code can be hardware agnostic to avoid having to deal with architecture quirks such as, whether the architecture provides direct access to the cache controller, or if it completely abstracts that from the programmer. In addition, it means that any attempts to improve the locality of the code, will be based on a simulation environment that will therefore likely improve the locality of the code on any architecture, without relying on hardware specific cache behaviour \cite{weidendorfer_2004}. \subsection{Introduction to x86\_64 and ARM64} \subsubsection{x86\_64}x86 is the all encompassing name for the backwards compatible set of architectures that begun with the Intel 8086 CPU. However, most modern CPUs are now 64 bit, so this will only focus on the AMD specification for an x86\_64 implementation. I am choosing to support this architecture as a part of this project as it can build upon the work of Kaparelos \cite{kaparelos_2014}, which builds an implementation of the required work for x86\_64 architecture, but was not merged into the main project and having read his paper, I can start with implementing the changes into the well verified x86 environment. In addition, is Kaparelos initial justification for an x86 port of this work, that the architecture is most commonly used in consumer PCs and therefore will provide some of the most benefit.\\ The 64 bit port of x86 that is now widespread, disregarding the Intel Itanium(IA64) implementation, was developed by AMD \cite{forum_2017}. The reason that this was such a successful architecture - as opposed to the original IA64 ISA, was because it retained backwards compatibility with the 32 bit implementations of x86 so old programs could be still be run in a compatibility mode. There are now both Intel and AMD implementations of the ISA but as they are near identical - and most compilers will produce code that avoids any differences, this is especially true for application level programs as there is little to no chance that they will make use of the kind of instructions that have differences. This paper will consider the two implementations as identical and produce work with the intention that it will work on both AMD and Intel branded CPUs. \subsubsection{ARM64} ARM64 is the name for the optional 64 bit extension to the ARMv8-A architecture specification from ARM Holdings. The main driving decision behind developing the Cachegrind extension for ARM architecture, is the Isambard \cite{isambard_2017} supercomputer project which is a world-first supercomputer comprised of ARM based SoCs with three levels of cache. It would be important to get accurate cache profiling results for all levels of cache so that the team can optimise the programs running on Isambard to efficiently use the CPU caches, to save CPU cycles and ensure that the machine can run as many jobs as possible. \todo{Write more about ARM64} \section{CPU Caches} A CPU cache is a (relatively) small, fast, piece of memory that is placed between the CPU bus and the main memory port to address the disparity between CPU frequency and memory port frequency. This means that when attempting to access some data at a given address, the processor will first look for matches in the cache. However because the cache is small, there must be some way of deciding where data should be stored in the cache - as we cannot feasibly store all possible data in the cache - that would just be replicating main memory. We can take advantage of the principles known as temporal locality and spacial locality which respectively refer to "the tendency to reuse recently accessed data items" \cite{hennessy_2011} and "the tendency to reference data items that are close to other recently referenced items" \cite{hennessy_2011}. Because programs run on a processor generally exhibit these properties, we actually do not need to store all the possible memory locations to get an increase in speed. In order to find some suitable mapping, one can split the address into an upper and lower portion, where the upper part can be stored in a tag store, and the lower part can refer to the offset within some arbitrarily sized cache line. When data is found in the cache, it is referred to as a cache hit, and when the data is not found, and a request needs to be sent to memory, it is referred to as a cache miss. As a request to memory comes with a penalty of around 240 cycles\cite{drepper_2007}, it is important to keep these to a minimum. \subsection{Cache Placement Policies} \subsubsection{Fully Associative} A very simple cache implementation would be one where any cache line could fill any slot in the cache. This would be a purely associative memory that will accept new key/data pairs until it is full. In order to place a new block of memory into the cache, one must simply find an invalid cache line and then fill that line with the data, updating the tag store to show the upper address region that now occupies this line. In the event that the cache is full, then a line will be evicted based on some replacement policy, for example LRU. This is rarely - if ever - a good solution, as each tag lookup requires iterating through all the elements in the cache to try to find the matching tag. This has it's own disadvantages as it takes up a lot of area and power when implemented in silicon. \subsubsection{Direct Mapped} Because a fully associative cache is too expensive to implement, and the miss penalty can be very high (due to full iteration of the tag store), a better way to map cache lines to a limited store is to split the upper part of the address further, keeping the tag as a way of determining if there is a match in the cache, but using the middle slice of bits as an index. When combined with a cache that splits it's slices into "sets" with single cache lines to a set, the index can be used to determine which \textit{set} the line will be installed in. This also simplifies the cache evict policy, as each cache line can only map to a single \textit{set} in the cache, when there is a conflict, the line that occupies the set is simply evicted and the new line installed. This is advantageous because it makes the process of finding whether a given address is present in the cache much faster and easier and as mentioned above, makes replacing lines in the cache simpler too. Unfortunately it introduces it's own problem: "If your program makes repeated reference to two data items that happen to share the same cache location (presumably because the low bits of their addresses happen to be close together), then the two data items will keep pushing each other out of the cache and efficiency will fall drastically." \cite{sweetman_2007}. \subsubsection{Set Associative} The final type of cache seeks to remedy this thrashing by further splitting up the sets within the cache to produce an $n \times m$ cache, with $n$ sets and $m$ ways within those sets. This is a compromise between the direct mapped and truly associative cache designs. With this cache design, as with the direct mapped design, the cache line address is split into a tag, an index, and the offset. Again, similarly to the direct mapped cache, the index determines which set the cache line will belong to, however in this design, there will be multiple \textit{ways} that the line could be installed in, allowing for $m$ cache lines per set. This increases the hit rate for lines with spatial locality as we do not necessarily have to evict a line if it matches the same set as another. However in the case that all ways in a set are occupied we have to choose one to evict, and this can be done based on an arbitrary eviction policy, as with fully associative cache designs. The downsides to this are that: "Compared with a direct-mapped cache, a set-associative cache requires many more bus connections between the cache memory and controller. That means that caches too big to integrate onto a single chip are much easier to build direct mapped. More subtly, because the direct-mapped cache has only one possible candidate for the data you need, it’s possible to keep the CPU running ahead of the tag check (so long as the CPU does not do anything irrevocable based on the data). Simplicity and running ahead can translate to a faster clock rate."\cite{sweetman_2007} \subsection{Split vs Unified Caches} The primary goal of the cache is to optimise the memory accesses of the CPU to prevent pipeline stalls. \cite{sweetman_2007}. For this reason it can be advantageous to split the instruction and data caches - usually in the L1 only, as misses there will pay a larger penalty anyway. This because, due to the nature of pipelines, the CPU can be fetching both data for one instruction, and the instruction for another at the exact same time \cite{smith_1982}. Splitting the cache into one part of the instruction cache - which can be further optimised for having infrequent writes and mostly reads, and the data cache, which can be optimised for the general use case. This doubles the available cache bandwidth, although can have a detrimental effect on miss rate. However, this remains the prevailing strategy for low level caches (i.e level 1), as it allows hardware designs to place the caches close to the logic that will require them, cutting down on precious nanoseconds. \subsection{Multilevel Caches} Another common method that can be employed to reduce miss penalty, is having multiple levels of cache, and this is the focus of this dissertation, as Cachegrind will be extended to fully support all 3 levels of cache present on the Isambard supercomputer. \newpage \bibliography{literature_review} \bibliographystyle{ieeetr} \end{document}