| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206 |
- \documentclass[12pt]{article}
- \usepackage{url}
- \usepackage{todonotes}
- \title{Modelling L2 and L3 CPU caches separately in
- Cachegrind for both x86\_64 and ARM64 architectures - Literature Review}
- \date{2017-11-18}
- \author{Matt Coles - mc930}
- \begin{document}
- \pagenumbering{gobble}
- \maketitle
- \tableofcontents
- \listoftodos
- \newpage
- \pagenumbering{arabic}
- \section{Introduction}
- In order to understand the scope of this problem, there are a few areas that
- are of interest. To begin with, one must first understand how a CPU cache
- makes use of associative memory to provide fast access to data, and then
- further, how processor architects have split those caches into multiple levels
- and what makes them important. A processor having fast data caches is
- essential to the performance of the instruction pipeline, and therefore by
- extension, the programs running on said CPU. In addition to this there are
- differing styles of cache, including the simpler direct-mapped cache and the
- more modern n-way set-associative cache, and when having multiple levels of
- cache, designers may choose to use one or more of these different types.
- Taking this into consideration, along with the fact that multiple levels of
- cache are usually present further from the core, one will notice that there is
- a disparity in the speed of access to the caches.\\
- In addition, as the project is to extend an existing cache profiling tool so
- that the results it gives might better reflect the actual architecture of
- modern processors, we will also investigate other computer program profiling
- tools and how they might be used to produce similar results or how they take a
- different approach to profiling cache usage.
- \subsection{Introducing Cachegrind}
- A popular tool for profiling programs is called Valgrind, which is a dynamic
- binary instrumentation framework for building other dynamic binary analysis
- tools. Valgrind, as of 2017 consists of six ‘production quality’ tools, one of
- those being Cachegrind. This is a method of cache profiling where the program
- is run with a model of the cache and then it is measured whether data would or
- would not be present in the cache, and can therefore measure cache misses and
- hits. In order for this to work for a given architecture, the instructions
- within said architecture must be categorised into whether they are reads or
- writes or neither or both. It can then monitor the instruction stream and
- update the cache model as the machine (likely) would \cite{nethercote_2004}.
- The advantage of using this simulated approach is manyfold. Firstly, it means
- that the bulk of the code can be hardware agnostic to avoid having to deal
- with architecture quirks such as, whether the architecture provides direct
- access to the cache controller, or if it completely abstracts that from the
- programmer. In addition, it means that any attempts to improve the locality of
- the code, will be based on a simulation environment that will therefore likely
- improve the locality of the code on any architecture, without relying on
- hardware specific cache behaviour \cite{weidendorfer_2004}.
- \subsection{Introduction to x86\_64 and ARM64}
-
- \subsubsection{x86\_64}x86 is the all encompassing
- name for the backwards compatible set of architectures that begun with the
- Intel 8086 CPU. However, most modern CPUs are now 64 bit, so this will only
- focus on the AMD specification for an x86\_64 implementation. I am choosing to
- support this architecture as a part of this project as it can build upon the
- work of Kaparelos \cite{kaparelos_2014}, which builds an implementation of the
- required work for x86\_64 architecture, but was not merged into the main
- project and having read his paper, I can start with implementing the changes
- into the well verified x86 environment. In addition, is Kaparelos initial
- justification for an x86 port of this work, that the architecture is most
- commonly used in consumer PCs and therefore will provide some of the most
- benefit.\\
- The 64 bit port of x86 that is now widespread, disregarding the Intel
- Itanium(IA64) implementation, was developed by AMD \cite{forum_2017}. The
- reason that this was such a successful architecture - as opposed to the
- original IA64 ISA, was because it retained backwards compatibility with the 32
- bit implementations of x86 so old programs could be still be run in a
- compatibility mode. There are now both Intel and AMD implementations of the
- ISA but as they are near identical - and most compilers will produce code that
- avoids any differences, this is especially true for application level programs
- as there is little to no chance that they will make use of the kind of
- instructions that have differences. This paper will consider the two
- implementations as identical and produce work with the intention that it will
- work on both AMD and Intel branded CPUs.
- \subsubsection{ARM64}
- ARM64 is the name for the optional 64 bit extension to the ARMv8-A
- architecture specification from ARM Holdings. The main driving decision behind
- developing the Cachegrind extension for ARM architecture, is the Isambard
- \cite{isambard_2017} supercomputer project which is a world-first
- supercomputer comprised of ARM based SoCs with three levels of cache. It would
- be important to get accurate cache profiling results for all levels of cache
- so that the team can optimise the programs running on Isambard to efficiently
- use the CPU caches, to save CPU cycles and ensure that the machine can run as
- many jobs as possible. \todo{Write more about ARM64}
- \section{CPU Caches}
- A CPU cache is a (relatively) small, fast, piece of memory that is placed
- between the CPU bus and the main memory port to address the disparity between
- CPU frequency and memory port frequency. This means that when attempting to
- access some data at a given address, the processor will first look for matches
- in the cache. However because the cache is small, there must be some way of
- deciding where data should be stored in the cache - as we cannot feasibly
- store all possible data in the cache - that would just be replicating main
- memory. We can take advantage of the principles known as temporal locality and
- spacial locality which respectively refer to "the tendency to reuse recently
- accessed data items" \cite{hennessy_2011} and "the tendency to reference data
- items that are close to other recently referenced items" \cite{hennessy_2011}.
- Because programs run on a processor generally exhibit these properties, we
- actually do not need to store all the possible memory locations to get an
- increase in speed. In order to find some suitable mapping, one can split the
- address into an upper and lower portion, where the upper part can be stored in
- a tag store, and the lower part can refer to the offset within some
- arbitrarily sized cache line. When data is found in the cache, it is referred
- to as a cache hit, and when the data is not found, and a request needs to be
- sent to memory, it is referred to as a cache miss. As a request to memory
- comes with a penalty of around 240 cycles\cite{drepper_2007}, it is important
- to keep these to a minimum.
- \subsection{Cache Placement Policies}
- \subsubsection{Fully Associative}
- A very simple cache implementation would be one where any cache line could
- fill any slot in the cache. This would be a purely associative memory that
- will accept new key/data pairs until it is full. In order to place a new block
- of memory into the cache, one must simply find an invalid cache line and then
- fill that line with the data, updating the tag store to show the upper address
- region that now occupies this line. In the event that the cache is full, then
- a line will be evicted based on some replacement policy, for example LRU. This
- is rarely - if ever - a good solution, as each tag lookup requires iterating
- through all the elements in the cache to try to find the matching tag. This
- has it's own disadvantages as it takes up a lot of area and power when
- implemented in silicon.
- \subsubsection{Direct Mapped} Because a fully associative cache is too
- expensive to implement, and the miss penalty can be very high (due to full
- iteration of the tag store), a better way to map cache lines to a limited
- store is to split the upper part of the address further, keeping the tag as a
- way of determining if there is a match in the cache, but using the middle
- slice of bits as an index. When combined with a cache that splits it's slices
- into "sets" with single cache lines to a set, the index can be used to
- determine which \textit{set} the line will be installed in. This also
- simplifies the cache evict policy, as each cache line can only map to a single
- \textit{set} in the cache, when there is a conflict, the line that occupies
- the set is simply evicted and the new line installed. This is advantageous
- because it makes the process of finding whether a given address is present in
- the cache much faster and easier and as mentioned above, makes replacing lines
- in the cache simpler too. Unfortunately it introduces it's own problem: "If
- your program makes repeated reference to two data items that happen to share
- the same cache location (presumably because the low bits of their addresses
- happen to be close together), then the two data items will keep pushing each
- other out of the cache and efficiency will fall drastically."
- \cite{sweetman_2007}.
- \subsubsection{Set Associative} The final type of cache seeks to remedy this
- thrashing by further splitting up the sets within the cache to produce an $n
- \times m$ cache, with $n$ sets and $m$ ways within those sets. This is a
- compromise between the direct mapped and truly associative cache designs. With
- this cache design, as with the direct mapped design, the cache line address is
- split into a tag, an index, and the offset. Again, similarly to the direct
- mapped cache, the index determines which set the cache line will belong to,
- however in this design, there will be multiple \textit{ways} that the line
- could be installed in, allowing for $m$ cache lines per set. This increases
- the hit rate for lines with spatial locality as we do not necessarily have to
- evict a line if it matches the same set as another. However in the case that
- all ways in a set are occupied we have to choose one to evict, and this can be
- done based on an arbitrary eviction policy, as with fully associative cache
- designs. The downsides to this are that: "Compared with a direct-mapped cache,
- a set-associative cache requires many more bus connections between the cache
- memory and controller. That means that caches too big to integrate onto a
- single chip are much easier to build direct mapped. More subtly, because the
- direct-mapped cache has only one possible candidate for the data you need,
- it’s possible to keep the CPU running ahead of the tag check (so long as the
- CPU does not do anything irrevocable based on the data). Simplicity and
- running ahead can translate to a faster clock rate."\cite{sweetman_2007}
- \subsection{Split vs Unified Caches} The primary goal of the cache is to
- optimise the memory accesses of the CPU to prevent pipeline stalls.
- \cite{sweetman_2007}. For this reason it can be advantageous to split the
- instruction and data caches - usually in the L1 only, as misses there will pay
- a larger penalty anyway. This because, due to the nature of pipelines, the CPU
- can be fetching both data for one instruction, and the instruction for another
- at the exact same time \cite{smith_1982}. Splitting the cache into one part of
- the instruction cache - which can be further optimised for having infrequent
- writes and mostly reads, and the data cache, which can be optimised for the
- general use case. This doubles the available cache bandwidth, although can
- have a detrimental effect on miss rate. However, this remains the prevailing
- strategy for low level caches (i.e level 1), as it allows hardware designs to
- place the caches close to the logic that will require them, cutting down on
- precious nanoseconds.
- \subsection{Multilevel Caches} Another common method that can be employed to
- reduce miss penalty, is having multiple levels of cache, and this is the focus
- of this dissertation, as Cachegrind will be extended to fully support all 3
- levels of cache present on the Isambard supercomputer.
-
- \newpage
- \bibliography{literature_review}
- \bibliographystyle{ieeetr}
- \end{document}
|