Matt
/
literature_review


			
				
					
						
						
							123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206
							\documentclass[12pt]{article}
\usepackage{url}
\usepackage{todonotes}

\title{Modelling L2 and L3 CPU caches separately in
Cachegrind for both x86\_64 and ARM64 architectures - Literature Review}
\date{2017-11-18} 
\author{Matt Coles - mc930}

\begin{document}
  \pagenumbering{gobble}
  \maketitle
  \tableofcontents
  \listoftodos
  \newpage
  \pagenumbering{arabic}

  \section{Introduction}

  In order to understand the scope of this problem, there are a few areas that
  are of interest. To begin with, one must first understand how a CPU cache
  makes use of associative memory to provide fast access to data, and then
  further, how processor architects have split those caches into multiple levels
  and what makes them important. A processor having fast data caches is
  essential to the performance of the instruction pipeline, and therefore by
  extension, the programs running on said CPU. In addition to this there are
  differing styles of cache, including the simpler direct-mapped cache and the
  more modern n-way set-associative cache, and when having multiple levels of
  cache, designers may choose to use one or more of these different types.
  Taking this into consideration, along with the fact that multiple levels of
  cache are usually present further from the core, one will notice that there is
  a disparity in the speed of access to the caches.\\

  In addition, as the project is to extend an existing cache profiling tool so
  that the results it gives might better reflect the actual architecture of
  modern processors, we will also investigate other computer program profiling
  tools and how they might be used to produce similar results or how they take a
  different approach to profiling cache usage.

  \subsection{Introducing Cachegrind}

  A popular tool for profiling programs is called Valgrind, which is a dynamic
  binary instrumentation framework for building other dynamic binary analysis
  tools. Valgrind, as of 2017 consists of six ‘production quality’ tools, one of
  those being Cachegrind. This is a method of cache profiling where the program
  is run with a model of the cache and then it is measured whether data would or
  would not be present in the cache, and can therefore measure cache misses and
  hits. In order for this to work for a given architecture, the instructions
  within said architecture must be categorised into whether they are reads or
  writes or neither or both. It can then monitor the instruction stream and
  update the cache model as the machine (likely) would \cite{nethercote_2004}.
  The advantage of using this simulated approach is manyfold. Firstly, it means
  that the bulk of the code can be hardware agnostic to avoid having to deal
  with architecture quirks such as, whether the architecture provides direct
  access to the cache controller, or if it completely abstracts that from the
  programmer. In addition, it means that any attempts to improve the locality of
  the code, will be based on a simulation environment that will therefore likely
  improve the locality of the code on any architecture, without relying on
  hardware specific cache behaviour \cite{weidendorfer_2004}. 

  \subsection{Introduction to x86\_64 and ARM64} 
  
  \subsubsection{x86\_64}x86 is the all encompassing
  name for the backwards compatible set of architectures that begun with the
  Intel 8086 CPU. However, most modern CPUs are now 64 bit, so this will only
  focus on the AMD specification for an x86\_64 implementation. I am choosing to
  support this architecture as a part of this project as it can build upon the
  work of Kaparelos \cite{kaparelos_2014}, which builds an implementation of the
  required work for x86\_64 architecture, but was not merged into the main
  project and having read his paper, I can start with implementing the changes
  into the well verified x86 environment. In addition, is Kaparelos initial
  justification for an x86 port of this work, that the architecture is most
  commonly used in consumer PCs and therefore will provide some of the most
  benefit.\\

  The 64 bit port of x86 that is now widespread, disregarding the Intel
  Itanium(IA64) implementation, was developed by AMD \cite{forum_2017}. The
  reason that this was such a successful architecture - as opposed to the
  original IA64 ISA, was because it retained backwards compatibility with the 32
  bit implementations of x86 so old programs could be still be run in a
  compatibility mode. There are now both Intel and AMD implementations of the
  ISA but as they are near identical - and most compilers will produce code that
  avoids any differences, this is especially true for application level programs
  as there is little to no chance that they will make use of the kind of
  instructions that have differences.  This paper will consider the two
  implementations as identical and produce work with the intention that it will
  work on both AMD and Intel branded CPUs. 

  \subsubsection{ARM64}
  ARM64 is the name for the optional 64 bit extension to the ARMv8-A
  architecture specification from ARM Holdings. The main driving decision behind
  developing the Cachegrind extension for ARM architecture, is the Isambard
  \cite{isambard_2017} supercomputer project which is a world-first
  supercomputer comprised of ARM based SoCs with three levels of cache. It would
  be important to get accurate cache profiling results for all levels of cache
  so that the team can optimise the programs running on Isambard to efficiently
  use the CPU caches, to save CPU cycles and ensure that the machine can run as
  many jobs as possible. \todo{Write more about ARM64}

  \section{CPU Caches}

  A CPU cache is a (relatively) small, fast, piece of memory that is placed
  between the CPU bus and the main memory port to address the disparity between
  CPU frequency and memory port frequency. This means that when attempting to
  access some data at a given address, the processor will first look for matches
  in the cache. However because the cache is small, there must be some way of
  deciding where data should be stored in the cache - as we cannot feasibly
  store all possible data in the cache - that would just be replicating main
  memory. We can take advantage of the principles known as temporal locality and
  spacial locality which respectively refer to "the tendency to reuse recently
  accessed data items" \cite{hennessy_2011} and "the tendency to reference data
  items that are close to other recently referenced items" \cite{hennessy_2011}.
  Because programs run on a processor generally exhibit these properties, we
  actually do not need to store all the possible memory locations to get an
  increase in speed.  In order to find some suitable mapping, one can split the
  address into an upper and lower portion, where the upper part can be stored in
  a tag store, and the lower part can refer to the offset within some
  arbitrarily sized cache line. When data is found in the cache, it is referred
  to as a cache hit, and when the data is not found, and a request needs to be
  sent to memory, it is referred to as a cache miss. As a request to memory
  comes with a penalty of around 240 cycles\cite{drepper_2007}, it is important
  to keep these to a minimum.

  \subsection{Cache Placement Policies}

  \subsubsection{Fully Associative}
  A very simple cache implementation would be one where any cache line could
  fill any slot in the cache. This would be a purely associative memory that
  will accept new key/data pairs until it is full. In order to place a new block
  of memory into the cache, one must simply find an invalid cache line and then
  fill that line with the data, updating the tag store to show the upper address
  region that now occupies this line. In the event that the cache is full, then
  a line will be evicted based on some replacement policy, for example LRU. This
  is rarely - if ever - a good solution, as each tag lookup requires iterating
  through all the elements in the cache to try to find the matching tag. This
  has it's own disadvantages as it takes up a lot of area and power when
  implemented in silicon.

  \subsubsection{Direct Mapped} Because a fully associative cache is too
  expensive to implement, and the miss penalty can be very high (due to full
  iteration of the tag store), a better way to map cache lines to a limited
  store is to split the upper part of the address further, keeping the tag as a
  way of determining if there is a match in the cache, but using the middle
  slice of bits as an index. When combined with a cache that splits it's slices
  into "sets" with single cache lines to a set, the index can be used to
  determine which \textit{set} the line will be installed in. This also
  simplifies the cache evict policy, as each cache line can only map to a single
  \textit{set} in the cache, when there is a conflict, the line that occupies
  the set is simply evicted and the new line installed. This is advantageous
  because it makes the process of finding whether a given address is present in
  the cache much faster and easier and as mentioned above, makes replacing lines
  in the cache simpler too. Unfortunately it introduces it's own problem: "If
  your program makes repeated reference to two data items that happen to share
  the same cache location (presumably because the low bits of their addresses
  happen to be close together), then the two data items will keep pushing each
  other out of the cache and efficiency will fall drastically."
  \cite{sweetman_2007}. 

  \subsubsection{Set Associative} The final type of cache seeks to remedy this
  thrashing by further splitting up the sets within the cache to produce an $n
  \times m$ cache, with $n$ sets and $m$ ways within those sets. This is a
  compromise between the direct mapped and truly associative cache designs. With
  this cache design, as with the direct mapped design, the cache line address is
  split into a tag, an index, and the offset. Again, similarly to the direct
  mapped cache, the index determines which set the cache line will belong to,
  however in this design, there will be multiple \textit{ways} that the line
  could be installed in, allowing for $m$ cache lines per set. This increases
  the hit rate for lines with spatial locality as we do not necessarily have to
  evict a line if it matches the same set as another. However in the case that
  all ways in a set are occupied we have to choose one to evict, and this can be
  done based on an arbitrary eviction policy, as with fully associative cache
  designs. The downsides to this are that: "Compared with a direct-mapped cache,
  a set-associative cache requires many more bus connections between the cache
  memory and controller. That means that caches too big to integrate onto a
  single chip are much easier to build direct mapped. More subtly, because the
  direct-mapped cache has only one possible candidate for the data you need,
  it’s possible to keep the CPU running ahead of the tag check (so long as the
  CPU does not do anything irrevocable based on the data). Simplicity and
  running ahead can translate to a faster clock rate."\cite{sweetman_2007}

  \subsection{Split vs Unified Caches} The primary goal of the cache is to
  optimise the memory accesses of the CPU to prevent pipeline stalls.
  \cite{sweetman_2007}. For this reason it can be advantageous to split the
  instruction and data caches - usually in the L1 only, as misses there will pay
  a larger penalty anyway. This because, due to the nature of pipelines, the CPU
  can be fetching both data for one instruction, and the instruction for another
  at the exact same time \cite{smith_1982}. Splitting the cache into one part of
  the instruction cache - which can be further optimised for having infrequent
  writes and mostly reads, and the data cache, which can be optimised for the
  general use case. This doubles the available cache bandwidth, although can
  have a detrimental effect on miss rate. However, this remains the prevailing
  strategy for low level caches (i.e level 1), as it allows hardware designs to
  place the caches close to the logic that will require them, cutting down on
  precious nanoseconds.

  \subsection{Multilevel Caches} Another common method that can be employed to
  reduce miss penalty, is having multiple levels of cache, and this is the focus
  of this dissertation, as Cachegrind will be extended to fully support all 3
  levels of cache present on the Isambard supercomputer. 
  
  \newpage 
  \bibliography{literature_review}
  \bibliographystyle{ieeetr}

\end{document}