8 years ago · 434b04ccef
--- a/literature_review.bib
+++ b/literature_review.bib
--- a/literature_review.tex
+++ b/literature_review.tex
@@ -0,0 +1,205 @@
 
				+\documentclass[12pt]{article}
			
 
				+\usepackage{url}
			
 
				+\usepackage{todonotes}
			
 
				+
			
 
				+\title{Modelling L2 and L3 CPU caches separately in
			
 
				+Cachegrind for both x86\_64 and ARM64 architectures - Literature Review}
			
 
				+\date{2017-11-18} 
			
 
				+\author{Matt Coles - mc930}
			
 
				+
			
 
				+\begin{document}
			
 
				+  \pagenumbering{gobble}
			
 
				+  \maketitle
			
 
				+  \tableofcontents
			
 
				+  \listoftodos
			
 
				+  \newpage
			
 
				+  \pagenumbering{arabic}
			
 
				+
			
 
				+  \section{Introduction}
			
 
				+
			
 
				+  In order to understand the scope of this problem, there are a few areas that
			
 
				+  are of interest. To begin with, one must first understand how a CPU cache
			
 
				+  makes use of associative memory to provide fast access to data, and then
			
 
				+  further, how processor architects have split those caches into multiple levels
			
 
				+  and what makes them important. A processor having fast data caches is
			
 
				+  essential to the performance of the instruction pipeline, and therefore by
			
 
				+  extension, the programs running on said CPU. In addition to this there are
			
 
				+  differing styles of cache, including the simpler direct-mapped cache and the
			
 
				+  more modern n-way set-associative cache, and when having multiple levels of
			
 
				+  cache, designers may choose to use one or more of these different types.
			
 
				+  Taking this into consideration, along with the fact that multiple levels of
			
 
				+  cache are usually present further from the core, one will notice that there is
			
 
				+  a disparity in the speed of access to the caches.\\
			
 
				+
			
 
				+  In addition, as the project is to extend an existing cache profiling tool so
			
 
				+  that the results it gives might better reflect the actual architecture of
			
 
				+  modern processors, we will also investigate other computer program profiling
			
 
				+  tools and how they might be used to produce similar results or how they take a
			
 
				+  different approach to profiling cache usage.
			
 
				+
			
 
				+  \subsection{Introducing Cachegrind}
			
 
				+
			
 
				+  A popular tool for profiling programs is called Valgrind, which is a dynamic
			
 
				+  binary instrumentation framework for building other dynamic binary analysis
			
 
				+  tools. Valgrind, as of 2017 consists of six ‘production quality’ tools, one of
			
 
				+  those being Cachegrind. This is a method of cache profiling where the program
			
 
				+  is run with a model of the cache and then it is measured whether data would or
			
 
				+  would not be present in the cache, and can therefore measure cache misses and
			
 
				+  hits. In order for this to work for a given architecture, the instructions
			
 
				+  within said architecture must be categorised into whether they are reads or
			
 
				+  writes or neither or both. It can then monitor the instruction stream and
			
 
				+  update the cache model as the machine (likely) would \cite{nethercote_2004}.
			
 
				+  The advantage of using this simulated approach is manyfold. Firstly, it means
			
 
				+  that the bulk of the code can be hardware agnostic to avoid having to deal
			
 
				+  with architecture quirks such as, whether the architecture provides direct
			
 
				+  access to the cache controller, or if it completely abstracts that from the
			
 
				+  programmer. In addition, it means that any attempts to improve the locality of
			
 
				+  the code, will be based on a simulation environment that will therefore likely
			
 
				+  improve the locality of the code on any architecture, without relying on
			
 
				+  hardware specific cache behaviour \cite{weidendorfer_2004}. 
			
 
				+
			
 
				+  \subsection{Introduction to x86\_64 and ARM64} 
			
 
				+  
			
 
				+  \subsubsection{x86\_64}x86 is the all encompassing
			
 
				+  name for the backwards compatible set of architectures that begun with the
			
 
				+  Intel 8086 CPU. However, most modern CPUs are now 64 bit, so this will only
			
 
				+  focus on the AMD specification for an x86\_64 implementation. I am choosing to
			
 
				+  support this architecture as a part of this project as it can build upon the
			
 
				+  work of Kaparelos \cite{kaparelos_2014}, which builds an implementation of the
			
 
				+  required work for x86\_64 architecture, but was not merged into the main
			
 
				+  project and having read his paper, I can start with implementing the changes
			
 
				+  into the well verified x86 environment. In addition, is Kaparelos initial
			
 
				+  justification for an x86 port of this work, that the architecture is most
			
 
				+  commonly used in consumer PCs and therefore will provide some of the most
			
 
				+  benefit.\\
			
 
				+
			
 
				+  The 64 bit port of x86 that is now widespread, disregarding the Intel
			
 
				+  Itanium(IA64) implementation, was developed by AMD \cite{forum_2017}. The
			
 
				+  reason that this was such a successful architecture - as opposed to the
			
 
				+  original IA64 ISA, was because it retained backwards compatibility with the 32
			
 
				+  bit implementations of x86 so old programs could be still be run in a
			
 
				+  compatibility mode. There are now both Intel and AMD implementations of the
			
 
				+  ISA but as they are near identical - and most compilers will produce code that
			
 
				+  avoids any differences, this is especially true for application level programs
			
 
				+  as there is little to no chance that they will make use of the kind of
			
 
				+  instructions that have differences.  This paper will consider the two
			
 
				+  implementations as identical and produce work with the intention that it will
			
 
				+  work on both AMD and Intel branded CPUs. 
			
 
				+
			
 
				+  \subsubsection{ARM64}
			
 
				+  ARM64 is the name for the optional 64 bit extension to the ARMv8-A
			
 
				+  architecture specification from ARM Holdings. The main driving decision behind
			
 
				+  developing the Cachegrind extension for ARM architecture, is the Isambard
			
 
				+  \cite{isambard_2017} supercomputer project which is a world-first
			
 
				+  supercomputer comprised of ARM based SoCs with three levels of cache. It would
			
 
				+  be important to get accurate cache profiling results for all levels of cache
			
 
				+  so that the team can optimise the programs running on Isambard to efficiently
			
 
				+  use the CPU caches, to save CPU cycles and ensure that the machine can run as
			
 
				+  many jobs as possible. \todo{Write more about ARM64}
			
 
				+
			
 
				+  \section{CPU Caches}
			
 
				+
			
 
				+  A CPU cache is a (relatively) small, fast, piece of memory that is placed
			
 
				+  between the CPU bus and the main memory port to address the disparity between
			
 
				+  CPU frequency and memory port frequency. This means that when attempting to
			
 
				+  access some data at a given address, the processor will first look for matches
			
 
				+  in the cache. However because the cache is small, there must be some way of
			
 
				+  deciding where data should be stored in the cache - as we cannot feasibly
			
 
				+  store all possible data in the cache - that would just be replicating main
			
 
				+  memory. We can take advantage of the principles known as temporal locality and
			
 
				+  spacial locality which respectively refer to "the tendency to reuse recently
			
 
				+  accessed data items" \cite{hennessy_2011} and "the tendency to reference data
			
 
				+  items that are close to other recently referenced items" \cite{hennessy_2011}.
			
 
				+  Because programs run on a processor generally exhibit these properties, we
			
 
				+  actually do not need to store all the possible memory locations to get an
			
 
				+  increase in speed.  In order to find some suitable mapping, one can split the
			
 
				+  address into an upper and lower portion, where the upper part can be stored in
			
 
				+  a tag store, and the lower part can refer to the offset within some
			
 
				+  arbitrarily sized cache line. When data is found in the cache, it is referred
			
 
				+  to as a cache hit, and when the data is not found, and a request needs to be
			
 
				+  sent to memory, it is referred to as a cache miss. As a request to memory
			
 
				+  comes with a penalty of around 240 cycles\cite{drepper_2007}, it is important
			
 
				+  to keep these to a minimum.
			
 
				+
			
 
				+  \subsection{Cache Placement Policies}
			
 
				+
			
 
				+  \subsubsection{Fully Associative}
			
 
				+  A very simple cache implementation would be one where any cache line could
			
 
				+  fill any slot in the cache. This would be a purely associative memory that
			
 
				+  will accept new key/data pairs until it is full. In order to place a new block
			
 
				+  of memory into the cache, one must simply find an invalid cache line and then
			
 
				+  fill that line with the data, updating the tag store to show the upper address
			
 
				+  region that now occupies this line. In the event that the cache is full, then
			
 
				+  a line will be evicted based on some replacement policy, for example LRU. This
			
 
				+  is rarely - if ever - a good solution, as each tag lookup requires iterating
			
 
				+  through all the elements in the cache to try to find the matching tag. This
			
 
				+  has it's own disadvantages as it takes up a lot of area and power when
			
 
				+  implemented in silicon.
			
 
				+
			
 
				+  \subsubsection{Direct Mapped} Because a fully associative cache is too
			
 
				+  expensive to implement, and the miss penalty can be very high (due to full
			
 
				+  iteration of the tag store), a better way to map cache lines to a limited
			
 
				+  store is to split the upper part of the address further, keeping the tag as a
			
 
				+  way of determining if there is a match in the cache, but using the middle
			
 
				+  slice of bits as an index. When combined with a cache that splits it's slices
			
 
				+  into "sets" with single cache lines to a set, the index can be used to
			
 
				+  determine which \textit{set} the line will be installed in. This also
			
 
				+  simplifies the cache evict policy, as each cache line can only map to a single
			
 
				+  \textit{set} in the cache, when there is a conflict, the line that occupies
			
 
				+  the set is simply evicted and the new line installed. This is advantageous
			
 
				+  because it makes the process of finding whether a given address is present in
			
 
				+  the cache much faster and easier and as mentioned above, makes replacing lines
			
 
				+  in the cache simpler too. Unfortunately it introduces it's own problem: "If
			
 
				+  your program makes repeated reference to two data items that happen to share
			
 
				+  the same cache location (presumably because the low bits of their addresses
			
 
				+  happen to be close together), then the two data items will keep pushing each
			
 
				+  other out of the cache and efficiency will fall drastically."
			
 
				+  \cite{sweetman_2007}. 
			
 
				+
			
 
				+  \subsubsection{Set Associative} The final type of cache seeks to remedy this
			
 
				+  thrashing by further splitting up the sets within the cache to produce an $n
			
 
				+  \times m$ cache, with $n$ sets and $m$ ways within those sets. This is a
			
 
				+  compromise between the direct mapped and truly associative cache designs. With
			
 
				+  this cache design, as with the direct mapped design, the cache line address is
			
 
				+  split into a tag, an index, and the offset. Again, similarly to the direct
			
 
				+  mapped cache, the index determines which set the cache line will belong to,
			
 
				+  however in this design, there will be multiple \textit{ways} that the line
			
 
				+  could be installed in, allowing for $m$ cache lines per set. This increases
			
 
				+  the hit rate for lines with spatial locality as we do not necessarily have to
			
 
				+  evict a line if it matches the same set as another. However in the case that
			
 
				+  all ways in a set are occupied we have to choose one to evict, and this can be
			
 
				+  done based on an arbitrary eviction policy, as with fully associative cache
			
 
				+  designs. The downsides to this are that: "Compared with a direct-mapped cache,
			
 
				+  a set-associative cache requires many more bus connections between the cache
			
 
				+  memory and controller. That means that caches too big to integrate onto a
			
 
				+  single chip are much easier to build direct mapped. More subtly, because the
			
 
				+  direct-mapped cache has only one possible candidate for the data you need,
			
 
				+  it’s possible to keep the CPU running ahead of the tag check (so long as the
			
 
				+  CPU does not do anything irrevocable based on the data). Simplicity and
			
 
				+  running ahead can translate to a faster clock rate."\cite{sweetman_2007}
			
 
				+
			
 
				+  \subsection{Split vs Unified Caches} The primary goal of the cache is to
			
 
				+  optimise the memory accesses of the CPU to prevent pipeline stalls.
			
 
				+  \cite{sweetman_2007}. For this reason it can be advantageous to split the
			
 
				+  instruction and data caches - usually in the L1 only, as misses there will pay
			
 
				+  a larger penalty anyway. This because, due to the nature of pipelines, the CPU
			
 
				+  can be fetching both data for one instruction, and the instruction for another
			
 
				+  at the exact same time \cite{smith_1982}. Splitting the cache into one part of
			
 
				+  the instruction cache - which can be further optimised for having infrequent
			
 
				+  writes and mostly reads, and the data cache, which can be optimised for the
			
 
				+  general use case. This doubles the available cache bandwidth, although can
			
 
				+  have a detrimental effect on miss rate. However, this remains the prevailing
			
 
				+  strategy for low level caches (i.e level 1), as it allows hardware designs to
			
 
				+  place the caches close to the logic that will require them, cutting down on
			
 
				+  precious nanoseconds.
			
 
				+
			
 
				+  \subsection{Multilevel Caches} Another common method that can be employed to
			
 
				+  reduce miss penalty, is having multiple levels of cache, and this is the focus
			
 
				+  of this dissertation, as Cachegrind will be extended to fully support all 3
			
 
				+  levels of cache present on the Isambard supercomputer. 
			
 
				+  
			
 
				+  \newpage 
			
 
				+  \bibliography{literature_review}
			
 
				+  \bibliographystyle{ieeetr}
			
 
				+
			
 
				+\end{document}