Matt Coles 8 年之前
当前提交
434b04ccef
共有 2 个文件被更改,包括 348 次插入0 次删除
  1. 143 0
      literature_review.bib
  2. 205 0
      literature_review.tex

文件差异内容过多而无法显示
+ 143 - 0
literature_review.bib


+ 205 - 0
literature_review.tex

@@ -0,0 +1,205 @@
1
+\documentclass[12pt]{article}
2
+\usepackage{url}
3
+\usepackage{todonotes}
4
+
5
+\title{Modelling L2 and L3 CPU caches separately in
6
+Cachegrind for both x86\_64 and ARM64 architectures - Literature Review}
7
+\date{2017-11-18} 
8
+\author{Matt Coles - mc930}
9
+
10
+\begin{document}
11
+  \pagenumbering{gobble}
12
+  \maketitle
13
+  \tableofcontents
14
+  \listoftodos
15
+  \newpage
16
+  \pagenumbering{arabic}
17
+
18
+  \section{Introduction}
19
+
20
+  In order to understand the scope of this problem, there are a few areas that
21
+  are of interest. To begin with, one must first understand how a CPU cache
22
+  makes use of associative memory to provide fast access to data, and then
23
+  further, how processor architects have split those caches into multiple levels
24
+  and what makes them important. A processor having fast data caches is
25
+  essential to the performance of the instruction pipeline, and therefore by
26
+  extension, the programs running on said CPU. In addition to this there are
27
+  differing styles of cache, including the simpler direct-mapped cache and the
28
+  more modern n-way set-associative cache, and when having multiple levels of
29
+  cache, designers may choose to use one or more of these different types.
30
+  Taking this into consideration, along with the fact that multiple levels of
31
+  cache are usually present further from the core, one will notice that there is
32
+  a disparity in the speed of access to the caches.\\
33
+
34
+  In addition, as the project is to extend an existing cache profiling tool so
35
+  that the results it gives might better reflect the actual architecture of
36
+  modern processors, we will also investigate other computer program profiling
37
+  tools and how they might be used to produce similar results or how they take a
38
+  different approach to profiling cache usage.
39
+
40
+  \subsection{Introducing Cachegrind}
41
+
42
+  A popular tool for profiling programs is called Valgrind, which is a dynamic
43
+  binary instrumentation framework for building other dynamic binary analysis
44
+  tools. Valgrind, as of 2017 consists of six ‘production quality’ tools, one of
45
+  those being Cachegrind. This is a method of cache profiling where the program
46
+  is run with a model of the cache and then it is measured whether data would or
47
+  would not be present in the cache, and can therefore measure cache misses and
48
+  hits. In order for this to work for a given architecture, the instructions
49
+  within said architecture must be categorised into whether they are reads or
50
+  writes or neither or both. It can then monitor the instruction stream and
51
+  update the cache model as the machine (likely) would \cite{nethercote_2004}.
52
+  The advantage of using this simulated approach is manyfold. Firstly, it means
53
+  that the bulk of the code can be hardware agnostic to avoid having to deal
54
+  with architecture quirks such as, whether the architecture provides direct
55
+  access to the cache controller, or if it completely abstracts that from the
56
+  programmer. In addition, it means that any attempts to improve the locality of
57
+  the code, will be based on a simulation environment that will therefore likely
58
+  improve the locality of the code on any architecture, without relying on
59
+  hardware specific cache behaviour \cite{weidendorfer_2004}. 
60
+
61
+  \subsection{Introduction to x86\_64 and ARM64} 
62
+  
63
+  \subsubsection{x86\_64}x86 is the all encompassing
64
+  name for the backwards compatible set of architectures that begun with the
65
+  Intel 8086 CPU. However, most modern CPUs are now 64 bit, so this will only
66
+  focus on the AMD specification for an x86\_64 implementation. I am choosing to
67
+  support this architecture as a part of this project as it can build upon the
68
+  work of Kaparelos \cite{kaparelos_2014}, which builds an implementation of the
69
+  required work for x86\_64 architecture, but was not merged into the main
70
+  project and having read his paper, I can start with implementing the changes
71
+  into the well verified x86 environment. In addition, is Kaparelos initial
72
+  justification for an x86 port of this work, that the architecture is most
73
+  commonly used in consumer PCs and therefore will provide some of the most
74
+  benefit.\\
75
+
76
+  The 64 bit port of x86 that is now widespread, disregarding the Intel
77
+  Itanium(IA64) implementation, was developed by AMD \cite{forum_2017}. The
78
+  reason that this was such a successful architecture - as opposed to the
79
+  original IA64 ISA, was because it retained backwards compatibility with the 32
80
+  bit implementations of x86 so old programs could be still be run in a
81
+  compatibility mode. There are now both Intel and AMD implementations of the
82
+  ISA but as they are near identical - and most compilers will produce code that
83
+  avoids any differences, this is especially true for application level programs
84
+  as there is little to no chance that they will make use of the kind of
85
+  instructions that have differences.  This paper will consider the two
86
+  implementations as identical and produce work with the intention that it will
87
+  work on both AMD and Intel branded CPUs. 
88
+
89
+  \subsubsection{ARM64}
90
+  ARM64 is the name for the optional 64 bit extension to the ARMv8-A
91
+  architecture specification from ARM Holdings. The main driving decision behind
92
+  developing the Cachegrind extension for ARM architecture, is the Isambard
93
+  \cite{isambard_2017} supercomputer project which is a world-first
94
+  supercomputer comprised of ARM based SoCs with three levels of cache. It would
95
+  be important to get accurate cache profiling results for all levels of cache
96
+  so that the team can optimise the programs running on Isambard to efficiently
97
+  use the CPU caches, to save CPU cycles and ensure that the machine can run as
98
+  many jobs as possible. \todo{Write more about ARM64}
99
+
100
+  \section{CPU Caches}
101
+
102
+  A CPU cache is a (relatively) small, fast, piece of memory that is placed
103
+  between the CPU bus and the main memory port to address the disparity between
104
+  CPU frequency and memory port frequency. This means that when attempting to
105
+  access some data at a given address, the processor will first look for matches
106
+  in the cache. However because the cache is small, there must be some way of
107
+  deciding where data should be stored in the cache - as we cannot feasibly
108
+  store all possible data in the cache - that would just be replicating main
109
+  memory. We can take advantage of the principles known as temporal locality and
110
+  spacial locality which respectively refer to "the tendency to reuse recently
111
+  accessed data items" \cite{hennessy_2011} and "the tendency to reference data
112
+  items that are close to other recently referenced items" \cite{hennessy_2011}.
113
+  Because programs run on a processor generally exhibit these properties, we
114
+  actually do not need to store all the possible memory locations to get an
115
+  increase in speed.  In order to find some suitable mapping, one can split the
116
+  address into an upper and lower portion, where the upper part can be stored in
117
+  a tag store, and the lower part can refer to the offset within some
118
+  arbitrarily sized cache line. When data is found in the cache, it is referred
119
+  to as a cache hit, and when the data is not found, and a request needs to be
120
+  sent to memory, it is referred to as a cache miss. As a request to memory
121
+  comes with a penalty of around 240 cycles\cite{drepper_2007}, it is important
122
+  to keep these to a minimum.
123
+
124
+  \subsection{Cache Placement Policies}
125
+
126
+  \subsubsection{Fully Associative}
127
+  A very simple cache implementation would be one where any cache line could
128
+  fill any slot in the cache. This would be a purely associative memory that
129
+  will accept new key/data pairs until it is full. In order to place a new block
130
+  of memory into the cache, one must simply find an invalid cache line and then
131
+  fill that line with the data, updating the tag store to show the upper address
132
+  region that now occupies this line. In the event that the cache is full, then
133
+  a line will be evicted based on some replacement policy, for example LRU. This
134
+  is rarely - if ever - a good solution, as each tag lookup requires iterating
135
+  through all the elements in the cache to try to find the matching tag. This
136
+  has it's own disadvantages as it takes up a lot of area and power when
137
+  implemented in silicon.
138
+
139
+  \subsubsection{Direct Mapped} Because a fully associative cache is too
140
+  expensive to implement, and the miss penalty can be very high (due to full
141
+  iteration of the tag store), a better way to map cache lines to a limited
142
+  store is to split the upper part of the address further, keeping the tag as a
143
+  way of determining if there is a match in the cache, but using the middle
144
+  slice of bits as an index. When combined with a cache that splits it's slices
145
+  into "sets" with single cache lines to a set, the index can be used to
146
+  determine which \textit{set} the line will be installed in. This also
147
+  simplifies the cache evict policy, as each cache line can only map to a single
148
+  \textit{set} in the cache, when there is a conflict, the line that occupies
149
+  the set is simply evicted and the new line installed. This is advantageous
150
+  because it makes the process of finding whether a given address is present in
151
+  the cache much faster and easier and as mentioned above, makes replacing lines
152
+  in the cache simpler too. Unfortunately it introduces it's own problem: "If
153
+  your program makes repeated reference to two data items that happen to share
154
+  the same cache location (presumably because the low bits of their addresses
155
+  happen to be close together), then the two data items will keep pushing each
156
+  other out of the cache and efficiency will fall drastically."
157
+  \cite{sweetman_2007}. 
158
+
159
+  \subsubsection{Set Associative} The final type of cache seeks to remedy this
160
+  thrashing by further splitting up the sets within the cache to produce an $n
161
+  \times m$ cache, with $n$ sets and $m$ ways within those sets. This is a
162
+  compromise between the direct mapped and truly associative cache designs. With
163
+  this cache design, as with the direct mapped design, the cache line address is
164
+  split into a tag, an index, and the offset. Again, similarly to the direct
165
+  mapped cache, the index determines which set the cache line will belong to,
166
+  however in this design, there will be multiple \textit{ways} that the line
167
+  could be installed in, allowing for $m$ cache lines per set. This increases
168
+  the hit rate for lines with spatial locality as we do not necessarily have to
169
+  evict a line if it matches the same set as another. However in the case that
170
+  all ways in a set are occupied we have to choose one to evict, and this can be
171
+  done based on an arbitrary eviction policy, as with fully associative cache
172
+  designs. The downsides to this are that: "Compared with a direct-mapped cache,
173
+  a set-associative cache requires many more bus connections between the cache
174
+  memory and controller. That means that caches too big to integrate onto a
175
+  single chip are much easier to build direct mapped. More subtly, because the
176
+  direct-mapped cache has only one possible candidate for the data you need,
177
+  it’s possible to keep the CPU running ahead of the tag check (so long as the
178
+  CPU does not do anything irrevocable based on the data). Simplicity and
179
+  running ahead can translate to a faster clock rate."\cite{sweetman_2007}
180
+
181
+  \subsection{Split vs Unified Caches} The primary goal of the cache is to
182
+  optimise the memory accesses of the CPU to prevent pipeline stalls.
183
+  \cite{sweetman_2007}. For this reason it can be advantageous to split the
184
+  instruction and data caches - usually in the L1 only, as misses there will pay
185
+  a larger penalty anyway. This because, due to the nature of pipelines, the CPU
186
+  can be fetching both data for one instruction, and the instruction for another
187
+  at the exact same time \cite{smith_1982}. Splitting the cache into one part of
188
+  the instruction cache - which can be further optimised for having infrequent
189
+  writes and mostly reads, and the data cache, which can be optimised for the
190
+  general use case. This doubles the available cache bandwidth, although can
191
+  have a detrimental effect on miss rate. However, this remains the prevailing
192
+  strategy for low level caches (i.e level 1), as it allows hardware designs to
193
+  place the caches close to the logic that will require them, cutting down on
194
+  precious nanoseconds.
195
+
196
+  \subsection{Multilevel Caches} Another common method that can be employed to
197
+  reduce miss penalty, is having multiple levels of cache, and this is the focus
198
+  of this dissertation, as Cachegrind will be extended to fully support all 3
199
+  levels of cache present on the Isambard supercomputer. 
200
+  
201
+  \newpage 
202
+  \bibliography{literature_review}
203
+  \bibliographystyle{ieeetr}
204
+
205
+\end{document}