|
|
@@ -0,0 +1,205 @@
|
|
|
1
|
+\documentclass[12pt]{article}
|
|
|
2
|
+\usepackage{url}
|
|
|
3
|
+\usepackage{todonotes}
|
|
|
4
|
+
|
|
|
5
|
+\title{Modelling L2 and L3 CPU caches separately in
|
|
|
6
|
+Cachegrind for both x86\_64 and ARM64 architectures - Literature Review}
|
|
|
7
|
+\date{2017-11-18}
|
|
|
8
|
+\author{Matt Coles - mc930}
|
|
|
9
|
+
|
|
|
10
|
+\begin{document}
|
|
|
11
|
+ \pagenumbering{gobble}
|
|
|
12
|
+ \maketitle
|
|
|
13
|
+ \tableofcontents
|
|
|
14
|
+ \listoftodos
|
|
|
15
|
+ \newpage
|
|
|
16
|
+ \pagenumbering{arabic}
|
|
|
17
|
+
|
|
|
18
|
+ \section{Introduction}
|
|
|
19
|
+
|
|
|
20
|
+ In order to understand the scope of this problem, there are a few areas that
|
|
|
21
|
+ are of interest. To begin with, one must first understand how a CPU cache
|
|
|
22
|
+ makes use of associative memory to provide fast access to data, and then
|
|
|
23
|
+ further, how processor architects have split those caches into multiple levels
|
|
|
24
|
+ and what makes them important. A processor having fast data caches is
|
|
|
25
|
+ essential to the performance of the instruction pipeline, and therefore by
|
|
|
26
|
+ extension, the programs running on said CPU. In addition to this there are
|
|
|
27
|
+ differing styles of cache, including the simpler direct-mapped cache and the
|
|
|
28
|
+ more modern n-way set-associative cache, and when having multiple levels of
|
|
|
29
|
+ cache, designers may choose to use one or more of these different types.
|
|
|
30
|
+ Taking this into consideration, along with the fact that multiple levels of
|
|
|
31
|
+ cache are usually present further from the core, one will notice that there is
|
|
|
32
|
+ a disparity in the speed of access to the caches.\\
|
|
|
33
|
+
|
|
|
34
|
+ In addition, as the project is to extend an existing cache profiling tool so
|
|
|
35
|
+ that the results it gives might better reflect the actual architecture of
|
|
|
36
|
+ modern processors, we will also investigate other computer program profiling
|
|
|
37
|
+ tools and how they might be used to produce similar results or how they take a
|
|
|
38
|
+ different approach to profiling cache usage.
|
|
|
39
|
+
|
|
|
40
|
+ \subsection{Introducing Cachegrind}
|
|
|
41
|
+
|
|
|
42
|
+ A popular tool for profiling programs is called Valgrind, which is a dynamic
|
|
|
43
|
+ binary instrumentation framework for building other dynamic binary analysis
|
|
|
44
|
+ tools. Valgrind, as of 2017 consists of six ‘production quality’ tools, one of
|
|
|
45
|
+ those being Cachegrind. This is a method of cache profiling where the program
|
|
|
46
|
+ is run with a model of the cache and then it is measured whether data would or
|
|
|
47
|
+ would not be present in the cache, and can therefore measure cache misses and
|
|
|
48
|
+ hits. In order for this to work for a given architecture, the instructions
|
|
|
49
|
+ within said architecture must be categorised into whether they are reads or
|
|
|
50
|
+ writes or neither or both. It can then monitor the instruction stream and
|
|
|
51
|
+ update the cache model as the machine (likely) would \cite{nethercote_2004}.
|
|
|
52
|
+ The advantage of using this simulated approach is manyfold. Firstly, it means
|
|
|
53
|
+ that the bulk of the code can be hardware agnostic to avoid having to deal
|
|
|
54
|
+ with architecture quirks such as, whether the architecture provides direct
|
|
|
55
|
+ access to the cache controller, or if it completely abstracts that from the
|
|
|
56
|
+ programmer. In addition, it means that any attempts to improve the locality of
|
|
|
57
|
+ the code, will be based on a simulation environment that will therefore likely
|
|
|
58
|
+ improve the locality of the code on any architecture, without relying on
|
|
|
59
|
+ hardware specific cache behaviour \cite{weidendorfer_2004}.
|
|
|
60
|
+
|
|
|
61
|
+ \subsection{Introduction to x86\_64 and ARM64}
|
|
|
62
|
+
|
|
|
63
|
+ \subsubsection{x86\_64}x86 is the all encompassing
|
|
|
64
|
+ name for the backwards compatible set of architectures that begun with the
|
|
|
65
|
+ Intel 8086 CPU. However, most modern CPUs are now 64 bit, so this will only
|
|
|
66
|
+ focus on the AMD specification for an x86\_64 implementation. I am choosing to
|
|
|
67
|
+ support this architecture as a part of this project as it can build upon the
|
|
|
68
|
+ work of Kaparelos \cite{kaparelos_2014}, which builds an implementation of the
|
|
|
69
|
+ required work for x86\_64 architecture, but was not merged into the main
|
|
|
70
|
+ project and having read his paper, I can start with implementing the changes
|
|
|
71
|
+ into the well verified x86 environment. In addition, is Kaparelos initial
|
|
|
72
|
+ justification for an x86 port of this work, that the architecture is most
|
|
|
73
|
+ commonly used in consumer PCs and therefore will provide some of the most
|
|
|
74
|
+ benefit.\\
|
|
|
75
|
+
|
|
|
76
|
+ The 64 bit port of x86 that is now widespread, disregarding the Intel
|
|
|
77
|
+ Itanium(IA64) implementation, was developed by AMD \cite{forum_2017}. The
|
|
|
78
|
+ reason that this was such a successful architecture - as opposed to the
|
|
|
79
|
+ original IA64 ISA, was because it retained backwards compatibility with the 32
|
|
|
80
|
+ bit implementations of x86 so old programs could be still be run in a
|
|
|
81
|
+ compatibility mode. There are now both Intel and AMD implementations of the
|
|
|
82
|
+ ISA but as they are near identical - and most compilers will produce code that
|
|
|
83
|
+ avoids any differences, this is especially true for application level programs
|
|
|
84
|
+ as there is little to no chance that they will make use of the kind of
|
|
|
85
|
+ instructions that have differences. This paper will consider the two
|
|
|
86
|
+ implementations as identical and produce work with the intention that it will
|
|
|
87
|
+ work on both AMD and Intel branded CPUs.
|
|
|
88
|
+
|
|
|
89
|
+ \subsubsection{ARM64}
|
|
|
90
|
+ ARM64 is the name for the optional 64 bit extension to the ARMv8-A
|
|
|
91
|
+ architecture specification from ARM Holdings. The main driving decision behind
|
|
|
92
|
+ developing the Cachegrind extension for ARM architecture, is the Isambard
|
|
|
93
|
+ \cite{isambard_2017} supercomputer project which is a world-first
|
|
|
94
|
+ supercomputer comprised of ARM based SoCs with three levels of cache. It would
|
|
|
95
|
+ be important to get accurate cache profiling results for all levels of cache
|
|
|
96
|
+ so that the team can optimise the programs running on Isambard to efficiently
|
|
|
97
|
+ use the CPU caches, to save CPU cycles and ensure that the machine can run as
|
|
|
98
|
+ many jobs as possible. \todo{Write more about ARM64}
|
|
|
99
|
+
|
|
|
100
|
+ \section{CPU Caches}
|
|
|
101
|
+
|
|
|
102
|
+ A CPU cache is a (relatively) small, fast, piece of memory that is placed
|
|
|
103
|
+ between the CPU bus and the main memory port to address the disparity between
|
|
|
104
|
+ CPU frequency and memory port frequency. This means that when attempting to
|
|
|
105
|
+ access some data at a given address, the processor will first look for matches
|
|
|
106
|
+ in the cache. However because the cache is small, there must be some way of
|
|
|
107
|
+ deciding where data should be stored in the cache - as we cannot feasibly
|
|
|
108
|
+ store all possible data in the cache - that would just be replicating main
|
|
|
109
|
+ memory. We can take advantage of the principles known as temporal locality and
|
|
|
110
|
+ spacial locality which respectively refer to "the tendency to reuse recently
|
|
|
111
|
+ accessed data items" \cite{hennessy_2011} and "the tendency to reference data
|
|
|
112
|
+ items that are close to other recently referenced items" \cite{hennessy_2011}.
|
|
|
113
|
+ Because programs run on a processor generally exhibit these properties, we
|
|
|
114
|
+ actually do not need to store all the possible memory locations to get an
|
|
|
115
|
+ increase in speed. In order to find some suitable mapping, one can split the
|
|
|
116
|
+ address into an upper and lower portion, where the upper part can be stored in
|
|
|
117
|
+ a tag store, and the lower part can refer to the offset within some
|
|
|
118
|
+ arbitrarily sized cache line. When data is found in the cache, it is referred
|
|
|
119
|
+ to as a cache hit, and when the data is not found, and a request needs to be
|
|
|
120
|
+ sent to memory, it is referred to as a cache miss. As a request to memory
|
|
|
121
|
+ comes with a penalty of around 240 cycles\cite{drepper_2007}, it is important
|
|
|
122
|
+ to keep these to a minimum.
|
|
|
123
|
+
|
|
|
124
|
+ \subsection{Cache Placement Policies}
|
|
|
125
|
+
|
|
|
126
|
+ \subsubsection{Fully Associative}
|
|
|
127
|
+ A very simple cache implementation would be one where any cache line could
|
|
|
128
|
+ fill any slot in the cache. This would be a purely associative memory that
|
|
|
129
|
+ will accept new key/data pairs until it is full. In order to place a new block
|
|
|
130
|
+ of memory into the cache, one must simply find an invalid cache line and then
|
|
|
131
|
+ fill that line with the data, updating the tag store to show the upper address
|
|
|
132
|
+ region that now occupies this line. In the event that the cache is full, then
|
|
|
133
|
+ a line will be evicted based on some replacement policy, for example LRU. This
|
|
|
134
|
+ is rarely - if ever - a good solution, as each tag lookup requires iterating
|
|
|
135
|
+ through all the elements in the cache to try to find the matching tag. This
|
|
|
136
|
+ has it's own disadvantages as it takes up a lot of area and power when
|
|
|
137
|
+ implemented in silicon.
|
|
|
138
|
+
|
|
|
139
|
+ \subsubsection{Direct Mapped} Because a fully associative cache is too
|
|
|
140
|
+ expensive to implement, and the miss penalty can be very high (due to full
|
|
|
141
|
+ iteration of the tag store), a better way to map cache lines to a limited
|
|
|
142
|
+ store is to split the upper part of the address further, keeping the tag as a
|
|
|
143
|
+ way of determining if there is a match in the cache, but using the middle
|
|
|
144
|
+ slice of bits as an index. When combined with a cache that splits it's slices
|
|
|
145
|
+ into "sets" with single cache lines to a set, the index can be used to
|
|
|
146
|
+ determine which \textit{set} the line will be installed in. This also
|
|
|
147
|
+ simplifies the cache evict policy, as each cache line can only map to a single
|
|
|
148
|
+ \textit{set} in the cache, when there is a conflict, the line that occupies
|
|
|
149
|
+ the set is simply evicted and the new line installed. This is advantageous
|
|
|
150
|
+ because it makes the process of finding whether a given address is present in
|
|
|
151
|
+ the cache much faster and easier and as mentioned above, makes replacing lines
|
|
|
152
|
+ in the cache simpler too. Unfortunately it introduces it's own problem: "If
|
|
|
153
|
+ your program makes repeated reference to two data items that happen to share
|
|
|
154
|
+ the same cache location (presumably because the low bits of their addresses
|
|
|
155
|
+ happen to be close together), then the two data items will keep pushing each
|
|
|
156
|
+ other out of the cache and efficiency will fall drastically."
|
|
|
157
|
+ \cite{sweetman_2007}.
|
|
|
158
|
+
|
|
|
159
|
+ \subsubsection{Set Associative} The final type of cache seeks to remedy this
|
|
|
160
|
+ thrashing by further splitting up the sets within the cache to produce an $n
|
|
|
161
|
+ \times m$ cache, with $n$ sets and $m$ ways within those sets. This is a
|
|
|
162
|
+ compromise between the direct mapped and truly associative cache designs. With
|
|
|
163
|
+ this cache design, as with the direct mapped design, the cache line address is
|
|
|
164
|
+ split into a tag, an index, and the offset. Again, similarly to the direct
|
|
|
165
|
+ mapped cache, the index determines which set the cache line will belong to,
|
|
|
166
|
+ however in this design, there will be multiple \textit{ways} that the line
|
|
|
167
|
+ could be installed in, allowing for $m$ cache lines per set. This increases
|
|
|
168
|
+ the hit rate for lines with spatial locality as we do not necessarily have to
|
|
|
169
|
+ evict a line if it matches the same set as another. However in the case that
|
|
|
170
|
+ all ways in a set are occupied we have to choose one to evict, and this can be
|
|
|
171
|
+ done based on an arbitrary eviction policy, as with fully associative cache
|
|
|
172
|
+ designs. The downsides to this are that: "Compared with a direct-mapped cache,
|
|
|
173
|
+ a set-associative cache requires many more bus connections between the cache
|
|
|
174
|
+ memory and controller. That means that caches too big to integrate onto a
|
|
|
175
|
+ single chip are much easier to build direct mapped. More subtly, because the
|
|
|
176
|
+ direct-mapped cache has only one possible candidate for the data you need,
|
|
|
177
|
+ it’s possible to keep the CPU running ahead of the tag check (so long as the
|
|
|
178
|
+ CPU does not do anything irrevocable based on the data). Simplicity and
|
|
|
179
|
+ running ahead can translate to a faster clock rate."\cite{sweetman_2007}
|
|
|
180
|
+
|
|
|
181
|
+ \subsection{Split vs Unified Caches} The primary goal of the cache is to
|
|
|
182
|
+ optimise the memory accesses of the CPU to prevent pipeline stalls.
|
|
|
183
|
+ \cite{sweetman_2007}. For this reason it can be advantageous to split the
|
|
|
184
|
+ instruction and data caches - usually in the L1 only, as misses there will pay
|
|
|
185
|
+ a larger penalty anyway. This because, due to the nature of pipelines, the CPU
|
|
|
186
|
+ can be fetching both data for one instruction, and the instruction for another
|
|
|
187
|
+ at the exact same time \cite{smith_1982}. Splitting the cache into one part of
|
|
|
188
|
+ the instruction cache - which can be further optimised for having infrequent
|
|
|
189
|
+ writes and mostly reads, and the data cache, which can be optimised for the
|
|
|
190
|
+ general use case. This doubles the available cache bandwidth, although can
|
|
|
191
|
+ have a detrimental effect on miss rate. However, this remains the prevailing
|
|
|
192
|
+ strategy for low level caches (i.e level 1), as it allows hardware designs to
|
|
|
193
|
+ place the caches close to the logic that will require them, cutting down on
|
|
|
194
|
+ precious nanoseconds.
|
|
|
195
|
+
|
|
|
196
|
+ \subsection{Multilevel Caches} Another common method that can be employed to
|
|
|
197
|
+ reduce miss penalty, is having multiple levels of cache, and this is the focus
|
|
|
198
|
+ of this dissertation, as Cachegrind will be extended to fully support all 3
|
|
|
199
|
+ levels of cache present on the Isambard supercomputer.
|
|
|
200
|
+
|
|
|
201
|
+ \newpage
|
|
|
202
|
+ \bibliography{literature_review}
|
|
|
203
|
+ \bibliographystyle{ieeetr}
|
|
|
204
|
+
|
|
|
205
|
+\end{document}
|