Sin descripción

literature_review.tex 12KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206
  1. \documentclass[12pt]{article}
  2. \usepackage{url}
  3. \usepackage{todonotes}
  4. \title{Modelling L2 and L3 CPU caches separately in
  5. Cachegrind for both x86\_64 and ARM64 architectures - Literature Review}
  6. \date{2017-11-18}
  7. \author{Matt Coles - mc930}
  8. \begin{document}
  9. \pagenumbering{gobble}
  10. \maketitle
  11. \tableofcontents
  12. \listoftodos
  13. \newpage
  14. \pagenumbering{arabic}
  15. \section{Introduction}
  16. In order to understand the scope of this problem, there are a few areas that
  17. are of interest. To begin with, one must first understand how a CPU cache
  18. makes use of associative memory to provide fast access to data, and then
  19. further, how processor architects have split those caches into multiple levels
  20. and what makes them important. A processor having fast data caches is
  21. essential to the performance of the instruction pipeline, and therefore by
  22. extension, the programs running on said CPU. In addition to this there are
  23. differing styles of cache, including the simpler direct-mapped cache and the
  24. more modern n-way set-associative cache, and when having multiple levels of
  25. cache, designers may choose to use one or more of these different types.
  26. Taking this into consideration, along with the fact that multiple levels of
  27. cache are usually present further from the core, one will notice that there is
  28. a disparity in the speed of access to the caches.\\
  29. In addition, as the project is to extend an existing cache profiling tool so
  30. that the results it gives might better reflect the actual architecture of
  31. modern processors, we will also investigate other computer program profiling
  32. tools and how they might be used to produce similar results or how they take a
  33. different approach to profiling cache usage.
  34. \subsection{Introducing Cachegrind}
  35. A popular tool for profiling programs is called Valgrind, which is a dynamic
  36. binary instrumentation framework for building other dynamic binary analysis
  37. tools. Valgrind, as of 2017 consists of six ‘production quality’ tools, one of
  38. those being Cachegrind. This is a method of cache profiling where the program
  39. is run with a model of the cache and then it is measured whether data would or
  40. would not be present in the cache, and can therefore measure cache misses and
  41. hits. In order for this to work for a given architecture, the instructions
  42. within said architecture must be categorised into whether they are reads or
  43. writes or neither or both. It can then monitor the instruction stream and
  44. update the cache model as the machine (likely) would \cite{nethercote_2004}.
  45. The advantage of using this simulated approach is manyfold. Firstly, it means
  46. that the bulk of the code can be hardware agnostic to avoid having to deal
  47. with architecture quirks such as, whether the architecture provides direct
  48. access to the cache controller, or if it completely abstracts that from the
  49. programmer. In addition, it means that any attempts to improve the locality of
  50. the code, will be based on a simulation environment that will therefore likely
  51. improve the locality of the code on any architecture, without relying on
  52. hardware specific cache behaviour \cite{weidendorfer_2004}.
  53. \subsection{Introduction to x86\_64 and ARM64}
  54. \subsubsection{x86\_64}x86 is the all encompassing
  55. name for the backwards compatible set of architectures that begun with the
  56. Intel 8086 CPU. However, most modern CPUs are now 64 bit, so this will only
  57. focus on the AMD specification for an x86\_64 implementation. I am choosing to
  58. support this architecture as a part of this project as it can build upon the
  59. work of Kaparelos \cite{kaparelos_2014}, which builds an implementation of the
  60. required work for x86\_64 architecture, but was not merged into the main
  61. project and having read his paper, I can start with implementing the changes
  62. into the well verified x86 environment. In addition, is Kaparelos initial
  63. justification for an x86 port of this work, that the architecture is most
  64. commonly used in consumer PCs and therefore will provide some of the most
  65. benefit.\\
  66. The 64 bit port of x86 that is now widespread, disregarding the Intel
  67. Itanium(IA64) implementation, was developed by AMD \cite{forum_2017}. The
  68. reason that this was such a successful architecture - as opposed to the
  69. original IA64 ISA, was because it retained backwards compatibility with the 32
  70. bit implementations of x86 so old programs could be still be run in a
  71. compatibility mode. There are now both Intel and AMD implementations of the
  72. ISA but as they are near identical - and most compilers will produce code that
  73. avoids any differences, this is especially true for application level programs
  74. as there is little to no chance that they will make use of the kind of
  75. instructions that have differences. This paper will consider the two
  76. implementations as identical and produce work with the intention that it will
  77. work on both AMD and Intel branded CPUs.
  78. \subsubsection{ARM64}
  79. ARM64 is the name for the optional 64 bit extension to the ARMv8-A
  80. architecture specification from ARM Holdings. The main driving decision behind
  81. developing the Cachegrind extension for ARM architecture, is the Isambard
  82. \cite{isambard_2017} supercomputer project which is a world-first
  83. supercomputer comprised of ARM based SoCs with three levels of cache. It would
  84. be important to get accurate cache profiling results for all levels of cache
  85. so that the team can optimise the programs running on Isambard to efficiently
  86. use the CPU caches, to save CPU cycles and ensure that the machine can run as
  87. many jobs as possible. \todo{Write more about ARM64}
  88. \section{CPU Caches}
  89. A CPU cache is a (relatively) small, fast, piece of memory that is placed
  90. between the CPU bus and the main memory port to address the disparity between
  91. CPU frequency and memory port frequency. This means that when attempting to
  92. access some data at a given address, the processor will first look for matches
  93. in the cache. However because the cache is small, there must be some way of
  94. deciding where data should be stored in the cache - as we cannot feasibly
  95. store all possible data in the cache - that would just be replicating main
  96. memory. We can take advantage of the principles known as temporal locality and
  97. spacial locality which respectively refer to "the tendency to reuse recently
  98. accessed data items" \cite{hennessy_2011} and "the tendency to reference data
  99. items that are close to other recently referenced items" \cite{hennessy_2011}.
  100. Because programs run on a processor generally exhibit these properties, we
  101. actually do not need to store all the possible memory locations to get an
  102. increase in speed. In order to find some suitable mapping, one can split the
  103. address into an upper and lower portion, where the upper part can be stored in
  104. a tag store, and the lower part can refer to the offset within some
  105. arbitrarily sized cache line. When data is found in the cache, it is referred
  106. to as a cache hit, and when the data is not found, and a request needs to be
  107. sent to memory, it is referred to as a cache miss. As a request to memory
  108. comes with a penalty of around 240 cycles\cite{drepper_2007}, it is important
  109. to keep these to a minimum.
  110. \subsection{Cache Placement Policies}
  111. \subsubsection{Fully Associative}
  112. A very simple cache implementation would be one where any cache line could
  113. fill any slot in the cache. This would be a purely associative memory that
  114. will accept new key/data pairs until it is full. In order to place a new block
  115. of memory into the cache, one must simply find an invalid cache line and then
  116. fill that line with the data, updating the tag store to show the upper address
  117. region that now occupies this line. In the event that the cache is full, then
  118. a line will be evicted based on some replacement policy, for example LRU. This
  119. is rarely - if ever - a good solution, as each tag lookup requires iterating
  120. through all the elements in the cache to try to find the matching tag. This
  121. has it's own disadvantages as it takes up a lot of area and power when
  122. implemented in silicon.
  123. \subsubsection{Direct Mapped} Because a fully associative cache is too
  124. expensive to implement, and the miss penalty can be very high (due to full
  125. iteration of the tag store), a better way to map cache lines to a limited
  126. store is to split the upper part of the address further, keeping the tag as a
  127. way of determining if there is a match in the cache, but using the middle
  128. slice of bits as an index. When combined with a cache that splits it's slices
  129. into "sets" with single cache lines to a set, the index can be used to
  130. determine which \textit{set} the line will be installed in. This also
  131. simplifies the cache evict policy, as each cache line can only map to a single
  132. \textit{set} in the cache, when there is a conflict, the line that occupies
  133. the set is simply evicted and the new line installed. This is advantageous
  134. because it makes the process of finding whether a given address is present in
  135. the cache much faster and easier and as mentioned above, makes replacing lines
  136. in the cache simpler too. Unfortunately it introduces it's own problem: "If
  137. your program makes repeated reference to two data items that happen to share
  138. the same cache location (presumably because the low bits of their addresses
  139. happen to be close together), then the two data items will keep pushing each
  140. other out of the cache and efficiency will fall drastically."
  141. \cite{sweetman_2007}.
  142. \subsubsection{Set Associative} The final type of cache seeks to remedy this
  143. thrashing by further splitting up the sets within the cache to produce an $n
  144. \times m$ cache, with $n$ sets and $m$ ways within those sets. This is a
  145. compromise between the direct mapped and truly associative cache designs. With
  146. this cache design, as with the direct mapped design, the cache line address is
  147. split into a tag, an index, and the offset. Again, similarly to the direct
  148. mapped cache, the index determines which set the cache line will belong to,
  149. however in this design, there will be multiple \textit{ways} that the line
  150. could be installed in, allowing for $m$ cache lines per set. This increases
  151. the hit rate for lines with spatial locality as we do not necessarily have to
  152. evict a line if it matches the same set as another. However in the case that
  153. all ways in a set are occupied we have to choose one to evict, and this can be
  154. done based on an arbitrary eviction policy, as with fully associative cache
  155. designs. The downsides to this are that: "Compared with a direct-mapped cache,
  156. a set-associative cache requires many more bus connections between the cache
  157. memory and controller. That means that caches too big to integrate onto a
  158. single chip are much easier to build direct mapped. More subtly, because the
  159. direct-mapped cache has only one possible candidate for the data you need,
  160. it’s possible to keep the CPU running ahead of the tag check (so long as the
  161. CPU does not do anything irrevocable based on the data). Simplicity and
  162. running ahead can translate to a faster clock rate."\cite{sweetman_2007}
  163. \subsection{Split vs Unified Caches} The primary goal of the cache is to
  164. optimise the memory accesses of the CPU to prevent pipeline stalls.
  165. \cite{sweetman_2007}. For this reason it can be advantageous to split the
  166. instruction and data caches - usually in the L1 only, as misses there will pay
  167. a larger penalty anyway. This because, due to the nature of pipelines, the CPU
  168. can be fetching both data for one instruction, and the instruction for another
  169. at the exact same time \cite{smith_1982}. Splitting the cache into one part of
  170. the instruction cache - which can be further optimised for having infrequent
  171. writes and mostly reads, and the data cache, which can be optimised for the
  172. general use case. This doubles the available cache bandwidth, although can
  173. have a detrimental effect on miss rate. However, this remains the prevailing
  174. strategy for low level caches (i.e level 1), as it allows hardware designs to
  175. place the caches close to the logic that will require them, cutting down on
  176. precious nanoseconds.
  177. \subsection{Multilevel Caches} Another common method that can be employed to
  178. reduce miss penalty, is having multiple levels of cache, and this is the focus
  179. of this dissertation, as Cachegrind will be extended to fully support all 3
  180. levels of cache present on the Isambard supercomputer.
  181. \newpage
  182. \bibliography{literature_review}
  183. \bibliographystyle{ieeetr}
  184. \end{document}