updates/003_2018dec04_microarchitecture.mdwn

   1 The Libre RISC-V core is planning to deploy an innovative vectorisation
   2 system, known as
   3 [Simple-V](https://libre-riscv.org/simple_v_extension/specification).
   4 Honestly, it's not very simple at all!
   5
   6 #### An Introduction to Simple-V
   7
   8 The principle is straightforward: mark ordinary registers as
   9 "vectorised," and when an instruction uses one such "tagged" register,
  10 go into a hardware-unrolled version of a software macro loop, issuing
  11 otherwise identical instructions with *contiguous*
  12 sequentially-increasing register numbers.
  13
  14 It sounds easy... except that the "tag" table behind the registers has
  15 been extended to:
  16
  17 * add predication (switching on and off individual "elements" of the vector)
  18 * change the element width so what was previously a 64-bit ADD can be
  19   used to do 16-bit or even 8-bit ADDs
  20 * reordering of the elements, for 2D and 3D arrays and matrices: very
  21   useful for in-place transposing for matrix multiplication
  22 * extend the instructions so that they can access up to 128 registers,
  23   where previously that was limited to 32 (and only 8 for compressed
  24   instructions)
  25
  26 All of these turn out to be important for GPU workloads.
  27
  28 One of the most challenging aspects of Simple-V is that there is no restriction
  29 on the "redirection."  Whilst one instruction could use register five and
  30 another uses register ten, *both* of them could actually be "redirected"
  31 to use register 112, for example.  One of those could even be changed
  32 to 32-bit operations whilst the other is set to 16-bit element widths.
  33
  34 Our initial thoughts advocated a standard, simple, in-order, SIMD architecture,
  35 with predication bits passed down into the SIMD ALUs.  If a bit is "off,"
  36 that "lane" within the ALU does not calculate a result, saving power.
  37 However, in Simple-V, when the element width is set to 32-, 16-, or 8-bit, a
  38 pre-issue engine is required that re-orders *parts* of the registers,
  39 packing lanes of data together so that it fits into one SIMD ALU, and, on
  40 exit from the ALU, it may be necessary to split and "redirect" parts of the
  41 data to *multiple* actual 64-bit registers.  In other words, bit-level
  42 (or byte-level) manipulation is required, both pre- and post-ALU.
  43
  44 This is complicated!
  45
  46 As part of the initial design of Simple-V, there was an accidental assumption
  47 that it would be perfectly reasonable to use a multi-issue execution
  48 engine, and to simply drop multiple of those "hardware-loop-unrolled"
  49 operations into the instruction queue.  This turns out to be a radically
  50 different paradigm from standard vector processors, where a loop allocates
  51 elements to "lanes," and if a predication bit is not set, the lane
  52 runs "empty."  By contrast, with the multi-issue execution model, an
  53 operation that is predicated out means that the element-based instruction
  54 does not even make it into the instruction queue, leaving it free for
  55 use by subsequent instructions, even in the same cycle, and even if the
  56 operation is totally different.  Thus, unlike in a
  57 traditional vector architecture, ALUs may be occupied by elements from
  58 other lanes, because the pre-existing decoupling between the multi-issue
  59 instruction queue and the ALUs is efficiently leveraged.
  60
  61 Simple!
  62
  63 #### Tomasulo Algorithm and Reorder Buffers
  64
  65 There are many other benefits to a multi-issue microarchitecture, as
  66 are being discussed on the [Libre RISC-V mailing
  67 list](http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2018-December/000198.html).
  68 Personally, I favour a modified version of the [Tomasulo
  69 Algorithm](https://en.wikipedia.org/wiki/Tomasulo_algorithm), which
  70 includes what is known as a [reorder
  71 buffer](https://en.wikipedia.org/wiki/Re-order_buffer).  This
  72 algorithm is particularly nice because it was first introduced in
  73 1967, and came to prominence in the early 1990s when Moore's Law
  74 started to hit a speed wall.  In particular, that means that 1) it's
  75 commonly taught in universities, and 2) patents on the algorithm have
  76 long since expired.
  77
  78 {reorder-buffer | link}
  79
  80 Also, there are both memory hazards and register hazards that a
  81 reorder-buffer-augmented Tomasulo algorithm takes care of, whilst also
  82 allowing for branch prediction and really simple roll-back,
  83 preservation of execution order even though instructions may actually
  84 be done out of order, and, crucially, some ALUs may take longer than
  85 others, and the algorithm simply does not care. In addition, there may
  86 be a really simple way to extend the reorder buffer tags to accomodate
  87 SIMD-style characteristics.
  88
  89 We also may need to have simple branch prediction, because some of
  90 the loops in [Kazan](https://salsa.debian.org/Kazan-team/kazan/)
  91 are particularly tight.  A reorder buffer (ROB) can easily be used
  92 to implement branch prediction, because, just as with an exception,
  93 the ROB can to be cleared out (flushed) if the branch is mispredicted.
  94 as it is necessary to respect exceptions, the logic has to exist to
  95 clear out the ROB: branch prediction simply uses this pre-existing logic.
  96
  97 The other way in which out-of-order execution can be handled is called
  98 scoreboarding, as well as explicit register renaming.  These schemes
  99 seem to have significant disadvantages and complexities when compared
 100 to reorder buffers (see upcoming updates for details on how the
 101 disadvantages are down to a complete failure of the academic
 102 literature to fully comprehend the design of the CDC 6600):
 103
 104 * Explicit register renaming needs a global register file quite a bit larger
 105   than the actual one.  The Libre RISC-V SoC already has two whopping
 106   great 128-entry 64-bit register files.
 107 * In-order scoreboarding actually *delays* instruction execution (all of it)
 108   until such time as the source registers and all other dependencies are
 109   ready.  The idea seems to be that the register renaming should have taken
 110   care of as many of these dependencies as possible, in advance.
 111 * Unlike Tomasulo with reorder buffers, there does not appear to be
 112   any assistance in dealing with memory LOAD/STOREs.
 113 * There's no clear way to handle branch prediction, where the reorder
 114   buffer of Tomasulo handles it really cleanly.
 115
 116 However, there are downsides to reorder buffers:
 117
 118 * The common data bus (CDB) may become a serious bottleneck, as it delivers
 119   data from multiple ALUs which may be generating results simultaneously.
 120   To keep up with result generation, *multiple* CDBs may be needed, which
 121   results in each receiver having multiple ports.
 122 * The destination field in the ROB has to act as a key in a content
 123   addressble memory (CAM).  As a result, power consumption of the ROB may be
 124   quite high.  It may or may not be possible to reduce power consumption
 125   by testing an active bitfield (separate from but augmenting the ROB)
 126   to indicate whether destination registers are in use.  If inactive,
 127   the CAM lookup need not take place.
 128
 129 Whilst nothing's firmly set in stone, as we have a charter that
 130 requires unanimous decision-making from contributors, so far it's leaning
 131 towards reorder buffers and Tomasulo as a good, clean fit.  In part, that
 132 is down to more research having been done on that particular algorithm.
 133 For completeness, scoreboarding and explicit register renaming need
 134 to be properly and comprehensively investigated.
 135
 136 #### Scoreboarding
 137
 138 [Scoreboards](https://en.wikipedia.org/wiki/Scoreboarding) are an
 139 effort to keep "score" of whether an instruction has all of its
 140 operands (and the hardware that uses them) ready to go, without conflict.
 141 Everything about the scheme, unfortunately, says that it is an incomplete
 142 mechanism.  Unlike the Tomasulo algorithm when augmented with a reorder
 143 buffer, there's no way to "undo" a completed operation: the operation
 144 proceeds and modifies registers or memory, in an out-of-order fashion,
 145 regardless of consequences.  If an exception occurred, *tough*!
 146
 147 Hence, it was quite hard to accept scoreboarding enough to understand
 148 it.  It wasn't until [this mailing list
 149 message](http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2018-December/000223.html)
 150 that I realised that there is near-direct equivalence between the
 151 features of scoreboarding and the Tomasulo algorithm.  It's not
 152 complete equivalence, because the reorder buffer is what keeps
 153 everything transactional and clean.  However, at least the scoreboard
 154 does have direct equivalence to the reservation station concept: the
 155 scoreboard records whether the source registers are ready or not (and
 156 so does the reservation station).
 157
 158 The problem is: there are far too many things missing from the scoreboard
 159 concept:
 160
 161 * Register-renaming has to be done separately (the Tomasulo reservation
 162   stations, in combination with the ROB, handle that implicitly).
 163 * Scoreboarding introduces the concept of "waffly exceptions"
 164   (the ROB can even record exceptions that are only actioned if they
 165   make it to the head of the queue).
 166 * Scoreboarding does not provide a means to do multi-issue (the ROB
 167   does: you just put more than one entry per cycle into the buffer).
 168
 169 (*Note: in later updates, we find that, fascinatingly, these things are
 170 just not true.*)
 171
 172 #### Unanimous Decision Making
 173
 174 The project is being run along ethical lines.  That in particular means
 175 unanimous decision-making.  Nobody gets to over-rule anyone else: everyone
 176 matters, and everyone's input matters.  So if I think
 177 Tomasulo is better, it's up to me to keep on researching and evaluating
 178 and reasoning until I have convinced everyone else... or they have convinced
 179 me otherwise.
 180
 181 I have it on good authority from some extremely comprehensive research
 182 that this is a hell of a lot better way to do decision-making than
 183 "mob-rule" voting.  "Mob-rule" voting (a.k.a. "democracy") basically
 184 *automatically* destroys the morale and discounts the will of the
 185 "minority."  No wonder democratic countries have minority representation
 186 political groups, because it's heavily brainwash-ingrained into people
 187 living in such countries that the minority view *shall* be disregarded;
 188 of *course* they feel the need to shout and get angry!
 189
 190 I feel learning from these mistakes, which are often copied and reflected
 191 in software libre groups, is very important, to try something
 192 different.  In previous libre projects that I've run, I was the
 193 kind-of informal "technical lead," where I never actually defined any
 194 guidelines or rules about how the project should make decisions: basically,
 195 I let people do what they liked unless it disrupted the project for
 196 everyone else.  I set myself up as the arbitrator, so to speak.
 197 The way that this project will be run is very much new and experimental,
 198 even to me.  We'll see how it goes.