updates/003_2018dec04_microarchitecture.mdwn

   1 # Microarchitectural Decisions
   2
   3 The Libre-RISCV core is planning to deploy innovative vectorisation
   4 system, known as
   5 [Simple-V](https://libre-riscv.org/simple_v_extension/specification).
   6 Honestly, it's not very simple at all!  The principle is straightforward:
   7 mark ordinary registers as "vectorised", and when an instruction uses one
   8 such "tagged" register, go into a hardware-unrolled version of a software
   9 macro loop, issuing otherwise identical instructions with *contiguous*
  10 sequentially-increasing register numbers.
  11
  12 It sounds easy... except that the "tag" table behind the registers has
  13 been extended to:
  14
  15 * add predication (switching on and off individual "elements" of the vector)
  16 * change the element width so that what was previously a 64-bit ADD can be
  17   used to do 16-bit or even 8-bit ADDs
  18 * reordering of the elements, for 2D and 3D arrays and Matrices: very
  19   useful for in-place transposing for Matrix Multiplication)
  20 * extend the instructions so that they can access up to 128 registers,
  21   where previously that was limited to 32 (and only 8 for Compressed
  22   instructions)
  23
  24 All of these turn out to be important for GPU workloads.
  25
  26 One of the most challenging aspects of SV is that there is no restriction
  27 on the "redirection".  Whilst one instruction could use register 5 and
  28 another uses register 10, *both* of them could actually be "redirected"
  29 to use register 112, for example.  One of those could even be changed
  30 to 32-bit operations whilst the other is set to 16-bit element widths.
  31
  32 Our initial thoughts advocated a standard simple in-order SIMD architecture,
  33 with predication bits passed down into the SIMD ALUs.  If a bit is "off",
  34 that "lane" within the ALU does not calculate a result, saving power.
  35 However, in SV, when the element width is set to 32, 16 or 8-bit, a
  36 pre-issue engine is required that re-orders *parts* of the registers,
  37 packing lanes of data together so that it fits into one SIMD ALU, and, on
  38 exit from the ALU, it may be necessary to split and "redirect" parts of the
  39 data to *multiple* actual 64-bit registers.  In other words, bit-level
  40 (or byte-level) manipulation is required, both pre- and post- ALU.
  41
  42 This is complicated!
  43
  44 As part of the initial design of SV, there was an accidental assumption
  45 that it would be perfectly reasonable to use a multi-issue execution
  46 engine, and to simply drop multiple of those "hardware-loop-unrolled"
  47 operations into the instruction queue.  This turns out to be a radically
  48 different paradigm from standard vector processors, where a loop allocates
  49 elements to "lanes", and if a predication bit is not set, the lane
  50 runs "empty".  By contrast, with the multi-issue execution model, an
  51 operation that is predicated out means that the element-based instruction
  52 does not even make it into the instruction queue, leaving it free for
  53 use by following instructions, even in the same cycle, and even if the
  54 operation is totally different.  Thus, unlike in a
  55 traditional vectore architecture, ALUs may be occupied by elements from
  56 other "Lanes", because the pre-existing decoupling between the multi-issue
  57 instruction queue and the ALUs is efficiently leveraged.
  58
  59 Simple!
  60
  61 # Tomasulo Algorithm and Reorder Buffers
  62
  63 There are many other benefits to a multi-issue microarchitecture, and
  64 these are being discussed
  65 [here](http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2018-December/000198.html)
  66 on the mailing list.  Personally I favour a modified version of the
  67 [Tomasulo Algorithm](https://en.wikipedia.org/wiki/Tomasulo_algorithm),
  68 which includes what is known as a
  69 [Reorder Buffer](https://en.wikipedia.org/wiki/Re-order_buffer).
  70 What is particularly nice about this algorithm is that it was first introduced
  71 in 1967, and came to prominence in the early 1990s when Moore's Law started
  72 to hit a speed wall.  That in particular means that, firstly, it's extremely
  73 commonly taught in Universities, and, secondly, patents on the algorithm
  74 have long since expired.
  75
  76 [[reorder_buffer.jpg]]
  77
  78 Also, there are both memory hazards and register hazards that a Reorder
  79 Buffer augmented Tomasulo algorithm takes care of, whilst also allowing
  80 for branch prediction and really simple roll-back, preservation of
  81 execution order even though instructions may actually be done out of order,
  82 and, crucially, some ALUs may take longer than others, and the algorithm
  83 simply does not care.  In addition, there may be a really simple way to
  84 extend the Reorder Buffer tags to accomodate SIMD-style characteristics.
  85
  86 We also may need to have simple Branch Prediction, because some of
  87 the loops in [Kazan](https://salsa.debian.org/Kazan-team/kazan/)
  88 are particularly tight.  A Reorder Buffer (ROB) can easily be used
  89 to implement Branch Prediction, because, just as with an Exception,
  90 the ROB can to be cleared out (flushed) if the branch is mispredicted.
  91 As it is necessary to respect Exceptions, the logic has to exist to
  92 clear out the ROB: Branch Prediction simply uses this pre-existing logic.
  93
  94 The other way in which out-of-order execution can be handled is called
  95 scoreboarding, as well as explicit register renaming.  These schemes
  96 seem to have significant disadvantages and complexities when compared
  97 to Reorder Buffers (however, see later updates: the disadvantages are down
  98 to a complete failure of the academic literature to fully comprehend
  99 the design of the CDC 6600):
 100
 101 * Explicit Register renaming needs a global register file quite a bit larger
 102   than the "actual" one.  The Libre RISC-V SoC already has two whopping
 103   great 128-entry 64-bit register files.
 104 * In-order scoreboarding actually *delays* instruction execution (all of it)
 105   until such time as the source registers and all other dependencies are
 106   ready.  The idea seems to be that the register renaming "should have taken
 107   care of" as many of these dependencies as possible, in advance.
 108 * Unlike Tomasulo with Reorder buffers, there does not appear to be
 109   any assistance in dealing with memory LOAD/STOREs.
 110 * There's no clear way to handle branch prediction, where the Reorder
 111   Buffer of Tomasulo handles it really cleanly.
 112
 113 However there are downsides to Reorder Buffers:
 114
 115 * The Common Data Bus may become a serious bottleneck, as it delivers
 116   data from multiple ALUs which may be generating results simultaneously.
 117   To keep up with result generation, *multiple* CDBs may be needed, which
 118   results in each receiver having multiple ports
 119 * The Destination field in the ROB has to act as a key in a CAM (Content
 120   Addressble Memory).  As a result, power consumption of the ROB may be
 121   quite high.  It may or may not be possible to reduce power consumption
 122   by testing an "active" bitfield (separate from but augmenting the ROB)
 123   to indicate whether Destination Registers are in use.  If inactive,
 124   the CAM lookup need not take place.
 125
 126 Whilst nothing's firmly set in stone, here, as we have a Charter that
 127 requires unanimous decision-making from contributors, so far it's leaning
 128 towards Reorder Buffers and Tomasulo as a good, clean fit.  In part that
 129 is down to more research having been done on that particular algorithm.
 130 For completeness, scoreboarding and explicit register renaming need
 131 to be properly and comprehensively investigted.
 132
 133 # Scoreboarding
 134
 135 [Scoreboards](https://en.wikipedia.org/wiki/Scoreboarding) are an
 136 effort to keep "score" of whether an instruction has all of its
 137 operands (and the hardware that uses them) ready to go, without conflict.
 138 Everything about the scheme, unfortunately, says that it is an incomplete
 139 mechanism.  Unlike the Tomasulo algorithm when augmented with a Reorder
 140 Buffer, there's no way to "undo" a completed operation: the operation
 141 proceeds and modifies registers or memory, in an out-of-order fashion,
 142 regardless of consequences.  If an exception occurred *tough*!
 143
 144 Hence, it was quite hard to accept scoreboarding enough to understand it.
 145 It wasn't until
 146 [this message](http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2018-December/000223.html)
 147 that I realised that there is near-direct equivalence between the features
 148 of scoreboarding and the Tomasulo Algorithm.  It's not complete equivalence,
 149 because the Reorder Buffer is what keeps everything transactional and clean.
 150 However, at least the scoreboard does have direct equivalence to the
 151 Reservation Station concept: the scoreboard records whether the source
 152 registers are ready or not (and so does the Reservation Station).
 153
 154 The problem is: there are far too many things missing from the scoreboard
 155 concept:
 156
 157 * Register-renaming has to be done separately (the Tomasulo Reservation
 158   Stations, in combination with the ROB, handle that implicitly).
 159 * Scoreboarding introduces the concept of "waffly exceptions"
 160   (the ROB can even record exceptions that are only actioned if they
 161   make it to the head of the queue).
 162 * Scoreboarding does not provide a means to do multi-issue (the ROB
 163   does: you just put more than one entry per cycle into the buffer)
 164
 165 (*Note: in later updates, we find that, fascinatingly, these things are
 166 just not true.*)
 167
 168 # Next step
 169
 170 The project is being run along ethical lines.  That in particular means
 171 unanimous decision-making.  Nobody gets to over-rule anyone else: everyone
 172 matters, and everyone's input matters.  So if I, personally, think that
 173 Tomasulo is better, it's up to me to keep on researching and evaluating
 174 and reasoning until I have convinced everyone else... or they have convinced
 175 me otherwise.
 176
 177 I have it on good authority from some extremely comprehensive research
 178 that this is a hell of a lot better way to do decision-making than
 179 "mob-rule" voting.  "Mob-rule" voting (aka "democracy") basically
 180 *automatically* destroys the morale and discounts the will of the
 181 "minority".  No wonder democratic countries have "Minority Representation
 182 Political Groups", because it's heavily brainwash-ingrained into people
 183 living in such countries that the "Minority" view *shall* be disregarded;
 184 of *course* they feel the need to shout and get angry!
 185
 186 Learning from these mistakes, which are often copied and reflected
 187 in Software Libre groups, is I feel very important, to try something
 188 different.  In previous Libre projects that I've run, I was the
 189 kind-of informal "technical lead", where I never actually defined any
 190 guidelines or rules about how the project should make decisions: basically
 191 I let people do what they liked unless it disrupted the project for
 192 everyone else.  I set myself up as the "arbitrator", so to speak.
 193 The way that this project will be run is very much new and experimental,
 194 even to me.  We'll see how it goes.
 195
 196