updates/003_2018dec04_microarchitecture.mdwn

   1 # Microarchitectural Decisions
   2
   3 The Libre-RISCV core is planning to deploy innovative vectorisation
   4 system, known as
   5 [Simple-V](https://libre-riscv.org/simple_v_extension/specification).
   6 Honestly, it's not very simple at all!  The principle is straightforward:
   7 mark ordinary registers as "vectorised", and when an instruction uses one
   8 such "tagged" register, go into a hardware-unrolled version of a software
   9 macro loop, issuing otherwise identical instructions with *contiguous*
  10 sequentially-increasing register numbers.
  11
  12 It sounds easy... except that the "tag" table behind the registers has
  13 been extended to:
  14
  15 * add predication (switching on and off individual "elements" of the vector)
  16 * change the element width so that what was previously a 64-bit ADD can be
  17   used to do 16-bit or even 8-bit ADDs
  18 * reordering of the elements, for 2D and 3D arrays and Matrices: very
  19   useful for in-place transposing for Matrix Multiplication)
  20 * extend the instructions so that they can access up to 128 registers,
  21   where previously that was limited to 32 (and only 8 for Compressed
  22   instructions)
  23
  24 All of these turn out to be important for GPU workloads.
  25
  26 One of the most challenging aspects of SV is that there is no restriction
  27 on the "redirection".  Whilst one instruction could use register 5 and
  28 another uses register 10, *both* of them could actually be "redirected"
  29 to use register 112, for example.  One of those could even be changed
  30 to 32-bit operations whilst the other is set to 16-bit element widths.
  31
  32 Our initial thoughts were to try a standard simple in-order SIMD architecture,
  33 with predication bits passed down into the SIMD ALUs.  If a bit is "off",
  34 that "lane" within the ALU does not calculate a result, saving power.
  35 However, a pre-analysis engine is required that re-orders the registers,
  36 packs lanes of data together so that it fits into one SIMD ALU, and, on
  37 exit from the ALU, it may be necessary to split and "redirect" parts of the
  38 data to *multiple* actual 64-bit registers.  In other words, bit-level
  39 (or byte-level) manipulation is required, both pre- and post- ALU.
  40
  41 This is complicated!
  42
  43 As part of the initial design of SV, there was an accidental assumption
  44 that it would be perfectly reasonable to use a multi-issue execution
  45 engine, and to simply drop multiple of those "hardware-loop-unrolled"
  46 operations into the instruction queue.  This turns out to be a radically
  47 different paradigm from standard vector processors, where a loop allocates
  48 elements to "lanes", and if a predication bit is not set, the lane
  49 runs "empty".  By contrast, with the multi-issue execution model, an
  50 operation that is predicated out means that the element-based instruction
  51 simply does not make it into the instruction queue.
  52
  53 Simple!
  54
  55 There are many other benefits to a multi-issue microarchitecture, and
  56 these are being discussed
  57 [here](http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2018-December/000198.html)
  58 on the mailing list.  Personally I favour a modified version of the
  59 [Tomasulo Algorithm](https://en.wikipedia.org/wiki/Tomasulo_algorithm),
  60 which includes what is known as a
  61 [Reorder Buffer](https://en.wikipedia.org/wiki/Re-order_buffer).
  62 What is particularly nice about this algorithm is that it was first introduced
  63 in 1967, and came to prominence in the early 1990s when Moore's Law started
  64 to hit a speed wall.  That in particular means that, firstly, it's extremely
  65 commonly taught in Universities, and, secondly, patents on the algorithm
  66 have long since expired.
  67
  68 Also, there are both memory hazards and register hazards that a Reorder
  69 Buffer augmented Tomasulo algorithm takes care of, whilst also allowing
  70 for branch prediction and really simple roll-back, preservation of
  71 execution order even though instructions may actually be done out of order,
  72 and, crucially, some ALUs may take longer than others, and the algorithm
  73 simply does not care.  In addition, there may be a really simple way to
  74 extend the Reorder Buffer tags to accomodate SIMD-style characteristics.
  75
  76 We also may need to have simple Branch Prediction, because some of the
  77 loops in [Kazan](https://salsa.debian.org/Kazan-team/kazan/) are particularly
  78 tight.  A Reorder Buffer can easily be used to implement Branch Prediction,
  79 because, just as with an Exception, the ROB needs to be cleared out
  80 (flushed) if the branch is mispredicted.  As it is necessary to respect
  81 Exceptions, the logic has to exist to clear out the ROB: Branch Prediction
  82 simply uses this pre-existing logic.
  83