# Microarchitectural Decisions

The Libre-RISCV core is planning to deploy innovative vectorisation
system, known as
[Simple-V](https://libre-riscv.org/simple_v_extension/specification).
Honestly, it's not very simple at all!  The principle is straightforward:
mark ordinary registers as "vectorised", and when an instruction uses one
such "tagged" register, go into a hardware-unrolled version of a software
macro loop, issuing otherwise identical instructions with *contiguous*
sequentially-increasing register numbers.

It sounds easy... except that the "tag" table behind the registers has
been extended to:

* add predication (switching on and off individual "elements" of the vector)
* change the element width so that what was previously a 64-bit ADD can be
  used to do 16-bit or even 8-bit ADDs
* reordering of the elements, for 2D and 3D arrays and Matrices: very
  useful for in-place transposing for Matrix Multiplication)
* extend the instructions so that they can access up to 128 registers,
  where previously that was limited to 32 (and only 8 for Compressed
  instructions)

All of these turn out to be important for GPU workloads.

One of the most challenging aspects of SV is that there is no restriction
on the "redirection".  Whilst one instruction could use register 5 and
another uses register 10, *both* of them could actually be "redirected"
to use register 112, for example.  One of those could even be changed
to 32-bit operations whilst the other is set to 16-bit element widths.

Our initial thoughts advocated a standard simple in-order SIMD architecture,
with predication bits passed down into the SIMD ALUs.  If a bit is "off",
that "lane" within the ALU does not calculate a result, saving power.
However, in SV, when the element width is set to 32, 16 or 8-bit, a
pre-issue engine is required that re-orders *parts* of the registers,
packing lanes of data together so that it fits into one SIMD ALU, and, on
exit from the ALU, it may be necessary to split and "redirect" parts of the
data to *multiple* actual 64-bit registers.  In other words, bit-level
(or byte-level) manipulation is required, both pre- and post- ALU.

This is complicated!

As part of the initial design of SV, there was an accidental assumption
that it would be perfectly reasonable to use a multi-issue execution
engine, and to simply drop multiple of those "hardware-loop-unrolled"
operations into the instruction queue.  This turns out to be a radically
different paradigm from standard vector processors, where a loop allocates
elements to "lanes", and if a predication bit is not set, the lane
runs "empty".  By contrast, with the multi-issue execution model, an
operation that is predicated out means that the element-based instruction
does not even make it into the instruction queue, leaving it free for
use by following instructions, even in the same cycle, and even if the
operation is totally different.  Thus, unlike in a
traditional vectore architecture, ALUs may be occupied by elements from 
other "Lanes", because the pre-existing decoupling between the multi-issue
instruction queue and the ALUs is efficiently leveraged.

Simple!

There are many other benefits to a multi-issue microarchitecture, and
these are being discussed
[here](http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2018-December/000198.html)
on the mailing list.  Personally I favour a modified version of the
[Tomasulo Algorithm](https://en.wikipedia.org/wiki/Tomasulo_algorithm),
which includes what is known as a
[Reorder Buffer](https://en.wikipedia.org/wiki/Re-order_buffer).
What is particularly nice about this algorithm is that it was first introduced
in 1967, and came to prominence in the early 1990s when Moore's Law started
to hit a speed wall.  That in particular means that, firstly, it's extremely
commonly taught in Universities, and, secondly, patents on the algorithm
have long since expired.

[[reorder_buffer.jpg]]

Also, there are both memory hazards and register hazards that a Reorder
Buffer augmented Tomasulo algorithm takes care of, whilst also allowing
for branch prediction and really simple roll-back, preservation of
execution order even though instructions may actually be done out of order,
and, crucially, some ALUs may take longer than others, and the algorithm
simply does not care.  In addition, there may be a really simple way to
extend the Reorder Buffer tags to accomodate SIMD-style characteristics.

We also may need to have simple Branch Prediction, because some of the
loops in [Kazan](https://salsa.debian.org/Kazan-team/kazan/) are particularly
tight.  A Reorder Buffer can easily be used to implement Branch Prediction,
because, just as with an Exception, the ROB can to be cleared out
(flushed) if the branch is mispredicted.  As it is necessary to respect
Exceptions, the logic has to exist to clear out the ROB: Branch Prediction
simply uses this pre-existing logic.

The other way in which out-of-order execution can be handled is called
scoreboarding, as well as explicit register renaming.  These schemes
seem to have significant disadvantages and complexities when compared
to Reorder Buffers:

* Explicit Register renaming needs a global register file quite a bit larger
  than the "actual" one.  The Libre RISC-V SoC already has two whopping
  great 128-entry 64-bit register files.
* In-order scoreboarding actually *delays* instruction execution (all of it)
  until such time as the source registers and all other dependencies are
  ready.  The idea seems to be that the register renaming "should have taken
  care of" as many of these dependencies as possible, in advance.
* Unlike Tomasulo with Reorder buffers, there does not appear to be
  any assistance in dealing with memory LOAD/STOREs.
* There's no clear way to handle branch prediction, where the Reorder
  Buffer of Tomasulo handles it really cleanly.

Whilst nothing's firmly set in stone, here, as we have a Charter that
requires unanimous decision-making from contributors, so far it's leaning
towards Reorder Buffers and Tomasulo as a good, clean fit.  In part that
is down to more research having been done on that particular algorithm.
For completeness, scoreboarding and explicit register renaming need
to be properly and comprehensively investigted.
More as it happens...