updates/007_2018dec25_predication.mdwn

   1 # Out-of-out-of-Order: Unpredictable Predication Predicament
   2
   3 So, early in the development of the Not-So-Simple-V specification it was
   4 identified that it meshed well with a multi-issue microarchitecture.
   5 Recall that the basic principle is that registers are "tagged" through
   6 a CSR table as "vectorised", and, if the Vector Length is greater than 1,
   7 *multiple* (sequential) instructions are issued to the pipeline (where
   8 one would normally be sent), *without* increasing the Program Counter,
   9 the difference between these otherwise identical instructions being that
  10 the source (and/or) destination registers are incremented continuously by
  11 one on each loop.
  12
  13 The nice thing about a multi-issue microarchitecture is that it is very
  14 simple to drop these element-based otherwise-identical instructions directly
  15 into the instruction FIFO.  What is even nicer is: when predication is
  16 introduced, all that needs to be done is that when the relevant element
  17 predicate bit is clear, the associated element-based instruction is
  18 **not** placed into the multi-issue instruction FIFO.
  19
  20 Simple, right?  Couldn't be easier.
  21
  22 The problem is: the predicate and the source and the destination registers
  23 can all come from the *same register file*.  So, one instruction may modify
  24 an integer register that on the *next instruction* is used as a predication
  25 target.  That creates a write-hazard that has to be dealt with.  That
  26 means that in this particular out-of-order architecture, the instruction
  27 issue phase itself has to become a Function Unit.
  28
  29 Let me repeat that again: the instruction issue phase *itself* has to
  30 have its own scoreboard Dependency Matrix entry.
  31
  32 This brings some quite fascinating (read: scary) challenges and opportunities.
  33 If handled incorrectly, it means that the entire idea of using a multi-issue
  34 instruction FIFO is toast, as there will be guaranteed stalling whenever
  35 a predicated vectorised instruction is encountered.
  36
  37 Normally, a multi-issue engine has a guaranteed regular number of instructions
  38 to process and place in the queue.  Even branches do not stop the flow
  39 of placement into the FIFO, as branch prediction (speculative execution) can
  40 guess with high accuracy where the branch will go.  Predicated vectorised
  41 instruction issue is completely different: we have *no idea* - in advance -
  42 if the issued element-based instruction is actually going to be executed
  43 or not.  We do not have the predicate source register (yet) because it
  44 hasn't been calculated, because the prior instruction (which is being
  45 executed out-of-order, and is **itself** dependent on prior instruction
  46 completion) hasn't even been started yet.
  47
  48 Perhaps - thinking out loud - it would be okay to have a place-holder,
  49 waiting for the predicate bits to arrive.  Perhaps it is as simple as
  50 adding an extra source register (predicate source) to every single Function
  51 Unit.  So instead of each Function Unit having src1 and src2, it has
  52 src1, src2, predicate "bit".  Given that it is just a single bit that each
  53 Function Unit would be waiting for, it does seem somewhat gratuitous,
  54 and a huge complication of an otherwise extremely simple scoreboard
  55 (at present, there are no CAMs and no multi-wire I/Os in any of the
  56 cells of either the FU-to-FU Matrix or the FU-to-Register Dependency Matrix).
  57 Therefore, having **separate** Function Unit(s) which wait for the
  58 predication register to be available, that are themselves plumbed in to
  59 the actual Scoreboard system, decoding and issuing further instructions only
  60 once the predicate register is ready, seems to be a reasonable avenue to
  61 explore.
  62
  63 However, the last thing that we need is to stall execution completely,
  64 so a lot more thought is going to be needed.  The nice thing about having
  65 a predicated vectorisation "Issue" Function Unit is: some of the more
  66 complex decoding (particularly REMAP) can hypothetically be pipelined.
  67 However that is **guaranteed** to result in stalled execution, as the
  68 out-of-order system is going to critically depend on knowing what the
  69 dependencies **are**!  Perhaps it may be possible to put in temporary
  70 "blank" entries that are filled in later?  Issue place-holder instructions
  71 into the Dependency Matrix, where we know that the registers on which
  72 the instruction will depend is known at a later date?
  73
  74 Why that needs to be considered is: remember that the whole basis of
  75 Simple-V is: you issue multiple *sequential* instructions.  Unfortunately,
  76 REMAP jumbles up the word "sequential" using a 1D/2D/3D/offset algorithm,
  77 such that the actual register (or part-register in the case of 8/16/32-bit
  78 element widths) needs a calculation to be performed in order to determine
  79 which register is to be used.  And, secondly, predication can entirely
  80 skip some of those element-based instructions!
  81
  82 Talk about complex!  Simple-V is supposed to be simple!  No wonder
  83 chip designers go for SIMD and let the software sort out the mess...