updates/009_register_overwrites.mdwn

   1 # "Name-less" register exception handling
   2
   3 In this
   4 [comp.arch](https://groups.google.com/forum/#!topic/comp.arch/8pAGuX6UBu0)
   5 post a scheme has been outlined that, if added to a precise-exception
   6 augmented CDC 6600 style Scoreboard, would allow less load on the register
   7 file (less reads and writes) and still guarantee precise exception handling.
   8
   9 The goal here is to reduce the number of reads and writes to the register
  10 file, because, quite simply put, doing so saves power and reduces contention
  11 for the limited resource of the data buses between the ALUs and the register
  12 file.  Why limited resource? Because keeping four or more ALUs fully occupied
  13 with for example an FMAC operation requires 3 READs and 1 WRITE port *per ALU*.
  14 If those are vectorised predicated FMAC operations, it's an even higher READ
  15 count than that.
  16
  17 Four parallel FMACs initiated per clock requires a whopping **TWELVE** read
  18 ports and four WRITE ports.  This is completely insane and it is why the
  19 register file has been subdivided into four separate banks.
  20
  21 There are certain standard "cells", including in FPGAs - pre-designed layouts -
  22 for register files.  The typical layout is 2R1W (2 read ports, 1 write port,
  23 per clock cycle).  Therefore, keeping to that will not only reduce power
  24 consumption, it will reduce the development cost for the project, as well.
  25
  26 It turns out that with FMAC (Floating-point Multiply and Accumulate) operations,
  27 the destination register is usually also the (additive) source register,
  28 in a sequential chain of FMACs.  So, actually... aside from the very first
  29 FMAC in the chain, if operand "forwarding" is available in the architecture,
  30 then actually it is only the two numbers being multiplied (and then added)
  31 that need to be read from the register file.  That nicely meshes with the
  32 whole "2R1W" thing.
  33
  34 [operand "forwarding"](https://en.wikipedia.org/wiki/Operand_forwarding)
  35 is basically that the result from one instruction is "forwarded"
  36 *directly* to the source input of a dependent instruction.  In the CDC 6600
  37 this, interestingly, is achieved through a special design of Register File,
  38 where if a register is being read at the same time (on the same clock cycle)
  39 as it is being written, it is "passed through" (literally).  "Normal"
  40 (modern) Register File designs simply do not do this, meaning that a
  41 dependent operation would have to wait an additional cycle: hence the
  42 reason why the concept of "Operand Forwarding" was "invented"... even though
  43 the 6600 had implemented it 55 years earlier.
  44
  45 The "Banks" which are planned to be used in the Libre RISC-V SoC present a
  46 bit of a problem as far as forwarding is concerned, even if they include
  47 6600-style same-clock "write-through" capability (aka Operand Forwarding).
  48 The issue is that whilst there are multiplexers planned to be added to the
  49 source (**after** the reads are performed), there are **no** multiplexers
  50 planned to be added before the **destination** registers are written.
  51 Therefore, the plan is to add an additional "forwarding" Bus which can
  52 "bypass" the register file entirely.
  53
  54 This is apparently fairly standard practice in high-performance modern
  55 micro-architectures.  The problem is, however: if the register is
  56 identified and marked as "not to be written back to the register file",
  57 and an exception occurs, how on earth do you ensure that the system state
  58 is stable i.e. not corrupted?  Most modern systems have a "rollback"
  59 mechanism to deal with this.
  60
  61 Before we get there, however, let's back up a little bit, and go over
  62 the example shown
  63 [here](https://groups.google.com/forum/#!msg/comp.arch/gedwgWzCK4A/mRcfK8IODwAJ)
  64 in more depth.  This is the sequence:
  65
  66     ADD r1, r2, #5
  67     ADD r2, r1, #5
  68     ADD r1, r2, #5
  69
  70 Note that instruction 3 actually overwrites R1, however R1 is used as
  71 a *source* register in instruction 2.  So what that means is, if we have
  72 Tomasulo-style Reservation Stations on the Function Units, we don't
  73 *actually* need to write R1 from instruction 1 into the Register File
  74 at all!  We can in fact simply use the fact that it will be sitting in
  75 the Function Unit's Reservation Station, use "Operand Forwarding" to
  76 pass it to instruction 2, and, once instruction 2 is underway, throw
  77 the instruction 1 R1 result **away**.  We achieve this by noting that
  78 Instruction 3 "overwrites" Instruction 1's R1 as a destination, and,
  79 whilst all three ALUs are still busy with pipelined processing, "mark"
  80 the Function Unit handling instruction 1 as "nameless".  The "name"
  81 of Register R1 effectively changes from "R#1" to "FU1.#n" (assuming
  82 FU 1 is handling instruction 1).
  83
  84 Now we have the context, let's return to the bit about exceptions, and
  85 assume that instruction 2 throws one (ADDs do not normally do that:
  86 let's assume that they can, for now).  Note that these are the conditions:
  87
  88 * Precise Exception handling has been added (by adding a "schroedinger"
  89   wire plus a write-hazard block that prevents down-stream instructions
  90   from "committing" (writing) until such time as the up-stream instruction
  91   absolutely knows that there will not be an exception.
  92 * When an up-stream instruction knows that it has passed (cleared) the
  93   hurdle of potentially needing to throw an exception, it **drops** the
  94   write-hazand, DEASSERTs the "schroedinger" wire, thus allowing down-stream
  95   dependent instructions to be free of write hazards, and thus commit.
  96   (However, that's not happening here: instruction 2 **has** flipped
  97   the "schroedinger" wire to "Go\_Die").
  98 * R1 from instruction 1 has been **specifically** marked as **not** to
  99   be written to the Register File: it has been renamed to "nameless"
 100   (FU1.#n).
 101 * R1 from instruction 1 is also a source register of instruction 2.
 102 * Instruction 2 is to be "rolled back"
 103 * Instruction 3 is to be told to die as well (instruction 2 has flipped
 104   the "Go\_Die" signal).
 105
 106 ...um, what do we do about the value "FU1.#n"?  Instruction 3 told it
 107 that it was no longer permitted to write to the Register File, except that
 108 now Instruction 3 is dead!  Instruction 1 has absolutely no place to put
 109 that value.  Should we discard Instruction 1 **as well**?? How far back does
 110 this go?  This is completely wasteful of resources!  More than that, what
 111 if we have a multi-issue engine, which issues multiple
 112 instructions in this "nameless" fashion, where they get rolled back
 113 again and again in an endless loop?
 114
 115 This is where modern micro-architectures get a little unstuck: apparently
 116 what they do is, roll back to where there are **no** "nameless" registers,
 117 they then **disable** multi-issue instruction execution, **disable** the
 118 "nameless" capability, and slowly move forward one instruction at a time
 119 until the exception is re-encountered.
 120 This basically ensures that when the exception is encountered, absolutely
 121 all of the registers may be (or are already) committed to the Register File.
 122 At that point, a trap handler knows that it can safely context-switch, or
 123 do whatever it likes, confident that the Register File Architectural State
 124 is sane.
 125
 126 This approach is extremely wasteful of resources, and sub-optimal.  In a
 127 design that is supposed to be power-efficient, there's an obligation to
 128 "Do Better".  Hence the scheme below.
 129
 130 # CDC 6600 Q-Table (FU-to-Register lookup) "History" Enhancement.
 131
 132 In CDC 6600 Terminology there is something called the "Q-Table", which
 133 is basically an array, indexed by Register number, which keeps a record
 134 of which Function Unit (relative to instruction order) last had that
 135 FU as a Destination Register.  This is directly equivalent to and
 136 completely synonymous with the Tomasulo Reorder Buffer's "Dest Reg" CAM
 137 entry (except that in 6600 Scoreboarding it's not a CAM).
 138
 139 The problem with "nameless" Operand Forwarding is: whenever a Q-Table
 140 entry (for any given FU) is overwritten, that's it: that instruction
 141 **absolutely cannot** "roll back".  The critical information that would
 142 allow the prior Function Unit (the "overwritee") has just been destroyed.
 143
 144 There is a simple solution to that: provide a *Queue* of Q-Table entries.
 145
 146 Below is what a 6600 Q-Table looks like (image courtesy of Mitch Alsup).
 147 In the original 6600 it is a binary table with a unary decoder on the
 148 left and a pair of unary encoders on the right.
 149
 150 {{6600_q_table.png}}
 151
 152 Now we have exactly the information needed to "roll back", should an
 153 exception occur.  Like many augmentations and enhancements to the 6600
 154 Scoreboard system, it's kind-of obvious in retrospect.  However the *real*
 155 "duh" moment, as posted on comp.arch, is to always ensure that FUs that
 156 are providing "nameless" data in their destination latches will never
 157 let down-stream dependent instructions commit if any of those down-stream
 158 instructions could potentially hit an exception.
 159
 160 Why is that important?  It's because it's not enough to know that the
 161 down-stream (dependent) instructions have all initiated (read the
 162 FU's dest latch and taken it as a forwarded src operand).  If **even one**
 163 of those instructions throws an exception, the "nameless" FU is hosed.
 164 So, firstly: the "nameless" FU absolutely has to wait until its dependencies
 165 are clear of exceptions (and then **and only** then may it safely drop (throw
 166 away) the data (without writing it to the Register File); and secondly,
 167 the "nameless" FU absolutely has to know that it can "roll back" from
 168 "nameless" to a "named" state, in the event that one of its dependent
 169 instructions does indeed throw an exception.  This is where the "History"
 170 Q-Table Entries come into play.
 171