updates/009_register_overwrites.mdwn

   1 # "Name-less" register exception handling
   2
   3 In this
   4 [comp.arch](https://groups.google.com/forum/#!topic/comp.arch/8pAGuX6UBu0)
   5 post a scheme has been outlined that, if added to a precise-exception
   6 augmented CDC 6600 style Scoreboard, would allow less load on the register
   7 file (less reads and writes) and still guarantee precise exception handling.
   8
   9 The goal here is to reduce the number of reads and writes to the register
  10 file, because, quite simply put, doing so saves power and reduces contention
  11 for the limited resource of the data buses between the ALUs and the register
  12 file.  Why limited resource? Because keeping four or more ALUs fully occupied
  13 with for example an FMAC operation requires 3 READs and 1 WRITE port *per ALU*.
  14 If those are vectorised predicated FMAC operations, it's an even higher READ
  15 count than that.
  16
  17 Four parallel FMACs initiated per clock requires a whopping **TWELVE** read
  18 ports and four WRITE ports.  This is completely insane and it is why the
  19 register file has been subdivided into four separate banks.
  20
  21 There are certain standard "cells", including in FPGAs - pre-designed layouts -
  22 for register files.  The typical layout is 2R1W (2 read ports, 1 write port,
  23 per clock cycle).  Therefore, keeping to that will not only reduce power
  24 consumption, it will reduce the development cost for the project, as well.
  25
  26 It turns out that with FMAC (Floating-point Multiply and Accumulate) operations,
  27 the destination register is usually also the (additive) source register,
  28 in a sequential chain of FMACs.  So, actually... aside from the very first
  29 FMAC in the chain, if operand "forwarding" is available in the architecture,
  30 then actually it is only the two numbers being multiplied (and then added)
  31 that need to be read from the register file.  That nicely meshes with the
  32 whole "2R1W" thing.
  33
  34 [operand "forwarding"](https://en.wikipedia.org/wiki/Operand_forwarding)
  35 is basically that the result from one instruction is "forwarded"
  36 *directly* to the source input of a dependent instruction.  In the CDC 6600
  37 this, interestingly, is achieved through a special design of Register File,
  38 where if a register is being read at the same time (on the same clock cycle)
  39 as it is being written, it is "passed through" (literally).  "Normal"
  40 (modern) Register File designs simply do not do this, meaning that a
  41 dependent operation would have to wait an additional cycle: hence the
  42 reason why the concept of "Operand Forwarding" was "invented"... even though
  43 the 6600 had implemented it 55 years earlier.
  44
  45 The "Banks" which are planned to be used in the Libre RISC-V SoC present a
  46 bit of a problem as far as forwarding is concerned, even if they include
  47 6600-style same-clock "write-through" capability (aka Operand Forwarding).
  48 The issue is that whilst there are multiplexers planned to be added to the
  49 source (**after** the reads are performed), there are **no** multiplexers
  50 planned to be added before the **destination** registers are written.
  51 Therefore, the plan is to add an additional "forwarding" Bus which can
  52 "bypass" the register file entirely.
  53
  54 This is apparently fairly standard practice in high-performance modern
  55 micro-architectures.  The problem is, however: if the register is
  56 identified and marked as "not to be written back to the register file",
  57 and an exception occurs, how on earth do you ensure that the system state
  58 is stable i.e. not corrupted?  Most modern systems have a "rollback"
  59 mechanism to deal with this.
  60
  61 Before we get there, however, let's back up a little bit, and go over
  62 the example shown
  63 [here](https://groups.google.com/forum/#!msg/comp.arch/gedwgWzCK4A/mRcfK8IODwAJ)
  64 in more depth.  This is the sequence:
  65
  66     ADD r1, r2, #5
  67     ADD r2, r1, #5
  68     ADD r1, r2, #5
  69
  70 Note that instruction 3 actually overwrites R1, however R1 is used as
  71 a *source* register in instruction 2.  So what that means is, if we have
  72 Tomasulo-style Reservation Stations on the Function Units, we don't
  73 *actually* need to write R1 from instruction 1 into the Register File
  74 at all!  We can in fact simply use the fact that it will be sitting in
  75 the Function Unit's Reservation Station, use "Operand Forwarding" to
  76 pass it to instruction 2, and, once instruction 2 is underway, throw
  77 the instruction 1 R1 result **away**.  We achieve this by noting that
  78 Instruction 3 "overwrites" Instruction 1's R1 as a destination, and,
  79 whilst all three ALUs are still busy with pipelined processing, "mark"
  80 the Function Unit handling instruction 1 as "nameless".  The "name"
  81 of Register R1 effectively changes from "R#1" to "FU1.#n" (assuming
  82 FU 1 is handling instruction 1).
  83
  84 Now we have the context, let's return to the bit about exceptions, and
  85 assume that instruction 2 throws one (ADDs do not normally do that:
  86 let's assume that they can, for now).  Note that these are the conditions:
  87
  88 * Precise Exception handling has been added (by adding a "schroedinger"
  89   wire plus a write-hazard block that prevents down-stream instructions
  90   from "committing" (writing) until such time as the up-stream instruction
  91   absolutely knows that there will not be an exception.
  92 * When an up-stream instruction knows that it has passed (cleared) the
  93   hurdle of potentially needing to throw an exception, it **drops** the
  94   write-hazand, DEASSERTs the "schroedinger" wire, thus allowing down-stream
  95   dependent instructions to be free of write hazards, and thus commit.
  96   (However, that's not happening here: instruction 2 **has** flipped
  97   the "schroedinger" wire to "Go\_Die").
  98 * R1 from instruction 1 has been **specifically** marked as **not** to
  99   be written to the Register File: it has been renamed to "nameless"
 100   (FU1.#n).
 101 * R1 from instruction 1 is also a source register of instruction 2.
 102 * Instruction 2 is to be "rolled back"
 103 * Instruction 3 is to be told to die as well (instruction 2 has flipped
 104   the "Go\_Die" signal).
 105
 106 ...um, what do we do about the value "FU1.#n"?  Instruction 3 told it
 107 that it was no longer permitted to write to the Register File, except that
 108 now Instruction 3 is dead!  Instruction 1 has absolutely no place to put
 109 that value.  Should we discard Instruction 1 **as well**?? How far back does
 110 this go?  This is completely wasteful of resources!  More than that, what
 111 if we have a multi-issue engine, which issues multiple
 112 instructions in this "nameless" fashion, where they get rolled back
 113 again and again in an endless loop?
 114
 115 This is where modern micro-architectures get a little unstuck: apparently
 116 what they do is, roll back to where there are **no** "nameless" registers,
 117 they then **disable** multi-issue instruction execution, **disable** the
 118 "nameless" capability, and slowly move forward one instruction at a time
 119 until the exception is re-encountered.
 120 This basically ensures that when the exception is encountered, absolutely
 121 all of the registers may be (or are already) committed to the Register File.
 122 At that point, a trap handler knows that it can safely context-switch, or
 123 do whatever it likes, confident that the Register File Architectural State
 124 is sane.
 125
 126 This approach is extremely wasteful of resources, and sub-optimal.  In a
 127 design that is supposed to be power-efficient, there's an obligation to
 128 "Do Better".  Hence the scheme below.
 129
 130 # CDC 6600 Q-Table (FU-to-Register lookup) "History" Enhancement.
 131
 132 In CDC 6600 Terminology there is something called the "Q-Table", which
 133 is basically an array, indexed by Register number, which keeps a record
 134 of which Function Unit (relative to instruction order) last had that
 135 FU as a Destination Register.  This is directly equivalent to and
 136 completely synonymous with the Tomasulo Reorder Buffer's "Dest Reg" CAM
 137 entry (except that in 6600 Scoreboarding it's not a CAM).
 138
 139 The problem with "nameless" Operand Forwarding is: whenever a Q-Table
 140 entry (for any given FU) is overwritten, that's it: that instruction
 141 **absolutely cannot** "roll back".  The critical information that would
 142 allow the prior Function Unit (the "overwritee") has just been destroyed.
 143
 144 There is a simple solution to that: provide a *Queue* of Q-Table entries.
 145
 146 Below is what a 6600 Q-Table looks like (image courtesy of Mitch Alsup).
 147 In the original 6600 it is a binary table with a unary decoder on the
 148 left and a pair of unary encoders on the right.
 149
 150 {{6600_q_table.png}}
 151
 152 The plan is, therefore, to add effectively *multiple* Q-Tables
 153 (or, multiple entries), recording the "history" of which *prior*
 154 Function Units had any given register as its destination.
 155
 156 Now we have exactly the information needed to "roll back", should an
 157 exception occur.  Like many augmentations and enhancements to the 6600
 158 Scoreboard system, it's kind-of obvious in retrospect.  However the *real*
 159 "duh" moment, as posted on comp.arch, is to always ensure that FUs that
 160 are providing "nameless" data in their destination latches will never
 161 let down-stream dependent instructions commit if any of those down-stream
 162 instructions could potentially hit an exception.
 163
 164 Why is that important?  It's because it's not enough to know that the
 165 down-stream (dependent) instructions have all initiated (read the
 166 FU's dest latch and taken it as a forwarded src operand).  If **even one**
 167 of those instructions throws an exception, the "nameless" FU from which that
 168 value came is hosed, as it has nowhere to put its result.
 169
 170 So, firstly: the "nameless" FU absolutely has to wait until its dependencies
 171 are clear of exceptions (and then **and only** then may it safely drop (throw
 172 away) the data (without writing it to the Register File); and secondly,
 173 the "nameless" FU absolutely has to know that it can "roll back" from
 174 "nameless" to a "named" state, in the event that one of its dependent
 175 instructions does indeed throw an exception.  This is where the "History"
 176 Q-Table Entries come into play.
 177
 178 So there's a few potential ways to go about this:
 179
 180 * Using the Historical Q-Table Entries, in chronological and Dependency
 181   Order, store all "Nameless" Registers (using the "history" to determine
 182   where), even if they are going to get overwritten in the next cycle.
 183 * After triggering the "Go\_die" wire from the Exception, and all
 184   dependent instructions have been removed (including their Destination
 185   Register Reservations), use the "history" information to work out
 186   which (formerly nameless) Function Unit(s) now actually have the
 187   Destination Reservation for all "vacated" Register.
 188 * Any remaining "nameless" Registers, if their results are available,
 189   are likewise either stored or trigger their shadow (dependent)
 190   instructions to die (even if it's the original exception).
 191 * Once the dust settles, carry on.
 192
 193 Realistically, this is going to need to be investigated with simulations.
 194 It's quite complicated, however the payoff is a significant reduction in
 195 the workload on the register file.  It basically means the difference between
 196 12 GFLOPs and 6 GFLOPs when doing 32-bit FMACs, at 800mhz (quad-core),
 197 and still being able to keep to a "standard" 2R1W register file.
 198 So it's a big deal!