important to update about tax agreements
[crowdsupply.git] / updates / 009_register_overwrites.mdwn
1 # "Name-less" register exception handling
2
3 In this
4 [comp.arch](https://groups.google.com/forum/#!topic/comp.arch/8pAGuX6UBu0)
5 post a scheme has been outlined that, if added to a precise-exception
6 augmented CDC 6600 style Scoreboard, would allow less load on the register
7 file (less reads and writes) and still guarantee precise exception handling.
8
9 The goal here is to reduce the number of reads and writes to the register
10 file, because, quite simply put, doing so saves power and reduces contention
11 for the limited resource of the data buses between the ALUs and the register
12 file. Why limited resource? Because keeping four or more ALUs fully occupied
13 with for example an FMAC operation requires 3 READs and 1 WRITE port *per ALU*.
14 If those are vectorised predicated FMAC operations, it's an even higher READ
15 count than that.
16
17 Four parallel FMACs initiated per clock requires a whopping **TWELVE** read
18 ports and four WRITE ports. This is completely insane and it is why the
19 register file has been subdivided into four separate banks.
20
21 There are certain standard "cells", including in FPGAs - pre-designed layouts -
22 for register files. The typical layout is 2R1W (2 read ports, 1 write port,
23 per clock cycle). Therefore, keeping to that will not only reduce power
24 consumption, it will reduce the development cost for the project, as well.
25
26 It turns out that with FMAC (Floating-point Multiply and Accumulate) operations,
27 the destination register is usually also the (additive) source register,
28 in a sequential chain of FMACs. So, actually... aside from the very first
29 FMAC in the chain, if operand "forwarding" is available in the architecture,
30 then actually it is only the two numbers being multiplied (and then added)
31 that need to be read from the register file. That nicely meshes with the
32 whole "2R1W" thing.
33
34 [operand "forwarding"](https://en.wikipedia.org/wiki/Operand_forwarding)
35 is basically that the result from one instruction is "forwarded"
36 *directly* to the source input of a dependent instruction. In the CDC 6600
37 this, interestingly, is achieved through a special design of Register File,
38 where if a register is being read at the same time (on the same clock cycle)
39 as it is being written, it is "passed through" (literally). "Normal"
40 (modern) Register File designs simply do not do this, meaning that a
41 dependent operation would have to wait an additional cycle: hence the
42 reason why the concept of "Operand Forwarding" was "invented"... even though
43 the 6600 had implemented it 55 years earlier.
44
45 The "Banks" which are planned to be used in the Libre RISC-V SoC present a
46 bit of a problem as far as forwarding is concerned, even if they include
47 6600-style same-clock "write-through" capability (aka Operand Forwarding).
48 The issue is that whilst there are multiplexers planned to be added to the
49 source (**after** the reads are performed), there are **no** multiplexers
50 planned to be added before the **destination** registers are written.
51 Therefore, the plan is to add an additional "forwarding" Bus which can
52 "bypass" the register file entirely.
53
54 This is apparently fairly standard practice in high-performance modern
55 micro-architectures. The problem is, however: if the register is
56 identified and marked as "not to be written back to the register file",
57 and an exception occurs, how on earth do you ensure that the system state
58 is stable i.e. not corrupted? Most modern systems have a "rollback"
59 mechanism to deal with this.
60
61 Before we get there, however, let's back up a little bit, and go over
62 the example shown
63 [here](https://groups.google.com/forum/#!msg/comp.arch/gedwgWzCK4A/mRcfK8IODwAJ)
64 in more depth. This is the sequence:
65
66 ADD r1, r2, #5
67 ADD r2, r1, #5
68 ADD r1, r2, #5
69
70 Note that instruction 3 actually overwrites R1, however R1 is used as
71 a *source* register in instruction 2. So what that means is, if we have
72 Tomasulo-style Reservation Stations on the Function Units, we don't
73 *actually* need to write R1 from instruction 1 into the Register File
74 at all! We can in fact simply use the fact that it will be sitting in
75 the Function Unit's Reservation Station, use "Operand Forwarding" to
76 pass it to instruction 2, and, once instruction 2 is underway, throw
77 the instruction 1 R1 result **away**. We achieve this by noting that
78 Instruction 3 "overwrites" Instruction 1's R1 as a destination, and,
79 whilst all three ALUs are still busy with pipelined processing, "mark"
80 the Function Unit handling instruction 1 as "nameless". The "name"
81 of Register R1 effectively changes from "R#1" to "FU1.#n" (assuming
82 FU 1 is handling instruction 1).
83
84 Now we have the context, let's return to the bit about exceptions, and
85 assume that instruction 2 throws one (ADDs do not normally do that:
86 let's assume that they can, for now). Note that these are the conditions:
87
88 * Precise Exception handling has been added (by adding a "schroedinger"
89 wire plus a write-hazard block that prevents down-stream instructions
90 from "committing" (writing) until such time as the up-stream instruction
91 absolutely knows that there will not be an exception.
92 * When an up-stream instruction knows that it has passed (cleared) the
93 hurdle of potentially needing to throw an exception, it **drops** the
94 write-hazand, DEASSERTs the "schroedinger" wire, thus allowing down-stream
95 dependent instructions to be free of write hazards, and thus commit.
96 (However, that's not happening here: instruction 2 **has** flipped
97 the "schroedinger" wire to "Go\_Die").
98 * R1 from instruction 1 has been **specifically** marked as **not** to
99 be written to the Register File: it has been renamed to "nameless"
100 (FU1.#n).
101 * R1 from instruction 1 is also a source register of instruction 2.
102 * Instruction 2 is to be "rolled back"
103 * Instruction 3 is to be told to die as well (instruction 2 has flipped
104 the "Go\_Die" signal).
105
106 ...um, what do we do about the value "FU1.#n"? Instruction 3 told it
107 that it was no longer permitted to write to the Register File, except that
108 now Instruction 3 is dead! Instruction 1 has absolutely no place to put
109 that value. Should we discard Instruction 1 **as well**?? How far back does
110 this go? This is completely wasteful of resources! More than that, what
111 if we have a multi-issue engine, which issues multiple
112 instructions in this "nameless" fashion, where they get rolled back
113 again and again in an endless loop?
114
115 This is where modern micro-architectures get a little unstuck: apparently
116 what they do is, roll back to where there are **no** "nameless" registers,
117 they then **disable** multi-issue instruction execution, **disable** the
118 "nameless" capability, and slowly move forward one instruction at a time
119 until the exception is re-encountered.
120 This basically ensures that when the exception is encountered, absolutely
121 all of the registers may be (or are already) committed to the Register File.
122 At that point, a trap handler knows that it can safely context-switch, or
123 do whatever it likes, confident that the Register File Architectural State
124 is sane.
125
126 This approach is extremely wasteful of resources, and sub-optimal. In a
127 design that is supposed to be power-efficient, there's an obligation to
128 "Do Better". Hence the scheme below.
129
130 # CDC 6600 Q-Table (FU-to-Register lookup) "History" Enhancement.
131
132 In CDC 6600 Terminology there is something called the "Q-Table", which
133 is basically an array, indexed by Register number, which keeps a record
134 of which Function Unit (relative to instruction order) last had that
135 FU as a Destination Register. This is directly equivalent to and
136 completely synonymous with the Tomasulo Reorder Buffer's "Dest Reg" CAM
137 entry (except that in 6600 Scoreboarding it's not a CAM).
138
139 The problem with "nameless" Operand Forwarding is: whenever a Q-Table
140 entry (for any given FU) is overwritten, that's it: that instruction
141 **absolutely cannot** "roll back". The critical information that would
142 allow the prior Function Unit (the "overwritee") has just been destroyed.
143
144 There is a simple solution to that: provide a *Queue* of Q-Table entries.
145
146 Below is what a 6600 Q-Table looks like (image courtesy of Mitch Alsup).
147 In the original 6600 it is a binary table with a unary decoder on the
148 left and a pair of unary encoders on the right.
149
150 {{6600_q_table.png}}
151
152 The plan is, therefore, to add effectively *multiple* Q-Tables
153 (or, multiple entries), recording the "history" of which *prior*
154 Function Units had any given register as its destination.
155
156 Now we have exactly the information needed to "roll back", should an
157 exception occur. Like many augmentations and enhancements to the 6600
158 Scoreboard system, it's kind-of obvious in retrospect. However the *real*
159 "duh" moment, as posted on comp.arch, is to always ensure that FUs that
160 are providing "nameless" data in their destination latches will never
161 let down-stream dependent instructions commit if any of those down-stream
162 instructions could potentially hit an exception.
163
164 Why is that important? It's because it's not enough to know that the
165 down-stream (dependent) instructions have all initiated (read the
166 FU's dest latch and taken it as a forwarded src operand). If **even one**
167 of those instructions throws an exception, the "nameless" FU from which that
168 value came is hosed, as it has nowhere to put its result.
169
170 So, firstly: the "nameless" FU absolutely has to wait until its dependencies
171 are clear of exceptions (and then **and only** then may it safely drop (throw
172 away) the data (without writing it to the Register File); and secondly,
173 the "nameless" FU absolutely has to know that it can "roll back" from
174 "nameless" to a "named" state, in the event that one of its dependent
175 instructions does indeed throw an exception. This is where the "History"
176 Q-Table Entries come into play.
177
178 So there's a few potential ways to go about this:
179
180 * Using the Historical Q-Table Entries, in chronological and Dependency
181 Order, store all "Nameless" Registers (using the "history" to determine
182 where), even if they are going to get overwritten in the next cycle.
183 * After triggering the "Go\_die" wire from the Exception, and all
184 dependent instructions have been removed (including their Destination
185 Register Reservations), use the "history" information to work out
186 which (formerly nameless) Function Unit(s) now actually have the
187 Destination Reservation for all "vacated" Register.
188 * Any remaining "nameless" Registers, if their results are available,
189 are likewise either stored or trigger their shadow (dependent)
190 instructions to die (even if it's the original exception).
191 * Once the dust settles, carry on.
192
193 Realistically, this is going to need to be investigated with simulations.
194 It's quite complicated, however the payoff is a significant reduction in
195 the workload on the register file. It basically means the difference between
196 12 GFLOPs and 6 GFLOPs when doing 32-bit FMACs, at 800mhz (quad-core),
197 and still being able to keep to a "standard" 2R1W register file.
198 So it's a big deal!