add q table image
[crowdsupply.git] / updates / 009_register_overwrites.mdwn
1 # "Name-less" register exception handling
2
3 In this
4 [comp.arch](https://groups.google.com/forum/#!topic/comp.arch/8pAGuX6UBu0)
5 post a scheme has been outlined that, if added to a precise-exception
6 augmented CDC 6600 style Scoreboard, would allow less load on the register
7 file (less reads and writes) and still guarantee precise exception handling.
8
9 The goal here is to reduce the number of reads and writes to the register
10 file, because, quite simply put, doing so saves power and reduces contention
11 for the limited resource of the data buses between the ALUs and the register
12 file. Why limited resource? Because keeping four or more ALUs fully occupied
13 with for example an FMAC operation requires 3 READs and 1 WRITE port *per ALU*.
14 If those are vectorised predicated FMAC operations, it's an even higher READ
15 count than that.
16
17 Four parallel FMACs initiated per clock requires a whopping **TWELVE** read
18 ports and four WRITE ports. This is completely insane and it is why the
19 register file has been subdivided into four separate banks.
20
21 There are certain standard "cells", including in FPGAs - pre-designed layouts -
22 for register files. The typical layout is 2R1W (2 read ports, 1 write port,
23 per clock cycle). Therefore, keeping to that will not only reduce power
24 consumption, it will reduce the development cost for the project, as well.
25
26 It turns out that with FMAC (Floating-point Multiply and Accumulate) operations,
27 the destination register is usually also the (additive) source register,
28 in a sequential chain of FMACs. So, actually... aside from the very first
29 FMAC in the chain, if operand "forwarding" is available in the architecture,
30 then actually it is only the two numbers being multiplied (and then added)
31 that need to be read from the register file. That nicely meshes with the
32 whole "2R1W" thing.
33
34 [operand "forwarding"](https://en.wikipedia.org/wiki/Operand_forwarding)
35 is basically that the result from one instruction is "forwarded"
36 *directly* to the source input of a dependent instruction. In the CDC 6600
37 this, interestingly, is achieved through a special design of Register File,
38 where if a register is being read at the same time (on the same clock cycle)
39 as it is being written, it is "passed through" (literally). "Normal"
40 (modern) Register File designs simply do not do this, meaning that a
41 dependent operation would have to wait an additional cycle: hence the
42 reason why the concept of "Operand Forwarding" was "invented"... even though
43 the 6600 had implemented it 55 years earlier.
44
45 The "Banks" which are planned to be used in the Libre RISC-V SoC present a
46 bit of a problem as far as forwarding is concerned, even if they include
47 6600-style same-clock "write-through" capability (aka Operand Forwarding).
48 The issue is that whilst there are multiplexers planned to be added to the
49 source (**after** the reads are performed), there are **no** multiplexers
50 planned to be added before the **destination** registers are written.
51 Therefore, the plan is to add an additional "forwarding" Bus which can
52 "bypass" the register file entirely.
53
54 This is apparently fairly standard practice in high-performance modern
55 micro-architectures. The problem is, however: if the register is
56 identified and marked as "not to be written back to the register file",
57 and an exception occurs, how on earth do you ensure that the system state
58 is stable i.e. not corrupted? Most modern systems have a "rollback"
59 mechanism to deal with this.
60
61 Before we get there, however, let's back up a little bit, and go over
62 the example shown
63 [here](https://groups.google.com/forum/#!msg/comp.arch/gedwgWzCK4A/mRcfK8IODwAJ)
64 in more depth. This is the sequence:
65
66 ADD r1, r2, #5
67 ADD r2, r1, #5
68 ADD r1, r2, #5
69
70 Note that instruction 3 actually overwrites R1, however R1 is used as
71 a *source* register in instruction 2. So what that means is, if we have
72 Tomasulo-style Reservation Stations on the Function Units, we don't
73 *actually* need to write R1 from instruction 1 into the Register File
74 at all! We can in fact simply use the fact that it will be sitting in
75 the Function Unit's Reservation Station, use "Operand Forwarding" to
76 pass it to instruction 2, and, once instruction 2 is underway, throw
77 the instruction 1 R1 result **away**. We achieve this by noting that
78 Instruction 3 "overwrites" Instruction 1's R1 as a destination, and,
79 whilst all three ALUs are still busy with pipelined processing, "mark"
80 the Function Unit handling instruction 1 as "nameless". The "name"
81 of Register R1 effectively changes from "R#1" to "FU1.#n" (assuming
82 FU 1 is handling instruction 1).
83
84 Now we have the context, let's return to the bit about exceptions, and
85 assume that instruction 2 throws one (ADDs do not normally do that:
86 let's assume that they can, for now). Note that these are the conditions:
87
88 * Precise Exception handling has been added (by adding a "schroedinger"
89 wire plus a write-hazard block that prevents down-stream instructions
90 from "committing" (writing) until such time as the up-stream instruction
91 absolutely knows that there will not be an exception.
92 * When an up-stream instruction knows that it has passed (cleared) the
93 hurdle of potentially needing to throw an exception, it **drops** the
94 write-hazand, DEASSERTs the "schroedinger" wire, thus allowing down-stream
95 dependent instructions to be free of write hazards, and thus commit.
96 (However, that's not happening here: instruction 2 **has** flipped
97 the "schroedinger" wire to "Go\_Die").
98 * R1 from instruction 1 has been **specifically** marked as **not** to
99 be written to the Register File: it has been renamed to "nameless"
100 (FU1.#n).
101 * R1 from instruction 1 is also a source register of instruction 2.
102 * Instruction 2 is to be "rolled back"
103 * Instruction 3 is to be told to die as well (instruction 2 has flipped
104 the "Go\_Die" signal).
105
106 ...um, what do we do about the value "FU1.#n"? Instruction 3 told it
107 that it was no longer permitted to write to the Register File, except that
108 now Instruction 3 is dead! Instruction 1 has absolutely no place to put
109 that value. Should we discard Instruction 1 **as well**?? How far back does
110 this go? This is completely wasteful of resources! More than that, what
111 if we have a multi-issue engine, which issues multiple
112 instructions in this "nameless" fashion, where they get rolled back
113 again and again in an endless loop?
114
115 This is where modern micro-architectures get a little unstuck: apparently
116 what they do is, roll back to where there are **no** "nameless" registers,
117 they then **disable** multi-issue instruction execution, **disable** the
118 "nameless" capability, and slowly move forward one instruction at a time
119 until the exception is re-encountered.
120 This basically ensures that when the exception is encountered, absolutely
121 all of the registers may be (or are already) committed to the Register File.
122 At that point, a trap handler knows that it can safely context-switch, or
123 do whatever it likes, confident that the Register File Architectural State
124 is sane.
125
126 This approach is extremely wasteful of resources, and sub-optimal. In a
127 design that is supposed to be power-efficient, there's an obligation to
128 "Do Better". Hence the scheme below.
129
130 # CDC 6600 Q-Table (FU-to-Register lookup) "History" Enhancement.
131
132 In CDC 6600 Terminology there is something called the "Q-Table", which
133 is basically an array, indexed by Register number, which keeps a record
134 of which Function Unit (relative to instruction order) last had that
135 FU as a Destination Register. This is directly equivalent to and
136 completely synonymous with the Tomasulo Reorder Buffer's "Dest Reg" CAM
137 entry (except that in 6600 Scoreboarding it's not a CAM).
138
139 The problem with "nameless" Operand Forwarding is: whenever a Q-Table
140 entry (for any given FU) is overwritten, that's it: that instruction
141 **absolutely cannot** "roll back". The critical information that would
142 allow the prior Function Unit (the "overwritee") has just been destroyed.
143
144 There is a simple solution to that: provide a *Queue* of Q-Table entries.
145
146 Below is what a 6600 Q-Table looks like (image courtesy of Mitch Alsup).
147 In the original 6600 it is a binary table with a unary decoder on the
148 left and a pair of unary encoders on the right.
149
150 {{6600_q_table.png}}
151
152 Now we have exactly the information needed to "roll back", should an
153 exception occur. Like many augmentations and enhancements to the 6600
154 Scoreboard system, it's kind-of obvious in retrospect. However the *real*
155 "duh" moment, as posted on comp.arch, is to always ensure that FUs that
156 are providing "nameless" data in their destination latches will never
157 let down-stream dependent instructions commit if any of those down-stream
158 instructions could potentially hit an exception.
159
160 Why is that important? It's because it's not enough to know that the
161 down-stream (dependent) instructions have all initiated (read the
162 FU's dest latch and taken it as a forwarded src operand). If **even one**
163 of those instructions throws an exception, the "nameless" FU is hosed.
164 So, firstly: the "nameless" FU absolutely has to wait until its dependencies
165 are clear of exceptions (and then **and only** then may it safely drop (throw
166 away) the data (without writing it to the Register File); and secondly,
167 the "nameless" FU absolutely has to know that it can "roll back" from
168 "nameless" to a "named" state, in the event that one of its dependent
169 instructions does indeed throw an exception. This is where the "History"
170 Q-Table Entries come into play.
171