From 66bc5cb6ac3c6fbe71ca77e672cbb94bf61490f0 Mon Sep 17 00:00:00 2001
From: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Date: Wed, 2 Jan 2019 01:16:15 +0000
Subject: [PATCH] add register overwrites update

---
 updates/009_register_overwrites.mdwn | 165 +++++++++++++++++++++++++++
 1 file changed, 165 insertions(+)
 create mode 100644 updates/009_register_overwrites.mdwn

diff --git a/updates/009_register_overwrites.mdwn b/updates/009_register_overwrites.mdwn
new file mode 100644
index 0000000..396e4c0
--- /dev/null
+++ b/updates/009_register_overwrites.mdwn
@@ -0,0 +1,165 @@
+# "Name-less" register exception handling
+
+In this
+[comp.arch](https://groups.google.com/forum/#!topic/comp.arch/8pAGuX6UBu0)
+post a scheme has been outlined that, if added to a precise-exception
+augmented CDC 6600 style Scoreboard, would allow less load on the register
+file (less reads and writes) and still guarantee precise exception handling.
+
+The goal here is to reduce the number of reads and writes to the register
+file, because, quite simply put, doing so saves power and reduces contention
+for the limited resource of the data buses between the ALUs and the register
+file.  Why limited resource? Because keeping four or more ALUs fully occupied
+with for example an FMAC operation requires 3 READs and 1 WRITE port *per ALU*.
+If those are vectorised predicated FMAC operations, it's an even higher READ
+count than that.
+
+Four parallel FMACs initiated per clock requires a whopping **TWELVE** read
+ports and four WRITE ports.  This is completely insane and it is why the
+register file has been subdivided into four separate banks.
+
+There are certain standard "cells", including in FPGAs - pre-designed layouts -
+for register files.  The typical layout is 2R1W (2 read ports, 1 write port,
+per clock cycle).  Therefore, keeping to that will not only reduce power
+consumption, it will reduce the development cost for the project, as well.
+
+It turns out that with FMAC (Floating-point Multiply and Accumulate) operations,
+the destination register is usually also the (additive) source register,
+in a sequential chain of FMACs.  So, actually... aside from the very first
+FMAC in the chain, if operand "forwarding" is available in the architecture,
+then actually it is only the two numbers being multiplied (and then added)
+that need to be read from the register file.  That nicely meshes with the
+whole "2R1W" thing.
+
+[operand "forwarding"](https://en.wikipedia.org/wiki/Operand_forwarding)
+is basically that the result from one instruction is "forwarded"
+*directly* to the source input of a dependent instruction.  In the CDC 6600
+this, interestingly, is achieved through a special design of Register File,
+where if a register is being read at the same time (on the same clock cycle)
+as it is being written, it is "passed through" (literally).  "Normal"
+(modern) Register File designs simply do not do this, meaning that a
+dependent operation would have to wait an additional cycle: hence the
+reason why the concept of "Operand Forwarding" was "invented"... even though
+the 6600 had implemented it 55 years earlier.
+
+The "Banks" which are planned to be used in the Libre RISC-V SoC present a
+bit of a problem as far as forwarding is concerned, even if they include
+6600-style same-clock "write-through" capability (aka Operand Forwarding).
+The issue is that whilst there are multiplexers planned to be added to the
+source (**after** the reads are performed), there are **no** multiplexers
+planned to be added before the **destination** registers are written.
+Therefore, the plan is to add an additional "forwarding" Bus which can
+"bypass" the register file entirely.
+
+This is apparently fairly standard practice in high-performance modern
+micro-architectures.  The problem is, however: if the register is
+identified and marked as "not to be written back to the register file",
+and an exception occurs, how on earth do you ensure that the system state
+is stable i.e. not corrupted?  Most modern systems have a "rollback"
+mechanism to deal with this.
+
+Before we get there, however, let's back up a little bit, and go over
+the example shown
+[here](https://groups.google.com/forum/#!msg/comp.arch/gedwgWzCK4A/mRcfK8IODwAJ)
+in more depth.  This is the sequence:
+
+    ADD r1, r2, #5
+    ADD r2, r1, #5
+    ADD r1, r2, #5
+
+Note that instruction 3 actually overwrites R1, however R1 is used as
+a *source* register in instruction 2.  So what that means is, if we have
+Tomasulo-style Reservation Stations on the Function Units, we don't
+*actually* need to write R1 from instruction 1 into the Register File
+at all!  We can in fact simply use the fact that it will be sitting in
+the Function Unit's Reservation Station, use "Operand Forwarding" to
+pass it to instruction 2, and, once instruction 2 is underway, throw
+the instruction 1 R1 result **away**.  We achieve this by noting that
+Instruction 3 "overwrites" Instruction 1's R1 as a destination, and,
+whilst all three ALUs are still busy with pipelined processing, "mark"
+the Function Unit handling instruction 1 as "nameless".  The "name"
+of Register R1 effectively changes from "R#1" to "FU1.#n" (assuming
+FU 1 is handling instruction 1).
+
+Now we have the context, let's return to the bit about exceptions, and
+assume that instruction 2 throws one (ADDs do not normally do that:
+let's assume that they can, for now).  Note that these are the conditions:
+
+* Precise Exception handling has been added (by adding a "schroedinger"
+  wire plus a write-hazard block that prevents down-stream instructions
+  from "committing" (writing) until such time as the up-stream instruction
+  absolutely knows that there will not be an exception.
+* When an up-stream instruction knows that it has passed (cleared) the
+  hurdle of potentially needing to throw an exception, it **drops** the
+  write-hazand, DEASSERTs the "schroedinger" wire, thus allowing down-stream
+  dependent instructions to be free of write hazards, and thus commit.
+  (However, that's not happening here: instruction 2 **has** flipped
+  the "schroedinger" wire to "Go\_Die").
+* R1 from instruction 1 has been **specifically** marked as **not** to
+  be written to the Register File: it has been renamed to "nameless"
+  (FU1.#n).
+* R1 from instruction 1 is also a source register of instruction 2.
+* Instruction 2 is to be "rolled back"
+* Instruction 3 is to be told to die as well (instruction 2 has flipped
+  the "Go\_Die" signal).
+
+...um, what do we do about the value "FU1.#n"?  Instruction 3 told it
+that it was no longer permitted to write to the Register File, except that
+now Instruction 3 is dead!  Instruction 1 has absolutely no place to put
+that value.  Should we discard Instruction 1 **as well**?? How far back does
+this go?  This is completely wasteful of resources!  More than that, what
+if we have a multi-issue engine, which issues multiple
+instructions in this "nameless" fashion, where they get rolled back
+again and again in an endless loop?
+
+This is where modern micro-architectures get a little unstuck: apparently
+what they do is, roll back to where there are **no** "nameless" registers,
+they then **disable** multi-issue instruction execution, **disable** the
+"nameless" capability, and slowly move forward one instruction at a time
+until the exception is re-encountered.
+This basically ensures that when the exception is encountered, absolutely
+all of the registers may be (or are already) committed to the Register File.
+At that point, a trap handler knows that it can safely context-switch, or
+do whatever it likes, confident that the Register File Architectural State
+is sane.
+
+This approach is extremely wasteful of resources, and sub-optimal.  In a
+design that is supposed to be power-efficient, there's an obligation to
+"Do Better".  Hence the scheme below.
+
+# CDC 6600 Q-Table (FU-to-Register lookup) "History" Enhancement.
+
+In CDC 6600 Terminology there is something called the "Q-Table", which
+is basically an array, indexed by Register number, which keeps a record
+of which Function Unit (relative to instruction order) last had that
+FU as a Destination Register.  This is directly equivalent to and
+completely synonymous with the Tomasulo Reorder Buffer's "Dest Reg" CAM
+entry (except that in 6600 Scoreboarding it's not a CAM).
+
+The problem with "nameless" Operand Forwarding is: whenever a Q-Table
+entry (for any given FU) is overwritten, that's it: that instruction
+**absolutely cannot** "roll back".  The critical information that would
+allow the prior Function Unit (the "overwritee") has just been destroyed.
+
+There is a simple solution to that: provide a *Queue* of Q-Table entries.
+
+Now we have exactly the information needed to "roll back", should an
+exception occur.  Like many augmentations and enhancements to the 6600
+Scoreboard system, it's kind-of obvious in retrospect.  However the *real*
+"duh" moment, as posted on comp.arch, is to always ensure that FUs that
+are providing "nameless" data in their destination latches will never
+let down-stream dependent instructions commit if any of those down-stream
+instructions could potentially hit an exception.
+
+Why is that important?  It's because it's not enough to know that the
+down-stream (dependent) instructions have all initiated (read the
+FU's dest latch and taken it as a forwarded src operand).  If **even one**
+of those instructions throws an exception, the "nameless" FU is hosed.
+So, firstly: the "nameless" FU absolutely has to wait until its dependencies
+are clear of exceptions (and then **and only** then may it safely drop (throw
+away) the data (without writing it to the Register File); and secondly,
+the "nameless" FU absolutely has to know that it can "roll back" from
+"nameless" to a "named" state, in the event that one of its dependent
+instructions does indeed throw an exception.  This is where the "History"
+Q-Table Entries come into play.
+
-- 
2.30.2