X-Git-Url: https://git.libre-soc.org/?a=blobdiff_plain;f=updates%2F004_2018dec06_microarchitecture_cont.mdwn;h=7995dd7584cda460c0b1f1b9f4533a67ed1937ff;hb=5cdd606067e23359f2e59a7e20aaabf10fdd895a;hp=28228c6be1cb8b1b310e6546788f8d356740d1df;hpb=4f0045893e5dbcd145b9364476c783ae891f749e;p=crowdsupply.git

diff --git a/updates/004_2018dec06_microarchitecture_cont.mdwn b/updates/004_2018dec06_microarchitecture_cont.mdwn
index 28228c6..7995dd7 100644
--- a/updates/004_2018dec06_microarchitecture_cont.mdwn
+++ b/updates/004_2018dec06_microarchitecture_cont.mdwn
@@ -1,45 +1,60 @@
-# Modernising 1960s Computer Technology: what can be learned from the CDC 6600
-
-Firstly, many thanks to 
+Firstly, many thanks to
 [Heise.de](https://www.heise.de/newsticker/meldung/Mobilprozessor-mit-freier-GPU-Libre-RISC-V-M-Class-geplant-4242802.html)
-for publishing a story on this project.  I replied to some of the
-[Heise Forum](https://www.heise.de/forum/heise-online/News-Kommentare/Mobilprozessor-mit-freier-GPU-Libre-RISC-V-M-Class-geplant/forum-414986/comment/)
-comments, here, endeavouring to use translation software to respect that
-the forum is in German.
-
-In this update, following on from the analysis of the Tomasulo Algorithm,
-by a process of osmosis I finally was able to make out a light at the
-end of the "Scoreboard" tunnel, and it is not an oncoming train.
-Conversations with
-[Mitch Alsup](https://groups.google.com/d/msg/comp.arch/w5fUBkrcw-s/-9JNF0cUCAAJ)
-are becoming clear.
+for publishing a story on this project. I replied to some of the
+[Heise
+Forum](https://www.heise.de/forum/heise-online/News-Kommentare/Mobilprozessor-mit-freier-GPU-Libre-RISC-V-M-Class-geplant/forum-414986/comment/)
+comments, here, endeavouring to use translation software to respect
+that the forum is in German.
+
+In this update, following on from the analysis of the Tomasulo
+Algorithm, by a process of osmosis I finally was able to make out a
+light at the end of the "Scoreboard" tunnel, and it is not an oncoming
+train. Conversations with [Mitch
+Alsup](https://groups.google.com/d/msg/comp.arch/w5fUBkrcw-s/-9JNF0cUCAAJ)
+are becoming clear, providing insights that, as we will find out
+below, have not made it into the academic literature in over 20 years.
 
 In the previous update, I really did not like the
 [Scoreboard](https://en.wikipedia.org/wiki/Scoreboarding) technique
 for doing out-of-order superscalar execution, because, *as described*,
-it is hopelessly inadequate.  There's no roll-back method for
-exceptions, no method for coping with register "hazards" (Read after Write
-and so on), so register "renaming" has to be done as a precursor step,
-no way to do branch prediction, and only a single LOAD/STORE can be
-done at any one time.
-
-The only *well-known* documentation on the CDC 6600 Scoreboarding technique
-is the 1967 patent.  Here's the kicker: the patent *does not* describe
-the key strategic part of Scoreboarding that makes it so powerful and
-much more power-efficient than the Tomasulo Algorithm when combined
-with Reorder Buffers: the Dependency Matrices.
+it is hopelessly inadequate. There's no roll-back method for
+exceptions, no method for coping with register "hazards" (Read after
+Write and so on), so register "renaming" has to be done as a precursor
+step, no way to do branch prediction, and only a single LOAD/STORE can
+be done at any one time.
+
+All of these things have to be added, and the best way to do so is to
+absorb the feature known as the "Reorder Buffer" (and associated
+Reservation Stations), normally associated with the Tomasulo
+Algorithm. At which point, as noted on
+[comp.arch](https://groups.google.com/d/msg/comp.arch/w5fUBkrcw-s/AIMVVS3DBwAJ)
+there really is no functional difference between "Scoreboarding plus
+Reorder Buffer" and "Tomasulo Algorithm plus Reorder Buffer". Even
+the Tomasulo Common Data Bus is present in a functionally-orthogonal
+way (see later for details).
+
+The only *well-known* documentation on the CDC 6600 Scoreboarding
+technique is the 1967 patent. Here's the kicker: the patent *does
+not* describe the key strategic part of Scoreboarding that makes it so
+powerful and much more power-efficient than the Tomasulo Algorithm
+when combined with Reorder Buffers: the Functional Unit's Dependency
+Matrices.
 
 Before getting to that stage, I thought it would be a good idea to
-make people aware of a book that Mitch told me about, called
-"Design of a Computer: the Control Data 6600" by James Thornton.
-James worked with Seymour Cray on the 6600.  It was literally
-constructed from PCB modules using hand-soldered transistors.
-Memory was magnetic rings (which is where we get the term "core memory"
-from), and the bootloader was a bank of toggle-switches.
+make people aware of a book that Mitch told me about, called "Design
+of a Computer: the Control Data 6600" by James Thornton. James worked
+with Seymour Cray on the *original design* of the 6600. It was
+literally constructed from PCB modules using hand-soldered
+transistors. Memory was magnetic rings (which is where we get the
+term "core memory" from), and the bootloader was a bank of
+toggle-switches. The design was absolutely revolutionary: where all
+other computers were managing an instruction every 11 clock cycles,
+the 6600 reduced that to **four**. The 7600, its successor, took that
+figure even lower.
 
 In 2002, someone named Tom Uban sought permission from James and his
-wife, to make the book available online, as, historically, the
-CDC 6600 is quite literally the precursor to modern supercomputing:
+wife, to make the book available online, as, historically, the CDC
+6600 is quite literally the precursor to modern supercomputing:
 
 [[design_of_a_computer_6600_permission.jpg]]
 
@@ -50,73 +65,112 @@ key strategic part of the Scoreboard:
 
 Basically, the patent shows a table with src1 and src2, and "ready"
 signals: what it does *not* show is the "Go Read" and "Go Write"
-signals, and it does not show the way in which one Function Unit
-blocks others, via the Dependency Matrix.
-
-It is well-known that the Tomasulo Reorder Buffer requires a CAM
-on the Destination Register, (which is power-hungry and expensive).
-This is described in academic literature as data coming "to".  The
+signals (which allowed an instruction to *begin* execution without
+*committing* execution - a feature that's usually believed to be
+exclusive to Reorder Buffers), and the patent certainly does not show
+the way in which one Function Unit blocks others, via the Dependency
+Matrix.
+
+It is well-known that the Tomasulo Reorder Buffer requires a CAM on
+the Destination Register, (which is power-hungry and expensive). This
+is described in academic literature as data coming "to". The
 Scoreboard technique is described as data coming "from" source
-registers, however because the Dependency Matrix is left out of
-these discussions, what they fail to mention is that there are
-*multiple single-line* source wires, thus achieving the exact
-same purpose as the Reorder Buffer's CAM, with *far less power
-and die area*.
+registers, however because the Dependency Matrix is left out of these
+discussions (not being part of the patent), what they fail to mention
+is that there are *multiple single-line* source wires, thus achieving
+the exact same purpose as the Reorder Buffer's CAM, with *far less
+power and die area*.
+
+Mitch's description of this on comp.arch was that the Dependency
+Matrix columns effectively may be viewed as a single-bit-wide "CAM",
+which of course is far less hardware, being just AND gates. However
+it wasn't until he very kindly sent me the chapters of his unpublished
+book on the 6600 that the significance of what he was saying actually
+sank in, namely that instead of a merged multi-wire very expensive
+"Destination Register" CAM, copying the *value* of the dependent src
+register into the Reorder Buffer (and then having to match it up
+afterwards. on every clock cycle), the Dependency Matrix breaks this
+down into multiple really really simple single wire comparators that
+*preserve* a **direct** link between the src register(s) and the
+destination(s) where they're needed. Consequently, the Scoreboard and
+Dependency Matrix logic gates take up far less space, and use
+significantly less power.
 
 Not only that: it is quite easy to add incremental register-renaming
-tags on top of the Scoreboard + Dependency Matrix, again, no need
-for a CAM.  Not only that: Mitch describes in an unpublished book
+tags on top of the Scoreboard + Dependency Matrix, again, no need for
+a CAM. Not only that: Mitch describes in the second unpublished book
 chapter several techniques that each bring in all of the techniques
-that are usually exclusively associated with Reorder Buffers,
-such as Branch Prediction, speculative execution, precise exceptions
-and multi-issue LOAD / STORE hazard avoidance.  This diagram below
-is reproduced with Mitch's permission:
+that are usually exclusively associated with Reorder Buffers, such as
+Branch Prediction, speculative execution, precise exceptions and
+multi-issue LOAD / STORE hazard avoidance. This diagram below is
+reproduced with Mitch's permission:
 
 [[mitch_ld_st_augmentation.jpg]]
 
 This high-level diagram includes some subtle modifications that
-augment a standard CDC 6600 design to allow speculative execution.
-A "Schroedinger" wire is added ("neither alive nor dead"), which,
-very simply put, prohibits Function Unit "Write" of results.  In
-this way, because the "Read" signals were independent of "Write"
-(something that is again completely missing from the academic
-literature in discussions of 6600 Scoreboards), the instruction
-may *begin* execution, but is prevented from *completing*
-execution.
-
-All that is required is to add one extra line to the Dependency
-Matrix per "branch" that is to be speculatively executed, just like
-any other Functional Unit, in effect.
-
-Mitch also has a high-level diagram of an additional LOAD/STORE Matrix that
-has, again, extremely simple rules: LOADs block STOREs, and
+augment a standard CDC 6600 design to allow speculative execution. A
+"Schroedinger" wire is added ("neither alive nor dead"), which, very
+simply put, prohibits Function Unit "Write" of results (mentioned
+earlier as a pre-existing under-recognised key part of the 6600
+design). In this way, because the "Read" signals were independent of
+"Write" (something that is again completely missing from the academic
+literature in discussions of 6600 Scoreboards), the instruction may
+*begin* execution, but is prevented from *completing* execution.
+
+All that is required to gain speculative execution on branches is to
+add one extra line to the Dependency Matrix per "branch" that is to be
+speculatively executed. The "Branch Speculation" Unit is just like
+any other Functional Unit, in effect. In this way, we gain *exactly*
+the same capability as a Reorder Buffer, including all of the
+benefits. The same trick will work just as well for Exceptions.
+
+Mitch also has a high-level diagram of an additional LOAD/STORE Matrix
+that has, again, extremely simple rules: LOADs block STOREs, and
 STOREs block LOADs, and the signals "Read / Write" are then passed
-down to the Function Unit Dependency Matrix as well.  The rules for
-the blocking need only be based on "there is no possibility of a conflict"
-rather than "on which exact and precise address does a conflict occur".
-This in turn means that the number of address bits needed to detect a
-conflict may be significantly reduced.  Interestingly, RISC-V "Fence"
-instruction rules are based on the same idea.
-
-So this is just amazing.  Let's recap.  It's 2018, there's absolutely zero
-Libre SoCs in existence anywhere on our planet of 8 billion people, and
-we're looking for inspiration at literally a 55-year-old computer design
-that occupied an entire room and was hand-built with transistors, 
-on how to make a modern, power-efficient 3D-capable processor.
-
-Not only that: the project has accidentally unearthed incredibly valuable
-historic processor design information that has eluded the Intels and
-ARMs - billion-dollar companies - as well as the Academic community -
-for several decades.
-
-I'd like to take a minute to especially thank Mitch Alsup for his
-time in ongoing discussions, without which there would be absolutely
-no chance that I could possibly have learned about, let alone understood,
-any of the above.  As I mentioned in the very first update: new processor
-designs get one shot at success.  Basing the core of the design on
-a 55-year-old well-documented and extremely compact and efficient design
-is a reasonable strategy: it's just that, without Mitch's help, there
-would have been no way to understand the 6600's true value.
-
-Bottom line: we do not need to follow Intel's power-inefficient lead, here.
-
+down to the Function Unit Dependency Matrix as well. The rules for
+the blocking need only be based on "there is no possibility of a
+conflict" rather than "on which exact and precise address does a
+conflict occur". This in turn means that the number of address bits
+needed to detect a conflict may be significantly reduced, i.e. only
+the top bits are needed.
+
+Interestingly, RISC-V "Fence" instruction rules are based on the same
+idea, and it may turn out to be possible to leverage the L1 Cache Line
+numbers instead of the (full) address.
+
+Also, thanks to Mitch's help, his unpublished book chapters help to
+identify and make clear that the CDC 6600's register file is designed
+with "write-through" capability, i.e. that a register that's written
+will go through *on the same clock cycle* to a "read" request. This
+makes the 6600's register file pretty much synonymous with the
+Tomasulo Algorithm "Common Data Bus". This same-cycle feature *also
+provides operand forwarding for free*!
+
+So this is just amazing. Let's recap. It's 2018, there's absolutely
+zero Libre SoCs in existence anywhere on our planet of 8 billion
+people, and we're looking for inspiration at literally a 55-year-old
+computer design that occupied an entire room and was hand-built with
+transistors, on how to make a modern, power-efficient 3D-capable
+processor.
+
+Not only that: the project has accidentally unearthed incredibly
+valuable historic processor design information that has eluded the
+Intels and ARMs - billion-dollar companies - as well as the Academic
+community - for several decades.
+
+I'd like to take a minute to especially thank Mitch Alsup for his time
+in ongoing discussions, without which there would be absolutely no
+chance that I could possibly have learned about, let alone understood,
+any of the above. As I mentioned in the very first update: new
+processor designs get one shot at success. Basing the core of the
+design on a 55-year-old well-documented and extremely compact and
+efficient design is a reasonable strategy: it's just that, without
+Mitch's help, there would have been no way to understand the 6600's
+true value.
+
+Bottom line is, we have a way forward that will result in
+significantly less hardware, a simpler design, using a lot less power
+than modern designs today, yet providing all of the features normally
+the exclusive domain of top-end processors. Thanks to a refresh of a
+55-year-old processor and the willingness of Mitch Alsup and James
+Thornton to share their expertise with the world.