Copy edit update 4

[crowdsupply.git] / updates / 004_2018dec06_microarchitecture_cont.mdwn
diff --git a/updates/004_2018dec06_microarchitecture_cont.mdwn b/updates/004_2018dec06_microarchitecture_cont.mdwn

index 28228c6be1cb8b1b310e6546788f8d356740d1df..61e2fd74ea49df2ba4506a149d89aa4673bddd53 100644 (file)
--- a/updates/004_2018dec06_microarchitecture_cont.mdwn
+++ b/updates/004_2018dec06_microarchitecture_cont.mdwn
@@ -1,122 +1,178 @@
-# Modernising 1960s Computer Technology: what can be learned from the CDC 6600
-
-Firstly, many thanks to 
+Firstly, many thanks to
  [Heise.de](https://www.heise.de/newsticker/meldung/Mobilprozessor-mit-freier-GPU-Libre-RISC-V-M-Class-geplant-4242802.html)
-for publishing a story on this project.  I replied to some of the
-[Heise Forum](https://www.heise.de/forum/heise-online/News-Kommentare/Mobilprozessor-mit-freier-GPU-Libre-RISC-V-M-Class-geplant/forum-414986/comment/)
-comments, here, endeavouring to use translation software to respect that
-the forum is in German.
-
-In this update, following on from the analysis of the Tomasulo Algorithm,
+for publishing a story on this project. I replied to some of the
+[Heise
+Forum](https://www.heise.de/forum/heise-online/News-Kommentare/Mobilprozessor-mit-freier-GPU-Libre-RISC-V-M-Class-geplant/forum-414986/comment/)
+comments, endeavouring to use translation software to respect that the
+forum is in German.
+
+In this update, following on from [the analysis of the Tomasulo
+algorithm](https://www.crowdsupply.com/libre-risc-v/m-class/updates/microarchitectural-decisions),
  by a process of osmosis I finally was able to make out a light at the
-end of the "Scoreboard" tunnel, and it is not an oncoming train.
-Conversations with
-[Mitch Alsup](https://groups.google.com/d/msg/comp.arch/w5fUBkrcw-s/-9JNF0cUCAAJ)
-are becoming clear.
+end of the "scoreboard" tunnel, and it is not an oncoming
+train. Conversations with [Mitch
+Alsup](https://groups.google.com/d/msg/comp.arch/w5fUBkrcw-s/-9JNF0cUCAAJ)
+are becoming clear, providing insights that, as we will find out
+below, have not made it into the academic literature in over 20 years.
  
  In the previous update, I really did not like the
-[Scoreboard](https://en.wikipedia.org/wiki/Scoreboarding) technique
+[scoreboard](https://en.wikipedia.org/wiki/Scoreboarding) technique
  for doing out-of-order superscalar execution, because, *as described*,
-it is hopelessly inadequate.  There's no roll-back method for
-exceptions, no method for coping with register "hazards" (Read after Write
-and so on), so register "renaming" has to be done as a precursor step,
-no way to do branch prediction, and only a single LOAD/STORE can be
-done at any one time.
-
-The only *well-known* documentation on the CDC 6600 Scoreboarding technique
-is the 1967 patent.  Here's the kicker: the patent *does not* describe
-the key strategic part of Scoreboarding that makes it so powerful and
-much more power-efficient than the Tomasulo Algorithm when combined
-with Reorder Buffers: the Dependency Matrices.
+it is hopelessly inadequate. There's no roll-back method for
+exceptions, no method for coping with register "hazards" (e.g., read
+after write), so register "renaming" has to be done as a precursor
+step, no way to do branch prediction, and only a single LOAD/STORE can
+be done at any one time.
+
+All of these things have to be added, and the best way to do so is to
+absorb the feature known as the "reorder buffer" (and associated
+reservation stations) normally associated with the Tomasulo
+algorithm. At which point, as noted on
+[comp.arch](https://groups.google.com/d/msg/comp.arch/w5fUBkrcw-s/AIMVVS3DBwAJ)
+there really is no functional difference between "scoreboarding plus
+reorder buffer" and "Tomasulo algorithm plus reorder buffer." Even
+the Tomasulo common data bus is present in a functionally-orthogonal
+way (see later for details).
+
+The only *well-known* documentation on the CDC 6600 scoreboarding
+technique is the 1967 patent. Here's the kicker: the patent *does
+not* describe the key strategic part of scoreboarding that makes it so
+powerful and much more power-efficient than the Tomasulo algorithm
+when combined with reorder buffers: the functional unit's dependency
+matrices.
  
  Before getting to that stage, I thought it would be a good idea to
-make people aware of a book that Mitch told me about, called
-"Design of a Computer: the Control Data 6600" by James Thornton.
-James worked with Seymour Cray on the 6600.  It was literally
-constructed from PCB modules using hand-soldered transistors.
-Memory was magnetic rings (which is where we get the term "core memory"
-from), and the bootloader was a bank of toggle-switches.
+make people aware of a book that Mitch told me about, called "Design
+of a Computer: the Control Data 6600" by James Thornton. James worked
+with Seymour Cray on the *original design* of the 6600. It was
+literally constructed from PCB modules using hand-soldered
+transistors. Memory was magnetic rings (which is where we get the
+term "core memory" from), and the bootloader was a bank of
+toggle-switches. The design was absolutely revolutionary: where all
+other computers were managing an instruction every 11 clock cycles,
+the 6600 reduced that to **four**. The 7600, its successor, took that
+figure even lower.
  
  In 2002, someone named Tom Uban sought permission from James and his
-wife, to make the book available online, as, historically, the
-CDC 6600 is quite literally the precursor to modern supercomputing:
+wife, to make the book available online, as, historically, the CDC
+6600 is quite literally the precursor to modern supercomputing:
  
-[[design_of_a_computer_6600_permission.jpg]]
+{design-of-a-computer-6600-permission | link}
  
-So I particularly wanted to show the Dependency Matrix, which is the
-key strategic part of the Scoreboard:
+I particularly wanted to show the dependency matrix, which is the
+key strategic part of the scoreboard:
  
-[[design_of_a_computer_6600.jpg]]
+{design-of-a-computer-6600 | link}
  
  Basically, the patent shows a table with src1 and src2, and "ready"
-signals: what it does *not* show is the "Go Read" and "Go Write"
-signals, and it does not show the way in which one Function Unit
-blocks others, via the Dependency Matrix.
-
-It is well-known that the Tomasulo Reorder Buffer requires a CAM
-on the Destination Register, (which is power-hungry and expensive).
-This is described in academic literature as data coming "to".  The
-Scoreboard technique is described as data coming "from" source
-registers, however because the Dependency Matrix is left out of
-these discussions, what they fail to mention is that there are
-*multiple single-line* source wires, thus achieving the exact
-same purpose as the Reorder Buffer's CAM, with *far less power
-and die area*.
-
-Not only that: it is quite easy to add incremental register-renaming
-tags on top of the Scoreboard + Dependency Matrix, again, no need
-for a CAM.  Not only that: Mitch describes in an unpublished book
-chapter several techniques that each bring in all of the techniques
-that are usually exclusively associated with Reorder Buffers,
-such as Branch Prediction, speculative execution, precise exceptions
-and multi-issue LOAD / STORE hazard avoidance.  This diagram below
-is reproduced with Mitch's permission:
-
-[[mitch_ld_st_augmentation.jpg]]
+signals: what it does *not* show is the "go read" and "go write"
+signals, which allowed an instruction to *begin* execution without
+*committing* execution - a feature that's usually believed to be
+exclusive to Reorder Buffers. Furthermore, the patent certainly does
+not show the way in which one function unit blocks others, which is
+via the dependency matrix.
+
+It is well known that the Tomasulo Reorder Buffer requires a CAM on
+the Destination Register, which is power-hungry and expensive. This
+is described in academic literature as data coming "to." The
+scoreboard technique is described as data coming "from" source
+registers, however because the dependency matrix is left out of these
+discussions (not being part of the patent), what they fail to mention
+is that there are *multiple single-line* source wires, thus achieving
+the exact same purpose as the reorder buffer's CAM, with *far less
+power and die area*.
+
+Mitch's description of this on comp.arch was that the dependency
+matrix columns effectively may be viewed as a single-bit-wide "CAM,"
+which of course is far less hardware, being just AND gates. However,
+it wasn't until he very kindly sent me the chapters of his unpublished
+book on the 6600 that the significance of what he was saying actually
+sank in. Namely, that instead of a merged multi-wire very expensive
+"destination register" CAM, copying the *value* of the dependent src
+register into the reorder buffer (and then having to match it up
+afterwards on every clock cycle), the dependency matrix breaks this
+down into multiple really simple single wire comparators that
+*preserve* a **direct** link between the src register(s) and the
+destination(s) where they're needed. Consequently, the scoreboard and
+dependency matrix logic gates take up far less space, and use
+significantly less power.
+
+Not only that, but it is quite easy to add incremental
+register-renaming tags on top of the scoreboard + dependency matrix --
+again, no need for a CAM. Moreover, in the second unpublished book
+chapter, Mitch describes several techniques that each bring in all of
+the techniques that are usually exclusively associated with reorder
+buffers, such as branch prediction, speculative execution, precise
+exceptions, and multi-issue LOAD/STORE hazard avoidance. The diagram
+below is reproduced with Mitch's permission:
+
+{mitch-ld-st-augmentation | link}
  
  This high-level diagram includes some subtle modifications that
-augment a standard CDC 6600 design to allow speculative execution.
-A "Schroedinger" wire is added ("neither alive nor dead"), which,
-very simply put, prohibits Function Unit "Write" of results.  In
-this way, because the "Read" signals were independent of "Write"
-(something that is again completely missing from the academic
-literature in discussions of 6600 Scoreboards), the instruction
-may *begin* execution, but is prevented from *completing*
-execution.
-
-All that is required is to add one extra line to the Dependency
-Matrix per "branch" that is to be speculatively executed, just like
-any other Functional Unit, in effect.
-
-Mitch also has a high-level diagram of an additional LOAD/STORE Matrix that
-has, again, extremely simple rules: LOADs block STOREs, and
-STOREs block LOADs, and the signals "Read / Write" are then passed
-down to the Function Unit Dependency Matrix as well.  The rules for
-the blocking need only be based on "there is no possibility of a conflict"
-rather than "on which exact and precise address does a conflict occur".
-This in turn means that the number of address bits needed to detect a
-conflict may be significantly reduced.  Interestingly, RISC-V "Fence"
-instruction rules are based on the same idea.
-
-So this is just amazing.  Let's recap.  It's 2018, there's absolutely zero
-Libre SoCs in existence anywhere on our planet of 8 billion people, and
-we're looking for inspiration at literally a 55-year-old computer design
-that occupied an entire room and was hand-built with transistors, 
-on how to make a modern, power-efficient 3D-capable processor.
-
-Not only that: the project has accidentally unearthed incredibly valuable
-historic processor design information that has eluded the Intels and
-ARMs - billion-dollar companies - as well as the Academic community -
-for several decades.
-
-I'd like to take a minute to especially thank Mitch Alsup for his
-time in ongoing discussions, without which there would be absolutely
-no chance that I could possibly have learned about, let alone understood,
-any of the above.  As I mentioned in the very first update: new processor
-designs get one shot at success.  Basing the core of the design on
-a 55-year-old well-documented and extremely compact and efficient design
-is a reasonable strategy: it's just that, without Mitch's help, there
-would have been no way to understand the 6600's true value.
-
-Bottom line: we do not need to follow Intel's power-inefficient lead, here.
-
+augment a standard CDC 6600 design to allow speculative execution. A
+"Schroedinger" wire is added ("neither alive nor dead"), which, very
+simply put, prohibits function unit "write" of results (mentioned
+earlier as a pre-existing under-recognised key part of the 6600
+design). In this way, because the "read" signals were independent of
+"write" (something that is again completely missing from the academic
+literature in discussions of 6600 scoreboards), the instruction may
+*begin* execution, but is prevented from *completing* execution.
+
+All that is required to gain speculative execution on branches is to
+add to the dependency matrix one extra line per "branch" that is to be
+speculatively executed. The "branch speculation" unit is just like
+any other functional unit, in effect. In this way, we gain *exactly*
+the same capability as a reorder buffer, including all of the
+benefits. The same trick will work just as well for exceptions.
+
+Mitch also has a high-level diagram of an additional LOAD/STORE matrix
+that has, again, extremely simple rules: LOADs block STOREs, and
+STOREs block LOADs, and the signals "read / write" are then passed
+down to the function unit dependency matrix as well. The rules for the
+blocking need only be based on "there is no possibility of a conflict"
+rather than "on which exact and precise address does a conflict
+occur". This in turn means that the number of address bits needed to
+detect a conflict may be significantly reduced, i.e., only the top
+bits are needed.
+
+Interestingly, RISC-V "fence" instruction rules are based on the same
+idea, and it may turn out to be possible to leverage the L1 cache line
+numbers instead of the (full) address.
+
+Also, Mitch's unpublished book chapters help to
+identify and make clear that the CDC 6600's register file is designed
+with "write-through" capability, i.e., that a register that's written
+will go through *on the same clock cycle* to a "read" request. This
+makes the 6600's register file pretty much synonymous with the
+Tomasulo algorithm's "common data bus." This same-cycle feature *also
+provides operand forwarding for free*!
+
+This is just amazing. Let's recap. It's 2018, there's absolutely no
+Libre SoCs in existence anywhere on our planet of 8 billion people,
+and we're looking for inspiration on how to make a modern,
+power-efficient 3D-capable processor, only to find it in a literally
+55-year-old design for a computer that occupied an entire room and was
+hand-built with transistors!
+
+Not only that, but the project has accidentally unearthed incredibly
+valuable historic processor design information that has eluded the
+Intels and ARMs (billion-dollar companies), as well as the academic
+community, for several decades.
+
+I'd like to take a minute to especially thank Mitch Alsup for his time
+in ongoing discussions, without which there would be absolutely no
+chance I could possibly have learned about, let alone understood, any
+of the above. As I mentioned in the [very first project
+update](https://www.crowdsupply.com/libre-risc-v/m-class/updates/why-make-a-quad-core-64-bit-soc-surely-there-are-enough-already):
+new processor designs get one shot at success. Basing the core of the
+design on a 55-year-old, well-documented, and extremely compact and
+efficient design is a reasonable strategy: it's just that, without
+Mitch's help, there would have been no way to understand the 6600's
+true value.
+
+The bottom line is, we have a way forward that will result in
+significantly less hardware and a simpler design, using a lot less
+power than modern designs, yet providing all of the features normally
+the exclusive domain of top-end processors, all thanks to a refresh of
+a 55-year-old processor and the willingness of Mitch Alsup and James
+Thornton to share their expertise with the world.