From: Joshua Harlan Lifton Date: Sun, 20 Jan 2019 03:10:03 +0000 (-0800) Subject: Normalize whitespace and line wrapping X-Git-Url: https://git.libre-soc.org/?p=crowdsupply.git;a=commitdiff_plain;h=5cdd606067e23359f2e59a7e20aaabf10fdd895a Normalize whitespace and line wrapping --- diff --git a/updates/004_2018dec06_microarchitecture_cont.mdwn b/updates/004_2018dec06_microarchitecture_cont.mdwn index 51e02b5..7995dd7 100644 --- a/updates/004_2018dec06_microarchitecture_cont.mdwn +++ b/updates/004_2018dec06_microarchitecture_cont.mdwn @@ -1,59 +1,60 @@ -# Modernising 1960s Computer Technology: what can be learned from the CDC 6600 - -Firstly, many thanks to +Firstly, many thanks to [Heise.de](https://www.heise.de/newsticker/meldung/Mobilprozessor-mit-freier-GPU-Libre-RISC-V-M-Class-geplant-4242802.html) -for publishing a story on this project. I replied to some of the -[Heise Forum](https://www.heise.de/forum/heise-online/News-Kommentare/Mobilprozessor-mit-freier-GPU-Libre-RISC-V-M-Class-geplant/forum-414986/comment/) -comments, here, endeavouring to use translation software to respect that -the forum is in German. - -In this update, following on from the analysis of the Tomasulo Algorithm, -by a process of osmosis I finally was able to make out a light at the -end of the "Scoreboard" tunnel, and it is not an oncoming train. -Conversations with -[Mitch Alsup](https://groups.google.com/d/msg/comp.arch/w5fUBkrcw-s/-9JNF0cUCAAJ) -are becoming clear, providing insights that, as we will find out below, -have not made it into the academic literature in over 20 years. +for publishing a story on this project. I replied to some of the +[Heise +Forum](https://www.heise.de/forum/heise-online/News-Kommentare/Mobilprozessor-mit-freier-GPU-Libre-RISC-V-M-Class-geplant/forum-414986/comment/) +comments, here, endeavouring to use translation software to respect +that the forum is in German. + +In this update, following on from the analysis of the Tomasulo +Algorithm, by a process of osmosis I finally was able to make out a +light at the end of the "Scoreboard" tunnel, and it is not an oncoming +train. Conversations with [Mitch +Alsup](https://groups.google.com/d/msg/comp.arch/w5fUBkrcw-s/-9JNF0cUCAAJ) +are becoming clear, providing insights that, as we will find out +below, have not made it into the academic literature in over 20 years. In the previous update, I really did not like the [Scoreboard](https://en.wikipedia.org/wiki/Scoreboarding) technique for doing out-of-order superscalar execution, because, *as described*, -it is hopelessly inadequate. There's no roll-back method for -exceptions, no method for coping with register "hazards" (Read after Write -and so on), so register "renaming" has to be done as a precursor step, -no way to do branch prediction, and only a single LOAD/STORE can be -done at any one time. - -All of these things have to be added, and the best way to do so is -to absorb the feature known as the "Reorder Buffer" (and associated -Reservation Stations), normally associated with the Tomasulo Algorithm. -At which point, as noted on +it is hopelessly inadequate. There's no roll-back method for +exceptions, no method for coping with register "hazards" (Read after +Write and so on), so register "renaming" has to be done as a precursor +step, no way to do branch prediction, and only a single LOAD/STORE can +be done at any one time. + +All of these things have to be added, and the best way to do so is to +absorb the feature known as the "Reorder Buffer" (and associated +Reservation Stations), normally associated with the Tomasulo +Algorithm. At which point, as noted on [comp.arch](https://groups.google.com/d/msg/comp.arch/w5fUBkrcw-s/AIMVVS3DBwAJ) there really is no functional difference between "Scoreboarding plus -Reorder Buffer" and "Tomasulo Algorithm plus Reorder Buffer". Even +Reorder Buffer" and "Tomasulo Algorithm plus Reorder Buffer". Even the Tomasulo Common Data Bus is present in a functionally-orthogonal way (see later for details). -The only *well-known* documentation on the CDC 6600 Scoreboarding technique -is the 1967 patent. Here's the kicker: the patent *does not* describe -the key strategic part of Scoreboarding that makes it so powerful and -much more power-efficient than the Tomasulo Algorithm when combined -with Reorder Buffers: the Functional Unit's Dependency Matrices. +The only *well-known* documentation on the CDC 6600 Scoreboarding +technique is the 1967 patent. Here's the kicker: the patent *does +not* describe the key strategic part of Scoreboarding that makes it so +powerful and much more power-efficient than the Tomasulo Algorithm +when combined with Reorder Buffers: the Functional Unit's Dependency +Matrices. Before getting to that stage, I thought it would be a good idea to -make people aware of a book that Mitch told me about, called -"Design of a Computer: the Control Data 6600" by James Thornton. -James worked with Seymour Cray on the *original design* of the 6600. -It was literally constructed from PCB modules using hand-soldered -transistors. Memory was magnetic rings (which is where we get the term -"core memory" from), and the bootloader was a bank of toggle-switches. -The design was absolutely revolutionary: where all other computers -were managing an instruction every 11 clock cycles, the 6600 reduced -that to **four**. The 7600, its successor, took that figure even lower. +make people aware of a book that Mitch told me about, called "Design +of a Computer: the Control Data 6600" by James Thornton. James worked +with Seymour Cray on the *original design* of the 6600. It was +literally constructed from PCB modules using hand-soldered +transistors. Memory was magnetic rings (which is where we get the +term "core memory" from), and the bootloader was a bank of +toggle-switches. The design was absolutely revolutionary: where all +other computers were managing an instruction every 11 clock cycles, +the 6600 reduced that to **four**. The 7600, its successor, took that +figure even lower. In 2002, someone named Tom Uban sought permission from James and his -wife, to make the book available online, as, historically, the -CDC 6600 is quite literally the precursor to modern supercomputing: +wife, to make the book available online, as, historically, the CDC +6600 is quite literally the precursor to modern supercomputing: [[design_of_a_computer_6600_permission.jpg]] @@ -66,108 +67,110 @@ Basically, the patent shows a table with src1 and src2, and "ready" signals: what it does *not* show is the "Go Read" and "Go Write" signals (which allowed an instruction to *begin* execution without *committing* execution - a feature that's usually believed to be -exclusive to Reorder Buffers), and the patent certainly does not show the -way in which one Function Unit blocks others, via the Dependency Matrix. +exclusive to Reorder Buffers), and the patent certainly does not show +the way in which one Function Unit blocks others, via the Dependency +Matrix. -It is well-known that the Tomasulo Reorder Buffer requires a CAM -on the Destination Register, (which is power-hungry and expensive). -This is described in academic literature as data coming "to". The +It is well-known that the Tomasulo Reorder Buffer requires a CAM on +the Destination Register, (which is power-hungry and expensive). This +is described in academic literature as data coming "to". The Scoreboard technique is described as data coming "from" source -registers, however because the Dependency Matrix is left out of -these discussions (not being part of the patent), what they fail to -mention is that there are *multiple single-line* source wires, thus -achieving the exact same purpose as the Reorder Buffer's CAM, with *far -less power and die area*. - -Mitch's description of this on comp.arch was that the Dependency Matrix columns -effectively may be viewed as a single-bit-wide "CAM", which of course -is far less hardware, being just AND gates. However it wasn't until -he very kindly sent me the chapters of his unpublished book on the 6600 -that the significance of what he was saying actually sank in, namely that -instead of a merged multi-wire very expensive "Destination Register" CAM, -copying the *value* of the dependent src register into the Reorder Buffer -(and then having to match it up afterwards. on every clock cycle), -the Dependency Matrix breaks this down into multiple really really simple -single wire comparators that *preserve* a **direct** link between the -src register(s) and the destination(s) where they're needed. Consequently, -the Scoreboard and Dependency Matrix logic gates take up far less space, -and use significantly less power. +registers, however because the Dependency Matrix is left out of these +discussions (not being part of the patent), what they fail to mention +is that there are *multiple single-line* source wires, thus achieving +the exact same purpose as the Reorder Buffer's CAM, with *far less +power and die area*. + +Mitch's description of this on comp.arch was that the Dependency +Matrix columns effectively may be viewed as a single-bit-wide "CAM", +which of course is far less hardware, being just AND gates. However +it wasn't until he very kindly sent me the chapters of his unpublished +book on the 6600 that the significance of what he was saying actually +sank in, namely that instead of a merged multi-wire very expensive +"Destination Register" CAM, copying the *value* of the dependent src +register into the Reorder Buffer (and then having to match it up +afterwards. on every clock cycle), the Dependency Matrix breaks this +down into multiple really really simple single wire comparators that +*preserve* a **direct** link between the src register(s) and the +destination(s) where they're needed. Consequently, the Scoreboard and +Dependency Matrix logic gates take up far less space, and use +significantly less power. Not only that: it is quite easy to add incremental register-renaming -tags on top of the Scoreboard + Dependency Matrix, again, no need -for a CAM. Not only that: Mitch describes in the second unpublished book +tags on top of the Scoreboard + Dependency Matrix, again, no need for +a CAM. Not only that: Mitch describes in the second unpublished book chapter several techniques that each bring in all of the techniques -that are usually exclusively associated with Reorder Buffers, -such as Branch Prediction, speculative execution, precise exceptions -and multi-issue LOAD / STORE hazard avoidance. This diagram below -is reproduced with Mitch's permission: +that are usually exclusively associated with Reorder Buffers, such as +Branch Prediction, speculative execution, precise exceptions and +multi-issue LOAD / STORE hazard avoidance. This diagram below is +reproduced with Mitch's permission: [[mitch_ld_st_augmentation.jpg]] This high-level diagram includes some subtle modifications that -augment a standard CDC 6600 design to allow speculative execution. -A "Schroedinger" wire is added ("neither alive nor dead"), which, -very simply put, prohibits Function Unit "Write" of results (mentioned -earlier as a pre-existing under-recognised key part of the 6600 design). In -this way, because the "Read" signals were independent of "Write" -(something that is again completely missing from the academic -literature in discussions of 6600 Scoreboards), the instruction -may *begin* execution, but is prevented from *completing* -execution. +augment a standard CDC 6600 design to allow speculative execution. A +"Schroedinger" wire is added ("neither alive nor dead"), which, very +simply put, prohibits Function Unit "Write" of results (mentioned +earlier as a pre-existing under-recognised key part of the 6600 +design). In this way, because the "Read" signals were independent of +"Write" (something that is again completely missing from the academic +literature in discussions of 6600 Scoreboards), the instruction may +*begin* execution, but is prevented from *completing* execution. All that is required to gain speculative execution on branches is to add one extra line to the Dependency Matrix per "branch" that is to be -speculatively executed. The "Branch Speculation" Unit is just like any -other Functional Unit, in effect. In this way, we gain *exactly* the -same capability as a Reorder Buffer, including all of the benefits. -The same trick will work just as well for Exceptions. +speculatively executed. The "Branch Speculation" Unit is just like +any other Functional Unit, in effect. In this way, we gain *exactly* +the same capability as a Reorder Buffer, including all of the +benefits. The same trick will work just as well for Exceptions. -Mitch also has a high-level diagram of an additional LOAD/STORE Matrix that -has, again, extremely simple rules: LOADs block STOREs, and +Mitch also has a high-level diagram of an additional LOAD/STORE Matrix +that has, again, extremely simple rules: LOADs block STOREs, and STOREs block LOADs, and the signals "Read / Write" are then passed -down to the Function Unit Dependency Matrix as well. The rules for -the blocking need only be based on "there is no possibility of a conflict" -rather than "on which exact and precise address does a conflict occur". -This in turn means that the number of address bits needed to detect a -conflict may be significantly reduced, i.e. only the top bits are -needed. - -Interestingly, RISC-V "Fence" instruction rules are based on the same idea, -and it may turn out to be possible to leverage the L1 Cache Line numbers -instead of the (full) address. - -Also, thanks to Mitch's help, his unpublished book chapters help -to identify and make clear that the CDC 6600's register file is designed with -"write-through" capability, i.e. that a register that's written will -go through *on the same clock cycle* to a "read" request. This makes -the 6600's register file pretty much synonymous with the Tomasulo -Algorithm "Common Data Bus". This same-cycle feature *also provides -operand forwarding for free*! - -So this is just amazing. Let's recap. It's 2018, there's absolutely zero -Libre SoCs in existence anywhere on our planet of 8 billion people, and -we're looking for inspiration at literally a 55-year-old computer design -that occupied an entire room and was hand-built with transistors, -on how to make a modern, power-efficient 3D-capable processor. - -Not only that: the project has accidentally unearthed incredibly valuable -historic processor design information that has eluded the Intels and -ARMs - billion-dollar companies - as well as the Academic community - -for several decades. - -I'd like to take a minute to especially thank Mitch Alsup for his -time in ongoing discussions, without which there would be absolutely -no chance that I could possibly have learned about, let alone understood, -any of the above. As I mentioned in the very first update: new processor -designs get one shot at success. Basing the core of the design on -a 55-year-old well-documented and extremely compact and efficient design -is a reasonable strategy: it's just that, without Mitch's help, there -would have been no way to understand the 6600's true value. - -Bottom line is, we have a way forward that will result in significantly -less hardware, a simpler design, using a lot less power than modern -designs today, yet providing all of the features normally the exclusive -domain of top-end processors. Thanks to a refresh of a 55-year-old -processor and the willingness of Mitch Alsup and James Thornton to share -their expertise with the world. - +down to the Function Unit Dependency Matrix as well. The rules for +the blocking need only be based on "there is no possibility of a +conflict" rather than "on which exact and precise address does a +conflict occur". This in turn means that the number of address bits +needed to detect a conflict may be significantly reduced, i.e. only +the top bits are needed. + +Interestingly, RISC-V "Fence" instruction rules are based on the same +idea, and it may turn out to be possible to leverage the L1 Cache Line +numbers instead of the (full) address. + +Also, thanks to Mitch's help, his unpublished book chapters help to +identify and make clear that the CDC 6600's register file is designed +with "write-through" capability, i.e. that a register that's written +will go through *on the same clock cycle* to a "read" request. This +makes the 6600's register file pretty much synonymous with the +Tomasulo Algorithm "Common Data Bus". This same-cycle feature *also +provides operand forwarding for free*! + +So this is just amazing. Let's recap. It's 2018, there's absolutely +zero Libre SoCs in existence anywhere on our planet of 8 billion +people, and we're looking for inspiration at literally a 55-year-old +computer design that occupied an entire room and was hand-built with +transistors, on how to make a modern, power-efficient 3D-capable +processor. + +Not only that: the project has accidentally unearthed incredibly +valuable historic processor design information that has eluded the +Intels and ARMs - billion-dollar companies - as well as the Academic +community - for several decades. + +I'd like to take a minute to especially thank Mitch Alsup for his time +in ongoing discussions, without which there would be absolutely no +chance that I could possibly have learned about, let alone understood, +any of the above. As I mentioned in the very first update: new +processor designs get one shot at success. Basing the core of the +design on a 55-year-old well-documented and extremely compact and +efficient design is a reasonable strategy: it's just that, without +Mitch's help, there would have been no way to understand the 6600's +true value. + +Bottom line is, we have a way forward that will result in +significantly less hardware, a simpler design, using a lot less power +than modern designs today, yet providing all of the features normally +the exclusive domain of top-end processors. Thanks to a refresh of a +55-year-old processor and the willingness of Mitch Alsup and James +Thornton to share their expertise with the world.