X-Git-Url: https://git.libre-soc.org/?a=blobdiff_plain;f=updates%2F004_2018dec06_microarchitecture_cont.mdwn;h=7995dd7584cda460c0b1f1b9f4533a67ed1937ff;hb=5cdd606067e23359f2e59a7e20aaabf10fdd895a;hp=28228c6be1cb8b1b310e6546788f8d356740d1df;hpb=4f0045893e5dbcd145b9364476c783ae891f749e;p=crowdsupply.git diff --git a/updates/004_2018dec06_microarchitecture_cont.mdwn b/updates/004_2018dec06_microarchitecture_cont.mdwn index 28228c6..7995dd7 100644 --- a/updates/004_2018dec06_microarchitecture_cont.mdwn +++ b/updates/004_2018dec06_microarchitecture_cont.mdwn @@ -1,45 +1,60 @@ -# Modernising 1960s Computer Technology: what can be learned from the CDC 6600 - -Firstly, many thanks to +Firstly, many thanks to [Heise.de](https://www.heise.de/newsticker/meldung/Mobilprozessor-mit-freier-GPU-Libre-RISC-V-M-Class-geplant-4242802.html) -for publishing a story on this project. I replied to some of the -[Heise Forum](https://www.heise.de/forum/heise-online/News-Kommentare/Mobilprozessor-mit-freier-GPU-Libre-RISC-V-M-Class-geplant/forum-414986/comment/) -comments, here, endeavouring to use translation software to respect that -the forum is in German. - -In this update, following on from the analysis of the Tomasulo Algorithm, -by a process of osmosis I finally was able to make out a light at the -end of the "Scoreboard" tunnel, and it is not an oncoming train. -Conversations with -[Mitch Alsup](https://groups.google.com/d/msg/comp.arch/w5fUBkrcw-s/-9JNF0cUCAAJ) -are becoming clear. +for publishing a story on this project. I replied to some of the +[Heise +Forum](https://www.heise.de/forum/heise-online/News-Kommentare/Mobilprozessor-mit-freier-GPU-Libre-RISC-V-M-Class-geplant/forum-414986/comment/) +comments, here, endeavouring to use translation software to respect +that the forum is in German. + +In this update, following on from the analysis of the Tomasulo +Algorithm, by a process of osmosis I finally was able to make out a +light at the end of the "Scoreboard" tunnel, and it is not an oncoming +train. Conversations with [Mitch +Alsup](https://groups.google.com/d/msg/comp.arch/w5fUBkrcw-s/-9JNF0cUCAAJ) +are becoming clear, providing insights that, as we will find out +below, have not made it into the academic literature in over 20 years. In the previous update, I really did not like the [Scoreboard](https://en.wikipedia.org/wiki/Scoreboarding) technique for doing out-of-order superscalar execution, because, *as described*, -it is hopelessly inadequate. There's no roll-back method for -exceptions, no method for coping with register "hazards" (Read after Write -and so on), so register "renaming" has to be done as a precursor step, -no way to do branch prediction, and only a single LOAD/STORE can be -done at any one time. - -The only *well-known* documentation on the CDC 6600 Scoreboarding technique -is the 1967 patent. Here's the kicker: the patent *does not* describe -the key strategic part of Scoreboarding that makes it so powerful and -much more power-efficient than the Tomasulo Algorithm when combined -with Reorder Buffers: the Dependency Matrices. +it is hopelessly inadequate. There's no roll-back method for +exceptions, no method for coping with register "hazards" (Read after +Write and so on), so register "renaming" has to be done as a precursor +step, no way to do branch prediction, and only a single LOAD/STORE can +be done at any one time. + +All of these things have to be added, and the best way to do so is to +absorb the feature known as the "Reorder Buffer" (and associated +Reservation Stations), normally associated with the Tomasulo +Algorithm. At which point, as noted on +[comp.arch](https://groups.google.com/d/msg/comp.arch/w5fUBkrcw-s/AIMVVS3DBwAJ) +there really is no functional difference between "Scoreboarding plus +Reorder Buffer" and "Tomasulo Algorithm plus Reorder Buffer". Even +the Tomasulo Common Data Bus is present in a functionally-orthogonal +way (see later for details). + +The only *well-known* documentation on the CDC 6600 Scoreboarding +technique is the 1967 patent. Here's the kicker: the patent *does +not* describe the key strategic part of Scoreboarding that makes it so +powerful and much more power-efficient than the Tomasulo Algorithm +when combined with Reorder Buffers: the Functional Unit's Dependency +Matrices. Before getting to that stage, I thought it would be a good idea to -make people aware of a book that Mitch told me about, called -"Design of a Computer: the Control Data 6600" by James Thornton. -James worked with Seymour Cray on the 6600. It was literally -constructed from PCB modules using hand-soldered transistors. -Memory was magnetic rings (which is where we get the term "core memory" -from), and the bootloader was a bank of toggle-switches. +make people aware of a book that Mitch told me about, called "Design +of a Computer: the Control Data 6600" by James Thornton. James worked +with Seymour Cray on the *original design* of the 6600. It was +literally constructed from PCB modules using hand-soldered +transistors. Memory was magnetic rings (which is where we get the +term "core memory" from), and the bootloader was a bank of +toggle-switches. The design was absolutely revolutionary: where all +other computers were managing an instruction every 11 clock cycles, +the 6600 reduced that to **four**. The 7600, its successor, took that +figure even lower. In 2002, someone named Tom Uban sought permission from James and his -wife, to make the book available online, as, historically, the -CDC 6600 is quite literally the precursor to modern supercomputing: +wife, to make the book available online, as, historically, the CDC +6600 is quite literally the precursor to modern supercomputing: [[design_of_a_computer_6600_permission.jpg]] @@ -50,73 +65,112 @@ key strategic part of the Scoreboard: Basically, the patent shows a table with src1 and src2, and "ready" signals: what it does *not* show is the "Go Read" and "Go Write" -signals, and it does not show the way in which one Function Unit -blocks others, via the Dependency Matrix. - -It is well-known that the Tomasulo Reorder Buffer requires a CAM -on the Destination Register, (which is power-hungry and expensive). -This is described in academic literature as data coming "to". The +signals (which allowed an instruction to *begin* execution without +*committing* execution - a feature that's usually believed to be +exclusive to Reorder Buffers), and the patent certainly does not show +the way in which one Function Unit blocks others, via the Dependency +Matrix. + +It is well-known that the Tomasulo Reorder Buffer requires a CAM on +the Destination Register, (which is power-hungry and expensive). This +is described in academic literature as data coming "to". The Scoreboard technique is described as data coming "from" source -registers, however because the Dependency Matrix is left out of -these discussions, what they fail to mention is that there are -*multiple single-line* source wires, thus achieving the exact -same purpose as the Reorder Buffer's CAM, with *far less power -and die area*. +registers, however because the Dependency Matrix is left out of these +discussions (not being part of the patent), what they fail to mention +is that there are *multiple single-line* source wires, thus achieving +the exact same purpose as the Reorder Buffer's CAM, with *far less +power and die area*. + +Mitch's description of this on comp.arch was that the Dependency +Matrix columns effectively may be viewed as a single-bit-wide "CAM", +which of course is far less hardware, being just AND gates. However +it wasn't until he very kindly sent me the chapters of his unpublished +book on the 6600 that the significance of what he was saying actually +sank in, namely that instead of a merged multi-wire very expensive +"Destination Register" CAM, copying the *value* of the dependent src +register into the Reorder Buffer (and then having to match it up +afterwards. on every clock cycle), the Dependency Matrix breaks this +down into multiple really really simple single wire comparators that +*preserve* a **direct** link between the src register(s) and the +destination(s) where they're needed. Consequently, the Scoreboard and +Dependency Matrix logic gates take up far less space, and use +significantly less power. Not only that: it is quite easy to add incremental register-renaming -tags on top of the Scoreboard + Dependency Matrix, again, no need -for a CAM. Not only that: Mitch describes in an unpublished book +tags on top of the Scoreboard + Dependency Matrix, again, no need for +a CAM. Not only that: Mitch describes in the second unpublished book chapter several techniques that each bring in all of the techniques -that are usually exclusively associated with Reorder Buffers, -such as Branch Prediction, speculative execution, precise exceptions -and multi-issue LOAD / STORE hazard avoidance. This diagram below -is reproduced with Mitch's permission: +that are usually exclusively associated with Reorder Buffers, such as +Branch Prediction, speculative execution, precise exceptions and +multi-issue LOAD / STORE hazard avoidance. This diagram below is +reproduced with Mitch's permission: [[mitch_ld_st_augmentation.jpg]] This high-level diagram includes some subtle modifications that -augment a standard CDC 6600 design to allow speculative execution. -A "Schroedinger" wire is added ("neither alive nor dead"), which, -very simply put, prohibits Function Unit "Write" of results. In -this way, because the "Read" signals were independent of "Write" -(something that is again completely missing from the academic -literature in discussions of 6600 Scoreboards), the instruction -may *begin* execution, but is prevented from *completing* -execution. - -All that is required is to add one extra line to the Dependency -Matrix per "branch" that is to be speculatively executed, just like -any other Functional Unit, in effect. - -Mitch also has a high-level diagram of an additional LOAD/STORE Matrix that -has, again, extremely simple rules: LOADs block STOREs, and +augment a standard CDC 6600 design to allow speculative execution. A +"Schroedinger" wire is added ("neither alive nor dead"), which, very +simply put, prohibits Function Unit "Write" of results (mentioned +earlier as a pre-existing under-recognised key part of the 6600 +design). In this way, because the "Read" signals were independent of +"Write" (something that is again completely missing from the academic +literature in discussions of 6600 Scoreboards), the instruction may +*begin* execution, but is prevented from *completing* execution. + +All that is required to gain speculative execution on branches is to +add one extra line to the Dependency Matrix per "branch" that is to be +speculatively executed. The "Branch Speculation" Unit is just like +any other Functional Unit, in effect. In this way, we gain *exactly* +the same capability as a Reorder Buffer, including all of the +benefits. The same trick will work just as well for Exceptions. + +Mitch also has a high-level diagram of an additional LOAD/STORE Matrix +that has, again, extremely simple rules: LOADs block STOREs, and STOREs block LOADs, and the signals "Read / Write" are then passed -down to the Function Unit Dependency Matrix as well. The rules for -the blocking need only be based on "there is no possibility of a conflict" -rather than "on which exact and precise address does a conflict occur". -This in turn means that the number of address bits needed to detect a -conflict may be significantly reduced. Interestingly, RISC-V "Fence" -instruction rules are based on the same idea. - -So this is just amazing. Let's recap. It's 2018, there's absolutely zero -Libre SoCs in existence anywhere on our planet of 8 billion people, and -we're looking for inspiration at literally a 55-year-old computer design -that occupied an entire room and was hand-built with transistors, -on how to make a modern, power-efficient 3D-capable processor. - -Not only that: the project has accidentally unearthed incredibly valuable -historic processor design information that has eluded the Intels and -ARMs - billion-dollar companies - as well as the Academic community - -for several decades. - -I'd like to take a minute to especially thank Mitch Alsup for his -time in ongoing discussions, without which there would be absolutely -no chance that I could possibly have learned about, let alone understood, -any of the above. As I mentioned in the very first update: new processor -designs get one shot at success. Basing the core of the design on -a 55-year-old well-documented and extremely compact and efficient design -is a reasonable strategy: it's just that, without Mitch's help, there -would have been no way to understand the 6600's true value. - -Bottom line: we do not need to follow Intel's power-inefficient lead, here. - +down to the Function Unit Dependency Matrix as well. The rules for +the blocking need only be based on "there is no possibility of a +conflict" rather than "on which exact and precise address does a +conflict occur". This in turn means that the number of address bits +needed to detect a conflict may be significantly reduced, i.e. only +the top bits are needed. + +Interestingly, RISC-V "Fence" instruction rules are based on the same +idea, and it may turn out to be possible to leverage the L1 Cache Line +numbers instead of the (full) address. + +Also, thanks to Mitch's help, his unpublished book chapters help to +identify and make clear that the CDC 6600's register file is designed +with "write-through" capability, i.e. that a register that's written +will go through *on the same clock cycle* to a "read" request. This +makes the 6600's register file pretty much synonymous with the +Tomasulo Algorithm "Common Data Bus". This same-cycle feature *also +provides operand forwarding for free*! + +So this is just amazing. Let's recap. It's 2018, there's absolutely +zero Libre SoCs in existence anywhere on our planet of 8 billion +people, and we're looking for inspiration at literally a 55-year-old +computer design that occupied an entire room and was hand-built with +transistors, on how to make a modern, power-efficient 3D-capable +processor. + +Not only that: the project has accidentally unearthed incredibly +valuable historic processor design information that has eluded the +Intels and ARMs - billion-dollar companies - as well as the Academic +community - for several decades. + +I'd like to take a minute to especially thank Mitch Alsup for his time +in ongoing discussions, without which there would be absolutely no +chance that I could possibly have learned about, let alone understood, +any of the above. As I mentioned in the very first update: new +processor designs get one shot at success. Basing the core of the +design on a 55-year-old well-documented and extremely compact and +efficient design is a reasonable strategy: it's just that, without +Mitch's help, there would have been no way to understand the 6600's +true value. + +Bottom line is, we have a way forward that will result in +significantly less hardware, a simpler design, using a lot less power +than modern designs today, yet providing all of the features normally +the exclusive domain of top-end processors. Thanks to a refresh of a +55-year-old processor and the willingness of Mitch Alsup and James +Thornton to share their expertise with the world.