updates/004_2018dec06_microarchitecture_cont.mdwn

   1 # Modernising 1960s Computer Technology: what can be learned from the CDC 6600
   2
   3 Firstly, many thanks to
   4 [Heise.de](https://www.heise.de/newsticker/meldung/Mobilprozessor-mit-freier-GPU-Libre-RISC-V-M-Class-geplant-4242802.html)
   5 for publishing a story on this project.  I replied to some of the
   6 [Heise Forum](https://www.heise.de/forum/heise-online/News-Kommentare/Mobilprozessor-mit-freier-GPU-Libre-RISC-V-M-Class-geplant/forum-414986/comment/)
   7 comments, here, endeavouring to use translation software to respect that
   8 the forum is in German.
   9
  10 In this update, following on from the analysis of the Tomasulo Algorithm,
  11 by a process of osmosis I finally was able to make out a light at the
  12 end of the "Scoreboard" tunnel, and it is not an oncoming train.
  13 Conversations with
  14 [Mitch Alsup](https://groups.google.com/d/msg/comp.arch/w5fUBkrcw-s/-9JNF0cUCAAJ)
  15 are becoming clear.
  16
  17 In the previous update, I really did not like the
  18 [Scoreboard](https://en.wikipedia.org/wiki/Scoreboarding) technique
  19 for doing out-of-order superscalar execution, because, *as described*,
  20 it is hopelessly inadequate.  There's no roll-back method for
  21 exceptions, no method for coping with register "hazards" (Read after Write
  22 and so on), so register "renaming" has to be done as a precursor step,
  23 no way to do branch prediction, and only a single LOAD/STORE can be
  24 done at any one time.
  25
  26 The only *well-known* documentation on the CDC 6600 Scoreboarding technique
  27 is the 1967 patent.  Here's the kicker: the patent *does not* describe
  28 the key strategic part of Scoreboarding that makes it so powerful and
  29 much more power-efficient than the Tomasulo Algorithm when combined
  30 with Reorder Buffers: the Dependency Matrices.
  31
  32 Before getting to that stage, I thought it would be a good idea to
  33 make people aware of a book that Mitch told me about, called
  34 "Design of a Computer: the Control Data 6600" by James Thornton.
  35 James worked with Seymour Cray on the 6600.  It was literally
  36 constructed from PCB modules using hand-soldered transistors.
  37 Memory was magnetic rings (which is where we get the term "core memory"
  38 from), and the bootloader was a bank of toggle-switches.
  39
  40 In 2002, someone named Tom Uban sought permission from James and his
  41 wife, to make the book available online, as, historically, the
  42 CDC 6600 is quite literally the precursor to modern supercomputing:
  43
  44 [[design_of_a_computer_6600_permission.jpg]]
  45
  46 So I particularly wanted to show the Dependency Matrix, which is the
  47 key strategic part of the Scoreboard:
  48
  49 [[design_of_a_computer_6600.jpg]]
  50
  51 Basically, the patent shows a table with src1 and src2, and "ready"
  52 signals: what it does *not* show is the "Go Read" and "Go Write"
  53 signals, and it does not show the way in which one Function Unit
  54 blocks others, via the Dependency Matrix.
  55
  56 It is well-known that the Tomasulo Reorder Buffer requires a CAM
  57 on the Destination Register, (which is power-hungry and expensive).
  58 This is described in academic literature as data coming "to".  The
  59 Scoreboard technique is described as data coming "from" source
  60 registers, however because the Dependency Matrix is left out of
  61 these discussions, what they fail to mention is that there are
  62 *multiple single-line* source wires, thus achieving the exact
  63 same purpose as the Reorder Buffer's CAM, with *far less power
  64 and die area*.
  65
  66 Not only that: it is quite easy to add incremental register-renaming
  67 tags on top of the Scoreboard + Dependency Matrix, again, no need
  68 for a CAM.  Not only that: Mitch describes in an unpublished book
  69 chapter several techniques that each bring in all of the techniques
  70 that are usually exclusively associated with Reorder Buffers,
  71 such as Branch Prediction, speculative execution, precise exceptions
  72 and multi-issue LOAD / STORE hazard avoidance.  This diagram below
  73 is reproduced with Mitch's permission:
  74
  75 [[mitch_ld_st_augmentation.jpg]]
  76
  77 This high-level diagram includes some subtle modifications that
  78 augment a standard CDC 6600 design to allow speculative execution.
  79 A "Schroedinger" wire is added ("neither alive nor dead"), which,
  80 very simply put, prohibits Function Unit "Write" of results.  In
  81 this way, because the "Read" signals were independent of "Write"
  82 (something that is again completely missing from the academic
  83 literature in discussions of 6600 Scoreboards), the instruction
  84 may *begin* execution, but is prevented from *completing*
  85 execution.
  86
  87 All that is required is to add one extra line to the Dependency
  88 Matrix per "branch" that is to be speculatively executed, just like
  89 any other Functional Unit, in effect.
  90
  91 Mitch also has a high-level diagram of an additional LOAD/STORE Matrix that
  92 has, again, extremely simple rules: LOADs block STOREs, and
  93 STOREs block LOADs, and the signals "Read / Write" are then passed
  94 down to the Function Unit Dependency Matrix as well.  The rules for
  95 the blocking need only be based on "there is no possibility of a conflict"
  96 rather than "on which exact and precise address does a conflict occur".
  97 This in turn means that the number of address bits needed to detect a
  98 conflict may be significantly reduced.  Interestingly, RISC-V "Fence"
  99 instruction rules are based on the same idea.
 100
 101 So this is just amazing.  Let's recap.  It's 2018, there's absolutely zero
 102 Libre SoCs in existence anywhere on our planet of 8 billion people, and
 103 we're looking for inspiration at literally a 55-year-old computer design
 104 that occupied an entire room and was hand-built with transistors,
 105 on how to make a modern, power-efficient 3D-capable processor.
 106
 107 Not only that: the project has accidentally unearthed incredibly valuable
 108 historic processor design information that has eluded the Intels and
 109 ARMs - billion-dollar companies - as well as the Academic community -
 110 for several decades.
 111
 112 I'd like to take a minute to especially thank Mitch Alsup for his
 113 time in ongoing discussions, without which there would be absolutely
 114 no chance that I could possibly have learned about, let alone understood,
 115 any of the above.  As I mentioned in the very first update: new processor
 116 designs get one shot at success.  Basing the core of the design on
 117 a 55-year-old well-documented and extremely compact and efficient design
 118 is a reasonable strategy: it's just that, without Mitch's help, there
 119 would have been no way to understand the 6600's true value.
 120
 121 Bottom line: we do not need to follow Intel's power-inefficient lead, here.
 122