updates/004_2018dec06_microarchitecture_cont.mdwn

   1 Firstly, many thanks to
   2 [Heise.de](https://www.heise.de/newsticker/meldung/Mobilprozessor-mit-freier-GPU-Libre-RISC-V-M-Class-geplant-4242802.html)
   3 for publishing a story on this project. I replied to some of the
   4 [Heise
   5 Forum](https://www.heise.de/forum/heise-online/News-Kommentare/Mobilprozessor-mit-freier-GPU-Libre-RISC-V-M-Class-geplant/forum-414986/comment/)
   6 comments, here, endeavouring to use translation software to respect
   7 that the forum is in German.
   8
   9 In this update, following on from the analysis of the Tomasulo
  10 Algorithm, by a process of osmosis I finally was able to make out a
  11 light at the end of the "Scoreboard" tunnel, and it is not an oncoming
  12 train. Conversations with [Mitch
  13 Alsup](https://groups.google.com/d/msg/comp.arch/w5fUBkrcw-s/-9JNF0cUCAAJ)
  14 are becoming clear, providing insights that, as we will find out
  15 below, have not made it into the academic literature in over 20 years.
  16
  17 In the previous update, I really did not like the
  18 [Scoreboard](https://en.wikipedia.org/wiki/Scoreboarding) technique
  19 for doing out-of-order superscalar execution, because, *as described*,
  20 it is hopelessly inadequate. There's no roll-back method for
  21 exceptions, no method for coping with register "hazards" (Read after
  22 Write and so on), so register "renaming" has to be done as a precursor
  23 step, no way to do branch prediction, and only a single LOAD/STORE can
  24 be done at any one time.
  25
  26 All of these things have to be added, and the best way to do so is to
  27 absorb the feature known as the "Reorder Buffer" (and associated
  28 Reservation Stations), normally associated with the Tomasulo
  29 Algorithm. At which point, as noted on
  30 [comp.arch](https://groups.google.com/d/msg/comp.arch/w5fUBkrcw-s/AIMVVS3DBwAJ)
  31 there really is no functional difference between "Scoreboarding plus
  32 Reorder Buffer" and "Tomasulo Algorithm plus Reorder Buffer". Even
  33 the Tomasulo Common Data Bus is present in a functionally-orthogonal
  34 way (see later for details).
  35
  36 The only *well-known* documentation on the CDC 6600 Scoreboarding
  37 technique is the 1967 patent. Here's the kicker: the patent *does
  38 not* describe the key strategic part of Scoreboarding that makes it so
  39 powerful and much more power-efficient than the Tomasulo Algorithm
  40 when combined with Reorder Buffers: the Functional Unit's Dependency
  41 Matrices.
  42
  43 Before getting to that stage, I thought it would be a good idea to
  44 make people aware of a book that Mitch told me about, called "Design
  45 of a Computer: the Control Data 6600" by James Thornton. James worked
  46 with Seymour Cray on the *original design* of the 6600. It was
  47 literally constructed from PCB modules using hand-soldered
  48 transistors. Memory was magnetic rings (which is where we get the
  49 term "core memory" from), and the bootloader was a bank of
  50 toggle-switches. The design was absolutely revolutionary: where all
  51 other computers were managing an instruction every 11 clock cycles,
  52 the 6600 reduced that to **four**. The 7600, its successor, took that
  53 figure even lower.
  54
  55 In 2002, someone named Tom Uban sought permission from James and his
  56 wife, to make the book available online, as, historically, the CDC
  57 6600 is quite literally the precursor to modern supercomputing:
  58
  59 [[design_of_a_computer_6600_permission.jpg]]
  60
  61 So I particularly wanted to show the Dependency Matrix, which is the
  62 key strategic part of the Scoreboard:
  63
  64 [[design_of_a_computer_6600.jpg]]
  65
  66 Basically, the patent shows a table with src1 and src2, and "ready"
  67 signals: what it does *not* show is the "Go Read" and "Go Write"
  68 signals (which allowed an instruction to *begin* execution without
  69 *committing* execution - a feature that's usually believed to be
  70 exclusive to Reorder Buffers), and the patent certainly does not show
  71 the way in which one Function Unit blocks others, via the Dependency
  72 Matrix.
  73
  74 It is well-known that the Tomasulo Reorder Buffer requires a CAM on
  75 the Destination Register, (which is power-hungry and expensive). This
  76 is described in academic literature as data coming "to". The
  77 Scoreboard technique is described as data coming "from" source
  78 registers, however because the Dependency Matrix is left out of these
  79 discussions (not being part of the patent), what they fail to mention
  80 is that there are *multiple single-line* source wires, thus achieving
  81 the exact same purpose as the Reorder Buffer's CAM, with *far less
  82 power and die area*.
  83
  84 Mitch's description of this on comp.arch was that the Dependency
  85 Matrix columns effectively may be viewed as a single-bit-wide "CAM",
  86 which of course is far less hardware, being just AND gates. However
  87 it wasn't until he very kindly sent me the chapters of his unpublished
  88 book on the 6600 that the significance of what he was saying actually
  89 sank in, namely that instead of a merged multi-wire very expensive
  90 "Destination Register" CAM, copying the *value* of the dependent src
  91 register into the Reorder Buffer (and then having to match it up
  92 afterwards. on every clock cycle), the Dependency Matrix breaks this
  93 down into multiple really really simple single wire comparators that
  94 *preserve* a **direct** link between the src register(s) and the
  95 destination(s) where they're needed. Consequently, the Scoreboard and
  96 Dependency Matrix logic gates take up far less space, and use
  97 significantly less power.
  98
  99 Not only that: it is quite easy to add incremental register-renaming
 100 tags on top of the Scoreboard + Dependency Matrix, again, no need for
 101 a CAM. Not only that: Mitch describes in the second unpublished book
 102 chapter several techniques that each bring in all of the techniques
 103 that are usually exclusively associated with Reorder Buffers, such as
 104 Branch Prediction, speculative execution, precise exceptions and
 105 multi-issue LOAD / STORE hazard avoidance. This diagram below is
 106 reproduced with Mitch's permission:
 107
 108 [[mitch_ld_st_augmentation.jpg]]
 109
 110 This high-level diagram includes some subtle modifications that
 111 augment a standard CDC 6600 design to allow speculative execution. A
 112 "Schroedinger" wire is added ("neither alive nor dead"), which, very
 113 simply put, prohibits Function Unit "Write" of results (mentioned
 114 earlier as a pre-existing under-recognised key part of the 6600
 115 design). In this way, because the "Read" signals were independent of
 116 "Write" (something that is again completely missing from the academic
 117 literature in discussions of 6600 Scoreboards), the instruction may
 118 *begin* execution, but is prevented from *completing* execution.
 119
 120 All that is required to gain speculative execution on branches is to
 121 add one extra line to the Dependency Matrix per "branch" that is to be
 122 speculatively executed. The "Branch Speculation" Unit is just like
 123 any other Functional Unit, in effect. In this way, we gain *exactly*
 124 the same capability as a Reorder Buffer, including all of the
 125 benefits. The same trick will work just as well for Exceptions.
 126
 127 Mitch also has a high-level diagram of an additional LOAD/STORE Matrix
 128 that has, again, extremely simple rules: LOADs block STOREs, and
 129 STOREs block LOADs, and the signals "Read / Write" are then passed
 130 down to the Function Unit Dependency Matrix as well. The rules for
 131 the blocking need only be based on "there is no possibility of a
 132 conflict" rather than "on which exact and precise address does a
 133 conflict occur". This in turn means that the number of address bits
 134 needed to detect a conflict may be significantly reduced, i.e. only
 135 the top bits are needed.
 136
 137 Interestingly, RISC-V "Fence" instruction rules are based on the same
 138 idea, and it may turn out to be possible to leverage the L1 Cache Line
 139 numbers instead of the (full) address.
 140
 141 Also, thanks to Mitch's help, his unpublished book chapters help to
 142 identify and make clear that the CDC 6600's register file is designed
 143 with "write-through" capability, i.e. that a register that's written
 144 will go through *on the same clock cycle* to a "read" request. This
 145 makes the 6600's register file pretty much synonymous with the
 146 Tomasulo Algorithm "Common Data Bus". This same-cycle feature *also
 147 provides operand forwarding for free*!
 148
 149 So this is just amazing. Let's recap. It's 2018, there's absolutely
 150 zero Libre SoCs in existence anywhere on our planet of 8 billion
 151 people, and we're looking for inspiration at literally a 55-year-old
 152 computer design that occupied an entire room and was hand-built with
 153 transistors, on how to make a modern, power-efficient 3D-capable
 154 processor.
 155
 156 Not only that: the project has accidentally unearthed incredibly
 157 valuable historic processor design information that has eluded the
 158 Intels and ARMs - billion-dollar companies - as well as the Academic
 159 community - for several decades.
 160
 161 I'd like to take a minute to especially thank Mitch Alsup for his time
 162 in ongoing discussions, without which there would be absolutely no
 163 chance that I could possibly have learned about, let alone understood,
 164 any of the above. As I mentioned in the very first update: new
 165 processor designs get one shot at success. Basing the core of the
 166 design on a 55-year-old well-documented and extremely compact and
 167 efficient design is a reasonable strategy: it's just that, without
 168 Mitch's help, there would have been no way to understand the 6600's
 169 true value.
 170
 171 Bottom line is, we have a way forward that will result in
 172 significantly less hardware, a simpler design, using a lot less power
 173 than modern designs today, yet providing all of the features normally
 174 the exclusive domain of top-end processors. Thanks to a refresh of a
 175 55-year-old processor and the willingness of Mitch Alsup and James
 176 Thornton to share their expertise with the world.