51e02b534e8ccbd48fde072f193554fcd286df3b
[crowdsupply.git] / updates / 004_2018dec06_microarchitecture_cont.mdwn
1 # Modernising 1960s Computer Technology: what can be learned from the CDC 6600
2
3 Firstly, many thanks to
4 [Heise.de](https://www.heise.de/newsticker/meldung/Mobilprozessor-mit-freier-GPU-Libre-RISC-V-M-Class-geplant-4242802.html)
5 for publishing a story on this project. I replied to some of the
6 [Heise Forum](https://www.heise.de/forum/heise-online/News-Kommentare/Mobilprozessor-mit-freier-GPU-Libre-RISC-V-M-Class-geplant/forum-414986/comment/)
7 comments, here, endeavouring to use translation software to respect that
8 the forum is in German.
9
10 In this update, following on from the analysis of the Tomasulo Algorithm,
11 by a process of osmosis I finally was able to make out a light at the
12 end of the "Scoreboard" tunnel, and it is not an oncoming train.
13 Conversations with
14 [Mitch Alsup](https://groups.google.com/d/msg/comp.arch/w5fUBkrcw-s/-9JNF0cUCAAJ)
15 are becoming clear, providing insights that, as we will find out below,
16 have not made it into the academic literature in over 20 years.
17
18 In the previous update, I really did not like the
19 [Scoreboard](https://en.wikipedia.org/wiki/Scoreboarding) technique
20 for doing out-of-order superscalar execution, because, *as described*,
21 it is hopelessly inadequate. There's no roll-back method for
22 exceptions, no method for coping with register "hazards" (Read after Write
23 and so on), so register "renaming" has to be done as a precursor step,
24 no way to do branch prediction, and only a single LOAD/STORE can be
25 done at any one time.
26
27 All of these things have to be added, and the best way to do so is
28 to absorb the feature known as the "Reorder Buffer" (and associated
29 Reservation Stations), normally associated with the Tomasulo Algorithm.
30 At which point, as noted on
31 [comp.arch](https://groups.google.com/d/msg/comp.arch/w5fUBkrcw-s/AIMVVS3DBwAJ)
32 there really is no functional difference between "Scoreboarding plus
33 Reorder Buffer" and "Tomasulo Algorithm plus Reorder Buffer". Even
34 the Tomasulo Common Data Bus is present in a functionally-orthogonal
35 way (see later for details).
36
37 The only *well-known* documentation on the CDC 6600 Scoreboarding technique
38 is the 1967 patent. Here's the kicker: the patent *does not* describe
39 the key strategic part of Scoreboarding that makes it so powerful and
40 much more power-efficient than the Tomasulo Algorithm when combined
41 with Reorder Buffers: the Functional Unit's Dependency Matrices.
42
43 Before getting to that stage, I thought it would be a good idea to
44 make people aware of a book that Mitch told me about, called
45 "Design of a Computer: the Control Data 6600" by James Thornton.
46 James worked with Seymour Cray on the *original design* of the 6600.
47 It was literally constructed from PCB modules using hand-soldered
48 transistors. Memory was magnetic rings (which is where we get the term
49 "core memory" from), and the bootloader was a bank of toggle-switches.
50 The design was absolutely revolutionary: where all other computers
51 were managing an instruction every 11 clock cycles, the 6600 reduced
52 that to **four**. The 7600, its successor, took that figure even lower.
53
54 In 2002, someone named Tom Uban sought permission from James and his
55 wife, to make the book available online, as, historically, the
56 CDC 6600 is quite literally the precursor to modern supercomputing:
57
58 [[design_of_a_computer_6600_permission.jpg]]
59
60 So I particularly wanted to show the Dependency Matrix, which is the
61 key strategic part of the Scoreboard:
62
63 [[design_of_a_computer_6600.jpg]]
64
65 Basically, the patent shows a table with src1 and src2, and "ready"
66 signals: what it does *not* show is the "Go Read" and "Go Write"
67 signals (which allowed an instruction to *begin* execution without
68 *committing* execution - a feature that's usually believed to be
69 exclusive to Reorder Buffers), and the patent certainly does not show the
70 way in which one Function Unit blocks others, via the Dependency Matrix.
71
72 It is well-known that the Tomasulo Reorder Buffer requires a CAM
73 on the Destination Register, (which is power-hungry and expensive).
74 This is described in academic literature as data coming "to". The
75 Scoreboard technique is described as data coming "from" source
76 registers, however because the Dependency Matrix is left out of
77 these discussions (not being part of the patent), what they fail to
78 mention is that there are *multiple single-line* source wires, thus
79 achieving the exact same purpose as the Reorder Buffer's CAM, with *far
80 less power and die area*.
81
82 Mitch's description of this on comp.arch was that the Dependency Matrix columns
83 effectively may be viewed as a single-bit-wide "CAM", which of course
84 is far less hardware, being just AND gates. However it wasn't until
85 he very kindly sent me the chapters of his unpublished book on the 6600
86 that the significance of what he was saying actually sank in, namely that
87 instead of a merged multi-wire very expensive "Destination Register" CAM,
88 copying the *value* of the dependent src register into the Reorder Buffer
89 (and then having to match it up afterwards. on every clock cycle),
90 the Dependency Matrix breaks this down into multiple really really simple
91 single wire comparators that *preserve* a **direct** link between the
92 src register(s) and the destination(s) where they're needed. Consequently,
93 the Scoreboard and Dependency Matrix logic gates take up far less space,
94 and use significantly less power.
95
96 Not only that: it is quite easy to add incremental register-renaming
97 tags on top of the Scoreboard + Dependency Matrix, again, no need
98 for a CAM. Not only that: Mitch describes in the second unpublished book
99 chapter several techniques that each bring in all of the techniques
100 that are usually exclusively associated with Reorder Buffers,
101 such as Branch Prediction, speculative execution, precise exceptions
102 and multi-issue LOAD / STORE hazard avoidance. This diagram below
103 is reproduced with Mitch's permission:
104
105 [[mitch_ld_st_augmentation.jpg]]
106
107 This high-level diagram includes some subtle modifications that
108 augment a standard CDC 6600 design to allow speculative execution.
109 A "Schroedinger" wire is added ("neither alive nor dead"), which,
110 very simply put, prohibits Function Unit "Write" of results (mentioned
111 earlier as a pre-existing under-recognised key part of the 6600 design). In
112 this way, because the "Read" signals were independent of "Write"
113 (something that is again completely missing from the academic
114 literature in discussions of 6600 Scoreboards), the instruction
115 may *begin* execution, but is prevented from *completing*
116 execution.
117
118 All that is required to gain speculative execution on branches is to
119 add one extra line to the Dependency Matrix per "branch" that is to be
120 speculatively executed. The "Branch Speculation" Unit is just like any
121 other Functional Unit, in effect. In this way, we gain *exactly* the
122 same capability as a Reorder Buffer, including all of the benefits.
123 The same trick will work just as well for Exceptions.
124
125 Mitch also has a high-level diagram of an additional LOAD/STORE Matrix that
126 has, again, extremely simple rules: LOADs block STOREs, and
127 STOREs block LOADs, and the signals "Read / Write" are then passed
128 down to the Function Unit Dependency Matrix as well. The rules for
129 the blocking need only be based on "there is no possibility of a conflict"
130 rather than "on which exact and precise address does a conflict occur".
131 This in turn means that the number of address bits needed to detect a
132 conflict may be significantly reduced, i.e. only the top bits are
133 needed.
134
135 Interestingly, RISC-V "Fence" instruction rules are based on the same idea,
136 and it may turn out to be possible to leverage the L1 Cache Line numbers
137 instead of the (full) address.
138
139 Also, thanks to Mitch's help, his unpublished book chapters help
140 to identify and make clear that the CDC 6600's register file is designed with
141 "write-through" capability, i.e. that a register that's written will
142 go through *on the same clock cycle* to a "read" request. This makes
143 the 6600's register file pretty much synonymous with the Tomasulo
144 Algorithm "Common Data Bus". This same-cycle feature *also provides
145 operand forwarding for free*!
146
147 So this is just amazing. Let's recap. It's 2018, there's absolutely zero
148 Libre SoCs in existence anywhere on our planet of 8 billion people, and
149 we're looking for inspiration at literally a 55-year-old computer design
150 that occupied an entire room and was hand-built with transistors,
151 on how to make a modern, power-efficient 3D-capable processor.
152
153 Not only that: the project has accidentally unearthed incredibly valuable
154 historic processor design information that has eluded the Intels and
155 ARMs - billion-dollar companies - as well as the Academic community -
156 for several decades.
157
158 I'd like to take a minute to especially thank Mitch Alsup for his
159 time in ongoing discussions, without which there would be absolutely
160 no chance that I could possibly have learned about, let alone understood,
161 any of the above. As I mentioned in the very first update: new processor
162 designs get one shot at success. Basing the core of the design on
163 a 55-year-old well-documented and extremely compact and efficient design
164 is a reasonable strategy: it's just that, without Mitch's help, there
165 would have been no way to understand the 6600's true value.
166
167 Bottom line is, we have a way forward that will result in significantly
168 less hardware, a simpler design, using a lot less power than modern
169 designs today, yet providing all of the features normally the exclusive
170 domain of top-end processors. Thanks to a refresh of a 55-year-old
171 processor and the willingness of Mitch Alsup and James Thornton to share
172 their expertise with the world.
173