scoreboard update
[crowdsupply.git] / updates / 004_2018dec06_microarchitecture_cont.mdwn
1 # Modernising 1960s Computer Technology: what can be learned from the CDC 6600
2
3 Firstly, many thanks to
4 [Heise.de](https://www.heise.de/newsticker/meldung/Mobilprozessor-mit-freier-GPU-Libre-RISC-V-M-Class-geplant-4242802.html)
5 for publishing a story on this project. I replied to some of the
6 [Heise Forum](https://www.heise.de/forum/heise-online/News-Kommentare/Mobilprozessor-mit-freier-GPU-Libre-RISC-V-M-Class-geplant/forum-414986/comment/)
7 comments, here, endeavouring to use translation software to respect that
8 the forum is in German.
9
10 In this update, following on from the analysis of the Tomasulo Algorithm,
11 by a process of osmosis I finally was able to make out a light at the
12 end of the "Scoreboard" tunnel, and it is not an oncoming train.
13 Conversations with
14 [Mitch Alsup](https://groups.google.com/d/msg/comp.arch/w5fUBkrcw-s/-9JNF0cUCAAJ)
15 are becoming clear.
16
17 In the previous update, I really did not like the
18 [Scoreboard](https://en.wikipedia.org/wiki/Scoreboarding) technique
19 for doing out-of-order superscalar execution, because, *as described*,
20 it is hopelessly inadequate. There's no roll-back method for
21 exceptions, no method for coping with register "hazards" (Read after Write
22 and so on), so register "renaming" has to be done as a precursor step,
23 no way to do branch prediction, and only a single LOAD/STORE can be
24 done at any one time.
25
26 The only *well-known* documentation on the CDC 6600 Scoreboarding technique
27 is the 1967 patent. Here's the kicker: the patent *does not* describe
28 the key strategic part of Scoreboarding that makes it so powerful and
29 much more power-efficient than the Tomasulo Algorithm when combined
30 with Reorder Buffers: the Dependency Matrices.
31
32 Before getting to that stage, I thought it would be a good idea to
33 make people aware of a book that Mitch told me about, called
34 "Design of a Computer: the Control Data 6600" by James Thornton.
35 James worked with Seymour Cray on the 6600. It was literally
36 constructed from PCB modules using hand-soldered transistors.
37 Memory was magnetic rings (which is where we get the term "core memory"
38 from), and the bootloader was a bank of toggle-switches.
39
40 In 2002, someone named Tom Uban sought permission from James and his
41 wife, to make the book available online, as, historically, the
42 CDC 6600 is quite literally the precursor to modern supercomputing:
43
44 [[design_of_a_computer_6600_permission.jpg]]
45
46 So I particularly wanted to show the Dependency Matrix, which is the
47 key strategic part of the Scoreboard:
48
49 [[design_of_a_computer_6600.jpg]]
50
51 Basically, the patent shows a table with src1 and src2, and "ready"
52 signals: what it does *not* show is the "Go Read" and "Go Write"
53 signals, and it does not show the way in which one Function Unit
54 blocks others, via the Dependency Matrix.
55
56 It is well-known that the Tomasulo Reorder Buffer requires a CAM
57 on the Destination Register, (which is power-hungry and expensive).
58 This is described in academic literature as data coming "to". The
59 Scoreboard technique is described as data coming "from" source
60 registers, however because the Dependency Matrix is left out of
61 these discussions, what they fail to mention is that there are
62 *multiple single-line* source wires, thus achieving the exact
63 same purpose as the Reorder Buffer's CAM, with *far less power
64 and die area*.
65
66 Not only that: it is quite easy to add incremental register-renaming
67 tags on top of the Scoreboard + Dependency Matrix, again, no need
68 for a CAM. Not only that: Mitch describes in an unpublished book
69 chapter several techniques that each bring in all of the techniques
70 that are usually exclusively associated with Reorder Buffers,
71 such as Branch Prediction, speculative execution, precise exceptions
72 and multi-issue LOAD / STORE hazard avoidance. This diagram below
73 is reproduced with Mitch's permission:
74
75 [[mitch_ld_st_augmentation.jpg]]
76
77 This high-level diagram includes some subtle modifications that
78 augment a standard CDC 6600 design to allow speculative execution.
79 A "Schroedinger" wire is added ("neither alive nor dead"), which,
80 very simply put, prohibits Function Unit "Write" of results. In
81 this way, because the "Read" signals were independent of "Write"
82 (something that is again completely missing from the academic
83 literature in discussions of 6600 Scoreboards), the instruction
84 may *begin* execution, but is prevented from *completing*
85 execution.
86
87 All that is required is to add one extra line to the Dependency
88 Matrix per "branch" that is to be speculatively executed, just like
89 any other Functional Unit, in effect.
90
91 Mitch also has a high-level diagram of an additional LOAD/STORE Matrix that
92 has, again, extremely simple rules: LOADs block STOREs, and
93 STOREs block LOADs, and the signals "Read / Write" are then passed
94 down to the Function Unit Dependency Matrix as well. The rules for
95 the blocking need only be based on "there is no possibility of a conflict"
96 rather than "on which exact and precise address does a conflict occur".
97 This in turn means that the number of address bits needed to detect a
98 conflict may be significantly reduced. Interestingly, RISC-V "Fence"
99 instruction rules are based on the same idea.
100
101 So this is just amazing. Let's recap. It's 2018, there's absolutely zero
102 Libre SoCs in existence anywhere on our planet of 8 billion people, and
103 we're looking for inspiration at literally a 55-year-old computer design
104 that occupied an entire room and was hand-built with transistors,
105 on how to make a modern, power-efficient 3D-capable processor.
106
107 Not only that: the project has accidentally unearthed incredibly valuable
108 historic processor design information that has eluded the Intels and
109 ARMs - billion-dollar companies - as well as the Academic community -
110 for several decades.
111
112 I'd like to take a minute to especially thank Mitch Alsup for his
113 time in ongoing discussions, without which there would be absolutely
114 no chance that I could possibly have learned about, let alone understood,
115 any of the above. As I mentioned in the very first update: new processor
116 designs get one shot at success. Basing the core of the design on
117 a 55-year-old well-documented and extremely compact and efficient design
118 is a reasonable strategy: it's just that, without Mitch's help, there
119 would have been no way to understand the 6600's true value.
120
121 Bottom line: we do not need to follow Intel's power-inefficient lead, here.
122