Replaced shadow.jpg with svg one.
[libreriscv.git] / 3d_gpu / architecture / 6600scoreboard.mdwn
1 # Augmented 6600-style Scoreboards
2
3 Images reproduced with kind permission from Mitch Alsup. This page
4 describes *augmentations* made to 6600-style scoreboards for full,
5 precise, register-renaming (more accurately "nameless" register) capability
6 directly equivalent to the Tomasulo algorithm. For
7 an explanation of the full functional-equivalence,
8 see [[tomasulo_transformation]].
9
10 # Notes and insights on Scoreboard design
11
12 btw one thing that's not obvious - at all - about scoreboards is: there's nothing that seems to "control" how instructions "know" to read, write, or complete.  it's very... weird.  i'll probably put this on the discussion page.
13
14 the reason i feel that the weirdness exists is for a few reasons:
15
16 * firstly, the Matrices create a Directed Acyclic Graph, using single-bit
17 SR-Latches.  for a software engineer, being able to express a DAG using
18 a matrix is itself.. .weird :)
19 * secondly: those Matrices preserve time *order* (instruction
20 dependent order actually), they are not themselves dependent *on* time
21 itself.  this is especially weird if one is used to an in-order system,
22 which is very much critically dependent on "time" and on strict observance
23 of how long results are going to take to get through a pipeline.  we
24 could do the entire design based around low-gate-count FSMs and it would
25 still be absolutely fine.
26 * thirdly, it's the *absence* of blocks that allows a unit to
27 proceed.  unlike an in-order system, there's nothing saying "you go now,
28 you go now": it's the opposite.  the unit is told instead, "here's the
29 resources you need to WAIT for: go when those resources are available".
30 * fourth (clarifying 3): it's reads that block writes, and writes
31 that block reads.  although obvious when thought through from first
32 principles, it can get particularly confusing that it is the *absence*
33 of read hazards that allow writes to proceed, and the *absence* of write
34 hazards that allow reads to proceed.
35 * fifth: the ComputationUnits still need to "manage" the input and output
36 of those resources to actual pipelines (or FSMs).
37 - (a) the CUs are *not* permitted to blithely say, if there is an
38 expected output that also needs managing "ok i got the inputs, now throw
39 them at the pipeline, i'm done".  they *must* wait for that result.  of
40 course if there is no result to wait for, they're permitted to indicate
41 "done" without waiting (this actually happens in the case of STORE).
42 - (b) there's an apparent disconnect between "fetching of registers"
43 and "Computational Unit progress".  surely, one feels, there should
44 be something that, again, "orders the CU to proceed in a set, orderly
45 progressive fashion?".  instead, because the progress is from the
46 *absence* of hazards, the CU's FSMs likewise make forward progress from
47 the "acknowledgement" of each blockage being dropped.
48 * sixth: one of the incredible but puzzling things is that register
49 renaming is *automatically* built-in to the design.  the Function Unit's
50 input and output latches are effectively "nameless" registers.
51 - (a) the more Function Units you have, the more nameless registers
52 exist.  the more nameless registers exist, the further ahead that
53 in-flight execution can progress, speculatively.
54 - (b) whilst the Function Units are devoid of register "name"
55 information, the FU-Regs Dependency Matrix is *not* devoid of that
56 information, having latched the read/write register numbers in an unary
57 form, as a "row", one bit in each row representing which register(s)
58 the instruction originally contained.
59 - (c) by virtue of the direct Operand Port connectivity between the FU
60 and its corresponding FU-Regs DM "row", the Function Unit requesting for
61 example Operand1 results in the FU-Regs DM *row* triggering a register
62 file read-enable line, *NOT* the Function Unit itself.
63 * seventh: the PriorityPickers manage resource contention between the FUs
64 and the row-information from the FU-Regs Matrix.  the port bandwidth
65 by nature has to be limited (we cannot have 200 read/write ports on
66 the regfile).  therefore the connection between the FU and the FU-Regs
67 "row" in which the actual reg numbers is stored (in unary) is even *less*
68 direct than it is in an in-order system.
69
70 ultimately then, there is:
71
72 * an FU-Regs Matrix that, on a per-row basis, captures the instruction's
73 register numbering (in unary, one SR-Latch raised per register per row)
74 on a per-operand basis
75 * an FU-FU Matrix that preserves, as a Directed Acyclic Graph (DAG),
76 the instruction order.  again, this is a bit-based system (SR Latches)
77 that record which *read port* of the Function Unit needs a write result
78 (when available), and which write port needs a *read* result.
79 * a suite of Function Units with input *and* output latches where the
80 register information is *removed* (that being back in the FU-Regs row
81 associated with a given FU)
82 * a PriorityPicker system that acknowledges the desire for access to the
83 register file, and, due to the regfile ports being a contended resource,
84 *only* permits one and only one FunctionUnit at a time to gain access to
85 that regfile port.  where the FunctionUnit knows the Operand number it
86 requires the input (or output) to come from (or to), it is the FU-Regs
87 *row* that knows, on a per-operand-number basis, what the actual register
88 file number is.
89
90 # Example allocation of Function Units
91
92 This is the key diagram showing the relationship between Function Units
93 and the Register File(s), covering "read". A similar diagram exists
94 for regfile "write". Observe also that there are *two* Increment
95 Function Units.
96
97 The Dependency Matrices manage the DAG of read-write relationships:
98 each Function Unit indicates which registers it requires for read
99 and which it needs permission to write to.
100
101 NOTE: **AT NO TIME** is **any** Function Unit permitted "direct" access to
102 global resources by way of any form of "unmanaged" path. Each Function
103 Unit may **only** receive incoming data and may **only** pass that data
104 out via the set determined path, as controlled by the Dependency Matrices.
105
106 Thus, the inputs for a given FU absolutely have to cover all resources
107 that the ALU will need. In the case of POWER9, for the Integer Function
108 Units this is not just the Integer operands, it's the *Condition* operands
109 (CR0, XER carry bits etc.) that need to be inputs (and outputs) as well.
110 In the case of the Branch Function Unit, the input operands (and outputs)
111 will likewise be not just the Integer operands, but CTR, LR etc. as well.
112
113 In the arrangement below (from the CDC 6600), it can be observed that
114 there are actually separate Register Files (A, B and X). Also observe
115 that whilst X goes to *all* Function Units as input, B only goes to
116 Branch, Increment and LongAdd, where A goes to Branch and the two Increment
117 Function Units. Thus, A was 3R3W, B was 4R3W, and X was 5R4W.
118
119 An augmentation of this arrangement, for a modern processor using pipelines,
120 is, rather than have separate FSMs and doubled-up (or greater) Function Units
121 is to "double up" (or triple, or quadruple etc.) the number of Function
122 Units, but to *share the same pipeline*. See "Concurrent Computation Unit"
123 section below for details.
124
125 [[!img multiple_function_units.png size="600x"]]
126
127 # Modifications needed to Computation Unit and Group Picker
128
129 The scoreboard uses two big NOR gates respectively to determine when there
130 are no read/write hazards. These two NOR gates are permanently active
131 (per Function Unit) even if the Function Unit is idle.
132
133 In the case of the Write path, these "permanently-on" signals are gated
134 by a Write-Release-Request signal that would otherwise leave the Priority
135 Picker permanently selecting one of the Function Units (the highest priority).
136 However the same thing has to be done for the read path, as well.
137
138 Below are the modifications required to add a read-release path that
139 will prevent a Function Unit from requesting a GoRead signal when it
140 has no need to read registers. Note that once both the Busy and GoRead
141 signals combined are dropped, the ReadRelease is dropped.
142
143 Note that this is a loop: GoRead (ANDed with Busy) goes through
144 to the priority picker, which generates GoRead, so it is critical
145 (in a modern design) to use a clock-sync'd latch in this path.
146 The original 6600 used rising-edge and falling-edge of the clock
147 to avoid this issue.
148
149 [[!img comp_unit_req_rel.jpg]]
150 [[!img group_pick_rd_rel.svg]]
151
152 [[!img priority_picker_16_yosys.png size="400x"]]
153
154 Source:
155
156 * [Priority Pickers](https://git.libre-riscv.org/?p=nmutil.git;a=blob;f=src/nmutil/picker.py;hb=HEAD)
157 * [ALU Comp Units](https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/soc/experiment/compalu.py;h=f7b5e411a739e770777ceb71d7bd09fe4e70e8c0;hb=b08dee1c3e8cf0d635820693fe50cd0518caeed2)
158
159 # Concurrent Computational Unit
160
161 With the original 6600 having only a 2-stage pipelined FPU (which took
162 many years to notice from examining the now-archaic notation from James
163 Thornton's book, "Design of a Computer"), there is no actual use of this
164 pipeline capability at the front-end Function Unit. Instead it is
165 treated effectively as a Finite State Machine, only one result to be
166 computed at a time.
167
168 Mitch Alsup recommends, when using pipelines, to allow multiple
169 Function Unit "front-ends", each one having inputs that were pushed
170 into a particular stage of the pipeline, and, therefore, those multiple
171 Function Units also track and store the result as it comes out.
172
173 The trick then is to have a method that ensures that FU front-end #1
174 can get result #1 when it pops out the end of the (serial) pipeline.
175 Mitch recommends using timing chains, here.
176
177 Note in this diagram that there are *multiple* ISSUE, GO\_READ and GO\_WRITE
178 signals. These link up to the Function Unit's ISSUE, GO\_RD and GO\_WR,
179 where the latches are, that will (on an available slot) feed the pipeline
180 with incoming data.
181
182 [[!img concurrent_comp_unit.png size="600x"]]
183
184 The actual design being used is slightly different, in the following
185 ways:
186
187 * Due to micro-coding and thus external contention, the pipelines
188 have a ready/valid signalling arrangement that can result in
189 a stall cascading back down the pipeline. Thus a timing chain
190 is not appropriate.
191 * A decision was therefore made to pass a "context" alongside the
192 operands, which is the "Function Unit Index". It is *this* information
193 that is used to "reassociate" the result with the correct FU, when
194 the result is produced.
195 * With "Shadow cancellation" being in effect, *additional* global
196 context is passed (combinatorially) to every single stage of the
197 pipeline, as an unary bitmask. If any Function Unit's "GO_DIE"
198 signal is asserted, the corresponding bit in the unary mask is
199 asserted, terminating effective immediate the intermediary data
200 anywhere in the pipeline from progressing further, thus saving power.
201
202
203 # Multi-in cascading Priority Picker
204
205 Using the Group Picker as a fundamental unit, a cascading chain is created,
206 with each output "masking" an output from being selected in all down-chain
207 Pickers. Whilst the input is a single unary array of bits, the output is
208 *multiple* unary arrays where only one bit in each is set.
209
210 This can be used for "port selection", for example when there are multiple
211 Register File ports or multiple LOAD/STORE cache "ways", and there are many
212 more devices seeking access to those "ports" than there are actual ports.
213 (If the number of devices seeking access to ports were equal to the number
214 of ports, each device could be allocated its own dedicated port).
215
216 Click on image to see full-sized version:
217
218 [[!img multi_priority_picker.png size="800x"]]
219
220 Links:
221
222 * [Priority Pickers](https://git.libre-riscv.org/?p=nmutil.git;a=blob;f=src/nmutil/picker.py;hb=HEAD)
223 * <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2020-March/005204.html>
224
225 # Modifications to Dependency Cell
226
227 Note: this version still requires CLK to operate on a HI-LO cycle.
228 Further modifications are needed to create an ISSUE-GORD-PAUSE ISSUE-GORD-PAUSE
229 sequence. For now however it is easier to stick with the original
230 diagrams produced by Mitch Alsup.
231
232 The dependency cell is responsible for recording that a Function Unit
233 requires the use of a dest or src register, which is given in UNARY.
234 It is also responsible for "defending" that unary register bit for
235 read and write hazards, and for also, on request (GoRead/GoWrite)
236 generating a "Register File Select" signal.
237
238 The sequence of operations for determining hazards is as follows:
239
240 * Issue goes HI when CLK is HI. If any of Dest / Oper1 / Oper2 are also HI,
241 the relevant SRLatch will go HI to indicate that this Function Unit requires
242 the use of this dest/src register
243 * Bear in mind that this cell works in conjunction with the FU-FU cells
244 * Issue is LOW when CLK is HI. This is where the "defending" comes into
245 play. There will be *another* Function Unit somewhere that has had
246 its Issue line raised. This cell needs to know if there is a conflict
247 (Read Hazard or Write Hazard).
248 * Therefore, *this* cell must, if either of the Oper1/Oper2 signals are
249 HI, output a "Read after Write" (RaW) hazard if its Dest Latch (Dest-Q) is HI.
250 This is the *Read_Pending* signal.
251 * Likewise, if either of the two SRC Latches (Oper1-Q or Oper2-Q) are HI,
252 this cell must output a "Write after Read" (WaR) hazard if the (other)
253 instruction has raised the unary Dest line.
254
255 The sequence for determining register select is as follows:
256
257 * After the Issue+CLK-HI has resulted in the relevant (unary) latches for
258 dest and src (unary) latches being set, at some point a GoRead (or GoWrite)
259 signal needs to be asserted
260 * The GoRead (or GoWrite) is asserted when *CLK is LOW*. The AND gate
261 on Reset ensures that the SRLatch *remains ENABLED*.
262 * This gives an opportunity for the Latch Q to be ANDed with the GoRead
263 (or GoWrite), raising an indicator flag that the register is being
264 "selected" by this Function Unit.
265 * The "select" outputs from the entire column (all Function Units for this
266 unary Register) are ORed together. Given that only one GoRead (or GoWrite)
267 is guaranteed to be ASSERTed (because that is the Priority Picker's job),
268 the ORing is acceptable.
269 * Whilst the GoRead (or GoWrite) signal is still asserted HI, the *CLK*
270 line goes *LOW*. With the Reset-AND-gate now being HI, this *clears* the
271 latch. This is the desired outcome because in the previous cycle (which
272 happened to be when CLK was LOW), the register file was read (or written)
273
274 The release of the latch happens to have a by-product of releasing the
275 "reservation", such that future instructions, if they ever test for
276 Read/Write hazards, will find that this Cell no longer responds: the
277 hazard has already passed as this Cell already indicated that it was
278 safe to read (or write) the register file, freeing up future instructions
279 from hazards in the process.
280
281 [[!img dependence_cell_pending.jpg]]
282
283 # Shadowing
284
285 Shadowing is important as it is the fundamental basis of:
286
287 * Precise exceptions
288 * Write-after-write hazard avoidance
289 * Correct multi-issue instruction sequencing
290 * Branch speculation
291
292 Modifications to the shadow circuit below allow the shadow flip-flops
293 to be automatically reset after a Function Unit "dies". Without these
294 modifications, the shadow unit may spuriously fire on subsequent re-use
295 due to some of the latches being left in a previous state.
296
297 Note that only "success" will cause the latch to reset. Note also
298 that the introduction of the NOT gate causes the latch to be more like
299 a DFF (register).
300
301 [[!img shadow.svg]]
302
303 # LD/ST Computation Unit
304
305 Discussions:
306
307 * <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2020-April/006167.html>
308 * <https://groups.google.com/forum/#!topic/comp.arch/qeMsE7UxvlI>
309
310 Walk-through Videos:
311
312 * <https://www.youtube.com/watch?v=idDn1norNl0>
313 * <https://www.youtube.com/watch?v=ipOe0cLOJWc>
314
315 The Load/Store Computation Unit is a little more complex, involving
316 three functions: LOAD, STORE, and LOAD-UPDATE. The SR Latches create
317 a forward-progressing Finite State Machine, with three possible paths:
318
319 * LD Mode will activate Issue, GoRead1, GoAddr then finally GoWrite1
320 * LD-UPDATE Mode will *additionally* activate GoWrite2.
321 * ST Mode will activate Issue, GoRead1, GoRead2, GoAddr then GoStore.
322 ST-UPDATE Mode will *additionally* activate GoWrite2.
323
324 These signals will be allowed to activate when the correct "Req" lines
325 are active. Minor complications are involved (extra latches) that respond
326 to an external API interface that has a more "traditional" valid/ready
327 signalling interface, with single-clock responses.
328
329 Source:
330
331 * [LD/ST Comp Units](https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/soc/experiment/compldst.py)
332
333 [[!img ld_st_comp_unit.jpg]]
334
335 # Memory-Memory Dependency Matrix
336
337 Due to the possibility of more than on LD/ST being in flight, it is necessary
338 to determine which memory operations are conflicting, and to preserve a
339 semblance of order. It turns out that as long as there is no *possibility*
340 of overlaps (note this wording carefully), and that LOADs are done separately
341 from STOREs, this is sufficient.
342
343 The first step then is to ensure that only a mutually-exclusive batch of LDs
344 *or* STs (not both) is detected, with the order between such batches being
345 preserved. This is what the memory-memory dependency matrix does.
346
347 "WAR" stands for "Write After Read" and is an SR Latch. "RAW" stands for
348 "Read After Write" and likewise is an SR Latch. Any LD which comes in
349 when a ST is pending will result in the relevant RAW SR Latch going active.
350 Likewise, any ST which comes in when a LD is pending results in the
351 relevant WAR SR Latch going active.
352
353 LDs can thus be prevented when it has any dependent RAW hazards active,
354 and likewise STs can be prevented when any dependent WAR hazards are active.
355 The matrix also ensures that ordering is preserved.
356
357 Note however that this is the equivalent of an ALU "FU-FU" Matrix. A
358 separate Register-Mem Dependency Matrix is *still needed* in order to
359 preserve the **register** read/write dependencies that occur between
360 instructions, where the Mem-Mem Matrix simply protects against memory
361 hazards.
362
363 Note also that it does not detect address clashes: that is the responsibility
364 of the Address Match Matrix.
365
366 Source:
367
368 * [Memory-Dependency Row](https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/soc/scoreboard/mem_dependence_cell.py;h=2958d864cec75480b97a0725d9b3c44f53d2e7a0;hb=a0e1af6c5dab5c324a8bf3a7ce6eb665d26a65c1)
369 * [Memory-Dependency Matrix](https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/soc/scoreboard/mem_fu_matrix.py;h=6b9ce140312290a26babe2e3e3d821ae3036e3ab;hb=a0e1af6c5dab5c324a8bf3a7ce6eb665d26a65c1)
370
371 [[!img ld_st_dep_matrix.png size="600x"]]
372
373 # Address Match Matrix
374
375 This is an important adjunct to the Memory Dependency Matrices: it ensures
376 that no LDs or STs overlap, because if they did it could result in memory
377 corruption. Example: a 64-bit ST at address 0x0001 comes in at the
378 same time as a 64-bit ST to address 0x0002: the second write will overwrite
379 all writes to bytes in memory 0x0002 thru 0x0008 of the first write,
380 and consequently the order of these two writes absolutely has to be
381 preserved.
382
383 The suggestion from Mitch Alsup was to use a match system based on bits
384 4 thru 10/11 of the address. The idea being: we don't care if the matching
385 is "too inclusive", i.e. we don't care if it includes addresses that don't
386 actually overlap, because this just means "oh dear some LD/STs do not
387 happen concurrently, they happen a few cycles later" (translation: Big Deal)
388
389 What we care about is if it were to **miss** some addresses that **do**
390 actually overlap. Therefore it is perfectly acceptable to use only a few
391 bits of the address. This is fortunate because the matching has to be
392 done in a huge NxN Pascal's Triangle, and if we were to compare against
393 the entirety of the address it would consume vast amounts of power and gates.
394
395 An enhancement of this idea is to turn the length of the operation
396 (LD/ST 1 byte, 2 bytes, 4 or 8 bytes) into a byte-map "mask", using the
397 bottom 4 bits of the address to offset this mask and "line up" with
398 the Memory byte read/write enable wires on the underlying Memory used
399 in the L1 Cache.
400
401 Then, the bottom 4 bits and the LD/ST length, now turned into a 16-bit unary
402 mask, can be "matched" using simple AND gate logic (instead of XOR for
403 binary address matching), with the advantage that it is both trivial to
404 use these masks as L1 Cache byte read/write enable lines, and furthermore
405 it is straightforward to detect misaligned LD/STs crossing cache line
406 boundaries.
407
408 Crossing over cache line boundaries is trivial in that the creation of
409 the byte-map mask is permitted to be 24 bits in length (actually, only
410 23 needed). When the bottom 4 bits of the address are 0b1111 and the
411 LD/ST is an 8-byte operation, 0b1111 1111 (representing the 64-bit LD/ST)
412 will be shifted up by 15 bits. This can then be chopped into two
413 segments:
414
415 * First segment is 0b1000 0000 0000 0000 and indicates that the
416 first byte of the LD/ST is to go into byte 15 of the cache line
417 * Second segment is 0b0111 1111 and indicates that bytes 2 through
418 8 of the LD/ST must go into bytes 0 thru 7 of the **second**
419 cache line at an address offset by 16 bytes from the first.
420
421 Thus we have actually split the LD/ST operation into two. The AddrSplit
422 class takes care of synchronising the two, by issuing two *separate*
423 sets of LD/ST requests, waiting for both of them to complete (or indicate
424 an error), and (in the case of a LD) merging the two.
425
426 The big advantage of this approach is that at no time does the L1 Cache
427 need to know anything about the offsets from which the LD/ST came. All
428 it needs to know is: which bytes to read/write into which positions
429 in the cache line(s).
430
431 Source:
432
433 * [Address Matcher](https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/soc/scoreboard/addr_match.py;h=a47f635f4e9c56a7a13329810855576358110339;hb=a0e1af6c5dab5c324a8bf3a7ce6eb665d26a65c1)
434 * [Address Splitter](https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/soc/scoreboard/addr_split.py;h=bf89e0970e9a8b44c76018660114172f5a3061f4;hb=a0e1af6c5dab5c324a8bf3a7ce6eb665d26a65c1)
435
436 [[!img ld_st_splitter.png size="600x"]]
437
438 # L0 Cache/Buffer
439
440 See:
441
442 * <https://bugs.libre-soc.org/show_bug.cgi?id=216>
443 * <https://bugs.libre-soc.org/show_bug.cgi?id=257>
444 * <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2020-April/006118.html>
445
446 The L0 cache/buffer needs to be kept extremely small due to it having
447 significant extra CAM functionality than a normal L1 cache. However,
448 crucially, the Memory Dependency Matrices and address-matching
449 [take care of certain things](https://bugs.libre-soc.org/show_bug.cgi?id=216#c20)
450 that greatly simplify its role.
451
452 The problem is that a standard "queue" in a multi-issue environment would
453 need to be massively-ported: 8-way read and 8-way write. However that's not
454 the only problem: the major problem is caused by the fact that we are
455 overloading "vectorisation" on top of multi-issue execution, where a
456 "normal" vector system would have a Vector LD/ST operation where sequences
457 of consecutive LDs/STs are part of the same operation, and thus a "full
458 cache line" worth of reads/writes is near-trivial to perform and detect.
459
460 Thus with the "element" LD/STs being farmed out to *individual* LD/ST
461 Computation Units, a batch of consecutive LD/ST operations arrive at the
462 LD/ST Buffer which could - hypothetically - be merged into a single
463 cache line, prior to passing them on to the L1 cache.
464
465 This is the primary task of the L0 Cache/Buffer: to resolve multiple
466 (potentially misaligned) 1/2/4/8 LD/ST operations (per cycle) into one
467 **single** L1 16-byte LD/ST operation.
468
469 The amount of wiring involved however is so enormous (3,000+ wires if
470 "only" 4-in 4-out multiplexing is done from the LD/ST Function Units) that
471 considerable care has to be taken to not massively overload the ASIC
472 layout.
473
474 To help with this, a recommendation from
475 [comp.arch](https://groups.google.com/forum/#!topic/comp.arch/cbGAlcCjiZE)
476 came to do a split odd-even double-L1-cache system: have *two* L1 caches,
477 one dealing with even-numbered 16-byte cache lines (addressed by bit 4 == 0)
478 and one dealing with odd-numbered 16-byte cache lines (addr[4] == 1).
479 This trick doubles the sequential throughput whilst halving the bandwidth
480 of a drastically-overloaded multiplexer bus.
481 Thus, we can also have two L0 LD/ST Cache/Buffers (one each looking after
482 its corresponding L1 cache).
483
484 The next phase - task - of the L0 Cache/Buffer - is to identify and merge
485 any requests with the same top 5 bits. This becomes a trivial task (under
486 certain conditions, already satisfied by other components), by simply
487 picking the first request, and using that row's address as a search
488 pattern to match against all upper bits (5 onwards). When such a match
489 is located, then due to the job(s) carried out by prior components, the
490 byte-mask for all requests with the same upper address bits may simply be
491 ORed together.
492
493 This requires a little back-tracking to explain. The prerequisite
494 conditions are as follows:
495
496 * Mask, in each row of the L0 Cache/Buffer, encodes the bottom 4 LSBs
497 of the address **and** the length of the LD/ST operation (1/2/4/8 bytes),
498 in a "bitmap" form.
499 * These "Masks" have already been analysed for overlaps by the Address
500 Match Matrix: we **know** therefore that there are no overlaps (hence why
501 addresses with the same MSBs from bits 5 and above may have their
502 masks ORed together)
503
504 [[!img mem_l0_to_l1_bridge.png size="600x"]]
505
506 ## Twin L0 cache/buffer design
507
508 See <https://groups.google.com/d/msg/comp.arch/cbGAlcCjiZE/OPNAvWSHAQAJ>.
509 [Flaws](https://bugs.libre-soc.org/show_bug.cgi?id=216#c24)
510 in the above were detected, and needed correction.
511
512 Notes:
513
514 * The flaw detected above is that for each pair of LD/ST operations
515 coming from the Function Unit (to cover mis-aligned requests),
516 the Addr[4] bit is **mutually-exclusive**. i.e. it is **guaranteed**
517 that Addr[4] for the first FU port's LD/ST request will **never**
518 equal that of the second.
519 * Therefore, if the two requests are split into left/right separate L0
520 Cache/Buffers, the advantages and optimisations for XOR-comparison
521 of bits 12-48 of the address **may not take place**.
522 * Solution: merge both L0-left and L0-right into one L0 Cache/Buffer,
523 with twin left/right banks in the same L0 Cache/Buffer
524 * This then means that the number of rows may be reduced to 8
525 * It also means that Addr[12-48] may be stored (and compared) only once
526 * It does however mean that the reservation on the row has to wait for
527 *both* ports (left and right) to clear out their LD/ST operation(s).
528 * Addr[4] still selects whether the request is to go into left or right bank
529 * When the misaligned address bits 4-11 are all 0b11111111, this is not
530 a case that can be handled, because it implies that Addr[12:48] will
531 be **different** in the row. This case throws a misaligned exception.
532
533 Other than that, the design remains the same, as does the algorithm to
534 merge the bytemasks. This remains as follows:
535
536 * PriorityPicker selects one row
537 * For all rows greater than the selected row, if Addr[5:48] matches
538 then the bytemask is "merged" into the output-bytemask-selector
539 * The output-bytemask-selector is used as a "byte-enable" line on
540 a single 128-bit byte-level read-or-write (never both).
541
542 Twin 128-bit requests (read-or-write) are then passed directly through
543 to a pair of L1 Caches.
544
545 [[!img twin_l0_cache_buffer.jpg size="600x"]]
546
547 # Multi-input/output Dependency Cell and Computation Unit
548
549 * <https://www.youtube.com/watch?v=ohHbWRLDCfs>
550 * <https://youtu.be/H0Le4ZF0cd0>
551
552 apologies that this is best done using images rather than text.
553 i'm doing a redesign of the (augmented) 6600 engine because there
554 are a couple of design criteria/assumptions that do not fit our
555 requirements:
556
557 1. operations are only 2-in, 1-out
558 2. simultaneous register port read (and write) availability is guaranteed.
559
560 we require:
561
562 1. operations with up to *four* in and up to *three* out
563 2. sporadic availability of far less than 4 Reg-Read ports and 3 Reg-Write
564
565 here are the two associated diagrams which describe the *original*
566 6600 computational unit and FU-to-Regs Dependency Cell:
567
568 1. comp unit https://libre-soc.org/3d_gpu/comp_unit_req_rel.jpg
569 2. dep cell https://libre-soc.org/3d_gpu/dependence_cell_pending.jpg
570
571 as described here https://libre-soc.org/3d_gpu/architecture/6600scoreboard/
572 we found a signal missing from Mitch's book chapters, and tracked it down
573 from the original Thornton "Design of a Computer": Read_Release. this
574 is a synchronisation / acknowledgement signal for Go_Read which is directly
575 analogous to Req_Rel for Go_Write.
576
577 also in the dependency cell, we found that it is necessary to OR the
578 two "Read" Oper1 and Oper2 signals together and to AND that with the
579 Write_Pending Latch (top latch in diagram 2.) as shown in the wonderfully
580 hand-drawn orange OR gate.
581
582 thus, Read-After-Write hazard occurs if there is a Write_Pending *AND*
583 any Read (oper1 *OR* oper2) is requested.
584
585
586 now onto the additional modifications.
587
588 3. comp unit https://libre-soc.org/3d_gpu/compunit_multi_rw.jpg
589 4. dep cell https://libre-soc.org/3d_gpu/dependence_cell_multi_pending.jpg
590
591 firstly, the computation unit modifications:
592
593 * multiple Go_Read signals are present, GoRD1-3
594 * multiple incoming operands are present, Op1-3
595 * multiple Go_Write signals are present, GoWR1-3
596 * multiple outgoing results are present, Out1-2
597
598 note that these are *NOT* necessarily 64-bit registers: they are in fact
599 Carry Flags because we are implementing POWER9. however (as mentioned
600 yesterday in the huge 250+ discussion, as far as the Dep Matrices are
601 concerned you still have to treat Carry-In and Carry-out as Read/Write
602 Hazard-protected *actual* Registers)
603
604 in the original 6600 comp unit diagram (1), because the "Go_Read" assumes
605 that *both* registers will be read (and supplied) simultaneously from
606 the Register File, the sequence - the Finite State Machine - is real
607 simple:
608
609 * ISSUE -> BUSY (latched)
610 * RD-REQ -> GO_RD
611 * WR-REQ -> GO_WR
612 * repeat
613
614 [aside: there is a protective "revolving door" loop where the SR latch for
615 each state in the FSM is guaranteed stable (never reaches "unknown") ]
616
617 in *this* diagram (3), we instead need:
618
619 * ISSUE -> BUSY (latched)
620 * RD-REQ1 -> GO_RD1 (may occur independent of RD2/3)
621 * RD-REQ2 -> GO_RD2 (may occur independent of RD1/3)
622 * RD-REQ3 -> GO_RD3 (may occur independent of RD1/2)
623 * when all 3 of GO_RD1-3 have been asserted,
624 ONLY THEN raise WR-REQ1-2
625 * WR-REQ1 -> GO_WR1 (may occur independent of WR2)
626 * WR-REQ2 -> GO_WR2 (may occur independent of WR1)
627 * when all (2) of GO_WR1-2 have been asserted,
628 ONLY THEN reset back to the beginning.
629
630 note the crucial difference is that read request and acknowledge (GO_RD)
631 are *all independent* and may occur:
632
633 * in any order
634 * in any combination
635 * all at the same time
636
637 likewise for write-request/go-write.
638
639 thus, if there is only one spare READ Register File port available
640 (because this particular Computation Unit is a low priority, but
641 the other operations need only two Regfile Ports and the Regfile
642 happens to be 3R1W), at least one of OP1-3 may get its operation.
643
644 thus, if we have three 2-operand operations and a 3R1W regfile:
645
646 * clock cycle 1: the first may grab 2 ports and the second grabs 1 (Oper1)
647 * clock cycle 2: the second grabs one more (Oper2) and the third grabs 2
648
649 compare this to the *original* 6600: if there are three 2-operand
650 operations outstanding, they MUST go:
651
652 * clock cycle 1: the first may grab 2 ports, NEITHER the 2nd nor 3rd proceed
653 * clock cycle 2: the second may grab 2 ports, 3rd may NOT proceed
654 * clock cycle 3: the 3rd grabs 2 ports
655
656 this because the Comp Unit - and associated Dependency Matrices - *FORCE*
657 the Comp Unit to only proceed when *ALL* necessary Register Read Ports
658 are available (because there is only the one Go_Read signal).
659
660
661 so my questions are:
662
663 * does the above look reasonable? both in terms of the DM changes
664 and CompUnit changes.
665
666 * the use of the three SR latches looks a little weird to me
667 (bottom right corner of (3) which is a rewrite of the middle
668 of the page.
669
670 it looks a little weird to have an SR Latch looped back
671 "onto itself". namely that when the inversion of both
672 WR_REQ1 and WR_REQ2 going low triggers that AND gate
673 (the one with the input from Q of an SR Latch), it *resets*
674 that very same SR-Latch, which will cause a mini "blip"
675 on Reset, doesn't it?
676
677 argh. that doesn't feel right. what should it be replaced with?
678
679 [[!img compunit_multi_rw.jpg size="600x"]]
680
681 [[!img dependence_cell_multi_pending.jpg size="600x"]]
682
683 # Corresponding FU-FU (Function-to-Function) Dependency Cell Modifications
684
685 * Video <https://youtu.be/_5fmPpInJ7U>
686
687 Original 6600 FU-FU Cell diagram:
688
689 [[!img fu_dep_cell_6600.jpg size="600x"]]
690
691 Augmented multi-GORD/GOWR 6600 FU-FU Cell diagram:
692
693 [[!img fu_dep_cell_multi_6600.jpg size="600x"]]
694
695 # FU-Regs Vectors
696
697 There are two FU-Regs Vectors. The first is an accumulation of
698 all row information. This indicates that (on a per-Operand basis
699 in the Libre-SOC design) there is *a* write pending for that Operand
700 (note that this is not per **register**, it is per **operand**).
701 Likewise, the OR-accumulation of every unary-encoded register SR-Latch
702 bit in the row, for reading for each FU's Operand, indicates a
703 desire of that Function Unit's need to *read* from a given port.
704
705 These accumulated signals, coming out on a per-row basis for each
706 Operand port, are sent straight to every cell in the corresponding
707 FU-FU Matrix row.
708
709 [[!img fu_regs_row_pending_vec.png size="600x"]]
710
711 The second vector set accumulates the **column** information. With the
712 FU-Regs Cells capturing the instruction operand read/write register
713 numbers (in unary form), the ORing per column of those bits creates
714 a "global picture", per register, of the fact that *any* Function Unit
715 needs to read (or write) a particular Operand latch port.
716
717 [[!img fu_regs_global_pending_vec.png size="500x"]]
718
719 # FU-FU Vectors
720
721 Two vectors exist that accumulate row and column information. With the
722 FU-FU Cell recording whether the Function Unit *wants* to read (or write)
723 the per-cell information is not so crucial as the *accumulation* of that
724 information. When all other Function Units in that column no longer
725 indicate that they were waiting for a read, that FU is clear to **write**.
726 Correspondingly, when all FUs in the column no longer indicate waiting
727 for a write, that FU is clear to **read**. With a full NxN matrix of
728 cells, this inversion preserves Read-after-Write and Write-after-Read
729 hazard information relationships between **all** Function Units and all
730 other Function Units.
731
732 [[!img fu_fu_readable_writeable.png size="500x"]]
733