3d_gpu/microarchitecture.mdwn

   1 # High-level architectural Requirements
   2
   3 * SMP Cache coherency (TileLink?)
   4 * Minumum 800mhz
   5 * Minimum 2-core SMP, more likely 4-core uniform design,
   6   each core with full 4-wide SIMD-style predicated ALUs
   7 * 6GFLOPS single-precision FP
   8 * 128 64-bit FP and 128 64-bit INT register files
   9 * RV64GC compliance for running full GNU/Linux-based OS
  10 * SimpleV compliance
  11 * xBitManip (required for VPU and ideal for predication)
  12 * On-chip tile buffer (memory-mapped SRAM), likely shared
  13   between all cores, for the collaborative creation of pixel "tiles".
  14 * 4-lane 2Rx1W SRAMs for registers numbered 32 and above;
  15   Multi-R x Multi-W for registers 1-31.
  16   TODO: consider 2R for registers to be used as predication targets
  17   if >= 32.
  18 * Idea: generic implementation of ports on register file so as to be able
  19   to experiment with different arrangements.
  20 * Potentially: Lane-swapping / crossing / data-multiplexing
  21   bus on register data (particularly because of SHAPE-REMAP (1D/2D/3D)
  22 * Potentially: Registers subdivided into 16-bit, to match
  23   elwidth down to 16-bit (for FP16).  8-bit elwidth only
  24   goes down as far as twin-SIMD (with predication).  This
  25   requires registers to have extra hidden bits: register
  26   x30 is now "x30:0+x30.1+x30.2+x30.3".  have to discuss.
  27
  28 See [[requirements_specification]]
  29
  30 # Conversation Notes
  31
  32 ----
  33
  34 http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-January/000310.html
  35
  36 > We will need fast f32 <-> i16 at least since that is used for 16-bit
  37 > z-buffers. Since we don't have indexed load/store and need to manually
  38 > construct pointer vectors we will need fast i32 -> i64. We will also need
  39 > fast i32 <-> f32.
  40
  41 ----
  42
  43 'm thinking about using tilelink (or something similar) internally as
  44 having a cache-coherent protocol is required for implementing Vulkan
  45 (unless you want to turn off the cache for the GPU memory, which I
  46 don't think is a good idea), axi is not a cache-coherent protocol,
  47 and tilelink already has atomic rmw operations built into the protocol.
  48 We can use an axi to tilelink bridge to interface with the memory.
  49
  50 I'm thinking we will want to have a dual-core GPU since a single
  51 core with 4xSIMD is too slow to achieve 6GFLOPS with a reasonable
  52 clock speed. Additionally, that allows us to use an 800MHz core clock
  53 instead of the 1.6GHz we would otherwise need, allowing us to lower the
  54 core voltage and save power, since the power used is proportional to
  55 F\*V^2. (just guessing on clock speeds.)
  56
  57 ----
  58
  59 I don't know about power, however I have done some research and a 4Kbyte
  60 (or 16, icr) SRAM (what I was thinking of for a tile buffer) takes in the
  61 ballpark of 1000 um^2 in 28nm.
  62 Using a 4xFMA with a banked register file where the bank is selected by the
  63 lower order register number means we could probably get away with 1Rx1W
  64 SRAM as the backing memory for the register file, similarly to Hwacha. I
  65 would suggest 8 banks allowing us to do more in parallel since we could run
  66 other units in parallel with a 4xFMA. 8 banks would also allow us to clock
  67 gate the SRAM banks that are not in use for the current clock cycle
  68 allowing us to save more power. Note that the 4xFMA could be 4 separately
  69 allocated FMA units, it doesn't have to be SIMD style. If we have enough hw
  70 parallelism, we can under-volt and under-clock the GPU cores allowing for a
  71 more efficient GPU. If we are using the GPU cores as CPU cores as well, I
  72 think it would be important to be able to use a faster clock speed when not
  73 using the extended registers (similar to how Intel processors use a lower
  74 clock rate when AVX512 is in use) so that scalar code is not slowed down
  75 too much.
  76
  77 > > Using a 4xFMA with a banked register file where the bank is selected by
  78 > the
  79 > > lower order register number means we could probably get away with 1Rx1W
  80 > > SRAM as the backing memory for the register file, similarly to Hwacha.
  81 >
  82 >  okaaay.... sooo... we make an assumption that the top higher "banks"
  83 > are pretty much always going to be "vectorised", such that, actually,
  84 > they genuinely don't need to be 6R-4W (or whatever).
  85 >
  86 Yeah pretty much, though I had meant the bank number comes from the
  87 least-significant bits of the 7-bit register number.
  88
  89 ----
  90
  91 Assuming 64-bit operands:
  92 If you could organize 2 SRAM macros and use the pair of them to
  93 read/write 4 registers at a time (256-bits). The pipeline will allow you to
  94 dedicate 3 cycles for reading and 1 cycle for writing (4 registers each).
  95
  96 <pre>
  97 RS1 = Read of operand S1
  98 WRd = Write of result Dst
  99 FMx = Floating Point Multiplier, x = stage.
 100
 101    |RS1|RS2|RS3|FWD|FM1|FM2|FM3|FM4|
 102                    |FWD|FM1|FM2|FM3|FM4|
 103                        |FWD|FM1|FM2|FM3|FM4|
 104                            |FWD|FM1|FM2|FM3|FM4|WRd|
 105                    |RS1|RS2|RS3|FWD|FM1|FM2|FM3|FM4|
 106                                    |FWD|FM1|FM2|FM3|FM4|
 107                                        |FWD|FM1|FM2|FM3|FM4|
 108                                            |FWD|FM1|FM2|FM3|FM4|WRd|
 109                                    |RS1|RS2|RS3|FWD|FM1|FM2|FM3|FM4|
 110                                                    |FWD|FM1|FM2|FM3|FM4|
 111                                                        |FWD|FM1|FM2|FM3|FM4|
 112                                                            |FWD|FM1|FM2|FM3|FM4|WRd|
 113 </pre>
 114
 115 The only trick is getting the read and write dedicated on different clocks.
 116 When the RS3 operand is not needed (60% of the time) you can use
 117 the time slot for reading or writing on behalf of memory refs; STs read,
 118 LDs write.
 119
 120 You will find doing VRFs a lot more compact this way. In GPU land we
 121 called the flip-flops orchestrating the timing "collectors".
 122
 123 ----
 124
 125 Justification for Branch Prediction
 126
 127 <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2018-December/000212.html>
 128
 129 We can combine several branch predictors to make a decent predictor:
 130 call/return predictor -- important as it can predict calls and returns
 131 with around 99.8% accuracy loop predictor -- basically counts loop
 132 iterations some kind of global predictor -- handles everything else
 133
 134 We will also want a btb, a smaller one will work, it reduces average
 135 branch cycle count from 2-3 to 1 since it predicts which instructions
 136 are taken branches while the instructions are still being fetched,
 137 allowing the fetch to go to the target address on the next clock rather
 138 than having to wait for the fetched instructions to be decoded.
 139
 140 ----
 141
 142 > https://www.researchgate.net/publication/316727584_A_case_for_standard-cell_based_RAMs_in_highly-ported_superscalar_processor_structures
 143
 144 well, there is this concept:
 145 https://www.princeton.edu/~rblee/ELE572Papers/MultiBankRegFile_ISCA2000.pdf
 146
 147 it is a 2-level hierarchy for register cacheing.  honestly, though, the
 148 reservation stations of the tomasulo algorithm are similar to a cache,
 149 although only of the intermediate results, not of the initial operands.
 150
 151 i have a feeling we should investigate putting a 2-level register cache
 152 in front of a multiplexed SRAM.
 153
 154 ----
 155
 156 For GPU workloads FP64 is not common so I think having 1 FP64 alu would
 157 be sufficient. Since indexed loads and stores are not supported, it will
 158 be important to support 4x64 integer operations to generate addresses
 159 for loads/stores.
 160
 161 I was thinking we would use scoreboarding to keep track of operations
 162 and dependencies since it doesn't need a cam per alu. We should be able
 163 to design it to forward past the register file to allow for 0-latency
 164 forwarding. If we combined that with register renaming it should prevent
 165 most war and waw data hazards.
 166
 167 I think branch prediction will be essential if only to fetch and decode
 168 operations since it will reduce the branch penalty substantially.
 169
 170 Note that even if we have a zero-overhead loop extension, branch
 171 prediction will still be useful as we will want to be able to run code
 172 like compilers and standard RV code with decent performance. Additionally,
 173 quite a few shaders have branching in their internal loops so
 174 zero-overhead loops won't be able to fix all the branching problems.
 175
 176 ----
 177
 178 > you would need a 4-wide cdb anyway, since that's the performance we're
 179 > trying for.
 180
 181  if the 32-bit ops can be grouped as 2x SIMD to a 64-bit-wide ALU,
 182 then only 2 such ALUs would be needed to give 4x 32-bit FP per cycle
 183 per core, which means only a 2-wide CDB, a heck of a lot better than
 184 4.
 185
 186  oh: i thought of another way to cut the power-impact of the Reorder
 187 Buffer CAMs: a simple bit-field (a single-bit 2RWW memory, of address
 188 length equal to the number of registers, 2 is because of 2-issue).
 189
 190  the CAM of a ROB is on the instruction destination register.  key:
 191 ROBnum, value: instr-dest-reg.  if you have a bitfleid that says "this
 192 destreg has no ROB tag", it's dead-easy to check that bitfield, first.
 193
 194 ----
 195
 196 Avoiding Memory Hazards
 197
 198 * WAR and WAR hazards through memory are eliminated with speculation
 199 because actual updating of memory occurs in order, when a store is at
 200 the head of the ROB, and hence, no earlier loads or stores can still
 201 be pending
 202 * RAW hazards are maintained by two restrictions:
 203   1. not allowing a load to initiate the second step of its execution if
 204     any active ROB entry occupied by a store has a destination
 205     field that matches the value of the A field of the load and
 206   2. maintaining the program order for the computation of an effective
 207       address of a load with respect to all earlier stores
 208 * These restrictions ensure that any load that access a memory location
 209   written to by an earlier store cannot perform the memory access until
 210   the store has written the data.
 211
 212 Advantages of Speculation, Load and Store hazards:
 213
 214 * A store updates memory only when it reached the head of the ROB
 215 * WAW and WAR type of hazards are eliminated with speculation
 216   (actual updating of memory occurs in order)
 217 * RAW hazards through memory are maintained by not allowing a load
 218   to initiate the second step of its execution
 219 * Check if any store has a destination field that matched the
 220   value of the load:
 221     - SD F1 100(R2)
 222     - LD F2 100(R2)
 223
 224 Exceptions
 225
 226 * Exceptions are handled by not recognising the exception until
 227   instruction that caused it is ready to commit in ROB (reaches head
 228   of ROB)
 229
 230 Reorder Buffer
 231
 232 * Results of an instruction become visible externally when it leaves
 233   the ROB
 234     - Registers updated
 235     - Memory updated
 236
 237 Reorder Buffer Entry
 238
 239 * Instruction type
 240     - branch (no destination resutl)
 241     - store (has a memory address destination)
 242     - register operation (ALU operation or load, which has reg dests)
 243 * Destination
 244     - register number (for loads and ALU ops) or
 245     - memory address (for stores) where the result should be written
 246 * Value
 247     - value of instruction result, pending a commit
 248 * Ready
 249     - indicates that the instruction has completed execution: value is ready
 250
 251 ----
 252
 253 Register Renaming resources
 254
 255 * <https://www.youtube.com/watch?v=p4SdrUhZrBM>
 256 * <https://www.d.umn.edu/~gshute/arch/register-renaming.xhtml>
 257 * ROBs + Rename <http://euler.mat.uson.mx/~havillam/ca/CS323/0708.cs-323010.html>
 258
 259 Video @ 3:24, "RAT" table - Register Aliasing Table:
 260
 261 <img src="/3d_gpu/rat_table.png" />
 262
 263 This scheme looks very much like a Reservation Station.
 264
 265 ----
 266
 267 There is another way to get precise ordering of the writes in a scoreboard.
 268 First, one has to implement forwarding in the scoreboard.
 269 Second, the function units need an output queue <of say 4 registers>
 270 Now, one can launch an instruction and pick up its operand either
 271 from the RF or from the function unit output while the result sits
 272 in the function unit waiting for its GO_Write signal.
 273
 274 Thus the launching of instructions is not delayed due to hazards
 275 but the results are delivered to the RF in program order.
 276
 277 This looks surprisingly like a 'belt' at the end of the function unit.
 278
 279 ----
 280
 281 > https://salsa.debian.org/Kazan-team/kazan/blob/e4b516e29469e26146e717e0ef4b552efdac694b/docs/ALU%20lanes.svg
 282
 283  so, coming back to this diagram, i think if we stratify the
 284 Functional Units into lanes as well, we may get a multi-issue
 285 architecture.
 286
 287  the 6600 scoreboard rules - which are awesomely simple and actually
 288 involve D-Latches (3 gates) *not* flip-flops (10 gates) can be executed
 289 in parallel because there will be no overlap between stratified registers.
 290
 291  if using that odd-even / msw-lsw division (instead of modulo 4 on the
 292 register number) it will be more like a 2-issue for standard RV
 293 instructions and a 4-issue for when SV 32-bit ops are loop-generated.
 294
 295  by subdividing the registers into odd-even banks we will need a
 296 _pair_ of (completely independent) register-renaming tables:
 297   https://libre-riscv.org/3d_gpu/rat_table.png
 298
 299  for SIMD'd operations, if we have the same type of reservation
 300 station queue as with Tomasulo, it can be augmented with the byte-mask:
 301 if the byte-masks in the queue of both the src and dest registers do
 302 not overlap, the operations may be done in parallel.
 303
 304  i still have not yet thought through how the Reorder Buffer would
 305 work: here, again, i am tempted to recommend that, again, we "stratify"
 306 the ROB into odd-even (modulo 2) or perhaps modulo 4, with 32 entries,
 307 however the CAM is only 4-bit or 3-bit wide.
 308
 309  if an instruction's destination register does not meet the modulo
 310 requirements, that ROB entry is *left empty*.  this does mean that,
 311 for a 32-entry Reorder Buffer, if the stratification is 4-wide (modulo
 312 4), and there are 4 sequential instructions that happen e.g. to have
 313 a destination of r4 for insn1, r24 for insn2, r16 for insn3.... etc.
 314 etc.... the ROB will only hold 8 such instructions
 315
 316 and that i think is perfectly fine, because, statistically, it'll balance
 317 out, and SV generates sequentially-incrementing instruction registers,
 318 so *that* is fine, too.
 319
 320 i'll keep working on diagrams, and also reading mitch alsup's chapters
 321 on the 6600.  they're frickin awesome.  the 6600 could do multi-issue
 322 LD and ST by way of having dedicated registers to LD and ST.  X1-X5 were
 323 for ST, X6 and X7 for LD.
 324
 325 ----
 326
 327 i took a shot at explaining this also on comp.arch today, and that
 328 allowed me to identify a problem with the proposed modulo-4 "lanes"
 329 stratification.
 330
 331 when a result is created in one lane, it may need to be passed to the next
 332 lane.  that means that each of the other lanes needs to keep a watchful
 333 eye on when another lane updates the other regfiles (all 3 of them).
 334
 335 when an incoming update occurs, there may be up to 3 register writes
 336 (that need to be queued?) that need to be broadcast (written) into
 337 reservation stations.
 338
 339 what i'm not sure of is: can data consistency be preserved, even if
 340 there's a delay?  my big concern is that during the time where the data is
 341 broadcast from one lane, the head of the ROB arrives at that instruction
 342 (which is the "commit" condition), it gets committed, then, unfortunately,
 343 the same ROB# gets *reused*.
 344
 345 now that i think about it, as long as the length of the queue is below
 346 the size of the Reorder Buffer (preferably well below), and as long as
 347 it's guaranteed to be emptied by the time the ROB cycles through the
 348 whole buffer, it *should* be okay.
 349
 350 ----
 351
 352 > Don't forget that in these days of Spectre and Meltdown, merely
 353 > preventing dead instruction results from being written to registers or
 354 > memory is NOT ENOUGH. You also need to prevent load instructions from
 355 > altering cache and branch instructions from altering branch prediction
 356 > state.
 357
 358 Which, oddly enough, provides a necessity for being able to consume
 359 multiple containers from the cache Miss buffers, which oddly enough,
 360 are what makes a crucial mechanism in the Virtual Vector Method work.
 361
 362 In the past, one would forward the demand container to the waiting
 363 memref and then write the whole the line into the cache. S&M means you
 364 have to forward multiple times from the miss buffers and avoid damaging
 365 the cache until the instruction retires. VVM uses this to avoid having
 366 a vector strip mine the data cache.
 367
 368 ----
 369
 370 > I meant the renaming done as part of the SV extension, not the
 371 > microarchitectural renaming.
 372
 373 ah ok, yes. right.  ok, so i don't know what to name that, and i'd
 374 been thinking of it in terms of "post-renaming", as in my mind, it's
 375 not really renaming, at all, it's... remapping.  or, vector
 376 "elements".
 377
 378 as in: architecturally we already have a name (vector "elements").
 379 physically we already have a name: register file.
 380
 381 i was initially thinking that the issue stage would take care of it,
 382 by producing:
 383
 384 * post-remapped elements which are basically post-remapped register indices
 385 * a byte-mask indicating which *bytes* of the register are to be
 386   modified and which left alone
 387 * an element-width that is effectively an augmentation of (part of) the opcode
 388
 389 the element width goes into the ALU as an augmentation of the opcode
 390 because the 64-bit "register" now contains e.g. 16-bit "elements"
 391 indexed 0-3, or 8-bit "elements" indexed 0-7, and we now want a
 392 SIMD-style (predicated) operation to take place.
 393
 394 now that i think about it, i think we may need to have the three
 395 phases be part of a pipeline, in a single dependency matrix.
 396
 397 ----
 398
 399 I had a state machine in one chip that could come up out of power on in a
 400 state it could not get out of. Since this experience, I have a rule with
 401 state machines, A state machine must be able to go from any state to idle
 402 when the reset line is asserted.
 403
 404 You have to prove that the logic can never create a circular dependency,
 405 not a proof with test vectors, a logical proof like what we do with FP
 406 arithmetic these days.
 407
 408 ----
 409
 410
 411 >  however... we don't mind that, as the vectorisation engine will
 412 >  be, for the most part, generating sequentially-increasing index
 413 >  dest *and* src registers, so we kinda get away with it.
 414
 415 In this case:: you could simply design a 1R or 1W file (A.K.A. SRAM)
 416 and read 4 registers at a time or write 4 registers at a time. Timing
 417 looks like:
 418
 419 <pre>
 420      |RdS1|RdS2|RdS3|WtRd|RdS1|RdS2|RdS3|WtRd|RdS1|RdS2|RdS3|WtRd|
 421                     |F123|F123|F123|F123|
 422                          |Esk1|EsK2|EsK3|EsK4|
 423                                         |EfK1|EfK2|EfK3|EfK4|
 424 </pre>
 425
 426 4 cycle FU shown. Read as much as you need in 4 cycles for one operand,
 427 Read as much as you need in 4 cycles for another operand, read as much
 428 as you need in 4 cycles for the last operand, then write as much as you
 429 can for the result. This simply requires flip-flops to capture the width
 430 and then deliver operands in parallel (serial to parallel converter) and
 431 similarly for writing.
 432
 433 ----
 434
 435 * <https://groups.google.com/d/msg/comp.arch/gedwgWzCK4A/32aNXIzeDQAJ>
 436
 437 discussion of how to do dest-latches rather than src-latches.
 438
 439 also includes need for forwarding to achieve it (synonymous with
 440 Tomasulo CDB).
 441
 442 also, assigning a result number at issue time allows multiple results
 443 to be stored-and-forwarded, meaning that multiplying up the FUs is
 444 not needed.
 445
 446 also, discussion of how to have multiple instructions issued even with
 447 the same dest reg: drop the reg-store and effectively rename them
 448 to "R.FU#".  exceptions under discussion.
 449
 450 ----
 451
 452 Speculation
 453
 454 <https://groups.google.com/forum/#!topic/comp.arch/mzXXTU2GUSo>
 455
 456 There is a minimal partial order that is immune to Spetré amd friends,
 457 You have the dependence matrix that imposes a minimal partial order on
 458 executing instructions (at least in the architecture you have been
 459 discussing herein) You just have to prove that your matrix provides
 460 that minimal partial order for instructions.
 461
 462 Then you have to prove that no cache/tlb state can be updated prior to the
 463 causing instruction being made retirable (not retired retirable).
 464
 465 As to cache updates, all "reasonable" interfaces that service cache misses
 466 will have line buffers to deal with the inbound and outbound memory traffic.
 467 These buffers will provide the appropriate data to the execution stream,
 468 but not update the cache until the causing instruction becomes transitively
 469 retirable. This will put "a little" extra pressure on these buffers.
 470
 471 As to the TLB it is easy enough on a TLB miss to fetch the mapping tables
 472 transitively and arrive at a PTE. This PTE cannot be installed until the
 473 causing instruction becomes retirable. The miss buffers are probably the
 474 right place, and if a second TLB miss occurs, you might just as well walk
 475 the tables again and if it hits the line in the buffer use the data from
 476 there. When we looked at this a long time ago, there was little benefit
 477 for being able to walk more than one TLB miss at a time.
 478
 479 ----
 480
 481 Register Prefixes <a name="prefixes" />
 482
 483 <pre>
 484 |           3      |           2      |           1      |           0      |
 485 | ---------------- | ---------------- | ---------------- | ---------------- |
 486 |                  | xxxxxxxxxxxxxxaa | xxxxxxxxxxxxxxaa | XXXXXXXXXX011111 |
 487 |                  | xxxxxxxxxxxxxxxx | xxxxxxxxxxxbbb11 | XXXXXXXXXX011111 |
 488 |                  | xxxxxxxxxxxxxxaa | XXXXXXXXXX011111 | XXXXXXXXXX011111 |
 489 | xxxxxxxxxxxxxxaa | xxxxxxxxxxxxxxaa | XXXXXXXXXXXXXXXX | XXXXXXXXX0111111 |
 490 | xxxxxxxxxxxxxxxx | xxxxxxxxxxxbbb11 | XXXXXXXXXXXXXXXX | XXXXXXXXX0111111 |
 491 </pre>
 492
 493 <pre>
 494 2x16-bit / 32-bit:
 495
 496 | 9 8   | 7 6 5 |     4 3 |     2 1 | 0 |
 497 | ----- | ----- | ------- | ------- | - |
 498 | elwid | VL    | rs[6:5] | rd[6:5] | 0 |
 499
 500 | 9 8 7 6 5 |      4 3 |   2 |   1 | 0 |
 501 | --------- | -------- | --- | --- | - |
 502 | predicate | predtarg | end | inv | 1 |
 503
 504
 505 |                  | xxxxxxxxxxxxxxxx | xxxxxxxxxxxbbb11 | XXXXXXXXXX011111 |
 506 |                  | xxxxxxxxxxxxxxaa | XXXXXXXXXX011111 | XXXXXXXXXX011111 |
 507 | xxxxxxxxxxxxxxaa | xxxxxxxxxxxxxxaa | XXXXXXXXXXXXXXXX | XXXXXXXXX0111111 |
 508 | xxxxxxxxxxxxxxxx | xxxxxxxxxxxbbb11 | XXXXXXXXXXXXXXXX | XXXXXXXXX0111111 |
 509 </pre>
 510
 511 # MVX and other reg-shuffling
 512
 513 <pre>
 514 > Crucial strategic op missing is MVX:
 515 > regs[rd]= regs[regs[rs1]]
 516 >
 517 we could modify the definition slightly:
 518 for i in 0..VL {
 519     let offset = regs[rs1 + i];
 520     // we could also limit on out-of-range
 521     assert!(offset < VL); // trap on fail
 522     regs[rd + i] = regs[rs2 + offset];
 523 }
 524
 525 The dependency matrix would have the instruction depend on everything from
 526 rs2 to rs2 + VL and we let the execution unit figure it out. for
 527 simplicity, we could extend the dependencies to a power of 2 or something.
 528
 529 We should add some constrained swizzle instructions for the more
 530 pipeline-friendly cases. One that will be important is:
 531 for i in (0..VL) {
 532     let i = i * 4;
 533     let s1: [0; 4];
 534     for j in 0..4 {
 535         s1[j] = regs[rs1 + i + j];
 536     }
 537     for j in 0..4 {
 538         regs[rd + i + j] = s1[(imm >> j * 2) & 0x3];
 539     }
 540 }
 541 Another is matrix transpose for (2-4)x(2-4) matrices which we can implement
 542 as similar to a strided ld/st except for registers.
 543 </pre>
 544
 545 # TLBs / Virtual Memory <a name="tlb" />
 546
 547 ----
 548
 549 We were specifically looking for ways to not need large CAMs since they are
 550 power-hungry when designing the instruction scheduling logic, so it may be
 551 a good idea to have a smaller L1 TLB and a larger, slower, more
 552 power-efficient, L2 TLB. I would have the L1 be 4-32 entries and the L2 can
 553 be 32-128 as long as the L2 cam isn't being activated every clock cycle. We
 554 can also share the L2 between the instruction and data caches.
 555
 556 # Register File having same-cycle "forwarding"
 557
 558 discussion about CDC 6600 Register File: it was capable of forwarding
 559 operands being written out to "reads", *in the same cycle*.  this
 560 effectively turns the Reg File *into* a "Forwarding Bus".
 561
 562 we aim to only have (4 banks of) 2R1W ported register files,
 563 with *additional* Forwarding Multiplexers (which look exactly
 564 like multi-port regfile gate logic).
 565
 566 suggestion by Mitch is to have a "demon" on the front of the regfile,
 567 <https://groups.google.com/d/msg/comp.arch/gedwgWzCK4A/qY2SYjd2DgAJ>,
 568 which:
 569
 570     basically, you are going to end up with a "demon" at the RF and when
 571     all read reservations have been satisfied the demon determines if the
 572     result needs to be written to the RF or discarded. The demon sees
 573     the instruction issue process, the branch resolutions, and the FU
 574     exceptions, and keeps track of whether the result needs to be written.
 575     It then forwards the result from the FU and clears the slot, then writes
 576     the result to the RF if needed.
 577
 578 # Design Layout
 579
 580 ok,so continuing some thoughts-in-order notes:
 581
 582 ## Scoreboards
 583
 584 scoreboards are not just scoreboards, they are dependency matrices,
 585 and there are several of them:
 586
 587 * one for LOAD/STORE-to-LOAD/STORE
 588   - most recent LOADs prevent later STOREs
 589   - most recent STOREs prevent later LOADs.
 590   - a separate process analyses LOAD-STORE addresses for
 591     conflicts, based on sufficient bits to assess uniqueness
 592     as opposed to precise and exact matches
 593 * one for Function-Unit to Function-Unit.
 594   - it expresses both RAW and WAW hazards through "Go_Write"
 595     and "Go_Read" signals, which are stopped from proceeding by
 596     dependent 1-bit CAM latches
 597   - exceptions may ALSO be made "precise" by holding a "Write prevention"
 598     signal.  only when the Function Unit knows that an exception is
 599     not going to occur (memory has been fetched, for example), does
 600     it release the signal
 601   - speculative branch execution likewise may hold a "Write prevention",
 602     however it also needs a "Go die" signal, to clear out the
 603     incorrectly-taken branch.
 604   - LOADs/STOREs *also* must be considered as "Functional Units" and thus
 605     must also have corresponding entries (plural) in the FU-to-FU Matrix
 606   - it is permitted for ALUs to *BEGIN* execution (read operands are
 607     valid) without being permitted to *COMMIT*.  thus, each FU must
 608     store (buffer) results, until such time as a "commit" signal is
 609     received
 610   - we may need to express an inter-dependence on the instruction order
 611     (raising the WAW hazard line to do so) as a way to preserve execution
 612     order.  only the oldest instructions will have this flag dropped,
 613     permitting execution that has *begun* to also reach "commit" phase.
 614 * one for Function-Unit to Registers.
 615   - it expresses the read and write requirements: the source
 616     and destination registers on which the operation depends.  source
 617     registers are marked "need read", dest registers marked
 618     "need write".
 619   - by having *more than one* Functional Unit matrix row per ALU
 620     it becomes possible to effectively achieve "Reservation Stations"
 621     orthogonality with the Tomasulo Algorithm.  the FU row must, like
 622     RS's, take and store a copy of the src register values.
 623
 624 ## Register Renaming
 625
 626 There are several potential well-known schemes for register-renaming:
 627 *none of them will be used here*.  The scheme below is a new form of
 628 renaming that is a topologically and functionally **direct** equivalent
 629 of the Tomasulo Algorithm with a Reorder Buffer, that came from the
 630 "Register Alias Table" concept that is better suited to Scoreboards.
 631 It works by flattening out Reservation Stations to one per FU (requiring
 632 more FUs as a result).  On top of this the function normally carried
 633 out by "tags" of the RAT table may be merged-morphed into the role
 634 carried out by the ROB Destination Register CAM which may be merged-morphed
 635 into a single vector (per register) of 1-bit mutually-exclusive "CAMs"
 636 that are added, very simply, to the FU-Register Dependency Matrix.
 637
 638 In this way, exactly as in the Tomasulo Algorithm, there is absolutely no
 639 need whatsoever for a separate PRF-ARF scheme.  The PRF *is* the ARF.
 640
 641 Register-renaming will be done with a single extra mutually-exclusive bit
 642 in the FUxReg Dependency Matrix, which may be set on only one FU (per register).
 643 This bit indicates which of the FUs has the **most recent** destination
 644 register value pending.  It is **directly** functionally equivalent to
 645 the Reorder Buffer Dest Reg# CAM value, except that now it is a
 646 string of 1-bit "CAMs".
 647
 648 When an FU needs a src reg and finds that it needs to create a
 649 dependency waiting for a result to be created, it must use this
 650 bit to determine which FU it creates a dependency on.
 651
 652 If there is a destination register that already has a bit set
 653 (anywhere in the column), it is **cleared** and **replaced**
 654 with a bit in the FU's row and the destination register's column.
 655
 656 See https://groups.google.com/d/msg/comp.arch/w5fUBkrcw-s/c80jRn4PCQAJ
 657
 658 MUL r1, r2, r3
 659
 660     FU name  Reg name
 661              12345678
 662     add-0    ........
 663     add-1    ........
 664     mul-0    X.......
 665     mul-1    ........
 666
 667 ADD r4, r1, r3
 668
 669     FU name  Reg name
 670              12345678
 671     add-0    ...X....
 672     add-1    ........
 673     mul-0    X.......
 674     mul-1    ........
 675
 676 ADD r1, r5, r6
 677
 678     FU name  Reg name
 679              12345678
 680     add-0    ...X....
 681     add-1    X.......
 682     mul-0    ........
 683     mul-1    ........
 684
 685 note how on the 3rd instruction, the (mul-0,R1) entry is **cleared**
 686 and **replaced** with an (add-1,R1) entry.  future instructions now
 687 know that if their src operands require R1, they are to place a
 688 RaW dependency on **add-1**, not mul-0
 689
 690 ## Multi-issue
 691
 692 we may potentially have 2-issue (or 4-issue) and a simpler issue and
 693 detection by "striping" the register file according to modulo 2 (or 4)
 694 on the destination   register number
 695
 696 * the Function Unit rows are multiplied up by 2 (or 4) however they are
 697   actually connected to the same ALUs (pipelined and with both src and
 698   dest register buffers/latches).
 699 * the Register Read and Write signals are then "striped" such that
 700   read/write requests for every 2nd (or 4th) register are "grouped" and
 701   will have to fight for access to a multiplexer in order to access
 702   registers that do not have the same modulo 2 (or 4) match.
 703 * we MAY potentially be able to drop the destination (write) multiplexer(s)
 704   by only permitting FU rows with the same modulo to write to that
 705   destination bank.  FUs with indices 0,4,8,12 may only write to registers
 706   similarly numbered.
 707 * there will therefore be FOUR separate register-data buses, with (at least)
 708   the Read buses multiplexed so that all FU banks may read all src registers
 709   (even if there is contention for the multiplexers)
 710
 711 ## FU-to-Register address de-muxed already
 712
 713 an oddity / artefact of the FU-to-Registers Dependency Matrix is that the
 714 write/read enable signals already exist as single-bits.  "normal" processors
 715 store the src/dest registers as an index (5 bits == 0-31), where in this
 716 design, that has been expanded out to 32 individual Read/Write wires,
 717 already.
 718
 719 * the register file verilog implementation therefore must take in an
 720   array of 128-bit write-enable and 128-bit read-enable signals.
 721 * however the data buses will be multiplexed modulo 2 (or 4) according
 722   to the lower bits of the register number, in order to cross "lanes".
 723
 724 ## FU "Grouping"
 725
 726 with so many Function Units in RISC-V (dozens of instructions, times 2
 727 to provide Reservation Stations, times 2 OR 4 for dual (or quad) issue),
 728 we almost certainly are going to have to deploy a "grouping" scheme:
 729
 730 * rather than dedicate 2x4 FUs to ADD, 2x4 FUs to SUB, 2x4 FUs
 731   to MUL etc., instead we group the FUs by how many src and dest
 732   registers are required, and *pass the opcode down to them*
 733 * only FUs with the exact same number (and type) of register profile
 734   will receive like-minded opcodes.
 735 * when src and dest are free for a particular op (and an ALU pipeline is
 736   not stalled) the FU is at liberty to push the operands into the
 737   appropriate free ALU.
 738 * FUs therefore only really express the register, memory, and execution
 739   dependencies: they don't actually do the execution.
 740
 741 ## Recommendations
 742
 743 * Include a merged address-generator in the INT ALU
 744 * Have simple ALU units duplicated and allow more than one FU to
 745   receive (and process) the src operands.
 746
 747 ## Register file workloads
 748
 749 Note: Vectorisation also includes predication, which is one extra integer read
 750
 751 Integer workloads:
 752
 753 * 43% Integer
 754 * 21% Load
 755 * 12% store
 756 * 24% branch
 757
 758 * 100% of the instruction stream can be integer instructions
 759 * 75% utilize two source operand registers.
 760 * 50% of the instruction stream can be Load instructions
 761 * 25% can be store instructions,
 762 * 25% can be branch instructions
 763
 764 FP workloads:
 765
 766 * 30% Integer
 767 * 25% Load
 768 * 10% Store
 769 * 13% Multiplication
 770 * 17% Addition
 771 * 5% branch
 772
 773 ----
 774
 775 >  in particular i found it fascinating that analysis of INT
 776 >  instructions found a 50% LD, 25% ST and 25% branch, and that
 777 >  70% were 2-src ops.  therefore you made sure that the number
 778 >  of read and write ports matched these, to ensure no bottlenecks,
 779 >  bearing in mind that ST requires reading an address *and*
 780 >  a data register.
 781
 782 I never had a problem in "reading the write slot" in any of my pipelines.
 783 That is, take a pipeline where LD (cache hit) has a latency of 3 cycles
 784 (AGEN, Cache, Align). Align would be in the cycle where the data was being
 785 forwarded, and the subsequent cycle, data could be written into the RF.
 786
 787 |dec|AGN|$$$|ALN|LDW|
 788
 789 For stores I would read the LDs write slot Align the store data and merge
 790 into the cache as::
 791
 792 |dec|AGEN|tag|---|STR|ALN|$$$|
 793
 794 You know 4 cycles in advance that a store is coming, 2 cycles after hit
 795 so there is easy logic to decide to read the write slot (or not), and it
 796 costs 2 address comparators to disambiguate this short shadow in the pipeline.
 797
 798 This is a lower expense than building another read port into the RF, in
 799 both area and power, and uses the pipeline efficiently.
 800
 801 # Explicit Vector Length (EVL) extension to LLVM <a name="llvm_evl" />
 802
 803 * <https://reviews.llvm.org/D57504>
 804 * <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-January/000433.html>
 805 * <http://lists.llvm.org/pipermail/llvm-dev/2019-January/129822.html>
 806
 807 # References
 808
 809 * <https://en.wikipedia.org/wiki/Tomasulo_algorithm>
 810 * <https://en.wikipedia.org/wiki/Reservation_station>
 811 * <https://en.wikipedia.org/wiki/Register_renaming> points out that
 812   reservation stations take a *lot* of power.
 813 * <http://home.deib.polimi.it/silvano/FilePDF/AAC/Lesson_4_ILP_PartII_Scoreboard.pdf> scoreboarding
 814 * MESI cache protocol, python <https://github.com/sunkarapk/mesi-cache.git>
 815   <https://github.com/afwolfe/mesi-simulator>
 816 * <https://kshitizdange.github.io/418CacheSim/final-report> report on
 817   types of caches
 818 * <https://github.com/ssc3?tab=repositories> interesting stuff
 819 * <https://en.wikipedia.org/wiki/Classic_RISC_pipeline#Solution_A._Bypassing>
 820   pipeline bypassing
 821 * <http://ece-research.unm.edu/jimp/611/slides/chap4_7.html> Tomasulo / Reorder
 822 * Register File Bank Cacheing <https://www.princeton.edu/~rblee/ELE572Papers/MultiBankRegFile_ISCA2000.pdf>
 823 * Discussion <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2018-November/000157.html>
 824 * <https://github.com/UCSBarchlab/PyRTL/blob/master/examples/example5-instrospection.py>
 825 * <https://github.com/ataradov/riscv/blob/master/rtl/riscv_core.v#L210>
 826 * <https://www.eda.ncsu.edu/wiki/FreePDK>