updates/005_2018dec14_simd_without_simd.mdwn

   1 # Microarchitectural Design by Osmosis
   2
   3 In a series of different descriptions and evaluations, a picture is
   4 beginning to emerge of a suitable microarchitecture, as the process
   5 of talking on [videos](https://youtu.be/DoZrGJIltgU), and
   6 [writing out thoughts](https://groups.google.com/forum/#!topic/comp.arch/2kYGFU4ppow)
   7 and then talking about the resultant feedback
   8 [elsewhere](http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2018-December/000261.html)
   9 begins to crystallise, without overloading any one group of people.
  10
  11 There are several things to remember about this design: the primary being
  12 that it is not explicitly intended as a discrete GPU (although one could
  13 be made), it is primarily for a battery-operated efficient hand-held device,
  14 where it happens to just about pass on, say, a low to mid-range chromebook.
  15 Power consumption *for the entire chip* is targetted at 2.5 watts.
  16
  17 We learned quite quickly that, paradoxically, even a mobile embedded 3D
  18 GPU *requires* extreme numbers of registers (128 floating-point registers)
  19 because it is handling vectors (or quads as they are called), and even
  20 pixel data in floating-point format is also 4 32-bit numbers (including
  21 the transparency).  So where a "normal" RISC processor has 32 registers,
  22 a GPU typically has to have 4 times that amount simply because it is
  23 dealing with 4 lots of numbers simultaneously.  If you don't do this,
  24 then that data has to go back down to memory (even to L1 cache), and, as the
  25 L1 cache runs a CAM, it's guaranteed to be power-hungry.
  26
  27 128 registers brings some unique challenges not normally faced by general
  28 purpose CPUs, and when it becomes possible (or a requirement) to access
  29 even down to the byte level of those 64-bit registers as "elements" in
  30 a vector operation, it is even more challenging.  Recall Mitch Alsup's
  31 scoreboard dependency floorplan (reproduced with kind permission, here):
  32
  33 {{mitch_ld_st_augmentation.jpg}}
  34
  35 There are two key Dependency Matrices here: on the left is the Function
  36 Unit (rows) to Register File (columns), where you can see at the bottom
  37 in the CDC 6600 the Register File is divided down into A, B and X.
  38 On the right is the Function Unit to Function Unit dependency matrix,
  39 which ensures that each FU only starts its arithmetic operations when
  40 its dependent FUs have created the results it needs.  Thus, that Matrix
  41 expresses source register to destination register dependencies.
  42
  43 Now let's do something hair-raising.  Let's do two crazed things at once:
  44 increase the number of registers to a whopping 256 total (128 floating
  45 point and 128 integer), and at the same time allow those 64-bit registers
  46 to be broken down into **eight** separate 8-bit values... *and allow
  47 Function Unit dependencies to exist on them*!
  48
  49 What would happen if we did not properly take this into account in the
  50 design is that an 8-bit ADD would require us to "lock" say Register R5
  51 (all 64 bits of it), absolutely preventing and prohibiting the other 7
  52 bytes of R5 from being used, until such time as that extremely small
  53 8-bit ADD had completed.  Such a design would be laughed at, its
  54 performance would be so low.  Only one 8-bit ADD per clock cycle, when
  55 Intel has recently added 512-bit SIMD??
  56
  57 So this is a diagram of a proposed solution.  What if, when an 8-bit
  58 operation needs to do a calculation to go into the 1st byte, the other
  59 7 bytes have their own **completely separate** dependency lines, in
  60 the Register and Function Unit Matrices?  It looks like this:
  61
  62 {{reorder_alias_bytemask_scheme.png}}
  63
  64 So if you recall from the previous updates about Scoreboards, it's not
  65 the "scoreboard" that's the key, it's these Register to Function Unit
  66 and FU to FU Dependency Matrices that are the misunderstood key.
  67 So let's explain this diagram.  Firstly, in purple in the bottom left
  68 is a massive matrix of FU to FU, just as with the standard CDC 6600,
  69 except now there are separate 32-bit FUs, 16-bit FUs, and 8-bit FUs.
  70 In this way, we can have 32-bit ADD depending on and waiting for
  71 an 8-bit computation, or 16-bit MUL on a 32-bit SQRT and so on.  Nothing
  72 immediately obviously different there.
  73
  74 Likewise, in the bottom right, in red, we see matrices that have
  75 FU along rows, and Registers along the columns, exactly again as with
  76 the CDC 6600 standard scoreboard: however, again, we note that
  77 because there are separate 32-bit FUs and separate 16-bit and 8-bit
  78 FUs, there are *three* separate sets of FU-to-Register Matrices.
  79 Also, note that these are separate, where they would be expected
  80 to be grouped together.  Except, they're *not* independent, and that's
  81 where the diagram at the top (middle) comes in.
  82
  83 The diagram at the top says, in words, "if you need a 32-bit register
  84 for an operation (using a 32-bit Function Unit), the 16-bit and 8-bit
  85 Function Units *also* connected to that exact same register **must**
  86 be prevented from occuring.  Also, if you need 8 bits of a register,
  87 whilst it does not prevent the other bytes of the register from being
  88 used, it *does* prevent the overlapping 16-bit portion **and the 32-bit
  89 and the 64-bit** portions of that same named register from being used".
  90
  91 This is absolutely essential to understand, this "cascading" relationship.
  92 Need Register R1 (all of it), you **cannot** go and allocate any of that
  93 register for use in any 32-bit, 16-bit or 8-bit operations.  This is
  94 common sense!  However, if you use the lowest byte (byte 1), you can still
  95 use the top three 16-bit portions of R1, and you can also still use byte 2.
  96 This is also common sense!
  97
  98 So in fact, it's actually quite simple, and this "cascade" is simply and
  99 easily propagated down to the Function Unit Dependency Matrices, stopping
 100 32-bit operations from overwriting 8-bit and vice-versa.
 101
 102 # Virtual Registers
 103
 104 The fourth part is the grid in green, in the top left corner.  This is
 105 a "virtual" to "real" one-bit table.  It's here because the size of
 106 these matrices is so enormous that there is deep concern about the line
 107 driver strength, as well as the actual size.  128 registers means
 108 that one single gate, when it goes high or low, has to "drive" the
 109 input of 128 other gates.  That takes longer and longer to do, the higher
 110 the number of gates, so it becomes a critical factor in determining the
 111 maximum speed of the entire processor.  We will have to keep an eye
 112 on this.
 113
 114 So, to keep the FU to Register matrix size down, this "virtual" register
 115 concept was introduced.  Only one bit in each row of the green table
 116 may be active: it says, for example, "IR1 actually represents that there
 117 is an instruction being executed using R3".  This does mean however that
 118 if this table is not high enough (not enough IRs), the processor has to
 119 stall until an instruction is completed, so that one register becomes
 120 free.  Again, another thing to keep an eye on, in simulations.
 121
 122 # Refinements
 123
 124 The second major concern is the purple matrix: the FU-to-FU one.  Basically
 125 where previously we would have FU1 cover all ADDs, FU2 would cover all MUL
 126 operations, FU3 covers BRANCH and so on, now we have to multiply those
 127 numbers by **four** (64-bit ops, 32-bit ops, 16-bit and 8), which in turn
 128 means that the size of the FU-to-FU Matrix has gone up by a staggering
 129 **sixteen** times.  This is not really acceptable, so we have to do something
 130 different.
 131
 132 So the refinement is based on an observation that 16-bit operations of
 133 course may be constructed from 8-bit values, and that 64-bit operations
 134 can be constructed from 32-bit ones.  So, what if we skipped the
 135 cascade on 64 and 16 bit, and made the cascade out of just 32-bit and 8-bit?
 136 Then, very simply, the top half of a 64-bit source register is allocated
 137 to one Function Unit, the bottom half to the one next to it, and when it
 138 comes to actually passing the source registers to the relevant ALU, take
 139 from *both* FUs.
 140
 141 The primary focus is on 32-bit (single-precision floating-point) performance
 142 anyway, for 3D, so if 64-bit operations happen to have half the number of
 143 Reservation Stations / Function Units, and block more often, we actually
 144 don't mind so much.  Also, we can still apply the same "banks" trick on
 145 the Register File, except this time with 4-way multiplexing on 32-bit
 146 wide banks, and 4x4 crossbars on the bytes as well:
 147
 148 {{register_file_multiplexing.jpg}}
 149
 150 To cope with 16-bit operations, pairs of 8-bit values in adjacent Function
 151 Units are reserved.  Likewise for 64-bit operations, the 8-bit crossbars
 152 are not used, and pairs of 32-bit source values in adjacent Function Units
 153 in the *32-bit* FU area are reserved.
 154
 155 However, the gate count in such a staggered crossbar arrangement is insane:
 156 bear in mind that this will be 3R1W or 2R1W (2 or 3 reads, 1 write per
 157 register), and that means **three** sets of crossbars, comprising **four**
 158 banks, with effectively 16 byte to 16 byte routing.
 159
 160 It's too much - so in later updates, this will be explored further.