updates/006_2018dec23_floorplan.mdwn

   1 # A Reasonably Sane Plan
   2
   3 Honestly there is nothing sane about merging a variable-size polymorphic
   4 vectorisation front-end onto a standard RISC register file in an MMX/SSE
   5 fashion, right down to the byte level, however that's what we've chosen
   6 to do.  Why? well, because it's not been done before, and we'd like to
   7 see how it works out.  Plus, there's no new instructions needed, and
   8 unlike a traditional vector system, which has its own pipeline and its
   9 own register file, we don't need special instructions to transfer between
  10 the vector register file (which will contain both integer and floating
  11 point numbers), and we can leverage an absolutely standard superscalar
  12 out-of-order microarchitecture, to save on design development effort.
  13
  14 That's the theory.
  15
  16 One of the things that's proved to be rather scary is both the size of
  17 the register files (128 FP and 128 INT 64-bit registers), and the number
  18 of ports needed for high-end processors: reports of 8R3W are not uncommon.
  19 We're going for an odd-even hi-lo approach: 4 banks with a 32-32 bit
  20 split and dividing further into odd register numbers and even register
  21 numbers.
  22
  23 In the previous update it was explained that we will fully route source
  24 registers (and sub-register "elements") down to the byte level, so that
  25 after they have been processed through the ALU, there is absolutely no
  26 need to do any further routing.  This is akin to a standard vectorisation
  27 system's "lanes".  Additionally, every byte on the register file will
  28 have its own separate "write" line, such that for 16-bit and 8-bit
  29 element widths we do not need to do extraneous read-merge-write cycles.
  30
  31 It was also explained that to do byte-level source register routing,
  32 across all four banks, that's a 16-to-16 crossbar routing 8 bit values
  33 from any 16 to any 16 destination locations.  This is simply too much,
  34 particularly given that if we use 2R1W we will need *two* 16-to-16
  35 crossbars.  The number of gates is massive.
  36
  37 We have an accompanying [[video]](https://www.youtube.com/watch?v=78het1cfz_8)
  38 walkthough, however here is a photo of the scheme currently under discussion:
  39
  40 {{libreriscv_floorplan.jpg}}
  41
  42 What we will likely go with is a hybrid arrangement.  In the top right of
  43 the above photo is a 4-bank arrangement, 32-bit wide as before.  However
  44 there is only 4-to-4 crossbar routing, 32-bit wide.  Again, this is only
  45 on the source registers.  Two of these crossbars will be needed: one for
  46 src1, one for src2.
  47
  48 In the bottom middle you can see that we decided to put in xBitManip
  49 Function Units onto the 8-bit Function Unit Area.  These are actually
  50 32-bit bit manipulation ALUs, however we are putting them in the *8-bit* area.
  51 The reason is very simple: these xBitManip ALUs will *also* be used, in
  52 a pseudo-micro-code fashion, to serve the dual purpose of reordering
  53 and routing source element register bytes to the correct "lane".
  54
  55 What will happen is:
  56
  57 * Each 8-bit Function Unit (synonymous in this scheme with a
  58   "Reservation Station" row), will have src1 and src2 latches
  59   for incoming registers.
  60 * 32-bit data will be latched into the **wrong** 8-bit Function Unit,
  61   along with the remainder of the element "address" to which the
  62   source value **should** be directed.
  63 * The "wrong" data will be sent through the xBitManip ALUs, to shuffle
  64   and permute it to the **right** order.
  65 * **Pre-existing** operand "forwarding" routing will take the output
  66   from the xBitManip ALUs and put it **back** into the Function Unit
  67   Reservation Station src1 (or src2) latches.
  68 * With the source sub-register 8-bit values now in their correct "lanes",
  69   the actual required 8-bit ALU operation may now proceed.
  70
  71 So it's a multi-stage process that's very similar to micro-code operations:
  72 it is however easier to hard-wire the use of the xBitManip ALUs than it
  73 is to create multiple micro-code instructions, which was one possibility
  74 that was considered.
  75
  76 In essence, the xBitManip ALU can handle 4x4 crossbar routing at the byte
  77 level with no difficulties whatsoever, so we might as well use it for
  78 precisely that job.  What's nice is that we can decide how many xBitManip
  79 ALUs to put in, depending on how the VPU workload works out.  Plus, the
  80 infrastructure to handle queueing, routing and temporary storage of the
  81 in-flight source register values *already exists*.  The alternative previously
  82 discussed was to have massive duplicated dedicated 16x16 crossbars: now
  83 we have only 4x4 32-bit crossbars plus a *small* number of 4x4 8-bit
  84 crossbars (aka xBitManip ALUs), saving significantly on the number of gates.
  85
  86 # Reducing Register-FU Matrix sizes
  87
  88 Also, one significant detail.  Recall in the previous update that a scheme
  89 was finally envisaged where 64-bit Function Units would cascade-block
  90 32-bit Function Units right down to 8-bit, on any given register.  We decided
  91 that this, too, was insane, given that it would result in a whopping 16
  92 fold increase in the Function Unit Matrices.
  93
  94 Instead we decided to go with 32-bit to 8-bit cascade-blocking, where
  95 two adjacent 32-bit Function Units would be required to perform 64-bit
  96 operations, and two adjacent 8-bit Function Units required to do 16-bit
  97 operations.  In this way the FU-to-FU Dependency Matrices are reduced
  98 down to only a four-fold size increase when compared to a more traditional
  99 SIMD arrangement.
 100
 101 In the middle towards the top of the above picture, we can therefore
 102 see a four-wide group of 32-bit Function Units: FU1 through FU4.  These,
 103 unsurprisingly, are dedicated to *destination* register banks, i.e. the
 104 write port is connected very specifically and exclusively to their
 105 respective RegFile bank.
 106
 107 Function Units 8 through 12 are the 8-bit FUs.  Really there should
 108 be sixteen of these, because it is likely that we will need one for
 109 every byte of the full width of 4 32-bit register banks.  If we do not
 110 have 16 of them, having say only 8, it will be necessary to do *destination*
 111 routing to the correct 32-bit-wide RegFile bank.  This is something that
 112 we are keen to avoid.
 113
 114 Also bear in mind that we have not shown, in the above diagram,
 115 the enhancements designed by Mitch Alsup, to the 6600 Scoreboard system.
 116 These enhancements basically add LOAD/STORE "Function Units", which cover
 117 the exact same role as the Tomasulo scheme's LOAD/STORE queues (provide
 118 out-of-order correctly sequenced LOAD/STORE operations).  One Function
 119 Unit (aka Reservation Station) is required per outstanding LOAD/STORE
 120 needed, and we need LOAD/STOREs on **both** the 32-bit FUs **and** the
 121 8-bit FUs.  It **may** be possible to merge these into one: we will have
 122 to see.
 123
 124 Also, Branch Prediction (including speculative execution) requires individual
 125 Function Units: one for each branch that is intended to run ahead.  Remember
 126 that it was previously mentioned that there would be a "Schroedinger" wire
 127 indicating that the instructions operating in the "shadow" of the branch
 128 would be neither alive nor dead, and that until this was determined they
 129 would be treated as "Write Hazards", allowing them to *execute* but **not**
 130 commit (write) their results.  We will need such Function Units on both
 131 the 32-bit **and** the 8-bit areas.  Exceptions likewise.
 132
 133 So if we are not careful we could easily end up with 64 Function Units:
 134 32 for the 32-bit area and 32 for the 8-bit area.  This is going to need
 135 some experimentation and some detailed thought, when it comes to actual
 136 implementation.  A 64x64 Function Unit Dependency Matrix is pretty massive,
 137 even if the cell size (and power consumption) is very small compared to
 138 Tomasulo plus Reorder Buffers, with associated CAMs.
 139
 140 There is a lot of detail that still needs to be done: we are however reaching
 141 the end of the critical "overview" planning phase.  Really it is time to
 142 start implementing a first iteration, to see how it works out.  For that,
 143 we will be looking closely at Mitch Alsup's unpublished book chapters,
 144 as there really is no reason why we should not just implement the gate-level
 145 diagrams that he has kindly given permission to use (with credit).
 146