updates/005_2018dec14_simd_without_simd.mdwn

   1 # Microarchitectural Design by Osmosis
   2
   3 In a series of different descriptions and evaluations, a picture is
   4 beginning to emerge of a suitable microarchitecture, as the process
   5 of talking on [videos](https://youtu.be/DoZrGJIltgU), and
   6 [writing out thoughts](https://groups.google.com/forum/#!topic/comp.arch/2kYGFU4ppow)
   7 and then talking about the resultant feedback
   8 [elsewhere](http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2018-December/000261.html)
   9 begins to crystallise, without overloading any one group of people.
  10
  11 There are several things to remember about this design: the primary being
  12 that it is not explicitly intended as a discrete GPU (although one could
  13 be made), it is primarily for a battery-operated efficient hand-held device,
  14 where it happens to just about pass on, say, a low to mid-range chromebook.
  15 Power consumption *for the entire chip* is targetted at 2.5 watts.
  16
  17 We learned quite quickly that, paradoxically, even a mobile embedded 3D
  18 GPU *requires* extreme numbers of registers (128 floating-point registers)
  19 because it is handling vectors (or quads as they are called), and even
  20 pixel data in floating-point format is also 4 32-bit numbers (including
  21 the transparency).  So where a "normal" RISC processor has 32 registers,
  22 a GPU typically has to have 4 times that amount simply because it is
  23 dealing with 4 lots of numbers simultaneously.  If you don't do this,
  24 then that data has to go back down to memory (even to L1 cache), and, as the
  25 L1 cache runs a CAM, it's guaranteed to be power-hungry.
  26
  27 128 registers brings some unique challenges not normally faced by general
  28 purpose CPUs, and when it becomes possible (or a requirement) to access
  29 even down to the byte level of those 64-bit registers as "elements" in
  30 a vector operation, it is even more challenging.  Recall Mitch Alsup's
  31 scoreboard dependency floorplan (reproduced with kind permission, here):
  32
  33 {{mitch_ld_st_augmentation.jpg}}
  34
  35 There are two key Dependency Matrices here: on the left is the Function
  36 Unit (rows) to Register File (columns), where you can see at the bottom
  37 in the CDC 6600 the Register File is divided down into A, B and X.
  38 On the right is the Function Unit to Function Unit dependency matrix,
  39 which ensures that each FU only starts its arithmetic operations when
  40 its dependent FUs have created the results it needs.  Thus, that Matrix
  41 expresses source register to destination register dependencies.
  42
  43 Now let's do something hair-raising.  Let's do two crazed things at once:
  44 increase the number of registers to a whopping 256 total (128 floating
  45 point and 128 integer), and at the same time allow those 64-bit registers
  46 to be broken down into **eight** separate 8-bit values... *and allow
  47 Function Unit dependencies to exist on them*!
  48
  49 What would happen if we did not properly take this into account in the
  50 design is that an 8-bit ADD would require us to "lock" say Register R5
  51 (all 64 bits of it), absolutely preventing and prohibiting the other 7
  52 bytes of R5 from being used, until such time as that extremely small
  53 8-bit ADD had completed.  Such a design would be laughed at, its
  54 performance would be so low.  Only one 8-bit ADD per clock cycle, when
  55 Intel has recently added 512-bit SIMD??
  56
  57 So this is a diagram of a proposed solution.  What if, when an 8-bit
  58 operation needs to do a calculation to go into the 1st byte, the other
  59 7 bytes have their own **completely separate** dependency lines, in
  60 the Register and Function Unit Matrices?  It looks like this:
  61
  62 {{reorder_alias_bytemask_scheme.png}}
  63
  64