clarify

[crowdsupply.git] / updates / 005_2018dec14_simd_without_simd.mdwn
diff --git a/updates/005_2018dec14_simd_without_simd.mdwn b/updates/005_2018dec14_simd_without_simd.mdwn

index 69f19aff4eb18fd413b77a6432fb0fb977a8b9f6..65487291e7197b538af29710b756546dc6ce274b 100644 (file)
--- a/updates/005_2018dec14_simd_without_simd.mdwn
+++ b/updates/005_2018dec14_simd_without_simd.mdwn
@@ -99,6 +99,8 @@ So in fact, it's actually quite simple, and this "cascade" is simply and
  easily propagated down to the Function Unit Dependency Matrices, stopping
  32-bit operations from overwriting 8-bit and vice-versa.
  
  easily propagated down to the Function Unit Dependency Matrices, stopping
  32-bit operations from overwriting 8-bit and vice-versa.
  
+# Virtual Registers
+
  The fourth part is the grid in green, in the top left corner.  This is
  a "virtual" to "real" one-bit table.  It's here because the size of
  these matrices is so enormous that there is deep concern about the line
  The fourth part is the grid in green, in the top left corner.  This is
  a "virtual" to "real" one-bit table.  It's here because the size of
  these matrices is so enormous that there is deep concern about the line
@@ -117,6 +119,8 @@ if this table is not high enough (not enough IRs), the processor has to
  stall until an instruction is completed, so that one register becomes
  free.  Again, another thing to keep an eye on, in simulations.
  
  stall until an instruction is completed, so that one register becomes
  free.  Again, another thing to keep an eye on, in simulations.
  
+# Refinements
+
  The second major concern is the purple matrix: the FU-to-FU one.  Basically
  where previously we would have FU1 cover all ADDs, FU2 would cover all MUL
  operations, FU3 covers BRANCH and so on, now we have to multiply those
  The second major concern is the purple matrix: the FU-to-FU one.  Basically
  where previously we would have FU1 cover all ADDs, FU2 would cover all MUL
  operations, FU3 covers BRANCH and so on, now we have to multiply those
@@ -124,3 +128,33 @@ numbers by **four** (64-bit ops, 32-bit ops, 16-bit and 8), which in turn
  means that the size of the FU-to-FU Matrix has gone up by a staggering
  **sixteen** times.  This is not really acceptable, so we have to do something
  different.
  means that the size of the FU-to-FU Matrix has gone up by a staggering
  **sixteen** times.  This is not really acceptable, so we have to do something
  different.
+
+So the refinement is based on an observation that 16-bit operations of
+course may be constructed from 8-bit values, and that 64-bit operations
+can be constructed from 32-bit ones.  So, what if we skipped the
+cascade on 64 and 16 bit, and made the cascade out of just 32-bit and 8-bit?
+Then, very simply, the top half of a 64-bit source register is allocated
+to one Function Unit, the bottom half to the one next to it, and when it
+comes to actually passing the source registers to the relevant ALU, take
+from *both* FUs.
+
+The primary focus is on 32-bit (single-precision floating-point) performance
+anyway, for 3D, so if 64-bit operations happen to have half the number of
+Reservation Stations / Function Units, and block more often, we actually
+don't mind so much.  Also, we can still apply the same "banks" trick on
+the Register File, except this time with 4-way multiplexing on 32-bit
+wide banks, and 4x4 crossbars on the bytes as well:
+
+{{register_file_multiplexing.jpg}}
+
+To cope with 16-bit operations, pairs of 8-bit values in adjacent Function
+Units are reserved.  Likewise for 64-bit operations, the 8-bit crossbars
+are not used, and pairs of 32-bit source values in adjacent Function Units
+in the *32-bit* FU area are reserved.
+
+However, the gate count in such a staggered crossbar arrangement is insane:
+bear in mind that this will be 3R1W or 2R1W (2 or 3 reads, 1 write per
+register), and that means **three** sets of crossbars, comprising **four**
+banks, with effectively 16 byte to 16 byte routing.
+
+It's too much - so in later updates, this will be explored further.