clarify

[crowdsupply.git] / updates / 005_2018dec14_simd_without_simd.mdwn
diff --git a/updates/005_2018dec14_simd_without_simd.mdwn b/updates/005_2018dec14_simd_without_simd.mdwn

index fb75a04f0ff941b1e803a17a3a165ed82ec7addc..65487291e7197b538af29710b756546dc6ce274b 100644 (file)
--- a/updates/005_2018dec14_simd_without_simd.mdwn
+++ b/updates/005_2018dec14_simd_without_simd.mdwn
@@ -141,4 +141,20 @@ from *both* FUs.
  The primary focus is on 32-bit (single-precision floating-point) performance
  anyway, for 3D, so if 64-bit operations happen to have half the number of
  Reservation Stations / Function Units, and block more often, we actually
-don't mind so much.  
+don't mind so much.  Also, we can still apply the same "banks" trick on
+the Register File, except this time with 4-way multiplexing on 32-bit
+wide banks, and 4x4 crossbars on the bytes as well:
+
+{{register_file_multiplexing.jpg}}
+
+To cope with 16-bit operations, pairs of 8-bit values in adjacent Function
+Units are reserved.  Likewise for 64-bit operations, the 8-bit crossbars
+are not used, and pairs of 32-bit source values in adjacent Function Units
+in the *32-bit* FU area are reserved.
+
+However, the gate count in such a staggered crossbar arrangement is insane:
+bear in mind that this will be 3R1W or 2R1W (2 or 3 reads, 1 write per
+register), and that means **three** sets of crossbars, comprising **four**
+banks, with effectively 16 byte to 16 byte routing.
+
+It's too much - so in later updates, this will be explored further.