From: Joshua Harlan Lifton <joshua.harlan.lifton@gmail.com>
Date: Fri, 8 Feb 2019 03:19:37 +0000 (-0800)
Subject: Copy edit update 005
X-Git-Url: https://git.libre-soc.org/?a=commitdiff_plain;ds=sidebyside;h=75589f95cc8683c052e96a091188e8337dcb292b;p=crowdsupply.git

Copy edit update 005
---

diff --git a/updates/005_2018dec14_simd_without_simd.mdwn b/updates/005_2018dec14_simd_without_simd.mdwn
index 6548729..b33158f 100644
--- a/updates/005_2018dec14_simd_without_simd.mdwn
+++ b/updates/005_2018dec14_simd_without_simd.mdwn
@@ -1,160 +1,166 @@
-# Microarchitectural Design by Osmosis
-
-In a series of different descriptions and evaluations, a picture is
-beginning to emerge of a suitable microarchitecture, as the process
-of talking on [videos](https://youtu.be/DoZrGJIltgU), and 
-[writing out thoughts](https://groups.google.com/forum/#!topic/comp.arch/2kYGFU4ppow)
-and then talking about the resultant feedback
-[elsewhere](http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2018-December/000261.html)
-begins to crystallise, without overloading any one group of people.
-
-There are several things to remember about this design: the primary being
-that it is not explicitly intended as a discrete GPU (although one could
-be made), it is primarily for a battery-operated efficient hand-held device,
-where it happens to just about pass on, say, a low to mid-range chromebook.
-Power consumption *for the entire chip* is targetted at 2.5 watts.
-
-We learned quite quickly that, paradoxically, even a mobile embedded 3D
-GPU *requires* extreme numbers of registers (128 floating-point registers)
-because it is handling vectors (or quads as they are called), and even
-pixel data in floating-point format is also 4 32-bit numbers (including
-the transparency).  So where a "normal" RISC processor has 32 registers,
-a GPU typically has to have 4 times that amount simply because it is
-dealing with 4 lots of numbers simultaneously.  If you don't do this,
-then that data has to go back down to memory (even to L1 cache), and, as the
-L1 cache runs a CAM, it's guaranteed to be power-hungry.
-
-128 registers brings some unique challenges not normally faced by general
-purpose CPUs, and when it becomes possible (or a requirement) to access
-even down to the byte level of those 64-bit registers as "elements" in
-a vector operation, it is even more challenging.  Recall Mitch Alsup's
-scoreboard dependency floorplan (reproduced with kind permission, here):
-
-{{mitch_ld_st_augmentation.jpg}}
-
-There are two key Dependency Matrices here: on the left is the Function
-Unit (rows) to Register File (columns), where you can see at the bottom
-in the CDC 6600 the Register File is divided down into A, B and X.
-On the right is the Function Unit to Function Unit dependency matrix,
-which ensures that each FU only starts its arithmetic operations when
-its dependent FUs have created the results it needs.  Thus, that Matrix
-expresses source register to destination register dependencies.
-
-Now let's do something hair-raising.  Let's do two crazed things at once:
+Spread over various [videos](https://youtu.be/DoZrGJIltgU),
+[writings](https://groups.google.com/forum/#!topic/comp.arch/2kYGFU4ppow),
+and [mailing list
+discussions](http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2018-December/000261.html),
+a picture is beginning to emerge of a suibable microarchitecture.
+
+There are several things to remember about this design, the primary
+being that it is not explicitly intended as a discrete GPU (although
+one could be made). Instead, it is primarily for a battery-operated,
+power-efficient hand-held device, where it happens to just about pass
+on, say, a low to mid-range chromebook.  Power consumption *for the
+entire chip* is targetted at 2.5 watts.
+
+We learned quite quickly that, paradoxically, even a mobile embedded
+3D GPU *requires* extreme an number of registers (128 floating-point
+registers) because it is handling vectors (or quads as they are
+called), and even pixel data, in floating-point format, which means
+four 32-bit numbers (including the transparency).  So, where a
+"normal" RISC processor has 32 registers, a GPU typically has to have
+four times that many simply because it is dealing with four lots of
+numbers simultaneously.  If you don't do this, then that data has to
+go back down to memory (even to L1 cache), and, as the L1 cache runs a
+CAM, it's guaranteed to be power-hungry.
+
+Dealing with 128 registers brings some unique challenges not normally
+faced by general purpose CPUs, and when it becomes possible (or a
+requirement) to access even down to the byte level of those 64-bit
+registers as "elements" in a vector operation, it is even more
+challenging.  Recall Mitch Alsup's scoreboard dependency floorplan
+(reproduced with kind permission, here):
+
+{mitch-ld-st-augmentation | link}
+
+There are two key dependency matrices here: on the left is the
+function unit (rows) to register file (columns), where you can see at
+the bottom in the CDC 6600 the register file is divided down into A, B
+and X.  On the right is the function unit to function unit dependency
+matrix, which ensures that each function unit only starts its
+arithmetic operations when its dependent function units have created
+the results it needs.  Thus, that matrix expresses source register to
+destination register dependencies.
+
+Now, let's do something hair-raising.  Let's do two crazed things at once:
 increase the number of registers to a whopping 256 total (128 floating
 point and 128 integer), and at the same time allow those 64-bit registers
 to be broken down into **eight** separate 8-bit values... *and allow
-Function Unit dependencies to exist on them*!
-
-What would happen if we did not properly take this into account in the
-design is that an 8-bit ADD would require us to "lock" say Register R5
-(all 64 bits of it), absolutely preventing and prohibiting the other 7
-bytes of R5 from being used, until such time as that extremely small
-8-bit ADD had completed.  Such a design would be laughed at, its
-performance would be so low.  Only one 8-bit ADD per clock cycle, when
-Intel has recently added 512-bit SIMD??
-
-So this is a diagram of a proposed solution.  What if, when an 8-bit
-operation needs to do a calculation to go into the 1st byte, the other
-7 bytes have their own **completely separate** dependency lines, in
-the Register and Function Unit Matrices?  It looks like this:
-
-{{reorder_alias_bytemask_scheme.png}}
-
-So if you recall from the previous updates about Scoreboards, it's not
-the "scoreboard" that's the key, it's these Register to Function Unit
-and FU to FU Dependency Matrices that are the misunderstood key.
-So let's explain this diagram.  Firstly, in purple in the bottom left
-is a massive matrix of FU to FU, just as with the standard CDC 6600,
-except now there are separate 32-bit FUs, 16-bit FUs, and 8-bit FUs.
-In this way, we can have 32-bit ADD depending on and waiting for
-an 8-bit computation, or 16-bit MUL on a 32-bit SQRT and so on.  Nothing
-immediately obviously different there.
+function unit dependencies to exist on them*!
+
+If we didn't properly take this into account in the design, then an
+8-bit ADD would require us to "lock", say, Register R5 (all 64 bits of
+it), absolutely preventing and prohibiting the other seven bytes of R5
+from being used, until such time as that extremely small 8-bit ADD had
+completed.  Such a design would be laughed at, its performance would
+be so low.  Only one 8-bit ADD per clock cycle, when Intel has
+recently added 512-bit [SIMD](https://en.wikipedia.org/wiki/SIMD)?
+
+Here's a proposed solution.  What if, when an 8-bit operation needs to
+do a calculation to go into the first byte, the other seven bytes have
+their own **completely separate** dependency lines in the register and
+function unit matrices? It looks like this:
+
+{reorder-alias-bytemask-scheme | link}
+
+If you recall from the [previous updates about
+scoreboards](https://www.crowdsupply.com/libre-risc-v/m-class/updates),
+it's not the "scoreboard" that's the key, it's these register to
+function unit and function unit to function unit dependency matrices
+that are the misunderstood key.  Let's explain the above diagram.
+Firstly, in purple in the bottom left, is a massive matrix of function
+units to function units, just as with the standard CDC 6600, except
+now there are separate 32-bit function units, 16-bit function units,
+and 8-bit function units.  In this way, we can have a 32-bit ADD
+depending on and waiting for an 8-bit computation, or a 16-bit MUL on
+a 32-bit SQRT and so on.  Nothing obviously different there.
 
 Likewise, in the bottom right, in red, we see matrices that have
-FU along rows, and Registers along the columns, exactly again as with
-the CDC 6600 standard scoreboard: however, again, we note that
-because there are separate 32-bit FUs and separate 16-bit and 8-bit
-FUs, there are *three* separate sets of FU-to-Register Matrices.
-Also, note that these are separate, where they would be expected
-to be grouped together.  Except, they're *not* independent, and that's
-where the diagram at the top (middle) comes in.
+function units along rows, and registers along the columns, exactly
+again as with the CDC 6600 standard scoreboard. However, again, we
+note that because there are separate 32-bit function units and
+separate 16-bit and 8-bit function units, there are *three* separate
+sets of function unit to register matrices.  Also, note that these are
+separate, where they would be expected to be grouped together.
+Except, they're *not* independent, and that's where the diagram at the
+top (middle) comes in.
 
 The diagram at the top says, in words, "if you need a 32-bit register
-for an operation (using a 32-bit Function Unit), the 16-bit and 8-bit
-Function Units *also* connected to that exact same register **must**
-be prevented from occuring.  Also, if you need 8 bits of a register,
+for an operation (using a 32-bit function unit), the 16-bit and 8-bit
+function units *also* connected to that exact same register **must**
+be prevented from occurring.  Also, if you need eight bits of a register,
 whilst it does not prevent the other bytes of the register from being
 used, it *does* prevent the overlapping 16-bit portion **and the 32-bit
-and the 64-bit** portions of that same named register from being used".
+and the 64-bit** portions of that same named register from being used."
 
-This is absolutely essential to understand, this "cascading" relationship.
-Need Register R1 (all of it), you **cannot** go and allocate any of that
-register for use in any 32-bit, 16-bit or 8-bit operations.  This is
-common sense!  However, if you use the lowest byte (byte 1), you can still
-use the top three 16-bit portions of R1, and you can also still use byte 2.
-This is also common sense!
+This "cascading" relationship is absolutely essential to understand.
+If you need register R1 (all of it), you **cannot** go and allocate
+any of that register for use in any 32-bit, 16-bit, or 8-bit
+operations.  This is common sense!  However, if you use the lowest
+byte (byte 1), you can still use the top three 16-bit portions of R1,
+and you can also still use byte 2.  This is also common sense!
 
 So in fact, it's actually quite simple, and this "cascade" is simply and
-easily propagated down to the Function Unit Dependency Matrices, stopping
+easily propagated down to the function unit dependency matrices, stopping
 32-bit operations from overwriting 8-bit and vice-versa.
 
-# Virtual Registers
-
-The fourth part is the grid in green, in the top left corner.  This is
-a "virtual" to "real" one-bit table.  It's here because the size of
-these matrices is so enormous that there is deep concern about the line
-driver strength, as well as the actual size.  128 registers means
-that one single gate, when it goes high or low, has to "drive" the
-input of 128 other gates.  That takes longer and longer to do, the higher
-the number of gates, so it becomes a critical factor in determining the
-maximum speed of the entire processor.  We will have to keep an eye
-on this.
-
-So, to keep the FU to Register matrix size down, this "virtual" register
-concept was introduced.  Only one bit in each row of the green table
-may be active: it says, for example, "IR1 actually represents that there
-is an instruction being executed using R3".  This does mean however that
-if this table is not high enough (not enough IRs), the processor has to
-stall until an instruction is completed, so that one register becomes
-free.  Again, another thing to keep an eye on, in simulations.
-
-# Refinements
-
-The second major concern is the purple matrix: the FU-to-FU one.  Basically
-where previously we would have FU1 cover all ADDs, FU2 would cover all MUL
-operations, FU3 covers BRANCH and so on, now we have to multiply those
-numbers by **four** (64-bit ops, 32-bit ops, 16-bit and 8), which in turn
-means that the size of the FU-to-FU Matrix has gone up by a staggering
-**sixteen** times.  This is not really acceptable, so we have to do something
-different.
-
-So the refinement is based on an observation that 16-bit operations of
-course may be constructed from 8-bit values, and that 64-bit operations
-can be constructed from 32-bit ones.  So, what if we skipped the
-cascade on 64 and 16 bit, and made the cascade out of just 32-bit and 8-bit?
-Then, very simply, the top half of a 64-bit source register is allocated
-to one Function Unit, the bottom half to the one next to it, and when it
-comes to actually passing the source registers to the relevant ALU, take
-from *both* FUs.
-
-The primary focus is on 32-bit (single-precision floating-point) performance
-anyway, for 3D, so if 64-bit operations happen to have half the number of
-Reservation Stations / Function Units, and block more often, we actually
-don't mind so much.  Also, we can still apply the same "banks" trick on
-the Register File, except this time with 4-way multiplexing on 32-bit
-wide banks, and 4x4 crossbars on the bytes as well:
-
-{{register_file_multiplexing.jpg}}
-
-To cope with 16-bit operations, pairs of 8-bit values in adjacent Function
-Units are reserved.  Likewise for 64-bit operations, the 8-bit crossbars
+### Virtual Registers
+
+The fourth part of the above diagram is the grid in green, in the top
+left corner.  This is a "virtual" to "real" one-bit table.  It's here
+because the size of these matrices is so enormous that there is deep
+concern about the line driver strength, as well as the actual size.
+128 registers means that one single gate, when it goes high or low,
+has to "drive" the input of 128 other gates.  That takes longer and
+longer to do, the higher the number of gates, so it becomes a critical
+factor in determining the maximum speed of the entire processor.  We
+will have to keep an eye on this.
+
+So, to keep the function unit to register matrix size down, this
+"virtual" register concept was introduced.  Only one bit in each row
+of the green table may be active: it says, for example, "IR1 actually
+represents that there is an instruction being executed using R3."
+This does mean, however, that if this table is not high enough (not
+enough IRs), the processor has to stall until an instruction is
+completed, so that one register becomes free.  Again, another thing to
+keep an eye on, in simulations.
+
+### Refinements
+
+The second major concern is the purple matrix, the function unit to
+function unit one.  Basically, where previously we would have FU1
+cover all ADDs, FU2 would cover all MUL operations, FU3 covers BRANCH,
+and so on, now we have to multiply those numbers by **four** (64-bit
+ops, 32-bit ops, 16-bit, and 8), which in turn means that the size of
+the FU-to-FU matrix has gone up by a staggering **sixteen** times.
+This is not really acceptable, so we have to do something different.
+
+The refinement is based on an observation that 16-bit operations of
+course may be constructed from 8-bit values, and that 64-bit
+operations can be constructed from 32-bit ones.  So, what if we
+skipped the cascade on 64 and 16 bit, and made the cascade out of just
+32-bit and 8-bit?  Then, very simply, the top half of a 64-bit source
+register is allocated to one function unit, the bottom half to the one
+next to it, and when it comes to actually passing the source registers
+to the relevant ALU, take from *both* function units.
+
+For 3D, the primary focus is on 32-bit (single-precision
+floating-point) performance anyway, so if 64-bit operations happen to
+have half the number of reservation stations / function units, and
+block more often, we actually don't mind so much.  Also, we can still
+apply the same "banks" trick on the register file, except this time
+with four-way multiplexing on 32-bit wide banks, and 4 x 4 crossbars
+on the bytes as well:
+
+{register-file-multiplexing | link}
+
+To cope with 16-bit operations, pairs of 8-bit values in adjacent function
+units are reserved.  Likewise for 64-bit operations, the 8-bit crossbars
 are not used, and pairs of 32-bit source values in adjacent Function Units
-in the *32-bit* FU area are reserved.
+in the *32-bit* function unit area are reserved.
 
-However, the gate count in such a staggered crossbar arrangement is insane:
-bear in mind that this will be 3R1W or 2R1W (2 or 3 reads, 1 write per
-register), and that means **three** sets of crossbars, comprising **four**
-banks, with effectively 16 byte to 16 byte routing.
+However, the gate count in such a staggered crossbar arrangement is
+insane: bear in mind that this will be 3R1W or 2R1W (2 or 3 reads, 1
+write per register), and that means **three** sets of crossbars,
+comprising **four** banks, with effectively 16 byte to 16 byte
+routing.
 
 It's too much - so in later updates, this will be explored further.