3d_gpu/architecture/dynamic_simd.mdwn

   1 # Dynamic Partitioned SIMD
   2
   3 The Dynamic Partitioned SIMD Signal is effectively a parallelisation
   4 of nmigen's Signal.  It is expected to work transparently as if it was
   5 a nmigen Signal, in every way, as a full peer of a nmigen Signal, with
   6 no requirement on the part of the developer to even know that it is
   7 performing parallel dynamically-partitioned operations.
   8
   9 nmigen 32-bit Signal:
  10
  11     a        : .... .... .... .... (32 bits, illustrated as 4x8)
  12
  13 Dynamically-partitioned 32-bit Signal subdivided into four 8-bit
  14 sections, by 3 partition bits:
  15
  16     partition:     P    P    P     (3 bits)
  17     a        : .... .... .... .... (32 bits, illustrated as 4x8)
  18     exp-a    : ....P....P....P.... (32+3 bits, P=0 if no partition)
  19
  20 Each partitioned section shall act as an independent Signal where the **partitioning is dynamic at runtime** and may subdivide the above example
  21 into all 8 possible combinations of the 3 Partition bits:
  22
  23     exp-a    : ....0....0....0.... 1x 32-bit
  24     exp-a    : ....0....0....1.... 1x 24-bit plus 1x 8-bit
  25     exp-a    : ....0....1....0.... 2x 16-bit
  26     ...
  27     ...
  28     exp-a    : ....1....1....0.... 2x 8-bit, 1x 16-bit
  29     exp-a    : ....1....1....1.... 4x 8-bit
  30
  31 A simple example, a "min" function:
  32
  33     # declare x, y and out as 16-bit scalar Signals
  34     x = Signal(16)
  35     y = Signal(16)
  36     out = Signal(16)
  37
  38     # compare x against y and set output accordingly
  39     with m.If(x < y):
  40         comb += out.eq(x)
  41     with m.Else():
  42         comb += out.eq(y)
  43
  44 This is very straightforward and obvious, that the 16-bit output is the
  45 lesser of the x and y inputs.  We require the exact same obviousness
  46 and under no circumstances any change of any kind to any nmigen language
  47 construct:
  48
  49     # a mask of length 3 indicates a desire to partition Signals at
  50     # 3 points into 4 equally-spaced SIMD "partitions".
  51     mask = Signal(3)
  52     # x y and out are all 16-bit so are subdivided at:
  53     # |      mask[0]     mask[1]     mask[3]         |
  54     # |  0-3    |    4-7    |   8-11    |   12-15    |
  55
  56     x = PartitionedSignal(mask, 16)    # identical except for mask
  57     y = PartitionedSignal(mask, 16)    # identical except for mask
  58     out = PartitionedSignal(mask, 16)  # identical except for mask
  59
  60     # all code here is required to be absolutely identical to the
  61     # scalar case, and identical in nmigen language behaviour in
  62     # every way.  no changes to the nmigen language or its use
  63     # are permitted
  64
  65     with m.If(x < y):
  66         comb += out.eq(x)
  67     with m.Else():
  68         comb += out.eq(y)
  69
  70 The purpose of PartitionedSignal is therefore to provide full 100%
  71 transparent SIMD run-time dynamic behaviour as far as end-usage is
  72 concerned.
  73
  74 The alternative is absolutely awful and completely unacceptable
  75 for both maintenance cost and development cost:
  76
  77     # declare x, y and out as 16-bit scalar Signals
  78     x = Signal(16)
  79     y = Signal(16)
  80     out = Signal(16)
  81
  82     # start an absolutely awful unmaintainable duplication of
  83     # SIMD behaviour.
  84     with m.If(mask == 0b111): # 1x 16-bit
  85        # compare x against y and set output accordingly
  86        with m.If(x < y):
  87            comb += out.eq(x)
  88        with m.Else():
  89            comb += out.eq(y)
  90     with m.ElIf(mask == 0b101): # 2x 8-bit
  91        for i in range(2):
  92            xh = x[i*8:(i+1)*8]
  93            yh = y[i*8:(i+1)*8]
  94            outh = out[i*8:(i+1)*8]
  95            # compare halves of x against halves y and set
  96            # halves of output accordingly
  97            with m.If(xh < yh):
  98                comb += outh.eq(xh)
  99            with m.Else():
 100                comb += outh.eq(yh)
 101     with m.ElIf(mask == 0b000): # 4x 4-bit
 102         ....
 103     with m.ElIf(mask == 0b100): # 1x 8-bit followed by 2x 4-bit
 104         ....
 105     with m.ElIf(....)
 106         ....
 107     with m.ElIf(....)
 108         ....
 109     with m.ElIf(....)
 110         ....
 111
 112
 113
 114
 115 # Links
 116
 117 * <https://bugs.libre-soc.org/show_bug.cgi?id=458> m.If/Switch
 118 * <https://bugs.libre-soc.org/show_bug.cgi?id=115> top level SIMD
 119 * <https://bugs.libre-soc.org/show_bug.cgi?id=707> Limited Cat
 120 * <https://bugs.libre-soc.org/show_bug.cgi?id=594> RFC for nmigen integration
 121 * <https://bugs.libre-soc.org/show_bug.cgi?id=565> Formal proof of PartitionedSignal
 122 * <https://bugs.libre-soc.org/show_bug.cgi?id=596> Formal proof of PartitionedSignal nmigen interaction
 123
 124 # Rationale / Introduction
 125
 126 To save hugely on gate count the normal practice of having separate scalar ALUs and separate SIMD ALUs is not followed.
 127
 128 Instead a suite of "partition points" identical in fashion to the Aspex Microelectronics ASP (Array-String-Architecture) architecture is deployed.
 129
 130 Basic principle: when all partition gates are open the ALU is subdivided into isolated and independent 8 bit SIMD ALUs.  Whenever any one gate is opened, the relevant 8 bit "part-results" are chained together in a downstream cascade to create 16 bit, 32 bit, 64 bit and 128 bit compound results.
 131
 132 Pages below describe the basic features of each and track the relevant bugreports.
 133
 134 * [[dynamic_simd/eq]] aka `__eq__` not to be confused with nmigen eq
 135 * [[dynamic_simd/assign]] nmigen eq (assignment)
 136 * [[dynamic_simd/gt]]
 137 * [[dynamic_simd/add]]
 138 * [[dynamic_simd/cat]] - limited capability
 139 * [[dynamic_simd/mul]]
 140 * [[dynamic_simd/shift]]
 141 * [[dynamic_simd/logicops]] some all xor
 142
 143 # Integration with nmigen
 144
 145 Dynamic partitioning of signals is not enough on its own. Normal nmigen programs involve conditional decisions, that means if statements and switch statements.
 146
 147 With the PartitionedSignal class, basic operations such as `x + y` are functional, producing results 1x64 bit, or 2x32 or 4x16 or 8x8 or anywhere in between, but what about control and decisions? Here is the "normal" way in which SIMD decisions are performed:
 148
 149     if partitions == 1x64
 150          with m.If(x > y):
 151               do something
 152     elif partitions == 2x32:
 153          with m.If(x[0:31] > y[0:31]):
 154               do something on 1st half
 155          elif ...
 156     elif ...
 157     # many more lines of repeated laborious hand written
 158     # SIMD nonsense all exactly the same except for the
 159     # for loop and sizes.
 160
 161 Clearly this is a total unmaintainable nightmare of worthless crud which, if continued throughout a large project with 40,000 lines of code when written without SIMD, would completely destroy all chances of that project being successful by turning 40,000 lines into 400,000 lines of unreadable spaghetti.
 162
 163 A much more intelligent approach is needed. What we actually want is:
 164
 165     with m.If(x > y): # do a partitioned compare here
 166          do something dynamic here
 167
 168 where behind the scenes the above laborious for-loops (conceptually) are created, hidden, looking to all intents and purposes that this is exactly like any other nmigen Signal.
 169
 170 This means that nmigen needs to "understand" the partitioning, in m.If, m.Else and m.Switch, at the bare minimum.
 171
 172 Analysis of the internals of nmigen shows that m.If, m.Else and m.Switch are all redirected to ast.py `Switch`.  Within that function Mux and other "global" functions (similar to python operator functions).  The hypothesis is therefore proposed that if `Value.mux` is added in an identical way to how `operator.add` calls `__add__` this may turn out to be all that (or most of what) is needed.
 173
 174 <https://github.com/nmigen/nmigen/blob/59ef6e6a1c4e389a41148554f2dd492328820ecd/nmigen/hdl/dsl.py#L447>
 175
 176 m.If and m.Else work by constructing a series of Switch cases, each case test being one of "--1---" or "-----1-" where the binary tests themselves are concatenated together as the "Switch" statement.  With switch statements being order-dependent, the first match will succeed which will stop subsequent "Else" or "Elif" statements from being executed.
 177
 178 For a parallel variant each partition column may be assumed to be independent. A mask of 3 bits subdivides Signals down into four separate partitions.  Therefore what was previously a single-bit binary test is, just like for Partitioned Mux, actually four separate and distinct partition-column-specific single-bit binary tests.
 179
 180 Therefore, a Parallel Switch statement is as simple as taking the relevant column of each Switch case and creating one independent Switch per Partition column.  Take the following example:
 181
 182      mask = Signal(3) # creates four partitions
 183      a = PartitionedSignal(mask, 4) # creates a 4-bit partitioned signal
 184      b = PartitionedSignal(mask, 4) # likewise
 185      c = PartitionedSignal(mask, 32)
 186      d = PartitionedSignal(mask, 32)
 187      o = PartitionedSignal(mask, 32)
 188
 189      with m.If(a):
 190          comb += o.eq(c)
 191      with m.Elif(b):
 192          comb += o.eq(d)
 193
 194 If these were ordinary Signals, they would be translated to a Switch where:
 195
 196 * if_tests would be Cat(a, b) i.e. a 2 bit quantity
 197 * cases would be (quantity 2) "1-" and "-1" in order to match
 198   against the first binary test bit of Cat(a, b) and the second,
 199   respectively.
 200 * the first case would be "1-" to activate `o.eq(c)
 201 * the second case would be "-1" to activate o.eq(d)
 202
 203 A parallel variant may thus perform a for-loop, creating four
 204 **independent** Switches:
 205
 206 * take a[0] and b[0] and Cat them together `Cat(a[0], b[0])`
 207 * take the output of each case result `o[0].eq[c[0])` and
 208   so on
 209 * create the first independent Switch
 210 * take a[1] and b[1] etc.
 211
 212 There are several ways in which the parts of each case, when
 213 activated, can be split up: temporary Signals, analysing
 214 the AST, or using PartitionedMux.