ghostmansd: update submitted RFPs
[libreriscv.git] / 3d_gpu / architecture / dynamic_simd.mdwn
1 # Dynamic Partitioned SIMD
2
3 The Dynamic Partitioned SIMD Signal is effectively a parallelisation
4 of nmigen's Signal. It is expected to work transparently as if it was
5 a nmigen Signal, in every way, as a full peer of a nmigen Signal, with
6 no requirement on the part of the developer to even know that it is
7 performing parallel dynamically-partitioned operations.
8
9 nmigen 32-bit Signal:
10
11 a : .... .... .... .... (32 bits, illustrated as 4x8)
12
13 Dynamically-partitioned 32-bit Signal subdivided into four 8-bit
14 sections, by 3 partition bits:
15
16 partition: P P P (3 bits)
17 a : .... .... .... .... (32 bits, illustrated as 4x8)
18 exp-a : ....P....P....P.... (32+3 bits, P=0 if no partition)
19
20 Each partitioned section shall act as an independent Signal where the **partitioning is dynamic at runtime** and may subdivide the above example
21 into all 8 possible combinations of the 3 Partition bits:
22
23 exp-a : ....0....0....0.... 1x 32-bit
24 exp-a : ....0....0....1.... 1x 24-bit plus 1x 8-bit
25 exp-a : ....0....1....0.... 2x 16-bit
26 ...
27 ...
28 exp-a : ....1....1....0.... 2x 8-bit, 1x 16-bit
29 exp-a : ....1....1....1.... 4x 8-bit
30
31 A simple example, a "min" function:
32
33 # declare x, y and out as 16-bit scalar Signals
34 x = Signal(16)
35 y = Signal(16)
36 out = Signal(16)
37
38 # compare x against y and set output accordingly
39 with m.If(x < y):
40 comb += out.eq(x)
41 with m.Else():
42 comb += out.eq(y)
43
44 This is very straightforward and obvious, that the 16-bit output is the
45 lesser of the x and y inputs. We require the exact same obviousness
46 and under no circumstances any change of any kind to any nmigen language
47 construct:
48
49 # a mask of length 3 indicates a desire to partition Signals at
50 # 3 points into 4 equally-spaced SIMD "partitions".
51 mask = Signal(3)
52 # x y and out are all 16-bit so are subdivided at:
53 # | mask[0] mask[1] mask[3] |
54 # | 0-3 | 4-7 | 8-11 | 12-15 |
55
56 x = PartitionedSignal(mask, 16) # identical except for mask
57 y = PartitionedSignal(mask, 16) # identical except for mask
58 out = PartitionedSignal(mask, 16) # identical except for mask
59
60 # all code here is required to be absolutely identical to the
61 # scalar case, and identical in nmigen language behaviour in
62 # every way. no changes to the nmigen language or its use
63 # are permitted
64
65 with m.If(x < y):
66 comb += out.eq(x)
67 with m.Else():
68 comb += out.eq(y)
69
70 The purpose of PartitionedSignal is therefore to provide full 100%
71 transparent SIMD run-time dynamic behaviour as far as end-usage is
72 concerned.
73
74 The alternative is absolutely awful and completely unacceptable
75 for both maintenance cost and development cost:
76
77 # declare x, y and out as 16-bit scalar Signals
78 x = Signal(16)
79 y = Signal(16)
80 out = Signal(16)
81
82 # start an absolutely awful unmaintainable duplication of
83 # SIMD behaviour.
84 with m.If(mask == 0b111): # 1x 16-bit
85 # compare x against y and set output accordingly
86 with m.If(x < y):
87 comb += out.eq(x)
88 with m.Else():
89 comb += out.eq(y)
90 with m.ElIf(mask == 0b101): # 2x 8-bit
91 for i in range(2):
92 xh = x[i*8:(i+1)*8]
93 yh = y[i*8:(i+1)*8]
94 outh = out[i*8:(i+1)*8]
95 # compare halves of x against halves y and set
96 # halves of output accordingly
97 with m.If(xh < yh):
98 comb += outh.eq(xh)
99 with m.Else():
100 comb += outh.eq(yh)
101 with m.ElIf(mask == 0b000): # 4x 4-bit
102 ....
103 with m.ElIf(mask == 0b100): # 1x 8-bit followed by 2x 4-bit
104 ....
105 with m.ElIf(....)
106 ....
107 with m.ElIf(....)
108 ....
109 with m.ElIf(....)
110 ....
111
112
113
114
115 # Links
116
117 * <https://bugs.libre-soc.org/show_bug.cgi?id=458> m.If/Switch
118 * <https://bugs.libre-soc.org/show_bug.cgi?id=115> top level SIMD
119 * <https://bugs.libre-soc.org/show_bug.cgi?id=707> Limited Cat
120 * <https://bugs.libre-soc.org/show_bug.cgi?id=594> RFC for nmigen integration
121 * <https://bugs.libre-soc.org/show_bug.cgi?id=565> Formal proof of PartitionedSignal
122 * <https://bugs.libre-soc.org/show_bug.cgi?id=596> Formal proof of PartitionedSignal nmigen interaction
123
124 # Rationale / Introduction
125
126 To save hugely on gate count the normal practice of having separate scalar ALUs and separate SIMD ALUs is not followed.
127
128 Instead a suite of "partition points" identical in fashion to the Aspex Microelectronics ASP (Array-String-Architecture) architecture is deployed.
129
130 Basic principle: when all partition gates are open the ALU is subdivided into isolated and independent 8 bit SIMD ALUs. Whenever any one gate is opened, the relevant 8 bit "part-results" are chained together in a downstream cascade to create 16 bit, 32 bit, 64 bit and 128 bit compound results.
131
132 Pages below describe the basic features of each and track the relevant bugreports.
133
134 * [[dynamic_simd/eq]] aka `__eq__` not to be confused with nmigen eq
135 * [[dynamic_simd/assign]] nmigen eq (assignment)
136 * [[dynamic_simd/gt]]
137 * [[dynamic_simd/add]]
138 * [[dynamic_simd/cat]] - limited capability
139 * [[dynamic_simd/mul]]
140 * [[dynamic_simd/shift]]
141 * [[dynamic_simd/logicops]] some all xor
142
143 # Integration with nmigen
144
145 Dynamic partitioning of signals is not enough on its own. Normal nmigen programs involve conditional decisions, that means if statements and switch statements.
146
147 With the PartitionedSignal class, basic operations such as `x + y` are functional, producing results 1x64 bit, or 2x32 or 4x16 or 8x8 or anywhere in between, but what about control and decisions? Here is the "normal" way in which SIMD decisions are performed:
148
149 if partitions == 1x64
150 with m.If(x > y):
151 do something
152 elif partitions == 2x32:
153 with m.If(x[0:31] > y[0:31]):
154 do something on 1st half
155 elif ...
156 elif ...
157 # many more lines of repeated laborious hand written
158 # SIMD nonsense all exactly the same except for the
159 # for loop and sizes.
160
161 Clearly this is a total unmaintainable nightmare of worthless crud which, if continued throughout a large project with 40,000 lines of code when written without SIMD, would completely destroy all chances of that project being successful by turning 40,000 lines into 400,000 lines of unreadable spaghetti.
162
163 A much more intelligent approach is needed. What we actually want is:
164
165 with m.If(x > y): # do a partitioned compare here
166 do something dynamic here
167
168 where behind the scenes the above laborious for-loops (conceptually) are created, hidden, looking to all intents and purposes that this is exactly like any other nmigen Signal.
169
170 This means that nmigen needs to "understand" the partitioning, in m.If, m.Else and m.Switch, at the bare minimum.
171
172 Analysis of the internals of nmigen shows that m.If, m.Else and m.Switch are all redirected to ast.py `Switch`. Within that function Mux and other "global" functions (similar to python operator functions). The hypothesis is therefore proposed that if `Value.mux` is added in an identical way to how `operator.add` calls `__add__` this may turn out to be all that (or most of what) is needed.
173
174 <https://github.com/nmigen/nmigen/blob/59ef6e6a1c4e389a41148554f2dd492328820ecd/nmigen/hdl/dsl.py#L447>
175
176 m.If and m.Else work by constructing a series of Switch cases, each case test being one of "--1---" or "-----1-" where the binary tests themselves are concatenated together as the "Switch" statement. With switch statements being order-dependent, the first match will succeed which will stop subsequent "Else" or "Elif" statements from being executed.
177
178 For a parallel variant each partition column may be assumed to be independent. A mask of 3 bits subdivides Signals down into four separate partitions. Therefore what was previously a single-bit binary test is, just like for Partitioned Mux, actually four separate and distinct partition-column-specific single-bit binary tests.
179
180 Therefore, a Parallel Switch statement is as simple as taking the relevant column of each Switch case and creating one independent Switch per Partition column. Take the following example:
181
182 mask = Signal(3) # creates four partitions
183 a = PartitionedSignal(mask, 4) # creates a 4-bit partitioned signal
184 b = PartitionedSignal(mask, 4) # likewise
185 c = PartitionedSignal(mask, 32)
186 d = PartitionedSignal(mask, 32)
187 o = PartitionedSignal(mask, 32)
188
189 with m.If(a):
190 comb += o.eq(c)
191 with m.Elif(b):
192 comb += o.eq(d)
193
194 If these were ordinary Signals, they would be translated to a Switch where:
195
196 * if_tests would be Cat(a, b) i.e. a 2 bit quantity
197 * cases would be (quantity 2) "1-" and "-1" in order to match
198 against the first binary test bit of Cat(a, b) and the second,
199 respectively.
200 * the first case would be "1-" to activate `o.eq(c)
201 * the second case would be "-1" to activate o.eq(d)
202
203 A parallel variant may thus perform a for-loop, creating four
204 **independent** Switches:
205
206 * take a[0] and b[0] and Cat them together `Cat(a[0], b[0])`
207 * take the output of each case result `o[0].eq[c[0])` and
208 so on
209 * create the first independent Switch
210 * take a[1] and b[1] etc.
211
212 There are several ways in which the parts of each case, when
213 activated, can be split up: temporary Signals, analysing
214 the AST, or using PartitionedMux.