(no commit message)
[libreriscv.git] / openpower / sv / remap / appendix.mdwn
1 [[!tag standards]]
2
3 # REMAP Matrix pseudocode
4
5 The algorithm below shows how REMAP works more clearly, and may be
6 executed as a python program:
7
8 ```
9 [[!inline pages="openpower/sv/remap.py" quick="yes" raw="yes" ]]
10 ```
11
12 An easier-to-read version (using python iterators) shows the loop nesting:
13
14 ```
15 [[!inline pages="openpower/sv/remapyield.py" quick="yes" raw="yes" ]]
16 ```
17
18 Each element index from the for-loop `0..VL-1`
19 is run through the above algorithm to work out the **actual** element
20 index, instead. Given that there are four possible SHAPE entries, up to
21 four separate registers in any given operation may be simultaneously
22 remapped:
23
24 ```
25 function op_add(RT, RA, RB) # add not VADD!
26 ...
27 ...
28  for (i=0,id=0,irs1=0,irs2=0; i < VL; i++)
29 SVSTATE.srcstep = i # save context
30 if (predval & 1<<i) # predication mask
31    GPR[RT+remap1(id)] <= GPR[RA+remap2(irs1)] +
32 GPR[RB+remap3(irs2)];
33 if (!int_vec[RT ].isvector) break;
34 if (int_vec[RT].isvector)  { id += 1; }
35 if (int_vec[RA].isvector)  { irs1 += 1; }
36 if (int_vec[RB].isvector)  { irs2 += 1; }
37 ```
38
39 By changing remappings, 2D matrices may be transposed "in-place" for one
40 operation, followed by setting a different permutation order without
41 having to move the values in the registers to or from memory.
42
43 Note that:
44
45 * Over-running the register file clearly has to be detected and
46 an illegal instruction exception thrown
47 * When non-default elwidths are set, the exact same algorithm still
48 applies (i.e. it offsets *polymorphic* elements *within* registers rather
49 than entire registers).
50 * If permute option 000 is utilised, the actual order of the
51 reindexing does not change. However, modulo MVL still occurs
52 which will result in repeated operations (use with caution).
53 * If two or more dimensions are set to zero, the actual order does not change!
54 * The above algorithm is pseudo-code **only**. Actual implementations
55 will need to take into account the fact that the element for-looping
56 must be **re-entrant**, due to the possibility of exceptions occurring.
57 See SVSTATE SPR, which records the current element index.
58 Continuing after return from an interrupt may introduce latency
59 due to re-computation of the remapped offsets.
60 * Twin-predicated operations require **two** separate and distinct
61 element offsets. The above pseudo-code algorithm will be applied
62 separately and independently to each, should each of the two
63 operands be remapped. *This even includes unit-strided LD/ST*
64 and other operations
65 in that category, where in that case it will be the **offset** that is
66 remapped.
67 * Offset is especially useful, on its own, for accessing elements
68 within the middle of a register. Without offsets, it is necessary
69 to either use a predicated MV, skipping the first elements, or
70 performing a LOAD/STORE cycle to memory.
71 With offsets, the data does not have to be moved.
72 * Setting the total elements (xdim+1) times (ydim+1) times (zdim+1) to
73 less than MVL is **perfectly legal**, albeit very obscure. It permits
74 entries to be regularly presented to operands **more than once**, thus
75 allowing the same underlying registers to act as an accumulator of
76 multiple vector or matrix operations, for example.
77 * Note especially that Program Order **must** still be respected
78 even when overlaps occur that read or write the same register
79 elements *including polymorphic ones*
80
81 Clearly here some considerable care needs to be taken as the remapping
82 could hypothetically create arithmetic operations that target the
83 exact same underlying registers, resulting in data corruption due to
84 pipeline overlaps. Out-of-order / Superscalar micro-architectures with
85 register-renaming will have an easier time dealing with this than
86 DSP-style SIMD micro-architectures.
87
88 # REMAP FFT pseudocode
89
90 The algorithm below shows how FFT REMAP works, and may be
91 executed as a python program:
92
93 ```
94 [[!inline pages="openpower/sv/remap_fft_yield.py" quick="yes" raw="yes" ]]
95 ```
96
97 The executable code above is designed to show how a hardware
98 implementation may generate Indices which are completely
99 independent of the Execution of element-level operations,
100 even for something as complex as a Triple-loop Tukey-Cooley
101 Schedule. A comprehensive demo and test suite may be found
102 [here](https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_fft.py;hb=HEAD)
103
104 Other uses include more than DFT and NTT: the Schedules are not
105 restricted in any way and if the programmer can find any algorithm
106 which has identical triple nesting then the FFT Schedule may be
107 used.
108
109 # 4x4 Matrix to vec4 Multiply Example
110
111 The following settings will allow a 4x4 matrix (starting at f8), expressed
112 as a sequence of 16 numbers first by row then by column, to be multiplied
113 by a vector of length 4 (starting at f0), using a single FMAC instruction.
114
115 * SHAPE0: xdim=4, ydim=4, permute=yx, applied to f0
116 * SHAPE1: xdim=4, ydim=1, permute=xy, applied to f4
117 * VL=16, f4=vec, f0=vec, f8=vec
118 * FMAC f4, f0, f8, f4
119
120 The permutation on SHAPE0 will use f0 as a vec4 source. On the first
121 four iterations through the hardware loop, the REMAPed index will not
122 increment. On the second four, the index will increase by one. Likewise
123 on each subsequent group of four.
124
125 The permutation on SHAPE1 will increment f4 continuously cycling through
126 f4-f7 every four iterations of the hardware loop.
127
128 At the same time, VL will, because there is no SHAPE on f8, increment
129 straight sequentially through the 16 values f8-f23 in the Matrix. The
130 equivalent sequence thus is issued:
131
132 ```
133 fmac f4, f0, f8, f4
134 fmac f5, f0, f9, f5
135 fmac f6, f0, f10, f6
136 fmac f7, f0, f11, f7
137 fmac f4, f1, f12, f4
138 fmac f5, f1, f13, f5
139 fmac f6, f1, f14, f6
140 fmac f7, f1, f15, f7
141 fmac f4, f2, f16, f4
142 fmac f5, f2, f17, f5
143 fmac f6, f2, f18, f6
144 fmac f7, f2, f19, f7
145 fmac f4, f3, f20, f4
146 fmac f5, f3, f21, f5
147 fmac f6, f3, f22, f6
148 fmac f7, f3, f23, f7
149 ```
150
151 The only other instruction required is to ensure that f4-f7 are
152 initialised (usually to zero).
153
154 It should be clear that a 4x4 by 4x4 Matrix Multiply, being effectively
155 the same technique applied to four independent vectors, can be done by
156 setting VL=64, using an extra dimension on the SHAPE0 and SHAPE1 SPRs,
157 and applying a rotating 1D SHAPE SPR of xdim=16 to f8 in order to get
158 it to apply four times to compute the four columns worth of vectors.
159
160 # Warshall transitive closure algorithm
161
162 TODO move to [[sv/remap/discussion]] page, copied from here
163 http://lists.libre-soc.org/pipermail/libre-soc-dev/2021-July/003286.html
164
165 with thanks to Hendrik.
166
167 <https://en.m.wikipedia.org/wiki/Floyd%E2%80%93Warshall_algorithm>
168
169 > Just a note: interpreting + as 'or', and * as 'and',
170 > operating on Boolean matrices,
171 > and having result, X, and Y be the exact same matrix,
172 > updated while being used,
173 > gives the traditional Warshall transitive-closure
174 > algorithm, if the loops are nested exactly in thie order.
175
176 this can be done with the ternary instruction which has
177 an in-place triple boolean input:
178
179 ```
180 RT = RT | (RA & RB)
181 ```
182
183 and also has a CR Field variant of the same
184
185 notes from conversations:
186
187 > > for y in y_r:
188 > > for x in x_r:
189 > > for z in z_r:
190 > > result[y][x] +=
191 > > a[y][z] *
192 > > b[z][x]
193
194 > This nesting of loops works for matrix multiply, but not for transitive
195 > closure.
196
197 > > it can be done:
198 > >
199 > >   for z in z_r:
200 > >    for y in y_r:
201 > >     for x in x_r:
202 > >       result[y][x] +=
203 > >          a[y][z] *
204 > >          b[z][x]
205 >
206 > And this ordering of loops *does* work for transitive closure, when a,
207 > b, and result are the very same matrix, updated while being used.
208 >
209 > By the way, I believe there is a graph algorithm that does the
210 > transitive closure thing, but instead of using boolean, "and", and "or",
211 > they use real numbers, addition, and minimum.  I think that one computes
212 > shortest paths between vertices.
213 >
214 > By the time the z'th iteration of the z loop begins, the algorithm has
215 > already peocessed paths that go through vertices numbered < z, and it
216 > adds paths that go through vertices numbered z.
217 >
218 > For this to work, the outer loop has to be the one on the subscript that
219 > bridges a and b (which in this case are teh same matrix, of course).
220
221 # SUBVL Remap
222
223 Remapping of SUBVL (vec2/3/4) elements is not permitted: the vec2/3/4
224 itself must be considered to be the "element". To perform REMAP
225 on the elements of a vec2/3/4, either use [[sv/mv.swizzle]], or,
226 due to the sub-elements themselves being contiguous, treat them as
227 such and use Indexing, or add one
228 extra dimension to Matrix REMAP, the inner dimension being the size
229 of the Subvector (2, 3, or 4).
230
231 Note that Swizzle on Sub-vectors may be applied on top of REMAP.
232 Where this is appropriate is the Rijndael MixColumns
233 stage:
234
235 <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/7/76/AES-MixColumns.svg/600px-AES-MixColumns.svg.png" width="400px" />
236
237 Assuming that the bytes are stored `a00 a01 a02 a03 a10 .. a33`
238 a 2D REMAP allows:
239
240 * the column bytes (as a vec4) to be iterated over as an inner loop,
241 progressing vertically (`a00 a10 a20 a30`)
242 * the columns themselves to be iterated as an outer loop
243 * a 32 bit `GF(256)` Matrix Multiply on the vec4 to be performed.
244
245 This entirely in-place without special 128-bit opcodes. Below is
246 the pseudocode for [[!wikipedia Rijndael MixColumns]]
247
248 ```
249 void gmix_column(unsigned char *r) {
250 unsigned char a[4];
251 unsigned char b[4];
252 unsigned char c;
253 unsigned char h;
254 // no swizzle here but vec4 byte-level
255 // elwidth overrides can be done though.
256 for (c = 0; c < 4; c++) {
257 a[c] = r[c];
258 h = (unsigned char)((signed char)r[c] >> 7);
259 b[c] = r[c] << 1;
260 b[c] ^= 0x1B & h; /* Rijndael's Galois field */
261 }
262 // These may then each be 4x 8bit Swizzled
263 // r0.vec4 = b.vec4
264 // r0.vec4 ^= a.vec4.WXYZ
265 // r0.vec4 ^= a.vec4.ZWXY
266 // r0.vec4 ^= b.vec4.YZWX ^ a.vec4.YZWX
267 r[0] = b[0] ^ a[3] ^ a[2] ^ b[1] ^ a[1];
268 r[1] = b[1] ^ a[0] ^ a[3] ^ b[2] ^ a[2];
269 r[2] = b[2] ^ a[1] ^ a[0] ^ b[3] ^ a[3];
270 r[3] = b[3] ^ a[2] ^ a[1] ^ b[0] ^ a[0];
271 }
272 ```
273
274 The application of the swizzles allows the remapped vec4 a, b and r
275 variables to perform four straight linear 32 bit XOR operations where a
276 scalar processor would be required to perform 16 byte-level individual
277 operations. Given wide enough SIMD backends in hardware these 3 bit
278 XORs may be done as single-cycle operations across the entire 128 bit
279 Rijndael Matrix.
280
281 The other alternative is to simply perform the actual 4x4 GF(256) Matrix
282 Multiply using the MDS Matrix.
283