move REMAP pseudocode to later pages
[libreriscv.git] / openpower / sv / remap / appendix.mdwn
1 ## REMAP Matrix pseudocode
2
3 The algorithm below shows how REMAP works more clearly, and may be
4 executed as a python program:
5
6 ```
7 [[!inline pages="openpower/sv/remap.py" quick="yes" raw="yes" ]]
8 ```
9
10 An easier-to-read version (using python iterators) is given in
11 a later section of this Appendix.
12
13 Each element index from the for-loop `0..VL-1`
14 is run through the above algorithm to work out the **actual** element
15 index, instead. Given that there are four possible SHAPE entries, up to
16 four separate registers in any given operation may be simultaneously
17 remapped:
18
19 ```
20 function op_add(RT, RA, RB) # add not VADD!
21  for (i=0,id=0,irs1=0,irs2=0; i < VL; i++)
22 SVSTATE.srcstep = i # save context
23 if (predval & 1<<i) # predication mask
24    GPR[RT+remap1(id)] <= GPR[RA+remap2(irs1)] +
25 GPR[RB+remap3(irs2)];
26 if (!RT.isvector) break;
27 if (RT.isvector)  { id += 1; }
28 if (RA.isvector)  { irs1 += 1; }
29 if (RB.isvector)  { irs2 += 1; }
30 ```
31
32 By changing remappings, 2D matrices may be transposed "in-place" for one
33 operation, followed by setting a different permutation order without
34 having to move the values in the registers to or from memory.
35
36 Note that:
37
38 * Over-running the register file clearly has to be detected and
39 an illegal instruction exception thrown
40 * When non-default elwidths are set, the exact same algorithm still
41 applies (i.e. it offsets *polymorphic* elements *within* registers rather
42 than entire registers).
43 * If permute option 000 is utilised, the actual order of the
44 reindexing does not change. However, modulo MVL still occurs
45 which will result in repeated operations (use with caution).
46 * If two or more dimensions are set to zero, the actual order does not change!
47 * The above algorithm is pseudo-code **only**. Actual implementations
48 will need to take into account the fact that the element for-looping
49 must be **re-entrant**, due to the possibility of exceptions occurring.
50 See SVSTATE SPR, which records the current element index.
51 Continuing after return from an interrupt may introduce latency
52 due to re-computation of the remapped offsets.
53 * Twin-predicated operations require **two** separate and distinct
54 element offsets. The above pseudo-code algorithm will be applied
55 separately and independently to each, should each of the two
56 operands be remapped. *This even includes unit-strided LD/ST*
57 and other operations
58 in that category, where in that case it will be the **address offset**
59 that is remapped: `EA <- (RA) + immediate * REMAP(elementoffset)`.
60 * Offset is especially useful, on its own, for accessing elements
61 within the middle of a register. Without offsets, it is necessary
62 to either use a predicated MV, skipping the first elements, or
63 performing a LOAD/STORE cycle to memory.
64 With offsets, the data does not have to be moved.
65 * Setting the total elements (xdim+1) times (ydim+1) times (zdim+1) to
66 less than MVL is **perfectly legal**, albeit very obscure. It permits
67 entries to be regularly presented to operands **more than once**, thus
68 allowing the same underlying registers to act as an accumulator of
69 multiple vector or matrix operations, for example.
70 * Note especially that Program Order **must** still be respected
71 even when overlaps occur that read or write the same register
72 elements *including polymorphic ones*
73
74 Clearly here some considerable care needs to be taken as the remapping
75 could hypothetically create arithmetic operations that target the
76 exact same underlying registers, resulting in data corruption due to
77 pipeline overlaps. Out-of-order / Superscalar micro-architectures with
78 register-renaming will have an easier time dealing with this than
79 DSP-style SIMD micro-architectures.
80
81
82 ### 4x4 Matrix to vec4 Multiply (4x4 by 1x4)
83
84 The following settings will allow a 4x4 matrix (starting at f8), expressed
85 as a sequence of 16 numbers first by row then by column, to be multiplied
86 by a vector of length 4 (starting at f0), using a single FMAC instruction.
87
88 * SHAPE0: xdim=4, ydim=4, permute=yx, applied to f0
89 * SHAPE1: xdim=4, ydim=1, permute=xy, applied to f4
90 * VL=16, f4=vec, f0=vec, f8=vec
91 * FMAC f4, f0, f8, f4
92
93 The permutation on SHAPE0 will use f0 as a vec4 source. On the first
94 four iterations through the hardware loop, the REMAPed index will not
95 increment. On the second four, the index will increase by one. Likewise
96 on each subsequent group of four.
97
98 The permutation on SHAPE1 will increment f4 continuously cycling through
99 f4-f7 every four iterations of the hardware loop.
100
101 At the same time, VL will, because there is no SHAPE on f8, increment
102 straight sequentially through the 16 values f8-f23 in the Matrix. The
103 equivalent sequence thus is issued:
104
105 ```
106 fmac f4, f0, f8, f4
107 fmac f5, f0, f9, f5
108 fmac f6, f0, f10, f6
109 fmac f7, f0, f11, f7
110 fmac f4, f1, f12, f4
111 fmac f5, f1, f13, f5
112 fmac f6, f1, f14, f6
113 fmac f7, f1, f15, f7
114 fmac f4, f2, f16, f4
115 fmac f5, f2, f17, f5
116 fmac f6, f2, f18, f6
117 fmac f7, f2, f19, f7
118 fmac f4, f3, f20, f4
119 fmac f5, f3, f21, f5
120 fmac f6, f3, f22, f6
121 fmac f7, f3, f23, f7
122 ```
123
124 Hardware should easily pipeline the above FMACs and as long as each FMAC
125 completes in 4 cycles or less there should be 100% sustained throughput,
126 from the one single Vector FMAC.
127
128 The only other instruction required is to ensure that f4-f7 are
129 initialised (usually to zero) however obviously if used as part
130 of some other computation, which is frequently the case, then
131 clearly the zeroing is not needed.
132
133 ## REMAP FFT, DFT, NTT
134
135 The algorithm from a later section of this Appendix shows how FFT REMAP works,
136 and it may be executed as a standalone python3 program.
137 The executable code is designed to illustrate how a hardware
138 implementation may generate Indices which are completely
139 independent of the Execution of element-level operations,
140 even for something as complex as a Triple-loop Tukey-Cooley
141 Schedule. A comprehensive demo and test suite may be found
142 [here](https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_fft.py;hb=HEAD)
143 including Complex Number FFT which deploys Vertical-First Mode
144 on top of the REMAP Schedules.
145
146 Other uses include more than DFT and NTT: as abstracted RISC-paradigm
147 the Schedules are not
148 restricted in any way or tied to any particular instruction.
149 If the programmer can find any algorithm
150 which has identical triple nesting then the FFT Schedule may be
151 used even there.
152
153 ## svshape pseudocode
154
155 ```
156 # for convenience, VL to be calculated and stored in SVSTATE
157 vlen <- [0] * 7
158 mscale[0:5] <- 0b000001 # for scaling MAXVL
159 itercount[0:6] <- [0] * 7
160 SVSTATE[0:31] <- [0] * 32
161 # only overwrite REMAP if "persistence" is zero
162 if (SVSTATE[62] = 0b0) then
163 SVSTATE[32:33] <- 0b00
164 SVSTATE[34:35] <- 0b00
165 SVSTATE[36:37] <- 0b00
166 SVSTATE[38:39] <- 0b00
167 SVSTATE[40:41] <- 0b00
168 SVSTATE[42:46] <- 0b00000
169 SVSTATE[62] <- 0b0
170 SVSTATE[63] <- 0b0
171 # clear out all SVSHAPEs
172 SVSHAPE0[0:31] <- [0] * 32
173 SVSHAPE1[0:31] <- [0] * 32
174 SVSHAPE2[0:31] <- [0] * 32
175 SVSHAPE3[0:31] <- [0] * 32
176
177 # set schedule up for multiply
178 if (SVrm = 0b0000) then
179 # VL in Matrix Multiply is xd*yd*zd
180 xd <- (0b00 || SVxd) + 1
181 yd <- (0b00 || SVyd) + 1
182 zd <- (0b00 || SVzd) + 1
183 n <- xd * yd * zd
184 vlen[0:6] <- n[14:20]
185 # set up template in SVSHAPE0, then copy to 1-3
186 SVSHAPE0[0:5] <- (0b0 || SVxd) # xdim
187 SVSHAPE0[6:11] <- (0b0 || SVyd) # ydim
188 SVSHAPE0[12:17] <- (0b0 || SVzd) # zdim
189 SVSHAPE0[28:29] <- 0b11 # skip z
190 # copy
191 SVSHAPE1[0:31] <- SVSHAPE0[0:31]
192 SVSHAPE2[0:31] <- SVSHAPE0[0:31]
193 SVSHAPE3[0:31] <- SVSHAPE0[0:31]
194 # set up FRA
195 SVSHAPE1[18:20] <- 0b001 # permute x,z,y
196 SVSHAPE1[28:29] <- 0b01 # skip z
197 # FRC
198 SVSHAPE2[18:20] <- 0b001 # permute x,z,y
199 SVSHAPE2[28:29] <- 0b11 # skip y
200
201 # set schedule up for FFT butterfly
202 if (SVrm = 0b0001) then
203 # calculate O(N log2 N)
204 n <- [0] * 3
205 do while n < 5
206 if SVxd[4-n] = 0 then
207 leave
208 n <- n + 1
209 n <- ((0b0 || SVxd) + 1) * n
210 vlen[0:6] <- n[1:7]
211 # set up template in SVSHAPE0, then copy to 1-3
212 # for FRA and FRT
213 SVSHAPE0[0:5] <- (0b0 || SVxd) # xdim
214 SVSHAPE0[12:17] <- (0b0 || SVzd) # zdim - "striding" (2D FFT)
215 mscale <- (0b0 || SVzd) + 1
216 SVSHAPE0[30:31] <- 0b01 # Butterfly mode
217 # copy
218 SVSHAPE1[0:31] <- SVSHAPE0[0:31]
219 SVSHAPE2[0:31] <- SVSHAPE0[0:31]
220 # set up FRB and FRS
221 SVSHAPE1[28:29] <- 0b01 # j+halfstep schedule
222 # FRC (coefficients)
223 SVSHAPE2[28:29] <- 0b10 # k schedule
224
225 # set schedule up for (i)DCT Inner butterfly
226 # SVrm Mode 4 (Mode 12 for iDCT) is for on-the-fly (Vertical-First Mode)
227 if ((SVrm = 0b0100) |
228 (SVrm = 0b1100)) then
229 # calculate O(N log2 N)
230 n <- [0] * 3
231 do while n < 5
232 if SVxd[4-n] = 0 then
233 leave
234 n <- n + 1
235 n <- ((0b0 || SVxd) + 1) * n
236 vlen[0:6] <- n[1:7]
237 # set up template in SVSHAPE0, then copy to 1-3
238 # set up FRB and FRS
239 SVSHAPE0[0:5] <- (0b0 || SVxd) # xdim
240 SVSHAPE0[12:17] <- (0b0 || SVzd) # zdim - "striding" (2D DCT)
241 mscale <- (0b0 || SVzd) + 1
242 if (SVrm = 0b1100) then
243 SVSHAPE0[30:31] <- 0b11 # iDCT mode
244 SVSHAPE0[18:20] <- 0b011 # iDCT Inner Butterfly sub-mode
245 else
246 SVSHAPE0[30:31] <- 0b01 # DCT mode
247 SVSHAPE0[18:20] <- 0b001 # DCT Inner Butterfly sub-mode
248 SVSHAPE0[21:23] <- 0b001 # "inverse" on outer loop
249 SVSHAPE0[6:11] <- 0b000011 # (i)DCT Inner Butterfly mode 4
250 # copy
251 SVSHAPE1[0:31] <- SVSHAPE0[0:31]
252 SVSHAPE2[0:31] <- SVSHAPE0[0:31]
253 if (SVrm != 0b0100) & (SVrm != 0b1100) then
254 SVSHAPE3[0:31] <- SVSHAPE0[0:31]
255 # for FRA and FRT
256 SVSHAPE0[28:29] <- 0b01 # j+halfstep schedule
257 # for cos coefficient
258 SVSHAPE2[28:29] <- 0b10 # ci (k for mode 4) schedule
259 SVSHAPE2[12:17] <- 0b000000 # reset costable "striding" to 1
260 if (SVrm != 0b0100) & (SVrm != 0b1100) then
261 SVSHAPE3[28:29] <- 0b11 # size schedule
262
263 # set schedule up for (i)DCT Outer butterfly
264 if (SVrm = 0b0011) | (SVrm = 0b1011) then
265 # calculate O(N log2 N) number of outer butterfly overlapping adds
266 vlen[0:6] <- [0] * 7
267 n <- 0b000
268 size <- 0b0000001
269 itercount[0:6] <- (0b00 || SVxd) + 0b0000001
270 itercount[0:6] <- (0b0 || itercount[0:5])
271 do while n < 5
272 if SVxd[4-n] = 0 then
273 leave
274 n <- n + 1
275 count <- (itercount - 0b0000001) * size
276 vlen[0:6] <- vlen + count[7:13]
277 size[0:6] <- (size[1:6] || 0b0)
278 itercount[0:6] <- (0b0 || itercount[0:5])
279 # set up template in SVSHAPE0, then copy to 1-3
280 # set up FRB and FRS
281 SVSHAPE0[0:5] <- (0b0 || SVxd) # xdim
282 SVSHAPE0[12:17] <- (0b0 || SVzd) # zdim - "striding" (2D DCT)
283 mscale <- (0b0 || SVzd) + 1
284 if (SVrm = 0b1011) then
285 SVSHAPE0[30:31] <- 0b11 # iDCT mode
286 SVSHAPE0[18:20] <- 0b011 # iDCT Outer Butterfly sub-mode
287 SVSHAPE0[21:23] <- 0b101 # "inverse" on outer and inner loop
288 else
289 SVSHAPE0[30:31] <- 0b01 # DCT mode
290 SVSHAPE0[18:20] <- 0b100 # DCT Outer Butterfly sub-mode
291 SVSHAPE0[6:11] <- 0b000010 # DCT Butterfly mode
292 # copy
293 SVSHAPE1[0:31] <- SVSHAPE0[0:31] # j+halfstep schedule
294 SVSHAPE2[0:31] <- SVSHAPE0[0:31] # costable coefficients
295 # for FRA and FRT
296 SVSHAPE1[28:29] <- 0b01 # j+halfstep schedule
297 # reset costable "striding" to 1
298 SVSHAPE2[12:17] <- 0b000000
299
300 # set schedule up for DCT COS table generation
301 if (SVrm = 0b0101) | (SVrm = 0b1101) then
302 # calculate O(N log2 N)
303 vlen[0:6] <- [0] * 7
304 itercount[0:6] <- (0b00 || SVxd) + 0b0000001
305 itercount[0:6] <- (0b0 || itercount[0:5])
306 n <- [0] * 3
307 do while n < 5
308 if SVxd[4-n] = 0 then
309 leave
310 n <- n + 1
311 vlen[0:6] <- vlen + itercount
312 itercount[0:6] <- (0b0 || itercount[0:5])
313 # set up template in SVSHAPE0, then copy to 1-3
314 # set up FRB and FRS
315 SVSHAPE0[0:5] <- (0b0 || SVxd) # xdim
316 SVSHAPE0[12:17] <- (0b0 || SVzd) # zdim - "striding" (2D DCT)
317 mscale <- (0b0 || SVzd) + 1
318 SVSHAPE0[30:31] <- 0b01 # DCT/FFT mode
319 SVSHAPE0[6:11] <- 0b000100 # DCT Inner Butterfly COS-gen mode
320 if (SVrm = 0b0101) then
321 SVSHAPE0[21:23] <- 0b001 # "inverse" on outer loop for DCT
322 # copy
323 SVSHAPE1[0:31] <- SVSHAPE0[0:31]
324 SVSHAPE2[0:31] <- SVSHAPE0[0:31]
325 # for cos coefficient
326 SVSHAPE1[28:29] <- 0b10 # ci schedule
327 SVSHAPE2[28:29] <- 0b11 # size schedule
328
329 # set schedule up for iDCT / DCT inverse of half-swapped ordering
330 if (SVrm = 0b0110) | (SVrm = 0b1110) | (SVrm = 0b1111) then
331 vlen[0:6] <- (0b00 || SVxd) + 0b0000001
332 # set up template in SVSHAPE0
333 SVSHAPE0[0:5] <- (0b0 || SVxd) # xdim
334 SVSHAPE0[12:17] <- (0b0 || SVzd) # zdim - "striding" (2D DCT)
335 mscale <- (0b0 || SVzd) + 1
336 if (SVrm = 0b1110) then
337 SVSHAPE0[18:20] <- 0b001 # DCT opposite half-swap
338 if (SVrm = 0b1111) then
339 SVSHAPE0[30:31] <- 0b01 # FFT mode
340 else
341 SVSHAPE0[30:31] <- 0b11 # DCT mode
342 SVSHAPE0[6:11] <- 0b000101 # DCT "half-swap" mode
343
344 # set schedule up for parallel reduction
345 if (SVrm = 0b0111) then
346 # calculate the total number of operations (brute-force)
347 vlen[0:6] <- [0] * 7
348 itercount[0:6] <- (0b00 || SVxd) + 0b0000001
349 step[0:6] <- 0b0000001
350 i[0:6] <- 0b0000000
351 do while step <u itercount
352 newstep <- step[1:6] || 0b0
353 j[0:6] <- 0b0000000
354 do while (j+step <u itercount)
355 j <- j + newstep
356 i <- i + 1
357 step <- newstep
358 # VL in Parallel-Reduce is the number of operations
359 vlen[0:6] <- i
360 # set up template in SVSHAPE0, then copy to 1. only 2 needed
361 SVSHAPE0[0:5] <- (0b0 || SVxd) # xdim
362 SVSHAPE0[12:17] <- (0b0 || SVzd) # zdim - "striding" (2D DCT)
363 mscale <- (0b0 || SVzd) + 1
364 SVSHAPE0[30:31] <- 0b10 # parallel reduce submode
365 # copy
366 SVSHAPE1[0:31] <- SVSHAPE0[0:31]
367 # set up right operand (left operand 28:29 is zero)
368 SVSHAPE1[28:29] <- 0b01 # right operand
369
370 # set VL, MVL and Vertical-First
371 m[0:12] <- vlen * mscale
372 maxvl[0:6] <- m[6:12]
373 SVSTATE[0:6] <- maxvl # MAVXL
374 SVSTATE[7:13] <- vlen # VL
375 SVSTATE[63] <- vf
376 ```
377
378 ## svindex pseudocode
379
380 ```
381 # based on nearest MAXVL compute other dimension
382 MVL <- SVSTATE[0:6]
383 d <- [0] * 6
384 dim <- SVd+1
385 do while d*dim <u ([0]*4 || MVL)
386 d <- d + 1
387
388 # set up template, then copy once location identified
389 shape <- [0]*32
390 shape[30:31] <- 0b00 # mode
391 if SVyx = 0 then
392 shape[18:20] <- 0b110 # indexed xd/yd
393 shape[0:5] <- (0b0 || SVd) # xdim
394 if sk = 0 then shape[6:11] <- 0 # ydim
395 else shape[6:11] <- 0b111111 # ydim max
396 else
397 shape[18:20] <- 0b111 # indexed yd/xd
398 if sk = 1 then shape[6:11] <- 0 # ydim
399 else shape[6:11] <- d-1 # ydim max
400 shape[0:5] <- (0b0 || SVd) # ydim
401 shape[12:17] <- (0b0 || SVG) # SVGPR
402 shape[28:29] <- ew # element-width override
403 shape[21] <- sk # skip 1st dimension
404
405 # select the mode for updating SVSHAPEs
406 SVSTATE[62] <- mm # set or clear persistence
407 if mm = 0 then
408 # clear out all SVSHAPEs first
409 SVSHAPE0[0:31] <- [0] * 32
410 SVSHAPE1[0:31] <- [0] * 32
411 SVSHAPE2[0:31] <- [0] * 32
412 SVSHAPE3[0:31] <- [0] * 32
413 SVSTATE[32:41] <- [0] * 10 # clear REMAP.mi/o
414 SVSTATE[42:46] <- rmm # rmm exactly REMAP.SVme
415 idx <- 0
416 for bit = 0 to 4
417 if rmm[4-bit] then
418 # activate requested shape
419 if idx = 0 then SVSHAPE0 <- shape
420 if idx = 1 then SVSHAPE1 <- shape
421 if idx = 2 then SVSHAPE2 <- shape
422 if idx = 3 then SVSHAPE3 <- shape
423 SVSTATE[bit*2+32:bit*2+33] <- idx
424 # increment shape index, modulo 4
425 if idx = 3 then idx <- 0
426 else idx <- idx + 1
427 else
428 # refined SVSHAPE/REMAP update mode
429 bit <- rmm[0:2]
430 idx <- rmm[3:4]
431 if idx = 0 then SVSHAPE0 <- shape
432 if idx = 1 then SVSHAPE1 <- shape
433 if idx = 2 then SVSHAPE2 <- shape
434 if idx = 3 then SVSHAPE3 <- shape
435 SVSTATE[bit*2+32:bit*2+33] <- idx
436 SVSTATE[46-bit] <- 1
437 ```
438
439 ## svshape2 pseudocode
440
441 ```
442 # based on nearest MAXVL compute other dimension
443 MVL <- SVSTATE[0:6]
444 d <- [0] * 6
445 dim <- SVd+1
446 do while d*dim <u ([0]*4 || MVL)
447 d <- d + 1
448 # set up template, then copy once location identified
449 shape <- [0]*32
450 shape[30:31] <- 0b00 # mode
451 shape[0:5] <- (0b0 || SVd) # x/ydim
452 if SVyx = 0 then
453 shape[18:20] <- 0b000 # ordering xd/yd(/zd)
454 if sk = 0 then shape[6:11] <- 0 # ydim
455 else shape[6:11] <- 0b111111 # ydim max
456 else
457 shape[18:20] <- 0b010 # ordering yd/xd(/zd)
458 if sk = 1 then shape[6:11] <- 0 # ydim
459 else shape[6:11] <- d-1 # ydim max
460 # offset (the prime purpose of this instruction)
461 shape[24:27] <- SVo # offset
462 if sk = 1 then shape[28:29] <- 0b01 # skip 1st dimension
463 else shape[28:29] <- 0b00 # no skipping
464 # select the mode for updating SVSHAPEs
465 SVSTATE[62] <- mm # set or clear persistence
466 if mm = 0 then
467 # clear out all SVSHAPEs first
468 SVSHAPE0[0:31] <- [0] * 32
469 SVSHAPE1[0:31] <- [0] * 32
470 SVSHAPE2[0:31] <- [0] * 32
471 SVSHAPE3[0:31] <- [0] * 32
472 SVSTATE[32:41] <- [0] * 10 # clear REMAP.mi/o
473 SVSTATE[42:46] <- rmm # rmm exactly REMAP.SVme
474 idx <- 0
475 for bit = 0 to 4
476 if rmm[4-bit] then
477 # activate requested shape
478 if idx = 0 then SVSHAPE0 <- shape
479 if idx = 1 then SVSHAPE1 <- shape
480 if idx = 2 then SVSHAPE2 <- shape
481 if idx = 3 then SVSHAPE3 <- shape
482 SVSTATE[bit*2+32:bit*2+33] <- idx
483 # increment shape index, modulo 4
484 if idx = 3 then idx <- 0
485 else idx <- idx + 1
486 else
487 # refined SVSHAPE/REMAP update mode
488 bit <- rmm[0:2]
489 idx <- rmm[3:4]
490 if idx = 0 then SVSHAPE0 <- shape
491 if idx = 1 then SVSHAPE1 <- shape
492 if idx = 2 then SVSHAPE2 <- shape
493 if idx = 3 then SVSHAPE3 <- shape
494 SVSTATE[bit*2+32:bit*2+33] <- idx
495 SVSTATE[46-bit] <- 1
496 ```
497
498 [[!tag standards]]
499
500 ---------
501
502 \newpage{}
503