f531615a0e7f06fb1b8edd2d981e6d3767c6756a
[libreriscv.git] / openpower / sv / twin_butterfly.mdwn
1 # Introduction
2
3 <!-- hide -->
4 * <https://bugs.libre-soc.org/show_bug.cgi?id=1074>
5 * <https://libre-soc.org/openpower/sv/biginteger/> for format and
6 information about implicit RS/FRS
7 * <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_dct.py;hb=HEAD>
8 * [[openpower/isa/svfparith]]
9 * [[openpower/isa/svfixedarith]]
10 * [[openpower/sv/rfc/ls016]]
11 <!-- show -->
12
13 Although best used with SVP64 REMAP these instructions may be used in a Scalar-only
14 context to save considerably on DCT, DFT and FFT processing. Whilst some hardware
15 implementations may not necessarily implement them efficiently (slower Micro-coding)
16 savings still come from the reduction in temporary registers as well as instruction
17 count.
18
19 # Rationale for Twin Butterfly Integer DCT Instruction(s)
20
21 The number of general-purpose uses for DCT is huge. The number of
22 instructions needed instead of these Twin-Butterfly instructions is also
23 huge (**eight**) and given that it is extremely common to explicitly
24 loop-unroll them quantity hundreds to thousands of instructions are
25 dismayingly common (for all ISAs).
26
27 The goal is to implement instructions that calculate the expression:
28
29 ```
30 fdct_round_shift((a +/- b) * c)
31 ```
32
33 For the single-coefficient butterfly instruction, and:
34
35 ```
36 fdct_round_shift(a * c1 +/- b * c2)
37 ```
38
39 For the double-coefficient butterfly instruction.
40
41 In a 32-bit context `fdct_round_shift` is defined as `ROUND_POWER_OF_TWO(x, 14)`
42
43 ```
44 #define ROUND_POWER_OF_TWO(value, n) \
45 (((value) + (1 << ((n)-1))) >> (n))
46 ```
47
48 These instructions are at the core of **ALL** FDCT calculations in many
49 major video codecs, including -but not limited to- VP8/VP9, AV1, etc.
50 ARM includes special instructions to optimize these operations, although
51 they are limited in precision: `vqrdmulhq_s16`/`vqrdmulhq_s32`.
52
53 The suggestion is to have a single instruction to calculate both values
54 `((a + b) * c) >> N`, and `((a - b) * c) >> N`. The instruction will
55 run in accumulate mode, so in order to calculate the 2-coeff version
56 one would just have to call the same instruction with different order a,
57 b and a different constant c.
58
59 Example taken from libvpx
60 <https://chromium.googlesource.com/webm/libvpx/+/refs/heads/main/vpx_dsp/fwd_txfm.c#132>:
61
62 ```
63 #include <stdint.h>
64 #define ROUND_POWER_OF_TWO(value, n) \
65 (((value) + (1 << ((n)-1))) >> (n))
66 void twin_int(int16_t *t, int16_t x0, int16_t x1, int16_t cospi_16_64) {
67 t[0] = ROUND_POWER_OF_TWO((x0 + x1) * cospi_16_64, 14);
68 t[1] = ROUND_POWER_OF_TWO((x0 - x1) * cospi_16_64, 14);
69 }
70 ```
71
72 8 instructions are required - replaced by just the one (maddsubrs):
73
74 ```
75 add 9,5,4
76 subf 5,5,4
77 mullw 9,9,6
78 mullw 5,5,6
79 addi 9,9,8192
80 addi 5,5,8192
81 srawi 9,9,14
82 srawi 5,5,14
83 ```
84
85 -------
86
87 \newpage{}
88
89 ## Integer Butterfly Multiply Add/Sub FFT/DCT
90
91 **Add the following to Book I Section 3.3.9.1**
92
93 A-Form
94
95 ```
96 |0 |6 |11 |16 |21 |26 |31 |
97 | PO | RT | RA | RB | SH | XO |Rc |
98 ```
99
100 * maddsubrs RT,RA,SH,RB
101
102 Pseudo-code:
103
104 ```
105 n <- SH
106 sum <- (RT) + (RA)
107 diff <- (RT) - (RA)
108 prod1 <- MULS(RB, sum)
109 prod1_lo <- prod1[XLEN:(XLEN*2)-1]
110 prod2 <- MULS(RB, diff)
111 prod2_lo <- prod2[XLEN:(XLEN*2)-1]
112 if n = 0 then
113 RT <- prod1_lo
114 RS <- prod2_lo
115 else
116 round <- [0]*XLEN
117 round[XLEN -n] <- 1
118 prod1_lo <- prod1_lo + round
119 prod2_lo <- prod2_lo + round
120 m <- MASK(n, (XLEN-1))
121 res1 <- ROTL64(prod1_lo, XLEN-n) & m
122 res2 <- ROTL64(prod2_lo, XLEN-n) & m
123 signbit1 <- prod1_lo[0]
124 signbit2 <- prod2_lo[0]
125 smask1 <- ([signbit1]*XLEN) & ¬m
126 smask2 <- ([signbit2]*XLEN) & ¬m
127 RT <- (res1 | smask1)
128 RS <- (res2 | smask2)
129 ```
130
131 Similar to `RTp`, this instruction produces an implicit result, `RS`,
132 which under Scalar circumstances is defined as `RT+1`. For SVP64 if
133 `RT` is a Vector, `RS` begins immediately after the Vector `RT` where
134 the length of `RT` is set by `SVSTATE.MAXVL` (Max Vector Length).
135
136 Special Registers Altered:
137
138 ```
139 None
140 ```
141
142 # [DRAFT] Integer Butterfly Multiply Add/Sub and Accumulate FFT/DCT
143
144 A-Form
145
146 * maddrs RT,RA,SH,RB
147
148 Pseudo-code:
149
150 n <- SH
151 prod <- MULS(RB, RA)
152 prod_lo <- prod[XLEN:(XLEN*2)-1]
153 if n = 0 then
154 RT <- (RT) + prod_lo
155 RS <- (RS) - prod_lo
156 else
157 res1 <- (RT) + prod_lo
158 res2 <- (RS) - prod_lo
159 round <- [0]*XLEN
160 round[XLEN -n] <- 1
161 res1 <- res1 + round
162 res2 <- res2 + round
163 signbit1 <- res1[0]
164 signbit2 <- res2[0]
165 m <- MASK(n, (XLEN-1))
166 res1 <- ROTL64(res1, XLEN-n) & m
167 res2 <- ROTL64(res2, XLEN-n) & m
168 smask1 <- ([signbit1]*XLEN) & ¬m
169 smask2 <- ([signbit2]*XLEN) & ¬m
170 RT <- (res1 | smask1)
171 RS <- (res2 | smask2)
172
173 Special Registers Altered:
174
175 None
176
177 Similar to `RTp`, this instruction produces an implicit result, `RS`,
178 which under Scalar circumstances is defined as `RT+1`. For SVP64 if
179 `RT` is a Vector, `RS` begins immediately after the Vector `RT` where
180 the length of `RT` is set by `SVSTATE.MAXVL` (Max Vector Length).
181
182 This instruction is supposed to be used in complement to the maddsubrs
183 to produce the double-coefficient butterfly instruction. In order for that
184 to work, instead of passing c2 as coefficient, we have to pass c2-c1 instead.
185
186 In essence, we are calculating the quantity `a * c1 +/- b * c1` first, with
187 `maddsubrs` *without* shifting (so `SH=0`) and then we add/sub `b * (c2-c1)`
188 from the previous `RT`/`RS`, and *then* do the shifting.
189
190 In the following example, assume `a` in `R1`, `b` in `R10`, `c1` in `R11` and `c2 - c1` in `R12`.
191 The first instruction will put `a * c1 + b * c1` in `R1` (`RT`), `a * c1 - b * c1` in `RS`
192 (here, `RS = RT +1`, so `R2`).
193 Then, `maddrs` will add `b * (c2 - c1)` to `R1` (`RT`), and subtract it from `R2` (`RS`), and then
194 round shift right both quantities 14 bits:
195
196 ```
197 maddsubrs 1,10,0,11
198 maddrs 1,10,14,12
199 ```
200
201 In scalar code, that would take ~16 instructions for both operations.
202
203 -------
204
205 \newpage{}
206
207 # Twin Butterfly Floating-Point DCT and FFT Instruction(s)
208
209 **Add the following to Book I Section 4.6.6.3**
210
211 ## Floating-Point Twin Multiply-Add DCT [Single]
212
213 X-Form
214
215 ```
216 |0 |6 |11 |16 |21 |31 |
217 | PO | FRT | FRA | FRB | XO |Rc |
218 ```
219
220 * fdmadds FRT,FRA,FRB (Rc=0)
221
222 Pseudo-code:
223
224 ```
225 FRS <- FPADD32(FRT, FRB)
226 sub <- FPSUB32(FRT, FRB)
227 FRT <- FPMUL32(FRA, sub)
228 ```
229
230 The two IEEE754-FP32 operations
231
232 ```
233 FRS <- [(FRT) + (FRB)]
234 FRT <- [(FRT) - (FRB)] * (FRA)
235 ```
236
237 are simultaneously performed.
238
239 The Floating-Point operand in register FRT is added to the floating-point
240 operand in register FRB and the result stored in FRS.
241
242 Using the exact same operand input register values from FRT and FRB
243 that were used to create FRS, the Floating-Point operand in register
244 FRB is subtracted from the floating-point operand in register FRT and
245 the result then rounded before being multiplied by FRA to create an
246 intermediate result that is stored in FRT.
247
248 The add into FRS is treated exactly as `fadds`. The creation of the
249 result FRT is **not** the same as that of `fmsubs`, but is instead as if
250 `fsubs` were performed first followed by `fmuls`. The creation of FRS
251 and FRT are treated as parallel independent operations which occur at
252 the same time.
253
254 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
255
256 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
257 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
258 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
259 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
260
261 Special Registers Altered:
262
263 ```
264 FPRF FR FI
265 FX OX UX XX
266 VXSNAN VXISI VXIMZ
267 ```
268
269 ## Floating-Point Multiply-Add FFT [Single]
270
271 X-Form
272
273 ```
274 |0 |6 |11 |16 |21 |31 |
275 | PO | FRT | FRA | FRB | XO |Rc |
276 ```
277
278 * ffmadds FRT,FRA,FRB (Rc=0)
279
280 Pseudo-code:
281
282 ```
283 FRS <- FPMULADD32(FRT, FRA, FRB, -1, 1)
284 FRT <- FPMULADD32(FRT, FRA, FRB, 1, 1)
285 ```
286
287 The two operations
288
289 ```
290 FRS <- -([(FRT) * (FRA)] - (FRB))
291 FRT <- [(FRT) * (FRA)] + (FRB)
292 ```
293
294 are performed.
295
296 The floating-point operand in register FRT is multiplied by the
297 floating-point operand in register FRA. The floating-point operand in
298 register FRB is added to this intermediate result, and the intermediate
299 stored in FRS.
300
301 Using the exact same values of FRT, FRT and FRB as used to create
302 FRS, the floating-point operand in register FRT is multiplied by the
303 floating-point operand in register FRA. The floating-point operand
304 in register FRB is subtracted from this intermediate result, and the
305 intermediate stored in FRT.
306
307 FRT is created as if a `fmadds` operation had been performed. FRS is
308 created as if a `fnmsubs` operation had simultaneously been performed
309 with the exact same register operands, in parallel, independently,
310 at exactly the same time.
311
312 FRT is a Read-Modify-Write operation.
313
314 Note that if Rc=1 an Illegal Instruction is raised.
315 Rc=1 is `RESERVED`
316
317 Similar to `FRTp`, this instruction produces an implicit result,
318 `FRS`, which under Scalar circumstances is defined as `FRT+1`.
319 For SVP64 if `FRT` is a Vector, `FRS` begins immediately after the
320 Vector `FRT` where the length of `FRT` is set by `SVSTATE.MAXVL`
321 (Max Vector Length).
322
323 Special Registers Altered:
324
325 ```
326 FPRF FR FI
327 FX OX UX XX
328 VXSNAN VXISI VXIMZ
329 ```
330
331 ## Floating-Point Twin Multiply-Add DCT
332
333 X-Form
334
335 ```
336 |0 |6 |11 |16 |21 |31 |
337 | PO | FRT | FRA | FRB | XO |Rc |
338 ```
339
340 * fdmadd FRT,FRA,FRB (Rc=0)
341
342 Pseudo-code:
343
344 ```
345 FRS <- FPADD64(FRT, FRB)
346 sub <- FPSUB64(FRT, FRB)
347 FRT <- FPMUL64(FRA, sub)
348 ```
349
350 The two IEEE754-FP64 operations
351
352 ```
353 FRS <- [(FRT) + (FRB)]
354 FRT <- [(FRT) - (FRB)] * (FRA)
355 ```
356
357 are simultaneously performed.
358
359 The Floating-Point operand in register FRT is added to the floating-point
360 operand in register FRB and the result stored in FRS.
361
362 Using the exact same operand input register values from FRT and FRB
363 that were used to create FRS, the Floating-Point operand in register
364 FRB is subtracted from the floating-point operand in register FRT and
365 the result then rounded before being multiplied by FRA to create an
366 intermediate result that is stored in FRT.
367
368 The add into FRS is treated exactly as `fadd`. The creation of the
369 result FRT is **not** the same as that of `fmsub`, but is instead as if
370 `fsub` were performed first followed by `fmuls. The creation of FRS
371 and FRT are treated as parallel independent operations which occur at
372 the same time.
373
374 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
375
376 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
377 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
378 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
379 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
380
381 Special Registers Altered:
382
383 ```
384 FPRF FR FI
385 FX OX UX XX
386 VXSNAN VXISI VXIMZ
387 ```
388
389 ## Floating-Point Twin Multiply-Add FFT
390
391 X-Form
392
393 ```
394 |0 |6 |11 |16 |21 |31 |
395 | PO | FRT | FRA | FRB | XO |Rc |
396 ```
397
398 * ffmadd FRT,FRA,FRB (Rc=0)
399
400 Pseudo-code:
401
402 ```
403 FRS <- FPMULADD64(FRT, FRA, FRB, -1, 1)
404 FRT <- FPMULADD64(FRT, FRA, FRB, 1, 1)
405 ```
406
407 The two operations
408
409 ```
410 FRS <- -([(FRT) * (FRA)] - (FRB))
411 FRT <- [(FRT) * (FRA)] + (FRB)
412 ```
413
414 are performed.
415
416 The floating-point operand in register FRT is multiplied by the
417 floating-point operand in register FRA. The float- ing-point operand in
418 register FRB is added to this intermediate result, and the intermediate
419 stored in FRS.
420
421 Using the exact same values of FRT, FRT and FRB as used to create
422 FRS, the floating-point operand in register FRT is multiplied by the
423 floating-point operand in register FRA. The float- ing-point operand
424 in register FRB is subtracted from this intermediate result, and the
425 intermediate stored in FRT.
426
427 FRT is created as if a `fmadd` operation had been performed. FRS is
428 created as if a `fnmsub` operation had simultaneously been performed
429 with the exact same register operands, in parallel, independently,
430 at exactly the same time.
431
432 FRT is a Read-Modify-Write operation.
433
434 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
435
436 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
437 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
438 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
439 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
440
441 Special Registers Altered:
442
443 ```
444 FPRF FR FI
445 FX OX UX XX
446 VXSNAN VXISI VXIMZ
447 ```
448
449
450 ## Floating-Point Add FFT/DCT [Single]
451
452 A-Form
453
454 ```
455 |0 |6 |11 |16 |21 |26 |31 |
456 | PO | FRT | FRA | FRB | / | XO |Rc |
457 ```
458
459 * ffadds FRT,FRA,FRB (Rc=0)
460
461 Pseudo-code:
462
463 ```
464 FRT <- FPADD32(FRA, FRB)
465 FRS <- FPSUB32(FRB, FRA)
466 ```
467
468 Special Registers Altered:
469
470 ```
471 FPRF FR FI
472 FX OX UX XX
473 VXSNAN VXISI
474 ```
475
476 ## Floating-Point Add FFT/DCT [Double]
477
478 A-Form
479
480 ```
481 |0 |6 |11 |16 |21 |26 |31 |
482 | PO | FRT | FRA | FRB | / | XO |Rc |
483 ```
484
485 * ffadd FRT,FRA,FRB (Rc=0)
486
487 Pseudo-code:
488
489 ```
490 FRT <- FPADD64(FRA, FRB)
491 FRS <- FPSUB64(FRB, FRA)
492 ```
493
494 Special Registers Altered:
495
496 ```
497 FPRF FR FI
498 FX OX UX XX
499 VXSNAN VXISI
500 ```
501
502 ## Floating-Point Subtract FFT/DCT [Single]
503
504 A-Form
505
506 ```
507 |0 |6 |11 |16 |21 |26 |31 |
508 | PO | FRT | FRA | FRB | / | XO |Rc |
509 ```
510
511 * ffsubs FRT,FRA,FRB (Rc=0)
512
513 Pseudo-code:
514
515 ```
516 FRT <- FPSUB32(FRB, FRA)
517 FRS <- FPADD32(FRA, FRB)
518 ```
519
520 Special Registers Altered:
521
522 ```
523 FPRF FR FI
524 FX OX UX XX
525 VXSNAN VXISI
526 ```
527
528 ## Floating-Point Subtract FFT/DCT [Double]
529
530 A-Form
531
532 ```
533 |0 |6 |11 |16 |21 |26 |31 |
534 | PO | FRT | FRA | FRB | / | XO |Rc |
535 ```
536
537 * ffsub FRT,FRA,FRB (Rc=0)
538
539 Pseudo-code:
540
541 ```
542 FRT <- FPSUB64(FRB, FRA)
543 FRS <- FPADD64(FRA, FRB)
544 ```
545
546 Special Registers Altered:
547
548 ```
549 FPRF FR FI
550 FX OX UX XX
551 VXSNAN VXISI
552 ```