4 * <https://bugs.libre-soc.org/show_bug.cgi?id=1074>
5 * <https://libre-soc.org/openpower/sv/biginteger/> for format and
6 information about implicit RS/FRS
7 * <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_dct.py;hb=HEAD>
8 * [[openpower/isa/svfparith]]
9 * [[openpower/isa/svfixedarith]]
10 * [[openpower/sv/rfc/ls016]]
13 Although best used with SVP64 REMAP these instructions may be used in a Scalar-only
14 context to save considerably on DCT, DFT and FFT processing. Whilst some hardware
15 implementations may not necessarily implement them efficiently (slower Micro-coding)
16 savings still come from the reduction in temporary registers as well as instruction
19 # Rationale for Twin Butterfly Integer DCT Instruction(s)
21 The number of general-purpose uses for DCT is huge. The number of
22 instructions needed instead of these Twin-Butterfly instructions is also
23 huge (**eight**) and given that it is extremely common to explicitly
24 loop-unroll them quantity hundreds to thousands of instructions are
25 dismayingly common (for all ISAs).
27 The goal is to implement instructions that calculate the expression:
30 fdct_round_shift((a +/- b) * c)
33 For the single-coefficient butterfly instruction, and:
36 fdct_round_shift(a * c1 +/- b * c2)
39 For the double-coefficient butterfly instruction.
41 In a 32-bit context `fdct_round_shift` is defined as `ROUND_POWER_OF_TWO(x, 14)`
44 #define ROUND_POWER_OF_TWO(value, n) \
45 (((value) + (1 << ((n)-1))) >> (n))
48 These instructions are at the core of **ALL** FDCT calculations in many
49 major video codecs, including -but not limited to- VP8/VP9, AV1, etc.
50 ARM includes special instructions to optimize these operations, although
51 they are limited in precision: `vqrdmulhq_s16`/`vqrdmulhq_s32`.
53 The suggestion is to have a single instruction to calculate both values
54 `((a + b) * c) >> N`, and `((a - b) * c) >> N`. The instruction will
55 run in accumulate mode, so in order to calculate the 2-coeff version
56 one would just have to call the same instruction with different order a,
57 b and a different constant c.
59 Example taken from libvpx
60 <https://chromium.googlesource.com/webm/libvpx/+/refs/heads/main/vpx_dsp/fwd_txfm.c#132>:
64 #define ROUND_POWER_OF_TWO(value, n) \
65 (((value) + (1 << ((n)-1))) >> (n))
66 void twin_int(int16_t *t, int16_t x0, int16_t x1, int16_t cospi_16_64) {
67 t[0] = ROUND_POWER_OF_TWO((x0 + x1) * cospi_16_64, 14);
68 t[1] = ROUND_POWER_OF_TWO((x0 - x1) * cospi_16_64, 14);
72 8 instructions are required - replaced by just the one (maddsubrs):
89 ## Integer Butterfly Multiply Add/Sub FFT/DCT
91 **Add the following to Book I Section 3.3.9.1**
96 |0 |6 |11 |16 |21 |26 |31 |
97 | PO | RT | RA | RB | SH | XO |Rc |
100 * maddsubrs RT,RA,SH,RB
108 prod1 <- MULS(RB, sum)
109 prod1_lo <- prod1[XLEN:(XLEN*2)-1]
110 prod2 <- MULS(RB, diff)
111 prod2_lo <- prod2[XLEN:(XLEN*2)-1]
118 prod1_lo <- prod1_lo + round
119 prod2_lo <- prod2_lo + round
120 m <- MASK(n, (XLEN-1))
121 res1 <- ROTL64(prod1_lo, XLEN-n) & m
122 res2 <- ROTL64(prod2_lo, XLEN-n) & m
123 signbit1 <- prod1_lo[0]
124 signbit2 <- prod2_lo[0]
125 smask1 <- ([signbit1]*XLEN) & ¬m
126 smask2 <- ([signbit2]*XLEN) & ¬m
127 RT <- (res1 | smask1)
128 RS <- (res2 | smask2)
131 Similar to `RTp`, this instruction produces an implicit result, `RS`,
132 which under Scalar circumstances is defined as `RT+1`. For SVP64 if
133 `RT` is a Vector, `RS` begins immediately after the Vector `RT` where
134 the length of `RT` is set by `SVSTATE.MAXVL` (Max Vector Length).
136 Special Registers Altered:
142 # [DRAFT] Integer Butterfly Multiply Add/Sub and Accumulate FFT/DCT
152 prod_lo <- prod[XLEN:(XLEN*2)-1]
157 res1 <- (RT) + prod_lo
158 res2 <- (RS) - prod_lo
165 m <- MASK(n, (XLEN-1))
166 res1 <- ROTL64(res1, XLEN-n) & m
167 res2 <- ROTL64(res2, XLEN-n) & m
168 smask1 <- ([signbit1]*XLEN) & ¬m
169 smask2 <- ([signbit2]*XLEN) & ¬m
170 RT <- (res1 | smask1)
171 RS <- (res2 | smask2)
173 Special Registers Altered:
177 Similar to `RTp`, this instruction produces an implicit result, `RS`,
178 which under Scalar circumstances is defined as `RT+1`. For SVP64 if
179 `RT` is a Vector, `RS` begins immediately after the Vector `RT` where
180 the length of `RT` is set by `SVSTATE.MAXVL` (Max Vector Length).
182 This instruction is supposed to be used in complement to the maddsubrs
183 to produce the double-coefficient butterfly instruction. In order for that
184 to work, instead of passing c2 as coefficient, we have to pass c2-c1 instead.
186 In essence, we are calculating the quantity `a * c1 +/- b * c1` first, with
187 `maddsubrs` *without* shifting (so `SH=0`) and then we add/sub `b * (c2-c1)`
188 from the previous `RT`/`RS`, and *then* do the shifting.
190 In the following example, assume `a` in `R1`, `b` in `R10`, `c1` in `R11` and `c2 - c1` in `R12`.
191 The first instruction will put `a * c1 + b * c1` in `R1` (`RT`), `a * c1 - b * c1` in `RS`
192 (here, `RS = RT +1`, so `R2`).
193 Then, `maddrs` will add `b * (c2 - c1)` to `R1` (`RT`), and subtract it from `R2` (`RS`), and then
194 round shift right both quantities 14 bits:
201 In scalar code, that would take ~16 instructions for both operations.
207 # Twin Butterfly Floating-Point DCT and FFT Instruction(s)
209 **Add the following to Book I Section 4.6.6.3**
211 ## Floating-Point Twin Multiply-Add DCT [Single]
216 |0 |6 |11 |16 |21 |31 |
217 | PO | FRT | FRA | FRB | XO |Rc |
220 * fdmadds FRT,FRA,FRB (Rc=0)
225 FRS <- FPADD32(FRT, FRB)
226 sub <- FPSUB32(FRT, FRB)
227 FRT <- FPMUL32(FRA, sub)
230 The two IEEE754-FP32 operations
233 FRS <- [(FRT) + (FRB)]
234 FRT <- [(FRT) - (FRB)] * (FRA)
237 are simultaneously performed.
239 The Floating-Point operand in register FRT is added to the floating-point
240 operand in register FRB and the result stored in FRS.
242 Using the exact same operand input register values from FRT and FRB
243 that were used to create FRS, the Floating-Point operand in register
244 FRB is subtracted from the floating-point operand in register FRT and
245 the result then rounded before being multiplied by FRA to create an
246 intermediate result that is stored in FRT.
248 The add into FRS is treated exactly as `fadds`. The creation of the
249 result FRT is **not** the same as that of `fmsubs`, but is instead as if
250 `fsubs` were performed first followed by `fmuls`. The creation of FRS
251 and FRT are treated as parallel independent operations which occur at
254 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
256 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
257 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
258 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
259 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
261 Special Registers Altered:
269 ## Floating-Point Multiply-Add FFT [Single]
274 |0 |6 |11 |16 |21 |31 |
275 | PO | FRT | FRA | FRB | XO |Rc |
278 * ffmadds FRT,FRA,FRB (Rc=0)
283 FRS <- FPMULADD32(FRT, FRA, FRB, -1, 1)
284 FRT <- FPMULADD32(FRT, FRA, FRB, 1, 1)
290 FRS <- -([(FRT) * (FRA)] - (FRB))
291 FRT <- [(FRT) * (FRA)] + (FRB)
296 The floating-point operand in register FRT is multiplied by the
297 floating-point operand in register FRA. The floating-point operand in
298 register FRB is added to this intermediate result, and the intermediate
301 Using the exact same values of FRT, FRT and FRB as used to create
302 FRS, the floating-point operand in register FRT is multiplied by the
303 floating-point operand in register FRA. The floating-point operand
304 in register FRB is subtracted from this intermediate result, and the
305 intermediate stored in FRT.
307 FRT is created as if a `fmadds` operation had been performed. FRS is
308 created as if a `fnmsubs` operation had simultaneously been performed
309 with the exact same register operands, in parallel, independently,
310 at exactly the same time.
312 FRT is a Read-Modify-Write operation.
314 Note that if Rc=1 an Illegal Instruction is raised.
317 Similar to `FRTp`, this instruction produces an implicit result,
318 `FRS`, which under Scalar circumstances is defined as `FRT+1`.
319 For SVP64 if `FRT` is a Vector, `FRS` begins immediately after the
320 Vector `FRT` where the length of `FRT` is set by `SVSTATE.MAXVL`
323 Special Registers Altered:
331 ## Floating-Point Twin Multiply-Add DCT
336 |0 |6 |11 |16 |21 |31 |
337 | PO | FRT | FRA | FRB | XO |Rc |
340 * fdmadd FRT,FRA,FRB (Rc=0)
345 FRS <- FPADD64(FRT, FRB)
346 sub <- FPSUB64(FRT, FRB)
347 FRT <- FPMUL64(FRA, sub)
350 The two IEEE754-FP64 operations
353 FRS <- [(FRT) + (FRB)]
354 FRT <- [(FRT) - (FRB)] * (FRA)
357 are simultaneously performed.
359 The Floating-Point operand in register FRT is added to the floating-point
360 operand in register FRB and the result stored in FRS.
362 Using the exact same operand input register values from FRT and FRB
363 that were used to create FRS, the Floating-Point operand in register
364 FRB is subtracted from the floating-point operand in register FRT and
365 the result then rounded before being multiplied by FRA to create an
366 intermediate result that is stored in FRT.
368 The add into FRS is treated exactly as `fadd`. The creation of the
369 result FRT is **not** the same as that of `fmsub`, but is instead as if
370 `fsub` were performed first followed by `fmuls. The creation of FRS
371 and FRT are treated as parallel independent operations which occur at
374 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
376 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
377 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
378 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
379 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
381 Special Registers Altered:
389 ## Floating-Point Twin Multiply-Add FFT
394 |0 |6 |11 |16 |21 |31 |
395 | PO | FRT | FRA | FRB | XO |Rc |
398 * ffmadd FRT,FRA,FRB (Rc=0)
403 FRS <- FPMULADD64(FRT, FRA, FRB, -1, 1)
404 FRT <- FPMULADD64(FRT, FRA, FRB, 1, 1)
410 FRS <- -([(FRT) * (FRA)] - (FRB))
411 FRT <- [(FRT) * (FRA)] + (FRB)
416 The floating-point operand in register FRT is multiplied by the
417 floating-point operand in register FRA. The float- ing-point operand in
418 register FRB is added to this intermediate result, and the intermediate
421 Using the exact same values of FRT, FRT and FRB as used to create
422 FRS, the floating-point operand in register FRT is multiplied by the
423 floating-point operand in register FRA. The float- ing-point operand
424 in register FRB is subtracted from this intermediate result, and the
425 intermediate stored in FRT.
427 FRT is created as if a `fmadd` operation had been performed. FRS is
428 created as if a `fnmsub` operation had simultaneously been performed
429 with the exact same register operands, in parallel, independently,
430 at exactly the same time.
432 FRT is a Read-Modify-Write operation.
434 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
436 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
437 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
438 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
439 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
441 Special Registers Altered:
450 ## Floating-Point Add FFT/DCT [Single]
455 |0 |6 |11 |16 |21 |26 |31 |
456 | PO | FRT | FRA | FRB | / | XO |Rc |
459 * ffadds FRT,FRA,FRB (Rc=0)
464 FRT <- FPADD32(FRA, FRB)
465 FRS <- FPSUB32(FRB, FRA)
468 Special Registers Altered:
476 ## Floating-Point Add FFT/DCT [Double]
481 |0 |6 |11 |16 |21 |26 |31 |
482 | PO | FRT | FRA | FRB | / | XO |Rc |
485 * ffadd FRT,FRA,FRB (Rc=0)
490 FRT <- FPADD64(FRA, FRB)
491 FRS <- FPSUB64(FRB, FRA)
494 Special Registers Altered:
502 ## Floating-Point Subtract FFT/DCT [Single]
507 |0 |6 |11 |16 |21 |26 |31 |
508 | PO | FRT | FRA | FRB | / | XO |Rc |
511 * ffsubs FRT,FRA,FRB (Rc=0)
516 FRT <- FPSUB32(FRB, FRA)
517 FRS <- FPADD32(FRA, FRB)
520 Special Registers Altered:
528 ## Floating-Point Subtract FFT/DCT [Double]
533 |0 |6 |11 |16 |21 |26 |31 |
534 | PO | FRT | FRA | FRB | / | XO |Rc |
537 * ffsub FRT,FRA,FRB (Rc=0)
542 FRT <- FPSUB64(FRB, FRA)
543 FRS <- FPADD64(FRA, FRB)
546 Special Registers Altered: