openpower/sv/twin_butterfly.mdwn

   1 # Introduction
   2
   3 <!-- hide -->
   4 * <https://bugs.libre-soc.org/show_bug.cgi?id=1074>
   5 * <https://libre-soc.org/openpower/sv/biginteger/> for format and
   6   information about implicit RS/FRS
   7 * <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_dct.py;hb=HEAD>
   8 * [[openpower/isa/svfparith]]
   9 * [[openpower/isa/svfixedarith]]
  10 * [[openpower/sv/rfc/ls016]]
  11 <!-- show -->
  12
  13 Although best used with SVP64 REMAP these instructions may be used in a Scalar-only
  14 context to save considerably on DCT, DFT and FFT processing.  Whilst some hardware
  15 implementations may not necessarily implement them efficiently (slower Micro-coding)
  16 savings still come from the reduction in temporary registers as well as instruction
  17 count.
  18
  19 # Rationale for Twin Butterfly Integer DCT Instruction(s)
  20
  21 The number of general-purpose uses for DCT is huge. The number of
  22 instructions needed instead of these Twin-Butterfly instructions is also
  23 huge (**eight**) and given that it is extremely common to explicitly
  24 loop-unroll them quantity hundreds to thousands of instructions are
  25 dismayingly common (for all ISAs).
  26
  27 The goal is to implement instructions that calculate the expression:
  28
  29 ```
  30     fdct_round_shift((a +/- b) * c)
  31 ```
  32
  33 For the single-coefficient butterfly instruction, and:
  34
  35 ```
  36     fdct_round_shift(a * c1  +/- b * c2)
  37 ```
  38
  39 For the double-coefficient butterfly instruction.
  40
  41 In a 32-bit context `fdct_round_shift` is defined as `ROUND_POWER_OF_TWO(x, 14)`
  42
  43 ```
  44     #define ROUND_POWER_OF_TWO(value, n) \
  45             (((value) + (1 << ((n)-1))) >> (n))
  46 ```
  47
  48 These instructions are at the core of **ALL** FDCT calculations in many
  49 major video codecs, including -but not limited to- VP8/VP9, AV1, etc.
  50 ARM includes special instructions to optimize these operations, although
  51 they are limited in precision: `vqrdmulhq_s16`/`vqrdmulhq_s32`.
  52
  53 The suggestion is to have a single instruction to calculate both values
  54 `((a + b) * c) >> N`, and `((a - b) * c) >> N`.  The instruction will
  55 run in accumulate mode, so in order to calculate the 2-coeff version
  56 one would just have to call the same instruction with different order a,
  57 b and a different constant c.
  58
  59 Example taken from libvpx
  60 <https://chromium.googlesource.com/webm/libvpx/+/refs/heads/main/vpx_dsp/fwd_txfm.c#132>:
  61
  62 ```
  63     #include <stdint.h>
  64     #define ROUND_POWER_OF_TWO(value, n) \
  65             (((value) + (1 << ((n)-1))) >> (n))
  66     void twin_int(int16_t *t, int16_t x0, int16_t x1, int16_t cospi_16_64) {
  67         t[0] = ROUND_POWER_OF_TWO((x0 + x1) * cospi_16_64, 14);
  68         t[1] = ROUND_POWER_OF_TWO((x0 - x1) * cospi_16_64, 14);
  69     }
  70 ```
  71
  72 8 instructions are required  - replaced by just the one (maddsubrs):
  73
  74 ```
  75     add 9,5,4
  76     subf 5,5,4
  77     mullw 9,9,6
  78     mullw 5,5,6
  79     addi 9,9,8192
  80     addi 5,5,8192
  81     srawi 9,9,14
  82     srawi 5,5,14
  83 ```
  84
  85 -------
  86
  87 \newpage{}
  88
  89 ## Integer Butterfly Multiply Add/Sub FFT/DCT
  90
  91 **Add the following to Book I Section 3.3.9.1**
  92
  93 A-Form
  94
  95 ```
  96     |0     |6     |11      |16     |21      |26    |31 |
  97     | PO   |  RT  |   RA   |   RB  |   SH   |   XO |Rc |
  98 ```
  99
 100 * maddsubrs  RT,RA,SH,RB
 101
 102 Pseudo-code:
 103
 104 ```
 105     n <- SH
 106     sum <- (RT) + (RA)
 107     diff <- (RT) - (RA)
 108     prod1 <- MULS(RB, sum)
 109     prod1_lo <- prod1[XLEN:(XLEN*2)-1]
 110     prod2 <- MULS(RB, diff)
 111     prod2_lo <- prod2[XLEN:(XLEN*2)-1]
 112     if n = 0 then
 113         RT <- prod1_lo
 114         RS <- prod2_lo
 115     else
 116         round <- [0]*XLEN
 117         round[XLEN -n] <- 1
 118         prod1_lo <- prod1_lo + round
 119         prod2_lo <- prod2_lo + round
 120         m <- MASK(n, (XLEN-1))
 121         res1 <- ROTL64(prod1_lo, XLEN-n) & m
 122         res2 <- ROTL64(prod2_lo, XLEN-n) & m
 123         signbit1 <- prod1_lo[0]
 124         signbit2 <- prod2_lo[0]
 125         smask1 <- ([signbit1]*XLEN) & ¬m
 126         smask2 <- ([signbit2]*XLEN) & ¬m
 127         RT <- (res1 | smask1)
 128         RS <- (res2 | smask2)
 129 ```
 130
 131 Similar to `RTp`, this instruction produces an implicit result, `RS`,
 132 which under Scalar circumstances is defined as `RT+1`.  For SVP64 if
 133 `RT` is a Vector, `RS` begins immediately after the Vector `RT` where
 134 the length of `RT` is set by `SVSTATE.MAXVL` (Max Vector Length).
 135
 136 Special Registers Altered:
 137
 138 ```
 139     None
 140 ```
 141
 142 # [DRAFT] Integer Butterfly Multiply Add/Sub and Accumulate FFT/DCT
 143
 144 A-Form
 145
 146 * maddrs  RT,RA,SH,RB
 147
 148 Pseudo-code:
 149
 150     n <- SH
 151     prod <- MULS(RB, RA)
 152     prod_lo <- prod[XLEN:(XLEN*2)-1]
 153     if n = 0 then
 154         RT <- (RT) + prod_lo
 155         RS <- (RS) - prod_lo
 156     else
 157         res1 <- (RT) + prod_lo
 158         res2 <- (RS) - prod_lo
 159         round <- [0]*XLEN
 160         round[XLEN -n] <- 1
 161         res1 <- res1 + round
 162         res2 <- res2 + round
 163         signbit1 <- res1[0]
 164         signbit2 <- res2[0]
 165         m <- MASK(n, (XLEN-1))
 166         res1 <- ROTL64(res1, XLEN-n) & m
 167         res2 <- ROTL64(res2, XLEN-n) & m
 168         smask1 <- ([signbit1]*XLEN) & ¬m
 169         smask2 <- ([signbit2]*XLEN) & ¬m
 170         RT <- (res1 | smask1)
 171         RS <- (res2 | smask2)
 172
 173 Special Registers Altered:
 174
 175     None
 176
 177 Similar to `RTp`, this instruction produces an implicit result, `RS`,
 178 which under Scalar circumstances is defined as `RT+1`.  For SVP64 if
 179 `RT` is a Vector, `RS` begins immediately after the Vector `RT` where
 180 the length of `RT` is set by `SVSTATE.MAXVL` (Max Vector Length).
 181
 182 This instruction is supposed to be used in complement to the maddsubrs
 183 to produce the double-coefficient butterfly instruction. In order for that
 184 to work, instead of passing c2 as coefficient, we have to pass c2-c1 instead.
 185
 186 In essence, we are calculating the quantity `a * c1 +/- b * c1` first, with
 187 `maddsubrs` *without* shifting (so `SH=0`) and then we add/sub `b * (c2-c1)`
 188 from the previous `RT`/`RS`, and *then* do the shifting.
 189
 190 In the following example, assume `a` in `R1`, `b` in `R10`, `c1` in `R11` and `c2 - c1` in `R12`.
 191 The first instruction will put `a * c1 + b * c1` in `R1` (`RT`), `a * c1 - b * c1` in `RS`
 192 (here, `RS = RT +1`, so `R2`).
 193 Then, `maddrs` will add `b * (c2 - c1)` to `R1` (`RT`), and subtract it from `R2` (`RS`), and then
 194 round shift right both quantities 14 bits:
 195
 196 ```
 197     maddsubrs 1,10,0,11
 198     maddrs 1,10,14,12
 199 ```
 200
 201 In scalar code, that would take ~16 instructions for both operations.
 202
 203 -------
 204
 205 \newpage{}
 206
 207 # Twin Butterfly Floating-Point DCT and FFT Instruction(s)
 208
 209 **Add the following to Book I Section 4.6.6.3**
 210
 211 ## Floating-Point Twin Multiply-Add DCT [Single]
 212
 213 X-Form
 214
 215 ```
 216     |0     |6     |11      |16     |21      |31 |
 217     | PO   |  FRT |  FRA   |  FRB  |   XO   |Rc |
 218 ```
 219
 220 * fdmadds FRT,FRA,FRB (Rc=0)
 221
 222 Pseudo-code:
 223
 224 ```
 225     FRS <- FPADD32(FRT, FRB)
 226     sub <- FPSUB32(FRT, FRB)
 227     FRT <- FPMUL32(FRA, sub)
 228 ```
 229
 230 The two IEEE754-FP32 operations
 231
 232 ```
 233     FRS <- [(FRT) + (FRB)]
 234     FRT <- [(FRT) - (FRB)] * (FRA)
 235 ```
 236
 237 are simultaneously performed.
 238
 239 The Floating-Point operand in register FRT is added to the floating-point
 240 operand in register FRB and the result stored in FRS.
 241
 242 Using the exact same operand input register values from FRT and FRB
 243 that were used to create FRS, the Floating-Point operand in register
 244 FRB is subtracted from the floating-point operand in register FRT and
 245 the result then rounded before being multiplied by FRA to create an
 246 intermediate result that is stored in FRT.
 247
 248 The add into FRS is treated exactly as `fadds`.  The creation of the
 249 result FRT is **not** the same as that of `fmsubs`, but is instead as if
 250 `fsubs` were performed first followed by `fmuls`.  The creation of FRS
 251 and FRT are treated as parallel independent operations which occur at
 252 the same time.
 253
 254 Note that if Rc=1 an Illegal Instruction is raised.  Rc=1 is `RESERVED`
 255
 256 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
 257 which under Scalar circumstances is defined as `FRT+1`.  For SVP64 if
 258 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
 259 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
 260
 261 Special Registers Altered:
 262
 263 ```
 264     FPRF FR FI
 265     FX OX UX XX
 266     VXSNAN VXISI VXIMZ
 267 ```
 268
 269 ## Floating-Point Multiply-Add FFT [Single]
 270
 271 X-Form
 272
 273 ```
 274     |0     |6     |11      |16     |21      |31 |
 275     | PO   |  FRT |  FRA   |  FRB  |   XO   |Rc |
 276 ```
 277
 278 * ffmadds FRT,FRA,FRB (Rc=0)
 279
 280 Pseudo-code:
 281
 282 ```
 283     FRS <- FPMULADD32(FRT, FRA, FRB, -1, 1)
 284     FRT <- FPMULADD32(FRT, FRA, FRB, 1, 1)
 285 ```
 286
 287 The two operations
 288
 289 ```
 290     FRS <- -([(FRT) * (FRA)] - (FRB))
 291     FRT <-   [(FRT) * (FRA)] + (FRB)
 292 ```
 293
 294 are performed.
 295
 296 The floating-point operand in register FRT is multiplied by the
 297 floating-point operand in register FRA. The floating-point operand in
 298 register FRB is added to this intermediate result, and the intermediate
 299 stored in FRS.
 300
 301 Using the exact same values of FRT, FRT and FRB as used to create
 302 FRS, the floating-point operand in register FRT is multiplied by the
 303 floating-point operand in register FRA. The floating-point operand
 304 in register FRB is subtracted from this intermediate result, and the
 305 intermediate stored in FRT.
 306
 307 FRT is created as if a `fmadds` operation had been performed. FRS is
 308 created as if a `fnmsubs` operation had simultaneously been performed
 309 with the exact same register operands, in parallel, independently,
 310 at exactly the same time.
 311
 312 FRT is a Read-Modify-Write operation.
 313
 314 Note that if Rc=1 an Illegal Instruction is raised.
 315 Rc=1 is `RESERVED`
 316
 317 Similar to `FRTp`, this instruction produces an implicit result,
 318 `FRS`, which under Scalar circumstances is defined as `FRT+1`.
 319 For SVP64 if `FRT` is a Vector, `FRS` begins immediately after the
 320 Vector `FRT` where the length of `FRT` is set by `SVSTATE.MAXVL`
 321 (Max Vector Length).
 322
 323 Special Registers Altered:
 324
 325 ```
 326     FPRF FR FI
 327     FX OX UX XX
 328     VXSNAN VXISI VXIMZ
 329 ```
 330
 331 ## Floating-Point Twin Multiply-Add DCT
 332
 333 X-Form
 334
 335 ```
 336     |0     |6     |11      |16     |21      |31 |
 337     | PO   |  FRT |  FRA   |  FRB  |   XO   |Rc |
 338 ```
 339
 340 * fdmadd FRT,FRA,FRB (Rc=0)
 341
 342 Pseudo-code:
 343
 344 ```
 345     FRS <- FPADD64(FRT, FRB)
 346     sub <- FPSUB64(FRT, FRB)
 347     FRT <- FPMUL64(FRA, sub)
 348 ```
 349
 350 The two IEEE754-FP64 operations
 351
 352 ```
 353     FRS <- [(FRT) + (FRB)]
 354     FRT <- [(FRT) - (FRB)] * (FRA)
 355 ```
 356
 357 are simultaneously performed.
 358
 359 The Floating-Point operand in register FRT is added to the floating-point
 360 operand in register FRB and the result stored in FRS.
 361
 362 Using the exact same operand input register values from FRT and FRB
 363 that were used to create FRS, the Floating-Point operand in register
 364 FRB is subtracted from the floating-point operand in register FRT and
 365 the result then rounded before being multiplied by FRA to create an
 366 intermediate result that is stored in FRT.
 367
 368 The add into FRS is treated exactly as `fadd`.  The creation of the
 369 result FRT is **not** the same as that of `fmsub`, but is instead as if
 370 `fsub` were performed first followed by `fmuls.  The creation of FRS
 371 and FRT are treated as parallel independent operations which occur at
 372 the same time.
 373
 374 Note that if Rc=1 an Illegal Instruction is raised.  Rc=1 is `RESERVED`
 375
 376 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
 377 which under Scalar circumstances is defined as `FRT+1`.  For SVP64 if
 378 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
 379 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
 380
 381 Special Registers Altered:
 382
 383 ```
 384     FPRF FR FI
 385     FX OX UX XX
 386     VXSNAN VXISI VXIMZ
 387 ```
 388
 389 ## Floating-Point Twin Multiply-Add FFT
 390
 391 X-Form
 392
 393 ```
 394     |0     |6     |11      |16     |21      |31 |
 395     | PO   |  FRT |  FRA   |  FRB  |   XO   |Rc |
 396 ```
 397
 398 * ffmadd FRT,FRA,FRB (Rc=0)
 399
 400 Pseudo-code:
 401
 402 ```
 403     FRS <- FPMULADD64(FRT, FRA, FRB, -1, 1)
 404     FRT <- FPMULADD64(FRT, FRA, FRB, 1, 1)
 405 ```
 406
 407 The two operations
 408
 409 ```
 410     FRS <- -([(FRT) * (FRA)] - (FRB))
 411     FRT <-   [(FRT) * (FRA)] + (FRB)
 412 ```
 413
 414 are performed.
 415
 416 The floating-point operand in register FRT is multiplied by the
 417 floating-point operand in register FRA. The float- ing-point operand in
 418 register FRB is added to this intermediate result, and the intermediate
 419 stored in FRS.
 420
 421 Using the exact same values of FRT, FRT and FRB as used to create
 422 FRS, the floating-point operand in register FRT is multiplied by the
 423 floating-point operand in register FRA. The float- ing-point operand
 424 in register FRB is subtracted from this intermediate result, and the
 425 intermediate stored in FRT.
 426
 427 FRT is created as if a `fmadd` operation had been performed. FRS is
 428 created as if a `fnmsub` operation had simultaneously been performed
 429 with the exact same register operands, in parallel, independently,
 430 at exactly the same time.
 431
 432 FRT is a Read-Modify-Write operation.
 433
 434 Note that if Rc=1 an Illegal Instruction is raised.  Rc=1 is `RESERVED`
 435
 436 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
 437 which under Scalar circumstances is defined as `FRT+1`.  For SVP64 if
 438 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
 439 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
 440
 441 Special Registers Altered:
 442
 443 ```
 444     FPRF FR FI
 445     FX OX UX XX
 446     VXSNAN VXISI VXIMZ
 447 ```
 448
 449
 450 ## Floating-Point Add FFT/DCT [Single]
 451
 452 A-Form
 453
 454 ```
 455     |0     |6     |11      |16     |21      |26    |31 |
 456     | PO   | FRT  |  FRA   |  FRB  |     /  |   XO |Rc |
 457 ```
 458
 459 * ffadds FRT,FRA,FRB (Rc=0)
 460
 461 Pseudo-code:
 462
 463 ```
 464     FRT <- FPADD32(FRA, FRB)
 465     FRS <- FPSUB32(FRB, FRA)
 466 ```
 467
 468 Special Registers Altered:
 469
 470 ```
 471     FPRF FR FI
 472     FX OX UX XX
 473     VXSNAN VXISI
 474 ```
 475
 476 ## Floating-Point Add FFT/DCT [Double]
 477
 478 A-Form
 479
 480 ```
 481     |0     |6     |11      |16     |21      |26    |31 |
 482     | PO   | FRT  |  FRA   |  FRB  |     /  |   XO |Rc |
 483 ```
 484
 485 * ffadd FRT,FRA,FRB (Rc=0)
 486
 487 Pseudo-code:
 488
 489 ```
 490     FRT <- FPADD64(FRA, FRB)
 491     FRS <- FPSUB64(FRB, FRA)
 492 ```
 493
 494 Special Registers Altered:
 495
 496 ```
 497     FPRF FR FI
 498     FX OX UX XX
 499     VXSNAN VXISI
 500 ```
 501
 502 ## Floating-Point Subtract FFT/DCT [Single]
 503
 504 A-Form
 505
 506 ```
 507     |0     |6     |11      |16     |21      |26    |31 |
 508     | PO   | FRT  |  FRA   |  FRB  |     /  |   XO |Rc |
 509 ```
 510
 511 * ffsubs FRT,FRA,FRB (Rc=0)
 512
 513 Pseudo-code:
 514
 515 ```
 516     FRT <- FPSUB32(FRB, FRA)
 517     FRS <- FPADD32(FRA, FRB)
 518 ```
 519
 520 Special Registers Altered:
 521
 522 ```
 523     FPRF FR FI
 524     FX OX UX XX
 525     VXSNAN VXISI
 526 ```
 527
 528 ## Floating-Point Subtract FFT/DCT [Double]
 529
 530 A-Form
 531
 532 ```
 533     |0     |6     |11      |16     |21      |26    |31 |
 534     | PO   | FRT  |  FRA   |  FRB  |     /  |   XO |Rc |
 535 ```
 536
 537 * ffsub FRT,FRA,FRB (Rc=0)
 538
 539 Pseudo-code:
 540
 541 ```
 542     FRT <- FPSUB64(FRB, FRA)
 543     FRS <- FPADD64(FRA, FRB)
 544 ```
 545
 546 Special Registers Altered:
 547
 548 ```
 549     FPRF FR FI
 550     FX OX UX XX
 551     VXSNAN VXISI
 552 ```