openpower/sv/bitmanip.mdwn

   1 [[!tag standards]]
   2
   3 [[!toc levels=1]]
   4
   5 # Implementation Log
   6
   7 * ternlogi <https://bugs.libre-soc.org/show_bug.cgi?id=745>
   8 * grev <https://bugs.libre-soc.org/show_bug.cgi?id=755>
   9 * GF2^M <https://bugs.libre-soc.org/show_bug.cgi?id=782>
  10 * binutils <https://bugs.libre-soc.org/show_bug.cgi?id=836>
  11 * shift-and-add <https://bugs.libre-soc.org/show_bug.cgi?id=968>
  12
  13 # bitmanipulation
  14
  15 **DRAFT STATUS**
  16
  17 pseudocode: [[openpower/isa/bitmanip]]
  18
  19 this extension amalgamates bitmanipulation primitives from many sources,
  20 including RISC-V bitmanip, Packed SIMD, AVX-512 and OpenPOWER VSX.
  21 Also included are DSP/Multimedia operations suitable for Audio/Video.
  22 Vectorisation and SIMD are removed: these are straight scalar (element)
  23 operations making them suitable for embedded applications.  Vectorisation
  24 Context is provided by [[openpower/sv]].
  25
  26 When combined with SV, scalar variants of bitmanip operations found in
  27 VSX are added so that the Packed SIMD aspects of VSX may be retired as
  28 "legacy" in the far future (10 to 20 years).  Also, VSX is hundreds of
  29 opcodes, requires 128 bit pathways, and is wholly unsuited to low power
  30 or embedded scenarios.
  31
  32 ternlogv is experimental and is the only operation that may be considered
  33 a "Packed SIMD".  It is added as a variant of the already well-justified
  34 ternlog operation (done in AVX512 as an immediate only) "because it
  35 looks fun". As it is based on the LUT4 concept it will allow accelerated
  36 emulation of FPGAs.  Other vendors of ISAs are buying FPGA companies to
  37 achieve similar objectives.
  38
  39 general-purpose Galois Field 2^M operations are added so as to avoid
  40 huge custom opcode proliferation across many areas of Computer Science.
  41 however for convenience and also to avoid setup costs, some of the more
  42 common operations (clmul, crc32) are also added.  The expectation is
  43 that these operations would all be covered by the same pipeline.
  44
  45 note that there are brownfield spaces below that could incorporate
  46 some of the set-before-first and other scalar operations listed in
  47 [[sv/mv.swizzle]],
  48 [[sv/vector_ops]], [[sv/int_fp_mv]] and the [[sv/av_opcodes]] as well as
  49 [[sv/setvl]], [[sv/svstep]], [[sv/remap]]
  50
  51 Useful resource:
  52
  53 * <https://en.wikiversity.org/wiki/Reed%E2%80%93Solomon_codes_for_coders>
  54 * <https://maths-people.anu.edu.au/~brent/pd/rpb232tr.pdf>
  55 * <https://gist.github.com/animetosho/d3ca95da2131b5813e16b5bb1b137ca0>
  56 * <https://github.com/HJLebbink/asm-dude/wiki/GF2P8AFFINEINVQB>
  57
  58 [[!inline pages="openpower/sv/draft_opcode_tables" quick="yes" raw="yes" ]]
  59
  60 # binary and ternary bitops
  61
  62 Similar to FPGA LUTs: for two (binary) or three (ternary) inputs take
  63 bits from each input, concatenate them and perform a lookup into a
  64 table using an 8-8-bit immediate (for the ternary instructions), or in
  65 another register (4-bit for the binary instructions).  The binary lookup
  66 instructions have CR Field lookup variants due to CR Fields being 4 bit.
  67
  68 Like the x86 AVX512F
  69 [vpternlogd/vpternlogq](https://www.felixcloutier.com/x86/vpternlogd:vpternlogq)
  70 instructions.
  71
  72 ## ternlogi
  73
  74 | 0.5|6.10|11.15|16.20| 21..28|29.30|31|
  75 | -- | -- | --- | --- | ----- | --- |--|
  76 | NN | RT | RA  | RB  | im0-7 |  00 |Rc|
  77
  78     lut3(imm, a, b, c):
  79         idx = c << 2 | b << 1 | a
  80         return imm[idx] # idx by LSB0 order
  81
  82     for i in range(64):
  83         RT[i] = lut3(imm, RB[i], RA[i], RT[i])
  84
  85 ## binlut
  86
  87 Binary lookup is a dynamic LUT2 version of ternlogi. Firstly, the
  88 lookup table is 4 bits wide not 8 bits, and secondly the lookup
  89 table comes from a register not an immediate.
  90
  91 | 0.5|6.10|11.15|16.20| 21..25|26..31  | Form    |
  92 | -- | -- | --- | --- | ----- |--------|---------|
  93 | NN | RT | RA  | RB  | RC    |nh 00001| VA-Form |
  94 | NN | RT | RA  | RB  | /BFA/ |0  01001| VA-Form |
  95
  96 For binlut, the 4-bit LUT may be selected from either the high nibble
  97 or the low nibble of the first byte of RC:
  98
  99     lut2(imm, a, b):
 100         idx = b << 1 | a
 101         return imm[idx] # idx by LSB0 order
 102
 103     imm = (RC>>(nh*4))&0b1111
 104     for i in range(64):
 105         RT[i] = lut2(imm, RB[i], RA[i])
 106
 107 For bincrlut, `BFA` selects the 4-bit CR Field as the LUT2:
 108
 109     for i in range(64):
 110         RT[i] = lut2(CRs{BFA}, RB[i], RA[i])
 111
 112 When Vectorised with SVP64, as usual both source and destination may be
 113 Vector or Scalar.
 114
 115 *Programmer's note: a dynamic ternary lookup may be synthesised from
 116 a pair of `binlut` instructions followed by a `ternlogi` to select which
 117 to merge. Use `nh` to select which nibble to use as the lookup table
 118 from the RC source register (`nh=1` nibble high), i.e. keeping
 119 an 8-bit LUT3 in RC, the first `binlut` instruction may set nh=0 and
 120 the second nh=1.*
 121
 122 ## crternlogi
 123
 124 another mode selection would be CRs not Ints.
 125
 126 | 0.5|6.8 | 9.11|12.14|15.17|18.20|21.28 | 29.30|31|
 127 | -- | -- | --- | --- | --- |-----|----- | -----|--|
 128 | NN | BT | BA  | BB  | BC  |m0-2 | imm  |  01  |m3|
 129
 130     mask = m0-3
 131     for i in range(4):
 132         a,b,c = CRs[BA][i], CRs[BB][i], CRs[BC][i])
 133         if mask[i] CRs[BT][i] = lut3(imm, a, b, c)
 134
 135 This instruction is remarkably similar to the existing crops, `crand` etc.
 136 which have been noted to be a 4-bit (binary) LUT.  In effect `crternlogi`
 137 is the ternary LUT version of crops, having an 8-bit LUT.
 138
 139 ## crbinlog
 140
 141 With ternary (LUT3) dynamic instructions being very costly,
 142 and CR Fields being only 4 bit, a binary (LUT2) variant is better
 143
 144 | 0.5|6.8 | 9.11|12.14|15.17|18.21|22...30  |31|
 145 | -- | -- | --- | --- | --- |-----| -------- |--|
 146 | NN | BT | BA  | BB  | BC  |m0-m3|000101110 |0 |
 147
 148     mask = m0..m3
 149     for i in range(4):
 150         a,b = CRs[BA][i], CRs[BB][i])
 151         if mask[i] CRs[BT][i] = lut2(CRs[BC], a, b)
 152
 153 When SVP64 Vectorised any of the 4 operands may be Scalar or
 154 Vector, including `BC` meaning that multiple different dynamic
 155 lookups may be performed with a single instruction.
 156
 157 *Programmer's note: just as with binlut and ternlogi, a pair
 158  of crbinlog instructions followed by a merging crternlogi may
 159  be deployed to synthesise dynamic ternary (LUT3) CR Field
 160  manipulation*
 161
 162 # int ops
 163
 164 ## min/m
 165
 166 required for the [[sv/av_opcodes]]
 167
 168 signed and unsigned min/max for integer.  this is sort-of partly
 169 synthesiseable in [[sv/svp64]] with pred-result as long as the dest reg
 170 is one of the sources, but not both signed and unsigned.  when the dest
 171 is also one of the srces and the mv fails due to the CR bittest failing
 172 this will only overwrite the dest where the src is greater (or less).
 173
 174 signed/unsigned min/max gives more flexibility.
 175
 176 X-Form
 177
 178 * XO=0001001110, itype=0b00 min, unsigned
 179 * XO=0101001110, itype=0b01 min, signed
 180 * XO=0011001110, itype=0b10 max, unsigned
 181 * XO=0111001110, itype=0b11 max, signed
 182
 183
 184 ```
 185 uint_xlen_t mins(uint_xlen_t rs1, uint_xlen_t rs2)
 186 { return (int_xlen_t)rs1 < (int_xlen_t)rs2 ? rs1 : rs2;
 187 }
 188 uint_xlen_t maxs(uint_xlen_t rs1, uint_xlen_t rs2)
 189 { return (int_xlen_t)rs1 > (int_xlen_t)rs2 ? rs1 : rs2;
 190 }
 191 uint_xlen_t minu(uint_xlen_t rs1, uint_xlen_t rs2)
 192 { return rs1 < rs2 ? rs1 : rs2;
 193 }
 194 uint_xlen_t maxu(uint_xlen_t rs1, uint_xlen_t rs2)
 195 { return rs1 > rs2 ? rs1 : rs2;
 196 }
 197 ```
 198
 199 ## average
 200
 201 required for the [[sv/av_opcodes]], these exist in Packed SIMD (VSX)
 202 but not scalar
 203
 204 ```
 205 uint_xlen_t intavg(uint_xlen_t rs1, uint_xlen_t rs2) {
 206      return (rs1 + rs2 + 1) >> 1:
 207 }
 208 ```
 209
 210 ## absdu
 211
 212 required for the [[sv/av_opcodes]], these exist in Packed SIMD (VSX)
 213 but not scalar
 214
 215 ```
 216 uint_xlen_t absdu(uint_xlen_t rs1, uint_xlen_t rs2) {
 217      return (src1 > src2) ? (src1-src2) : (src2-src1)
 218 }
 219 ```
 220
 221 ## abs-accumulate
 222
 223 required for the [[sv/av_opcodes]], these are needed for motion estimation.
 224 both are overwrite on RS.
 225
 226 ```
 227 uint_xlen_t uintabsacc(uint_xlen_t rs, uint_xlen_t ra, uint_xlen_t rb) {
 228      return rs + (src1 > src2) ? (src1-src2) : (src2-src1)
 229 }
 230 uint_xlen_t intabsacc(uint_xlen_t rs, int_xlen_t ra, int_xlen_t rb) {
 231      return rs + (src1 > src2) ? (src1-src2) : (src2-src1)
 232 }
 233 ```
 234
 235 For SVP64, the twin Elwidths allows e.g. a 16 bit accumulator for 8 bit
 236 differences.  Form is `RM-1P-3S1D` where RS-as-source has a separate
 237 SVP64 designation from RS-as-dest. This gives a limited range of
 238 non-overwrite capability.
 239
 240 # shift-and-add <a name="shift-add"> </a>
 241
 242 Power ISA is missing LD/ST with shift, which is present in both ARM and x86.
 243 Too complex to add more LD/ST, a compromise is to add shift-and-add.
 244 Replaces a pair of explicit instructions in hot-loops.
 245
 246 ```
 247 uint_xlen_t shadd(uint_xlen_t rs1, uint_xlen_t rs2, uint8_t sh) {
 248     sh = sh & 0x3;
 249     return (rs1 << (sh+1)) + rs2;
 250 }
 251
 252 uint_xlen_t shadduw(uint_xlen_t rs1, uint_xlen_t rs2, uint8_t sh) {
 253     uint_xlen_t rs1z = rs1 & 0xFFFFFFFF;
 254     sh = sh & 0x3;
 255     return (rs1z << (sh+1)) + rs2;
 256 }
 257 ```
 258
 259 # bitmask set
 260
 261 based on RV bitmanip singlebit set, instruction format similar to shift
 262 [[isa/fixedshift]].  bmext is actually covered already (shift-with-mask
 263 rldicl but only immediate version).  however bitmask-invert is not,
 264 and set/clr are not covered, although they can use the same Shift ALU.
 265
 266 bmext (RB) version is not the same as rldicl because bmext is a right
 267 shift by RC, where rldicl is a left rotate.  for the immediate version
 268 this does not matter, so a bmexti is not required.  bmrev however there
 269 is no direct equivalent and consequently a bmrevi is required.
 270
 271 bmset (register for mask amount) is particularly useful for creating
 272 predicate masks where the length is a dynamic runtime quantity.
 273 bmset(RA=0, RB=0, RC=mask) will produce a run of ones of length "mask"
 274 in a single instruction without needing to initialise or depend on any
 275 other registers.
 276
 277 | 0.5|6.10|11.15|16.20|21.25| 26..30  |31| name  |
 278 | -- | -- | --- | --- | --- | ------- |--| ----- |
 279 | NN | RS | RA  | RB  | RC  | mode 010 |Rc| bm\*   |
 280
 281 Immediate-variant is an overwrite form:
 282
 283 | 0.5|6.10|11.15|16.20| 21 | 22.23 | 24....30 |31| name |
 284 | -- | -- | --- | --- | -- | ----- | -------- |--| ---- |
 285 | NN | RS | RB  | sh  | SH | itype | 1000 110 |Rc| bm\*i |
 286
 287 ```
 288 def MASK(x, y):
 289      if x < y:
 290          x = x+1
 291          mask_a = ((1 << x) - 1) & ((1 << 64) - 1)
 292          mask_b = ((1 << y) - 1) & ((1 << 64) - 1)
 293      elif x == y:
 294          return 1 << x
 295      else:
 296          x = x+1
 297          mask_a = ((1 << x) - 1) & ((1 << 64) - 1)
 298          mask_b = (~((1 << y) - 1)) & ((1 << 64) - 1)
 299      return mask_a ^ mask_b
 300
 301
 302 uint_xlen_t bmset(RS, RB, sh)
 303 {
 304     int shamt = RB & (XLEN - 1);
 305     mask = (2<<sh)-1;
 306     return RS | (mask << shamt);
 307 }
 308
 309 uint_xlen_t bmclr(RS, RB, sh)
 310 {
 311     int shamt = RB & (XLEN - 1);
 312     mask = (2<<sh)-1;
 313     return RS & ~(mask << shamt);
 314 }
 315
 316 uint_xlen_t bminv(RS, RB, sh)
 317 {
 318     int shamt = RB & (XLEN - 1);
 319     mask = (2<<sh)-1;
 320     return RS ^ (mask << shamt);
 321 }
 322
 323 uint_xlen_t bmext(RS, RB, sh)
 324 {
 325     int shamt = RB & (XLEN - 1);
 326     mask = (2<<sh)-1;
 327     return mask & (RS >> shamt);
 328 }
 329 ```
 330
 331 bitmask extract with reverse.  can be done by bit-order-inverting all
 332 of RB and getting bits of RB from the opposite end.
 333
 334 when RA is zero, no shift occurs. this makes bmextrev useful for
 335 simply reversing all bits of a register.
 336
 337 ```
 338 msb = ra[5:0];
 339 rev[0:msb] = rb[msb:0];
 340 rt = ZE(rev[msb:0]);
 341
 342 uint_xlen_t bmrevi(RA, RB, sh)
 343 {
 344     int shamt = XLEN-1;
 345     if (RA != 0) shamt = (GPR(RA) & (XLEN - 1));
 346     shamt = (XLEN-1)-shamt;  # shift other end
 347     brb = bitreverse(GPR(RB))     # swap LSB-MSB
 348     mask = (2<<sh)-1;
 349     return mask & (brb >> shamt);
 350 }
 351
 352 uint_xlen_t bmrev(RA, RB, RC) {
 353     return bmrevi(RA, RB, GPR(RC) & 0b111111);
 354 }
 355 ```
 356
 357 | 0.5|6.10|11.15|16.20|21.26| 27..30  |31| name   | Form     |
 358 | -- | -- | --- | --- | --- | ------- |--| ------ | -------- |
 359 | NN | RT | RA  | RB  | sh  | 1111    |Rc| bmrevi | MDS-Form |
 360
 361 | 0.5|6.10|11.15|16.20|21.25| 26..30  |31| name   | Form     |
 362 | -- | -- | --- | --- | --- | ------- |--| ------ | -------- |
 363 | NN | RT | RA  | RB  | RC  | 11110   |Rc| bmrev  | VA2-Form |
 364
 365 # grevlut <a name="grevlut"> </a>
 366
 367 generalised reverse combined with a pair of LUT2s and allowing
 368 a constant `0b0101...0101` when RA=0, and an option to invert
 369 (including when RA=0, giving a constant 0b1010...1010 as the
 370 initial value) provides a wide range of instructions
 371 and a means to set hundreds of regular 64 bit patterns with one
 372 single 32 bit instruction.
 373
 374 the two LUT2s are applied left-half (when not swapping)
 375 and right-half (when swapping) so as to allow a wider
 376 range of options.
 377
 378 <img src="/openpower/sv/grevlut2x2.jpg" width=700 />
 379
 380 * A value of `0b11001010` for the immediate provides
 381 the functionality of a standard "grev".
 382 * `0b11101110` provides gorc
 383
 384 grevlut should be arranged so as to produce the constants
 385 needed to put into bext (bitextract) so as in turn to
 386 be able to emulate x86 pmovmask instructions
 387 <https://www.felixcloutier.com/x86/pmovmskb>.
 388 This only requires 2 instructions (grevlut, bext).
 389
 390 Note that if the mask is required to be placed
 391 directly into CR Fields (for use as CR Predicate
 392 masks rather than a integer mask) then sv.cmpi or sv.ori
 393 may be used instead, bearing in mind that sv.ori
 394 is a 64-bit instruction, and `VL` must have been
 395 set to the required length:
 396
 397     sv.ori./elwid=8 r10.v, r10.v, 0
 398
 399 The following settings provide the required mask constants:
 400
 401 | RA=0     | RB      | imm        | iv | result        |
 402 | -------  | ------- | ---------- | -- | ----------    |
 403 | 0x555..  | 0b10    | 0b01101100 | 0  | 0x111111...   |
 404 | 0x555..  | 0b110   | 0b01101100 | 0  | 0x010101...   |
 405 | 0x555..  | 0b1110  | 0b01101100 | 0  | 0x00010001...   |
 406 | 0x555..  | 0b10    | 0b11000110 | 1  | 0x88888...   |
 407 | 0x555..  | 0b110   | 0b11000110 | 1  | 0x808080...   |
 408 | 0x555..  | 0b1110  | 0b11000110 | 1  | 0x80008000...   |
 409
 410 Better diagram showing the correct ordering of shamt (RB).  A LUT2
 411 is applied to all locations marked in red using the first 4
 412 bits of the immediate, and a separate LUT2 applied to all
 413 locations in green using the upper 4 bits of the immediate.
 414
 415 <img src="/openpower/sv/grevlut.png" width=700 />
 416
 417 demo code [[openpower/sv/grevlut.py]]
 418
 419 ```
 420 lut2(imm, a, b):
 421     idx = b << 1 | a
 422     return imm[idx] # idx by LSB0 order
 423
 424 dorow(imm8, step_i, chunksize, us32b):
 425     for j in 0 to 31 if is32b else 63:
 426         if (j&chunk_size) == 0
 427            imm = imm8[0..3]
 428         else
 429            imm = imm8[4..7]
 430         step_o[j] = lut2(imm, step_i[j], step_i[j ^ chunk_size])
 431     return step_o
 432
 433 uint64_t grevlut(uint64_t RA, uint64_t RB, uint8 imm, bool iv, bool is32b)
 434 {
 435     uint64_t x = 0x5555_5555_5555_5555;
 436     if (RA != 0) x = GPR(RA);
 437     if (iv) x = ~x;
 438     int shamt = RB & 31 if is32b else 63
 439     for i in 0 to (6-is32b)
 440         step = 1<<i
 441         if (shamt & step) x = dorow(imm, x, step, is32b)
 442     return x;
 443 }
 444 ```
 445
 446 A variant may specify different LUT-pairs per row,
 447 using one byte of RB for each.  If it is desired that
 448 a particular row-crossover shall not be applied it is
 449 a simple matter to set the appropriate LUT-pair in RB
 450 to effect an identity transform for that row (`0b11001010`).
 451
 452 ```
 453 uint64_t grevlutr(uint64_t RA, uint64_t RB, bool iv, bool is32b)
 454 {
 455     uint64_t x = 0x5555_5555_5555_5555;
 456     if (RA != 0) x = GPR(RA);
 457     if (iv) x = ~x;
 458     for i in 0 to (6-is32b)
 459         step = 1<<i
 460         imm = (RB>>(i*8))&0xff
 461         x = dorow(imm, x, step, is32b)
 462     return x;
 463 }
 464
 465 ```
 466
 467 | 0.5|6.10|11.15|16.20 |21..28   | 29.30|31| name | Form |
 468 | -- | -- | --- | ---  | -----   | -----|--| ------ | ----- |
 469 | NN | RT | RA  | s0-4 | im0-7   | 1 iv |s5| grevlogi |      |
 470 | NN | RT | RA  | RB   | im0-7   | 01   |0 | grevlog |           |
 471
 472 An equivalent to `grevlogw` may be synthesised by setting the
 473 appropriate bits in RB to set the top half of RT to zero.
 474 Thus an explicit grevlogw instruction is not necessary.
 475
 476 # xperm
 477
 478 based on RV bitmanip.
 479
 480 RA contains a vector of indices to select parts of RB to be
 481 copied to RT.  The immediate-variant allows up to an 8 bit
 482 pattern (repeated) to be targetted at different parts of RT.
 483
 484 xperm shares some similarity with one of the uses of bmator
 485 in that xperm indices are binary addressing where bitmator
 486 may be considered to be unary addressing.
 487
 488 ```
 489 uint_xlen_t xpermi(uint8_t imm8, uint_xlen_t RB, int sz_log2)
 490 {
 491     uint_xlen_t r = 0;
 492     uint_xlen_t sz = 1LL << sz_log2;
 493     uint_xlen_t mask = (1LL << sz) - 1;
 494     uint_xlen_t RA = imm8 | imm8<<8 | ... | imm8<<56;
 495     for (int i = 0; i < XLEN; i += sz) {
 496         uint_xlen_t pos = ((RA >> i) & mask) << sz_log2;
 497         if (pos < XLEN)
 498             r |= ((RB >> pos) & mask) << i;
 499     }
 500     return r;
 501 }
 502 uint_xlen_t xperm(uint_xlen_t RA, uint_xlen_t RB, int sz_log2)
 503 {
 504     uint_xlen_t r = 0;
 505     uint_xlen_t sz = 1LL << sz_log2;
 506     uint_xlen_t mask = (1LL << sz) - 1;
 507     for (int i = 0; i < XLEN; i += sz) {
 508         uint_xlen_t pos = ((RA >> i) & mask) << sz_log2;
 509         if (pos < XLEN)
 510             r |= ((RB >> pos) & mask) << i;
 511     }
 512     return r;
 513 }
 514 uint_xlen_t xperm_n (uint_xlen_t RA, uint_xlen_t RB)
 515 {  return xperm(RA, RB, 2); }
 516 uint_xlen_t xperm_b (uint_xlen_t RA, uint_xlen_t RB)
 517 {  return xperm(RA, RB, 3); }
 518 uint_xlen_t xperm_h (uint_xlen_t RA, uint_xlen_t RB)
 519 {  return xperm(RA, RB, 4); }
 520 uint_xlen_t xperm_w (uint_xlen_t RA, uint_xlen_t RB)
 521 {  return xperm(RA, RB, 5); }
 522 ```
 523
 524 # bitmatrix
 525
 526 bmatflip and bmatxor is found in the Cray XMT, and in x86 is known
 527 as GF2P8AFFINEQB. uses:
 528
 529 * <https://gist.github.com/animetosho/d3ca95da2131b5813e16b5bb1b137ca0>
 530 * SM4, Reed Solomon, RAID6
 531   <https://stackoverflow.com/questions/59124720/what-are-the-avx-512-galois-field-related-instructions-for>
 532 * Vector bit-reverse <https://reviews.llvm.org/D91515?id=305411>
 533 * Affine Inverse <https://github.com/HJLebbink/asm-dude/wiki/GF2P8AFFINEINVQB>
 534
 535 | 0.5|6.10|11.15|16.20| 21 | 22.23 | 24....30 |31| name      |  Form   |
 536 | -- | -- | --- | --- | -- | ----- | -------- |--| ----      | ------- |
 537 | NN | RS | RA  |im04 | im5|  1 1  | im67 00 110 |Rc| bmatxori  | TODO    |
 538
 539
 540 ```
 541 uint64_t bmatflip(uint64_t RA)
 542 {
 543     uint64_t x = RA;
 544     x = shfl64(x, 31);
 545     x = shfl64(x, 31);
 546     x = shfl64(x, 31);
 547     return x;
 548 }
 549
 550 uint64_t bmatxori(uint64_t RS, uint64_t RA, uint8_t imm) {
 551     // transpose of RA
 552     uint64_t RAt = bmatflip(RA);
 553     uint8_t u[8]; // rows of RS
 554     uint8_t v[8]; // cols of RA
 555     for (int i = 0; i < 8; i++) {
 556         u[i] = RS >> (i*8);
 557         v[i] = RAt >> (i*8);
 558     }
 559     uint64_t bit, x = 0;
 560     for (int i = 0; i < 64; i++) {
 561         bit = (imm >> (i%8)) & 1;
 562         bit ^= pcnt(u[i / 8] & v[i % 8]) & 1;
 563         x |= bit << i;
 564     }
 565     return x;
 566 }
 567
 568 uint64_t bmatxor(uint64_t RA, uint64_t RB) {
 569     return bmatxori(RA, RB, 0xff)
 570 }
 571
 572 uint64_t bmator(uint64_t RA, uint64_t RB) {
 573     // transpose of RB
 574     uint64_t RBt = bmatflip(RB);
 575     uint8_t u[8]; // rows of RA
 576     uint8_t v[8]; // cols of RB
 577     for (int i = 0; i < 8; i++) {
 578         u[i] = RA >> (i*8);
 579         v[i] = RBt >> (i*8);
 580     }
 581     uint64_t x = 0;
 582     for (int i = 0; i < 64; i++) {
 583         if ((u[i / 8] & v[i % 8]) != 0)
 584             x |= 1LL << i;
 585     }
 586     return x;
 587 }
 588
 589 uint64_t bmatand(uint64_t RA, uint64_t RB) {
 590     // transpose of RB
 591     uint64_t RBt = bmatflip(RB);
 592     uint8_t u[8]; // rows of RA
 593     uint8_t v[8]; // cols of RB
 594     for (int i = 0; i < 8; i++) {
 595         u[i] = RA >> (i*8);
 596         v[i] = RBt >> (i*8);
 597     }
 598     uint64_t x = 0;
 599     for (int i = 0; i < 64; i++) {
 600         if ((u[i / 8] & v[i % 8]) == 0xff)
 601             x |= 1LL << i;
 602     }
 603     return x;
 604 }
 605 ```
 606
 607 # Introduction to Carry-less and GF arithmetic
 608
 609 * obligatory xkcd <https://xkcd.com/2595/>
 610
 611 There are three completely separate types of Galois-Field-based arithmetic
 612 that we implement which are not well explained even in introductory
 613 literature.  A slightly oversimplified explanation is followed by more
 614 accurate descriptions:
 615
 616 * `GF(2)` carry-less binary arithmetic. this is not actually a Galois Field,
 617   but is accidentally referred to as GF(2) - see below as to why.
 618 * `GF(p)` modulo arithmetic with a Prime number, these are "proper"
 619    Galois Fields
 620 * `GF(2^N)` carry-less binary arithmetic with two limits: modulo a power-of-2
 621   (2^N) and a second "reducing" polynomial (similar to a prime number), these
 622   are said to be GF(2^N) arithmetic.
 623
 624 further detailed and more precise explanations are provided below
 625
 626 * **Polynomials with coefficients in `GF(2)`**
 627   (aka. Carry-less arithmetic -- the `cl*` instructions).
 628   This isn't actually a Galois Field, but its coefficients are. This is
 629   basically binary integer addition, subtraction, and multiplication like
 630   usual, except that carries aren't propagated at all, effectively turning
 631   both addition and subtraction into the bitwise xor operation. Division and
 632   remainder are defined to match how addition and multiplication works.
 633 * **Galois Fields with a prime size**
 634   (aka. `GF(p)` or Prime Galois Fields -- the `gfp*` instructions).
 635   This is basically just the integers mod `p`.
 636 * **Galois Fields with a power-of-a-prime size**
 637   (aka. `GF(p^n)` or `GF(q)` where `q == p^n` for prime `p` and
 638   integer `n > 0`).
 639   We only implement these for `p == 2`, called Binary Galois Fields
 640   (`GF(2^n)` -- the `gfb*` instructions).
 641   For any prime `p`, `GF(p^n)` is implemented as polynomials with
 642   coefficients in `GF(p)` and degree `< n`, where the polynomials are the
 643   remainders of dividing by a specificly chosen polynomial in `GF(p)` called
 644   the Reducing Polynomial (we will denote that by `red_poly`). The Reducing
 645   Polynomial must be an irreducable polynomial (like primes, but for
 646   polynomials), as well as have degree `n`. All `GF(p^n)` for the same `p`
 647   and `n` are isomorphic to each other -- the choice of `red_poly` doesn't
 648   affect `GF(p^n)`'s mathematical shape, all that changes is the specific
 649   polynomials used to implement `GF(p^n)`.
 650
 651 Many implementations and much of the literature do not make a clear
 652 distinction between these three categories, which makes it confusing
 653 to understand what their purpose and value is.
 654
 655 * carry-less multiply is extremely common and is used for the ubiquitous
 656   CRC32 algorithm. [TODO add many others, helps justify to ISA WG]
 657 * GF(2^N) forms the basis of Rijndael (the current AES standard) and
 658   has significant uses throughout cryptography
 659 * GF(p) is the basis again of a significant quantity of algorithms
 660   (TODO, list them, jacob knows what they are), even though the
 661   modulo is limited to be below 64-bit (size of a scalar int)
 662
 663 # Instructions for Carry-less Operations
 664
 665 aka. Polynomials with coefficients in `GF(2)`
 666
 667 Carry-less addition/subtraction is simply XOR, so a `cladd`
 668 instruction is not provided since the `xor[i]` instruction can be used instead.
 669
 670 These are operations on polynomials with coefficients in `GF(2)`, with the
 671 polynomial's coefficients packed into integers with the following algorithm:
 672
 673 ```python
 674 [[!inline pagenames="gf_reference/pack_poly.py" raw="yes"]]
 675 ```
 676
 677 ## Carry-less Multiply Instructions
 678
 679 based on RV bitmanip
 680 see <https://en.wikipedia.org/wiki/CLMUL_instruction_set> and
 681 <https://www.felixcloutier.com/x86/pclmulqdq> and
 682 <https://en.m.wikipedia.org/wiki/Carry-less_product>
 683
 684 They are worth adding as their own non-overwrite operations
 685 (in the same pipeline).
 686
 687 ### `clmul` Carry-less Multiply
 688
 689 ```python
 690 [[!inline pagenames="gf_reference/clmul.py" raw="yes"]]
 691 ```
 692
 693 ### `clmulh` Carry-less Multiply High
 694
 695 ```python
 696 [[!inline pagenames="gf_reference/clmulh.py" raw="yes"]]
 697 ```
 698
 699 ### `clmulr` Carry-less Multiply (Reversed)
 700
 701 Useful for CRCs. Equivalent to bit-reversing the result of `clmul` on
 702 bit-reversed inputs.
 703
 704 ```python
 705 [[!inline pagenames="gf_reference/clmulr.py" raw="yes"]]
 706 ```
 707
 708 ## `clmadd` Carry-less Multiply-Add
 709
 710 ```
 711 clmadd RT, RA, RB, RC
 712 ```
 713
 714 ```
 715 (RT) = clmul((RA), (RB)) ^ (RC)
 716 ```
 717
 718 ## `cltmadd` Twin Carry-less Multiply-Add (for FFTs)
 719
 720 Used in combination with SV FFT REMAP to perform a full Discrete Fourier
 721 Transform of Polynomials over GF(2) in-place. Possible by having 3-in 2-out,
 722 to avoid the need for a temp register. RS is written to as well as RT.
 723
 724 Note: Polynomials over GF(2) are a Ring rather than a Field, so, because the
 725 definition of the Inverse Discrete Fourier Transform involves calculating a
 726 multiplicative inverse, which may not exist in every Ring, therefore the
 727 Inverse Discrete Fourier Transform may not exist. (AFAICT the number of inputs
 728 to the IDFT must be odd for the IDFT to be defined for Polynomials over GF(2).
 729 TODO: check with someone who knows for sure if that's correct.)
 730
 731 ```
 732 cltmadd RT, RA, RB, RC
 733 ```
 734
 735 TODO: add link to explanation for where `RS` comes from.
 736
 737 ```
 738 a = (RA)
 739 c = (RC)
 740 # read all inputs before writing to any outputs in case
 741 # an input overlaps with an output register.
 742 (RT) = clmul(a, (RB)) ^ c
 743 (RS) = a ^ c
 744 ```
 745
 746 ## `cldivrem` Carry-less Division and Remainder
 747
 748 `cldivrem` isn't an actual instruction, but is just used in the pseudo-code
 749 for other instructions.
 750
 751 ```python
 752 [[!inline pagenames="gf_reference/cldivrem.py" raw="yes"]]
 753 ```
 754
 755 ## `cldiv` Carry-less Division
 756
 757 ```
 758 cldiv RT, RA, RB
 759 ```
 760
 761 ```
 762 n = (RA)
 763 d = (RB)
 764 q, r = cldivrem(n, d, width=XLEN)
 765 (RT) = q
 766 ```
 767
 768 ## `clrem` Carry-less Remainder
 769
 770 ```
 771 clrem RT, RA, RB
 772 ```
 773
 774 ```
 775 n = (RA)
 776 d = (RB)
 777 q, r = cldivrem(n, d, width=XLEN)
 778 (RT) = r
 779 ```
 780
 781 # Instructions for Binary Galois Fields `GF(2^m)`
 782
 783 see:
 784
 785 * <https://courses.csail.mit.edu/6.857/2016/files/ffield.py>
 786 * <https://engineering.purdue.edu/kak/compsec/NewLectures/Lecture7.pdf>
 787 * <https://foss.heptapod.net/math/libgf2/-/blob/branch/default/src/libgf2/gf2.py>
 788
 789 Binary Galois Field addition/subtraction is simply XOR, so a `gfbadd`
 790 instruction is not provided since the `xor[i]` instruction can be used instead.
 791
 792 ## `GFBREDPOLY` SPR -- Reducing Polynomial
 793
 794 In order to save registers and to make operations orthogonal with standard
 795 arithmetic, the reducing polynomial is stored in a dedicated SPR `GFBREDPOLY`.
 796 This also allows hardware to pre-compute useful parameters (such as the
 797 degree, or look-up tables) based on the reducing polynomial, and store them
 798 alongside the SPR in hidden registers, only recomputing them whenever the SPR
 799 is written to, rather than having to recompute those values for every
 800 instruction.
 801
 802 Because Galois Fields require the reducing polynomial to be an irreducible
 803 polynomial, that guarantees that any polynomial of `degree > 1` must have
 804 the LSB set, since otherwise it would be divisible by the polynomial `x`,
 805 making it reducible, making whatever we're working on no longer a Field.
 806 Therefore, we can reuse the LSB to indicate `degree == XLEN`.
 807
 808 ```python
 809 [[!inline pagenames="gf_reference/decode_reducing_polynomial.py" raw="yes"]]
 810 ```
 811
 812 ## `gfbredpoly` -- Set the Reducing Polynomial SPR `GFBREDPOLY`
 813
 814 unless this is an immediate op, `mtspr` is completely sufficient.
 815
 816 ```python
 817 [[!inline pagenames="gf_reference/gfbredpoly.py" raw="yes"]]
 818 ```
 819
 820 ## `gfbmul` -- Binary Galois Field `GF(2^m)` Multiplication
 821
 822 ```
 823 gfbmul RT, RA, RB
 824 ```
 825
 826 ```python
 827 [[!inline pagenames="gf_reference/gfbmul.py" raw="yes"]]
 828 ```
 829
 830 ## `gfbmadd` -- Binary Galois Field `GF(2^m)` Multiply-Add
 831
 832 ```
 833 gfbmadd RT, RA, RB, RC
 834 ```
 835
 836 ```python
 837 [[!inline pagenames="gf_reference/gfbmadd.py" raw="yes"]]
 838 ```
 839
 840 ## `gfbtmadd` -- Binary Galois Field `GF(2^m)` Twin Multiply-Add (for FFT)
 841
 842 Used in combination with SV FFT REMAP to perform a full `GF(2^m)` Discrete
 843 Fourier Transform in-place. Possible by having 3-in 2-out, to avoid the need
 844 for a temp register. RS is written to as well as RT.
 845
 846 ```
 847 gfbtmadd RT, RA, RB, RC
 848 ```
 849
 850 TODO: add link to explanation for where `RS` comes from.
 851
 852 ```
 853 a = (RA)
 854 c = (RC)
 855 # read all inputs before writing to any outputs in case
 856 # an input overlaps with an output register.
 857 (RT) = gfbmadd(a, (RB), c)
 858 # use gfbmadd again since it reduces the result
 859 (RS) = gfbmadd(a, 1, c) # "a * 1 + c"
 860 ```
 861
 862 ## `gfbinv` -- Binary Galois Field `GF(2^m)` Inverse
 863
 864 ```
 865 gfbinv RT, RA
 866 ```
 867
 868 ```python
 869 [[!inline pagenames="gf_reference/gfbinv.py" raw="yes"]]
 870 ```
 871
 872 # Instructions for Prime Galois Fields `GF(p)`
 873
 874 ## `GFPRIME` SPR -- Prime Modulus For `gfp*` Instructions
 875
 876 ## `gfpadd` Prime Galois Field `GF(p)` Addition
 877
 878 ```
 879 gfpadd RT, RA, RB
 880 ```
 881
 882 ```python
 883 [[!inline pagenames="gf_reference/gfpadd.py" raw="yes"]]
 884 ```
 885
 886 the addition happens on infinite-precision integers
 887
 888 ## `gfpsub` Prime Galois Field `GF(p)` Subtraction
 889
 890 ```
 891 gfpsub RT, RA, RB
 892 ```
 893
 894 ```python
 895 [[!inline pagenames="gf_reference/gfpsub.py" raw="yes"]]
 896 ```
 897
 898 the subtraction happens on infinite-precision integers
 899
 900 ## `gfpmul` Prime Galois Field `GF(p)` Multiplication
 901
 902 ```
 903 gfpmul RT, RA, RB
 904 ```
 905
 906 ```python
 907 [[!inline pagenames="gf_reference/gfpmul.py" raw="yes"]]
 908 ```
 909
 910 the multiplication happens on infinite-precision integers
 911
 912 ## `gfpinv` Prime Galois Field `GF(p)` Invert
 913
 914 ```
 915 gfpinv RT, RA
 916 ```
 917
 918 Some potential hardware implementations are found in:
 919 <https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.90.5233&rep=rep1&type=pdf>
 920
 921 ```python
 922 [[!inline pagenames="gf_reference/gfpinv.py" raw="yes"]]
 923 ```
 924
 925 ## `gfpmadd` Prime Galois Field `GF(p)` Multiply-Add
 926
 927 ```
 928 gfpmadd RT, RA, RB, RC
 929 ```
 930
 931 ```python
 932 [[!inline pagenames="gf_reference/gfpmadd.py" raw="yes"]]
 933 ```
 934
 935 the multiplication and addition happens on infinite-precision integers
 936
 937 ## `gfpmsub` Prime Galois Field `GF(p)` Multiply-Subtract
 938
 939 ```
 940 gfpmsub RT, RA, RB, RC
 941 ```
 942
 943 ```python
 944 [[!inline pagenames="gf_reference/gfpmsub.py" raw="yes"]]
 945 ```
 946
 947 the multiplication and subtraction happens on infinite-precision integers
 948
 949 ## `gfpmsubr` Prime Galois Field `GF(p)` Multiply-Subtract-Reversed
 950
 951 ```
 952 gfpmsubr RT, RA, RB, RC
 953 ```
 954
 955 ```python
 956 [[!inline pagenames="gf_reference/gfpmsubr.py" raw="yes"]]
 957 ```
 958
 959 the multiplication and subtraction happens on infinite-precision integers
 960
 961 ## `gfpmaddsubr` Prime Galois Field `GF(p)` Multiply-Add and Multiply-Sub-Reversed (for FFT)
 962
 963 Used in combination with SV FFT REMAP to perform
 964 a full Number-Theoretic-Transform in-place. Possible by having 3-in 2-out,
 965 to avoid the need for a temp register. RS is written
 966 to as well as RT.
 967
 968 ```
 969 gfpmaddsubr RT, RA, RB, RC
 970 ```
 971
 972 TODO: add link to explanation for where `RS` comes from.
 973
 974 ```
 975 factor1 = (RA)
 976 factor2 = (RB)
 977 term = (RC)
 978 # read all inputs before writing to any outputs in case
 979 # an input overlaps with an output register.
 980 (RT) = gfpmadd(factor1, factor2, term)
 981 (RS) = gfpmsubr(factor1, factor2, term)
 982 ```
 983
 984 # Already in POWER ISA or subsumed
 985
 986 Lists operations either included as part of
 987 other bitmanip operations, or are already in
 988 Power ISA.
 989
 990 ## cmix
 991
 992 based on RV bitmanip, covered by ternlog bitops
 993
 994 ```
 995 uint_xlen_t cmix(uint_xlen_t RA, uint_xlen_t RB, uint_xlen_t RC) {
 996     return (RA & RB) | (RC & ~RB);
 997 }
 998 ```
 999
1000 ## count leading/trailing zeros with mask
1001
1002 in v3.1 p105
1003
1004 ```
1005 count = 0
1006 do i = 0 to 63 if((RB)i=1) then do
1007 if((RS)i=1) then break end end count ← count + 1
1008 RA ← EXTZ64(count)
1009 ```
1010
1011 ## bit deposit
1012
1013 pdepd VRT,VRA,VRB, identical to RV bitmamip bdep, found already in v3.1 p106
1014
1015     do while(m < 64)
1016        if VSR[VRB+32].dword[i].bit[63-m]=1 then do
1017           result = VSR[VRA+32].dword[i].bit[63-k]
1018           VSR[VRT+32].dword[i].bit[63-m] = result
1019           k = k + 1
1020        m = m + 1
1021
1022 ```
1023
1024 uint_xlen_t bdep(uint_xlen_t RA, uint_xlen_t RB)
1025 {
1026     uint_xlen_t r = 0;
1027     for (int i = 0, j = 0; i < XLEN; i++)
1028         if ((RB >> i) & 1) {
1029             if ((RA >> j) & 1)
1030                 r |= uint_xlen_t(1) << i;
1031             j++;
1032         }
1033     return r;
1034 }
1035
1036 ```
1037
1038 ## bit extract
1039
1040 other way round: identical to RV bext: pextd, found in v3.1 p196
1041
1042 ```
1043 uint_xlen_t bext(uint_xlen_t RA, uint_xlen_t RB)
1044 {
1045     uint_xlen_t r = 0;
1046     for (int i = 0, j = 0; i < XLEN; i++)
1047         if ((RB >> i) & 1) {
1048             if ((RA >> i) & 1)
1049                 r |= uint_xlen_t(1) << j;
1050             j++;
1051         }
1052     return r;
1053 }
1054 ```
1055
1056 ## centrifuge
1057
1058 found in v3.1 p106 so not to be added here
1059
1060 ```
1061 ptr0 = 0
1062 ptr1 = 0
1063 do i = 0 to 63
1064     if((RB)i=0) then do
1065        resultptr0 = (RS)i
1066     end
1067     ptr0 = ptr0 + 1
1068     if((RB)63-i==1) then do
1069         result63-ptr1 = (RS)63-i
1070     end
1071     ptr1 = ptr1 + 1
1072 RA = result
1073 ```
1074
1075 ## bit to byte permute
1076
1077 similar to matrix permute in RV bitmanip, which has XOR and OR variants,
1078 these perform a transpose (bmatflip).
1079 TODO this looks VSX is there a scalar variant
1080 in v3.0/1 already
1081
1082     do j = 0 to 7
1083       do k = 0 to 7
1084          b = VSR[VRB+32].dword[i].byte[k].bit[j]
1085          VSR[VRT+32].dword[i].byte[j].bit[k] = b
1086
1087 ## grev
1088
1089 superceded by grevlut
1090
1091 based on RV bitmanip, this is also known as a butterfly network. however
1092 where a butterfly network allows setting of every crossbar setting in
1093 every row and every column, generalised-reverse (grev) only allows
1094 a per-row decision: every entry in the same row must either switch or
1095 not-switch.
1096
1097 <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/8c/Butterfly_Network.jpg/474px-Butterfly_Network.jpg" />
1098
1099 ```
1100 uint64_t grev64(uint64_t RA, uint64_t RB)
1101 {
1102     uint64_t x = RA;
1103     int shamt = RB & 63;
1104     if (shamt & 1) x = ((x &  0x5555555555555555LL) <<  1) |
1105                         ((x & 0xAAAAAAAAAAAAAAAALL) >>  1);
1106     if (shamt & 2) x = ((x &  0x3333333333333333LL) <<  2) |
1107                         ((x & 0xCCCCCCCCCCCCCCCCLL) >>  2);
1108     if (shamt & 4) x = ((x &  0x0F0F0F0F0F0F0F0FLL) <<  4) |
1109                         ((x & 0xF0F0F0F0F0F0F0F0LL) >>  4);
1110     if (shamt & 8) x = ((x &  0x00FF00FF00FF00FFLL) <<  8) |
1111                         ((x & 0xFF00FF00FF00FF00LL) >>  8);
1112     if (shamt & 16) x = ((x & 0x0000FFFF0000FFFFLL) << 16) |
1113                         ((x & 0xFFFF0000FFFF0000LL) >> 16);
1114     if (shamt & 32) x = ((x & 0x00000000FFFFFFFFLL) << 32) |
1115                         ((x & 0xFFFFFFFF00000000LL) >> 32);
1116     return x;
1117 }
1118
1119 ```
1120
1121 ## gorc
1122
1123 based on RV bitmanip, gorc is superceded by grevlut
1124
1125 ```
1126 uint32_t gorc32(uint32_t RA, uint32_t RB)
1127 {
1128     uint32_t x = RA;
1129     int shamt = RB & 31;
1130     if (shamt & 1) x |= ((x & 0x55555555) << 1)   |  ((x &  0xAAAAAAAA) >> 1);
1131     if (shamt & 2) x |= ((x & 0x33333333) << 2)   |  ((x &  0xCCCCCCCC) >> 2);
1132     if (shamt & 4) x |= ((x & 0x0F0F0F0F) << 4)   |  ((x &  0xF0F0F0F0) >> 4);
1133     if (shamt & 8) x |= ((x & 0x00FF00FF) << 8)   |  ((x &  0xFF00FF00) >> 8);
1134     if (shamt & 16) x |= ((x & 0x0000FFFF) << 16) |  ((x &  0xFFFF0000) >> 16);
1135     return x;
1136 }
1137 uint64_t gorc64(uint64_t RA, uint64_t RB)
1138 {
1139     uint64_t x = RA;
1140     int shamt = RB & 63;
1141     if (shamt & 1) x |= ((x & 0x5555555555555555LL)   <<   1) |
1142                          ((x & 0xAAAAAAAAAAAAAAAALL)  >>  1);
1143     if (shamt & 2) x |= ((x & 0x3333333333333333LL)   <<   2) |
1144                          ((x & 0xCCCCCCCCCCCCCCCCLL)  >>  2);
1145     if (shamt & 4) x |= ((x & 0x0F0F0F0F0F0F0F0FLL)   <<   4) |
1146                          ((x & 0xF0F0F0F0F0F0F0F0LL)  >>  4);
1147     if (shamt & 8) x |= ((x & 0x00FF00FF00FF00FFLL)   <<   8) |
1148                          ((x & 0xFF00FF00FF00FF00LL)  >>  8);
1149     if (shamt & 16) x |= ((x & 0x0000FFFF0000FFFFLL)  << 16) |
1150                          ((x & 0xFFFF0000FFFF0000LL)  >> 16);
1151     if (shamt & 32) x |= ((x & 0x00000000FFFFFFFFLL)  << 32) |
1152                          ((x & 0xFFFFFFFF00000000LL)  >> 32);
1153     return x;
1154 }
1155
1156 ```
1157
1158
1159 # Appendix
1160
1161 see [[bitmanip/appendix]]
1162