openpower/sv/bitmanip.mdwn

   1 [[!tag standards]]
   2
   3 [[!toc levels=1]]
   4
   5 # Implementation Log
   6
   7 * ternlogi <https://bugs.libre-soc.org/show_bug.cgi?id=745>
   8 * grev <https://bugs.libre-soc.org/show_bug.cgi?id=755>
   9 * GF2^M <https://bugs.libre-soc.org/show_bug.cgi?id=782>
  10 * binutils <https://bugs.libre-soc.org/show_bug.cgi?id=836>
  11 * shift-and-add <https://bugs.libre-soc.org/show_bug.cgi?id=968>
  12
  13 # bitmanipulation
  14
  15 **DRAFT STATUS**
  16
  17 pseudocode: [[openpower/isa/bitmanip]]
  18
  19 this extension amalgamates bitmanipulation primitives from many sources,
  20 including RISC-V bitmanip, Packed SIMD, AVX-512 and OpenPOWER VSX.
  21 Also included are DSP/Multimedia operations suitable for Audio/Video.
  22 Vectorisation and SIMD are removed: these are straight scalar (element)
  23 operations making them suitable for embedded applications.  Vectorisation
  24 Context is provided by [[openpower/sv]].
  25
  26 When combined with SV, scalar variants of bitmanip operations found in
  27 VSX are added so that the Packed SIMD aspects of VSX may be retired as
  28 "legacy" in the far future (10 to 20 years).  Also, VSX is hundreds of
  29 opcodes, requires 128 bit pathways, and is wholly unsuited to low power
  30 or embedded scenarios.
  31
  32 ternlogv is experimental and is the only operation that may be considered
  33 a "Packed SIMD".  It is added as a variant of the already well-justified
  34 ternlog operation (done in AVX512 as an immediate only) "because it
  35 looks fun". As it is based on the LUT4 concept it will allow accelerated
  36 emulation of FPGAs.  Other vendors of ISAs are buying FPGA companies to
  37 achieve similar objectives.
  38
  39 general-purpose Galois Field 2^M operations are added so as to avoid
  40 huge custom opcode proliferation across many areas of Computer Science.
  41 however for convenience and also to avoid setup costs, some of the more
  42 common operations (clmul, crc32) are also added.  The expectation is
  43 that these operations would all be covered by the same pipeline.
  44
  45 note that there are brownfield spaces below that could incorporate
  46 some of the set-before-first and other scalar operations listed in
  47 [[sv/mv.swizzle]],
  48 [[sv/vector_ops]], [[sv/int_fp_mv]] and the [[sv/av_opcodes]] as well as
  49 [[sv/setvl]], [[sv/svstep]], [[sv/remap]]
  50
  51 Useful resource:
  52
  53 * <https://en.wikiversity.org/wiki/Reed%E2%80%93Solomon_codes_for_coders>
  54 * <https://maths-people.anu.edu.au/~brent/pd/rpb232tr.pdf>
  55 * <https://gist.github.com/animetosho/d3ca95da2131b5813e16b5bb1b137ca0>
  56 * <https://github.com/HJLebbink/asm-dude/wiki/GF2P8AFFINEINVQB>
  57
  58 [[!inline pages="openpower/sv/draft_opcode_tables" quick="yes" raw="yes" ]]
  59
  60 # binary and ternary bitops
  61
  62 Similar to FPGA LUTs: for two (binary) or three (ternary) inputs take
  63 bits from each input, concatenate them and perform a lookup into a
  64 table using an 8-8-bit immediate (for the ternary instructions), or in
  65 another register (4-bit for the binary instructions).  The binary lookup
  66 instructions have CR Field lookup variants due to CR Fields being 4 bit.
  67
  68 Like the x86 AVX512F
  69 [vpternlogd/vpternlogq](https://www.felixcloutier.com/x86/vpternlogd:vpternlogq)
  70 instructions.
  71
  72 ## ternlogi
  73
  74 | 0.5|6.10|11.15|16.20| 21..28|29.30|31|
  75 | -- | -- | --- | --- | ----- | --- |--|
  76 | NN | RT | RA  | RB  | im0-7 |  00 |Rc|
  77
  78     lut3(imm, a, b, c):
  79         idx = c << 2 | b << 1 | a
  80         return imm[idx] # idx by LSB0 order
  81
  82     for i in range(64):
  83         RT[i] = lut3(imm, RB[i], RA[i], RT[i])
  84
  85 ## binlut
  86
  87 Binary lookup is a dynamic LUT2 version of ternlogi. Firstly, the
  88 lookup table is 4 bits wide not 8 bits, and secondly the lookup
  89 table comes from a register not an immediate.
  90
  91 | 0.5|6.10|11.15|16.20| 21..25|26..31  | Form    |
  92 | -- | -- | --- | --- | ----- |--------|---------|
  93 | NN | RT | RA  | RB  | RC    |nh 00001| VA-Form |
  94 | NN | RT | RA  | RB  | /BFA/ |0  01001| VA-Form |
  95
  96 For binlut, the 4-bit LUT may be selected from either the high nibble
  97 or the low nibble of the first byte of RC:
  98
  99     lut2(imm, a, b):
 100         idx = b << 1 | a
 101         return imm[idx] # idx by LSB0 order
 102
 103     imm = (RC>>(nh*4))&0b1111
 104     for i in range(64):
 105         RT[i] = lut2(imm, RB[i], RA[i])
 106
 107 For bincrlut, `BFA` selects the 4-bit CR Field as the LUT2:
 108
 109     for i in range(64):
 110         RT[i] = lut2(CRs{BFA}, RB[i], RA[i])
 111
 112 When Vectorised with SVP64, as usual both source and destination may be
 113 Vector or Scalar.
 114
 115 *Programmer's note: a dynamic ternary lookup may be synthesised from
 116 a pair of `binlut` instructions followed by a `ternlogi` to select which
 117 to merge. Use `nh` to select which nibble to use as the lookup table
 118 from the RC source register (`nh=1` nibble high), i.e. keeping
 119 an 8-bit LUT3 in RC, the first `binlut` instruction may set nh=0 and
 120 the second nh=1.*
 121
 122 ## crternlogi
 123
 124 another mode selection would be CRs not Ints.
 125
 126 CRB-Form:
 127
 128 | 0.5|6.8 |9.10|11.13|14.15|16.18|19.25|26.30| 31|
 129 |----|----|----|-----|-----|-----|-----|-----|---|
 130 | NN | BF | msk|BFA  | msk | BFB | TLI | XO  |TLI|
 131
 132     for i in range(4):
 133         a,b,c = CRs[BF][i], CRs[BFA][i], CRs[BFB][i])
 134         if msk[i] CRs[BF][i] = lut3(imm, a, b, c)
 135
 136 This instruction is remarkably similar to the existing crops, `crand` etc.
 137 which have been noted to be a 4-bit (binary) LUT.  In effect `crternlogi`
 138 is the ternary LUT version of crops, having an 8-bit LUT.  However it
 139 is an overwrite instruction in order to save on register file ports,
 140 due to the mask requiring the contents of the BF to be both read and
 141 written.
 142
 143 Programmer's note: This instruction is useful when combined with Matrix REMAP
 144 in "Inner Product" Mode, creating Warshall Transitive Closure that has many
 145 applications in Computer Science.
 146
 147 ## crbinlog
 148
 149 With ternary (LUT3) dynamic instructions being very costly,
 150 and CR Fields being only 4 bit, a binary (LUT2) variant is better
 151
 152 CRB-Form:
 153
 154 | 0.5|6.8 |9.10|11.13|14.15|16.18|19.25|26.30| 31|
 155 |----|----|----|-----|-----|-----|-----|-----|---|
 156 | NN | BF | msk|BFA  | msk | BFB | //  | XO  | //|
 157
 158     for i in range(4):
 159         a,b = CRs[BF][i], CRs[BF][i])
 160         if msk[i] CRs[BF][i] = lut2(CRs[BFB], a, b)
 161
 162 When SVP64 Vectorised any of the 4 operands may be Scalar or
 163 Vector, including `BFB` meaning that multiple different dynamic
 164 lookups may be performed with a single instruction.  Note that
 165 this instruction is deliberately an overwrite in order to reduce
 166 the number of register file ports required: like `crternlogi`
 167 the contents of `BF` **must** be read due to the mask only
 168 writing back to non-masked-out bits of `BF`.
 169
 170 *Programmer's note: just as with binlut and ternlogi, a pair
 171  of crbinlog instructions followed by a merging crternlogi may
 172  be deployed to synthesise dynamic ternary (LUT3) CR Field
 173  manipulation*
 174
 175 # int ops
 176
 177 ## min/m
 178
 179 required for the [[sv/av_opcodes]]
 180
 181 signed and unsigned min/max for integer.  this is sort-of partly
 182 synthesiseable in [[sv/svp64]] with pred-result as long as the dest reg
 183 is one of the sources, but not both signed and unsigned.  when the dest
 184 is also one of the srces and the mv fails due to the CR bittest failing
 185 this will only overwrite the dest where the src is greater (or less).
 186
 187 signed/unsigned min/max gives more flexibility.
 188
 189 X-Form
 190
 191 * XO=0001001110, itype=0b00 min, unsigned
 192 * XO=0101001110, itype=0b01 min, signed
 193 * XO=0011001110, itype=0b10 max, unsigned
 194 * XO=0111001110, itype=0b11 max, signed
 195
 196
 197 ```
 198 uint_xlen_t mins(uint_xlen_t rs1, uint_xlen_t rs2)
 199 { return (int_xlen_t)rs1 < (int_xlen_t)rs2 ? rs1 : rs2;
 200 }
 201 uint_xlen_t maxs(uint_xlen_t rs1, uint_xlen_t rs2)
 202 { return (int_xlen_t)rs1 > (int_xlen_t)rs2 ? rs1 : rs2;
 203 }
 204 uint_xlen_t minu(uint_xlen_t rs1, uint_xlen_t rs2)
 205 { return rs1 < rs2 ? rs1 : rs2;
 206 }
 207 uint_xlen_t maxu(uint_xlen_t rs1, uint_xlen_t rs2)
 208 { return rs1 > rs2 ? rs1 : rs2;
 209 }
 210 ```
 211
 212 ## average
 213
 214 required for the [[sv/av_opcodes]], these exist in Packed SIMD (VSX)
 215 but not scalar
 216
 217 ```
 218 uint_xlen_t intavg(uint_xlen_t rs1, uint_xlen_t rs2) {
 219      return (rs1 + rs2 + 1) >> 1:
 220 }
 221 ```
 222
 223 ## absdu
 224
 225 required for the [[sv/av_opcodes]], these exist in Packed SIMD (VSX)
 226 but not scalar
 227
 228 ```
 229 uint_xlen_t absdu(uint_xlen_t rs1, uint_xlen_t rs2) {
 230      return (src1 > src2) ? (src1-src2) : (src2-src1)
 231 }
 232 ```
 233
 234 ## abs-accumulate
 235
 236 required for the [[sv/av_opcodes]], these are needed for motion estimation.
 237 both are overwrite on RS.
 238
 239 ```
 240 uint_xlen_t uintabsacc(uint_xlen_t rs, uint_xlen_t ra, uint_xlen_t rb) {
 241      return rs + (src1 > src2) ? (src1-src2) : (src2-src1)
 242 }
 243 uint_xlen_t intabsacc(uint_xlen_t rs, int_xlen_t ra, int_xlen_t rb) {
 244      return rs + (src1 > src2) ? (src1-src2) : (src2-src1)
 245 }
 246 ```
 247
 248 For SVP64, the twin Elwidths allows e.g. a 16 bit accumulator for 8 bit
 249 differences.  Form is `RM-1P-3S1D` where RS-as-source has a separate
 250 SVP64 designation from RS-as-dest. This gives a limited range of
 251 non-overwrite capability.
 252
 253 # shift-and-add <a name="shift-add"> </a>
 254
 255 Power ISA is missing LD/ST with shift, which is present in both ARM and x86.
 256 Too complex to add more LD/ST, a compromise is to add shift-and-add.
 257 Replaces a pair of explicit instructions in hot-loops.
 258
 259 ```
 260 # 1.6.27 Z23-FORM
 261     |0     |6     |11    |15 |16     |21 |23    |31 |
 262     | PO   |  RT  |   RA     |   RB  |sm |   XO |Rc |
 263 ```
 264
 265 Pseudo-code (shadd):
 266
 267     n <- (RB)
 268     m <- sm + 1
 269     RT <- (n[m:XLEN-1] || [0]*m) + (RA)
 270
 271 Pseudo-code (shadduw):
 272
 273     n <- ([0]*(XLEN/2)) || (RB)[XLEN/2:XLEN-1]
 274     m <- sm + 1
 275     RT <- (n[m:XLEN-1] || [0]*m) + (RA)
 276
 277 ```
 278 uint_xlen_t shadd(uint_xlen_t RA, uint_xlen_t RB, uint8_t sm) {
 279     sm = sm & 0x3;
 280     return (RB << (sm+1)) + RA;
 281 }
 282
 283 uint_xlen_t shadduw(uint_xlen_t RA, uint_xlen_t RB, uint8_t sm) {
 284     uint_xlen_t n = RB & 0xFFFFFFFF;
 285     sm = sm & 0x3;
 286     return (n << (sm+1)) + RA;
 287 }
 288 ```
 289
 290 # bitmask set
 291
 292 based on RV bitmanip singlebit set, instruction format similar to shift
 293 [[isa/fixedshift]].  bmext is actually covered already (shift-with-mask
 294 rldicl but only immediate version).  however bitmask-invert is not,
 295 and set/clr are not covered, although they can use the same Shift ALU.
 296
 297 bmext (RB) version is not the same as rldicl because bmext is a right
 298 shift by RC, where rldicl is a left rotate.  for the immediate version
 299 this does not matter, so a bmexti is not required.  bmrev however there
 300 is no direct equivalent and consequently a bmrevi is required.
 301
 302 bmset (register for mask amount) is particularly useful for creating
 303 predicate masks where the length is a dynamic runtime quantity.
 304 bmset(RA=0, RB=0, RC=mask) will produce a run of ones of length "mask"
 305 in a single instruction without needing to initialise or depend on any
 306 other registers.
 307
 308 | 0.5|6.10|11.15|16.20|21.25| 26..30  |31| name  |
 309 | -- | -- | --- | --- | --- | ------- |--| ----- |
 310 | NN | RS | RA  | RB  | RC  | mode 010 |Rc| bm\*   |
 311
 312 Immediate-variant is an overwrite form:
 313
 314 | 0.5|6.10|11.15|16.20| 21 | 22.23 | 24....30 |31| name |
 315 | -- | -- | --- | --- | -- | ----- | -------- |--| ---- |
 316 | NN | RS | RB  | sh  | SH | itype | 1000 110 |Rc| bm\*i |
 317
 318 ```
 319 def MASK(x, y):
 320      if x < y:
 321          x = x+1
 322          mask_a = ((1 << x) - 1) & ((1 << 64) - 1)
 323          mask_b = ((1 << y) - 1) & ((1 << 64) - 1)
 324      elif x == y:
 325          return 1 << x
 326      else:
 327          x = x+1
 328          mask_a = ((1 << x) - 1) & ((1 << 64) - 1)
 329          mask_b = (~((1 << y) - 1)) & ((1 << 64) - 1)
 330      return mask_a ^ mask_b
 331
 332
 333 uint_xlen_t bmset(RS, RB, sh)
 334 {
 335     int shamt = RB & (XLEN - 1);
 336     mask = (2<<sh)-1;
 337     return RS | (mask << shamt);
 338 }
 339
 340 uint_xlen_t bmclr(RS, RB, sh)
 341 {
 342     int shamt = RB & (XLEN - 1);
 343     mask = (2<<sh)-1;
 344     return RS & ~(mask << shamt);
 345 }
 346
 347 uint_xlen_t bminv(RS, RB, sh)
 348 {
 349     int shamt = RB & (XLEN - 1);
 350     mask = (2<<sh)-1;
 351     return RS ^ (mask << shamt);
 352 }
 353
 354 uint_xlen_t bmext(RS, RB, sh)
 355 {
 356     int shamt = RB & (XLEN - 1);
 357     mask = (2<<sh)-1;
 358     return mask & (RS >> shamt);
 359 }
 360 ```
 361
 362 bitmask extract with reverse.  can be done by bit-order-inverting all
 363 of RB and getting bits of RB from the opposite end.
 364
 365 when RA is zero, no shift occurs. this makes bmextrev useful for
 366 simply reversing all bits of a register.
 367
 368 ```
 369 msb = ra[5:0];
 370 rev[0:msb] = rb[msb:0];
 371 rt = ZE(rev[msb:0]);
 372
 373 uint_xlen_t bmrevi(RA, RB, sh)
 374 {
 375     int shamt = XLEN-1;
 376     if (RA != 0) shamt = (GPR(RA) & (XLEN - 1));
 377     shamt = (XLEN-1)-shamt;  # shift other end
 378     brb = bitreverse(GPR(RB))     # swap LSB-MSB
 379     mask = (2<<sh)-1;
 380     return mask & (brb >> shamt);
 381 }
 382
 383 uint_xlen_t bmrev(RA, RB, RC) {
 384     return bmrevi(RA, RB, GPR(RC) & 0b111111);
 385 }
 386 ```
 387
 388 | 0.5|6.10|11.15|16.20|21.26| 27..30  |31| name   | Form     |
 389 | -- | -- | --- | --- | --- | ------- |--| ------ | -------- |
 390 | NN | RT | RA  | RB  | sh  | 1111    |Rc| bmrevi | MDS-Form |
 391
 392 | 0.5|6.10|11.15|16.20|21.25| 26..30  |31| name   | Form     |
 393 | -- | -- | --- | --- | --- | ------- |--| ------ | -------- |
 394 | NN | RT | RA  | RB  | RC  | 11110   |Rc| bmrev  | VA2-Form |
 395
 396 # grevlut <a name="grevlut"> </a>
 397
 398 generalised reverse combined with a pair of LUT2s and allowing
 399 a constant `0b0101...0101` when RA=0, and an option to invert
 400 (including when RA=0, giving a constant 0b1010...1010 as the
 401 initial value) provides a wide range of instructions
 402 and a means to set hundreds of regular 64 bit patterns with one
 403 single 32 bit instruction.
 404
 405 the two LUT2s are applied left-half (when not swapping)
 406 and right-half (when swapping) so as to allow a wider
 407 range of options.
 408
 409 <img src="/openpower/sv/grevlut2x2.jpg" width=700 />
 410
 411 * A value of `0b11001010` for the immediate provides
 412 the functionality of a standard "grev".
 413 * `0b11101110` provides gorc
 414
 415 grevlut should be arranged so as to produce the constants
 416 needed to put into bext (bitextract) so as in turn to
 417 be able to emulate x86 pmovmask instructions
 418 <https://www.felixcloutier.com/x86/pmovmskb>.
 419 This only requires 2 instructions (grevlut, bext).
 420
 421 Note that if the mask is required to be placed
 422 directly into CR Fields (for use as CR Predicate
 423 masks rather than a integer mask) then sv.cmpi or sv.ori
 424 may be used instead, bearing in mind that sv.ori
 425 is a 64-bit instruction, and `VL` must have been
 426 set to the required length:
 427
 428     sv.ori./elwid=8 r10.v, r10.v, 0
 429
 430 The following settings provide the required mask constants:
 431
 432 | RA=0     | RB      | imm        | iv | result        |
 433 | -------  | ------- | ---------- | -- | ----------    |
 434 | 0x555..  | 0b10    | 0b01101100 | 0  | 0x111111...   |
 435 | 0x555..  | 0b110   | 0b01101100 | 0  | 0x010101...   |
 436 | 0x555..  | 0b1110  | 0b01101100 | 0  | 0x00010001...   |
 437 | 0x555..  | 0b10    | 0b11000110 | 1  | 0x88888...   |
 438 | 0x555..  | 0b110   | 0b11000110 | 1  | 0x808080...   |
 439 | 0x555..  | 0b1110  | 0b11000110 | 1  | 0x80008000...   |
 440
 441 Better diagram showing the correct ordering of shamt (RB).  A LUT2
 442 is applied to all locations marked in red using the first 4
 443 bits of the immediate, and a separate LUT2 applied to all
 444 locations in green using the upper 4 bits of the immediate.
 445
 446 <img src="/openpower/sv/grevlut.png" width=700 />
 447
 448 demo code [[openpower/sv/grevlut.py]]
 449
 450 ```
 451 lut2(imm, a, b):
 452     idx = b << 1 | a
 453     return imm[idx] # idx by LSB0 order
 454
 455 dorow(imm8, step_i, chunksize, us32b):
 456     for j in 0 to 31 if is32b else 63:
 457         if (j&chunk_size) == 0
 458            imm = imm8[0..3]
 459         else
 460            imm = imm8[4..7]
 461         step_o[j] = lut2(imm, step_i[j], step_i[j ^ chunk_size])
 462     return step_o
 463
 464 uint64_t grevlut(uint64_t RA, uint64_t RB, uint8 imm, bool iv, bool is32b)
 465 {
 466     uint64_t x = 0x5555_5555_5555_5555;
 467     if (RA != 0) x = GPR(RA);
 468     if (iv) x = ~x;
 469     int shamt = RB & 31 if is32b else 63
 470     for i in 0 to (6-is32b)
 471         step = 1<<i
 472         if (shamt & step) x = dorow(imm, x, step, is32b)
 473     return x;
 474 }
 475 ```
 476
 477 A variant may specify different LUT-pairs per row,
 478 using one byte of RB for each.  If it is desired that
 479 a particular row-crossover shall not be applied it is
 480 a simple matter to set the appropriate LUT-pair in RB
 481 to effect an identity transform for that row (`0b11001010`).
 482
 483 ```
 484 uint64_t grevlutr(uint64_t RA, uint64_t RB, bool iv, bool is32b)
 485 {
 486     uint64_t x = 0x5555_5555_5555_5555;
 487     if (RA != 0) x = GPR(RA);
 488     if (iv) x = ~x;
 489     for i in 0 to (6-is32b)
 490         step = 1<<i
 491         imm = (RB>>(i*8))&0xff
 492         x = dorow(imm, x, step, is32b)
 493     return x;
 494 }
 495
 496 ```
 497
 498 | 0.5|6.10|11.15|16.20 |21..28   | 29.30|31| name | Form |
 499 | -- | -- | --- | ---  | -----   | -----|--| ------ | ----- |
 500 | NN | RT | RA  | s0-4 | im0-7   | 1 iv |s5| grevlogi |      |
 501 | NN | RT | RA  | RB   | im0-7   | 01   |0 | grevlog |           |
 502
 503 An equivalent to `grevlogw` may be synthesised by setting the
 504 appropriate bits in RB to set the top half of RT to zero.
 505 Thus an explicit grevlogw instruction is not necessary.
 506
 507 # xperm
 508
 509 based on RV bitmanip.
 510
 511 RA contains a vector of indices to select parts of RB to be
 512 copied to RT.  The immediate-variant allows up to an 8 bit
 513 pattern (repeated) to be targetted at different parts of RT.
 514
 515 xperm shares some similarity with one of the uses of bmator
 516 in that xperm indices are binary addressing where bitmator
 517 may be considered to be unary addressing.
 518
 519 ```
 520 uint_xlen_t xpermi(uint8_t imm8, uint_xlen_t RB, int sz_log2)
 521 {
 522     uint_xlen_t r = 0;
 523     uint_xlen_t sz = 1LL << sz_log2;
 524     uint_xlen_t mask = (1LL << sz) - 1;
 525     uint_xlen_t RA = imm8 | imm8<<8 | ... | imm8<<56;
 526     for (int i = 0; i < XLEN; i += sz) {
 527         uint_xlen_t pos = ((RA >> i) & mask) << sz_log2;
 528         if (pos < XLEN)
 529             r |= ((RB >> pos) & mask) << i;
 530     }
 531     return r;
 532 }
 533 uint_xlen_t xperm(uint_xlen_t RA, uint_xlen_t RB, int sz_log2)
 534 {
 535     uint_xlen_t r = 0;
 536     uint_xlen_t sz = 1LL << sz_log2;
 537     uint_xlen_t mask = (1LL << sz) - 1;
 538     for (int i = 0; i < XLEN; i += sz) {
 539         uint_xlen_t pos = ((RA >> i) & mask) << sz_log2;
 540         if (pos < XLEN)
 541             r |= ((RB >> pos) & mask) << i;
 542     }
 543     return r;
 544 }
 545 uint_xlen_t xperm_n (uint_xlen_t RA, uint_xlen_t RB)
 546 {  return xperm(RA, RB, 2); }
 547 uint_xlen_t xperm_b (uint_xlen_t RA, uint_xlen_t RB)
 548 {  return xperm(RA, RB, 3); }
 549 uint_xlen_t xperm_h (uint_xlen_t RA, uint_xlen_t RB)
 550 {  return xperm(RA, RB, 4); }
 551 uint_xlen_t xperm_w (uint_xlen_t RA, uint_xlen_t RB)
 552 {  return xperm(RA, RB, 5); }
 553 ```
 554
 555 # bitmatrix
 556
 557 bmatflip and bmatxor is found in the Cray XMT, and in x86 is known
 558 as GF2P8AFFINEQB. uses:
 559
 560 * <https://gist.github.com/animetosho/d3ca95da2131b5813e16b5bb1b137ca0>
 561 * SM4, Reed Solomon, RAID6
 562   <https://stackoverflow.com/questions/59124720/what-are-the-avx-512-galois-field-related-instructions-for>
 563 * Vector bit-reverse <https://reviews.llvm.org/D91515?id=305411>
 564 * Affine Inverse <https://github.com/HJLebbink/asm-dude/wiki/GF2P8AFFINEINVQB>
 565
 566 | 0.5|6.10|11.15|16.20| 21 | 22.23 | 24....30 |31| name      |  Form   |
 567 | -- | -- | --- | --- | -- | ----- | -------- |--| ----      | ------- |
 568 | NN | RS | RA  |im04 | im5|  1 1  | im67 00 110 |Rc| bmatxori  | TODO    |
 569
 570
 571 ```
 572 uint64_t bmatflip(uint64_t RA)
 573 {
 574     uint64_t x = RA;
 575     x = shfl64(x, 31);
 576     x = shfl64(x, 31);
 577     x = shfl64(x, 31);
 578     return x;
 579 }
 580
 581 uint64_t bmatxori(uint64_t RS, uint64_t RA, uint8_t imm) {
 582     // transpose of RA
 583     uint64_t RAt = bmatflip(RA);
 584     uint8_t u[8]; // rows of RS
 585     uint8_t v[8]; // cols of RA
 586     for (int i = 0; i < 8; i++) {
 587         u[i] = RS >> (i*8);
 588         v[i] = RAt >> (i*8);
 589     }
 590     uint64_t bit, x = 0;
 591     for (int i = 0; i < 64; i++) {
 592         bit = (imm >> (i%8)) & 1;
 593         bit ^= pcnt(u[i / 8] & v[i % 8]) & 1;
 594         x |= bit << i;
 595     }
 596     return x;
 597 }
 598
 599 uint64_t bmatxor(uint64_t RA, uint64_t RB) {
 600     return bmatxori(RA, RB, 0xff)
 601 }
 602
 603 uint64_t bmator(uint64_t RA, uint64_t RB) {
 604     // transpose of RB
 605     uint64_t RBt = bmatflip(RB);
 606     uint8_t u[8]; // rows of RA
 607     uint8_t v[8]; // cols of RB
 608     for (int i = 0; i < 8; i++) {
 609         u[i] = RA >> (i*8);
 610         v[i] = RBt >> (i*8);
 611     }
 612     uint64_t x = 0;
 613     for (int i = 0; i < 64; i++) {
 614         if ((u[i / 8] & v[i % 8]) != 0)
 615             x |= 1LL << i;
 616     }
 617     return x;
 618 }
 619
 620 uint64_t bmatand(uint64_t RA, uint64_t RB) {
 621     // transpose of RB
 622     uint64_t RBt = bmatflip(RB);
 623     uint8_t u[8]; // rows of RA
 624     uint8_t v[8]; // cols of RB
 625     for (int i = 0; i < 8; i++) {
 626         u[i] = RA >> (i*8);
 627         v[i] = RBt >> (i*8);
 628     }
 629     uint64_t x = 0;
 630     for (int i = 0; i < 64; i++) {
 631         if ((u[i / 8] & v[i % 8]) == 0xff)
 632             x |= 1LL << i;
 633     }
 634     return x;
 635 }
 636 ```
 637
 638 # Introduction to Carry-less and GF arithmetic
 639
 640 * obligatory xkcd <https://xkcd.com/2595/>
 641
 642 There are three completely separate types of Galois-Field-based arithmetic
 643 that we implement which are not well explained even in introductory
 644 literature.  A slightly oversimplified explanation is followed by more
 645 accurate descriptions:
 646
 647 * `GF(2)` carry-less binary arithmetic. this is not actually a Galois Field,
 648   but is accidentally referred to as GF(2) - see below as to why.
 649 * `GF(p)` modulo arithmetic with a Prime number, these are "proper"
 650    Galois Fields
 651 * `GF(2^N)` carry-less binary arithmetic with two limits: modulo a power-of-2
 652   (2^N) and a second "reducing" polynomial (similar to a prime number), these
 653   are said to be GF(2^N) arithmetic.
 654
 655 further detailed and more precise explanations are provided below
 656
 657 * **Polynomials with coefficients in `GF(2)`**
 658   (aka. Carry-less arithmetic -- the `cl*` instructions).
 659   This isn't actually a Galois Field, but its coefficients are. This is
 660   basically binary integer addition, subtraction, and multiplication like
 661   usual, except that carries aren't propagated at all, effectively turning
 662   both addition and subtraction into the bitwise xor operation. Division and
 663   remainder are defined to match how addition and multiplication works.
 664 * **Galois Fields with a prime size**
 665   (aka. `GF(p)` or Prime Galois Fields -- the `gfp*` instructions).
 666   This is basically just the integers mod `p`.
 667 * **Galois Fields with a power-of-a-prime size**
 668   (aka. `GF(p^n)` or `GF(q)` where `q == p^n` for prime `p` and
 669   integer `n > 0`).
 670   We only implement these for `p == 2`, called Binary Galois Fields
 671   (`GF(2^n)` -- the `gfb*` instructions).
 672   For any prime `p`, `GF(p^n)` is implemented as polynomials with
 673   coefficients in `GF(p)` and degree `< n`, where the polynomials are the
 674   remainders of dividing by a specificly chosen polynomial in `GF(p)` called
 675   the Reducing Polynomial (we will denote that by `red_poly`). The Reducing
 676   Polynomial must be an irreducable polynomial (like primes, but for
 677   polynomials), as well as have degree `n`. All `GF(p^n)` for the same `p`
 678   and `n` are isomorphic to each other -- the choice of `red_poly` doesn't
 679   affect `GF(p^n)`'s mathematical shape, all that changes is the specific
 680   polynomials used to implement `GF(p^n)`.
 681
 682 Many implementations and much of the literature do not make a clear
 683 distinction between these three categories, which makes it confusing
 684 to understand what their purpose and value is.
 685
 686 * carry-less multiply is extremely common and is used for the ubiquitous
 687   CRC32 algorithm. [TODO add many others, helps justify to ISA WG]
 688 * GF(2^N) forms the basis of Rijndael (the current AES standard) and
 689   has significant uses throughout cryptography
 690 * GF(p) is the basis again of a significant quantity of algorithms
 691   (TODO, list them, jacob knows what they are), even though the
 692   modulo is limited to be below 64-bit (size of a scalar int)
 693
 694 # Instructions for Carry-less Operations
 695
 696 aka. Polynomials with coefficients in `GF(2)`
 697
 698 Carry-less addition/subtraction is simply XOR, so a `cladd`
 699 instruction is not provided since the `xor[i]` instruction can be used instead.
 700
 701 These are operations on polynomials with coefficients in `GF(2)`, with the
 702 polynomial's coefficients packed into integers with the following algorithm:
 703
 704 ```python
 705 [[!inline pagenames="gf_reference/pack_poly.py" raw="yes"]]
 706 ```
 707
 708 ## Carry-less Multiply Instructions
 709
 710 based on RV bitmanip
 711 see <https://en.wikipedia.org/wiki/CLMUL_instruction_set> and
 712 <https://www.felixcloutier.com/x86/pclmulqdq> and
 713 <https://en.m.wikipedia.org/wiki/Carry-less_product>
 714
 715 They are worth adding as their own non-overwrite operations
 716 (in the same pipeline).
 717
 718 ### `clmul` Carry-less Multiply
 719
 720 ```python
 721 [[!inline pagenames="gf_reference/clmul.py" raw="yes"]]
 722 ```
 723
 724 ### `clmulh` Carry-less Multiply High
 725
 726 ```python
 727 [[!inline pagenames="gf_reference/clmulh.py" raw="yes"]]
 728 ```
 729
 730 ### `clmulr` Carry-less Multiply (Reversed)
 731
 732 Useful for CRCs. Equivalent to bit-reversing the result of `clmul` on
 733 bit-reversed inputs.
 734
 735 ```python
 736 [[!inline pagenames="gf_reference/clmulr.py" raw="yes"]]
 737 ```
 738
 739 ## `clmadd` Carry-less Multiply-Add
 740
 741 ```
 742 clmadd RT, RA, RB, RC
 743 ```
 744
 745 ```
 746 (RT) = clmul((RA), (RB)) ^ (RC)
 747 ```
 748
 749 ## `cltmadd` Twin Carry-less Multiply-Add (for FFTs)
 750
 751 Used in combination with SV FFT REMAP to perform a full Discrete Fourier
 752 Transform of Polynomials over GF(2) in-place. Possible by having 3-in 2-out,
 753 to avoid the need for a temp register. RS is written to as well as RT.
 754
 755 Note: Polynomials over GF(2) are a Ring rather than a Field, so, because the
 756 definition of the Inverse Discrete Fourier Transform involves calculating a
 757 multiplicative inverse, which may not exist in every Ring, therefore the
 758 Inverse Discrete Fourier Transform may not exist. (AFAICT the number of inputs
 759 to the IDFT must be odd for the IDFT to be defined for Polynomials over GF(2).
 760 TODO: check with someone who knows for sure if that's correct.)
 761
 762 ```
 763 cltmadd RT, RA, RB, RC
 764 ```
 765
 766 TODO: add link to explanation for where `RS` comes from.
 767
 768 ```
 769 a = (RA)
 770 c = (RC)
 771 # read all inputs before writing to any outputs in case
 772 # an input overlaps with an output register.
 773 (RT) = clmul(a, (RB)) ^ c
 774 (RS) = a ^ c
 775 ```
 776
 777 ## `cldivrem` Carry-less Division and Remainder
 778
 779 `cldivrem` isn't an actual instruction, but is just used in the pseudo-code
 780 for other instructions.
 781
 782 ```python
 783 [[!inline pagenames="gf_reference/cldivrem.py" raw="yes"]]
 784 ```
 785
 786 ## `cldiv` Carry-less Division
 787
 788 ```
 789 cldiv RT, RA, RB
 790 ```
 791
 792 ```
 793 n = (RA)
 794 d = (RB)
 795 q, r = cldivrem(n, d, width=XLEN)
 796 (RT) = q
 797 ```
 798
 799 ## `clrem` Carry-less Remainder
 800
 801 ```
 802 clrem RT, RA, RB
 803 ```
 804
 805 ```
 806 n = (RA)
 807 d = (RB)
 808 q, r = cldivrem(n, d, width=XLEN)
 809 (RT) = r
 810 ```
 811
 812 # Instructions for Binary Galois Fields `GF(2^m)`
 813
 814 see:
 815
 816 * <https://courses.csail.mit.edu/6.857/2016/files/ffield.py>
 817 * <https://engineering.purdue.edu/kak/compsec/NewLectures/Lecture7.pdf>
 818 * <https://foss.heptapod.net/math/libgf2/-/blob/branch/default/src/libgf2/gf2.py>
 819
 820 Binary Galois Field addition/subtraction is simply XOR, so a `gfbadd`
 821 instruction is not provided since the `xor[i]` instruction can be used instead.
 822
 823 ## `GFBREDPOLY` SPR -- Reducing Polynomial
 824
 825 In order to save registers and to make operations orthogonal with standard
 826 arithmetic, the reducing polynomial is stored in a dedicated SPR `GFBREDPOLY`.
 827 This also allows hardware to pre-compute useful parameters (such as the
 828 degree, or look-up tables) based on the reducing polynomial, and store them
 829 alongside the SPR in hidden registers, only recomputing them whenever the SPR
 830 is written to, rather than having to recompute those values for every
 831 instruction.
 832
 833 Because Galois Fields require the reducing polynomial to be an irreducible
 834 polynomial, that guarantees that any polynomial of `degree > 1` must have
 835 the LSB set, since otherwise it would be divisible by the polynomial `x`,
 836 making it reducible, making whatever we're working on no longer a Field.
 837 Therefore, we can reuse the LSB to indicate `degree == XLEN`.
 838
 839 ```python
 840 [[!inline pagenames="gf_reference/decode_reducing_polynomial.py" raw="yes"]]
 841 ```
 842
 843 ## `gfbredpoly` -- Set the Reducing Polynomial SPR `GFBREDPOLY`
 844
 845 unless this is an immediate op, `mtspr` is completely sufficient.
 846
 847 ```python
 848 [[!inline pagenames="gf_reference/gfbredpoly.py" raw="yes"]]
 849 ```
 850
 851 ## `gfbmul` -- Binary Galois Field `GF(2^m)` Multiplication
 852
 853 ```
 854 gfbmul RT, RA, RB
 855 ```
 856
 857 ```python
 858 [[!inline pagenames="gf_reference/gfbmul.py" raw="yes"]]
 859 ```
 860
 861 ## `gfbmadd` -- Binary Galois Field `GF(2^m)` Multiply-Add
 862
 863 ```
 864 gfbmadd RT, RA, RB, RC
 865 ```
 866
 867 ```python
 868 [[!inline pagenames="gf_reference/gfbmadd.py" raw="yes"]]
 869 ```
 870
 871 ## `gfbtmadd` -- Binary Galois Field `GF(2^m)` Twin Multiply-Add (for FFT)
 872
 873 Used in combination with SV FFT REMAP to perform a full `GF(2^m)` Discrete
 874 Fourier Transform in-place. Possible by having 3-in 2-out, to avoid the need
 875 for a temp register. RS is written to as well as RT.
 876
 877 ```
 878 gfbtmadd RT, RA, RB, RC
 879 ```
 880
 881 TODO: add link to explanation for where `RS` comes from.
 882
 883 ```
 884 a = (RA)
 885 c = (RC)
 886 # read all inputs before writing to any outputs in case
 887 # an input overlaps with an output register.
 888 (RT) = gfbmadd(a, (RB), c)
 889 # use gfbmadd again since it reduces the result
 890 (RS) = gfbmadd(a, 1, c) # "a * 1 + c"
 891 ```
 892
 893 ## `gfbinv` -- Binary Galois Field `GF(2^m)` Inverse
 894
 895 ```
 896 gfbinv RT, RA
 897 ```
 898
 899 ```python
 900 [[!inline pagenames="gf_reference/gfbinv.py" raw="yes"]]
 901 ```
 902
 903 # Instructions for Prime Galois Fields `GF(p)`
 904
 905 ## `GFPRIME` SPR -- Prime Modulus For `gfp*` Instructions
 906
 907 ## `gfpadd` Prime Galois Field `GF(p)` Addition
 908
 909 ```
 910 gfpadd RT, RA, RB
 911 ```
 912
 913 ```python
 914 [[!inline pagenames="gf_reference/gfpadd.py" raw="yes"]]
 915 ```
 916
 917 the addition happens on infinite-precision integers
 918
 919 ## `gfpsub` Prime Galois Field `GF(p)` Subtraction
 920
 921 ```
 922 gfpsub RT, RA, RB
 923 ```
 924
 925 ```python
 926 [[!inline pagenames="gf_reference/gfpsub.py" raw="yes"]]
 927 ```
 928
 929 the subtraction happens on infinite-precision integers
 930
 931 ## `gfpmul` Prime Galois Field `GF(p)` Multiplication
 932
 933 ```
 934 gfpmul RT, RA, RB
 935 ```
 936
 937 ```python
 938 [[!inline pagenames="gf_reference/gfpmul.py" raw="yes"]]
 939 ```
 940
 941 the multiplication happens on infinite-precision integers
 942
 943 ## `gfpinv` Prime Galois Field `GF(p)` Invert
 944
 945 ```
 946 gfpinv RT, RA
 947 ```
 948
 949 Some potential hardware implementations are found in:
 950 <https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.90.5233&rep=rep1&type=pdf>
 951
 952 ```python
 953 [[!inline pagenames="gf_reference/gfpinv.py" raw="yes"]]
 954 ```
 955
 956 ## `gfpmadd` Prime Galois Field `GF(p)` Multiply-Add
 957
 958 ```
 959 gfpmadd RT, RA, RB, RC
 960 ```
 961
 962 ```python
 963 [[!inline pagenames="gf_reference/gfpmadd.py" raw="yes"]]
 964 ```
 965
 966 the multiplication and addition happens on infinite-precision integers
 967
 968 ## `gfpmsub` Prime Galois Field `GF(p)` Multiply-Subtract
 969
 970 ```
 971 gfpmsub RT, RA, RB, RC
 972 ```
 973
 974 ```python
 975 [[!inline pagenames="gf_reference/gfpmsub.py" raw="yes"]]
 976 ```
 977
 978 the multiplication and subtraction happens on infinite-precision integers
 979
 980 ## `gfpmsubr` Prime Galois Field `GF(p)` Multiply-Subtract-Reversed
 981
 982 ```
 983 gfpmsubr RT, RA, RB, RC
 984 ```
 985
 986 ```python
 987 [[!inline pagenames="gf_reference/gfpmsubr.py" raw="yes"]]
 988 ```
 989
 990 the multiplication and subtraction happens on infinite-precision integers
 991
 992 ## `gfpmaddsubr` Prime Galois Field `GF(p)` Multiply-Add and Multiply-Sub-Reversed (for FFT)
 993
 994 Used in combination with SV FFT REMAP to perform
 995 a full Number-Theoretic-Transform in-place. Possible by having 3-in 2-out,
 996 to avoid the need for a temp register. RS is written
 997 to as well as RT.
 998
 999 ```
1000 gfpmaddsubr RT, RA, RB, RC
1001 ```
1002
1003 TODO: add link to explanation for where `RS` comes from.
1004
1005 ```
1006 factor1 = (RA)
1007 factor2 = (RB)
1008 term = (RC)
1009 # read all inputs before writing to any outputs in case
1010 # an input overlaps with an output register.
1011 (RT) = gfpmadd(factor1, factor2, term)
1012 (RS) = gfpmsubr(factor1, factor2, term)
1013 ```
1014
1015 # Already in POWER ISA or subsumed
1016
1017 Lists operations either included as part of
1018 other bitmanip operations, or are already in
1019 Power ISA.
1020
1021 ## cmix
1022
1023 based on RV bitmanip, covered by ternlog bitops
1024
1025 ```
1026 uint_xlen_t cmix(uint_xlen_t RA, uint_xlen_t RB, uint_xlen_t RC) {
1027     return (RA & RB) | (RC & ~RB);
1028 }
1029 ```
1030
1031 ## count leading/trailing zeros with mask
1032
1033 in v3.1 p105
1034
1035 ```
1036 count = 0
1037 do i = 0 to 63 if((RB)i=1) then do
1038 if((RS)i=1) then break end end count ← count + 1
1039 RA ← EXTZ64(count)
1040 ```
1041
1042 ## bit deposit
1043
1044 pdepd VRT,VRA,VRB, identical to RV bitmamip bdep, found already in v3.1 p106
1045
1046     do while(m < 64)
1047        if VSR[VRB+32].dword[i].bit[63-m]=1 then do
1048           result = VSR[VRA+32].dword[i].bit[63-k]
1049           VSR[VRT+32].dword[i].bit[63-m] = result
1050           k = k + 1
1051        m = m + 1
1052
1053 ```
1054
1055 uint_xlen_t bdep(uint_xlen_t RA, uint_xlen_t RB)
1056 {
1057     uint_xlen_t r = 0;
1058     for (int i = 0, j = 0; i < XLEN; i++)
1059         if ((RB >> i) & 1) {
1060             if ((RA >> j) & 1)
1061                 r |= uint_xlen_t(1) << i;
1062             j++;
1063         }
1064     return r;
1065 }
1066
1067 ```
1068
1069 ## bit extract
1070
1071 other way round: identical to RV bext: pextd, found in v3.1 p196
1072
1073 ```
1074 uint_xlen_t bext(uint_xlen_t RA, uint_xlen_t RB)
1075 {
1076     uint_xlen_t r = 0;
1077     for (int i = 0, j = 0; i < XLEN; i++)
1078         if ((RB >> i) & 1) {
1079             if ((RA >> i) & 1)
1080                 r |= uint_xlen_t(1) << j;
1081             j++;
1082         }
1083     return r;
1084 }
1085 ```
1086
1087 ## centrifuge
1088
1089 found in v3.1 p106 so not to be added here
1090
1091 ```
1092 ptr0 = 0
1093 ptr1 = 0
1094 do i = 0 to 63
1095     if((RB)i=0) then do
1096        resultptr0 = (RS)i
1097     end
1098     ptr0 = ptr0 + 1
1099     if((RB)63-i==1) then do
1100         result63-ptr1 = (RS)63-i
1101     end
1102     ptr1 = ptr1 + 1
1103 RA = result
1104 ```
1105
1106 ## bit to byte permute
1107
1108 similar to matrix permute in RV bitmanip, which has XOR and OR variants,
1109 these perform a transpose (bmatflip).
1110 TODO this looks VSX is there a scalar variant
1111 in v3.0/1 already
1112
1113     do j = 0 to 7
1114       do k = 0 to 7
1115          b = VSR[VRB+32].dword[i].byte[k].bit[j]
1116          VSR[VRT+32].dword[i].byte[j].bit[k] = b
1117
1118 ## grev
1119
1120 superceded by grevlut
1121
1122 based on RV bitmanip, this is also known as a butterfly network. however
1123 where a butterfly network allows setting of every crossbar setting in
1124 every row and every column, generalised-reverse (grev) only allows
1125 a per-row decision: every entry in the same row must either switch or
1126 not-switch.
1127
1128 <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/8c/Butterfly_Network.jpg/474px-Butterfly_Network.jpg" />
1129
1130 ```
1131 uint64_t grev64(uint64_t RA, uint64_t RB)
1132 {
1133     uint64_t x = RA;
1134     int shamt = RB & 63;
1135     if (shamt & 1) x = ((x &  0x5555555555555555LL) <<  1) |
1136                         ((x & 0xAAAAAAAAAAAAAAAALL) >>  1);
1137     if (shamt & 2) x = ((x &  0x3333333333333333LL) <<  2) |
1138                         ((x & 0xCCCCCCCCCCCCCCCCLL) >>  2);
1139     if (shamt & 4) x = ((x &  0x0F0F0F0F0F0F0F0FLL) <<  4) |
1140                         ((x & 0xF0F0F0F0F0F0F0F0LL) >>  4);
1141     if (shamt & 8) x = ((x &  0x00FF00FF00FF00FFLL) <<  8) |
1142                         ((x & 0xFF00FF00FF00FF00LL) >>  8);
1143     if (shamt & 16) x = ((x & 0x0000FFFF0000FFFFLL) << 16) |
1144                         ((x & 0xFFFF0000FFFF0000LL) >> 16);
1145     if (shamt & 32) x = ((x & 0x00000000FFFFFFFFLL) << 32) |
1146                         ((x & 0xFFFFFFFF00000000LL) >> 32);
1147     return x;
1148 }
1149
1150 ```
1151
1152 ## gorc
1153
1154 based on RV bitmanip, gorc is superceded by grevlut
1155
1156 ```
1157 uint32_t gorc32(uint32_t RA, uint32_t RB)
1158 {
1159     uint32_t x = RA;
1160     int shamt = RB & 31;
1161     if (shamt & 1) x |= ((x & 0x55555555) << 1)   |  ((x &  0xAAAAAAAA) >> 1);
1162     if (shamt & 2) x |= ((x & 0x33333333) << 2)   |  ((x &  0xCCCCCCCC) >> 2);
1163     if (shamt & 4) x |= ((x & 0x0F0F0F0F) << 4)   |  ((x &  0xF0F0F0F0) >> 4);
1164     if (shamt & 8) x |= ((x & 0x00FF00FF) << 8)   |  ((x &  0xFF00FF00) >> 8);
1165     if (shamt & 16) x |= ((x & 0x0000FFFF) << 16) |  ((x &  0xFFFF0000) >> 16);
1166     return x;
1167 }
1168 uint64_t gorc64(uint64_t RA, uint64_t RB)
1169 {
1170     uint64_t x = RA;
1171     int shamt = RB & 63;
1172     if (shamt & 1) x |= ((x & 0x5555555555555555LL)   <<   1) |
1173                          ((x & 0xAAAAAAAAAAAAAAAALL)  >>  1);
1174     if (shamt & 2) x |= ((x & 0x3333333333333333LL)   <<   2) |
1175                          ((x & 0xCCCCCCCCCCCCCCCCLL)  >>  2);
1176     if (shamt & 4) x |= ((x & 0x0F0F0F0F0F0F0F0FLL)   <<   4) |
1177                          ((x & 0xF0F0F0F0F0F0F0F0LL)  >>  4);
1178     if (shamt & 8) x |= ((x & 0x00FF00FF00FF00FFLL)   <<   8) |
1179                          ((x & 0xFF00FF00FF00FF00LL)  >>  8);
1180     if (shamt & 16) x |= ((x & 0x0000FFFF0000FFFFLL)  << 16) |
1181                          ((x & 0xFFFF0000FFFF0000LL)  >> 16);
1182     if (shamt & 32) x |= ((x & 0x00000000FFFFFFFFLL)  << 32) |
1183                          ((x & 0xFFFFFFFF00000000LL)  >> 32);
1184     return x;
1185 }
1186
1187 ```
1188
1189
1190 # Appendix
1191
1192 see [[bitmanip/appendix]]
1193