dae0121a550398fa6ed4021a3fb809d31aa7b278
[libreriscv.git] / openpower / sv / bitmanip.mdwn
1 [[!tag standards]]
2
3 [[!toc levels=1]]
4
5 # Implementation Log
6
7 * ternlogi <https://bugs.libre-soc.org/show_bug.cgi?id=745>
8 * grev <https://bugs.libre-soc.org/show_bug.cgi?id=755>
9 * GF2^M <https://bugs.libre-soc.org/show_bug.cgi?id=782>
10 * binutils <https://bugs.libre-soc.org/show_bug.cgi?id=836>
11 * shift-and-add <https://bugs.libre-soc.org/show_bug.cgi?id=968>
12
13 # bitmanipulation
14
15 **DRAFT STATUS**
16
17 pseudocode: [[openpower/isa/bitmanip]]
18
19 this extension amalgamates bitmanipulation primitives from many sources,
20 including RISC-V bitmanip, Packed SIMD, AVX-512 and OpenPOWER VSX.
21 Also included are DSP/Multimedia operations suitable for Audio/Video.
22 Vectorisation and SIMD are removed: these are straight scalar (element)
23 operations making them suitable for embedded applications. Vectorisation
24 Context is provided by [[openpower/sv]].
25
26 When combined with SV, scalar variants of bitmanip operations found in
27 VSX are added so that the Packed SIMD aspects of VSX may be retired as
28 "legacy" in the far future (10 to 20 years). Also, VSX is hundreds of
29 opcodes, requires 128 bit pathways, and is wholly unsuited to low power
30 or embedded scenarios.
31
32 ternlogv is experimental and is the only operation that may be considered
33 a "Packed SIMD". It is added as a variant of the already well-justified
34 ternlog operation (done in AVX512 as an immediate only) "because it
35 looks fun". As it is based on the LUT4 concept it will allow accelerated
36 emulation of FPGAs. Other vendors of ISAs are buying FPGA companies to
37 achieve similar objectives.
38
39 general-purpose Galois Field 2^M operations are added so as to avoid
40 huge custom opcode proliferation across many areas of Computer Science.
41 however for convenience and also to avoid setup costs, some of the more
42 common operations (clmul, crc32) are also added. The expectation is
43 that these operations would all be covered by the same pipeline.
44
45 note that there are brownfield spaces below that could incorporate
46 some of the set-before-first and other scalar operations listed in
47 [[sv/mv.swizzle]],
48 [[sv/vector_ops]], [[sv/int_fp_mv]] and the [[sv/av_opcodes]] as well as
49 [[sv/setvl]], [[sv/svstep]], [[sv/remap]]
50
51 Useful resource:
52
53 * <https://en.wikiversity.org/wiki/Reed%E2%80%93Solomon_codes_for_coders>
54 * <https://maths-people.anu.edu.au/~brent/pd/rpb232tr.pdf>
55 * <https://gist.github.com/animetosho/d3ca95da2131b5813e16b5bb1b137ca0>
56 * <https://github.com/HJLebbink/asm-dude/wiki/GF2P8AFFINEINVQB>
57
58 [[!inline pages="openpower/sv/draft_opcode_tables" quick="yes" raw="yes" ]]
59
60 # binary and ternary bitops
61
62 Similar to FPGA LUTs: for two (binary) or three (ternary) inputs take
63 bits from each input, concatenate them and perform a lookup into a
64 table using an 8-8-bit immediate (for the ternary instructions), or in
65 another register (4-bit for the binary instructions). The binary lookup
66 instructions have CR Field lookup variants due to CR Fields being 4 bit.
67
68 Like the x86 AVX512F
69 [vpternlogd/vpternlogq](https://www.felixcloutier.com/x86/vpternlogd:vpternlogq)
70 instructions.
71
72 ## ternlogi
73
74 | 0.5|6.10|11.15|16.20| 21..28|29.30|31|
75 | -- | -- | --- | --- | ----- | --- |--|
76 | NN | RT | RA | RB | im0-7 | 00 |Rc|
77
78 lut3(imm, a, b, c):
79 idx = c << 2 | b << 1 | a
80 return imm[idx] # idx by LSB0 order
81
82 for i in range(64):
83 RT[i] = lut3(imm, RB[i], RA[i], RT[i])
84
85 ## binlut
86
87 Binary lookup is a dynamic LUT2 version of ternlogi. Firstly, the
88 lookup table is 4 bits wide not 8 bits, and secondly the lookup
89 table comes from a register not an immediate.
90
91 | 0.5|6.10|11.15|16.20| 21..25|26..31 | Form |
92 | -- | -- | --- | --- | ----- |--------|---------|
93 | NN | RT | RA | RB | RC |nh 00001| VA-Form |
94 | NN | RT | RA | RB | /BFA/ |0 01001| VA-Form |
95
96 For binlut, the 4-bit LUT may be selected from either the high nibble
97 or the low nibble of the first byte of RC:
98
99 lut2(imm, a, b):
100 idx = b << 1 | a
101 return imm[idx] # idx by LSB0 order
102
103 imm = (RC>>(nh*4))&0b1111
104 for i in range(64):
105 RT[i] = lut2(imm, RB[i], RA[i])
106
107 For bincrlut, `BFA` selects the 4-bit CR Field as the LUT2:
108
109 for i in range(64):
110 RT[i] = lut2(CRs{BFA}, RB[i], RA[i])
111
112 When Vectorised with SVP64, as usual both source and destination may be
113 Vector or Scalar.
114
115 *Programmer's note: a dynamic ternary lookup may be synthesised from
116 a pair of `binlut` instructions followed by a `ternlogi` to select which
117 to merge. Use `nh` to select which nibble to use as the lookup table
118 from the RC source register (`nh=1` nibble high), i.e. keeping
119 an 8-bit LUT3 in RC, the first `binlut` instruction may set nh=0 and
120 the second nh=1.*
121
122 ## crternlogi
123
124 another mode selection would be CRs not Ints.
125
126 CRB-Form:
127
128 | 0.5|6.8 |9.10|11.13|14.15|16.18|19.25|26.30| 31|
129 |----|----|----|-----|-----|-----|-----|-----|---|
130 | NN | BF | msk|BFA | msk | BFB | TLI | XO |TLI|
131
132 for i in range(4):
133 a,b,c = CRs[BF][i], CRs[BFA][i], CRs[BFB][i])
134 if msk[i] CRs[BF][i] = lut3(imm, a, b, c)
135
136 This instruction is remarkably similar to the existing crops, `crand` etc.
137 which have been noted to be a 4-bit (binary) LUT. In effect `crternlogi`
138 is the ternary LUT version of crops, having an 8-bit LUT. However it
139 is an overwrite instruction in order to save on register file ports,
140 due to the mask requiring the contents of the BF to be both read and
141 written.
142
143 Programmer's note: This instruction is useful when combined with Matrix REMAP
144 in "Inner Product" Mode, creating Warshall Transitive Closure that has many
145 applications in Computer Science.
146
147 ## crbinlog
148
149 With ternary (LUT3) dynamic instructions being very costly,
150 and CR Fields being only 4 bit, a binary (LUT2) variant is better
151
152 CRB-Form:
153
154 | 0.5|6.8 |9.10|11.13|14.15|16.18|19.25|26.30| 31|
155 |----|----|----|-----|-----|-----|-----|-----|---|
156 | NN | BF | msk|BFA | msk | BFB | // | XO | //|
157
158 for i in range(4):
159 a,b = CRs[BF][i], CRs[BF][i])
160 if msk[i] CRs[BF][i] = lut2(CRs[BFB], a, b)
161
162 When SVP64 Vectorised any of the 4 operands may be Scalar or
163 Vector, including `BFB` meaning that multiple different dynamic
164 lookups may be performed with a single instruction. Note that
165 this instruction is deliberately an overwrite in order to reduce
166 the number of register file ports required: like `crternlogi`
167 the contents of `BF` **must** be read due to the mask only
168 writing back to non-masked-out bits of `BF`.
169
170 *Programmer's note: just as with binlut and ternlogi, a pair
171 of crbinlog instructions followed by a merging crternlogi may
172 be deployed to synthesise dynamic ternary (LUT3) CR Field
173 manipulation*
174
175 # int ops
176
177 ## min/m
178
179 required for the [[sv/av_opcodes]]
180
181 signed and unsigned min/max for integer. this is sort-of partly
182 synthesiseable in [[sv/svp64]] with pred-result as long as the dest reg
183 is one of the sources, but not both signed and unsigned. when the dest
184 is also one of the srces and the mv fails due to the CR bittest failing
185 this will only overwrite the dest where the src is greater (or less).
186
187 signed/unsigned min/max gives more flexibility.
188
189 X-Form
190
191 * XO=0001001110, itype=0b00 min, unsigned
192 * XO=0101001110, itype=0b01 min, signed
193 * XO=0011001110, itype=0b10 max, unsigned
194 * XO=0111001110, itype=0b11 max, signed
195
196
197 ```
198 uint_xlen_t mins(uint_xlen_t rs1, uint_xlen_t rs2)
199 { return (int_xlen_t)rs1 < (int_xlen_t)rs2 ? rs1 : rs2;
200 }
201 uint_xlen_t maxs(uint_xlen_t rs1, uint_xlen_t rs2)
202 { return (int_xlen_t)rs1 > (int_xlen_t)rs2 ? rs1 : rs2;
203 }
204 uint_xlen_t minu(uint_xlen_t rs1, uint_xlen_t rs2)
205 { return rs1 < rs2 ? rs1 : rs2;
206 }
207 uint_xlen_t maxu(uint_xlen_t rs1, uint_xlen_t rs2)
208 { return rs1 > rs2 ? rs1 : rs2;
209 }
210 ```
211
212 ## average
213
214 required for the [[sv/av_opcodes]], these exist in Packed SIMD (VSX)
215 but not scalar
216
217 ```
218 uint_xlen_t intavg(uint_xlen_t rs1, uint_xlen_t rs2) {
219 return (rs1 + rs2 + 1) >> 1:
220 }
221 ```
222
223 ## absdu
224
225 required for the [[sv/av_opcodes]], these exist in Packed SIMD (VSX)
226 but not scalar
227
228 ```
229 uint_xlen_t absdu(uint_xlen_t rs1, uint_xlen_t rs2) {
230 return (src1 > src2) ? (src1-src2) : (src2-src1)
231 }
232 ```
233
234 ## abs-accumulate
235
236 required for the [[sv/av_opcodes]], these are needed for motion estimation.
237 both are overwrite on RS.
238
239 ```
240 uint_xlen_t uintabsacc(uint_xlen_t rs, uint_xlen_t ra, uint_xlen_t rb) {
241 return rs + (src1 > src2) ? (src1-src2) : (src2-src1)
242 }
243 uint_xlen_t intabsacc(uint_xlen_t rs, int_xlen_t ra, int_xlen_t rb) {
244 return rs + (src1 > src2) ? (src1-src2) : (src2-src1)
245 }
246 ```
247
248 For SVP64, the twin Elwidths allows e.g. a 16 bit accumulator for 8 bit
249 differences. Form is `RM-1P-3S1D` where RS-as-source has a separate
250 SVP64 designation from RS-as-dest. This gives a limited range of
251 non-overwrite capability.
252
253 # shift-and-add <a name="shift-add"> </a>
254
255 Power ISA is missing LD/ST with shift, which is present in both ARM and x86.
256 Too complex to add more LD/ST, a compromise is to add shift-and-add.
257 Replaces a pair of explicit instructions in hot-loops.
258
259 ```
260 # 1.6.27 Z23-FORM
261 |0 |6 |11 |15 |16 |21 |23 |31 |
262 | PO | RT | RA | RB |sm | XO |Rc |
263 ```
264
265 Pseudo-code (shadd):
266
267 n <- (RB)
268 m <- sm + 1
269 RT <- (n[m:XLEN-1] || [0]*m) + (RA)
270
271 Pseudo-code (shadduw):
272
273 n <- ([0]*(XLEN/2)) || (RB)[XLEN/2:XLEN-1]
274 m <- sm + 1
275 RT <- (n[m:XLEN-1] || [0]*m) + (RA)
276
277 ```
278 uint_xlen_t shadd(uint_xlen_t RA, uint_xlen_t RB, uint8_t sm) {
279 sm = sm & 0x3;
280 return (RB << (sm+1)) + RA;
281 }
282
283 uint_xlen_t shadduw(uint_xlen_t RA, uint_xlen_t RB, uint8_t sm) {
284 uint_xlen_t n = RB & 0xFFFFFFFF;
285 sm = sm & 0x3;
286 return (n << (sm+1)) + RA;
287 }
288 ```
289
290 # bitmask set
291
292 based on RV bitmanip singlebit set, instruction format similar to shift
293 [[isa/fixedshift]]. bmext is actually covered already (shift-with-mask
294 rldicl but only immediate version). however bitmask-invert is not,
295 and set/clr are not covered, although they can use the same Shift ALU.
296
297 bmext (RB) version is not the same as rldicl because bmext is a right
298 shift by RC, where rldicl is a left rotate. for the immediate version
299 this does not matter, so a bmexti is not required. bmrev however there
300 is no direct equivalent and consequently a bmrevi is required.
301
302 bmset (register for mask amount) is particularly useful for creating
303 predicate masks where the length is a dynamic runtime quantity.
304 bmset(RA=0, RB=0, RC=mask) will produce a run of ones of length "mask"
305 in a single instruction without needing to initialise or depend on any
306 other registers.
307
308 | 0.5|6.10|11.15|16.20|21.25| 26..30 |31| name |
309 | -- | -- | --- | --- | --- | ------- |--| ----- |
310 | NN | RS | RA | RB | RC | mode 010 |Rc| bm\* |
311
312 Immediate-variant is an overwrite form:
313
314 | 0.5|6.10|11.15|16.20| 21 | 22.23 | 24....30 |31| name |
315 | -- | -- | --- | --- | -- | ----- | -------- |--| ---- |
316 | NN | RS | RB | sh | SH | itype | 1000 110 |Rc| bm\*i |
317
318 ```
319 def MASK(x, y):
320 if x < y:
321 x = x+1
322 mask_a = ((1 << x) - 1) & ((1 << 64) - 1)
323 mask_b = ((1 << y) - 1) & ((1 << 64) - 1)
324 elif x == y:
325 return 1 << x
326 else:
327 x = x+1
328 mask_a = ((1 << x) - 1) & ((1 << 64) - 1)
329 mask_b = (~((1 << y) - 1)) & ((1 << 64) - 1)
330 return mask_a ^ mask_b
331
332
333 uint_xlen_t bmset(RS, RB, sh)
334 {
335 int shamt = RB & (XLEN - 1);
336 mask = (2<<sh)-1;
337 return RS | (mask << shamt);
338 }
339
340 uint_xlen_t bmclr(RS, RB, sh)
341 {
342 int shamt = RB & (XLEN - 1);
343 mask = (2<<sh)-1;
344 return RS & ~(mask << shamt);
345 }
346
347 uint_xlen_t bminv(RS, RB, sh)
348 {
349 int shamt = RB & (XLEN - 1);
350 mask = (2<<sh)-1;
351 return RS ^ (mask << shamt);
352 }
353
354 uint_xlen_t bmext(RS, RB, sh)
355 {
356 int shamt = RB & (XLEN - 1);
357 mask = (2<<sh)-1;
358 return mask & (RS >> shamt);
359 }
360 ```
361
362 bitmask extract with reverse. can be done by bit-order-inverting all
363 of RB and getting bits of RB from the opposite end.
364
365 when RA is zero, no shift occurs. this makes bmextrev useful for
366 simply reversing all bits of a register.
367
368 ```
369 msb = ra[5:0];
370 rev[0:msb] = rb[msb:0];
371 rt = ZE(rev[msb:0]);
372
373 uint_xlen_t bmrevi(RA, RB, sh)
374 {
375 int shamt = XLEN-1;
376 if (RA != 0) shamt = (GPR(RA) & (XLEN - 1));
377 shamt = (XLEN-1)-shamt; # shift other end
378 brb = bitreverse(GPR(RB)) # swap LSB-MSB
379 mask = (2<<sh)-1;
380 return mask & (brb >> shamt);
381 }
382
383 uint_xlen_t bmrev(RA, RB, RC) {
384 return bmrevi(RA, RB, GPR(RC) & 0b111111);
385 }
386 ```
387
388 | 0.5|6.10|11.15|16.20|21.26| 27..30 |31| name | Form |
389 | -- | -- | --- | --- | --- | ------- |--| ------ | -------- |
390 | NN | RT | RA | RB | sh | 1111 |Rc| bmrevi | MDS-Form |
391
392 | 0.5|6.10|11.15|16.20|21.25| 26..30 |31| name | Form |
393 | -- | -- | --- | --- | --- | ------- |--| ------ | -------- |
394 | NN | RT | RA | RB | RC | 11110 |Rc| bmrev | VA2-Form |
395
396 # grevlut <a name="grevlut"> </a>
397
398 generalised reverse combined with a pair of LUT2s and allowing
399 a constant `0b0101...0101` when RA=0, and an option to invert
400 (including when RA=0, giving a constant 0b1010...1010 as the
401 initial value) provides a wide range of instructions
402 and a means to set hundreds of regular 64 bit patterns with one
403 single 32 bit instruction.
404
405 the two LUT2s are applied left-half (when not swapping)
406 and right-half (when swapping) so as to allow a wider
407 range of options.
408
409 <img src="/openpower/sv/grevlut2x2.jpg" width=700 />
410
411 * A value of `0b11001010` for the immediate provides
412 the functionality of a standard "grev".
413 * `0b11101110` provides gorc
414
415 grevlut should be arranged so as to produce the constants
416 needed to put into bext (bitextract) so as in turn to
417 be able to emulate x86 pmovmask instructions
418 <https://www.felixcloutier.com/x86/pmovmskb>.
419 This only requires 2 instructions (grevlut, bext).
420
421 Note that if the mask is required to be placed
422 directly into CR Fields (for use as CR Predicate
423 masks rather than a integer mask) then sv.cmpi or sv.ori
424 may be used instead, bearing in mind that sv.ori
425 is a 64-bit instruction, and `VL` must have been
426 set to the required length:
427
428 sv.ori./elwid=8 r10.v, r10.v, 0
429
430 The following settings provide the required mask constants:
431
432 | RA=0 | RB | imm | iv | result |
433 | ------- | ------- | ---------- | -- | ---------- |
434 | 0x555.. | 0b10 | 0b01101100 | 0 | 0x111111... |
435 | 0x555.. | 0b110 | 0b01101100 | 0 | 0x010101... |
436 | 0x555.. | 0b1110 | 0b01101100 | 0 | 0x00010001... |
437 | 0x555.. | 0b10 | 0b11000110 | 1 | 0x88888... |
438 | 0x555.. | 0b110 | 0b11000110 | 1 | 0x808080... |
439 | 0x555.. | 0b1110 | 0b11000110 | 1 | 0x80008000... |
440
441 Better diagram showing the correct ordering of shamt (RB). A LUT2
442 is applied to all locations marked in red using the first 4
443 bits of the immediate, and a separate LUT2 applied to all
444 locations in green using the upper 4 bits of the immediate.
445
446 <img src="/openpower/sv/grevlut.png" width=700 />
447
448 demo code [[openpower/sv/grevlut.py]]
449
450 ```
451 lut2(imm, a, b):
452 idx = b << 1 | a
453 return imm[idx] # idx by LSB0 order
454
455 dorow(imm8, step_i, chunksize, us32b):
456 for j in 0 to 31 if is32b else 63:
457 if (j&chunk_size) == 0
458 imm = imm8[0..3]
459 else
460 imm = imm8[4..7]
461 step_o[j] = lut2(imm, step_i[j], step_i[j ^ chunk_size])
462 return step_o
463
464 uint64_t grevlut(uint64_t RA, uint64_t RB, uint8 imm, bool iv, bool is32b)
465 {
466 uint64_t x = 0x5555_5555_5555_5555;
467 if (RA != 0) x = GPR(RA);
468 if (iv) x = ~x;
469 int shamt = RB & 31 if is32b else 63
470 for i in 0 to (6-is32b)
471 step = 1<<i
472 if (shamt & step) x = dorow(imm, x, step, is32b)
473 return x;
474 }
475 ```
476
477 A variant may specify different LUT-pairs per row,
478 using one byte of RB for each. If it is desired that
479 a particular row-crossover shall not be applied it is
480 a simple matter to set the appropriate LUT-pair in RB
481 to effect an identity transform for that row (`0b11001010`).
482
483 ```
484 uint64_t grevlutr(uint64_t RA, uint64_t RB, bool iv, bool is32b)
485 {
486 uint64_t x = 0x5555_5555_5555_5555;
487 if (RA != 0) x = GPR(RA);
488 if (iv) x = ~x;
489 for i in 0 to (6-is32b)
490 step = 1<<i
491 imm = (RB>>(i*8))&0xff
492 x = dorow(imm, x, step, is32b)
493 return x;
494 }
495
496 ```
497
498 | 0.5|6.10|11.15|16.20 |21..28 | 29.30|31| name | Form |
499 | -- | -- | --- | --- | ----- | -----|--| ------ | ----- |
500 | NN | RT | RA | s0-4 | im0-7 | 1 iv |s5| grevlogi | |
501 | NN | RT | RA | RB | im0-7 | 01 |0 | grevlog | |
502
503 An equivalent to `grevlogw` may be synthesised by setting the
504 appropriate bits in RB to set the top half of RT to zero.
505 Thus an explicit grevlogw instruction is not necessary.
506
507 # xperm
508
509 based on RV bitmanip.
510
511 RA contains a vector of indices to select parts of RB to be
512 copied to RT. The immediate-variant allows up to an 8 bit
513 pattern (repeated) to be targetted at different parts of RT.
514
515 xperm shares some similarity with one of the uses of bmator
516 in that xperm indices are binary addressing where bitmator
517 may be considered to be unary addressing.
518
519 ```
520 uint_xlen_t xpermi(uint8_t imm8, uint_xlen_t RB, int sz_log2)
521 {
522 uint_xlen_t r = 0;
523 uint_xlen_t sz = 1LL << sz_log2;
524 uint_xlen_t mask = (1LL << sz) - 1;
525 uint_xlen_t RA = imm8 | imm8<<8 | ... | imm8<<56;
526 for (int i = 0; i < XLEN; i += sz) {
527 uint_xlen_t pos = ((RA >> i) & mask) << sz_log2;
528 if (pos < XLEN)
529 r |= ((RB >> pos) & mask) << i;
530 }
531 return r;
532 }
533 uint_xlen_t xperm(uint_xlen_t RA, uint_xlen_t RB, int sz_log2)
534 {
535 uint_xlen_t r = 0;
536 uint_xlen_t sz = 1LL << sz_log2;
537 uint_xlen_t mask = (1LL << sz) - 1;
538 for (int i = 0; i < XLEN; i += sz) {
539 uint_xlen_t pos = ((RA >> i) & mask) << sz_log2;
540 if (pos < XLEN)
541 r |= ((RB >> pos) & mask) << i;
542 }
543 return r;
544 }
545 uint_xlen_t xperm_n (uint_xlen_t RA, uint_xlen_t RB)
546 { return xperm(RA, RB, 2); }
547 uint_xlen_t xperm_b (uint_xlen_t RA, uint_xlen_t RB)
548 { return xperm(RA, RB, 3); }
549 uint_xlen_t xperm_h (uint_xlen_t RA, uint_xlen_t RB)
550 { return xperm(RA, RB, 4); }
551 uint_xlen_t xperm_w (uint_xlen_t RA, uint_xlen_t RB)
552 { return xperm(RA, RB, 5); }
553 ```
554
555 # bitmatrix
556
557 bmatflip and bmatxor is found in the Cray XMT, and in x86 is known
558 as GF2P8AFFINEQB. uses:
559
560 * <https://gist.github.com/animetosho/d3ca95da2131b5813e16b5bb1b137ca0>
561 * SM4, Reed Solomon, RAID6
562 <https://stackoverflow.com/questions/59124720/what-are-the-avx-512-galois-field-related-instructions-for>
563 * Vector bit-reverse <https://reviews.llvm.org/D91515?id=305411>
564 * Affine Inverse <https://github.com/HJLebbink/asm-dude/wiki/GF2P8AFFINEINVQB>
565
566 | 0.5|6.10|11.15|16.20| 21 | 22.23 | 24....30 |31| name | Form |
567 | -- | -- | --- | --- | -- | ----- | -------- |--| ---- | ------- |
568 | NN | RS | RA |im04 | im5| 1 1 | im67 00 110 |Rc| bmatxori | TODO |
569
570
571 ```
572 uint64_t bmatflip(uint64_t RA)
573 {
574 uint64_t x = RA;
575 x = shfl64(x, 31);
576 x = shfl64(x, 31);
577 x = shfl64(x, 31);
578 return x;
579 }
580
581 uint64_t bmatxori(uint64_t RS, uint64_t RA, uint8_t imm) {
582 // transpose of RA
583 uint64_t RAt = bmatflip(RA);
584 uint8_t u[8]; // rows of RS
585 uint8_t v[8]; // cols of RA
586 for (int i = 0; i < 8; i++) {
587 u[i] = RS >> (i*8);
588 v[i] = RAt >> (i*8);
589 }
590 uint64_t bit, x = 0;
591 for (int i = 0; i < 64; i++) {
592 bit = (imm >> (i%8)) & 1;
593 bit ^= pcnt(u[i / 8] & v[i % 8]) & 1;
594 x |= bit << i;
595 }
596 return x;
597 }
598
599 uint64_t bmatxor(uint64_t RA, uint64_t RB) {
600 return bmatxori(RA, RB, 0xff)
601 }
602
603 uint64_t bmator(uint64_t RA, uint64_t RB) {
604 // transpose of RB
605 uint64_t RBt = bmatflip(RB);
606 uint8_t u[8]; // rows of RA
607 uint8_t v[8]; // cols of RB
608 for (int i = 0; i < 8; i++) {
609 u[i] = RA >> (i*8);
610 v[i] = RBt >> (i*8);
611 }
612 uint64_t x = 0;
613 for (int i = 0; i < 64; i++) {
614 if ((u[i / 8] & v[i % 8]) != 0)
615 x |= 1LL << i;
616 }
617 return x;
618 }
619
620 uint64_t bmatand(uint64_t RA, uint64_t RB) {
621 // transpose of RB
622 uint64_t RBt = bmatflip(RB);
623 uint8_t u[8]; // rows of RA
624 uint8_t v[8]; // cols of RB
625 for (int i = 0; i < 8; i++) {
626 u[i] = RA >> (i*8);
627 v[i] = RBt >> (i*8);
628 }
629 uint64_t x = 0;
630 for (int i = 0; i < 64; i++) {
631 if ((u[i / 8] & v[i % 8]) == 0xff)
632 x |= 1LL << i;
633 }
634 return x;
635 }
636 ```
637
638 # Introduction to Carry-less and GF arithmetic
639
640 * obligatory xkcd <https://xkcd.com/2595/>
641
642 There are three completely separate types of Galois-Field-based arithmetic
643 that we implement which are not well explained even in introductory
644 literature. A slightly oversimplified explanation is followed by more
645 accurate descriptions:
646
647 * `GF(2)` carry-less binary arithmetic. this is not actually a Galois Field,
648 but is accidentally referred to as GF(2) - see below as to why.
649 * `GF(p)` modulo arithmetic with a Prime number, these are "proper"
650 Galois Fields
651 * `GF(2^N)` carry-less binary arithmetic with two limits: modulo a power-of-2
652 (2^N) and a second "reducing" polynomial (similar to a prime number), these
653 are said to be GF(2^N) arithmetic.
654
655 further detailed and more precise explanations are provided below
656
657 * **Polynomials with coefficients in `GF(2)`**
658 (aka. Carry-less arithmetic -- the `cl*` instructions).
659 This isn't actually a Galois Field, but its coefficients are. This is
660 basically binary integer addition, subtraction, and multiplication like
661 usual, except that carries aren't propagated at all, effectively turning
662 both addition and subtraction into the bitwise xor operation. Division and
663 remainder are defined to match how addition and multiplication works.
664 * **Galois Fields with a prime size**
665 (aka. `GF(p)` or Prime Galois Fields -- the `gfp*` instructions).
666 This is basically just the integers mod `p`.
667 * **Galois Fields with a power-of-a-prime size**
668 (aka. `GF(p^n)` or `GF(q)` where `q == p^n` for prime `p` and
669 integer `n > 0`).
670 We only implement these for `p == 2`, called Binary Galois Fields
671 (`GF(2^n)` -- the `gfb*` instructions).
672 For any prime `p`, `GF(p^n)` is implemented as polynomials with
673 coefficients in `GF(p)` and degree `< n`, where the polynomials are the
674 remainders of dividing by a specificly chosen polynomial in `GF(p)` called
675 the Reducing Polynomial (we will denote that by `red_poly`). The Reducing
676 Polynomial must be an irreducable polynomial (like primes, but for
677 polynomials), as well as have degree `n`. All `GF(p^n)` for the same `p`
678 and `n` are isomorphic to each other -- the choice of `red_poly` doesn't
679 affect `GF(p^n)`'s mathematical shape, all that changes is the specific
680 polynomials used to implement `GF(p^n)`.
681
682 Many implementations and much of the literature do not make a clear
683 distinction between these three categories, which makes it confusing
684 to understand what their purpose and value is.
685
686 * carry-less multiply is extremely common and is used for the ubiquitous
687 CRC32 algorithm. [TODO add many others, helps justify to ISA WG]
688 * GF(2^N) forms the basis of Rijndael (the current AES standard) and
689 has significant uses throughout cryptography
690 * GF(p) is the basis again of a significant quantity of algorithms
691 (TODO, list them, jacob knows what they are), even though the
692 modulo is limited to be below 64-bit (size of a scalar int)
693
694 # Instructions for Carry-less Operations
695
696 aka. Polynomials with coefficients in `GF(2)`
697
698 Carry-less addition/subtraction is simply XOR, so a `cladd`
699 instruction is not provided since the `xor[i]` instruction can be used instead.
700
701 These are operations on polynomials with coefficients in `GF(2)`, with the
702 polynomial's coefficients packed into integers with the following algorithm:
703
704 ```python
705 [[!inline pagenames="gf_reference/pack_poly.py" raw="yes"]]
706 ```
707
708 ## Carry-less Multiply Instructions
709
710 based on RV bitmanip
711 see <https://en.wikipedia.org/wiki/CLMUL_instruction_set> and
712 <https://www.felixcloutier.com/x86/pclmulqdq> and
713 <https://en.m.wikipedia.org/wiki/Carry-less_product>
714
715 They are worth adding as their own non-overwrite operations
716 (in the same pipeline).
717
718 ### `clmul` Carry-less Multiply
719
720 ```python
721 [[!inline pagenames="gf_reference/clmul.py" raw="yes"]]
722 ```
723
724 ### `clmulh` Carry-less Multiply High
725
726 ```python
727 [[!inline pagenames="gf_reference/clmulh.py" raw="yes"]]
728 ```
729
730 ### `clmulr` Carry-less Multiply (Reversed)
731
732 Useful for CRCs. Equivalent to bit-reversing the result of `clmul` on
733 bit-reversed inputs.
734
735 ```python
736 [[!inline pagenames="gf_reference/clmulr.py" raw="yes"]]
737 ```
738
739 ## `clmadd` Carry-less Multiply-Add
740
741 ```
742 clmadd RT, RA, RB, RC
743 ```
744
745 ```
746 (RT) = clmul((RA), (RB)) ^ (RC)
747 ```
748
749 ## `cltmadd` Twin Carry-less Multiply-Add (for FFTs)
750
751 Used in combination with SV FFT REMAP to perform a full Discrete Fourier
752 Transform of Polynomials over GF(2) in-place. Possible by having 3-in 2-out,
753 to avoid the need for a temp register. RS is written to as well as RT.
754
755 Note: Polynomials over GF(2) are a Ring rather than a Field, so, because the
756 definition of the Inverse Discrete Fourier Transform involves calculating a
757 multiplicative inverse, which may not exist in every Ring, therefore the
758 Inverse Discrete Fourier Transform may not exist. (AFAICT the number of inputs
759 to the IDFT must be odd for the IDFT to be defined for Polynomials over GF(2).
760 TODO: check with someone who knows for sure if that's correct.)
761
762 ```
763 cltmadd RT, RA, RB, RC
764 ```
765
766 TODO: add link to explanation for where `RS` comes from.
767
768 ```
769 a = (RA)
770 c = (RC)
771 # read all inputs before writing to any outputs in case
772 # an input overlaps with an output register.
773 (RT) = clmul(a, (RB)) ^ c
774 (RS) = a ^ c
775 ```
776
777 ## `cldivrem` Carry-less Division and Remainder
778
779 `cldivrem` isn't an actual instruction, but is just used in the pseudo-code
780 for other instructions.
781
782 ```python
783 [[!inline pagenames="gf_reference/cldivrem.py" raw="yes"]]
784 ```
785
786 ## `cldiv` Carry-less Division
787
788 ```
789 cldiv RT, RA, RB
790 ```
791
792 ```
793 n = (RA)
794 d = (RB)
795 q, r = cldivrem(n, d, width=XLEN)
796 (RT) = q
797 ```
798
799 ## `clrem` Carry-less Remainder
800
801 ```
802 clrem RT, RA, RB
803 ```
804
805 ```
806 n = (RA)
807 d = (RB)
808 q, r = cldivrem(n, d, width=XLEN)
809 (RT) = r
810 ```
811
812 # Instructions for Binary Galois Fields `GF(2^m)`
813
814 see:
815
816 * <https://courses.csail.mit.edu/6.857/2016/files/ffield.py>
817 * <https://engineering.purdue.edu/kak/compsec/NewLectures/Lecture7.pdf>
818 * <https://foss.heptapod.net/math/libgf2/-/blob/branch/default/src/libgf2/gf2.py>
819
820 Binary Galois Field addition/subtraction is simply XOR, so a `gfbadd`
821 instruction is not provided since the `xor[i]` instruction can be used instead.
822
823 ## `GFBREDPOLY` SPR -- Reducing Polynomial
824
825 In order to save registers and to make operations orthogonal with standard
826 arithmetic, the reducing polynomial is stored in a dedicated SPR `GFBREDPOLY`.
827 This also allows hardware to pre-compute useful parameters (such as the
828 degree, or look-up tables) based on the reducing polynomial, and store them
829 alongside the SPR in hidden registers, only recomputing them whenever the SPR
830 is written to, rather than having to recompute those values for every
831 instruction.
832
833 Because Galois Fields require the reducing polynomial to be an irreducible
834 polynomial, that guarantees that any polynomial of `degree > 1` must have
835 the LSB set, since otherwise it would be divisible by the polynomial `x`,
836 making it reducible, making whatever we're working on no longer a Field.
837 Therefore, we can reuse the LSB to indicate `degree == XLEN`.
838
839 ```python
840 [[!inline pagenames="gf_reference/decode_reducing_polynomial.py" raw="yes"]]
841 ```
842
843 ## `gfbredpoly` -- Set the Reducing Polynomial SPR `GFBREDPOLY`
844
845 unless this is an immediate op, `mtspr` is completely sufficient.
846
847 ```python
848 [[!inline pagenames="gf_reference/gfbredpoly.py" raw="yes"]]
849 ```
850
851 ## `gfbmul` -- Binary Galois Field `GF(2^m)` Multiplication
852
853 ```
854 gfbmul RT, RA, RB
855 ```
856
857 ```python
858 [[!inline pagenames="gf_reference/gfbmul.py" raw="yes"]]
859 ```
860
861 ## `gfbmadd` -- Binary Galois Field `GF(2^m)` Multiply-Add
862
863 ```
864 gfbmadd RT, RA, RB, RC
865 ```
866
867 ```python
868 [[!inline pagenames="gf_reference/gfbmadd.py" raw="yes"]]
869 ```
870
871 ## `gfbtmadd` -- Binary Galois Field `GF(2^m)` Twin Multiply-Add (for FFT)
872
873 Used in combination with SV FFT REMAP to perform a full `GF(2^m)` Discrete
874 Fourier Transform in-place. Possible by having 3-in 2-out, to avoid the need
875 for a temp register. RS is written to as well as RT.
876
877 ```
878 gfbtmadd RT, RA, RB, RC
879 ```
880
881 TODO: add link to explanation for where `RS` comes from.
882
883 ```
884 a = (RA)
885 c = (RC)
886 # read all inputs before writing to any outputs in case
887 # an input overlaps with an output register.
888 (RT) = gfbmadd(a, (RB), c)
889 # use gfbmadd again since it reduces the result
890 (RS) = gfbmadd(a, 1, c) # "a * 1 + c"
891 ```
892
893 ## `gfbinv` -- Binary Galois Field `GF(2^m)` Inverse
894
895 ```
896 gfbinv RT, RA
897 ```
898
899 ```python
900 [[!inline pagenames="gf_reference/gfbinv.py" raw="yes"]]
901 ```
902
903 # Instructions for Prime Galois Fields `GF(p)`
904
905 ## `GFPRIME` SPR -- Prime Modulus For `gfp*` Instructions
906
907 ## `gfpadd` Prime Galois Field `GF(p)` Addition
908
909 ```
910 gfpadd RT, RA, RB
911 ```
912
913 ```python
914 [[!inline pagenames="gf_reference/gfpadd.py" raw="yes"]]
915 ```
916
917 the addition happens on infinite-precision integers
918
919 ## `gfpsub` Prime Galois Field `GF(p)` Subtraction
920
921 ```
922 gfpsub RT, RA, RB
923 ```
924
925 ```python
926 [[!inline pagenames="gf_reference/gfpsub.py" raw="yes"]]
927 ```
928
929 the subtraction happens on infinite-precision integers
930
931 ## `gfpmul` Prime Galois Field `GF(p)` Multiplication
932
933 ```
934 gfpmul RT, RA, RB
935 ```
936
937 ```python
938 [[!inline pagenames="gf_reference/gfpmul.py" raw="yes"]]
939 ```
940
941 the multiplication happens on infinite-precision integers
942
943 ## `gfpinv` Prime Galois Field `GF(p)` Invert
944
945 ```
946 gfpinv RT, RA
947 ```
948
949 Some potential hardware implementations are found in:
950 <https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.90.5233&rep=rep1&type=pdf>
951
952 ```python
953 [[!inline pagenames="gf_reference/gfpinv.py" raw="yes"]]
954 ```
955
956 ## `gfpmadd` Prime Galois Field `GF(p)` Multiply-Add
957
958 ```
959 gfpmadd RT, RA, RB, RC
960 ```
961
962 ```python
963 [[!inline pagenames="gf_reference/gfpmadd.py" raw="yes"]]
964 ```
965
966 the multiplication and addition happens on infinite-precision integers
967
968 ## `gfpmsub` Prime Galois Field `GF(p)` Multiply-Subtract
969
970 ```
971 gfpmsub RT, RA, RB, RC
972 ```
973
974 ```python
975 [[!inline pagenames="gf_reference/gfpmsub.py" raw="yes"]]
976 ```
977
978 the multiplication and subtraction happens on infinite-precision integers
979
980 ## `gfpmsubr` Prime Galois Field `GF(p)` Multiply-Subtract-Reversed
981
982 ```
983 gfpmsubr RT, RA, RB, RC
984 ```
985
986 ```python
987 [[!inline pagenames="gf_reference/gfpmsubr.py" raw="yes"]]
988 ```
989
990 the multiplication and subtraction happens on infinite-precision integers
991
992 ## `gfpmaddsubr` Prime Galois Field `GF(p)` Multiply-Add and Multiply-Sub-Reversed (for FFT)
993
994 Used in combination with SV FFT REMAP to perform
995 a full Number-Theoretic-Transform in-place. Possible by having 3-in 2-out,
996 to avoid the need for a temp register. RS is written
997 to as well as RT.
998
999 ```
1000 gfpmaddsubr RT, RA, RB, RC
1001 ```
1002
1003 TODO: add link to explanation for where `RS` comes from.
1004
1005 ```
1006 factor1 = (RA)
1007 factor2 = (RB)
1008 term = (RC)
1009 # read all inputs before writing to any outputs in case
1010 # an input overlaps with an output register.
1011 (RT) = gfpmadd(factor1, factor2, term)
1012 (RS) = gfpmsubr(factor1, factor2, term)
1013 ```
1014
1015 # Already in POWER ISA or subsumed
1016
1017 Lists operations either included as part of
1018 other bitmanip operations, or are already in
1019 Power ISA.
1020
1021 ## cmix
1022
1023 based on RV bitmanip, covered by ternlog bitops
1024
1025 ```
1026 uint_xlen_t cmix(uint_xlen_t RA, uint_xlen_t RB, uint_xlen_t RC) {
1027 return (RA & RB) | (RC & ~RB);
1028 }
1029 ```
1030
1031 ## count leading/trailing zeros with mask
1032
1033 in v3.1 p105
1034
1035 ```
1036 count = 0
1037 do i = 0 to 63 if((RB)i=1) then do
1038 if((RS)i=1) then break end end count ← count + 1
1039 RA ← EXTZ64(count)
1040 ```
1041
1042 ## bit deposit
1043
1044 pdepd VRT,VRA,VRB, identical to RV bitmamip bdep, found already in v3.1 p106
1045
1046 do while(m < 64)
1047 if VSR[VRB+32].dword[i].bit[63-m]=1 then do
1048 result = VSR[VRA+32].dword[i].bit[63-k]
1049 VSR[VRT+32].dword[i].bit[63-m] = result
1050 k = k + 1
1051 m = m + 1
1052
1053 ```
1054
1055 uint_xlen_t bdep(uint_xlen_t RA, uint_xlen_t RB)
1056 {
1057 uint_xlen_t r = 0;
1058 for (int i = 0, j = 0; i < XLEN; i++)
1059 if ((RB >> i) & 1) {
1060 if ((RA >> j) & 1)
1061 r |= uint_xlen_t(1) << i;
1062 j++;
1063 }
1064 return r;
1065 }
1066
1067 ```
1068
1069 ## bit extract
1070
1071 other way round: identical to RV bext: pextd, found in v3.1 p196
1072
1073 ```
1074 uint_xlen_t bext(uint_xlen_t RA, uint_xlen_t RB)
1075 {
1076 uint_xlen_t r = 0;
1077 for (int i = 0, j = 0; i < XLEN; i++)
1078 if ((RB >> i) & 1) {
1079 if ((RA >> i) & 1)
1080 r |= uint_xlen_t(1) << j;
1081 j++;
1082 }
1083 return r;
1084 }
1085 ```
1086
1087 ## centrifuge
1088
1089 found in v3.1 p106 so not to be added here
1090
1091 ```
1092 ptr0 = 0
1093 ptr1 = 0
1094 do i = 0 to 63
1095 if((RB)i=0) then do
1096 resultptr0 = (RS)i
1097 end
1098 ptr0 = ptr0 + 1
1099 if((RB)63-i==1) then do
1100 result63-ptr1 = (RS)63-i
1101 end
1102 ptr1 = ptr1 + 1
1103 RA = result
1104 ```
1105
1106 ## bit to byte permute
1107
1108 similar to matrix permute in RV bitmanip, which has XOR and OR variants,
1109 these perform a transpose (bmatflip).
1110 TODO this looks VSX is there a scalar variant
1111 in v3.0/1 already
1112
1113 do j = 0 to 7
1114 do k = 0 to 7
1115 b = VSR[VRB+32].dword[i].byte[k].bit[j]
1116 VSR[VRT+32].dword[i].byte[j].bit[k] = b
1117
1118 ## grev
1119
1120 superceded by grevlut
1121
1122 based on RV bitmanip, this is also known as a butterfly network. however
1123 where a butterfly network allows setting of every crossbar setting in
1124 every row and every column, generalised-reverse (grev) only allows
1125 a per-row decision: every entry in the same row must either switch or
1126 not-switch.
1127
1128 <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/8c/Butterfly_Network.jpg/474px-Butterfly_Network.jpg" />
1129
1130 ```
1131 uint64_t grev64(uint64_t RA, uint64_t RB)
1132 {
1133 uint64_t x = RA;
1134 int shamt = RB & 63;
1135 if (shamt & 1) x = ((x & 0x5555555555555555LL) << 1) |
1136 ((x & 0xAAAAAAAAAAAAAAAALL) >> 1);
1137 if (shamt & 2) x = ((x & 0x3333333333333333LL) << 2) |
1138 ((x & 0xCCCCCCCCCCCCCCCCLL) >> 2);
1139 if (shamt & 4) x = ((x & 0x0F0F0F0F0F0F0F0FLL) << 4) |
1140 ((x & 0xF0F0F0F0F0F0F0F0LL) >> 4);
1141 if (shamt & 8) x = ((x & 0x00FF00FF00FF00FFLL) << 8) |
1142 ((x & 0xFF00FF00FF00FF00LL) >> 8);
1143 if (shamt & 16) x = ((x & 0x0000FFFF0000FFFFLL) << 16) |
1144 ((x & 0xFFFF0000FFFF0000LL) >> 16);
1145 if (shamt & 32) x = ((x & 0x00000000FFFFFFFFLL) << 32) |
1146 ((x & 0xFFFFFFFF00000000LL) >> 32);
1147 return x;
1148 }
1149
1150 ```
1151
1152 ## gorc
1153
1154 based on RV bitmanip, gorc is superceded by grevlut
1155
1156 ```
1157 uint32_t gorc32(uint32_t RA, uint32_t RB)
1158 {
1159 uint32_t x = RA;
1160 int shamt = RB & 31;
1161 if (shamt & 1) x |= ((x & 0x55555555) << 1) | ((x & 0xAAAAAAAA) >> 1);
1162 if (shamt & 2) x |= ((x & 0x33333333) << 2) | ((x & 0xCCCCCCCC) >> 2);
1163 if (shamt & 4) x |= ((x & 0x0F0F0F0F) << 4) | ((x & 0xF0F0F0F0) >> 4);
1164 if (shamt & 8) x |= ((x & 0x00FF00FF) << 8) | ((x & 0xFF00FF00) >> 8);
1165 if (shamt & 16) x |= ((x & 0x0000FFFF) << 16) | ((x & 0xFFFF0000) >> 16);
1166 return x;
1167 }
1168 uint64_t gorc64(uint64_t RA, uint64_t RB)
1169 {
1170 uint64_t x = RA;
1171 int shamt = RB & 63;
1172 if (shamt & 1) x |= ((x & 0x5555555555555555LL) << 1) |
1173 ((x & 0xAAAAAAAAAAAAAAAALL) >> 1);
1174 if (shamt & 2) x |= ((x & 0x3333333333333333LL) << 2) |
1175 ((x & 0xCCCCCCCCCCCCCCCCLL) >> 2);
1176 if (shamt & 4) x |= ((x & 0x0F0F0F0F0F0F0F0FLL) << 4) |
1177 ((x & 0xF0F0F0F0F0F0F0F0LL) >> 4);
1178 if (shamt & 8) x |= ((x & 0x00FF00FF00FF00FFLL) << 8) |
1179 ((x & 0xFF00FF00FF00FF00LL) >> 8);
1180 if (shamt & 16) x |= ((x & 0x0000FFFF0000FFFFLL) << 16) |
1181 ((x & 0xFFFF0000FFFF0000LL) >> 16);
1182 if (shamt & 32) x |= ((x & 0x00000000FFFFFFFFLL) << 32) |
1183 ((x & 0xFFFFFFFF00000000LL) >> 32);
1184 return x;
1185 }
1186
1187 ```
1188
1189
1190 # Appendix
1191
1192 see [[bitmanip/appendix]]
1193