f16955f8506cb6f49814d5108d892fc2c3f3f4a8
[libreriscv.git] / openpower / sv / bitmanip.mdwn
1 [[!tag standards]]
2
3 [[!toc levels=1]]
4
5 # Implementation Log
6
7 * ternlogi <https://bugs.libre-soc.org/show_bug.cgi?id=745>
8 * grev <https://bugs.libre-soc.org/show_bug.cgi?id=755>
9 * GF2^M <https://bugs.libre-soc.org/show_bug.cgi?id=782>
10 * binutils <https://bugs.libre-soc.org/show_bug.cgi?id=836>
11 * shift-and-add <https://bugs.libre-soc.org/show_bug.cgi?id=968>
12
13 # bitmanipulation
14
15 **DRAFT STATUS**
16
17 pseudocode: [[openpower/isa/bitmanip]]
18
19 this extension amalgamates bitmanipulation primitives from many sources,
20 including RISC-V bitmanip, Packed SIMD, AVX-512 and OpenPOWER VSX.
21 Also included are DSP/Multimedia operations suitable for Audio/Video.
22 Vectorisation and SIMD are removed: these are straight scalar (element)
23 operations making them suitable for embedded applications. Vectorisation
24 Context is provided by [[openpower/sv]].
25
26 When combined with SV, scalar variants of bitmanip operations found in
27 VSX are added so that the Packed SIMD aspects of VSX may be retired as
28 "legacy" in the far future (10 to 20 years). Also, VSX is hundreds of
29 opcodes, requires 128 bit pathways, and is wholly unsuited to low power
30 or embedded scenarios.
31
32 ternlogv is experimental and is the only operation that may be considered
33 a "Packed SIMD". It is added as a variant of the already well-justified
34 ternlog operation (done in AVX512 as an immediate only) "because it
35 looks fun". As it is based on the LUT4 concept it will allow accelerated
36 emulation of FPGAs. Other vendors of ISAs are buying FPGA companies to
37 achieve similar objectives.
38
39 general-purpose Galois Field 2^M operations are added so as to avoid
40 huge custom opcode proliferation across many areas of Computer Science.
41 however for convenience and also to avoid setup costs, some of the more
42 common operations (clmul, crc32) are also added. The expectation is
43 that these operations would all be covered by the same pipeline.
44
45 note that there are brownfield spaces below that could incorporate
46 some of the set-before-first and other scalar operations listed in
47 [[sv/mv.swizzle]],
48 [[sv/vector_ops]], [[sv/int_fp_mv]] and the [[sv/av_opcodes]] as well as
49 [[sv/setvl]], [[sv/svstep]], [[sv/remap]]
50
51 Useful resource:
52
53 * <https://en.wikiversity.org/wiki/Reed%E2%80%93Solomon_codes_for_coders>
54 * <https://maths-people.anu.edu.au/~brent/pd/rpb232tr.pdf>
55 * <https://gist.github.com/animetosho/d3ca95da2131b5813e16b5bb1b137ca0>
56 * <https://github.com/HJLebbink/asm-dude/wiki/GF2P8AFFINEINVQB>
57
58 [[!inline pages="openpower/sv/draft_opcode_tables" quick="yes" raw="yes" ]]
59
60 # binary and ternary bitops
61
62 Similar to FPGA LUTs: for two (binary) or three (ternary) inputs take
63 bits from each input, concatenate them and perform a lookup into a
64 table using an 8-8-bit immediate (for the ternary instructions), or in
65 another register (4-bit for the binary instructions). The binary lookup
66 instructions have CR Field lookup variants due to CR Fields being 4 bit.
67
68 Like the x86 AVX512F
69 [vpternlogd/vpternlogq](https://www.felixcloutier.com/x86/vpternlogd:vpternlogq)
70 instructions.
71
72 ## ternlogi
73
74 | 0.5|6.10|11.15|16.20| 21..28|29.30|31|
75 | -- | -- | --- | --- | ----- | --- |--|
76 | NN | RT | RA | RB | im0-7 | 00 |Rc|
77
78 lut3(imm, a, b, c):
79 idx = c << 2 | b << 1 | a
80 return imm[idx] # idx by LSB0 order
81
82 for i in range(64):
83 RT[i] = lut3(imm, RB[i], RA[i], RT[i])
84
85 ## binlut
86
87 Binary lookup is a dynamic LUT2 version of ternlogi. Firstly, the
88 lookup table is 4 bits wide not 8 bits, and secondly the lookup
89 table comes from a register not an immediate.
90
91 | 0.5|6.10|11.15|16.20| 21..25|26..31 | Form |
92 | -- | -- | --- | --- | ----- |--------|---------|
93 | NN | RT | RA | RB | RC |nh 00001| VA-Form |
94 | NN | RT | RA | RB | /BFA/ |0 01001| VA-Form |
95
96 For binlut, the 4-bit LUT may be selected from either the high nibble
97 or the low nibble of the first byte of RC:
98
99 lut2(imm, a, b):
100 idx = b << 1 | a
101 return imm[idx] # idx by LSB0 order
102
103 imm = (RC>>(nh*4))&0b1111
104 for i in range(64):
105 RT[i] = lut2(imm, RB[i], RA[i])
106
107 For bincrlut, `BFA` selects the 4-bit CR Field as the LUT2:
108
109 for i in range(64):
110 RT[i] = lut2(CRs{BFA}, RB[i], RA[i])
111
112 When Vectorised with SVP64, as usual both source and destination may be
113 Vector or Scalar.
114
115 *Programmer's note: a dynamic ternary lookup may be synthesised from
116 a pair of `binlut` instructions followed by a `ternlogi` to select which
117 to merge. Use `nh` to select which nibble to use as the lookup table
118 from the RC source register (`nh=1` nibble high), i.e. keeping
119 an 8-bit LUT3 in RC, the first `binlut` instruction may set nh=0 and
120 the second nh=1.*
121
122 ## crternlogi
123
124 another mode selection would be CRs not Ints.
125
126 | 0.5|6.8 | 9.11|12.14|15.17|18.20|21.28 | 29.30|31|
127 | -- | -- | --- | --- | --- |-----|----- | -----|--|
128 | NN | BT | BA | BB | BC |m0-2 | imm | 01 |m3|
129
130 mask = m0-3
131 for i in range(4):
132 a,b,c = CRs[BA][i], CRs[BB][i], CRs[BC][i])
133 if mask[i] CRs[BT][i] = lut3(imm, a, b, c)
134
135 This instruction is remarkably similar to the existing crops, `crand` etc.
136 which have been noted to be a 4-bit (binary) LUT. In effect `crternlogi`
137 is the ternary LUT version of crops, having an 8-bit LUT.
138
139 ## crbinlog
140
141 With ternary (LUT3) dynamic instructions being very costly,
142 and CR Fields being only 4 bit, a binary (LUT2) variant is better
143
144 | 0.5|6.8 | 9.11|12.14|15.17|18.21|22...30 |31|
145 | -- | -- | --- | --- | --- |-----| -------- |--|
146 | NN | BT | BA | BB | BC |m0-m3|000101110 |0 |
147
148 mask = m0..m3
149 for i in range(4):
150 a,b = CRs[BA][i], CRs[BB][i])
151 if mask[i] CRs[BT][i] = lut2(CRs[BC], a, b)
152
153 When SVP64 Vectorised any of the 4 operands may be Scalar or
154 Vector, including `BC` meaning that multiple different dynamic
155 lookups may be performed with a single instruction.
156
157 *Programmer's note: just as with binlut and ternlogi, a pair
158 of crbinlog instructions followed by a merging crternlogi may
159 be deployed to synthesise dynamic ternary (LUT3) CR Field
160 manipulation*
161
162 # int ops
163
164 ## min/m
165
166 required for the [[sv/av_opcodes]]
167
168 signed and unsigned min/max for integer. this is sort-of partly
169 synthesiseable in [[sv/svp64]] with pred-result as long as the dest reg
170 is one of the sources, but not both signed and unsigned. when the dest
171 is also one of the srces and the mv fails due to the CR bittest failing
172 this will only overwrite the dest where the src is greater (or less).
173
174 signed/unsigned min/max gives more flexibility.
175
176 X-Form
177
178 * XO=0001001110, itype=0b00 min, unsigned
179 * XO=0101001110, itype=0b01 min, signed
180 * XO=0011001110, itype=0b10 max, unsigned
181 * XO=0111001110, itype=0b11 max, signed
182
183
184 ```
185 uint_xlen_t mins(uint_xlen_t rs1, uint_xlen_t rs2)
186 { return (int_xlen_t)rs1 < (int_xlen_t)rs2 ? rs1 : rs2;
187 }
188 uint_xlen_t maxs(uint_xlen_t rs1, uint_xlen_t rs2)
189 { return (int_xlen_t)rs1 > (int_xlen_t)rs2 ? rs1 : rs2;
190 }
191 uint_xlen_t minu(uint_xlen_t rs1, uint_xlen_t rs2)
192 { return rs1 < rs2 ? rs1 : rs2;
193 }
194 uint_xlen_t maxu(uint_xlen_t rs1, uint_xlen_t rs2)
195 { return rs1 > rs2 ? rs1 : rs2;
196 }
197 ```
198
199 ## average
200
201 required for the [[sv/av_opcodes]], these exist in Packed SIMD (VSX)
202 but not scalar
203
204 ```
205 uint_xlen_t intavg(uint_xlen_t rs1, uint_xlen_t rs2) {
206 return (rs1 + rs2 + 1) >> 1:
207 }
208 ```
209
210 ## absdu
211
212 required for the [[sv/av_opcodes]], these exist in Packed SIMD (VSX)
213 but not scalar
214
215 ```
216 uint_xlen_t absdu(uint_xlen_t rs1, uint_xlen_t rs2) {
217 return (src1 > src2) ? (src1-src2) : (src2-src1)
218 }
219 ```
220
221 ## abs-accumulate
222
223 required for the [[sv/av_opcodes]], these are needed for motion estimation.
224 both are overwrite on RS.
225
226 ```
227 uint_xlen_t uintabsacc(uint_xlen_t rs, uint_xlen_t ra, uint_xlen_t rb) {
228 return rs + (src1 > src2) ? (src1-src2) : (src2-src1)
229 }
230 uint_xlen_t intabsacc(uint_xlen_t rs, int_xlen_t ra, int_xlen_t rb) {
231 return rs + (src1 > src2) ? (src1-src2) : (src2-src1)
232 }
233 ```
234
235 For SVP64, the twin Elwidths allows e.g. a 16 bit accumulator for 8 bit
236 differences. Form is `RM-1P-3S1D` where RS-as-source has a separate
237 SVP64 designation from RS-as-dest. This gives a limited range of
238 non-overwrite capability.
239
240 # shift-and-add <a name="shift-add"> </a>
241
242 Power ISA is missing LD/ST with shift, which is present in both ARM and x86.
243 Too complex to add more LD/ST, a compromise is to add shift-and-add.
244 Replaces a pair of explicit instructions in hot-loops.
245
246 ```
247 uint_xlen_t shadd(uint_xlen_t rs1, uint_xlen_t rs2, uint8_t sh) {
248 sh = sh & 0x3;
249 return (rs1 << (sh+1)) + rs2;
250 }
251
252 uint_xlen_t shadduw(uint_xlen_t rs1, uint_xlen_t rs2, uint8_t sh) {
253 uint_xlen_t rs1z = rs1 & 0xFFFFFFFF;
254 sh = sh & 0x3;
255 return (rs1z << (sh+1)) + rs2;
256 }
257 ```
258
259 # bitmask set
260
261 based on RV bitmanip singlebit set, instruction format similar to shift
262 [[isa/fixedshift]]. bmext is actually covered already (shift-with-mask
263 rldicl but only immediate version). however bitmask-invert is not,
264 and set/clr are not covered, although they can use the same Shift ALU.
265
266 bmext (RB) version is not the same as rldicl because bmext is a right
267 shift by RC, where rldicl is a left rotate. for the immediate version
268 this does not matter, so a bmexti is not required. bmrev however there
269 is no direct equivalent and consequently a bmrevi is required.
270
271 bmset (register for mask amount) is particularly useful for creating
272 predicate masks where the length is a dynamic runtime quantity.
273 bmset(RA=0, RB=0, RC=mask) will produce a run of ones of length "mask"
274 in a single instruction without needing to initialise or depend on any
275 other registers.
276
277 | 0.5|6.10|11.15|16.20|21.25| 26..30 |31| name |
278 | -- | -- | --- | --- | --- | ------- |--| ----- |
279 | NN | RS | RA | RB | RC | mode 010 |Rc| bm\* |
280
281 Immediate-variant is an overwrite form:
282
283 | 0.5|6.10|11.15|16.20| 21 | 22.23 | 24....30 |31| name |
284 | -- | -- | --- | --- | -- | ----- | -------- |--| ---- |
285 | NN | RS | RB | sh | SH | itype | 1000 110 |Rc| bm\*i |
286
287 ```
288 def MASK(x, y):
289 if x < y:
290 x = x+1
291 mask_a = ((1 << x) - 1) & ((1 << 64) - 1)
292 mask_b = ((1 << y) - 1) & ((1 << 64) - 1)
293 elif x == y:
294 return 1 << x
295 else:
296 x = x+1
297 mask_a = ((1 << x) - 1) & ((1 << 64) - 1)
298 mask_b = (~((1 << y) - 1)) & ((1 << 64) - 1)
299 return mask_a ^ mask_b
300
301
302 uint_xlen_t bmset(RS, RB, sh)
303 {
304 int shamt = RB & (XLEN - 1);
305 mask = (2<<sh)-1;
306 return RS | (mask << shamt);
307 }
308
309 uint_xlen_t bmclr(RS, RB, sh)
310 {
311 int shamt = RB & (XLEN - 1);
312 mask = (2<<sh)-1;
313 return RS & ~(mask << shamt);
314 }
315
316 uint_xlen_t bminv(RS, RB, sh)
317 {
318 int shamt = RB & (XLEN - 1);
319 mask = (2<<sh)-1;
320 return RS ^ (mask << shamt);
321 }
322
323 uint_xlen_t bmext(RS, RB, sh)
324 {
325 int shamt = RB & (XLEN - 1);
326 mask = (2<<sh)-1;
327 return mask & (RS >> shamt);
328 }
329 ```
330
331 bitmask extract with reverse. can be done by bit-order-inverting all
332 of RB and getting bits of RB from the opposite end.
333
334 when RA is zero, no shift occurs. this makes bmextrev useful for
335 simply reversing all bits of a register.
336
337 ```
338 msb = ra[5:0];
339 rev[0:msb] = rb[msb:0];
340 rt = ZE(rev[msb:0]);
341
342 uint_xlen_t bmrevi(RA, RB, sh)
343 {
344 int shamt = XLEN-1;
345 if (RA != 0) shamt = (GPR(RA) & (XLEN - 1));
346 shamt = (XLEN-1)-shamt; # shift other end
347 brb = bitreverse(GPR(RB)) # swap LSB-MSB
348 mask = (2<<sh)-1;
349 return mask & (brb >> shamt);
350 }
351
352 uint_xlen_t bmrev(RA, RB, RC) {
353 return bmrevi(RA, RB, GPR(RC) & 0b111111);
354 }
355 ```
356
357 | 0.5|6.10|11.15|16.20|21.26| 27..30 |31| name | Form |
358 | -- | -- | --- | --- | --- | ------- |--| ------ | -------- |
359 | NN | RT | RA | RB | sh | 1111 |Rc| bmrevi | MDS-Form |
360
361 | 0.5|6.10|11.15|16.20|21.25| 26..30 |31| name | Form |
362 | -- | -- | --- | --- | --- | ------- |--| ------ | -------- |
363 | NN | RT | RA | RB | RC | 11110 |Rc| bmrev | VA2-Form |
364
365 # grevlut <a name="grevlut"> </a>
366
367 generalised reverse combined with a pair of LUT2s and allowing
368 a constant `0b0101...0101` when RA=0, and an option to invert
369 (including when RA=0, giving a constant 0b1010...1010 as the
370 initial value) provides a wide range of instructions
371 and a means to set hundreds of regular 64 bit patterns with one
372 single 32 bit instruction.
373
374 the two LUT2s are applied left-half (when not swapping)
375 and right-half (when swapping) so as to allow a wider
376 range of options.
377
378 <img src="/openpower/sv/grevlut2x2.jpg" width=700 />
379
380 * A value of `0b11001010` for the immediate provides
381 the functionality of a standard "grev".
382 * `0b11101110` provides gorc
383
384 grevlut should be arranged so as to produce the constants
385 needed to put into bext (bitextract) so as in turn to
386 be able to emulate x86 pmovmask instructions
387 <https://www.felixcloutier.com/x86/pmovmskb>.
388 This only requires 2 instructions (grevlut, bext).
389
390 Note that if the mask is required to be placed
391 directly into CR Fields (for use as CR Predicate
392 masks rather than a integer mask) then sv.cmpi or sv.ori
393 may be used instead, bearing in mind that sv.ori
394 is a 64-bit instruction, and `VL` must have been
395 set to the required length:
396
397 sv.ori./elwid=8 r10.v, r10.v, 0
398
399 The following settings provide the required mask constants:
400
401 | RA=0 | RB | imm | iv | result |
402 | ------- | ------- | ---------- | -- | ---------- |
403 | 0x555.. | 0b10 | 0b01101100 | 0 | 0x111111... |
404 | 0x555.. | 0b110 | 0b01101100 | 0 | 0x010101... |
405 | 0x555.. | 0b1110 | 0b01101100 | 0 | 0x00010001... |
406 | 0x555.. | 0b10 | 0b11000110 | 1 | 0x88888... |
407 | 0x555.. | 0b110 | 0b11000110 | 1 | 0x808080... |
408 | 0x555.. | 0b1110 | 0b11000110 | 1 | 0x80008000... |
409
410 Better diagram showing the correct ordering of shamt (RB). A LUT2
411 is applied to all locations marked in red using the first 4
412 bits of the immediate, and a separate LUT2 applied to all
413 locations in green using the upper 4 bits of the immediate.
414
415 <img src="/openpower/sv/grevlut.png" width=700 />
416
417 demo code [[openpower/sv/grevlut.py]]
418
419 ```
420 lut2(imm, a, b):
421 idx = b << 1 | a
422 return imm[idx] # idx by LSB0 order
423
424 dorow(imm8, step_i, chunksize, us32b):
425 for j in 0 to 31 if is32b else 63:
426 if (j&chunk_size) == 0
427 imm = imm8[0..3]
428 else
429 imm = imm8[4..7]
430 step_o[j] = lut2(imm, step_i[j], step_i[j ^ chunk_size])
431 return step_o
432
433 uint64_t grevlut(uint64_t RA, uint64_t RB, uint8 imm, bool iv, bool is32b)
434 {
435 uint64_t x = 0x5555_5555_5555_5555;
436 if (RA != 0) x = GPR(RA);
437 if (iv) x = ~x;
438 int shamt = RB & 31 if is32b else 63
439 for i in 0 to (6-is32b)
440 step = 1<<i
441 if (shamt & step) x = dorow(imm, x, step, is32b)
442 return x;
443 }
444 ```
445
446 A variant may specify different LUT-pairs per row,
447 using one byte of RB for each. If it is desired that
448 a particular row-crossover shall not be applied it is
449 a simple matter to set the appropriate LUT-pair in RB
450 to effect an identity transform for that row (`0b11001010`).
451
452 ```
453 uint64_t grevlutr(uint64_t RA, uint64_t RB, bool iv, bool is32b)
454 {
455 uint64_t x = 0x5555_5555_5555_5555;
456 if (RA != 0) x = GPR(RA);
457 if (iv) x = ~x;
458 for i in 0 to (6-is32b)
459 step = 1<<i
460 imm = (RB>>(i*8))&0xff
461 x = dorow(imm, x, step, is32b)
462 return x;
463 }
464
465 ```
466
467 | 0.5|6.10|11.15|16.20 |21..28 | 29.30|31| name | Form |
468 | -- | -- | --- | --- | ----- | -----|--| ------ | ----- |
469 | NN | RT | RA | s0-4 | im0-7 | 1 iv |s5| grevlogi | |
470 | NN | RT | RA | RB | im0-7 | 01 |0 | grevlog | |
471
472 An equivalent to `grevlogw` may be synthesised by setting the
473 appropriate bits in RB to set the top half of RT to zero.
474 Thus an explicit grevlogw instruction is not necessary.
475
476 # xperm
477
478 based on RV bitmanip.
479
480 RA contains a vector of indices to select parts of RB to be
481 copied to RT. The immediate-variant allows up to an 8 bit
482 pattern (repeated) to be targetted at different parts of RT.
483
484 xperm shares some similarity with one of the uses of bmator
485 in that xperm indices are binary addressing where bitmator
486 may be considered to be unary addressing.
487
488 ```
489 uint_xlen_t xpermi(uint8_t imm8, uint_xlen_t RB, int sz_log2)
490 {
491 uint_xlen_t r = 0;
492 uint_xlen_t sz = 1LL << sz_log2;
493 uint_xlen_t mask = (1LL << sz) - 1;
494 uint_xlen_t RA = imm8 | imm8<<8 | ... | imm8<<56;
495 for (int i = 0; i < XLEN; i += sz) {
496 uint_xlen_t pos = ((RA >> i) & mask) << sz_log2;
497 if (pos < XLEN)
498 r |= ((RB >> pos) & mask) << i;
499 }
500 return r;
501 }
502 uint_xlen_t xperm(uint_xlen_t RA, uint_xlen_t RB, int sz_log2)
503 {
504 uint_xlen_t r = 0;
505 uint_xlen_t sz = 1LL << sz_log2;
506 uint_xlen_t mask = (1LL << sz) - 1;
507 for (int i = 0; i < XLEN; i += sz) {
508 uint_xlen_t pos = ((RA >> i) & mask) << sz_log2;
509 if (pos < XLEN)
510 r |= ((RB >> pos) & mask) << i;
511 }
512 return r;
513 }
514 uint_xlen_t xperm_n (uint_xlen_t RA, uint_xlen_t RB)
515 { return xperm(RA, RB, 2); }
516 uint_xlen_t xperm_b (uint_xlen_t RA, uint_xlen_t RB)
517 { return xperm(RA, RB, 3); }
518 uint_xlen_t xperm_h (uint_xlen_t RA, uint_xlen_t RB)
519 { return xperm(RA, RB, 4); }
520 uint_xlen_t xperm_w (uint_xlen_t RA, uint_xlen_t RB)
521 { return xperm(RA, RB, 5); }
522 ```
523
524 # bitmatrix
525
526 bmatflip and bmatxor is found in the Cray XMT, and in x86 is known
527 as GF2P8AFFINEQB. uses:
528
529 * <https://gist.github.com/animetosho/d3ca95da2131b5813e16b5bb1b137ca0>
530 * SM4, Reed Solomon, RAID6
531 <https://stackoverflow.com/questions/59124720/what-are-the-avx-512-galois-field-related-instructions-for>
532 * Vector bit-reverse <https://reviews.llvm.org/D91515?id=305411>
533 * Affine Inverse <https://github.com/HJLebbink/asm-dude/wiki/GF2P8AFFINEINVQB>
534
535 | 0.5|6.10|11.15|16.20| 21 | 22.23 | 24....30 |31| name | Form |
536 | -- | -- | --- | --- | -- | ----- | -------- |--| ---- | ------- |
537 | NN | RS | RA |im04 | im5| 1 1 | im67 00 110 |Rc| bmatxori | TODO |
538
539
540 ```
541 uint64_t bmatflip(uint64_t RA)
542 {
543 uint64_t x = RA;
544 x = shfl64(x, 31);
545 x = shfl64(x, 31);
546 x = shfl64(x, 31);
547 return x;
548 }
549
550 uint64_t bmatxori(uint64_t RS, uint64_t RA, uint8_t imm) {
551 // transpose of RA
552 uint64_t RAt = bmatflip(RA);
553 uint8_t u[8]; // rows of RS
554 uint8_t v[8]; // cols of RA
555 for (int i = 0; i < 8; i++) {
556 u[i] = RS >> (i*8);
557 v[i] = RAt >> (i*8);
558 }
559 uint64_t bit, x = 0;
560 for (int i = 0; i < 64; i++) {
561 bit = (imm >> (i%8)) & 1;
562 bit ^= pcnt(u[i / 8] & v[i % 8]) & 1;
563 x |= bit << i;
564 }
565 return x;
566 }
567
568 uint64_t bmatxor(uint64_t RA, uint64_t RB) {
569 return bmatxori(RA, RB, 0xff)
570 }
571
572 uint64_t bmator(uint64_t RA, uint64_t RB) {
573 // transpose of RB
574 uint64_t RBt = bmatflip(RB);
575 uint8_t u[8]; // rows of RA
576 uint8_t v[8]; // cols of RB
577 for (int i = 0; i < 8; i++) {
578 u[i] = RA >> (i*8);
579 v[i] = RBt >> (i*8);
580 }
581 uint64_t x = 0;
582 for (int i = 0; i < 64; i++) {
583 if ((u[i / 8] & v[i % 8]) != 0)
584 x |= 1LL << i;
585 }
586 return x;
587 }
588
589 uint64_t bmatand(uint64_t RA, uint64_t RB) {
590 // transpose of RB
591 uint64_t RBt = bmatflip(RB);
592 uint8_t u[8]; // rows of RA
593 uint8_t v[8]; // cols of RB
594 for (int i = 0; i < 8; i++) {
595 u[i] = RA >> (i*8);
596 v[i] = RBt >> (i*8);
597 }
598 uint64_t x = 0;
599 for (int i = 0; i < 64; i++) {
600 if ((u[i / 8] & v[i % 8]) == 0xff)
601 x |= 1LL << i;
602 }
603 return x;
604 }
605 ```
606
607 # Introduction to Carry-less and GF arithmetic
608
609 * obligatory xkcd <https://xkcd.com/2595/>
610
611 There are three completely separate types of Galois-Field-based arithmetic
612 that we implement which are not well explained even in introductory
613 literature. A slightly oversimplified explanation is followed by more
614 accurate descriptions:
615
616 * `GF(2)` carry-less binary arithmetic. this is not actually a Galois Field,
617 but is accidentally referred to as GF(2) - see below as to why.
618 * `GF(p)` modulo arithmetic with a Prime number, these are "proper"
619 Galois Fields
620 * `GF(2^N)` carry-less binary arithmetic with two limits: modulo a power-of-2
621 (2^N) and a second "reducing" polynomial (similar to a prime number), these
622 are said to be GF(2^N) arithmetic.
623
624 further detailed and more precise explanations are provided below
625
626 * **Polynomials with coefficients in `GF(2)`**
627 (aka. Carry-less arithmetic -- the `cl*` instructions).
628 This isn't actually a Galois Field, but its coefficients are. This is
629 basically binary integer addition, subtraction, and multiplication like
630 usual, except that carries aren't propagated at all, effectively turning
631 both addition and subtraction into the bitwise xor operation. Division and
632 remainder are defined to match how addition and multiplication works.
633 * **Galois Fields with a prime size**
634 (aka. `GF(p)` or Prime Galois Fields -- the `gfp*` instructions).
635 This is basically just the integers mod `p`.
636 * **Galois Fields with a power-of-a-prime size**
637 (aka. `GF(p^n)` or `GF(q)` where `q == p^n` for prime `p` and
638 integer `n > 0`).
639 We only implement these for `p == 2`, called Binary Galois Fields
640 (`GF(2^n)` -- the `gfb*` instructions).
641 For any prime `p`, `GF(p^n)` is implemented as polynomials with
642 coefficients in `GF(p)` and degree `< n`, where the polynomials are the
643 remainders of dividing by a specificly chosen polynomial in `GF(p)` called
644 the Reducing Polynomial (we will denote that by `red_poly`). The Reducing
645 Polynomial must be an irreducable polynomial (like primes, but for
646 polynomials), as well as have degree `n`. All `GF(p^n)` for the same `p`
647 and `n` are isomorphic to each other -- the choice of `red_poly` doesn't
648 affect `GF(p^n)`'s mathematical shape, all that changes is the specific
649 polynomials used to implement `GF(p^n)`.
650
651 Many implementations and much of the literature do not make a clear
652 distinction between these three categories, which makes it confusing
653 to understand what their purpose and value is.
654
655 * carry-less multiply is extremely common and is used for the ubiquitous
656 CRC32 algorithm. [TODO add many others, helps justify to ISA WG]
657 * GF(2^N) forms the basis of Rijndael (the current AES standard) and
658 has significant uses throughout cryptography
659 * GF(p) is the basis again of a significant quantity of algorithms
660 (TODO, list them, jacob knows what they are), even though the
661 modulo is limited to be below 64-bit (size of a scalar int)
662
663 # Instructions for Carry-less Operations
664
665 aka. Polynomials with coefficients in `GF(2)`
666
667 Carry-less addition/subtraction is simply XOR, so a `cladd`
668 instruction is not provided since the `xor[i]` instruction can be used instead.
669
670 These are operations on polynomials with coefficients in `GF(2)`, with the
671 polynomial's coefficients packed into integers with the following algorithm:
672
673 ```python
674 [[!inline pagenames="gf_reference/pack_poly.py" raw="yes"]]
675 ```
676
677 ## Carry-less Multiply Instructions
678
679 based on RV bitmanip
680 see <https://en.wikipedia.org/wiki/CLMUL_instruction_set> and
681 <https://www.felixcloutier.com/x86/pclmulqdq> and
682 <https://en.m.wikipedia.org/wiki/Carry-less_product>
683
684 They are worth adding as their own non-overwrite operations
685 (in the same pipeline).
686
687 ### `clmul` Carry-less Multiply
688
689 ```python
690 [[!inline pagenames="gf_reference/clmul.py" raw="yes"]]
691 ```
692
693 ### `clmulh` Carry-less Multiply High
694
695 ```python
696 [[!inline pagenames="gf_reference/clmulh.py" raw="yes"]]
697 ```
698
699 ### `clmulr` Carry-less Multiply (Reversed)
700
701 Useful for CRCs. Equivalent to bit-reversing the result of `clmul` on
702 bit-reversed inputs.
703
704 ```python
705 [[!inline pagenames="gf_reference/clmulr.py" raw="yes"]]
706 ```
707
708 ## `clmadd` Carry-less Multiply-Add
709
710 ```
711 clmadd RT, RA, RB, RC
712 ```
713
714 ```
715 (RT) = clmul((RA), (RB)) ^ (RC)
716 ```
717
718 ## `cltmadd` Twin Carry-less Multiply-Add (for FFTs)
719
720 Used in combination with SV FFT REMAP to perform a full Discrete Fourier
721 Transform of Polynomials over GF(2) in-place. Possible by having 3-in 2-out,
722 to avoid the need for a temp register. RS is written to as well as RT.
723
724 Note: Polynomials over GF(2) are a Ring rather than a Field, so, because the
725 definition of the Inverse Discrete Fourier Transform involves calculating a
726 multiplicative inverse, which may not exist in every Ring, therefore the
727 Inverse Discrete Fourier Transform may not exist. (AFAICT the number of inputs
728 to the IDFT must be odd for the IDFT to be defined for Polynomials over GF(2).
729 TODO: check with someone who knows for sure if that's correct.)
730
731 ```
732 cltmadd RT, RA, RB, RC
733 ```
734
735 TODO: add link to explanation for where `RS` comes from.
736
737 ```
738 a = (RA)
739 c = (RC)
740 # read all inputs before writing to any outputs in case
741 # an input overlaps with an output register.
742 (RT) = clmul(a, (RB)) ^ c
743 (RS) = a ^ c
744 ```
745
746 ## `cldivrem` Carry-less Division and Remainder
747
748 `cldivrem` isn't an actual instruction, but is just used in the pseudo-code
749 for other instructions.
750
751 ```python
752 [[!inline pagenames="gf_reference/cldivrem.py" raw="yes"]]
753 ```
754
755 ## `cldiv` Carry-less Division
756
757 ```
758 cldiv RT, RA, RB
759 ```
760
761 ```
762 n = (RA)
763 d = (RB)
764 q, r = cldivrem(n, d, width=XLEN)
765 (RT) = q
766 ```
767
768 ## `clrem` Carry-less Remainder
769
770 ```
771 clrem RT, RA, RB
772 ```
773
774 ```
775 n = (RA)
776 d = (RB)
777 q, r = cldivrem(n, d, width=XLEN)
778 (RT) = r
779 ```
780
781 # Instructions for Binary Galois Fields `GF(2^m)`
782
783 see:
784
785 * <https://courses.csail.mit.edu/6.857/2016/files/ffield.py>
786 * <https://engineering.purdue.edu/kak/compsec/NewLectures/Lecture7.pdf>
787 * <https://foss.heptapod.net/math/libgf2/-/blob/branch/default/src/libgf2/gf2.py>
788
789 Binary Galois Field addition/subtraction is simply XOR, so a `gfbadd`
790 instruction is not provided since the `xor[i]` instruction can be used instead.
791
792 ## `GFBREDPOLY` SPR -- Reducing Polynomial
793
794 In order to save registers and to make operations orthogonal with standard
795 arithmetic, the reducing polynomial is stored in a dedicated SPR `GFBREDPOLY`.
796 This also allows hardware to pre-compute useful parameters (such as the
797 degree, or look-up tables) based on the reducing polynomial, and store them
798 alongside the SPR in hidden registers, only recomputing them whenever the SPR
799 is written to, rather than having to recompute those values for every
800 instruction.
801
802 Because Galois Fields require the reducing polynomial to be an irreducible
803 polynomial, that guarantees that any polynomial of `degree > 1` must have
804 the LSB set, since otherwise it would be divisible by the polynomial `x`,
805 making it reducible, making whatever we're working on no longer a Field.
806 Therefore, we can reuse the LSB to indicate `degree == XLEN`.
807
808 ```python
809 [[!inline pagenames="gf_reference/decode_reducing_polynomial.py" raw="yes"]]
810 ```
811
812 ## `gfbredpoly` -- Set the Reducing Polynomial SPR `GFBREDPOLY`
813
814 unless this is an immediate op, `mtspr` is completely sufficient.
815
816 ```python
817 [[!inline pagenames="gf_reference/gfbredpoly.py" raw="yes"]]
818 ```
819
820 ## `gfbmul` -- Binary Galois Field `GF(2^m)` Multiplication
821
822 ```
823 gfbmul RT, RA, RB
824 ```
825
826 ```python
827 [[!inline pagenames="gf_reference/gfbmul.py" raw="yes"]]
828 ```
829
830 ## `gfbmadd` -- Binary Galois Field `GF(2^m)` Multiply-Add
831
832 ```
833 gfbmadd RT, RA, RB, RC
834 ```
835
836 ```python
837 [[!inline pagenames="gf_reference/gfbmadd.py" raw="yes"]]
838 ```
839
840 ## `gfbtmadd` -- Binary Galois Field `GF(2^m)` Twin Multiply-Add (for FFT)
841
842 Used in combination with SV FFT REMAP to perform a full `GF(2^m)` Discrete
843 Fourier Transform in-place. Possible by having 3-in 2-out, to avoid the need
844 for a temp register. RS is written to as well as RT.
845
846 ```
847 gfbtmadd RT, RA, RB, RC
848 ```
849
850 TODO: add link to explanation for where `RS` comes from.
851
852 ```
853 a = (RA)
854 c = (RC)
855 # read all inputs before writing to any outputs in case
856 # an input overlaps with an output register.
857 (RT) = gfbmadd(a, (RB), c)
858 # use gfbmadd again since it reduces the result
859 (RS) = gfbmadd(a, 1, c) # "a * 1 + c"
860 ```
861
862 ## `gfbinv` -- Binary Galois Field `GF(2^m)` Inverse
863
864 ```
865 gfbinv RT, RA
866 ```
867
868 ```python
869 [[!inline pagenames="gf_reference/gfbinv.py" raw="yes"]]
870 ```
871
872 # Instructions for Prime Galois Fields `GF(p)`
873
874 ## `GFPRIME` SPR -- Prime Modulus For `gfp*` Instructions
875
876 ## `gfpadd` Prime Galois Field `GF(p)` Addition
877
878 ```
879 gfpadd RT, RA, RB
880 ```
881
882 ```python
883 [[!inline pagenames="gf_reference/gfpadd.py" raw="yes"]]
884 ```
885
886 the addition happens on infinite-precision integers
887
888 ## `gfpsub` Prime Galois Field `GF(p)` Subtraction
889
890 ```
891 gfpsub RT, RA, RB
892 ```
893
894 ```python
895 [[!inline pagenames="gf_reference/gfpsub.py" raw="yes"]]
896 ```
897
898 the subtraction happens on infinite-precision integers
899
900 ## `gfpmul` Prime Galois Field `GF(p)` Multiplication
901
902 ```
903 gfpmul RT, RA, RB
904 ```
905
906 ```python
907 [[!inline pagenames="gf_reference/gfpmul.py" raw="yes"]]
908 ```
909
910 the multiplication happens on infinite-precision integers
911
912 ## `gfpinv` Prime Galois Field `GF(p)` Invert
913
914 ```
915 gfpinv RT, RA
916 ```
917
918 Some potential hardware implementations are found in:
919 <https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.90.5233&rep=rep1&type=pdf>
920
921 ```python
922 [[!inline pagenames="gf_reference/gfpinv.py" raw="yes"]]
923 ```
924
925 ## `gfpmadd` Prime Galois Field `GF(p)` Multiply-Add
926
927 ```
928 gfpmadd RT, RA, RB, RC
929 ```
930
931 ```python
932 [[!inline pagenames="gf_reference/gfpmadd.py" raw="yes"]]
933 ```
934
935 the multiplication and addition happens on infinite-precision integers
936
937 ## `gfpmsub` Prime Galois Field `GF(p)` Multiply-Subtract
938
939 ```
940 gfpmsub RT, RA, RB, RC
941 ```
942
943 ```python
944 [[!inline pagenames="gf_reference/gfpmsub.py" raw="yes"]]
945 ```
946
947 the multiplication and subtraction happens on infinite-precision integers
948
949 ## `gfpmsubr` Prime Galois Field `GF(p)` Multiply-Subtract-Reversed
950
951 ```
952 gfpmsubr RT, RA, RB, RC
953 ```
954
955 ```python
956 [[!inline pagenames="gf_reference/gfpmsubr.py" raw="yes"]]
957 ```
958
959 the multiplication and subtraction happens on infinite-precision integers
960
961 ## `gfpmaddsubr` Prime Galois Field `GF(p)` Multiply-Add and Multiply-Sub-Reversed (for FFT)
962
963 Used in combination with SV FFT REMAP to perform
964 a full Number-Theoretic-Transform in-place. Possible by having 3-in 2-out,
965 to avoid the need for a temp register. RS is written
966 to as well as RT.
967
968 ```
969 gfpmaddsubr RT, RA, RB, RC
970 ```
971
972 TODO: add link to explanation for where `RS` comes from.
973
974 ```
975 factor1 = (RA)
976 factor2 = (RB)
977 term = (RC)
978 # read all inputs before writing to any outputs in case
979 # an input overlaps with an output register.
980 (RT) = gfpmadd(factor1, factor2, term)
981 (RS) = gfpmsubr(factor1, factor2, term)
982 ```
983
984 # Already in POWER ISA or subsumed
985
986 Lists operations either included as part of
987 other bitmanip operations, or are already in
988 Power ISA.
989
990 ## cmix
991
992 based on RV bitmanip, covered by ternlog bitops
993
994 ```
995 uint_xlen_t cmix(uint_xlen_t RA, uint_xlen_t RB, uint_xlen_t RC) {
996 return (RA & RB) | (RC & ~RB);
997 }
998 ```
999
1000 ## count leading/trailing zeros with mask
1001
1002 in v3.1 p105
1003
1004 ```
1005 count = 0
1006 do i = 0 to 63 if((RB)i=1) then do
1007 if((RS)i=1) then break end end count ← count + 1
1008 RA ← EXTZ64(count)
1009 ```
1010
1011 ## bit deposit
1012
1013 pdepd VRT,VRA,VRB, identical to RV bitmamip bdep, found already in v3.1 p106
1014
1015 do while(m < 64)
1016 if VSR[VRB+32].dword[i].bit[63-m]=1 then do
1017 result = VSR[VRA+32].dword[i].bit[63-k]
1018 VSR[VRT+32].dword[i].bit[63-m] = result
1019 k = k + 1
1020 m = m + 1
1021
1022 ```
1023
1024 uint_xlen_t bdep(uint_xlen_t RA, uint_xlen_t RB)
1025 {
1026 uint_xlen_t r = 0;
1027 for (int i = 0, j = 0; i < XLEN; i++)
1028 if ((RB >> i) & 1) {
1029 if ((RA >> j) & 1)
1030 r |= uint_xlen_t(1) << i;
1031 j++;
1032 }
1033 return r;
1034 }
1035
1036 ```
1037
1038 ## bit extract
1039
1040 other way round: identical to RV bext: pextd, found in v3.1 p196
1041
1042 ```
1043 uint_xlen_t bext(uint_xlen_t RA, uint_xlen_t RB)
1044 {
1045 uint_xlen_t r = 0;
1046 for (int i = 0, j = 0; i < XLEN; i++)
1047 if ((RB >> i) & 1) {
1048 if ((RA >> i) & 1)
1049 r |= uint_xlen_t(1) << j;
1050 j++;
1051 }
1052 return r;
1053 }
1054 ```
1055
1056 ## centrifuge
1057
1058 found in v3.1 p106 so not to be added here
1059
1060 ```
1061 ptr0 = 0
1062 ptr1 = 0
1063 do i = 0 to 63
1064 if((RB)i=0) then do
1065 resultptr0 = (RS)i
1066 end
1067 ptr0 = ptr0 + 1
1068 if((RB)63-i==1) then do
1069 result63-ptr1 = (RS)63-i
1070 end
1071 ptr1 = ptr1 + 1
1072 RA = result
1073 ```
1074
1075 ## bit to byte permute
1076
1077 similar to matrix permute in RV bitmanip, which has XOR and OR variants,
1078 these perform a transpose (bmatflip).
1079 TODO this looks VSX is there a scalar variant
1080 in v3.0/1 already
1081
1082 do j = 0 to 7
1083 do k = 0 to 7
1084 b = VSR[VRB+32].dword[i].byte[k].bit[j]
1085 VSR[VRT+32].dword[i].byte[j].bit[k] = b
1086
1087 ## grev
1088
1089 superceded by grevlut
1090
1091 based on RV bitmanip, this is also known as a butterfly network. however
1092 where a butterfly network allows setting of every crossbar setting in
1093 every row and every column, generalised-reverse (grev) only allows
1094 a per-row decision: every entry in the same row must either switch or
1095 not-switch.
1096
1097 <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/8c/Butterfly_Network.jpg/474px-Butterfly_Network.jpg" />
1098
1099 ```
1100 uint64_t grev64(uint64_t RA, uint64_t RB)
1101 {
1102 uint64_t x = RA;
1103 int shamt = RB & 63;
1104 if (shamt & 1) x = ((x & 0x5555555555555555LL) << 1) |
1105 ((x & 0xAAAAAAAAAAAAAAAALL) >> 1);
1106 if (shamt & 2) x = ((x & 0x3333333333333333LL) << 2) |
1107 ((x & 0xCCCCCCCCCCCCCCCCLL) >> 2);
1108 if (shamt & 4) x = ((x & 0x0F0F0F0F0F0F0F0FLL) << 4) |
1109 ((x & 0xF0F0F0F0F0F0F0F0LL) >> 4);
1110 if (shamt & 8) x = ((x & 0x00FF00FF00FF00FFLL) << 8) |
1111 ((x & 0xFF00FF00FF00FF00LL) >> 8);
1112 if (shamt & 16) x = ((x & 0x0000FFFF0000FFFFLL) << 16) |
1113 ((x & 0xFFFF0000FFFF0000LL) >> 16);
1114 if (shamt & 32) x = ((x & 0x00000000FFFFFFFFLL) << 32) |
1115 ((x & 0xFFFFFFFF00000000LL) >> 32);
1116 return x;
1117 }
1118
1119 ```
1120
1121 ## gorc
1122
1123 based on RV bitmanip, gorc is superceded by grevlut
1124
1125 ```
1126 uint32_t gorc32(uint32_t RA, uint32_t RB)
1127 {
1128 uint32_t x = RA;
1129 int shamt = RB & 31;
1130 if (shamt & 1) x |= ((x & 0x55555555) << 1) | ((x & 0xAAAAAAAA) >> 1);
1131 if (shamt & 2) x |= ((x & 0x33333333) << 2) | ((x & 0xCCCCCCCC) >> 2);
1132 if (shamt & 4) x |= ((x & 0x0F0F0F0F) << 4) | ((x & 0xF0F0F0F0) >> 4);
1133 if (shamt & 8) x |= ((x & 0x00FF00FF) << 8) | ((x & 0xFF00FF00) >> 8);
1134 if (shamt & 16) x |= ((x & 0x0000FFFF) << 16) | ((x & 0xFFFF0000) >> 16);
1135 return x;
1136 }
1137 uint64_t gorc64(uint64_t RA, uint64_t RB)
1138 {
1139 uint64_t x = RA;
1140 int shamt = RB & 63;
1141 if (shamt & 1) x |= ((x & 0x5555555555555555LL) << 1) |
1142 ((x & 0xAAAAAAAAAAAAAAAALL) >> 1);
1143 if (shamt & 2) x |= ((x & 0x3333333333333333LL) << 2) |
1144 ((x & 0xCCCCCCCCCCCCCCCCLL) >> 2);
1145 if (shamt & 4) x |= ((x & 0x0F0F0F0F0F0F0F0FLL) << 4) |
1146 ((x & 0xF0F0F0F0F0F0F0F0LL) >> 4);
1147 if (shamt & 8) x |= ((x & 0x00FF00FF00FF00FFLL) << 8) |
1148 ((x & 0xFF00FF00FF00FF00LL) >> 8);
1149 if (shamt & 16) x |= ((x & 0x0000FFFF0000FFFFLL) << 16) |
1150 ((x & 0xFFFF0000FFFF0000LL) >> 16);
1151 if (shamt & 32) x |= ((x & 0x00000000FFFFFFFFLL) << 32) |
1152 ((x & 0xFFFFFFFF00000000LL) >> 32);
1153 return x;
1154 }
1155
1156 ```
1157
1158
1159 # Appendix
1160
1161 see [[bitmanip/appendix]]
1162