openpower/sv/biginteger.mdwn

   1 [[!tag standards]]
   2
   3 # Big Integer Arithmetic
   4
   5 **DRAFT STATUS** 19apr2022, last edited 23may2022
   6
   7 * [[discussion]] page for notes
   8 * <https://bugs.libre-soc.org/show_bug.cgi?id=817> bugreport
   9 * <https://bugs.libre-soc.org/show_bug.cgi?id=937> 128/64 shifts
  10 * [[biginteger/analysis]]
  11 * [[openpower/isa/svfixedarith]] pseudocode
  12
  13 BigNum arithmetic is extremely common especially in cryptography,
  14 where for example RSA relies on arithmetic of 2048 or 4096 bits
  15 in length.  The primary operations are add, multiply and divide
  16 (and modulo) with specialisations of subtract and signed multiply.
  17
  18 A reminder that a particular focus of SVP64 is that it is built on
  19 top of Scalar operations, where those scalar operations are useful in
  20 their own right without SVP64. Thus the operations here are proposed
  21 first as Scalar Extensions to the Power ISA.
  22
  23 A secondary focus is that if Vectorised, implementors may choose
  24 to deploy macro-op fusion targetting back-end 256-bit or greater
  25 Dynamic SIMD ALUs for maximum performance and effectiveness.
  26
  27 # Analysis
  28
  29 Covered in [[biginteger/analysis]] the summary is that standard `adde`
  30 is sufficient for SVP64 Vectorisation of big-integer addition (and `subfe`
  31 for subtraction) but that big-integer shift, multiply and divide require an
  32 extra 3-in 2-out instructions, similar to Intel's
  33 [shld](https://www.felixcloutier.com/x86/shld)
  34 and [shrd](https://www.felixcloutier.com/x86/shrd),
  35 [mulx](https://www.felixcloutier.com/x86/mulx) and
  36 [divq](https://www.felixcloutier.com/x86/div),
  37 to be efficient.
  38 The same instruction (`maddedu`) is used in both
  39 big-divide and big-multiply because 'maddedu''s primary
  40 purpose is to perform a fused 64-bit scalar multiply with a large vector,
  41 where that result is Big-Added for Big-Multiply, but Big-Subtracted for
  42 Big-Divide.
  43
  44 Chaining the operations together gives Scalar-by-Vector
  45 operations, except for `sv.adde` and `sv.subfe` which are
  46 Vector-by-Vector Chainable (through the `CA` flag).
  47 Macro-op Fusion and back-end massively-wide SIMD ALUs may be deployed in a
  48 fashion that is hidden from the user, behind a consistent, stable ISA API.
  49 The same macro-op fusion may theoretically be deployed even on Scalar
  50 operations.
  51
  52 # dsld and dsrd
  53
  54 **DRAFT**
  55
  56 **dsld**
  57
  58 |0.....5|6..10|11..15|16..20|21.25|26..30|31|
  59 |-------|-----|------|------|-----|------|--|
  60 | EXT04 | RT  |  RA  |  RB  | RC  |  XO  |Rc|
  61
  62 VA2-Form
  63
  64 * dsld    RT,RA,RB,RC  (Rc=0)
  65 * dsld.   RT,RA,RB,RC  (Rc=1)
  66
  67 Pseudo-code:
  68
  69     n <- (RB)[58:63]
  70     v <- ROTL64((RA), n)
  71     mask <- MASK(0, 63-n)
  72     RT <- (v[0:63] & mask) | ((RC) & ¬mask)
  73     RS <- v[0:63] & ¬mask
  74     overflow = 0
  75     if RS != [0]*64:
  76         overflow = 1
  77
  78 Special Registers Altered:
  79
  80     CR0                    (if Rc=1)
  81
  82 **dsrd**
  83
  84 |0.....5|6..10|11..15|16..20|21.25|26..30|31|
  85 |-------|-----|------|------|-----|------|--|
  86 | EXT04 | RT  |  RA  |  RB  | RC  |  XO  |Rc|
  87
  88 VA2-Form
  89
  90 * dsrd    RT,RA,RB,RC  (Rc=0)
  91 * dsrd.   RT,RA,RB,RC  (Rc=1)
  92
  93 Pseudo-code:
  94
  95     n <- (RB)[58:63]
  96     v <- ROTL64((RA), 64-n)
  97     mask <- MASK(n, 63)
  98     RT <- (v[0:63] & mask) | ((RC) & ¬mask)
  99     RS <- v[0:63] & ¬mask
 100     overflow = 0
 101     if RS != [0]*64:
 102         overflow = 1
 103
 104 Special Registers Altered:
 105
 106     CR0                    (if Rc=1)
 107
 108
 109 # maddedu
 110
 111 **DRAFT**
 112
 113 `maddedu` is similar to v3.0 `madd`, and
 114 is VA-Form despite having 2 outputs: the second
 115 destination register is implicit.
 116
 117 |0.....5|6..10|11..15|16..20|21..25|26..31|
 118 |-------|-----|------|------|------|------|
 119 | EXT04 | RT  |  RA  |  RB  |   RC |  XO  |
 120
 121 The pseudocode for `maddedu RT, RA, RB, RC` is:
 122
 123     prod[0:127] = (RA) * (RB)
 124     sum[0:127] = EXTZ(RC) + prod
 125     RT <- sum[64:127]
 126     RS <- sum[0:63] # RS implicit register, see below
 127
 128 RC is zero-extended (not shifted, not sign-extended), the 128-bit product added
 129 to it; the lower half of that result stored in RT and the upper half
 130 in RS.
 131
 132 The differences here to `maddhdu` are that `maddhdu` stores the upper
 133 half in RT, where `maddedu` stores the upper half in RS.
 134
 135 The value stored in RT is exactly equivalent to `maddld` despite `maddld`
 136 performing sign-extension on RC, because RT is the full mathematical result
 137 modulo 2^64 and sign/zero extension from 64 to 128 bits produces identical
 138 results modulo 2^64. This is why there is no maddldu instruction.
 139
 140 *Programmer's Note:
 141 As a Scalar Power ISA operation, like `lq` and `stq`, RS=RT+1.
 142 To achieve the same big-integer rolling-accumulation effect
 143 as SVP64: assuming the scalar to multiply is in r0,
 144 the vector to multiply by starts at r4 and the result vector
 145 in r20, instructions may be issued `maddedu r20,r4,r0,r20
 146 maddedu r21,r5,r0,r21` etc. where the first `maddedu` will have
 147 stored the upper half of the 128-bit multiply into r21, such
 148 that it may be picked up by the second `maddedu`. Repeat inline
 149 to construct a larger bigint scalar-vector multiply,
 150 as Scalar GPR register file space permits.*
 151
 152 SVP64 overrides the Scalar behaviour of what defines RS.
 153 For SVP64 EXTRA register extension, the `RM-1P-3S-1D` format is
 154 used with the additional bit set for determining RS.
 155
 156 | Field Name | Field bits | Description                            |
 157 |------------|------------|----------------------------------------|
 158 | Rdest\_EXTRA2 | `10:11` | extends RT (R\*\_EXTRA2 Encoding)   |
 159 | Rsrc1\_EXTRA2 | `12:13` | extends RA (R\*\_EXTRA2 Encoding)   |
 160 | Rsrc2\_EXTRA2 | `14:15` | extends RB (R\*\_EXTRA2 Encoding)   |
 161 | Rsrc3\_EXTRA2 | `16:17` | extends RC (R\*\_EXTRA2 Encoding)   |
 162 | EXTRA2_MODE   | `18`    | used by `maddedu` for determining RS |
 163
 164 When `EXTRA2_MODE` is set to zero, the implicit RS register takes
 165 its Vector/Scalar setting from Rdest_EXTRA2, and takes
 166 the register number from RT, but all numbering
 167 is offset by MAXVL. *Note that element-width overrides influence this
 168 offset* (see SVP64 [[svp64/appendix]] for full details).
 169
 170 When `EXTRA2_MODE` is set to one, the implicit RS register is identical
 171 to RC extended with SVP64 using `Rsrc3_EXTRA2` in every respect, including whether RC is set Scalar or
 172 Vector.
 173
 174 # divmod2du RT,RA,RB,RC
 175
 176 **DRAFT**
 177
 178 Divide/Modulu Quad-Double Unsigned is another VA-Form instruction
 179 that is near-identical to `divdeu` except that:
 180
 181 * the lower 64 bits of the dividend, instead of being zero, contain a
 182   register, RC.
 183 * it performs a fused divide and modulo in a single instruction, storing
 184   the modulo in an implicit RS (similar to `maddedu`)
 185
 186 RB, the divisor, remains 64 bit.  The instruction is therefore a 128/64
 187 division, producing a (pair) of 64 bit result(s), in the same way that
 188 Intel [divq](https://www.felixcloutier.com/x86/div) works.
 189 Overflow conditions
 190 are detected in exactly the same fashion as `divdeu`, except that rather
 191 than have `UNDEFINED` behaviour, RT is set to all ones and RS set to all
 192 zeros on overflow.
 193
 194 *Programmer's note: there are no Rc variants of any of these VA-Form
 195 instructions. `cmpi` will need to be used to detect overflow conditions:
 196 the saving in instruction count is that both RT and RS will have already
 197 been set to useful values (all 1s and all zeros respectively)
 198 needed as part of implementing Knuth's
 199 Algorithm D*
 200
 201 For SVP64, given that this instruction is also 3-in 2-out 64-bit registers,
 202 the exact same EXTRA format and setting of RS is used as for `sv.maddedu`.
 203 For Scalar usage, just as for `maddedu`, `RS=RT+1` (similar to `lq` and `stq`).
 204
 205 Pseudo-code:
 206
 207     if ((RA) <u (RB)) & ((RB) != [0]*XLEN) then
 208         dividend[0:(XLEN*2)-1] <- (RA) || (RC)
 209         divisor[0:(XLEN*2)-1] <- [0]*XLEN || (RB)
 210         result <- dividend / divisor
 211         modulo <- dividend % divisor
 212         RT <- result[XLEN:(XLEN*2)-1]
 213         RS <- modulo[XLEN:(XLEN*2)-1]
 214     else
 215         RT <- [1]*XLEN
 216         RS <- [0]*XLEN
 217
 218 # [DRAFT] EXT04 Proposed Map
 219
 220 For the Opcode map (XO Field)
 221 see Power ISA v3.1, Book III, Appendix D, Table 13 (sheet 7 of 8), p1357.
 222 Proposed is the addition of:
 223
 224 * `maddedu` in `110010`
 225 * `divmod2du` in `111010`
 226 * `pcdec` in `111000`
 227
 228 |v >|   000|   001 |   010    |   011|   100  |   101  |   110   |   111  |
 229 |---|------|-------|----------|------|--------|--------|---------|--------|
 230 |110|maddhd|maddhdu|maddedu   |maddld|rsvd    |rsvd    |rsvd     |rsvd    |
 231 |111|pcdec.|rsvd   |divmod2du |vpermr|vaddequm|vaddecuq|vsubeuqm |vsubecuq|
 232