openpower/sv/rfc/ls003.mdwn

   1 # RFC ls003 Big Integer
   2
   3 **URLs**:
   4
   5 * <https://libre-soc.org/openpower/sv/biginteger/analysis/>
   6 * <https://libre-soc.org/openpower/sv/rfc/ls003/>
   7 * <https://bugs.libre-soc.org/show_bug.cgi?id=960>
   8 * <https://git.openpower.foundation/isa/PowerISA/issues/91>
   9
  10 **Severity**: Major
  11
  12 **Status**: New
  13
  14 **Date**: 20 Oct 2022
  15
  16 **Target**: v3.2B
  17
  18 **Source**: v3.0B
  19
  20 **Books and Section affected**: **UPDATE**
  21
  22 ```
  23     Book I 64-bit Fixed-Point Arithmetic Instructions 3.3.9.1
  24     Appendix E Power ISA sorted by opcode
  25     Appendix F Power ISA sorted by version
  26     Appendix G Power ISA sorted by Compliancy Subset
  27     Appendix H Power ISA sorted by mnemonic
  28 ```
  29
  30 **Summary**
  31
  32 ```
  33     Instructions added
  34     maddedu - Multiply-Add Extended Double Unsigned
  35     divmod2du - Divide/Modulo Quad-Double Unsigned
  36 ```
  37
  38 **Submitter**: Luke Leighton (Libre-SOC)
  39
  40 **Requester**: Libre-SOC
  41
  42 **Impact on processor**:
  43
  44 ```
  45     Addition of two new GPR-based instructions
  46 ```
  47
  48 **Impact on software**:
  49
  50 ```
  51     Requires support for new instructions in assembler, debuggers,
  52     and related tools.
  53 ```
  54
  55 **Keywords**:
  56
  57 ```
  58     GPR, Big-integer, Double-word
  59 ```
  60
  61 **Motivation**
  62
  63 Similar to `maddhdu` and `maddld`, but allow for a big-integer rolling
  64 accumulation affect: `RC` effectively becomes a 64-bit carry in chains
  65 of highly-efficient loop-unrolled arbitrary-length big-integer operations.
  66 Similar to `divdeu`, and has similar advantages to `maddedu`,
  67 Modulo result is available with the quotient in a single instruction
  68 allowing highly-efficient arbitrary-length big-integer division.
  69
  70 **Notes and Observations**:
  71
  72 1. It is not practical to add Rc=1 variants as VA-Form is used and
  73    there is a **pair** of results produced.
  74 2. An overflow variant (XER.OV set) of `divmod2du` would be valuable
  75    but VA-Form EXT004 is under severe pressure.
  76 3. Both instructions have been present in Intel x86 for several decades.
  77 4. Neither instruction is present in VSX: these are 128/64 whereas
  78    VSX is 128/128.
  79 5. `maddedu` and `divmod2du` are full inverses of each other, including
  80   when used for arbitrary-length big-integer arithmetic
  81 6. These are both 3-in 2-out instructions. If Power ISA did not already
  82   have LD/ST-with-update instructions and instructions with `RAp`
  83   and `RTp` then these instructions would not be proposed.
  84
  85 **Changes**
  86
  87 Add the following entries to:
  88
  89 * the Appendices of Book I
  90 * Instructions of Book I added to Section 3.3.9.1
  91
  92 ----------------
  93
  94 \newpage{}
  95
  96 # Multiply-Add Extended Double Unsigned
  97
  98 `maddedu RT, RA, RB, RC`
  99
 100 |  0-5  | 6-10 | 11-15 | 16-20 | 21-25 | 26-31 | Form    |
 101 |-------|------|-------|-------|-------|-------|---------|
 102 | EXT04 | RT   |  RA   |  RB   |   RC  |  XO   | VA-Form |
 103
 104 Pseudocode:
 105
 106 ```
 107 prod[0:127] <- (RA) * (RB)    # Multiply RA and RB, result 128-bit
 108 sum[0:127] <- EXTZ(RC) + prod # Zero extend RC, add product
 109 RT <- sum[64:127]             # Store low half in RT
 110 RS <- sum[0:63]               # RS implicit register, equal to RC
 111 ```
 112
 113 Special registers altered:
 114
 115     None
 116
 117 The 64-bit operands are (RA), (RB), and (RC).
 118 RC is zero-extended (not shifted, not sign-extended).
 119 The 128-bit product of the operands (RA) and (RB) is added to (RC).
 120 The low-order 64 bits of the 128-bit sum are
 121 placed into register RT.
 122 The high-order 64 bits of the 128-bit sum are
 123 placed into register RS.
 124 RS is implictly defined as the same register as RC.
 125
 126 All three operands and the result are interpreted as
 127 unsigned integers.
 128
 129 The differences here to `maddhdu` are that `maddhdu` stores the upper
 130 half in RT, where `maddedu` stores the upper half in RS.
 131
 132 The value stored in RT is exactly equivalent to `maddld` despite `maddld`
 133 performing sign-extension on RC, because RT is the full mathematical result
 134 modulo 2^64 and sign/zero extension from 64 to 128 bits produces identical
 135 results modulo 2^64. This is why there is no maddldu instruction.
 136
 137 *Programmer's Note:
 138 To achieve a big-integer rolling-accumulation effect:
 139 assuming the scalar to multiply is in r0, and r3 is
 140 used (effectively) as a 64-bit carry,
 141 the vector to multiply by starts at r4 and the result vector
 142 in r20, instructions may be issued `maddedu r20,r4,r0,r3
 143 maddedu r21,r5,r0,r3` etc. where the first `maddedu` will have
 144 stored the upper half of the 128-bit multiply into r3, such
 145 that it may be picked up by the second `maddedu`. Repeat inline
 146 to construct a larger bigint scalar-vector multiply,
 147 as Scalar GPR register file space permits. If register
 148 spill is required then r3, as the effective 64-bit carry,
 149 continues the chain.*
 150
 151 Examples:
 152
 153 ```
 154 # (r0 * r1) + r2, store lower in r4, upper in r2
 155 maddedu r4, r0, r1, r2
 156
 157 # Chaining together for larger bigint (see Programmer's Note above)
 158 # r3 starts with zero (no carry-in)
 159 maddedu r20,r4,r0,r3
 160 maddedu r21,r5,r0,r3
 161 maddedu r22,r6,r0,r3
 162 ```
 163
 164 ----------
 165
 166 \newpage{}
 167
 168 # Divide/Modulo Quad-Double Unsigned
 169
 170 **Should name be Divide/Module Double Extended Unsigned?**
 171 **Check the pseudo-code comments**
 172
 173 `divmod2du RT,RA,RB,RC`
 174
 175 |  0-5  | 6-10 | 11-15 | 16-20 | 21-25 | 26-31 | Form    |
 176 |-------|------|-------|-------|-------|-------|---------|
 177 | EXT04 | RT   |  RA   |  RB   |   RC  |  XO   | VA-Form |
 178
 179 Pseudo-code:
 180
 181 ```
 182 if ((RA) <u (RB)) & ((RB) != [0]*64) then  # Check RA<RB, for divide-by-0
 183     dividend[0:127] <- (RA) || (RC)        # Combine RA/RC as 128-bit
 184     divisor[0:127] <- [0]*64 || (RB)       # Extend RB to 128-bit
 185     result <- dividend / divisor           # Division
 186     modulo <- dividend % divisor           # Modulo
 187     RT <- result[64:127]                   # Store result in RT
 188     RS <- modulo[64:127]                   # Modulo in RC, implicit
 189 else                                       # In case of error
 190     RT <- [1]*64                           # RT all 1's
 191     RS <- [0]*64                           # RS all 0's
 192 ```
 193
 194 Special registers altered:
 195
 196     None
 197
 198 The 128-bit dividend is (RA) || (RC). The 64-bit divisor is
 199 (RB). If the quotient can be represented in 64 bits, it is
 200 placed into register RT. The modulo is placed into register RS.
 201 RS is implictly defined as the same register as RC, similarly to maddedu.
 202
 203 The instruction is only defined where both conditions are true:
 204
 205 * (RA) < (RB) (unsigned comparison)
 206 * (RB) is NOT 0 (not divide-by-0)
 207
 208 If these conditions are not met, RT is set to all 1's, RS to all 0's.
 209
 210 Both operands, quotient, and modulo are interpreted as unsigned integers.
 211
 212
 213 Divide/Modulo Quad-Double Unsigned is another VA-Form instruction
 214 that is near-identical to `divdeu` except that:
 215
 216 * the lower 64 bits of the dividend, instead of being zero, contain a
 217   register, RC.
 218 * it performs a fused divide and modulo in a single instruction, storing
 219   the modulo in an implicit RS (similar to `maddedu`)
 220
 221 RB, the divisor, remains 64 bit.  The instruction is therefore a 128/64
 222 division, producing a (pair) of 64 bit result(s), in the same way that
 223 Intel [divq](https://www.felixcloutier.com/x86/div) works.
 224 Overflow conditions
 225 are detected in exactly the same fashion as `divdeu`, except that rather
 226 than have `UNDEFINED` behaviour, RT is set to all ones and RS set to all
 227 zeros on overflow.
 228
 229 *Programmer's note: there are no Rc variants of any of these VA-Form
 230 instructions. `cmpi` will need to be used to detect overflow conditions:
 231 the saving in instruction count is that both RT and RS will have already
 232 been set to useful values (all 1s and all zeros respectively)
 233 needed as part of implementing Knuth's Algorithm D*
 234
 235 For Scalar usage, just as for `maddedu`, `RS=RC`
 236 Examples:
 237
 238 ```
 239 # ((r0 << 64) + r2) / r1, store in r4
 240 # ((r0 << 64) + r2) % r1, store in r2
 241 divmod2du r4, r0, r1, r2
 242 ```
 243
 244 [[!tag opf_rfc]]
 245
 246 # Appendices
 247
 248     Appendix E Power ISA sorted by opcode
 249     Appendix F Power ISA sorted by version
 250     Appendix G Power ISA sorted by Compliancy Subset
 251     Appendix H Power ISA sorted by mnemonic
 252
 253 | Form | Book | Page | Version | mnemonic | Description |
 254 |------|------|------|---------|----------|-------------|
 255 | VA   | I    | #    | 3.0B    | maddedu  | Multiply-Add Extend Double Unsigned |
 256 | VA   | I    | #    | 3.0B    | divmod2du | Divide/Modulo Quad-Double Unsigned |
 257