ls003: Started adding dsrd
[libreriscv.git] / openpower / sv / rfc / ls003.mdwn
1 # RFC ls003 Big Integer
2
3 **URLs**:
4
5 * <https://libre-soc.org/openpower/sv/biginteger/analysis/>
6 * <https://libre-soc.org/openpower/sv/rfc/ls003/>
7 * <https://bugs.libre-soc.org/show_bug.cgi?id=960>
8 * <https://git.openpower.foundation/isa/PowerISA/issues/91>
9
10 **Severity**: Major
11
12 **Status**: New
13
14 **Date**: 20 Oct 2022
15
16 **Target**: v3.2B
17
18 **Source**: v3.0B
19
20 **Books and Section affected**: **UPDATE**
21
22 ```
23 Book I 64-bit Fixed-Point Arithmetic Instructions 3.3.9.1
24 Appendix E Power ISA sorted by opcode
25 Appendix F Power ISA sorted by version
26 Appendix G Power ISA sorted by Compliancy Subset
27 Appendix H Power ISA sorted by mnemonic
28 ```
29
30 **Summary**
31
32 Instructions added
33
34 ```
35 maddedu - Multiply-Add Extended Double Unsigned
36 divmod2du - Divide/Modulo Quad-Double Unsigned
37 dsld - Double Shift Left Doubleword
38 dsrd - Double Shift Right Doubleword
39 ```
40
41 **Submitter**: Luke Leighton (Libre-SOC)
42
43 **Requester**: Libre-SOC
44
45 **Impact on processor**:
46
47 ```
48 Addition of two new GPR-based instructions
49 ```
50
51 **Impact on software**:
52
53 ```
54 Requires support for new instructions in assembler, debuggers,
55 and related tools.
56 ```
57
58 **Keywords**:
59
60 ```
61 GPR, Big-integer, Double-word
62 ```
63
64 **Motivation**
65
66 * Similar to `maddhdu` and `maddld`, but allow for a big-integer rolling
67 accumulation affect: `RC` effectively becomes a 64-bit carry in chains
68 of highly-efficient loop-unrolled arbitrary-length big-integer operations.
69 * Similar to `divdeu`, and has similar advantages to `maddedu`,
70 Modulo result is available with the quotient in a single instruction
71 allowing highly-efficient arbitrary-length big-integer division.
72
73 **Notes and Observations**:
74
75 TODO: address Jacob's comments: https://libre-soc.org/irclog/%23libre-soc.2022-10-28.log.html#t2022-10-28T18:00:27
76
77 1. It is not practical to add Rc=1 variants as VA-Form is used and
78 there is a **pair** of results produced.
79 2. An overflow variant (XER.OV set) of `divmod2du` would be valuable
80 but VA-Form EXT004 is under severe pressure.
81 3. Both `maddhdu` and `divmod2du` instructions have been present in Intel x86
82 for several decades. Likewise, variants of `dsld` and `dsrd`.
83 4. None of these instruction is present in VSX: these are 128/64 whereas
84 VSX is 128/128.
85 5. `maddedu` and `divmod2du` are full inverses of each other, including
86 when used for arbitrary-length big-integer arithmetic.
87 6. These are all 3-in 2-out instructions. If Power ISA did not already
88 have LD/ST-with-update instructions and instructions with `RAp`
89 and `RTp` then these instructions would not be proposed.
90
91 **Changes**
92
93 Add the following entries to:
94
95 * the Appendices of Book I
96 * Instructions of Book I added to Section 3.3.9.1
97 * VA2-Form of Book I Section 1.6.21.1 and 1.6.2
98
99 ----------------
100
101 \newpage{}
102
103 # Multiply-Add Extended Double Unsigned
104
105 `maddedu RT, RA, RB, RC`
106
107 | 0-5 | 6-10 | 11-15 | 16-20 | 21-25 | 26-31 | Form |
108 |-------|------|-------|-------|-------|-------|---------|
109 | EXT04 | RT | RA | RB | RC | XO | VA-Form |
110
111 Pseudocode:
112
113 ```
114 prod[0:127] <- (RA) * (RB) # Multiply RA and RB, result 128-bit
115 sum[0:127] <- EXTZ(RC) + prod # Zero extend RC, add product
116 RT <- sum[64:127] # Store low half in RT
117 RS <- sum[0:63] # RS implicit register, equal to RC
118 ```
119
120 Special registers altered:
121
122 None
123
124 The 64-bit operands are (RA), (RB), and (RC).
125 RC is zero-extended (not shifted, not sign-extended).
126 The 128-bit product of the operands (RA) and (RB) is added to (RC).
127 The low-order 64 bits of the 128-bit sum are
128 placed into register RT.
129 The high-order 64 bits of the 128-bit sum are
130 placed into register RS.
131 RS is implictly defined as the same register as RC.
132
133 All three operands and the result are interpreted as
134 unsigned integers.
135
136 The differences here to `maddhdu` are that `maddhdu` stores the upper
137 half in RT, where `maddedu` stores the upper half in RS.
138
139 The value stored in RT is exactly equivalent to `maddld` despite `maddld`
140 performing sign-extension on RC, because RT is the full mathematical result
141 modulo 2^64 and sign/zero extension from 64 to 128 bits produces identical
142 results modulo 2^64. This is why there is no maddldu instruction.
143
144 *Programmer's Note:
145 To achieve a big-integer rolling-accumulation effect:
146 assuming the scalar to multiply is in r0, and r3 is
147 used (effectively) as a 64-bit carry,
148 the vector to multiply by starts at r4 and the result vector
149 in r20, instructions may be issued `maddedu r20,r4,r0,r3
150 maddedu r21,r5,r0,r3` etc. where the first `maddedu` will have
151 stored the upper half of the 128-bit multiply into r3, such
152 that it may be picked up by the second `maddedu`. Repeat inline
153 to construct a larger bigint scalar-vector multiply,
154 as Scalar GPR register file space permits. If register
155 spill is required then r3, as the effective 64-bit carry,
156 continues the chain.*
157
158 Examples:
159
160 ```
161 # (r0 * r1) + r2, store lower in r4, upper in r2
162 maddedu r4, r0, r1, r2
163
164 # Chaining together for larger bigint (see Programmer's Note above)
165 # r3 starts with zero (no carry-in)
166 maddedu r20,r4,r0,r3
167 maddedu r21,r5,r0,r3
168 maddedu r22,r6,r0,r3
169 ```
170
171 ----------
172
173 \newpage{}
174
175 # Divide/Modulo Quad-Double Unsigned
176
177 **Should name be Divide/Module Double Extended Unsigned?**
178 **Check the pseudo-code comments**
179
180 `divmod2du RT,RA,RB,RC`
181
182 | 0-5 | 6-10 | 11-15 | 16-20 | 21-25 | 26-31 | Form |
183 |-------|------|-------|-------|-------|-------|---------|
184 | EXT04 | RT | RA | RB | RC | XO | VA-Form |
185
186 Pseudo-code:
187
188 ```
189 if ((RA) <u (RB)) & ((RB) != [0]*64) then # Check RA<RB, for divide-by-0
190 dividend[0:127] <- (RA) || (RC) # Combine RA/RC as 128-bit
191 divisor[0:127] <- [0]*64 || (RB) # Extend RB to 128-bit
192 result <- dividend / divisor # Unsigned Division
193 modulo <- dividend % divisor # Unsigned Modulo
194 RT <- result[64:127] # Store result in RT
195 RS <- modulo[64:127] # Modulo in RC, implicit
196 else # In case of error
197 RT <- [1]*64 # RT all 1's
198 RS <- [0]*64 # RS all 0's
199 ```
200
201 Special registers altered:
202
203 None
204
205 The 128-bit dividend is (RA) || (RC). The 64-bit divisor is
206 (RB). If the quotient can be represented in 64 bits, it is
207 placed into register RT. The modulo is placed into register RS.
208 RS is implictly defined as the same register as RC, similarly to maddedu.
209
210 The quotient can be represented in 64-bits when both these conditions
211 are true:
212
213 * (RA) < (RB) (unsigned comparison)
214 * (RB) is NOT 0 (not divide-by-0)
215
216 If these conditions are not met, RT is set to all 1's, RS to all 0's.
217
218 All operands, quotient, and modulo are interpreted as unsigned integers.
219
220 Divide/Modulo Quad-Double Unsigned is a VA-Form instruction
221 that is near-identical to `divdeu` except that:
222
223 * the lower 64 bits of the dividend, instead of being zero, contain a
224 register, RC.
225 * it performs a fused divide and modulo in a single instruction, storing
226 the modulo in an implicit RS (similar to `maddedu`)
227 * There is no `UNDEFINED` behaviour.
228
229 RB, the divisor, remains 64 bit. The instruction is therefore a 128/64
230 division, producing a (pair) of 64 bit result(s), in the same way that
231 Intel [divq](https://www.felixcloutier.com/x86/div) works.
232 Overflow conditions
233 are detected in exactly the same fashion as `divdeu`, except that rather
234 than have `UNDEFINED` behaviour, RT is set to all ones and RS set to all
235 zeros on overflow.
236
237 *Programmer's note: there are no Rc variants of any of these VA-Form
238 instructions. `cmpi` will need to be used to detect overflow conditions:
239 the saving in instruction count is that both RT and RS will have already
240 been set to useful values (all 1s and all zeros respectively)
241 needed as part of implementing Knuth's Algorithm D*
242
243 For Scalar usage, just as for `maddedu`, `RS=RC`
244 Examples:
245
246 ```
247 # ((r0 << 64) + r2) / r1, store in r4
248 # ((r0 << 64) + r2) % r1, store in r2
249 divmod2du r4, r0, r1, r2
250 ```
251
252 # Dynamic-Shift Left Doubleword
253
254 `dsld RT,RA,RB,RC`
255
256 | 0-5 | 6-10 | 11-15 | 16-20 | 21-25 | 26-30 | 31 | Form |
257 |-------|------|-------|-------|-------|-------|----|----------|
258 | EXT04 | RT | RA | RB | RC | XO | Rc | VA2-Form |
259
260 Pseudo-code:
261
262 n <- (RB)[58:63] # Take lower 6-bits of RB for shift
263 v <- ROTL64((RA), n) # Rotate RA 64-bit left by n bits
264 mask <- MASK(64, 63-n) # 1's mask, set mask[64-n:63] to 0's
265 RT <- (v[0:63] & mask) | ((RC) & ¬mask) # Mask out bits
266 RS <- v[0:63] & ¬mask # ?
267 overflow = 0 # Clear overflow flag
268 if RS != [0]*64: # Check if RS is NOT zero
269 overflow = 1 # Set the overflow flag
270
271 Special Registers Altered:
272
273 CR0 (if Rc=1)
274
275 **CHECK if overflow flag is the expected behaviour**
276
277 The contents of register RA are shifted left the number
278 of bits specified by (RB) 58:63.
279 **Please check if this is correct!!! This condition is taken
280 from PowerISA spec page 253, definition of MASK128(x,y)!!!**
281 A mask is generated having 0-bits from bit (64-n) through
282 bit 63 and 1-bits elsewhere.
283
284 The rotated data is ANDed with the generated mask, and ORed
285 with contents of RC ANDed with inverted mask.
286 The result is placed into register RT.
287
288 Additionally, the rotated data is ANDed with inverted mask and
289 placed into register RS. If value in RS is not all 0's, the
290 overflow flag is raised.
291
292 Similarly maddedu and divmod2du, dsld can be chained (using RC).
293
294 # Dynamic-Shift Right Doubleword
295
296 `dsrd RT,RA,RB,RC`
297
298 | 0-5 | 6-10 | 11-15 | 16-20 | 21-25 | 26-30 | 31 | Form |
299 |-------|------|-------|-------|-------|-------|----|----------|
300 | EXT04 | RT | RA | RB | RC | XO | Rc | VA2-Form |
301
302 Pseudo-code:
303
304 n <- (RB)[58:63] # Take lower 6-bits of RB for shift
305 v <- ROTL64((RA), 64-n) # Rotate RA 64-bit left by 64-n bits
306 mask <- MASK(n, 63) # 0's mask, set mask[n:63] to 1'
307 RT <- (v[0:63] & mask) | ((RC) & ¬mask) #
308 RS <- v[0:63] & ¬mask
309 overflow = 0
310 if RS != [0]*64:
311 overflow = 1
312
313 Special Registers Altered:
314
315 CR0 (if Rc=1)
316
317
318 \newpage{}
319
320 # VA2-Form
321
322 Add the following to Book I, 1.6.21.1, VA2-Form
323
324 ```
325 |0 |6 |11 |16 |21 |24|26 |31 |
326 | PO | RT | RA | RB | RC | XO | Rc |
327 ```
328
329 Add `RA` to `XO` Field in Book I, 1.6.2
330
331 ```
332 RA (11:15)
333 Field used to specify a GPR to be used as a
334 source or as a target.
335 Formats: A, BM2, D, DQ, DQE, DS, M, MD, MDS, TX, VA, VA2,
336 VX, X, XO, XS, SVL, XB, TLI, Z23
337 ```
338
339 *TODO* other fields `RT, RB, RC, XO, and Rc`, see
340 <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=openpower/isatables/fields.text;hb=HEAD>
341
342 # Appendices
343
344 Appendix E Power ISA sorted by opcode
345 Appendix F Power ISA sorted by version
346 Appendix G Power ISA sorted by Compliancy Subset
347 Appendix H Power ISA sorted by mnemonic
348
349 | Form | Book | Page | Version | mnemonic | Description |
350 |------|------|------|---------|----------|-------------|
351 | VA | I | # | 3.0B | maddedu | Multiply-Add Extend Double Unsigned |
352 | VA | I | # | 3.0B | divmod2du | Divide/Modulo Quad-Double Unsigned |
353
354 ----------------
355
356 [[!tag opf_rfc]]