Fix MSB0 issues in the pseudo-code for augmented register numbering
[libreriscv.git] / openpower / sv / svp64 / appendix.mdwn
1 # Appendix
2
3 * <https://bugs.libre-soc.org/show_bug.cgi?id=574>
4 * <https://bugs.libre-soc.org/show_bug.cgi?id=558#c47>
5
6 This is the appendix to [[sv/svp64]], providing explanations of modes
7 etc. leaving the main svp64 page's primary purpose as outlining the
8 instruction format.
9
10 Table of contents:
11
12 [[!toc]]
13
14 # XER, SO and other global flags
15
16 Vector systems are expected to be high performance. This is achieved
17 through parallelism, which requires that elements in the vector be
18 independent. XER SO and other global "accumulation" flags (CR.OV) cause
19 Read-Write Hazards on single-bit global resources, having a significant
20 detrimental effect.
21
22 Consequently in SV, XER.SO and CR.OV behaviour is disregarded (including
23 in `cmp` instructions). XER is simply neither read nor written.
24 This includes when `scalar identity behaviour` occurs. If precise
25 OpenPOWER v3.0/1 scalar behaviour is desired then OpenPOWER v3.0/1
26 instructions should be used without an SV Prefix.
27
28 An interesting side-effect of this decision is that the OE flag is now
29 free for other uses when SV Prefixing is used.
30
31 Regarding XER.CA: this does not fit either: it was designed for a scalar
32 ISA. Instead, both carry-in and carry-out go into the CR.so bit of a given
33 Vector element. This provides a means to perform large parallel batches
34 of Vectorised carry-capable additions. crweird instructions can be used
35 to transfer the CRs in and out of an integer, where bitmanipulation
36 may be performed to analyse the carry bits (including carry lookahead
37 propagation) before continuing with further parallel additions.
38
39 # v3.0B/v3.1B relevant instructions
40
41 SV is primarily designed for use as an efficient hybrid 3D GPU / VPU /
42 CPU ISA.
43
44 As mentioned above, OE=1 is not applicable in SV, freeing this bit for
45 alternative uses. Additionally, Vectorisation of the VSX SIMD system
46 likewise makes no sense whatsoever. SV *replaces* VSX and provides,
47 at the very minimum, predication (which VSX was designed without).
48 Thus all VSX Major Opcodes - all of them - are "unused" and must raise
49 illegal instruction exceptions in SV Prefix Mode.
50
51 Likewise, `lq` (Load Quad), and Load/Store Multiple make no sense to
52 have because they are not only provided by SV, the SV alternatives may
53 be predicated as well, making them far better suited to use in function
54 calls and context-switching.
55
56 Additionally, some v3.0/1 instructions simply make no sense at all in a
57 Vector context: `twi` and `tdi` fall into this category, as do branch
58 operations as well as `sc` and `scv`. Here there is simply no point
59 trying to Vectorise them: the standard OpenPOWER v3.0/1 instructions
60 should be called instead.
61
62 Fortuitously this leaves several Major Opcodes free for use by SV
63 to fit alternative future instructions. In a 3D context this means
64 Vector Product, Vector Normalise, [[sv/mv.swizzle]], Texture LD/ST
65 operations, and others critical to an efficient, effective 3D GPU and
66 VPU ISA. With such instructions being included as standard in other
67 commercially-successful GPU ISAs it is likewise critical that a 3D
68 GPU/VPU based on svp64 also have such instructions.
69
70 Note however that svp64 is stand-alone and is in no way
71 critically dependent on the existence or provision of 3D GPU or VPU
72 instructions. These should be considered extensions, and their discussion
73 and specification is out of scope for this document.
74
75 Note, again: this is *only* under svp64 prefixing. Standard v3.0B /
76 v3.1B is *not* altered by svp64 in any way.
77
78 ## Major opcode map (v3.0B)
79
80 This table is taken from v3.0B.
81 Table 9: Primary Opcode Map (opcode bits 0:5)
82
83 | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111
84 000 | | | tdi | twi | EXT04 | | | mulli | 000
85 001 | subfic | | cmpli | cmpi | addic | addic. | addi | addis | 001
86 010 | bc/l/a | EXT17 | b/l/a | EXT19 | rlwimi| rlwinm | | rlwnm | 010
87 011 | ori | oris | xori | xoris | andi. | andis. | EXT30 | EXT31 | 011
88 100 | lwz | lwzu | lbz | lbzu | stw | stwu | stb | stbu | 100
89 101 | lhz | lhzu | lha | lhau | sth | sthu | lmw | stmw | 101
90 110 | lfs | lfsu | lfd | lfdu | stfs | stfsu | stfd | stfdu | 110
91 111 | lq | EXT57 | EXT58 | EXT59 | EXT60 | EXT61 | EXT62 | EXT63 | 111
92 | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111
93
94 ## Suitable for svp64
95
96 This is the same table containing v3.0B Primary Opcodes except those that
97 make no sense in a Vectorisation Context have been removed. These removed
98 POs can, *in the SV Vector Context only*, be assigned to alternative
99 (Vectorised-only) instructions, including future extensions.
100
101 Note, again, to emphasise: outside of svp64 these opcodes **do not**
102 change. When not prefixed with svp64 these opcodes **specifically**
103 retain their v3.0B / v3.1B OpenPOWER Standard compliant meaning.
104
105 | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111
106 000 | | | | | | | | mulli | 000
107 001 | subfic | | cmpli | cmpi | addic | addic. | addi | addis | 001
108 010 | | | | EXT19 | rlwimi| rlwinm | | rlwnm | 010
109 011 | ori | oris | xori | xoris | andi. | andis. | EXT30 | EXT31 | 011
110 100 | lwz | lwzu | lbz | lbzu | stw | stwu | stb | stbu | 100
111 101 | lhz | lhzu | lha | lhau | sth | sthu | | | 101
112 110 | lfs | lfsu | lfd | lfdu | stfs | stfsu | stfd | stfdu | 110
113 111 | | | EXT58 | EXT59 | | EXT61 | | EXT63 | 111
114 | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111
115
116 # Twin Predication
117
118 This is a novel concept that allows predication to be applied to a single
119 source and a single dest register. The following types of traditional
120 Vector operations may be encoded with it, *without requiring explicit
121 opcodes to do so*
122
123 * VSPLAT (a single scalar distributed across a vector)
124 * VEXTRACT (like LLVM IR [`extractelement`](https://releases.llvm.org/11.0.0/docs/LangRef.html#extractelement-instruction))
125 * VINSERT (like LLVM IR [`insertelement`](https://releases.llvm.org/11.0.0/docs/LangRef.html#insertelement-instruction))
126 * VCOMPRESS (like LLVM IR [`llvm.masked.compressstore.*`](https://releases.llvm.org/11.0.0/docs/LangRef.html#llvm-masked-compressstore-intrinsics))
127 * VEXPAND (like LLVM IR [`llvm.masked.expandload.*`](https://releases.llvm.org/11.0.0/docs/LangRef.html#llvm-masked-expandload-intrinsics))
128
129 Those patterns (and more) may be applied to:
130
131 * mv (the usual way that V\* ISA operations are created)
132 * exts\* sign-extension
133 * rwlinm and other RS-RA shift operations (**note**: excluding
134 those that take RA as both a src and dest. These are not
135 1-src 1-dest, they are 2-src, 1-dest)
136 * LD and ST (treating AGEN as one source)
137 * FP fclass, fsgn, fneg, fabs, fcvt, frecip, fsqrt etc.
138 * Condition Register ops mfcr, mtcr and other similar
139
140 This is a huge list that creates extremely powerful combinations,
141 particularly given that one of the predicate options is `(1<<r3)`
142
143 Additional unusual capabilities of Twin Predication include a back-to-back
144 version of VCOMPRESS-VEXPAND which is effectively the ability to do
145 sequentially ordered multiple VINSERTs. The source predicate selects a
146 sequentially ordered subset of elements to be inserted; the destination
147 predicate specifies the sequentially ordered recipient locations.
148 This is equivalent to
149 `llvm.masked.compressstore.*`
150 followed by
151 `llvm.masked.expandload.*`
152
153
154 # Rounding, clamp and saturate
155
156 see [[av_opcodes]].
157
158 To help ensure that audio quality is not compromised by overflow,
159 "saturation" is provided, as well as a way to detect when saturation
160 occurred if desired (Rc=1). When Rc=1 there will be a *vector* of CRs,
161 one CR per element in the result (Note: this is different from VSX which
162 has a single CR per block).
163
164 When N=0 the result is saturated to within the maximum range of an
165 unsigned value. For integer ops this will be 0 to 2^elwidth-1. Similar
166 logic applies to FP operations, with the result being saturated to
167 maximum rather than returning INF, and the minimum to +0.0
168
169 When N=1 the same occurs except that the result is saturated to the min
170 or max of a signed result, and for FP to the min and max value rather
171 than returning +/- INF.
172
173 When Rc=1, the CR "overflow" bit is set on the CR associated with the
174 element, to indicate whether saturation occurred. Note that due to
175 the hugely detrimental effect it has on parallel processing, XER.SO is
176 **ignored** completely and is **not** brought into play here. The CR
177 overflow bit is therefore simply set to zero if saturation did not occur,
178 and to one if it did.
179
180 Note also that saturate on operations that produce a carry output are
181 prohibited due to the conflicting use of the CR.so bit for storing if
182 saturation occurred.
183
184 Post-analysis of the Vector of CRs to find out if any given element hit
185 saturation may be done using a mapreduced CR op (cror), or by using the
186 new crweird instruction, transferring the relevant CR bits to a scalar
187 integer and testing it for nonzero. see [[sv/cr_int_predication]]
188
189 Note that the operation takes place at the maximum bitwidth (max of
190 src and dest elwidth) and that truncation occurs to the range of the
191 dest elwidth.
192
193 # Reduce mode
194
195 There are two variants here. The first is when the destination is scalar
196 and at least one of the sources is Vector. The second is more complex
197 and involves map-reduction on vectors.
198
199 The first defining characteristic distinguishing Scalar-dest reduce mode
200 from Vector reduce mode is that Scalar-dest reduce issues VL element
201 operations, whereas Vector reduce mode performs an actual map-reduce
202 (tree reduction): typically `O(VL log VL)` actual computations.
203
204 The second defining characteristic of scalar-dest reduce mode is that it
205 is, in simplistic and shallow terms *serial and sequential in nature*,
206 whereas the Vector reduce mode is definitely inherently paralleliseable.
207
208 The reason why scalar-dest reduce mode is "simplistically" serial and
209 sequential is that in certain circumstances (such as an `OR` operation
210 or a MIN/MAX operation) it may be possible to parallelise the reduction.
211
212 ## Scalar result reduce mode
213
214 In this mode, one register is identified as being the "accumulator".
215 Scalar reduction is thus categorised by:
216
217 * One of the sources is a Vector
218 * the destination is a scalar
219 * optionally but most usefully when one source register is also the destination
220 * That the source register type is the same as the destination register
221 type identified as the "accumulator". scalar reduction on `cmp`,
222 `setb` or `isel` is not possible for example because of the mixture
223 between CRs and GPRs.
224
225 Typical applications include simple operations such as `ADD r3, r10.v,
226 r3` where, clearly, r3 is being used to accumulate the addition of all
227 elements is the vector starting at r10.
228
229 # add RT, RA,RB but when RT==RA
230 for i in range(VL):
231 iregs[RA] += iregs[RB+i] # RT==RA
232
233 However, *unless* the operation is marked as "mapreduce", SV ordinarily
234 **terminates** at the first scalar operation. Only by marking the
235 operation as "mapreduce" will it continue to issue multiple sub-looped
236 (element) instructions in `Program Order`.
237
238 Other examples include shift-mask operations where a Vector of inserts
239 into a single destination register is required, as a way to construct
240 a value quickly from multiple arbitrary bit-ranges and bit-offsets.
241 Using the same register as both the source and destination, with Vectors
242 of different offsets masks and values to be inserted has multiple
243 applications including Video, cryptography and JIT compilation.
244
245 Subtract and Divide are still permitted to be executed in this mode,
246 although from an algorithmic perspective it is strongly discouraged.
247 It would be better to use addition followed by one final subtract,
248 or in the case of divide, to get better accuracy, to perform a multiply
249 cascade followed by a final divide.
250
251 Note that single-operand or three-operand scalar-dest reduce is perfectly
252 well permitted: both still meet the qualifying characteristics that one
253 source operand can also be the destination, which allows the "accumulator"
254 to be identified.
255
256 ## Vector result reduce mode
257
258 1. limited to single predicated dual src operations (add RT, RA, RB).
259 triple source operations are prohibited (fma).
260 2. limited to operations that make sense. divide is excluded, as is
261 subtract (X - Y - Z produces different answers depending on the order)
262 and asymmetric CRops (crandc, crorc). sane operations:
263 multiply, min/max, add, logical bitwise OR, most other CR ops.
264 operations that do have the same source and dest register type are
265 also excluded (isel, cmp). operations involving carry or overflow
266 (XER.CA / OV) are also prohibited.
267 3. the destination is a vector but the result is stored, ultimately,
268 in the first nonzero predicated element. all other nonzero predicated
269 elements are undefined. *this includes the CR vector* when Rc=1
270 4. implementations may use any ordering and any algorithm to reduce
271 down to a single result. However it must be equivalent to a straight
272 application of mapreduce. The destination vector (except masked out
273 elements) may be used for storing any intermediate results. these may
274 be left in the vector (undefined).
275 5. CRM applies when Rc=1. When CRM is zero, the CR associated with
276 the result is regarded as a "some results met standard CR result
277 criteria". When CRM is one, this changes to "all results met standard
278 CR criteria".
279 6. implementations MAY use destoffs as well as srcoffs (see [[sv/sprs]])
280 in order to store sufficient state to resume operation should an
281 interrupt occur. this is also why implementations are permitted to use
282 the destination vector to store intermediary computations
283 7. *Predication may be applied*. zeroing mode is not an option. masked-out
284 inputs are ignored; masked-out elements in the destination vector are
285 unaltered (not used for the purposes of intermediary storage); the
286 scalar result is placed in the first available unmasked element.
287
288 Pseudocode for the case where RA==RB:
289
290 result = op(iregs[RA], iregs[RA+1])
291 CR = analyse(result)
292 for i in range(2, VL):
293 result = op(result, iregs[RA+i])
294 CRnew = analyse(result)
295 if Rc=1
296 if CRM:
297 CR = CR bitwise or CRnew
298 else:
299 CR = CR bitwise AND CRnew
300
301 TODO: case where RA!=RB which involves first a vector of 2-operand
302 results followed by a mapreduce on the intermediates.
303
304 Note that when SVM is clear and SUBVL!=1 the sub-elements are
305 *independent*, i.e. they are mapreduced per *sub-element* as a result.
306 illustration with a vec2:
307
308 result.x = op(iregs[RA].x, iregs[RA+1].x)
309 result.y = op(iregs[RA].y, iregs[RA+1].y)
310 for i in range(2, VL):
311 result.x = op(result.x, iregs[RA+i].x)
312 result.y = op(result.y, iregs[RA+i].y)
313
314 Note here that Rc=1 does not make sense when SVM is clear and SUBVL!=1.
315
316 When SVM is set and SUBVL!=1, another variant is enabled: horizontal
317 subvector mode. Example for a vec3:
318
319 for i in range(VL):
320 result = op(iregs[RA+i].x, iregs[RA+i].x)
321 result = op(result, iregs[RA+i].y)
322 result = op(result, iregs[RA+i].z)
323 iregs[RT+i] = result
324
325 In this mode, when Rc=1 the Vector of CRs is as normal: each result
326 element creates a corresponding CR element.
327
328 # Fail-on-first
329
330 Data-dependent fail-on-first has two distinct variants: one for LD/ST,
331 the other for arithmetic operations (actually, CR-driven). Note in each
332 case the assumption is that vector elements are required appear to be
333 executed in sequential Program Order, element 0 being the first.
334
335 * LD/ST ffirst treats the first LD/ST in a vector (element 0) as an
336 ordinary one. Exceptions occur "as normal". However for elements 1
337 and above, if an exception would occur, then VL is **truncated** to the
338 previous element.
339 * Data-driven (CR-driven) fail-on-first activates when Rc=1 or other
340 CR-creating operation produces a result (including cmp). Similar to
341 branch, an analysis of the CR is performed and if the test fails, the
342 vector operation terminates and discards all element operations at and
343 above the current one, and VL is truncated to the *previous* element.
344 Thus the new VL comprises a contiguous vector of results, all of which
345 pass the testing criteria (equal to zero, less than zero).
346
347 The CR-based data-driven fail-on-first is new and not found in ARM
348 SVE or RVV. It is extremely useful for reducing instruction count,
349 however requires speculative execution involving modifications of VL
350 to get high performance implementations. An additional mode (RC1=1)
351 effectively turns what would otherwise be an arithmetic operation
352 into a type of `cmp`. The CR is stored (and the CR.eq bit tested).
353 If the CR.eq bit fails then the Vector is truncated and the loop ends.
354 Note that when RC1=1 the result elements arw never stored, only the CRs.
355
356 In CR-based data-driven fail-on-first there is only the option to select
357 and test one bit of each CR (just as with branch BO). For more complex
358 tests this may be insufficient. If that is the case, a vectorised crops
359 (crand, cror) may be used, and ffirst applied to the crop instead of to
360 the arithmetic vector.
361
362 One extremely important aspect of ffirst is:
363
364 * LDST ffirst may never set VL equal to zero. This because on the first
365 element an exception must be raised "as normal".
366 * CR-based data-dependent ffirst on the other hand **can** set VL equal
367 to zero. This is the only means in the entirety of SV that VL may be set
368 to zero (with the exception of via the SV.STATE SPR). When VL is set
369 zero due to the first element failing the CR bit-test, all subsequent
370 vectorised operations are effectively `nops` which is
371 *precisely the desired and intended behaviour*.
372
373 Another aspect is that for ffirst LD/STs, VL may be truncated arbitrarily
374 to a nonzero value for any implementation-specific reason. For example:
375 it is perfectly reasonable for implementations to alter VL when ffirst
376 LD or ST operations are initiated on a nonaligned boundary, such that
377 within a loop the subsequent iteration of that loop begins subsequent
378 ffirst LD/ST operations on an aligned boundary. Likewise, to reduce
379 workloads or balance resources.
380
381 CR-based data-dependent first on the other hand MUST not truncate VL
382 arbitrarily. This because it is a precise test on which algorithms
383 will rely.
384
385 # pred-result mode
386
387 This mode merges common CR testing with predication, saving on instruction
388 count. Below is the pseudocode excluding predicate zeroing and elwidth
389 overrides.
390
391 for i in range(VL):
392 # predication test, skip all masked out elements.
393 if predicate_masked_out(i):
394 continue
395 result = op(iregs[RA+i], iregs[RB+i])
396 CRnew = analyse(result) # calculates eq/lt/gt
397 # Rc=1 always stores the CR
398 if Rc=1 or RC1:
399 crregs[offs+i] = CRnew
400 # now test CR, similar to branch
401 if RC1 or CRnew[BO[0:1]] != BO[2]:
402 continue # test failed: cancel store
403 # result optionally stored but CR always is
404 iregs[RT+i] = result
405
406 The reason for allowing the CR element to be stored is so that
407 post-analysis of the CR Vector may be carried out. For example:
408 Saturation may have occurred (and been prevented from updating, by the
409 test) but it is desirable to know *which* elements fail saturation.
410
411 Note that RC1 Mode basically turns all operations into `cmp`. The
412 calculation is performed but it is only the CR that is written. The
413 element result is *always* discarded, never written (just like `cmp`).
414
415 Note that predication is still respected: predicate zeroing is slightly
416 different: elements that fail the CR test *or* are masked out are zero'd.
417
418 ## pred-result mode on CR ops
419
420 Yes, really: CR operations (mtcr, crand, cror) may be Vectorised,
421 predicated, and also pred-result mode applied to it. In this case,
422 the Vectorisation applies to the batch of 4 bits, i.e. it is not the CR
423 individual bits that are treated as the Vector, but the CRs themselves
424 (CR0, CR8, CR9...)
425
426 Thus after each Vectorised operation (crand) a test of the CR result
427 can in fact be performed.
428
429 # CR Operations
430
431 CRs are slightly more involved than INT or FP registers due to the
432 possibility for indexing individual bits (crops BA/BB/BT). Again however
433 the access pattern needs to be understandable in relation to v3.0B / v3.1B
434 numbering, with a clear linear relationship and mapping existing when
435 SV is applied.
436
437 ## CR EXTRA mapping table and algorithm
438
439 Numbering relationships for CR fields are already complex due to being
440 in BE format (*the relationship is not clearly explained in the v3.0B
441 or v3.1B specification*). However with some care and consideration
442 the exact same mapping used for INT and FP regfiles may be applied,
443 just to the upper bits, as explained below.
444
445 In OpenPOWER v3.0/1, BF/BT/BA/BB are all 5 bits. The top 3 bits (0:2)
446 select one of the 8 CRs; the bottom 2 bits (3:4) select one of 4 bits
447 *in* that CR. The numbering was determined (after 4 months of
448 analysis and research) to be as follows:
449
450 CR_index = 7-(BA>>2) # top 3 bits but BE
451 bit_index = 3-(BA & 0b11) # low 2 bits but BE
452 CR_reg = CR{CR_index} # get the CR
453 # finally get the bit from the CR.
454 CR_bit = (CR_reg & (1<<bit_index)) != 0
455
456 When it comes to applying SV, it is the CR\_reg number to which SV EXTRA2/3
457 applies, **not** the CR\_bit portion (bits 3:4):
458
459 if extra3_mode:
460 spec = EXTRA3
461 else:
462 spec = EXTRA2<<1 | 0b0
463 if spec[0]:
464 # vector constructs "BA[0:2] spec[1:2] 00 BA[3:4]"
465 return ((BA >> 2)<<6) | # hi 3 bits shifted up
466 (spec[1:2]<<4) | # to make room for these
467 (BA & 0b11) # CR_bit on the end
468 else:
469 # scalar constructs "00 spec[1:2] BA[0:4]"
470 return (spec[1:2] << 5) | BA
471
472 Thus, for example, to access a given bit for a CR in SV mode, the v3.0B
473 algorithm to determin CR\_reg is modified to as follows:
474
475 CR_index = 7-(BA>>2) # top 3 bits but BE
476 if spec[0]:
477 # vector mode, 0-124 increments of 4
478 CR_index = (CR_index<<4) | (spec[1:2] << 2)
479 else:
480 # scalar mode, 0-32 increments of 1
481 CR_index = (spec[1:2]<<3) | CR_index
482 # same as for v3.0/v3.1 from this point onwards
483 bit_index = 3-(BA & 0b11) # low 2 bits but BE
484 CR_reg = CR{CR_index} # get the CR
485 # finally get the bit from the CR.
486 CR_bit = (CR_reg & (1<<bit_index)) != 0
487
488 Note here that the decoding pattern to determine CR\_bit does not change.
489
490 Note: high-performance implementations may read/write Vectors of CRs in
491 batches of aligned 32-bit chunks (CR0-7, CR7-15). This is to greatly
492 simplify internal design. If instructions are issued where CR Vectors
493 do not start on a 32-bit aligned boundary, performance may be affected.
494
495 ## CR fields as inputs/outputs of vector operations
496
497 CRs (or, the arithmetic operations associated with them)
498 may be marked as Vectorised or Scalar. When Rc=1 in arithmetic operations that have no explicit EXTRA to cover the CR, the CR is Vectorised if the destination is Vectorised. Likewise if the destination is scalar then so is the CR.
499
500 When vectorized, the CR inputs/outputs are sequentially read/written
501 to 4-bit CR fields. Vectorised Integer results, when Rc=1, will begin
502 writing to CR8 (TBD evaluate) and increase sequentially from there.
503 This is so that:
504
505 * implementations may rely on the Vector CRs being aligned to 8. This
506 means that CRs may be read or written in aligned batches of 32 bits
507 (8 CRs per batch), for high performance implementations.
508 * scalar Rc=1 operation (CR0, CR1) and callee-saved CRs (CR2-4) are not
509 overwritten by vector Rc=1 operations except for very large VL
510 * CR-based predication, from CR32, is also not interfered with
511 (except by large VL).
512
513 However when the SV result (destination) is marked as a scalar by the
514 EXTRA field the *standard* v3.0B behaviour applies: the accompanying
515 CR when Rc=1 is written to. This is CR0 for integer operations and CR1
516 for FP operations.
517
518 Note that yes, the CRs are genuinely Vectorised. Unlike in SIMD VSX which
519 has a single CR (CR6) for a given SIMD result, SV Vectorised OpenPOWER
520 v3.0B scalar operations produce a **tuple** of element results: the
521 result of the operation as one part of that element *and a corresponding
522 CR element*. Greatly simplified pseudocode:
523
524 for i in range(VL):
525 # calculate the vector result of an add iregs[RT+i] = iregs[RA+i]
526 + iregs[RB+i] # now calculate CR bits CRs{8+i}.eq = iregs[RT+i]
527 == 0 CRs{8+i}.gt = iregs[RT+i] > 0 ... etc
528
529 If a "cumulated" CR based analysis of results is desired (a la VSX CR6)
530 then a followup instruction must be performed, setting "reduce" mode on
531 the Vector of CRs, using cr ops (crand, crnor)to do so. This provides far
532 more flexibility in analysing vectors than standard Vector ISAs. Normal
533 Vector ISAs are typically restricted to "were all results nonzero" and
534 "were some results nonzero". The application of mapreduce to Vectorised
535 cr operations allows far more sophisticated analysis, particularly in
536 conjunction with the new crweird operations see [[sv/cr_int_predication]].
537
538 Note in particular that the use of a separate instruction in this way
539 ensures that high performance multi-issue OoO inplementations do not
540 have the computation of the cumulative analysis CR as a bottleneck and
541 hindrance, regardless of the length of VL.
542
543 (see [[discussion]]. some alternative schemes are described there)
544
545 ## Rc=1 when SUBVL!=1
546
547 sub-vectors are effectively a form of SIMD (length 2 to 4). Only 1 bit of
548 predicate is allocated per subvector; likewise only one CR is allocated
549 per subvector.
550
551 This leaves a conundrum as to how to apply CR computation per subvector,
552 when normally Rc=1 is exclusively applied to scalar elements. A solution
553 is to perform a bitwise OR or AND of the subvector tests. Given that
554 OE is ignored, rhis field may (when available) be used to select OR or
555 AND behavior.
556
557 ### Table of CR fields
558
559 CR[i] is the notation used by the OpenPower spec to refer to CR field #i,
560 so FP instructions with Rc=1 write to CR[1] aka SVCR1_000.
561
562 CRs are not stored in SPRs: they are registers in their own right.
563 Therefore context-switching the full set of CRs involves a Vectorised
564 mfcr or mtcr, using VL=64, elwidth=8 to do so. This is exactly as how
565 scalar OpenPOWER context-switches CRs: it is just that there are now
566 more of them.
567
568 The 64 SV CRs are arranged similarly to the way the 128 integer registers
569 are arranged. TODO a python program that auto-generates a CSV file
570 which can be included in a table, which is in a new page (so as not to
571 overwhelm this one). [[svp64/cr_names]]
572
573 # Register Profiles
574
575 **NOTE THIS TABLE SHOULD NO LONGER BE HAND EDITED** see
576 <https://bugs.libre-soc.org/show_bug.cgi?id=548> for details.
577
578 Instructions are broken down by Register Profiles as listed in the
579 following auto-generated page: [[opcode_regs_deduped]]. "Non-SV"
580 indicates that the operations with this Register Profile cannot be
581 Vectorised (mtspr, bc, dcbz, twi)
582
583 TODO generate table which will be here [[svp64/reg_profiles]]
584
585 # SV pseudocode illilustration
586
587 ## Single-predicated Instruction
588
589 illustration of normal mode add operation: zeroing not included, elwidth
590 overrides not included. if there is no predicate, it is set to all 1s
591
592 function op_add(rd, rs1, rs2) # add not VADD!
593 int i, id=0, irs1=0, irs2=0; predval = get_pred_val(FALSE, rd);
594 for (i = 0; i < VL; i++)
595 STATE.srcoffs = i # save context if (predval & 1<<i) # predication
596 uses intregs
597 ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2]; if (!int_vec[rd
598 ].isvec) break;
599 if (rd.isvec) { id += 1; } if (rs1.isvec) { irs1 += 1; } if
600 (rs2.isvec) { irs2 += 1; } if (id == VL or irs1 == VL or irs2 ==
601 VL) {
602 # end VL hardware loop STATE.srcoffs = 0; # reset return;
603 }
604
605 This has several modes:
606
607 * RT.v = RA.v RB.v * RT.v = RA.v RB.s (and RA.s RB.v) * RT.v = RA.s RB.s *
608 RT.s = RA.v RB.v * RT.s = RA.v RB.s (and RA.s RB.v) * RT.s = RA.s RB.s
609
610 All of these may be predicated. Vector-Vector is straightfoward.
611 When one of source is a Vector and the other a Scalar, it is clear that
612 each element of the Vector source should be added to the Scalar source,
613 each result placed into the Vector (or, if the destination is a scalar,
614 only the first nonpredicated result).
615
616 The one that is not obvious is RT=vector but both RA/RB=scalar.
617 Here this acts as a "splat scalar result", copying the same result into
618 all nonpredicated result elements. If a fixed destination scalar was
619 intended, then an all-Scalar operation should be used.
620
621 See <https://bugs.libre-soc.org/show_bug.cgi?id=552>
622
623 # Assembly Annotation
624
625 Assembly code annotation is required for SV to be able to successfully
626 mark instructions as "prefixed".
627
628 A reasonable (prototype) starting point:
629
630 svp64 [field=value]*
631
632 Fields:
633
634 * ew=8/16/32 - element width
635 * sew=8/16/32 - source element width
636 * vec=2/3/4 - SUBVL
637 * mode=reduce/satu/sats/crpred
638 * pred=1\<\<3/r3/~r3/r10/~r10/r30/~r30/lt/gt/le/ge/eq/ne
639 * spred={reg spec}
640
641 similar to x86 "rex" prefix.
642
643 For actual assembler:
644
645 sv.asmcode/mode.vec{N}.ew=8,sw=16,m={pred},sm={pred} reg.v, src.s
646
647 Qualifiers:
648
649 * m={pred}: predicate mask mode
650 * sm={pred}: source-predicate mask mode (only allowed in Twin-predication)
651 * vec{N}: vec2 OR vec3 OR vec4 - sets SUBVL=2/3/4
652 * ew={N}: ew=8/16/32 - sets elwidth override
653 * sw={N}: sw=8/16/32 - sets source elwidth override
654 * ff={xx}: see fail-first mode
655 * pr={xx}: see predicate-result mode
656 * sat{x}: satu / sats - see saturation mode
657 * mr: see map-reduce mode
658 * mr.svm see map-reduce with sub-vector mode
659 * crm: see map-reduce CR mode
660 * crm.svm see map-reduce CR with sub-vector mode
661 * sz: predication with source-zeroing
662 * dz: predication with dest-zeroing
663
664 For modes:
665
666 * pred-result:
667 - pm=lt/gt/le/ge/eq/ne/so/ns OR
668 - pm=RC1 OR pm=~RC1
669 * fail-first
670 - ff=lt/gt/le/ge/eq/ne/so/ns OR
671 - ff=RC1 OR ff=~RC1
672 * saturation:
673 - sats
674 - satu
675 * map-reduce:
676 - mr OR crm: "normal" map-reduce mode or CR-mode.
677 - mr.svm OR crm.svm: when vec2/3/4 set, sub-vector mapreduce is enabled
678