f6b221f4dc8b580804f2215268bd663076e65c35
[libreriscv.git] / openpower / sv / svp64 / appendix.mdwn
1 # Appendix
2
3 * <https://bugs.libre-soc.org/show_bug.cgi?id=574>
4 * <https://bugs.libre-soc.org/show_bug.cgi?id=558#c47>
5
6 This is the appendix to [[sv/svp64]], providing explanations of modes
7 etc. leaving the main svp64 page's primary purpose as outlining the
8 instruction format.
9
10 Table of contents:
11
12 [[!toc]]
13
14 # XER, SO and other global flags
15
16 Vector systems are expected to be high performance. This is achieved
17 through parallelism, which requires that elements in the vector be
18 independent. XER SO and other global "accumulation" flags (CR.OV) cause
19 Read-Write Hazards on single-bit global resources, having a significant
20 detrimental effect.
21
22 Consequently in SV, XER.SO and CR.OV behaviour is disregarded (including
23 in `cmp` instructions). XER is simply neither read nor written.
24 This includes when `scalar identity behaviour` occurs. If precise
25 OpenPOWER v3.0/1 scalar behaviour is desired then OpenPOWER v3.0/1
26 instructions should be used without an SV Prefix.
27
28 An interesting side-effect of this decision is that the OE flag is now
29 free for other uses when SV Prefixing is used.
30
31 Regarding XER.CA: this does not fit either: it was designed for a scalar
32 ISA. Instead, both carry-in and carry-out go into the CR.so bit of a given
33 Vector element. This provides a means to perform large parallel batches
34 of Vectorised carry-capable additions. crweird instructions can be used
35 to transfer the CRs in and out of an integer, where bitmanipulation
36 may be performed to analyse the carry bits (including carry lookahead
37 propagation) before continuing with further parallel additions.
38
39 # v3.0B/v3.1 relevant instructions
40
41 SV is primarily designed for use as an efficient hybrid 3D GPU / VPU /
42 CPU ISA.
43
44 As mentioned above, OE=1 is not applicable in SV, freeing this bit for
45 alternative uses. Additionally, Vectorisation of the VSX SIMD system
46 likewise makes no sense whatsoever. SV *replaces* VSX and provides,
47 at the very minimum, predication (which VSX was designed without).
48 Thus all VSX Major Opcodes - all of them - are "unused" and must raise
49 illegal instruction exceptions in SV Prefix Mode.
50
51 Likewise, `lq` (Load Quad), and Load/Store Multiple make no sense to
52 have because they are not only provided by SV, the SV alternatives may
53 be predicated as well, making them far better suited to use in function
54 calls and context-switching.
55
56 Additionally, some v3.0/1 instructions simply make no sense at all in a
57 Vector context: `rfid` falls into this category,
58 as well as `sc` and `scv`. Here there is simply no point
59 trying to Vectorise them: the standard OpenPOWER v3.0/1 instructions
60 should be called instead.
61
62 Fortuitously this leaves several Major Opcodes free for use by SV
63 to fit alternative future instructions. In a 3D context this means
64 Vector Product, Vector Normalise, [[sv/mv.swizzle]], Texture LD/ST
65 operations, and others critical to an efficient, effective 3D GPU and
66 VPU ISA. With such instructions being included as standard in other
67 commercially-successful GPU ISAs it is likewise critical that a 3D
68 GPU/VPU based on svp64 also have such instructions.
69
70 Note however that svp64 is stand-alone and is in no way
71 critically dependent on the existence or provision of 3D GPU or VPU
72 instructions. These should be considered extensions, and their discussion
73 and specification is out of scope for this document.
74
75 Note, again: this is *only* under svp64 prefixing. Standard v3.0B /
76 v3.1B is *not* altered by svp64 in any way.
77
78 ## Major opcode map (v3.0B)
79
80 This table is taken from v3.0B.
81 Table 9: Primary Opcode Map (opcode bits 0:5)
82
83 | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111
84 000 | | | tdi | twi | EXT04 | | | mulli | 000
85 001 | subfic | | cmpli | cmpi | addic | addic. | addi | addis | 001
86 010 | bc/l/a | EXT17 | b/l/a | EXT19 | rlwimi| rlwinm | | rlwnm | 010
87 011 | ori | oris | xori | xoris | andi. | andis. | EXT30 | EXT31 | 011
88 100 | lwz | lwzu | lbz | lbzu | stw | stwu | stb | stbu | 100
89 101 | lhz | lhzu | lha | lhau | sth | sthu | lmw | stmw | 101
90 110 | lfs | lfsu | lfd | lfdu | stfs | stfsu | stfd | stfdu | 110
91 111 | lq | EXT57 | EXT58 | EXT59 | EXT60 | EXT61 | EXT62 | EXT63 | 111
92 | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111
93
94 ## Suitable for svp64-only
95
96 This is the same table containing v3.0B Primary Opcodes except those that
97 make no sense in a Vectorisation Context have been removed. These removed
98 POs can, *in the SV Vector Context only*, be assigned to alternative
99 (Vectorised-only) instructions, including future extensions.
100
101 Note, again, to emphasise: outside of svp64 these opcodes **do not**
102 change. When not prefixed with svp64 these opcodes **specifically**
103 retain their v3.0B / v3.1B OpenPOWER Standard compliant meaning.
104
105 | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111
106 000 | | | | | | | | mulli | 000
107 001 | subfic | | cmpli | cmpi | addic | addic. | addi | addis | 001
108 010 | | | | EXT19 | rlwimi| rlwinm | | rlwnm | 010
109 011 | ori | oris | xori | xoris | andi. | andis. | EXT30 | EXT31 | 011
110 100 | lwz | lwzu | lbz | lbzu | stw | stwu | stb | stbu | 100
111 101 | lhz | lhzu | lha | lhau | sth | sthu | | | 101
112 110 | lfs | lfsu | lfd | lfdu | stfs | stfsu | stfd | stfdu | 110
113 111 | | | EXT58 | EXT59 | | EXT61 | | EXT63 | 111
114 | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111
115
116 It is important to note that having a different v3.0B Scalar opcode
117 that is different from an SVP64 one is highly undesirable: the complexity
118 in the decoder is greatly increased.
119
120 # Single Predication
121
122 This is a standard mode normally found in Vector ISAs. every element in every source Vector and in the destination uses the same bit of one single predicate mask.
123
124 In SVSTATE, for Single-predication, implementors MUST increment both srcstep and dststep: unlike Twin-Predication the two must be equal at all times.
125
126 # Twin Predication
127
128 This is a novel concept that allows predication to be applied to a single
129 source and a single dest register. The following types of traditional
130 Vector operations may be encoded with it, *without requiring explicit
131 opcodes to do so*
132
133 * VSPLAT (a single scalar distributed across a vector)
134 * VEXTRACT (like LLVM IR [`extractelement`](https://releases.llvm.org/11.0.0/docs/LangRef.html#extractelement-instruction))
135 * VINSERT (like LLVM IR [`insertelement`](https://releases.llvm.org/11.0.0/docs/LangRef.html#insertelement-instruction))
136 * VCOMPRESS (like LLVM IR [`llvm.masked.compressstore.*`](https://releases.llvm.org/11.0.0/docs/LangRef.html#llvm-masked-compressstore-intrinsics))
137 * VEXPAND (like LLVM IR [`llvm.masked.expandload.*`](https://releases.llvm.org/11.0.0/docs/LangRef.html#llvm-masked-expandload-intrinsics))
138
139 Those patterns (and more) may be applied to:
140
141 * mv (the usual way that V\* ISA operations are created)
142 * exts\* sign-extension
143 * rwlinm and other RS-RA shift operations (**note**: excluding
144 those that take RA as both a src and dest. These are not
145 1-src 1-dest, they are 2-src, 1-dest)
146 * LD and ST (treating AGEN as one source)
147 * FP fclass, fsgn, fneg, fabs, fcvt, frecip, fsqrt etc.
148 * Condition Register ops mfcr, mtcr and other similar
149
150 This is a huge list that creates extremely powerful combinations,
151 particularly given that one of the predicate options is `(1<<r3)`
152
153 Additional unusual capabilities of Twin Predication include a back-to-back
154 version of VCOMPRESS-VEXPAND which is effectively the ability to do
155 sequentially ordered multiple VINSERTs. The source predicate selects a
156 sequentially ordered subset of elements to be inserted; the destination
157 predicate specifies the sequentially ordered recipient locations.
158 This is equivalent to
159 `llvm.masked.compressstore.*`
160 followed by
161 `llvm.masked.expandload.*`
162
163 # Reduce modes
164
165 Reduction in SVP64 is deterministic and somewhat of a misnomer. A normal
166 Vector ISA would have explicit Reduce opcodes with defined characteristics
167 per operation: in SX Aurora there is even an additional scalar argument
168 containing the initial reduction value. SVP64 fundamentally has to
169 utilise *existing* Scalar Power ISA v3.0B operations, which presents some
170 unique challenges.
171
172 The solution turns out to be to simply define reduction as permitting
173 deterministic element-based schedules to be issued using the base Scalar
174 operations, and to rely on the underlying microarchitecture to resolve
175 Register Hazards at the element level. This goes back to
176 the fundamental principle that SV is nothing more than a Sub-Program-Counter
177 sitting between Decode and Issue phases.
178
179 Microarchitectures *may* take opportunities to parallelise the reduction
180 but only if in doing so they preserve Program Order at the Element Level.
181 Opportunities where this is possible include an `OR` operation
182 or a MIN/MAX operation: it may be possible to parallelise the reduction,
183 but for Floating Point it is not permitted due to different results
184 being obtained if the reduction is not executed in strict sequential
185 order.
186
187 ## Scalar result reduce mode
188
189 Scalar Reduction per se does not exist, instead is implemented in SVP64
190 as a simple and natural relaxation of the usual restriction on the Vector
191 Looping which would terminate if the destination was marked as a Scalar.
192 Scalar Reduction by contrast *keeps issuing Vector Element Operations*
193 even though the destination register is marked as scalar.
194 Thus it is up to the programmer to be aware of this and observe some
195 conventions. It is also important to appreciate that there is no
196 actual imposition or restriction on how this mode is utilised: there
197 will therefore be several valuable uses (including Vector Iteration)
198 and it is up to the programmer to make best use of the capability
199 provided.
200
201 In this mode, which is suited to operations involving carry or overflow,
202 one register must be identified by the programmer as being the "accumulator".
203 Scalar reduction is thus categorised by:
204
205 * One of the sources is a Vector
206 * the destination is a scalar
207 * optionally but most usefully when one source register is also the destination
208 * That the source register type is the same as the destination register
209 type identified as the "accumulator". scalar reduction on `cmp`,
210 `setb` or `isel` makes no sense for example because of the mixture
211 between CRs and GPRs.
212
213 *Note that issuing instructions in Scalar reduce mode such as `setb`
214 are neither `UNDEFINED` nor prohibited, despite them not making much
215 sense at first glance.
216 Scalar reduce is strictly defined behaviour, and the cost in
217 hardware terms of prohibition of seemingly non-sensical operations is too great.
218 Therefore it is permitted and required to be executed successfully.
219 Implementors **MAY** choose to optimise such instructions in instances
220 where their use results in "extraneous execution", i.e. where it is clear
221 that the sequence of operations, comprising multiple overwrites to
222 a scalar destination **without** cumulative, iterative, or reductive
223 behaviour, may discard all but the last element operation. Identification
224 of such is trivial to do for `setb` and `cmp`: the source register type is
225 a completely different register file from the destination*
226
227 Typical applications include simple operations such as `ADD r3, r10.v,
228 r3` where, clearly, r3 is being used to accumulate the addition of all
229 elements is the vector starting at r10.
230
231 # add RT, RA,RB but when RT==RA
232 for i in range(VL):
233 iregs[RA] += iregs[RB+i] # RT==RA
234
235 However, *unless* the operation is marked as "mapreduce", SV ordinarily
236 **terminates** at the first scalar operation. Only by marking the
237 operation as "mapreduce" will it continue to issue multiple sub-looped
238 (element) instructions in `Program Order`.
239
240 To.perform the loop in reverse order, the ```RG``` (reverse gear) bit must be set. This is useful for leaving a cumulative suffix sum in reverse order:
241
242 for i in (VL-1 downto 0):
243 # RT-1 = RA gives a suffix sum
244 iregs[RT+i] = iregs[RA+i] - iregs[RB+i]
245
246 Other examples include shift-mask operations where a Vector of inserts
247 into a single destination register is required, as a way to construct
248 a value quickly from multiple arbitrary bit-ranges and bit-offsets.
249 Using the same register as both the source and destination, with Vectors
250 of different offsets masks and values to be inserted has multiple
251 applications including Video, cryptography and JIT compilation.
252
253 Subtract and Divide are still permitted to be executed in this mode,
254 although from an algorithmic perspective it is strongly discouraged.
255 It would be better to use addition followed by one final subtract,
256 or in the case of divide, to get better accuracy, to perform a multiply
257 cascade followed by a final divide.
258
259 Note that single-operand or three-operand scalar-dest reduce is perfectly
260 well permitted: the programmer may still declare one register, used as
261 both a Vector source and Scalar destination, to be utilised as
262 the "accumulator". In the case of `sv.fmadds` and `sv.maddhw` etc
263 this naturally fits well with the normal expected usage of these
264 operations.
265
266 If an interrupt or exception occurs in the middle of the scalar mapreduce,
267 the scalar destination register **MUST** be updated with the current
268 (intermediate) result, because this is how ```Program Order``` is
269 preserved (Vector Loops are to be considered to be just another way of issuing instructions
270 in Program Order). In this way, after return from interrupt,
271 the scalar mapreduce may continue where it left off. This provides
272 "precise" exception behaviour.
273
274 Note that hardware is perfectly permitted to perform multi-issue
275 parallel optimisation of the scalar reduce operation: it's just that
276 as far as the user is concerned, all exceptions and interrupts **MUST**
277 be precise.
278
279 ## Vector result reduce mode
280
281 Vector Reduce Mode issues a deterministic tree-reduction schedule to the underlying micro-architecture. Like Scalar reduction, the "Scalar Base"
282 (Power ISA v3.0B) operation is leveraged, unmodified, to give the
283 *appearance* and *effect* of Reduction.
284
285 Given that the tree-reduction schedule is deterministic,
286 Interrupts and exceptions
287 can therefore also be precise. The final result will be in the first
288 non-predicate-masked-out destination element, but due again to
289 the deterministic schedule programmers may find uses for the intermediate
290 results.
291
292 When Rc=1 a corresponding Vector of co-resultant CRs is also
293 created. No special action is taken: the result and its CR Field
294 are stored "as usual" exactly as all other SVP64 Rc=1 operations.
295
296 ## Sub-Vector Horizontal Reduction
297
298 Note that when SVM is clear and SUBVL!=1 the sub-elements are
299 *independent*, i.e. they are mapreduced per *sub-element* as a result.
300 illustration with a vec2:
301
302 result.x = op(iregs[RA].x, iregs[RA+1].x)
303 result.y = op(iregs[RA].y, iregs[RA+1].y)
304 for i in range(2, VL):
305 result.x = op(result.x, iregs[RA+i].x)
306 result.y = op(result.y, iregs[RA+i].y)
307
308 Note here that Rc=1 does not make sense when SVM is clear and SUBVL!=1.
309
310 When SVM is set and SUBVL!=1, another variant is enabled: horizontal
311 subvector mode. Example for a vec3:
312
313 for i in range(VL):
314 result = iregs[RA+i].x
315 result = op(result, iregs[RA+i].y)
316 result = op(result, iregs[RA+i].z)
317 iregs[RT+i] = result
318
319 In this mode, when Rc=1 the Vector of CRs is as normal: each result
320 element creates a corresponding CR element (for the final, reduced, result).
321
322 # Fail-on-first
323
324 Data-dependent fail-on-first has two distinct variants: one for LD/ST,
325 the other for arithmetic operations (actually, CR-driven). Note in each
326 case the assumption is that vector elements are required appear to be
327 executed in sequential Program Order, element 0 being the first.
328
329 * LD/ST ffirst treats the first LD/ST in a vector (element 0) as an
330 ordinary one. Exceptions occur "as normal". However for elements 1
331 and above, if an exception would occur, then VL is **truncated** to the
332 previous element.
333 * Data-driven (CR-driven) fail-on-first activates when Rc=1 or other
334 CR-creating operation produces a result (including cmp). Similar to
335 branch, an analysis of the CR is performed and if the test fails, the
336 vector operation terminates and discards all element operations at and
337 above the current one, and VL is truncated to either
338 the *previous* element or the current one, depending on whether
339 VLi (VL "inclusive") is set.
340
341 Thus the new VL comprises a contiguous vector of results,
342 all of which pass the testing criteria (equal to zero, less than zero).
343
344 The CR-based data-driven fail-on-first is new and not found in ARM
345 SVE or RVV. It is extremely useful for reducing instruction count,
346 however requires speculative execution involving modifications of VL
347 to get high performance implementations. An additional mode (RC1=1)
348 effectively turns what would otherwise be an arithmetic operation
349 into a type of `cmp`. The CR is stored (and the CR.eq bit tested
350 against the `inv` field).
351 If the CR.eq bit is equal to `inv` then the Vector is truncated and
352 the loop ends.
353 Note that when RC1=1 the result elements are never stored, only the CRs.
354
355 VLi is only available as an option when `Rc=0` (or for instructions
356 which do not have Rc). When set, the current element is always
357 also included in the count (the new length that VL will be set to).
358 This may be useful in combination with "inv" to truncate the Vector
359 to `exclude` elements that fail a test, or, in the case of implementations
360 of strncpy, to include the terminating zero.
361
362 In CR-based data-driven fail-on-first there is only the option to select
363 and test one bit of each CR (just as with branch BO). For more complex
364 tests this may be insufficient. If that is the case, a vectorised crops
365 (crand, cror) may be used, and ffirst applied to the crop instead of to
366 the arithmetic vector.
367
368 One extremely important aspect of ffirst is:
369
370 * LDST ffirst may never set VL equal to zero. This because on the first
371 element an exception must be raised "as normal".
372 * CR-based data-dependent ffirst on the other hand **can** set VL equal
373 to zero. This is the only means in the entirety of SV that VL may be set
374 to zero (with the exception of via the SV.STATE SPR). When VL is set
375 zero due to the first element failing the CR bit-test, all subsequent
376 vectorised operations are effectively `nops` which is
377 *precisely the desired and intended behaviour*.
378
379 Another aspect is that for ffirst LD/STs, VL may be truncated arbitrarily
380 to a nonzero value for any implementation-specific reason. For example:
381 it is perfectly reasonable for implementations to alter VL when ffirst
382 LD or ST operations are initiated on a nonaligned boundary, such that
383 within a loop the subsequent iteration of that loop begins subsequent
384 ffirst LD/ST operations on an aligned boundary. Likewise, to reduce
385 workloads or balance resources.
386
387 CR-based data-dependent first on the other hand MUST not truncate VL
388 arbitrarily to a length decided by the hardware: VL MUST only be
389 truncated based explicitly on whether a test fails.
390 This because it is a precise test on which algorithms
391 will rely.
392
393 ## Data-dependent fail-first on CR operations (crand etc)
394
395 Operations that actually produce or alter CR Field as a result
396 do not also in turn have an Rc=1 mode. However it makes no
397 sense to try to test the 4 bits of a CR Field for being equal
398 or not equal to zero. Moreover, the result is already in the
399 form that is desired: it is a CR field. Therefore,
400 CR-based operations have their own SVP64 Mode, described
401 in [[sv/cr_ops]]
402
403 There are two primary different types of CR operations:
404
405 * Those which have a 3-bit operand field (referring to a CR Field)
406 * Those which have a 5-bit operand (referring to a bit within the
407 whole 32-bit CR)
408
409 More details can be found in [[sv/cr_ops]].
410
411 # pred-result mode
412
413 This mode merges common CR testing with predication, saving on instruction
414 count. Below is the pseudocode excluding predicate zeroing and elwidth
415 overrides. Note that the paeudocode for [[sv/cr_ops]] is slightly different.
416
417 for i in range(VL):
418 # predication test, skip all masked out elements.
419 if predicate_masked_out(i):
420 continue
421 result = op(iregs[RA+i], iregs[RB+i])
422 CRnew = analyse(result) # calculates eq/lt/gt
423 # Rc=1 always stores the CR
424 if Rc=1 or RC1:
425 crregs[offs+i] = CRnew
426 # now test CR, similar to branch
427 if RC1 or CRnew[BO[0:1]] != BO[2]:
428 continue # test failed: cancel store
429 # result optionally stored but CR always is
430 iregs[RT+i] = result
431
432 The reason for allowing the CR element to be stored is so that
433 post-analysis of the CR Vector may be carried out. For example:
434 Saturation may have occurred (and been prevented from updating, by the
435 test) but it is desirable to know *which* elements fail saturation.
436
437 Note that RC1 Mode basically turns all operations into `cmp`. The
438 calculation is performed but it is only the CR that is written. The
439 element result is *always* discarded, never written (just like `cmp`).
440
441 Note that predication is still respected: predicate zeroing is slightly
442 different: elements that fail the CR test *or* are masked out are zero'd.
443
444 ## pred-result mode on CR ops
445
446 CR operations (mtcr, crand, cror) may be Vectorised,
447 predicated, and also pred-result mode applied to it.
448 Vectorisation applies to 4-bit CR Fields which are treated as
449 elements, not the individual bits of the 32-bit CR.
450 CR ops and how to identify them is described in [[sv/cr_ops]]
451
452 # CR Operations
453
454 CRs are slightly more involved than INT or FP registers due to the
455 possibility for indexing individual bits (crops BA/BB/BT). Again however
456 the access pattern needs to be understandable in relation to v3.0B / v3.1B
457 numbering, with a clear linear relationship and mapping existing when
458 SV is applied.
459
460 ## CR EXTRA mapping table and algorithm
461
462 Numbering relationships for CR fields are already complex due to being
463 in BE format (*the relationship is not clearly explained in the v3.0B
464 or v3.1B specification*). However with some care and consideration
465 the exact same mapping used for INT and FP regfiles may be applied,
466 just to the upper bits, as explained below. The notation
467 `CR{field number}` is used to indicate access to a particular
468 Condition Register Field (as opposed to the notation `CR[bit]`
469 which accesses one bit of the 32 bit Power ISA v3.0B
470 Condition Register)
471
472 In OpenPOWER v3.0/1, BF/BT/BA/BB are all 5 bits. The top 3 bits (0:2)
473 select one of the 8 CRs; the bottom 2 bits (3:4) select one of 4 bits
474 *in* that CR. The numbering was determined (after 4 months of
475 analysis and research) to be as follows:
476
477 CR_index = 7-(BA>>2) # top 3 bits but BE
478 bit_index = 3-(BA & 0b11) # low 2 bits but BE
479 CR_reg = CR{CR_index} # get the CR
480 # finally get the bit from the CR.
481 CR_bit = (CR_reg & (1<<bit_index)) != 0
482
483 When it comes to applying SV, it is the CR\_reg number to which SV EXTRA2/3
484 applies, **not** the CR\_bit portion (bits 3:4):
485
486 if extra3_mode:
487 spec = EXTRA3
488 else:
489 spec = EXTRA2<<1 | 0b0
490 if spec[0]:
491 # vector constructs "BA[0:2] spec[1:2] 00 BA[3:4]"
492 return ((BA >> 2)<<6) | # hi 3 bits shifted up
493 (spec[1:2]<<4) | # to make room for these
494 (BA & 0b11) # CR_bit on the end
495 else:
496 # scalar constructs "00 spec[1:2] BA[0:4]"
497 return (spec[1:2] << 5) | BA
498
499 Thus, for example, to access a given bit for a CR in SV mode, the v3.0B
500 algorithm to determin CR\_reg is modified to as follows:
501
502 CR_index = 7-(BA>>2) # top 3 bits but BE
503 if spec[0]:
504 # vector mode, 0-124 increments of 4
505 CR_index = (CR_index<<4) | (spec[1:2] << 2)
506 else:
507 # scalar mode, 0-32 increments of 1
508 CR_index = (spec[1:2]<<3) | CR_index
509 # same as for v3.0/v3.1 from this point onwards
510 bit_index = 3-(BA & 0b11) # low 2 bits but BE
511 CR_reg = CR{CR_index} # get the CR
512 # finally get the bit from the CR.
513 CR_bit = (CR_reg & (1<<bit_index)) != 0
514
515 Note here that the decoding pattern to determine CR\_bit does not change.
516
517 Note: high-performance implementations may read/write Vectors of CRs in
518 batches of aligned 32-bit chunks (CR0-7, CR7-15). This is to greatly
519 simplify internal design. If instructions are issued where CR Vectors
520 do not start on a 32-bit aligned boundary, performance may be affected.
521
522 ## CR fields as inputs/outputs of vector operations
523
524 CRs (or, the arithmetic operations associated with them)
525 may be marked as Vectorised or Scalar. When Rc=1 in arithmetic operations that have no explicit EXTRA to cover the CR, the CR is Vectorised if the destination is Vectorised. Likewise if the destination is scalar then so is the CR.
526
527 When vectorized, the CR inputs/outputs are sequentially read/written
528 to 4-bit CR fields. Vectorised Integer results, when Rc=1, will begin
529 writing to CR8 (TBD evaluate) and increase sequentially from there.
530 This is so that:
531
532 * implementations may rely on the Vector CRs being aligned to 8. This
533 means that CRs may be read or written in aligned batches of 32 bits
534 (8 CRs per batch), for high performance implementations.
535 * scalar Rc=1 operation (CR0, CR1) and callee-saved CRs (CR2-4) are not
536 overwritten by vector Rc=1 operations except for very large VL
537 * CR-based predication, from CR32, is also not interfered with
538 (except by large VL).
539
540 However when the SV result (destination) is marked as a scalar by the
541 EXTRA field the *standard* v3.0B behaviour applies: the accompanying
542 CR when Rc=1 is written to. This is CR0 for integer operations and CR1
543 for FP operations.
544
545 Note that yes, the CR Fields are genuinely Vectorised. Unlike in SIMD VSX which
546 has a single CR (CR6) for a given SIMD result, SV Vectorised OpenPOWER
547 v3.0B scalar operations produce a **tuple** of element results: the
548 result of the operation as one part of that element *and a corresponding
549 CR element*. Greatly simplified pseudocode:
550
551 for i in range(VL):
552 # calculate the vector result of an add iregs[RT+i] = iregs[RA+i]
553 + iregs[RB+i] # now calculate CR bits CRs{8+i}.eq = iregs[RT+i]
554 == 0 CRs{8+i}.gt = iregs[RT+i] > 0 ... etc
555
556 If a "cumulated" CR based analysis of results is desired (a la VSX CR6)
557 then a followup instruction must be performed, setting "reduce" mode on
558 the Vector of CRs, using cr ops (crand, crnor) to do so. This provides far
559 more flexibility in analysing vectors than standard Vector ISAs. Normal
560 Vector ISAs are typically restricted to "were all results nonzero" and
561 "were some results nonzero". The application of mapreduce to Vectorised
562 cr operations allows far more sophisticated analysis, particularly in
563 conjunction with the new crweird operations see [[sv/cr_int_predication]].
564
565 Note in particular that the use of a separate instruction in this way
566 ensures that high performance multi-issue OoO inplementations do not
567 have the computation of the cumulative analysis CR as a bottleneck and
568 hindrance, regardless of the length of VL.
569
570 Additionally,
571 SVP64 [[sv/branches]] may be used, even when the branch itself is to
572 the following instruction. The combined side-effects of CTR reduction
573 and VL truncation provide several benefits.
574
575 (see [[discussion]]. some alternative schemes are described there)
576
577 ## Rc=1 when SUBVL!=1
578
579 sub-vectors are effectively a form of Packed SIMD (length 2 to 4). Only 1 bit of
580 predicate is allocated per subvector; likewise only one CR is allocated
581 per subvector.
582
583 This leaves a conundrum as to how to apply CR computation per subvector,
584 when normally Rc=1 is exclusively applied to scalar elements. A solution
585 is to perform a bitwise OR or AND of the subvector tests. Given that
586 OE is ignored in SVP64, this field may (when available) be used to select OR or
587 AND behavior.
588
589 ### Table of CR fields
590
591 CR[i] is the notation used by the OpenPower spec to refer to CR field #i,
592 so FP instructions with Rc=1 write to CR[1] aka SVCR1_000.
593
594 CRs are not stored in SPRs: they are registers in their own right.
595 Therefore context-switching the full set of CRs involves a Vectorised
596 mfcr or mtcr, using VL=64, elwidth=8 to do so. This is exactly as how
597 scalar OpenPOWER context-switches CRs: it is just that there are now
598 more of them.
599
600 The 64 SV CRs are arranged similarly to the way the 128 integer registers
601 are arranged. TODO a python program that auto-generates a CSV file
602 which can be included in a table, which is in a new page (so as not to
603 overwhelm this one). [[svp64/cr_names]]
604
605 # Register Profiles
606
607 **NOTE THIS TABLE SHOULD NO LONGER BE HAND EDITED** see
608 <https://bugs.libre-soc.org/show_bug.cgi?id=548> for details.
609
610 Instructions are broken down by Register Profiles as listed in the
611 following auto-generated page: [[opcode_regs_deduped]]. "Non-SV"
612 indicates that the operations with this Register Profile cannot be
613 Vectorised (mtspr, bc, dcbz, twi)
614
615 TODO generate table which will be here [[svp64/reg_profiles]]
616
617 # SV pseudocode illilustration
618
619 ## Single-predicated Instruction
620
621 illustration of normal mode add operation: zeroing not included, elwidth
622 overrides not included. if there is no predicate, it is set to all 1s
623
624 function op_add(rd, rs1, rs2) # add not VADD!
625 int i, id=0, irs1=0, irs2=0; predval = get_pred_val(FALSE, rd);
626 for (i = 0; i < VL; i++)
627 STATE.srcoffs = i # save context if (predval & 1<<i) # predication
628 uses intregs
629 ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2]; if (!int_vec[rd
630 ].isvec) break;
631 if (rd.isvec) { id += 1; } if (rs1.isvec) { irs1 += 1; } if
632 (rs2.isvec) { irs2 += 1; } if (id == VL or irs1 == VL or irs2 ==
633 VL) {
634 # end VL hardware loop STATE.srcoffs = 0; # reset return;
635 }
636
637 This has several modes:
638
639 * RT.v = RA.v RB.v * RT.v = RA.v RB.s (and RA.s RB.v) * RT.v = RA.s RB.s *
640 RT.s = RA.v RB.v * RT.s = RA.v RB.s (and RA.s RB.v) * RT.s = RA.s RB.s
641
642 All of these may be predicated. Vector-Vector is straightfoward.
643 When one of source is a Vector and the other a Scalar, it is clear that
644 each element of the Vector source should be added to the Scalar source,
645 each result placed into the Vector (or, if the destination is a scalar,
646 only the first nonpredicated result).
647
648 The one that is not obvious is RT=vector but both RA/RB=scalar.
649 Here this acts as a "splat scalar result", copying the same result into
650 all nonpredicated result elements. If a fixed destination scalar was
651 intended, then an all-Scalar operation should be used.
652
653 See <https://bugs.libre-soc.org/show_bug.cgi?id=552>
654
655 # Assembly Annotation
656
657 Assembly code annotation is required for SV to be able to successfully
658 mark instructions as "prefixed".
659
660 A reasonable (prototype) starting point:
661
662 svp64 [field=value]*
663
664 Fields:
665
666 * ew=8/16/32 - element width
667 * sew=8/16/32 - source element width
668 * vec=2/3/4 - SUBVL
669 * mode=reduce/satu/sats/crpred
670 * pred=1\<\<3/r3/~r3/r10/~r10/r30/~r30/lt/gt/le/ge/eq/ne
671 * spred={reg spec}
672
673 similar to x86 "rex" prefix.
674
675 For actual assembler:
676
677 sv.asmcode/mode.vec{N}.ew=8,sw=16,m={pred},sm={pred} reg.v, src.s
678
679 Qualifiers:
680
681 * m={pred}: predicate mask mode
682 * sm={pred}: source-predicate mask mode (only allowed in Twin-predication)
683 * vec{N}: vec2 OR vec3 OR vec4 - sets SUBVL=2/3/4
684 * ew={N}: ew=8/16/32 - sets elwidth override
685 * sw={N}: sw=8/16/32 - sets source elwidth override
686 * ff={xx}: see fail-first mode
687 * pr={xx}: see predicate-result mode
688 * sat{x}: satu / sats - see saturation mode
689 * mr: see map-reduce mode
690 * mr.svm see map-reduce with sub-vector mode
691 * crm: see map-reduce CR mode
692 * crm.svm see map-reduce CR with sub-vector mode
693 * sz: predication with source-zeroing
694 * dz: predication with dest-zeroing
695
696 For modes:
697
698 * pred-result:
699 - pm=lt/gt/le/ge/eq/ne/so/ns OR
700 - pm=RC1 OR pm=~RC1
701 * fail-first
702 - ff=lt/gt/le/ge/eq/ne/so/ns OR
703 - ff=RC1 OR ff=~RC1
704 * saturation:
705 - sats
706 - satu
707 * map-reduce:
708 - mr OR crm: "normal" map-reduce mode or CR-mode.
709 - mr.svm OR crm.svm: when vec2/3/4 set, sub-vector mapreduce is enabled
710
711 # Proposed Parallel-reduction algorithm
712
713 ```
714 /// reference implementation of proposed SimpleV reduction semantics.
715 ///
716 // reduction operation -- we still use this algorithm even
717 // if the reduction operation isn't associative or
718 // commutative.
719 /// `temp_pred` is a user-visible Vector Condition register
720 ///
721 /// all input arrays have length `vl`
722 def reduce( vl, vec, pred, pred,):
723 step = 1;
724 while step < vl
725 step *= 2;
726 for i in (0..vl).step_by(step)
727 other = i + step / 2;
728 other_pred = other < vl && pred[other];
729 if pred[i] && other_pred
730 vec[i] += vec[other];
731 else if other_pred
732 vec[i] = vec[other];
733 pred[i] |= other_pred;
734
735 def reduce( vl, vec, pred, pred,):
736 j = 0
737 vi = [] # array of lookup indices to skip nonpredicated
738 for i, pbit in enumerate(pred):
739 if pbit:
740 vi[j] = i
741 j += 1
742 step = 2
743 while step <= vl
744 halfstep = step // 2
745 for i in (0..vl).step_by(step)
746 other = vi[i + halfstep]
747 i = vi[i]
748 other_pred = other < vl && pred[other]
749 if pred[i] && other_pred
750 vec[i] += vec[other]
751 pred[i] |= other_pred
752 step *= 2
753
754 ```