3 * <https://bugs.libre-soc.org/show_bug.cgi?id=574>
4 * <https://bugs.libre-soc.org/show_bug.cgi?id=558#c47>
6 This is the appendix to [[sv/svp64]], providing explanations of modes
7 etc. leaving the main svp64 page's primary purpose as outlining the
14 # XER, SO and other global flags
16 Vector systems are expected to be high performance. This is achieved
17 through parallelism, which requires that elements in the vector be
18 independent. XER SO and other global "accumulation" flags (CR.OV) cause
19 Read-Write Hazards on single-bit global resources, having a significant
22 Consequently in SV, XER.SO and CR.OV behaviour is disregarded (including
23 in `cmp` instructions). XER is simply neither read nor written.
24 This includes when `scalar identity behaviour` occurs. If precise
25 OpenPOWER v3.0/1 scalar behaviour is desired then OpenPOWER v3.0/1
26 instructions should be used without an SV Prefix.
28 An interesting side-effect of this decision is that the OE flag is now
29 free for other uses when SV Prefixing is used.
31 Regarding XER.CA: this does not fit either: it was designed for a scalar
32 ISA. Instead, both carry-in and carry-out go into the CR.so bit of a given
33 Vector element. This provides a means to perform large parallel batches
34 of Vectorised carry-capable additions. crweird instructions can be used
35 to transfer the CRs in and out of an integer, where bitmanipulation
36 may be performed to analyse the carry bits (including carry lookahead
37 propagation) before continuing with further parallel additions.
39 # v3.0B/v3.1 relevant instructions
41 SV is primarily designed for use as an efficient hybrid 3D GPU / VPU /
44 As mentioned above, OE=1 is not applicable in SV, freeing this bit for
45 alternative uses. Additionally, Vectorisation of the VSX SIMD system
46 likewise makes no sense whatsoever. SV *replaces* VSX and provides,
47 at the very minimum, predication (which VSX was designed without).
48 Thus all VSX Major Opcodes - all of them - are "unused" and must raise
49 illegal instruction exceptions in SV Prefix Mode.
51 Likewise, `lq` (Load Quad), and Load/Store Multiple make no sense to
52 have because they are not only provided by SV, the SV alternatives may
53 be predicated as well, making them far better suited to use in function
54 calls and context-switching.
56 Additionally, some v3.0/1 instructions simply make no sense at all in a
57 Vector context: `rfid` falls into this category,
58 as well as `sc` and `scv`. Here there is simply no point
59 trying to Vectorise them: the standard OpenPOWER v3.0/1 instructions
60 should be called instead.
62 Fortuitously this leaves several Major Opcodes free for use by SV
63 to fit alternative future instructions. In a 3D context this means
64 Vector Product, Vector Normalise, [[sv/mv.swizzle]], Texture LD/ST
65 operations, and others critical to an efficient, effective 3D GPU and
66 VPU ISA. With such instructions being included as standard in other
67 commercially-successful GPU ISAs it is likewise critical that a 3D
68 GPU/VPU based on svp64 also have such instructions.
70 Note however that svp64 is stand-alone and is in no way
71 critically dependent on the existence or provision of 3D GPU or VPU
72 instructions. These should be considered extensions, and their discussion
73 and specification is out of scope for this document.
75 Note, again: this is *only* under svp64 prefixing. Standard v3.0B /
76 v3.1B is *not* altered by svp64 in any way.
78 ## Major opcode map (v3.0B)
80 This table is taken from v3.0B.
81 Table 9: Primary Opcode Map (opcode bits 0:5)
83 | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111
84 000 | | | tdi | twi | EXT04 | | | mulli | 000
85 001 | subfic | | cmpli | cmpi | addic | addic. | addi | addis | 001
86 010 | bc/l/a | EXT17 | b/l/a | EXT19 | rlwimi| rlwinm | | rlwnm | 010
87 011 | ori | oris | xori | xoris | andi. | andis. | EXT30 | EXT31 | 011
88 100 | lwz | lwzu | lbz | lbzu | stw | stwu | stb | stbu | 100
89 101 | lhz | lhzu | lha | lhau | sth | sthu | lmw | stmw | 101
90 110 | lfs | lfsu | lfd | lfdu | stfs | stfsu | stfd | stfdu | 110
91 111 | lq | EXT57 | EXT58 | EXT59 | EXT60 | EXT61 | EXT62 | EXT63 | 111
92 | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111
94 ## Suitable for svp64-only
96 This is the same table containing v3.0B Primary Opcodes except those that
97 make no sense in a Vectorisation Context have been removed. These removed
98 POs can, *in the SV Vector Context only*, be assigned to alternative
99 (Vectorised-only) instructions, including future extensions.
101 Note, again, to emphasise: outside of svp64 these opcodes **do not**
102 change. When not prefixed with svp64 these opcodes **specifically**
103 retain their v3.0B / v3.1B OpenPOWER Standard compliant meaning.
105 | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111
106 000 | | | | | | | | mulli | 000
107 001 | subfic | | cmpli | cmpi | addic | addic. | addi | addis | 001
108 010 | | | | EXT19 | rlwimi| rlwinm | | rlwnm | 010
109 011 | ori | oris | xori | xoris | andi. | andis. | EXT30 | EXT31 | 011
110 100 | lwz | lwzu | lbz | lbzu | stw | stwu | stb | stbu | 100
111 101 | lhz | lhzu | lha | lhau | sth | sthu | | | 101
112 110 | lfs | lfsu | lfd | lfdu | stfs | stfsu | stfd | stfdu | 110
113 111 | | | EXT58 | EXT59 | | EXT61 | | EXT63 | 111
114 | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111
116 It is important to note that having a different v3.0B Scalar opcode
117 that is different from an SVP64 one is highly undesirable: the complexity
118 in the decoder is greatly increased.
122 This is a standard mode normally found in Vector ISAs. every element in every source Vector and in the destination uses the same bit of one single predicate mask.
124 In SVSTATE, for Single-predication, implementors MUST increment both srcstep and dststep: unlike Twin-Predication the two must be equal at all times.
128 This is a novel concept that allows predication to be applied to a single
129 source and a single dest register. The following types of traditional
130 Vector operations may be encoded with it, *without requiring explicit
133 * VSPLAT (a single scalar distributed across a vector)
134 * VEXTRACT (like LLVM IR [`extractelement`](https://releases.llvm.org/11.0.0/docs/LangRef.html#extractelement-instruction))
135 * VINSERT (like LLVM IR [`insertelement`](https://releases.llvm.org/11.0.0/docs/LangRef.html#insertelement-instruction))
136 * VCOMPRESS (like LLVM IR [`llvm.masked.compressstore.*`](https://releases.llvm.org/11.0.0/docs/LangRef.html#llvm-masked-compressstore-intrinsics))
137 * VEXPAND (like LLVM IR [`llvm.masked.expandload.*`](https://releases.llvm.org/11.0.0/docs/LangRef.html#llvm-masked-expandload-intrinsics))
139 Those patterns (and more) may be applied to:
141 * mv (the usual way that V\* ISA operations are created)
142 * exts\* sign-extension
143 * rwlinm and other RS-RA shift operations (**note**: excluding
144 those that take RA as both a src and dest. These are not
145 1-src 1-dest, they are 2-src, 1-dest)
146 * LD and ST (treating AGEN as one source)
147 * FP fclass, fsgn, fneg, fabs, fcvt, frecip, fsqrt etc.
148 * Condition Register ops mfcr, mtcr and other similar
150 This is a huge list that creates extremely powerful combinations,
151 particularly given that one of the predicate options is `(1<<r3)`
153 Additional unusual capabilities of Twin Predication include a back-to-back
154 version of VCOMPRESS-VEXPAND which is effectively the ability to do
155 sequentially ordered multiple VINSERTs. The source predicate selects a
156 sequentially ordered subset of elements to be inserted; the destination
157 predicate specifies the sequentially ordered recipient locations.
158 This is equivalent to
159 `llvm.masked.compressstore.*`
161 `llvm.masked.expandload.*`
165 Reduction in SVP64 is deterministic and somewhat of a misnomer. A normal
166 Vector ISA would have explicit Reduce opcodes with defined characteristics
167 per operation: in SX Aurora there is even an additional scalar argument
168 containing the initial reduction value. SVP64 fundamentally has to
169 utilise *existing* Scalar Power ISA v3.0B operations, which presents some
172 The solution turns out to be to simply define reduction as permitting
173 deterministic element-based schedules to be issued using the base Scalar
174 operations, and to rely on the underlying microarchitecture to resolve
175 Register Hazards at the element level. This goes back to
176 the fundamental principle that SV is nothing more than a Sub-Program-Counter
177 sitting between Decode and Issue phases.
179 Microarchitectures *may* take opportunities to parallelise the reduction
180 but only if in doing so they preserve Program Order at the Element Level.
181 Opportunities where this is possible include an `OR` operation
182 or a MIN/MAX operation: it may be possible to parallelise the reduction,
183 but for Floating Point it is not permitted due to different results
184 being obtained if the reduction is not executed in strict sequential
187 ## Scalar result reduce mode
189 Scalar Reduction per se does not exist, instead is implemented in SVP64
190 as a simple and natural relaxation of the usual restriction on the Vector
191 Looping which would terminate if the destination was marked as a Scalar.
192 Scalar Reduction by contrast *keeps issuing Vector Element Operations*
193 even though the destination register is marked as scalar.
194 Thus it is up to the programmer to be aware of this and observe some
195 conventions. It is also important to appreciate that there is no
196 actual imposition or restriction on how this mode is utilised: there
197 will therefore be several valuable uses (including Vector Iteration)
198 and it is up to the programmer to make best use of the capability
201 In this mode, which is suited to operations involving carry or overflow,
202 one register must be identified by the programmer as being the "accumulator".
203 Scalar reduction is thus categorised by:
205 * One of the sources is a Vector
206 * the destination is a scalar
207 * optionally but most usefully when one source register is also the destination
208 * That the source register type is the same as the destination register
209 type identified as the "accumulator". scalar reduction on `cmp`,
210 `setb` or `isel` makes no sense for example because of the mixture
211 between CRs and GPRs.
213 *Note that issuing instructions in Scalar reduce mode such as `setb`
214 are neither `UNDEFINED` nor prohibited, despite them not making much
215 sense at first glance.
216 Scalar reduce is strictly defined behaviour, and the cost in
217 hardware terms of prohibition of seemingly non-sensical operations is too great.
218 Therefore it is permitted and required to be executed successfully.
219 Implementors **MAY** choose to optimise such instructions in instances
220 where their use results in "extraneous execution", i.e. where it is clear
221 that the sequence of operations, comprising multiple overwrites to
222 a scalar destination **without** cumulative, iterative, or reductive
223 behaviour, may discard all but the last element operation. Identification
224 of such is trivial to do for `setb` and `cmp`: the source register type is
225 a completely different register file from the destination*
227 Typical applications include simple operations such as `ADD r3, r10.v,
228 r3` where, clearly, r3 is being used to accumulate the addition of all
229 elements is the vector starting at r10.
231 # add RT, RA,RB but when RT==RA
233 iregs[RA] += iregs[RB+i] # RT==RA
235 However, *unless* the operation is marked as "mapreduce", SV ordinarily
236 **terminates** at the first scalar operation. Only by marking the
237 operation as "mapreduce" will it continue to issue multiple sub-looped
238 (element) instructions in `Program Order`.
240 To.perform the loop in reverse order, the ```RG``` (reverse gear) bit must be set. This is useful for leaving a cumulative suffix sum in reverse order:
242 for i in (VL-1 downto 0):
243 # RT-1 = RA gives a suffix sum
244 iregs[RT+i] = iregs[RA+i] - iregs[RB+i]
246 Other examples include shift-mask operations where a Vector of inserts
247 into a single destination register is required, as a way to construct
248 a value quickly from multiple arbitrary bit-ranges and bit-offsets.
249 Using the same register as both the source and destination, with Vectors
250 of different offsets masks and values to be inserted has multiple
251 applications including Video, cryptography and JIT compilation.
253 Subtract and Divide are still permitted to be executed in this mode,
254 although from an algorithmic perspective it is strongly discouraged.
255 It would be better to use addition followed by one final subtract,
256 or in the case of divide, to get better accuracy, to perform a multiply
257 cascade followed by a final divide.
259 Note that single-operand or three-operand scalar-dest reduce is perfectly
260 well permitted: the programmer may still declare one register, used as
261 both a Vector source and Scalar destination, to be utilised as
262 the "accumulator". In the case of `sv.fmadds` and `sv.maddhw` etc
263 this naturally fits well with the normal expected usage of these
266 If an interrupt or exception occurs in the middle of the scalar mapreduce,
267 the scalar destination register **MUST** be updated with the current
268 (intermediate) result, because this is how ```Program Order``` is
269 preserved (Vector Loops are to be considered to be just another way of issuing instructions
270 in Program Order). In this way, after return from interrupt,
271 the scalar mapreduce may continue where it left off. This provides
272 "precise" exception behaviour.
274 Note that hardware is perfectly permitted to perform multi-issue
275 parallel optimisation of the scalar reduce operation: it's just that
276 as far as the user is concerned, all exceptions and interrupts **MUST**
279 ## Vector result reduce mode
281 Vector Reduce Mode issues a deterministic tree-reduction schedule to the underlying micro-architecture. Like Scalar reduction, the "Scalar Base"
282 (Power ISA v3.0B) operation is leveraged, unmodified, to give the
283 *appearance* and *effect* of Reduction.
285 Given that the tree-reduction schedule is deterministic,
286 Interrupts and exceptions
287 can therefore also be precise. The final result will be in the first
288 non-predicate-masked-out destination element, but due again to
289 the deterministic schedule programmers may find uses for the intermediate
292 When Rc=1 a corresponding Vector of co-resultant CRs is also
293 created. No special action is taken: the result and its CR Field
294 are stored "as usual" exactly as all other SVP64 Rc=1 operations.
296 ## Sub-Vector Horizontal Reduction
298 Note that when SVM is clear and SUBVL!=1 the sub-elements are
299 *independent*, i.e. they are mapreduced per *sub-element* as a result.
300 illustration with a vec2:
302 result.x = op(iregs[RA].x, iregs[RA+1].x)
303 result.y = op(iregs[RA].y, iregs[RA+1].y)
304 for i in range(2, VL):
305 result.x = op(result.x, iregs[RA+i].x)
306 result.y = op(result.y, iregs[RA+i].y)
308 Note here that Rc=1 does not make sense when SVM is clear and SUBVL!=1.
310 When SVM is set and SUBVL!=1, another variant is enabled: horizontal
311 subvector mode. Example for a vec3:
314 result = iregs[RA+i].x
315 result = op(result, iregs[RA+i].y)
316 result = op(result, iregs[RA+i].z)
319 In this mode, when Rc=1 the Vector of CRs is as normal: each result
320 element creates a corresponding CR element (for the final, reduced, result).
324 Data-dependent fail-on-first has two distinct variants: one for LD/ST,
325 the other for arithmetic operations (actually, CR-driven). Note in each
326 case the assumption is that vector elements are required appear to be
327 executed in sequential Program Order, element 0 being the first.
329 * LD/ST ffirst treats the first LD/ST in a vector (element 0) as an
330 ordinary one. Exceptions occur "as normal". However for elements 1
331 and above, if an exception would occur, then VL is **truncated** to the
333 * Data-driven (CR-driven) fail-on-first activates when Rc=1 or other
334 CR-creating operation produces a result (including cmp). Similar to
335 branch, an analysis of the CR is performed and if the test fails, the
336 vector operation terminates and discards all element operations at and
337 above the current one, and VL is truncated to either
338 the *previous* element or the current one, depending on whether
339 VLi (VL "inclusive") is set.
341 Thus the new VL comprises a contiguous vector of results,
342 all of which pass the testing criteria (equal to zero, less than zero).
344 The CR-based data-driven fail-on-first is new and not found in ARM
345 SVE or RVV. It is extremely useful for reducing instruction count,
346 however requires speculative execution involving modifications of VL
347 to get high performance implementations. An additional mode (RC1=1)
348 effectively turns what would otherwise be an arithmetic operation
349 into a type of `cmp`. The CR is stored (and the CR.eq bit tested
350 against the `inv` field).
351 If the CR.eq bit is equal to `inv` then the Vector is truncated and
353 Note that when RC1=1 the result elements are never stored, only the CRs.
355 VLi is only available as an option when `Rc=0` (or for instructions
356 which do not have Rc). When set, the current element is always
357 also included in the count (the new length that VL will be set to).
358 This may be useful in combination with "inv" to truncate the Vector
359 to `exclude` elements that fail a test, or, in the case of implementations
360 of strncpy, to include the terminating zero.
362 In CR-based data-driven fail-on-first there is only the option to select
363 and test one bit of each CR (just as with branch BO). For more complex
364 tests this may be insufficient. If that is the case, a vectorised crops
365 (crand, cror) may be used, and ffirst applied to the crop instead of to
366 the arithmetic vector.
368 One extremely important aspect of ffirst is:
370 * LDST ffirst may never set VL equal to zero. This because on the first
371 element an exception must be raised "as normal".
372 * CR-based data-dependent ffirst on the other hand **can** set VL equal
373 to zero. This is the only means in the entirety of SV that VL may be set
374 to zero (with the exception of via the SV.STATE SPR). When VL is set
375 zero due to the first element failing the CR bit-test, all subsequent
376 vectorised operations are effectively `nops` which is
377 *precisely the desired and intended behaviour*.
379 Another aspect is that for ffirst LD/STs, VL may be truncated arbitrarily
380 to a nonzero value for any implementation-specific reason. For example:
381 it is perfectly reasonable for implementations to alter VL when ffirst
382 LD or ST operations are initiated on a nonaligned boundary, such that
383 within a loop the subsequent iteration of that loop begins subsequent
384 ffirst LD/ST operations on an aligned boundary. Likewise, to reduce
385 workloads or balance resources.
387 CR-based data-dependent first on the other hand MUST not truncate VL
388 arbitrarily to a length decided by the hardware: VL MUST only be
389 truncated based explicitly on whether a test fails.
390 This because it is a precise test on which algorithms
393 ## Data-dependent fail-first on CR operations (crand etc)
395 Operations that actually produce or alter CR Field as a result
396 do not also in turn have an Rc=1 mode. However it makes no
397 sense to try to test the 4 bits of a CR Field for being equal
398 or not equal to zero. Moreover, the result is already in the
399 form that is desired: it is a CR field. Therefore,
400 CR-based operations have their own SVP64 Mode, described
403 There are two primary different types of CR operations:
405 * Those which have a 3-bit operand field (referring to a CR Field)
406 * Those which have a 5-bit operand (referring to a bit within the
409 More details can be found in [[sv/cr_ops]].
413 This mode merges common CR testing with predication, saving on instruction
414 count. Below is the pseudocode excluding predicate zeroing and elwidth
415 overrides. Note that the paeudocode for [[sv/cr_ops]] is slightly different.
418 # predication test, skip all masked out elements.
419 if predicate_masked_out(i):
421 result = op(iregs[RA+i], iregs[RB+i])
422 CRnew = analyse(result) # calculates eq/lt/gt
423 # Rc=1 always stores the CR
425 crregs[offs+i] = CRnew
426 # now test CR, similar to branch
427 if RC1 or CRnew[BO[0:1]] != BO[2]:
428 continue # test failed: cancel store
429 # result optionally stored but CR always is
432 The reason for allowing the CR element to be stored is so that
433 post-analysis of the CR Vector may be carried out. For example:
434 Saturation may have occurred (and been prevented from updating, by the
435 test) but it is desirable to know *which* elements fail saturation.
437 Note that RC1 Mode basically turns all operations into `cmp`. The
438 calculation is performed but it is only the CR that is written. The
439 element result is *always* discarded, never written (just like `cmp`).
441 Note that predication is still respected: predicate zeroing is slightly
442 different: elements that fail the CR test *or* are masked out are zero'd.
444 ## pred-result mode on CR ops
446 CR operations (mtcr, crand, cror) may be Vectorised,
447 predicated, and also pred-result mode applied to it.
448 Vectorisation applies to 4-bit CR Fields which are treated as
449 elements, not the individual bits of the 32-bit CR.
450 CR ops and how to identify them is described in [[sv/cr_ops]]
454 CRs are slightly more involved than INT or FP registers due to the
455 possibility for indexing individual bits (crops BA/BB/BT). Again however
456 the access pattern needs to be understandable in relation to v3.0B / v3.1B
457 numbering, with a clear linear relationship and mapping existing when
460 ## CR EXTRA mapping table and algorithm
462 Numbering relationships for CR fields are already complex due to being
463 in BE format (*the relationship is not clearly explained in the v3.0B
464 or v3.1B specification*). However with some care and consideration
465 the exact same mapping used for INT and FP regfiles may be applied,
466 just to the upper bits, as explained below. The notation
467 `CR{field number}` is used to indicate access to a particular
468 Condition Register Field (as opposed to the notation `CR[bit]`
469 which accesses one bit of the 32 bit Power ISA v3.0B
472 In OpenPOWER v3.0/1, BF/BT/BA/BB are all 5 bits. The top 3 bits (0:2)
473 select one of the 8 CRs; the bottom 2 bits (3:4) select one of 4 bits
474 *in* that CR. The numbering was determined (after 4 months of
475 analysis and research) to be as follows:
477 CR_index = 7-(BA>>2) # top 3 bits but BE
478 bit_index = 3-(BA & 0b11) # low 2 bits but BE
479 CR_reg = CR{CR_index} # get the CR
480 # finally get the bit from the CR.
481 CR_bit = (CR_reg & (1<<bit_index)) != 0
483 When it comes to applying SV, it is the CR\_reg number to which SV EXTRA2/3
484 applies, **not** the CR\_bit portion (bits 3:4):
489 spec = EXTRA2<<1 | 0b0
491 # vector constructs "BA[0:2] spec[1:2] 00 BA[3:4]"
492 return ((BA >> 2)<<6) | # hi 3 bits shifted up
493 (spec[1:2]<<4) | # to make room for these
494 (BA & 0b11) # CR_bit on the end
496 # scalar constructs "00 spec[1:2] BA[0:4]"
497 return (spec[1:2] << 5) | BA
499 Thus, for example, to access a given bit for a CR in SV mode, the v3.0B
500 algorithm to determin CR\_reg is modified to as follows:
502 CR_index = 7-(BA>>2) # top 3 bits but BE
504 # vector mode, 0-124 increments of 4
505 CR_index = (CR_index<<4) | (spec[1:2] << 2)
507 # scalar mode, 0-32 increments of 1
508 CR_index = (spec[1:2]<<3) | CR_index
509 # same as for v3.0/v3.1 from this point onwards
510 bit_index = 3-(BA & 0b11) # low 2 bits but BE
511 CR_reg = CR{CR_index} # get the CR
512 # finally get the bit from the CR.
513 CR_bit = (CR_reg & (1<<bit_index)) != 0
515 Note here that the decoding pattern to determine CR\_bit does not change.
517 Note: high-performance implementations may read/write Vectors of CRs in
518 batches of aligned 32-bit chunks (CR0-7, CR7-15). This is to greatly
519 simplify internal design. If instructions are issued where CR Vectors
520 do not start on a 32-bit aligned boundary, performance may be affected.
522 ## CR fields as inputs/outputs of vector operations
524 CRs (or, the arithmetic operations associated with them)
525 may be marked as Vectorised or Scalar. When Rc=1 in arithmetic operations that have no explicit EXTRA to cover the CR, the CR is Vectorised if the destination is Vectorised. Likewise if the destination is scalar then so is the CR.
527 When vectorized, the CR inputs/outputs are sequentially read/written
528 to 4-bit CR fields. Vectorised Integer results, when Rc=1, will begin
529 writing to CR8 (TBD evaluate) and increase sequentially from there.
532 * implementations may rely on the Vector CRs being aligned to 8. This
533 means that CRs may be read or written in aligned batches of 32 bits
534 (8 CRs per batch), for high performance implementations.
535 * scalar Rc=1 operation (CR0, CR1) and callee-saved CRs (CR2-4) are not
536 overwritten by vector Rc=1 operations except for very large VL
537 * CR-based predication, from CR32, is also not interfered with
538 (except by large VL).
540 However when the SV result (destination) is marked as a scalar by the
541 EXTRA field the *standard* v3.0B behaviour applies: the accompanying
542 CR when Rc=1 is written to. This is CR0 for integer operations and CR1
545 Note that yes, the CR Fields are genuinely Vectorised. Unlike in SIMD VSX which
546 has a single CR (CR6) for a given SIMD result, SV Vectorised OpenPOWER
547 v3.0B scalar operations produce a **tuple** of element results: the
548 result of the operation as one part of that element *and a corresponding
549 CR element*. Greatly simplified pseudocode:
552 # calculate the vector result of an add iregs[RT+i] = iregs[RA+i]
553 + iregs[RB+i] # now calculate CR bits CRs{8+i}.eq = iregs[RT+i]
554 == 0 CRs{8+i}.gt = iregs[RT+i] > 0 ... etc
556 If a "cumulated" CR based analysis of results is desired (a la VSX CR6)
557 then a followup instruction must be performed, setting "reduce" mode on
558 the Vector of CRs, using cr ops (crand, crnor) to do so. This provides far
559 more flexibility in analysing vectors than standard Vector ISAs. Normal
560 Vector ISAs are typically restricted to "were all results nonzero" and
561 "were some results nonzero". The application of mapreduce to Vectorised
562 cr operations allows far more sophisticated analysis, particularly in
563 conjunction with the new crweird operations see [[sv/cr_int_predication]].
565 Note in particular that the use of a separate instruction in this way
566 ensures that high performance multi-issue OoO inplementations do not
567 have the computation of the cumulative analysis CR as a bottleneck and
568 hindrance, regardless of the length of VL.
571 SVP64 [[sv/branches]] may be used, even when the branch itself is to
572 the following instruction. The combined side-effects of CTR reduction
573 and VL truncation provide several benefits.
575 (see [[discussion]]. some alternative schemes are described there)
577 ## Rc=1 when SUBVL!=1
579 sub-vectors are effectively a form of Packed SIMD (length 2 to 4). Only 1 bit of
580 predicate is allocated per subvector; likewise only one CR is allocated
583 This leaves a conundrum as to how to apply CR computation per subvector,
584 when normally Rc=1 is exclusively applied to scalar elements. A solution
585 is to perform a bitwise OR or AND of the subvector tests. Given that
586 OE is ignored in SVP64, this field may (when available) be used to select OR or
589 ### Table of CR fields
591 CR[i] is the notation used by the OpenPower spec to refer to CR field #i,
592 so FP instructions with Rc=1 write to CR[1] aka SVCR1_000.
594 CRs are not stored in SPRs: they are registers in their own right.
595 Therefore context-switching the full set of CRs involves a Vectorised
596 mfcr or mtcr, using VL=64, elwidth=8 to do so. This is exactly as how
597 scalar OpenPOWER context-switches CRs: it is just that there are now
600 The 64 SV CRs are arranged similarly to the way the 128 integer registers
601 are arranged. TODO a python program that auto-generates a CSV file
602 which can be included in a table, which is in a new page (so as not to
603 overwhelm this one). [[svp64/cr_names]]
607 **NOTE THIS TABLE SHOULD NO LONGER BE HAND EDITED** see
608 <https://bugs.libre-soc.org/show_bug.cgi?id=548> for details.
610 Instructions are broken down by Register Profiles as listed in the
611 following auto-generated page: [[opcode_regs_deduped]]. "Non-SV"
612 indicates that the operations with this Register Profile cannot be
613 Vectorised (mtspr, bc, dcbz, twi)
615 TODO generate table which will be here [[svp64/reg_profiles]]
617 # SV pseudocode illilustration
619 ## Single-predicated Instruction
621 illustration of normal mode add operation: zeroing not included, elwidth
622 overrides not included. if there is no predicate, it is set to all 1s
624 function op_add(rd, rs1, rs2) # add not VADD!
625 int i, id=0, irs1=0, irs2=0; predval = get_pred_val(FALSE, rd);
626 for (i = 0; i < VL; i++)
627 STATE.srcoffs = i # save context if (predval & 1<<i) # predication
629 ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2]; if (!int_vec[rd
631 if (rd.isvec) { id += 1; } if (rs1.isvec) { irs1 += 1; } if
632 (rs2.isvec) { irs2 += 1; } if (id == VL or irs1 == VL or irs2 ==
634 # end VL hardware loop STATE.srcoffs = 0; # reset return;
637 This has several modes:
639 * RT.v = RA.v RB.v * RT.v = RA.v RB.s (and RA.s RB.v) * RT.v = RA.s RB.s *
640 RT.s = RA.v RB.v * RT.s = RA.v RB.s (and RA.s RB.v) * RT.s = RA.s RB.s
642 All of these may be predicated. Vector-Vector is straightfoward.
643 When one of source is a Vector and the other a Scalar, it is clear that
644 each element of the Vector source should be added to the Scalar source,
645 each result placed into the Vector (or, if the destination is a scalar,
646 only the first nonpredicated result).
648 The one that is not obvious is RT=vector but both RA/RB=scalar.
649 Here this acts as a "splat scalar result", copying the same result into
650 all nonpredicated result elements. If a fixed destination scalar was
651 intended, then an all-Scalar operation should be used.
653 See <https://bugs.libre-soc.org/show_bug.cgi?id=552>
655 # Assembly Annotation
657 Assembly code annotation is required for SV to be able to successfully
658 mark instructions as "prefixed".
660 A reasonable (prototype) starting point:
666 * ew=8/16/32 - element width
667 * sew=8/16/32 - source element width
669 * mode=reduce/satu/sats/crpred
670 * pred=1\<\<3/r3/~r3/r10/~r10/r30/~r30/lt/gt/le/ge/eq/ne
673 similar to x86 "rex" prefix.
675 For actual assembler:
677 sv.asmcode/mode.vec{N}.ew=8,sw=16,m={pred},sm={pred} reg.v, src.s
681 * m={pred}: predicate mask mode
682 * sm={pred}: source-predicate mask mode (only allowed in Twin-predication)
683 * vec{N}: vec2 OR vec3 OR vec4 - sets SUBVL=2/3/4
684 * ew={N}: ew=8/16/32 - sets elwidth override
685 * sw={N}: sw=8/16/32 - sets source elwidth override
686 * ff={xx}: see fail-first mode
687 * pr={xx}: see predicate-result mode
688 * sat{x}: satu / sats - see saturation mode
689 * mr: see map-reduce mode
690 * mr.svm see map-reduce with sub-vector mode
691 * crm: see map-reduce CR mode
692 * crm.svm see map-reduce CR with sub-vector mode
693 * sz: predication with source-zeroing
694 * dz: predication with dest-zeroing
699 - pm=lt/gt/le/ge/eq/ne/so/ns OR
702 - ff=lt/gt/le/ge/eq/ne/so/ns OR
708 - mr OR crm: "normal" map-reduce mode or CR-mode.
709 - mr.svm OR crm.svm: when vec2/3/4 set, sub-vector mapreduce is enabled
711 # Proposed Parallel-reduction algorithm
714 /// reference implementation of proposed SimpleV reduction semantics.
716 // reduction operation -- we still use this algorithm even
717 // if the reduction operation isn't associative or
719 /// `temp_pred` is a user-visible Vector Condition register
721 /// all input arrays have length `vl`
722 def reduce( vl, vec, pred, pred,):
726 for i in (0..vl).step_by(step)
727 other = i + step / 2;
728 other_pred = other < vl && pred[other];
729 if pred[i] && other_pred
730 vec[i] += vec[other];
733 pred[i] |= other_pred;
735 def reduce( vl, vec, pred, pred,):
737 vi = [] # array of lookup indices to skip nonpredicated
738 for i, pbit in enumerate(pred):
745 for i in (0..vl).step_by(step)
746 other = vi[i + halfstep]
748 other_pred = other < vl && pred[other]
749 if pred[i] && other_pred
751 pred[i] |= other_pred