1 # RFC ls008 SVP64 Management instructions
7 * <https://libre-soc.org/openpower/sv/>
8 * <https://libre-soc.org/openpower/sv/rfc/ls008/>
9 * <https://bugs.libre-soc.org/show_bug.cgi?id=1040>
10 * <https://git.openpower.foundation/isa/PowerISA/issues/87>
22 **Books and Section affected**:
25 Book I, new Scalar Chapter. (Or, new Book on "Zero-Overhead Loop Subsystem")
26 Appendix E Power ISA sorted by opcode
27 Appendix F Power ISA sorted by version
28 Appendix G Power ISA sorted by Compliancy Subset
29 Appendix H Power ISA sorted by mnemonic
36 setvl - Cray-style "Set Vector Length" instruction
37 svstep - Vertical-First Mode explicit Step and Status
38 svremap - Re-Mapping of Register Element Offsets
39 svindex - General-purpose setting of SHAPEs to be re-mapped
40 svshape - Hardware-level setting of SHAPEs for element re-mapping
41 svshape2 - Hardware-level setting of SHAPEs for element re-mapping (v2)
44 **Submitter**: Luke Leighton (Libre-SOC)
46 **Requester**: Libre-SOC
48 **Impact on processor**:
51 Addition of six new "Zero-Overhead-Loop-Control" DSP-style Vector-style
52 Management Instructions which can be implemented extremely efficiently
53 and effectively by inserting an additional phase between Decode and Issue.
54 More complex designs are NOT adversely impacted and in fact greatly benefit
55 whilst still retaining an obvious linear sequential execution programming model.
58 **Impact on software**:
61 Requires support for new instructions in assembler, debuggers,
68 Cray Supercomputing, Vectorisation, Zero-Overhead-Loop-Control,
69 Scalable Vectors, Multi-Issue Out-of-Order, Sequential Programming Model
76 **Notes and Observations**:
82 Add the following entries to:
84 * Section 1.3.2 Notation
85 * the Appendices of Book I
86 * Instructions of Book I as a new Section
87 * SVL-Form of Book I Section 1.6.1.6 and 1.6.2
93 # Notation, Section 1.3.2
95 When register operands (RA, RT, BF) are prefixed by a single underscore
96 (_RT, _RA, _BF) the variable contains the contents of the instruction field
97 not the contents of the Register File referenced *by* that field. Example:
98 `_RT` contains the contents of bits 5 thru 10. The relationship
99 `RT = GPR(_RT)` is thus always true. Uses include making alternative
100 decisions within an instruction based on whether the operand field
107 # svstep: Vertical-First Stepping and status reporting
111 * svstep RT,SVi,vf (Rc=0)
112 * svstep. RT,SVi,vf (Rc=1)
114 | 0-5|6-10|11.15|16..22| 23-25 | 26-30 |31| Form |
115 |----|----|-----|------|----------|-------|--|--------- |
116 |PO | RT | / | SVi | / / vf | XO |Rc| SVL-Form |
121 if SVi[3:4] = 0b11 then
122 # store pack and unpack in SVSTATE
123 SVSTATE[53] <- SVi[5]
124 SVSTATE[54] <- SVi[6]
125 RT <- [0]*62 || SVSTATE[53:54]
127 # Vertical-First explicit stepping.
128 step <- SVSTATE_NEXT(SVi, vf)
132 Special Registers Altered:
145 | 0-5|6-10|11-15|16-22 | 23 24 25 | 26-30 |31| FORM |
146 | -- | -- | --- | ---- |----------| ----- |--|----------|
147 |PO | RT | RA | SVi | ms vs vf | XO |Rc| SVL-Form |
149 * setvl RT,RA,SVi,vf,vs,ms (Rc=0)
150 * setvl. RT,RA,SVi,vf,vs,ms (Rc=1)
155 overflow <- 0b0 # sets CR.SO if set and if Rc=1
158 if ms = 1 then MVL <- VLimm[0:6]
159 else MVL <- SVSTATE[0:6]
161 if vs = 0 then VL <- SVSTATE[7:13]
162 else if _RA != 0 then
163 if (RA) >u 0b1111111 then
166 else VL <- (RA)[57:63]
167 else if _RT = 0 then VL <- VLimm[0:6]
168 else if CTR >u 0b1111111 then
171 else VL <- CTR[57:63]
172 # limit VL to within MVL
179 GPR(_RT) <- [0]*57 || VL
180 # MAXVL is a static "state-reset" opportunity so VF is only set then.
182 SVSTATE[63] <- vf # set Vertical-First mode
183 SVSTATE[62] <- 0b0 # clear persist bit
186 Special Registers Altered:
192 * `SVi` - bits 16-22 - an immediate operand for setting MVL and/or VL
193 * `ms` - bit 23 - allows for setting of MVL
194 * `vs` - bit 24 - allows for setting of VL
195 * `vf` - bit 25 - sets "Vertical First Mode".
197 Note that in immediate setting mode VL and MVL start from **one**
198 but that this is compensated for in the assembly notation.
199 i.e. that an immediate value of 1 in assembler notation
200 actually places the value 0b0000000 in the `SVi` field bits:
201 on execution the `setvl` instruction adds one to the decoded
202 `SVi` field bits, resulting in
203 VL/MVL being set to 1. This allows VL to be set to values
204 ranging from 1 to 128 with only 7 bits instead of 8.
206 to 0 would result in all Vector operations becoming `nop`. If this is
207 truly desired (nop behaviour) then setting VL and MVL to zero is to be
208 done via the [[SVSTATE SPR|sv/sprs]].
210 Note that setmvli is a pseudo-op, based on RA/RT=0, and setvli likewise
212 setvli VL=8 : setvl r0, r0, VL=8, vf=0, vs=1, ms=0
213 setvli. VL=8 : setvl. r0, r0, VL=8, vf=0, vs=1, ms=0
214 setmvli MVL=8 : setvl r0, r0, MVL=8, vf=0, vs=0, ms=1
215 setmvli. MVL=8 : setvl. r0, r0, MVL=8, vf=0, vs=0, ms=1
217 Additional pseudo-op for obtaining VL without modifying it (or any state):
219 getvl r5 : setvl r5, r0, vf=0, vs=0, ms=0
220 getvl. r5 : setvl. r5, r0, vf=0, vs=0, ms=0
222 Note that whilst it is possible to set both MVL and VL from the same
223 immediate, it is not possible to set them to different immediates in
224 the same instruction. Doing so would require two instructions.
226 **Selecting sources for VL**
228 There is considerable opcode pressure, consequently to set MVL and VL
229 from different sources is as follows:
231 | condition | effect |
233 | `vs=1, RA=0, RT!=0` | VL,RT set to MIN(MVL, CTR) |
234 | `vs=1, RA=0, RT=0` | VL set to MIN(MVL, SVi+1) |
235 | `vs=1, RA!=0, RT=0` | VL set to MIN(MVL, RA) |
236 | `vs=1, RA!=0, RT!=0` | VL,RT set to MIN(MVL, RA) |
238 The reasoning here is that the opportunity to set RT equal to the
239 immediate `SVi+1` is sacrificed in favour of setting from CTR.
241 # Unusual Rc=1 behaviour
243 Normally, the return result from an instruction is in `RT`. With
244 it being possible for `RT=0` to mean that `CTR` mode is to be read,
245 some different semantics are needed.
247 CR Field 0, when `Rc=1`, may be set even if `RT=0`. The reason is that
248 overflow may occur: `VL`, if set either from an immediate or from `CTR`,
249 may not exceed `MAXVL`, and if it is, `CR0.SO` must be set.
251 Additionally, in reality it is **`VL`** being set. Therefore, rather
252 than `CR0` testing `RT` when `Rc=1`, CR0.EQ is set if `VL=0`, CR0.GE
253 is set if `VL` is non-zero.
255 # Vertical First Mode
257 Vertical First is effectively like an implicit single bit predicate
258 applied to every SVP64 instruction. **ONLY** one element in each
259 SVP64 Vector instruction is executed; srcstep and dststep do **not**
260 increment, and the Program Counter progresses **immediately** to
261 the next instruction just as it would for any standard scalar v3.0B
264 An explicit mode of setvl is called which can move srcstep and
265 dststep on to the next element, still respecting predicate
268 In other words, where normal SVP64 Vectorisation acts "horizontally"
269 by looping first through 0 to VL-1 and only then moving the PC
270 to the next instruction, Vertical-First moves the PC onwards
271 (vertically) through multiple instructions **with the same
272 srcstep and dststep**, then an explict instruction used to
273 advance srcstep/dststep. An outer loop is expected to be
274 used (branch instruction) which completes a series of
277 ```svfstep``` mode is enabled when vf=1, vs=0 and ms=0.
278 When Rc=1 it is possible to determine when any level of
279 loops reach an end condition, or if VL has been reached. The immediate can
280 be reinterpreted as indicating which SVSTATE (0-3)
281 should be tested and placed into CR0 (when Rc=1)
283 When RT is not zero, an internal stepping index may also be returned,
284 either the REMAP index or srcstep or dststep. This table is identical
285 to that of [[sv/svstep]]:
287 * `SVi=1`: also include inner middle and outer
288 loop end conditions from SVSTATE0 into CR.EQ CR.LE CR.GT
289 * `SVi=2`: test SVSTATE1 (and return conditions)
290 * `SVi=3`: test SVSTATE2 (and return conditions)
291 * `SVi=4`: test SVSTATE3 (and return conditions)
292 * `SVi=5`: `SVSTATE.srcstep` is returned.
293 * `SVi=6`: `SVSTATE.dststep` is returned.
295 Testing any end condition of any loop of any REMAP state allows branches to be used to create loops.
297 *Programmers should be aware that VL, srcstep and dststep are global in nature.
298 Nested looping with different schedules is perfectly possible, as is
299 calling of functions, however SVSTATE (and any associated SVSTATE) should be stored on the stack.*
303 Sub-vector elements are not be considered "Vertical". The vec2/3/4
304 is to be considered as if the "single element". Caveats exist for
305 [[sv/mv.swizzle]] and [[sv/mv.vec]] when Pack/Unpack is enabled,
306 due to the order in which VL and SUBVL loops are applied being
307 swapped (outer-inner becomes inner-outer)
315 setvl a3, a0, MVL=8 # update a3 with vl
316 # (# of elements this iteration)
318 # do vector operations at up to 8 length (MVL=8)
320 sub a0, a0, a3 # Decrement count by vl
321 bnez a0, loop # Any more?
333 setvli. r4, r3, MVL=64
338 ## Load/Store-Multi (selective)
340 Up to 64 FPRs will be loaded, here. `r3` is set one per bit
341 for each FP register required to be loaded. The block of memory
342 from which the registers are loaded is contiguous (no gaps):
343 any FP register which has a corresponding zero bit in `r3`
344 is *unaltered*. In essence this is a selective LD-multi with
345 "Scatter" capability.
347 setvli r0, MVL=64, VL=64
348 sv.fld/dm=r3 *r0, 0(r30) # selective load 64 FP registers
350 Up to 64 FPRs will be saved, here. Again, `r3`
352 setvli r0, MVL=64, VL=64
353 sv.stfd/sm=r3 *fp0, 0(r30) # selective store 64 FP registers
361 The format of the SVSTATE SPR is as follows:
363 | Field | Name | Description |
364 | ----- | -------- | --------------------- |
365 | 0:6 | maxvl | Max Vector Length |
366 | 7:13 | vl | Vector Length |
367 | 14:20 | srcstep | for srcstep = 0..VL-1 |
368 | 21:27 | dststep | for dststep = 0..VL-1 |
369 | 28:29 | dsubstep | for substep = 0..SUBVL-1 |
370 | 30:31 | ssubstep | for substep = 0..SUBVL-1 |
371 | 32:33 | mi0 | REMAP RA/FRA/BFA SVSHAPE0-3 |
372 | 34:35 | mi1 | REMAP RB/FRB/BFB SVSHAPE0-3 |
373 | 36:37 | mi2 | REMAP RC/FRT SVSHAPE0-3 |
374 | 38:39 | mo0 | REMAP RT/FRT/BF SVSHAPE0-3 |
375 | 40:41 | mo1 | REMAP EA/RS/FRS SVSHAPE0-3 |
376 | 42:46 | SVme | REMAP enable (RA-RT) |
377 | 47:52 | rsvd | reserved |
378 | 53 | pack | PACK (srcstrp reorder) |
379 | 54 | unpack | UNPACK (dststep order) |
380 | 55:61 | hphint | Horizontal Hint |
381 | 62 | RMpst | REMAP persistence |
382 | 63 | vfirst | Vertical First mode |
386 * The entries are truncated to be within range. Attempts to set VL to
387 greater than MAXVL will truncate VL.
388 * Setting srcstep, dststep to 64 or greater, or VL or MVL to greater
389 than 64 is reserved and will cause an illegal instruction trap.
393 SVSTATE is a standard SPR that (if REMAP is not activated) contains sufficient
394 self-contaned information for a full context save/restore.
395 SVSTATE contains (and permits setting of):
397 * MVL (the Maximum Vector Length) - declares (statically) how
398 much of a regfile is to be reserved for Vector elements
400 * dststep - the destination element offset of the current parallel
401 instruction being executed
402 * srcstep - for twin-predication, the source element offset as well.
403 * ssubstep - the source subvector element offset of the current
404 parallel instruction being executed
405 * dsubstep - the destination subvector element offset of the current
406 parallel instruction being executed
407 * vfirst - Vertical First mode. srcstep, dststep and substep
408 **do not advance** unless explicitly requested to do so with
409 pseudo-op svstep (a mode of setvl)
410 * RMpst - REMAP persistence. REMAP will apply only to the following
411 instruction unless this bit is set, in which case REMAP "persists".
412 Reset (cleared) on use of the `setvl` instruction if used to
414 * Pack - if set then srcstep/substep VL/SUBVL loop-ordering is inverted.
415 * UnPack - if set then dststep/substep VL/SUBVL loop-ordering is inverted.
416 * hphint - Horizontal Parallelism Hint. Indicates that
417 no Hazards exist between groups of elements in sequential multiples of this number
418 (before REMAP). By definition: elements for which `FLOOR(srcstep/hphint)` is
419 equal *before REMAP* are in the same parallelism "group". In Vertical First Mode
420 hardware **MUST ONLY** process elements in the same group, and must stop
421 Horizontal Issue at the last element of a given group. Set to zero to indicate "no hint".
422 * SVme - REMAP enable bits, indicating which register is to be
423 REMAPed: RA, RB, RC, RT and EA are the canonical (typical) register names
424 associated with each bit, with RA being the LSB and EA being the MSB.
425 See table below for ordering. When `SVme` is zero (0b00000) REMAP
426 is **fully disabled and inactive** regardless of the contents of
427 `SVSTATE`, `mi0-mi2/mo0-mo1`, or the four `SVSHAPEn` SPRs
428 * mi0-mi2/mo0-mo1 - when the corresponding SVme bit is enabled, these
429 indicate the SVSHAPE (0-3) that the corresponding register (RA etc)
430 should use, as long as the register's corresponding SVme bit is set
432 Programmer's Note: the fact that REMAP is entirely dormant when `SVme` is zero
433 allows establishment of REMAP context well in advance, followed by utilising `svremap`
434 at a precise (or the very last) moment. Some implementations may exploit this
435 to cache (or take some time to prepare caches) in the background whilst other
436 (unrelated) instructions are being executed. This is particularly important to
437 bear in mind when using `svindex` which will require hardware to perform (and
438 cache) additional GPR reads.
440 Programmer's Note: when REMAP is activated it becomes necessary on any
441 context-switch (Interrupt or Function call) to detect (or know in advance)
442 that REMAP is enabled and to additionally save/restore the four SVSHAPE
443 SPRs, SVHAPE0-3. Given that this is expected to be a rare occurrence it was
444 deemed unreasonable to burden every context-switch or function call with
445 mandatory save/restore of SVSHAPEs, and consequently it is a *callee*
446 (and Trap Handler) responsibility. Callees (and Trap Handlers) **MUST**
447 avoid using all and any SVP64 instructions during the period where state
448 could be adversely affected. SVP64 purely relies on Scalar instructions,
449 so Scalar instructions (except the SVP64 Management ones and mtspr and
450 mfspr) are 100% guaranteed to have zero impact on SVP64 state.
452 **Max Vector Length (maxvl)** <a name="mvl" />
454 MAXVECTORLENGTH is the same concept as MVL in RISC-V RVV, except that it
455 is variable length and may be dynamically set (normally from an immediate
456 field only). MVL is limited to 7 bits
457 (in the first version of SVP64) and consequently the maximum number of
458 elements is limited to between 0 and 127.
460 Programmer's Note: Except by directly using `mtspr` on SVSTATE, which may
461 result in performance penalties on some hardware implementations, SVSTATE's `maxvl`
462 field may only be set **statically** as an immediate, by the `setvl` instruction.
463 It may **NOT** be set dynamically from a register. Compiler writers and assembly
464 programmers are expected to perform static register file analysis, subdivision,
465 and allocation and only utilise `setvl`. Direct writing to SVSTATE in order to
466 "bypass" this Note could, in less-advanced implementations, potentially cause stalling,
467 particularly if SVP64 instructions are issued directly after the `mtspr` to SVSTATE.
469 **Vector Length (vl)** <a name="vl" />
471 The actual Vector length, the number of elements in a "Vector", `SVSTATE.vl` may be set
472 entirely dynamically at runtime from a number of sources. `setvl` is the primary
473 instruction for setting Vector Length.
474 `setvl` is conceptually similar but different from the Cray, SX Aurora, and RISC-V RVV
475 equivalent. Similar to RVV, VL is set to be within
476 the range 0 <= VL <= MVL. Unlike RVV, VL is set **exactly** according to the following:
478 VL = (RT|0) = MIN(vlen, MVL)
480 where 0 <= MVL <= 127 and vlen may come from an immediate, `RA`, or from the `CTR` SPR,
481 depending on options selected with the `setvl` instruction.
483 Programmer's Note: conceptual understanding of Cray-style Vectors is far beyond the scope
484 of the Power ISA Technical Reference. Guidance on the 50-year-old Cray Vector paradigm is
485 best sought elsewhere: good studies include Academic Courses given on the 1970s
486 Cray Supercomputers over at least the past three decades.
488 **SUBVL - Sub Vector Length**
490 This is a "group by quantity" that effectively asks each iteration
491 of the hardware loop to load SUBVL elements of width elwidth at a
492 time. Effectively, SUBVL is like a SIMD multiplier: instead of just 1
493 operation issued, SUBVL operations are issued.
495 The main effect of SUBVL is that predication bits are applied per
496 **group**, rather than by individual element. Legal values are 0 to 3,
497 representing 1 operation (1 element) thru 4 operations (4 elements) respectively.
498 Elements are best though of in the context of 3D, Audio and Video: two Left and Right
499 Channel "elements" or four ARGB "elements", or three XYZ coordinate "elements".
501 `subvl` is again primarily set by the `setvl` instruction. Not to be confused
504 Directly related to `subvl` is the `pack` and `unpack` Mode bits of `SVSTATE`.
505 See `svstep` instruction for how to set Pack and Unpack Modes.
508 **Horizontal Parallelism**
510 A problem exists for hardware where it may not be able to detect
511 that a programmer (or compiler) knows of opportunities for parallelism
512 and lack of overlap between loops.
514 For hphint, the number chosen must be consistently
515 executed **every time**. Hardware is not permitted to execute five
516 computations for one instruction then three on the next.
517 hphint is a hint from the compiler to hardware that exactly this
518 many elements may be safely executed in parallel, without hazards
519 (including Memory accesses).
520 Interestingly, when hphint is set equal to VL, it is in effect
521 as if Vertical First mode were not set, because the hardware is
522 given the option to run through all elements in an instruction.
523 This is exactly what Horizontal-First is: a for-loop from 0 to VL-1
524 except that the hardware may *choose* the number of elements.
526 *Note to programmers: changing VL during the middle of such modes
527 should be done only with due care and respect for the fact that SVSTATE
528 has exactly the same peer-level status as a Program Counter.*
536 Add the following to Book I, 1.6.1, SVL-Form
539 |0 |6 |11 |16 |23 |24 |25 |26 |31 |
540 | PO | RT | RA | SVi |ms |vs |vf | XO |Rc |
541 | PO | RT | / | SVi |/ |/ |vf | XO |Rc |
544 * Add `SVL` to `RA (11:15)` Field in Book I, 1.6.2
545 * Add `SVL` to `RT (6:10)` Field in Book I, 1.6.2
546 * Add `SVL` to `Rc (31)` Field in Book I, 1.6.2
547 * Add `SVL` to `XO (26:31)` Field in Book I, 1.6.2
549 Add the following to Book I, 1.6.2
553 Field used in Simple-V to specify whether MVL (maxvl in the SVSTATE SPR)
557 Field used in Simple-V to specify whether "Vertical" Mode is set
558 (vfirst in the SVSTATE SPR)
561 Field used in Simple-V to specify whether VL (vl in the SVSTATE SPR) is to be set
564 Simple-V immediate field used by setvl for setting VL or MVL
565 (vl, maxvl in the SVSTATE SPR)
566 and used as a "Mode of Operation" selector in svstep
572 Appendix E Power ISA sorted by opcode
573 Appendix F Power ISA sorted by version
574 Appendix G Power ISA sorted by Compliancy Subset
575 Appendix H Power ISA sorted by mnemonic
577 | Form | Book | Page | Version | mnemonic | Description |
578 |------|------|------|---------|----------|-------------|
579 | SVL | I | # | 3.0B | svstep | Vertical-First Stepping and status reporting |
580 | SVL | I | # | 3.0B | setvl | Cray-like establishment of Looping (Vector) context |