openpower/sv/rfc/ls008.mdwn

   1 # RFC ls008 SVP64 Management instructions
   2
   3 [[!tag opf_rfc]]
   4
   5 **URLs**:
   6
   7 * <https://libre-soc.org/openpower/sv/>
   8 * <https://libre-soc.org/openpower/sv/rfc/ls008/>
   9 * <https://bugs.libre-soc.org/show_bug.cgi?id=1040>
  10 * <https://git.openpower.foundation/isa/PowerISA/issues/87>
  11
  12 **Severity**: Major
  13
  14 **Status**: New
  15
  16 **Date**: 24 Mar 2023
  17
  18 **Target**: v3.2B
  19
  20 **Source**: v3.0B
  21
  22 **Books and Section affected**:
  23
  24 ```
  25     Book I, new Scalar Chapter.  (Or, new Book on "Zero-Overhead Loop Subsystem")
  26     Appendix E Power ISA sorted by opcode
  27     Appendix F Power ISA sorted by version
  28     Appendix G Power ISA sorted by Compliancy Subset
  29     Appendix H Power ISA sorted by mnemonic
  30 ```
  31
  32 **Summary**
  33
  34 ```
  35     setvl    - Cray-style "Set Vector Length" instruction
  36     svstep   - Vertical-First Mode explicit Step and Status
  37     svremap  - Re-Mapping of Register Element Offsets
  38     svindex  - General-purpose setting of SHAPEs to be re-mapped
  39     svshape  - Hardware-level setting of SHAPEs for element re-mapping
  40     svshape2 - Hardware-level setting of SHAPEs for element re-mapping (v2)
  41 ```
  42
  43 **Submitter**: Luke Leighton (Libre-SOC)
  44
  45 **Requester**: Libre-SOC
  46
  47 **Impact on processor**:
  48
  49 ```
  50     Addition of six new "Zero-Overhead-Loop-Control" DSP-style Vector-style
  51     Management Instructions which can be implemented extremely efficiently
  52     and effectively by inserting an additional phase between Decode and Issue.
  53     More complex designs are NOT adversely impacted and in fact greatly benefit
  54 ```
  55
  56 **Impact on software**:
  57
  58 ```
  59     Requires support for new instructions in assembler, debuggers,
  60     and related tools.
  61 ```
  62
  63 **Keywords**:
  64
  65 ```
  66     Cray Supercomputing, Vectorisation, Zero-Overhead-Loop-Control (ZOLC),
  67     Scalable Vectors, Multi-Issue Out-of-Order, Sequential Programming Model,
  68     Digital Signal Processing (DSP)
  69 ```
  70
  71 **Motivation**
  72
  73 Power ISA is synonymous with Supercomputing and the early Supercomputers
  74 (ETA-10, ILLIAC-IV, CDC200, Cray) had Vectorisation. It is therefore anomalous
  75 that Power ISA does not have Scalable Vectors.  This presents the opportunity to
  76 modernise Power ISA keeping it at the top of Supercomputing.
  77
  78 **Notes and Observations**:
  79
  80 1. SVP64 is very much designed for ultra-light-weight Embedded use-cases all the
  81   way up to moving the bar of Supercomputing orders of magnitude above its present
  82   perception, whilst retaining at all times Sequential Programming Execution.
  83 2. This proposal is the **base** for further Extensions.  These include
  84   extending SVP64 onto the Scalar VSX instructions (with a **LONG TERM** view in 10+ years
  85   to deprecating the PackedSIMD aspects of VSX), to be discussed at a later
  86   time, the potential for extending VSX registers to 128 or beyond, and Arithmetic
  87   operations to a runtime-selectable choice of 128-bit, 256-bit, 512-bit or 1024-bit.
  88 3. Massive reductions in instruction count of between 2x and 20x have been demonstrated
  89   with SVP64, which is far beyond anything ever achieved by any *general-purpose*
  90   ISA Extension added to any ISA in the history of Computing.
  91
  92 **Changes**
  93
  94 Add the following entries to:
  95
  96 * Section 1.3.2 Notation
  97 * the Appendices of Book I
  98 * Instructions of Book I as a new Section
  99 * SVL-Form of Book I Section 1.6.1.6 and 1.6.2
 100
 101 ----------------
 102
 103 \newpage{}
 104
 105 # Notation, Section 1.3.2
 106
 107 When register operands (RA, RT, BF) are prefixed by a single underscore
 108 (_RT, _RA, _BF) the variable contains the contents of the instruction field
 109 not the contents of the Register File referenced *by* that field. Example:
 110 `_RT` contains the contents of bits 5 thru 10. The relationship
 111 `RT = GPR(_RT)` is thus always true. Uses include making alternative
 112 decisions within an instruction based on whether the operand field
 113 is zero or non-zero.
 114
 115 ----------------
 116
 117 \newpage{}
 118
 119 # svstep: Vertical-First Stepping and status reporting
 120
 121 SVL-Form
 122
 123 * svstep RT,SVi,vf (Rc=0)
 124 * svstep. RT,SVi,vf (Rc=1)
 125
 126 | 0-5|6-10|11.15|16..22| 23-25    | 26-30 |31|   Form   |
 127 |----|----|-----|------|----------|-------|--|--------- |
 128 |PO  | RT | /   | SVi  |  / / vf  | XO    |Rc| SVL-Form |
 129
 130 Pseudo-code:
 131
 132 ```
 133     if SVi[3:4] = 0b11 then
 134         # store pack and unpack in SVSTATE
 135         SVSTATE[53] <- SVi[5]
 136         SVSTATE[54] <- SVi[6]
 137         RT <- [0]*62 || SVSTATE[53:54]
 138     else
 139         # Vertical-First explicit stepping.
 140         step <- SVSTATE_NEXT(SVi, vf)
 141         RT <- [0]*57 || step
 142 ```
 143
 144 Special Registers Altered:
 145
 146     CR0                     (if Rc=1)
 147
 148 **Description**
 149
 150 svstep may be used
 151 to enquire about the REMAP Schedule and it may be used to alter Vectorisation
 152 State.  When `vf=1` then stepping occurs.
 153 When `vf=0` the enquiry is performed without altering internal
 154 state.  If `SVi=0, Rc=0, vf=0` the instruction is a `nop`.
 155
 156 The following Modes exist:
 157
 158 * `SVi=0`: appropriately step srcstep, dststep, subsrcstep and subdststep to the next
 159    element, taking pack and unpack into consideration.
 160 * When `SVi` is 1-4 the REMAP Schedule for a given SVSHAPE may be
 161 returned in `RT`.  SVi=1 selects SVSHAPE0 current state,
 162 through to SVi=4 selects SVSHAPE3.
 163 * When `SVi` is 5, `SVSTATE.srcstep` is returned.
 164 * When `SVi` is 6, `SVSTATE.dststep` is returned.
 165 * When `SVi` is 0b1100 pack/unpack in SVSTATE is cleared
 166 * When `SVi` is 0b1101 pack in SVSTATE is set, unpack is cleared
 167 * When `SVi` is 0b1110 unpack in SVSTATE is set, pack is cleared
 168 * When `SVi` is 0b1111 pack/unpack in SVSTATE are set
 169
 170 As this is a Single-Predicated (1P) instruction, predication may be applied
 171 to skip (or zero) elements.
 172
 173 * Vertical-First Mode will return the requested index
 174   (and move to the next state if `vf=1`)
 175 * Horizontal-First Mode can be used to return all indices,
 176   i.e. walks through all possible states.
 177
 178 **Vectorisation of svstep itself**
 179
 180 As a 32-bit instruction, `svstep` may be itself be Vector-Prefixed, as
 181 `sv.svstep`. This will work perfectly well in Horizontal-First
 182 as it will in Vertical-First Mode.
 183
 184 Example: to obtain the full set of possible computed element
 185 indices use `sv.svstep RT.v,SVI,1` which will store all computed element
 186 indices, starting from RT.  If Rc=1 then a co-result Vector of CR Fields
 187 will also be returned, comprising the "loop end-points" of each of the inner
 188 loops when either Matrix Mode or DCT/FFT is set.  In other words,
 189 for example, when the `xdim` inner loop reaches the end and on the next
 190 iteration it will begin again at zero, the CR Field `EQ` will be set.
 191 With a maximum of three loops within both Matrix and DCT/FFT Modes,
 192 the CR Field's EQ bit will be set at the end of the first inner loop,
 193 the LE bit for the second, the GT bit for the outermost loop and the
 194 SO bit set on the very last element, when all loops reach their maximum
 195 extent.
 196
 197 *Programmer's note (1): VL in some situations, particularly larger Matrices,
 198 may exceed 64,
 199 meaning that `sv.svshape` returning a considerable number of values. Under
 200 such circumstances `sv.svshape/ew=8` is recommended.*
 201
 202 *Programmer's note (2): having conveniently obtained a pre-computed
 203 Schedule with `sv.svstep`,
 204 it may then be used as the input to Indexed REMAP Mode
 205 to achieve the exact same Schedule. It is evident however that
 206 before use some of the Indices may be arbitrarily altered as desired.
 207 `sv.svstep` helps the programmer avoid having to manually recreate
 208 Indices for certain
 209 types of common Loop patterns, and in its simplest form, without REMAP
 210 (SVi=5 or SVi=6),
 211 is equivalent to the `iota` instruction found in other Vector ISAs*
 212
 213 **Vertical First Mode**
 214
 215 Vertical First is effectively like an implicit single bit predicate
 216 applied to every SVP64 instruction.  **ONLY** one element in each
 217 SVP64 Vector instruction is executed; srcstep and dststep do **not**
 218 increment, and the Program Counter progresses **immediately** to
 219 the next instruction just as it would for any standard scalar v3.0B
 220 instruction.
 221
 222 A mode of srcstep (SVi=0) is called which can move srcstep and
 223 dststep on to the next element, still respecting predicate
 224 masks.
 225
 226 In other words, where normal SVP64 Vectorisation acts "horizontally"
 227 by looping first through 0 to VL-1 and only then moving the PC
 228 to the next instruction, Vertical-First moves the PC onwards
 229 (vertically) through multiple instructions **with the same
 230 srcstep and dststep**, then an explict instruction used to
 231 advance srcstep/dststep. An outer loop is expected to be
 232 used (branch instruction) which completes a series of
 233 Vector operations.
 234
 235 Testing any end condition of any loop of any REMAP state allows branches to be
 236 used to create loops.
 237
 238 Programmer's note: when Predicate Non-Zeroing is used this indicates to
 239 the underlying hardware that any masked-out element must be skipped.
 240 *This includes in Vertical-First Mode*, and programmers should be keenly
 241 aware that srcstep or dststep or both *may* jump by more than one as
 242 a result, because the actual request under these circumstances was to execute
 243 on the first available next *non-masked-out* element.
 244
 245 *Programmers should be aware that VL, srcstep and dststep are global in nature.
 246 Nested looping with different schedules is perfectly possible, as is
 247 calling of functions, however SVSTATE (and any associated SVSTATE) should
 248 obviously be stored on the stack in order to achieve this benefit*
 249
 250 -------------
 251
 252 \newpage{}
 253
 254
 255 # setvl
 256
 257 SVL-Form
 258
 259 | 0-5|6-10|11-15|16-22 | 23 24 25 | 26-30 |31|   FORM   |
 260 | -- | -- | --- | ---- |----------| ----- |--|----------|
 261 |PO  | RT | RA  | SVi  | ms vs vf | XO    |Rc| SVL-Form |
 262
 263 * setvl RT,RA,SVi,vf,vs,ms (Rc=0)
 264 * setvl. RT,RA,SVi,vf,vs,ms (Rc=1)
 265
 266 Pseudo-code:
 267
 268 ```
 269     overflow <- 0b0    # sets CR.SO if set and if Rc=1
 270     VLimm <- SVi + 1
 271     # set or get MVL
 272     if ms = 1 then MVL <- VLimm[0:6]
 273     else           MVL <- SVSTATE[0:6]
 274     # set or get VL
 275     if vs = 0                then VL <- SVSTATE[7:13]
 276     else if _RA != 0         then
 277         if (RA) >u 0b1111111 then
 278             VL <- 0b1111111
 279             overflow <- 0b1
 280         else                      VL <- (RA)[57:63]
 281     else if _RT = 0          then VL <- VLimm[0:6]
 282     else if CTR >u 0b1111111 then
 283         VL <- 0b1111111
 284         overflow <- 0b1
 285     else                          VL <- CTR[57:63]
 286     # limit VL to within MVL
 287     if VL >u MVL then
 288         overflow <- 0b1
 289         VL <- MVL
 290     SVSTATE[0:6] <- MVL
 291     SVSTATE[7:13] <- VL
 292     if _RT != 0 then
 293        GPR(_RT) <- [0]*57 || VL
 294     # MAXVL is a static "state-reset" opportunity so VF is only set then.
 295     if ms = 1 then
 296          SVSTATE[63] <- vf   # set Vertical-First mode
 297          SVSTATE[62] <- 0b0  # clear persist bit
 298 ```
 299
 300 Special Registers Altered:
 301
 302 ```
 303     CR0                     (if Rc=1)
 304 ```
 305
 306 * `SVi` - bits 16-22 - an immediate operand for setting MVL and/or VL
 307 * `ms` - bit 23 - allows for setting of MVL
 308 * `vs` - bit 24 - allows for setting of VL
 309 * `vf` - bit 25 - sets "Vertical First Mode".
 310
 311 Note that in immediate setting mode VL and MVL start from **one**
 312 but that this is compensated for in the assembly notation.
 313 i.e. that an immediate value of 1 in assembler notation
 314 actually places the value 0b0000000 in the `SVi` field bits:
 315 on execution the `setvl` instruction adds one to the decoded
 316 `SVi` field bits, resulting in
 317 VL/MVL being set to 1. This allows VL to be set to values
 318 ranging from 1 to 128 with only 7 bits instead of 8.
 319 Setting VL/MVL
 320 to 0 would result in all Vector operations becoming `nop`.  If this is
 321 truly desired (nop behaviour) then setting VL and MVL to zero is to be
 322 done via the [[SVSTATE SPR|sv/sprs]].
 323
 324 Note that setmvli is a pseudo-op, based on RA/RT=0, and setvli likewise
 325
 326     setvli   VL=8   : setvl  r0, r0, VL=8, vf=0, vs=1, ms=0
 327     setvli.  VL=8   : setvl. r0, r0, VL=8, vf=0, vs=1, ms=0
 328     setmvli  MVL=8  : setvl  r0, r0, MVL=8, vf=0, vs=0, ms=1
 329     setmvli. MVL=8  : setvl. r0, r0, MVL=8, vf=0, vs=0, ms=1
 330
 331 Additional pseudo-op for obtaining VL without modifying it (or any state):
 332
 333     getvl  r5      : setvl  r5, r0, vf=0, vs=0, ms=0
 334     getvl. r5      : setvl. r5, r0, vf=0, vs=0, ms=0
 335
 336 Note that whilst it is possible to set both MVL and VL from the same
 337 immediate, it is not possible to set them to different immediates in
 338 the same instruction.  Doing so would require two instructions.
 339
 340 **Selecting sources for VL**
 341
 342 There is considerable opcode pressure, consequently to set MVL and VL
 343 from different sources is as follows:
 344
 345 | condition           | effect         |
 346 | - | - |
 347 | `vs=1, RA=0, RT!=0` | VL,RT set to MIN(MVL, CTR)  |
 348 | `vs=1, RA=0, RT=0`  | VL set to MIN(MVL, SVi+1)  |
 349 | `vs=1, RA!=0, RT=0` | VL set to MIN(MVL, RA)  |
 350 | `vs=1, RA!=0, RT!=0` | VL,RT set to MIN(MVL, RA)  |
 351
 352 The reasoning here is that the opportunity to set RT equal to the
 353 immediate `SVi+1` is sacrificed in favour of setting from CTR.
 354
 355 # Unusual Rc=1 behaviour
 356
 357 Normally, the return result from an instruction is in `RT`. With
 358 it being possible for `RT=0` to mean that `CTR` mode is to be read,
 359 some different semantics are needed.
 360
 361 CR Field 0, when `Rc=1`, may be set even if `RT=0`. The reason is that
 362 overflow may occur: `VL`, if set either from an immediate or from `CTR`,
 363 may not exceed `MAXVL`, and if it is, `CR0.SO` must be set.
 364
 365 Additionally, in reality it is **`VL`** being set. Therefore, rather
 366 than `CR0` testing `RT` when `Rc=1`, CR0.EQ is set if `VL=0`, CR0.GE
 367 is set if `VL` is non-zero.
 368
 369 **SUBVL**
 370
 371 Sub-vector elements are not be considered "Vertical". The vec2/3/4
 372 is to be considered as if the "single element".  Caveats exist for
 373 [[sv/mv.swizzle]] and [[sv/mv.vec]] when Pack/Unpack is enabled,
 374 due to the order in which VL and SUBVL loops are applied being
 375 swapped (outer-inner becomes inner-outer)
 376
 377 # Examples
 378
 379 ## Core concept loop
 380
 381 ```
 382 loop:
 383     setvl a3, a0, MVL=8    #  update a3 with vl
 384                            # (# of elements this iteration)
 385                            # set MVL to 8
 386     # do vector operations at up to 8 length (MVL=8)
 387     # ...
 388     sub a0, a0, a3   # Decrement count by vl
 389     bnez a0, loop    # Any more?
 390 ```
 391
 392 ## Loop using Rc=1
 393
 394     my_fn:
 395       li r3, 1000
 396       b test
 397     loop:
 398       sub r3, r3, r4
 399       ...
 400     test:
 401       setvli. r4, r3, MVL=64
 402       bne cr0, loop
 403     end:
 404       blr
 405
 406 ## Load/Store-Multi (selective)
 407
 408 Up to 64 FPRs will be loaded, here.  `r3` is set one per bit
 409 for each FP register required to be loaded.  The block of memory
 410 from which the registers are loaded is contiguous (no gaps):
 411 any FP register which has a corresponding zero bit in `r3`
 412 is *unaltered*.  In essence this is a selective LD-multi with
 413 "Scatter" capability.
 414
 415     setvli r0, MVL=64, VL=64
 416     sv.fld/dm=r3 *r0, 0(r30) # selective load 64 FP registers
 417
 418 Up to 64 FPRs will be saved, here.  Again, `r3`
 419
 420     setvli r0, MVL=64, VL=64
 421     sv.stfd/sm=r3 *fp0, 0(r30) # selective store 64 FP registers
 422
 423 -------------
 424
 425 \newpage{}
 426
 427 # SVSTATE SPR
 428
 429 The format of the SVSTATE SPR is as follows:
 430
 431 | Field | Name     | Description           |
 432 | ----- | -------- | --------------------- |
 433 | 0:6   | maxvl    | Max Vector Length     |
 434 | 7:13  |    vl    | Vector Length         |
 435 | 14:20 | srcstep  | for srcstep = 0..VL-1 |
 436 | 21:27 | dststep  | for dststep = 0..VL-1 |
 437 | 28:29 | dsubstep | for substep = 0..SUBVL-1  |
 438 | 30:31 | ssubstep | for substep = 0..SUBVL-1  |
 439 | 32:33 | mi0      | REMAP RA/FRA/BFA SVSHAPE0-3    |
 440 | 34:35 | mi1      | REMAP RB/FRB/BFB SVSHAPE0-3    |
 441 | 36:37 | mi2      | REMAP RC/FRT SVSHAPE0-3    |
 442 | 38:39 | mo0      | REMAP RT/FRT/BF SVSHAPE0-3    |
 443 | 40:41 | mo1      | REMAP EA/RS/FRS SVSHAPE0-3    |
 444 | 42:46 | SVme     | REMAP enable (RA-RT)  |
 445 | 47:52 | rsvd     | reserved              |
 446 | 53    | pack     | PACK (srcstrp reorder)  |
 447 | 54    | unpack   | UNPACK (dststep order)  |
 448 | 55:61 | hphint   | Horizontal Hint       |
 449 | 62    | RMpst    | REMAP persistence     |
 450 | 63    | vfirst   | Vertical First mode   |
 451
 452 Notes:
 453
 454 * The entries are truncated to be within range.  Attempts to set VL to
 455   greater than MAXVL will truncate VL.
 456 * Setting srcstep, dststep to 64 or greater, or VL or MVL to greater
 457   than 64 is reserved and will cause an illegal instruction trap.
 458
 459 **SVSTATE Fields**
 460
 461 SVSTATE is a standard SPR that (if REMAP is not activated) contains sufficient
 462 self-contaned information for a full context save/restore.
 463 SVSTATE contains (and permits setting of):
 464
 465 * MVL (the Maximum Vector Length) - declares (statically) how
 466   much of a regfile is to be reserved for Vector elements
 467 * VL - Vector Length
 468 * dststep - the destination element offset of the current parallel
 469   instruction being executed
 470 * srcstep - for twin-predication, the source element offset as well.
 471 * ssubstep - the source subvector element offset of the current
 472   parallel instruction being executed
 473 * dsubstep - the destination subvector element offset of the current
 474   parallel instruction being executed
 475 * vfirst - Vertical First mode.  srcstep, dststep and substep
 476     **do not advance** unless explicitly requested to do so with
 477     pseudo-op svstep (a mode of setvl)
 478 * RMpst - REMAP persistence.  REMAP will apply only to the following
 479   instruction unless this bit is set, in which case REMAP "persists".
 480   Reset (cleared) on use of the `setvl` instruction if used to
 481   alter VL or MVL.
 482 * Pack - if set then srcstep/substep VL/SUBVL loop-ordering is inverted.
 483 * UnPack - if set then dststep/substep VL/SUBVL loop-ordering is inverted.
 484 * hphint - Horizontal Parallelism Hint. Indicates that
 485   no Hazards exist between groups of elements in sequential multiples of this number
 486    (before REMAP).  By definition: elements for which `FLOOR(srcstep/hphint)` is
 487    equal *before REMAP* are in the same parallelism "group". In Vertical First Mode
 488    hardware **MUST ONLY** process elements in the same group, and must stop
 489    Horizontal Issue at the last element of a given group. Set to zero to indicate "no hint".
 490 * SVme - REMAP enable bits, indicating which register is to be
 491    REMAPed: RA, RB, RC, RT and EA are the canonical (typical) register names
 492    associated with each bit, with RA being the LSB and EA being the MSB.
 493    See table below for ordering. When `SVme` is zero (0b00000) REMAP
 494    is **fully disabled and inactive** regardless of the contents of
 495   `SVSTATE`, `mi0-mi2/mo0-mo1`, or the four `SVSHAPEn` SPRs
 496 * mi0-mi2/mo0-mo1 - when the corresponding SVme bit is enabled, these
 497   indicate the SVSHAPE (0-3) that the corresponding register (RA etc)
 498   should use, as long as the register's corresponding SVme bit is set
 499
 500 Programmer's Note: the fact that REMAP is entirely dormant when `SVme` is zero
 501 allows establishment of REMAP context well in advance, followed by utilising `svremap`
 502 at a precise (or the very last) moment.  Some implementations may exploit this
 503 to cache (or take some time to prepare caches) in the background whilst other
 504 (unrelated) instructions are being executed. This is particularly important to
 505 bear in mind when using `svindex` which will require hardware to perform (and
 506 cache) additional GPR reads.
 507
 508 Programmer's Note: when REMAP is activated it becomes necessary on any
 509 context-switch (Interrupt or Function call) to detect (or know in advance)
 510 that REMAP is enabled and to additionally save/restore the four SVSHAPE
 511 SPRs, SVHAPE0-3.  Given that this is expected to be a rare occurrence it was
 512 deemed unreasonable to burden every context-switch or function call with
 513 mandatory save/restore of SVSHAPEs, and consequently it is a *callee*
 514 (and Trap Handler) responsibility.  Callees (and Trap Handlers) **MUST**
 515 avoid using all and any SVP64 instructions during the period where state
 516 could be adversely affected.  SVP64 purely relies on Scalar instructions,
 517 so Scalar instructions (except the SVP64 Management ones and mtspr and
 518 mfspr) are 100% guaranteed to have zero impact on SVP64 state.
 519
 520 **Max Vector Length (maxvl)** <a name="mvl" />
 521
 522 MAXVECTORLENGTH is the same concept as MVL in RISC-V RVV, except that it
 523 is variable length and may be dynamically set (normally from an immediate
 524 field only).  MVL is limited to 7 bits
 525 (in the first version of SVP64) and consequently the maximum number of
 526 elements is limited to between 0 and 127.
 527
 528 Programmer's Note: Except by directly using `mtspr` on SVSTATE, which may
 529 result in performance penalties on some hardware implementations, SVSTATE's `maxvl`
 530 field may only be set **statically** as an immediate, by the `setvl` instruction.
 531 It may **NOT** be set dynamically from a register.  Compiler writers and assembly
 532 programmers are expected to perform static register file analysis, subdivision,
 533 and allocation and only utilise `setvl`. Direct writing to SVSTATE in order to
 534 "bypass" this Note could, in less-advanced implementations, potentially cause stalling,
 535 particularly if SVP64 instructions are issued directly after the `mtspr` to SVSTATE.
 536
 537 **Vector Length (vl)** <a name="vl" />
 538
 539 The actual Vector length, the number of elements in a "Vector", `SVSTATE.vl` may be set
 540 entirely dynamically at runtime from a number of sources. `setvl` is the primary
 541 instruction for setting Vector Length.
 542 `setvl` is conceptually similar but different from the Cray, SX Aurora, and RISC-V RVV
 543 equivalent. Similar to RVV, VL is set to be within
 544 the range 0 <= VL <= MVL. Unlike RVV, VL is set **exactly** according to the following:
 545
 546     VL = (RT|0) = MIN(vlen, MVL)
 547
 548 where 0 <= MVL <= 127 and vlen may come from an immediate, `RA`, or from the `CTR` SPR,
 549 depending on options selected with the `setvl` instruction.
 550
 551 Programmer's Note: conceptual understanding of Cray-style Vectors is far beyond the scope
 552 of the Power ISA Technical Reference.  Guidance on the 50-year-old Cray Vector paradigm is
 553 best sought elsewhere: good studies include Academic Courses given on the 1970s
 554 Cray Supercomputers over at least the past three decades.
 555
 556 **SUBVL - Sub Vector Length**
 557
 558 This is a "group by quantity" that effectively asks each iteration
 559 of the hardware loop to load SUBVL elements of width elwidth at a
 560 time. Effectively, SUBVL is like a SIMD multiplier: instead of just 1
 561 operation issued, SUBVL operations are issued.
 562
 563 The main effect of SUBVL is that predication bits are applied per
 564 **group**, rather than by individual element.  Legal values are 0 to 3,
 565 representing 1 operation (1 element) thru 4 operations (4 elements) respectively.
 566 Elements are best though of in the context of 3D, Audio and Video: two Left and Right
 567 Channel "elements" or four ARGB "elements", or three XYZ coordinate "elements".
 568
 569 `subvl` is again primarily set by the `setvl` instruction. Not to be confused
 570 with `hphint`.
 571
 572 Directly related to `subvl` is the `pack` and `unpack` Mode bits of `SVSTATE`.
 573 See `svstep` instruction for how to set Pack and Unpack Modes.
 574
 575
 576 **Horizontal Parallelism**
 577
 578 A problem exists for hardware where it may not be able to detect
 579 that a programmer (or compiler) knows of opportunities for parallelism
 580 and lack of overlap between loops.
 581
 582 For hphint, the number chosen must be consistently
 583 executed **every time**. Hardware is not permitted to execute five
 584 computations for one instruction then three on the next.
 585 hphint is a hint from the compiler to hardware that exactly this
 586 many elements may be safely executed in parallel, without hazards
 587 (including Memory accesses).
 588 Interestingly, when hphint is set equal to VL, it is in effect
 589 as if Vertical First mode were not set, because the hardware is
 590 given the option to run through all elements in an instruction.
 591 This is exactly what Horizontal-First is: a for-loop from 0 to VL-1
 592 except that the hardware may *choose* the number of elements.
 593
 594 *Note to programmers: changing VL during the middle of such modes
 595 should be done only with due care and respect for the fact that SVSTATE
 596 has exactly the same peer-level status as a Program Counter.*
 597
 598 -------------
 599
 600 \newpage{}
 601
 602 # SVL-Form
 603
 604 Add the following to Book I, 1.6.1, SVL-Form
 605
 606 ```
 607     |0     |6    |11    |16   |23 |24 |25 |26    |31 |
 608     | PO   |  RT |   RA | SVi |ms |vs |vf |   XO |Rc |
 609     | PO   |  RT | /    | SVi |/  |/  |vf |   XO |Rc |
 610 ```
 611
 612 * Add `SVL` to `RA (11:15)` Field in Book I, 1.6.2
 613 * Add `SVL` to `RT (6:10)` Field in Book I, 1.6.2
 614 * Add `SVL` to `Rc (31)` Field in Book I, 1.6.2
 615 * Add `SVL` to `XO (26:31)` Field in Book I, 1.6.2
 616
 617 Add the following to Book I, 1.6.2
 618
 619 ```
 620     ms (23)
 621         Field used in Simple-V to specify whether MVL (maxvl in the SVSTATE SPR)
 622         is to be set
 623         Formats: SVL
 624     vf (25)
 625         Field used in Simple-V to specify whether "Vertical" Mode is set
 626         (vfirst in the SVSTATE SPR)
 627         Formats: SVL
 628     vs (24)
 629         Field used in Simple-V to specify whether VL (vl in the SVSTATE SPR) is to be set
 630         Formats: SVL
 631     SVi (16:22)
 632          Simple-V immediate field used by setvl for setting VL or MVL
 633          (vl, maxvl in the SVSTATE SPR)
 634          and used as a "Mode of Operation" selector in svstep
 635          Formats: SVL
 636 ```
 637
 638 # Appendices
 639
 640     Appendix E Power ISA sorted by opcode
 641     Appendix F Power ISA sorted by version
 642     Appendix G Power ISA sorted by Compliancy Subset
 643     Appendix H Power ISA sorted by mnemonic
 644
 645 | Form | Book | Page | Version | mnemonic | Description |
 646 |------|------|------|---------|----------|-------------|
 647 | SVL  | I    | #    | 3.0B    | svstep   | Vertical-First Stepping and status reporting |
 648 | SVL  | I    | #    | 3.0B    | setvl    | Cray-like establishment of Looping (Vector) context |
 649