openpower/sv/rfc/ls008.mdwn

   1 # RFC ls008 SVP64 Management instructions
   2
   3 [[!tag opf_rfc]]
   4
   5 **URLs**:
   6
   7 * <https://libre-soc.org/openpower/sv/>
   8 * <https://libre-soc.org/openpower/sv/rfc/ls008/>
   9 * <https://bugs.libre-soc.org/show_bug.cgi?id=1040>
  10 * <https://git.openpower.foundation/isa/PowerISA/issues/87>
  11
  12 **Severity**: Major
  13
  14 **Status**: New
  15
  16 **Date**: 24 Mar 2023
  17
  18 **Target**: v3.2B
  19
  20 **Source**: v3.0B
  21
  22 **Books and Section affected**:
  23
  24 ```
  25     Book I, new Scalar Chapter.  (Or, new Book on "Zero-Overhead Loop Subsystem")
  26     Appendix E Power ISA sorted by opcode
  27     Appendix F Power ISA sorted by version
  28     Appendix G Power ISA sorted by Compliancy Subset
  29     Appendix H Power ISA sorted by mnemonic
  30 ```
  31
  32 **Summary**
  33
  34 ```
  35     Instructions added
  36     setvl    - Cray-style "Set Vector Length" instruction
  37     svstep   - Vertical-First Mode explicit Step and Status
  38     svremap  - Re-Mapping of Register Element Offsets
  39     svindex  - General-purpose setting of SHAPEs to be re-mapped
  40     svshape  - Hardware-level setting of SHAPEs for element re-mapping
  41     svshape2 - Hardware-level setting of SHAPEs for element re-mapping (v2)
  42 ```
  43
  44 **Submitter**: Luke Leighton (Libre-SOC)
  45
  46 **Requester**: Libre-SOC
  47
  48 **Impact on processor**:
  49
  50 ```
  51     Addition of six new "Zero-Overhead-Loop-Control" DSP-style Vector-style
  52     Management Instructions which can be implemented extremely efficiently
  53     and effectively by inserting an additional phase between Decode and Issue.
  54     More complex designs are NOT adversely impacted and in fact greatly benefit
  55     whilst still retaining an obvious linear sequential execution programming model.
  56 ```
  57
  58 **Impact on software**:
  59
  60 ```
  61     Requires support for new instructions in assembler, debuggers,
  62     and related tools.
  63 ```
  64
  65 **Keywords**:
  66
  67 ```
  68     Cray Supercomputing, Vectorisation, Zero-Overhead-Loop-Control (ZOLC),
  69     Scalable Vectors, Multi-Issue Out-of-Order, Sequential Programming Model,
  70     Digital Signal Processing (DSP)
  71 ```
  72
  73 **Motivation**
  74
  75 Power ISA is synonymous with Supercomputing and the early Supercomputers
  76 (ETA-10, ILLIAC-IV, CDC200, Cray) had Vectorisation. It is therefore anomalous
  77 that Power ISA does not have Scalable Vectors, instead having the legacy
  78 "PackedSIMD" paradigm. Fortunately this presents
  79 the opportunity to modernise Power ISA learning from both past ISA features and
  80 mistakes placing it far above the top of Supercomputing for the next two decades
  81 and beyond.
  82
  83 **Notes and Observations**:
  84
  85 1. SVP64 is very much designed for ultra-light-weight Embedded use-cases all the
  86   way up to moving the bar of Supercomputing orders of magnitude above its present
  87   perception, whilst retaining at all times the Sequential Programming Execution
  88   Model.
  89 2. This proposal is the **base** for further Extensions.  These include
  90   extending SVP64 onto the Scalar VSX instructions (with a **LONG TERM** view in 10+ years
  91   to deprecating the PackedSIMD aspects of VSX), to be discussed at a later
  92   time, the potential for extending VSX registers to 128 or beyond, and Arithmetic
  93   operations to a runtime-selectable choice of 128-bit, 256-bit, 512-bit or 1024-bit.
  94 3. Massive reductions in instruction count of between 2x and 20x have been demonstrated
  95   with SVP64, which is far beyond anything ever achieved by any *general-purpose*
  96   ISA Extension added to any ISA in the history of Computing. Normal reductions
  97   expected are of the order of 5 to 10% being considered a highly worthwhile exercise
  98   to pursue inclusion. not fractions of former sizes.
  99 4. Other potential extensions include work inspired by EXTRA-V and Eth-Zurich "Snitch"
 100   to reduce CPU workload by 95% in the case of EXTRA-V and power consumption by
 101   85% in the case of Snitch.  Addition massive reductions from ZOLC Research are
 102   also anticipated.
 103
 104 **Changes**
 105
 106 Add the following entries to:
 107
 108 * Section 1.3.2 Notation
 109 * the Appendices of Book I
 110 * Instructions of Book I as a new Section
 111 * SVL-Form of Book I Section 1.6.1.6 and 1.6.2
 112
 113 ----------------
 114
 115 \newpage{}
 116
 117 # Notation, Section 1.3.2
 118
 119 When register operands (RA, RT, BF) are prefixed by a single underscore
 120 (_RT, _RA, _BF) the variable contains the contents of the instruction field
 121 not the contents of the Register File referenced *by* that field. Example:
 122 `_RT` contains the contents of bits 5 thru 10. The relationship
 123 `RT = GPR(_RT)` is thus always true. Uses include making alternative
 124 decisions within an instruction based on whether the operand field
 125 is zero or non-zero.
 126
 127 ----------------
 128
 129 \newpage{}
 130
 131 # svstep: Vertical-First Stepping and status reporting
 132
 133 SVL-Form
 134
 135 * svstep RT,SVi,vf (Rc=0)
 136 * svstep. RT,SVi,vf (Rc=1)
 137
 138 | 0-5|6-10|11.15|16..22| 23-25    | 26-30 |31|   Form   |
 139 |----|----|-----|------|----------|-------|--|--------- |
 140 |PO  | RT | /   | SVi  |  / / vf  | XO    |Rc| SVL-Form |
 141
 142 Pseudo-code:
 143
 144 ```
 145     if SVi[3:4] = 0b11 then
 146         # store pack and unpack in SVSTATE
 147         SVSTATE[53] <- SVi[5]
 148         SVSTATE[54] <- SVi[6]
 149         RT <- [0]*62 || SVSTATE[53:54]
 150     else
 151         # Vertical-First explicit stepping.
 152         step <- SVSTATE_NEXT(SVi, vf)
 153         RT <- [0]*57 || step
 154 ```
 155
 156 Special Registers Altered:
 157
 158     CR0                     (if Rc=1)
 159
 160 **Description**
 161
 162 svstep may be used
 163 to enquire about the REMAP Schedule and it may be used to alter Vectorisation
 164 State.  When `vf=1` then stepping occurs.
 165 When `vf=0` the enquiry is performed without altering internal
 166 state.  If `SVi=0, Rc=0, vf=0` the instruction is a `nop`.
 167
 168 The following Modes exist:
 169
 170 * `SVi=0`: appropriately step srcstep, dststep, subsrcstep and subdststep to the next
 171    element, taking pack and unpack into consideration.
 172 * When `SVi` is 1-4 the REMAP Schedule for a given SVSHAPE may be
 173 returned in `RT`.  SVi=1 selects SVSHAPE0 current state,
 174 through to SVi=4 selects SVSHAPE3.
 175 * When `SVi` is 5, `SVSTATE.srcstep` is returned.
 176 * When `SVi` is 6, `SVSTATE.dststep` is returned.
 177 * When `SVi` is 0b1100 pack/unpack in SVSTATE is cleared
 178 * When `SVi` is 0b1101 pack in SVSTATE is set, unpack is cleared
 179 * When `SVi` is 0b1110 unpack in SVSTATE is set, pack is cleared
 180 * When `SVi` is 0b1111 pack/unpack in SVSTATE are set
 181
 182 As this is a Single-Predicated (1P) instruction, predication may be applied
 183 to skip (or zero) elements.
 184
 185 * Vertical-First Mode will return the requested index
 186   (and move to the next state if `vf=1`)
 187 * Horizontal-First Mode can be used to return all indices,
 188   i.e. walks through all possible states.
 189
 190 **Vectorisation of svstep itself**
 191
 192 As a 32-bit instruction, `svstep` may be itself be Vector-Prefixed, as
 193 `sv.svstep`. This will work perfectly well in Horizontal-First
 194 as it will in Vertical-First Mode.
 195
 196 Example: to obtain the full set of possible computed element
 197 indices use `sv.svstep RT.v,SVI,1` which will store all computed element
 198 indices, starting from RT.  If Rc=1 then a co-result Vector of CR Fields
 199 will also be returned, comprising the "loop end-points" of each of the inner
 200 loops when either Matrix Mode or DCT/FFT is set.  In other words,
 201 for example, when the `xdim` inner loop reaches the end and on the next
 202 iteration it will begin again at zero, the CR Field `EQ` will be set.
 203 With a maximum of three loops within both Matrix and DCT/FFT Modes,
 204 the CR Field's EQ bit will be set at the end of the first inner loop,
 205 the LE bit for the second, the GT bit for the outermost loop and the
 206 SO bit set on the very last element, when all loops reach their maximum
 207 extent.
 208
 209 *Programmer's note (1): VL in some situations, particularly larger Matrices,
 210 may exceed 64,
 211 meaning that `sv.svshape` returning a considerable number of values. Under
 212 such circumstances `sv.svshape/ew=8` is recommended.*
 213
 214 *Programmer's note (2): having conveniently obtained a pre-computed
 215 Schedule with `sv.svstep`,
 216 it may then be used as the input to Indexed REMAP Mode
 217 to achieve the exact same Schedule. It is evident however that
 218 before use some of the Indices may be arbitrarily altered as desired.
 219 `sv.svstep` helps the programmer avoid having to manually recreate
 220 Indices for certain
 221 types of common Loop patterns, and in its simplest form, without REMAP
 222 (SVi=5 or SVi=6),
 223 is equivalent to the `iota` instruction found in other Vector ISAs*
 224
 225 **Vertical First Mode**
 226
 227 Vertical First is effectively like an implicit single bit predicate
 228 applied to every SVP64 instruction.  **ONLY** one element in each
 229 SVP64 Vector instruction is executed; srcstep and dststep do **not**
 230 increment, and the Program Counter progresses **immediately** to
 231 the next instruction just as it would for any standard scalar v3.0B
 232 instruction.
 233
 234 A mode of srcstep (SVi=0) is called which can move srcstep and
 235 dststep on to the next element, still respecting predicate
 236 masks.
 237
 238 In other words, where normal SVP64 Vectorisation acts "horizontally"
 239 by looping first through 0 to VL-1 and only then moving the PC
 240 to the next instruction, Vertical-First moves the PC onwards
 241 (vertically) through multiple instructions **with the same
 242 srcstep and dststep**, then an explict instruction used to
 243 advance srcstep/dststep. An outer loop is expected to be
 244 used (branch instruction) which completes a series of
 245 Vector operations.
 246
 247 Testing any end condition of any loop of any REMAP state allows branches to be
 248 used to create loops.
 249
 250 Programmer's note: when Predicate Non-Zeroing is used this indicates to
 251 the underlying hardware that any masked-out element must be skipped.
 252 *This includes in Vertical-First Mode*, and programmers should be keenly
 253 aware that srcstep or dststep or both *may* jump by more than one as
 254 a result, because the actual request under these circumstances was to execute
 255 on the first available next *non-masked-out* element.
 256
 257 *Programmers should be aware that VL, srcstep and dststep are global in nature.
 258 Nested looping with different schedules is perfectly possible, as is
 259 calling of functions, however SVSTATE (and any associated SVSTATE) should
 260 obviously be stored on the stack in order to achieve this benefit*
 261
 262 -------------
 263
 264 \newpage{}
 265
 266
 267 # setvl
 268
 269 SVL-Form
 270
 271 | 0-5|6-10|11-15|16-22 | 23 24 25 | 26-30 |31|   FORM   |
 272 | -- | -- | --- | ---- |----------| ----- |--|----------|
 273 |PO  | RT | RA  | SVi  | ms vs vf | XO    |Rc| SVL-Form |
 274
 275 * setvl RT,RA,SVi,vf,vs,ms (Rc=0)
 276 * setvl. RT,RA,SVi,vf,vs,ms (Rc=1)
 277
 278 Pseudo-code:
 279
 280 ```
 281     overflow <- 0b0    # sets CR.SO if set and if Rc=1
 282     VLimm <- SVi + 1
 283     # set or get MVL
 284     if ms = 1 then MVL <- VLimm[0:6]
 285     else           MVL <- SVSTATE[0:6]
 286     # set or get VL
 287     if vs = 0                then VL <- SVSTATE[7:13]
 288     else if _RA != 0         then
 289         if (RA) >u 0b1111111 then
 290             VL <- 0b1111111
 291             overflow <- 0b1
 292         else                      VL <- (RA)[57:63]
 293     else if _RT = 0          then VL <- VLimm[0:6]
 294     else if CTR >u 0b1111111 then
 295         VL <- 0b1111111
 296         overflow <- 0b1
 297     else                          VL <- CTR[57:63]
 298     # limit VL to within MVL
 299     if VL >u MVL then
 300         overflow <- 0b1
 301         VL <- MVL
 302     SVSTATE[0:6] <- MVL
 303     SVSTATE[7:13] <- VL
 304     if _RT != 0 then
 305        GPR(_RT) <- [0]*57 || VL
 306     # MAXVL is a static "state-reset" opportunity so VF is only set then.
 307     if ms = 1 then
 308          SVSTATE[63] <- vf   # set Vertical-First mode
 309          SVSTATE[62] <- 0b0  # clear persist bit
 310 ```
 311
 312 Special Registers Altered:
 313
 314 ```
 315     CR0                     (if Rc=1)
 316 ```
 317
 318 * `SVi` - bits 16-22 - an immediate operand for setting MVL and/or VL
 319 * `ms` - bit 23 - allows for setting of MVL
 320 * `vs` - bit 24 - allows for setting of VL
 321 * `vf` - bit 25 - sets "Vertical First Mode".
 322
 323 Note that in immediate setting mode VL and MVL start from **one**
 324 but that this is compensated for in the assembly notation.
 325 i.e. that an immediate value of 1 in assembler notation
 326 actually places the value 0b0000000 in the `SVi` field bits:
 327 on execution the `setvl` instruction adds one to the decoded
 328 `SVi` field bits, resulting in
 329 VL/MVL being set to 1. This allows VL to be set to values
 330 ranging from 1 to 128 with only 7 bits instead of 8.
 331 Setting VL/MVL
 332 to 0 would result in all Vector operations becoming `nop`.  If this is
 333 truly desired (nop behaviour) then setting VL and MVL to zero is to be
 334 done via the [[SVSTATE SPR|sv/sprs]].
 335
 336 Note that setmvli is a pseudo-op, based on RA/RT=0, and setvli likewise
 337
 338     setvli   VL=8   : setvl  r0, r0, VL=8, vf=0, vs=1, ms=0
 339     setvli.  VL=8   : setvl. r0, r0, VL=8, vf=0, vs=1, ms=0
 340     setmvli  MVL=8  : setvl  r0, r0, MVL=8, vf=0, vs=0, ms=1
 341     setmvli. MVL=8  : setvl. r0, r0, MVL=8, vf=0, vs=0, ms=1
 342
 343 Additional pseudo-op for obtaining VL without modifying it (or any state):
 344
 345     getvl  r5      : setvl  r5, r0, vf=0, vs=0, ms=0
 346     getvl. r5      : setvl. r5, r0, vf=0, vs=0, ms=0
 347
 348 Note that whilst it is possible to set both MVL and VL from the same
 349 immediate, it is not possible to set them to different immediates in
 350 the same instruction.  Doing so would require two instructions.
 351
 352 **Selecting sources for VL**
 353
 354 There is considerable opcode pressure, consequently to set MVL and VL
 355 from different sources is as follows:
 356
 357 | condition           | effect         |
 358 | - | - |
 359 | `vs=1, RA=0, RT!=0` | VL,RT set to MIN(MVL, CTR)  |
 360 | `vs=1, RA=0, RT=0`  | VL set to MIN(MVL, SVi+1)  |
 361 | `vs=1, RA!=0, RT=0` | VL set to MIN(MVL, RA)  |
 362 | `vs=1, RA!=0, RT!=0` | VL,RT set to MIN(MVL, RA)  |
 363
 364 The reasoning here is that the opportunity to set RT equal to the
 365 immediate `SVi+1` is sacrificed in favour of setting from CTR.
 366
 367 # Unusual Rc=1 behaviour
 368
 369 Normally, the return result from an instruction is in `RT`. With
 370 it being possible for `RT=0` to mean that `CTR` mode is to be read,
 371 some different semantics are needed.
 372
 373 CR Field 0, when `Rc=1`, may be set even if `RT=0`. The reason is that
 374 overflow may occur: `VL`, if set either from an immediate or from `CTR`,
 375 may not exceed `MAXVL`, and if it is, `CR0.SO` must be set.
 376
 377 Additionally, in reality it is **`VL`** being set. Therefore, rather
 378 than `CR0` testing `RT` when `Rc=1`, CR0.EQ is set if `VL=0`, CR0.GE
 379 is set if `VL` is non-zero.
 380
 381 **SUBVL**
 382
 383 Sub-vector elements are not be considered "Vertical". The vec2/3/4
 384 is to be considered as if the "single element".  Caveats exist for
 385 [[sv/mv.swizzle]] and [[sv/mv.vec]] when Pack/Unpack is enabled,
 386 due to the order in which VL and SUBVL loops are applied being
 387 swapped (outer-inner becomes inner-outer)
 388
 389 # Examples
 390
 391 ## Core concept loop
 392
 393 ```
 394 loop:
 395     setvl a3, a0, MVL=8    #  update a3 with vl
 396                            # (# of elements this iteration)
 397                            # set MVL to 8
 398     # do vector operations at up to 8 length (MVL=8)
 399     # ...
 400     sub a0, a0, a3   # Decrement count by vl
 401     bnez a0, loop    # Any more?
 402 ```
 403
 404 ## Loop using Rc=1
 405
 406     my_fn:
 407       li r3, 1000
 408       b test
 409     loop:
 410       sub r3, r3, r4
 411       ...
 412     test:
 413       setvli. r4, r3, MVL=64
 414       bne cr0, loop
 415     end:
 416       blr
 417
 418 ## Load/Store-Multi (selective)
 419
 420 Up to 64 FPRs will be loaded, here.  `r3` is set one per bit
 421 for each FP register required to be loaded.  The block of memory
 422 from which the registers are loaded is contiguous (no gaps):
 423 any FP register which has a corresponding zero bit in `r3`
 424 is *unaltered*.  In essence this is a selective LD-multi with
 425 "Scatter" capability.
 426
 427     setvli r0, MVL=64, VL=64
 428     sv.fld/dm=r3 *r0, 0(r30) # selective load 64 FP registers
 429
 430 Up to 64 FPRs will be saved, here.  Again, `r3`
 431
 432     setvli r0, MVL=64, VL=64
 433     sv.stfd/sm=r3 *fp0, 0(r30) # selective store 64 FP registers
 434
 435 -------------
 436
 437 \newpage{}
 438
 439 # SVSTATE SPR
 440
 441 The format of the SVSTATE SPR is as follows:
 442
 443 | Field | Name     | Description           |
 444 | ----- | -------- | --------------------- |
 445 | 0:6   | maxvl    | Max Vector Length     |
 446 | 7:13  |    vl    | Vector Length         |
 447 | 14:20 | srcstep  | for srcstep = 0..VL-1 |
 448 | 21:27 | dststep  | for dststep = 0..VL-1 |
 449 | 28:29 | dsubstep | for substep = 0..SUBVL-1  |
 450 | 30:31 | ssubstep | for substep = 0..SUBVL-1  |
 451 | 32:33 | mi0      | REMAP RA/FRA/BFA SVSHAPE0-3    |
 452 | 34:35 | mi1      | REMAP RB/FRB/BFB SVSHAPE0-3    |
 453 | 36:37 | mi2      | REMAP RC/FRT SVSHAPE0-3    |
 454 | 38:39 | mo0      | REMAP RT/FRT/BF SVSHAPE0-3    |
 455 | 40:41 | mo1      | REMAP EA/RS/FRS SVSHAPE0-3    |
 456 | 42:46 | SVme     | REMAP enable (RA-RT)  |
 457 | 47:52 | rsvd     | reserved              |
 458 | 53    | pack     | PACK (srcstrp reorder)  |
 459 | 54    | unpack   | UNPACK (dststep order)  |
 460 | 55:61 | hphint   | Horizontal Hint       |
 461 | 62    | RMpst    | REMAP persistence     |
 462 | 63    | vfirst   | Vertical First mode   |
 463
 464 Notes:
 465
 466 * The entries are truncated to be within range.  Attempts to set VL to
 467   greater than MAXVL will truncate VL.
 468 * Setting srcstep, dststep to 64 or greater, or VL or MVL to greater
 469   than 64 is reserved and will cause an illegal instruction trap.
 470
 471 **SVSTATE Fields**
 472
 473 SVSTATE is a standard SPR that (if REMAP is not activated) contains sufficient
 474 self-contaned information for a full context save/restore.
 475 SVSTATE contains (and permits setting of):
 476
 477 * MVL (the Maximum Vector Length) - declares (statically) how
 478   much of a regfile is to be reserved for Vector elements
 479 * VL - Vector Length
 480 * dststep - the destination element offset of the current parallel
 481   instruction being executed
 482 * srcstep - for twin-predication, the source element offset as well.
 483 * ssubstep - the source subvector element offset of the current
 484   parallel instruction being executed
 485 * dsubstep - the destination subvector element offset of the current
 486   parallel instruction being executed
 487 * vfirst - Vertical First mode.  srcstep, dststep and substep
 488     **do not advance** unless explicitly requested to do so with
 489     pseudo-op svstep (a mode of setvl)
 490 * RMpst - REMAP persistence.  REMAP will apply only to the following
 491   instruction unless this bit is set, in which case REMAP "persists".
 492   Reset (cleared) on use of the `setvl` instruction if used to
 493   alter VL or MVL.
 494 * Pack - if set then srcstep/substep VL/SUBVL loop-ordering is inverted.
 495 * UnPack - if set then dststep/substep VL/SUBVL loop-ordering is inverted.
 496 * hphint - Horizontal Parallelism Hint. Indicates that
 497   no Hazards exist between groups of elements in sequential multiples of this number
 498    (before REMAP).  By definition: elements for which `FLOOR(srcstep/hphint)` is
 499    equal *before REMAP* are in the same parallelism "group". In Vertical First Mode
 500    hardware **MUST ONLY** process elements in the same group, and must stop
 501    Horizontal Issue at the last element of a given group. Set to zero to indicate "no hint".
 502 * SVme - REMAP enable bits, indicating which register is to be
 503    REMAPed: RA, RB, RC, RT and EA are the canonical (typical) register names
 504    associated with each bit, with RA being the LSB and EA being the MSB.
 505    See table below for ordering. When `SVme` is zero (0b00000) REMAP
 506    is **fully disabled and inactive** regardless of the contents of
 507   `SVSTATE`, `mi0-mi2/mo0-mo1`, or the four `SVSHAPEn` SPRs
 508 * mi0-mi2/mo0-mo1 - when the corresponding SVme bit is enabled, these
 509   indicate the SVSHAPE (0-3) that the corresponding register (RA etc)
 510   should use, as long as the register's corresponding SVme bit is set
 511
 512 Programmer's Note: the fact that REMAP is entirely dormant when `SVme` is zero
 513 allows establishment of REMAP context well in advance, followed by utilising `svremap`
 514 at a precise (or the very last) moment.  Some implementations may exploit this
 515 to cache (or take some time to prepare caches) in the background whilst other
 516 (unrelated) instructions are being executed. This is particularly important to
 517 bear in mind when using `svindex` which will require hardware to perform (and
 518 cache) additional GPR reads.
 519
 520 Programmer's Note: when REMAP is activated it becomes necessary on any
 521 context-switch (Interrupt or Function call) to detect (or know in advance)
 522 that REMAP is enabled and to additionally save/restore the four SVSHAPE
 523 SPRs, SVHAPE0-3.  Given that this is expected to be a rare occurrence it was
 524 deemed unreasonable to burden every context-switch or function call with
 525 mandatory save/restore of SVSHAPEs, and consequently it is a *callee*
 526 (and Trap Handler) responsibility.  Callees (and Trap Handlers) **MUST**
 527 avoid using all and any SVP64 instructions during the period where state
 528 could be adversely affected.  SVP64 purely relies on Scalar instructions,
 529 so Scalar instructions (except the SVP64 Management ones and mtspr and
 530 mfspr) are 100% guaranteed to have zero impact on SVP64 state.
 531
 532 **Max Vector Length (maxvl)** <a name="mvl" />
 533
 534 MAXVECTORLENGTH is the same concept as MVL in RISC-V RVV, except that it
 535 is variable length and may be dynamically set (normally from an immediate
 536 field only).  MVL is limited to 7 bits
 537 (in the first version of SVP64) and consequently the maximum number of
 538 elements is limited to between 0 and 127.
 539
 540 Programmer's Note: Except by directly using `mtspr` on SVSTATE, which may
 541 result in performance penalties on some hardware implementations, SVSTATE's `maxvl`
 542 field may only be set **statically** as an immediate, by the `setvl` instruction.
 543 It may **NOT** be set dynamically from a register.  Compiler writers and assembly
 544 programmers are expected to perform static register file analysis, subdivision,
 545 and allocation and only utilise `setvl`. Direct writing to SVSTATE in order to
 546 "bypass" this Note could, in less-advanced implementations, potentially cause stalling,
 547 particularly if SVP64 instructions are issued directly after the `mtspr` to SVSTATE.
 548
 549 **Vector Length (vl)** <a name="vl" />
 550
 551 The actual Vector length, the number of elements in a "Vector", `SVSTATE.vl` may be set
 552 entirely dynamically at runtime from a number of sources. `setvl` is the primary
 553 instruction for setting Vector Length.
 554 `setvl` is conceptually similar but different from the Cray, SX Aurora, and RISC-V RVV
 555 equivalent. Similar to RVV, VL is set to be within
 556 the range 0 <= VL <= MVL. Unlike RVV, VL is set **exactly** according to the following:
 557
 558     VL = (RT|0) = MIN(vlen, MVL)
 559
 560 where 0 <= MVL <= 127 and vlen may come from an immediate, `RA`, or from the `CTR` SPR,
 561 depending on options selected with the `setvl` instruction.
 562
 563 Programmer's Note: conceptual understanding of Cray-style Vectors is far beyond the scope
 564 of the Power ISA Technical Reference.  Guidance on the 50-year-old Cray Vector paradigm is
 565 best sought elsewhere: good studies include Academic Courses given on the 1970s
 566 Cray Supercomputers over at least the past three decades.
 567
 568 **SUBVL - Sub Vector Length**
 569
 570 This is a "group by quantity" that effectively asks each iteration
 571 of the hardware loop to load SUBVL elements of width elwidth at a
 572 time. Effectively, SUBVL is like a SIMD multiplier: instead of just 1
 573 operation issued, SUBVL operations are issued.
 574
 575 The main effect of SUBVL is that predication bits are applied per
 576 **group**, rather than by individual element.  Legal values are 0 to 3,
 577 representing 1 operation (1 element) thru 4 operations (4 elements) respectively.
 578 Elements are best though of in the context of 3D, Audio and Video: two Left and Right
 579 Channel "elements" or four ARGB "elements", or three XYZ coordinate "elements".
 580
 581 `subvl` is again primarily set by the `setvl` instruction. Not to be confused
 582 with `hphint`.
 583
 584 Directly related to `subvl` is the `pack` and `unpack` Mode bits of `SVSTATE`.
 585 See `svstep` instruction for how to set Pack and Unpack Modes.
 586
 587
 588 **Horizontal Parallelism**
 589
 590 A problem exists for hardware where it may not be able to detect
 591 that a programmer (or compiler) knows of opportunities for parallelism
 592 and lack of overlap between loops.
 593
 594 For hphint, the number chosen must be consistently
 595 executed **every time**. Hardware is not permitted to execute five
 596 computations for one instruction then three on the next.
 597 hphint is a hint from the compiler to hardware that exactly this
 598 many elements may be safely executed in parallel, without hazards
 599 (including Memory accesses).
 600 Interestingly, when hphint is set equal to VL, it is in effect
 601 as if Vertical First mode were not set, because the hardware is
 602 given the option to run through all elements in an instruction.
 603 This is exactly what Horizontal-First is: a for-loop from 0 to VL-1
 604 except that the hardware may *choose* the number of elements.
 605
 606 *Note to programmers: changing VL during the middle of such modes
 607 should be done only with due care and respect for the fact that SVSTATE
 608 has exactly the same peer-level status as a Program Counter.*
 609
 610 -------------
 611
 612 \newpage{}
 613
 614 # SVL-Form
 615
 616 Add the following to Book I, 1.6.1, SVL-Form
 617
 618 ```
 619     |0     |6    |11    |16   |23 |24 |25 |26    |31 |
 620     | PO   |  RT |   RA | SVi |ms |vs |vf |   XO |Rc |
 621     | PO   |  RT | /    | SVi |/  |/  |vf |   XO |Rc |
 622 ```
 623
 624 * Add `SVL` to `RA (11:15)` Field in Book I, 1.6.2
 625 * Add `SVL` to `RT (6:10)` Field in Book I, 1.6.2
 626 * Add `SVL` to `Rc (31)` Field in Book I, 1.6.2
 627 * Add `SVL` to `XO (26:31)` Field in Book I, 1.6.2
 628
 629 Add the following to Book I, 1.6.2
 630
 631 ```
 632     ms (23)
 633         Field used in Simple-V to specify whether MVL (maxvl in the SVSTATE SPR)
 634         is to be set
 635         Formats: SVL
 636     vf (25)
 637         Field used in Simple-V to specify whether "Vertical" Mode is set
 638         (vfirst in the SVSTATE SPR)
 639         Formats: SVL
 640     vs (24)
 641         Field used in Simple-V to specify whether VL (vl in the SVSTATE SPR) is to be set
 642         Formats: SVL
 643     SVi (16:22)
 644          Simple-V immediate field used by setvl for setting VL or MVL
 645          (vl, maxvl in the SVSTATE SPR)
 646          and used as a "Mode of Operation" selector in svstep
 647          Formats: SVL
 648 ```
 649
 650 # Appendices
 651
 652     Appendix E Power ISA sorted by opcode
 653     Appendix F Power ISA sorted by version
 654     Appendix G Power ISA sorted by Compliancy Subset
 655     Appendix H Power ISA sorted by mnemonic
 656
 657 | Form | Book | Page | Version | mnemonic | Description |
 658 |------|------|------|---------|----------|-------------|
 659 | SVL  | I    | #    | 3.0B    | svstep   | Vertical-First Stepping and status reporting |
 660 | SVL  | I    | #    | 3.0B    | setvl    | Cray-like establishment of Looping (Vector) context |
 661