openpower/sv/setvl.mdwn

   1 [[!tag standards]]
   2
   3 # OpenPOWER SV setvl/setvli
   4
   5 See links:
   6
   7 * <http://lists.libre-soc.org/pipermail/libre-soc-dev/2020-November/001366.html>
   8 * <https://bugs.libre-soc.org/show_bug.cgi?id=535>
   9 * <https://bugs.libre-soc.org/show_bug.cgi?id=587>
  10 * <https://bugs.libre-soc.org/show_bug.cgi?id=568> TODO
  11 * <https://bugs.libre-soc.org/show_bug.cgi?id=862> VF Predication
  12 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vsetvlivsetvl-instructions>
  13 * [[sv/svstep]]
  14 * pseudocode [[openpower/isa/simplev]]
  15
  16 Use of setvl results in changes to the MVL, VL and STATE SPRs. see [[sv/sprs]]
  17
  18 # Behaviour and Rationale
  19
  20 SV's Vector Engine is based on Cray-style Variable-length Vectorisation,
  21 just like RVV.  However unlike RVV, SV sits on top of the standard Scalar
  22 regfiles: there is no separate Vector register numbering.  Therefore, also
  23 unlike RVV, SV does not have hard-coded "Lanes": microarchitects
  24 may use *ordinary* in-order, out-of-order, or superscalar designs
  25 as the basis for SV. By contrast, the relevant parameter
  26 in RVV is "MAXVL" and this is architecturally hard-coded into RVV systems,
  27 anywhere from 1 to tens of thousands of Lanes in supercomputers.
  28
  29 SV is more like how MMX used to sit on top of the x86 FP regfile.
  30 Therefore when Vector operations are performed, the question has to
  31 be asked, "well, how much of the regfile do you want to allocate to
  32 this operation?" because if it is too small an amount performance may
  33 be affected, and if too large then other registers would overlap and
  34 cause data  corruption, or even if allocated correctly would require
  35 spill to memory.
  36
  37 The answer effectively needs to be parameterised.  Hence: MAXVL (MVL)
  38 is set from an immediate, so that the compiler may decide, statically, a
  39 guaranteed resource allocation according to the needs of the application.
  40
  41 While RVV's MAXVL was a hw limit, SV's MVL is simply a loop
  42 optimization. It does not carry side-effects for the arch, though for
  43 a specific cpu it may affect hw unit usage.
  44
  45 Other than being able to set MVL, SV's VL (Vector Length) works just like
  46 RVV's VL, with one minor twist.  RVV permits the `setvl` instruction to
  47 set VL to an arbitrary explicit value.  Within the limit of MVL, VL
  48 **MUST** be set to the requested value. Given that RVV only works on Vector Loops,
  49 this is fine and part of its value and design.  However, SV sits on top
  50 of the standard register files.  When MVL=VL=2, a Vector Add on `r3`
  51 will perform two Scalar Adds: one on `r3` and one on `r4`.
  52
  53 Thus there is the opportunity to set VL to an explicit value (within the
  54 limits of MVL) with the reasonable expectation that if two operations
  55 are requested (by setting VL=2) then two operations are guaranteed.
  56 This avoids the need for a loop (with not-insignificant use of the
  57 regfiles for counters), simply two instructions:
  58
  59     setvli r0, MVL=64, VL=64
  60     ld r0.v, 0(r30) # load exactly 64 registers from memory
  61
  62 Page Faults etc. aside this is *guaranteed* 100% without fail to perform
  63 64 unit-strided LDs starting from the address pointed to by r30 and put
  64 the contents into r0 through r63.  Thus it becomes a "LOAD-MULTI". Twin
  65 Predication could even be used to only load relevant registers from
  66 the stack.  This *only works if VL is set to the requested value* rather
  67 than, as in RVV, allowing the hardware to set VL to an arbitrary value
  68 (caveat being, limited to not exceed MVL)
  69
  70 Also available is the option to set VL from CTR (`VL = MIN(CTR, MVL)`.
  71 In combination with SVP64 [[sv/branches]] this can save one instruction
  72 inside critical inner loops. Note: to avoid having an extra bit in `setvl`,
  73 to select CTR is slightly convoluted.
  74
  75 # Format
  76
  77 *(Allocation of opcode TBD pending OPF ISA WG approval)*,
  78 using EXT22 temporarily and fitting into the
  79 [[sv/bitmanip]] space
  80
  81 Form: SVL-Form (see [[isatables/fields.text]])
  82
  83 | 0.5|6.10|11.15|16..22| 23...25    | 26.30 |31|  name   |
  84 | -- | -- | --- | ---- |----------- | ----- |--| ------- |
  85 |OPCD| RT | RA  | SVi  |   ms vs vf | 11011 |Rc| setvl   |
  86
  87 Instruction format:
  88
  89     setvl RT,RA,SVi,vf,vs,ms
  90     setvl. RT,RA,SVi,vf,vs,ms
  91
  92 Note that the immediate (`SVi`) spans 7 bits (16 to 22)
  93
  94 * `ms` - bit 23 - allows for setting of MVL
  95 * `vs` - bit 24 - allows for setting of VL
  96 * `vf` - bit 25 - sets "Vertical First Mode".
  97
  98 Note that in immediate setting mode VL and MVL start from **one**
  99 i.e. that an immediate value of zero will result in VL/MVL being set to 1.
 100 0b111111 results in VL/MVL being set to 64. This is because setting
 101 VL/MVL to 1 results in "scalar identity" behaviour, where setting VL/MVL
 102 to 0 would result in all Vector operations becoming `nop`.  If this is
 103 truly desired (nop behaviour) then setting VL and MVL to zero is to be
 104 done via the [[SVSTATE SPR|sv/sprs]]
 105
 106 Note that setmvli is a pseudo-op, based on RA/RT=0, and setvli likewise
 107
 108     setvli VL=8    : setvl r5, r0, VL=8
 109     setmvli MVL=8  : setvl r0, r0, MVL=8
 110
 111 Additional pseudo-op for obtaining VL without modifying it:
 112
 113     getvl r5       : setvl r5, r0, vf=0, vs=0, ms=0
 114
 115 For Vertical-First mode, a pseudo-op for explicit incrementing
 116 of srcstep and dststep:
 117
 118     svstep.        : setvl. 0, 0, vf=1, vs=0, ms=0
 119
 120 Note that whilst it is possible to set both MVL and VL from the same
 121 immediate, it is not possible to set them to different immediates in
 122 the same instruction.  That would require two instructions.
 123
 124 **Selecting CTR to set VL**
 125
 126 There is considerable opcode pressure, consequently to set MVL and VL
 127 from different sources is as follows:
 128
 129 | condition           | effect         |
 130 | - | |
 131 | `vf=1, RA=0, RT!=0` | VL set from CTR |
 132
 133 # Vertical First Mode
 134
 135 Vertical First is effectively like an implicit single bit predicate
 136 applied to every SVP64 instruction.  **ONLY** one element in each
 137 SVP64 Vector instruction is executed; srcstep and dststep do **not**
 138 increment, and the Program Counter progresses **immediately** to
 139 the next instruction just as it would for any standard scalar v3.0B
 140 instruction.
 141
 142 An explicit mode of setvl is called which can move srcstep and
 143 dststep on to the next element, still respecting predicate
 144 masks.
 145
 146 In other words, where normal SVP64 Vectorisation acts "horizontally"
 147 by looping first through 0 to VL-1 and only then moving the PC
 148 to the next instruction, Vertical-First moves the PC onwards
 149 (vertically) through multiple instructions **with the same
 150 srcstep and dststep**, then an explict instruction used to
 151 advance srcstep/dststep, and an outer loop is expected to be
 152 used (branch instruction) which completes a series of
 153 Vector operations.
 154
 155 ```svstep``` mode is enabled when vf=1, vs=0 and ms=0.
 156 When Rc=1 it is possible to determine when any level of
 157 loops reach an end condition, or if VL has been reached. The immediate can
 158 be reinterpreted as indicating which SVSTATE (0-3)
 159 should be tested and placed into CR0.
 160
 161 * setvl immediate = 1: only VL testing is enabled. CR0.SO is set
 162   to 1 when either srcstep or dststep reach VL
 163 * setvl immediate = 2: also include inner middle and outer
 164   loop end conditions from SVSTATE0 into CR.EQ CR.LE CR.GT
 165 * setvl immediate = 3: test SVSTATE1
 166 * setvl immediate = 4: test SVSTATE2
 167 * setvl immediate = 5: test SVSTATE3
 168
 169 Testing any end condition of any loop of any REMAP state allows branches to be used to create loops.
 170
 171 *Programmers should be aware that VL, srcstep and dststep are global in nature.
 172 Nested looping with different schedules is perfectly possible, as is
 173 calling of functions, however SVSTATE (and any associated SVSTATE) should be stored on the stack.*
 174
 175 **SUBVL**
 176
 177 Sub-vector elements are not be considered "Vertical". The vec2/3/4
 178 is to be considered as if the "single element".  Caveats exist for
 179 [[sv/mv.swizzle]] and [[sv/mv.vec]] when Pack/Unpack is enabled.
 180
 181 # Pseudocode
 182
 183     // instruction fields:
 184     rd = get_rt_field();         // bits 6..10
 185     ra = get_ra_field();         // bits 11..15
 186     vf = get_vf_field();         // bit 23
 187     vs = get_vs_field();         // bit 24
 188     ms = get_ms_field();         // bit 25
 189     Rc = get_Rc_field();         // bit 31
 190
 191     if vf and not vs and not ms {
 192         // increment src/dest step mode
 193         // NOTE! this is in no way complete! predication is not included
 194         // and neither is SUBVL mode
 195         srcstep = SPR[SV].srcstep
 196         dststep = SPR[SV].dststep
 197         VL = SPR[SV].VL
 198         srcstep++
 199         dststep++
 200         rollover = (srcstep == VL or dststep == VL)
 201         if rollover:
 202             // Reset srcstep, dststep, and also exit "Vertical First" mode
 203             srcstep = 0
 204             dststep = 0
 205             MSR[6] = 0
 206         SPR[SV].srcstep = srcstep
 207         SPR[SV].dststep = dststep
 208
 209         // write CR? helps for doing Vertical loops, detects end
 210         // of Vector Elements
 211         if Rc = 1 {
 212             // update CR to indicate that srcstep/dststep "rolled over"
 213             CR0.eq = rollover
 214         }
 215     } else {
 216         // add one. MVL/VL=1..64 not 0..63
 217         vlimmed = get_immed_field()+1; //  16..22
 218
 219         // set VL (or not).
 220         // 4 options: from SPR, from immed, from ra, from CTR
 221         if vs {
 222            // VL to be sourced from fields/regs
 223            if ra != 0 {
 224                VL = GPR[ra]
 225            } else {
 226                VL = vlimmed
 227            }
 228         } else {
 229            // VL not to change (except if MVL is reduced)
 230            // read from SPRs
 231            VL = SPR[SV_VL]
 232         }
 233
 234         // set MVL (or not).
 235         // 2 options: from SPR, from immed
 236         if ms {
 237            MVL = vlimmed
 238         } else {
 239            // MVL not to change, read from SPRs
 240            MVL = SPR[SV_MVL]
 241         }
 242
 243         // calculate (limit) VL
 244         VL = min(VL, MVL)
 245
 246         // store VL, MVL
 247         SVSTATE.VL = VL
 248         SVSTATE.MVL = MVL
 249
 250         // write rd
 251         if rt != 0 {
 252             // rt is not zero
 253             regs[rt] = VL;
 254         }
 255         // write CR?
 256         if Rc = 1 {
 257             // update CR from VL (not rt)
 258             CR0.eq = (VL == 0)
 259             ...
 260             ...
 261         }
 262         // write Vertical-First mode
 263         SVSTATE.vf = vf
 264     }
 265
 266 # Examples
 267
 268 ## Core concept loop
 269
 270 ```
 271 loop:
 272     setvl a3, a0, MVL=8    #  update a3 with vl
 273                            # (# of elements this iteration)
 274                            # set MVL to 8
 275     # do vector operations at up to 8 length (MVL=8)
 276     # ...
 277     sub a0, a0, a3   # Decrement count by vl
 278     bnez a0, loop    # Any more?
 279 ```
 280
 281 ## Loop using Rc=1
 282
 283     my_fn:
 284       li r3, 1000
 285       b test
 286     loop:
 287       sub r3, r3, r4
 288       ...
 289     test:
 290       setvli. r4, r3, MVL=64
 291       bne cr0, loop
 292     end:
 293       blr
 294