openpower/sv/setvl.mdwn

   1 [[!tag standards]]
   2
   3 # DRAFT setvl/setvli
   4
   5 See links:
   6
   7 * <http://lists.libre-soc.org/pipermail/libre-soc-dev/2020-November/001366.html>
   8 * <https://bugs.libre-soc.org/show_bug.cgi?id=535>
   9 * <https://bugs.libre-soc.org/show_bug.cgi?id=587>
  10 * <https://bugs.libre-soc.org/show_bug.cgi?id=568> TODO
  11 * <https://bugs.libre-soc.org/show_bug.cgi?id=862> VF Predication
  12 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vsetvlivsetvl-instructions>
  13 * [[sv/svstep]]
  14 * pseudocode [[openpower/isa/simplev]]
  15
  16 Use of setvl results in changes to the SVSTATE SPR. see [[sv/sprs]]
  17
  18 # Behaviour and Rationale
  19
  20 SV's Vector Engine is based on Cray-style Variable-length Vectorisation,
  21 just like RVV.  However unlike RVV, SV sits on top of the standard Scalar
  22 regfiles: there is no separate Vector register numbering.  Therefore, also
  23 unlike RVV, SV does not have hard-coded "Lanes": microarchitects
  24 may use *ordinary* in-order, out-of-order, or superscalar designs
  25 as the basis for SV. By contrast, the relevant parameter
  26 in RVV is "MAXVL" and this is architecturally hard-coded into RVV systems,
  27 anywhere from 1 to tens of thousands of Lanes in supercomputers.
  28
  29 SV is more like how MMX used to sit on top of the x86 FP regfile.
  30 Therefore when Vector operations are performed, the question has to
  31 be asked, "well, how much of the regfile do you want to allocate to
  32 this operation?" because if it is too small an amount performance may
  33 be affected, and if too large then other registers would overlap and
  34 cause data  corruption, or even if allocated correctly would require
  35 spill to memory.
  36
  37 The answer effectively needs to be parameterised.  Hence: MAXVL (MVL)
  38 is set from an immediate, so that the compiler may decide, statically, a
  39 guaranteed resource allocation according to the needs of the application.
  40
  41 While RVV's MAXVL was a hw limit, SV's MVL is simply a loop
  42 optimization. It does not carry side-effects for the arch, though for
  43 a specific cpu it may affect hw unit usage.
  44
  45 Other than being able to set MVL, SV's VL (Vector Length) works just like
  46 RVV's VL, with one minor twist.  RVV permits the `setvl` instruction to
  47 set VL to an arbitrary explicit value.  Within the limit of MVL, VL
  48 **MUST** be set to the requested value. Given that RVV only works on Vector Loops,
  49 this is fine and part of its value and design.  However, SV sits on top
  50 of the standard register files.  When MVL=VL=2, a Vector Add on `r3`
  51 will perform two Scalar Adds: one on `r3` and one on `r4`.
  52
  53 Thus there is the opportunity to set VL to an explicit value (within the
  54 limits of MVL) with the reasonable expectation that if two operations
  55 are requested (by setting VL=2) then two operations are guaranteed.
  56 This avoids the need for a loop (with not-insignificant use of the
  57 regfiles for counters), simply two instructions:
  58
  59     setvli r0, MVL=64, VL=64
  60     ld r0.v, 0(r30) # load exactly 64 registers from memory
  61
  62 Page Faults etc. aside this is *guaranteed* 100% without fail to perform
  63 64 unit-strided LDs starting from the address pointed to by r30 and put
  64 the contents into r0 through r63.  Thus it becomes a "LOAD-MULTI". Twin
  65 Predication could even be used to only load relevant registers from
  66 the stack.  This *only works if VL is set to the requested value* rather
  67 than, as in RVV, allowing the hardware to set VL to an arbitrary value.
  68
  69 Also available is the option to set VL from CTR (`VL = MIN(CTR, MVL)`.
  70 In combination with SVP64 [[sv/branches]] this can save one instruction
  71 inside critical inner loops. Note: to avoid having an extra opcode
  72 bit in `setvl`,
  73 to select CTR is slightly convoluted.
  74
  75 # Format
  76
  77 *(Allocation of opcode TBD pending OPF ISA WG approval)*,
  78 using EXT22 temporarily and fitting into the
  79 [[sv/bitmanip]] space
  80
  81 Form: SVL-Form (see [[isatables/fields.text]])
  82
  83 | 0.5|6.10|11.15|16..22| 23...25    | 26.30 |31|  name   |
  84 | -- | -- | --- | ---- |----------- | ----- |--| ------- |
  85 |OPCD| RT | RA  | SVi  |   ms vs vf | 11011 |Rc| setvl   |
  86
  87 Instruction format:
  88
  89     setvl RT,RA,SVi,vf,vs,ms
  90     setvl. RT,RA,SVi,vf,vs,ms
  91
  92 Note that the immediate (`SVi`) spans 7 bits (16 to 22).
  93
  94 Instruction encodings where `SVi`'s MSB is set are reserved for future extensions.
  95 Implementations are required to cause an illegal instruction exception when
  96 `SVi`'s MSB is set to allow software emulation of those future extensions.
  97
  98 * `ms` - bit 23 - allows for setting of MVL
  99 * `vs` - bit 24 - allows for setting of VL
 100 * `vf` - bit 25 - sets "Vertical First Mode".
 101
 102 Note that in immediate setting mode VL and MVL start from **one**
 103 i.e. that an immediate value of zero will result in VL/MVL being set to 1.
 104 0b111111 results in VL/MVL being set to 64. This is because setting
 105 VL/MVL to 1 results in "scalar identity" behaviour, where setting VL/MVL
 106 to 0 would result in all Vector operations becoming `nop`.  If this is
 107 truly desired (nop behaviour) then setting VL and MVL to zero is to be
 108 done via the [[SVSTATE SPR|sv/sprs]]
 109
 110 Note that setmvli is a pseudo-op, based on RA/RT=0, and setvli likewise
 111
 112     setvli VL=8    : setvl r5, r0, VL=8
 113     setmvli MVL=8  : setvl r0, r0, MVL=8
 114
 115 Additional pseudo-op for obtaining VL without modifying it (or any state):
 116
 117     getvl r5       : setvl r5, r0, vf=0, vs=0, ms=0
 118
 119 For Vertical-First mode, a pseudo-op for explicit incrementing
 120 of srcstep and dststep:
 121
 122     svfstep.        : setvl. 0, 0, vf=1, vs=0, ms=0
 123
 124 This pseudocode op is different from [[sv/svstep]] which is used to
 125 perform detailed enquiries about internal state.
 126
 127 Note that whilst it is possible to set both MVL and VL from the same
 128 immediate, it is not possible to set them to different immediates in
 129 the same instruction.  Doing so would require two instructions.
 130
 131 **Selecting sources for VL**
 132
 133 There is considerable opcode pressure, consequently to set MVL and VL
 134 from different sources is as follows:
 135
 136 | condition           | effect         |
 137 | - | - |
 138 | `vs=1, RA=0, RT!=0` | VL,RT set to MIN(MVL, CTR)  |
 139 | `vs=1, RA=0, RT=0`  | VL set to MIN(MVL, SVi+1)  |
 140 | `vs=1, RA!=0, RT=0` | VL set to MIN(MVL, RA)  |
 141 | `vs=1, RA!=0, RT!=0` | VL,RT set to MIN(MVL, RA)  |
 142
 143 The reasoning here is that the opportunity to set RT equal to the
 144 immediate `SVi+1` is sacrificed in favour of setting from CTR.
 145
 146 # Vertical First Mode
 147
 148 Vertical First is effectively like an implicit single bit predicate
 149 applied to every SVP64 instruction.  **ONLY** one element in each
 150 SVP64 Vector instruction is executed; srcstep and dststep do **not**
 151 increment, and the Program Counter progresses **immediately** to
 152 the next instruction just as it would for any standard scalar v3.0B
 153 instruction.
 154
 155 An explicit mode of setvl is called which can move srcstep and
 156 dststep on to the next element, still respecting predicate
 157 masks.
 158
 159 In other words, where normal SVP64 Vectorisation acts "horizontally"
 160 by looping first through 0 to VL-1 and only then moving the PC
 161 to the next instruction, Vertical-First moves the PC onwards
 162 (vertically) through multiple instructions **with the same
 163 srcstep and dststep**, then an explict instruction used to
 164 advance srcstep/dststep. An outer loop is expected to be
 165 used (branch instruction) which completes a series of
 166 Vector operations.
 167
 168 ```svfstep``` mode is enabled when vf=1, vs=0 and ms=0.
 169 When Rc=1 it is possible to determine when any level of
 170 loops reach an end condition, or if VL has been reached. The immediate can
 171 be reinterpreted as indicating which SVSTATE (0-3)
 172 should be tested and placed into CR0 (when Rc=1)
 173
 174 When RT is not zero, an internal stepping index may also be returned,
 175 either the REMAP index or srcstep or dststep. This table is identical
 176 to that of [[sv/svstep]]:
 177
 178 * `SVi=1`: also include inner middle and outer
 179   loop end conditions from SVSTATE0 into CR.EQ CR.LE CR.GT
 180 * `SVi=2`: test SVSTATE1 (and return conditions)
 181 * `SVi=3`: test SVSTATE2 (and return conditions)
 182 * `SVi=4`: test SVSTATE3 (and return conditions)
 183 * `SVi=5`: `SVSTATE.srcstep` is returned.
 184 * `SVi=6`: `SVSTATE.dststep` is returned.
 185
 186 Testing any end condition of any loop of any REMAP state allows branches to be used to create loops.
 187
 188 *Programmers should be aware that VL, srcstep and dststep are global in nature.
 189 Nested looping with different schedules is perfectly possible, as is
 190 calling of functions, however SVSTATE (and any associated SVSTATE) should be stored on the stack.*
 191
 192 **SUBVL**
 193
 194 Sub-vector elements are not be considered "Vertical". The vec2/3/4
 195 is to be considered as if the "single element".  Caveats exist for
 196 [[sv/mv.swizzle]] and [[sv/mv.vec]] when Pack/Unpack is enabled,
 197 due to the order in which VL and SUBVL loops are applied being
 198 swapped (outer-inner becomes inner-outer)
 199
 200 # Examples
 201
 202 ## Core concept loop
 203
 204 ```
 205 loop:
 206     setvl a3, a0, MVL=8    #  update a3 with vl
 207                            # (# of elements this iteration)
 208                            # set MVL to 8
 209     # do vector operations at up to 8 length (MVL=8)
 210     # ...
 211     sub a0, a0, a3   # Decrement count by vl
 212     bnez a0, loop    # Any more?
 213 ```
 214
 215 ## Loop using Rc=1
 216
 217     my_fn:
 218       li r3, 1000
 219       b test
 220     loop:
 221       sub r3, r3, r4
 222       ...
 223     test:
 224       setvli. r4, r3, MVL=64
 225       bne cr0, loop
 226     end:
 227       blr
 228