85f460f72441d22e2f46cec39e5ea83fc4c1a20c
[libreriscv.git] / openpower / sv / setvl.mdwn
1 [[!tag standards]]
2
3 # OpenPOWER SV setvl/setvli
4
5 See links:
6
7 * <http://lists.libre-soc.org/pipermail/libre-soc-dev/2020-November/001366.html>
8 * <https://bugs.libre-soc.org/show_bug.cgi?id=535>
9 * <https://bugs.libre-soc.org/show_bug.cgi?id=587>
10 * <https://bugs.libre-soc.org/show_bug.cgi?id=568> TODO
11 * <https://bugs.libre-soc.org/show_bug.cgi?id=862> VF Predication
12 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vsetvlivsetvl-instructions>
13 * [[sv/svstep]]
14 * pseudocode [[openpower/isa/simplev]]
15
16 Use of setvl results in changes to the MVL, VL and STATE SPRs. see [[sv/sprs]]
17
18 # Behaviour and Rationale
19
20 SV's Vector Engine is based on Cray-style Variable-length Vectorisation,
21 just like RVV. However unlike RVV, SV sits on top of the standard Scalar
22 regfiles: there is no separate Vector register numbering. Therefore, also
23 unlike RVV, SV does not have hard-coded "Lanes": microarchitects
24 may use *ordinary* in-order, out-of-order, or superscalar designs
25 as the basis for SV. By contrast, the relevant parameter
26 in RVV is "MAXVL" and this is architecturally hard-coded into RVV systems,
27 anywhere from 1 to tens of thousands of Lanes in supercomputers.
28
29 SV is more like how MMX used to sit on top of the x86 FP regfile.
30 Therefore when Vector operations are performed, the question has to
31 be asked, "well, how much of the regfile do you want to allocate to
32 this operation?" because if it is too small an amount performance may
33 be affected, and if too large then other registers would overlap and
34 cause data corruption, or even if allocated correctly would require
35 spill to memory.
36
37 The answer effectively needs to be parameterised. Hence: MAXVL (MVL)
38 is set from an immediate, so that the compiler may decide, statically, a
39 guaranteed resource allocation according to the needs of the application.
40
41 While RVV's MAXVL was a hw limit, SV's MVL is simply a loop
42 optimization. It does not carry side-effects for the arch, though for
43 a specific cpu it may affect hw unit usage.
44
45 Other than being able to set MVL, SV's VL (Vector Length) works just like
46 RVV's VL, with one minor twist. RVV permits the `setvl` instruction to
47 set VL to an arbitrary explicit value. Within the limit of MVL, VL
48 **MUST** be set to the requested value. Given that RVV only works on Vector Loops,
49 this is fine and part of its value and design. However, SV sits on top
50 of the standard register files. When MVL=VL=2, a Vector Add on `r3`
51 will perform two Scalar Adds: one on `r3` and one on `r4`.
52
53 Thus there is the opportunity to set VL to an explicit value (within the
54 limits of MVL) with the reasonable expectation that if two operations
55 are requested (by setting VL=2) then two operations are guaranteed.
56 This avoids the need for a loop (with not-insignificant use of the
57 regfiles for counters), simply two instructions:
58
59 setvli r0, MVL=64, VL=64
60 ld r0.v, 0(r30) # load exactly 64 registers from memory
61
62 Page Faults etc. aside this is *guaranteed* 100% without fail to perform
63 64 unit-strided LDs starting from the address pointed to by r30 and put
64 the contents into r0 through r63. Thus it becomes a "LOAD-MULTI". Twin
65 Predication could even be used to only load relevant registers from
66 the stack. This *only works if VL is set to the requested value* rather
67 than, as in RVV, allowing the hardware to set VL to an arbitrary value
68 (caveat being, limited to not exceed MVL)
69
70 Also available is the option to set VL from CTR (`VL = MIN(CTR, MVL)`.
71 In combination with SVP64 [[sv/branches]] this can save one instruction
72 inside critical inner loops.
73
74 # Format
75
76 *(Allocation of opcode TBD pending OPF ISA WG approval)*,
77 using EXT22 temporarily and fitting into the
78 [[sv/bitmanip]] space
79
80 Form: SVL-Form (see [[isatables/fields.text]])
81
82 | 0.5|6.10|11.15|16..22| 23...25 | 26.30 |31| name |
83 | -- | -- | --- | ---- |----------- | ----- |--| ------- |
84 |OPCD| RT | RA | SVi | ms vs vf | 11011 |Rc| setvl |
85
86 Instruction format:
87
88 setvl RT,RA,SVi,vf,vs,ms
89 setvl. RT,RA,SVi,vf,vs,ms
90
91 Note that the immediate (`SVi`) spans 7 bits (16 to 22)
92
93 * `ms` - bit 23 - allows for setting of MVL.
94 * `vs` - bit 24 - allows for setting of VL.
95 * `vf` - bit 25 - sets "Vertical First Mode".
96
97 Note that in immediate setting mode VL and MVL start from **one**
98 i.e. that an immediate value of zero will result in VL/MVL being set to 1.
99 0b111111 results in VL/MVL being set to 64. This is because setting
100 VL/MVL to 1 results in "scalar identity" behaviour, where setting VL/MVL
101 to 0 would result in all Vector operations becoming `nop`. If this is
102 truly desired (nop behaviour) then setting VL and MVL to zero is to be
103 done via the [[SVSTATE SPR|sv/sprs]]
104
105 Note that setmvli is a pseudo-op, based on RA/RT=0, and setvli likewise
106
107 setvli VL=8 : setvl r5, r0, VL=8
108 setmvli MVL=8 : setvl r0, r0, MVL=8
109
110 Additional pseudo-op for obtaining VL without modifying it:
111
112 getvl r5 : setvl r5, r0, vf=0, vs=0, ms=0
113
114 For Vertical-First mode, a pseudo-op for explicit incrementing
115 of srcstep and dststep:
116
117 svstep. : setvl. 0, 0, vf=1, vs=0, ms=0
118
119 Note that whilst it is possible to set both MVL and VL from the same
120 immediate, it is not possible to set them to different immediates in
121 the same instruction. That would require two instructions.
122
123 # Vertical First Mode
124
125 Vertical First is effectively like an implicit single bit predicate
126 applied to every SVP64 instruction. **ONLY** one element in each
127 SVP64 Vector instruction is executed; srcstep and dststep do **not**
128 increment, and the Program Counter progresses **immediately** to
129 the next instruction just as it would for any standard scalar v3.0B
130 instruction.
131
132 An explicit mode of setvl is called which can move srcstep and
133 dststep on to the next element, still respecting predicate
134 masks.
135
136 In other words, where normal SVP64 Vectorisation acts "horizontally"
137 by looping first through 0 to VL-1 and only then moving the PC
138 to the next instruction, Vertical-First moves the PC onwards
139 (vertically) through multiple instructions **with the same
140 srcstep and dststep**, then an explict instruction used to
141 advance srcstep/dststep, and an outer loop is expected to be
142 used (branch instruction) which completes a series of
143 Vector operations.
144
145 ```svstep``` mode is enabled when vf=1, vs=0 and ms=0.
146 When Rc=1 it is possible to determine when any level of
147 loops reach an end condition, or if VL has been reached. The immediate can
148 be reinterpreted as indicating which SVSTATE (0-3)
149 should be tested and placed into CR0.
150
151 * setvl immediate = 1: only VL testing is enabled. CR0.SO is set
152 to 1 when either srcstep or dststep reach VL
153 * setvl immediate = 2: also include inner middle and outer
154 loop end conditions from SVSTATE0 into CR.EQ CR.LE CR.GT
155 * setvl immediate = 3: test SVSTATE1
156 * setvl immediate = 4: test SVSTATE2
157 * setvl immediate = 5: test SVSTATE3
158
159 Testing any end condition of any loop of any REMAP state allows branches to be used to create loops.
160
161 *Programmers should be aware that VL, srcstep and dststep are global in nature.
162 Nested looping with different schedules is perfectly possible, as is
163 calling of functions, however SVSTATE (and any associated SVSTATE) should be stored on the stack.*
164
165 **SUBVL**
166
167 Sub-vector elements are not be considered "Vertical". The vec2/3/4
168 is to be considered as if the "single element". Caveats exist for
169 [[sv/mv.swizzle]] and [[sv/mv.vec]] when Pack/Unpack is enabled.
170
171 # Pseudocode
172
173 // instruction fields:
174 rd = get_rt_field(); // bits 6..10
175 ra = get_ra_field(); // bits 11..15
176 vf = get_vf_field(); // bit 23
177 vs = get_vs_field(); // bit 24
178 ms = get_ms_field(); // bit 25
179 Rc = get_Rc_field(); // bit 31
180
181 if vf and not vs and not ms {
182 // increment src/dest step mode
183 // NOTE! this is in no way complete! predication is not included
184 // and neither is SUBVL mode
185 srcstep = SPR[SV].srcstep
186 dststep = SPR[SV].dststep
187 VL = SPR[SV].VL
188 srcstep++
189 dststep++
190 rollover = (srcstep == VL or dststep == VL)
191 if rollover:
192 // Reset srcstep, dststep, and also exit "Vertical First" mode
193 srcstep = 0
194 dststep = 0
195 MSR[6] = 0
196 SPR[SV].srcstep = srcstep
197 SPR[SV].dststep = dststep
198
199 // write CR? helps for doing Vertical loops, detects end
200 // of Vector Elements
201 if Rc = 1 {
202 // update CR to indicate that srcstep/dststep "rolled over"
203 CR0.eq = rollover
204 }
205 } else {
206 // add one. MVL/VL=1..64 not 0..63
207 vlimmed = get_immed_field()+1; // 16..22
208
209 // set VL (or not).
210 // 4 options: from SPR, from immed, from ra, from CTR
211 if vs {
212 // VL to be sourced from fields/regs
213 if ra != 0 {
214 VL = GPR[ra]
215 } else {
216 VL = vlimmed
217 }
218 } else {
219 // VL not to change (except if MVL is reduced)
220 // read from SPRs
221 VL = SPR[SV_VL]
222 }
223
224 // set MVL (or not).
225 // 2 options: from SPR, from immed
226 if ms {
227 MVL = vlimmed
228 } else {
229 // MVL not to change, read from SPRs
230 MVL = SPR[SV_MVL]
231 }
232
233 // calculate (limit) VL
234 VL = min(VL, MVL)
235
236 // store VL, MVL
237 SVSTATE.VL = VL
238 SVSTATE.MVL = MVL
239
240 // write rd
241 if rt != 0 {
242 // rt is not zero
243 regs[rt] = VL;
244 }
245 // write CR?
246 if Rc = 1 {
247 // update CR from VL (not rt)
248 CR0.eq = (VL == 0)
249 ...
250 ...
251 }
252 // write Vertical-First mode
253 SVSTATE.vf = vf
254 }
255
256 # Examples
257
258 ## Core concept loop
259
260 ```
261 loop:
262 setvl a3, a0, MVL=8 # update a3 with vl
263 # (# of elements this iteration)
264 # set MVL to 8
265 # do vector operations at up to 8 length (MVL=8)
266 # ...
267 sub a0, a0, a3 # Decrement count by vl
268 bnez a0, loop # Any more?
269 ```
270
271 ## Loop using Rc=1
272
273 my_fn:
274 li r3, 1000
275 b test
276 loop:
277 sub r3, r3, r4
278 ...
279 test:
280 setvli. r4, r3, MVL=64
281 bne cr0, loop
282 end:
283 blr
284