(no commit message)
[libreriscv.git] / openpower / sv / setvl.mdwn
1 [[!tag standards]]
2
3 # DRAFT setvl/setvli
4
5 See links:
6
7 * <http://lists.libre-soc.org/pipermail/libre-soc-dev/2020-November/001366.html>
8 * <https://bugs.libre-soc.org/show_bug.cgi?id=535>
9 * <https://bugs.libre-soc.org/show_bug.cgi?id=587>
10 * <https://bugs.libre-soc.org/show_bug.cgi?id=568> TODO
11 * <https://bugs.libre-soc.org/show_bug.cgi?id=927> bug - RT>=32
12 * <https://bugs.libre-soc.org/show_bug.cgi?id=862> VF Predication
13 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vsetvlivsetvl-instructions>
14 * [[sv/svstep]]
15 * pseudocode [[openpower/isa/simplev]]
16
17 Use of setvl results in changes to the SVSTATE SPR. see [[sv/sprs]]
18
19 # Behaviour and Rationale
20
21 SV's Vector Engine is based on Cray-style Variable-length Vectorisation,
22 just like RVV. However unlike RVV, SV sits on top of the standard Scalar
23 regfiles: there is no separate Vector register numbering. Therefore, also
24 unlike RVV, SV does not have hard-coded "Lanes": microarchitects
25 may use *ordinary* in-order, out-of-order, or superscalar designs
26 as the basis for SV. By contrast, the relevant parameter
27 in RVV is "MAXVL" and this is architecturally hard-coded into RVV systems,
28 anywhere from 1 to tens of thousands of Lanes in supercomputers.
29
30 SV is more like how MMX used to sit on top of the x86 FP regfile.
31 Therefore when Vector operations are performed, the question has to
32 be asked, "well, how much of the regfile do you want to allocate to
33 this operation?" because if it is too small an amount performance may
34 be affected, and if too large then other registers would overlap and
35 cause data corruption, or even if allocated correctly would require
36 spill to memory.
37
38 The answer effectively needs to be parameterised. Hence: MAXVL (MVL)
39 is set from an immediate, so that the compiler may decide, statically, a
40 guaranteed resource allocation according to the needs of the application.
41
42 While RVV's MAXVL was a hw limit, SV's MVL is simply a loop
43 optimization. It does not carry side-effects for the arch, though for
44 a specific cpu it may affect hw unit usage.
45
46 Other than being able to set MVL, SV's VL (Vector Length) works just like
47 RVV's VL, with one minor twist. RVV permits the `setvl` instruction to
48 set VL to an arbitrary explicit value. Within the limit of MVL, VL
49 **MUST** be set to the requested value. Given that RVV only works on Vector Loops,
50 this is fine and part of its value and design. However, SV sits on top
51 of the standard register files. When MVL=VL=2, a Vector Add on `r3`
52 will perform two Scalar Adds: one on `r3` and one on `r4`.
53
54 Thus there is the opportunity to set VL to an explicit value (within the
55 limits of MVL) with the reasonable expectation that if two operations
56 are requested (by setting VL=2) then two operations are guaranteed.
57 This avoids the need for a loop (with not-insignificant use of the
58 regfiles for counters), simply two instructions:
59
60 setvli r0, MVL=64, VL=64
61 sv.ld *r0, 0(r30) # load exactly 64 registers from memory
62
63 Page Faults etc. aside this is *guaranteed* 100% without fail to perform
64 64 unit-strided LDs starting from the address pointed to by r30 and put
65 the contents into r0 through r63. Thus it becomes a "LOAD-MULTI". Twin
66 Predication could even be used to only load relevant registers from
67 the stack. This *only works if VL is set to the requested value* rather
68 than, as in RVV, allowing the hardware to set VL to an arbitrary value
69 (due to variances in implementation choices).
70
71 Also available is the option to set VL from CTR (`VL = MIN(CTR, MVL)`.
72 In combination with SVP64 [[sv/branches]] this can save one instruction
73 inside critical inner loops. A caveat: to avoid having an extra opcode
74 bit in `setvl`, selection of CTR mode is slightly convoluted.
75
76 # Format
77
78 *(Allocation of opcode TBD pending OPF ISA WG approval)*,
79 using EXT22 temporarily and fitting into the
80 [[sv/bitmanip]] space
81
82 Form: SVL-Form (see [[isatables/fields.text]])
83
84 | 0.5|6.10|11.15|16..22| 23...25 | 26.30 |31| name |
85 | -- | -- | --- | ---- |----------- | ----- |--| ------- |
86 |OPCD| RT | RA | SVi | ms vs vf | 11011 |Rc| setvl |
87
88 Instruction format:
89
90 setvl RT,RA,SVi,vf,vs,ms
91 setvl. RT,RA,SVi,vf,vs,ms
92
93 Note that the immediate (`SVi`) spans 7 bits (16 to 22)
94
95 * `ms` - bit 23 - allows for setting of MVL
96 * `vs` - bit 24 - allows for setting of VL
97 * `vf` - bit 25 - sets "Vertical First Mode".
98
99 Note that in immediate setting mode VL and MVL start from **one**
100 but that this is compensated for in the assembly notation.
101 i.e. that an immediate value of 1 in assembler notation
102 actually places the value 0b0000000 in the `SVi` field bits:
103 on execution the `setvl` instruction adds one to the decoded
104 `SVi` field bits, resulting in
105 VL/MVL being set to 1. This allows VL to be set to values
106 ranging from 1 to 128 with only 7 bits instead of 8.
107 Setting VL/MVL
108 to 0 would result in all Vector operations becoming `nop`. If this is
109 truly desired (nop behaviour) then setting VL and MVL to zero is to be
110 done via the [[SVSTATE SPR|sv/sprs]].
111
112 Note that setmvli is a pseudo-op, based on RA/RT=0, and setvli likewise
113
114 setvli VL=8 : setvl r0, r0, VL=8, vf=0, vs=1, ms=0
115 setvli. VL=8 : setvl. r0, r0, VL=8, vf=0, vs=1, ms=0
116 setmvli MVL=8 : setvl r0, r0, MVL=8, vf=0, vs=0, ms=1
117 setmvli. MVL=8 : setvl. r0, r0, MVL=8, vf=0, vs=0, ms=1
118
119 Additional pseudo-op for obtaining VL without modifying it (or any state):
120
121 getvl r5 : setvl r5, r0, vf=0, vs=0, ms=0
122 getvl. r5 : setvl. r5, r0, vf=0, vs=0, ms=0
123
124 This pseudocode op is different from [[sv/svstep]] which is used to
125 perform detailed enquiries about internal state.
126
127 Note that whilst it is possible to set both MVL and VL from the same
128 immediate, it is not possible to set them to different immediates in
129 the same instruction. Doing so would require two instructions.
130
131 **Selecting sources for VL**
132
133 There is considerable opcode pressure, consequently to set MVL and VL
134 from different sources is as follows:
135
136 | condition | effect |
137 | - | - |
138 | `vs=1, RA=0, RT!=0` | VL,RT set to MIN(MVL, CTR) |
139 | `vs=1, RA=0, RT=0` | VL set to MIN(MVL, SVi+1) |
140 | `vs=1, RA!=0, RT=0` | VL set to MIN(MVL, RA) |
141 | `vs=1, RA!=0, RT!=0` | VL,RT set to MIN(MVL, RA) |
142
143 The reasoning here is that the opportunity to set RT equal to the
144 immediate `SVi+1` is sacrificed in favour of setting from CTR.
145
146 # Unusual Rc=1 behaviour
147
148 Normally, the return result from an instruction is in `RT`. With
149 it being possible for `RT=0` to mean that `CTR` mode is to be read,
150 some different semantics are needed.
151
152 CR Field 0, when `Rc=1`, may be set even if `RT=0`. The reason is that
153 overflow may occur: `VL`, if set either from an immediate or from `CTR`,
154 may not exceed `MAXVL`, and if it is, `CR0.SO` must be set.
155
156 Additionally, in reality it is **`VL`** being set. Therefore, rather
157 than `CR0` testing `RT` when `Rc=1`, CR0.EQ is set if `VL=0`, CR0.GE
158 is set if `VL` is non-zero.
159
160
161 *Programmers should be aware that VL, srcstep and dststep are global in nature.
162 Nested looping with different schedules is perfectly possible, as is
163 calling of functions, however SVSTATE (and any associated SVSTATE) should be stored on the stack.*
164
165 **SUBVL**
166
167 Sub-vector elements are not be considered "Vertical". The vec2/3/4
168 is to be considered as if the "single element". Caveats exist for
169 [[sv/mv.swizzle]] and [[sv/mv.vec]] when Pack/Unpack is enabled,
170 due to the order in which VL and SUBVL loops are applied being
171 swapped (outer-inner becomes inner-outer)
172
173 # Examples
174
175 ## Core concept loop
176
177 ```
178 loop:
179 setvl a3, a0, MVL=8 # update a3 with vl
180 # (# of elements this iteration)
181 # set MVL to 8
182 # do vector operations at up to 8 length (MVL=8)
183 # ...
184 sub a0, a0, a3 # Decrement count by vl
185 bnez a0, loop # Any more?
186 ```
187
188 ## Loop using Rc=1
189
190 my_fn:
191 li r3, 1000
192 b test
193 loop:
194 sub r3, r3, r4
195 ...
196 test:
197 setvli. r4, r3, MVL=64
198 bne cr0, loop
199 end:
200 blr
201
202 ## Load/Store-Multi (selective)
203
204 Up to 64 FPRs will be loaded, here. `r3` is set one per bit
205 for each FP register required to be loaded. The block of memory
206 from which the registers are loaded is contiguous (no gaps):
207 any FP register which has a corresponding zero bit in `r3`
208 is *unaltered*. In essence this is a selective LD-multi with
209 "Scatter" capability.
210
211 setvli r0, MVL=64, VL=64
212 sv.fld/dm=r3 *r0, 0(r30) # selective load 64 FP registers
213
214 Up to 64 FPRs will be saved, here. Again, `r3`
215
216 setvli r0, MVL=64, VL=64
217 sv.stfd/sm=r3 *fp0, 0(r30) # selective store 64 FP registers