(no commit message)
[libreriscv.git] / openpower / sv / mv.swizzle.mdwn
1 [[!tag standards]]
2
3 # mv.swizzle
4
5 Links
6
7 * <https://bugs.libre-soc.org/show_bug.cgi?id=139>
8 * <https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-June/004913.html>
9
10 Swizzle is a type of permute shorthand allowing arbitrary selection
11 of elements from vec2/3/4 creating a new vec2/3/4.
12 Their value lies in the high occurrence of Swizzle
13 in 3D Shader Binaries (over 10% of all instructions).
14 Swizzle is usually done on a per-vec-operand basis in 3D GPU ISAs, making
15 for extremely long instructions (64 bits or greater),
16 however it is not practical to add two or more sets of 12-bit
17 prefixes into a single instruction.
18 A compromise is to provide a Swizzle "Move": one such move is
19 then required for each operand used in a subsequent instruction.
20 The encoding for Swizzle Move embeds static predication into the
21 swizzle as well as constants 1/1.0 and 0/0.0.
22
23 An extremely important aspect of 3D GPU workloads is that the source
24 and destination subvector lengths may be *different*. A vector of
25 contiguous array of vec3 (XYZ) may only have 2 elements (ZY)
26 swizzle-copied to
27 a contiguous array of vec2. A contiguous array of vec2 sources
28 may have multiple of each vec2 elements (XY) copied to a contiguous
29 vec4 array (YYXX or XYXX). For this reason, *when Vectorised*
30 Swizzle Moves support independent subvector lengths for both
31 source and destination.
32
33 Although conceptually similar to `vpermd` of Packed SIMD VSX,
34 Swizzle Moves come in immediate-only form with only up to four
35 selectors, where VSX refers to individual bytes and may not
36 copy constants to the destination.
37 3D Shader programs commonly use the letters "XYZW"
38 when referring to the four swizzle indices, and also often
39 use the letters "RGBA"
40 if referring to pixel data. These designations are also
41 part of both the OpenGL(TM) and Vulkan(TM) specifications.
42
43 As a standalone Scalar operation this instruction is valuable
44 if Prefixed with SVP64Single (providing Predication).
45 Combined with `cmpi` it synthesises Compare-and-Swap.
46
47 # Format
48
49 | 0.5 |6.10|11.15|16.27|28.31| name | Form |
50 |-----|----|-----|-----|-----|--------------|-------- |
51 |PO | RTp| RAp |imm | 0011| mv.swiz | DQ-Form |
52 |PO | RTp| RAp |imm | 1011| fmv.swiz | DQ-Form |
53
54 this gives a 12 bit immediate across bits 16 to 27.
55 Each swizzle mnemonic (XYZW), commonly known from 3D GPU programming,
56 has an associated index. 3 bits of the immediate are allocated
57 to each:
58
59 | imm |0.2 |3.5 |6.8|9.11|
60 |-------|----|----|---|----|
61 |swizzle|X | Y | Z | W |
62 |pixel |R | G | B | A |
63 |index |0 | 1 | 2 | 3 |
64
65 The options for each Swizzle are:
66
67 * 0b000 to indicate "skip". this is equivalent to predicate masking
68 * 0b001 subvector length end marker (length=4 if not present)
69 * 0b010 to indicate "constant 0"
70 * 0b011 to indicate "constant 1" (or 1.0)
71 * 0b1NN index 0 thru 3 to copy from subelement in pos XYZW
72
73 In very simplistic terms the relationship between swizzle indices
74 (NN, above), source, and destination is:
75
76 dest[i] = src[swiz[i]]
77
78 Note that 8 options are needed (not 6) because option 0b001 encodes
79 the subvector length, and option 0b000 allows static
80 predicate masking (skipping) to be encoded within the swizzle immediate.
81 For example it allows "W.Y." to specify: "copy W to position X,
82 and Y to position Z, leave the other two positions Y and W unaltered"
83
84 0 1 2 3
85 X Y Z W source
86 | |
87 +----+ |
88 . | |
89 +--------------+
90 | . | .
91 W . Y . swizzle
92 | . | .
93 | Y | W Y,W unmodified
94 | . | .
95 W Y Y W dest
96
97 **As a Scalar instruction**
98
99 Given that XYZW Swizzle can select simultaneously between one *and four*
100 register operands, a full version of this instruction would
101 be an eye-popping 8 64-bit operands: 4-in, 4-out. As part of a Scalar
102 ISA this not practical. A compromise is to cut the registers required
103 by half, placing it on-par with `lq`, `stq` and Indexed
104 Load-with-update instructions.
105 When part of the Scalar Power ISA (not SVP64 Vectorised)
106 mv.swiz and fmv.swiz operate on four 32-bit
107 quantities, reducing this instruction to a feasible
108 2-in, 2-out pairs of 64-bit registers:
109
110 | swizzle name | source | dest | half |
111 |-- | -- | -- | -- |
112 | X | RA | RT | lo-half |
113 | Y | RA | RT | hi-half |
114 | Z | RA+1 | RT+1 | lo-half |
115 | W | RA+1 | RT+1 | hi-half |
116
117 When `RA=RT` (in-place swizzle) any portion of RT not covered by
118 the Swizzle is unmodified. For example a Swizzle of "..XY"
119 will copy the contents RA+1 into RT but leave RT+1 unmodified.
120
121 When `RA!=RT` any part of RT or RT+1 not set as a destination by
122 the Swizzle will be set to zero. A Swizzle of "..XY" would
123 copy the contents RA+1 into RT, but set RT+1 to zero.
124
125 Also, making life easier, RT and RA are only permitted to be even
126 (no overlapping can occur). This makes RT (and RA) a "pair" exactly
127 as in `lq` and `stq`. Scalar Swizzle instructions must be atomically
128 indivisible: an Exception or Interrupt may not occur during the Moves.
129
130 Note that unlike the Vectorised variant, when `RT=RA` the Scalar variant
131 *must* buffer (read) both 64-bit RA registers before writing to the
132 RT pair (in an Out-of-Order Micro-architecture, both of the register
133 pair must be "in-flight").
134 This ensures that register file corruption does not occur.
135
136 **SVP64 Vectorised**
137
138 Vectorised Swizzle may be considered to
139 contain an extended static predicate
140 mask for subvectors (SUBVL=2/3/4). Due to the skipping caused by
141 the static predication capability, the destination
142 subvector length can be *different* from the source subvector
143 length, and consequently the destination subvector length is
144 encoded into the Swizzle.
145
146 When Vectorised, given the use-case is for a High-performance GPU,
147 the fundamental assumption is that Micro-coding or
148 other technique will
149 be deployed in hardware to issue multiple Scalar MV operations and
150 full parallel crossbars, which
151 would be impractical in a smaller Scalar-only Micro-architecture.
152 Therefore the restriction imposed on the Scalar `mv.swiz` to 32-bit
153 quantities as the default is lifted on `sv.mv.swiz`.
154
155 Additionally, in order to make life easier for implementers, some of
156 whom may wish, especially for Embedded GPUs, to use multi-cycle Micro-coding,
157 the usual strict Element-level Program Order is relaxed.
158 An overlap between all and any Vectorised
159 sources and destination Elements for the entirety of
160 the Vector Loop `0..VL-1` is `UNDEFINED` behaviour.
161
162 This in turn implies that Traps and Exceptions are, as usual,
163 permitted in between element-level moves, because due to there
164 being no overlap there is no risk of destroying a source with
165 an overwrite. This is *unlike* the Scalar variant which, when
166 `RT=RA`, must buffer both halves of the RT pair.
167
168 Determining the source and destination subvector lengths is tricky.
169 Swizzle Pseudocode:
170
171 ```
172 swiz[0] = imm[0:3] # X
173 swiz[1] = imm[3:6] # Y
174 swiz[2] = imm[6:9] # Z
175 swiz[3] = imm[9:12] # W
176 # determine implied subvector length from Swizzle
177 dst_subvl = 4
178 for i in range(4):
179 if swiz[i] == 0b001:
180 dst_subvl = i+1
181 break
182 ```
183
184 What is going on here is that the option is provided to have different
185 source and destination subvector lengths, by exploiting redundancy in
186 the Swizzle Immediate. With the Swizzles marking what goes into
187 each destination position, the marker "0b001" may be used to indicate
188 the end. If no marker is present then the destination subvector length
189 may be assumed to be 4. SUBVL is considered to be the "source" subvector
190 length.
191
192 Pseudocode exploiting python "yield" for clarity: element-width overrides,
193 Saturation and Predication also left out, for clarity:
194
195 ```
196 def index_src():
197 for i in range(VL):
198 for j in range(SUBVL):
199 if swiz[j] == 0b000: # skip
200 continue
201 if swiz[j] == 0b001: # end
202 break
203 if swiz[j] in [0b010, 0b011]:
204 yield (i*SUBVL, CONSTANT)
205 else:
206 yield (i*SUBVL, swiz[j]-3)
207
208 def index_dest():
209 for i in range(VL):
210 for j in range(dst_subvl):
211 if swiz[j] == 0b000: # skip
212 continue
213 yield i*dst_subvl+j
214
215 # walk through both source and dest indices simultaneously
216 for (src_idx, offs), dst_idx in zip(index_src(), index_dst()):
217 if offs == CONSTANT:
218 set(RT+dst_idx, CONSTANT)
219 else
220 move_operation(RT+dst_idx, RA+src_idx+offs)
221 ```
222
223 **Vertical-First Mode**
224
225 It is important to appreciate that *only* the main loop VL
226 is Vertical-First: the SUBVL loop is not. This makes sense
227 from the perspective that the Swizzle Move is a group of
228 moves, but is still a single instruction that happens to take
229 vec2/3/4 as operands. Vertical-First
230 only performing one of the *sub*-elements at a time rather
231 than operating on the entire vec2/3/4 together would
232 violate that expectation. The exceptions to this, explained
233 later, are when Pack/Unpack is enabled.
234
235 **Effect of Saturation on Vectorised Swizzle**
236
237 A useful convenience for pixel data is to be able to insert values
238 0x7f or 0xff as magic constants for arbitrary R,G,B or A. Therefore,
239 when Saturation is enabled and a Swizzle=0b011 (Constant 1) is requested,
240 the maximum permitted Saturated value is inserted rather than Constant 1.
241 `sv.mv.swiz/sats/vec2/ew=8 RT.v, RA.v, Y1` would insert the 2nd subelement
242 (Y) into the first destination subelement and the signed-maximum constant
243 0x7f into the second. A Constant 0 Swizzle on the other hand still inserts
244 zero because there is no encoding space to select between -1, 0 and 1, and
245 0 and max values are more useful.
246
247 # Pack/Unpack Mode:
248
249 It is possible to apply Pack and Unpack to Vectorised
250 swizzle moves, and these instructions are of EXTRA type
251 `RM-2P-1S1D-PU`. The interaction requires specific explanation
252 because it involves the separate SUBVLs (with destination SUBVL
253 being separate). Key to understanding is that the
254 source and
255 destination SUBVL be "outer" loops instead of inner loops,
256 exactly as in [[sv/remap]] Matrix mode, under the control
257 of `PACK_en` and `UNPACK_en`.
258
259 Illustrating a
260 "normal" SVP64 operation with `SUBVL!=1` (assuming no elwidth overrides):
261
262 def index():
263 for i in range(VL):
264 for j in range(SUBVL):
265 yield i*SUBVL+j
266
267 for idx in index():
268 operation_on(RA+idx)
269
270 For a separate source/dest SUBVL (again, no elwidth overrides):
271
272 # yield an outer-SUBVL or inner VL loop with SUBVL
273 def index_dest(outer):
274 if outer:
275 for j in range(dst_subvl):
276 for i in range(VL):
277 ....
278 else:
279 for i in range(VL):
280 for j in range(dst_subvl):
281 ....
282
283 # yield an outer-SUBVL or inner VL loop with SUBVL
284 def index_src(outer):
285 if outer:
286 for j in range(SUBVL):
287 for i in range(VL):
288 ....
289 else:
290 for i in range(VL):
291 for j in range(SUBVL):
292 ....
293
294 "yield" from python is used here for simplicity and clarity.
295 The two Finite State Machines for the generation of the source
296 and destination element offsets progress incrementally in
297 lock-step.
298
299 Just as in [[sv/mv.vec]], when `PACK_en` is set it is the source
300 that swaps to Outer-subvector loops, and when `UNPACK_en` is set
301 it is the destination that swaps its loop-order. Setting both
302 `PACK_en` and `UNPACK_en` is neither prohibited nor `UNDEFINED`
303 because the behaviour is fully deterministic.
304
305 *However*, in
306 Vertical-First Mode, when both are enabled,
307 with both source and destination being outer loops a **single**
308 step of srstep and dststep is performed. Contrast this when
309 one of `PACK_en` is set, it is the *destination* that is an inner
310 subvector loop, and therefore Vertical-First runs through the
311 entire `dst_subvl` group. Likewise when `UNPACK_en` is set it
312 is the source subvector that is run through as a group.
313
314 ```
315 if VERTICAL_FIRST:
316 # must run through SUBVL or dst_subvl elements, to keep
317 # the subvector "together". weirdness occurs due to
318 # PACK_en/UNPACK_en
319 num_runs = SUBVL # 1-4
320 if PACK_en:
321 num_runs = dst_subvl # destination still an inner loop
322 if PACK_en and UNPACK_en:
323 num_runs = 1 # both are outer loops
324 for substep in num_runs:
325 (src_idx, offs) = yield from index_src(PACK_en)
326 dst_idx = yield from index_dst(UNPACK_en)
327 move_operation(RT+dst_idx, RA+src_idx+offs)
328 ```