openpower/sv/mv.swizzle.mdwn

   1 [[!tag standards]]
   2
   3 # mv.swizzle
   4
   5 Links
   6
   7 * <https://bugs.libre-soc.org/show_bug.cgi?id=139>
   8 * <https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-June/004913.html>
   9
  10 Swizzle is a type of permute shorthand allowing arbitrary selection
  11 of elements from vec2/3/4 creating a new vec2/3/4.
  12 Their value lies in the high occurrence of Swizzle
  13 in 3D Shader Binaries (over 10% of all instructions).
  14 Swizzle is usually done on a per-vec-operand basis in 3D GPU ISAs, making
  15 for extremely long instructions (64 bits or greater),
  16 however it is not practical to add two or more sets of 12-bit
  17 prefixes into a single instruction.
  18 A compromise is to provide a Swizzle "Move": one such move is
  19 then required for each operand used in a subsequent instruction.
  20 The encoding for Swizzle Move embeds static predication into the
  21 swizzle as well as constants 1/1.0 and 0/0.0.
  22
  23 An extremely important aspect of 3D GPU workloads is that the source
  24 and destination subvector lengths may be *different*.  A vector of
  25 contiguous array of vec3 (XYZ) may only have 2 elements (ZY)
  26 swizzle-copied to
  27 a contiguous array of vec2.  A contiguous array of vec2 sources
  28 may have multiple of each vec2 elements (XY) copied to a contiguous
  29 vec4 array (YYXX or XYXX). For this reason, *when Vectorised*
  30 Swizzle Moves support independent subvector lengths for both
  31 source and destination.
  32
  33 Although conceptually similar to `vpermd` of Packed SIMD VSX,
  34 Swizzle Moves come in immediate-only form with only up to four
  35 selectors, where VSX refers to individual bytes and may not
  36 copy constants to the destination.
  37 3D Shader programs commonly use the letters "XYZW"
  38 when referring to the four swizzle indices, and also often
  39 use the letters "RGBA"
  40 if referring to pixel data.  These designations are also
  41 part of both the OpenGL(TM) and Vulkan(TM) specifications.
  42
  43 As a standalone Scalar operation this instruction is valuable
  44 if Prefixed with SVP64Single (providing Predication).
  45 Combined with `cmpi` it synthesises Compare-and-Swap.
  46
  47 # Format
  48
  49 | 0.5 |6.10|11.15|16.27|28.31|  name        | Form    |
  50 |-----|----|-----|-----|-----|--------------|-------- |
  51 |PO   | RTp| RAp |imm  | 0011| mv.swiz      | DQ-Form |
  52 |PO   | RTp| RAp |imm  | 1011| fmv.swiz     | DQ-Form |
  53
  54 this gives a 12 bit immediate across bits 16 to 27.
  55 Each swizzle mnemonic (XYZW), commonly known from 3D GPU programming,
  56 has an associated index.  3 bits of the immediate are allocated
  57 to each:
  58
  59 | imm   |0.2 |3.5 |6.8|9.11|
  60 |-------|----|----|---|----|
  61 |swizzle|X   | Y  | Z | W  |
  62 |pixel  |R   | G  | B | A  |
  63 |index  |0   | 1  | 2 | 3  |
  64
  65 The options for each Swizzle are:
  66
  67 * 0b000 to indicate "skip".  this is equivalent to predicate masking
  68 * 0b001 subvector length end marker (length=4 if not present)
  69 * 0b010 to indicate "constant 0"
  70 * 0b011 to indicate "constant 1" (or 1.0)
  71 * 0b1NN index 0 thru 3 to copy from subelement in pos XYZW
  72
  73 In very simplistic terms the relationship between swizzle indices
  74 (NN, above), source, and destination is:
  75
  76     dest[i] = src[swiz[i]]
  77
  78 Note that 8 options are needed (not 6) because option 0b001 encodes
  79 the subvector length, and option 0b000 allows static
  80 predicate masking (skipping) to be encoded within the swizzle immediate.
  81 For example it allows "W.Y." to specify: "copy W to position X,
  82 and Y to position Z, leave the other two positions Y and W unaltered"
  83
  84     0    1    2    3
  85     X    Y    Z    W  source
  86          |         |
  87          +----+    |
  88          .    |    |
  89     +--------------+
  90     |    .    |    .
  91     W    .    Y    .  swizzle
  92     |    .    |    .
  93     |    Y    |    W  Y,W unmodified
  94     |    .    |    .
  95     W    Y    Y    W  dest
  96
  97 **As a Scalar instruction**
  98
  99 Given that XYZW Swizzle can select simultaneously between one *and four*
 100 register operands, a full version of this instruction would
 101 be an eye-popping 8 64-bit operands: 4-in, 4-out. As part of a Scalar
 102 ISA this not practical. A compromise is to cut the registers required
 103 by half, placing it on-par with `lq`, `stq` and Indexed
 104 Load-with-update instructions.
 105 When part of the Scalar Power ISA (not SVP64 Vectorised)
 106 mv.swiz and fmv.swiz operate on four 32-bit
 107 quantities, reducing this instruction to a feasible
 108 2-in, 2-out pairs of 64-bit registers:
 109
 110 | swizzle name | source | dest | half    |
 111 |--            | --     | --   | --      |
 112 | X            | RA     | RT   | lo-half |
 113 | Y            | RA     | RT   | hi-half |
 114 | Z            | RA+1   | RT+1 | lo-half |
 115 | W            | RA+1   | RT+1 | hi-half |
 116
 117 When `RA=RT` (in-place swizzle) any portion of RT not covered by
 118 the Swizzle is unmodified.  For example a Swizzle of "..XY"
 119 will copy the contents RA+1 into RT but leave RT+1 unmodified.
 120
 121 When `RA!=RT` any part of RT or RT+1 not set as a destination by
 122 the Swizzle will be set to zero.  A Swizzle of "..XY" would
 123 copy the contents RA+1 into RT, but set RT+1 to zero.
 124
 125 Also, making life easier, RT and RA are only permitted to be even
 126 (no overlapping can occur).  This makes RT (and RA) a "pair" exactly
 127 as in `lq` and `stq`.  Scalar Swizzle instructions must be atomically
 128 indivisible: an Exception or Interrupt may not occur during the Moves.
 129
 130 Note that unlike the Vectorised variant, when `RT=RA` the Scalar variant
 131 *must* buffer (read) both 64-bit RA registers before writing to the
 132 RT pair (in an Out-of-Order Micro-architecture, both of the register
 133 pair must be "in-flight").
 134 This ensures that register file corruption does not occur.
 135
 136 **SVP64 Vectorised**
 137
 138 Vectorised Swizzle may be considered to
 139 contain an extended static predicate
 140 mask for subvectors (SUBVL=2/3/4). Due to the skipping caused by
 141 the static predication capability, the destination
 142 subvector length can be *different* from the source subvector
 143 length, and consequently the destination subvector length is
 144 encoded into the Swizzle.
 145
 146 When Vectorised, given the use-case is for a High-performance GPU,
 147 the fundamental assumption is that Micro-coding or
 148 other technique will
 149 be deployed in hardware to issue multiple Scalar MV operations and
 150 full parallel crossbars, which
 151 would be impractical in a smaller Scalar-only Micro-architecture.
 152 Therefore the restriction imposed on the Scalar `mv.swiz` to 32-bit
 153 quantities as the default is lifted on `sv.mv.swiz`.
 154
 155 Additionally, in order to make life easier for implementers, some of
 156 whom may wish, especially for Embedded GPUs, to use multi-cycle Micro-coding,
 157 the usual strict Element-level Program Order is relaxed.
 158 An overlap between all and any Vectorised
 159 sources and destination Elements for the entirety of
 160 the Vector Loop `0..VL-1` is `UNDEFINED` behaviour.
 161
 162 This in turn implies that Traps and Exceptions are, as usual,
 163 permitted in between element-level moves, because due to there
 164 being no overlap there is no risk of destroying a source with
 165 an overwrite.  This is *unlike* the Scalar variant which, when
 166 `RT=RA`, must buffer both halves of the RT pair.
 167
 168 Determining the source and destination subvector lengths is tricky.
 169 Swizzle Pseudocode:
 170
 171 ```
 172     swiz[0] = imm[0:3]   # X
 173     swiz[1] = imm[3:6]   # Y
 174     swiz[2] = imm[6:9]   # Z
 175     swiz[3] = imm[9:12]  # W
 176     # determine implied subvector length from Swizzle
 177     dst_subvl = 4
 178     for i in range(4):
 179         if swiz[i] == 0b001:
 180             dst_subvl = i+1
 181             break
 182 ```
 183
 184 What is going on here is that the option is provided to have different
 185 source and destination subvector lengths, by exploiting redundancy in
 186 the Swizzle Immediate.  With the Swizzles marking what goes into
 187 each destination position, the marker "0b001" may be used to indicate
 188 the end. If no marker is present then the destination subvector length
 189 may be assumed to be 4.  SUBVL is considered to be the "source" subvector
 190 length.
 191
 192 Pseudocode exploiting python "yield" for clarity: element-width overrides,
 193 Saturation and Predication also left out, for clarity:
 194
 195 ```
 196     def index_src():
 197         for i in range(VL):
 198             for j in range(SUBVL):
 199                 if swiz[j] == 0b000: # skip
 200                     continue
 201                 if swiz[j] == 0b001: # end
 202                     break
 203                 if swiz[j] in [0b010, 0b011]:
 204                     yield (i*SUBVL, CONSTANT)
 205                 else:
 206                     yield (i*SUBVL, swiz[j]-3)
 207
 208     def index_dest():
 209         for i in range(VL):
 210             for j in range(dst_subvl):
 211                 if swiz[j] == 0b000: # skip
 212                     continue
 213                 yield i*dst_subvl+j
 214
 215     # walk through both source and dest indices simultaneously
 216     for (src_idx, offs), dst_idx in zip(index_src(), index_dst()):
 217         if offs == CONSTANT:
 218              set(RT+dst_idx, CONSTANT)
 219         else
 220              move_operation(RT+dst_idx, RA+src_idx+offs)
 221 ```
 222
 223 **Vertical-First Mode**
 224
 225 It is important to appreciate that *only* the main loop VL
 226 is Vertical-First: the SUBVL loop is not.  This makes sense
 227 from the perspective that the Swizzle Move is a group of
 228 moves, but is still a single instruction that happens to take
 229 vec2/3/4 as operands.  Vertical-First
 230 only performing one of the *sub*-elements at a time rather
 231 than operating on the entire vec2/3/4 together would
 232 violate that expectation.  The exceptions to this, explained
 233 later, are when Pack/Unpack is enabled.
 234
 235 **Effect of Saturation on Vectorised Swizzle**
 236
 237 A useful convenience for pixel data is to be able to insert values
 238 0x7f or 0xff as magic constants for arbitrary R,G,B or A. Therefore,
 239 when Saturation is enabled and a Swizzle=0b011 (Constant 1) is requested,
 240 the maximum permitted Saturated value is inserted rather than Constant 1.
 241 `sv.mv.swiz/sats/vec2/ew=8 RT.v, RA.v, Y1` would insert the 2nd subelement
 242 (Y) into the first destination subelement and the signed-maximum constant
 243 0x7f into the second. A Constant 0 Swizzle on the other hand still inserts
 244 zero because there is no encoding space to select between -1, 0 and 1, and
 245 0 and max values are more useful.
 246
 247 # Pack/Unpack Mode:
 248
 249 It is possible to apply Pack and Unpack to Vectorised
 250 swizzle moves, and these instructions are of EXTRA type
 251 `RM-2P-1S1D-PU`. The interaction requires specific explanation
 252 because it involves the separate SUBVLs (with destination SUBVL
 253 being separate). Key to understanding is that the
 254 source and
 255 destination SUBVL be "outer" loops instead of inner loops,
 256 exactly as in [[sv/remap]] Matrix mode, under the control
 257 of `PACK_en` and `UNPACK_en`.
 258
 259 Illustrating a
 260 "normal" SVP64 operation with `SUBVL!=1` (assuming no elwidth overrides):
 261
 262     def index():
 263         for i in range(VL):
 264             for j in range(SUBVL):
 265                 yield i*SUBVL+j
 266
 267     for idx in index():
 268         operation_on(RA+idx)
 269
 270 For a separate source/dest SUBVL (again, no elwidth overrides):
 271
 272     # yield an outer-SUBVL or inner VL loop with SUBVL
 273     def index_dest(outer):
 274         if outer:
 275             for j in range(dst_subvl):
 276                 for i in range(VL):
 277                     ....
 278         else:
 279             for i in range(VL):
 280                 for j in range(dst_subvl):
 281                     ....
 282
 283     # yield an outer-SUBVL or inner VL loop with SUBVL
 284     def index_src(outer):
 285         if outer:
 286             for j in range(SUBVL):
 287                 for i in range(VL):
 288                     ....
 289         else:
 290             for i in range(VL):
 291                 for j in range(SUBVL):
 292                     ....
 293
 294 "yield" from python is used here for simplicity and clarity.
 295 The two Finite State Machines for the generation of the source
 296 and destination element offsets progress incrementally in
 297 lock-step.
 298
 299 Just as in [[sv/mv.vec]], when `PACK_en` is set it is the source
 300 that swaps to Outer-subvector loops, and when `UNPACK_en` is set
 301 it is the destination that swaps its loop-order.  Setting both
 302 `PACK_en` and `UNPACK_en` is neither prohibited nor `UNDEFINED`
 303 because the behaviour is fully deterministic.
 304
 305 *However*, in
 306 Vertical-First Mode, when both are enabled,
 307 with both source and destination being outer loops a **single**
 308 step of srstep and dststep is performed.  Contrast this when
 309 one of `PACK_en` is set, it is the *destination* that is an inner
 310 subvector loop, and therefore Vertical-First runs through the
 311 entire `dst_subvl` group. Likewise when `UNPACK_en` is set it
 312 is the source subvector that is run through as a group.
 313
 314 ```
 315 if VERTICAL_FIRST:
 316     # must run through SUBVL or dst_subvl elements, to keep
 317     # the subvector "together".  weirdness occurs due to
 318     # PACK_en/UNPACK_en
 319     num_runs = SUBVL # 1-4
 320     if PACK_en:
 321         num_runs = dst_subvl # destination still an inner loop
 322     if PACK_en and UNPACK_en:
 323         num_runs = 1 # both are outer loops
 324     for substep in num_runs:
 325         (src_idx, offs) = yield from index_src(PACK_en)
 326         dst_idx = yield from index_dst(UNPACK_en)
 327         move_operation(RT+dst_idx, RA+src_idx+offs)
 328 ```