openpower/sv/av_opcodes.mdwn

   1 [[!tag standards]]
   2
   3 # Scalar OpenPOWER Audio and Video Opcodes
   4
   5 the fundamental principle of SV is a hardware for-loop. therefore the first (and in nearly 100% of cases only) place to put Vector operations is first and foremost in the *scalar* ISA.  However only by analysing those scalar opcodes *in* a SV Vectorisation context does it become clear why they are needed and how they may be designed.
   6
   7 This page therefore has accompanying discussion at <https://bugs.libre-soc.org/show_bug.cgi?id=230> for evolution of suitable opcodes.
   8
   9 Links
  10
  11 * <https://bugs.libre-soc.org/show_bug.cgi?id=915> add overflow to maxmin.
  12 * <https://bugs.libre-soc.org/show_bug.cgi?id=863> add pseudocode etc.
  13 * <https://bugs.libre-soc.org/show_bug.cgi?id=234> hardware implementation
  14 * <https://bugs.libre-soc.org/show_bug.cgi?id=910> mins/maxs zero-option?
  15 * [[vpu]]
  16 * [[sv/int_fp_mv]]
  17 * [[openpower/isa/av]] pseudocode
  18 * TODO review HP 1994-6 PA-RISC MAX <https://en.m.wikipedia.org/wiki/Multimedia_Acceleration_eXtensions>
  19 * <https://en.m.wikipedia.org/wiki/Sum_of_absolute_differences>
  20 * List of MMX instructions <https://cs.fit.edu/~mmahoney/cse3101/mmx.html>
  21
  22 # Summary
  23
  24 In-advance, the summary of base scalar operations that need to be added is:
  25
  26 | instruction   | pseudocode               |
  27 | ------------  | ------------------------      |
  28 | average-add.  | result = (src1 + src2 + 1) >> 1 |
  29 | abs-diff      | result = abs (src1-src2) |
  30 | abs-accumulate| result += abs (src1-src2) |
  31 | (un)signed min| result = (src1 < src2) ? src1 : src2 use bitmanip |
  32 | (un)signed max| result = (src1 > src2) ? src1 : src2  use bitmanip |
  33 | bitwise sel   | (a ? b : c) - use [[sv/bitmanip]] ternary |
  34 | int/fp move   | covered by [[sv/int_fp_mv]] |
  35
  36 Implemented at the [[openpower/isa/av]] pseudocode page.
  37
  38 All other capabilities (saturate in particular) are achieved with [[sv/svp64]] modes and swizzle.  Note that minmax and ternary are added in bitmanip.
  39
  40 # Audio
  41
  42 The fundamental principle for these instructions is:
  43
  44 * identify the scalar primitive
  45 * assume that longer runs of scalars will have Simple-V vectorisatin applied
  46 * assume that "swizzle" may be applied at the (vec2 - SUBVL=2) Vector level,
  47  (even if that involves a mv.swizxle which may be macro-op fused)
  48   in order to perform the necessary HI/LO selection normally hard-coded
  49   into SIMD ISAs.
  50
  51 Thus for example, where OpenPOWER VSX has vpkswss, this would be achieved in SV with simply:
  52
  53 * applying saturation to minu (sv.minu/satu)
  54 * 1st op, swizzle-selection vec2 "select X only" from source to dest:
  55   dest.X = extclamp(src.X)
  56 * 2nd op, swizzle-select vec2 "select Y only" from source to dest
  57   dest.Y = extclamp(src.Y)
  58
  59 Macro-op fusion may be used to detect that these two interleave cleanly, overlapping the vec2.X with vec2.Y to produce a single vec2.XY operation.
  60
  61 Alternatively Twin-Predication may be applied, with every even bit set in
  62 the source mask and every odd bit set in the destination mask:
  63
  64     r3=0b10101010
  65     r10=0b01010101
  66     r0=0x00007fff # or other limit
  67     sv.minu/satu/sm=r3/dm=r10/ew=32 *r20,*r20,r0
  68
  69 ## Scalar element operations
  70
  71 * clamping / saturation for signed and unsigned.  best done similar to FP rounding modes, i.e. with an SPR.
  72 * average-add.  result = (src1 + src2 + 1) >> 1
  73 * abs-diff: result = (src1 > src2) ? (src1-src2) : (src2-src1)
  74 * signed min/max
  75
  76 \[un]signed min/max instructions are specifically needed for vector reduce min/max operations which are pretty common.
  77
  78 # Video
  79
  80 * DCT added as [[sv/remap]] <https://users.cs.cf.ac.uk/Dave.Marshall/Multimedia/node231.html>
  81  <https://www.nayuki.io/page/fast-discrete-cosine-transform-algorithms>
  82 * Absolute-diff Accumulation, used in Motion Estimation, added,
  83   see [[sv/bitmanip]] and opcodes in [[openpower/isa/av]]
  84
  85 # VSX SIMD analysis
  86
  87 Useful parts of VSX, and how they might map.
  88
  89 ## vpks[\*][\*]s (vec_pack*)
  90
  91 signed and unsigned, these are N-to-M (N=64/32/16, M=32/16/8) chop/clamp/sign/zero-extend operations. May be implemented by a clamped move to a smaller elwidth.
  92
  93 The other direction, vec_unpack widening ops, may need some way to tell whether to sign-extend or zero-extend.
  94
  95 *scalar extsw/b/h gives one set, mv gives another.  src elwidth override and dest elwidth override provide the pack/unpack*.
  96
  97 implemented by Pack/Unpack. [[sv/normal]] arithmetic also has Pack-with-Saturate.
  98
  99 ## vavgs\* (vec_avg)
 100
 101 signed and unsigned, 8/16/32: these are all of the form:
 102
 103     result = truncate((a + b + 1) >> 1))
 104
 105 *These do not exist in scalar ISA and would need to be added.  Essentially it is a type of post-processing involving the CA bit so could be included in the existing scalar pipeline ALU*
 106
 107 ## vabsdu\* (vec_abs)
 108
 109 unsigned 8/16/32: these are all of the form:
 110
 111     result = (src1 > src2) ? truncate(src1-src2) :
 112                              truncate(src2-src1)
 113
 114 *These do not exist in the scalar ISA and would need to be added*
 115
 116 ## abs-accumulate
 117
 118 signed and unsigned variants needed:
 119
 120     result += (src1 > src2) ? truncate(src1-src2) :
 121                               truncate(src2-src1)
 122
 123 *These do not exist in the scalar ISA and would need to be added*
 124
 125 ## vmaxs\* / vmaxu\* (and min)
 126
 127 signed and unsigned, 8/16/32: these are all of the form:
 128
 129     result = (src1 > src2) ? src1 : src2 # max
 130     result = (src1 < src2) ? src1 : src2 # min
 131
 132 *These do not exist in the scalar INTEGER ISA and would need to be added*.
 133 There are additionally no scalar FP min/max, either. These also
 134 need to be added.
 135
 136 Also it makes sense for both the integer and FP variants
 137 to have Rc=1 modes, where those modes are based on the
 138 respective cmp (or fsel / isel) behaviour. In other words,
 139 the Rc=1 setting is based on the *comparison* of the
 140 two inputs, rather than on which of the two results was
 141 returned by the min/max opcode.
 142
 143     result = (src1 > src2) ? src1 : src2 # max
 144     CR0 = CR_computr(src2-src1) # not based on result
 145
 146 ## vmerge operations
 147
 148 Their main point was to work around the odd/even multiplies. SV swizzles and mv.x should handle all cases.
 149
 150 these take two src vectors of various widths and splice them together.  the best technique to cover these is a simple straightforward predicated pair of mv operations, inverting the predicate in the second case, or, alternately, to use a pair of vec2 (SUBVL=2) swizzled operations.
 151
 152 in the swizzle case the first instruction would be destvec2.X = srcvec2.X and the second would swizzle-select Y: destvec2.Y = srcvec2.Y.  macro-op fusion in both the predicated variant and the swizzle variant would interleave the two into the same SIMD backend ALUs (or macro-op fusion identifies the patterns)
 153
 154 with twin predication the elwidth can be overridden on both src and dest such that either straight scalar mv or extsw/b/h can be used to provide the combinations of coverage needed, with only 2 actual instructions (plus vector prefixing)
 155
 156 See [[sv/mv.vec]] and [[sv/mv.swizzle]]
 157
 158 ## Float estimates
 159
 160     vec_expte - float 2^x
 161     vec_loge - float log2(x)
 162     vec_re - float 1/x
 163     vec_rsqrte - float 1/sqrt(x)
 164
 165 The spec says the max relative inaccuracy is 1/4096.
 166
 167 *In conjunction with the FPSPR "accuracy" bit These could be done by assigning meaning to the "sat mode" SVP64 bits in a FP context. 0b00 is IEEE754 FP, 0b01 is 2^12 accuracy for FP32. These can be applied to standard scalar FP ops*
 168
 169 The other alternative is to use the "single precision" FP operations on a 32-bit elwidth override.  As explained in [[sv/fcvt]] this halves the precision,
 170 operating at FP16 accuracy but storing in a FP32 format.
 171
 172 ## vec_madd(s) - FMA, multiply-add, optionally saturated
 173
 174     a * b + c
 175
 176 *Standard scalar madd*
 177
 178 ## vec_msum(s) - horizontal gather multiply-add, optionally saturated
 179
 180 This should be separated to a horizontal multiply and a horizontal add. How a horizontal operation would work in SV is TBD, how wide is it, etc.
 181
 182     a.x + a.y + a.z ...
 183     a.x * a.y * a.z ...
 184
 185 *This would realistically need to be done with a loop doing a mapreduce sequence.  I looked very early on at doing this type of operation and concluded it would be better done with a series of halvings each time, as separate instructions:  VL=16 then VL=8 then 4 then 2 and finally one scalar.  i.e. not an actual part of SV al all. An OoO multi-issue engine would be more than capable of dealing with the Dependencies.*
 186
 187 That has the issue that's it's a massive PITA to code, plus it's slow. Plus there's the "access to non-4-offset regs stalls". Even if there's no ready operation, it should be made easier and faster than a manual mapreduce loop.
 188
 189 --
 190
 191 As a mid-solution, 4-element gathers were discussed. 4 elements would also make them useful for pixel packing, not just the general vector gather. This is because OR and ADD are the same operation when bits don't overlap.
 192
 193     gather-add: d = a.x + a.y + a.z + a.w
 194     gather-mul: d = a.x * a.y * a.z * a.w
 195
 196 But can the SV loop increment the src reg # by 4? Hmm.
 197
 198 The idea then leads to the opposite operation, a 1-to-4 bit scatter instruction. Like gather, it could be implemented with a normal loop, but it's faster for certain uses.
 199
 200     bit-scatter dest, src, bits
 201
 202     bit-scatter rd, rs, 8 # assuming source and dest are 32-bit wide
 203     rd = (rs >> 0 * 8) & (2^8 - 1)
 204     rd+1 = (rs >> 1 * 8) & (2^8 - 1)
 205     rd+2 = (rs >> 2 * 8) & (2^8 - 1)
 206     rd+3 = (rs >> 3 * 8) & (2^8 - 1)
 207
 208 So at the start you have a RGBA packed pixel in one 32-bit register, at the end you have each channel separated into its own register, in the low bits, and ANDed so only the relevant bits are there.
 209
 210 ## vec_mul*
 211
 212 There should be both a same-width multiply and a widening multiply. Signed and unsigned versions. Optionally saturated.
 213
 214     u8 * u8 = u8
 215     u8 * u8 = u16
 216
 217 For 8,16,32,64, resulting in 8,16,32,64,128.
 218
 219 *All of these can be done with SV elwidth overrides, as long as the dest is no greater than 128.  SV specifically does not do 128 bit arithmetic. Instead, vec2.X mul-lo followed by vec2.Y mul-hi can be macro-op fused to get at the full 128 bit internal result.  Specifying e.g. src elwidth=8 and dest elwidth=16 will give a widening multiply*
 220
 221 (Now added `maddedu` which is twin-half 64x64->HI64/LO64 in [[sv/biginteger]])
 222
 223 ## vec_rl - rotate left
 224
 225     (a << x) | (a >> (WIDTH - x))
 226
 227 *Standard scalar rlwinm*
 228
 229 ## vec_sel - bitwise select
 230
 231     (a ? b : c)
 232
 233 *This does not exist in the scalar ISA and would need to be added*
 234
 235 Interesting operation: Tim.Forsyth's video on Larrabee they added a logical ternary lookup table op, which can cover this and more.  similar to crops 2-2 bit lookup.
 236
 237 * <http://0x80.pl/articles/avx512-ternary-functions.html>
 238 * <https://github.com/WojciechMula/ternary-logic/blob/master/py/show-function.py>
 239 * [[sv/bitmanip]]
 240
 241
 242 ## vec_splat - scatter
 243
 244 Implemented using swizzle/predicate.
 245
 246 ## vec_perm - permute
 247
 248 Implemented using swizzle, mv.x.
 249
 250 ## vec_*c[tl]z, vec_popcnt - count leading/trailing zeroes, set bits
 251
 252 Bit counts.
 253
 254     ctz - count trailing zeroes
 255     clz - count leading zeroes
 256     popcnt - count set bits
 257
 258 *These all exist in the scalar ISA*