simple_v_extension/vector_ops.mdwn

   1 [[!tag standards]]
   2
   3 # Vector Operations Extension to SV
   4
   5 This extension defines vector operations that would otherwise take several cycles to complete in software. With 3D priorities being to compute as many pixels per clock as possible, the normal RISC rules (reduce opcode count and make heavy use of macro op fusion) do not necessarily apply.
   6
   7 This extension is usually dependent on SV SUBVL being implemented. When SUBVL is set to define the length of a subvector the operations in this extension interpret the elements as a single vector.
   8
   9 Normally in SV all operations are scalar and independent, and the operations on them may inherently be independently parallelised, with the result being a vector of length exactly equal to the input vectors.
  10
  11 In this extension, the subvector itself is typically the unit, although some operations will work on scalars or standard vectors as well, or the result is a scalar that is dependent on all elements within the vector arguments.
  12
  13 However given that some of the parameters are vectors (with and without SUBVL set), and some are scalars (where SUBVL will not apply), some clear rules need to be defined as to how the operations work.
  14
  15 Examples which can require SUBVL include cross product and may in future involve complex numbers.
  16
  17 ## CORDIC
  18
  19 6 opcode options (fmt3):
  20
  21 * CORDIC.lin.rot vd, vs, beta
  22 * CORDIC.cir.rot vd, vs, beta
  23 * CORDIC.hyp.rot vd, vs, beta
  24 * CORDIC.lin.vec vd, vs, beta
  25 * CORDIC.cir.vec vd, vs, beta
  26 * CORDIC.hyp.vec vd, vs, beta
  27
  28
  29 | Instr | result | src1 | src2 | SUBVL | VL | Notes |
  30 | ------------------ | ------ | ---- | ---- | ----- | -- | ------ |
  31 | CORDIC.x.t vd, vs1, rs2 | vec2 | vec2 | scal | 2 | any | src2 ignores SUBVL |
  32
  33
  34 SUBVL must be set to 2 and applies to vd and vs. SUBVL is *ignored* on beta.  vd and vs must be marked as vectors.
  35
  36 VL may be applied.  beta as a scalar is ok (applies across all vectors vd and vs). Predication is also ok (single predication) sourced from vd. Use of swizzle is also ok.
  37
  38 Non vector args vd, vs, or SUBVL != 2 are reserved encodings.
  39
  40 CORDIC is an extremely general-purpose algorithm useful for a huge number
  41 of diverse purposes.  In its full form it does however require quite a
  42 few parameters, one of which is a vector, making it awkward to include in
  43 a standard "scalar" ISA.  Additionally the coordinates can be set to circular,
  44 linear or hyperbolic, producing three different modes, and the algorithm
  45 may also be run in either "vector" mode or "rotation" mode.  See [[discussion]]
  46
  47 CORDIC can also be used for performing DCT.  See
  48 <https://arxiv.org/abs/1606.02424>
  49
  50 CORDIC has several RADIX-4 papers for efficient pipelining.  Each stage requires its own ROM tables which can get costly.  Two combinatorial blocks may be chained together to double the RADIX and halve the pipeline depth, at the cost of doubling the latency.
  51
  52 Also, to get good accuracy, particularly at the limits of CORDIC input range, requires double the bitwidth of the output in internal computations. This similar to how MUL requires double the bitwidth to compute.
  53
  54 Links:
  55
  56 * <http://www.myhdl.org/docs/examples/sinecomp/>
  57 * <https://www.atlantis-press.com/proceedings/jcis2006/232>
  58
  59 ## Vector cross product
  60
  61 * VCROSS vd, vs1, vs1
  62
  63 Result is the cross product of x and y.
  64
  65 SUBVL must be set to 3, and all regs must be vectors. VL nonzero produces multiple results in vd.
  66
  67 The resulting components are, in order:
  68
  69     x[1] * y[2] - y[1] * x[2]
  70     x[2] * y[0] - y[2] * x[0]
  71     x[0] * y[1] - y[0] * x[1]
  72
  73 All the operands must be vectors of 3 components of a floating-point type.
  74
  75 Pseudocode:
  76
  77     vec3 a, b; // elements in order a.x, a.y, a.z
  78     // compute a cross b:
  79     vec3 t1 = a.yzx; // produce vector [a.y, a.z, a.x]
  80     vec3 t2 = b.zxy;
  81     vec3 t3 = a.zxy;
  82     vec3 t4 = b.yzx;
  83     vec3 p = t3 * t4;
  84     vec3 cross = t1 * t2 - p;
  85
  86 Assembler:
  87
  88     fswizzlei,2130 F4, F1
  89     fswizzlei,1320 F5, F1
  90     fswizzlei,2130 F6, F2
  91     fswizzlei,1320 F7, F2
  92     fmul F8, F5, F6
  93     fmulsub F3, F4, F7, F8
  94
  95 ## Vector dot product
  96
  97
  98 * VDOT rd, vs1, vs2
  99
 100 Computes the dot product of two vectors. Internal accuracy must be
 101 greater than the input vectors and the result.
 102
 103 There are two possible argument options:
 104
 105 * SUBVL=2,3,4 vs1 and vs2 set as vectors,  multiple results are generated. When VL is set, only the first (unpredicated) SUBVector is used to create a result, if rd is scalar (standard behaviour for single predication). Otherwise, if rd is a vector, multiple scalar results are calculated (i.e. SUBVL is always ignored for rd). Swizzling may be applied.
 106 * When rd=scalar, SUBVL=1 and vs1=vec, vs2=vec, one scalar result is generated from the entire src vectors.  Predication is allowed on the src vectors.
 107
 108 Pseudocode in python:
 109
 110     from operator import mul
 111     sum(map(mul, A, B))
 112
 113 Pseudocode in c:
 114
 115     double dot_product(float v[], float u[], int n)
 116     {
 117         double result = 0.0;
 118         for (int i = 0; i < n; i++)
 119             result += v[i] * u[i];
 120         return result;
 121     }
 122
 123 ## Vector Normalisation (not included)
 124
 125 Vector normalisation may be performed through dot product, recip square root and multiplication:
 126
 127     fdot F3, F1, F1 # vector dot with self
 128     rcpsqrta F3, F3
 129     fscale,0 F2, F3, F1
 130
 131 Or it may be performed through VLEN (Vector length) and division.
 132
 133 ## Vector length
 134
 135 * rd=scalar, vs1=vec (SUBVL=1)
 136 * rd=scalar, vs1=vec (SUBVL=2,3,4) only 1 (predication rules apply)
 137 * rd=vec, SUBVL ignored; vs1=vec, SUBVL=2,3,4
 138 * rd=vec, SUBVL ignored; vs1=vec, SUBVL=1: reserved encoding.
 139
 140 * VLEN rd, vs1
 141
 142 The scalar length of a vector:
 143
 144     sqrt(x[0]^2 + x[1]^2 + ...).
 145
 146 One option is for this to be a macro op fusion sequence, with inverse-sqrt also being a second macro op sequence suitable for normalisation.
 147
 148 ## Vector distance
 149
 150 * VDIST rd, vs1, vs2
 151
 152 The scalar distance between two vectors. Subtracts one vector from the
 153 other and returns length:
 154
 155     length(v0 - v1)
 156
 157 ## Vector LERP
 158
 159 * VLERP vd, vs1, rs2 # SUBVL=2: vs1.v0 vs1.v1
 160
 161 | Instr | result | src1 | src2 | SUBVL | VL |
 162 | ------------------ | ------ | ---- | ---- | ----- | -- |
 163 | VLERP vd, vs1, rs2 | vec2 | vec2 | scal | 2 | any |
 164
 165 Known as **fmix** in GLSL.
 166
 167 <https://en.m.wikipedia.org/wiki/Linear_interpolation>
 168
 169 Pseudocode:
 170
 171     // Imprecise method, which does not guarantee v = v1 when t = 1,
 172     // due to floating-point arithmetic error.
 173     // This form may be used when the hardware has a native fused
 174     // multiply-add instruction.
 175     float lerp(float v0, float v1, float t) {
 176       return v0 + t * (v1 - v0);
 177     }
 178
 179     // Precise method, which guarantees v = v1 when t = 1.
 180     float lerp(float v0, float v1, float t) {
 181       return (1 - t) * v0 + t * v1;
 182     }
 183
 184 ## Vector SLERP
 185
 186 * VSLERP vd, vs1, vs2, rs3
 187
 188 Not recommended as it is not commonly used and has several trigonometric
 189 functions, although CORDIC in vector rotate circular mode is designed for this purpose. Also a costly 4 arg operation.
 190
 191 <https://en.m.wikipedia.org/wiki/Slerp>
 192
 193 Pseudocode:
 194
 195     Quaternion slerp(Quaternion v0, Quaternion v1, double t) {
 196         // Only unit quaternions are valid rotations.
 197         // Normalize to avoid undefined behavior.
 198         v0.normalize();
 199         v1.normalize();
 200
 201         // Compute the cosine of the angle between the two vectors.
 202         double dot = dot_product(v0, v1);
 203
 204         // If the dot product is negative, slerp won't take
 205         // the shorter path. Note that v1 and -v1 are equivalent when
 206         // the negation is applied to all four components. Fix by
 207         // reversing one quaternion.
 208         if (dot < 0.0f) {
 209             v1 = -v1;
 210             dot = -dot;
 211         }
 212
 213         const double DOT_THRESHOLD = 0.9995;
 214         if (dot > DOT_THRESHOLD) {
 215             // If the inputs are too close for comfort, linearly interpolate
 216             // and normalize the result.
 217
 218             Quaternion result = v0 + t*(v1 - v0);
 219             result.normalize();
 220             return result;
 221         }
 222
 223         // Since dot is in range [0, DOT_THRESHOLD], acos is safe
 224         double theta_0 = acos(dot);        // theta_0 = angle between input vectors
 225         double theta = theta_0*t;          // theta = angle between v0 and result
 226         double sin_theta = sin(theta);     // compute this value only once
 227         double sin_theta_0 = sin(theta_0); // compute this value only once
 228
 229         double s0 = cos(theta) - dot * sin_theta / sin_theta_0;  // == sin(theta_0 - theta) / sin(theta_0)
 230         double s1 = sin_theta / sin_theta_0;
 231
 232         return (s0 * v0) + (s1 * v1);
 233     }
 234
 235 However this algorithm does not involve transcendentals except in
 236 the computation of the tables: <https://en.wikipedia.org/wiki/CORDIC#Rotation_mode>
 237
 238     function v = cordic(beta,n)
 239         % This function computes v = [cos(beta), sin(beta)] (beta in radians)
 240         % using n iterations. Increasing n will increase the precision.
 241
 242         if beta < -pi/2 || beta > pi/2
 243             if beta < 0
 244                 v = cordic(beta + pi, n);
 245             else
 246                 v = cordic(beta - pi, n);
 247             end
 248             v = -v; % flip the sign for second or third quadrant
 249             return
 250         end
 251
 252         % Initialization of tables of constants used by CORDIC
 253         % need a table of arctangents of negative powers of two, in radians:
 254         % angles = atan(2.^-(0:27));
 255         angles =  [  ...
 256             0.78539816339745   0.46364760900081
 257             0.24497866312686   0.12435499454676 ...
 258             0.06241880999596   0.03123983343027
 259             0.01562372862048   0.00781234106010 ...
 260             0.00390623013197   0.00195312251648
 261             0.00097656218956   0.00048828121119 ...
 262             0.00024414062015   0.00012207031189
 263             0.00006103515617   0.00003051757812 ...
 264             0.00001525878906   0.00000762939453
 265             0.00000381469727   0.00000190734863 ...
 266             0.00000095367432   0.00000047683716
 267             0.00000023841858   0.00000011920929 ...
 268             0.00000005960464   0.00000002980232
 269             0.00000001490116   0.00000000745058 ];
 270         % and a table of products of reciprocal lengths of vectors [1, 2^-2j]:
 271         % Kvalues = cumprod(1./abs(1 + 1j*2.^(-(0:23))))
 272         Kvalues = [ ...
 273             0.70710678118655   0.63245553203368
 274             0.61357199107790   0.60883391251775 ...
 275             0.60764825625617   0.60735177014130
 276             0.60727764409353   0.60725911229889 ...
 277             0.60725447933256   0.60725332108988
 278             0.60725303152913   0.60725295913894 ...
 279             0.60725294104140   0.60725293651701
 280             0.60725293538591   0.60725293510314 ...
 281             0.60725293503245   0.60725293501477
 282             0.60725293501035   0.60725293500925 ...
 283             0.60725293500897   0.60725293500890
 284             0.60725293500889   0.60725293500888 ];
 285         Kn = Kvalues(min(n, length(Kvalues)));
 286
 287         % Initialize loop variables:
 288         v = [1;0]; % start with 2-vector cosine and sine of zero
 289         poweroftwo = 1;
 290         angle = angles(1);
 291
 292         % Iterations
 293         for j = 0:n-1;
 294             if beta < 0
 295                 sigma = -1;
 296             else
 297                 sigma = 1;
 298             end
 299             factor = sigma * poweroftwo;
 300             % Note the matrix multiplication can be done using scaling by
 301             % powers of two and addition subtraction
 302             R = [1, -factor; factor, 1];
 303             v = R * v; % 2-by-2 matrix multiply
 304             beta = beta - sigma * angle; % update the remaining angle
 305             poweroftwo = poweroftwo / 2;
 306             % update the angle from table, or eventually by just dividing by two
 307             if j+2 > length(angles)
 308                 angle = angle / 2;
 309             else
 310                 angle = angles(j+2);
 311             end
 312         end
 313
 314         % Adjust length of output vector to be [cos(beta), sin(beta)]:
 315         v = v * Kn;
 316         return
 317
 318     endfunction
 319
 320 2x2 matrix multiply can be done with shifts and adds:
 321
 322     x = v[0] - sigma * (v[1] * 2^(-j));
 323     y = sigma * (v[0] * 2^(-j)) + v[1];
 324     v = [x; y];
 325
 326 The technique is outlined in a paper as being applicable to 3D:
 327 <https://www.atlantis-press.com/proceedings/jcis2006/232>
 328
 329 # Expensive 3-operand OP32 operations
 330
 331 3-operand operations are extremely expensive in terms of OP32 encoding space.  A potential idea is to embed 3 RVC register formats across two out of three 5-bit fields rs1/rs2/rd
 332
 333 Another is to overwrite one of the src registers.
 334
 335 # Opcode Table
 336
 337 TODO
 338
 339 # Links
 340
 341 * <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-September/002736.html>
 342 * <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-September/002733.html>
 343 * <http://bugs.libre-riscv.org/show_bug.cgi?id=142>
 344
 345 Research Papers
 346
 347 * <https://www.researchgate.net/publication/2938554_PLX_FP_An_Efficient_Floating-Point_Instruction_Set_for_3D_Graphics>