simple_v_extension/vector_ops.mdwn

   1 [[!tag standards]]
   2
   3 # Vector Operations Extension to SV
   4
   5 This extension defines vector operations that would otherwise take several cycles to complete in software. With 3D priorities being to compute as many pixels per clock as possible, the normal RISC rules (reduce opcode count and make heavy use of macro op fusion) do not necessarily apply.
   6
   7 This extension is usually dependent on SV SUBVL being implemented. When SUBVL is set to define the length of a subvector the operations in this extension interpret the elements as a single vector.
   8
   9 Normally in SV all operations are scalar and independent, and the operations on them may inherently be independently parallelised, with the result being a vector of length exactly equal to the input vectors.
  10
  11 In this extension, the subvector itself is typically the unit, although some operations will work on scalars or standard vectors as well, or the result is a scalar that is dependent on all elements within the vector arguments.
  12
  13 However given that some of the parameters are vectors (with and without SUBVL set), and some are scalars (where SUBVL will not apply), some clear rules need to be defined as to how the operations work.
  14
  15 Examples which can require SUBVL include cross product and may in future involve complex numbers.
  16
  17 ## CORDIC
  18
  19 6 opcode options (fmt3):
  20
  21 * CORDIC.lin.rot vd, vs, beta
  22 * CORDIC.cir.rot vd, vs, beta
  23 * CORDIC.hyp.rot vd, vs, beta
  24 * CORDIC.lin.vec vd, vs, beta
  25 * CORDIC.cir.vec vd, vs, beta
  26 * CORDIC.hyp.vec vd, vs, beta
  27
  28 SUBVL must be set to 2 and applies to vd and vs. SUBVL is *ignored* on beta.  vd and vs must be marked as vectors.
  29
  30 VL may be applied.  beta as a scalar is ok (applies across all vectors vd and vs). Predication is also ok (single predication) sourced from vd. Use of swizzle is also ok.
  31
  32 Non vector args vd, vs, or SUBVL != 2 are reserved encodings.
  33
  34 CORDIC is an extremely general-purpose algorithm useful for a huge number
  35 of diverse purposes.  In its full form it does however require quite a
  36 few parameters, one of which is a vector, making it awkward to include in
  37 a standard "scalar" ISA.  Additionally the coordinates can be set to circular,
  38 linear or hyperbolic, producing three different modes, and the algorithm
  39 may also be run in either "vector" mode or "rotation" mode.  See [[discussion]]
  40
  41 CORDIC can also be used for performing DCT.  See
  42 <https://arxiv.org/abs/1606.02424>
  43
  44 CORDIC has several RADIX-4 papers for efficient pipelining.  Each stage requires its own ROM tables which can get costly.  Two combinatorial blocks may be chained together to double the RADIX and halve the pipeline depth, at the cost of doubling the latency.
  45
  46 Also, to get good accuracy, particularly at the limits of CORDIC input range, requires double the bitwidth of the output in internal computations. This similar to how MUL requires double the bitwidth to compute.
  47
  48 Links:
  49
  50 * <http://www.myhdl.org/docs/examples/sinecomp/>
  51 * <https://www.atlantis-press.com/proceedings/jcis2006/232>
  52
  53 ## Vector cross product
  54
  55 * VCROSS vd, vs1, vs1
  56
  57 Result is the cross product of x and y.
  58
  59 SUBVL must be set to 3, and all regs must be vectors. VL nonzero produces multiple results in vd.
  60
  61 The resulting components are, in order:
  62
  63     x[1] * y[2] - y[1] * x[2]
  64     x[2] * y[0] - y[2] * x[0]
  65     x[0] * y[1] - y[0] * x[1]
  66
  67 All the operands must be vectors of 3 components of a floating-point type.
  68
  69 Pseudocode:
  70
  71     vec3 a, b; // elements in order a.x, a.y, a.z
  72     // compute a cross b:
  73     vec3 t1 = a.yzx; // produce vector [a.y, a.z, a.x]
  74     vec3 t2 = b.zxy;
  75     vec3 t3 = a.zxy;
  76     vec3 t4 = b.yzx;
  77     vec3 p = t3 * t4;
  78     vec3 cross = t1 * t2 - p;
  79
  80 Assembler:
  81
  82     fswizzlei,2130 F4, F1
  83     fswizzlei,1320 F5, F1
  84     fswizzlei,2130 F6, F2
  85     fswizzlei,1320 F7, F2
  86     fmul F8, F5, F6
  87     fmulsub F3, F4, F7, F8
  88
  89 ## Vector dot product
  90
  91
  92 * VDOT rd, vs1, vs2
  93
  94 Computes the dot product of two vectors. Internal accuracy must be
  95 greater than the input vectors and the result.
  96
  97 There are two possible argument options:
  98
  99 * SUBVL=2,3,4 vs1 and vs2 set as vectors,  multiple results are generated. When VL is set, only the first (unpredicated) SUBVector is used to create a result, if rd is scalar (standard behaviour for single predication). Otherwise, if rd is a vector, multiple scalar results are calculated (i.e. SUBVL is always ignored for rd). Swizzling may be applied.
 100 * When rd=scalar, SUBVL=1 and vs1=vec, vs2=vec, one scalar result is generated from the entire src vectors.  Predication is allowed on the src vectors.
 101
 102 Pseudocode in python:
 103
 104     from operator import mul
 105     sum(map(mul, A, B))
 106
 107 Pseudocode in c:
 108
 109     double dot_product(float v[], float u[], int n)
 110     {
 111         double result = 0.0;
 112         for (int i = 0; i < n; i++)
 113             result += v[i] * u[i];
 114         return result;
 115     }
 116
 117 ## Vector Normalisation (not included)
 118
 119 Vector normalisation may be performed through dot product, recip square root and multiplication:
 120
 121     fdot F3, F1, F1 # vector dot with self
 122     rcpsqrta F3, F3
 123     fscale,0 F2, F3, F1
 124
 125 Or it may be performed through VLEN (Vector length) and division.
 126
 127 ## Vector length
 128
 129 * rd=scalar, vs1=vec (SUBVL=1)
 130 * rd=scalar, vs1=vec (SUBVL=2,3,4) only 1 (predication rules apply)
 131 * rd=vec, SUBVL ignored; vs1=vec, SUBVL=2,3,4
 132 * rd=vec, SUBVL ignored; vs1=vec, SUBVL=1: reserved encoding.
 133
 134 * VLEN rd, vs1
 135
 136 The scalar length of a vector:
 137
 138     sqrt(x[0]^2 + x[1]^2 + ...).
 139
 140 One option is for this to be a macro op fusion sequence, with inverse-sqrt also being a second macro op sequence suitable for normalisation.
 141
 142 ## Vector distance
 143
 144 * VDIST rd, vs1, vs2
 145
 146 The scalar distance between two vectors. Subtracts one vector from the
 147 other and returns length:
 148
 149     length(v0 - v1)
 150
 151 ## Vector LERP
 152
 153 * VLERP vd, vs1, rs2 # SUBVL=2: vs1.v0 vs1.v1
 154
 155 | Instr | result | src1 | src2 | SUBVL | VL |
 156 | ------------------ | ------ | ---- | ---- | ----- | -- |
 157 | VLERP vd, vs1, rs2 | vec2 | vec2 | scal | 2 | any |
 158
 159 Known as **fmix** in GLSL.
 160
 161 <https://en.m.wikipedia.org/wiki/Linear_interpolation>
 162
 163 Pseudocode:
 164
 165     // Imprecise method, which does not guarantee v = v1 when t = 1,
 166     // due to floating-point arithmetic error.
 167     // This form may be used when the hardware has a native fused
 168     // multiply-add instruction.
 169     float lerp(float v0, float v1, float t) {
 170       return v0 + t * (v1 - v0);
 171     }
 172
 173     // Precise method, which guarantees v = v1 when t = 1.
 174     float lerp(float v0, float v1, float t) {
 175       return (1 - t) * v0 + t * v1;
 176     }
 177
 178 ## Vector SLERP
 179
 180 * VSLERP vd, vs1, vs2, rs3
 181
 182 Not recommended as it is not commonly used and has several trigonometric
 183 functions, although CORDIC in vector rotate circular mode is designed for this purpose. Also a costly 4 arg operation.
 184
 185 <https://en.m.wikipedia.org/wiki/Slerp>
 186
 187 Pseudocode:
 188
 189     Quaternion slerp(Quaternion v0, Quaternion v1, double t) {
 190         // Only unit quaternions are valid rotations.
 191         // Normalize to avoid undefined behavior.
 192         v0.normalize();
 193         v1.normalize();
 194
 195         // Compute the cosine of the angle between the two vectors.
 196         double dot = dot_product(v0, v1);
 197
 198         // If the dot product is negative, slerp won't take
 199         // the shorter path. Note that v1 and -v1 are equivalent when
 200         // the negation is applied to all four components. Fix by
 201         // reversing one quaternion.
 202         if (dot < 0.0f) {
 203             v1 = -v1;
 204             dot = -dot;
 205         }
 206
 207         const double DOT_THRESHOLD = 0.9995;
 208         if (dot > DOT_THRESHOLD) {
 209             // If the inputs are too close for comfort, linearly interpolate
 210             // and normalize the result.
 211
 212             Quaternion result = v0 + t*(v1 - v0);
 213             result.normalize();
 214             return result;
 215         }
 216
 217         // Since dot is in range [0, DOT_THRESHOLD], acos is safe
 218         double theta_0 = acos(dot);        // theta_0 = angle between input vectors
 219         double theta = theta_0*t;          // theta = angle between v0 and result
 220         double sin_theta = sin(theta);     // compute this value only once
 221         double sin_theta_0 = sin(theta_0); // compute this value only once
 222
 223         double s0 = cos(theta) - dot * sin_theta / sin_theta_0;  // == sin(theta_0 - theta) / sin(theta_0)
 224         double s1 = sin_theta / sin_theta_0;
 225
 226         return (s0 * v0) + (s1 * v1);
 227     }
 228
 229 However this algorithm does not involve transcendentals except in
 230 the computation of the tables: <https://en.wikipedia.org/wiki/CORDIC#Rotation_mode>
 231
 232     function v = cordic(beta,n)
 233         % This function computes v = [cos(beta), sin(beta)] (beta in radians)
 234         % using n iterations. Increasing n will increase the precision.
 235
 236         if beta < -pi/2 || beta > pi/2
 237             if beta < 0
 238                 v = cordic(beta + pi, n);
 239             else
 240                 v = cordic(beta - pi, n);
 241             end
 242             v = -v; % flip the sign for second or third quadrant
 243             return
 244         end
 245
 246         % Initialization of tables of constants used by CORDIC
 247         % need a table of arctangents of negative powers of two, in radians:
 248         % angles = atan(2.^-(0:27));
 249         angles =  [  ...
 250             0.78539816339745   0.46364760900081
 251             0.24497866312686   0.12435499454676 ...
 252             0.06241880999596   0.03123983343027
 253             0.01562372862048   0.00781234106010 ...
 254             0.00390623013197   0.00195312251648
 255             0.00097656218956   0.00048828121119 ...
 256             0.00024414062015   0.00012207031189
 257             0.00006103515617   0.00003051757812 ...
 258             0.00001525878906   0.00000762939453
 259             0.00000381469727   0.00000190734863 ...
 260             0.00000095367432   0.00000047683716
 261             0.00000023841858   0.00000011920929 ...
 262             0.00000005960464   0.00000002980232
 263             0.00000001490116   0.00000000745058 ];
 264         % and a table of products of reciprocal lengths of vectors [1, 2^-2j]:
 265         % Kvalues = cumprod(1./abs(1 + 1j*2.^(-(0:23))))
 266         Kvalues = [ ...
 267             0.70710678118655   0.63245553203368
 268             0.61357199107790   0.60883391251775 ...
 269             0.60764825625617   0.60735177014130
 270             0.60727764409353   0.60725911229889 ...
 271             0.60725447933256   0.60725332108988
 272             0.60725303152913   0.60725295913894 ...
 273             0.60725294104140   0.60725293651701
 274             0.60725293538591   0.60725293510314 ...
 275             0.60725293503245   0.60725293501477
 276             0.60725293501035   0.60725293500925 ...
 277             0.60725293500897   0.60725293500890
 278             0.60725293500889   0.60725293500888 ];
 279         Kn = Kvalues(min(n, length(Kvalues)));
 280
 281         % Initialize loop variables:
 282         v = [1;0]; % start with 2-vector cosine and sine of zero
 283         poweroftwo = 1;
 284         angle = angles(1);
 285
 286         % Iterations
 287         for j = 0:n-1;
 288             if beta < 0
 289                 sigma = -1;
 290             else
 291                 sigma = 1;
 292             end
 293             factor = sigma * poweroftwo;
 294             % Note the matrix multiplication can be done using scaling by
 295             % powers of two and addition subtraction
 296             R = [1, -factor; factor, 1];
 297             v = R * v; % 2-by-2 matrix multiply
 298             beta = beta - sigma * angle; % update the remaining angle
 299             poweroftwo = poweroftwo / 2;
 300             % update the angle from table, or eventually by just dividing by two
 301             if j+2 > length(angles)
 302                 angle = angle / 2;
 303             else
 304                 angle = angles(j+2);
 305             end
 306         end
 307
 308         % Adjust length of output vector to be [cos(beta), sin(beta)]:
 309         v = v * Kn;
 310         return
 311
 312     endfunction
 313
 314 2x2 matrix multiply can be done with shifts and adds:
 315
 316     x = v[0] - sigma * (v[1] * 2^(-j));
 317     y = sigma * (v[0] * 2^(-j)) + v[1];
 318     v = [x; y];
 319
 320 The technique is outlined in a paper as being applicable to 3D:
 321 <https://www.atlantis-press.com/proceedings/jcis2006/232>
 322
 323 # Expensive 3-operand OP32 operations
 324
 325 3-operand operations are extremely expensive in terms of OP32 encoding space.  A potential idea is to embed 3 RVC register formats across two out of three 5-bit fields rs1/rs2/rd
 326
 327 Another is to overwrite one of the src registers.
 328
 329 # Opcode Table
 330
 331 TODO
 332
 333 # Links
 334
 335 * <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-September/002736.html>
 336 * <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-September/002733.html>
 337 * <http://bugs.libre-riscv.org/show_bug.cgi?id=142>
 338
 339 Research Papers
 340
 341 * <https://www.researchgate.net/publication/2938554_PLX_FP_An_Efficient_Floating-Point_Instruction_Set_for_3D_Graphics>