simple_v_extension/vector_ops.mdwn

   1 [[!tag standards]]
   2
   3 # Vector Operations Extension to SV
   4
   5 This extension is usually dependent on SV SUBVL being implemented. When SUBVL is set to define the length of a subvector the operations in this extension interpret the elements as a single vector.
   6
   7 Normally in SV all operations are scalar and independent, and the operations on them may inherently be independently parallelised, with the result being a vector of length exactly equal to the input vectors.
   8
   9 In this extension, the subvector itself is typically the unit, although some operations will work on scalars or standard vectors as well, or the result is a scalar that is dependent on all elements within the vector arguments.
  10
  11 However given that some of the parameters are vectors (with and without SUBVL set), and some are scalars (where SUBVL will not apply), some clear rules need to be defined as to how the operations work.
  12
  13 Examples which can require SUBVL include cross product and may in future involve complex numbers.
  14
  15 ## CORDIC
  16
  17 * SUBVL=2, vd, vs; SUBVL ignored on beta.
  18 * VL nonzero ok.  beta as scalar ok (applies across all vectors)
  19 * non vector args vd, vs, or SUBVL!=2 reserved.
  20
  21 6 opcode options (fmt3):
  22
  23 * CORDIC.lin.rot vd, vs, beta
  24 * CORDIC.cir.rot vd, vs, beta
  25 * CORDIC.hyp.rot vd, vs, beta
  26 * CORDIC.lin.vec vd, vs, beta
  27 * CORDIC.cir.vec vd, vs, beta
  28 * CORDIC.hyp.vec vd, vs, beta
  29
  30 CORDIC is an extremely general-purpose algorithm useful for a huge number
  31 of diverse purposes.  In its full form it does however require quite a
  32 few parameters, one of which is a vector, making it awkward to include in
  33 a standard "scalar" ISA.  Additionally the coordinates can be set to circular,
  34 linear or hyperbolic, producing three different modes, and the algorithm
  35 may also be run in either "vector" mode or "rotation" mode.  See [[discussion]]
  36
  37 CORDIC can also be used for performing DCT.  See
  38 <https://arxiv.org/abs/1606.02424>
  39
  40 vx, vy = CORDIC(vx, vy, coordinate\_mode, beta)
  41
  42      int i = 0;
  43      int iterations = 0; // Number of times to run the algorithm
  44      float arctanTable[iterations]; // in Radians
  45      float K = 0.6073; // K
  46      float v_x,v_y; // Vector v; x and y components
  47
  48      for(i=0; i < iterations; i++) {
  49         arctanTable[i] = atan(pow(2,-i));
  50      }
  51
  52      float vnew_x;   // To store the new value of x;
  53      for(i = 0; i < iterations; i++) {
  54          // If beta is negative, we need to do a counter-clockwise rotation:
  55          if( beta < 0) {
  56             vnew_x = v_x + (v_y*pow(2,-i));
  57             v_y -= (v_x*pow(2,-i));
  58             beta += arctanTable[i];
  59          }
  60          // If beta is positive, we need to do a clockwise rotation:
  61          else {
  62             vnew_x = v_x - (v_y*pow(2,-i));
  63             v_y += (v_x*pow(2,-i));
  64             beta -= arctanTable[i];
  65          }
  66          v_x = vnew_x;
  67      }
  68      v_x *= K;
  69      v_y *= K;
  70
  71 Links:
  72
  73 * <http://www.myhdl.org/docs/examples/sinecomp/>
  74 * <https://www.atlantis-press.com/proceedings/jcis2006/232>
  75
  76 ## Vector cross product
  77
  78 SUBVL=3, all regs. VL nonzero produces multiple vd results.
  79
  80 * VCROSS vd, vs1, vs1
  81
  82 Result is the cross product of x and y, i.e., the resulting components are, in order:
  83
  84     x[1] * y[2] - y[1] * x[2]
  85     x[2] * y[0] - y[2] * x[0]
  86     x[0] * y[1] - y[0] * x[1]
  87
  88 All the operands must be vectors of 3 components of a floating-point type.
  89
  90 Pseudocode:
  91
  92     vec3 a, b; // elements in order a.x, a.y, a.z
  93     // compute a cross b:
  94     vec3 t1 = a.yzx; // produce vector [a.y, a.z, a.x]
  95     vec3 t2 = b.zxy;
  96     vec3 t3 = a.zxy;
  97     vec3 t4 = b.yzx;
  98     vec3 p = t3 * t4;
  99     vec3 cross = t1 * t2 - p;
 100
 101 Assembler:
 102
 103     fpermute,2130 F4, F1
 104     fpermute,1320 F5, F1
 105     fpermute,2130 F6, F2
 106     fpermute,1320 F7, F2
 107     fmul F8, F5, F6
 108     fmulsub F3, F4, F7, F8
 109
 110 ## Vector dot product
 111
 112 * SUBVL ignored on rd.  SUBVL=2,3,4 vs1,vs2, if all vectors, multiple results generated. If rd scalar, only first (unpredicated) SUBVector is used.
 113 * rd=scalar, SUBVL=1 and vs1, vs2=vec will produce one scalar result. Predication allowed on src vectors.
 114
 115 * VDOT rd, vs1, vs2
 116
 117 Computes the dot product of two vectors. Internal accuracy must be
 118 greater than the input vectors and the result.
 119
 120 Pseudocode in python:
 121
 122     from operator import mul
 123     sum(map(mul, A, B))
 124
 125 Pseudocode in c:
 126
 127     double dot_product(float v[], float u[], int n)
 128     {
 129         double result = 0.0;
 130         for (int i = 0; i < n; i++)
 131             result += v[i] * u[i];
 132         return result;
 133     }
 134
 135 ## Vector Normalisation (not included)
 136
 137 Vector normalisation may be performed through dot product, recip square root and multiplication:
 138
 139     fdot F3, F1, F1 # vector dot with self
 140     rcpsqrta F3, F3
 141     fscale,0 F2, F3, F1
 142
 143 Or it may be performed through VLEN (Vector length) and division.
 144
 145 ## Vector length
 146
 147 * rd=scalar, vs1=vec (SUBVL=1)
 148 * rd=scalar, vs1=vec (SUBVL=2,3,4) only 1 (predication rules apply)
 149 * rd=vec, SUBVL ignored; vs1=vec, SUBVL=2,3,4
 150 * rd=vec, SUBVL ignored; vs1=vec, SUBVL=1: reserved encoding.
 151
 152 * VLEN rd, vs1
 153
 154 The scalar length of a vector:
 155
 156     sqrt(x[0]^2 + x[1]^2 + ...).
 157
 158 One option is for this to be a macro op fusion sequence, with inverse-sqrt also being a second macro op sequence suitable for normalisation.
 159
 160 ## Vector distance
 161
 162 * VDIST rd, vs1, vs2
 163
 164 The scalar distance between two vectors. Subtracts one vector from the
 165 other and returns length:
 166
 167     length(v0 - v1)
 168
 169 ## Vector LERP
 170
 171 * VLERP rd, vs1, rs2 # SUBVL=2: vs1.v0 vs1.v1
 172
 173 Known as **fmix** in GLSL.
 174
 175 <https://en.m.wikipedia.org/wiki/Linear_interpolation>
 176
 177 Pseudocode:
 178
 179     // Imprecise method, which does not guarantee v = v1 when t = 1,
 180     // due to floating-point arithmetic error.
 181     // This form may be used when the hardware has a native fused
 182     // multiply-add instruction.
 183     float lerp(float v0, float v1, float t) {
 184       return v0 + t * (v1 - v0);
 185     }
 186
 187     // Precise method, which guarantees v = v1 when t = 1.
 188     float lerp(float v0, float v1, float t) {
 189       return (1 - t) * v0 + t * v1;
 190     }
 191
 192 ## Vector SLERP
 193
 194 * VSLERP vd, vs1, vs2, rs3
 195
 196 Not recommended as it is not commonly used and has several trigonometric
 197 functions, although CORDIC in vector rotate circular mode is designed for this purpose. Also a costly 4 arg operation.
 198
 199 <https://en.m.wikipedia.org/wiki/Slerp>
 200
 201 Pseudocode:
 202
 203     Quaternion slerp(Quaternion v0, Quaternion v1, double t) {
 204         // Only unit quaternions are valid rotations.
 205         // Normalize to avoid undefined behavior.
 206         v0.normalize();
 207         v1.normalize();
 208
 209         // Compute the cosine of the angle between the two vectors.
 210         double dot = dot_product(v0, v1);
 211
 212         // If the dot product is negative, slerp won't take
 213         // the shorter path. Note that v1 and -v1 are equivalent when
 214         // the negation is applied to all four components. Fix by
 215         // reversing one quaternion.
 216         if (dot < 0.0f) {
 217             v1 = -v1;
 218             dot = -dot;
 219         }
 220
 221         const double DOT_THRESHOLD = 0.9995;
 222         if (dot > DOT_THRESHOLD) {
 223             // If the inputs are too close for comfort, linearly interpolate
 224             // and normalize the result.
 225
 226             Quaternion result = v0 + t*(v1 - v0);
 227             result.normalize();
 228             return result;
 229         }
 230
 231         // Since dot is in range [0, DOT_THRESHOLD], acos is safe
 232         double theta_0 = acos(dot);        // theta_0 = angle between input vectors
 233         double theta = theta_0*t;          // theta = angle between v0 and result
 234         double sin_theta = sin(theta);     // compute this value only once
 235         double sin_theta_0 = sin(theta_0); // compute this value only once
 236
 237         double s0 = cos(theta) - dot * sin_theta / sin_theta_0;  // == sin(theta_0 - theta) / sin(theta_0)
 238         double s1 = sin_theta / sin_theta_0;
 239
 240         return (s0 * v0) + (s1 * v1);
 241     }
 242
 243 However this algorithm does not involve transcendentals except in
 244 the computation of the tables: <https://en.wikipedia.org/wiki/CORDIC#Rotation_mode>
 245
 246     function v = cordic(beta,n)
 247         % This function computes v = [cos(beta), sin(beta)] (beta in radians)
 248         % using n iterations. Increasing n will increase the precision.
 249
 250         if beta < -pi/2 || beta > pi/2
 251             if beta < 0
 252                 v = cordic(beta + pi, n);
 253             else
 254                 v = cordic(beta - pi, n);
 255             end
 256             v = -v; % flip the sign for second or third quadrant
 257             return
 258         end
 259
 260         % Initialization of tables of constants used by CORDIC
 261         % need a table of arctangents of negative powers of two, in radians:
 262         % angles = atan(2.^-(0:27));
 263         angles =  [  ...
 264             0.78539816339745   0.46364760900081
 265             0.24497866312686   0.12435499454676 ...
 266             0.06241880999596   0.03123983343027
 267             0.01562372862048   0.00781234106010 ...
 268             0.00390623013197   0.00195312251648
 269             0.00097656218956   0.00048828121119 ...
 270             0.00024414062015   0.00012207031189
 271             0.00006103515617   0.00003051757812 ...
 272             0.00001525878906   0.00000762939453
 273             0.00000381469727   0.00000190734863 ...
 274             0.00000095367432   0.00000047683716
 275             0.00000023841858   0.00000011920929 ...
 276             0.00000005960464   0.00000002980232
 277             0.00000001490116   0.00000000745058 ];
 278         % and a table of products of reciprocal lengths of vectors [1, 2^-2j]:
 279         % Kvalues = cumprod(1./abs(1 + 1j*2.^(-(0:23))))
 280         Kvalues = [ ...
 281             0.70710678118655   0.63245553203368
 282             0.61357199107790   0.60883391251775 ...
 283             0.60764825625617   0.60735177014130
 284             0.60727764409353   0.60725911229889 ...
 285             0.60725447933256   0.60725332108988
 286             0.60725303152913   0.60725295913894 ...
 287             0.60725294104140   0.60725293651701
 288             0.60725293538591   0.60725293510314 ...
 289             0.60725293503245   0.60725293501477
 290             0.60725293501035   0.60725293500925 ...
 291             0.60725293500897   0.60725293500890
 292             0.60725293500889   0.60725293500888 ];
 293         Kn = Kvalues(min(n, length(Kvalues)));
 294
 295         % Initialize loop variables:
 296         v = [1;0]; % start with 2-vector cosine and sine of zero
 297         poweroftwo = 1;
 298         angle = angles(1);
 299
 300         % Iterations
 301         for j = 0:n-1;
 302             if beta < 0
 303                 sigma = -1;
 304             else
 305                 sigma = 1;
 306             end
 307             factor = sigma * poweroftwo;
 308             % Note the matrix multiplication can be done using scaling by
 309             % powers of two and addition subtraction
 310             R = [1, -factor; factor, 1];
 311             v = R * v; % 2-by-2 matrix multiply
 312             beta = beta - sigma * angle; % update the remaining angle
 313             poweroftwo = poweroftwo / 2;
 314             % update the angle from table, or eventually by just dividing by two
 315             if j+2 > length(angles)
 316                 angle = angle / 2;
 317             else
 318                 angle = angles(j+2);
 319             end
 320         end
 321
 322         % Adjust length of output vector to be [cos(beta), sin(beta)]:
 323         v = v * Kn;
 324         return
 325
 326     endfunction
 327
 328 2x2 matrix multiply can be done with shifts and adds:
 329
 330     x = v[0] - sigma * (v[1] * 2^(-j));
 331     y = sigma * (v[0] * 2^(-j)) + v[1];
 332     v = [x; y];
 333
 334 The technique is outlined in a paper as being applicable to 3D:
 335 <https://www.atlantis-press.com/proceedings/jcis2006/232>
 336
 337 # Expensive 3-operand OP32 operations
 338
 339 3-operand operations are extremely expensive in terms of OP32 encoding space.  A potential idea is to embed 3 RVC register formats across two out of three 5-bit fields rs1/rs2/rd
 340
 341 Another is to overwrite one of the src registers.
 342
 343 # Opcode Table
 344
 345 TODO
 346
 347 # Links
 348
 349 * <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-September/002736.html>
 350 * <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-September/002733.html>
 351 * <http://bugs.libre-riscv.org/show_bug.cgi?id=142>
 352
 353 Research Papers
 354
 355 * <https://www.researchgate.net/publication/2938554_PLX_FP_An_Efficient_Floating-Point_Instruction_Set_for_3D_Graphics>