simple_v_extension/vector_ops.mdwn

   1 [[!tag standards]]
   2
   3 # Vector Operations Extension to SV
   4
   5 This extension is usually dependent on SV SUBVL being implemented. When SUBVL is set to define the length of a subvector the operations in this extension interpret the elements as a single vector.
   6
   7 Normally in SV all operations are scalar and independent, and the operations on them may inherently be independently parallelised, with the result being a vector of length exactly equal to the input vectors.
   8
   9 In this extension, the subvector itself is typically the unit, although some operations will work on scalars or standard vectors as well, or the result is a scalar that is dependent on all elements within the vector arguments.
  10
  11 However given that some of the parameters are vectors (with and without SUBVL set), and some are scalars (where SUBVL will not apply), some clear rules need to be defined as to how the operations work.
  12
  13 Examples which can require SUBVL include cross product and may in future involve complex numbers.
  14
  15 ## CORDIC
  16
  17 * SUBVL=2, vd, vs; SUBVL ignored on beta.
  18 * VL nonzero ok.  beta as scalar ok (applies across all vectors)
  19 * non vector args vd, vs, or SUBVL!=2 reserved.
  20
  21 6 opcode options (fmt3):
  22
  23 * CORDIC.lin.rot vd, vs, beta
  24 * CORDIC.cir.rot vd, vs, beta
  25 * CORDIC.hyp.rot vd, vs, beta
  26 * CORDIC.lin.vec vd, vs, beta
  27 * CORDIC.cir.vec vd, vs, beta
  28 * CORDIC.hyp.vec vd, vs, beta
  29
  30 CORDIC is an extremely general-purpose algorithm useful for a huge number
  31 of diverse purposes.  In its full form it does however require quite a
  32 few parameters, one of which is a vector, making it awkward to include in
  33 a standard "scalar" ISA.  Additionally the coordinates can be set to circular,
  34 linear or hyperbolic, producing three different modes, and the algorithm
  35 may also be run in either "vector" mode or "rotation" mode.  See [[discussion]]
  36
  37 CORDIC can also be used for performing DCT.  See
  38 <https://arxiv.org/abs/1606.02424>
  39
  40 CORDIC has several RADIX-4 papers for efficient pipelining.  Each stage requires its own ROM tables which can get costly.  Two combinatorial blocks may be chained together to double the RADIX and halve the pipeline depth, at the cost of doubling the latency.
  41
  42 Also, to get good accuracy, particularly at the limits of CORDIC input range, requires double the bitwidth of the output in internal computations. This similar to how MUL requires double the bitwidth to compute.
  43
  44 Links:
  45
  46 * <http://www.myhdl.org/docs/examples/sinecomp/>
  47 * <https://www.atlantis-press.com/proceedings/jcis2006/232>
  48
  49 ## Vector cross product
  50
  51 SUBVL=3, all regs. VL nonzero produces multiple vd results.
  52
  53 * VCROSS vd, vs1, vs1
  54
  55 Result is the cross product of x and y, i.e., the resulting components are, in order:
  56
  57     x[1] * y[2] - y[1] * x[2]
  58     x[2] * y[0] - y[2] * x[0]
  59     x[0] * y[1] - y[0] * x[1]
  60
  61 All the operands must be vectors of 3 components of a floating-point type.
  62
  63 Pseudocode:
  64
  65     vec3 a, b; // elements in order a.x, a.y, a.z
  66     // compute a cross b:
  67     vec3 t1 = a.yzx; // produce vector [a.y, a.z, a.x]
  68     vec3 t2 = b.zxy;
  69     vec3 t3 = a.zxy;
  70     vec3 t4 = b.yzx;
  71     vec3 p = t3 * t4;
  72     vec3 cross = t1 * t2 - p;
  73
  74 Assembler:
  75
  76     fpermute,2130 F4, F1
  77     fpermute,1320 F5, F1
  78     fpermute,2130 F6, F2
  79     fpermute,1320 F7, F2
  80     fmul F8, F5, F6
  81     fmulsub F3, F4, F7, F8
  82
  83 ## Vector dot product
  84
  85 * SUBVL ignored on rd.  SUBVL=2,3,4 vs1,vs2, if all vectors, multiple results generated. If rd scalar, only first (unpredicated) SUBVector is used.
  86 * rd=scalar, SUBVL=1 and vs1, vs2=vec will produce one scalar result. Predication allowed on src vectors.
  87
  88 * VDOT rd, vs1, vs2
  89
  90 Computes the dot product of two vectors. Internal accuracy must be
  91 greater than the input vectors and the result.
  92
  93 Pseudocode in python:
  94
  95     from operator import mul
  96     sum(map(mul, A, B))
  97
  98 Pseudocode in c:
  99
 100     double dot_product(float v[], float u[], int n)
 101     {
 102         double result = 0.0;
 103         for (int i = 0; i < n; i++)
 104             result += v[i] * u[i];
 105         return result;
 106     }
 107
 108 ## Vector Normalisation (not included)
 109
 110 Vector normalisation may be performed through dot product, recip square root and multiplication:
 111
 112     fdot F3, F1, F1 # vector dot with self
 113     rcpsqrta F3, F3
 114     fscale,0 F2, F3, F1
 115
 116 Or it may be performed through VLEN (Vector length) and division.
 117
 118 ## Vector length
 119
 120 * rd=scalar, vs1=vec (SUBVL=1)
 121 * rd=scalar, vs1=vec (SUBVL=2,3,4) only 1 (predication rules apply)
 122 * rd=vec, SUBVL ignored; vs1=vec, SUBVL=2,3,4
 123 * rd=vec, SUBVL ignored; vs1=vec, SUBVL=1: reserved encoding.
 124
 125 * VLEN rd, vs1
 126
 127 The scalar length of a vector:
 128
 129     sqrt(x[0]^2 + x[1]^2 + ...).
 130
 131 One option is for this to be a macro op fusion sequence, with inverse-sqrt also being a second macro op sequence suitable for normalisation.
 132
 133 ## Vector distance
 134
 135 * VDIST rd, vs1, vs2
 136
 137 The scalar distance between two vectors. Subtracts one vector from the
 138 other and returns length:
 139
 140     length(v0 - v1)
 141
 142 ## Vector LERP
 143
 144 * VLERP rd, vs1, rs2 # SUBVL=2: vs1.v0 vs1.v1
 145
 146 Known as **fmix** in GLSL.
 147
 148 <https://en.m.wikipedia.org/wiki/Linear_interpolation>
 149
 150 Pseudocode:
 151
 152     // Imprecise method, which does not guarantee v = v1 when t = 1,
 153     // due to floating-point arithmetic error.
 154     // This form may be used when the hardware has a native fused
 155     // multiply-add instruction.
 156     float lerp(float v0, float v1, float t) {
 157       return v0 + t * (v1 - v0);
 158     }
 159
 160     // Precise method, which guarantees v = v1 when t = 1.
 161     float lerp(float v0, float v1, float t) {
 162       return (1 - t) * v0 + t * v1;
 163     }
 164
 165 ## Vector SLERP
 166
 167 * VSLERP vd, vs1, vs2, rs3
 168
 169 Not recommended as it is not commonly used and has several trigonometric
 170 functions, although CORDIC in vector rotate circular mode is designed for this purpose. Also a costly 4 arg operation.
 171
 172 <https://en.m.wikipedia.org/wiki/Slerp>
 173
 174 Pseudocode:
 175
 176     Quaternion slerp(Quaternion v0, Quaternion v1, double t) {
 177         // Only unit quaternions are valid rotations.
 178         // Normalize to avoid undefined behavior.
 179         v0.normalize();
 180         v1.normalize();
 181
 182         // Compute the cosine of the angle between the two vectors.
 183         double dot = dot_product(v0, v1);
 184
 185         // If the dot product is negative, slerp won't take
 186         // the shorter path. Note that v1 and -v1 are equivalent when
 187         // the negation is applied to all four components. Fix by
 188         // reversing one quaternion.
 189         if (dot < 0.0f) {
 190             v1 = -v1;
 191             dot = -dot;
 192         }
 193
 194         const double DOT_THRESHOLD = 0.9995;
 195         if (dot > DOT_THRESHOLD) {
 196             // If the inputs are too close for comfort, linearly interpolate
 197             // and normalize the result.
 198
 199             Quaternion result = v0 + t*(v1 - v0);
 200             result.normalize();
 201             return result;
 202         }
 203
 204         // Since dot is in range [0, DOT_THRESHOLD], acos is safe
 205         double theta_0 = acos(dot);        // theta_0 = angle between input vectors
 206         double theta = theta_0*t;          // theta = angle between v0 and result
 207         double sin_theta = sin(theta);     // compute this value only once
 208         double sin_theta_0 = sin(theta_0); // compute this value only once
 209
 210         double s0 = cos(theta) - dot * sin_theta / sin_theta_0;  // == sin(theta_0 - theta) / sin(theta_0)
 211         double s1 = sin_theta / sin_theta_0;
 212
 213         return (s0 * v0) + (s1 * v1);
 214     }
 215
 216 However this algorithm does not involve transcendentals except in
 217 the computation of the tables: <https://en.wikipedia.org/wiki/CORDIC#Rotation_mode>
 218
 219     function v = cordic(beta,n)
 220         % This function computes v = [cos(beta), sin(beta)] (beta in radians)
 221         % using n iterations. Increasing n will increase the precision.
 222
 223         if beta < -pi/2 || beta > pi/2
 224             if beta < 0
 225                 v = cordic(beta + pi, n);
 226             else
 227                 v = cordic(beta - pi, n);
 228             end
 229             v = -v; % flip the sign for second or third quadrant
 230             return
 231         end
 232
 233         % Initialization of tables of constants used by CORDIC
 234         % need a table of arctangents of negative powers of two, in radians:
 235         % angles = atan(2.^-(0:27));
 236         angles =  [  ...
 237             0.78539816339745   0.46364760900081
 238             0.24497866312686   0.12435499454676 ...
 239             0.06241880999596   0.03123983343027
 240             0.01562372862048   0.00781234106010 ...
 241             0.00390623013197   0.00195312251648
 242             0.00097656218956   0.00048828121119 ...
 243             0.00024414062015   0.00012207031189
 244             0.00006103515617   0.00003051757812 ...
 245             0.00001525878906   0.00000762939453
 246             0.00000381469727   0.00000190734863 ...
 247             0.00000095367432   0.00000047683716
 248             0.00000023841858   0.00000011920929 ...
 249             0.00000005960464   0.00000002980232
 250             0.00000001490116   0.00000000745058 ];
 251         % and a table of products of reciprocal lengths of vectors [1, 2^-2j]:
 252         % Kvalues = cumprod(1./abs(1 + 1j*2.^(-(0:23))))
 253         Kvalues = [ ...
 254             0.70710678118655   0.63245553203368
 255             0.61357199107790   0.60883391251775 ...
 256             0.60764825625617   0.60735177014130
 257             0.60727764409353   0.60725911229889 ...
 258             0.60725447933256   0.60725332108988
 259             0.60725303152913   0.60725295913894 ...
 260             0.60725294104140   0.60725293651701
 261             0.60725293538591   0.60725293510314 ...
 262             0.60725293503245   0.60725293501477
 263             0.60725293501035   0.60725293500925 ...
 264             0.60725293500897   0.60725293500890
 265             0.60725293500889   0.60725293500888 ];
 266         Kn = Kvalues(min(n, length(Kvalues)));
 267
 268         % Initialize loop variables:
 269         v = [1;0]; % start with 2-vector cosine and sine of zero
 270         poweroftwo = 1;
 271         angle = angles(1);
 272
 273         % Iterations
 274         for j = 0:n-1;
 275             if beta < 0
 276                 sigma = -1;
 277             else
 278                 sigma = 1;
 279             end
 280             factor = sigma * poweroftwo;
 281             % Note the matrix multiplication can be done using scaling by
 282             % powers of two and addition subtraction
 283             R = [1, -factor; factor, 1];
 284             v = R * v; % 2-by-2 matrix multiply
 285             beta = beta - sigma * angle; % update the remaining angle
 286             poweroftwo = poweroftwo / 2;
 287             % update the angle from table, or eventually by just dividing by two
 288             if j+2 > length(angles)
 289                 angle = angle / 2;
 290             else
 291                 angle = angles(j+2);
 292             end
 293         end
 294
 295         % Adjust length of output vector to be [cos(beta), sin(beta)]:
 296         v = v * Kn;
 297         return
 298
 299     endfunction
 300
 301 2x2 matrix multiply can be done with shifts and adds:
 302
 303     x = v[0] - sigma * (v[1] * 2^(-j));
 304     y = sigma * (v[0] * 2^(-j)) + v[1];
 305     v = [x; y];
 306
 307 The technique is outlined in a paper as being applicable to 3D:
 308 <https://www.atlantis-press.com/proceedings/jcis2006/232>
 309
 310 # Expensive 3-operand OP32 operations
 311
 312 3-operand operations are extremely expensive in terms of OP32 encoding space.  A potential idea is to embed 3 RVC register formats across two out of three 5-bit fields rs1/rs2/rd
 313
 314 Another is to overwrite one of the src registers.
 315
 316 # Opcode Table
 317
 318 TODO
 319
 320 # Links
 321
 322 * <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-September/002736.html>
 323 * <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-September/002733.html>
 324 * <http://bugs.libre-riscv.org/show_bug.cgi?id=142>
 325
 326 Research Papers
 327
 328 * <https://www.researchgate.net/publication/2938554_PLX_FP_An_Efficient_Floating-Point_Instruction_Set_for_3D_Graphics>