(no commit message)
[libreriscv.git] / simple_v_extension / vector_ops.mdwn
1 [[!tag standards]]
2
3 # Vector Operations Extension to SV
4
5 This extension is usually dependent on SV SUBVL being implemented. When SUBVL is set to define the length of a subvector the operations in this extension interpret the elements as a single vector.
6
7 Normally in SV all operations are scalar and independent, and the operations on them may inherently be independently parallelised, with the result being a vector of length exactly equal to the input vectors.
8
9 In this extension, the subvector itself is typically the unit, although some operations will work on scalars or standard vectors as well, or the result is a scalar that is dependent on all elements within the vector arguments.
10
11 However given that some of the parameters are vectors (with and without SUBVL set), and some are scalars (where SUBVL will not apply), some clear rules need to be defined as to how the operations work.
12
13 Examples which can require SUBVL include cross product and may in future involve complex numbers.
14
15 ## CORDIC
16
17 * SUBVL=2, vd, vs; SUBVL ignored on beta.
18 * VL nonzero ok. beta as scalar ok (applies across all vectors)
19 * non vector args vd, vs, or SUBVL!=2 reserved.
20
21 6 opcode options (fmt3):
22
23 * CORDIC.lin.rot vd, vs, beta
24 * CORDIC.cir.rot vd, vs, beta
25 * CORDIC.hyp.rot vd, vs, beta
26 * CORDIC.lin.vec vd, vs, beta
27 * CORDIC.cir.vec vd, vs, beta
28 * CORDIC.hyp.vec vd, vs, beta
29
30 CORDIC is an extremely general-purpose algorithm useful for a huge number
31 of diverse purposes. In its full form it does however require quite a
32 few parameters, one of which is a vector, making it awkward to include in
33 a standard "scalar" ISA. Additionally the coordinates can be set to circular,
34 linear or hyperbolic, producing three different modes, and the algorithm
35 may also be run in either "vector" mode or "rotation" mode. See [[discussion]]
36
37 CORDIC can also be used for performing DCT. See
38 <https://arxiv.org/abs/1606.02424>
39
40 vx, vy = CORDIC(vx, vy, coordinate\_mode, beta)
41
42 int i = 0;
43 int iterations = 0; // Number of times to run the algorithm
44 float arctanTable[iterations]; // in Radians
45 float K = 0.6073; // K
46 float v_x,v_y; // Vector v; x and y components
47
48 for(i=0; i < iterations; i++) {
49 arctanTable[i] = atan(pow(2,-i));
50 }
51
52 float vnew_x; // To store the new value of x;
53 for(i = 0; i < iterations; i++) {
54 // If beta is negative, we need to do a counter-clockwise rotation:
55 if( beta < 0) {
56 vnew_x = v_x + (v_y*pow(2,-i));
57 v_y -= (v_x*pow(2,-i));
58 beta += arctanTable[i];
59 }
60 // If beta is positive, we need to do a clockwise rotation:
61 else {
62 vnew_x = v_x - (v_y*pow(2,-i));
63 v_y += (v_x*pow(2,-i));
64 beta -= arctanTable[i];
65 }
66 v_x = vnew_x;
67 }
68 v_x *= K;
69 v_y *= K;
70
71 Links:
72
73 * <http://www.myhdl.org/docs/examples/sinecomp/>
74
75 ## Vector cross product
76
77 SUBVL=2,3,4 all regs. VL nonzero produces multiple vd results.
78
79 * VCROSS vd, vs1, vs1
80
81 Result is the cross product of x and y, i.e., the resulting components are, in order:
82
83 x[1] * y[2] - y[1] * x[2]
84 x[2] * y[0] - y[2] * x[0]
85 x[0] * y[1] - y[0] * x[1]
86
87 All the operands must be vectors of 3 components of a floating-point type.
88
89 Pseudocode:
90
91 vec3 a, b; // elements in order a.x, a.y, a.z
92 // compute a cross b:
93 vec3 t1 = a.yzx; // produce vector [a.y, a.z, a.x]
94 vec3 t2 = b.zxy;
95 vec3 t3 = a.zxy;
96 vec3 t4 = b.yzx;
97 vec3 p = t3 * t4;
98 vec3 cross = t1 * t2 - p;
99
100 ## Vector dot product
101
102 * SUBVL ignored on rd. SUBVL=2,3,4 vs1,vs2
103 * rd=scalar, SUBVL=1, vs1, vs2=vec
104
105 * VDOT rd, vs1, vs2
106
107 Computes the dot product of two vectors. Internal accuracy must be
108 greater than the input vectors and the result.
109
110 Pseudocode in python:
111
112 from operator import mul
113 sum(map(mul, A, B))
114
115 Pseudocode in c:
116
117 double dot_product(float v[], float u[], int n)
118 {
119 double result = 0.0;
120 for (int i = 0; i < n; i++)
121 result += v[i] * u[i];
122 return result;
123 }
124
125 ## Vector length
126
127 * rd=scalar, vs1=vec (SUBVL=1)
128 * rd=scalar, vs1=vec (SUBVL=2,3,4) only 1 (predication rules apply)
129 * rd=vec, SUBVL ignored; vs1=vec, SUBVL=2,3,4
130 * rd=vec, SUBVL ignored; vs1=vec, SUBVL=1: reserved encoding.
131
132 * VLEN rd, vs1
133
134 The scalar length of a vector:
135
136 sqrt(x[0]^2 + x[1]^2 + ...).
137
138 ## Vector distance
139
140 * VDIST rd, vs1, vs2
141
142 The scalar distance between two vectors. Subtracts one vector from the
143 other and returns length:
144
145 length(v0 - v1)
146
147 ## Vector LERP
148
149 * VLERP rd, vs1, rs2 # SUBVL=2: vs1.v0 vs1.v1
150
151 Known as **fmix** in GLSL.
152
153 <https://en.m.wikipedia.org/wiki/Linear_interpolation>
154
155 Pseudocode:
156
157 // Imprecise method, which does not guarantee v = v1 when t = 1,
158 // due to floating-point arithmetic error.
159 // This form may be used when the hardware has a native fused
160 // multiply-add instruction.
161 float lerp(float v0, float v1, float t) {
162 return v0 + t * (v1 - v0);
163 }
164
165 // Precise method, which guarantees v = v1 when t = 1.
166 float lerp(float v0, float v1, float t) {
167 return (1 - t) * v0 + t * v1;
168 }
169
170 ## Vector SLERP
171
172 * VSLERP vd, vs1, vs2, rs3
173
174 Not recommended as it is not commonly used and has several trigonometric
175 functions, although CORDIC in vector rotate circular mode is designed for this purpose. Also a costly 4 arg operation.
176
177 <https://en.m.wikipedia.org/wiki/Slerp>
178
179 Pseudocode:
180
181 Quaternion slerp(Quaternion v0, Quaternion v1, double t) {
182 // Only unit quaternions are valid rotations.
183 // Normalize to avoid undefined behavior.
184 v0.normalize();
185 v1.normalize();
186
187 // Compute the cosine of the angle between the two vectors.
188 double dot = dot_product(v0, v1);
189
190 // If the dot product is negative, slerp won't take
191 // the shorter path. Note that v1 and -v1 are equivalent when
192 // the negation is applied to all four components. Fix by
193 // reversing one quaternion.
194 if (dot < 0.0f) {
195 v1 = -v1;
196 dot = -dot;
197 }
198
199 const double DOT_THRESHOLD = 0.9995;
200 if (dot > DOT_THRESHOLD) {
201 // If the inputs are too close for comfort, linearly interpolate
202 // and normalize the result.
203
204 Quaternion result = v0 + t*(v1 - v0);
205 result.normalize();
206 return result;
207 }
208
209 // Since dot is in range [0, DOT_THRESHOLD], acos is safe
210 double theta_0 = acos(dot); // theta_0 = angle between input vectors
211 double theta = theta_0*t; // theta = angle between v0 and result
212 double sin_theta = sin(theta); // compute this value only once
213 double sin_theta_0 = sin(theta_0); // compute this value only once
214
215 double s0 = cos(theta) - dot * sin_theta / sin_theta_0; // == sin(theta_0 - theta) / sin(theta_0)
216 double s1 = sin_theta / sin_theta_0;
217
218 return (s0 * v0) + (s1 * v1);
219 }
220
221 However this algorithm does not involve transcendentals except in
222 the computation of the tables: <https://en.wikipedia.org/wiki/CORDIC#Rotation_mode>
223
224 function v = cordic(beta,n)
225 % This function computes v = [cos(beta), sin(beta)] (beta in radians)
226 % using n iterations. Increasing n will increase the precision.
227
228 if beta < -pi/2 || beta > pi/2
229 if beta < 0
230 v = cordic(beta + pi, n);
231 else
232 v = cordic(beta - pi, n);
233 end
234 v = -v; % flip the sign for second or third quadrant
235 return
236 end
237
238 % Initialization of tables of constants used by CORDIC
239 % need a table of arctangents of negative powers of two, in radians:
240 % angles = atan(2.^-(0:27));
241 angles = [ ...
242 0.78539816339745 0.46364760900081
243 0.24497866312686 0.12435499454676 ...
244 0.06241880999596 0.03123983343027
245 0.01562372862048 0.00781234106010 ...
246 0.00390623013197 0.00195312251648
247 0.00097656218956 0.00048828121119 ...
248 0.00024414062015 0.00012207031189
249 0.00006103515617 0.00003051757812 ...
250 0.00001525878906 0.00000762939453
251 0.00000381469727 0.00000190734863 ...
252 0.00000095367432 0.00000047683716
253 0.00000023841858 0.00000011920929 ...
254 0.00000005960464 0.00000002980232
255 0.00000001490116 0.00000000745058 ];
256 % and a table of products of reciprocal lengths of vectors [1, 2^-2j]:
257 % Kvalues = cumprod(1./abs(1 + 1j*2.^(-(0:23))))
258 Kvalues = [ ...
259 0.70710678118655 0.63245553203368
260 0.61357199107790 0.60883391251775 ...
261 0.60764825625617 0.60735177014130
262 0.60727764409353 0.60725911229889 ...
263 0.60725447933256 0.60725332108988
264 0.60725303152913 0.60725295913894 ...
265 0.60725294104140 0.60725293651701
266 0.60725293538591 0.60725293510314 ...
267 0.60725293503245 0.60725293501477
268 0.60725293501035 0.60725293500925 ...
269 0.60725293500897 0.60725293500890
270 0.60725293500889 0.60725293500888 ];
271 Kn = Kvalues(min(n, length(Kvalues)));
272
273 % Initialize loop variables:
274 v = [1;0]; % start with 2-vector cosine and sine of zero
275 poweroftwo = 1;
276 angle = angles(1);
277
278 % Iterations
279 for j = 0:n-1;
280 if beta < 0
281 sigma = -1;
282 else
283 sigma = 1;
284 end
285 factor = sigma * poweroftwo;
286 % Note the matrix multiplication can be done using scaling by
287 % powers of two and addition subtraction
288 R = [1, -factor; factor, 1];
289 v = R * v; % 2-by-2 matrix multiply
290 beta = beta - sigma * angle; % update the remaining angle
291 poweroftwo = poweroftwo / 2;
292 % update the angle from table, or eventually by just dividing by two
293 if j+2 > length(angles)
294 angle = angle / 2;
295 else
296 angle = angles(j+2);
297 end
298 end
299
300 % Adjust length of output vector to be [cos(beta), sin(beta)]:
301 v = v * Kn;
302 return
303
304 endfunction
305
306 2x2 matrix multiply can be done with shifts and adds:
307
308 x = v[0] - sigma * (v[1] * 2^(-j));
309 y = sigma * (v[0] * 2^(-j)) + v[1];
310 v = [x; y];
311
312 The technique is outlined in a paper as being applicable to 3D:
313 <https://www.atlantis-press.com/proceedings/jcis2006/232>
314
315 # Expensive 3-operand OP32 operations
316
317 3-operand operations are extremely expensive in terms of OP32 encoding space. A potential idea is to embed 3 RVC register formats across two out of three 5-bit fields rs1/rs2/rd
318
319 Another is to overwrite one of the src registers.
320
321 # Opcode Table
322
323 TODO
324
325 # Links
326
327 * <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-September/002736.html>
328 * <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-September/002733.html>
329 * <http://bugs.libre-riscv.org/show_bug.cgi?id=142>