(no commit message)
[libreriscv.git] / simple_v_extension / vector_ops.mdwn
1 [[!tag standards]]
2
3 # Vector Operations Extension to SV
4
5 This extension is usually dependent on SV SUBVL being implemented. When SUBVL is set to define the length of a subvector the operations in this extension interpret the elements as a single vector.
6
7 Normally in SV all operations are scalar and independent, and the operations on them may inherently be independently parallelised, with the result being a vector of length exactly equal to the input vectors.
8
9 In this extension, the subvector itself is typically the unit, although some operations will work on scalars or standard vectors as well, or the result is a scalar that is dependent on all elements within the vector arguments.
10
11 However given that some of the parameters are vectors (with and without SUBVL set), and some are scalars (where SUBVL will not apply), some clear rules need to be defined as to how the operations work.
12
13 Examples which can require SUBVL include cross product and may in future involve complex numbers.
14
15 ## CORDIC
16
17 * SUBVL=2, vd, vs; SUBVL ignored on beta.
18 * VL nonzero ok. beta as scalar ok (applies across all vectors)
19 * non vector args vd, vs, or SUBVL!=2 reserved.
20
21 6 opcode options (fmt3):
22
23 * CORDIC.lin.rot vd, vs, beta
24 * CORDIC.cir.rot vd, vs, beta
25 * CORDIC.hyp.rot vd, vs, beta
26 * CORDIC.lin.vec vd, vs, beta
27 * CORDIC.cir.vec vd, vs, beta
28 * CORDIC.hyp.vec vd, vs, beta
29
30 CORDIC is an extremely general-purpose algorithm useful for a huge number
31 of diverse purposes. In its full form it does however require quite a
32 few parameters, one of which is a vector, making it awkward to include in
33 a standard "scalar" ISA. Additionally the coordinates can be set to circular,
34 linear or hyperbolic, producing three different modes, and the algorithm
35 may also be run in either "vector" mode or "rotation" mode. See [[discussion]]
36
37 CORDIC can also be used for performing DCT. See
38 <https://arxiv.org/abs/1606.02424>
39
40 vx, vy = CORDIC(vx, vy, coordinate\_mode, beta)
41
42 int i = 0;
43 int iterations = 0; // Number of times to run the algorithm
44 float arctanTable[iterations]; // in Radians
45 float K = 0.6073; // K
46 float v_x,v_y; // Vector v; x and y components
47
48 for(i=0; i < iterations; i++) {
49 arctanTable[i] = atan(pow(2,-i));
50 }
51
52 float vnew_x; // To store the new value of x;
53 for(i = 0; i < iterations; i++) {
54 // If beta is negative, we need to do a counter-clockwise rotation:
55 if( beta < 0) {
56 vnew_x = v_x + (v_y*pow(2,-i));
57 v_y -= (v_x*pow(2,-i));
58 beta += arctanTable[i];
59 }
60 // If beta is positive, we need to do a clockwise rotation:
61 else {
62 vnew_x = v_x - (v_y*pow(2,-i));
63 v_y += (v_x*pow(2,-i));
64 beta -= arctanTable[i];
65 }
66 v_x = vnew_x;
67 }
68 v_x *= K;
69 v_y *= K;
70
71 Links:
72
73 * <http://www.myhdl.org/docs/examples/sinecomp/>
74
75 ## Vector cross product
76
77 SUBVL=3, all regs. VL nonzero produces multiple vd results.
78
79 * VCROSS vd, vs1, vs1
80
81 Result is the cross product of x and y, i.e., the resulting components are, in order:
82
83 x[1] * y[2] - y[1] * x[2]
84 x[2] * y[0] - y[2] * x[0]
85 x[0] * y[1] - y[0] * x[1]
86
87 All the operands must be vectors of 3 components of a floating-point type.
88
89 Pseudocode:
90
91 vec3 a, b; // elements in order a.x, a.y, a.z
92 // compute a cross b:
93 vec3 t1 = a.yzx; // produce vector [a.y, a.z, a.x]
94 vec3 t2 = b.zxy;
95 vec3 t3 = a.zxy;
96 vec3 t4 = b.yzx;
97 vec3 p = t3 * t4;
98 vec3 cross = t1 * t2 - p;
99
100 Assembler:
101
102 fpermute,2130 F4, F1
103 fpermute,1320 F5, F1
104 fpermute,2130 F6, F2
105 fpermute,1320 F7, F2
106 fmul F8, F5, F6
107 fmulsub F3, F4, F7, F8
108
109 ## Vector dot product
110
111 * SUBVL ignored on rd. SUBVL=2,3,4 vs1,vs2, if all vectors, multiple results generated. If rd scalar, only first (unpredicated) SUBVector is used.
112 * rd=scalar, SUBVL=1 and vs1, vs2=vec will produce one scalar result. Predication allowed on src vectors.
113
114 * VDOT rd, vs1, vs2
115
116 Computes the dot product of two vectors. Internal accuracy must be
117 greater than the input vectors and the result.
118
119 Pseudocode in python:
120
121 from operator import mul
122 sum(map(mul, A, B))
123
124 Pseudocode in c:
125
126 double dot_product(float v[], float u[], int n)
127 {
128 double result = 0.0;
129 for (int i = 0; i < n; i++)
130 result += v[i] * u[i];
131 return result;
132 }
133
134 ## Vector length
135
136 * rd=scalar, vs1=vec (SUBVL=1)
137 * rd=scalar, vs1=vec (SUBVL=2,3,4) only 1 (predication rules apply)
138 * rd=vec, SUBVL ignored; vs1=vec, SUBVL=2,3,4
139 * rd=vec, SUBVL ignored; vs1=vec, SUBVL=1: reserved encoding.
140
141 * VLEN rd, vs1
142
143 The scalar length of a vector:
144
145 sqrt(x[0]^2 + x[1]^2 + ...).
146
147 ## Vector distance
148
149 * VDIST rd, vs1, vs2
150
151 The scalar distance between two vectors. Subtracts one vector from the
152 other and returns length:
153
154 length(v0 - v1)
155
156 ## Vector LERP
157
158 * VLERP rd, vs1, rs2 # SUBVL=2: vs1.v0 vs1.v1
159
160 Known as **fmix** in GLSL.
161
162 <https://en.m.wikipedia.org/wiki/Linear_interpolation>
163
164 Pseudocode:
165
166 // Imprecise method, which does not guarantee v = v1 when t = 1,
167 // due to floating-point arithmetic error.
168 // This form may be used when the hardware has a native fused
169 // multiply-add instruction.
170 float lerp(float v0, float v1, float t) {
171 return v0 + t * (v1 - v0);
172 }
173
174 // Precise method, which guarantees v = v1 when t = 1.
175 float lerp(float v0, float v1, float t) {
176 return (1 - t) * v0 + t * v1;
177 }
178
179 ## Vector SLERP
180
181 * VSLERP vd, vs1, vs2, rs3
182
183 Not recommended as it is not commonly used and has several trigonometric
184 functions, although CORDIC in vector rotate circular mode is designed for this purpose. Also a costly 4 arg operation.
185
186 <https://en.m.wikipedia.org/wiki/Slerp>
187
188 Pseudocode:
189
190 Quaternion slerp(Quaternion v0, Quaternion v1, double t) {
191 // Only unit quaternions are valid rotations.
192 // Normalize to avoid undefined behavior.
193 v0.normalize();
194 v1.normalize();
195
196 // Compute the cosine of the angle between the two vectors.
197 double dot = dot_product(v0, v1);
198
199 // If the dot product is negative, slerp won't take
200 // the shorter path. Note that v1 and -v1 are equivalent when
201 // the negation is applied to all four components. Fix by
202 // reversing one quaternion.
203 if (dot < 0.0f) {
204 v1 = -v1;
205 dot = -dot;
206 }
207
208 const double DOT_THRESHOLD = 0.9995;
209 if (dot > DOT_THRESHOLD) {
210 // If the inputs are too close for comfort, linearly interpolate
211 // and normalize the result.
212
213 Quaternion result = v0 + t*(v1 - v0);
214 result.normalize();
215 return result;
216 }
217
218 // Since dot is in range [0, DOT_THRESHOLD], acos is safe
219 double theta_0 = acos(dot); // theta_0 = angle between input vectors
220 double theta = theta_0*t; // theta = angle between v0 and result
221 double sin_theta = sin(theta); // compute this value only once
222 double sin_theta_0 = sin(theta_0); // compute this value only once
223
224 double s0 = cos(theta) - dot * sin_theta / sin_theta_0; // == sin(theta_0 - theta) / sin(theta_0)
225 double s1 = sin_theta / sin_theta_0;
226
227 return (s0 * v0) + (s1 * v1);
228 }
229
230 However this algorithm does not involve transcendentals except in
231 the computation of the tables: <https://en.wikipedia.org/wiki/CORDIC#Rotation_mode>
232
233 function v = cordic(beta,n)
234 % This function computes v = [cos(beta), sin(beta)] (beta in radians)
235 % using n iterations. Increasing n will increase the precision.
236
237 if beta < -pi/2 || beta > pi/2
238 if beta < 0
239 v = cordic(beta + pi, n);
240 else
241 v = cordic(beta - pi, n);
242 end
243 v = -v; % flip the sign for second or third quadrant
244 return
245 end
246
247 % Initialization of tables of constants used by CORDIC
248 % need a table of arctangents of negative powers of two, in radians:
249 % angles = atan(2.^-(0:27));
250 angles = [ ...
251 0.78539816339745 0.46364760900081
252 0.24497866312686 0.12435499454676 ...
253 0.06241880999596 0.03123983343027
254 0.01562372862048 0.00781234106010 ...
255 0.00390623013197 0.00195312251648
256 0.00097656218956 0.00048828121119 ...
257 0.00024414062015 0.00012207031189
258 0.00006103515617 0.00003051757812 ...
259 0.00001525878906 0.00000762939453
260 0.00000381469727 0.00000190734863 ...
261 0.00000095367432 0.00000047683716
262 0.00000023841858 0.00000011920929 ...
263 0.00000005960464 0.00000002980232
264 0.00000001490116 0.00000000745058 ];
265 % and a table of products of reciprocal lengths of vectors [1, 2^-2j]:
266 % Kvalues = cumprod(1./abs(1 + 1j*2.^(-(0:23))))
267 Kvalues = [ ...
268 0.70710678118655 0.63245553203368
269 0.61357199107790 0.60883391251775 ...
270 0.60764825625617 0.60735177014130
271 0.60727764409353 0.60725911229889 ...
272 0.60725447933256 0.60725332108988
273 0.60725303152913 0.60725295913894 ...
274 0.60725294104140 0.60725293651701
275 0.60725293538591 0.60725293510314 ...
276 0.60725293503245 0.60725293501477
277 0.60725293501035 0.60725293500925 ...
278 0.60725293500897 0.60725293500890
279 0.60725293500889 0.60725293500888 ];
280 Kn = Kvalues(min(n, length(Kvalues)));
281
282 % Initialize loop variables:
283 v = [1;0]; % start with 2-vector cosine and sine of zero
284 poweroftwo = 1;
285 angle = angles(1);
286
287 % Iterations
288 for j = 0:n-1;
289 if beta < 0
290 sigma = -1;
291 else
292 sigma = 1;
293 end
294 factor = sigma * poweroftwo;
295 % Note the matrix multiplication can be done using scaling by
296 % powers of two and addition subtraction
297 R = [1, -factor; factor, 1];
298 v = R * v; % 2-by-2 matrix multiply
299 beta = beta - sigma * angle; % update the remaining angle
300 poweroftwo = poweroftwo / 2;
301 % update the angle from table, or eventually by just dividing by two
302 if j+2 > length(angles)
303 angle = angle / 2;
304 else
305 angle = angles(j+2);
306 end
307 end
308
309 % Adjust length of output vector to be [cos(beta), sin(beta)]:
310 v = v * Kn;
311 return
312
313 endfunction
314
315 2x2 matrix multiply can be done with shifts and adds:
316
317 x = v[0] - sigma * (v[1] * 2^(-j));
318 y = sigma * (v[0] * 2^(-j)) + v[1];
319 v = [x; y];
320
321 The technique is outlined in a paper as being applicable to 3D:
322 <https://www.atlantis-press.com/proceedings/jcis2006/232>
323
324 # Expensive 3-operand OP32 operations
325
326 3-operand operations are extremely expensive in terms of OP32 encoding space. A potential idea is to embed 3 RVC register formats across two out of three 5-bit fields rs1/rs2/rd
327
328 Another is to overwrite one of the src registers.
329
330 # Opcode Table
331
332 TODO
333
334 # Links
335
336 * <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-September/002736.html>
337 * <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-September/002733.html>
338 * <http://bugs.libre-riscv.org/show_bug.cgi?id=142>