(no commit message)
[libreriscv.git] / simple_v_extension / vector_ops.mdwn
1 [[!tag standards]]
2
3 # Vector Operations Extension to SV
4
5 This extension is usually dependent on SV SUBVL being implemented. When SUBVL is set to define the length of a subvector the operations in this extension interpret the elements as a single vector.
6
7 Normally in SV all operations are scalar and independent, and the operations on them may inherently be independently parallelised, with the result being a vector of length exactly equal to the input vectors.
8
9 In this extension, the subvector itself is typically the unit, although some operations will work on scalars or standard vectors as well, or the result is a scalar that is dependent on all elements within the vector arguments.
10
11 However given that some of the parameters are vectors (with and without SUBVL set), and some are scalars (where SUBVL will not apply), some clear rules need to be defined as to how the operations work.
12
13 Examples which can require SUBVL include cross product and may in future involve complex numbers.
14
15 ## CORDIC
16
17 * SUBVL=2, vd, vs; SUBVL ignored on beta.
18 * VL nonzero ok. beta as scalar ok (applies across all vectors)
19 * non vector args vd, vs, or SUBVL!=2 reserved.
20
21 6 opcode options (fmt3):
22
23 * CORDIC.lin.rot vd, vs, beta
24 * CORDIC.cir.rot vd, vs, beta
25 * CORDIC.hyp.rot vd, vs, beta
26 * CORDIC.lin.vec vd, vs, beta
27 * CORDIC.cir.vec vd, vs, beta
28 * CORDIC.hyp.vec vd, vs, beta
29
30 CORDIC is an extremely general-purpose algorithm useful for a huge number
31 of diverse purposes. In its full form it does however require quite a
32 few parameters, one of which is a vector, making it awkward to include in
33 a standard "scalar" ISA. Additionally the coordinates can be set to circular,
34 linear or hyperbolic, producing three different modes, and the algorithm
35 may also be run in either "vector" mode or "rotation" mode. See [[discussion]]
36
37 CORDIC can also be used for performing DCT. See
38 <https://arxiv.org/abs/1606.02424>
39
40 vx, vy = CORDIC(vx, vy, coordinate\_mode, beta)
41
42 int i = 0;
43 int iterations = 0; // Number of times to run the algorithm
44 float arctanTable[iterations]; // in Radians
45 float K = 0.6073; // K
46 float v_x,v_y; // Vector v; x and y components
47
48 for(i=0; i < iterations; i++) {
49 arctanTable[i] = atan(pow(2,-i));
50 }
51
52 float vnew_x; // To store the new value of x;
53 for(i = 0; i < iterations; i++) {
54 // If beta is negative, we need to do a counter-clockwise rotation:
55 if( beta < 0) {
56 vnew_x = v_x + (v_y*pow(2,-i));
57 v_y -= (v_x*pow(2,-i));
58 beta += arctanTable[i];
59 }
60 // If beta is positive, we need to do a clockwise rotation:
61 else {
62 vnew_x = v_x - (v_y*pow(2,-i));
63 v_y += (v_x*pow(2,-i));
64 beta -= arctanTable[i];
65 }
66 v_x = vnew_x;
67 }
68 v_x *= K;
69 v_y *= K;
70
71 Links:
72
73 * <http://www.myhdl.org/docs/examples/sinecomp/>
74 * <https://www.atlantis-press.com/proceedings/jcis2006/232>
75
76 ## Vector cross product
77
78 SUBVL=3, all regs. VL nonzero produces multiple vd results.
79
80 * VCROSS vd, vs1, vs1
81
82 Result is the cross product of x and y, i.e., the resulting components are, in order:
83
84 x[1] * y[2] - y[1] * x[2]
85 x[2] * y[0] - y[2] * x[0]
86 x[0] * y[1] - y[0] * x[1]
87
88 All the operands must be vectors of 3 components of a floating-point type.
89
90 Pseudocode:
91
92 vec3 a, b; // elements in order a.x, a.y, a.z
93 // compute a cross b:
94 vec3 t1 = a.yzx; // produce vector [a.y, a.z, a.x]
95 vec3 t2 = b.zxy;
96 vec3 t3 = a.zxy;
97 vec3 t4 = b.yzx;
98 vec3 p = t3 * t4;
99 vec3 cross = t1 * t2 - p;
100
101 Assembler:
102
103 fpermute,2130 F4, F1
104 fpermute,1320 F5, F1
105 fpermute,2130 F6, F2
106 fpermute,1320 F7, F2
107 fmul F8, F5, F6
108 fmulsub F3, F4, F7, F8
109
110 ## Vector dot product
111
112 * SUBVL ignored on rd. SUBVL=2,3,4 vs1,vs2, if all vectors, multiple results generated. If rd scalar, only first (unpredicated) SUBVector is used.
113 * rd=scalar, SUBVL=1 and vs1, vs2=vec will produce one scalar result. Predication allowed on src vectors.
114
115 * VDOT rd, vs1, vs2
116
117 Computes the dot product of two vectors. Internal accuracy must be
118 greater than the input vectors and the result.
119
120 Pseudocode in python:
121
122 from operator import mul
123 sum(map(mul, A, B))
124
125 Pseudocode in c:
126
127 double dot_product(float v[], float u[], int n)
128 {
129 double result = 0.0;
130 for (int i = 0; i < n; i++)
131 result += v[i] * u[i];
132 return result;
133 }
134
135 ## Vector Normalisation (not included)
136
137 Vector normalisation may be performed through dot product, recip square root and multiplication:
138
139 fdot F3, F1, F1 # vector dot with self
140 rcpsqrta F3, F3
141 fscale,0 F2, F3, F1
142
143 Or it may be performed through VLEN (Vector length) and division.
144
145 ## Vector length
146
147 * rd=scalar, vs1=vec (SUBVL=1)
148 * rd=scalar, vs1=vec (SUBVL=2,3,4) only 1 (predication rules apply)
149 * rd=vec, SUBVL ignored; vs1=vec, SUBVL=2,3,4
150 * rd=vec, SUBVL ignored; vs1=vec, SUBVL=1: reserved encoding.
151
152 * VLEN rd, vs1
153
154 The scalar length of a vector:
155
156 sqrt(x[0]^2 + x[1]^2 + ...).
157
158 One option is for this to be a macro op fusion sequence, with inverse-sqrt also being a second macro op sequence suitable for normalisation.
159
160 ## Vector distance
161
162 * VDIST rd, vs1, vs2
163
164 The scalar distance between two vectors. Subtracts one vector from the
165 other and returns length:
166
167 length(v0 - v1)
168
169 ## Vector LERP
170
171 * VLERP rd, vs1, rs2 # SUBVL=2: vs1.v0 vs1.v1
172
173 Known as **fmix** in GLSL.
174
175 <https://en.m.wikipedia.org/wiki/Linear_interpolation>
176
177 Pseudocode:
178
179 // Imprecise method, which does not guarantee v = v1 when t = 1,
180 // due to floating-point arithmetic error.
181 // This form may be used when the hardware has a native fused
182 // multiply-add instruction.
183 float lerp(float v0, float v1, float t) {
184 return v0 + t * (v1 - v0);
185 }
186
187 // Precise method, which guarantees v = v1 when t = 1.
188 float lerp(float v0, float v1, float t) {
189 return (1 - t) * v0 + t * v1;
190 }
191
192 ## Vector SLERP
193
194 * VSLERP vd, vs1, vs2, rs3
195
196 Not recommended as it is not commonly used and has several trigonometric
197 functions, although CORDIC in vector rotate circular mode is designed for this purpose. Also a costly 4 arg operation.
198
199 <https://en.m.wikipedia.org/wiki/Slerp>
200
201 Pseudocode:
202
203 Quaternion slerp(Quaternion v0, Quaternion v1, double t) {
204 // Only unit quaternions are valid rotations.
205 // Normalize to avoid undefined behavior.
206 v0.normalize();
207 v1.normalize();
208
209 // Compute the cosine of the angle between the two vectors.
210 double dot = dot_product(v0, v1);
211
212 // If the dot product is negative, slerp won't take
213 // the shorter path. Note that v1 and -v1 are equivalent when
214 // the negation is applied to all four components. Fix by
215 // reversing one quaternion.
216 if (dot < 0.0f) {
217 v1 = -v1;
218 dot = -dot;
219 }
220
221 const double DOT_THRESHOLD = 0.9995;
222 if (dot > DOT_THRESHOLD) {
223 // If the inputs are too close for comfort, linearly interpolate
224 // and normalize the result.
225
226 Quaternion result = v0 + t*(v1 - v0);
227 result.normalize();
228 return result;
229 }
230
231 // Since dot is in range [0, DOT_THRESHOLD], acos is safe
232 double theta_0 = acos(dot); // theta_0 = angle between input vectors
233 double theta = theta_0*t; // theta = angle between v0 and result
234 double sin_theta = sin(theta); // compute this value only once
235 double sin_theta_0 = sin(theta_0); // compute this value only once
236
237 double s0 = cos(theta) - dot * sin_theta / sin_theta_0; // == sin(theta_0 - theta) / sin(theta_0)
238 double s1 = sin_theta / sin_theta_0;
239
240 return (s0 * v0) + (s1 * v1);
241 }
242
243 However this algorithm does not involve transcendentals except in
244 the computation of the tables: <https://en.wikipedia.org/wiki/CORDIC#Rotation_mode>
245
246 function v = cordic(beta,n)
247 % This function computes v = [cos(beta), sin(beta)] (beta in radians)
248 % using n iterations. Increasing n will increase the precision.
249
250 if beta < -pi/2 || beta > pi/2
251 if beta < 0
252 v = cordic(beta + pi, n);
253 else
254 v = cordic(beta - pi, n);
255 end
256 v = -v; % flip the sign for second or third quadrant
257 return
258 end
259
260 % Initialization of tables of constants used by CORDIC
261 % need a table of arctangents of negative powers of two, in radians:
262 % angles = atan(2.^-(0:27));
263 angles = [ ...
264 0.78539816339745 0.46364760900081
265 0.24497866312686 0.12435499454676 ...
266 0.06241880999596 0.03123983343027
267 0.01562372862048 0.00781234106010 ...
268 0.00390623013197 0.00195312251648
269 0.00097656218956 0.00048828121119 ...
270 0.00024414062015 0.00012207031189
271 0.00006103515617 0.00003051757812 ...
272 0.00001525878906 0.00000762939453
273 0.00000381469727 0.00000190734863 ...
274 0.00000095367432 0.00000047683716
275 0.00000023841858 0.00000011920929 ...
276 0.00000005960464 0.00000002980232
277 0.00000001490116 0.00000000745058 ];
278 % and a table of products of reciprocal lengths of vectors [1, 2^-2j]:
279 % Kvalues = cumprod(1./abs(1 + 1j*2.^(-(0:23))))
280 Kvalues = [ ...
281 0.70710678118655 0.63245553203368
282 0.61357199107790 0.60883391251775 ...
283 0.60764825625617 0.60735177014130
284 0.60727764409353 0.60725911229889 ...
285 0.60725447933256 0.60725332108988
286 0.60725303152913 0.60725295913894 ...
287 0.60725294104140 0.60725293651701
288 0.60725293538591 0.60725293510314 ...
289 0.60725293503245 0.60725293501477
290 0.60725293501035 0.60725293500925 ...
291 0.60725293500897 0.60725293500890
292 0.60725293500889 0.60725293500888 ];
293 Kn = Kvalues(min(n, length(Kvalues)));
294
295 % Initialize loop variables:
296 v = [1;0]; % start with 2-vector cosine and sine of zero
297 poweroftwo = 1;
298 angle = angles(1);
299
300 % Iterations
301 for j = 0:n-1;
302 if beta < 0
303 sigma = -1;
304 else
305 sigma = 1;
306 end
307 factor = sigma * poweroftwo;
308 % Note the matrix multiplication can be done using scaling by
309 % powers of two and addition subtraction
310 R = [1, -factor; factor, 1];
311 v = R * v; % 2-by-2 matrix multiply
312 beta = beta - sigma * angle; % update the remaining angle
313 poweroftwo = poweroftwo / 2;
314 % update the angle from table, or eventually by just dividing by two
315 if j+2 > length(angles)
316 angle = angle / 2;
317 else
318 angle = angles(j+2);
319 end
320 end
321
322 % Adjust length of output vector to be [cos(beta), sin(beta)]:
323 v = v * Kn;
324 return
325
326 endfunction
327
328 2x2 matrix multiply can be done with shifts and adds:
329
330 x = v[0] - sigma * (v[1] * 2^(-j));
331 y = sigma * (v[0] * 2^(-j)) + v[1];
332 v = [x; y];
333
334 The technique is outlined in a paper as being applicable to 3D:
335 <https://www.atlantis-press.com/proceedings/jcis2006/232>
336
337 # Expensive 3-operand OP32 operations
338
339 3-operand operations are extremely expensive in terms of OP32 encoding space. A potential idea is to embed 3 RVC register formats across two out of three 5-bit fields rs1/rs2/rd
340
341 Another is to overwrite one of the src registers.
342
343 # Opcode Table
344
345 TODO
346
347 # Links
348
349 * <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-September/002736.html>
350 * <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-September/002733.html>
351 * <http://bugs.libre-riscv.org/show_bug.cgi?id=142>
352
353 Research Papers
354
355 * <https://www.researchgate.net/publication/2938554_PLX_FP_An_Efficient_Floating-Point_Instruction_Set_for_3D_Graphics>