(no commit message)
[libreriscv.git] / simple_v_extension / vector_ops.mdwn
1 [[!tag standards]]
2
3 # Vector Operations Extension to SV
4
5 This extension is usually dependent on SV SUBVL being implemented. When SUBVL is set to define the length of a subvector the operations in this extension interpret the elements as a single vector.
6
7 Normally in SV all operations are scalar and independent, and the operations on them may inherently be independently parallelised, with the result being a vector of length exactly equal to the input vectors.
8
9 In this extension, the subvector itself is typically the unit, although some operations will work on scalars or standard vectors as well, or the result is a scalar that is dependent on all elements within the vector arguments.
10
11 However given that some of the parameters are vectors (with and without SUBVL set), and some are scalars (where SUBVL will not apply), some clear rules need to be defined as to how the operations work.
12
13 Examples which can require SUBVL include cross product and may in future involve complex numbers.
14
15 ## CORDIC
16
17 * SUBVL=2, vd, vs; SUBVL ignored on beta.
18 * VL nonzero ok. beta as scalar ok (applies across all vectors)
19 * non vector args vd, vs, or SUBVL!=2 reserved.
20
21 6 opcode options (fmt3):
22
23 * CORDIC.lin.rot vd, vs, beta
24 * CORDIC.cir.rot vd, vs, beta
25 * CORDIC.hyp.rot vd, vs, beta
26 * CORDIC.lin.vec vd, vs, beta
27 * CORDIC.cir.vec vd, vs, beta
28 * CORDIC.hyp.vec vd, vs, beta
29
30 CORDIC is an extremely general-purpose algorithm useful for a huge number
31 of diverse purposes. In its full form it does however require quite a
32 few parameters, one of which is a vector, making it awkward to include in
33 a standard "scalar" ISA. Additionally the coordinates can be set to circular,
34 linear or hyperbolic, producing three different modes, and the algorithm
35 may also be run in either "vector" mode or "rotation" mode. See [[discussion]]
36
37 CORDIC can also be used for performing DCT. See
38 <https://arxiv.org/abs/1606.02424>
39
40 CORDIC has several RADIX-4 papers for efficient pipelining. Each stage requires its own ROM tables which can get costly. Two combinatorial blocks may be chained together to double the RADIX and halve the pipeline depth, at the cost of doubling the latency.
41
42 Also, to get good accuracy, particularly at the limits of CORDIC input range, requires double the bitwidth of the output in internal computations. This similar to how MUL requires double the bitwidth to compute.
43
44 Links:
45
46 * <http://www.myhdl.org/docs/examples/sinecomp/>
47 * <https://www.atlantis-press.com/proceedings/jcis2006/232>
48
49 ## Vector cross product
50
51 SUBVL=3, all regs. VL nonzero produces multiple vd results.
52
53 * VCROSS vd, vs1, vs1
54
55 Result is the cross product of x and y, i.e., the resulting components are, in order:
56
57 x[1] * y[2] - y[1] * x[2]
58 x[2] * y[0] - y[2] * x[0]
59 x[0] * y[1] - y[0] * x[1]
60
61 All the operands must be vectors of 3 components of a floating-point type.
62
63 Pseudocode:
64
65 vec3 a, b; // elements in order a.x, a.y, a.z
66 // compute a cross b:
67 vec3 t1 = a.yzx; // produce vector [a.y, a.z, a.x]
68 vec3 t2 = b.zxy;
69 vec3 t3 = a.zxy;
70 vec3 t4 = b.yzx;
71 vec3 p = t3 * t4;
72 vec3 cross = t1 * t2 - p;
73
74 Assembler:
75
76 fpermute,2130 F4, F1
77 fpermute,1320 F5, F1
78 fpermute,2130 F6, F2
79 fpermute,1320 F7, F2
80 fmul F8, F5, F6
81 fmulsub F3, F4, F7, F8
82
83 ## Vector dot product
84
85 * SUBVL ignored on rd. SUBVL=2,3,4 vs1,vs2, if all vectors, multiple results generated. If rd scalar, only first (unpredicated) SUBVector is used.
86 * rd=scalar, SUBVL=1 and vs1, vs2=vec will produce one scalar result. Predication allowed on src vectors.
87
88 * VDOT rd, vs1, vs2
89
90 Computes the dot product of two vectors. Internal accuracy must be
91 greater than the input vectors and the result.
92
93 Pseudocode in python:
94
95 from operator import mul
96 sum(map(mul, A, B))
97
98 Pseudocode in c:
99
100 double dot_product(float v[], float u[], int n)
101 {
102 double result = 0.0;
103 for (int i = 0; i < n; i++)
104 result += v[i] * u[i];
105 return result;
106 }
107
108 ## Vector Normalisation (not included)
109
110 Vector normalisation may be performed through dot product, recip square root and multiplication:
111
112 fdot F3, F1, F1 # vector dot with self
113 rcpsqrta F3, F3
114 fscale,0 F2, F3, F1
115
116 Or it may be performed through VLEN (Vector length) and division.
117
118 ## Vector length
119
120 * rd=scalar, vs1=vec (SUBVL=1)
121 * rd=scalar, vs1=vec (SUBVL=2,3,4) only 1 (predication rules apply)
122 * rd=vec, SUBVL ignored; vs1=vec, SUBVL=2,3,4
123 * rd=vec, SUBVL ignored; vs1=vec, SUBVL=1: reserved encoding.
124
125 * VLEN rd, vs1
126
127 The scalar length of a vector:
128
129 sqrt(x[0]^2 + x[1]^2 + ...).
130
131 One option is for this to be a macro op fusion sequence, with inverse-sqrt also being a second macro op sequence suitable for normalisation.
132
133 ## Vector distance
134
135 * VDIST rd, vs1, vs2
136
137 The scalar distance between two vectors. Subtracts one vector from the
138 other and returns length:
139
140 length(v0 - v1)
141
142 ## Vector LERP
143
144 * VLERP rd, vs1, rs2 # SUBVL=2: vs1.v0 vs1.v1
145
146 Known as **fmix** in GLSL.
147
148 <https://en.m.wikipedia.org/wiki/Linear_interpolation>
149
150 Pseudocode:
151
152 // Imprecise method, which does not guarantee v = v1 when t = 1,
153 // due to floating-point arithmetic error.
154 // This form may be used when the hardware has a native fused
155 // multiply-add instruction.
156 float lerp(float v0, float v1, float t) {
157 return v0 + t * (v1 - v0);
158 }
159
160 // Precise method, which guarantees v = v1 when t = 1.
161 float lerp(float v0, float v1, float t) {
162 return (1 - t) * v0 + t * v1;
163 }
164
165 ## Vector SLERP
166
167 * VSLERP vd, vs1, vs2, rs3
168
169 Not recommended as it is not commonly used and has several trigonometric
170 functions, although CORDIC in vector rotate circular mode is designed for this purpose. Also a costly 4 arg operation.
171
172 <https://en.m.wikipedia.org/wiki/Slerp>
173
174 Pseudocode:
175
176 Quaternion slerp(Quaternion v0, Quaternion v1, double t) {
177 // Only unit quaternions are valid rotations.
178 // Normalize to avoid undefined behavior.
179 v0.normalize();
180 v1.normalize();
181
182 // Compute the cosine of the angle between the two vectors.
183 double dot = dot_product(v0, v1);
184
185 // If the dot product is negative, slerp won't take
186 // the shorter path. Note that v1 and -v1 are equivalent when
187 // the negation is applied to all four components. Fix by
188 // reversing one quaternion.
189 if (dot < 0.0f) {
190 v1 = -v1;
191 dot = -dot;
192 }
193
194 const double DOT_THRESHOLD = 0.9995;
195 if (dot > DOT_THRESHOLD) {
196 // If the inputs are too close for comfort, linearly interpolate
197 // and normalize the result.
198
199 Quaternion result = v0 + t*(v1 - v0);
200 result.normalize();
201 return result;
202 }
203
204 // Since dot is in range [0, DOT_THRESHOLD], acos is safe
205 double theta_0 = acos(dot); // theta_0 = angle between input vectors
206 double theta = theta_0*t; // theta = angle between v0 and result
207 double sin_theta = sin(theta); // compute this value only once
208 double sin_theta_0 = sin(theta_0); // compute this value only once
209
210 double s0 = cos(theta) - dot * sin_theta / sin_theta_0; // == sin(theta_0 - theta) / sin(theta_0)
211 double s1 = sin_theta / sin_theta_0;
212
213 return (s0 * v0) + (s1 * v1);
214 }
215
216 However this algorithm does not involve transcendentals except in
217 the computation of the tables: <https://en.wikipedia.org/wiki/CORDIC#Rotation_mode>
218
219 function v = cordic(beta,n)
220 % This function computes v = [cos(beta), sin(beta)] (beta in radians)
221 % using n iterations. Increasing n will increase the precision.
222
223 if beta < -pi/2 || beta > pi/2
224 if beta < 0
225 v = cordic(beta + pi, n);
226 else
227 v = cordic(beta - pi, n);
228 end
229 v = -v; % flip the sign for second or third quadrant
230 return
231 end
232
233 % Initialization of tables of constants used by CORDIC
234 % need a table of arctangents of negative powers of two, in radians:
235 % angles = atan(2.^-(0:27));
236 angles = [ ...
237 0.78539816339745 0.46364760900081
238 0.24497866312686 0.12435499454676 ...
239 0.06241880999596 0.03123983343027
240 0.01562372862048 0.00781234106010 ...
241 0.00390623013197 0.00195312251648
242 0.00097656218956 0.00048828121119 ...
243 0.00024414062015 0.00012207031189
244 0.00006103515617 0.00003051757812 ...
245 0.00001525878906 0.00000762939453
246 0.00000381469727 0.00000190734863 ...
247 0.00000095367432 0.00000047683716
248 0.00000023841858 0.00000011920929 ...
249 0.00000005960464 0.00000002980232
250 0.00000001490116 0.00000000745058 ];
251 % and a table of products of reciprocal lengths of vectors [1, 2^-2j]:
252 % Kvalues = cumprod(1./abs(1 + 1j*2.^(-(0:23))))
253 Kvalues = [ ...
254 0.70710678118655 0.63245553203368
255 0.61357199107790 0.60883391251775 ...
256 0.60764825625617 0.60735177014130
257 0.60727764409353 0.60725911229889 ...
258 0.60725447933256 0.60725332108988
259 0.60725303152913 0.60725295913894 ...
260 0.60725294104140 0.60725293651701
261 0.60725293538591 0.60725293510314 ...
262 0.60725293503245 0.60725293501477
263 0.60725293501035 0.60725293500925 ...
264 0.60725293500897 0.60725293500890
265 0.60725293500889 0.60725293500888 ];
266 Kn = Kvalues(min(n, length(Kvalues)));
267
268 % Initialize loop variables:
269 v = [1;0]; % start with 2-vector cosine and sine of zero
270 poweroftwo = 1;
271 angle = angles(1);
272
273 % Iterations
274 for j = 0:n-1;
275 if beta < 0
276 sigma = -1;
277 else
278 sigma = 1;
279 end
280 factor = sigma * poweroftwo;
281 % Note the matrix multiplication can be done using scaling by
282 % powers of two and addition subtraction
283 R = [1, -factor; factor, 1];
284 v = R * v; % 2-by-2 matrix multiply
285 beta = beta - sigma * angle; % update the remaining angle
286 poweroftwo = poweroftwo / 2;
287 % update the angle from table, or eventually by just dividing by two
288 if j+2 > length(angles)
289 angle = angle / 2;
290 else
291 angle = angles(j+2);
292 end
293 end
294
295 % Adjust length of output vector to be [cos(beta), sin(beta)]:
296 v = v * Kn;
297 return
298
299 endfunction
300
301 2x2 matrix multiply can be done with shifts and adds:
302
303 x = v[0] - sigma * (v[1] * 2^(-j));
304 y = sigma * (v[0] * 2^(-j)) + v[1];
305 v = [x; y];
306
307 The technique is outlined in a paper as being applicable to 3D:
308 <https://www.atlantis-press.com/proceedings/jcis2006/232>
309
310 # Expensive 3-operand OP32 operations
311
312 3-operand operations are extremely expensive in terms of OP32 encoding space. A potential idea is to embed 3 RVC register formats across two out of three 5-bit fields rs1/rs2/rd
313
314 Another is to overwrite one of the src registers.
315
316 # Opcode Table
317
318 TODO
319
320 # Links
321
322 * <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-September/002736.html>
323 * <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-September/002733.html>
324 * <http://bugs.libre-riscv.org/show_bug.cgi?id=142>
325
326 Research Papers
327
328 * <https://www.researchgate.net/publication/2938554_PLX_FP_An_Efficient_Floating-Point_Instruction_Set_for_3D_Graphics>