(no commit message)
[libreriscv.git] / simple_v_extension / vector_ops.mdwn
1 [[!tag standards]]
2
3 # Vector Operations Extension to SV
4
5 This extension defines vector operations that would otherwise take several cycles to complete in software. With 3D priorities being to compute as many pixels per clock as possible, the normal RISC rules (reduce opcode count and make heavy use of macro op fusion) do not necessarily apply.
6
7 This extension is usually dependent on SV SUBVL being implemented. When SUBVL is set to define the length of a subvector the operations in this extension interpret the elements as a single vector.
8
9 Normally in SV all operations are scalar and independent, and the operations on them may inherently be independently parallelised, with the result being a vector of length exactly equal to the input vectors.
10
11 In this extension, the subvector itself is typically the unit, although some operations will work on scalars or standard vectors as well, or the result is a scalar that is dependent on all elements within the vector arguments.
12
13 However given that some of the parameters are vectors (with and without SUBVL set), and some are scalars (where SUBVL will not apply), some clear rules need to be defined as to how the operations work.
14
15 Examples which can require SUBVL include cross product and may in future involve complex numbers.
16
17 ## CORDIC
18
19 6 opcode options (fmt3):
20
21 * CORDIC.lin.rot vd, vs, beta
22 * CORDIC.cir.rot vd, vs, beta
23 * CORDIC.hyp.rot vd, vs, beta
24 * CORDIC.lin.vec vd, vs, beta
25 * CORDIC.cir.vec vd, vs, beta
26 * CORDIC.hyp.vec vd, vs, beta
27
28 SUBVL must be set to 2 and applies to vd and vs. SUBVL is *ignored* on beta. vd and vs must be marked as vectors.
29
30 VL may be applied. beta as a scalar is ok (applies across all vectors vd and vs). Predication is also ok (single predication) sourced from vd. Use of swizzle is also ok.
31
32 Non vector args vd, vs, or SUBVL != 2 are reserved encodings.
33
34 CORDIC is an extremely general-purpose algorithm useful for a huge number
35 of diverse purposes. In its full form it does however require quite a
36 few parameters, one of which is a vector, making it awkward to include in
37 a standard "scalar" ISA. Additionally the coordinates can be set to circular,
38 linear or hyperbolic, producing three different modes, and the algorithm
39 may also be run in either "vector" mode or "rotation" mode. See [[discussion]]
40
41 CORDIC can also be used for performing DCT. See
42 <https://arxiv.org/abs/1606.02424>
43
44 CORDIC has several RADIX-4 papers for efficient pipelining. Each stage requires its own ROM tables which can get costly. Two combinatorial blocks may be chained together to double the RADIX and halve the pipeline depth, at the cost of doubling the latency.
45
46 Also, to get good accuracy, particularly at the limits of CORDIC input range, requires double the bitwidth of the output in internal computations. This similar to how MUL requires double the bitwidth to compute.
47
48 Links:
49
50 * <http://www.myhdl.org/docs/examples/sinecomp/>
51 * <https://www.atlantis-press.com/proceedings/jcis2006/232>
52
53 ## Vector cross product
54
55 * VCROSS vd, vs1, vs1
56
57 Result is the cross product of x and y.
58
59 SUBVL must be set to 3, and all regs must be vectors. VL nonzero produces multiple results in vd.
60
61 The resulting components are, in order:
62
63 x[1] * y[2] - y[1] * x[2]
64 x[2] * y[0] - y[2] * x[0]
65 x[0] * y[1] - y[0] * x[1]
66
67 All the operands must be vectors of 3 components of a floating-point type.
68
69 Pseudocode:
70
71 vec3 a, b; // elements in order a.x, a.y, a.z
72 // compute a cross b:
73 vec3 t1 = a.yzx; // produce vector [a.y, a.z, a.x]
74 vec3 t2 = b.zxy;
75 vec3 t3 = a.zxy;
76 vec3 t4 = b.yzx;
77 vec3 p = t3 * t4;
78 vec3 cross = t1 * t2 - p;
79
80 Assembler:
81
82 fswizzlei,2130 F4, F1
83 fswizzlei,1320 F5, F1
84 fswizzlei,2130 F6, F2
85 fswizzlei,1320 F7, F2
86 fmul F8, F5, F6
87 fmulsub F3, F4, F7, F8
88
89 ## Vector dot product
90
91
92 * VDOT rd, vs1, vs2
93
94 Computes the dot product of two vectors. Internal accuracy must be
95 greater than the input vectors and the result.
96
97 There are two possible argument options:
98
99 * SUBVL=2,3,4 vs1 and vs2 set as vectors, multiple results are generated. When VL is set, only the first (unpredicated) SUBVector is used to create a result, if rd is scalar (standard behaviour for single predication). Otherwise, if rd is a vector, multiple scalar results are calculated (i.e. SUBVL is always ignored for rd). Swizzling may be applied.
100 * When rd=scalar, SUBVL=1 and vs1=vec, vs2=vec, one scalar result is generated from the entire src vectors. Predication is allowed on the src vectors.
101
102 Pseudocode in python:
103
104 from operator import mul
105 sum(map(mul, A, B))
106
107 Pseudocode in c:
108
109 double dot_product(float v[], float u[], int n)
110 {
111 double result = 0.0;
112 for (int i = 0; i < n; i++)
113 result += v[i] * u[i];
114 return result;
115 }
116
117 ## Vector Normalisation (not included)
118
119 Vector normalisation may be performed through dot product, recip square root and multiplication:
120
121 fdot F3, F1, F1 # vector dot with self
122 rcpsqrta F3, F3
123 fscale,0 F2, F3, F1
124
125 Or it may be performed through VLEN (Vector length) and division.
126
127 ## Vector length
128
129 * rd=scalar, vs1=vec (SUBVL=1)
130 * rd=scalar, vs1=vec (SUBVL=2,3,4) only 1 (predication rules apply)
131 * rd=vec, SUBVL ignored; vs1=vec, SUBVL=2,3,4
132 * rd=vec, SUBVL ignored; vs1=vec, SUBVL=1: reserved encoding.
133
134 * VLEN rd, vs1
135
136 The scalar length of a vector:
137
138 sqrt(x[0]^2 + x[1]^2 + ...).
139
140 One option is for this to be a macro op fusion sequence, with inverse-sqrt also being a second macro op sequence suitable for normalisation.
141
142 ## Vector distance
143
144 * VDIST rd, vs1, vs2
145
146 The scalar distance between two vectors. Subtracts one vector from the
147 other and returns length:
148
149 length(v0 - v1)
150
151 ## Vector LERP
152
153 * VLERP vd, vs1, rs2 # SUBVL=2: vs1.v0 vs1.v1
154
155 | Instr | result | src1 | src2 | SUBVL | VL |
156
157 Known as **fmix** in GLSL.
158
159 <https://en.m.wikipedia.org/wiki/Linear_interpolation>
160
161 Pseudocode:
162
163 // Imprecise method, which does not guarantee v = v1 when t = 1,
164 // due to floating-point arithmetic error.
165 // This form may be used when the hardware has a native fused
166 // multiply-add instruction.
167 float lerp(float v0, float v1, float t) {
168 return v0 + t * (v1 - v0);
169 }
170
171 // Precise method, which guarantees v = v1 when t = 1.
172 float lerp(float v0, float v1, float t) {
173 return (1 - t) * v0 + t * v1;
174 }
175
176 ## Vector SLERP
177
178 * VSLERP vd, vs1, vs2, rs3
179
180 Not recommended as it is not commonly used and has several trigonometric
181 functions, although CORDIC in vector rotate circular mode is designed for this purpose. Also a costly 4 arg operation.
182
183 <https://en.m.wikipedia.org/wiki/Slerp>
184
185 Pseudocode:
186
187 Quaternion slerp(Quaternion v0, Quaternion v1, double t) {
188 // Only unit quaternions are valid rotations.
189 // Normalize to avoid undefined behavior.
190 v0.normalize();
191 v1.normalize();
192
193 // Compute the cosine of the angle between the two vectors.
194 double dot = dot_product(v0, v1);
195
196 // If the dot product is negative, slerp won't take
197 // the shorter path. Note that v1 and -v1 are equivalent when
198 // the negation is applied to all four components. Fix by
199 // reversing one quaternion.
200 if (dot < 0.0f) {
201 v1 = -v1;
202 dot = -dot;
203 }
204
205 const double DOT_THRESHOLD = 0.9995;
206 if (dot > DOT_THRESHOLD) {
207 // If the inputs are too close for comfort, linearly interpolate
208 // and normalize the result.
209
210 Quaternion result = v0 + t*(v1 - v0);
211 result.normalize();
212 return result;
213 }
214
215 // Since dot is in range [0, DOT_THRESHOLD], acos is safe
216 double theta_0 = acos(dot); // theta_0 = angle between input vectors
217 double theta = theta_0*t; // theta = angle between v0 and result
218 double sin_theta = sin(theta); // compute this value only once
219 double sin_theta_0 = sin(theta_0); // compute this value only once
220
221 double s0 = cos(theta) - dot * sin_theta / sin_theta_0; // == sin(theta_0 - theta) / sin(theta_0)
222 double s1 = sin_theta / sin_theta_0;
223
224 return (s0 * v0) + (s1 * v1);
225 }
226
227 However this algorithm does not involve transcendentals except in
228 the computation of the tables: <https://en.wikipedia.org/wiki/CORDIC#Rotation_mode>
229
230 function v = cordic(beta,n)
231 % This function computes v = [cos(beta), sin(beta)] (beta in radians)
232 % using n iterations. Increasing n will increase the precision.
233
234 if beta < -pi/2 || beta > pi/2
235 if beta < 0
236 v = cordic(beta + pi, n);
237 else
238 v = cordic(beta - pi, n);
239 end
240 v = -v; % flip the sign for second or third quadrant
241 return
242 end
243
244 % Initialization of tables of constants used by CORDIC
245 % need a table of arctangents of negative powers of two, in radians:
246 % angles = atan(2.^-(0:27));
247 angles = [ ...
248 0.78539816339745 0.46364760900081
249 0.24497866312686 0.12435499454676 ...
250 0.06241880999596 0.03123983343027
251 0.01562372862048 0.00781234106010 ...
252 0.00390623013197 0.00195312251648
253 0.00097656218956 0.00048828121119 ...
254 0.00024414062015 0.00012207031189
255 0.00006103515617 0.00003051757812 ...
256 0.00001525878906 0.00000762939453
257 0.00000381469727 0.00000190734863 ...
258 0.00000095367432 0.00000047683716
259 0.00000023841858 0.00000011920929 ...
260 0.00000005960464 0.00000002980232
261 0.00000001490116 0.00000000745058 ];
262 % and a table of products of reciprocal lengths of vectors [1, 2^-2j]:
263 % Kvalues = cumprod(1./abs(1 + 1j*2.^(-(0:23))))
264 Kvalues = [ ...
265 0.70710678118655 0.63245553203368
266 0.61357199107790 0.60883391251775 ...
267 0.60764825625617 0.60735177014130
268 0.60727764409353 0.60725911229889 ...
269 0.60725447933256 0.60725332108988
270 0.60725303152913 0.60725295913894 ...
271 0.60725294104140 0.60725293651701
272 0.60725293538591 0.60725293510314 ...
273 0.60725293503245 0.60725293501477
274 0.60725293501035 0.60725293500925 ...
275 0.60725293500897 0.60725293500890
276 0.60725293500889 0.60725293500888 ];
277 Kn = Kvalues(min(n, length(Kvalues)));
278
279 % Initialize loop variables:
280 v = [1;0]; % start with 2-vector cosine and sine of zero
281 poweroftwo = 1;
282 angle = angles(1);
283
284 % Iterations
285 for j = 0:n-1;
286 if beta < 0
287 sigma = -1;
288 else
289 sigma = 1;
290 end
291 factor = sigma * poweroftwo;
292 % Note the matrix multiplication can be done using scaling by
293 % powers of two and addition subtraction
294 R = [1, -factor; factor, 1];
295 v = R * v; % 2-by-2 matrix multiply
296 beta = beta - sigma * angle; % update the remaining angle
297 poweroftwo = poweroftwo / 2;
298 % update the angle from table, or eventually by just dividing by two
299 if j+2 > length(angles)
300 angle = angle / 2;
301 else
302 angle = angles(j+2);
303 end
304 end
305
306 % Adjust length of output vector to be [cos(beta), sin(beta)]:
307 v = v * Kn;
308 return
309
310 endfunction
311
312 2x2 matrix multiply can be done with shifts and adds:
313
314 x = v[0] - sigma * (v[1] * 2^(-j));
315 y = sigma * (v[0] * 2^(-j)) + v[1];
316 v = [x; y];
317
318 The technique is outlined in a paper as being applicable to 3D:
319 <https://www.atlantis-press.com/proceedings/jcis2006/232>
320
321 # Expensive 3-operand OP32 operations
322
323 3-operand operations are extremely expensive in terms of OP32 encoding space. A potential idea is to embed 3 RVC register formats across two out of three 5-bit fields rs1/rs2/rd
324
325 Another is to overwrite one of the src registers.
326
327 # Opcode Table
328
329 TODO
330
331 # Links
332
333 * <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-September/002736.html>
334 * <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-September/002733.html>
335 * <http://bugs.libre-riscv.org/show_bug.cgi?id=142>
336
337 Research Papers
338
339 * <https://www.researchgate.net/publication/2938554_PLX_FP_An_Efficient_Floating-Point_Instruction_Set_for_3D_Graphics>