(no commit message)
[libreriscv.git] / simple_v_extension / vector_ops.mdwn
1 [[!tag oldstandards]]
2
3 **OBSOLETE**, see [[openpower/sv/3d_vector_ops]]
4
5 # Vector Operations Extension to SV
6
7 This extension defines vector operations that would otherwise take several cycles to complete in software. With 3D priorities being to compute as many pixels per clock as possible, the normal RISC rules (reduce opcode count and make heavy use of macro op fusion) do not necessarily apply.
8
9 This extension is usually dependent on SV SUBVL being implemented. When SUBVL is set to define the length of a subvector the operations in this extension interpret the elements as a single vector.
10
11 Normally in SV all operations are scalar and independent, and the operations on them may inherently be independently parallelised, with the result being a vector of length exactly equal to the input vectors.
12
13 In this extension, the subvector itself is typically the unit, although some operations will work on scalars or standard vectors as well, or the result is a scalar that is dependent on all elements within the vector arguments.
14
15 However given that some of the parameters are vectors (with and without SUBVL set), and some are scalars (where SUBVL will not apply), some clear rules need to be defined as to how the operations work.
16
17 Examples which can require SUBVL include cross product and may in future involve complex numbers.
18
19 ## CORDIC
20
21 6 opcode options (fmt3):
22
23 * CORDIC.lin.rot vd, vs, beta
24 * CORDIC.cir.rot vd, vs, beta
25 * CORDIC.hyp.rot vd, vs, beta
26 * CORDIC.lin.vec vd, vs, beta
27 * CORDIC.cir.vec vd, vs, beta
28 * CORDIC.hyp.vec vd, vs, beta
29
30
31 | Instr | result | src1 | src2 | SUBVL | VL | Notes |
32 | ------------------ | ------ | ---- | ---- | ----- | -- | ------ |
33 | CORDIC.x.t vd, vs1, rs2 | vec2 | vec2 | scal | 2 | any | src2 ignores SUBVL |
34
35 SUBVL must be set to 2 and applies to vd and vs. SUBVL is *ignored* on beta. vd and vs must be marked as vectors.
36
37 VL may be applied. beta as a scalar is ok (applies across all vectors vd and vs). Predication is also ok (single predication) sourced from vd. Use of swizzle is also ok.
38
39 Non vector args vd, vs are reserved encodings.
40
41 CORDIC is an extremely general-purpose algorithm useful for a huge number
42 of diverse purposes. In its full form it does however require quite a
43 few parameters, one of which is a vector, making it awkward to include in
44 a standard "scalar" ISA. Additionally the coordinates can be set to circular,
45 linear or hyperbolic, producing three different modes, and the algorithm
46 may also be run in either "vector" mode or "rotation" mode. See [[discussion]]
47
48 CORDIC can also be used for performing DCT. See
49 <https://arxiv.org/abs/1606.02424>
50
51 CORDIC has several RADIX-4 papers for efficient pipelining. Each stage requires its own ROM tables which can get costly. Two combinatorial blocks may be chained together to double the RADIX and halve the pipeline depth, at the cost of doubling the latency.
52
53 Also, to get good accuracy, particularly at the limits of CORDIC input range, requires double the bitwidth of the output in internal computations. This similar to how MUL requires double the bitwidth to compute.
54
55 Links:
56
57 * <http://www.myhdl.org/docs/examples/sinecomp/>
58 * <https://www.atlantis-press.com/proceedings/jcis2006/232>
59
60 ## Vector cross product
61
62 * VCROSS vd, vs1, vs1
63
64 Result is the cross product of x and y.
65
66 SUBVL must be set to 3, and all regs must be vectors. VL nonzero produces multiple results in vd.
67
68 | Instr | result | src1 | src2 | SUBVL | VL |
69 | ------------------ | ------ | ---- | ---- | ----- | -- |
70 | VCROSS vd, vs1, vs2 | vec3 | vec3 | vec3 | 3 | any |
71
72 The resulting components are, in order:
73
74 x[1] * y[2] - y[1] * x[2]
75 x[2] * y[0] - y[2] * x[0]
76 x[0] * y[1] - y[0] * x[1]
77
78 All the operands must be vectors of 3 components of a floating-point type.
79
80 Pseudocode:
81
82 vec3 a, b; // elements in order a.x, a.y, a.z
83 // compute a cross b:
84 vec3 t1 = a.yzx; // produce vector [a.y, a.z, a.x]
85 vec3 t2 = b.zxy;
86 vec3 t3 = a.zxy;
87 vec3 t4 = b.yzx;
88 vec3 p = t3 * t4;
89 vec3 cross = t1 * t2 - p;
90
91 Assembler:
92
93 fswizzlei,2130 F4, F1
94 fswizzlei,1320 F5, F1
95 fswizzlei,2130 F6, F2
96 fswizzlei,1320 F7, F2
97 fmul F8, F5, F6
98 fmulsub F3, F4, F7, F8
99
100 ## Vector dot product
101
102 * VDOT rd, vs1, vs2
103
104 Computes the dot product of two vectors. Internal accuracy must be
105 greater than the input vectors and the result.
106
107 There are two possible argument options:
108
109 * SUBVL=2,3,4 vs1 and vs2 set as vectors, multiple results are generated. When VL is set, only the first (unpredicated) SUBVector is used to create a result, if rd is scalar (standard behaviour for single predication). Otherwise, if rd is a vector, multiple scalar results are calculated (i.e. SUBVL is always ignored for rd). Swizzling may be applied.
110 * When rd=scalar, SUBVL=1 and vs1=vec, vs2=vec, one scalar result is generated from the entire src vectors. Predication is allowed on the src vectors.
111
112
113 | Instr | result | src1 | src2 | SUBVL | VL |
114 | ------------------ | ------ | ---- | ---- | ----- | -- |
115 | VDOT rd, vs1, vs2 | scal | vec | vec | 2-4 | any |
116 | VDOT rd, vs1, vs2 | scal | vec | vec | 1 | any |
117
118 Pseudocode in python:
119
120 from operator import mul
121 sum(map(mul, A, B))
122
123 Pseudocode in c:
124
125 double dot_product(float v[], float u[], int n)
126 {
127 double result = 0.0;
128 for (int i = 0; i < n; i++)
129 result += v[i] * u[i];
130 return result;
131 }
132
133 ## Vector Normalisation (not included)
134
135 Vector normalisation may be performed through dot product, recip square root and multiplication:
136
137 fdot F3, F1, F1 # vector dot with self
138 rcpsqrta F3, F3
139 fscale,0 F2, F3, F1
140
141 Or it may be performed through VLEN (Vector length) and division.
142
143 ## Vector length
144
145 * rd=scalar, vs1=vec (SUBVL=1)
146 * rd=scalar, vs1=vec (SUBVL=2,3,4) only 1 (predication rules apply)
147 * rd=vec, SUBVL ignored; vs1=vec, SUBVL=2,3,4
148 * rd=vec, SUBVL ignored; vs1=vec, SUBVL=1: reserved encoding.
149
150 * VLEN rd, vs1
151
152 The scalar length of a vector:
153
154 sqrt(x[0]^2 + x[1]^2 + ...).
155
156 One option is for this to be a macro op fusion sequence, with inverse-sqrt also being a second macro op sequence suitable for normalisation.
157
158 ## Vector distance
159
160 * VDIST rd, vs1, vs2
161
162 The scalar distance between two vectors. Subtracts one vector from the
163 other and returns length:
164
165 length(v0 - v1)
166
167 ## Vector LERP
168
169 * VLERP vd, vs1, rs2 # SUBVL=2: vs1.v0 vs1.v1
170
171 | Instr | result | src1 | src2 | SUBVL | VL |
172 | ------------------ | ------ | ---- | ---- | ----- | -- |
173 | VLERP vd, vs1, rs2 | vec2 | vec2 | scal | 2 | any |
174
175 Known as **fmix** in GLSL.
176
177 <https://en.m.wikipedia.org/wiki/Linear_interpolation>
178
179 Pseudocode:
180
181 // Imprecise method, which does not guarantee v = v1 when t = 1,
182 // due to floating-point arithmetic error.
183 // This form may be used when the hardware has a native fused
184 // multiply-add instruction.
185 float lerp(float v0, float v1, float t) {
186 return v0 + t * (v1 - v0);
187 }
188
189 // Precise method, which guarantees v = v1 when t = 1.
190 float lerp(float v0, float v1, float t) {
191 return (1 - t) * v0 + t * v1;
192 }
193
194 ## Vector SLERP
195
196 * VSLERP vd, vs1, vs2, rs3
197
198 Not recommended as it is not commonly used and has several trigonometric
199 functions, although CORDIC in vector rotate circular mode is designed for this purpose. Also a costly 4 arg operation.
200
201 <https://en.m.wikipedia.org/wiki/Slerp>
202
203 Pseudocode:
204
205 Quaternion slerp(Quaternion v0, Quaternion v1, double t) {
206 // Only unit quaternions are valid rotations.
207 // Normalize to avoid undefined behavior.
208 v0.normalize();
209 v1.normalize();
210
211 // Compute the cosine of the angle between the two vectors.
212 double dot = dot_product(v0, v1);
213
214 // If the dot product is negative, slerp won't take
215 // the shorter path. Note that v1 and -v1 are equivalent when
216 // the negation is applied to all four components. Fix by
217 // reversing one quaternion.
218 if (dot < 0.0f) {
219 v1 = -v1;
220 dot = -dot;
221 }
222
223 const double DOT_THRESHOLD = 0.9995;
224 if (dot > DOT_THRESHOLD) {
225 // If the inputs are too close for comfort, linearly interpolate
226 // and normalize the result.
227
228 Quaternion result = v0 + t*(v1 - v0);
229 result.normalize();
230 return result;
231 }
232
233 // Since dot is in range [0, DOT_THRESHOLD], acos is safe
234 double theta_0 = acos(dot); // theta_0 = angle between input vectors
235 double theta = theta_0*t; // theta = angle between v0 and result
236 double sin_theta = sin(theta); // compute this value only once
237 double sin_theta_0 = sin(theta_0); // compute this value only once
238
239 double s0 = cos(theta) - dot * sin_theta / sin_theta_0; // == sin(theta_0 - theta) / sin(theta_0)
240 double s1 = sin_theta / sin_theta_0;
241
242 return (s0 * v0) + (s1 * v1);
243 }
244
245 However this algorithm does not involve transcendentals except in
246 the computation of the tables: <https://en.wikipedia.org/wiki/CORDIC#Rotation_mode>
247
248 function v = cordic(beta,n)
249 % This function computes v = [cos(beta), sin(beta)] (beta in radians)
250 % using n iterations. Increasing n will increase the precision.
251
252 if beta < -pi/2 || beta > pi/2
253 if beta < 0
254 v = cordic(beta + pi, n);
255 else
256 v = cordic(beta - pi, n);
257 end
258 v = -v; % flip the sign for second or third quadrant
259 return
260 end
261
262 % Initialization of tables of constants used by CORDIC
263 % need a table of arctangents of negative powers of two, in radians:
264 % angles = atan(2.^-(0:27));
265 angles = [ ...
266 0.78539816339745 0.46364760900081
267 0.24497866312686 0.12435499454676 ...
268 0.06241880999596 0.03123983343027
269 0.01562372862048 0.00781234106010 ...
270 0.00390623013197 0.00195312251648
271 0.00097656218956 0.00048828121119 ...
272 0.00024414062015 0.00012207031189
273 0.00006103515617 0.00003051757812 ...
274 0.00001525878906 0.00000762939453
275 0.00000381469727 0.00000190734863 ...
276 0.00000095367432 0.00000047683716
277 0.00000023841858 0.00000011920929 ...
278 0.00000005960464 0.00000002980232
279 0.00000001490116 0.00000000745058 ];
280 % and a table of products of reciprocal lengths of vectors [1, 2^-2j]:
281 % Kvalues = cumprod(1./abs(1 + 1j*2.^(-(0:23))))
282 Kvalues = [ ...
283 0.70710678118655 0.63245553203368
284 0.61357199107790 0.60883391251775 ...
285 0.60764825625617 0.60735177014130
286 0.60727764409353 0.60725911229889 ...
287 0.60725447933256 0.60725332108988
288 0.60725303152913 0.60725295913894 ...
289 0.60725294104140 0.60725293651701
290 0.60725293538591 0.60725293510314 ...
291 0.60725293503245 0.60725293501477
292 0.60725293501035 0.60725293500925 ...
293 0.60725293500897 0.60725293500890
294 0.60725293500889 0.60725293500888 ];
295 Kn = Kvalues(min(n, length(Kvalues)));
296
297 % Initialize loop variables:
298 v = [1;0]; % start with 2-vector cosine and sine of zero
299 poweroftwo = 1;
300 angle = angles(1);
301
302 % Iterations
303 for j = 0:n-1;
304 if beta < 0
305 sigma = -1;
306 else
307 sigma = 1;
308 end
309 factor = sigma * poweroftwo;
310 % Note the matrix multiplication can be done using scaling by
311 % powers of two and addition subtraction
312 R = [1, -factor; factor, 1];
313 v = R * v; % 2-by-2 matrix multiply
314 beta = beta - sigma * angle; % update the remaining angle
315 poweroftwo = poweroftwo / 2;
316 % update the angle from table, or eventually by just dividing by two
317 if j+2 > length(angles)
318 angle = angle / 2;
319 else
320 angle = angles(j+2);
321 end
322 end
323
324 % Adjust length of output vector to be [cos(beta), sin(beta)]:
325 v = v * Kn;
326 return
327
328 endfunction
329
330 2x2 matrix multiply can be done with shifts and adds:
331
332 x = v[0] - sigma * (v[1] * 2^(-j));
333 y = sigma * (v[0] * 2^(-j)) + v[1];
334 v = [x; y];
335
336 The technique is outlined in a paper as being applicable to 3D:
337 <https://www.atlantis-press.com/proceedings/jcis2006/232>
338
339 # Expensive 3-operand OP32 operations
340
341 3-operand operations are extremely expensive in terms of OP32 encoding space. A potential idea is to embed 3 RVC register formats across two out of three 5-bit fields rs1/rs2/rd
342
343 Another is to overwrite one of the src registers.
344
345 # Opcode Table
346
347 TODO
348
349 # Links
350
351 * <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-September/002736.html>
352 * <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-September/002733.html>
353 * <http://bugs.libre-riscv.org/show_bug.cgi?id=142>
354
355 Research Papers
356
357 * <https://www.researchgate.net/publication/2938554_PLX_FP_An_Efficient_Floating-Point_Instruction_Set_for_3D_Graphics>