add atan/atanpi pseudo-op note, fix errors, and change pseudo-code to be consistant
[libreriscv.git] / ztrans_proposal.mdwn
1 # Zftrans - transcendental operations
2
3 See:
4
5 * <http://bugs.libre-riscv.org/show_bug.cgi?id=127>
6 * <https://www.khronos.org/registry/spir-v/specs/unified1/OpenCL.ExtendedInstructionSet.100.html>
7 * Discussion: <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002342.html>
8 * [[rv_major_opcode_1010011]] for opcode listing.
9 * [[zfpacc_proposal]] for accuracy settings proposal
10
11 Extension subsets:
12
13 * **Zftrans**: standard transcendentals (best suited to 3D)
14 * **ZftransExt**: extra functions (useful, not generally needed for 3D,
15 can be synthesised using Ztrans)
16 * **Ztrigpi**: trig. xxx-pi sinpi cospi tanpi
17 * **Ztrignpi**: trig non-xxx-pi sin cos tan
18 * **Zarctrigpi**: arc-trig. a-xxx-pi: atan2pi asinpi acospi
19 * **Zarctrignpi**: arc-trig. non-a-xxx-pi: atan2, asin, acos
20 * **Zfhyp**: hyperbolic/inverse-hyperbolic. sinh, cosh, tanh, asinh,
21 acosh, atanh (can be synthesised - see below)
22 * **ZftransAdv**: much more complex to implement in hardware
23 * **Zfrsqrt**: Reciprocal square-root.
24
25 Minimum recommended requirements for 3D: Zftrans, Ztrigpi, Zarctrigpi,
26 Zarctrignpi
27
28 [[!toc levels=2]]
29
30 # TODO:
31
32 * Decision on accuracy, moved to [[zfpacc_proposal]]
33 <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002355.html>
34 * Errors **MUST** be repeatable.
35 * How about four Platform Specifications? 3DUNIX, UNIX, 3DEmbedded and Embedded?
36 <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002361.html>
37 Accuracy requirements for dual (triple) purpose implementations must
38 meet the higher standard.
39 * Reciprocal Square-root is in its own separate extension (Zfrsqrt) as
40 it is desirable on its own by other implementors. This to be evaluated.
41
42 # Requirements <a name="requirements"></a>
43
44 This proposal is designed to meet a wide range of extremely diverse needs,
45 allowing implementors from all of them to benefit from the tools and hardware
46 cost reductions associated with common standards adoption.
47
48 **There are *four* different, disparate platform's needs (two new)**:
49
50 * 3D Embedded Platform
51 * Embedded Platform
52 * 3D UNIX Platform
53 * UNIX Platform
54
55 3D Embedded will require significantly less accuracy and will need to make
56 power budget and die area compromises that other platforms (including Embedded)
57 will not need to make.
58
59 3D UNIX Platform has to be performance-price-competitive: subtly-reduced
60 accuracy in FP32 is acceptable where, conversely, in the UNIX Platform,
61 IEEE754 compliance is a hard requirement that would compromise power
62 and efficiency on a 3D UNIX Platform.
63
64 **The use-cases are**:
65
66 * 3D GPUs
67 * Numerical Computation
68 * (Potentially) A.I. / Machine-learning (1)
69
70 (1) although approximations suffice in this field, making it more likely
71 to use a custom extension. High-end ML would inherently definitely
72 be excluded.
73
74 **The power and die-area requirements vary from**:
75
76 * Ultra-low-power (smartwatches where GPU power budgets are in milliwatts)
77 * Mobile-Embedded (good performance with high efficiency for battery life)
78 * Desktop Computing
79 * Server / HPC (2)
80
81 (2) Supercomputing is left out of the requirements as it is traditionally
82 covered by Supercomputer Vectorisation Standards (such as RVV).
83
84 **The software requirements are**:
85
86 * Full public integration into GNU math libraries (libm)
87 * Full public integration into well-known Numerical Computation systems (numpy)
88 * Full public integration into upstream GNU and LLVM Compiler toolchains
89 * Full public integration into Khronos OpenCL SPIR-V compatible Compilers
90 seeking public Certification and Endorsement from the Khronos Group
91 under their Trademarked Certification Programme.
92
93 **The "contra"-requirements are**:
94
95 * The requirements are **not** for the purposes of developing a full custom
96 proprietary GPU with proprietary firmware.
97 * A full custom proprietary GPU ASIC Manufacturer *may* benefit from
98 this proposal however the fact that they typically develop proprietary
99 software that is not shared with the rest of the community likely to
100 use this proposal means that they have completely different needs.
101
102 This proposal is for *sharing* of effort in reducing development costs,
103 and as such needs to be entirely public, transparent, and fully integrated
104 into public upstream libre-licensed libraries and toolchains.
105
106 *Overall this proposal is categorically and wholly unsuited to
107 relegation of "custom" status*.
108
109 # Proposed Opcodes vs Khronos OpenCL Opcodes <a name="khronos_equiv"></a>
110
111 This list shows the (direct) equivalence between proposed opcodes and
112 their Khronos OpenCL equivalents.
113
114 See
115 <https://www.khronos.org/registry/spir-v/specs/unified1/OpenCL.ExtendedInstructionSet.100.html>
116
117 Special FP16 opcodes are *not* being proposed, except by indirect / inherent
118 use of the "fmt" field that is already present in the RISC-V Specification.
119
120 "Native" opcodes are *not* being proposed: implementors will be expected
121 to use the (equivalent) proposed opcode covering the same function.
122
123 "Fast" opcodes are *not* being proposed, because the Khronos Specification
124 fast\_length, fast\_normalise and fast\_distance OpenCL opcodes require
125 vectors (or can be done as scalar operations using other RISC-V instructions).
126
127 The OpenCL FP32 opcodes are **direct** equivalents to the proposed opcodes.
128 Deviation from conformance with the Khronos Specification - including the
129 Khronos Specification accuracy requirements - is not an option.
130
131 [[!table data="""
132 Proposed opcode | OpenCL FP32 | OpenCL FP16 | OpenCL native | OpenCL fast |
133 FSIN | sin | half\_sin | native\_sin | NONE |
134 FCOS | cos | half\_cos | native\_cos | NONE |
135 FTAN | tan | half\_tan | native\_tan | NONE |
136 NONE (1) | sincos | NONE | NONE | NONE |
137 FASIN | asin | NONE | NONE | NONE |
138 FACOS | acos | NONE | NONE | NONE |
139 FATAN | atan | NONE | NONE | NONE |
140 FSINPI | sinpi | NONE | NONE | NONE |
141 FCOSPI | cospi | NONE | NONE | NONE |
142 FTANPI | tanpi | NONE | NONE | NONE |
143 FASINPI | asinpi | NONE | NONE | NONE |
144 FACOSPI | acospi | NONE | NONE | NONE |
145 FATANPI | atanpi | NONE | NONE | NONE |
146 FSINH | sinh | NONE | NONE | NONE |
147 FCOSH | cosh | NONE | NONE | NONE |
148 FTANH | tanh | NONE | NONE | NONE |
149 FASINH | asinh | NONE | NONE | NONE |
150 FACOSH | acosh | NONE | NONE | NONE |
151 FATANH | atanh | NONE | NONE | NONE |
152 FRSQRT | rsqrt | half\_rsqrt | native\_rsqrt | NONE |
153 FCBRT | cbrt | NONE | NONE | NONE |
154 FEXP2 | exp2 | half\_exp2 | native\_exp2 | NONE |
155 FLOG2 | log2 | half\_log2 | native\_log2 | NONE |
156 FEXPM1 | expm1 | NONE | NONE | NONE |
157 FLOG1P | log1p | NONE | NONE | NONE |
158 FEXP | exp | half\_exp | native\_exp | NONE |
159 FLOG | log | half\_log | native\_log | NONE |
160 FEXP10 | exp10 | half\_exp10 | native\_exp10 | NONE |
161 FLOG10 | log10 | half\_log10 | native\_log10 | NONE |
162 FATAN2 | atan2 | NONE | NONE | NONE |
163 FATAN2PI | atan2pi | NONE | NONE | NONE |
164 FPOW | pow | NONE | NONE | NONE |
165 FROOT | rootn | NONE | NONE | NONE |
166 FHYPOT | hypot | NONE | NONE | NONE |
167 FRECIP | NONE | half\_recip | native\_recip | NONE |
168 """]]
169
170 Note (1) FSINCOS is macro-op fused (see below).
171
172 # List of 2-arg opcodes
173
174 [[!table data="""
175 opcode | Description | pseudo-code | Extension |
176 FATAN2 | atan2 arc tangent | rd = atan2(rs2, rs1) | Zarctrignpi |
177 FATAN2PI | atan2 arc tangent / pi | rd = atan2(rs2, rs1) / pi | Zarctrigpi |
178 FPOW | x power of y | rd = pow(rs1, rs2) | ZftransAdv |
179 FROOT | x power 1/y | rd = pow(rs1, 1/rs2) | ZftransAdv |
180 FHYPOT | hypotenuse | rd = sqrt(rs1^2 + rs2^2) | Zftrans |
181 """]]
182
183 # List of 1-arg transcendental opcodes
184
185 [[!table data="""
186 opcode | Description | pseudo-code | Extension |
187 FRSQRT | Reciprocal Square-root | rd = sqrt(rs1) | Zfrsqrt |
188 FCBRT | Cube Root | rd = pow(rs1, 1.0 / 3) | Zftrans |
189 FRECIP | Reciprocal | rd = 1.0 / rs1 | Zftrans |
190 FEXP2 | power-of-2 | rd = pow(2, rs1) | Zftrans |
191 FLOG2 | log2 | rd = log(2. rs1) | Zftrans |
192 FEXPM1 | exponential minus 1 | rd = pow(e, rs1) - 1.0 | Zftrans |
193 FLOG1P | log plus 1 | rd = log(e, 1 + rs1) | Zftrans |
194 FEXP | exponential | rd = pow(e, rs1) | ZftransExt |
195 FLOG | natural log (base e) | rd = log(e, rs1) | ZftransExt |
196 FEXP10 | power-of-10 | rd = pow(10, rs1) | ZftransExt |
197 FLOG10 | log base 10 | rd = log(10, rs1) | ZftransExt |
198 """]]
199
200 # List of 1-arg trigonometric opcodes
201
202 [[!table data="""
203 opcode | Description | pseudo-code | Extension |
204 FSIN | sin (radians) | rd = sin(rs1) | Ztrignpi |
205 FCOS | cos (radians) | rd = cos(rs1) | Ztrignpi |
206 FTAN | tan (radians) | rd = tan(rs1) | Ztrignpi |
207 FASIN | arcsin (radians) | rd = asin(rs1) | Zarctrignpi |
208 FACOS | arccos (radians) | rd = acos(rs1) | Zarctrignpi |
209 FATAN (1) | arctan (radians) | rd = atan(rs1) | Zarctrignpi |
210 FSINPI | sin times pi | rd = sin(pi * rs1) | Ztrigpi |
211 FCOSPI | cos times pi | rd = cos(pi * rs1) | Ztrigpi |
212 FTANPI | tan times pi | rd = tan(pi * rs1) | Ztrigpi |
213 FASINPI | arcsin / pi | rd = asin(rs1) / pi | Zarctrigpi |
214 FACOSPI | arccos / pi | rd = acos(rs1) / pi | Zarctrigpi |
215 FATANPI (1) | arctan / pi | rd = atan(rs1) / pi | Zarctrigpi |
216 FSINH | hyperbolic sin (radians) | rd = sinh(rs1) | Zfhyp |
217 FCOSH | hyperbolic cos (radians) | rd = cosh(rs1) | Zfhyp |
218 FTANH | hyperbolic tan (radians) | rd = tanh(rs1) | Zfhyp |
219 FASINH | inverse hyperbolic sin | rd = asinh(rs1) | Zfhyp |
220 FACOSH | inverse hyperbolic cos | rd = acosh(rs1) | Zfhyp |
221 FATANH | inverse hyperbolic tan | rd = atanh(rs1) | Zfhyp |
222 """]]
223
224 Note (1): FATAN/FATANPI is a pseudo-op expanding to FATAN2/FATAN2PI (needs deciding)
225
226 # Synthesis, Pseudo-code ops and macro-ops
227
228 The pseudo-ops are best left up to the compiler rather than being actual
229 pseudo-ops, by allocating one scalar FP register for use as a constant
230 (loop invariant) set to "1.0" at the beginning of a function or other
231 suitable code block.
232
233 * FSINCOS - fused macro-op between FSIN and FCOS (issued in that order).
234 * FSINCOSPI - fused macro-op between FSINPI and FCOSPI (issued in that order).
235
236 FATANPI example pseudo-code:
237
238 lui t0, 0x3F800 // upper bits of f32 1.0
239 fmv.x.s ft0, t0
240 fatan2pi.s rd, rs1, ft0
241
242 Hyperbolic function example (obviates need for Zfhyp except for
243 high-performance or correctly-rounding):
244
245 ASINH( x ) = ln( x + SQRT(x**2+1))
246
247 # Reciprocal
248
249 Used to be an alias. Some imolementors may wish to implement divide as y times recip(x)
250
251 # To evaluate: should LOG be replaced with LOG1P (and EXP with EXPM1)?
252
253 RISC principle says "exclude LOG because it's covered by LOGP1 plus an ADD".
254 Research needed to ensure that implementors are not compromised by such
255 a decision
256 <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002358.html>
257
258 > > correctly-rounded LOG will return different results than LOGP1 and ADD.
259 > > Likewise for EXP and EXPM1
260
261 > ok, they stay in as real opcodes, then.
262
263 # ATAN / ATAN2 commentary
264
265 Discussion starts here:
266 <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002470.html>
267
268 from Mitch Alsup:
269
270 would like to point out that the general implementations of ATAN2 do a
271 bunch of special case checks and then simply call ATAN.
272
273 double ATAN2( double y, double x )
274 { // IEEE 754-2008 quality ATAN2
275
276 // deal with NANs
277 if( ISNAN( x ) ) return x;
278 if( ISNAN( y ) ) return y;
279
280 // deal with infinities
281 if( x == +∞ && |y|== +∞ ) return copysign( π/4, y );
282 if( x == +∞ ) return copysign( 0.0, y );
283 if( x == -∞ && |y|== +∞ ) return copysign( 3π/4, y );
284 if( x == -∞ ) return copysign( π, y );
285 if( |y|== +∞ ) return copysign( π/2, y );
286
287 // deal with signed zeros
288 if( x == 0.0 && y != 0.0 ) return copysign( π/2, y );
289 if( x >=+0.0 && y == 0.0 ) return copysign( 0.0, y );
290 if( x <=-0.0 && y == 0.0 ) return copysign( π, y );
291
292 // calculate ATAN2 textbook style
293 if( x > 0.0 ) return ATAN( |y / x| );
294 if( x < 0.0 ) return π - ATAN( |y / x| );
295 }
296
297
298 Yet the proposed encoding makes ATAN2 the primitive and has ATAN invent
299 a constant and then call/use ATAN2.
300
301 When one considers an implementation of ATAN, one must consider several
302 ranges of evaluation::
303
304 x [ -∞, -1.0]:: ATAN( x ) = -π/2 + ATAN( 1/x );
305 x (-1.0, +1.0]:: ATAN( x ) = + ATAN( x );
306 x [ 1.0, +∞]:: ATAN( x ) = +π/2 - ATAN( 1/x );
307
308 I should point out that the add/sub of π/2 can not lose significance
309 since the result of ATAN(1/x) is bounded 0..π/2
310
311 The bottom line is that I think you are choosing to make too many of
312 these into OpCodes, making the hardware function/calculation unit (and
313 sequencer) more complicated that necessary.
314
315 --------------------------------------------------------
316
317 I might suggest that if there were a way for a calculation to be performed
318 and the result of that calculation
319
320 chained to a subsequent calculation such that the precision of the
321 result-becomes-operand is wider than
322
323 what will fit in a register, then you can dramatically reduce the count
324 of instructions in this category while retaining
325
326 acceptable accuracy:
327
328 z = x / y
329
330 can be calculated as::
331
332 z = x * (1/y)
333
334 Where 1/y has about 26-to-32 bits of fraction. No, it's not IEEE 754-2008
335 accurate, but GPUs want speed and
336
337 1/y is fully pipelined (F32) while x/y cannot be (at reasonable area). It
338 is also not "that inaccurate" displaying 0.625-to-0.52 ULP.
339
340 Given that one has the ability to carry (and process) more fraction bits,
341 one can then do high precision multiplies of π or other transcendental
342 radixes.
343
344 And GPUs have been doing this almost since the dawn of 3D.
345
346 // calculate ATAN2 high performance style
347 // Note: at this point x != y
348 //
349 if( x > 0.0 )
350 {
351 if( y < 0.0 && |y| < |x| ) return - π/2 - ATAN( x / y );
352 if( y < 0.0 && |y| > |x| ) return + ATAN( y / x );
353 if( y > 0.0 && |y| < |x| ) return + ATAN( y / x );
354 if( y > 0.0 && |y| > |x| ) return + π/2 - ATAN( x / y );
355 }
356 if( x < 0.0 )
357 {
358 if( y < 0.0 && |y| < |x| ) return + π/2 + ATAN( x / y );
359 if( y < 0.0 && |y| > |x| ) return + π - ATAN( y / x );
360 if( y > 0.0 && |y| < |x| ) return + π - ATAN( y / x );
361 if( y > 0.0 && |y| > |x| ) return +3π/2 + ATAN( x / y );
362 }
363
364 This way the adds and subtracts from the constant are not in a precision
365 precarious position.