1 # Zftrans - transcendental operations
5 * <http://bugs.libre-riscv.org/show_bug.cgi?id=127>
6 * <https://www.khronos.org/registry/spir-v/specs/unified1/OpenCL.ExtendedInstructionSet.100.html>
7 * Discussion: <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002342.html>
8 * [[rv_major_opcode_1010011]] for opcode listing.
9 * [[zfpacc_proposal]] for accuracy settings proposal
13 * **Zftrans**: standard transcendentals (best suited to 3D)
14 * **ZftransExt**: extra functions (useful, not generally needed for 3D,
15 can be synthesised using Ztrans)
16 * **Ztrigpi**: trig. xxx-pi sinpi cospi tanpi
17 * **Ztrignpi**: trig non-xxx-pi sin cos tan
18 * **Zarctrigpi**: arc-trig. a-xxx-pi: atan2pi asinpi acospi
19 * **Zarctrignpi**: arc-trig. non-a-xxx-pi: atan2, asin, acos
20 * **Zfhyp**: hyperbolic/inverse-hyperbolic. sinh, cosh, tanh, asinh,
21 acosh, atanh (can be synthesised - see below)
22 * **ZftransAdv**: much more complex to implement in hardware
23 * **Zfrsqrt**: Reciprocal square-root.
25 Minimum recommended requirements for 3D: Zftrans, Ztrigpi, Zarctrigpi,
32 * Decision on accuracy, moved to [[zfpacc_proposal]]
33 <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002355.html>
34 * Errors **MUST** be repeatable.
35 * How about four Platform Specifications? 3DUNIX, UNIX, 3DEmbedded and Embedded?
36 <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002361.html>
37 Accuracy requirements for dual (triple) purpose implementations must
38 meet the higher standard.
39 * Reciprocal Square-root is in its own separate extension (Zfrsqrt) as
40 it is desirable on its own by other implementors. This to be evaluated.
42 # Requirements <a name="requirements"></a>
44 This proposal is designed to meet a wide range of extremely diverse needs,
45 allowing implementors from all of them to benefit from the tools and hardware
46 cost reductions associated with common standards adoption.
48 **There are *four* different, disparate platform's needs (two new)**:
50 * 3D Embedded Platform
55 3D Embedded will require significantly less accuracy and will need to make
56 power budget and die area compromises that other platforms (including Embedded)
57 will not need to make.
59 3D UNIX Platform has to be performance-price-competitive: subtly-reduced
60 accuracy in FP32 is acceptable where, conversely, in the UNIX Platform,
61 IEEE754 compliance is a hard requirement that would compromise power
62 and efficiency on a 3D UNIX Platform.
64 **The use-cases are**:
67 * Numerical Computation
68 * (Potentially) A.I. / Machine-learning (1)
70 (1) although approximations suffice in this field, making it more likely
71 to use a custom extension. High-end ML would inherently definitely
74 **The power and die-area requirements vary from**:
76 * Ultra-low-power (smartwatches where GPU power budgets are in milliwatts)
77 * Mobile-Embedded (good performance with high efficiency for battery life)
81 (2) Supercomputing is left out of the requirements as it is traditionally
82 covered by Supercomputer Vectorisation Standards (such as RVV).
84 **The software requirements are**:
86 * Full public integration into GNU math libraries (libm)
87 * Full public integration into well-known Numerical Computation systems (numpy)
88 * Full public integration into upstream GNU and LLVM Compiler toolchains
89 * Full public integration into Khronos OpenCL SPIR-V compatible Compilers
90 seeking public Certification and Endorsement from the Khronos Group
91 under their Trademarked Certification Programme.
93 **The "contra"-requirements are**:
95 * The requirements are **not** for the purposes of developing a full custom
96 proprietary GPU with proprietary firmware.
97 * A full custom proprietary GPU ASIC Manufacturer *may* benefit from
98 this proposal however the fact that they typically develop proprietary
99 software that is not shared with the rest of the community likely to
100 use this proposal means that they have completely different needs.
102 This proposal is for *sharing* of effort in reducing development costs,
103 and as such needs to be entirely public, transparent, and fully integrated
104 into public upstream libre-licensed libraries and toolchains.
106 *Overall this proposal is categorically and wholly unsuited to
107 relegation of "custom" status*.
109 # Proposed Opcodes vs Khronos OpenCL Opcodes <a name="khronos_equiv"></a>
111 This list shows the (direct) equivalence between proposed opcodes and
112 their Khronos OpenCL equivalents.
115 <https://www.khronos.org/registry/spir-v/specs/unified1/OpenCL.ExtendedInstructionSet.100.html>
117 Special FP16 opcodes are *not* being proposed, except by indirect / inherent
118 use of the "fmt" field that is already present in the RISC-V Specification.
120 "Native" opcodes are *not* being proposed: implementors will be expected
121 to use the (equivalent) proposed opcode covering the same function.
123 "Fast" opcodes are *not* being proposed, because the Khronos Specification
124 fast\_length, fast\_normalise and fast\_distance OpenCL opcodes require
125 vectors (or can be done as scalar operations using other RISC-V instructions).
127 The OpenCL FP32 opcodes are **direct** equivalents to the proposed opcodes.
128 Deviation from conformance with the Khronos Specification - including the
129 Khronos Specification accuracy requirements - is not an option.
132 Proposed opcode | OpenCL FP32 | OpenCL FP16 | OpenCL native | OpenCL fast |
133 FSIN | sin | half\_sin | native\_sin | NONE |
134 FCOS | cos | half\_cos | native\_cos | NONE |
135 FTAN | tan | half\_tan | native\_tan | NONE |
136 NONE (1) | sincos | NONE | NONE | NONE |
137 FASIN | asin | NONE | NONE | NONE |
138 FACOS | acos | NONE | NONE | NONE |
139 FATAN | atan | NONE | NONE | NONE |
140 FSINPI | sinpi | NONE | NONE | NONE |
141 FCOSPI | cospi | NONE | NONE | NONE |
142 FTANPI | tanpi | NONE | NONE | NONE |
143 FASINPI | asinpi | NONE | NONE | NONE |
144 FACOSPI | acospi | NONE | NONE | NONE |
145 FATANPI | atanpi | NONE | NONE | NONE |
146 FSINH | sinh | NONE | NONE | NONE |
147 FCOSH | cosh | NONE | NONE | NONE |
148 FTANH | tanh | NONE | NONE | NONE |
149 FASINH | asinh | NONE | NONE | NONE |
150 FACOSH | acosh | NONE | NONE | NONE |
151 FATANH | atanh | NONE | NONE | NONE |
152 FRSQRT | rsqrt | half\_rsqrt | native\_rsqrt | NONE |
153 FCBRT | cbrt | NONE | NONE | NONE |
154 FEXP2 | exp2 | half\_exp2 | native\_exp2 | NONE |
155 FLOG2 | log2 | half\_log2 | native\_log2 | NONE |
156 FEXPM1 | expm1 | NONE | NONE | NONE |
157 FLOG1P | log1p | NONE | NONE | NONE |
158 FEXP | exp | half\_exp | native\_exp | NONE |
159 FLOG | log | half\_log | native\_log | NONE |
160 FEXP10 | exp10 | half\_exp10 | native\_exp10 | NONE |
161 FLOG10 | log10 | half\_log10 | native\_log10 | NONE |
162 FATAN2 | atan2 | NONE | NONE | NONE |
163 FATAN2PI | atan2pi | NONE | NONE | NONE |
164 FPOW | pow | NONE | NONE | NONE |
165 FROOT | rootn | NONE | NONE | NONE |
166 FHYPOT | hypot | NONE | NONE | NONE |
167 FRECIP | NONE | half\_recip | native\_recip | NONE |
170 Note (1) FSINCOS is macro-op fused (see below).
172 # List of 2-arg opcodes
175 opcode | Description | pseudo-code | Extension |
176 FATAN2 | atan2 arc tangent | rd = atan2(rs2, rs1) | Zarctrignpi |
177 FATAN2PI | atan2 arc tangent / pi | rd = atan2(rs2, rs1) / pi | Zarctrigpi |
178 FPOW | x power of y | rd = pow(rs1, rs2) | ZftransAdv |
179 FROOT | x power 1/y | rd = pow(rs1, 1/rs2) | ZftransAdv |
180 FHYPOT | hypotenuse | rd = sqrt(rs1^2 + rs2^2) | Zftrans |
183 # List of 1-arg transcendental opcodes
186 opcode | Description | pseudo-code | Extension |
187 FRSQRT | Reciprocal Square-root | rd = sqrt(rs1) | Zfrsqrt |
188 FCBRT | Cube Root | rd = pow(rs1, 1.0 / 3) | Zftrans |
189 FRECIP | Reciprocal | rd = 1.0 / rs1 | Zftrans |
190 FEXP2 | power-of-2 | rd = pow(2, rs1) | Zftrans |
191 FLOG2 | log2 | rd = log(2. rs1) | Zftrans |
192 FEXPM1 | exponential minus 1 | rd = pow(e, rs1) - 1.0 | Zftrans |
193 FLOG1P | log plus 1 | rd = log(e, 1 + rs1) | Zftrans |
194 FEXP | exponential | rd = pow(e, rs1) | ZftransExt |
195 FLOG | natural log (base e) | rd = log(e, rs1) | ZftransExt |
196 FEXP10 | power-of-10 | rd = pow(10, rs1) | ZftransExt |
197 FLOG10 | log base 10 | rd = log(10, rs1) | ZftransExt |
200 # List of 1-arg trigonometric opcodes
203 opcode | Description | pseudo-code | Extension |
204 FSIN | sin (radians) | rd = sin(rs1) | Ztrignpi |
205 FCOS | cos (radians) | rd = cos(rs1) | Ztrignpi |
206 FTAN | tan (radians) | rd = tan(rs1) | Ztrignpi |
207 FASIN | arcsin (radians) | rd = asin(rs1) | Zarctrignpi |
208 FACOS | arccos (radians) | rd = acos(rs1) | Zarctrignpi |
209 FATAN (1) | arctan (radians) | rd = atan(rs1) | Zarctrignpi |
210 FSINPI | sin times pi | rd = sin(pi * rs1) | Ztrigpi |
211 FCOSPI | cos times pi | rd = cos(pi * rs1) | Ztrigpi |
212 FTANPI | tan times pi | rd = tan(pi * rs1) | Ztrigpi |
213 FASINPI | arcsin / pi | rd = asin(rs1) / pi | Zarctrigpi |
214 FACOSPI | arccos / pi | rd = acos(rs1) / pi | Zarctrigpi |
215 FATANPI (1) | arctan / pi | rd = atan(rs1) / pi | Zarctrigpi |
216 FSINH | hyperbolic sin (radians) | rd = sinh(rs1) | Zfhyp |
217 FCOSH | hyperbolic cos (radians) | rd = cosh(rs1) | Zfhyp |
218 FTANH | hyperbolic tan (radians) | rd = tanh(rs1) | Zfhyp |
219 FASINH | inverse hyperbolic sin | rd = asinh(rs1) | Zfhyp |
220 FACOSH | inverse hyperbolic cos | rd = acosh(rs1) | Zfhyp |
221 FATANH | inverse hyperbolic tan | rd = atanh(rs1) | Zfhyp |
224 Note (1): FATAN/FATANPI is a pseudo-op expanding to FATAN2/FATAN2PI (needs deciding)
226 # Synthesis, Pseudo-code ops and macro-ops
228 The pseudo-ops are best left up to the compiler rather than being actual
229 pseudo-ops, by allocating one scalar FP register for use as a constant
230 (loop invariant) set to "1.0" at the beginning of a function or other
233 * FSINCOS - fused macro-op between FSIN and FCOS (issued in that order).
234 * FSINCOSPI - fused macro-op between FSINPI and FCOSPI (issued in that order).
236 FATANPI example pseudo-code:
238 lui t0, 0x3F800 // upper bits of f32 1.0
240 fatan2pi.s rd, rs1, ft0
242 Hyperbolic function example (obviates need for Zfhyp except for
243 high-performance or correctly-rounding):
245 ASINH( x ) = ln( x + SQRT(x**2+1))
249 Used to be an alias. Some imolementors may wish to implement divide as y times recip(x)
251 # To evaluate: should LOG be replaced with LOG1P (and EXP with EXPM1)?
253 RISC principle says "exclude LOG because it's covered by LOGP1 plus an ADD".
254 Research needed to ensure that implementors are not compromised by such
256 <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002358.html>
258 > > correctly-rounded LOG will return different results than LOGP1 and ADD.
259 > > Likewise for EXP and EXPM1
261 > ok, they stay in as real opcodes, then.
263 # ATAN / ATAN2 commentary
265 Discussion starts here:
266 <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002470.html>
270 would like to point out that the general implementations of ATAN2 do a
271 bunch of special case checks and then simply call ATAN.
273 double ATAN2( double y, double x )
274 { // IEEE 754-2008 quality ATAN2
277 if( ISNAN( x ) ) return x;
278 if( ISNAN( y ) ) return y;
280 // deal with infinities
281 if( x == +∞ && |y|== +∞ ) return copysign( π/4, y );
282 if( x == +∞ ) return copysign( 0.0, y );
283 if( x == -∞ && |y|== +∞ ) return copysign( 3π/4, y );
284 if( x == -∞ ) return copysign( π, y );
285 if( |y|== +∞ ) return copysign( π/2, y );
287 // deal with signed zeros
288 if( x == 0.0 && y != 0.0 ) return copysign( π/2, y );
289 if( x >=+0.0 && y == 0.0 ) return copysign( 0.0, y );
290 if( x <=-0.0 && y == 0.0 ) return copysign( π, y );
292 // calculate ATAN2 textbook style
293 if( x > 0.0 ) return ATAN( |y / x| );
294 if( x < 0.0 ) return π - ATAN( |y / x| );
298 Yet the proposed encoding makes ATAN2 the primitive and has ATAN invent
299 a constant and then call/use ATAN2.
301 When one considers an implementation of ATAN, one must consider several
302 ranges of evaluation::
304 x [ -∞, -1.0]:: ATAN( x ) = -π/2 + ATAN( 1/x );
305 x (-1.0, +1.0]:: ATAN( x ) = + ATAN( x );
306 x [ 1.0, +∞]:: ATAN( x ) = +π/2 - ATAN( 1/x );
308 I should point out that the add/sub of π/2 can not lose significance
309 since the result of ATAN(1/x) is bounded 0..π/2
311 The bottom line is that I think you are choosing to make too many of
312 these into OpCodes, making the hardware function/calculation unit (and
313 sequencer) more complicated that necessary.
315 --------------------------------------------------------
317 I might suggest that if there were a way for a calculation to be performed
318 and the result of that calculation
320 chained to a subsequent calculation such that the precision of the
321 result-becomes-operand is wider than
323 what will fit in a register, then you can dramatically reduce the count
324 of instructions in this category while retaining
330 can be calculated as::
334 Where 1/y has about 26-to-32 bits of fraction. No, it's not IEEE 754-2008
335 accurate, but GPUs want speed and
337 1/y is fully pipelined (F32) while x/y cannot be (at reasonable area). It
338 is also not "that inaccurate" displaying 0.625-to-0.52 ULP.
340 Given that one has the ability to carry (and process) more fraction bits,
341 one can then do high precision multiplies of π or other transcendental
344 And GPUs have been doing this almost since the dawn of 3D.
346 // calculate ATAN2 high performance style
347 // Note: at this point x != y
351 if( y < 0.0 && |y| < |x| ) return - π/2 - ATAN( x / y );
352 if( y < 0.0 && |y| > |x| ) return + ATAN( y / x );
353 if( y > 0.0 && |y| < |x| ) return + ATAN( y / x );
354 if( y > 0.0 && |y| > |x| ) return + π/2 - ATAN( x / y );
358 if( y < 0.0 && |y| < |x| ) return + π/2 + ATAN( x / y );
359 if( y < 0.0 && |y| > |x| ) return + π - ATAN( y / x );
360 if( y > 0.0 && |y| < |x| ) return + π - ATAN( y / x );
361 if( y > 0.0 && |y| > |x| ) return +3π/2 + ATAN( x / y );
364 This way the adds and subtracts from the constant are not in a precision