ztrans_proposal.mdwn

   1 # Zftrans - transcendental operations
   2
   3 See:
   4
   5 * <http://bugs.libre-riscv.org/show_bug.cgi?id=127>
   6 * <https://www.khronos.org/registry/spir-v/specs/unified1/OpenCL.ExtendedInstructionSet.100.html>
   7 * Discussion: <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002342.html>
   8 * [[rv_major_opcode_1010011]] for opcode listing.
   9 * [[zfpacc_proposal]] for accuracy settings proposal
  10
  11 Extension subsets:
  12
  13 * **Zftrans**: standard transcendentals (best suited to 3D)
  14 * **ZftransExt**: extra functions (useful, not generally needed for 3D,
  15   can be synthesised using Ztrans)
  16 * **Ztrigpi**: trig. xxx-pi sinpi cospi tanpi
  17 * **Ztrignpi**: trig non-xxx-pi sin cos tan
  18 * **Zarctrigpi**: arc-trig. a-xxx-pi: atan2pi asinpi acospi
  19 * **Zarctrignpi**: arc-trig. non-a-xxx-pi: atan2, asin, acos
  20 * **Zfhyp**: hyperbolic/inverse-hyperbolic.  sinh, cosh, tanh, asinh,
  21   acosh, atanh (can be synthesised - see below)
  22 * **ZftransAdv**: much more complex to implement in hardware
  23 * **Zfrsqrt**: Reciprocal square-root.
  24
  25 Minimum recommended requirements for 3D: Zftrans, Ztrigpi, Zarctrigpi,
  26 Zarctrignpi
  27
  28 [[!toc levels=2]]
  29
  30 # TODO:
  31
  32 * Decision on accuracy, moved to [[zfpacc_proposal]]
  33 <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002355.html>
  34 * Errors **MUST** be repeatable.
  35 * How about four Platform Specifications? 3DUNIX, UNIX, 3DEmbedded and Embedded?
  36 <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002361.html>
  37   Accuracy requirements for dual (triple) purpose implementations must
  38   meet the higher standard.
  39 * Reciprocal Square-root is in its own separate extension (Zfrsqrt) as
  40   it is desirable on its own by other implementors.  This to be evaluated.
  41
  42 # Requirements <a name="requirements"></a>
  43
  44 This proposal is designed to meet a wide range of extremely diverse needs,
  45 allowing implementors from all of them to benefit from the tools and hardware
  46 cost reductions associated with common standards adoption.
  47
  48 **There are *four* different, disparate platform's needs (two new)**:
  49
  50 * 3D Embedded Platform
  51 * Embedded Platform
  52 * 3D UNIX Platform
  53 * UNIX Platform
  54
  55 3D Embedded will require significantly less accuracy and will need to make
  56 power budget and die area compromises that other platforms (including Embedded)
  57 will not need to make.
  58
  59 3D UNIX Platform has to be performance-price-competitive: subtly-reduced
  60 accuracy in FP32 is acceptable where, conversely, in the UNIX Platform,
  61 IEEE754 compliance is a hard requirement that would compromise power
  62 and efficiency on a 3D UNIX Platform.
  63
  64 **The use-cases are**:
  65
  66 * 3D GPUs
  67 * Numerical Computation
  68 * (Potentially) A.I. / Machine-learning (1)
  69
  70 (1) although approximations suffice in this field, making it more likely
  71 to use a custom extension.  High-end ML would inherently definitely
  72 be excluded.
  73
  74 **The power and die-area requirements vary from**:
  75
  76 * Ultra-low-power (smartwatches where GPU power budgets are in milliwatts)
  77 * Mobile-Embedded (good performance with high efficiency for battery life)
  78 * Desktop Computing
  79 * Server / HPC (2)
  80
  81 (2) Supercomputing is left out of the requirements as it is traditionally
  82 covered by Supercomputer Vectorisation Standards (such as RVV).
  83
  84 **The software requirements are**:
  85
  86 * Full public integration into GNU math libraries (libm)
  87 * Full public integration into well-known Numerical Computation systems (numpy)
  88 * Full public integration into upstream GNU and LLVM Compiler toolchains
  89 * Full public integration into Khronos OpenCL SPIR-V compatible Compilers
  90   seeking public Certification and Endorsement from the Khronos Group
  91   under their Trademarked Certification Programme.
  92
  93 **The "contra"-requirements are**:
  94
  95 * The requirements are **not** for the purposes of developing a full custom
  96   proprietary GPU with proprietary firmware.
  97 * A full custom proprietary GPU ASIC Manufacturer *may* benefit from
  98   this proposal however the fact that they typically develop proprietary
  99   software that is not shared with the rest of the community likely to
 100   use this proposal means that they have completely different needs.
 101
 102 This proposal is for *sharing* of effort in reducing development costs,
 103 and as such needs to be entirely public, transparent, and fully integrated
 104 into public upstream libre-licensed libraries and toolchains.
 105
 106 *Overall this proposal is categorically and wholly unsuited to
 107 relegation of "custom" status*.
 108
 109 # Proposed Opcodes vs Khronos OpenCL Opcodes <a name="khronos_equiv"></a>
 110
 111 This list shows the (direct) equivalence between proposed opcodes and
 112 their Khronos OpenCL equivalents.
 113
 114 See
 115 <https://www.khronos.org/registry/spir-v/specs/unified1/OpenCL.ExtendedInstructionSet.100.html>
 116
 117 Special FP16 opcodes are *not* being proposed, except by indirect / inherent
 118 use of the "fmt" field that is already present in the RISC-V Specification.
 119
 120 "Native" opcodes are *not* being proposed: implementors will be expected
 121 to use the (equivalent) proposed opcode covering the same function.
 122
 123 "Fast" opcodes are *not* being proposed, because the Khronos Specification
 124 fast\_length, fast\_normalise and fast\_distance OpenCL opcodes require
 125 vectors (or can be done as scalar operations using other RISC-V instructions).
 126
 127 The OpenCL FP32 opcodes are **direct** equivalents to the proposed opcodes.
 128 Deviation from conformance with the Khronos Specification - including the
 129 Khronos Specification accuracy requirements - is not an option.
 130
 131 [[!table data="""
 132 Proposed opcode | OpenCL FP32 | OpenCL FP16 | OpenCL native | OpenCL fast |
 133 FSIN            | sin         | half\_sin   | native\_sin   | NONE        |
 134 FCOS            | cos         | half\_cos   | native\_cos   | NONE        |
 135 FTAN            | tan         | half\_tan   | native\_tan   | NONE        |
 136 NONE (1)        | sincos      | NONE        | NONE          | NONE        |
 137 FASIN           | asin        | NONE        | NONE          | NONE        |
 138 FACOS           | acos        | NONE        | NONE          | NONE        |
 139 FATAN           | atan        | NONE        | NONE          | NONE        |
 140 FSINPI          | sinpi       | NONE        | NONE          | NONE        |
 141 FCOSPI          | cospi       | NONE        | NONE          | NONE        |
 142 FTANPI          | tanpi       | NONE        | NONE          | NONE        |
 143 FASINPI         | asinpi      | NONE        | NONE          | NONE        |
 144 FACOSPI         | acospi      | NONE        | NONE          | NONE        |
 145 FATANPI         | atanpi      | NONE        | NONE          | NONE        |
 146 FSINH           | sinh        | NONE        | NONE          | NONE        |
 147 FCOSH           | cosh        | NONE        | NONE          | NONE        |
 148 FTANH           | tanh        | NONE        | NONE          | NONE        |
 149 FASINH          | asinh       | NONE        | NONE          | NONE        |
 150 FACOSH          | acosh       | NONE        | NONE          | NONE        |
 151 FATANH          | atanh       | NONE        | NONE          | NONE        |
 152 FRSQRT          | rsqrt       | half\_rsqrt | native\_rsqrt | NONE        |
 153 FCBRT           | cbrt        | NONE        | NONE          | NONE        |
 154 FEXP2           | exp2        | half\_exp2  | native\_exp2  | NONE        |
 155 FLOG2           | log2        | half\_log2  | native\_log2  | NONE        |
 156 FEXPM1          | expm1       | NONE        | NONE          | NONE        |
 157 FLOG1P          | log1p       | NONE        | NONE          | NONE        |
 158 FEXP            | exp         | half\_exp   | native\_exp   | NONE        |
 159 FLOG            | log         | half\_log   | native\_log   | NONE        |
 160 FEXP10          | exp10       | half\_exp10 | native\_exp10 | NONE        |
 161 FLOG10          | log10       | half\_log10 | native\_log10 | NONE        |
 162 FATAN2          | atan2       | NONE        | NONE          | NONE        |
 163 FATAN2PI        | atan2pi     | NONE        | NONE          | NONE        |
 164 FPOW            | pow         | NONE        | NONE          | NONE        |
 165 FROOT           | rootn       | NONE        | NONE          | NONE        |
 166 FHYPOT          | hypot       | NONE        | NONE          | NONE        |
 167 FRECIP          | NONE        | half\_recip | native\_recip | NONE        |
 168 """]]
 169
 170 Note (1) FSINCOS is macro-op fused (see below).
 171
 172 # List of 2-arg opcodes
 173
 174 [[!table  data="""
 175 opcode    | Description            | pseudo-code                | Extension   |
 176 FATAN2    | atan2 arc tangent      | rd = atan2(rs2, rs1)       | Zarctrignpi |
 177 FATAN2PI  | atan2 arc tangent / pi | rd = atan2(rs2, rs1) / pi  | Zarctrigpi  |
 178 FPOW      | x power of y           | rd = pow(rs1, rs2)         | ZftransAdv  |
 179 FROOT     | x power 1/y            | rd = pow(rs1, 1/rs2)       | ZftransAdv  |
 180 FHYPOT    | hypotenuse             | rd = sqrt(rs1^2 + rs2^2)   | Zftrans     |
 181 """]]
 182
 183 # List of 1-arg transcendental opcodes
 184
 185 [[!table  data="""
 186 opcode   | Description              | pseudo-code             | Extension  |
 187 FRSQRT   | Reciprocal Square-root   | rd = sqrt(rs1)          | Zfrsqrt    |
 188 FCBRT    | Cube Root                | rd = pow(rs1, 1.0 / 3)  | Zftrans    |
 189 FRECIP   | Reciprocal               | rd = 1.0 / rs1          | Zftrans    |
 190 FEXP2    | power-of-2               | rd = pow(2, rs1)        | Zftrans    |
 191 FLOG2    | log2                     | rd = log(2. rs1)        | Zftrans    |
 192 FEXPM1   | exponential minus 1      | rd = pow(e, rs1) - 1.0  | Zftrans    |
 193 FLOG1P   | log plus 1               | rd = log(e, 1 + rs1)    | Zftrans    |
 194 FEXP     | exponential              | rd = pow(e, rs1)        | ZftransExt |
 195 FLOG     | natural log (base e)     | rd = log(e, rs1)        | ZftransExt |
 196 FEXP10   | power-of-10              | rd = pow(10, rs1)       | ZftransExt |
 197 FLOG10   | log base 10              | rd = log(10, rs1)       | ZftransExt |
 198 """]]
 199
 200 # List of 1-arg trigonometric opcodes
 201
 202 [[!table  data="""
 203 opcode      | Description              | pseudo-code             | Extension |
 204 FSIN        | sin (radians)            | rd = sin(rs1)           | Ztrignpi    |
 205 FCOS        | cos (radians)            | rd = cos(rs1)           | Ztrignpi    |
 206 FTAN        | tan (radians)            | rd = tan(rs1)           | Ztrignpi    |
 207 FASIN       | arcsin (radians)         | rd = asin(rs1)          | Zarctrignpi |
 208 FACOS       | arccos (radians)         | rd = acos(rs1)          | Zarctrignpi |
 209 FATAN (1)   | arctan (radians)         | rd = atan(rs1)          | Zarctrignpi |
 210 FSINPI      | sin times pi             | rd = sin(pi * rs1)      | Ztrigpi |
 211 FCOSPI      | cos times pi             | rd = cos(pi * rs1)      | Ztrigpi |
 212 FTANPI      | tan times pi             | rd = tan(pi * rs1)      | Ztrigpi |
 213 FASINPI     | arcsin / pi              | rd = asin(rs1) / pi     | Zarctrigpi |
 214 FACOSPI     | arccos / pi              | rd = acos(rs1) / pi     | Zarctrigpi |
 215 FATANPI (1) | arctan / pi              | rd = atan(rs1) / pi     | Zarctrigpi |
 216 FSINH       | hyperbolic sin (radians) | rd = sinh(rs1)          | Zfhyp |
 217 FCOSH       | hyperbolic cos (radians) | rd = cosh(rs1)          | Zfhyp |
 218 FTANH       | hyperbolic tan (radians) | rd = tanh(rs1)          | Zfhyp |
 219 FASINH      | inverse hyperbolic sin   | rd = asinh(rs1)         | Zfhyp |
 220 FACOSH      | inverse hyperbolic cos   | rd = acosh(rs1)         | Zfhyp |
 221 FATANH      | inverse hyperbolic tan   | rd = atanh(rs1)         | Zfhyp |
 222 """]]
 223
 224 Note (1): FATAN/FATANPI is a pseudo-op expanding to FATAN2/FATAN2PI (needs deciding)
 225
 226 # Synthesis, Pseudo-code ops and macro-ops
 227
 228 The pseudo-ops are best left up to the compiler rather than being actual
 229 pseudo-ops, by allocating one scalar FP register for use as a constant
 230 (loop invariant) set to "1.0" at the beginning of a function or other
 231 suitable code block.
 232
 233 * FSINCOS - fused macro-op between FSIN and FCOS (issued in that order).
 234 * FSINCOSPI - fused macro-op between FSINPI and FCOSPI (issued in that order).
 235
 236 FATANPI example pseudo-code:
 237
 238     lui t0, 0x3F800 // upper bits of f32 1.0
 239     fmv.x.s ft0, t0
 240     fatan2pi.s rd, rs1, ft0
 241
 242 Hyperbolic function example (obviates need for Zfhyp except for
 243 high-performance or correctly-rounding):
 244
 245     ASINH( x ) = ln( x + SQRT(x**2+1))
 246
 247 # Reciprocal
 248
 249 Used to be an alias. Some imolementors may wish to implement divide as y times recip(x)
 250
 251 # To evaluate: should LOG be replaced with LOG1P (and EXP with EXPM1)?
 252
 253 RISC principle says "exclude LOG because it's covered by LOGP1 plus an ADD".
 254 Research needed to ensure that implementors are not compromised by such
 255 a decision
 256 <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002358.html>
 257
 258 > > correctly-rounded LOG will return different results than LOGP1 and ADD.
 259 > > Likewise for EXP and EXPM1
 260
 261 > ok, they stay in as real opcodes, then.
 262
 263 # ATAN / ATAN2 commentary
 264
 265 Discussion starts here:
 266 <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002470.html>
 267
 268 from Mitch Alsup:
 269
 270 would like to point out that the general implementations of ATAN2 do a
 271 bunch of special case checks and then simply call ATAN.
 272
 273     double ATAN2( double y, double x )
 274     {   // IEEE 754-2008 quality ATAN2
 275
 276         // deal with NANs
 277         if( ISNAN( x )             ) return x;
 278         if( ISNAN( y )             ) return y;
 279
 280         // deal with infinities
 281         if( x == +∞    && |y|== +∞  ) return copysign(  π/4, y );
 282         if( x == +∞                 ) return copysign(  0.0, y );
 283         if( x == -∞    && |y|== +∞  ) return copysign( 3π/4, y );
 284         if( x == -∞                 ) return copysign(    π, y );
 285         if(               |y|== +∞  ) return copysign(  π/2, y );
 286
 287         // deal with signed zeros
 288         if( x == 0.0  &&  y != 0.0 ) return copysign(  π/2, y );
 289         if( x >=+0.0  &&  y == 0.0 ) return copysign(  0.0, y );
 290         if( x <=-0.0  &&  y == 0.0 ) return copysign(    π, y );
 291
 292         // calculate ATAN2 textbook style
 293         if( x  > 0.0               ) return     ATAN( |y / x| );
 294         if( x  < 0.0               ) return π - ATAN( |y / x| );
 295     }
 296
 297
 298 Yet the proposed encoding makes ATAN2 the primitive and has ATAN invent
 299 a constant and then call/use ATAN2.
 300
 301 When one considers an implementation of ATAN, one must consider several
 302 ranges of evaluation::
 303
 304      x  [  -∞, -1.0]:: ATAN( x ) = -π/2 + ATAN( 1/x );
 305      x  (-1.0, +1.0]:: ATAN( x ) =      + ATAN(   x );
 306      x  [ 1.0,   +∞]:: ATAN( x ) = +π/2 - ATAN( 1/x );
 307
 308 I should point out that the add/sub of π/2 can not lose significance
 309 since the result of ATAN(1/x) is bounded 0..π/2
 310
 311 The bottom line is that I think you are choosing to make too many of
 312 these into OpCodes, making the hardware function/calculation unit (and
 313 sequencer) more complicated that necessary.
 314
 315 --------------------------------------------------------
 316
 317 I might suggest that if there were a way for a calculation to be performed
 318 and the result of that calculation
 319
 320 chained to a subsequent calculation such that the precision of the
 321 result-becomes-operand is wider than
 322
 323 what will fit in a register, then you can dramatically reduce the count
 324 of instructions in this category while retaining
 325
 326 acceptable accuracy:
 327
 328      z = x / y
 329
 330 can be calculated as::
 331
 332      z = x * (1/y)
 333
 334 Where 1/y has about 26-to-32 bits of fraction. No, it's not IEEE 754-2008
 335 accurate, but GPUs want speed and
 336
 337 1/y is fully pipelined (F32) while x/y cannot be (at reasonable area). It
 338 is also not "that inaccurate" displaying 0.625-to-0.52 ULP.
 339
 340 Given that one has the ability to carry (and process) more fraction bits,
 341 one can then do high precision multiplies of  π or other transcendental
 342 radixes.
 343
 344 And GPUs have been doing this almost since the dawn of 3D.
 345
 346     // calculate ATAN2 high performance style
 347     // Note: at this point x != y
 348     //
 349     if( x  > 0.0             )
 350     {
 351         if( y < 0.0 && |y| < |x| ) return - π/2 - ATAN( x / y );
 352         if( y < 0.0 && |y| > |x| ) return       + ATAN( y / x );
 353         if( y > 0.0 && |y| < |x| ) return       + ATAN( y / x );
 354         if( y > 0.0 && |y| > |x| ) return + π/2 - ATAN( x / y );
 355     }
 356     if( x  < 0.0             )
 357     {
 358         if( y < 0.0 && |y| < |x| ) return + π/2 + ATAN( x / y );
 359         if( y < 0.0 && |y| > |x| ) return + π   - ATAN( y / x );
 360         if( y > 0.0 && |y| < |x| ) return + π   - ATAN( y / x );
 361         if( y > 0.0 && |y| > |x| ) return +3π/2 + ATAN( x / y );
 362     }
 363
 364 This way the adds and subtracts from the constant are not in a precision
 365 precarious position.