ztrans_proposal.mdwn

   1 # Zftrans - transcendental operations
   2
   3 See:
   4
   5 * <http://bugs.libre-riscv.org/show_bug.cgi?id=127>
   6 * <https://www.khronos.org/registry/spir-v/specs/unified1/OpenCL.ExtendedInstructionSet.100.html>
   7 * Discussion: <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002342.html>
   8 * [[rv_major_opcode_1010011]] for opcode listing.
   9 * [[zfpacc_proposal]] for accuracy settings proposal
  10
  11 Extension subsets:
  12
  13 * **Zftrans**: standard transcendentals (best suited to 3D)
  14 * **ZftransExt**: extra functions (useful, not generally needed for 3D,
  15   can be synthesised using Ztrans)
  16 * **Ztrigpi**: trig. xxx-pi sinpi cospi tanpi
  17 * **Ztrignpi**: trig non-xxx-pi sin cos tan
  18 * **Zarctrigpi**: arc-trig. a-xxx-pi: atan2pi asinpi acospi
  19 * **Zarctrignpi**: arc-trig. non-a-xxx-pi: atan2, asin, acos
  20 * **Zfhyp**: hyperbolic/inverse-hyperbolic.  sinh, cosh, tanh, asinh,
  21   acosh, atanh (can be synthesised - see below)
  22 * **ZftransAdv**: much more complex to implement in hardware
  23 * **Zfrsqrt**: Reciprocal square-root.
  24
  25 Minimum recommended requirements for 3D: Zftrans, Ztrigpi, Zarctrigpi,
  26 Zarctrignpi
  27
  28 [[!toc levels=2]]
  29
  30 # TODO:
  31
  32 * Decision on accuracy, moved to [[zfpacc_proposal]]
  33 <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002355.html>
  34 * Errors **MUST** be repeatable.
  35 * How about four Platform Specifications? 3DUNIX, UNIX, 3DEmbedded and Embedded?
  36 <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002361.html>
  37   Accuracy requirements for dual (triple) purpose implementations must
  38   meet the higher standard.
  39 * Reciprocal Square-root is in its own separate extension (Zfrsqrt) as
  40   it is desirable on its own by other implementors.  This to be evaluated.
  41
  42 # Proposed Opcodes vs Khronos OpenCL Opcodes <a name="khronos_equiv"></a>
  43
  44 This list shows the (direct) equivalence between proposed opcodes and
  45 their Khronos OpenCL equivalents.
  46
  47 See
  48 <https://www.khronos.org/registry/spir-v/specs/unified1/OpenCL.ExtendedInstructionSet.100.html>
  49
  50 Special FP16 opcodes are *not* being proposed, except by indirect / inherent
  51 use of the "fmt" field that is already present in the RISC-V Specification.
  52
  53 "Native" opcodes are *not* being proposed: implementors will be expected
  54 to use the (equivalent) proposed opcode covering the same function.
  55
  56 "Fast" opcodes are *not* being proposed, because the Khronos Specification
  57 fast\_length, fast\_normalise and fast\_distance OpenCL opcodes require
  58 vectors (or can be done as scalar operations using other RISC-V instructions).
  59
  60 The OpenCL FP32 opcodes are **direct** equivalents to the proposed opcodes.
  61 Deviation from conformance with the Khronos Specification - including the
  62 Khronos Specification accuracy requirements - is not an option.
  63
  64 [[!table data="""
  65 Proposed opcode | OpenCL FP32 | OpenCL FP16 | OpenCL native | OpenCL fast |
  66 FSIN            | sin         | half\_sin   | native\_sin   | NONE        |
  67 FCOS            | cos         | half\_cos   | native\_cos   | NONE        |
  68 FTAN            | tan         | half\_tan   | native\_tan   | NONE        |
  69 NONE (1)        | sincos      | NONE        | NONE          | NONE        |
  70 FASIN           | asin        | NONE        | NONE          | NONE        |
  71 FACOS           | acos        | NONE        | NONE          | NONE        |
  72 NONE (3)        | atan        | NONE        | NONE          | NONE        |
  73 FSINPI          | sinpi       | NONE        | NONE          | NONE        |
  74 FCOSPI          | cospi       | NONE        | NONE          | NONE        |
  75 FTANPI          | tanpi       | NONE        | NONE          | NONE        |
  76 FASINPI         | asinpi      | NONE        | NONE          | NONE        |
  77 FACOSPI         | acospi      | NONE        | NONE          | NONE        |
  78 NONE (2)        | atanpi      | NONE        | NONE          | NONE        |
  79 FSINH           | sinh        | NONE        | NONE          | NONE        |
  80 FCOSH           | cosh        | NONE        | NONE          | NONE        |
  81 FTANH           | tanh        | NONE        | NONE          | NONE        |
  82 FASINH          | asinh       | NONE        | NONE          | NONE        |
  83 FACOSH          | acosh       | NONE        | NONE          | NONE        |
  84 FATANH          | atanh       | NONE        | NONE          | NONE        |
  85 FRSQRT          | rsqrt       | half\_rsqrt | native\_rsqrt | NONE        |
  86 FCBRT           | cbrt        | NONE        | NONE          | NONE        |
  87 FEXP2           | exp2        | half\_exp2  | native\_exp2  | NONE        |
  88 FLOG2           | log2        | half\_log2  | native\_log2  | NONE        |
  89 FEXPM1          | expm1       | NONE        | NONE          | NONE        |
  90 FLOG1P          | log1p       | NONE        | NONE          | NONE        |
  91 FEXP            | exp         | half\_exp   | native\_exp   | NONE        |
  92 FLOG            | log         | half\_log   | native\_log   | NONE        |
  93 FEXP10          | exp10       | half\_exp10 | native\_exp10 | NONE        |
  94 FLOG10          | log10       | half\_log10 | native\_log10 | NONE        |
  95 FATAN2          | atan2       | NONE        | NONE          | NONE        |
  96 FATAN2PI        | atan2pi     | NONE        | NONE          | NONE        |
  97 FPOW            | pow         | NONE        | NONE          | NONE        |
  98 FROOT           | rootn       | NONE        | NONE          | NONE        |
  99 FHYPOT          | hypot       | NONE        | NONE          | NONE        |
 100 NONE (4)        | NONE        | half\_recip | native\_recip | NONE        |
 101 """]]
 102
 103 Note (1) FSINCOS is macro-op fused (see below).
 104
 105 Note (2) FATANPI is a synthesised alias, below.
 106
 107 Note (3) FATAN2 is a sythesised alias, below.
 108
 109 Note (4) FCRECIP is a sythesised alias, below.
 110
 111 # List of 2-arg opcodes
 112
 113 [[!table  data="""
 114 opcode    | Description            | pseudo-code                | Extension   |
 115 FATAN2    | atan2 arc tangent      | rd = atan2(rs2, rs1)       | Zarctrignpi |
 116 FATAN2PI  | atan2 arc tangent / pi | rd = atan2(rs2, rs1) / pi  | Zarctrigpi  |
 117 FPOW      | x power of y           | rd = pow(rs1, rs2)         | ZftransAdv  |
 118 FROOT     | x power 1/y            | rd = pow(rs1, 1/rs2)       | ZftransAdv  |
 119 FHYPOT    | hypotenuse             | rd = sqrt(rs1^2 + rs2^2)   | Zftrans     |
 120 """]]
 121
 122 # List of 1-arg transcendental opcodes
 123
 124 [[!table  data="""
 125 opcode   | Description              | pseudo-code             | Extension  |
 126 FRSQRT   | Reciprocal Square-root   | rd = sqrt(rs1)          | Zfrsqrt    |
 127 FCBRT    | Cube Root                | rd = pow(rs1, 3)        | Zftrans    |
 128 FEXP2    | power-of-2               | rd = pow(2, rs1)        | Zftrans    |
 129 FLOG2    | log2                     | rd = log2(rs1)          | Zftrans    |
 130 FEXPM1   | exponential minus 1      | rd = pow(e, rs1) - 1.0  | Zftrans    |
 131 FLOG1P   | log plus 1               | rd = log(e, 1 + rs1)    | Zftrans    |
 132 FEXP     | exponential              | rd = pow(e, rs1)        | ZftransExt |
 133 FLOG     | natural log (base e)     | rd = log(e, rs1)        | ZftransExt |
 134 FEXP10   | power-of-10              | rd = pow(10, rs1)       | ZftransExt |
 135 FLOG10   | log base 10              | rd = log10(rs1)         | ZftransExt |
 136 """]]
 137
 138 # List of 1-arg trigonometric opcodes
 139
 140 [[!table  data="""
 141 opcode   | Description              | pseudo-code             | Extension |
 142 FSIN     | sin (radians)            | rd = sin(rs1)           | Ztrignpi    |
 143 FCOS     | cos (radians)            | rd = cos(rs1)           | Ztrignpi    |
 144 FTAN     | tan (radians)            | rd = tan(rs1)           | Ztrignpi    |
 145 FASIN    | arcsin (radians)         | rd = asin(rs1)          | Zarctrignpi |
 146 FACOS    | arccos (radians)         | rd = acos(rs1)          | Zarctrignpi |
 147 FSINPI   | sin times pi             | rd = sin(pi * rs1)      | Ztrigpi |
 148 FCOSPI   | cos times pi             | rd = cos(pi * rs1)      | Ztrigpi |
 149 FTANPI   | tan times pi             | rd = tan(pi * rs1)      | Ztrigpi |
 150 FASINPI  | arcsin / pi              | rd = asin(rs1) / pi     | Zarctrigpi |
 151 FACOSPI  | arccos / pi              | rd = acos(rs1) / pi     | Zarctrigpi |
 152 FATANPI  | arctan / pi              | rd = atan(rs1) / pi     | Zarctrigpi |
 153 FSINH    | hyperbolic sin (radians) | rd = sinh(rs1)          | Zfhyp |
 154 FCOSH    | hyperbolic cos (radians) | rd = cosh(rs1)          | Zfhyp |
 155 FTANH    | hyperbolic tan (radians) | rd = tanh(rs1)          | Zfhyp |
 156 FASINH   | inverse hyperbolic sin   | rd = asinh(rs1)         | Zfhyp |
 157 FACOSH   | inverse hyperbolic cos   | rd = acosh(rs1)         | Zfhyp |
 158 FATANH   | inverse hyperbolic tan   | rd = atanh(rs1)         | Zfhyp |
 159 """]]
 160
 161 # Synthesis, Pseudo-code ops and macro-ops
 162
 163 The pseudo-ops are best left up to the compiler rather than being actual
 164 pseudo-ops, by allocating one scalar FP register for use as a constant
 165 (loop invariant) set to "1.0" at the beginning of a function or other
 166 suitable code block.
 167
 168 * FRCP rd, rs1 - pseudo-code alias for rd = 1.0 / rs1
 169 * FATAN - pseudo-code alias for rd = atan2(rs1, 1.0) - FATAN2
 170 * FATANPI - pseudo alias for rd = atan2pi(rs1, 1.0) - FATAN2PI
 171 * FSINCOS - fused macro-op between FSIN and FCOS (issued in that order).
 172 * FSINCOSPI - fused macro-op between FSINPI and FCOSPI (issued in that order).
 173
 174 FATANPI example pseudo-code:
 175
 176     lui t0, 0x3F800 // upper bits of f32 1.0
 177     fmv.x.s ft0, t0
 178     fatan2pi.s rd, rs1, ft0
 179
 180 Hyperbolic function example (obviates need for Zfhyp except for high-performance or correctly-rounding):
 181
 182     ASINH( x ) = ln( x + SQRT(x**2+1))
 183
 184 # To evaluate: should LOG be replaced with LOG1P (and EXP with EXPM1)?
 185
 186 RISC principle says "exclude LOG because it's covered by LOGP1 plus an ADD".
 187 Research needed to ensure that implementors are not compromised by such
 188 a decision
 189 <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002358.html>
 190
 191 > > correctly-rounded LOG will return different results than LOGP1 and ADD.
 192 > > Likewise for EXP and EXPM1
 193
 194 > ok, they stay in as real opcodes, then.
 195
 196 # ATAN / ATAN2 commentary
 197
 198 Discussion starts here:
 199 <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002470.html>
 200
 201 from Mitch Alsup:
 202
 203 would like to point out that the general implementations of ATAN2 do a
 204 bunch of special case checks and then simply call ATAN.
 205
 206     double ATAN2( double y, double x )
 207     {   // IEEE 754-2008 quality ATAN2
 208
 209         // deal with NANs
 210         if( ISNAN( x )             ) return x;
 211         if( ISNAN( y )             ) return y;
 212
 213         // deal with infinities
 214         if( x == +∞    && |y|== +∞  ) return copysign(  π/4, y );
 215         if( x == +∞                 ) return copysign(  0.0, y );
 216         if( x == -∞    && |y|== +∞  ) return copysign( 3π/4, y );
 217         if( x == -∞                 ) return copysign(    π, y );
 218         if(               |y|== +∞  ) return copysign(  π/2, y );
 219
 220         // deal with signed zeros
 221         if( x == 0.0  &&  y != 0.0 ) return copysign(  π/2, y );
 222         if( x >=+0.0  &&  y == 0.0 ) return copysign(  0.0, y );
 223         if( x <=-0.0  &&  y == 0.0 ) return copysign(    π, y );
 224
 225         // calculate ATAN2 textbook style
 226         if( x  > 0.0               ) return     ATAN( |y / x| );
 227         if( x  < 0.0               ) return π - ATAN( |y / x| );
 228     }
 229
 230
 231 Yet the proposed encoding makes ATAN2 the primitive and has ATAN invent
 232 a constant and then call/use ATAN2.
 233
 234 When one considers an implementation of ATAN, one must consider several
 235 ranges of evaluation::
 236
 237      x  [  -∞, -1.0]:: ATAN( x ) = -π/2 + ATAN( 1/x );
 238      x  (-1.0, +1.0]:: ATAN( x ) =      + ATAN(   x );
 239      x  [ 1.0,   +∞]:: ATAN( x ) = +π/2 - ATAN( 1/x );
 240
 241 I should point out that the add/sub of π/2 can not lose significance
 242 since the result of ATAN(1/x) is bounded 0..π/2
 243
 244 The bottom line is that I think you are choosing to make too many of
 245 these into OpCodes, making the hardware function/calculation unit (and
 246 sequencer) more complicated that necessary.
 247
 248 --------------------------------------------------------
 249
 250 I might suggest that if there were a way for a calculation to be performed
 251 and the result of that calculation
 252
 253 chained to a subsequent calculation such that the precision of the
 254 result-becomes-operand is wider than
 255
 256 what will fit in a register, then you can dramatically reduce the count
 257 of instructions in this category while retaining
 258
 259 acceptable accuracy:
 260
 261      z = x / y
 262
 263 can be calculated as::
 264
 265      z = x * (1/y)
 266
 267 Where 1/y has about 26-to-32 bits of fraction. No, it's not IEEE 754-2008
 268 accurate, but GPUs want speed and
 269
 270 1/y is fully pipelined (F32) while x/y cannot be (at reasonable area). It
 271 is also not "that inaccurate" displaying 0.625-to-0.52 ULP.
 272
 273 Given that one has the ability to carry (and process) more fraction bits,
 274 one can then do high precision multiplies of  π or other transcendental
 275 radixes.
 276
 277 And GPUs have been doing this almost since the dawn of 3D.
 278
 279     // calculate ATAN2 high performance style
 280     // Note: at this point x != y
 281     //
 282     if( x  > 0.0             )
 283     {
 284         if( y < 0.0 && |y| < |x| ) return - π/2 - ATAN( x / y );
 285         if( y < 0.0 && |y| > |x| ) return       + ATAN( y / x );
 286         if( y > 0.0 && |y| < |x| ) return       + ATAN( y / x );
 287         if( y > 0.0 && |y| > |x| ) return + π/2 - ATAN( x / y );
 288     }
 289     if( x  < 0.0             )
 290     {
 291         if( y < 0.0 && |y| < |x| ) return + π/2 + ATAN( x / y );
 292         if( y < 0.0 && |y| > |x| ) return + π   - ATAN( y / x );
 293         if( y > 0.0 && |y| < |x| ) return + π   - ATAN( y / x );
 294         if( y > 0.0 && |y| > |x| ) return +3π/2 + ATAN( x / y );
 295     }
 296
 297 This way the adds and subtracts from the constant are not in a precision
 298 precarious position.