ztrans_proposal.mdwn

   1 # Zftrans - transcendental operations
   2
   3 See:
   4
   5 * <http://bugs.libre-riscv.org/show_bug.cgi?id=127>
   6 * <https://www.khronos.org/registry/spir-v/specs/unified1/OpenCL.ExtendedInstructionSet.100.html>
   7 * Discussion: <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002342.html>
   8 * [[rv_major_opcode_1010011]] for opcode listing.
   9 * [[zfpacc_proposal]] for accuracy settings proposal
  10
  11 Extension subsets:
  12
  13 * **Zftrans**: standard transcendentals (best suited to 3D)
  14 * **ZftransExt**: extra functions (useful, not generally needed for 3D,
  15   can be synthesised using Ztrans)
  16 * **Ztrigpi**: trig. xxx-pi sinpi cospi tanpi
  17 * **Ztrignpi**: trig non-xxx-pi sin cos tan
  18 * **Zarctrigpi**: arc-trig. a-xxx-pi: atan2pi asinpi acospi
  19 * **Zarctrignpi**: arc-trig. non-a-xxx-pi: atan2, asin, acos
  20 * **Zfhyp**: hyperbolic/inverse-hyperbolic.  sinh, cosh, tanh, asinh,
  21   acosh, atanh (can be synthesised - see below)
  22 * **ZftransAdv**: much more complex to implement in hardware
  23 * **Zfrsqrt**: Reciprocal square-root.
  24
  25 Minimum recommended requirements for 3D: Zftrans, Ztrigpi, Zarctrigpi,
  26 Zarctrignpi
  27
  28 [[!toc levels=2]]
  29
  30 # TODO:
  31
  32 * Decision on accuracy, moved to [[zfpacc_proposal]]
  33 <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002355.html>
  34 * Errors **MUST** be repeatable.
  35 * How about four Platform Specifications? 3DUNIX, UNIX, 3DEmbedded and Embedded?
  36 <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002361.html>
  37   Accuracy requirements for dual (triple) purpose implementations must
  38   meet the higher standard.
  39 * Reciprocal Square-root is in its own separate extension (Zfrsqrt) as
  40   it is desirable on its own by other implementors.  This to be evaluated.
  41
  42 # Proposed Opcodes vs Khronos OpenCL Opcodes <a name="khronos_equiv"></a>
  43
  44 This list shows the (direct) equivalence between proposed opcodes and
  45 their Khronos OpenCL equivalents.
  46
  47 See
  48 <https://www.khronos.org/registry/spir-v/specs/unified1/OpenCL.ExtendedInstructionSet.100.html>
  49
  50 Special FP16 opcodes are *not* being proposed, except by indirect / inherent
  51 use of the "fmt" field that is already present in the RISC-V Specification.
  52
  53 "Native" opcodes are *not* being proposed: implementors will be expected
  54 to use the (equivalent) proposed opcode covering the same function.
  55
  56 "Fast" opcodes are *not* being proposed, because the Khronos Specification
  57 fast\_length, fast\_normalise and fast\_distance OpenCL opcodes require
  58 vectors (or can be done as scalar operations using other RISC-V instructions).
  59
  60 The OpenCL FP32 opcodes are **direct** equivalents to the proposed opcodes.
  61 Deviation from conformance with the Khronos Specification - including the
  62 Khronos Specification accuracy requirements - is not an option.
  63
  64 [[!table data="""
  65 Proposed opcode | OpenCL FP32 | OpenCL FP16 | OpenCL native | OpenCL fast |
  66 FSIN            | sin         | half\_sin   | native\_sin   | NONE        |
  67 FCOS            | cos         | half\_cos   | native\_cos   | NONE        |
  68 FTAN            | tan         | half\_tan   | native\_tan   | NONE        |
  69 NONE (1)        | sincos      | NONE        | NONE          | NONE        |
  70 FASIN           | asin        | NONE        | NONE          | NONE        |
  71 FACOS           | acos        | NONE        | NONE          | NONE        |
  72 FATAN           | atan        | NONE        | NONE          | NONE        |
  73 FSINPI          | sinpi       | NONE        | NONE          | NONE        |
  74 FCOSPI          | cospi       | NONE        | NONE          | NONE        |
  75 FTANPI          | tanpi       | NONE        | NONE          | NONE        |
  76 FASINPI         | asinpi      | NONE        | NONE          | NONE        |
  77 FACOSPI         | acospi      | NONE        | NONE          | NONE        |
  78 FATANPI         | atanpi      | NONE        | NONE          | NONE        |
  79 FSINH           | sinh        | NONE        | NONE          | NONE        |
  80 FCOSH           | cosh        | NONE        | NONE          | NONE        |
  81 FTANH           | tanh        | NONE        | NONE          | NONE        |
  82 FASINH          | asinh       | NONE        | NONE          | NONE        |
  83 FACOSH          | acosh       | NONE        | NONE          | NONE        |
  84 FATANH          | atanh       | NONE        | NONE          | NONE        |
  85 FRSQRT          | rsqrt       | half\_rsqrt | native\_rsqrt | NONE        |
  86 FCBRT           | cbrt        | NONE        | NONE          | NONE        |
  87 FEXP2           | exp2        | half\_exp2  | native\_exp2  | NONE        |
  88 FLOG2           | log2        | half\_log2  | native\_log2  | NONE        |
  89 FEXPM1          | expm1       | NONE        | NONE          | NONE        |
  90 FLOG1P          | log1p       | NONE        | NONE          | NONE        |
  91 FEXP            | exp         | half\_exp   | native\_exp   | NONE        |
  92 FLOG            | log         | half\_log   | native\_log   | NONE        |
  93 FEXP10          | exp10       | half\_exp10 | native\_exp10 | NONE        |
  94 FLOG10          | log10       | half\_log10 | native\_log10 | NONE        |
  95 FATAN2          | atan2       | NONE        | NONE          | NONE        |
  96 FATAN2PI        | atan2pi     | NONE        | NONE          | NONE        |
  97 FPOW            | pow         | NONE        | NONE          | NONE        |
  98 FROOT           | rootn       | NONE        | NONE          | NONE        |
  99 FHYPOT          | hypot       | NONE        | NONE          | NONE        |
 100 NONE (4)        | NONE        | half\_recip | native\_recip | NONE        |
 101 """]]
 102
 103 Note (1) FSINCOS is macro-op fused (see below).
 104
 105 Note (4) FCRECIP is a sythesised alias, below.
 106
 107 # List of 2-arg opcodes
 108
 109 [[!table  data="""
 110 opcode    | Description            | pseudo-code                | Extension   |
 111 FATAN2    | atan2 arc tangent      | rd = atan2(rs2, rs1)       | Zarctrignpi |
 112 FATAN2PI  | atan2 arc tangent / pi | rd = atan2(rs2, rs1) / pi  | Zarctrigpi  |
 113 FPOW      | x power of y           | rd = pow(rs1, rs2)         | ZftransAdv  |
 114 FROOT     | x power 1/y            | rd = pow(rs1, 1/rs2)       | ZftransAdv  |
 115 FHYPOT    | hypotenuse             | rd = sqrt(rs1^2 + rs2^2)   | Zftrans     |
 116 """]]
 117
 118 # List of 1-arg transcendental opcodes
 119
 120 [[!table  data="""
 121 opcode   | Description              | pseudo-code             | Extension  |
 122 FRSQRT   | Reciprocal Square-root   | rd = sqrt(rs1)          | Zfrsqrt    |
 123 FCBRT    | Cube Root                | rd = pow(rs1, 3)        | Zftrans    |
 124 FEXP2    | power-of-2               | rd = pow(2, rs1)        | Zftrans    |
 125 FLOG2    | log2                     | rd = log2(rs1)          | Zftrans    |
 126 FEXPM1   | exponential minus 1      | rd = pow(e, rs1) - 1.0  | Zftrans    |
 127 FLOG1P   | log plus 1               | rd = log(e, 1 + rs1)    | Zftrans    |
 128 FEXP     | exponential              | rd = pow(e, rs1)        | ZftransExt |
 129 FLOG     | natural log (base e)     | rd = log(e, rs1)        | ZftransExt |
 130 FEXP10   | power-of-10              | rd = pow(10, rs1)       | ZftransExt |
 131 FLOG10   | log base 10              | rd = log10(rs1)         | ZftransExt |
 132 """]]
 133
 134 # List of 1-arg trigonometric opcodes
 135
 136 [[!table  data="""
 137 opcode   | Description              | pseudo-code             | Extension |
 138 FSIN     | sin (radians)            | rd = sin(rs1)           | Ztrignpi    |
 139 FCOS     | cos (radians)            | rd = cos(rs1)           | Ztrignpi    |
 140 FTAN     | tan (radians)            | rd = tan(rs1)           | Ztrignpi    |
 141 FASIN    | arcsin (radians)         | rd = asin(rs1)          | Zarctrignpi |
 142 FACOS    | arccos (radians)         | rd = acos(rs1)          | Zarctrignpi |
 143 FATAN    | arctan (radians)         | rd = atan(rs1)          | Zarctrignpi |
 144 FSINPI   | sin times pi             | rd = sin(pi * rs1)      | Ztrigpi |
 145 FCOSPI   | cos times pi             | rd = cos(pi * rs1)      | Ztrigpi |
 146 FTANPI   | tan times pi             | rd = tan(pi * rs1)      | Ztrigpi |
 147 FASINPI  | arcsin / pi              | rd = asin(rs1) / pi     | Zarctrigpi |
 148 FACOSPI  | arccos / pi              | rd = acos(rs1) / pi     | Zarctrigpi |
 149 FATANPI  | arctan / pi              | rd = atan(rs1) / pi     | Zarctrigpi |
 150 FSINH    | hyperbolic sin (radians) | rd = sinh(rs1)          | Zfhyp |
 151 FCOSH    | hyperbolic cos (radians) | rd = cosh(rs1)          | Zfhyp |
 152 FTANH    | hyperbolic tan (radians) | rd = tanh(rs1)          | Zfhyp |
 153 FASINH   | inverse hyperbolic sin   | rd = asinh(rs1)         | Zfhyp |
 154 FACOSH   | inverse hyperbolic cos   | rd = acosh(rs1)         | Zfhyp |
 155 FATANH   | inverse hyperbolic tan   | rd = atanh(rs1)         | Zfhyp |
 156 """]]
 157
 158 # Synthesis, Pseudo-code ops and macro-ops
 159
 160 The pseudo-ops are best left up to the compiler rather than being actual
 161 pseudo-ops, by allocating one scalar FP register for use as a constant
 162 (loop invariant) set to "1.0" at the beginning of a function or other
 163 suitable code block.
 164
 165 * FRCP rd, rs1 - pseudo-code alias for rd = 1.0 / rs1
 166 * FSINCOS - fused macro-op between FSIN and FCOS (issued in that order).
 167 * FSINCOSPI - fused macro-op between FSINPI and FCOSPI (issued in that order).
 168
 169 FATANPI example pseudo-code:
 170
 171     lui t0, 0x3F800 // upper bits of f32 1.0
 172     fmv.x.s ft0, t0
 173     fatan2pi.s rd, rs1, ft0
 174
 175 Hyperbolic function example (obviates need for Zfhyp except for
 176 high-performance or correctly-rounding):
 177
 178     ASINH( x ) = ln( x + SQRT(x**2+1))
 179
 180 # To evaluate: should LOG be replaced with LOG1P (and EXP with EXPM1)?
 181
 182 RISC principle says "exclude LOG because it's covered by LOGP1 plus an ADD".
 183 Research needed to ensure that implementors are not compromised by such
 184 a decision
 185 <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002358.html>
 186
 187 > > correctly-rounded LOG will return different results than LOGP1 and ADD.
 188 > > Likewise for EXP and EXPM1
 189
 190 > ok, they stay in as real opcodes, then.
 191
 192 # ATAN / ATAN2 commentary
 193
 194 Discussion starts here:
 195 <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002470.html>
 196
 197 from Mitch Alsup:
 198
 199 would like to point out that the general implementations of ATAN2 do a
 200 bunch of special case checks and then simply call ATAN.
 201
 202     double ATAN2( double y, double x )
 203     {   // IEEE 754-2008 quality ATAN2
 204
 205         // deal with NANs
 206         if( ISNAN( x )             ) return x;
 207         if( ISNAN( y )             ) return y;
 208
 209         // deal with infinities
 210         if( x == +∞    && |y|== +∞  ) return copysign(  π/4, y );
 211         if( x == +∞                 ) return copysign(  0.0, y );
 212         if( x == -∞    && |y|== +∞  ) return copysign( 3π/4, y );
 213         if( x == -∞                 ) return copysign(    π, y );
 214         if(               |y|== +∞  ) return copysign(  π/2, y );
 215
 216         // deal with signed zeros
 217         if( x == 0.0  &&  y != 0.0 ) return copysign(  π/2, y );
 218         if( x >=+0.0  &&  y == 0.0 ) return copysign(  0.0, y );
 219         if( x <=-0.0  &&  y == 0.0 ) return copysign(    π, y );
 220
 221         // calculate ATAN2 textbook style
 222         if( x  > 0.0               ) return     ATAN( |y / x| );
 223         if( x  < 0.0               ) return π - ATAN( |y / x| );
 224     }
 225
 226
 227 Yet the proposed encoding makes ATAN2 the primitive and has ATAN invent
 228 a constant and then call/use ATAN2.
 229
 230 When one considers an implementation of ATAN, one must consider several
 231 ranges of evaluation::
 232
 233      x  [  -∞, -1.0]:: ATAN( x ) = -π/2 + ATAN( 1/x );
 234      x  (-1.0, +1.0]:: ATAN( x ) =      + ATAN(   x );
 235      x  [ 1.0,   +∞]:: ATAN( x ) = +π/2 - ATAN( 1/x );
 236
 237 I should point out that the add/sub of π/2 can not lose significance
 238 since the result of ATAN(1/x) is bounded 0..π/2
 239
 240 The bottom line is that I think you are choosing to make too many of
 241 these into OpCodes, making the hardware function/calculation unit (and
 242 sequencer) more complicated that necessary.
 243
 244 --------------------------------------------------------
 245
 246 I might suggest that if there were a way for a calculation to be performed
 247 and the result of that calculation
 248
 249 chained to a subsequent calculation such that the precision of the
 250 result-becomes-operand is wider than
 251
 252 what will fit in a register, then you can dramatically reduce the count
 253 of instructions in this category while retaining
 254
 255 acceptable accuracy:
 256
 257      z = x / y
 258
 259 can be calculated as::
 260
 261      z = x * (1/y)
 262
 263 Where 1/y has about 26-to-32 bits of fraction. No, it's not IEEE 754-2008
 264 accurate, but GPUs want speed and
 265
 266 1/y is fully pipelined (F32) while x/y cannot be (at reasonable area). It
 267 is also not "that inaccurate" displaying 0.625-to-0.52 ULP.
 268
 269 Given that one has the ability to carry (and process) more fraction bits,
 270 one can then do high precision multiplies of  π or other transcendental
 271 radixes.
 272
 273 And GPUs have been doing this almost since the dawn of 3D.
 274
 275     // calculate ATAN2 high performance style
 276     // Note: at this point x != y
 277     //
 278     if( x  > 0.0             )
 279     {
 280         if( y < 0.0 && |y| < |x| ) return - π/2 - ATAN( x / y );
 281         if( y < 0.0 && |y| > |x| ) return       + ATAN( y / x );
 282         if( y > 0.0 && |y| < |x| ) return       + ATAN( y / x );
 283         if( y > 0.0 && |y| > |x| ) return + π/2 - ATAN( x / y );
 284     }
 285     if( x  < 0.0             )
 286     {
 287         if( y < 0.0 && |y| < |x| ) return + π/2 + ATAN( x / y );
 288         if( y < 0.0 && |y| > |x| ) return + π   - ATAN( y / x );
 289         if( y > 0.0 && |y| < |x| ) return + π   - ATAN( y / x );
 290         if( y > 0.0 && |y| > |x| ) return +3π/2 + ATAN( x / y );
 291     }
 292
 293 This way the adds and subtracts from the constant are not in a precision
 294 precarious position.