zfpacc_proposal.mdwn

   1 # FP Accuracy proposal
   2
   3 TODO: complete writeup
   4
   5 * <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002400.html>
   6 * <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002412.html>
   7
   8 Zfpacc: a proposal to allow implementations to dynamically set the
   9 bit-accuracy of floating-point results, trading speed (reduced latency)
  10 *at runtime* for accuracy (higher latency).  IEEE754 format is preserved:
  11 instruction operand and result format requirements are unmodified by
  12 this proposal.  Only ULP (Unit in Last Place) of the instruction *result*
  13 is permitted to meet alternative accuracy requirements, whilst still
  14 retaining the instruction's requested format.
  15
  16 # Extension of FCSR
  17
  18 Zfpacc would use some of the the reserved bits of FCSR.  It would be treated
  19 very similarly to how dynamic frm is treated.
  20
  21 frm is treated as follows:
  22
  23 * Floating-point operations use either a static rounding mode encoded
  24   in the instruction, or a dynamic rounding mode held in frm.
  25 * Rounding modes are encoded as shown in Table 11.1 of the RISC-V ISA Spec
  26 * A value of 111 in the instruction’s rm field selects the dynamic rounding
  27   mode held in frm. If frm is set to an invalid value (101–111),
  28   any subsequent attempt to execute a floating-point operation with a
  29   dynamic rounding mode will raise an illegal instruction exception.
  30
  31 If we wish to support up to 4 accuracy modes, that would require 2 'fam'
  32 bits.  The Default would be IEEE754-compliant, encoded as 00.  This means
  33 that all current hardware would be compliant with the default mode.
  34
  35 Unsupported modes cause a trap to allow emulation where traps are supported.
  36 Emulation of unsupported modes would be required for UNIX platforms.
  37 As with frm, an implementation may choose to support any permutation
  38 of dynamic fam-instruction pairs. It will illegal-instruction trap upon
  39 executing an unsupported fam-instruction pair.  The implementation can
  40 then emulate the accuracy mode required.
  41
  42 If the bits are in FCSR, then the switch itself would be exposed to
  43 user mode.  User-mode would not be able to detect emulation vs hardware
  44 supported instructions, however (by design).  That would require some
  45 platform-specific code.
  46
  47 Emulation of unsupported modes would be required for unix platforms.
  48
  49 TODO:
  50
  51 A mechanism for user mode code to detect which modes are emulated
  52 (csr? syscall?) (if the supervisor decides to make the emulation visible)
  53 that would allow user code to switch to faster software implementations
  54 if it chooses to.
  55
  56 TODO:
  57
  58 Choose which accuracy modes are required
  59
  60     Which accuracy modes should be included is a question outside of
  61     my expertise and would require a literature review of instruction
  62     frequency in key workloads, PPA analysis of simple and advanced
  63     implementations, etc.
  64
  65 TODO: reduced accuracy
  66
  67     I don't see why Unix should be required to emulate some arbitrary
  68     reduced accuracy ML mode.  My guess would be that Unix Platform Spec
  69     requires support for IEEE, whereas arbitrary ML platform requires
  70     support for Mode XYZ.  Of course, implementations of either platform
  71     would be free to support any/all modes that they find valuable.
  72     Compiling for a specific platform means that support for required
  73     accuracy modes is guaranteed (and therefore does not need discovery
  74     sequences), while allowing portable code to execute discovery
  75     sequences to detect support for alternative accuracy modes.
  76
  77 # Dynamic accuracy CSR <a name="dynamic"></a>
  78
  79 FCSR to be modified to include accuracy bits:
  80
  81 | 31....11 | 10..8  | 7..5 | 4....0 |
  82 | -------- | ------ | ---- | ------ |
  83 | reserved | facc   | frm  | fflags |
  84
  85 The values for the field facc to include the following:
  86
  87 | facc  | mode    | description         |
  88 | ----- | ------- | ------------------- |
  89 | 0b000 | IEEE754 | correctly rounded   |
  90 | 0b010 | ULP<1   | Unit Last Place < 1 |
  91 | 0b100 | Vulkan  | Vulkan compliant    |
  92 | 0b110 | Appx    | Machine Learning
  93
  94 Notes:
  95
  96 * facc=0 to match current RISC-V behaviour, where these bits were formerly reserved and set to zero.
  97 * The format of the operands and result remain the same for
  98 all opcodes. The only change is in the *accuracy* of the result, not
  99 its format.
 100 * facc sets the *minimum* accuracy. It is acceptable to provide *more* accurate results than is requested by a given facc mode (although, clearly, the opportunity for reduced power and latency would be missed).
 101
 102 ## Discussion
 103
 104 maybe a solution would be to add an extra field to the fp control csr
 105 to allow selecting one of several accurate or fast modes:
 106
 107 - machine-learning-mode: fast as possible
 108   (maybe need additional requirements such as monotonicity for atanh?)
 109 - GPU-mode: accurate to within a few ULP
 110   (see Vulkan, OpenGL, and OpenCL specs for accuracy guidelines)
 111 - almost-accurate-mode: accurate to <1 ULP
 112      (would 0.51 or some other value be better?)
 113 - fully-accurate-mode: correctly rounded in all cases
 114 - maybe more modes?
 115
 116 Question: should better accuracy than is requested be permitted? Example:
 117 Ahmdahl-370 issues.
 118
 119 Comments:
 120
 121     Yes, embedded systems typically can do with 12, 16 or 32 bit
 122     accuracy. Rarely does it require 64 bits. But the idea of making
 123     a low power 32 bit FPU/DSP that can accommodate 64 bits is already
 124     being done in other designs such as PIC etc I believe. For embedded
 125     graphics 16 bit is more than adequate. In fact, Cornell had a very
 126     innovative 18-bit floating point format described here (useful for
 127     FPGA designs with 18-bit DSPs):
 128
 129     <https://people.ece.cornell.edu/land/courses/ece5760/FloatingPoint/index.html>
 130
 131     A very interesting GPU using the 18-bit FPU is also described here:
 132
 133     <https://people.ece.cornell.edu/land/courses/ece5760/FinalProjects/f2008/ap328_sjp45/website/hardwaredesign.html>
 134
 135     There are also 8 and 9-bit floating point formats that could be useful
 136
 137     <https://en.wikipedia.org/wiki/Minifloat>
 138
 139 ### Compliance
 140
 141 Dan Petroski:
 142
 143     It’s a bit more complicated than that. Different FP
 144     representations/algorithms have different quantization ranges, so you
 145     can get more or less precise depending on how large the arguments are.
 146
 147     For instance, machine A can compute within ULP3 from 0 to 10000, but
 148     ULP2 from 10000 upwards. Machine B can compute within ULP2 from 0 to
 149     6000, then ULP3 for 6000+. How do you design a compliance suite which
 150     guarantees behavior across all fpaccs?
 151
 152 and from Allen Baum:
 153
 154     In the example above, you'd need a ratified spec with the defined
 155     ranges  (possbily per range and per op) - and then implementations
 156     would need to at least meet that spec (but could be more accurate)
 157
 158     so - not impossible, but a lot more work to write different kinds
 159     of tests than standard IEEE compatible test would have.
 160
 161     And, by the way, if you want it to be a ratified spec, it needs a
 162     compliance suite, and whoever has defined the spec is responsible
 163     for writing it.,
 164