zfpacc_proposal.mdwn

   1 # FP Accuracy proposal
   2
   3 TODO: complete writeup
   4
   5 * <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002400.html>
   6 * <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002412.html>
   7
   8 Zfpacc: a proposal to allow implementations to dynamically set the bit-accuracy
   9 of results, trading speed (reduced latency) for accuracy (higher latency).
  10 IEE754 format is preserved: only ULP (Unit in Last Place) is permitted to be non-zero.
  11
  12 # Extension of FCSR
  13
  14 Zfpacc would use some of the the reserved bits of FCSR.  It would be treated
  15 very similarly to how dynamic frm is treated.
  16
  17 frm is treated as follows:
  18
  19 * Floating-point operations use either a static rounding mode encoded
  20   in the instruction, or a dynamic rounding mode held in frm.
  21 * Rounding modes are encoded as shown in Table 11.1 of the RISC-V ISA Spec
  22 * A value of 111 in the instruction’s rm field selects the dynamic rounding
  23   mode held in frm. If frm is set to an invalid value (101–111),
  24   any subsequent attempt to execute a floating-point operation with a
  25   dynamic rounding mode will raise an illegal instruction exception.
  26
  27 If we wish to support up to 4 accuracy modes, that would require 2 'fam'
  28 bits.  The Default would be IEEE754-compliant, encoded as 00.  This means
  29 that all current hardware would be compliant with the default mode.
  30
  31 Unsupported modes cause a trap to allow emulation where traps are supported.
  32 Emulation of unsupported modes would be required for UNIX platforms.
  33 As with frm, an implementation may choose to support any permutation
  34 of dynamic fam-instruction pairs. It will illegal-instruction trap upon
  35 executing an unsupported fam-instruction pair.  The implementation can
  36 then emulate the accuracy mode required.
  37
  38 If the bits are in FCSR, then the switch itself would be exposed to
  39 user mode.  User-mode would not be able to detect emulation vs hardware
  40 supported instructions, however (by design).  That would require some
  41 platform-specific code.
  42
  43 Emulation of unsupported modes would be required for unix platforms.
  44
  45 TODO:
  46
  47 A mechanism for user mode code to detect which modes are emulated
  48 (csr? syscall?) (if the supervisor decides to make the emulation visible)
  49 that would allow user code to switch to faster software implementations
  50 if it chooses to.
  51
  52 TODO:
  53
  54 Choose which accuracy modes are required
  55
  56     Which accuracy modes should be included is a question outside of
  57     my expertise and would require a literature review of instruction
  58     frequency in key workloads, PPA analysis of simple and advanced
  59     implementations, etc.
  60
  61 TODO: reduced accuracy
  62
  63     I don't see why Unix should be required to emulate some arbitrary
  64     reduced accuracy ML mode.  My guess would be that Unix Platform Spec
  65     requires support for IEEE, whereas arbitrary ML platform requires
  66     support for Mode XYZ.  Of course, implementations of either platform
  67     would be free to support any/all modes that they find valuable.
  68     Compiling for a specific platform means that support for required
  69     accuracy modes is guaranteed (and therefore does not need discovery
  70     sequences), while allowing portable code to execute discovery
  71     sequences to detect support for alternative accuracy modes.
  72
  73 # Dynamic accuracy CSR <a name="dynamic"></a>
  74
  75 FCSR to be modified to include accuracy bits:
  76
  77 | 31....11 | 10..8  | 7..5 | 4....0 |
  78 | -------- | ------ | ---- | ------ |
  79 | reserved | facc   | frm  | fflags |
  80
  81 The values for the field facc to include the following:
  82
  83 | facc  | mode    | description         |
  84 | ----- | ------- | ------------------- |
  85 | 0b000 | IEEE754 | correctly rounded   |
  86 | 0b010 | ULP<1   | Unit Last Place < 1 |
  87 | 0b100 | Vulkan  | Vulkan compliant    |
  88 | 0b110 | Appx    | Machine Learning    |
  89
  90 Note that the format of the operands and result remain the same for all opcodes. The only change is in the *accuracy* of the result, not its format.
  91
  92 maybe a solution would be to add an extra field to the fp control csr
  93 to allow selecting one of several accurate or fast modes:
  94
  95 - machine-learning-mode: fast as possible
  96   (maybe need additional requirements such as monotonicity for atanh?)
  97 - GPU-mode: accurate to within a few ULP
  98   (see Vulkan, OpenGL, and OpenCL specs for accuracy guidelines)
  99 - almost-accurate-mode: accurate to <1 ULP
 100      (would 0.51 or some other value be better?)
 101 - fully-accurate-mode: correctly rounded in all cases
 102 - maybe more modes?
 103
 104 Question: should better accuracy than is requested be permitted? Example:
 105 Ahmdahl-370 issues.
 106
 107 Comments:
 108
 109     Yes, embedded systems typically can do with 12, 16 or 32 bit
 110     accuracy. Rarely does it require 64 bits. But the idea of making
 111     a low power 32 bit FPU/DSP that can accommodate 64 bits is already
 112     being done in other designs such as PIC etc I believe. For embedded
 113     graphics 16 bit is more than adequate. In fact, Cornell had a very
 114     innovative 18-bit floating point format described here (useful for
 115     FPGA designs with 18-bit DSPs):
 116
 117     <https://people.ece.cornell.edu/land/courses/ece5760/FloatingPoint/index.html>
 118
 119     A very interesting GPU using the 18-bit FPU is also described here:
 120
 121     <https://people.ece.cornell.edu/land/courses/ece5760/FinalProjects/f2008/ap328_sjp45/website/hardwaredesign.html>
 122
 123     There are also 8 and 9-bit floating point formats that could be useful
 124
 125     <https://en.wikipedia.org/wiki/Minifloat>