whitespace
[libreriscv.git] / zfpacc_proposal.mdwn
1 # FP Accuracy proposal
2
3 TODO: complete writeup
4
5 * <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002400.html>
6 * <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002412.html>
7
8 Zfpacc: a proposal to allow implementations to dynamically set the
9 bit-accuracy of floating-point results, trading speed (reduced latency)
10 *at runtime* for accuracy (higher latency). IEEE754 format is preserved:
11 instruction operand and result format requirements are unmodified by
12 this proposal. Only ULP (Unit in Last Place) of the instruction *result*
13 is permitted to meet alternative accuracy requirements, whilst still
14 retaining the instruction's requested format.
15
16 # Extension of FCSR
17
18 Zfpacc would use some of the the reserved bits of FCSR. It would be treated
19 very similarly to how dynamic frm is treated.
20
21 frm is treated as follows:
22
23 * Floating-point operations use either a static rounding mode encoded
24 in the instruction, or a dynamic rounding mode held in frm.
25 * Rounding modes are encoded as shown in Table 11.1 of the RISC-V ISA Spec
26 * A value of 111 in the instruction’s rm field selects the dynamic rounding
27 mode held in frm. If frm is set to an invalid value (101–111),
28 any subsequent attempt to execute a floating-point operation with a
29 dynamic rounding mode will raise an illegal instruction exception.
30
31 If we wish to support up to 4 accuracy modes, that would require 2 'fam'
32 bits. The Default would be IEEE754-compliant, encoded as 00. This means
33 that all current hardware would be compliant with the default mode.
34
35 Unsupported modes cause a trap to allow emulation where traps are supported.
36 Emulation of unsupported modes would be required for UNIX platforms.
37 As with frm, an implementation may choose to support any permutation
38 of dynamic fam-instruction pairs. It will illegal-instruction trap upon
39 executing an unsupported fam-instruction pair. The implementation can
40 then emulate the accuracy mode required.
41
42 If the bits are in FCSR, then the switch itself would be exposed to
43 user mode. User-mode would not be able to detect emulation vs hardware
44 supported instructions, however (by design). That would require some
45 platform-specific code.
46
47 Emulation of unsupported modes would be required for unix platforms.
48
49 TODO:
50
51 A mechanism for user mode code to detect which modes are emulated
52 (csr? syscall?) (if the supervisor decides to make the emulation visible)
53 that would allow user code to switch to faster software implementations
54 if it chooses to.
55
56 TODO:
57
58 Choose which accuracy modes are required
59
60 Which accuracy modes should be included is a question outside of
61 my expertise and would require a literature review of instruction
62 frequency in key workloads, PPA analysis of simple and advanced
63 implementations, etc.
64
65 TODO: reduced accuracy
66
67 I don't see why Unix should be required to emulate some arbitrary
68 reduced accuracy ML mode. My guess would be that Unix Platform Spec
69 requires support for IEEE, whereas arbitrary ML platform requires
70 support for Mode XYZ. Of course, implementations of either platform
71 would be free to support any/all modes that they find valuable.
72 Compiling for a specific platform means that support for required
73 accuracy modes is guaranteed (and therefore does not need discovery
74 sequences), while allowing portable code to execute discovery
75 sequences to detect support for alternative accuracy modes.
76
77 # Dynamic accuracy CSR <a name="dynamic"></a>
78
79 FCSR to be modified to include accuracy bits:
80
81 | 31....11 | 10..8 | 7..5 | 4....0 |
82 | -------- | ------ | ---- | ------ |
83 | reserved | facc | frm | fflags |
84
85 The values for the field facc to include the following:
86
87 | facc | mode | description |
88 | ----- | ------- | ------------------- |
89 | 0b00H | IEEE754 | correctly rounded |
90 | 0b01H | ULP<1 | Unit Last Place < 1 |
91 | 0b10H | Vulkan | Vulkan compliant |
92 | 0b11H | Appx | Machine Learning |
93
94 When bit 0 (H) of facc is set to zero, half-precision mode is
95 disabled. When set, an automatic down conversion (FCVT) to half the
96 instruction bitwidth (FP32 opcode would convert to FP16) on operands
97 is performed, followed by the operation occuring at half precision,
98 followed by automatic up conversion back to the instruction's bitwidth.
99
100 Note that the format of the operands and result remain the same for
101 all opcodes. The only change is in the *accuracy* of the result, not
102 its format.
103
104 Pseudocode for half accuracy mode:
105
106 def fpadd32(op1, op2):
107 if FCSR.facc.halfmode:
108 op1 = fcvt32to16(op1)
109 op2 = fcvt32to16(op2)
110 result = fpadd32(op1, op2)
111 return fcvt16to32(result)
112 else:
113 # TODO, reduced accuracy if requested
114 return op1 + op2
115
116 ## Discussion
117
118 maybe a solution would be to add an extra field to the fp control csr
119 to allow selecting one of several accurate or fast modes:
120
121 - machine-learning-mode: fast as possible
122 (maybe need additional requirements such as monotonicity for atanh?)
123 - GPU-mode: accurate to within a few ULP
124 (see Vulkan, OpenGL, and OpenCL specs for accuracy guidelines)
125 - almost-accurate-mode: accurate to <1 ULP
126 (would 0.51 or some other value be better?)
127 - fully-accurate-mode: correctly rounded in all cases
128 - maybe more modes?
129
130 Question: should better accuracy than is requested be permitted? Example:
131 Ahmdahl-370 issues.
132
133 Comments:
134
135 Yes, embedded systems typically can do with 12, 16 or 32 bit
136 accuracy. Rarely does it require 64 bits. But the idea of making
137 a low power 32 bit FPU/DSP that can accommodate 64 bits is already
138 being done in other designs such as PIC etc I believe. For embedded
139 graphics 16 bit is more than adequate. In fact, Cornell had a very
140 innovative 18-bit floating point format described here (useful for
141 FPGA designs with 18-bit DSPs):
142
143 <https://people.ece.cornell.edu/land/courses/ece5760/FloatingPoint/index.html>
144
145 A very interesting GPU using the 18-bit FPU is also described here:
146
147 <https://people.ece.cornell.edu/land/courses/ece5760/FinalProjects/f2008/ap328_sjp45/website/hardwaredesign.html>
148
149 There are also 8 and 9-bit floating point formats that could be useful
150
151 <https://en.wikipedia.org/wiki/Minifloat>