(no commit message)
[libreriscv.git] / zfpacc_proposal.mdwn
1 # FP Accuracy proposal
2
3 Credits:
4
5 * Bruce Hoult
6 * Allen Baum
7 * Dan Petroski
8 * Jacob Lifshay
9
10 TODO: complete writeup
11
12 * <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002400.html>
13 * <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002412.html>
14
15 Zfpacc: a proposal to allow implementations to dynamically set the
16 bit-accuracy of floating-point results, trading speed (reduced latency)
17 *at runtime* for accuracy (higher latency). IEEE754 format is preserved:
18 instruction operand and result format requirements are unmodified by
19 this proposal. Only ULP (Unit in Last Place) of the instruction *result*
20 is permitted to meet alternative accuracy requirements, whilst still
21 retaining the instruction's requested format.
22
23 # Extension of FCSR
24
25 Zfpacc would use some of the the reserved bits of FCSR. It would be treated
26 very similarly to how dynamic frm is treated.
27
28 frm is treated as follows:
29
30 * Floating-point operations use either a static rounding mode encoded
31 in the instruction, or a dynamic rounding mode held in frm.
32 * Rounding modes are encoded as shown in Table 11.1 of the RISC-V ISA Spec
33 * A value of 111 in the instruction’s rm field selects the dynamic rounding
34 mode held in frm. If frm is set to an invalid value (101–111),
35 any subsequent attempt to execute a floating-point operation with a
36 dynamic rounding mode will raise an illegal instruction exception.
37
38 If we wish to support up to 4 accuracy modes, that would require 2 'fam'
39 bits. The Default would be IEEE754-compliant, encoded as 00. This means
40 that all current hardware would be compliant with the default mode.
41
42 Unsupported modes cause a trap to allow emulation where traps are supported.
43 Emulation of unsupported modes would be required for UNIX platforms.
44 As with frm, an implementation may choose to support any permutation
45 of dynamic fam-instruction pairs. It will illegal-instruction trap upon
46 executing an unsupported fam-instruction pair. The implementation can
47 then emulate the accuracy mode required.
48
49 If the bits are in FCSR, then the switch itself would be exposed to
50 user mode. User-mode would not be able to detect emulation vs hardware
51 supported instructions, however (by design). That would require some
52 platform-specific code.
53
54 Emulation of unsupported modes would be required for unix platforms.
55
56 TODO:
57
58 A mechanism for user mode code to detect which modes are emulated
59 (csr? syscall?) (if the supervisor decides to make the emulation visible)
60 that would allow user code to switch to faster software implementations
61 if it chooses to.
62
63 TODO:
64
65 Choose which accuracy modes are required
66
67 Which accuracy modes should be included is a question outside of
68 my expertise and would require a literature review of instruction
69 frequency in key workloads, PPA analysis of simple and advanced
70 implementations, etc.
71
72 TODO: reduced accuracy
73
74 I don't see why Unix should be required to emulate some arbitrary
75 reduced accuracy ML mode. My guess would be that Unix Platform Spec
76 requires support for IEEE, whereas arbitrary ML platform requires
77 support for Mode XYZ. Of course, implementations of either platform
78 would be free to support any/all modes that they find valuable.
79 Compiling for a specific platform means that support for required
80 accuracy modes is guaranteed (and therefore does not need discovery
81 sequences), while allowing portable code to execute discovery
82 sequences to detect support for alternative accuracy modes.
83
84 # Dynamic accuracy CSR <a name="dynamic"></a>
85
86 FCSR to be modified to include accuracy bits:
87
88 | 31....11 | 10..8 | 7..5 | 4....0 |
89 | -------- | ------ | ---- | ------ |
90 | reserved | facc | frm | fflags |
91
92 The values for the field facc to include the following:
93
94 | facc | mode | description |
95 | ----- | ------- | ------------------- |
96 | 0b000 | IEEE754 | correctly rounded |
97 | 0b010 | ULP<1 | Unit Last Place < 1 |
98 | 0b100 | Vulkan | Vulkan compliant |
99 | 0b110 | Appx | Machine Learning
100
101 (TODO: review alternative idea: ULP0.5, ULP1, ULP2, ULP4, ULP16)
102
103 Notes:
104
105 * facc=0 to match current RISC-V behaviour, where these bits were formerly reserved and set to zero.
106 * The format of the operands and result remain the same for
107 all opcodes. The only change is in the *accuracy* of the result, not
108 its format.
109 * facc sets the *minimum* accuracy. It is acceptable to provide *more* accurate results than is requested by a given facc mode (although, clearly, the opportunity for reduced power and latency would be missed).
110
111 ## Discussion
112
113 maybe a solution would be to add an extra field to the fp control csr
114 to allow selecting one of several accurate or fast modes:
115
116 - machine-learning-mode: fast as possible
117 (maybe need additional requirements such as monotonicity for atanh?)
118 - GPU-mode: accurate to within a few ULP
119 (see Vulkan, OpenGL, and OpenCL specs for accuracy guidelines)
120 - almost-accurate-mode: accurate to <1 ULP
121 (would 0.51 or some other value be better?)
122 - fully-accurate-mode: correctly rounded in all cases
123 - maybe more modes?
124
125 extra mode suggestions:
126
127 it might be reasonable to add a mode saying you're prepared to accept
128 worse then 0.5 ULP accuracy, perhaps with a few options: 1, 2, 4,
129 16 or something like that.
130
131 Question: should better accuracy than is requested be permitted? Example:
132 Ahmdahl-370 issues.
133
134 Comments:
135
136 Yes, embedded systems typically can do with 12, 16 or 32 bit
137 accuracy. Rarely does it require 64 bits. But the idea of making
138 a low power 32 bit FPU/DSP that can accommodate 64 bits is already
139 being done in other designs such as PIC etc I believe. For embedded
140 graphics 16 bit is more than adequate. In fact, Cornell had a very
141 innovative 18-bit floating point format described here (useful for
142 FPGA designs with 18-bit DSPs):
143
144 <https://people.ece.cornell.edu/land/courses/ece5760/FloatingPoint/index.html>
145
146 A very interesting GPU using the 18-bit FPU is also described here:
147
148 <https://people.ece.cornell.edu/land/courses/ece5760/FinalProjects/f2008/ap328_sjp45/website/hardwaredesign.html>
149
150 There are also 8 and 9-bit floating point formats that could be useful
151
152 <https://en.wikipedia.org/wiki/Minifloat>
153
154 ### function accuracy in standards (opencl, vulkan)
155
156 Vulkan requires full ieee754 precision for all F/D instructions except for fdiv and fsqrt.
157
158 https://www.khronos.org/registry/vulkan/specs/1.0/html/vkspec.html#spirvenv-precision-operation
159
160 OpenCL slightly different, suggest adding as an extra entry.
161
162 https://www.khronos.org/registry/OpenCL/specs/2.2/html/OpenCL_Env.html#relative-error-as-ulps
163
164 Link, finds version 2.1 of opencl environment specification, table 8.4.1 however needs checking if it is the same as the above, which has "SPIRV" in it and is 2.2 not 2.1
165
166 https://www.google.com/search?q=opencl+environment+specification
167
168 2.1 superceded by 2.2
169 <https://github.com/KhronosGroup/OpenCL-Docs/blob/master/env/numerical_compliance.asciidoc>
170
171 ### Compliance
172
173 Dan Petroski:
174
175 It’s a bit more complicated than that. Different FP
176 representations/algorithms have different quantization ranges, so you
177 can get more or less precise depending on how large the arguments are.
178
179 For instance, machine A can compute within ULP3 from 0 to 10000, but
180 ULP2 from 10000 upwards. Machine B can compute within ULP2 from 0 to
181 6000, then ULP3 for 6000+. How do you design a compliance suite which
182 guarantees behavior across all fpaccs?
183
184 and from Allen Baum:
185
186 In the example above, you'd need a ratified spec with the defined
187 ranges (possbily per range and per op) - and then implementations
188 would need to at least meet that spec (but could be more accurate)
189
190 so - not impossible, but a lot more work to write different kinds
191 of tests than standard IEEE compatible test would have.
192
193 And, by the way, if you want it to be a ratified spec, it needs a
194 compliance suite, and whoever has defined the spec is responsible
195 for writing it.,
196