update UART page
[libreriscv.git] / zfpacc_proposal.mdwn
1 [[!tag standards]]
2
3 # FP Accuracy proposal
4
5 Credits:
6
7 * Bruce Hoult
8 * Allen Baum
9 * Dan Petroski
10 * Jacob Lifshay
11
12 TODO: complete writeup
13
14 * <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002400.html>
15 * <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002412.html>
16
17 Zfpacc: a proposal to allow implementations to dynamically set the
18 bit-accuracy of floating-point results, trading speed (reduced latency)
19 *at runtime* for accuracy (higher latency). IEEE754 format is preserved:
20 instruction operand and result format requirements are unmodified by
21 this proposal. Only ULP (Unit in Last Place) of the instruction *result*
22 is permitted to meet alternative accuracy requirements, whilst still
23 retaining the instruction's requested format.
24
25 This proposal is *only* suitable for adding pre-existing accuracy standards
26 where it is clearly established, well in advance of applications being
27 written that conform to that standard, that dealing with variations in
28 accuracy across hardware implementations is the responsibility of the
29 application writer. This is the case for both Vulkan and OpenCL.
30
31 This proposal is *not* suitable for inclusion of "de-facto" (proprietary)
32 accuracy standards (historic IBM Mainframe vs Ahmdahl incompatibility)
33 where there was no prior agreement or notification to applications
34 writers that variations in accuracy across hardware implementations
35 would occur. In the unlikely event that they *are* ever to be included
36 (n the future, rather than as a Custom Extension, then, unlike Vulkan
37 and OpenCL, they must **only** be added as "bit-for-bit compatible".
38
39 # Extension of FCSR
40
41 Zfpacc would use some of the the reserved bits of FCSR. It would be treated
42 very similarly to how dynamic frm is treated.
43
44 frm is treated as follows:
45
46 * Floating-point operations use either a static rounding mode encoded
47 in the instruction, or a dynamic rounding mode held in frm.
48 * Rounding modes are encoded as shown in Table 11.1 of the RISC-V ISA Spec
49 * A value of 111 in the instruction’s rm field selects the dynamic rounding
50 mode held in frm. If frm is set to an invalid value (101–111),
51 any subsequent attempt to execute a floating-point operation with a
52 dynamic rounding mode will raise an illegal instruction exception.
53
54 If we wish to support up to 4 accuracy modes, that would require 2 'fam'
55 bits. The Default would be IEEE754-compliant, encoded as 00. This means
56 that all current hardware would be compliant with the default mode.
57
58 Unsupported modes cause a trap to allow emulation where traps are supported.
59 Emulation of unsupported modes would be required for UNIX platforms.
60 As with frm, an implementation may choose to support any permutation
61 of dynamic fam-instruction pairs. It will illegal-instruction trap upon
62 executing an unsupported fam-instruction pair. The implementation can
63 then emulate the accuracy mode required.
64
65 If the bits are in FCSR, then the switch itself would be exposed to
66 user mode. User-mode would not be able to detect emulation vs hardware
67 supported instructions, however (by design). That would require some
68 platform-specific code.
69
70 Emulation of unsupported modes would be required for unix platforms.
71
72 TODO:
73
74 A mechanism for user mode code to detect which modes are emulated
75 (csr? syscall?) (if the supervisor decides to make the emulation visible)
76 that would allow user code to switch to faster software implementations
77 if it chooses to.
78
79 TODO:
80
81 Choose which accuracy modes are required
82
83 Which accuracy modes should be included is a question outside of
84 my expertise and would require a literature review of instruction
85 frequency in key workloads, PPA analysis of simple and advanced
86 implementations, etc.
87
88 TODO: reduced accuracy
89
90 I don't see why Unix should be required to emulate some arbitrary
91 reduced accuracy ML mode. My guess would be that Unix Platform Spec
92 requires support for IEEE, whereas arbitrary ML platform requires
93 support for Mode XYZ. Of course, implementations of either platform
94 would be free to support any/all modes that they find valuable.
95 Compiling for a specific platform means that support for required
96 accuracy modes is guaranteed (and therefore does not need discovery
97 sequences), while allowing portable code to execute discovery
98 sequences to detect support for alternative accuracy modes.
99
100 # Dynamic accuracy CSR <a name="dynamic"></a>
101
102 FCSR to be modified to include accuracy bits:
103
104 | 31....11 | 10..8 | 7..5 | 4....0 |
105 | -------- | ------ | ---- | ------ |
106 | reserved | facc | frm | fflags |
107
108 The values for the field facc to include the following:
109
110 | facc | mode | description |
111 | ----- | ------- | ------------------- |
112 | 0b000 | IEEE754 | correctly rounded |
113 | 0b010 | ULP<1 | Unit Last Place < 1 |
114 | 0b100 | Vulkan | Vulkan compliant |
115 | 0b110 | Appx | Machine Learning
116
117 (TODO: review alternative idea: ULP0.5, ULP1, ULP2, ULP4, ULP16)
118
119 Notes:
120
121 * facc=0 to match current RISC-V behaviour, where these bits were formerly reserved and set to zero.
122 * The format of the operands and result remain the same for
123 all opcodes. The only change is in the *accuracy* of the result, not
124 its format.
125 * facc sets the *minimum* accuracy. It is acceptable to provide *more* accurate results than is requested by a given facc mode (although, clearly, the opportunity for reduced power and latency would be missed).
126
127 ## Discussion
128
129 maybe a solution would be to add an extra field to the fp control csr
130 to allow selecting one of several accurate or fast modes:
131
132 - machine-learning-mode: fast as possible
133 (maybe need additional requirements such as monotonicity for atanh?)
134 - GPU-mode: accurate to within a few ULP
135 (see Vulkan, OpenGL, and OpenCL specs for accuracy guidelines)
136 - almost-accurate-mode: accurate to <1 ULP
137 (would 0.51 or some other value be better?)
138 - fully-accurate-mode: correctly rounded in all cases
139 - maybe more modes?
140
141 extra mode suggestions:
142
143 it might be reasonable to add a mode saying you're prepared to accept
144 worse then 0.5 ULP accuracy, perhaps with a few options: 1, 2, 4,
145 16 or something like that.
146
147 Question: should better accuracy than is requested be permitted? Example:
148 Ahmdahl-370 issues.
149
150 Comments:
151
152 Yes, embedded systems typically can do with 12, 16 or 32 bit
153 accuracy. Rarely does it require 64 bits. But the idea of making
154 a low power 32 bit FPU/DSP that can accommodate 64 bits is already
155 being done in other designs such as PIC etc I believe. For embedded
156 graphics 16 bit is more than adequate. In fact, Cornell had a very
157 innovative 18-bit floating point format described here (useful for
158 FPGA designs with 18-bit DSPs):
159
160 <https://people.ece.cornell.edu/land/courses/ece5760/FloatingPoint/index.html>
161
162 A very interesting GPU using the 18-bit FPU is also described here:
163
164 <https://people.ece.cornell.edu/land/courses/ece5760/FinalProjects/f2008/ap328_sjp45/website/hardwaredesign.html>
165
166 There are also 8 and 9-bit floating point formats that could be useful
167
168 <https://en.wikipedia.org/wiki/Minifloat>
169
170 ### function accuracy in standards (opencl, vulkan)
171
172 [[resources]] for OpenCL and Vulkan
173
174 Vulkan requires full ieee754 precision for all F/D instructions except for fdiv and fsqrt.
175
176 <https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/chap40.html#spirvenv-precision-operation>
177
178 Source is here:
179 <https://github.com/KhronosGroup/Vulkan-Docs/blob/master/appendices/spirvenv.txt#L1172>
180
181 OpenCL slightly different, suggest adding as an extra entry.
182
183 <https://www.khronos.org/registry/OpenCL/specs/2.2/html/OpenCL_Env.html#relative-error-as-ulps>
184
185 Link, finds version 2.1 of opencl environment specification, table 8.4.1 however needs checking if it is the same as the above, which has "SPIRV" in it and is 2.2 not 2.1
186
187 https://www.google.com/search?q=opencl+environment+specification
188
189 2.1 superceded by 2.2
190 <https://github.com/KhronosGroup/OpenCL-Docs/blob/master/env/numerical_compliance.asciidoc>
191
192 ### Compliance
193
194 Dan Petroski:
195
196 It’s a bit more complicated than that. Different FP
197 representations/algorithms have different quantization ranges, so you
198 can get more or less precise depending on how large the arguments are.
199
200 For instance, machine A can compute within ULP3 from 0 to 10000, but
201 ULP2 from 10000 upwards. Machine B can compute within ULP2 from 0 to
202 6000, then ULP3 for 6000+. How do you design a compliance suite which
203 guarantees behavior across all fpaccs?
204
205 and from Allen Baum:
206
207 In the example above, you'd need a ratified spec with the defined
208 ranges (possbily per range and per op) - and then implementations
209 would need to at least meet that spec (but could be more accurate)
210
211 so - not impossible, but a lot more work to write different kinds
212 of tests than standard IEEE compatible test would have.
213
214 And, by the way, if you want it to be a ratified spec, it needs a
215 compliance suite, and whoever has defined the spec is responsible
216 for writing it.,
217