important to update about tax agreements
[crowdsupply.git] / updates / 017_2019mar02_nmigen_learning_curve.mdwn
1 # FPUs and nmigen
2
3 [nmigen](https://github.com/m-labs/nmigen) by
4 [m-labs](http://m-labs.hk/migen/index.html) is turning out to be a very
5 interesting choice. It's not a panacea: however the fact remains that
6 verilog is simply never going to be as up-to-date or have advanced and
7 powerful features added to it that python has, and, in all seriousness,
8 it never should be updated either. Instead, it is quite reasonable to
9 treat verilog in effect as a machine-code (compiler target).
10
11 However, it is critical to remember that despite writing code in python,
12 the underlying rules to obey are those of hardware, not software. That
13 modules (and how to use them) are not the same thing - at all - as calling
14 a function, and that classes are definitely not synonymous with modules.
15 This update outlines some of the quirks encountered.
16
17 # Modules != classes
18
19 The initial conversion process of John Dawson's IEEE754 FPU verilog code
20 to nmigen went extremely well and very rapidly. Where things began to
21 come unstuck for over a week was in the efforts to "pythonify" the code,
22 with a view to converting a Finite State Machine design into a pipeline.
23 The initial efforts focussed on splitting out the relevant sections of
24 code into python classes and functions, to be followed up by subsequently
25 converting those to modules (actual verilog modules, rather than "python"
26 modules).
27
28 John's design is based around the use of global variables. The code moves
29 from state to state, using the global variables to store forward progress.
30 A pipeline requires the use of *local* variables (local to each stage),
31 where the output from one stage is connected as time-synchronised as the
32 input to another. Aleksander encountered some excellent work by
33 Dan Gisselquist on
34 [zipcpu](https://zipcpu.com/blog/2017/08/14/strategies-for-pipelining.html),
35 which describes various pipeline strategies including one which involves
36 buffered handshakes. It turns out that John's code (as a unit) in fact
37 conforms to the very paradigm that Dan describes. However, John's code
38 also has stages that perform shifting one bit at a time, for normalisation
39 of the floating-point result. The global internal variable is updated one
40 bit every cycle, and that's not how pipelines work: it's an imperative
41 prerequisite that a pipeline stage do its work in a *single* cycle.
42
43 So the first incremental change was to split out each stage (alignment,
44 normalisation, the actual add, rounding and so on) into separate classes.
45
46 It was a mess.
47
48 The problem is that where in computer programming languages it is normal
49 to have a variable that can be updated (overwritten), hardware is parallel
50 and doesn't like it when more than one piece of "code" tries to update the
51 same "variable". Outputs *have* to be separated from inputs. So although
52 the "code" to do some work may be split out into a separate class, it's
53 necessary to also cleanly separate the inputs from the outputs. *No*
54 variables may be overwritten without them being properly protected, and
55 in a pipeline paradigm, global variables are not an option.
56
57 In addition, modules need to be "connected" to the places where they are
58 used. It's not possible to "call" a module, and expect the parameters
59 to be passed in and automatically the inputs and outputs magically work:
60 nmigen is a different paradigm because you can either use "sync" or
61 "comb" - clock-synchronisation or combinatorial logic.
62
63 If you use "comb", it generates hardware that is updated immediately
64 from its inputs. However if you use "sync", nmigen knows to auto-generate
65 hardware where on the **next** cycle, the result is updated from its
66 inputs. The problem in converting code over to a module and using local
67 inputs and outputs *and* removing globals is that it's too many things at
68 once to tackle.
69
70 It took about ten days to work all this out, keeping the unit tests running
71 at all times and using their success or failure as an indicator of whether
72 things were on track. Eventually however it all worked out.
73
74 # Add Example module
75
76 It's worthwhile showing some of the pipeline stages. Here's the python
77 nmigen code for the adder first stage:
78
79 {add_code_screenshot.png}
80
81 A prerequisite is that an "alignment" phase was run, which ensured that
82 the exponents were both the same, so there is no need in this phase to
83 perform bit-shifting of the mantissas: that's already been handled.
84
85 There's two inputs (in_a and in_b) and one output (out_z): these are
86 modules in their own right, each containing a sign, mantissa and exponent.
87 in_a.m is the mantissa of input A, for example. So the first thing is:
88 four intermediate variables are created: one for testing whether the
89 signs of A and B are equal (or not), the second for comparing the
90 mantissas of A and B, and two further intermediates are used to store
91 the mantissas A and B zero-extended by one bit.
92
93 Next we have some simple combinatorial tests: if the signs were the
94 same, we perform an add of A and B's mantissas, storing them in Z's
95 mantissa. If we get to the next "If" statement, we know that this
96 is to be a subtraction, not an addition. However, for subtraction,
97 it matters which way round the subtractions are done, depending on
98 which of A or B is the larger.
99
100 It's really quite straightforward, and the important thing here is to
101 note that the code is properly commented. It's not the most compact
102 code in the world: it's not the prettiest-looking. Python cannot
103 handle overloading of the assignment operator (not without overloading
104 getattr and setattr, that is), so nmigen creates and uses a method
105 named "eq" to handle assignment.
106
107 One aspect of this project that's considered to be extremely important
108 is to do a visual inspection of each module. Here's what add looks like
109 when yosys "show" command is run on it:
110
111 {add_graph.png}
112
113 On the left it can be seen that the names are a bit of a mess: the members
114 of A and B (s, e and m) are extracted and, because they clash, are given
115 auto-generated names. m can be seen to go into a square (a graphviz module)
116 with "e" and "m" on it, in a box named "add0_n_a". That's the name we
117 chose to give to the submodule in the nmigen code, shown above, purely
118 so that it would be easy to visually identify in the graphviz output.
119
120 Note that there is an arrow into a block that takes m (bits 26 down to 0)
121 and a single-bit zero, and outputs that Concatenated together: these then
122 go into a diamond-block named "am0". We've identified am0 from the python
123 code!
124
125 The m (mantissa A) and m$2 (mantissa B) also go into $9, a "ge" (Greater
126 or Equal) operator, which in turn goes to a diamond-block named "mge":
127 this is the check to see which of the mantissas is larger!
128
129 Then we can see $15, $12 and $18 are add and subtraction operations,
130 which feed through to a selection procedure ($group_5), which ultimately
131 goes into the "out_tot" variable. This is the mantissa output of the
132 addition.
133
134 So with a little careful analysis, by tracking the names of the inputs,
135 intermediates and outputs, we can verify that the resultant auto-generated
136 output from nmigen looks reasonable. The thing is: the module has been
137 *deliberately* kept small so as to be able to do *exactly this*. One of
138 the reasons for this is illustrated below.
139
140 # Where things go horribly wrong
141
142 In nmigen, it's perfectly possible to use python variables to assign
143 (accumulate) intermediate results, without actually storing them in
144 actual "named" hardware (so to speak). Note in the add above, that
145 the tests for the If and Elif statements were placed into intermediate
146 variables? The reason for this was that if they were not, yosys
147 **duplicated** the expressions. Here's an example of where that goes
148 horribly wrong. Note the simple-looking innocuous code, below:
149
150 {shift_screenshot.png}
151
152 sm.rshift basically does a variable-length right shift (the "<<" operator
153 in both python and verilog). Note the assignment of the intermediary
154 calculation m_mask to a python temporary variable. Note the commented-out
155 code which uses the "reduce" operator to or all of the bits of
156 a *secondary* expression, which ANDs all of the bits of "m_mask" with
157 the input mantissa? Watch what happens when that's handed over to yosys:
158
159 {align_single_fail.png}
160
161 It's an absolute mess. If you zoom in close on the left side, what's
162 happened is that the shift expression has been **multiplied** (duplicated)
163 a whopping **twenty four** times (this is a 32-bit FP number so the mantissa
164 is 24 bits). The reason is because the reduce operation needed 24 copies
165 of the input, in order to select one bit at a time. Then, on the right
166 hand side, each bit is ORed in a chain with the previous bit, exactly as
167 would be expected to be carried out by a sequential processor performing
168 a "reduce" operation.
169
170 On seeing this graph output, it was immediately apparent that it would be
171 totally unacceptable, yet from the python nmigen it is not in the slightest
172 bit obvious that there's a problem. **This is why the yosys "show" output is
173 so important**.
174
175 On further investigation, it was discovered that there is a "bool" function
176 of nmigen, which ORs all bits of a variable together. In yosys it even
177 has a name, "reduce_bool". Here's the graph output once that function has
178 been used instead:
179
180 {align_single_clean.png}
181
182 *Now* we are looking at something that's much clearer, smaller, cleaner and
183 easier to understand. It's still quite amazing how so few lines of code
184 can turn into something so comprehensive. The line of "1s" (11111111...)
185 is where the variable "m_mask" gets created: this line of "1s" is right-shifted
186 to create the mask. In the box named "$43" it is then ANDed with the original
187 mantissa, reduced to a single boolean OR ($44) with a $reduce_bool operation,
188 and so on.
189
190 This shift-mask is basically for the creation of the "sticky" bit in
191 IEEE754 rounding. It's essential to get right, and it's an essential part
192 of IEEE754 Floating-Point. By doing this kind of close visual inspection,
193 and keeping things to small, compact modules, in combination with comprehensive
194 unit test coverage and performing incremental minimalist changes, we stand a
195 reasonable chance of not making huge glaring design errors and being able
196 to bootstrap up to a decent design.
197
198 Not knowing how to do something is not an excuse for not trying. Having a
199 strategy for being able to work things out is essential to succeeding, even
200 when faced with a huge number of unknowns. Go from known-good to known-good,
201 create the building blocks first, make sure that they're reasonable, make
202 sure that they're unit-tested comprehensively, then incremental changes can
203 be attempted with the confidence that mistakes can be weeded out immediately
204 by a unit test failing when it should not.
205
206 However, as this update demonstrates: both those versions of the normalisation
207 alignment produced the correct answer, yet one of them was deeply flawed.
208 Even code that produces the correct answer may have design flaws:
209 that's what the visual inspection is for.
210