updates/017_2019mar02_nmigen_learning_curve.mdwn

   1 # FPUs and nmigen
   2
   3 [nmigen](https://github.com/m-labs/nmigen) by
   4 [m-labs](http://m-labs.hk/migen/index.html) is turning out to be a very
   5 interesting choice.  It's not a panacea: however the fact remains that
   6 verilog is simply never going to be as up-to-date or have advanced and
   7 powerful features added to it that python has, and, in all seriousness,
   8 it never should be updated either.  Instead, it is quite reasonable to
   9 treat verilog in effect as a machine-code (compiler target).
  10
  11 However, it is critical to remember that despite writing code in python,
  12 the underlying rules to obey are those of hardware, not software.  That
  13 modules (and how to use them) are not the same thing - at all - as calling
  14 a function, and that classes are definitely not synonymous with modules.
  15 This update outlines some of the quirks encountered.
  16
  17 # Modules != classes
  18
  19 The initial conversion process of John Dawson's IEEE754 FPU verilog code
  20 to nmigen went extremely well and very rapidly.  Where things began to
  21 come unstuck for over a week was in the efforts to "pythonify" the code,
  22 with a view to converting a Finite State Machine design into a pipeline.
  23 The initial efforts focussed on splitting out the relevant sections of
  24 code into python classes and functions, to be followed up by subsequently
  25 converting those to modules (actual verilog modules, rather than "python"
  26 modules).
  27
  28 John's design is based around the use of global variables.  The code moves
  29 from state to state, using the global variables to store forward progress.
  30 A pipeline requires the use of *local* variables (local to each stage),
  31 where the output from one stage is connected as time-synchronised as the
  32 input to another.  Aleksander encountered some excellent work by
  33 Dan Gisselquist on
  34 [zipcpu](https://zipcpu.com/blog/2017/08/14/strategies-for-pipelining.html),
  35 which describes various pipeline strategies including one which involves
  36 buffered handshakes.  It turns out that John's code (as a unit) in fact
  37 conforms to the very paradigm that Dan describes.  However, John's code
  38 also has stages that perform shifting one bit at a time, for normalisation
  39 of the floating-point result. The global internal variable is updated one
  40 bit every cycle, and that's not how pipelines work: it's an imperative
  41 prerequisite that a pipeline stage do its work in a *single* cycle.
  42
  43 So the first incremental change was to split out each stage (alignment,
  44 normalisation, the actual add, rounding and so on) into separate classes.
  45
  46 It was a mess.
  47
  48 The problem is that where in computer programming languages it is normal
  49 to have a variable that can be updated (overwritten), hardware is parallel
  50 and doesn't like it when more than one piece of "code" tries to update the
  51 same "variable".  Outputs *have* to be separated from inputs.  So although
  52 the "code" to do some work may be split out into a separate class, it's
  53 necessary to also cleanly separate the inputs from the outputs.  *No*
  54 variables may be overwritten without them being properly protected, and
  55 in a pipeline paradigm, global variables are not an option.
  56
  57 In addition, modules need to be "connected" to the places where they are
  58 used.  It's not possible to "call" a module, and expect the parameters
  59 to be passed in and automatically the inputs and outputs magically work:
  60 nmigen is a different paradigm because you can either use "sync" or
  61 "comb" - clock-synchronisation or combinatorial logic.
  62
  63 If you use "comb", it generates hardware that is updated immediately
  64 from its inputs.  However if you use "sync", nmigen knows to auto-generate
  65 hardware where on the **next** cycle, the result is updated from its
  66 inputs.  The problem in converting code over to a module and using local
  67 inputs and outputs *and* removing globals is that it's too many things at
  68 once to tackle.
  69
  70 It took about ten days to work all this out, keeping the unit tests running
  71 at all times and using their success or failure as an indicator of whether
  72 things were on track.  Eventually however it all worked out.
  73
  74 # Add Example module
  75
  76 It's worthwhile showing some of the pipeline stages.  Here's the python
  77 nmigen code for the adder first stage:
  78
  79 {add_code_screenshot.png}
  80
  81 A prerequisite is that an "alignment" phase was run, which ensured that
  82 the exponents were both the same, so there is no need in this phase to
  83 perform bit-shifting of the mantissas: that's already been handled.
  84
  85 There's two inputs (in_a and in_b) and one output (out_z): these are
  86 modules in their own right, each containing a sign, mantissa and exponent.
  87 in_a.m is the mantissa of input A, for example.  So the first thing is:
  88 four intermediate variables are created: one for testing whether the
  89 signs of A and B are equal (or not), the second for comparing the
  90 mantissas of A and B, and two further intermediates are used to store
  91 the mantissas A and B zero-extended by one bit.
  92
  93 Next we have some simple combinatorial tests: if the signs were the
  94 same, we perform an add of A and B's mantissas, storing them in Z's
  95 mantissa.  If we get to the next "If" statement, we know that this
  96 is to be a subtraction, not an addition.  However, for subtraction,
  97 it matters which way round the subtractions are done, depending on
  98 which of A or B is the larger.
  99
 100 It's really quite straightforward, and the important thing here is to
 101 note that the code is properly commented.  It's not the most compact
 102 code in the world: it's not the prettiest-looking.  Python cannot
 103 handle overloading of the assignment operator (not without overloading
 104 getattr and setattr, that is), so nmigen creates and uses a method
 105 named "eq" to handle assignment.
 106
 107 One aspect of this project that's considered to be extremely important
 108 is to do a visual inspection of each module.  Here's what add looks like
 109 when yosys "show" command is run on it:
 110
 111 {add_graph.png}
 112
 113 On the left it can be seen that the names are a bit of a mess: the members
 114 of A and B (s, e and m) are extracted and, because they clash, are given
 115 auto-generated names.  m can be seen to go into a square (a graphviz module)
 116 with "e" and "m" on it, in a box named "add0_n_a".  That's the name we
 117 chose to give to the submodule in the nmigen code, shown above, purely
 118 so that it would be easy to visually identify in the graphviz output.
 119
 120 Note that there is an arrow into a block that takes m (bits 26 down to 0)
 121 and a single-bit zero, and outputs that Concatenated together: these then
 122 go into a diamond-block named "am0".  We've identified am0 from the python
 123 code!
 124
 125 The m (mantissa A) and m$2 (mantissa B) also go into $9, a "ge" (Greater
 126 or Equal) operator, which in turn goes to a diamond-block named "mge":
 127 this is the check to see which of the mantissas is larger!
 128
 129 Then we can see $15, $12 and $18 are add and subtraction operations,
 130 which feed through to a selection procedure ($group_5), which ultimately
 131 goes into the "out_tot" variable.  This is the mantissa output of the
 132 addition.
 133
 134 So with a little careful analysis, by tracking the names of the inputs,
 135 intermediates and outputs, we can verify that the resultant auto-generated
 136 output from nmigen looks reasonable.  The thing is: the module has been
 137 *deliberately* kept small so as to be able to do *exactly this*.  One of
 138 the reasons for this is illustrated below.
 139
 140 # Where things go horribly wrong
 141
 142 In nmigen, it's perfectly possible to use python variables to assign
 143 (accumulate) intermediate results, without actually storing them in
 144 actual "named" hardware (so to speak).  Note in the add above, that
 145 the tests for the If and Elif statements were placed into intermediate
 146 variables?  The reason for this was that if they were not, yosys
 147 **duplicated** the expressions.  Here's an example of where that goes
 148 horribly wrong.  Note the simple-looking innocuous code, below:
 149
 150 {shift_screenshot.png}
 151
 152 sm.rshift basically does a variable-length right shift (the "<<" operator
 153 in both python and verilog).  Note the assignment of the intermediary
 154 calculation m_mask to a python temporary variable.  Note the commented-out
 155 code which uses the "reduce" operator to or all of the bits of
 156 a *secondary* expression, which ANDs all of the bits of "m_mask" with
 157 the input mantissa?  Watch what happens when that's handed over to yosys:
 158
 159 {align_single_fail.png}
 160
 161 It's an absolute mess.  If you zoom in close on the left side, what's
 162 happened is that the shift expression has been **multiplied** (duplicated)
 163 a whopping **twenty four** times (this is a 32-bit FP number so the mantissa
 164 is 24 bits).  The reason is because the reduce operation needed 24 copies
 165 of the input, in order to select one bit at a time.  Then, on the right
 166 hand side, each bit is ORed in a chain with the previous bit, exactly as
 167 would be expected to be carried out by a sequential processor performing
 168 a "reduce" operation.
 169
 170 On seeing this graph output, it was immediately apparent that it would be
 171 totally unacceptable, yet from the python nmigen it is not in the slightest
 172 bit obvious that there's a problem.  **This is why the yosys "show" output is
 173 so important**.
 174
 175 On further investigation, it was discovered that there is a "bool" function
 176 of nmigen, which ORs all bits of a variable together.  In yosys it even
 177 has a name, "reduce_bool".  Here's the graph output once that function has
 178 been used instead:
 179
 180 {align_single_clean.png}
 181
 182 *Now* we are looking at something that's much clearer, smaller, cleaner and
 183 easier to understand.  It's still quite amazing how so few lines of code
 184 can turn into something so comprehensive.  The line of "1s" (11111111...)
 185 is where the variable "m_mask" gets created: this line of "1s" is right-shifted
 186 to create the mask.  In the box named "$43" it is then ANDed with the original
 187 mantissa, reduced to a single boolean OR ($44) with a $reduce_bool operation,
 188 and so on.
 189
 190 This shift-mask is basically for the creation of the "sticky" bit in
 191 IEEE754 rounding.  It's essential to get right, and it's an essential part
 192 of IEEE754 Floating-Point.  By doing this kind of close visual inspection,
 193 and keeping things to small, compact modules, in combination with comprehensive
 194 unit test coverage and performing incremental minimalist changes, we stand a
 195 reasonable chance of not making huge glaring design errors and being able
 196 to bootstrap up to a decent design.
 197
 198 Not knowing how to do something is not an excuse for not trying.  Having a
 199 strategy for being able to work things out is essential to succeeding, even
 200 when faced with a huge number of unknowns.  Go from known-good to known-good,
 201 create the building blocks first, make sure that they're reasonable, make
 202 sure that they're unit-tested comprehensively, then incremental changes can
 203 be attempted with the confidence that mistakes can be weeded out immediately
 204 by a unit test failing when it should not.
 205
 206 However, as this update demonstrates: both those versions of the normalisation
 207 alignment produced the correct answer, yet one of them was deeply flawed.
 208 Even code that produces the correct answer may have design flaws:
 209 that's what the visual inspection is for.
 210