From: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Date: Sun, 3 Mar 2019 00:17:06 +0000 (+0000)
Subject: FPU update
X-Git-Url: https://git.libre-soc.org/?a=commitdiff_plain;h=c756d2f8b95bfc84305d04dfb639533528655e2d;hp=f151751f09347364d91e22c563820f2cfa84f507;p=crowdsupply.git

FPU update
---

diff --git a/images/shift_screenshot.png b/images/shift_screenshot.png
new file mode 100644
index 0000000..17fab04
Binary files /dev/null and b/images/shift_screenshot.png differ
diff --git a/updates/017_2019mar02_nmigen_learning_curve.mdwn b/updates/017_2019mar02_nmigen_learning_curve.mdwn
new file mode 100644
index 0000000..6765f9a
--- /dev/null
+++ b/updates/017_2019mar02_nmigen_learning_curve.mdwn
@@ -0,0 +1,210 @@
+# FPUs and nmigen
+
+[nmigen](https://github.com/m-labs/nmigen) by
+[m-labs](http://m-labs.hk/migen/index.html) is turning out to be a very
+interesting choice.  It's not a panacea: however the fact remains that
+verilog is simply never going to be as up-to-date or have advanced and
+powerful features added to it that python has, and, in all seriousness,
+it never should be updated either.  Instead, it is quite reasonable to
+treat verilog in effect as a machine-code (compiler target).
+
+However, it is critical to remember that despite writing code in python,
+the underlying rules to obey are those of hardware, not software.  That
+modules (and how to use them) are not the same thing - at all - as calling
+a function, and that classes are definitely not synonymous with modules.
+This update outlines some of the quirks encountered.
+
+# Modules != classes
+
+The initial conversion process of John Dawson's IEEE754 FPU verilog code
+to nmigen went extremely well and very rapidly.  Where things began to
+come unstuck for over a week was in the efforts to "pythonify" the code,
+with a view to converting a Finite State Machine design into a pipeline.
+The initial efforts focussed on splitting out the relevant sections of
+code into python classes and functions, to be followed up by subsequently
+converting those to modules (actual verilog modules, rather than "python"
+modules).
+
+John's design is based around the use of global variables.  The code moves
+from state to state, using the global variables to store forward progress.
+A pipeline requires the use of *local* variables (local to each stage),
+where the output from one stage is connected as time-synchronised as the
+input to another.  Aleksander encountered some excellent work by
+Dan Gisselquist on
+[zipcpu](https://zipcpu.com/blog/2017/08/14/strategies-for-pipelining.html),
+which describes various pipeline strategies including one which involves
+buffered handshakes.  It turns out that John's code (as a unit) in fact
+conforms to the very paradigm that Dan describes.  However, John's code
+also has stages that perform shifting one bit at a time, for normalisation
+of the floating-point result. The global internal variable is updated one
+bit every cycle, and that's not how pipelines work: it's an imperative
+prerequisite that a pipeline stage do its work in a *single* cycle.
+
+So the first incremental change was to split out each stage (alignment,
+normalisation, the actual add, rounding and so on) into separate classes.
+
+It was a mess.
+
+The problem is that where in computer programming languages it is normal
+to have a variable that can be updated (overwritten), hardware is parallel
+and doesn't like it when more than one piece of "code" tries to update the
+same "variable".  Outputs *have* to be separated from inputs.  So although
+the "code" to do some work may be split out into a separate class, it's
+necessary to also cleanly separate the inputs from the outputs.  *No*
+variables may be overwritten without them being properly protected, and
+in a pipeline paradigm, global variables are not an option.
+
+In addition, modules need to be "connected" to the places where they are
+used.  It's not possible to "call" a module, and expect the parameters
+to be passed in and automatically the inputs and outputs magically work:
+nmigen is a different paradigm because you can either use "sync" or
+"comb" - clock-synchronisation or combinatorial logic.
+
+If you use "comb", it generates hardware that is updated immediately
+from its inputs.  However if you use "sync", nmigen knows to auto-generate
+hardware where on the **next** cycle, the result is updated from its
+inputs.  The problem in converting code over to a module and using local
+inputs and outputs *and* removing globals is that it's too many things at
+once to tackle.
+
+It took about ten days to work all this out, keeping the unit tests running
+at all times and using their success or failure as an indicator of whether
+things were on track.  Eventually however it all worked out.
+
+# Add Example module
+
+It's worthwhile showing some of the pipeline stages.  Here's the python
+nmigen code for the adder first stage:
+
+{add_code_screenshot.png}
+
+A prerequisite is that an "alignment" phase was run, which ensured that
+the exponents were both the same, so there is no need in this phase to
+perform bit-shifting of the mantissas: that's already been handled.
+
+There's two inputs (in_a and in_b) and one output (out_z): these are
+modules in their own right, each containing a sign, mantissa and exponent.
+in_a.m is the mantissa of input A, for example.  So the first thing is:
+four intermediate variables are created: one for testing whether the
+signs of A and B are equal (or not), the second for comparing the
+mantissas of A and B, and two further intermediates are used to store
+the mantissas A and B zero-extended by one bit.
+
+Next we have some simple combinatorial tests: if the signs were the
+same, we perform an add of A and B's mantissas, storing them in Z's
+mantissa.  If we get to the next "If" statement, we know that this
+is to be a subtraction, not an addition.  However, for subtraction,
+it matters which way round the subtractions are done, depending on
+which of A or B is the larger.
+
+It's really quite straightforward, and the important thing here is to
+note that the code is properly commented.  It's not the most compact
+code in the world: it's not the prettiest-looking.  Python cannot
+handle overloading of the assignment operator (not without overloading
+getattr and setattr, that is), so nmigen creates and uses a method
+named "eq" to handle assignment.
+
+One aspect of this project that's considered to be extremely important
+is to do a visual inspection of each module.  Here's what add looks like
+when yosys "show" command is run on it:
+
+{add_graph.png}
+
+On the left it can be seen that the names are a bit of a mess: the members
+of A and B (s, e and m) are extracted and, because they clash, are given
+auto-generated names.  m can be seen to go into a square (a graphviz module)
+with "e" and "m" on it, in a box named "add0_n_a".  That's the name we
+chose to give to the submodule in the nmigen code, shown above, purely
+so that it would be easy to visually identify in the graphviz output.
+
+Note that there is an arrow into a block that takes m (bits 26 down to 0)
+and a single-bit zero, and outputs that Concatenated together: these then
+go into a diamond-block named "am0".  We've identified am0 from the python
+code!
+
+The m (mantissa A) and m$2 (mantissa B) also go into $9, a "ge" (Greater
+or Equal) operator, which in turn goes to a diamond-block named "mge":
+this is the check to see which of the mantissas is larger!
+
+Then we can see $15, $12 and $18 are add and subtraction operations,
+which feed through to a selection procedure ($group_5), which ultimately
+goes into the "out_tot" variable.  This is the mantissa output of the
+addition.
+
+So with a little careful analysis, by tracking the names of the inputs,
+intermediates and outputs, we can verify that the resultant auto-generated
+output from nmigen looks reasonable.  The thing is: the module has been
+*deliberately* kept small so as to be able to do *exactly this*.  One of
+the reasons for this is illustrated below.
+
+# Where things go horribly wrong
+
+In nmigen, it's perfectly possible to use python variables to assign
+(accumulate) intermediate results, without actually storing them in
+actual "named" hardware (so to speak).  Note in the add above, that
+the tests for the If and Elif statements were placed into intermediate
+variables?  The reason for this was that if they were not, yosys
+**duplicated** the expressions.  Here's an example of where that goes
+horribly wrong.  Note the simple-looking innocuous code, below:
+
+{shift_screenshot.png}
+
+sm.rshift basically does a variable-length right shift (the "<<" operator
+in both python and verilog).  Note the assignment of the intermediary
+calculation m_mask to a python temporary variable.  Note the commented-out
+code which uses the "reduce" operator to or all of the bits of
+a *secondary* expression, which ANDs all of the bits of "m_mask" with
+the input mantissa?  Watch what happens when that's handed over to yosys:
+
+{align_single_fail.png}
+
+It's an absolute mess.  If you zoom in close on the left side, what's
+happened is that the shift expression has been **multiplied** (duplicated)
+a whopping **twenty four** times (this is a 32-bit FP number so the mantissa
+is 24 bits).  The reason is because the reduce operation needed 24 copies
+of the input, in order to select one bit at a time.  Then, on the right
+hand side, each bit is ORed in a chain with the previous bit, exactly as
+would be expected to be carried out by a sequential processor performing
+a "reduce" operation.
+
+On seeing this graph output, it was immediately apparent that it would be
+totally unacceptable, yet from the python nmigen it is not in the slightest
+bit obvious that there's a problem.  **This is why the yosys "show" output is
+so important**.
+
+On further investigation, it was discovered that there is a "bool" function
+of nmigen, which ORs all bits of a variable together.  In yosys it even
+has a name, "reduce_bool".  Here's the graph output once that function has
+been used instead:
+
+{align_single_clean.png}
+
+*Now* we are looking at something that's much clearer, smaller, cleaner and
+easier to understand.  It's still quite amazing how so few lines of code
+can turn into something so comprehensive.  The line of "1s" (11111111...)
+is where the variable "m_mask" gets created: this line of "1s" is right-shifted
+to create the mask.  In the box named "$43" it is then ANDed with the original
+mantissa, reduced to a single boolean OR ($44) with a $reduce_bool operation,
+and so on.
+
+This shift-mask is basically for the creation of the "sticky" bit in
+IEEE754 rounding.  It's essential to get right, and it's an essential part
+of IEEE754 Floating-Point.  By doing this kind of close visual inspection,
+and keeping things to small, compact modules, in combination with comprehensive
+unit test coverage and performing incremental minimalist changes, we stand a
+reasonable chance of not making huge glaring design errors and being able
+to bootstrap up to a decent design.
+
+Not knowing how to do something is not an excuse for not trying.  Having a
+strategy for being able to work things out is essential to succeeding, even
+when faced with a huge number of unknowns.  Go from known-good to known-good,
+create the building blocks first, make sure that they're reasonable, make
+sure that they're unit-tested comprehensively, then incremental changes can
+be attempted with the confidence that mistakes can be weeded out immediately
+by a unit test failing when it should not.
+
+However, as this update demonstrates: both those versions of the normalisation
+alignment produced the correct answer, yet one of them was deeply flawed.
+Even code that produces the correct answer may have design flaws:
+that's what the visual inspection is for.
+