From c756d2f8b95bfc84305d04dfb639533528655e2d Mon Sep 17 00:00:00 2001 From: Luke Kenneth Casson Leighton Date: Sun, 3 Mar 2019 00:17:06 +0000 Subject: [PATCH] FPU update --- images/shift_screenshot.png | Bin 0 -> 4817 bytes .../017_2019mar02_nmigen_learning_curve.mdwn | 210 ++++++++++++++++++ 2 files changed, 210 insertions(+) create mode 100644 images/shift_screenshot.png create mode 100644 updates/017_2019mar02_nmigen_learning_curve.mdwn diff --git a/images/shift_screenshot.png b/images/shift_screenshot.png new file mode 100644 index 0000000000000000000000000000000000000000..17fab04f4b77218d126a6079c2f7e3c28d8c56f5 GIT binary patch literal 4817 zcma)=cQD*r`2RmbNYp41ix334tZ1v3l`Du~m53Iit=@MLy@uRGODwBzB+4S%Y6-z| z6D`V$-j`t2Sbg<;-I?E?zwgZVk5lG6&zU)A=JlNSd(LYkm<|&?H$4CVOuCOAng9S* z>E-<6wJVps|3y*g<#65iktGrUn7{wmsM5umc>sVdR`=lp)1chnd21C*O_kceU*%|E zO^&;aN^@_VMQ}Z{d&i;Gk3NuqaL?U;T_|Ez5dM5TpgCbrm6&l4(0Dd40stzzk+n<{ z*?X!kIYu4php~B)p4^thr9hHcl~L?! zCToxgUxyjmtZ6F(2HOiF-Oo;nEZi)s`JNo%!m9VM$K2kUZd`w2v_NVkwTQ2>d7XIw zq4n0kG9I`}$?5kw$ESJb!0E2pJJcrW+PI4?(KvvMB66rKxK-<1hDwJa<&yC*ev4Zp zHb7f)v%QZj04!Z0X3(y1b`A#km44!^7WCN;U*XStxdLM->D;d!loHx#U3#{C_=-$9 zi$FTwE%+qWd z;_(Ls_M3_jn9Wd0fVjwkbJw2VTgKKAhLXRPD+ZcG{{j`k%&+LCv8;$=O)VLoG$zo4yoRWXv zzLB+g>UnaYlf3AnXwA-$)L*=@P6u?ThEyz`2!rOJ(YJYYC-KUT%TrQ`AxUy70f{Y= zHn;c_1)vjIAI{YSUYc04y45z$knb3M>pplZV`Km8uSRD<9j~6WAG+FNIst+`{<{Ct zW}N<7l-~{XwHK5Gd+;jV)>!;`+JC3$I(6(Sx;ly`$u8{(Qq0jjRl4e2gfHYeT`1y$ z^`a13YR|Z!4@1;FdVTeY!mzZ&5UDUucdw>;8sB$Zg9{T^(pi}S+Isq4fQ;U=#mho~ zJrx`X0@6-zOI%AUDU5>%GO=H zOONQPio4#!_RLPBrQd7WNDN60undRUdV+hZzxqJ*D-R!ECs4ttqOOpt0;=1YpG|$0 z%zZL-p;WMup^uWtIUke%@L|zIZcQGs%N)VaA$x2?yB{+^Jp|I=G-m=Yzo>z_)v_l1 z`!Qo7r81e{$)5n=OVG)u8L=-P$NxI|>ph<6KHg!ndkW{M?ey$lh96)3+Nj%yV>Mw({jIJ;H7-vAkp6 zsAM1IOn=(NXmW3-qB3W~$H0?L$PqcXjGsGFC*`^g)Tl52JlvnUIwV(?GvOkOQJWAk zh0Q-ZEU{0e+sth*v2bRY2756`o46~sp3KIXd<8S0wLzUX2A1w3ZvxQ8L;2w{NgjPq z^J-ggi+sbZ^Q4fZg^p&3^0U7TvXuj!o8a6}m0(3OXxT6mSf4Yf?a9_a4UixIf$aC; z*1q)fC+P{xL7RR})glIdzM)%%uw)HOPNiPG^x+YcY86@HVrp2TN>_756ZNW@=||=W zVN94dYt)?nvyFUqWn3WQ+ywk8t~_lKYvMeC6kXxxnL1<6B96Sri+` zt-m_D3^Xo)d3!aYx#}b0cOg5rON1^14)YuU5dEN4lJWaQrMa>OY6b4ZrKyPe=5QRx zCEz4xn!}KZfdZSIgL};@%}8?36PkhNA0RTeT5W7xXrn3k~$2eqWp-^8TlaN?eL zF7a=}Q%MU`;|}yOw;@NP5#~(sWTyVW%zM0X3ey{WvBI*(GBcVgY%xcvVDJz{tTjB< zO!d}U-!*bJ9NIC!E|8EqX@8QnVK1*M5&l$e5ChH`D0E-TVCyGES6Y3zI~N)((nvak zXG1?sBMHMtiYM#cjB6tY39f%K0`Lm8`<$|SHnbQigvdR4g|xKXTvG?CTzP(n!9p>U zZ{NRn-WByqb|YvNs|8w2j5#T`k~klSP8h;+91#dTBlc7NQz)l?^7F~OrEPgvUD^Ws z5sNalGG5KEXQKtPS#M;VTYKBsruy>RyT7XhUp}n(iz<1o}>j`w7g@na<8)? z=8=t9Pu=MrRAdq~=Oa+7L+eX2d~p4Q@!~~_KKak6)IQc%LUMKD@aO;w(WsF`k92I} zP{ks5o#RD|229hWFK@tSaqtm)`v>=P#-}y1f-OeNcdw!wV=cV*2bV{-Ql6FCXcgxP zkGa)~VdH|K=j!#kbfSpQXTHN|>rJ6T6^~_CRRrU2YVE<4!|=`PVYc0LK{e8PbY--t z)LvI%to}Gq6o2@-q+aV6uJH*%^}KblYLc=+rpXO&5-D+eucGy()9O`jyS)%|D1bP3 z`1B~*DHqR~_!{9f)wJtUK>zJl;Rt)H-QNcEERN3)QNyGd|EDMY^RY|!YMwj$Ma=~< zx_U3wy#NLMJIG4^_3e<@D};29zFj6*RyjqQp>~NEUmCb>QXlGHYQHLnkF{mJmN9#&8I|ec*JOuRs~cX? zb(6W-q0dvH*!6)0esZGGq%HlsN2z(dbN4&{~`YWNrexdo+`J{@B+XOu9J$7zO(Cb zop(B;6fiG3_RMP&_rww4`58WIDbf$sX28&l~!n=U?3r=@b<%#WEGi+7g~>WPt9?=SoG(g;3w#J?_+v2j}=DJn)6$0kjS zZE-W4dSW*BvuXFj~j>%eC{;!az61{yuQ+Zz;^KL-e#+nZ1)=a}h;(b$V$NbzpasQ6|tQdYQ;%PQ|S)m|f|QW7*h05Kn+95!3xzQ)bu7UUZ}e2~Vqs@?HB8 zBi((;U|WI3ymi;d#9-TR$6ZktNRW@j=+Rm8Zra+L<++rOu=m&rj8UD9C8XS%<^0E&avIc>X1NSc_S^F7ERvM=zWs1_tu^z|%%%2G|IDVs zMda>Foo=p5f=gl3JGljr59`%(nWk0hj_cl z6x}$@7d+SEAtf#ZnQ?x;avoX2v}p)Rm}5b>`isrH^>oc>DwGbaws`0)+M+8wJw^Tp zCl$Zwq0UH|D2RWZKENvJRHYlKIX;-aHv{Ua&IVEwLtWIwed?Dc3Z5Q#yeiP`eyd+I zvh~i^ngGvuaN=#a9y4i~-kSeTu^?#kojtr)Qrt=FmY?tV3i6K+u_x7*Fp~}3iw)xs z|B`<5sR7FOM^&AaxsA_Fx%H+7@yT$s65bFscQ=a#zE4qJ-|#Ot7$bDtl>qbY-KXbC zmTTua8+t!jK-B`O8++i<=x@^am9FINX;X3DU?V;=Zb9_1e!f7vyrxLu=tx9TE{yqJ zCP#0uW6-_0I$<-tuA^Tfaspt#?`VA67oAVa@AIs^q?%~m_1^x*YKKbW-tg$f&7$On z1_Wzvwiv>fb>vJ&z4!5|NThU57To18QAGOY&t9tyBXg)jyUE;26KFuHPjEx?h!AGa2Kt!)QF{CtDD$ z1i}jatp$Sbt6Bda#*(Ss70)pJ*6riGBo%r3z_S_^>Dr_M=7o=*cwDxL3a+fb@9!)} zj4F*Cuoz~${di5ZRaY3fOnfC8Icw z%Hz99vtZ2=cXYG=Y#u1xC>$G&4mR)$E$?2WA~8vio1O#>DAVP?YAT4+uC{MGz;wd8DPmJF&W zcF4_3rMWy4HCJ6WxvQz&;CzTPh;BQabKPXI`#?vDVqVd?iA6NJv+JR^r=ycj3DvoY^ZL4%v}_?-lfMJMUs3{Es=B5`f24WqpH4j z%hO?uxPU8p!RU zVssu~l|9SnAMy6Lt+dFz+D0J`GKN)ihCkB66?~|rFI%?Ma_oL>U>OrWQAg~uDi2w? z7m`7R%8xe=5DxW??;oPlrU& zk7S~5lD8S>JXxkz4iAkniR9H&(AuwU69dom0YHAH4e4}{p9(Mvv|Q<6mIMH&TbFhK z;P^k;ftuMyW&lw1#oPj_MY;B_{BPdjf}hUpl(; literal 0 HcmV?d00001 diff --git a/updates/017_2019mar02_nmigen_learning_curve.mdwn b/updates/017_2019mar02_nmigen_learning_curve.mdwn new file mode 100644 index 0000000..6765f9a --- /dev/null +++ b/updates/017_2019mar02_nmigen_learning_curve.mdwn @@ -0,0 +1,210 @@ +# FPUs and nmigen + +[nmigen](https://github.com/m-labs/nmigen) by +[m-labs](http://m-labs.hk/migen/index.html) is turning out to be a very +interesting choice. It's not a panacea: however the fact remains that +verilog is simply never going to be as up-to-date or have advanced and +powerful features added to it that python has, and, in all seriousness, +it never should be updated either. Instead, it is quite reasonable to +treat verilog in effect as a machine-code (compiler target). + +However, it is critical to remember that despite writing code in python, +the underlying rules to obey are those of hardware, not software. That +modules (and how to use them) are not the same thing - at all - as calling +a function, and that classes are definitely not synonymous with modules. +This update outlines some of the quirks encountered. + +# Modules != classes + +The initial conversion process of John Dawson's IEEE754 FPU verilog code +to nmigen went extremely well and very rapidly. Where things began to +come unstuck for over a week was in the efforts to "pythonify" the code, +with a view to converting a Finite State Machine design into a pipeline. +The initial efforts focussed on splitting out the relevant sections of +code into python classes and functions, to be followed up by subsequently +converting those to modules (actual verilog modules, rather than "python" +modules). + +John's design is based around the use of global variables. The code moves +from state to state, using the global variables to store forward progress. +A pipeline requires the use of *local* variables (local to each stage), +where the output from one stage is connected as time-synchronised as the +input to another. Aleksander encountered some excellent work by +Dan Gisselquist on +[zipcpu](https://zipcpu.com/blog/2017/08/14/strategies-for-pipelining.html), +which describes various pipeline strategies including one which involves +buffered handshakes. It turns out that John's code (as a unit) in fact +conforms to the very paradigm that Dan describes. However, John's code +also has stages that perform shifting one bit at a time, for normalisation +of the floating-point result. The global internal variable is updated one +bit every cycle, and that's not how pipelines work: it's an imperative +prerequisite that a pipeline stage do its work in a *single* cycle. + +So the first incremental change was to split out each stage (alignment, +normalisation, the actual add, rounding and so on) into separate classes. + +It was a mess. + +The problem is that where in computer programming languages it is normal +to have a variable that can be updated (overwritten), hardware is parallel +and doesn't like it when more than one piece of "code" tries to update the +same "variable". Outputs *have* to be separated from inputs. So although +the "code" to do some work may be split out into a separate class, it's +necessary to also cleanly separate the inputs from the outputs. *No* +variables may be overwritten without them being properly protected, and +in a pipeline paradigm, global variables are not an option. + +In addition, modules need to be "connected" to the places where they are +used. It's not possible to "call" a module, and expect the parameters +to be passed in and automatically the inputs and outputs magically work: +nmigen is a different paradigm because you can either use "sync" or +"comb" - clock-synchronisation or combinatorial logic. + +If you use "comb", it generates hardware that is updated immediately +from its inputs. However if you use "sync", nmigen knows to auto-generate +hardware where on the **next** cycle, the result is updated from its +inputs. The problem in converting code over to a module and using local +inputs and outputs *and* removing globals is that it's too many things at +once to tackle. + +It took about ten days to work all this out, keeping the unit tests running +at all times and using their success or failure as an indicator of whether +things were on track. Eventually however it all worked out. + +# Add Example module + +It's worthwhile showing some of the pipeline stages. Here's the python +nmigen code for the adder first stage: + +{add_code_screenshot.png} + +A prerequisite is that an "alignment" phase was run, which ensured that +the exponents were both the same, so there is no need in this phase to +perform bit-shifting of the mantissas: that's already been handled. + +There's two inputs (in_a and in_b) and one output (out_z): these are +modules in their own right, each containing a sign, mantissa and exponent. +in_a.m is the mantissa of input A, for example. So the first thing is: +four intermediate variables are created: one for testing whether the +signs of A and B are equal (or not), the second for comparing the +mantissas of A and B, and two further intermediates are used to store +the mantissas A and B zero-extended by one bit. + +Next we have some simple combinatorial tests: if the signs were the +same, we perform an add of A and B's mantissas, storing them in Z's +mantissa. If we get to the next "If" statement, we know that this +is to be a subtraction, not an addition. However, for subtraction, +it matters which way round the subtractions are done, depending on +which of A or B is the larger. + +It's really quite straightforward, and the important thing here is to +note that the code is properly commented. It's not the most compact +code in the world: it's not the prettiest-looking. Python cannot +handle overloading of the assignment operator (not without overloading +getattr and setattr, that is), so nmigen creates and uses a method +named "eq" to handle assignment. + +One aspect of this project that's considered to be extremely important +is to do a visual inspection of each module. Here's what add looks like +when yosys "show" command is run on it: + +{add_graph.png} + +On the left it can be seen that the names are a bit of a mess: the members +of A and B (s, e and m) are extracted and, because they clash, are given +auto-generated names. m can be seen to go into a square (a graphviz module) +with "e" and "m" on it, in a box named "add0_n_a". That's the name we +chose to give to the submodule in the nmigen code, shown above, purely +so that it would be easy to visually identify in the graphviz output. + +Note that there is an arrow into a block that takes m (bits 26 down to 0) +and a single-bit zero, and outputs that Concatenated together: these then +go into a diamond-block named "am0". We've identified am0 from the python +code! + +The m (mantissa A) and m$2 (mantissa B) also go into $9, a "ge" (Greater +or Equal) operator, which in turn goes to a diamond-block named "mge": +this is the check to see which of the mantissas is larger! + +Then we can see $15, $12 and $18 are add and subtraction operations, +which feed through to a selection procedure ($group_5), which ultimately +goes into the "out_tot" variable. This is the mantissa output of the +addition. + +So with a little careful analysis, by tracking the names of the inputs, +intermediates and outputs, we can verify that the resultant auto-generated +output from nmigen looks reasonable. The thing is: the module has been +*deliberately* kept small so as to be able to do *exactly this*. One of +the reasons for this is illustrated below. + +# Where things go horribly wrong + +In nmigen, it's perfectly possible to use python variables to assign +(accumulate) intermediate results, without actually storing them in +actual "named" hardware (so to speak). Note in the add above, that +the tests for the If and Elif statements were placed into intermediate +variables? The reason for this was that if they were not, yosys +**duplicated** the expressions. Here's an example of where that goes +horribly wrong. Note the simple-looking innocuous code, below: + +{shift_screenshot.png} + +sm.rshift basically does a variable-length right shift (the "<<" operator +in both python and verilog). Note the assignment of the intermediary +calculation m_mask to a python temporary variable. Note the commented-out +code which uses the "reduce" operator to or all of the bits of +a *secondary* expression, which ANDs all of the bits of "m_mask" with +the input mantissa? Watch what happens when that's handed over to yosys: + +{align_single_fail.png} + +It's an absolute mess. If you zoom in close on the left side, what's +happened is that the shift expression has been **multiplied** (duplicated) +a whopping **twenty four** times (this is a 32-bit FP number so the mantissa +is 24 bits). The reason is because the reduce operation needed 24 copies +of the input, in order to select one bit at a time. Then, on the right +hand side, each bit is ORed in a chain with the previous bit, exactly as +would be expected to be carried out by a sequential processor performing +a "reduce" operation. + +On seeing this graph output, it was immediately apparent that it would be +totally unacceptable, yet from the python nmigen it is not in the slightest +bit obvious that there's a problem. **This is why the yosys "show" output is +so important**. + +On further investigation, it was discovered that there is a "bool" function +of nmigen, which ORs all bits of a variable together. In yosys it even +has a name, "reduce_bool". Here's the graph output once that function has +been used instead: + +{align_single_clean.png} + +*Now* we are looking at something that's much clearer, smaller, cleaner and +easier to understand. It's still quite amazing how so few lines of code +can turn into something so comprehensive. The line of "1s" (11111111...) +is where the variable "m_mask" gets created: this line of "1s" is right-shifted +to create the mask. In the box named "$43" it is then ANDed with the original +mantissa, reduced to a single boolean OR ($44) with a $reduce_bool operation, +and so on. + +This shift-mask is basically for the creation of the "sticky" bit in +IEEE754 rounding. It's essential to get right, and it's an essential part +of IEEE754 Floating-Point. By doing this kind of close visual inspection, +and keeping things to small, compact modules, in combination with comprehensive +unit test coverage and performing incremental minimalist changes, we stand a +reasonable chance of not making huge glaring design errors and being able +to bootstrap up to a decent design. + +Not knowing how to do something is not an excuse for not trying. Having a +strategy for being able to work things out is essential to succeeding, even +when faced with a huge number of unknowns. Go from known-good to known-good, +create the building blocks first, make sure that they're reasonable, make +sure that they're unit-tested comprehensively, then incremental changes can +be attempted with the confidence that mistakes can be weeded out immediately +by a unit test failing when it should not. + +However, as this update demonstrates: both those versions of the normalisation +alignment produced the correct answer, yet one of them was deeply flawed. +Even code that produces the correct answer may have design flaws: +that's what the visual inspection is for. + -- 2.30.2