updates/015_2019feb15_summary.mdwn

   1 # IEEE754 Floating-Point, Virtual Memory, SimpleV Prefixing
   2
   3 This update covers three different topics, as we now have four people
   4 working on different areas.  Daniel is designing a Virtual Memory TLB;
   5 Aleksander and I are working on an IEEE754 FPU, and Jacob has been
   6 designing a 48-bit parallel-prefixed RISC-V ISA extension.
   7
   8 # IEEE754 FPU
   9
  10 Prior to Aleksander joining the team, we analysed
  11 [nmigen](https://github.com/m-labs/nmigen) by taking an existing
  12 verilog project (Jacob's rv32 processor) and converting it.
  13 To give Aleksander a way to bootstrap up to understanding nmigen
  14 as well as IEEE754, I decided to do a
  15 [similar conversion](https://git.libre-riscv.org/?p=ieee754fpu.git;a=tree;f=src/add)
  16 on Jon Dawson's
  17 [adder.v](https://github.com/dawsonjon/fpu/blob/master/adder/adder.v).
  18 It turned out
  19 [pretty good](http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-February/000551.html).
  20
  21 We added a lot of comments so that each stage of the FP add is easy to
  22 understand.  A python class - FPNum - was created that abstracted out
  23 a significant amount of repetitive verilog, particularly when it came
  24 to packing and unpacking the result.  Common patterns for recognising
  25 (or creating) +/- INF or NaN were given a function with an easily-recogniseable
  26 name in the FPNum class that return nmigen code-fragments: a technique
  27 that is flat-out impossible to achieve with any other type of HDL programming.
  28
  29 Already we have identified that the majority of code between Jon's 32-bit
  30 and 64-bit FPU is near-identical, the only difference being a set of
  31 constants defining the width of the mantissa and exponent, and more than
  32 that, we've also identified that around 80 to 90% of the code is duplicated
  33 between adder, divider and multiplier.  In particular, the normalisation
  34 and denormalisation, as well as the packing and unpacking stages: they're all
  35 absolutely identical.  Consequently we can abstract these stages out into
  36 base classes.
  37
  38 Also, an aside: many thanks to attie from #m-labs on Freenode: it turns
  39 out that converting verilog to migen as a way to learn is something that
  40 other people do as well.  It's a nice coincidence that attie
  41 converted the
  42 [milkymist FPU](https://github.com/m-labs/milkymist/blob/master/cores/pfpu/rtl/pfpu_faddsub.v) over to
  43 [migen](https://github.com/nakengelhardt/fpgagraphlib/blob/master/src/faddsub.py)
  44 as a way to avoid having to learn both migen as well as IEEE754.
  45 We'll be comparing notes :)
  46
  47 # Virtual Memory / TLB
  48
  49 A [TLB](https://en.wikipedia.org/wiki/Translation_lookaside_buffer)
  50 is a Translation Lookaside Buffer.  It's the fundamental basis of
  51 Virtual Memory.  We're not doing an Embedded Controller, here, so
  52 Virtual Memory is essential.  Daniel has been
  53 [researching](http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-February/000490.html)
  54 this and found an extremely useful paper that explains that standard
  55 CPU Virtual Memory strategies typically fail extremely badly when naively
  56 transferred over to GPUs.
  57
  58 The reason for the failure is explained in the paper: GPU workloads typically
  59 involve considerable amounts of parallel data that is processed once
  60 *and only* once.  Most standard scalar CPU Virtual Memory strategies are
  61 based on the assumption that areas of memory (code and data) will be
  62 accessed several times in quick succession.
  63
  64 We will therefore need to put a *lot* of thought into what we are going
  65 to do, here.  A good lead to follow is the hint that if one GPU thread needs
  66 a region of data, then because the workload is parallel it is extremely
  67 likely that there will be nearby data that could potentially be loaded
  68 in advance.
  69
  70 This may be an area where a software-based TLB may have an advantage over
  71 a hardware one: we can deploy different strategies, even after the hardware
  72 is finalised.  Again, we just have to see how it goes, here.
  73
  74 # Simple-V Prefix Proposal (v2)
  75
  76 In looking at the SPIR-V to LLVM-IR compiler, Jacob identified that LLVM's
  77 IR backend really does not have the ability to support what is effectively
  78 stack-based wholesale changes to the meaning of instructions.  In addition,
  79 the setup time of the SimpleV extension (the number of instructions required
  80 to set up the changed meaning of instructions) is significant.
  81
  82 The "Prefix" Proposal therefore compromises by introducing a 48-bit
  83 (and also a 32-bit "Compressed") parallel instruction format.  We spent
  84 considerable time going over the options, and the result is a
  85 [second proposed prefix scheme](https://salsa.debian.org/Kazan-team/kazan/blob/master/docs/SVprefix%20Proposal.rst).
  86
  87 What's really nice is that Simon Moll, the author of an LLVM Vector Predication
  88 proposal, is
  89 [ taking SimpleV into consideration](https://lists.llvm.org/pipermail/llvm-dev/2019-February/129973.html).
  90 A GPU (and the Vulkan API) contains a considerable number of 3-long and
  91 4-long Floating Point Vectors.  However these are processed in *parallel*,
  92 so there are *multiple* 3-long and 4-long vectors.  It makes no sense to
  93 have predication bits down to the granularity of individual elements *in*
  94 the vectors, so we need a vector mask that allocates one bit per 3-long
  95 or 4-long "group".  Or, more to the point: there is a performance penalty
  96 associated with having to allocate mask bits right down to the level of
  97 the individual elements.  So it is really nice that Simon is taking this
  98 into consideration.
  99
 100 # Summary
 101
 102 So there is quite a lot going on, and we're making progress.  There are
 103 a lot of unknowns: that's okay.  It's a complex project.  The main thing
 104 is to just keep going.  Sadly, a significant number of reddit and other
 105 forums are full of people saying things like "this project could not
 106 possibly succeed, because they don't have the billion dollar resources
 107 of e.g. Intel".  So... um, should we stop? Should we just quit, then, and
 108 not even try?
 109
 110 That's the thing when something has never been done before: you simply
 111 cannot say "It Won't Possibly Work", because that's just not how innovation
 112 works.  You don't *know* if it will or won't work, and you simply do not
 113 have enough information to categorically say, one way or the other, until
 114 you try.
 115
 116 In short: failure or success are *both* their own self-fulfilling prophecies.
 117