updates/018_2019may27_nlnet_grant_approved.mdwn

   1 # First NLNet Grant approved to fund development of the Libre RISC-V SoC
   2
   3 The application for funding from NLnet, from back in November of last year,
   4 has been approved.  It means that we have EUR $50,000 to pay for full-time
   5 engineering work to be carried out over the next year, and to pay for
   6 "bounty" style tasks.
   7
   8 However this is not all: by splitting the tasks up into separate groups,
   9 and using a second European-based individual as the applicant, we can
  10 apply for a second grant (also of up to EUR $50,000).  In the next couple
  11 of days we will put in an application for "Formal Mathematical Proofs"
  12 of the processor design.
  13
  14 There are several reasons for doing so.  The primary one is down to the
  15 fact that we anticipate this (commercial, libre) product to be closely
  16 and independently examined by third parties, to verify for themselves
  17 that it does not contain spying backdoor co-processors, as well as the
  18 usual security and correctness guarantees.  If there exists *formal
  19 mathematical proofs* that the processor and its sub-components operate
  20 correctly, that independent 3rd party verification task is a lot easier.
  21
  22 In addition, it turns out that when writing unit tests, using formal
  23 mathematical proofs makes for *complete* code coverage - far better
  24 than any other "comprehensive" multiple unit test technique could ever
  25 hope to achieve - with less code and not just better accuracy but *100%
  26 provable* accuracy.  Additional, much simpler unit tests can then be written
  27 which are more along the lines of "HOWTOs" - examples on how to use the
  28 unit.
  29
  30 This is one of those "epiphany" moments that, as a Software Engineer of
  31 25 years experience, has me stunned and wondering why on earth this is
  32 not more generally and widely deployed in software. The answer I believe
  33 is down to the nature of what a processor actually is.
  34
  35 A processor is developed much more along the lines of how Functional Programming
  36 works.  Functional Programming can have formal mathematical proofs applied
  37 to it because for any given inputs, the output is *guaranteed* to be the same.
  38 This of course breaks down when the function has "side-effects" (such as
  39 reading from a file or accesses other external state outside of the "control"
  40 of the function).  And in the design of a processor, by the very nature
  41 of hardware, you simply cannot create a verilog module that has access to
  42 "files" or to "global variables".
  43
  44 In addition, hardware is based purely on boolean logic, and on if/else
  45 constructs.  In essence, then the *entire hardware design* has to be made
  46 according to far simpler rules than "normal" software is expected to conform
  47 to.  Even memory accesses in hardware have to be implemented according to
  48 these strict rules (we're *implementing* LOAD and STORE instructions, here,
  49 not *using* those LOAD and STORE instructions).
  50
  51 Consequently, adding in formal proofs is a little bit easier, brings
  52 huge benefits as well in terms of code readability, reliability, and
  53 time-cost savings, and has the crucial advantage of being aligned with
  54 the overall privacy goal *and* with NLnet's funding remit.
  55
  56 # Summary from the past couple of months
  57
  58 The past few months have seen a *lot* of activity.  The IEEE754 ADD unit
  59 has been completed, both as a Finite State Machine (FSM) and as a fully
  60 pipelined design, both of which have parameters that allow them to do
  61 FP16, FP32 or FP64.  The DIV unit has been implemented as an FSM, and
  62 will stay that way for now.  MUL has been completed, however needs to
  63 be turned into an FMAC (3 operands: Multiply and Accumulate).
  64
  65 A provisional pipeline API has been developed, which the IEEE754 FPU
  66 is using.  It includes data "funneling" (multiplexing) blocks that allow
  67 for the creation of what Mitch Alsup calls a "Concurrent Computation Unit".
  68 It's basically an array of matched operand latches and result latches
  69 (as input and output, respectively), in front of a single pipeline.
  70 This arrangement allows a batch of operations to be presented to the
  71 CDC6600 style "Dependency Matrices".
  72
  73 Jacob has been working on a fascinating design: a dynamically partitionable
  74 Adder and Multiplier Unit.  Given that we are doing a Vector Processing
  75 front-end onto SIMD back-end operations, it makes sense to save gates by
  76 allowing the ADD and MUL units to be able to optionally handle a batch
  77 of 8-bit operations, or half the number of 16-bit operations, or a quarter
  78 of the number of 32-bit operations or one eigth of the number of64-bit
  79 operations.  In this way, a lot less gates are required than if they
  80 were separate units.  The unit tests demonstrate that the code that Jacob
  81 has written provide RISC-V mul, mulh, mulhu and mulhsu functionality.
  82
  83 The augmented 6600 Scoreboard took literally six weeks to correctly implement
  84 Read-after-Write and Write-after-Read hazards.  It required extraordinary
  85 and excruciating patience to get right.  Adding in Write-after-Write
  86 however only took two days, as the infrastructure to do so had already been
  87 developed.
  88
  89 Currently being implemented is "branch shadowing" - this is not the
  90 same as branch *prediction*: that is a different algorithm which, when
  91 *combined* with "branch shadowing", provides the feature known as branch
  92 *speculation*.  This is the source of a lot of confusion about Out-of-Order
  93 designs in general.  It seems to be assumed that an OoO design *has* to
  94 have branch speculation: it doesn't.  It's just that, given all the
  95 pieces, adding in branch speculation is actually quite straightforward,
  96 and provides such a high performance increase that it is hard to justify
  97 leaving it out.
  98
  99 One huge surprise came out of a recent discussion with Mitch Alsup.  It
 100 has been assumed all along that turning this design from a single-issue
 101 to multi-issue would be difficult, or require significant gates and latency
 102 to do so.  The "simple" approach to do multi-issue, using the Dependency
 103 Matrices, would be to analyse a batch of instructions, and if there are
 104 no overlaps (no registers in common), allow that batch to proceed in
 105 parallel.  This is a naive approach.
 106
 107 Mitch pointed out that in his work on the AMD Opteron (the processor
 108 family that AMD had to publish "Intel equivalent" speed numbers for,
 109 because it was so much more efficient and effective than Intel's designs)
 110 each instruction "accumulated" the dependencies of all prior instructions
 111 being issued in the same batch.  This works because read and write
 112 dependencies are *transitive* (google it...)
 113
 114 What that means, in practical terms, is that we have a way to create a
 115 design that could, if ramped up, "take on the big boys".  To make that
 116 clear: there's no technical barrier that would prevent us from creating
 117 a quad issue (or higher) design.
 118
 119 There is still a heck of a lot to get done, here.  However, it has to
 120 be said that actually adding an instruction decoder onto the 6600 style
 121 Dependency Matrices is relatively straightforward, this being RISC, after
 122 all.  It is possible, then, that we may have a subset of functionality
 123 operational far sooner than anticipated.