updates/019_2019jul16_purism_donation.mdwn

   1 **DRAFT STATUS. last edit 16jul2019**
   2
   3 We are delighted to be able to announce additional sponsorship by
   4 [Purism](http://puri.sm), through [NLNet](http://nlnet.nl).
   5
   6 # Purism Sponsorship
   7
   8 As a Social Purpose Corporation, Purism is empowered to balance ethics, social
   9 enterprise and profitable business. I am delighted that they chose to
  10 fund the Libre RISC-V hybrid CPU/GPU through the NLNet Foundation. Their
  11 donation provides us some extra flexibility in how we reach the goal of
  12 bringing to market a hybrid CPU, VPU and GPU that is libre to the bedrock.
  13
  14 Purism started with a
  15 [Crowdsupply campaign](https://www.crowdsupply.com/purism/librem-15)
  16 to deliver a modern laptop with full software support and a
  17 [coreboot BIOS](https://puri.sm/coreboot/).
  18 I know that, after this initial success, they worked hard to try to
  19 solve the "NSA backdoor coprocessor" issue, known as the
  20 ["Management Engine"](https://libreboot.org/faq.html#intelme).
  21 Ironically, inspired by Purism, Intel's internal efforts became moot,
  22 as a 3rd party reverse engineered an Intel BIOS and discovered the
  23 ["nsa\_me\_off\_switch"](https://it.slashdot.org/story/17/08/29/2239231/researchers-find-a-way-to-disable-intel-me-component-courtesy-of-the-nsa)
  24 parameter, designed to be used by the NSA when Intel equipment is deployed
  25 within NSA premises.
  26
  27 Purism then moved quickly to provide a BIOS update to disable this
  28 "feature", eliminating the last and most important barrier to being able
  29 to declare a full privacy software stack.
  30
  31 It is these kinds of brave strategic decisions to kick the trend towards
  32 privacy invading hardware "by default" for which Purism deserves our
  33 respect and gratitude.
  34
  35 However, just as NLNet recognise, Purism also appreciate that we cannot
  36 stop at just the software.  Profit maximising Corporations just do not
  37 take the brave decisions that can compromise profits, particularly when
  38 faced with competition: it's too much.  This is why being a Social Purpose
  39 Corporation is so critically important.  Socially-responsible decisions
  40 do not get undermined by profit-maximisation.
  41
  42 So we are extremely grateful for their donation, managed through NLnet.
  43
  44 # Progress
  45
  46 So much has happened already, since the last update, it is hard to know
  47 where to begin.
  48
  49 * The IEEE754 FPU has a simulation-proven FADD pipeline, and FMUL,
  50   FDIV, FSQRT and FCVT are on the way.
  51 * A RISC-V Reciprocal Square Root FP Opcode has been proposed, which is
  52   needed for 3D operations, particularly normalisation of vectors.  With
  53   other RISC-V implementors needing this opcode it makes sense for it
  54   to be a Standard Extension.
  55 * The SimpleV extension has had a major overhaul, with the addition of a
  56   single-instruction prefix (P32C, P48 and P64), and a "VBLOCK" format that
  57   adds Vectorisation Context to a batch of instructions.
  58 * Implementation of the precise-augmented 6600 style scoreboard system has
  59   begun, with ALU register hazards and shadowing already completed, and
  60   memory hazards underway.
  61
  62 # Multi Issue
  63
  64 Multi Issue is absolutely critical for this CPU/VPU/GPU because the
  65 [SimpleV](https://libre-riscv.org/simple_v_extension/specification)
  66 engine critically relies on being able to turn one "vector"
  67 operation into multiple "scalar element" instructions, in every cycle. The
  68 simplest way to do this is to throw equivalent scalar opcodes into a
  69 multi issue execution engine, and let the engine sort it out.
  70
  71 So, regarding the Dependency Matrices: thanks to Mitch Alsup's absolutely
  72 invaluable input we now know how to do multi-issue. On top of a precise
  73 6600 style Dependency Matrix it is almost comically trivial.
  74
  75 The key insight that Mitch gave us was that instruction dependencies are
  76 transitive. In other words: if there are 4 instructions to be issued,
  77 the second instruction may have the dependencies of the first added to it;
  78 the 3rd may accumulate the dependencies of the first and second and so on.
  79
  80 Where this trick does not work well (or takes significant hardware to
  81 implement) is when, for example with the Tomasulo Algorithm (or the
  82 original 6600 Q-Table), the Register Dependency Hazards are expressed
  83 in *binary* (r5 = 0b00101, r3=0b00011). If instead the registers are
  84 expressed in *unary* (r5 = 0b00010000, r3= 0b00000100) then it should
  85 be pretty obvious that in a multi issue design, all that is needed in
  86 each clock cycle is to OR the cumulative register dependencies in a
  87 cascading fashion. Aside from now also needing to increase the number of
  88 register ports and other resources to cope with the increased workload,
  89 amazingly that's all it takes!
  90
  91 To achieve the same trick with a Tomasulo Reorder Buffer (ROB) requires
  92 the addition of an entire extra CAM per every extra issue to be added to
  93 the architecture: four way multi issue would require four ROB CAMs! The
  94 power consumption and gate count would be prohibitively expensive,
  95 and resolving the commits of multiple parallel operations is also fraught.
  96
  97 # SimpleV
  98
  99 What began ironically as "simple" still bears some vestige of its
 100 original name, in that the ISA needs no new opcodes: any scalar RISC-V
 101 implementation may be turned parallel through the addition of SV at the
 102 instruction issue phase.
 103
 104 However, one of the major drawbacks of the initial draft spec was that
 105 the use of CSRs took a huge number of instructions just to set up and
 106 then tear down the vectorisation context.
 107
 108 This had to be solved.
 109
 110 The idea which came to mind was to embed RISC-V opcodes within
 111 a longer, variable-length encoding, which we've called the
 112 [VBLOCK Format](https://libre-riscv.org/simple_v_extension/vblock_format/).
 113 At the beginning of this new format, the vectorisation and predication
 114 context could be embedded, which "changes" the standard *scalar* opcodes
 115 to become "parallel" (multi-issue) operations.
 116
 117 The advantage of this approach is that, firstly, the context is much
 118 smaller: the actual CSR opcodes are gone, leaving only the "data",
 119 which is now batched together. Secondly, there is no need to "reset"
 120 (tear down) the vectorisation context, because that automatically goes
 121 when the long-format ends.
 122
 123 The other issue that needed to be fixed is that we really need a
 124 [SETVL](https://libre-riscv.org/simple_v_extension/specification/sv.setvl/)
 125 instruction. This is really unfortunate as it breaks the "no new opcodes"
 126 paradigm.  However, what we are going to do is simply to reuse the RVV
 127 SETVL opcode, now that RVV has reached its last anticipated draft before
 128 ratification.  Secondly: it's not an *actual* instruction related to
 129 elements (it doesn't perform a parallel add, for example).  It's more an
 130 "infrastructure support" instruction.
 131
 132 The reason for needing SETVL is complex. It is down to the fact that,
 133 unlike in RVV, the Maximum Vector Length is **not** an architectural hard
 134 design parameter, it is a runtime dynamic one. Thus, it is absolutely
 135 crucial that not only VL is set on every loop (or SV Prefix instruction),
 136 but that MVL is also set.
 137
 138 This means that SV has two additional instructions for any algorithm,
 139 when compared to RVV, and this kind of penalty is just not acceptable. The
 140 solution therefore was to create a special SV.SETVL opcode that always
 141 takes the MVL as an *additional* extra parameter over and above those
 142 provided to the RV equivalent opcode. That basically puts SV on par with
 143 RV as far as instruction count is concerned.
 144
 145 # Fail on First
 146
 147 The other really nice addition, which came with a small reorganisation
 148 of the Vector and Predicate Contexts, is data dependent
 149 ["fail on first"](https://libre-riscv.org/simple_v_extension/appendix/#ffirst).
 150
 151 ARM's SVE, RVV, and the Mill Architecture all have an incredibly neat
 152 feature where if data is being loaded from memory in parallel, and the
 153 LD operations run off the end of a page boundary, this may be detected
 154 and the *legal* parallel operations may complete, all without needing
 155 to drop into "scalar" mode.
 156
 157 In the case of the Mill Architecture, this is achieved through the
 158 extremely innovative feature of simply marking the result of the
 159 operation as "invalid", and that "tag" cascades through all subsequent
 160 operations. Thus, any attempts to ADD or STORE the data will result in
 161 the invalid data being simply ignored.
 162
 163 RV instead detects the point at which the LD became invalid, "fails"
 164 at the "first" such illegal memory access, and truncates all subsequent
 165 vector operations to within that limit, by *changing VL*. This is an
 166 extremely effective and very simple idea, it was worth adding to SV.
 167
 168 However, when doing so, the idea sprang to mind: why not extend the
 169 "fail on first" concept to not just cover LD/ST operations, but to cover
 170 actual ALU operations as well? Why not, if any of the the results from
 171 a sequence of parallel operations is zero ("fail"), similarly truncate VL?
 172
 173 This idea was tested out on strncpy (the typical canonical function
 174 used to test out data-dependent ISA concepts), and it worked! So, that
 175 is going into SV as well. It does mean that after every ALU operation,
 176 a comparator against zero will be optionally activated: given that it
 177 is optional and under the control of the ffirst bit, it is not a power
 178 penalty on every single instruction.
 179
 180 # Summary
 181
 182 There is so much to do, and so much that has already been achieved,
 183 it is almost overwhelming. We still cannot lose sight of the fact that
 184 there is an enormous amount that we do not yet know, yet at the same
 185 time, never let that stop us from moving forward. A journey starts with
 186 a first step, and continues with each step.
 187
 188 With help from NLNet and companies like Purism we can look forward
 189 to actually paying people to contribute to solving what was formerly
 190 considered an impossible task.
 191
 192 It is worthwhile emphasising: any individual or Corporation wishing to
 193 see this project succeed (so that you can use it as the basis for one
 194 of your products, for example), donations through NLNet, as a Registered
 195 Charitable Foundation, are tax deductible.
 196
 197 Likewise, for anyone who would like to help with the project's Milestones,
 198 payments from NLnet are *donations*, and, depending on jurisdiction,
 199 may also be tax deductible (i.e. not classed as "earnings").  If you are
 200 interested to learn more, do get in touch.
 201