updates/020_2019aug28_intriguing_ideas.mdwn

   1 # Intriguing Ideas
   2
   3 Pixilica starts a 3D Open Graphics Alliance initiative;
   4 We decide to go with a "reconfigurable" pipeline;
   5 Seven additional EUR 50,000 NLNet Grant proposals submitted.
   6
   7 # The possibility of a 3D Open Graphics Alliance
   8
   9 {https://youtu.be/HeVz-z4D8os}
  10
  11 At SIGGRAPH 2019 this year there was a very interesting BoF, where the
  12 [idea was put forward]
  13 (https://www.pixilica.com/forum/event/risc-v-graphical-isa-at-siggraph-2019/p-1/dl-5d62b6282dc27100170a4a05)
  14 by Atif, of Pixilica, to use RISC-V as the core
  15 basis of a 3D Embedded flexible GPGPU (hybrid / general purpose GPU).
  16 Whilst the idea of a GPGPU has been floated before (in particular by
  17 ICubeCorp), the reasons *why* were what particularly caught peoples'
  18 attention at the BoF.
  19
  20 The current 3D GPU designs -  NVIDIA, AMD, Intel, are hugely optimised
  21 for mass volume appeal. Niche markets, by virtue of the profit
  22 opportunities being lower or even negative given the design choices of
  23 the incumbents, are inherently penalised.  Not only that: whilst things are
  24 slowly changing due to ongoing multi-man-year reverse-engineering efforts,
  25 3D driver source code is often proprietary as well.
  26
  27 At the BoF, one attendee described how they are implementing *transparent*
  28 shader algorithms. Most shader hardware provides triangle algorithms that
  29 asume a solid surface. Using such hardware for transparent shaders is a
  30 2 pass process which clearly comes with an inherent *100%* performance
  31 penalty. If on the other hand they had some input into a new 3D core,
  32 one that was designed to be flexible...
  33
  34 The level of interest was sufficiently high that Atif is reaching out to
  35 people (including our team) to set up an Open 3D Graphics Alliance. The
  36 basic idea being to have people work together to create an appropriate
  37 efficient "Hybrid CPU/GPU" Instruction Set (ISA) suitable for a diverse
  38 range of architectures and requirements: all the way from small embedded
  39 softcores, to embedded GPUs for use in mobile processors, to HPC servers
  40 to high end Machine Learning and Robotics applications.
  41
  42 One interesting thing that has to be made clear - the lesson from
  43 Nyuzi and Larrabee - is that a good Vector Processor does **not**
  44 automatically make a good 3D GPU. Jeff Bush designed Nyuzi very
  45 specifically to replicate the Larrabee team's work: in particular, their
  46 use of a recursive software-based tiling algorithm.  By deliberately
  47 not including custom 3D Hardware Accelerated Opcodes, Nyuzi has only
  48 25% the performance of a modern GPU consuming the same amount of power.
  49 Put another way: if you want to use a pure Vector Engine to get the same
  50 performance as a commercially-competitive GPU, you need *four times*
  51 the power consumption and four times the silicon area.
  52
  53 Thus we simply cannot use an off-the-shelf Vector extension such as the
  54 upcoming RISC-V Vector Extension, or even SimpleV, and expect to
  55 automatically have a commercially competitive 3D GPU. It takes texture
  56 opcodes, Z-Buffers, pixel conversion, Linear Interpolation, Trascendentals
  57 (sin, cos, exp, log), and much more, all of which has to be designed,
  58 thought through, implemented *and then used behind a suitable API*.
  59
  60 In addition, given that the Alliance is to meet the needs of "unusual"
  61 markets, it is no good creating an ISA that has such a high barrier to
  62 entry and such a power-performance penalty that it inherently excludes
  63 the very implementors it is targetted at, particularly in Embedded markets.
  64
  65 Thus we need a Hybrid Architecture, not just to reduce complexity, not
  66 just to meet Libre criteria, but to meet the long tail of innovation in
  67 3D and kick start some real innovation.
  68 These were the challenges discussed at the upcoming first
  69 [meetup](https://www.meetup.com/Bay-Area-RISC-V-Meetup/events/264231095/)
  70 at Western Digital's Milpitas HQ. Experts at the Meetup, from the 3D
  71 Industry who have worked for decades for ATI, NVIDIA and Intel, were
  72 really enthusiastic and praised this approach, saying that it was exactly
  73 the kind of shakeup the 3D Industry needs.
  74
  75 # Reconfigureable Pipelines
  76
  77 Jacob came up with a fascinating idea: a reconfigureable pipeline. The
  78 basic idea behind pipelines is that combinatorial blocks are separated
  79 by latches.  The reason is because when gates are chained together,
  80 there is a ripple effect which has to have time to stabilise. If the
  81 clock is run too fast, computations no longer have time to become valid.
  82
  83 So the solution is to split the combinatorial blocks into shorter chains,
  84 and have "latches" in between them which capture the intermediary
  85 results. This is termed a "pipeline".  Actually it's more like an
  86 escalator.
  87
  88 The problem comes when you want to vary the clock speed. This is desirable
  89 because if the pipeline is long and the clock rate is slow, clearly the latency
  90 (completion time of an instruction) is also long.
  91
  92 Conversely, if the pipeline is short (large numbers of gates connected
  93 together) then as mentioned above, this can inherently limit the maximum
  94 frequency that the processor could run at, because due to the "ripple" effect
  95 in each pipeline stage, a longer chain of gates clearly has to have a longer
  96 time to stabilise.
  97
  98 What if there was a solution which allowed *both* options? What if you
  99 could actually reconfigure the pipeline to be shorter or longer?
 100
 101 It turns out that by using what is termed "transparent latches" that it
 102 is possible to do precisely that.  The advantages are enormous and were
 103 described in detail on comp.arch
 104
 105 Earlier in
 106 [this thread](https://groups.google.com/d/msg/comp.arch/fcq-GLQqvas/SY2F9Hd8AQAJ),
 107 someone kindly pointed out that IBM published
 108 papers on the technique.  Basically, the latches normally present in the
 109 pipeline have an additional combinatorial "bypass" in the form of a
 110 Mux. The output is dynamically selected from either the input *or* the
 111 input after it has been put through a flip-flop. The flip-flop basically
 112 stores (and delays) its input for one clock cycle, or it can be bypassed
 113 i.e. just be another part of that "ripple" effect mentioned earlier.
 114
 115 By putting these transparent latches on every other combinatorial stage
 116 in the processing chain, the length of the pipeline may be halved, such
 117 that when the clock rate is also halved the *instruction completion time
 118 remains the same*.
 119
 120 As described earlier, normally if the processor speed were lowered it
 121 would have an adverse impact on instruction latency.  With the transparent
 122 latches bypassed and with plenty of time to stabilise at the lower speed,
 123 two back-to-back stages now comprise a *single* pipeline stage, and thus,
 124 even if the processor speed is halved,
 125 *so is the length of the overall pipeline* and thus the instruction
 126 completion time remains the same.
 127
 128 It's a fantastic idea that will allow us to reconfigure the processor
 129 either to reach a 1.5ghz clock rate for high performance bursts, or to
 130 run at 800mhz in reduced power mode.
 131
 132 # NLNet Funding proposals.
 133
 134 The next step is to put in half a dozen NLNet Funding proposals. No,
 135 literally:
 136 [seven new proposals](https://libre-riscv.org/nlnet_proposals/),
 137 each for EUR 50,000. One for gcc, one for a port of MESA RADV to the
 138 new processor, another for writing experimental assembly code to go into
 139 libswscale, libx264 etc. ultimately for use in VLC and ffmpeg and so on.
 140
 141 Best of all, two for actually doing a test ASIC: one working with
 142 [chips4makers](http://chips4makers.io/blog), the other with
 143 [lip6.fr](https://www-soc.lip6.fr/en/). It turns out that 180nm ASIC shuttle
 144 services cost only USD 600 per square mm, and we can get away with around
 145 20 sq.mm which is about USD 12,000 and estimated 800,000 gates.
 146
 147 At that low cost, we can iterate before going to lower geometries plus
 148 actually have something which, even at 350mhz, if it was dual issue,
 149 would be a reasonable saleable product in its own right.  The only thing
 150 we have to watch out for, there, is that it will be a bit of a monster
 151 so power consumption is going to be high at 350mhz. Still, for a first
 152 ASIC ever, it's just exciting to think that it's possible at all.
 153
 154 Regarding the NLNet proposals: we need people! In particular, we need two
 155 EU Citizens to come forward, to satisfy NLNet's backers' requirements
 156 (Thanks to [NGU.eu](https://ngi.eu), NLNet has received its money under
 157 the EU Horizon 2020 Programme), so at least one EU Citizen has to be
 158 part of the proposal. One for gcc, another for the MESA/RADV port.
 159 Please do contact me for details. There's no contract or obligation,
 160 because this is charitable donations.
 161
 162 In addition, if anyone wants to receive tax deductible charitable
 163 donations direct from NLNet for working on aspects of this project,
 164 do get in touch, there is plenty to do.  Application reviews start in 2
 165 weeks, we will hear from NLnet by December as to what has been approved,
 166 and will be able to expand the project scope around January 2020.
 167
 168 Also remember, if you work for a Corporation that could financially
 169 benefit from this project being a reality, sponsorship, via NLNet,
 170 is tax deductible because it is a charitable donation.
 171
 172 (Update: covered in a
 173 [Slashdot](https://hardware.slashdot.org/story/19/09/29/1845252/libre-risc-v-3d-cpugpu-seeks-grants-for-ambitious-expansion#comments)
 174 article)