updates/020_2019aug28_intriguing_ideas.mdwn

   1 Intriguing Ideas
   2
   3 Pixilica starts a 3D Open Graphics Alliance initiative;
   4 We decide to go with a "reconfigurable" pipeline;
   5
   6 # The possibility of a 3D Open Graphics Alliance
   7
   8 At SIGGRAPH 2019 this year there was a very interesting BoF, where the
   9 [idea was put forward]
  10 (https://www.pixilica.com/forum/event/risc-v-graphical-isa-at-siggraph-2019/p-1/dl-5d62b6282dc27100170a4a05)
  11 by Atif, of Pixilica, to use RISC-V as the core
  12 basis of a 3D Embedded flexible GPGPU (hybrid / general purpose GPU).
  13 Whilst the idea of a GPGPU has been floated before (in particular by
  14 ICubeCorp), the reasons *why* were what particularly caught peoples'
  15 attention at the BoF.
  16
  17 The current 3D GPU designs -  NVIDIA, AMD, Intel, are hugely optimised
  18 for mass volume appeal. Niche markets, by virtue of the profit
  19 opportunities being lower or even negative given the design choices of
  20 the incumbents, are inherently penalised. Not only that but the source
  21 code of the 3D engines is proprietary, meaning that anything outside of
  22 what is dictated by the incumbents is out of the question.
  23
  24 At the BoF, one attendee described how they are implementing *transparent*
  25 shader algorithms. Most shader hardware provides triangle algorithms that
  26 asume a solid surface. Using such hardware for transparent shaders is a
  27 2 pass process which clearly comes with an inherent *100%* performance
  28 penalty. If on the other hand they had some input into a new 3D core,
  29 one that was designed to be flexible...
  30
  31 The level of interest was sufficiently high that Atif is reaching out to
  32 people (including our team) to set up an Open 3D Graphics Alliance. The
  33 basic idea being to have people work together to create an appropriate
  34 efficient "Hybrid CPU/GPU" Instruction Set (ISA) suitable for a diverse
  35 range of architectures and requirements: all the way from small embedded
  36 softcores, to embedded GPUs for use in mobile processors, to HPC servers
  37 to high end Machine Learning and Robotics applications.
  38
  39 One interesting thing that has to be made clear - the lesson from
  40 Nyuzi and Larrabee - is that a good Vector Processor does **not**
  41 automatically make a good 3D GPU. Jeff Bush designed Nyuzi very
  42 specifically to replicate the Larrabee team's work: in particular, their
  43 use of a recursive software-based tiling algorithm.  By deliberately
  44 not including custom 3D Hardware Accelerated Opcodes, Nyuzi has only
  45 25% the performance of a modern GPU consuming the same amount of power.
  46 Put another way: if you want to use a pure Vector Engine to get the same
  47 performance as a commercially-competitive GPU, you need *four times*
  48 the power consumption and four times the silicon area.
  49
  50 Thus we simply cannot use an off-the-shelf Vector extension such as the
  51 upcoming RISC-V Vector Extension, or even SimpleV, and expect to
  52 automatically have a commercially competitive 3D GPU. It takes texture
  53 opcodes, Z-Buffers, pixel conversion, Linear Interpolation, Trascendentals
  54 (sin, cos, exp, log), and much more, all of which has to be designed,
  55 thought through, implemented *and then used behind a suitable API*.
  56
  57 In addition, given that the Alliance is to meet the needs of "unusual"
  58 markets, it is no good creating an ISA that has such a high barrier to
  59 entry and such a power-performance penalty that it inherently excludes
  60 the very implementors it is targetted at, particularly in Embedded markets.
  61
  62 These are the challenges to be discussed at the upcoming first
  63 [meetup](https://www.meetup.com/Bay-Area-RISC-V-Meetup/events/264231095/)
  64 at Western Digital's Milpitas HQ.
  65
  66 https://youtu.be/HeVz-z4D8os
  67
  68 # Reconfigureable Pipelines
  69
  70 Jacob came up with a fascinating idea: a reconfigureable pipeline. The
  71 basic idea behind pipelines is that combinatorial blocks are separated
  72 by latches.  The reason is because when gates are chained together,
  73 there is a ripple effect which has to have time to stabilise. If the
  74 clock is run too fast, computations no longer have time to become valid.
  75
  76 So the solution is to split the combinatorial blocks into shorter chains,
  77 and have "latches" in between them which capture the intermediary
  78 results. This is termed a "pipeline".  Actually it's more like an
  79 escalator.
  80
  81 The problem comes when you want to vary the clock speed. This is desirable
  82 because if the pipeline is long and the clock rate is slow, the latency
  83 (completion time of an instruction) is also long.
  84
  85 Conversely, if the pipeline is short (large numbers of gates connected
  86 together) then as mentioned above, this can inherently limit the maximum
  87 frequency that the processor could run at.
  88
  89 What if there was a solution which allowed *both* options? What if you
  90 could actually reconfigure tge pipeline to be shorter or longer?
  91
  92 It turns out that by using what is termed "transparent latches" that it
  93 is possible to do precisely that.  The advantages are enormous and were
  94 described in detail on comp.arch
  95
  96 https://groups.google.com/d/msg/comp.arch/fcq-GLQqvas/SY2F9Hd8AQAJ
  97 Earlier in that thread, someone kindly pointed out that IBM published
  98 papers on the technique.  Basically, the latches normally present in the
  99 pipeline have a combinatorial "bypass" in the form of a Mux. The output
 100 is dynamically selected from either the input *or* the input after it
 101 has been put through a flip-flop. The flip-flop basically stores (and
 102 delays) its input for one clock cycle.
 103
 104 By putting these transparent latches on every other combinatorial stage
 105 in the processing chain, the length of the pipeline may be halved, such
 106 that when the clock rate is also halved the *instruction completion time
 107 remains the same*.
 108
 109 Normally if the processor speed were lowered it would have an adverse
 110 impact on instruction latency.
 111
 112 It's a fantastic idea that will allow us to reconfigure the processor
 113 to reach a 1.5ghz clock rate for high performance bursts.
 114