3d_gpu.mdwn

   1 # RISC-V 3D GPU
   2
   3 See [[libre_3d_gpu]]
   4
   5 at FOSDEM 2018 when Yunsup and the team announced the U540 there was
   6 some discussion about this: it was one of the questions asked.  one of
   7 the possibilities raised there was that maddog was heading something:
   8 i've looked for that effort, and have not been able to find it [jon is
   9 getting quite old, now, bless him.  he had to have an operation last
  10 year.  he's recovered well].
  11
  12 also at the Barcelona Conference i mentioned in the
  13 very-very-very-rapid talk on the Libre RISC-V chip that i have been
  14 tasked with, that if there is absolutely absolutely no other option,
  15 it will use Vivante GC800 (and, obviously, use etnaviv).  what *that*
  16 means is that there's a definite budget of USD $250,000 available
  17 which the (anonymous) sponsor is definitely willing to spend... so if
  18 anyone can come up with an alternative that is entirely libre and
  19 open, i can put that initiative to the sponsor for evaluation.
  20
  21 basically i've been looking at this for several months, so have been
  22 talking to various people (jeff bush from nyuzi [1] and chiselgpu [2],
  23 frank from gplgpu [3], VRG for MIAOW [4]) to get a feel for what would
  24 be involved.
  25
  26 * miaow is just an OpenCL engine that is compatible with a subset of
  27   AMD/ATI's OpenCL assembly code.  it is NOT a GPU.  they have
  28   preliminary plans to *make* one... however the development process is
  29   not open.  we'll hear about it if and when it succeeds, probably as
  30   part of a published research paper.
  31
  32 * nyuzi is a *modern* "software shader / renderer" and is a
  33   replication of the intel larrabee architecture.  it explored the
  34   concept of doing recursive software-driven rasterisation (as did
  35   larrabee) where hardware rasterisation uses brute force and often
  36   wastes time and power.  jeff went to a lot of trouble to find out
  37   *why* intel's researchers were um "not permitted" to actually put
  38   performance numbers into their published papers.  he found out why :)
  39   one of the main facts that jeff's research reveals (and there are a
  40   lot of them) is that most of the energy of a GPU is spent getting data
  41   each way past the L2/L1 cache barrier, and secondly much of the time
  42   (if doing software-only rendering) you have several instruction cycles
  43   where in a hardware design you issue one and a separate pipeline takes
  44   over (see videocore-iv below)
  45
  46 * chiselgpu was an additional effort by jeff to create the absolute
  47   minimum required tile-based "triangle renderer" in hardware, for
  48   comparative purposes in the nyuzi raster engine research.  synthesis
  49   of such a block he pointed out to me would actually be *enormous*,
  50   despite appearances from how little code there is in the chiselgpu
  51   repository.  in his paper he mentions that the majority of the time
  52   when such hardware-renderers are deployed, the rest of the GPU is
  53   really struggling to keep up feeding the hardware-rasteriser, so you
  54   have to put in multiple threads, and that brings its own problems.
  55   it's all in the paper, it's fascinating stuff.
  56
  57 * gplgpu was done by one of the original developers of the "Number
  58   Nine" GPU, and is based around a "fixed function" design and as such
  59   is no longer considered suitable for use in the modern 3D developer
  60   community (they hate having to code for it), and its performance would
  61   be *really* hard to optimise and extend.  however in speaking to jeff,
  62   who analysed it quite comprehensively, he said that there were a large
  63   number of features (4-tuple floating-point colour to 16/32-bit ARGB
  64   fixed functions) that have retained a presence in modern designs, so
  65   it's still useful for inspiration and analysis purposes.  you can see
  66   jeff's analysis here [7]
  67
  68 * an extremely useful resource has been the videocore-iv project [8]
  69   which has collected documentation and part-implemented compiler tools.
  70   the architecture is quite interesting, it's a hybrid of a
  71   Software-driven Vector architecture similar to Nyuzi plus
  72   fixed-functions on separate pipelines such as that "take 4-tuple FP,
  73   turn it into fixed-point ARGB and overlay it into the tile"
  74   instruction.  that's done as a *single* instruction to cover i think 4
  75   pixels, where Nyuzi requires an average of 4 cycles per pixel.  the
  76   other thing about videocore-iv is that there is a separate internal
  77   "scratch" memory area of size 4x4 (x32-bit) which is the "tile" area,
  78   and focussing on filling just that is one of the things that saves
  79   power.  jeff did a walkthrough, you can read it here [10] [11]
  80
  81 so on this basis i have been investigating a couple of proposals for
  82 RISC-V extensions: one is Simple-V [9] and the other is a *small*
  83 general-purpose memory-scratch area extension, which would be
  84 accessible only on the *other* side of the L1/L2 cache area and *ONLY*
  85 accessible by an individual core [or its hyperthreads].  small would
  86 be essential because if a context-switch occurs it would be necessary
  87 to swap the scratch-area out to main memory (and back).
  88 general-purpose so that it's useful and useable in other contexts and
  89 situations.
  90
  91 whilst there are many additional reasons - justifications that make
  92 it attractive for *general-purpose* usage (such as accidentally
  93 providing LD.MULTI and ST.MULTI for context-switching and efficient
  94 function call parameter stack storing, and an accidental
  95 single-instruction "memcpy" and "memzero") - the primary driver behind
  96 Simple-V has been as the basis for turning RISC-V into an
  97 embedded-style (low-power) GPU (and also a VPU).
  98
  99 one of the things that's lacking from
 100 [RVV](https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc)
 101 is parallelisation of
 102 Bit-Manipulation.  RVV has been primarily designed based on input from
 103 the Supercomputer community, and as such it's *incredible*.
 104 absolutely amazing... but only desirable to implementt if you need to
 105 build a Supercomputer.
 106
 107 Simple-V i therefore designed to parallelise *everything*.  custom
 108 extensions, future extensions, current extensions, current
 109 instructions, *everything*.  RVV, once it's been implemented in gcc
 110 for example, would require heavy-customisation to support e.g.
 111 Bit-Manipulation, would require special Bit-Manipulation Vector
 112 instructions to be added *to RVV*... all of which would need to AGAIN
 113 go through the Extension Proposal process... you can imagine how that
 114 would go, and the subsequent cost of maintenance of gcc, binutils and
 115 so on as a long-term preliminary (or if the extension to RVV is not
 116 accepted, after all the hard work) even a permanent hard-fork.
 117
 118 in other words once you've been through the "Extension Proposal
 119 Process" with Simple-V, it need never be done again, not for one
 120 single parallel / vector / SIMD instruction, ever again.
 121
 122 that would include for example creating a fixed-function 3D "FP to
 123 ARGB" custom instruction.  a custom extension with special 3D
 124 pipelines would, with Simple-V, not need to also have to worry about
 125 how those operations would be parallelised.
 126
 127 this is not a new concept: it's borrowed directly from videocore-iv
 128 (which in turn probably borrowed it from somewhere else).
 129 videocore-iv call it "virtual parallelism".  the Vector Unit
 130 *actually* has a 4-wide FPU for certain heavily-used operations such
 131 as ADD, and a ***ONE*** wide FPU for less-used operations such as
 132 RECIPSQRT.
 133
 134 however at the *instruction* level each of those operations,
 135 regardless of whether they're heavily-used or less-used they *appear*
 136 to be 16 parallel operations all at once, as far as the compiler and
 137 assembly writers are concerned.  Simple-V just borrows this exact same
 138 concept and lets implementors decide where to deploy it, to best
 139 advantage.
 140
 141
 142 > 2. If it’s a good idea to implement, are there any projects currently
 143 > working on it?
 144
 145 i haven't been able to find any: if you do please do let me know, i
 146 would like to speak to them and find out how much time and money they
 147 would need to complete the work.
 148
 149 >       If the answer is yes, would you mind mention the project’s name and
 150 > website?
 151 >
 152 >       If the answer is no, are there any special reasons that nobody not
 153 > implement it yet?
 154
 155 it's damn hard, it requires a *lot* of resources, and if the idea is
 156 to make it entirely libre-licensed and royalty-free there is an extra
 157 step required which a proprietary GPU company would not normally do,
 158 and that is to follow the example of the BBC when they created their
 159 own Video CODEC called Dirac [5].
 160
 161 what the BBC did there was create the algorithm *exclusively* from
 162 prior art and expired patents... they applied for their own patents...
 163 and then *DELIBERATELY* let them lapse.  the way that the patent
 164 system works, the patents will *still be published*, there will be an
 165 official priority filing date in the patent records with the full text
 166 and details of the patents.
 167
 168 this strategy, where you MUST actually pay for the first filing
 169 otherwise the records are REMOVED and never published, acts as a way
 170 of preventing and prohibiting unscrupulous people from grabbing the
 171 whitepapers and source code, and trying to patent details of the
 172 algorithm themselves just like Google did very recently [6]
 173
 174 * [0] https://www.youtube.com/watch?v=7z6xjIRXcp4
 175 * [1] https://github.com/jbush001/NyuziProcessor/wiki
 176 * [2] https://github.com/asicguy/gplgpu
 177 * [3] https://github.com/jbush001/ChiselGPU/
 178 * [4] http://miaowgpu.org/
 179 * [5] https://en.wikipedia.org/wiki/Dirac_(video_compression_format)
 180 * [6] https://yro.slashdot.org/story/18/06/11/2159218/inventor-says-google-is-patenting-his-public-domain-work
 181 * [7] https://jbush001.github.io/2016/07/24/gplgpu-walkthrough.html
 182 * [8] https://github.com/hermanhermitage/videocoreiv/wiki/VideoCore-IV-Programmers-Manual
 183 * [9] libre-riscv.org/simple_v_extension/
 184 * [10] https://jbush001.github.io/2016/03/02/videocore-qpu-pipeline.html
 185 * [11] https://jbush001.github.io/2016/02/27/life-of-triangle.html
 186 * OpenPiton https://openpiton-blog.princeton.edu/2018/11/announcing-openpiton-with-ariane/
 187
 188 # News Articles
 189
 190 * <https://hub.packtpub.com/a-libre-gpu-effort-based-on-risc-v-rust-llvm-and-vulkan-by-the-developer-of-an-earth-friendly-computer/>
 191 * <https://riscv.org/2018/10/packt-hub-article-a-libre-gpu-effort-based-on-risc-v-rust-llvm-and-vulkan-by-the-developer-of-an-earth-friendly-computer/>
 192 * <https://www.reddit.com/r/RISCV/comments/9jts9t/theres_a_new_libre_gpu_effort_building_on_riscv/>
 193 * <https://www.linux.com/blog/2018/11/risc-v-linux-development-full-swing>
 194 * <https://www.phoronix.com/scan.php?page=news_item&px=Libre-GPU-RISC-V-Vulkan>
 195 * <https://www.heise.de/newsticker/meldung/Mobilprozessor-mit-freier-GPU-Libre-RISC-V-M-Class-geplant-4242802.html>
 196 * <https://news.ycombinator.com/item?id=18094734>
 197 * <http://www.tuxmachines.org/node/116004>
 198 * <https://linuxfr.org/users/martoni/journaux/risc-v-est-pret-pour-le-desktop>