shakti/m_class/libre_3d_gpu.mdwn

   1 # Libre 3D GPU Requirements
   2
   3 See [[3d_gpu/microarchitecture]]
   4
   5 ## GPU capabilities
   6
   7 Based on GC800 the following would be acceptable performance (as would
   8 Mali-400):
   9
  10 * 35 million triangles/sec
  11 * 325 milllion pixels/sec
  12 * 6 GFLOPS
  13
  14 ## GPU size and power
  15
  16 * Basically the power requirement should be at or below around 1 watt
  17   in 40nm. Beyond 1 watt it becomes... difficult.
  18 * Size is not particularly critical as such but should not be insane.
  19
  20 Based on GC800 the following would be acceptable area in 40nm:
  21
  22 * 1.9mm^2 synthesis area
  23 * 2.5mm^2 silicon area.
  24
  25 So here's a table showing embedded cores:
  26
  27 <https://www.cnx-software.com/2013/01/19/gpus-comparison-arm-mali-vs-vivante-gcxxx-vs-powervr-sgx-vs-nvidia-geforce-ulp/>
  28
  29 Silicon area corresponds *ROUGHLY* with power usage, but PLEASE do
  30 not take that as absolute, because if you read Jeff's Nyuzi 2016 paper
  31 you'll see that getting data through the L1/L2 cache barrier is by far
  32 and above the biggest eater of power.
  33
  34 Note lower down that the numbers for Mali-400 are for the *4* core
  35 version - Mali-400 (MP4) - where Jeff and I compared Mali-400 SINGLE CORE
  36 and discovered that Nyuzi, if 4 parallel Nyuzi cores were put
  37 together, would reach only 25% of Mali-400's performance (in about the
  38 same silicon area).
  39
  40 ## Other
  41
  42 * The deadline is about 12-18 months.
  43 * It is highly recommended to use Gallium3D for the software stack
  44   (see below if deciding whether to use Nyuzi or RISC-V or other)
  45 * Software must be licensed under LGPLv2+ or BSD/MIT.
  46 * Hardware (RTL) must be licensed under BSD or MIT with no
  47   "NON-COMMERCIAL" CLAUSES.
  48 * Any proposals will be competing against Vivante GC800 (using Etnaviv driver).
  49 * The GPU is integrated (like Mali-400). So all that the GPU needs
  50   to do is write to an area of memory (framebuffer or area of the
  51   framebuffer). The SoC - which in this case has a RISC-V core and has
  52   peripherals such as the LCD controller - will take care of the rest.
  53 * In this arcitecture, the GPU, the CPU and the peripherals are all on
  54   the same AXI4 shared memory bus. They all have access to the same shared
  55   DDR3/DDR4 RAM. So as a result the GPU will use AXI4 to write directly
  56   to the framebuffer and the rest will be handle by SoC.
  57 * The job must be done by a team that shows sufficient expertise to
  58   reduce the risk.
  59
  60 ## Notes
  61
  62 * The deadline is really tight. If an FPGA (or simulation) plus the basics
  63   of the software driver are at least prototyped by then it *might* be ok.
  64 * If using Nyuzi as the basis it *might* be possible to begin the
  65   software port in parallel because Jeff went to the trouble of writing
  66   a cycle-accurate simulation.
  67 * I *suspect* it will result in less work to use Gallium3D than, for
  68   example, writing an entire OpenGL stack from scratch.
  69 * A *demo* should run on an FPGA as an initial. The FPGA is not a priority
  70   for assessment, but it would be *nice* if it could fit into a ZC706.
  71 * Also if there is parallel hardware obviously it would be nice to be able
  72   to demonstrate parallelism to the maximum extend possible. But again,
  73   being reasonable, if the GPU is so big that only a single core can fit
  74   into even a large FPGA then for an initial demo that would be fine.
  75 * Note that no other licenses are acceptable for the hardware: all GPL
  76   licenses (GPL, AGPL, LGPL) are out.  GPL (all revisions v2, v3, v2+, v3+)
  77   are out for software, with the exception of the LGPL (v2+ or v3+ acceptable).
  78
  79 ## Design decisions and considerations
  80
  81 Whilst Nyuzi has a big advantage in that it has simuations and also a
  82 llvm port and so on, if utilised for this particular RISC-V chip it would
  83 mean needing to write a "memory shim" between the general-purpose Nyuzi
  84 core and the main processor, i.e. all the shader info, state etc. needs
  85 synchronisation hardware (and software). That could significantly
  86 complicate design, especially of software.
  87
  88 Whilst i *recommended* Gallium3D there is actually another possible approach:
  89
  90 A RISC-V multi-core design which accelerates *software*
  91 rendering... including potentially utilising the fact that Gallium3D
  92 has a *software* (LLVM) renderer:
  93
  94 <https://mesa3d.org/llvmpipe.html>
  95
  96 The general aim of this approach is *not* to have the complexity of
  97 transferring significant amounts of data structures to and from disparate
  98 cores (one Nyuzi, one RISC-V) but to STAY WITHIN THE RISC-V ARCHITECTURE
  99 and simply compile Mesa3D (for RISC-V), gallium3d-llvm (for RISC-V),
 100 modifying llvm for RISC-V to do the heavy-lifting instead.
 101
 102 Then it just becomes a matter of adding Vector/SIMD/Parallelization
 103 extensions to RISC-V, and adding support in LLVM for the same:
 104
 105 <https://lists.llvm.org/pipermail/llvm-dev/2018-April/122517.html>
 106
 107 So if considering to base the design on RISC-V, that means turning RISC-V
 108 into a vector processor. Now, whilst Hwacha has been located (finally),
 109 it's a design that is specifically targetted at supercomputers. I have
 110 been taking an alternative approach to vectorisation which is more about
 111 *parallelization* than it is about *vectorization*.
 112
 113 It would be great for Simple-V to be given consideration for
 114 implementation as the abstraction "API" of Simple-V would greatly simplify
 115 the addition process of Custom features such as fixed-function pixel
 116 conversion and rasterization instructions (if those are chosen to be
 117 added) and so on. Bear in mind that a high-speed clock rate is NOT a
 118 good idea for GPUs (power being a square law), multi-core parallelism
 119 and longer SIMD/vectors are much better to consider, instead.
 120
 121 The PDF/slides on Simple-V is here:
 122
 123 <http://hands.com/~lkcl/simple_v_chennai_2018.pdf>
 124
 125 And the assessment, design and implementation is being done here:
 126
 127 <http://libre-riscv.org/simple_v_extension/>
 128
 129 ----
 130
 131 My feeling on this is therefore that the following approach is one which involve minimal work:
 132
 133 * Investigate the ChiselGPU code to see if it can be leveraged (an
 134   "image" added instead of straight ARGB color).
 135 * OR... add sufficient fixed-function 3D instructions (plus a memory
 136   scratch area) to RISC-V to do the equivalent job.
 137 * Implement the Simple-V RISC-V "parallelism" extension (which can
 138   parallelize xBitManip *and* the above-suggested 3D fixed-function
 139   instructions).
 140 * Wait for RISC-V LLVM to have vectorization support added to it.
 141 * MODIFY the resultant RISC-V LLVM code so that it supports Simple-V.
 142 * Grab the gallium3d-llvm source code and hit the "compile" button.
 143 * Grab the *standard* Mesa3D library, tell it to use the gallium3d-llvm library and hit the "compile" button.
 144 * see what happens.
 145
 146 Now, interestingly, if spike is thrown into the mix there (as a
 147 cycle-accurate RISC-V simulator) it should be perfectly well possible to
 148 get an idea of where performance of the above would need optimization,
 149 just like Jeff did with the Nyuzi paper.
 150
 151 He focussed on specific algorithms and checked the assembly code, and
 152 worked out how many instruction cycles per pixel were needed, which is
 153 an invaluable measure.
 154
 155 As I mention in the above page, one of the problems with doing a
 156 completely separate engine (Nyuzi is actually a general-purpose RISC-based
 157 vector processor) is that when it comes to using it, you need to transfer
 158 all the "state" data structures from the main core over to the GPU's core.
 159
 160 ... But if the main core is RISC-V *and the GPU is RISC-V as well*
 161 and they are SMP cores then transferring the state is a simple matter of
 162 doing a context-switch... or if *all* cores have vector and 3D instruction
 163 extensions, a context-switch is not needed at all.
 164
 165 Will that approach work? Honestly I have absolutely no idea, but it
 166 would be a fascinating and extremely ambitious research project.
 167
 168 Can we get people to fund it?  Yeah I do.  there's a lot of buzz about
 169 RISC-V, and a lot of buzz can be created about a libre 3D GPU. If that
 170 same GPU happens to be good at doing crypto-currency mining there will be
 171 a LOT more attention paid, particularly given that people have noticed
 172 that relying on proprietary GPUs and CPUs to manage billions of dollars
 173 worth of crypto-currency, when the NSA is *known* to have blackmailed
 174 intel into putting a spying back-door co-processor in to x86, and that
 175 it miiight not be a good idea to trust proprietary hardware:
 176
 177 <http://libreboot.org/faq#intelme>
 178
 179 ## Q & A
 180
 181 > Q:
 182 >
 183 > Do you need a team with good CVs? What about if the team shows you
 184 > an acceptable FPGA prototype? I’m talking about a team of students
 185 > which do not have big industrial CVs but they know how to handle this
 186 > job (just like RocketChip or MIAOW or etc…).
 187
 188 A:
 189
 190 That would be fantastic as it would demonstrate not only competence but
 191 also commitment. And will have taken out the "risk" of being "unknown",
 192 entirely. So that works perfectly for me :) .
 193
 194 > Q:
 195 >
 196 > Is there any guarantee that there would be a sponsorship for the GPU?
 197
 198 A:
 199
 200 Please please let's be absolutely clear:
 201
 202 I can put the *business case* to the anonymous sponsor to *consider*
 203 sponsoring a libre GPU, *only* and purely on the basis of a *commercial*
 204 decision based on cost and risk analysis, comparing against the best
 205 alternative option which is USD $250,000 for a one-time proprietary
 206 license for Vivante GC800 using etnaviv. So as a result we need to be
 207 *really clear* that *there is no "guaranteed sponsorship"*.  this is a
 208 pure commercial *business* assessment.
 209
 210 However, it just so happens that there's quite a lot of people who are
 211 pissed at how things go in the 3D embedded space. That can be leveraged,
 212 by way of a crowd-funding campaign, to invite people to help, put money
 213 behind this that has *nothing to do with the libre-riscv anonymous
 214 sponsor*.
 215
 216 As in: there is absolutely nothing which prevents or prohibits raising
 217 of funds from other sources and using other initiatives.  The anonymous
 218 sponsor *purely* seeks a chip for use in a product.  They are **NOT**
 219 demanding ownership or control *of* the chip being designed, in any
 220 way, shape or form.  There just happens not to be a chip available on
 221 the market today that suits their requirements, hence the interest in
 222 ensuring that there is.
 223