shakti/m_class/libre_3d_gpu.mdwn

   1 # Requirements
   2
   3 ## GPU 3D capabilities
   4
   5 Based on GC800 the following would be acceptable performance
   6 (as would MALI400).
   7
   8 * 35 million triangles/sec
   9 * 325 milllion pixels/sec
  10 * 6 GFLOPS
  11
  12 ## GPU size and power
  13
  14 > 1.1. GPU size MUST be < 0.XX mm for ASICs after synthesis with
  15 > DesignCompiler tool using YY cell library at ZZ nm tech.
  16
  17 basically the power requirement should be at or below around 1 watt
  18 in 40nm.  beyond 1 watt it becomes... difficult.   size is not
  19 particularly critical as such but should not be insane.
  20
  21 so here's a table showing embedded cores:
  22 <https://www.cnx-software.com/2013/01/19/gpus-comparison-arm-mali-vs-vivante-gcxxx-vs-powervr-sgx-vs-nvidia-geforce-ulp/>
  23
  24 GC800 has (in 40nm):
  25
  26 * 35 million triangles/sec
  27 * 325 milllion pixels/sec
  28 * 6 GFLOPS
  29 * 1.9mm^2 synthesis area
  30 * 2.5mm^2 silicon area.
  31
  32 silicon area corresponds *ROUGHLY* with power usage, but PLEASE do
  33 not take that as absolute, because if you read jeff's nyuzi 2016 paper
  34 you'll see that getting data through the L1/L2 cache barrier is by far
  35 and above the biggest eater of power.
  36
  37 note lower down that the numbers for MALI400 are for the *4* core
  38 version - MALI400-MP4 - where jeff and i compared MALI400 SINGLE CORE
  39 and discovered that nyuzi, if 4 parallel nyuzi cores were put
  40 together, would reach only 25% of MALI400's performance (in about the
  41 same silicon area)
  42
  43 ## Other
  44
  45 * Deadline = 12-18 months
  46 * The GPU is matched by the Gallium3D driver
  47 * RTL must be sufficient to run on an FPGA.
  48 * Software must be licensed under LGPLv2+ or BSD/MIT.
  49 * Hardware (RTL) must be licensed under BSD or MIT with no
  50   "NON-COMMERCIAL" CLAUSES.
  51 * Any proposals will be competing against Vivante GC800 (using Etnaviv driver).
  52 * The GPU is integrated (like Mali400). So all that the GPU needs to do
  53   is write to an area of memory (framebuffer or area of the framebuffer).
  54   the SoC - which in this case has a RISC-V core and has peripherals such
  55   as the LCD controller - will take care of the rest.
  56 * In this arcitecture, the GPU, the CPU and the peripherals are all on
  57   the same AXI4 shared memory bus. They all have access to the same shared
  58   DDR3/DDR4 RAM. So as a result the GPU will use AXI4 to write directly
  59   to the framebuffer and the rest will be handle by SoC.
  60 * The job must be done by a team that shows sufficient expertise to
  61   reduce the risk. (Do you mean a team with good CVs? What about if the
  62   team shows you an acceptable FPGA prototype? I’m talking about a team
  63   of students which do not have big industrial CVs but they know how to
  64   handle this job (just like RocketChip or MIAOW or etc…).
  65
  66 response:
  67
  68 > Deadline = ?
  69
  70 about 12-18 months which is really tight.  if an FPGA (or simulation)
  71 plus the basics of the software driver are at least prototyped by then
  72 it *might* be ok.
  73
  74 if using nyuzi as the basis it *might* be possible to begin the
  75 software port in parallel because jeff went to the trouble of writing
  76 a cycle-accurate simulation.
  77
  78
  79 > The GPU must be matched by the Gallium3D driver
  80
  81 that's the *recommended* approach, as i *suspect* it will result in less
  82 work than, for example, writing an entire OpenGL stack from scratch.
  83
  84
  85 > RTL must be sufficient to run on an FPGA.
  86
  87 a *demo* must run on an FPGA as an initial
  88
  89 > Software must be licensed under LGPLv2+ or BSD/MIT.
  90
  91 and no other licenses.  GPLv2+ is out.
  92
  93 > Hardware (RTL) must be licensed under BSD or MIT with no “NON-COMMERCIAL
  94 > CLAUSES”.
  95 > Any proposals will be competing against Vivante GC800 (using Etnaviv
  96 > driver).
  97
  98 in terms of price, performance and power budget, yes.  if you look up
  99 the numbers (triangles/sec, pixels/sec, power usage, die area) you'll
 100 find it's really quite modest.  nyuzi right now requires FOUR times the
 101 silicon area of e.g. MALI400 to achieve the same performance as MALI400,
 102 meaning that the power usage alone would be well in excess of the budget.
 103
 104 > The job must be done by a team that shows sufficient expertise to reduce the
 105 > risk. (Do you mean a team with good CVs? What about if the team shows you an
 106 > acceptable FPGA prototype?
 107
 108 that would be fantastic as it would demonstrate not only competence
 109 but also committment.  and will have taken out the "risk" of being
 110 "unknown", entirely.
 111
 112 > I’m talking about a team of students which do not
 113 > have big industrial CVs but they know how to handle this job (just like
 114 > RocketChip or MIAOW or etc…).
 115
 116  works perfectly for me :)
 117
 118 ## Design decisions and considerations
 119
 120 whilst Nyuzi has a big advantage in that it has simuations and also a
 121 llvm port and so on, if utilised for this particular RISC-V chip it would
 122 mean needing to write a "memory shim" between the general-purpose Nyuzi
 123 core and the main processor, i.e. all the shader info, state etc. needs
 124 synchronisation hardware (and software).
 125
 126 that could significantly complicate design, especially of software.
 127
 128 whilst i *recommended* Gallium3D there is actually another possible
 129 approach: a RISC-V multi-core design which accelerates *software*
 130 rendering... including potentially utilising the fact that gallium3d
 131 has a *software* (LLVM) renderer:
 132
 133 <https://mesa3d.org/llvmpipe.html>
 134
 135 the general aim of this approach is *not* to have the complexity of
 136 transferring significant amounts of data structures to and from disparate
 137 cores (one Nyuzi, one RISC-V) but to STAY WITHIN THE RISC-V ARCHITECTURE
 138 and simply compile mesa3d (for RISC-V), gallium3d-llvm (for RISC-V).
 139
 140 so if considering to base the design on RISC-V, that means turning RISC-V
 141 into a vector processor.  now, whilst hwacha has been located (finally),
 142 it's a design that is specifically targetted at supercomputers.  i have
 143 been taking an alternative approach to vectorisation which is more about
 144 *parallelisation* than it is about *vectorisation*.
 145
 146 it would be great for Simple-V to be given consideration for
 147 implementation as the abstraction "API" of Simple-V would greatly simplify
 148 the addition process of Custom features such as fixed-function pixel
 149 conversion and rasterisation instructions (if those are chosen to be
 150 added) and so on.  bear in mind that a high-speed clock rate is NOT a
 151 good idea for GPUs (power being a square law), multi-core parallelism
 152 and longer SIMD/vectors are much better to consider, instead.
 153
 154 the PDF / slides on Simple-V is here:
 155 <http://hands.com/~lkcl/simple_v_chennai_2018.pdf>
 156
 157 and the assessment, design and implementation is being done here:
 158 <http://libre-riscv.org/simple_v_extension/>
 159