X-Git-Url: https://git.libre-soc.org/?a=blobdiff_plain;f=shakti%2Fm_class%2Flibre_3d_gpu.mdwn;h=cce3915df20dfcaabf054a494f3dd8631b2c9379;hb=de72f7f3c4cfcd7a890f189dff642a378eb9371d;hp=1d22554d4d30e4e1a3715012372b5d3ad9c45c62;hpb=5a4e07c8ae344a5936c5d97dec1649f6a4a5b3fe;p=libreriscv.git diff --git a/shakti/m_class/libre_3d_gpu.mdwn b/shakti/m_class/libre_3d_gpu.mdwn index 1d22554d4..cce3915df 100644 --- a/shakti/m_class/libre_3d_gpu.mdwn +++ b/shakti/m_class/libre_3d_gpu.mdwn @@ -1,9 +1,9 @@ -# Requirements +# Libre 3D GPU Requirements -## GPU 3D capabilities +## GPU capabilities -Based on GC800 the following would be acceptable performance -(as would MALI400). +Based on GC800 the following would be acceptable performance (as would +Mali-400): * 35 million triangles/sec * 325 milllion pixels/sec @@ -11,107 +11,202 @@ Based on GC800 the following would be acceptable performance ## GPU size and power -> 1.1. GPU size MUST be < 0.XX mm for ASICs after synthesis with -> DesignCompiler tool using YY cell library at ZZ nm tech. +* Basically the power requirement should be at or below around 1 watt + in 40nm. Beyond 1 watt it becomes... difficult. +* Size is not particularly critical as such but should not be insane. -basically the power requirement should be at or below around 1 watt -in 40nm. beyond 1 watt it becomes... difficult. size is not -particularly critical as such but should not be insane. +Based on GC800 the following would be acceptable area in 40nm: -so here's a table showing embedded cores: - - -GC800 has (in 40nm): - -* 35 million triangles/sec -* 325 milllion pixels/sec -* 6 GFLOPS * 1.9mm^2 synthesis area * 2.5mm^2 silicon area. -silicon area corresponds *ROUGHLY* with power usage, but PLEASE do -not take that as absolute, because if you read jeff's nyuzi 2016 paper +So here's a table showing embedded cores: + + + +Silicon area corresponds *ROUGHLY* with power usage, but PLEASE do +not take that as absolute, because if you read Jeff's Nyuzi 2016 paper you'll see that getting data through the L1/L2 cache barrier is by far and above the biggest eater of power. -note lower down that the numbers for MALI400 are for the *4* core -version - MALI400-MP4 - where jeff and i compared MALI400 SINGLE CORE -and discovered that nyuzi, if 4 parallel nyuzi cores were put -together, would reach only 25% of MALI400's performance (in about the -same silicon area) +Note lower down that the numbers for Mali-400 are for the *4* core +version - Mali-400 (MP4) - where Jeff and I compared Mali-400 SINGLE CORE +and discovered that Nyuzi, if 4 parallel Nyuzi cores were put +together, would reach only 25% of Mali-400's performance (in about the +same silicon area). ## Other -* Deadline = 12-18 months -* The GPU is matched by the Gallium3D driver -* RTL must be sufficient to run on an FPGA. +* The deadline is about 12-18 months. +* It is highly recommended to use Gallium3D for the software stack + (see below if deciding whether to use Nyuzi or RISC-V or other) * Software must be licensed under LGPLv2+ or BSD/MIT. * Hardware (RTL) must be licensed under BSD or MIT with no "NON-COMMERCIAL" CLAUSES. * Any proposals will be competing against Vivante GC800 (using Etnaviv driver). -* The GPU is integrated (like Mali400). So all that the GPU needs to do - is write to an area of memory (framebuffer or area of the framebuffer). - the SoC - which in this case has a RISC-V core and has peripherals such - as the LCD controller - will take care of the rest. +* The GPU is integrated (like Mali-400). So all that the GPU needs + to do is write to an area of memory (framebuffer or area of the + framebuffer). The SoC - which in this case has a RISC-V core and has + peripherals such as the LCD controller - will take care of the rest. * In this arcitecture, the GPU, the CPU and the peripherals are all on the same AXI4 shared memory bus. They all have access to the same shared DDR3/DDR4 RAM. So as a result the GPU will use AXI4 to write directly to the framebuffer and the rest will be handle by SoC. * The job must be done by a team that shows sufficient expertise to - reduce the risk. (Do you mean a team with good CVs? What about if the - team shows you an acceptable FPGA prototype? I’m talking about a team - of students which do not have big industrial CVs but they know how to - handle this job (just like RocketChip or MIAOW or etc…). - -response: - -> Deadline = ? - -about 12-18 months which is really tight. if an FPGA (or simulation) -plus the basics of the software driver are at least prototyped by then -it *might* be ok. - -if using nyuzi as the basis it *might* be possible to begin the -software port in parallel because jeff went to the trouble of writing -a cycle-accurate simulation. - - -> The GPU must be matched by the Gallium3D driver - -that's the *recommended* approach, as i *suspect* it will result in less -work than, for example, writing an entire OpenGL stack from scratch. - - -> RTL must be sufficient to run on an FPGA. - -a *demo* must run on an FPGA as an initial - -> Software must be licensed under LGPLv2+ or BSD/MIT. - -and no other licenses. GPLv2+ is out. - -> Hardware (RTL) must be licensed under BSD or MIT with no “NON-COMMERCIAL -> CLAUSES”. -> Any proposals will be competing against Vivante GC800 (using Etnaviv -> driver). - -in terms of price, performance and power budget, yes. if you look up -the numbers (triangles/sec, pixels/sec, power usage, die area) you'll -find it's really quite modest. nyuzi right now requires FOUR times the -silicon area of e.g. MALI400 to achieve the same performance as MALI400, -meaning that the power usage alone would be well in excess of the budget. - -> The job must be done by a team that shows sufficient expertise to reduce the -> risk. (Do you mean a team with good CVs? What about if the team shows you an -> acceptable FPGA prototype? - -that would be fantastic as it would demonstrate not only competence -but also committment. and will have taken out the "risk" of being -"unknown", entirely. - -> I’m talking about a team of students which do not -> have big industrial CVs but they know how to handle this job (just like -> RocketChip or MIAOW or etc…). - - works perfectly for me :) - + reduce the risk. + +## Notes + +* The deadline is really tight. If an FPGA (or simulation) plus the basics + of the software driver are at least prototyped by then it *might* be ok. +* If using Nyuzi as the basis it *might* be possible to begin the + software port in parallel because Jeff went to the trouble of writing + a cycle-accurate simulation. +* I *suspect* it will result in less work to use Gallium3D than, for + example, writing an entire OpenGL stack from scratch. +* A *demo* should run on an FPGA as an initial. The FPGA is not a priority + for assessment, but it would be *nice* if it could fit into a ZC706. +* Also if there is parallel hardware obviously it would be nice to be able + to demonstrate parallelism to the maximum extend possible. But again, + being reasonable, if the GPU is so big that only a single core can fit + into even a large FPGA then for an initial demo that would be fine. +* Note that no other licenses are acceptable for the hardware: all GPL + licenses (GPL, AGPL, LGPL) are out. GPL (all revisions v2, v3, v2+, v3+) + are out for software, with the exception of the LGPL (v2+ or v3+ acceptable). + +## Design decisions and considerations + +Whilst Nyuzi has a big advantage in that it has simuations and also a +llvm port and so on, if utilised for this particular RISC-V chip it would +mean needing to write a "memory shim" between the general-purpose Nyuzi +core and the main processor, i.e. all the shader info, state etc. needs +synchronisation hardware (and software). That could significantly +complicate design, especially of software. + +Whilst i *recommended* Gallium3D there is actually another possible approach: + +A RISC-V multi-core design which accelerates *software* +rendering... including potentially utilising the fact that Gallium3D +has a *software* (LLVM) renderer: + + + +The general aim of this approach is *not* to have the complexity of +transferring significant amounts of data structures to and from disparate +cores (one Nyuzi, one RISC-V) but to STAY WITHIN THE RISC-V ARCHITECTURE +and simply compile Mesa3D (for RISC-V), gallium3d-llvm (for RISC-V), +modifying llvm for RISC-V to do the heavy-lifting instead. + +Then it just becomes a matter of adding Vector/SIMD/Parallelization +extensions to RISC-V, and adding support in LLVM for the same: + + + +So if considering to base the design on RISC-V, that means turning RISC-V +into a vector processor. Now, whilst Hwacha has been located (finally), +it's a design that is specifically targetted at supercomputers. I have +been taking an alternative approach to vectorisation which is more about +*parallelization* than it is about *vectorization*. + +It would be great for Simple-V to be given consideration for +implementation as the abstraction "API" of Simple-V would greatly simplify +the addition process of Custom features such as fixed-function pixel +conversion and rasterization instructions (if those are chosen to be +added) and so on. Bear in mind that a high-speed clock rate is NOT a +good idea for GPUs (power being a square law), multi-core parallelism +and longer SIMD/vectors are much better to consider, instead. + +The PDF/slides on Simple-V is here: + + + +And the assessment, design and implementation is being done here: + + + +---- + +My feeling on this is therefore that the following approach is one which involve minimal work: + +* Investigate the ChiselGPU code to see if it can be leveraged (an + "image" added instead of straight ARGB color). +* OR... add sufficient fixed-function 3D instructions (plus a memory + scratch area) to RISC-V to do the equivalent job. +* Implement the Simple-V RISC-V "parallelism" extension (which can + parallelize xBitManip *and* the above-suggested 3D fixed-function + instructions). +* Wait for RISC-V LLVM to have vectorization support added to it. +* MODIFY the resultant RISC-V LLVM code so that it supports Simple-V. +* Grab the gallium3d-llvm source code and hit the "compile" button. +* Grab the *standard* Mesa3D library, tell it to use the gallium3d-llvm library and hit the "compile" button. +* see what happens. + +Now, interestingly, if spike is thrown into the mix there (as a +cycle-accurate RISC-V simulator) it should be perfectly well possible to +get an idea of where performance of the above would need optimization, +just like Jeff did with the Nyuzi paper. + +He focussed on specific algorithms and checked the assembly code, and +worked out how many instruction cycles per pixel were needed, which is +an invaluable measure. + +As I mention in the above page, one of the problems with doing a +completely separate engine (Nyuzi is actually a general-purpose RISC-based +vector processor) is that when it comes to using it, you need to transfer +all the "state" data structures from the main core over to the GPU's core. + +... But if the main core is RISC-V *and the GPU is RISC-V as well* +and they are SMP cores then transferring the state is a simple matter of +doing a context-switch... or if *all* cores have vector and 3D instruction +extensions, a context-switch is not needed at all. + +Will that approach work? Honestly I have absolutely no idea, but it +would be a fascinating and extremely ambitious research project. + +Can we get people to fund it? Yeah I do. there's a lot of buzz about +RISC-V, and a lot of buzz can be created about a libre 3D GPU. If that +same GPU happens to be good at doing crypto-currency mining there will be +a LOT more attention paid, particularly given that people have noticed +that relying on proprietary GPUs and CPUs to manage billions of dollars +worth of crypto-currency, when the NSA is *known* to have blackmailed +intel into putting a spying back-door co-processor in to x86, and that +it miiight not be a good idea to trust proprietary hardware: + + + +## Q & A + +> Q: +> +> Do you need a team with good CVs? What about if the team shows you +> an acceptable FPGA prototype? I’m talking about a team of students +> which do not have big industrial CVs but they know how to handle this +> job (just like RocketChip or MIAOW or etc…). + +A: + +That would be fantastic as it would demonstrate not only competence but +also commitment. And will have taken out the "risk" of being "unknown", +entirely. So that works perfectly for me :) . + +> Q: +> +> Is there any guarantee that there would be a sponsorship for the GPU? + +A: + +Please please let's be absolutely clear: + +I can put the *business case* to the anonymous sponsor to *consider* +sponsoring a libre GPU, *only* and purely on the basis of a *commercial* +decision based on cost and risk analysis, comparing against the best +alternative option which is USD $250,000 for a one-time proprietary +license for Vivante GC800 using etnaviv. So as a result we need to be +*really clear* that *there is no "guaranteed sponsorship"*. this is a +pure commercial *business* assessment. + +However, it just so happens that there's quite a lot of people who are +pissed at how things go in the 3D embedded space. That can be leveraged, +by way of a crowd-funding campaign, to invite people to help, put money +behind this that has *nothing to do with the libre-riscv anonymous +sponsor*.