- * miaow is just an OpenCL engine that is compatible with a subset of
-AMD/ATI's OpenCL assembly code. it is NOT a GPU. they have
-preliminary plans to *make* one... however the development process is
-not open. we'll hear about it if and when it succeeds, probably as
-part of a published research paper.
-
- * nyuzi is a *modern* "software shader / renderer" and is a
-replication of the intel larrabee architecture. it explored the
-concept of doing recursive software-driven rasterisation (as did
-larrabee) where hardware rasterisation uses brute force and often
-wastes time and power. jeff went to a lot of trouble to find out
-*why* intel's researchers were um "not permitted" to actually put
-performance numbers into their published papers. he found out why :)
-one of the main facts that jeff's research reveals (and there are a
-lot of them) is that most of the energy of a GPU is spent getting data
-each way past the L2/L1 cache barrier, and secondly much of the time
-(if doing software-only rendering) you have several instruction cycles
-where in a hardware design you issue one and a separate pipeline takes
-over (see videocore-iv below)
-
- * chiselgpu was an additional effort by jeff to create the absolute
-minimum required tile-based "triangle renderer" in hardware, for
-comparative purposes in the nyuzi raster engine research. synthesis
-of such a block he pointed out to me would actually be *enormous*,
-despite appearances from how little code there is in the chiselgpu
-repository. in his paper he mentions that the majority of the time
-when such hardware-renderers are deployed, the rest of the GPU is
-really struggling to keep up feeding the hardware-rasteriser, so you
-have to put in multiple threads, and that brings its own problems.
-it's all in the paper, it's fascinating stuff.
-
- * gplgpu was done by one of the original developers of the "Number
-Nine" GPU, and is based around a "fixed function" design and as such
-is no longer considered suitable for use in the modern 3D developer
-community (they hate having to code for it), and its performance would
-be *really* hard to optimise and extend. however in speaking to jeff,
-who analysed it quite comprehensively, he said that there were a large
-number of features (4-tuple floating-point colour to 16/32-bit ARGB
-fixed functions) that have retained a presence in modern designs, so
-it's still useful for inspiration and analysis purposes. you can see
-jeff's analysis here [7]
-
- * an extremely useful resource has been the videocore-iv project [8]
-which has collected documentation and part-implemented compiler tools.
-the architecture is quite interesting, it's a hybrid of a
-Software-driven Vector architecture similar to Nyuzi plus
-fixed-functions on separate pipelines such as that "take 4-tuple FP,
-turn it into fixed-point ARGB and overlay it into the tile"
-instruction. that's done as a *single* instruction to cover i think 4
-pixels, where Nyuzi requires an average of 4 cycles per pixel. the
-other thing about videocore-iv is that there is a separate internal
-"scratch" memory area of size 4x4 (x32-bit) which is the "tile" area,
-and focussing on filling just that is one of the things that saves
-power. jeff did a walkthrough, you can read it here [10] [11]
-
- so on this basis i have been investigating a couple of proposals for
+* miaow is just an OpenCL engine that is compatible with a subset of
+ AMD/ATI's OpenCL assembly code. it is NOT a GPU. they have
+ preliminary plans to *make* one... however the development process is
+ not open. we'll hear about it if and when it succeeds, probably as
+ part of a published research paper.
+
+* nyuzi is a *modern* "software shader / renderer" and is a
+ replication of the intel larrabee architecture. it explored the
+ concept of doing recursive software-driven rasterisation (as did
+ larrabee) where hardware rasterisation uses brute force and often
+ wastes time and power. jeff went to a lot of trouble to find out
+ *why* intel's researchers were um "not permitted" to actually put
+ performance numbers into their published papers. he found out why :)
+ one of the main facts that jeff's research reveals (and there are a
+ lot of them) is that most of the energy of a GPU is spent getting data
+ each way past the L2/L1 cache barrier, and secondly much of the time
+ (if doing software-only rendering) you have several instruction cycles
+ where in a hardware design you issue one and a separate pipeline takes
+ over (see videocore-iv below)
+
+* chiselgpu was an additional effort by jeff to create the absolute
+ minimum required tile-based "triangle renderer" in hardware, for
+ comparative purposes in the nyuzi raster engine research. synthesis
+ of such a block he pointed out to me would actually be *enormous*,
+ despite appearances from how little code there is in the chiselgpu
+ repository. in his paper he mentions that the majority of the time
+ when such hardware-renderers are deployed, the rest of the GPU is
+ really struggling to keep up feeding the hardware-rasteriser, so you
+ have to put in multiple threads, and that brings its own problems.
+ it's all in the paper, it's fascinating stuff.
+
+* gplgpu was done by one of the original developers of the "Number
+ Nine" GPU, and is based around a "fixed function" design and as such
+ is no longer considered suitable for use in the modern 3D developer
+ community (they hate having to code for it), and its performance would
+ be *really* hard to optimise and extend. however in speaking to jeff,
+ who analysed it quite comprehensively, he said that there were a large
+ number of features (4-tuple floating-point colour to 16/32-bit ARGB
+ fixed functions) that have retained a presence in modern designs, so
+ it's still useful for inspiration and analysis purposes. you can see
+ jeff's analysis here [7]
+
+* an extremely useful resource has been the videocore-iv project [8]
+ which has collected documentation and part-implemented compiler tools.
+ the architecture is quite interesting, it's a hybrid of a
+ Software-driven Vector architecture similar to Nyuzi plus
+ fixed-functions on separate pipelines such as that "take 4-tuple FP,
+ turn it into fixed-point ARGB and overlay it into the tile"
+ instruction. that's done as a *single* instruction to cover i think 4
+ pixels, where Nyuzi requires an average of 4 cycles per pixel. the
+ other thing about videocore-iv is that there is a separate internal
+ "scratch" memory area of size 4x4 (x32-bit) which is the "tile" area,
+ and focussing on filling just that is one of the things that saves
+ power. jeff did a walkthrough, you can read it here [10] [11]
+
+so on this basis i have been investigating a couple of proposals for