# SIMD / Simple-V Extension Proposal This proposal exists so as to be able to satisfy several disparate requirements: area-conscious designs and performance-conscious designs. Also, the existing P (SIMD) proposal and the V (Vector) proposals, whilst each extremely powerful in their own right and clearly desirable, are also: * Clearly independent in their origins (Cray and AndeStar v3 respectively) * Both contain duplication of pre-existing RISC-V instructions * Both have independent and disparate methods for introducing parallelism at the instruction level. * Both require that their respective parallelism paradigm be implemented along-side their respective functionality *or not at all*. * Both independently have methods for introducing parallelism that could, if separated, benefit *other areas of RISC-V not just DSP and Floating-point*. Therefore it makes a huge amount of sense to have a means and method of introducing instruction parallelism in a flexible way that provides implementors with the option to choose exactly where they wish to offer performance improvements and where they wish to optimise for power and area. If that can be offered even on a per-operation basis that would provide even more flexibility. # Analysis and discussion of Vector vs SIMD There are four combined areas between the two proposals that help with parallelism without over-burdening the ISA with a huge proliferation of instructions: * Fixed vs variable parallelism (fixed or variable "M" in SIMD) * Implicit vs fixed instruction bit-width (integral to instruction or not) * Implicit vs explicit type-conversion (compounded on bit-width) * Implicit vs explicit inner loops. * Masks / tagging (selecting/preventing certain indexed elements from execution) The pros and cons of each are discussed and analysed below. ## Fixed vs variable parallelism length In David Patterson and Andrew Waterman's analysis of SIMD and Vector ISAs, the analysis comes out clearly in favour of (effectively) variable length SIMD. As SIMD is a fixed width, typically 4, 8 or in extreme cases 16 or 32 simultaneous operations, the setup, teardown and corner-cases of SIMD are extremely burdensome except for applications whose requirements *specifically* match the *precise and exact* depth of the SIMD engine. Thus, SIMD, no matter what width is chosen, is never going to be acceptable for general-purpose computation, and in the context of developing a general-purpose ISA, is never going to satisfy 100 percent of implementors. That basically leaves "variable-length vector" as the clear *general-purpose* winner, at least in terms of greatly simplifying the instruction set, reducing the number of instructions required for any given task, and thus reducing power consumption for the same. ## Implicit vs fixed instruction bit-width SIMD again has a severe disadvantage here, over Vector: huge proliferation of specialist instructions that target 8-bit, 16-bit, 32-bit, 64-bit, and have to then have operations *for each and between each*. It gets very messy, very quickly. The V-Extension on the other hand proposes to set the bit-width of future instructions on a per-register basis, such that subsequent instructions involving that register are *implicitly* of that particular bit-width until otherwise changed or reset. This has some extremely useful properties, without being particularly burdensome to implementations, given that instruction decode already has to direct the operation to a correctly-sized width ALU engine, anyway. Not least: in places where an ISA was previously constrained (due for whatever reason, including limitations of the available operand spcace), implicit bit-width allows the meaning of certain operations to be type-overloaded *without* pollution or alteration of frozen and immutable instructions, in a fully backwards-compatible fashion. ## Implicit and explicit type-conversion The Draft 2.3 V-extension proposal has (deprecated) polymorphism to help deal with over-population of instructions, such that type-casting from integer (and floating point) of various sizes is automatically inferred due to "type tagging" that is set with a special instruction. A register will be *specifically* marked as "16-bit Floating-Point" and, if added to an operand that is specifically tagged as "32-bit Integer" an implicit type-conversion will take placce *without* requiring that type-conversion to be explicitly done with its own separate instruction. However, implicit type-conversion is not only quite burdensome to implement (explosion of inferred type-to-type conversion) but also is never really going to be complete. It gets even worse when bit-widths also have to be taken into consideration. Overall, type-conversion is generally best to leave to explicit type-conversion instructions, or in definite specific use-cases left to be part of an actual instruction (DSP or FP) ## Zero-overhead loops vs explicit loops The initial Draft P-SIMD Proposal by Chuanhua Chang of Andes Technology contains an extremely interesting feature: zero-overhead loops. This proposal would basically allow an inner loop of instructions to be repeated indefinitely, a fixed number of times. Its specific advantage over explicit loops is that the pipeline in a DSP can potentially be kept completely full *even in an in-order implementation*. Normally, it requires a superscalar architecture and out-of-order execution capabilities to "pre-process" instructions in order to keep ALU pipelines 100% occupied. This very simple proposal offers a way to increase pipeline activity in the one key area which really matters: the inner loop. ## Mask and Tagging *TODO: research masks as they can be superb and extremely powerful. If B-Extension is implemented and provides Bit-Gather-Scatter it becomes really cool and easy to switch out certain indexed values from an array of data, but actually BGS **on its own** might be sufficient. Bottom line, this is complex, and needs a proper analysis. The other sections are pretty straightforward.* ## Conclusions In the above sections the four different ways where parallel instruction execution has closely and loosely inter-related implications for the ISA and for implementors, were outlined. The pluses and minuses came out as follows: * Fixed vs variable parallelism: variable * Implicit (indirect) vs fixed (integral) instruction bit-width: indirect * Implicit vs explicit type-conversion: explicit * Implicit vs explicit inner loops: implicit * Tag or no-tag: TODO In particular: variable-length vectors came out on top because of the high setup, teardown and corner-cases associated with the fixed width of SIMD. Implicit bit-width helps to extend the ISA to escape from former limitations and restrictions (in a backwards-compatible fashion), and implicit (zero-overhead) loops provide a means to keep pipelines potentially 100% occupied *without* requiring a super-scalar or out-of-order architecture. Constructing a SIMD/Simple-Vector proposal based around even only these four (five?) requirements would therefore seem to be a logical thing to do. # References * SIMD considered harmful * Link to first proposal * Recommendation by Jacob Bachmeyer to make zero-overhead loop an "implicit program-counter" * Re-continuing P-Extension proposal * First Draft P-SIMD (DSP) proposal