X-Git-Url: https://git.libre-soc.org/?a=blobdiff_plain;f=simple_v_extension.mdwn;h=e422e11f7c8b38c33b20642e31586ee7dc8b652b;hb=d9abaf326f14a84687aebd021f02ae8aa0225cc1;hp=bd99c92cc7f32aab77b7a33d15112b2f6fa0d470;hpb=e42745c78789d0153ca7a254dac43fb90ae071c7;p=libreriscv.git diff --git a/simple_v_extension.mdwn b/simple_v_extension.mdwn index bd99c92cc..e422e11f7 100644 --- a/simple_v_extension.mdwn +++ b/simple_v_extension.mdwn @@ -1,14 +1,5 @@ # Variable-width Variable-packed SIMD / Simple-V / Parallelism Extension Proposal -* TODO 23may2018: CSR-CAM-ify regfile tables -* TODO 23may2018: zero-mark predication CSR -* TODO 28may2018: sort out VSETVL: CSR length to be removed? -* TODO 09jun2018: Chennai Presentation more up-to-date -* TODO 09jun2019: elwidth only 4 values (dflt, dflt/2, 8, 16) -* TODO 09jun2019: extra register banks (future option) -* TODO 09jun2019: new Reg CSR table (incl. packed=Y/N) - - Key insight: Simple-V is intended as an abstraction layer to provide a consistent "API" to parallelisation of existing *and future* operations. *Actual* internal hardware-level parallelism is *not* required, such @@ -18,8 +9,10 @@ instruction queue (FIFO), pending execution. *Actual* parallelism, if added independently of Simple-V in the form of Out-of-order restructuring (including parallel ALU lanes) or VLIW -implementations, or SIMD, or anything else, would then benefit *if* -Simple-V was added on top. +implementations, or SIMD, or anything else, would then benefit from +the uniformity of a consistent API. + +Talk slides: [[!toc ]] @@ -945,7 +938,7 @@ of detecting early page / segmentation faults and adjusting the TLB in advance, accordingly: other strategies are explored in the Appendix Section "Virtual Memory Page Faults". -## Vectorised Copy/Move instructions +## Vectorised Copy/Move (and conversion) instructions There is a series of 2-operand instructions involving copying (and alteration): C.MV, FMV, FNEG, FABS, FCVT, FSGNJ. These operations all @@ -1068,6 +1061,168 @@ Similar rules apply to the destination register. * Throw an exception. Whether that actually results in spawning threads as part of the trap-handling remains to be seen. +# Under consideration + +From the Chennai 2018 slides the following issues were raised. +Efforts to analyse and answer these questions are below. + +* Should future extra bank be included now? +* How many Register and Predication CSRs should there be? + (and how many in RV32E) +* How many in M-Mode (for doing context-switch)? +* Should use of registers be allowed to "wrap" (x30 x31 x1 x2)? +* Can CLIP be done as a CSR (mode, like elwidth) +* SIMD saturation (etc.) also set as a mode? +* Include src1/src2 predication on Comparison Ops? + (same arrangement as C.MV, with same flexibility/power) +* 8/16-bit ops is it worthwhile adding a "start offset"? + (a bit like misaligned addressing... for registers) + or just use predication to skip start? + +## Future (extra) bank be included (made mandatory) + +The implications of expanding the *standard* register file from +32 entries per bank to 64 per bank is quite an extensive architectural +change. Also it has implications for context-switching. + +Therefore, on balance, it is not recommended and certainly should +not be made a *mandatory* requirement for the use of SV. SV's design +ethos is to be minimally-disruptive for implementors to shoe-horn +into an existing design. + +## How large should the Register and Predication CSR key-value stores be? + +This is something that definitely needs actual evaluation and for +code to be run and the results analysed. At the time of writing +(12jul2018) that is too early to tell. An approximate best-guess +however would be 16 entries. + +RV32E however is a special case, given that it is highly unlikely +(but not outside the realm of possibility) that it would be used +for performance reasons but instead for reducing instruction count. +The number of CSR entries therefore has to be considered extremely +carefully. + +## How many CSR entries in M-Mode or S-Mode (for context-switching)? + +The minimum required CSR entries would be 1 for each register-bank: +one for integer and one for floating-point. However, as shown +in the "Context Switch Example" section, for optimal efficiency +(minimal instructions in a low-latency situation) the CSRs for +the context-switch should be set up *and left alone*. + +This means that it is not really a good idea to touch the CSRs +used for context-switching in the M-Mode (or S-Mode) trap, so +if there is ever demonstrated a need for vectors then there would +need to be *at least* one more free. However just one does not make +much sense (as it one only covers scalar-vector ops) so it is more +likely that at least two extra would be needed. + +This *in addition* - in the RV32E case - if an RV32E implementation +happens also to support U/S/M modes. This would be considered quite +rare but not outside of the realm of possibility. + +Conclusion: all needs careful analysis and future work. + +## Should use of registers be allowed to "wrap" (x30 x31 x1 x2)? + +On balance it's a neat idea however it does seem to be one where the +benefits are not really clear. It would however obviate the need for +an exception to be raised if the VL runs out of registers to put +things in (gets to x31, tries a non-existent x32 and fails), however +the "fly in the ointment" is that x0 is hard-coded to "zero". The +increment therefore would need to be double-stepped to skip over x0. +Some microarchitectures could run into difficulties (SIMD-like ones +in particular) so it needs a lot more thought. + +## Can CLIP be done as a CSR (mode, like elwidth) + +RVV appears to be going this way. At the time of writing (12jun2018) +it's noted that in V2.3-Draft V0.4 RVV Chapter, RVV intends to do +clip by way of exactly this method: setting a "clip mode" in a CSR. + +No details are given however the most sensible thing to have would be +to extend the 16-bit Register CSR table to 24-bit (or 32-bit) and have +extra bits specifying the type of clipping to be carried out, on +a per-register basis. Other bits may be used for other purposes +(see SIMD saturation below) + +## SIMD saturation (etc.) also set as a mode? + +Similar to "CLIP" as an extension to the CSR key-value store, "saturate" +may also need extra details (what the saturation maximum is for example). + +## Include src1/src2 predication on Comparison Ops? + +In the C.MV (and other ops - see "C.MV Instruction"), the decision +was taken, unlike in ADD (etc.) which are 3-operand ops, to use +*both* the src *and* dest predication masks to give an extremely +powerful and flexible instruction that covers a huge number of +"traditional" vector opcodes. + +The natural question therefore to ask is: where else could this +flexibility be deployed? What about comparison operations? + +Unfortunately, C.MV is basically "regs[dest] = regs[src]" whilst +predicated comparison operations are actually a *three* operand +instruction: + + regs[pred] |= 1<< (cmp(regs[src1], regs[src2]) ? 1 : 0) + +Therefore at first glance it does not make sense to use src1 and src2 +predication masks, as it breaks the rule of 3-operand instructions +to use the *destination* predication register. + +In this case however, the destination *is* a predication register +as opposed to being a predication mask that is applied *to* the +(vectorised) operation, element-at-a-time on src1 and src2. + +Thus the question is directly inter-related to whether the modification +of the predication mask should *itself* be predicated. + +It is quite complex, in other words, and needs careful consideration. + +## 8/16-bit ops is it worthwhile adding a "start offset"? + +The idea here is to make it possible, particularly in a "Packed SIMD" +case, to be able to avoid doing unaligned Load/Store operations +by specifying that operations, instead of being carried out +element-for-element, are offset by a fixed amount *even* in 8 and 16-bit +element Packed SIMD cases. + +For example rather than take 2 32-bit registers divided into 4 8-bit +elements and have them ADDed element-for-element as follows: + + r3[0] = add r4[0], r6[0] + r3[1] = add r4[1], r6[1] + r3[2] = add r4[2], r6[2] + r3[3] = add r4[3], r6[3] + +an offset of 1 would result in four operations as follows, instead: + + r3[0] = add r4[1], r6[0] + r3[1] = add r4[2], r6[1] + r3[2] = add r4[3], r6[2] + r3[3] = add r5[0], r6[3] + +In non-packed-SIMD mode there is no benefit at all, as a vector may +be created using a different CSR that has the offset built-in. So this +leaves just the packed-SIMD case to consider. + +Two ways in which this could be implemented / emulated (without special +hardware): + +* bit-manipulation that shuffles the data along by one byte (or one word) + either prior to or as part of the operation requiring the offset. +* just use an unaligned Load/Store sequence, even if there are performance + penalties for doing so. + +The question then is whether the performance hit is worth the extra hardware +involving byte-shuffling/shifting the data by an arbitrary offset. On +balance given that there are two reasonable instruction-based options, the +hardware-offset option should be left out for the initial version of SV, +with the option to consider it in an "advanced" version of the specification. + # Impementing V on top of Simple-V With Simple-V converting the original RVV draft concept-for-concept @@ -1349,37 +1504,35 @@ the question is asked "How can each of the proposals effectively implement ### Example Instruction translation: -Instructions "ADD r2 r4 r4" would result in three instructions being -generated and placed into the FIFO: +Instructions "ADD r7 r4 r4" would result in three instructions being +generated and placed into the FIFO. r7 and r4 are marked as "vectorised": + +* ADD r7 r4 r4 +* ADD r8 r5 r5 +* ADD r9 r6 r6 -* ADD r2 r4 r4 -* ADD r2 r5 r5 -* ADD r2 r6 r6 +Instructions "ADD r7 r4 r1" would result in three instructions being +generated and placed into the FIFO. r7 and r1 are marked as "vectorised" +whilst r4 is not: + +* ADD r7 r4 r1 +* ADD r8 r4 r2 +* ADD r9 r4 r3 ## Example of vector / vector, vector / scalar, scalar / scalar => vector add - register CSRvectorlen[XLEN][4]; # not quite decided yet about this one... - register CSRpredicate[XLEN][4]; # 2^4 is max vector length - register CSRreg_is_vectorised[XLEN]; # just for fun support scalars as well - register x[32][XLEN]; - - function op_add(rd, rs1, rs2, predr) - { -    /* note that this is ADD, not PADD */ -    int i, id, irs1, irs2; -    # checks CSRvectorlen[rd] == CSRvectorlen[rs] etc. ignored -    # also destination makes no sense as a scalar but what the hell... -    for (i = 0, id=0, irs1=0, irs2=0; i @@ -2191,3 +2344,5 @@ TBD: floating-point compare and other exception handling * * Full Description (last page) of RVV instructions +* PULP Low-energy Cluster Vector Processor +