X-Git-Url: https://git.libre-soc.org/?a=blobdiff_plain;f=simple_v_extension.mdwn;h=e422e11f7c8b38c33b20642e31586ee7dc8b652b;hb=3f444fffc5a6b49cb5879ee44bd351b0a80a7dd8;hp=99ae03031cdf81bf80a89a6d9e58dd98027effcb;hpb=a8f03fb480aeab533c93cba39c50295ff7238121;p=libreriscv.git diff --git a/simple_v_extension.mdwn b/simple_v_extension.mdwn index 99ae03031..e422e11f7 100644 --- a/simple_v_extension.mdwn +++ b/simple_v_extension.mdwn @@ -1,14 +1,5 @@ # Variable-width Variable-packed SIMD / Simple-V / Parallelism Extension Proposal -* TODO 23may2018: CSR-CAM-ify regfile tables -* TODO 23may2018: zero-mark predication CSR -* TODO 28may2018: sort out VSETVL: CSR length to be removed? -* TODO 09jun2018: Chennai Presentation more up-to-date -* TODO 09jun2019: elwidth only 4 values (dflt, dflt/2, 8, 16) -* TODO 09jun2019: extra register banks (future option) -* TODO 09jun2019: new Reg CSR table (incl. packed=Y/N) - - Key insight: Simple-V is intended as an abstraction layer to provide a consistent "API" to parallelisation of existing *and future* operations. *Actual* internal hardware-level parallelism is *not* required, such @@ -18,8 +9,10 @@ instruction queue (FIFO), pending execution. *Actual* parallelism, if added independently of Simple-V in the form of Out-of-order restructuring (including parallel ALU lanes) or VLIW -implementations, or SIMD, or anything else, would then benefit *if* -Simple-V was added on top. +implementations, or SIMD, or anything else, would then benefit from +the uniformity of a consistent API. + +Talk slides: [[!toc ]] @@ -1068,7 +1061,7 @@ Similar rules apply to the destination register. * Throw an exception. Whether that actually results in spawning threads as part of the trap-handling remains to be seen. -# Under consideration +# Under consideration From the Chennai 2018 slides the following issues were raised. Efforts to analyse and answer these questions are below. @@ -1191,7 +1184,44 @@ It is quite complex, in other words, and needs careful consideration. ## 8/16-bit ops is it worthwhile adding a "start offset"? -TBD +The idea here is to make it possible, particularly in a "Packed SIMD" +case, to be able to avoid doing unaligned Load/Store operations +by specifying that operations, instead of being carried out +element-for-element, are offset by a fixed amount *even* in 8 and 16-bit +element Packed SIMD cases. + +For example rather than take 2 32-bit registers divided into 4 8-bit +elements and have them ADDed element-for-element as follows: + + r3[0] = add r4[0], r6[0] + r3[1] = add r4[1], r6[1] + r3[2] = add r4[2], r6[2] + r3[3] = add r4[3], r6[3] + +an offset of 1 would result in four operations as follows, instead: + + r3[0] = add r4[1], r6[0] + r3[1] = add r4[2], r6[1] + r3[2] = add r4[3], r6[2] + r3[3] = add r5[0], r6[3] + +In non-packed-SIMD mode there is no benefit at all, as a vector may +be created using a different CSR that has the offset built-in. So this +leaves just the packed-SIMD case to consider. + +Two ways in which this could be implemented / emulated (without special +hardware): + +* bit-manipulation that shuffles the data along by one byte (or one word) + either prior to or as part of the operation requiring the offset. +* just use an unaligned Load/Store sequence, even if there are performance + penalties for doing so. + +The question then is whether the performance hit is worth the extra hardware +involving byte-shuffling/shifting the data by an arbitrary offset. On +balance given that there are two reasonable instruction-based options, the +hardware-offset option should be left out for the initial version of SV, +with the option to consider it in an "advanced" version of the specification. # Impementing V on top of Simple-V @@ -1474,37 +1504,35 @@ the question is asked "How can each of the proposals effectively implement ### Example Instruction translation: -Instructions "ADD r2 r4 r4" would result in three instructions being -generated and placed into the FIFO: +Instructions "ADD r7 r4 r4" would result in three instructions being +generated and placed into the FIFO. r7 and r4 are marked as "vectorised": -* ADD r2 r4 r4 -* ADD r2 r5 r5 -* ADD r2 r6 r6 +* ADD r7 r4 r4 +* ADD r8 r5 r5 +* ADD r9 r6 r6 + +Instructions "ADD r7 r4 r1" would result in three instructions being +generated and placed into the FIFO. r7 and r1 are marked as "vectorised" +whilst r4 is not: + +* ADD r7 r4 r1 +* ADD r8 r4 r2 +* ADD r9 r4 r3 ## Example of vector / vector, vector / scalar, scalar / scalar => vector add - register CSRvectorlen[XLEN][4]; # not quite decided yet about this one... - register CSRpredicate[XLEN][4]; # 2^4 is max vector length - register CSRreg_is_vectorised[XLEN]; # just for fun support scalars as well - register x[32][XLEN]; - - function op_add(rd, rs1, rs2, predr) - { -    /* note that this is ADD, not PADD */ -    int i, id, irs1, irs2; -    # checks CSRvectorlen[rd] == CSRvectorlen[rs] etc. ignored -    # also destination makes no sense as a scalar but what the hell... -    for (i = 0, id=0, irs1=0, irs2=0; i @@ -2316,3 +2344,5 @@ TBD: floating-point compare and other exception handling * * Full Description (last page) of RVV instructions +* PULP Low-energy Cluster Vector Processor +