X-Git-Url: https://git.libre-soc.org/?a=blobdiff_plain;f=simple_v_extension.mdwn;h=4cf0937edde233618bbd012a4a8ad571c9acd473;hb=989d018425f243ab9d3959108356eba243ad57ae;hp=51fe43d14b4a3629fbc9c51898ab63d0c94de1af;hpb=4af323d95355a1a0c60c8ae96b130d7256337583;p=libreriscv.git diff --git a/simple_v_extension.mdwn b/simple_v_extension.mdwn index 51fe43d14..4cf0937ed 100644 --- a/simple_v_extension.mdwn +++ b/simple_v_extension.mdwn @@ -9,8 +9,8 @@ instruction queue (FIFO), pending execution. *Actual* parallelism, if added independently of Simple-V in the form of Out-of-order restructuring (including parallel ALU lanes) or VLIW -implementations, or SIMD, or anything else, would then benefit *if* -Simple-V was added on top. +implementations, or SIMD, or anything else, would then benefit from +the uniformity of a consistent API. [[!toc ]] @@ -126,7 +126,8 @@ reducing power consumption for the same. SIMD again has a severe disadvantage here, over Vector: huge proliferation of specialist instructions that target 8-bit, 16-bit, 32-bit, 64-bit, and have to then have operations *for each and between each*. It gets very -messy, very quickly. +messy, very quickly: *six* separate dimensions giving an O(N^6) instruction +proliferation profile. The V-Extension on the other hand proposes to set the bit-width of future instructions on a per-register basis, such that subsequent instructions @@ -356,16 +357,14 @@ level all-hardware parallelism. Options are covered in the Appendix. # CSRs -There are a number of CSRs needed, which are used at the instruction -decode phase to re-interpret RV opcodes (a practice that has -precedent in the setting of MISA to enable / disable extensions). +There are two CSR tables needed to create lookup tables which are used at +the register decode phase. -* Integer Register N is Vector of length M: r(N) -> r(N..N+M-1) +* Integer Register N is Vector * Integer Register N is of implicit bitwidth M (M=default,8,16,32,64) * Floating-point Register N is Vector of length M: r(N) -> r(N..N+M-1) * Floating-point Register N is of implicit bitwidth M (M=default,8,16,32,64) * Integer Register N is a Predication Register (note: a key-value store) -* Vector Length CSR (VSETVL, VGETVL) Also (see Appendix, "Context Switch Example") it may turn out to be important to have a separate (smaller) set of CSRs for M-Mode (and S-Mode) so that @@ -389,40 +388,68 @@ Notes: V2.3-Draft ISA Reference) it becomes possible to greatly reduce state needed for context-switches (empty slots need never be stored). -## Predication CSR +## Predication CSR The Predication CSR is a key-value store indicating whether, if a given destination register (integer or floating-point) is referred to in an -instruction, it is to be predicated. The first entry is whether predication -is enabled. The second entry is whether the register index refers to a -floating-point or an integer register. The third entry is the index -of that register which is to be predicated (if referred to). The fourth entry -is the integer register that is treated as a bitfield, indexable by the -vector element index. - -| RegNo | 6 | 5 | (4..0) | (4..0) | -| ----- | - | - | ------- | ------- | -| r0 | pren0 | i/f | regidx | predidx | -| r1 | pren1 | i/f | regidx | predidx | -| .. | pren.. | i/f | regidx | predidx | -| r15 | pren15 | i/f | regidx | predidx | +instruction, it is to be predicated. However it is important to note +that the *actual* register is *different* from the one that ends up +being used, due to the level of indirection through the lookup table. +This includes (in the future) redirecting to a *second* bank of +integer registers (as a future option) + +* regidx is the actual register that in combination with the + i/f flag, if that integer or floating-point register is referred to, + results in the lookup table being referenced to find the predication + mask to use on the operation in which that (regidx) register has + been used +* predidx (in combination with the bank bit in the future) is the + *actual* register to be used for the predication mask. Note: + in effect predidx is actually a 6-bit register address, as the bank + bit is the MSB (and is nominally set to zero for now). +* inv indicates that the predication mask bits are to be inverted + prior to use *without* actually modifying the contents of the + register itself. +* zeroing is either 1 or 0, and if set to 1, the operation must + place zeros in any element position where the predication mask is + set to zero. If zeroing is set to 1, unpredicated elements *must* + be left alone. Some microarchitectures may choose to interpret + this as skipping the operation entirely. Others which wish to + stick more closely to a SIMD architecture may choose instead to + interpret unpredicated elements as an internal "copy element" + operation (which would be necessary in SIMD microarchitectures + that perform register-renaming) + +| PrCSR | 13 | 12 | 11 | 10 | (9..5) | (4..0) | +| ----- | - | - | - | - | ------- | ------- | +| 0 | bank0 | zero0 | inv0 | i/f | regidx | predidx | +| 1 | bank1 | zero1 | inv1 | i/f | regidx | predidx | +| .. | bank.. | zero.. | inv.. | i/f | regidx | predidx | +| 15 | bank15 | zero15 | inv15 | i/f | regidx | predidx | The Predication CSR Table is a key-value store, so implementation-wise it will be faster to turn the table around (maintain topologically equivalent state): - fp_pred_enabled[32]; - int_pred_enabled[32]; + struct pred { + bool zero; + bool inv; + bool bank; // 0 for now, 1=rsvd + bool enabled; + int predidx; // redirection: actual int register to use + } + + struct pred fp_pred_reg[32]; // 64 in future (bank=1) + struct pred int_pred_reg[32]; // 64 in future (bank=1) + for (i = 0; i < 16; i++) - if CSRpred[i].pren: - idx = CSRpred[i].regidx - predidx = CSRpred[i].predidx - if CSRpred[i].type == 0: # integer - int_pred_enabled[idx] = 1 - int_pred_reg[idx] = predidx - else: - fp_pred_enabled[idx] = 1 - fp_pred_reg[idx] = predidx + tb = int_pred_reg if CSRpred[i].type == 0 else fp_pred_reg; + idx = CSRpred[i].regidx + tb[idx].zero = CSRpred[i].zero + tb[idx].inv = CSRpred[i].inv + tb[idx].bank = CSRpred[i].bank + tb[idx].predidx = CSRpred[i].predidx + tb[idx].enabled = true So when an operation is to be predicated, it is the internal state that is used. In Section 6.4.2 of Hwacha's Manual (EECS-2015-262) the following @@ -436,24 +463,54 @@ reference to the predication register to be used: s2 ? vreg[rs2][i] : sreg[rs2]); // for insts with 2 inputs This instead becomes an *indirect* reference using the *internal* state -table generated from the Predication CSR key-value store: +table generated from the Predication CSR key-value store, which iwws used +as follows. if type(iop) == INT: - pred_enabled = int_pred_enabled preg = int_pred_reg[rd] else: - pred_enabled = fp_pred_enabled preg = fp_pred_reg[rd] for (int i=0; i 1; - s2 = CSRvectorlen[src2] > 1; - for (int i=0; i + +There is no MV instruction in RV however there is a C.MV instruction. +It is used for copying integer-to-integer registers (vectorised FMV +is used for copying floating-point). + +If either the source or the destination register are marked as vectors +C.MV is reinterpreted to be a vectorised (multi-register) predicated +move operation. The actual instruction's format does not change: + +[[!table data=""" +15 12 | 11 7 | 6 2 | 1 0 | +funct4 | rd | rs | op | +4 | 5 | 5 | 2 | +C.MV | dest | src | C0 | +"""]] + +A simplified version of the pseudocode for this operation is as follows: + + function op_mv(rd, rs) # MV not VMV! +  rd = int_vec[rd].isvector ? int_vec[rd].regidx : rd; +  rs = int_vec[rs].isvector ? int_vec[rs].regidx : rs; +  ps = get_pred_val(FALSE, rs); # predication on src +  pd = get_pred_val(FALSE, rd); # ... AND on dest +  for (int i = 0, int j = 0; i < VL && j < VL;): + if (int_vec[rs].isvec) while (!(ps & 1< What does an ADD of two different-sized vectors do in simple-V? @@ -876,6 +1059,168 @@ Section "Virtual Memory Page Faults". * Throw an exception. Whether that actually results in spawning threads as part of the trap-handling remains to be seen. +# Under consideration + +From the Chennai 2018 slides the following issues were raised. +Efforts to analyse and answer these questions are below. + +* Should future extra bank be included now? +* How many Register and Predication CSRs should there be? + (and how many in RV32E) +* How many in M-Mode (for doing context-switch)? +* Should use of registers be allowed to "wrap" (x30 x31 x1 x2)? +* Can CLIP be done as a CSR (mode, like elwidth) +* SIMD saturation (etc.) also set as a mode? +* Include src1/src2 predication on Comparison Ops? + (same arrangement as C.MV, with same flexibility/power) +* 8/16-bit ops is it worthwhile adding a "start offset"? + (a bit like misaligned addressing... for registers) + or just use predication to skip start? + +## Future (extra) bank be included (made mandatory) + +The implications of expanding the *standard* register file from +32 entries per bank to 64 per bank is quite an extensive architectural +change. Also it has implications for context-switching. + +Therefore, on balance, it is not recommended and certainly should +not be made a *mandatory* requirement for the use of SV. SV's design +ethos is to be minimally-disruptive for implementors to shoe-horn +into an existing design. + +## How large should the Register and Predication CSR key-value stores be? + +This is something that definitely needs actual evaluation and for +code to be run and the results analysed. At the time of writing +(12jul2018) that is too early to tell. An approximate best-guess +however would be 16 entries. + +RV32E however is a special case, given that it is highly unlikely +(but not outside the realm of possibility) that it would be used +for performance reasons but instead for reducing instruction count. +The number of CSR entries therefore has to be considered extremely +carefully. + +## How many CSR entries in M-Mode or S-Mode (for context-switching)? + +The minimum required CSR entries would be 1 for each register-bank: +one for integer and one for floating-point. However, as shown +in the "Context Switch Example" section, for optimal efficiency +(minimal instructions in a low-latency situation) the CSRs for +the context-switch should be set up *and left alone*. + +This means that it is not really a good idea to touch the CSRs +used for context-switching in the M-Mode (or S-Mode) trap, so +if there is ever demonstrated a need for vectors then there would +need to be *at least* one more free. However just one does not make +much sense (as it one only covers scalar-vector ops) so it is more +likely that at least two extra would be needed. + +This *in addition* - in the RV32E case - if an RV32E implementation +happens also to support U/S/M modes. This would be considered quite +rare but not outside of the realm of possibility. + +Conclusion: all needs careful analysis and future work. + +## Should use of registers be allowed to "wrap" (x30 x31 x1 x2)? + +On balance it's a neat idea however it does seem to be one where the +benefits are not really clear. It would however obviate the need for +an exception to be raised if the VL runs out of registers to put +things in (gets to x31, tries a non-existent x32 and fails), however +the "fly in the ointment" is that x0 is hard-coded to "zero". The +increment therefore would need to be double-stepped to skip over x0. +Some microarchitectures could run into difficulties (SIMD-like ones +in particular) so it needs a lot more thought. + +## Can CLIP be done as a CSR (mode, like elwidth) + +RVV appears to be going this way. At the time of writing (12jun2018) +it's noted that in V2.3-Draft V0.4 RVV Chapter, RVV intends to do +clip by way of exactly this method: setting a "clip mode" in a CSR. + +No details are given however the most sensible thing to have would be +to extend the 16-bit Register CSR table to 24-bit (or 32-bit) and have +extra bits specifying the type of clipping to be carried out, on +a per-register basis. Other bits may be used for other purposes +(see SIMD saturation below) + +## SIMD saturation (etc.) also set as a mode? + +Similar to "CLIP" as an extension to the CSR key-value store, "saturate" +may also need extra details (what the saturation maximum is for example). + +## Include src1/src2 predication on Comparison Ops? + +In the C.MV (and other ops - see "C.MV Instruction"), the decision +was taken, unlike in ADD (etc.) which are 3-operand ops, to use +*both* the src *and* dest predication masks to give an extremely +powerful and flexible instruction that covers a huge number of +"traditional" vector opcodes. + +The natural question therefore to ask is: where else could this +flexibility be deployed? What about comparison operations? + +Unfortunately, C.MV is basically "regs[dest] = regs[src]" whilst +predicated comparison operations are actually a *three* operand +instruction: + + regs[pred] |= 1<< (cmp(regs[src1], regs[src2]) ? 1 : 0) + +Therefore at first glance it does not make sense to use src1 and src2 +predication masks, as it breaks the rule of 3-operand instructions +to use the *destination* predication register. + +In this case however, the destination *is* a predication register +as opposed to being a predication mask that is applied *to* the +(vectorised) operation, element-at-a-time on src1 and src2. + +Thus the question is directly inter-related to whether the modification +of the predication mask should *itself* be predicated. + +It is quite complex, in other words, and needs careful consideration. + +## 8/16-bit ops is it worthwhile adding a "start offset"? + +The idea here is to make it possible, particularly in a "Packed SIMD" +case, to be able to avoid doing unaligned Load/Store operations +by specifying that operations, instead of being carried out +element-for-element, are offset by a fixed amount *even* in 8 and 16-bit +element Packed SIMD cases. + +For example rather than take 2 32-bit registers divided into 4 8-bit +elements and have them ADDed element-for-element as follows: + + r3[0] = add r4[0], r6[0] + r3[1] = add r4[1], r6[1] + r3[2] = add r4[2], r6[2] + r3[3] = add r4[3], r6[3] + +an offset of 1 would result in four operations as follows, instead: + + r3[0] = add r4[1], r6[0] + r3[1] = add r4[2], r6[1] + r3[2] = add r4[3], r6[2] + r3[3] = add r5[0], r6[3] + +In non-packed-SIMD mode there is no benefit at all, as a vector may +be created using a different CSR that has the offset built-in. So this +leaves just the packed-SIMD case to consider. + +Two ways in which this could be implemented / emulated (without special +hardware): + +* bit-manipulation that shuffles the data along by one byte (or one word) + either prior to or as part of the operation requiring the offset. +* just use an unaligned Load/Store sequence, even if there are performance + penalties for doing so. + +The question then is whether the performance hit is worth the extra hardware +involving byte-shuffling/shifting the data by an arbitrary offset. On +balance given that there are two reasonable instruction-based options, the +hardware-offset option should be left out for the initial version of SV, +with the option to consider it in an "advanced" version of the specification. + # Impementing V on top of Simple-V With Simple-V converting the original RVV draft concept-for-concept @@ -1555,7 +1900,7 @@ To illustrate how this works, here is some example code from FreeRTOS ... STORE x30, 29 * REGBYTES(sp) STORE x31, 30 * REGBYTES(sp) - + /* Store current stackpointer in task control block (TCB) */ LOAD t0, pxCurrentTCB //pointer STORE sp, 0x0(t0) @@ -1616,11 +1961,11 @@ bank of registers is to be loaded/saved: .macroVectorSetup MVECTORCSRx1 = 31, defaultlen MVECTORCSRx4 = 28, defaultlen - + /* Save Context */ SETVL x0, x0, 31 /* x0 ignored silently */ - STORE x1, 0x0(sp) // x1 marked as 31-long vector of default bitwidth - + STORE x1, 0x0(sp) // x1 marked as 31-long vector of default bitwidth + /* Restore registers, Skip global pointer because that does not change */ LOAD x1, 0x0(sp) @@ -1774,7 +2119,7 @@ discussion then led to the question of OoO architectures > relevant, is that the imprecise model increases the size of the context > structure, as the microarchitectural guts have to be spilled to memory.) ------ +## Zero/Non-zero Predication >> >  it just occurred to me that there's another reason why the data >> > should be left instead of zeroed.  if the standard register file is @@ -1804,6 +2149,109 @@ discussion then led to the question of OoO architectures > there may be a way to implement DTM as well. +## Implementation detail for scalar-only op detection + +Note 1: this idea is a pipeline-bypass concept, which may *or may not* be +worthwhile. + +Note 2: this is just one possible implementation. Another implementation +may choose to treat *all* operations as vectorised (including treating +scalars as vectors of length 1), choosing to add an extra pipeline stage +dedicated to *all* instructions. + +This section *specifically* covers the implementor's freedom to choose +that they wish to minimise disruption to an existing design by detecting +"scalar-only operations", bypassing the vectorisation phase (which may +or may not require an additional pipeline stage) + +[[scalardetect.png]] + +>> For scalar ops an implementation may choose to compare 2-3 bits through an +>> AND gate: are src & dest scalar? Yep, ok send straight to ALU  (or instr +>> FIFO). + +> Those bits cannot be known until after the registers are decoded from the +> instruction and a lookup in the "vector length table" has completed. +> Considering that one of the reasons RISC-V keeps registers in invariant +> positions across all instructions is to simplify register decoding, I expect +> that inserting an SRAM read would lengthen the critical path in most +> implementations. + +reply: + +> briefly: the trick i mentioned about ANDing bits together to check if +> an op was fully-scalar or not was to be read out of a single 32-bit +> 3R1W SRAM (64-bit if FPU exists). the 32/64-bit SRAM contains 1 bit per +> register indicating "is register vectorised yes no". 3R because you need +> to check src1, src2 and dest simultaneously. the entries are *generated* +> from the CSRs and are an optimisation that on slower embedded systems +> would likely not be needed. + +> is there anything unreasonable that anyone can foresee about that? +> what are the down-sides? + +## C.MV predicated src, predicated dest + +> Can this be usefully defined in such a way that it is +> equivalent to vector gather-scatter on each source, followed by a +> non-predicated vector-compare, followed by vector gather-scatter on the +> result? + +## element width conversion: restrict or remove? + +summary: don't restrict / remove. it's fine. + +> > it has virtually no cost/overhead as long as you specify +> > that inputs can only upconvert, and operations are always done at the +> > largest size, and downconversion only happens at the output. +> +> okaaay.  so that's a really good piece of implementation advice. +> algorithms do require data size conversion, so at some point you need to +> introduce the feature of upconverting and downconverting. +> +> > for int and uint, this is dead simple and fits well within the RVV pipeline +> > without any critical path, pipeline depth, or area implications. + + + +## Under review / discussion: remove CSR vector length, use VSETVL + +**DECISION: 11jun2018 - CSR vector length removed, VSETVL determines +length on all regs**. This section kept for historical reasons. + +So the issue is as follows: + +* CSRs are used to set the "span" of a vector (how many of the standard + register file to contiguously use) +* VSETVL in RVV works as follows: it sets the vector length (copy of which + is placed in a dest register), and if the "required" length is longer + than the *available* length, the dest reg is set to the MIN of those + two. +* **HOWEVER**... in SV, *EVERY* vector register has its own separate + length and thus there is no way (at the time that VSETVL is called) to + know what to set the vector length *to*. +* At first glance it seems that it would be perfectly fine to just limit + the vector operation to the length specified in the destination + register's CSR, at the time that each instruction is issued... + except that that cannot possibly be guaranteed to match + with the value *already loaded into the target register from VSETVL*. + +Therefore a different approach is needed. + +Possible options include: + +* Removing the CSR "Vector Length" and always using the value from + VSETVL. "VSETVL destreg, counterreg, #lenimmed" will set VL *and* + destreg equal to MIN(counterreg, lenimmed), with register-based + variant "VSETVL destreg, counterreg, lenreg" doing the same. +* Keeping the CSR "Vector Length" and having the lenreg version have + a "twist": "if lengreg is vectorised, read the length from the CSR" +* Other (TBD) + +The first option (of the ones brainstormed so far) is a lot simpler. +It does however mean that the length set in VSETVL will apply across-the-board +to all src1, src2 and dest vectorised registers until it is otherwise changed +(by another VSETVL call). This is probably desirable behaviour. ## Implementation Paradigms @@ -1890,6 +2338,9 @@ TBD: floating-point compare and other exception handling * Dot Product Vector * RVV slides 2017 -* Wavefront skipping using BRAMS +* Wavefront skipping using BRAMS * Streaming Pipelines * Barcelona SIMD Presentation +* +* Full Description (last page) of RVV instructions +