X-Git-Url: https://git.libre-soc.org/?a=blobdiff_plain;f=simple_v_extension.mdwn;h=e422e11f7c8b38c33b20642e31586ee7dc8b652b;hb=d9abaf326f14a84687aebd021f02ae8aa0225cc1;hp=bd99c92cc7f32aab77b7a33d15112b2f6fa0d470;hpb=e42745c78789d0153ca7a254dac43fb90ae071c7;p=libreriscv.git

diff --git a/simple_v_extension.mdwn b/simple_v_extension.mdwn
index bd99c92cc..e422e11f7 100644
--- a/simple_v_extension.mdwn
+++ b/simple_v_extension.mdwn
@@ -1,14 +1,5 @@
 # Variable-width Variable-packed SIMD / Simple-V / Parallelism Extension Proposal
 
-* TODO 23may2018: CSR-CAM-ify regfile tables
-* TODO 23may2018: zero-mark predication CSR
-* TODO 28may2018: sort out VSETVL: CSR length to be removed?
-* TODO 09jun2018: Chennai Presentation more up-to-date
-* TODO 09jun2019: elwidth only 4 values (dflt, dflt/2, 8, 16)
-* TODO 09jun2019: extra register banks (future option)
-* TODO 09jun2019: new Reg CSR table (incl. packed=Y/N)
-
-
 Key insight: Simple-V is intended as an abstraction layer to provide
 a consistent "API" to parallelisation of existing *and future* operations.
 *Actual* internal hardware-level parallelism is *not* required, such
@@ -18,8 +9,10 @@ instruction queue (FIFO), pending execution.
 
 *Actual* parallelism, if added independently of Simple-V in the form
 of Out-of-order restructuring (including parallel ALU lanes) or VLIW
-implementations, or SIMD, or anything else, would then benefit *if*
-Simple-V was added on top.
+implementations, or SIMD, or anything else, would then benefit from
+the uniformity of a consistent API.
+
+Talk slides: <http://hands.com/~lkcl/simple_v_chennai_2018.pdf>
 
 [[!toc ]]
 
@@ -945,7 +938,7 @@ of detecting early page / segmentation faults and adjusting the TLB
 in advance, accordingly: other strategies are explored in the Appendix
 Section "Virtual Memory Page Faults".
 
-## Vectorised Copy/Move instructions
+## Vectorised Copy/Move (and conversion) instructions
 
 There is a series of 2-operand instructions involving copying (and
 alteration): C.MV, FMV, FNEG, FABS, FCVT, FSGNJ.  These operations all
@@ -1068,6 +1061,168 @@ Similar rules apply to the destination register.
 * Throw an exception.  Whether that actually results in spawning threads
   as part of the trap-handling remains to be seen.
 
+# Under consideration <a name="issues"></a>
+
+From the Chennai 2018 slides the following issues were raised.
+Efforts to analyse and answer these questions are below.
+
+* Should future extra bank be included now?
+* How many Register and Predication CSRs should there be?
+         (and how many in RV32E)
+* How many in M-Mode (for doing context-switch)?
+* Should use of registers be allowed to "wrap" (x30 x31 x1 x2)?
+* Can CLIP be done as a CSR (mode, like elwidth)
+* SIMD saturation (etc.) also set as a mode?
+* Include src1/src2 predication on Comparison Ops?
+  (same arrangement as C.MV, with same flexibility/power)
+* 8/16-bit ops is it worthwhile adding a "start offset"?
+  (a bit like misaligned addressing... for registers)
+  or just use predication to skip start?
+
+## Future (extra) bank be included (made mandatory)
+
+The implications of expanding the *standard* register file from
+32 entries per bank to 64 per bank is quite an extensive architectural
+change.  Also it has implications for context-switching.
+
+Therefore, on balance, it is not recommended and certainly should
+not be made a *mandatory* requirement for the use of SV.  SV's design
+ethos is to be minimally-disruptive for implementors to shoe-horn
+into an existing design.
+
+## How large should the Register and Predication CSR key-value stores be?
+
+This is something that definitely needs actual evaluation and for
+code to be run and the results analysed.  At the time of writing
+(12jul2018) that is too early to tell.  An approximate best-guess
+however would be 16 entries.
+
+RV32E however is a special case, given that it is highly unlikely
+(but not outside the realm of possibility) that it would be used
+for performance reasons but instead for reducing instruction count.
+The number of CSR entries therefore has to be considered extremely
+carefully.
+
+## How many CSR entries in M-Mode or S-Mode (for context-switching)?
+
+The minimum required CSR entries would be 1 for each register-bank:
+one for integer and one for floating-point.  However, as shown
+in the "Context Switch Example" section, for optimal efficiency
+(minimal instructions in a low-latency situation) the CSRs for
+the context-switch should be set up *and left alone*.
+
+This means that it is not really a good idea to touch the CSRs
+used for context-switching in the M-Mode (or S-Mode) trap, so
+if there is ever demonstrated a need for vectors then there would
+need to be *at least* one more free.  However just one does not make
+much sense (as it one only covers scalar-vector ops) so it is more
+likely that at least two extra would be needed.
+
+This *in addition* - in the RV32E case - if an RV32E implementation
+happens also to support U/S/M modes.  This would be considered quite
+rare but not outside of the realm of possibility.
+
+Conclusion: all needs careful analysis and future work.
+
+## Should use of registers be allowed to "wrap" (x30 x31 x1 x2)?
+
+On balance it's a neat idea however it does seem to be one where the
+benefits are not really clear.  It would however obviate the need for
+an exception to be raised if the VL runs out of registers to put
+things in (gets to x31, tries a non-existent x32 and fails), however
+the "fly in the ointment" is that x0 is hard-coded to "zero".  The
+increment therefore would need to be double-stepped to skip over x0.
+Some microarchitectures could run into difficulties (SIMD-like ones
+in particular) so it needs a lot more thought.
+
+## Can CLIP be done as a CSR (mode, like elwidth)
+
+RVV appears to be going this way.  At the time of writing (12jun2018)
+it's noted that in V2.3-Draft V0.4 RVV Chapter, RVV intends to do
+clip by way of exactly this method: setting a "clip mode" in a CSR.
+
+No details are given however the most sensible thing to have would be
+to extend the 16-bit Register CSR table to 24-bit (or 32-bit) and have
+extra bits specifying the type of clipping to be carried out, on
+a per-register basis.  Other bits may be used for other purposes
+(see SIMD saturation below)
+
+## SIMD saturation (etc.) also set as a mode?
+
+Similar to "CLIP" as an extension to the CSR key-value store, "saturate"
+may also need extra details (what the saturation maximum is for example).
+
+## Include src1/src2 predication on Comparison Ops?
+
+In the C.MV (and other ops - see "C.MV Instruction"), the decision
+was taken, unlike in ADD (etc.) which are 3-operand ops, to use
+*both* the src *and* dest predication masks to give an extremely
+powerful and flexible instruction that covers a huge number of
+"traditional" vector opcodes.
+
+The natural question therefore to ask is: where else could this
+flexibility be deployed?  What about comparison operations?
+
+Unfortunately, C.MV is basically "regs[dest] = regs[src]" whilst
+predicated comparison operations are actually a *three* operand
+instruction:
+
+    regs[pred] |= 1<< (cmp(regs[src1], regs[src2]) ? 1 : 0)
+
+Therefore at first glance it does not make sense to use src1 and src2
+predication masks, as it breaks the rule of 3-operand instructions
+to use the *destination* predication register.
+
+In this case however, the destination *is* a predication register
+as opposed to being a predication mask that is applied *to* the
+(vectorised) operation, element-at-a-time on src1 and src2.
+
+Thus the question is directly inter-related to whether the modification
+of the predication mask should *itself* be predicated.
+
+It is quite complex, in other words, and needs careful consideration.
+
+## 8/16-bit ops is it worthwhile adding a "start offset"?
+
+The idea here is to make it possible, particularly in a "Packed SIMD"
+case, to be able to avoid doing unaligned Load/Store operations
+by specifying that operations, instead of being carried out
+element-for-element, are offset by a fixed amount *even* in 8 and 16-bit
+element Packed SIMD cases.
+
+For example rather than take 2 32-bit registers divided into 4 8-bit
+elements and have them ADDed element-for-element as follows:
+
+    r3[0] = add r4[0], r6[0]
+    r3[1] = add r4[1], r6[1]
+    r3[2] = add r4[2], r6[2]
+    r3[3] = add r4[3], r6[3]
+
+an offset of 1 would result in four operations as follows, instead:
+
+    r3[0] = add r4[1], r6[0]
+    r3[1] = add r4[2], r6[1]
+    r3[2] = add r4[3], r6[2]
+    r3[3] = add r5[0], r6[3]
+
+In non-packed-SIMD mode there is no benefit at all, as a vector may
+be created using a different CSR that has the offset built-in.  So this
+leaves just the packed-SIMD case to consider.
+
+Two ways in which this could be implemented / emulated (without special
+hardware):
+
+* bit-manipulation that shuffles the data along by one byte (or one word)
+  either prior to or as part of the operation requiring the offset.
+* just use an unaligned Load/Store sequence, even if there are performance
+  penalties for doing so.
+
+The question then is whether the performance hit is worth the extra hardware
+involving byte-shuffling/shifting the data by an arbitrary offset.  On
+balance given that there are two reasonable instruction-based options, the
+hardware-offset option should be left out for the initial version of SV,
+with the option to consider it in an "advanced" version of the specification.
+
 # Impementing V on top of Simple-V
 
 With Simple-V converting the original RVV draft concept-for-concept
@@ -1349,37 +1504,35 @@ the question is asked "How can each of the proposals effectively implement
 
 ### Example Instruction translation: <a name="example_translation"></a>
 
-Instructions "ADD r2 r4 r4" would result in three instructions being
-generated and placed into the FIFO:
+Instructions "ADD r7 r4 r4" would result in three instructions being
+generated and placed into the FIFO.  r7 and r4 are marked as "vectorised":
+
+* ADD r7 r4 r4
+* ADD r8 r5 r5
+* ADD r9 r6 r6
 
-* ADD r2 r4 r4
-* ADD r2 r5 r5
-* ADD r2 r6 r6
+Instructions "ADD r7 r4 r1" would result in three instructions being
+generated and placed into the FIFO.  r7 and r1 are marked as "vectorised"
+whilst r4 is not:
+
+* ADD r7 r4 r1
+* ADD r8 r4 r2
+* ADD r9 r4 r3
 
 ## Example of vector / vector, vector / scalar, scalar / scalar => vector add
 
-    register CSRvectorlen[XLEN][4]; # not quite decided yet about this one...
-    register CSRpredicate[XLEN][4]; # 2^4 is max vector length
-    register CSRreg_is_vectorised[XLEN]; # just for fun support scalars as well
-    register x[32][XLEN];
-
-    function op_add(rd, rs1, rs2, predr)
-    {
-    Â  Â /* note that this is ADD, not PADD */
-    Â  Â int i, id, irs1, irs2;
-    Â  Â # checks CSRvectorlen[rd] == CSRvectorlen[rs] etc. ignored
-    Â  Â # also destination makes no sense as a scalar but what the hell...
-    Â  Â for (i = 0, id=0, irs1=0, irs2=0; i<CSRvectorlen[rd]; i++)
-    Â  Â  Â  if (CSRpredicate[predr][i]) # i *think* this is right...
-    Â  Â  Â  Â  Â x[rd+id] <= x[rs1+irs1] + x[rs2+irs2];
-    Â  Â  Â  # now increment the idxs
-    Â  Â  Â  if (CSRreg_is_vectorised[rd]) # bitfield check rd, scalar/vector?
-    Â  Â  Â  Â  Â id += 1;
-    Â  Â  Â  if (CSRreg_is_vectorised[rs1]) # bitfield check rs1, scalar/vector?
-    Â  Â  Â  Â  Â irs1 += 1;
-    Â  Â  Â  if (CSRreg_is_vectorised[rs2]) # bitfield check rs2, scalar/vector?
-    Â  Â  Â  Â  Â irs2 += 1;
-    }
+    function op_add(rd, rs1, rs2) # add not VADD!
+     Â int i, id=0, irs1=0, irs2=0;
+     Â rd  = int_vec[rd ].isvector ? int_vec[rd ].regidx : rd;
+     Â rs1 = int_vec[rs1].isvector ? int_vec[rs1].regidx : rs1;
+     Â rs2 = int_vec[rs2].isvector ? int_vec[rs2].regidx : rs2;
+     Â predval = get_pred_val(FALSE, rd);
+     Â for (i = 0; i < VL; i++)
+        if (predval & 1<<i) # predication uses intregs
+     Â     Â ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
+        if (int_vec[rd ].isvector) Â { id += 1; }
+        if (int_vec[rs1].isvector) Â { irs1 += 1; }
+        if (int_vec[rs2].isvector) Â { irs2 += 1; }
 
 ## Retro-fitting Predication into branch-explicit ISA <a name="predication_retrofit"></a>
 
@@ -2191,3 +2344,5 @@ TBD: floating-point compare and other exception handling
 * <http://www.ece.ubc.ca/~lemieux/publications/severance-fpga2015.pdf>
 * Full Description (last page) of RVV instructions
   <https://inst.eecs.berkeley.edu/~cs152/sp18/handouts/lab4-1.0.pdf>
+* PULP Low-energy Cluster Vector Processor
+  <http://iis-projects.ee.ethz.ch/index.php/Low-Energy_Cluster-Coupled_Vector_Coprocessor_for_Special-Purpose_PULP_Acceleration>