add slids

[libreriscv.git] / simple_v_extension.mdwn
diff --git a/simple_v_extension.mdwn b/simple_v_extension.mdwn

index 634d205e9b9a8c45a216383d3e0f4298f75a854e..e422e11f7c8b38c33b20642e31586ee7dc8b652b 100644 (file)
--- a/simple_v_extension.mdwn
+++ b/simple_v_extension.mdwn
@@ -1,29 +1,68 @@
  # Variable-width Variable-packed SIMD / Simple-V / Parallelism Extension Proposal
  
+Key insight: Simple-V is intended as an abstraction layer to provide
+a consistent "API" to parallelisation of existing *and future* operations.
+*Actual* internal hardware-level parallelism is *not* required, such
+that Simple-V may be viewed as providing a "compact" or "consolidated"
+means of issuing multiple near-identical arithmetic instructions to an
+instruction queue (FIFO), pending execution.
+
+*Actual* parallelism, if added independently of Simple-V in the form
+of Out-of-order restructuring (including parallel ALU lanes) or VLIW
+implementations, or SIMD, or anything else, would then benefit from
+the uniformity of a consistent API.
+
+Talk slides: <http://hands.com/~lkcl/simple_v_chennai_2018.pdf>
+
+[[!toc ]]
+
+# Introduction
+
  This proposal exists so as to be able to satisfy several disparate
  requirements: power-conscious, area-conscious, and performance-conscious
  designs all pull an ISA and its implementation in different conflicting
  directions, as do the specific intended uses for any given implementation.
  
-Additionally, the existing P (SIMD) proposal and the V (Vector) proposals,
+The existing P (SIMD) proposal and the V (Vector) proposals,
  whilst each extremely powerful in their own right and clearly desirable,
  are also:
  
-* Clearly independent in their origins (Cray and AndeStar v3 respectively)
+* Clearly independent in their origins (Cray and AndesStar v3 respectively)
    so need work to adapt to the RISC-V ethos and paradigm
  * Are sufficiently large so as to make adoption (and exploration for
    analysis and review purposes) prohibitively expensive
  * Both contain partial duplication of pre-existing RISC-V instructions
    (an undesirable characteristic)
-* Both have independent and disparate methods for introducing parallelism
-  at the instruction level.
+* Both have independent, incompatible and disparate methods for introducing
+  parallelism at the instruction level
  * Both require that their respective parallelism paradigm be implemented
    along-side and integral to their respective functionality *or not at all*.
  * Both independently have methods for introducing parallelism that
    could, if separated, benefit
    *other areas of RISC-V not just DSP or Floating-point respectively*.
  
-Therefore it makes a huge amount of sense to have a means and method
+There are also key differences between Vectorisation and SIMD (full
+details outlined in the Appendix), the key points being:
+
+* SIMD has an extremely seductively compelling ease of implementation argument:
+  each operation is passed to the ALU, which is where the parallelism
+  lies.  There is *negligeable* (if any) impact on the rest of the core
+  (with life instead being made hell for compiler writers and applications
+  writers due to extreme ISA proliferation).
+* By contrast, Vectorisation has quite some complexity (for considerable
+  flexibility, reduction in opcode proliferation and much more).
+* Vectorisation typically includes much more comprehensive memory load
+  and store schemes (unit stride, constant-stride and indexed), which
+  in turn have ramifications: virtual memory misses (TLB cache misses)
+  and even multiple page-faults... all caused by a *single instruction*,
+  yet with a clear benefit that the regularisation of LOAD/STOREs can
+  be optimised for minimal impact on caches and maximised throughput.
+* By contrast, SIMD can use "standard" memory load/stores (32-bit aligned
+  to pages), and these load/stores have absolutely nothing to do with the
+  SIMD / ALU engine, no matter how wide the operand.  Simplicity but with
+  more impact on instruction and data caches.
+
+Overall it makes a huge amount of sense to have a means and method
  of introducing instruction parallelism in a flexible way that provides
  implementors with the option to choose exactly where they wish to offer
  performance improvements and where they wish to optimise for power
@@ -32,83 +71,29 @@ would provide even more flexibility).
  
  Additionally it makes sense to *split out* the parallelism inherent within
  each of P and V, and to see if each of P and V then, in *combination* with
-a "best-of-both" parallelism extension, would work well.
-
-**TODO**: reword this to better suit this document:
-
-Having looked at both P and V as they stand, they're _both_ very much
-"separate engines" that, despite both their respective merits and
-extremely powerful features, don't really cleanly fit into the RV design
-ethos (or the flexible extensibility) and, as such, are both in danger
-of not being widely adopted.  I'm inclined towards recommending:
-
-* splitting out the DSP aspects of P-SIMD to create a single-issue DSP
-* splitting out the polymorphism, esoteric data types (GF, complex
-  numbers) and unusual operations of V to create a single-issue "Esoteric
-  Floating-Point" extension
-* splitting out the loop-aspects, vector aspects and data-width aspects
-  of both P and V to a *new* "P-SIMD / Simple-V" and requiring that they
-  apply across *all* Extensions, whether those be DSP, M, Base, V, P -
-  everything.
-
-**TODO**: propose overflow registers be actually one of the integer regs
-(flowing to multiple regs).
-
-**TODO**: propose "mask" (predication) registers likewise.  combination with
-standard RV instructions and overflow registers extremely powerful
-
-## Stride
-
-**TODO**: propose two LOAD/STORE offset CSRs, which mark a particular
-register as being "if you use this reg in LOAD/STORE, use the offset
-amount CSRoffsN (N=0,1) instead of treating LOAD/STORE as contiguous".
-can be used for matrix spanning.
-
-> For LOAD/STORE, could a better option be to interpret the offset in the 
-> opcode as a stride instead, so "LOAD t3, 12(t2)" would, if t3 is 
-> configured as a length-4 vector base, result in t3 = *t2, t4 = *(t2+12), 
-> t5 = *(t2+24), t6 = *(t2+32)?  Perhaps include a bit in the 
-> vector-control CSRs to select between offset-as-stride and unit-stride 
-> memory accesses? 
-
-So there would be an instruction like this:
+a "best-of-both" parallelism extension, could be added on *on top* of
+this proposal, to topologically provide the exact same functionality of
+each of P and V.  Each of P and V then can focus on providing the best
+operations possible for their respective target areas, without being
+hugely concerned about the actual parallelism.
  
-| SETOFF | On=rN | OBank={float|int} | Smode={offs|unit} | OFFn=rM |
-| opcode | 5 bit | 1 bit             | 1 bit             | 5 bit, OFFn=XLEN |
-
-
-which would mean:
-
-* CSR-Offset register n <= (float|int) register number N
-* CSR-Offset Stride-mode = offset or unit
-* CSR-Offset amount register n = contents of register M
-
-LOAD rN, ldoffs(rM) would then be (assuming packed bit-width not set):
-
-> offs = 0
-> stride = 1
-> vector-len = CSR-Vector-length register N
->
-> for (o = 0, o < 2, o++)
->   if (CSR-Offset register o == M)
->       offs = CSR-Offset amount register o
->       if CSR-Offset Stride-mode == offset:
->           stride = ldoffs
->       break
->
-> for (i = 0, i < vector-len; i++)
->   r[N+i] = mem[(offs*i + r[M+i])*stride]
+Furthermore, an additional goal of this proposal is to reduce the number
+of opcodes utilised by each of P and V as they currently stand, leveraging
+existing RISC-V opcodes where possible, and also potentially allowing
+P and V to make use of Compressed Instructions as a result.
  
  # Analysis and discussion of Vector vs SIMD
  
-There are four combined areas between the two proposals that help with
-parallelism without over-burdening the ISA with a huge proliferation of
+There are six combined areas between the two proposals that help with
+parallelism (increased performance, reduced power / area) without
+over-burdening the ISA with a huge proliferation of
  instructions:
  
  * Fixed vs variable parallelism (fixed or variable "M" in SIMD)
  * Implicit vs fixed instruction bit-width (integral to instruction or not)
  * Implicit vs explicit type-conversion (compounded on bit-width)
  * Implicit vs explicit inner loops.
+* Single-instruction LOAD/STORE.
  * Masks / tagging (selecting/preventing certain indexed elements from execution)
  
  The pros and cons of each are discussed and analysed below.
@@ -126,6 +111,13 @@ Thus, SIMD, no matter what width is chosen, is never going to be acceptable
  for general-purpose computation, and in the context of developing a
  general-purpose ISA, is never going to satisfy 100 percent of implementors.
  
+To explain this further: for increased workloads over time, as the
+performance requirements increase for new target markets, implementors
+choose to extend the SIMD width (so as to again avoid mixing parallelism
+into the instruction issue phases: the primary "simplicity" benefit of
+SIMD in the first place), with the result that the entire opcode space
+effectively doubles with each new SIMD width that's added to the ISA.
+
  That basically leaves "variable-length vector" as the clear *general-purpose*
  winner, at least in terms of greatly simplifying the instruction set,
  reducing the number of instructions required for any given task, and thus
@@ -136,7 +128,8 @@ reducing power consumption for the same.
  SIMD again has a severe disadvantage here, over Vector: huge proliferation
  of specialist instructions that target 8-bit, 16-bit, 32-bit, 64-bit, and
  have to then have operations *for each and between each*.  It gets very
-messy, very quickly.
+messy, very quickly: *six* separate dimensions giving an O(N^6) instruction
+proliferation profile.
  
  The V-Extension on the other hand proposes to set the bit-width of
  future instructions on a per-register basis, such that subsequent instructions
@@ -148,7 +141,7 @@ burdensome to implementations, given that instruction decode already has
  to direct the operation to a correctly-sized width ALU engine, anyway.
  
  Not least: in places where an ISA was previously constrained (due for
-whatever reason, including limitations of the available operand spcace),
+whatever reason, including limitations of the available operand space),
  implicit bit-width allows the meaning of certain operations to be
  type-overloaded *without* pollution or alteration of frozen and immutable
  instructions, in a fully backwards-compatible fashion.
@@ -161,13 +154,16 @@ integer (and floating point) of various sizes is automatically inferred
  due to "type tagging" that is set with a special instruction.  A register
  will be *specifically* marked as "16-bit Floating-Point" and, if added
  to an operand that is specifically tagged as "32-bit Integer" an implicit
-type-conversion will take placce *without* requiring that type-conversion
+type-conversion will take place *without* requiring that type-conversion
  to be explicitly done with its own separate instruction.
  
  However, implicit type-conversion is not only quite burdensome to
  implement (explosion of inferred type-to-type conversion) but also is
  never really going to be complete.  It gets even worse when bit-widths
-also have to be taken into consideration.
+also have to be taken into consideration.  Each new type results in
+an increased O(N^2) conversion space that, as anyone who has examined
+python's source code (which has built-in polymorphic type-conversion),
+knows that the task is more complex than it first seems.
  
  Overall, type-conversion is generally best to leave to explicit
  type-conversion instructions, or in definite specific use-cases left to
@@ -180,27 +176,116 @@ contains an extremely interesting feature: zero-overhead loops.  This
  proposal would basically allow an inner loop of instructions to be
  repeated indefinitely, a fixed number of times.
  
-Its specific advantage over explicit loops is that the pipeline in a
-DSP can potentially be kept completely full *even in an in-order
+Its specific advantage over explicit loops is that the pipeline in a DSP
+can potentially be kept completely full *even in an in-order single-issue
  implementation*.  Normally, it requires a superscalar architecture and
-out-of-order execution capabilities to "pre-process" instructions in order
-to keep ALU pipelines 100% occupied.
-
-This very simple proposal offers a way to increase pipeline activity in the
-one key area which really matters: the inner loop.
-
-## Mask and Tagging
-
-*TODO: research masks as they can be superb and extremely powerful.
-If B-Extension is implemented and provides Bit-Gather-Scatter it
-becomes really cool and easy to switch out certain indexed values
-from an array of data, but actually BGS **on its own** might be
-sufficient.  Bottom line, this is complex, and needs a proper analysis.
-The other sections are pretty straightforward.*
+out-of-order execution capabilities to "pre-process" instructions in
+order to keep ALU pipelines 100% occupied.
+
+By bringing that capability in, this proposal could offer a way to increase
+pipeline activity even in simpler implementations in the one key area
+which really matters: the inner loop.
+
+However when looking at much more comprehensive schemes
+"A portable specification of zero-overhead loop control hardware
+applied to embedded processors" (ZOLC), optimising only the single
+inner loop seems inadequate, tending to suggest that ZOLC may be
+better off being proposed as an entirely separate Extension.
+
+## Single-instruction LOAD/STORE
+
+In traditional Vector Architectures there are instructions which
+result in multiple register-memory transfer operations resulting
+from a single instruction.  They're complicated to implement in hardware,
+yet the benefits are a huge consistent regularisation of memory accesses
+that can be highly optimised with respect to both actual memory and any
+L1, L2 or other caches.  In Hwacha EECS-2015-263 it is explicitly made
+clear the consequences of getting this architecturally wrong:
+L2 cache-thrashing at the very least.
+
+Complications arise when Virtual Memory is involved: TLB cache misses
+need to be dealt with, as do page faults.  Some of the tradeoffs are
+discussed in <http://people.eecs.berkeley.edu/~krste/thesis.pdf>, Section
+4.6, and an article by Jeff Bush when faced with some of these issues
+is particularly enlightening
+<https://jbush001.github.io/2015/11/03/lost-in-translation.html>
+
+Interestingly, none of this complexity is faced in SIMD architectures...
+but then they do not get the opportunity to optimise for highly-streamlined
+memory accesses either.
+
+With the "bang-per-buck" ratio being so high and the indirect improvement
+in L1 Instruction Cache usage (reduced instruction count), as well as
+the opportunity to optimise L1 and L2 cache usage, the case for including
+Vector LOAD/STORE is compelling.
+
+## Mask and Tagging (Predication)
+
+Tagging (aka Masks aka Predication) is a pseudo-method of implementing
+simplistic branching in a parallel fashion, by allowing execution on
+elements of a vector to be switched on or off depending on the results
+of prior operations in the same array position.
+
+The reason for considering this is simple: by *definition* it
+is not possible to perform individual parallel branches in a SIMD
+(Single-Instruction, **Multiple**-Data) context.  Branches (modifying
+of the Program Counter) will result in *all* parallel data having
+a different instruction executed on it: that's just the definition of
+SIMD, and it is simply unavoidable.
+
+So these are the ways in which conditional execution may be implemented:
+
+* explicit compare and branch: BNE x, y -> offs would jump offs
+  instructions if x was not equal to y
+* explicit store of tag condition: CMP x, y -> tagbit
+* implicit (condition-code) such as ADD results in a carry, carry bit
+  implicitly (or sometimes explicitly) goes into a "tag" (mask) register
+
+The first of these is a "normal" branch method, which is flat-out impossible
+to parallelise without look-ahead and effectively rewriting instructions.
+This would defeat the purpose of RISC.
+
+The latter two are where parallelism becomes easy to do without complexity:
+every operation is modified to be "conditionally executed" (in an explicit
+way directly in the instruction format *or* implicitly).
+
+RVV (Vector-Extension) proposes to have *explicit* storing of the compare
+in a tag/mask register, and to *explicitly* have every vector operation
+*require* that its operation be "predicated" on the bits within an
+explicitly-named tag/mask register.
+
+SIMD (P-Extension) has not yet published precise documentation on what its
+schema is to be: there is however verbal indication at the time of writing
+that:
+
+> The "compare" instructions in the DSP/SIMD ISA proposed by Andes will
+> be executed using the same compare ALU logic for the base ISA with some
+> minor modifications to handle smaller data types. The function will not
+> be duplicated.
+
+This is an *implicit* form of predication as the base RV ISA does not have
+condition-codes or predication.  By adding a CSR it becomes possible
+to also tag certain registers as "predicated if referenced as a destination".
+Example:
+
+    // in future operations from now on, if r0 is the destination use r5 as
+    // the PREDICATION register
+    SET_IMPLICIT_CSRPREDICATE r0, r5
+    // store the compares in r5 as the PREDICATION register
+    CMPEQ8 r5, r1, r2
+    // r0 is used here.  ah ha!  that means it's predicated using r5!
+    ADD8 r0, r1, r3
+
+With enough registers (and in RISC-V there are enough registers) some fairly
+complex predication can be set up and yet still execute without significant
+stalling, even in a simple non-superscalar architecture.
+
+(For details on how Branch Instructions would be retro-fitted to indirectly
+predicated equivalents, see Appendix)
  
  ## Conclusions
  
-In the above sections the four different ways where parallel instruction
+In the above sections the five different ways where parallel instruction
  execution has closely and loosely inter-related implications for the ISA and
  for implementors, were outlined.  The pluses and minuses came out as
  follows:
@@ -208,29 +293,25 @@ follows:
  * Fixed vs variable parallelism: <b>variable</b>
  * Implicit (indirect) vs fixed (integral) instruction bit-width: <b>indirect</b>
  * Implicit vs explicit type-conversion: <b>explicit</b>
-* Implicit vs explicit inner loops: <b>implicit</b>
-* Tag or no-tag: <b>TODO</b>
-
-In particular: variable-length vectors came out on top because of the
-high setup, teardown and corner-cases associated with the fixed width
-of SIMD.  Implicit bit-width helps to extend the ISA to escape from
-former limitations and restrictions (in a backwards-compatible fashion),
-and implicit (zero-overhead) loops provide a means to keep pipelines
-potentially 100% occupied *without* requiring a super-scalar or out-of-order
-architecture.
-
-Constructing a SIMD/Simple-Vector proposal based around even only these four
-(five?) requirements would therefore seem to be a logical thing to do.
-
-# Instruction Format
-
-**TODO** *basically borrow from both P and V, which should be quite simple
-to do, with the exception of Tag/no-tag, which needs a bit more
-thought.  V's Section 17.19 of Draft V2.3 spec is reminiscent of B's BGS
-gather-scatterer, and, if implemented, could actually be a really useful
-way to span 8-bit up to 64-bit groups of data, where BGS as it stands
-and described by Clifford does **bits** of up to 16 width.  Lots to
-look at and investigate!*
+* Implicit vs explicit inner loops: <b>implicit but best done separately</b>
+* Single-instruction Vector LOAD/STORE: <b>Complex but highly beneficial</b>
+* Tag or no-tag: <b>Complex but highly beneficial</b>
+
+In particular:
+
+* variable-length vectors came out on top because of the high setup, teardown
+  and corner-cases associated with the fixed width of SIMD.
+* Implicit bit-width helps to extend the ISA to escape from
+  former limitations and restrictions (in a backwards-compatible fashion),
+  whilst also leaving implementors free to simmplify implementations
+  by using actual explicit internal parallelism.
+* Implicit (zero-overhead) loops provide a means to keep pipelines
+  potentially 100% occupied in a single-issue in-order implementation
+  i.e. *without* requiring a super-scalar or out-of-order architecture,
+  but doing a proper, full job (ZOLC) is an entirely different matter.
+
+Constructing a SIMD/Simple-Vector proposal based around four of these six
+requirements would therefore seem to be a logical thing to do.
  
  # Note on implementation of parallelism
  
@@ -256,465 +337,1975 @@ basis* whether and how much "Virtual Parallelism" to deploy.
  
  It is absolutely critical to note that it is proposed that such choices MUST
  be **entirely transparent** to the end-user and the compiler.  Whilst
-a Vector (varible-width SIM) may not precisely match the width of the
+a Vector (varible-width SIMD) may not precisely match the width of the
  parallelism within the implementation, the end-user **should not care**
  and in this way the performance benefits are gained but the ISA remains
-simple.  All that happens at the end of an instruction run is: some
+straightforward.  All that happens at the end of an instruction run is: some
  parallel units (if there are any) would remain offline, completely
  transparently to the ISA, the program, and the compiler.
  
-The "SIMD considered harmful" trap of having huge complexity and extra
+To make that clear: should an implementor choose a particularly wide
+SIMD-style ALU, each parallel unit *must* have predication so that
+the parallel SIMD ALU may emulate variable-length parallel operations.
+Thus the "SIMD considered harmful" trap of having huge complexity and extra
  instructions to deal with corner-cases is thus avoided, and implementors
  get to choose precisely where to focus and target the benefits of their
-implementationefforts..
+implementation efforts, without "extra baggage".
+
+In addition, implementors will be free to choose whether to provide an
+absolute bare minimum level of compliance with the "API" (software-traps
+when vectorisation is detected), all the way up to full supercomputing
+level all-hardware parallelism.  Options are covered in the Appendix.
+
+# CSRs <a name="csrs"></a>
+
+There are two CSR tables needed to create lookup tables which are used at
+the register decode phase.
+
+* Integer Register N is Vector
+* Integer Register N is of implicit bitwidth M (M=default,8,16,32,64)
+* Floating-point Register N is Vector of length M: r(N) -> r(N..N+M-1)
+* Floating-point Register N is of implicit bitwidth M (M=default,8,16,32,64)
+* Integer Register N is a Predication Register (note: a key-value store)
+
+Also (see Appendix, "Context Switch Example") it may turn out to be important
+to have a separate (smaller) set of CSRs for M-Mode (and S-Mode) so that
+Vectorised LOAD / STORE may be used to load and store multiple registers:
+something that is missing from the Base RV ISA.
+
+Notes:
+
+* for the purposes of LOAD / STORE, Integer Registers which are
+  marked as a Vector will result in a Vector LOAD / STORE.
+* Vector Lengths are *not* the same as vsetl but are an integral part
+  of vsetl.
+* Actual vector length is *multipled* by how many blocks of length
+  "bitwidth" may fit into an XLEN-sized register file.
+* Predication is a key-value store due to the implicit referencing,
+  as opposed to having the predicate register explicitly in the instruction.
+* Whilst the predication CSR is a key-value store it *generates* easier-to-use
+  state information.
+* TODO: assess whether the same technique could be applied to the other
+  Vector CSRs, particularly as pointed out in Section 17.8 (Draft RV 0.4,
+  V2.3-Draft ISA Reference) it becomes possible to greatly reduce state
+  needed for context-switches (empty slots need never be stored).
+
+## Predication CSR <a name="predication_csr_table"></a>
+
+The Predication CSR is a key-value store indicating whether, if a given
+destination register (integer or floating-point) is referred to in an
+instruction, it is to be predicated.  However it is important to note
+that the *actual* register is *different* from the one that ends up
+being used, due to the level of indirection through the lookup table.
+This includes (in the future) redirecting to a *second* bank of
+integer registers (as a future option)
+
+* regidx is the actual register that in combination with the
+  i/f flag, if that integer or floating-point register is referred to,
+  results in the lookup table being referenced to find the predication
+  mask to use on the operation in which that (regidx) register has
+  been used
+* predidx (in combination with the bank bit in the future) is the
+  *actual* register to be used for the predication mask.  Note:
+  in effect predidx is actually a 6-bit register address, as the bank
+  bit is the MSB (and is nominally set to zero for now).
+* inv indicates that the predication mask bits are to be inverted
+  prior to use *without* actually modifying the contents of the
+  register itself.
+* zeroing is either 1 or 0, and if set to 1, the operation must
+  place zeros in any element position where the predication mask is
+  set to zero.  If zeroing is set to 1, unpredicated elements *must*
+  be left alone.  Some microarchitectures may choose to interpret
+  this as skipping the operation entirely.  Others which wish to
+  stick more closely to a SIMD architecture may choose instead to
+  interpret unpredicated elements as an internal "copy element"
+  operation (which would be necessary in SIMD microarchitectures
+  that perform register-renaming)
+
+| PrCSR | 13     | 12     | 11    | 10  | (9..5)  | (4..0)  |
+| ----- | -      | -      | -     | -   | ------- | ------- |
+| 0     | bank0  | zero0  | inv0  | i/f | regidx  | predidx |
+| 1     | bank1  | zero1  | inv1  | i/f | regidx  | predidx |
+| ..    | bank.. | zero.. | inv.. | i/f | regidx  | predidx |
+| 15    | bank15 | zero15 | inv15 | i/f | regidx  | predidx |
+
+The Predication CSR Table is a key-value store, so implementation-wise
+it will be faster to turn the table around (maintain topologically
+equivalent state):
+
+    struct pred {
+        bool zero;
+        bool inv;
+        bool bank;   // 0 for now, 1=rsvd
+        bool enabled;
+        int predidx; // redirection: actual int register to use
+    }
+
+    struct pred fp_pred_reg[32];   // 64 in future (bank=1)
+    struct pred int_pred_reg[32];  // 64 in future (bank=1)
+
+    for (i = 0; i < 16; i++)
+      tb = int_pred_reg if CSRpred[i].type == 0 else fp_pred_reg;
+      idx = CSRpred[i].regidx
+      tb[idx].zero = CSRpred[i].zero
+      tb[idx].inv  = CSRpred[i].inv
+      tb[idx].bank = CSRpred[i].bank
+      tb[idx].predidx  = CSRpred[i].predidx
+      tb[idx].enabled  = true
+
+So when an operation is to be predicated, it is the internal state that
+is used.  In Section 6.4.2 of Hwacha's Manual (EECS-2015-262) the following
+pseudo-code for operations is given, where p is the explicit (direct)
+reference to the predication register to be used:
+
+    for (int i=0; i<vl; ++i)
+        if ([!]preg[p][i])
+           (d ? vreg[rd][i] : sreg[rd]) =
+            iop(s1 ? vreg[rs1][i] : sreg[rs1],
+                s2 ? vreg[rs2][i] : sreg[rs2]); // for insts with 2 inputs
+
+This instead becomes an *indirect* reference using the *internal* state
+table generated from the Predication CSR key-value store, which iwws used
+as follows.
+
+    if type(iop) == INT:
+        preg = int_pred_reg[rd]
+    else:
+        preg = fp_pred_reg[rd]
+
+    for (int i=0; i<vl; ++i)
+        predidx = preg[rd].predidx; // the indirection takes place HERE
+        if (!preg[rd].enabled)
+            predicate = ~0x0; // all parallel ops enabled
+        else:
+            predicate = intregfile[predidx]; // get actual reg contents HERE
+            if (preg[rd].inv) // invert if requested
+                predicate = ~predicate;
+        if (predicate && (1<<i))
+           (d ? regfile[rd+i] : regfile[rd]) =
+            iop(s1 ? regfile[rs1+i] : regfile[rs1],
+                s2 ? regfile[rs2+i] : regfile[rs2]); // for insts with 2 inputs
+        else if (preg[rd].zero)
+            // TODO: place zero in dest reg
+
+Note:
+
+* d, s1 and s2 are booleans indicating whether destination,
+  source1 and source2 are vector or scalar
+* key-value CSR-redirection of rd, rs1 and rs2 have NOT been included
+  above, for clarity.  rd, rs1 and rs2 all also must ALSO go through
+  register-level redirection (from the Register CSR table) if they are
+  vectors.
+
+If written as a function, obtaining the predication mask (but not whether
+zeroing takes place) may be done as follows:
+
+    def get_pred_val(bool is_fp_op, int reg):
+       tb = int_pred if is_fp_op else fp_pred
+       if (!tb[reg].enabled):
+          return ~0x0              // all ops enabled
+       predidx = tb[reg].predidx   // redirection occurs HERE
+       predicate = intreg[predidx] // actual predicate HERE
+       if (tb[reg].inv):
+          predicate = ~predicate   // invert ALL bits
+       return predicate
+
+## MAXVECTORLENGTH
+
+MAXVECTORLENGTH is the same concept as MVL in RVV.  However in Simple-V,
+given that its primary (base, unextended) purpose is for 3D, Video and
+other purposes (not requiring supercomputing capability), it makes sense
+to limit MAXVECTORDEPTH to the regfile bitwidth (32 for RV32, 64 for RV64
+and so on).
+
+The reason for setting this limit is so that predication registers, when
+marked as such, may fit into a single register as opposed to fanning out
+over several registers.  This keeps the implementation a little simpler.
+Note also (as also described in the VSETVL section) that the *minimum*
+for MAXVECTORDEPTH must be the total number of registers (15 for RV32E
+and 31 for RV32 or RV64).
+
+Note that RVV on top of Simple-V may choose to over-ride this decision.
+
+## Register CSR key-value (CAM) table
+
+The purpose of the Register CSR table is four-fold:
+
+* To mark integer and floating-point registers as requiring "redirection"
+  if it is ever used as a source or destination in any given operation.
+  This involves a level of indirection through a 5-to-6-bit lookup table
+  (where the 6th bit - bank - is always set to 0 for now).
+* To indicate whether, after redirection through the lookup table, the
+  register is a vector (or remains a scalar).
+* To over-ride the implicit or explicit bitwidth that the operation would
+  normally give the register.
+* To indicate if the register is to be interpreted as "packed" (SIMD)
+  i.e. containing multiple contiguous elements of size equal to "bitwidth".
+
+| RgCSR | 15     | 14     | 13       | (12..11) | 10  | (9..5)  | (4..0)  |
+| ----- | -      | -      | -        | -        | -   | ------- | ------- |
+| 0     | simd0  | bank0  | isvec0   | vew0     | i/f | regidx  | predidx |
+| 1     | simd1  | bank1  | isvec1   | vew1     | i/f | regidx  | predidx |
+| ..    | simd.. | bank.. | isvec..  | vew..    | i/f | regidx  | predidx |
+| 15    | simd15 | bank15 | isvec15  | vew15    | i/f | regidx  | predidx |
+
+vew may be one of the following (giving a table "bytestable", used below):
+
+| vew | bitwidth  |
+| --- | --------- |
+| 00  | default   |
+| 01  | default/2 |
+| 10  | 8         |
+| 11  | 16        |
+
+Extending this table (with extra bits) is covered in the section
+"Implementing RVV on top of Simple-V".
+
+As the above table is a CAM (key-value store) it may be appropriate
+to expand it as follows:
+
+    struct vectorised fp_vec[32], int_vec[32]; // 64 in future
+
+    for (i = 0; i < 16; i++) // 16 CSRs?
+       tb = int_vec if CSRvec[i].type == 0 else fp_vec
+       idx = CSRvec[i].regkey // INT/FP src/dst reg in opcode
+       tb[idx].elwidth  = CSRvec[i].elwidth
+       tb[idx].regidx   = CSRvec[i].regidx  // indirection
+       tb[idx].isvector = CSRvec[i].isvector // 0=scalar
+       tb[idx].packed   = CSRvec[i].packed  // SIMD or not
+       tb[idx].bank     = CSRvec[i].bank    // 0 (1=rsvd)
+
+TODO: move elsewhere
+
+    # TODO: use elsewhere (retire for now)
+    vew = CSRbitwidth[rs1]
+    if (vew == 0)
+        bytesperreg = (XLEN/8) # or FLEN as appropriate
+    elif (vew == 1)
+        bytesperreg = (XLEN/4) # or FLEN/2 as appropriate
+    else:
+        bytesperreg = bytestable[vew] # 8 or 16
+    simdmult = (XLEN/8) / bytesperreg # or FLEN as appropriate
+    vlen = CSRvectorlen[rs1] * simdmult
+    CSRvlength = MIN(MIN(vlen, MAXVECTORDEPTH), rs2)
+
+The reason for multiplying the vector length by the number of SIMD elements
+(in each individual register) is so that each SIMD element may optionally be
+predicated.
+
+An example of how to subdivide the register file when bitwidth != default
+is given in the section "Bitwidth Virtual Register Reordering".
+
+# Instructions
+
+Despite being a 98% complete and accurate topological remap of RVV
+concepts and functionality, the only instructions needed are VSETVL
+and VGETVL.  *All* RVV instructions can be re-mapped, however xBitManip
+becomes a critical dependency for efficient manipulation of predication
+masks (as a bit-field).  Despite the removal of all but VSETVL and VGETVL,
+*all instructions from RVV are topologically re-mapped and retain their
+complete functionality, intact*.
+
+Three instructions, VSELECT, VCLIP and VCLIPI, do not have RV Standard
+equivalents, so are left out of Simple-V.  VSELECT could be included if
+there existed a MV.X instruction in RV (MV.X is a hypothetical
+non-immediate variant of MV that would allow another register to
+specify which register was to be copied).  Note that if any of these three
+instructions are added to any given RV extension, their functionality
+will be inherently parallelised.
+
+## Instruction Format
+
+The instruction format for Simple-V does not actually have *any* explicit
+compare operations, *any* arithmetic, floating point or *any*
+memory instructions.
+Instead it *overloads* pre-existing branch operations into predicated
+variants, and implicitly overloads arithmetic operations, MV,
+FCVT, and LOAD/STORE
+depending on CSR configurations for bitwidth and
+predication.  **Everything** becomes parallelised.  *This includes
+Compressed instructions* as well as any
+future instructions and Custom Extensions.
+
+* For analysis of RVV see [[v_comparative_analysis]] which begins to
+  outline topologically-equivalent mappings of instructions
+* Also see Appendix "Retro-fitting Predication into branch-explicit ISA"
+  for format of Branch opcodes.
+
+**TODO**: *analyse and decide whether the implicit nature of predication
+as proposed is or is not a lot of hassle, and if explicit prefixes are
+a better idea instead.  Parallelism therefore effectively may end up
+as always being 64-bit opcodes (32 for the prefix, 32 for the instruction)
+with some opportunities for to use Compressed bringing it down to 48.
+Also to consider is whether one or both of the last two remaining Compressed
+instruction codes in Quadrant 1 could be used as a parallelism prefix,
+bringing parallelised opcodes down to 32-bit (when combined with C)
+and having the benefit of being explicit.*
+
+## VSETVL
+
+NOTE TODO: 28may2018: VSETVL may need to be *really* different from RVV,
+with the instruction format remaining the same.
+
+VSETVL is slightly different from RVV in that the minimum vector length
+is required to be at least the number of registers in the register file,
+and no more than XLEN.  This allows vector LOAD/STORE to be used to switch
+the entire bank of registers using a single instruction (see Appendix,
+"Context Switch Example").  The reason for limiting VSETVL to XLEN is
+down to the fact that predication bits fit into a single register of length
+XLEN bits.
+
+The second change is that when VSETVL is requested to be stored
+into x0, it is *ignored* silently (VSETVL x0, x5, #4)
+
+The third change is that there is an additional immediate added to VSETVL,
+to which VL is set after first going through MIN-filtering.
+So When using the "vsetl rs1, rs2, #vlen" instruction, it becomes:
+
+    VL = MIN(MIN(vlen, MAXVECTORDEPTH), rs2)
+
+where RegfileLen <= MAXVECTORDEPTH < XLEN
+
+This has implication for the microarchitecture, as VL is required to be
+set (limits from MAXVECTORDEPTH notwithstanding) to the actual value
+requested in the #immediate parameter.  RVV has the option to set VL
+to an arbitrary value that suits the conditions and the micro-architecture:
+SV does *not* permit that.
+
+The reason is so that if SV is to be used for a context-switch or as a
+substitute for LOAD/STORE-Multiple, the operation can be done with only
+2-3 instructions (setup of the CSRs, VSETVL x0, x0, #{regfilelen-1},
+single LD/ST operation).  If VL does *not* get set to the register file
+length when VSETVL is called, then a software-loop would be needed.
+To avoid this need, VL *must* be set to exactly what is requested
+(limits notwithstanding).
+
+Therefore, in turn, unlike RVV, implementors *must* provide
+pseudo-parallelism (using sequential loops in hardware) if actual
+hardware-parallelism in the ALUs is not deployed.  A hybrid is also
+permitted (as used in Broadcom's VideoCore-IV) however this must be
+*entirely* transparent to the ISA.
+
+## Branch Instruction:
+
+Branch operations use standard RV opcodes that are reinterpreted to
+be "predicate variants" in the instance where either of the two src
+registers are marked as vectors (isvector=1).  When this reinterpretation
+is enabled the "immediate" field of the branch operation is taken to be a
+predication target register, rs3.  The predicate target register rs3 is
+to be treated as a bitfield (up to a maximum of XLEN bits corresponding
+to a maximum of XLEN elements).
+
+If either of src1 or src2 are scalars (CSRvectorlen[src] == 0) the comparison
+goes ahead as vector-scalar or scalar-vector.  Implementors should note that
+this could require considerable multi-porting of the register file in order
+to parallelise properly, so may have to involve the use of register cacheing
+and transparent copying (see Multiple-Banked Register File Architectures
+paper).
+
+In instances where no vectorisation is detected on either src registers
+the operation is treated as an absolutely standard scalar branch operation.
+
+This is the overloaded table for Integer-base Branch operations.  Opcode
+(bits 6..0) is set in all cases to 1100011.
+
+[[!table  data="""
+31    .. 25 |24 ... 20 | 19 15 | 14  12 | 11 ..  8 | 7       | 6 ... 0 |
+imm[12,10:5]| rs2      | rs1   | funct3 | imm[4:1] | imm[11] | opcode  |
+7           | 5        | 5     | 3      | 4             | 1  | 7       |
+reserved    | src2     | src1  | BPR    | predicate rs3     || BRANCH  |
+reserved    | src2     | src1  | 000    | predicate rs3     || BEQ     |
+reserved    | src2     | src1  | 001    | predicate rs3     || BNE     |
+reserved    | src2     | src1  | 010    | predicate rs3     || rsvd    |
+reserved    | src2     | src1  | 011    | predicate rs3     || rsvd    |
+reserved    | src2     | src1  | 100    | predicate rs3     || BLE     |
+reserved    | src2     | src1  | 101    | predicate rs3     || BGE     |
+reserved    | src2     | src1  | 110    | predicate rs3     || BLTU    |
+reserved    | src2     | src1  | 111    | predicate rs3     || BGEU    |
+"""]]
+
+Note that just as with the standard (scalar, non-predicated) branch
+operations, BLT, BGT, BLEU and BTGU may be synthesised by inverting
+src1 and src2.
+
+Below is the overloaded table for Floating-point Predication operations.
+Interestingly no change is needed to the instruction format because
+FP Compare already stores a 1 or a zero in its "rd" integer register
+target, i.e. it's not actually a Branch at all: it's a compare.
+The target needs to simply change to be a predication bitfield (done
+implicitly).
+
+As with
+Standard RVF/D/Q, Opcode (bits 6..0) is set in all cases to 1010011.
+Likewise Single-precision, fmt bits 26..25) is still set to 00.
+Double-precision is still set to 01, whilst Quad-precision
+appears not to have a definition in V2.3-Draft (but should be unaffected).
+
+It is however noted that an entry "FNE" (the opposite of FEQ) is missing,
+and whilst in ordinary branch code this is fine because the standard
+RVF compare can always be followed up with an integer BEQ or a BNE (or
+a compressed comparison to zero or non-zero), in predication terms that
+becomes more of an impact.  To deal with this, SV's predication has
+had "invert" added to it.
+
+[[!table  data="""
+31 .. 27| 26 .. 25 |24 ... 20 | 19 15 | 14  12 | 11 .. 7  | 6 ... 0 |
+funct5  | fmt      | rs2      | rs1   | funct3 | rd       | opcode  |
+5       | 2        | 5        | 5     | 3      | 4        | 7       |
+10100   | 00/01/11 | src2     | src1  | 010    | pred rs3 | FEQ     |
+10100   | 00/01/11 | src2     | src1  | **011**| pred rs3 | rsvd    |
+10100   | 00/01/11 | src2     | src1  | 001    | pred rs3 | FLT     |
+10100   | 00/01/11 | src2     | src1  | 000    | pred rs3 | FLE     |
+"""]]
+
+Note (**TBD**): floating-point exceptions will need to be extended
+to cater for multiple exceptions (and statuses of the same).  The
+usual approach is to have an array of status codes and bit-fields,
+and one exception, rather than throw separate exceptions for each
+Vector element.
+
+In Hwacha EECS-2015-262 Section 6.7.2 the following pseudocode is given
+for predicated compare operations of function "cmp":
+
+    for (int i=0; i<vl; ++i)
+      if ([!]preg[p][i])
+         preg[pd][i] = cmp(s1 ? vreg[rs1][i] : sreg[rs1],
+                           s2 ? vreg[rs2][i] : sreg[rs2]);
+
+With associated predication, vector-length adjustments and so on,
+and temporarily ignoring bitwidth (which makes the comparisons more
+complex), this becomes:
+
+    if I/F == INT: # integer type cmp
+        preg = int_pred_reg[rd]
+        reg = int_regfile
+    else:
+        preg = fp_pred_reg[rd]
+        reg = fp_regfile
+
+    s1 = reg_is_vectorised(src1);
+    s2 = reg_is_vectorised(src2);
+    if (!s2 && !s1) goto branch;
+    for (int i = 0; i < VL; ++i)
+      if (cmp(s1 ? reg[src1+i]:reg[src1],
+              s2 ? reg[src2+i]:reg[src2])
+             preg[rs3] |= 1<<i;  # bitfield not vector
+
+Notes:
+
+* Predicated SIMD comparisons would break src1 and src2 further down
+  into bitwidth-sized chunks (see Appendix "Bitwidth Virtual Register
+  Reordering") setting Vector-Length times (number of SIMD elements) bits
+  in Predicate Register rs3 as opposed to just Vector-Length bits.
+* Predicated Branches do not actually have an adjustment to the Program
+  Counter, so all of bits 25 through 30 in every case are not needed.
+* There are plenty of reserved opcodes for which bits 25 through 30 could
+  be put to good use if there is a suitable use-case.
+  FLT and FLE may be inverted to FGT and FGE if needed by swapping
+  src1 and src2 (likewise the integer counterparts).
+
+## Compressed Branch Instruction:
+
+Compressed Branch instructions are likewise re-interpreted as predicated
+2-register operations, with the result going into rs3.  All the bits of
+the immediate are re-interpreted for different purposes, to extend the
+number of comparator operations to beyond the original specification,
+but also to cater for floating-point comparisons as well as integer ones.
+
+[[!table  data="""
+15..13 | 12...10  | 9..7 | 6..5  | 4..2 | 1..0 | name |
+funct3 | imm      | rs10 | imm   |      | op   |      |
+3      | 3        | 3    | 2     |  3   | 2    |      |
+C.BPR  | pred rs3 | src1 | I/F B | src2 | C1   |      |
+110    | pred rs3 | src1 | I/F 0 | src2 | C1   | P.EQ |
+111    | pred rs3 | src1 | I/F 0 | src2 | C1   | P.NE |
+110    | pred rs3 | src1 | I/F 1 | src2 | C1   | P.LT |
+111    | pred rs3 | src1 | I/F 1 | src2 | C1   | P.LE |
+"""]]
+
+Notes:
+
+* Bits 5 13 14 and 15 make up the comparator type
+* Bit 6 indicates whether to use integer or floating-point comparisons
+* In both floating-point and integer cases there are four predication
+  comparators: EQ/NEQ/LT/LE (with GT and GE being synthesised by inverting
+  src1 and src2).
+
+## LOAD / STORE Instructions <a name="load_store"></a>
+
+For full analysis of topological adaptation of RVV LOAD/STORE
+see [[v_comparative_analysis]].  All three types (LD, LD.S and LD.X)
+may be implicitly overloaded into the one base RV LOAD instruction,
+and likewise for STORE.
+
+Revised LOAD:
+
+[[!table  data="""
+31 | 30 | 29 25 | 24    20 | 19 15 | 14   12 | 11 7 | 6    0 |
+imm[11:0]               |||| rs1   | funct3  | rd   | opcode |
+1  | 1  |  5    | 5        | 5     | 3       | 5    | 7      |
+?  | s  |  rs2  | imm[4:0] | base  | width   | dest | LOAD   |
+"""]]
+
+The exact same corresponding adaptation is also carried out on the single,
+double and quad precision floating-point LOAD-FP and STORE-FP operations,
+which fit the exact same instruction format.  Thus all three types
+(unit, stride and indexed) may be fitted into FLW, FLD and FLQ,
+as well as FSW, FSD and FSQ.
+
+Notes:
+
+* LOAD remains functionally (topologically) identical to RVV LOAD
+  (for both integer and floating-point variants).
+* Predication CSR-marking register is not explicitly shown in instruction, it's
+  implicit based on the CSR predicate state for the rd (destination) register
+* rs2, the source, may *also be marked as a vector*, which implicitly
+  is taken to indicate "Indexed Load" (LD.X)
+* Bit 30 indicates "element stride" or "constant-stride" (LD or LD.S)
+* Bit 31 is reserved (ideas under consideration: auto-increment)
+* **TODO**: include CSR SIMD bitwidth in the pseudo-code below.
+* **TODO**: clarify where width maps to elsize
+
+Pseudo-code (excludes CSR SIMD bitwidth for simplicity):
+
+    if (unit-strided) stride = elsize;
+    else stride = areg[as2]; // constant-strided
+
+    preg = int_pred_reg[rd]
+
+    for (int i=0; i<vl; ++i)
+      if ([!]preg[rd] & 1<<i)
+        for (int j=0; j<seglen+1; j++)
+        {
+          if CSRvectorised[rs2])
+             offs = vreg[rs2+i]
+          else
+             offs = i*(seglen+1)*stride;
+          vreg[rd+j][i] = mem[sreg[base] + offs + j*stride];
+        }
+
+Taking CSR (SIMD) bitwidth into account involves using the vector
+length and register encoding according to the "Bitwidth Virtual Register
+Reordering" scheme shown in the Appendix (see function "regoffs").
+
+A similar instruction exists for STORE, with identical topological
+translation of all features.  **TODO**
+
+## Compressed LOAD / STORE Instructions
+
+Compressed LOAD and STORE are of the same format, where bits 2-4 are
+a src register instead of dest:
+
+[[!table  data="""
+15  13 | 12       10 | 9    7 | 6         5 | 4  2 | 1  0 |
+funct3 | imm         | rs10   | imm         | rd0  | op   |
+3      | 3           | 3      | 2           | 3    | 2    |
+C.LW   | offset[5:3] | base   | offset[2|6] | dest | C0   |
+"""]]
+
+Unfortunately it is not possible to fit the full functionality
+of vectorised LOAD / STORE into C.LD / C.ST: the "X" variants (Indexed)
+require another operand (rs2) in addition to the operand width
+(which is also missing), offset, base, and src/dest.
+
+However a close approximation may be achieved by taking the top bit
+of the offset in each of the five types of LD (and ST), reducing the
+offset to 4 bits and utilising the 5th bit to indicate whether "stride"
+is to be enabled.  In this way it is at least possible to introduce
+that functionality.
+
+(**TODO**: *assess whether the loss of one bit from offset is worth having
+"stride" capability.*)
+
+We also assume (including for the "stride" variant) that the "width"
+parameter, which is missing, is derived and implicit, just as it is
+with the standard Compressed LOAD/STORE instructions.  For C.LW, C.LD
+and C.LQ, the width is implicitly 4, 8 and 16 respectively, whilst for
+C.FLW and C.FLD the width is implicitly 4 and 8 respectively.
+
+Interestingly we note that the Vectorised Simple-V variant of
+LOAD/STORE (Compressed and otherwise), due to it effectively using the
+standard register file(s), is the direct functional equivalent of
+standard load-multiple and store-multiple instructions found in other
+processors.
+
+In Section 12.3 riscv-isa manual V2.3-draft it is noted the comments on
+page 76, "For virtual memory systems some data accesses could be resident
+in physical memory and some not".  The interesting question then arises:
+how does RVV deal with the exact same scenario?
+Expired U.S. Patent 5895501 (Filing Date Sep 3 1996) describes a method
+of detecting early page / segmentation faults and adjusting the TLB
+in advance, accordingly: other strategies are explored in the Appendix
+Section "Virtual Memory Page Faults".
+
+## Vectorised Copy/Move (and conversion) instructions
+
+There is a series of 2-operand instructions involving copying (and
+alteration): C.MV, FMV, FNEG, FABS, FCVT, FSGNJ.  These operations all
+follow the same pattern, as it is *both* the source *and* destination
+predication masks that are taken into account.  This is different from
+the three-operand arithmetic instructions, where the predication mask
+is taken from the *destination* register, and applied uniformly to the
+elements of the source register(s), element-for-element.
+
+### C.MV Instruction <a name="c_mv"></a>
+
+There is no MV instruction in RV however there is a C.MV instruction.
+It is used for copying integer-to-integer registers (vectorised FMV
+is used for copying floating-point).
+
+If either the source or the destination register are marked as vectors
+C.MV is reinterpreted to be a vectorised (multi-register) predicated
+move operation.  The actual instruction's format does not change:
+
+[[!table  data="""
+15  12 | 11   7 | 6  2 | 1  0 |
+funct4 | rd     | rs   | op   |
+4      | 5      | 5    | 2    |
+C.MV   | dest   | src  | C0   |
+"""]]
+
+A simplified version of the pseudocode for this operation is as follows:
+
+    function op_mv(rd, rs) # MV not VMV!
+      rd = int_vec[rd].isvector ? int_vec[rd].regidx : rd;
+      rs = int_vec[rs].isvector ? int_vec[rs].regidx : rs;
+      ps = get_pred_val(FALSE, rs); # predication on src
+      pd = get_pred_val(FALSE, rd); # ... AND on dest
+      for (int i = 0, int j = 0; i < VL && j < VL;):
+        if (int_vec[rs].isvec) while (!(ps & 1<<i)) i++;
+        if (int_vec[rd].isvec) while (!(pd & 1<<j)) j++;
+        ireg[rd+j] <= ireg[rs+i];
+        if (int_vec[rs].isvec) i++;
+        if (int_vec[rd].isvec) j++;
+
+Note that:
+
+* elwidth (SIMD) is not covered above
+* ending the loop early in scalar cases (VINSERT, VEXTRACT) is also
+  not covered
+
+There are several different instructions from RVV that are covered by
+this one opcode:
+
+[[!table  data="""
+src    | dest    | predication   | op             |
+scalar | vector  | none          | VSPLAT         |
+scalar | vector  | destination   | sparse VSPLAT  |
+scalar | vector  | 1-bit dest    | VINSERT        |
+vector | scalar  | 1-bit? src    | VEXTRACT       |
+vector | vector  | none          | VCOPY          |
+vector | vector  | src           | Vector Gather  |
+vector | vector  | dest          | Vector Scatter |
+vector | vector  | src & dest    | Gather/Scatter |
+vector | vector  | src == dest   | sparse VCOPY   |
+"""]]
+
+Also, VMERGE may be implemented as back-to-back (macro-op fused) C.MV
+operations with inversion on the src and dest predication for one of the
+two C.MV operations.
+
+Note that in the instance where the Compressed Extension is not implemented,
+MV may be used, but that is a pseudo-operation mapping to addi rd, x0, rs.
+Note that the behaviour is **different** from C.MV because with addi the
+predication mask to use is taken **only** from rd and is applied against
+all elements: rs[i] = rd[i].
+
+### FMV, FNEG and FABS Instructions
+
+These are identical in form to C.MV, except covering floating-point
+register copying.  The same double-predication rules also apply.
+However when elwidth is not set to default the instruction is implicitly
+and automatic converted to a (vectorised) floating-point type conversion
+operation of the appropriate size covering the source and destination
+register bitwidths.
+
+(Note that FMV, FNEG and FABS are all actually pseudo-instructions)
+
+### FVCT Instructions
+
+These are again identical in form to C.MV, except that they cover
+floating-point to integer and integer to floating-point.  When element
+width in each vector is set to default, the instructions behave exactly
+as they are defined for standard RV (scalar) operations, except vectorised
+in exactly the same fashion as outlined in C.MV.
+
+However when the source or destination element width is not set to default,
+the opcode's explicit element widths are *over-ridden* to new definitions,
+and the opcode's element width is taken as indicative of the SIMD width
+(if applicable i.e. if packed SIMD is requested) instead.
+
+For example FCVT.S.L would normally be used to convert a 64-bit
+integer in register rs1 to a 64-bit floating-point number in rd.
+If however the source rs1 is set to be a vector, where elwidth is set to
+default/2 and "packed SIMD" is enabled, then the first 32 bits of
+rs1 are converted to a floating-point number to be stored in rd's
+first element and the higher 32-bits *also* converted to floating-point
+and stored in the second.  The 32 bit size comes from the fact that
+FCVT.S.L's integer width is 64 bit, and with elwidth on rs1 set to
+divide that by two it means that rs1 element width is to be taken as 32.
+
+Similar rules apply to the destination register.
  
-# V-Extension to Simple-V Comparative Analysis
+# Exceptions
  
-This section covers the ways in which Simple-V is comparable
-to, or more flexible than, V-Extension (V2.3-draft).  Also covered is
-one major weak-point (register files are fixed size, where V is
-arbitrary length), and how best to deal with that, should V be adapted
-to be on top of Simple-V.
+> What does an ADD of two different-sized vectors do in simple-V?
  
-The first stages of this section go over each of the sections of V2.3-draft V
-where appropriate
+* if the two source operands are not the same, throw an exception.
+* if the destination operand is also a vector, and the source is longer
+  than the destination, throw an exception.
  
-## 17.3 Shape Encoding
+> And what about instructions like JALR? 
+> What does jumping to a vector do?
+
+* Throw an exception.  Whether that actually results in spawning threads
+  as part of the trap-handling remains to be seen.
  
-Simple-V's proposed means of expressing whether a register (from the
-standard integer or the standard floating-point file) is a scalar or
-a vector is to simply set the vector length to 1.  The instruction
-would however have to specify which register file (integer or FP) that
-the vector-length was to be applied to.
+# Under consideration <a name="issues"></a>
  
-Extended shapes (2-D etc) would not be part of Simple-V at all.
+From the Chennai 2018 slides the following issues were raised.
+Efforts to analyse and answer these questions are below.
+
+* Should future extra bank be included now?
+* How many Register and Predication CSRs should there be?
+         (and how many in RV32E)
+* How many in M-Mode (for doing context-switch)?
+* Should use of registers be allowed to "wrap" (x30 x31 x1 x2)?
+* Can CLIP be done as a CSR (mode, like elwidth)
+* SIMD saturation (etc.) also set as a mode?
+* Include src1/src2 predication on Comparison Ops?
+  (same arrangement as C.MV, with same flexibility/power)
+* 8/16-bit ops is it worthwhile adding a "start offset"?
+  (a bit like misaligned addressing... for registers)
+  or just use predication to skip start?
  
-## 17.4 Representation Encoding
+## Future (extra) bank be included (made mandatory)
  
-Simple-V would not have representation-encoding.  This is part of
-polymorphism, which is considered too complex to implement (TODO: confirm?)
+The implications of expanding the *standard* register file from
+32 entries per bank to 64 per bank is quite an extensive architectural
+change.  Also it has implications for context-switching.
  
-## 17.5 Element Bitwidth
+Therefore, on balance, it is not recommended and certainly should
+not be made a *mandatory* requirement for the use of SV.  SV's design
+ethos is to be minimally-disruptive for implementors to shoe-horn
+into an existing design.
  
-This is directly equivalent to Simple-V's "Packed", and implies that
-integer (or floating-point) are divided down into vector-indexable
-chunks of size Bitwidth.
+## How large should the Register and Predication CSR key-value stores be?
  
-In this way it becomes possible to have ADD effectively and implicitly
-turn into ADDb (8-bit add), ADDw (16-bit add) and so on, and where
-vector-length has been set to greater than 1, it becomes a "Packed"
-(SIMD) instruction.
+This is something that definitely needs actual evaluation and for
+code to be run and the results analysed.  At the time of writing
+(12jul2018) that is too early to tell.  An approximate best-guess
+however would be 16 entries.
  
-It remains to be decided what should be done when RV32 / RV64 ADD (sized)
-opcodes are used.  One useful idea would be, on an RV64 system where
-a 32-bit-sized ADD was performed, to simply use the least significant
-32-bits of the register (exactly as is currently done) but at the same
-time to *respect the packed bitwidth as well*.
+RV32E however is a special case, given that it is highly unlikely
+(but not outside the realm of possibility) that it would be used
+for performance reasons but instead for reducing instruction count.
+The number of CSR entries therefore has to be considered extremely
+carefully.
  
-The extended encoding (Table 17.6) would not be part of Simple-V.
+## How many CSR entries in M-Mode or S-Mode (for context-switching)?
  
-## 17.6 Base Vector Extension Supported Types
+The minimum required CSR entries would be 1 for each register-bank:
+one for integer and one for floating-point.  However, as shown
+in the "Context Switch Example" section, for optimal efficiency
+(minimal instructions in a low-latency situation) the CSRs for
+the context-switch should be set up *and left alone*.
  
-TODO: analyse.  probably exactly the same.
+This means that it is not really a good idea to touch the CSRs
+used for context-switching in the M-Mode (or S-Mode) trap, so
+if there is ever demonstrated a need for vectors then there would
+need to be *at least* one more free.  However just one does not make
+much sense (as it one only covers scalar-vector ops) so it is more
+likely that at least two extra would be needed.
  
-## 17.7 Maximum Vector Element Width
+This *in addition* - in the RV32E case - if an RV32E implementation
+happens also to support U/S/M modes.  This would be considered quite
+rare but not outside of the realm of possibility.
  
-No equivalent in Simple-V
+Conclusion: all needs careful analysis and future work.
  
-## 17.8 Vector Configuration Registers
+## Should use of registers be allowed to "wrap" (x30 x31 x1 x2)?
  
-TODO: analyse.
+On balance it's a neat idea however it does seem to be one where the
+benefits are not really clear.  It would however obviate the need for
+an exception to be raised if the VL runs out of registers to put
+things in (gets to x31, tries a non-existent x32 and fails), however
+the "fly in the ointment" is that x0 is hard-coded to "zero".  The
+increment therefore would need to be double-stepped to skip over x0.
+Some microarchitectures could run into difficulties (SIMD-like ones
+in particular) so it needs a lot more thought.
  
-## 17.9 Legal Vector Unit Configurations
+## Can CLIP be done as a CSR (mode, like elwidth)
  
-TODO: analyse
+RVV appears to be going this way.  At the time of writing (12jun2018)
+it's noted that in V2.3-Draft V0.4 RVV Chapter, RVV intends to do
+clip by way of exactly this method: setting a "clip mode" in a CSR.
  
-## 17.10 Vector Unit CSRs
+No details are given however the most sensible thing to have would be
+to extend the 16-bit Register CSR table to 24-bit (or 32-bit) and have
+extra bits specifying the type of clipping to be carried out, on
+a per-register basis.  Other bits may be used for other purposes
+(see SIMD saturation below)
  
-TODO: analyse
+## SIMD saturation (etc.) also set as a mode?
  
-> Ok so this is an aspect of Simple-V that I hadn't thought through,
-> yet (proposal / idea only a few days old!).  in V2.3-Draft ISA Section
-> 17.10 the CSRs are listed.  I note that there's some general-purpose
-> CSRs (including a global/active vector-length) and 16 vcfgN CSRs.  i
-> don't precisely know what those are for.
+Similar to "CLIP" as an extension to the CSR key-value store, "saturate"
+may also need extra details (what the saturation maximum is for example).
  
->  In the Simple-V proposal, *every* register in both the integer
-> register-file *and* the floating-point register-file would have at
-> least a 2-bit "data-width" CSR and probably something like an 8-bit
-> "vector-length" CSR (less in RV32E, by exactly one bit).
+## Include src1/src2 predication on Comparison Ops?
  
->  What I *don't* know is whether that would be considered perfectly
-> reasonable or completely insane.  If it turns out that the proposed
-> Simple-V CSRs can indeed be stored in SRAM then I would imagine that
-> adding somewhere in the region of 10 bits per register would be... okay? 
-> I really don't honestly know.
+In the C.MV (and other ops - see "C.MV Instruction"), the decision
+was taken, unlike in ADD (etc.) which are 3-operand ops, to use
+*both* the src *and* dest predication masks to give an extremely
+powerful and flexible instruction that covers a huge number of
+"traditional" vector opcodes.
  
->  Would these proposed 10-or-so-bit per-register Simple-V CSRs need to
-> be multi-ported? No I don't believe they would.
+The natural question therefore to ask is: where else could this
+flexibility be deployed?  What about comparison operations?
  
-## 17.11 Maximum Vector Length (MVL)
+Unfortunately, C.MV is basically "regs[dest] = regs[src]" whilst
+predicated comparison operations are actually a *three* operand
+instruction:
+
+    regs[pred] |= 1<< (cmp(regs[src1], regs[src2]) ? 1 : 0)
  
-Basically implicitly this is set to the maximum size of the register
-file multiplied by the number of 8-bit packed ints that can fit into
-a register (4 for RV32, 8 for RV64 and 16 for RV128).
+Therefore at first glance it does not make sense to use src1 and src2
+predication masks, as it breaks the rule of 3-operand instructions
+to use the *destination* predication register.
  
-## !7.12 Vector Instruction Formats
+In this case however, the destination *is* a predication register
+as opposed to being a predication mask that is applied *to* the
+(vectorised) operation, element-at-a-time on src1 and src2.
  
-No equivalent in Simple-V because *all* instructions of *all* Extensions
-are implicitly parallelised (and packed).
+Thus the question is directly inter-related to whether the modification
+of the predication mask should *itself* be predicated.
  
-## 17.13 Polymorphic Vector Instructions
+It is quite complex, in other words, and needs careful consideration.
  
-Polymorphism (implicit type-casting) is deliberately not supported
-in Simple-V.
+## 8/16-bit ops is it worthwhile adding a "start offset"?
  
-## 17.14 Rapid Configuration Instructions
+The idea here is to make it possible, particularly in a "Packed SIMD"
+case, to be able to avoid doing unaligned Load/Store operations
+by specifying that operations, instead of being carried out
+element-for-element, are offset by a fixed amount *even* in 8 and 16-bit
+element Packed SIMD cases.
  
-TODO: analyse if this is useful to have an equivalent in Simple-V
+For example rather than take 2 32-bit registers divided into 4 8-bit
+elements and have them ADDed element-for-element as follows:
  
-## 17.15 Vector-Type-Change Instructions
+    r3[0] = add r4[0], r6[0]
+    r3[1] = add r4[1], r6[1]
+    r3[2] = add r4[2], r6[2]
+    r3[3] = add r4[3], r6[3]
  
-TODO: analyse if this is useful to have an equivalent in Simple-V
+an offset of 1 would result in four operations as follows, instead:
  
-## 17.16 Vector Length
+    r3[0] = add r4[1], r6[0]
+    r3[1] = add r4[2], r6[1]
+    r3[2] = add r4[3], r6[2]
+    r3[3] = add r5[0], r6[3]
  
-Has a direct corresponding equivalent.
+In non-packed-SIMD mode there is no benefit at all, as a vector may
+be created using a different CSR that has the offset built-in.  So this
+leaves just the packed-SIMD case to consider.
  
-## 17.17 Predicated Execution
+Two ways in which this could be implemented / emulated (without special
+hardware):
  
-Predicated Execution is another name for "masking" or "tagging".  Masked
-(or tagged) implies that there is a bit field which is indexed, and each
-bit associated with the corresponding indexed offset register within
-the "Vector".  If the tag / mask bit is 1, when a parallel operation is
-issued, the indexed element of the vector has the operation carried out.
-However if the tag / mask bit is *zero*, that particular indexed element
-of the vector does *not* have the requested operation carried out.
+* bit-manipulation that shuffles the data along by one byte (or one word)
+  either prior to or as part of the operation requiring the offset.
+* just use an unaligned Load/Store sequence, even if there are performance
+  penalties for doing so.
  
-In V2.3-draft V, there is a significant (not recommended) difference:
-the zero-tagged elements are *set to zero*.  This loses a *significant*
-advantage of mask / tagging, particularly if the entire mask register
-is itself a general-purpose register, as that general-purpose register
-can be inverted, shifted, and'ed, or'ed and so on.  In other words
-it becomes possible, especially if Carry/Overflow from each vector
-operation is also accessible, to do conditional (step-by-step) vector
-operations including things like turn vectors into 1024-bit or greater
-operands with very few instructions, by treating the "carry" from
-one instruction as a way to do "Conditional add of 1 to the register
-next door".  If V2.3-draft V sets zero-tagged elements to zero, such
-extremely powerful techniques are simply not possible.
+The question then is whether the performance hit is worth the extra hardware
+involving byte-shuffling/shifting the data by an arbitrary offset.  On
+balance given that there are two reasonable instruction-based options, the
+hardware-offset option should be left out for the initial version of SV,
+with the option to consider it in an "advanced" version of the specification.
  
-It is noted that there is no mention of an equivalent to BEXT (element
-skipping) which would be particularly fascinating and powerful to have.
-In this mode, the "mask" would skip elements where its mask bit was zero
-in either the source or the destination operand.
+# Impementing V on top of Simple-V
  
-Lots to be discussed.
+With Simple-V converting the original RVV draft concept-for-concept
+from explicit opcodes to implicit overloading of existing RV Standard
+Extensions, certain features were (deliberately) excluded that need
+to be added back in for RVV to reach its full potential.  This is
+made slightly complicated by the fact that RVV itself has two
+levels: Base and reserved future functionality.
+
+* Representation Encoding is entirely left out of Simple-V in favour of
+  implicitly taking the exact (explicit) meaning from RV Standard Extensions.
+* VCLIP and VCLIPI do not have corresponding RV Standard Extension
+  opcodes (and are the only such operations).
+* Extended Element bitwidths (1 through to 24576 bits) were left out
+  of Simple-V as, again, there is no corresponding RV Standard Extension
+  that covers anything even below 32-bit operands.
+* Polymorphism was entirely left out of Simple-V due to the inherent
+  complexity of automatic type-conversion.
+* Vector Register files were specifically left out of Simple-V in favour
+  of fitting on top of the integer and floating-point files.  An
+  "RVV re-retro-fit" needs to be able to mark (implicitly marked)
+  registers as being actually in a separate *vector* register file.
+* Fortunately in RVV (Draft 0.4, V2.3-Draft), the "base" vector
+  register file size is 5 bits (32 registers), whilst the "Extended"
+  variant of RVV specifies 8 bits (256 registers) and has yet to
+  be published.
+* One big difference: Sections 17.12 and 17.17, there are only two possible
+  predication registers in RVV "Base".  Through the "indirect" method,
+  Simple-V provides a key-value CSR table that allows (arbitrarily)
+  up to 16 (TBD) of either the floating-point or integer registers to
+  be marked as "predicated" (key), and if so, which integer register to
+  use as the predication mask (value).
+
+**TODO**
  
-## 17.18 Vector Load/Store Instructions
+# Implementing P (renamed to DSP) on top of Simple-V
  
-These may not have a direct equivalent in Simple-V, except if mask/tagging
-is to be deployed.
+* Implementors indicate chosen bitwidth support in Vector-bitwidth CSR
+  (caveat: anything not specified drops through to software-emulation / traps)
+* TODO
  
-To be discussed.
+# Appendix
+
+## V-Extension to Simple-V Comparative Analysis
+
+This section has been moved to its own page [[v_comparative_analysis]]
+
+## P-Ext ISA
+
+This section has been moved to its own page [[p_comparative_analysis]]
+
+## Comparison of "Traditional" SIMD, Alt-RVP, Simple-V and RVV Proposals <a name="parallelism_comparisons"></a>
+
+This section compares the various parallelism proposals as they stand,
+including traditional SIMD, in terms of features, ease of implementation,
+complexity, flexibility, and die area.
+
+### [[harmonised_rvv_rvp]]
+
+This is an interesting proposal under development to retro-fit the AndesStar
+P-Ext into V-Ext.
+
+### [[alt_rvp]]
+
+Primary benefit of Alt-RVP is the simplicity with which parallelism
+may be introduced (effective multiplication of regfiles and associated ALUs).
+
+* plus: the simplicity of the lanes (combined with the regularity of
+  allocating identical opcodes multiple independent registers) meaning
+  that SRAM or 2R1W can be used for entire regfile (potentially).
+* minus: a more complex instruction set where the parallelism is much
+  more explicitly directly specified in the instruction and
+* minus: if you *don't* have an explicit instruction (opcode) and you
+  need one, the only place it can be added is... in the vector unit and
+* minus: opcode functions (and associated ALUs) duplicated in Alt-RVP are
+  not useable or accessible in other Extensions.
+* plus-and-minus: Lanes may be utilised for high-speed context-switching
+  but with the down-side that they're an all-or-nothing part of the Extension.
+  No Alt-RVP: no fast register-bank switching.
+* plus: Lane-switching would mean that complex operations not suited to
+  parallelisation can be carried out, followed by further parallel Lane-based
+  work, without moving register contents down to memory (and back)
+* minus: Access to registers across multiple lanes is challenging. "Solution"
+  is to drop data into memory and immediately back in again (like MMX).
+
+### Simple-V
+
+Primary benefit of Simple-V is the OO abstraction of parallel principles
+from actual (internal) parallel hardware.  It's an API in effect that's
+designed to be slotted in to an existing implementation (just after
+instruction decode) with minimum disruption and effort.
+
+* minus: the complexity (if full parallelism is to be exploited)
+  of having to use register renames, OoO, VLIW, register file cacheing,
+  all of which has been done before but is a pain
+* plus: transparent re-use of existing opcodes as-is just indirectly
+  saying "this register's now a vector" which
+* plus: means that future instructions also get to be inherently
+  parallelised because there's no "separate vector opcodes"
+* plus: Compressed instructions may also be (indirectly) parallelised
+* minus: the indirect nature of Simple-V means that setup (setting
+  a CSR register to indicate vector length, a separate one to indicate
+  that it is a predicate register and so on) means a little more setup
+  time than Alt-RVP or RVV's "direct and within the (longer) instruction"
+  approach.
+* plus: shared register file meaning that, like Alt-RVP, complex
+  operations not suited to parallelisation may be carried out interleaved
+  between parallelised instructions *without* requiring data to be dropped
+  down to memory and back (into a separate vectorised register engine).
+* plus-and-maybe-minus: re-use of integer and floating-point 32-wide register
+  files means that huge parallel workloads would use up considerable
+  chunks of the register file.  However in the case of RV64 and 32-bit
+  operations, that effectively means 64 slots are available for parallel
+  operations.
+* plus: inherent parallelism (actual parallel ALUs) doesn't actually need to
+  be added, yet the instruction opcodes remain unchanged (and still appear
+  to be parallel).  consistent "API" regardless of actual internal parallelism:
+  even an in-order single-issue implementation with a single ALU would still
+  appear to have parallel vectoristion.
+* hard-to-judge: if actual inherent underlying ALU parallelism is added it's
+  hard to say if there would be pluses or minuses (on die area).  At worse it
+  would be "no worse" than existing register renaming, OoO, VLIW and register
+  file cacheing schemes.
+
+### RVV (as it stands, Draft 0.4 Section 17, RISC-V ISA V2.3-Draft)
+
+RVV is extremely well-designed and has some amazing features, including
+2D reorganisation of memory through LOAD/STORE "strides".
+
+* plus: regular predictable workload means that implementations may
+  streamline effects on L1/L2 Cache.
+* plus: regular and clear parallel workload also means that lanes
+  (similar to Alt-RVP) may be used as an implementation detail,
+  using either SRAM or 2R1W registers.
+* plus: separate engine with no impact on the rest of an implementation
+* minus: separate *complex* engine with no RTL (ALUs, Pipeline stages) reuse
+  really feasible.
+* minus: no ISA abstraction or re-use either: additions to other Extensions
+  do not gain parallelism, resulting in prolific duplication of functionality
+  inside RVV *and out*.
+* minus: when operations require a different approach (scalar operations
+  using the standard integer or FP regfile) an entire vector must be
+  transferred out to memory, into standard regfiles, then back to memory,
+  then back to the vector unit, this to occur potentially multiple times.
+* minus: will never fit into Compressed instruction space (as-is.  May
+  be able to do so if "indirect" features of Simple-V are partially adopted).
+* plus-and-slight-minus: extended variants may address up to 256
+  vectorised registers (requires 48/64-bit opcodes to do it).
+* minus-and-partial-plus: separate engine plus complexity increases
+  implementation time and die area, meaning that adoption is likely only
+  to be in high-performance specialist supercomputing (where it will
+  be absolutely superb).
+
+### Traditional SIMD
+
+The only really good things about SIMD are how easy it is to implement and
+get good performance.  Unfortunately that makes it quite seductive...
+
+* plus: really straightforward, ALU basically does several packed operations
+  at once.  Parallelism is inherent at the ALU, making the addition of
+  SIMD-style parallelism an easy decision that has zero significant impact
+  on the rest of any given architectural design and layout.
+* plus (continuation): SIMD in simple in-order single-issue designs can
+  therefore result in superb throughput, easily achieved even with a very
+  simple execution model.
+* minus: ridiculously complex setup and corner-cases that disproportionately
+  increase instruction count on what would otherwise be a "simple loop",
+  should the number of elements in an array not happen to exactly match
+  the SIMD group width.
+* minus: getting data usefully out of registers (if separate regfiles
+  are used) means outputting to memory and back.
+* minus: quite a lot of supplementary instructions for bit-level manipulation
+  are needed in order to efficiently extract (or prepare) SIMD operands.
+* minus: MASSIVE proliferation of ISA both in terms of opcodes in one
+  dimension and parallelism (width): an at least O(N^2) and quite probably
+  O(N^3) ISA proliferation that often results in several thousand
+  separate instructions.  all requiring separate and distinct corner-case
+  algorithms!
+* minus: EVEN BIGGER proliferation of SIMD ISA if the functionality of
+  8, 16, 32 or 64-bit reordering is built-in to the SIMD instruction.
+  For example: add (high|low) 16-bits of r1 to (low|high) of r2 requires
+  four separate and distinct instructions: one for (r1:low r2:high),
+  one for (r1:high r2:low), one for (r1:high r2:high) and one for
+  (r1:low r2:low) *per function*.
+* minus: EVEN BIGGER proliferation of SIMD ISA if there is a mismatch
+  between operand and result bit-widths.  In combination with high/low
+  proliferation the situation is made even worse.
+* minor-saving-grace: some implementations *may* have predication masks
+  that allow control over individual elements within the SIMD block.
+
+## Comparison *to* Traditional SIMD: Alt-RVP, Simple-V and RVV Proposals <a name="simd_comparison"></a>
+
+This section compares the various parallelism proposals as they stand,
+*against* traditional SIMD as opposed to *alongside* SIMD.  In other words,
+the question is asked "How can each of the proposals effectively implement
+(or replace) SIMD, and how effective would they be"?
+
+### [[alt_rvp]]
+
+* Alt-RVP would not actually replace SIMD but would augment it: just as with
+  a SIMD architecture where the ALU becomes responsible for the parallelism,
+  Alt-RVP ALUs would likewise be so responsible... with *additional*
+  (lane-based) parallelism on top.
+* Thus at least some of the downsides of SIMD ISA O(N^5) proliferation by
+  at least one dimension are avoided (architectural upgrades introducing
+  128-bit then 256-bit then 512-bit variants of the exact same 64-bit
+  SIMD block)
+* Thus, unfortunately, Alt-RVP would suffer the same inherent proliferation
+  of instructions as SIMD, albeit not quite as badly (due to Lanes).
+* In the same discussion for Alt-RVP, an additional proposal was made to
+  be able to subdivide the bits of each register lane (columns) down into
+  arbitrary bit-lengths (RGB 565 for example).
+* A recommendation was given instead to make the subdivisions down to 32-bit,
+  16-bit or even 8-bit, effectively dividing the registerfile into
+  Lane0(H), Lane0(L), Lane1(H) ... LaneN(L) or further.  If inter-lane
+  "swapping" instructions were then introduced, some of the disadvantages
+  of SIMD could be mitigated.
+
+### RVV
+
+* RVV is designed to replace SIMD with a better paradigm: arbitrary-length
+  parallelism.
+* However whilst SIMD is usually designed for single-issue in-order simple
+  DSPs with a focus on Multimedia (Audio, Video and Image processing),
+  RVV's primary focus appears to be on Supercomputing: optimisation of
+  mathematical operations that fit into the OpenCL space.
+* Adding functions (operations) that would normally fit (in parallel)
+  into a SIMD instruction requires an equivalent to be added to the
+  RVV Extension, if one does not exist.  Given the specialist nature of
+  some SIMD instructions (8-bit or 16-bit saturated or halving add),
+  this possibility seems extremely unlikely to occur, even if the
+  implementation overhead of RVV were acceptable (compared to
+  normal SIMD/DSP-style single-issue in-order simplicity).
+
+### Simple-V
+
+* Simple-V borrows hugely from RVV as it is intended to be easy to
+  topologically transplant every single instruction from RVV (as
+  designed) into Simple-V equivalents, with *zero loss of functionality
+   or capability*.
+* With the "parallelism" abstracted out, a hypothetical SIMD-less "DSP"
+  Extension which contained the basic primitives (non-parallelised
+  8, 16 or 32-bit SIMD operations) inherently *become* parallel,
+  automatically.
+* Additionally, standard operations (ADD, MUL) that would normally have
+  to have special SIMD-parallel opcodes added need no longer have *any*
+  of the length-dependent variants (2of 32-bit ADDs in a 64-bit register,
+  4of 32-bit ADDs in a 128-bit register) because Simple-V takes the
+  *standard* RV opcodes (present and future) and automatically parallelises
+  them.
+* By inheriting the RVV feature of arbitrary vector-length, then just as
+  with RVV the corner-cases and ISA proliferation of SIMD is avoided.
+* Whilst not entirely finalised, registers are expected to be
+  capable of being subdivided down to an implementor-chosen bitwidth
+  in the underlying hardware (r1 becomes r1[31..24] r1[23..16] r1[15..8]
+  and r1[7..0], or just r1[31..16] r1[15..0]) where implementors can
+  choose to have separate independent 8-bit ALUs or dual-SIMD 16-bit
+  ALUs that perform twin 8-bit operations as they see fit, or anything
+  else including no subdivisions at all.
+* Even though implementors have that choice even to have full 64-bit
+  (with RV64) SIMD, they *must* provide predication that transparently
+  switches off appropriate units on the last loop, thus neatly fitting
+  underlying SIMD ALU implementations *into* the arbitrary vector-length
+  RVV paradigm, keeping the uniform consistent API that is a key strategic
+  feature of Simple-V.
+* With Simple-V fitting into the standard register files, certain classes
+  of SIMD operations such as High/Low arithmetic (r1[31..16] + r2[15..0])
+  can be done by applying *Parallelised* Bit-manipulation operations
+  followed by parallelised *straight* versions of element-to-element
+  arithmetic operations, even if the bit-manipulation operations require
+  changing the bitwidth of the "vectors" to do so.  Predication can
+  be utilised to skip high words (or low words) in source or destination.
+* In essence, the key downside of SIMD - massive duplication of
+  identical functions over time as an architecture evolves from 32-bit
+  wide SIMD all the way up to 512-bit, is avoided with Simple-V, through
+  vector-style parallelism being dropped on top of 8-bit or 16-bit
+  operations, all the while keeping a consistent ISA-level "API" irrespective
+  of implementor design choices (or indeed actual implementations).
+
+### Example Instruction translation: <a name="example_translation"></a>
+
+Instructions "ADD r7 r4 r4" would result in three instructions being
+generated and placed into the FIFO.  r7 and r4 are marked as "vectorised":
+
+* ADD r7 r4 r4
+* ADD r8 r5 r5
+* ADD r9 r6 r6
+
+Instructions "ADD r7 r4 r1" would result in three instructions being
+generated and placed into the FIFO.  r7 and r1 are marked as "vectorised"
+whilst r4 is not:
+
+* ADD r7 r4 r1
+* ADD r8 r4 r2
+* ADD r9 r4 r3
+
+## Example of vector / vector, vector / scalar, scalar / scalar => vector add
+
+    function op_add(rd, rs1, rs2) # add not VADD!
+      int i, id=0, irs1=0, irs2=0;
+      rd  = int_vec[rd ].isvector ? int_vec[rd ].regidx : rd;
+      rs1 = int_vec[rs1].isvector ? int_vec[rs1].regidx : rs1;
+      rs2 = int_vec[rs2].isvector ? int_vec[rs2].regidx : rs2;
+      predval = get_pred_val(FALSE, rd);
+      for (i = 0; i < VL; i++)
+        if (predval & 1<<i) # predication uses intregs
+           ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
+        if (int_vec[rd ].isvector)  { id += 1; }
+        if (int_vec[rs1].isvector)  { irs1 += 1; }
+        if (int_vec[rs2].isvector)  { irs2 += 1; }
+
+## Retro-fitting Predication into branch-explicit ISA <a name="predication_retrofit"></a>
+
+One of the goals of this parallelism proposal is to avoid instruction
+duplication.  However, with the base ISA having been designed explictly
+to *avoid* condition-codes entirely, shoe-horning predication into it
+bcomes quite challenging.
+
+However what if all branch instructions, if referencing a vectorised
+register, were instead given *completely new analogous meanings* that
+resulted in a parallel bit-wise predication register being set?  This
+would have to be done for both C.BEQZ and C.BNEZ, as well as BEQ, BNE,
+BLT and BGE.
+
+We might imagine that FEQ, FLT and FLT would also need to be converted,
+however these are effectively *already* in the precise form needed and
+do not need to be converted *at all*!  The difference is that FEQ, FLT
+and FLE *specifically* write a 1 to an integer register if the condition
+holds, and 0 if not.  All that needs to be done here is to say, "if
+the integer register is tagged with a bit that says it is a predication
+register, the **bit** in the integer register is set based on the
+current vector index" instead.
+
+There is, in the standard Conditional Branch instruction, more than
+adequate space to interpret it in a similar fashion:
+
+[[!table  data="""
+31      |30 ..... 25 |24..20|19..15| 14...12| 11.....8 | 7       | 6....0 |
+imm[12] | imm[10:5]  |rs2   | rs1  | funct3 | imm[4:1] | imm[11] | opcode |
+ 1      | 6          | 5    | 5    | 3      | 4        | 1       |   7    |
+   offset[12,10:5]  || src2 | src1 | BEQ    | offset[11,4:1]    || BRANCH |
+"""]]
+
+This would become:
+
+[[!table  data="""
+31      | 30 .. 25 |24 ... 20 | 19 15 | 14  12 | 11 ..  8 | 7       | 6 ... 0 |
+imm[12] | imm[10:5]| rs2      | rs1   | funct3 | imm[4:1] | imm[11] | opcode  |
+1       | 6        | 5        | 5     | 3      | 4             | 1  | 7       |
+reserved          || src2     | src1  | BEQ    | predicate rs3     || BRANCH  |
+"""]]
+
+Similarly the C.BEQZ and C.BNEZ instruction format may be retro-fitted,
+with the interesting side-effect that there is space within what is presently
+the "immediate offset" field to reinterpret that to add in not only a bit
+field to distinguish between floating-point compare and integer compare,
+not only to add in a second source register, but also use some of the bits as
+a predication target as well.
+
+[[!table  data="""
+15..13 | 12 ....... 10 | 9...7 | 6 ......... 2     | 1 .. 0 |
+funct3 | imm           | rs10  | imm               | op     |
+3      | 3             | 3     | 5                 | 2      |
+C.BEQZ | offset[8,4:3] | src   | offset[7:6,2:1,5] | C1     |
+"""]]
+
+Now uses the CS format:
+
+[[!table  data="""
+15..13 | 12 .  10 | 9 .. 7 | 6 .. 5 | 4..2 | 1 .. 0 |
+funct3 | imm      | rs10   | imm    |      | op     |
+3      | 3        | 3      | 2      | 3    | 2      |
+C.BEQZ | pred rs3 | src1   | I/F B  | src2 | C1     |
+"""]]
+
+Bit 6 would be decoded as "operation refers to Integer or Float" including
+interpreting src1 and src2 accordingly as outlined in Table 12.2 of the
+"C" Standard, version 2.0,
+whilst Bit 5 would allow the operation to be extended, in combination with
+funct3 = 110 or 111: a combination of four distinct (predicated) comparison
+operators.  In both floating-point and integer cases those could be
+EQ/NEQ/LT/LE (with GT and GE being synthesised by inverting src1 and src2).
+
+## Register reordering <a name="register_reordering"></a>
+
+### Register File
+
+| Reg Num | Bits |
+| ------- | ---- |
+| r0 | (32..0) |
+| r1 | (32..0) |
+| r2 | (32..0) |
+| r3 | (32..0) |
+| r4 | (32..0) |
+| r5 | (32..0) |
+| r6 | (32..0) |
+| r7 | (32..0) |
+| .. | (32..0) |
+| r31| (32..0) |
+
+### Vectorised CSR
+
+May not be an actual CSR: may be generated from Vector Length CSR:
+single-bit is less burdensome on instruction decode phase.
+
+| 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
+| - | - | - | - | - | - | - | - |
+| 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
+
+### Vector Length CSR
+
+| Reg Num | (3..0) |
+| ------- | ---- |
+| r0 | 2 |
+| r1 | 0 |
+| r2 | 1 |
+| r3 | 1 |
+| r4 | 3 |
+| r5 | 0 |
+| r6 | 0 |
+| r7 | 1 |
+
+### Virtual Register Reordering
+
+This example assumes the above Vector Length CSR table
+
+| Reg Num | Bits (0) | Bits (1) | Bits (2) |
+| ------- | -------- | -------- | -------- |
+| r0 | (32..0) | (32..0) |
+| r2 | (32..0) |
+| r3 | (32..0) |
+| r4 | (32..0) | (32..0) | (32..0) |
+| r7 | (32..0) |
+
+### Bitwidth Virtual Register Reordering
+
+This example goes a little further and illustrates the effect that a
+bitwidth CSR has been set on a register.  Preconditions:
+
+* RV32 assumed
+* CSRintbitwidth[2] = 010 # integer r2 is 16-bit
+* CSRintvlength[2] = 3 # integer r2 is a vector of length 3
+* vsetl rs1, 5 # set the vector length to 5
+
+This is interpreted as follows:
+
+* Given that the context is RV32, ELEN=32.
+* With ELEN=32 and bitwidth=16, the number of SIMD elements is 2
+* Therefore the actual vector length is up to *six* elements
+* However vsetl sets a length 5 therefore the last "element" is skipped
+
+So when using an operation that uses r2 as a source (or destination)
+the operation is carried out as follows:
+
+* 16-bit operation on r2(15..0) - vector element index 0
+* 16-bit operation on r2(31..16) - vector element index 1
+* 16-bit operation on r3(15..0) - vector element index 2
+* 16-bit operation on r3(31..16) - vector element index 3
+* 16-bit operation on r4(15..0) - vector element index 4
+* 16-bit operation on r4(31..16) **NOT** carried out due to length being 5
+
+Predication has been left out of the above example for simplicity, however
+predication is ANDed with the latter stages (vsetl not equal to maximum
+capacity).
+
+Note also that it is entirely an implementor's choice as to whether to have
+actual separate ALUs down to the minimum bitwidth, or whether to have something
+more akin to traditional SIMD (at any level of subdivision: 8-bit SIMD
+operations carried out 32-bits at a time is perfectly acceptable, as is
+8-bit SIMD operations carried out 16-bits at a time requiring two ALUs).
+Regardless of the internal parallelism choice, *predication must
+still be respected*, making Simple-V in effect the "consistent public API".
+
+vew may be one of the following (giving a table "bytestable", used below):
+
+| vew | bitwidth | bytestable |
+| --- | -------- | ---------- |
+| 000 | default  | XLEN/8     |
+| 001 | 8        | 1          |
+| 010 | 16       | 2          |
+| 011 | 32       | 4          |
+| 100 | 64       | 8          |
+| 101 | 128      | 16         |
+| 110 | rsvd     | rsvd       |
+| 111 | rsvd     | rsvd       |
+
+Pseudocode for vector length taking CSR SIMD-bitwidth into account:
+
+    vew = CSRbitwidth[rs1]
+    if (vew == 0)
+        bytesperreg = (XLEN/8) # or FLEN as appropriate
+    else:
+        bytesperreg = bytestable[vew] # 1 2 4 8 16
+    simdmult = (XLEN/8) / bytesperreg # or FLEN as appropriate
+    vlen = CSRvectorlen[rs1] * simdmult
+
+To index an element in a register rnum where the vector element index is i:
+
+    function regoffs(rnum, i):
+        regidx = floor(i / simdmult)  # integer-div rounded down
+        byteidx = i % simdmult        # integer-remainder
+        return rnum + regidx,         # actual real register
+               byteidx * 8,           # low
+               byteidx * 8 + (vew-1), # high
+
+### Insights
+
+SIMD register file splitting still to consider.  For RV64, benefits of doubling
+(quadrupling in the case of Half-Precision IEEE754 FP) the apparent
+size of the floating point register file to 64 (128 in the case of HP)
+seem pretty clear and worth the complexity.
+
+64 virtual 32-bit F.P. registers and given that 32-bit FP operations are
+done on 64-bit registers it's not so conceptually difficult.  May even
+be achieved by *actually* splitting the regfile into 64 virtual 32-bit
+registers such that a 64-bit FP scalar operation is dropped into (r0.H
+r0.L) tuples.  Implementation therefore hidden through register renaming.
+
+Implementations intending to introduce VLIW, OoO and parallelism
+(even without Simple-V) would then find that the instructions are
+generated quicker (or in a more compact fashion that is less heavy
+on caches).  Interestingly we observe then that Simple-V is about
+"consolidation of instruction generation", where actual parallelism
+of underlying hardware is an implementor-choice that could just as
+equally be applied *without* Simple-V even being implemented.
+
+## Analysis of CSR decoding on latency <a name="csr_decoding_analysis"></a>
+
+It could indeed have been logically deduced (or expected), that there
+would be additional decode latency in this proposal, because if
+overloading the opcodes to have different meanings, there is guaranteed
+to be some state, some-where, directly related to registers.
+
+There are several cases:
+
+* All operands vector-length=1 (scalars), all operands
+  packed-bitwidth="default": instructions are passed through direct as if
+  Simple-V did not exist.  Simple-V is, in effect, completely disabled.
+* At least one operand vector-length > 1, all operands
+  packed-bitwidth="default": any parallel vector ALUs placed on "alert",
+  virtual parallelism looping may be activated.
+* All operands vector-length=1 (scalars), at least one
+  operand packed-bitwidth != default: degenerate case of SIMD,
+  implementation-specific complexity here (packed decode before ALUs or
+  *IN* ALUs)
+* At least one operand vector-length > 1, at least one operand
+  packed-bitwidth != default: parallel vector ALUs (if any)
+  placed on "alert", virtual parallelsim looping may be activated,
+  implementation-specific SIMD complexity kicks in (packed decode before
+  ALUs or *IN* ALUs).
+
+Bear in mind that the proposal includes that the decision whether
+to parallelise in hardware or whether to virtual-parallelise (to
+dramatically simplify compilers and also not to run into the SIMD
+instruction proliferation nightmare) *or* a transprent combination
+of both, be done on a *per-operand basis*, so that implementors can
+specifically choose to create an application-optimised implementation
+that they believe (or know) will sell extremely well, without having
+"Extra Standards-Mandated Baggage" that would otherwise blow their area
+or power budget completely out the window.
+
+Additionally, two possible CSR schemes have been proposed, in order to
+greatly reduce CSR space:
+
+* per-register CSRs (vector-length and packed-bitwidth)
+* a smaller number of CSRs with the same information but with an *INDEX*
+  specifying WHICH register in one of three regfiles (vector, fp, int)
+  the length and bitwidth applies to.
+
+(See "CSR vector-length and CSR SIMD packed-bitwidth" section for details)
+
+In addition, LOAD/STORE has its own associated proposed CSRs that
+mirror the STRIDE (but not yet STRIDE-SEGMENT?) functionality of
+V (and Hwacha).
+
+Also bear in mind that, for reasons of simplicity for implementors,
+I was coming round to the idea of permitting implementors to choose
+exactly which bitwidths they would like to support in hardware and which
+to allow to fall through to software-trap emulation.
+
+So the question boils down to:
+
+* whether either (or both) of those two CSR schemes have significant
+  latency that could even potentially require an extra pipeline decode stage
+* whether there are implementations that can be thought of which do *not*
+  introduce significant latency
+* whether it is possible to explicitly (through quite simply
+  disabling Simple-V-Ext) or implicitly (detect the case all-vlens=1,
+  all-simd-bitwidths=default) switch OFF any decoding, perhaps even to
+  the extreme of skipping an entire pipeline stage (if one is needed)
+* whether packed bitwidth and associated regfile splitting is so complex
+  that it should definitely, definitely be made mandatory that implementors
+  move regfile splitting into the ALU, and what are the implications of that
+* whether even if that *is* made mandatory, is software-trapped
+  "unsupported bitwidths" still desirable, on the basis that SIMD is such
+  a complete nightmare that *even* having a software implementation is
+  better, making Simple-V have more in common with a software API than
+  anything else.
+
+Whilst the above may seem to be severe minuses, there are some strong
+pluses:
+
+* Significant reduction of V's opcode space: over 95%.
+* Smaller reduction of P's opcode space: around 10%.
+* The potential to use Compressed instructions in both Vector and SIMD
+  due to the overloading of register meaning (implicit vectorisation,
+  implicit packing)
+* Not only present but also future extensions automatically gain parallelism.
+* Already mentioned but worth emphasising: the simplification to compiler
+  writers and assembly-level writers of having the same consistent ISA
+  regardless of whether the internal level of parallelism (number of
+  parallel ALUs) is only equal to one ("virtual" parallelism), or is
+  greater than one, should not be underestimated.
+
+## Reducing Register Bank porting
+
+This looks quite reasonable.
+<https://www.princeton.edu/~rblee/ELE572Papers/MultiBankRegFile_ISCA2000.pdf>
+
+The main details are outlined on page 4.  They propose a 2-level register
+cache hierarchy, note that registers are typically only read once, that
+you never write back from upper to lower cache level but always go in a
+cycle lower -> upper -> ALU -> lower, and at the top of page 5 propose
+a scheme where you look ahead by only 2 instructions to determine which
+registers to bring into the cache.
+
+The nice thing about a vector architecture is that you *know* that
+*even more* registers are going to be pulled in: Hwacha uses this fact
+to optimise L1/L2 cache-line usage (avoid thrashing), strangely enough
+by *introducing* deliberate latency into the execution phase.
+
+## Overflow registers in combination with predication
  
-## 17.19 Vector Register Gather
+**TODO**: propose overflow registers be actually one of the integer regs
+(flowing to multiple regs).
  
-TODO
+**TODO**: propose "mask" (predication) registers likewise.  combination with
+standard RV instructions and overflow registers extremely powerful, see
+Aspex ASP.
+
+When integer overflow is stored in an easily-accessible bit (or another
+register), parallelisation turns this into a group of bits which can
+potentially be interacted with in predication, in interesting and powerful
+ways.  For example, by taking the integer-overflow result as a predication
+field and shifting it by one, a predicated vectorised "add one" can emulate
+"carry" on arbitrary (unlimited) length addition.
+
+However despite RVV having made room for floating-point exceptions, neither
+RVV nor base RV have taken integer-overflow (carry) into account, which
+makes proposing it quite challenging given that the relevant (Base) RV
+sections are frozen.  Consequently it makes sense to forgo this feature.
+
+## Context Switch Example <a name="context_switch"></a>
+
+An unusual side-effect of Simple-V mapping onto the standard register files
+is that LOAD-multiple and STORE-multiple are accidentally available, as long
+as it is acceptable that the register(s) to be loaded/stored are contiguous
+(per instruction).  An additional accidental benefit is that Compressed LD/ST
+may also be used.
+
+To illustrate how this works, here is some example code from FreeRTOS
+(GPLv2 licensed, portasm.S):
+
+    /* Macro for saving task context */
+    .macro portSAVE_CONTEXT
+        .global        pxCurrentTCB
+        /* make room in stack */
+        addi   sp, sp, -REGBYTES * 32
+
+        /* Save Context */
+        STORE  x1, 0x0(sp)
+        STORE  x2, 1 * REGBYTES(sp)
+        STORE  x3, 2 * REGBYTES(sp)
+        ...
+        ...
+        STORE  x30, 29 * REGBYTES(sp)
+        STORE  x31, 30 * REGBYTES(sp)
+
+        /* Store current stackpointer in task control block (TCB) */
+        LOAD   t0, pxCurrentTCB        //pointer
+        STORE  sp, 0x0(t0)
+        .endm
+
+    /* Saves current error program counter (EPC) as task program counter */
+    .macro portSAVE_EPC
+        csrr   t0, mepc
+        STORE  t0, 31 * REGBYTES(sp)
+        .endm
+
+    /* Saves current return adress (RA) as task program counter */
+    .macro portSAVE_RA
+        STORE  ra, 31 * REGBYTES(sp)
+        .endm
+
+    /* Macro for restoring task context */
+    .macro portRESTORE_CONTEXT
+
+        .global        pxCurrentTCB
+        /* Load stack pointer from the current TCB */
+        LOAD   sp, pxCurrentTCB
+        LOAD   sp, 0x0(sp)
+
+        /* Load task program counter */
+        LOAD   t0, 31 * REGBYTES(sp)
+        csrw   mepc, t0
+
+        /* Run in machine mode */
+        li             t0, MSTATUS_PRV1
+        csrs   mstatus, t0
+
+        /* Restore registers,
+           Skip global pointer because that does not change */
+        LOAD   x1, 0x0(sp)
+        LOAD   x4, 3 * REGBYTES(sp)
+        LOAD   x5, 4 * REGBYTES(sp)
+        ...
+        ...
+        LOAD   x30, 29 * REGBYTES(sp)
+        LOAD   x31, 30 * REGBYTES(sp)
+
+        addi   sp, sp, REGBYTES * 32
+        mret
+        .endm
+
+The important bits are the Load / Save context, which may be replaced
+with firstly setting up the Vectors and secondly using a *single* STORE
+(or LOAD) including using C.ST or C.LD, to indicate that the entire
+bank of registers is to be loaded/saved:
+
+    /* a few things are assumed here: (a) that when switching to
+       M-Mode an entirely different set of CSRs is used from that
+       which is used in U-Mode and (b) that the M-Mode x1 and x4
+       vectors are also not used anywhere else in M-Mode, consequently
+       only need to be set up just the once.
+     */
+    .macroVectorSetup
+        MVECTORCSRx1 = 31, defaultlen
+        MVECTORCSRx4 = 28, defaultlen
  
-## TODO, sort
+        /* Save Context */
+        SETVL x0, x0, 31 /* x0 ignored silently */
+        STORE  x1, 0x0(sp) // x1 marked as 31-long vector of default bitwidth
+
+        /* Restore registers,
+           Skip global pointer because that does not change */
+        LOAD   x1, 0x0(sp)
+        SETVL x0, x0, 28 /* x0 ignored silently */
+        LOAD   x4, 3 * REGBYTES(sp) // x4 marked as 28-long default bitwidth
+
+Note that although it may just be a bug in portasm.S, x2 and x3 appear not
+to be being restored.  If however this is a bug and they *do* need to be
+restored, then the SETVL call may be moved to *outside* the Save / Restore
+Context assembly code, into the macroVectorSetup, as long as vectors are
+never used anywhere else (i.e. VL is never altered by M-Mode).
+
+In effect the entire bank of repeated LOAD / STORE instructions is replaced
+by one single (compressed if it is available) instruction.
+
+## Virtual Memory page-faults on LOAD/STORE
+
+
+### Notes from conversations
+
+> I was going through the C.LOAD / C.STORE section 12.3 of V2.3-Draft
+> riscv-isa-manual in order to work out how to re-map RVV onto the standard
+> ISA, and came across an interesting comments at the bottom of pages 75
+> and 76:
+
+> "   A common mechanism used in other ISAs to further reduce save/restore
+> code size is load- multiple and store-multiple instructions. "
+
+> Fascinatingly, due to Simple-V proposing to use the *standard* register
+> file, both C.LOAD / C.STORE *and* LOAD / STORE would in effect be exactly
+> that: load-multiple and store-multiple instructions.  Which brings us
+> on to this comment:
+
+> "For virtual memory systems, some data accesses could be resident in
+> physical memory and
+>   some could not, which requires a new restart mechanism for partially
+> executed instructions."
+
+> Which then of course brings us to the interesting question: how does RVV
+> cope with the scenario when, particularly with LD.X (Indexed / indirect
+> loads), part-way through the loading a page fault occurs?
+
+> Has this been noted or discussed before?
+
+For applications-class platforms, the RVV exception model is
+element-precise (that is, if an exception occurs on element j of a
+vector instruction, elements 0..j-1 have completed execution and elements
+j+1..vl-1 have not executed).
+
+Certain classes of embedded platforms where exceptions are always fatal
+might choose to offer resumable/swappable interrupts but not precise
+exceptions.
+
+
+> Is RVV designed in any way to be re-entrant?
+
+Yes.
+
+
+> What would the implications be for instructions that were in a FIFO at
+> the time, in out-of-order and VLIW implementations, where partial decode
+> had taken place?
+
+The usual bag of tricks for maintaining precise exceptions applies to
+vector machines as well.  Register renaming makes the job easier, and
+it's relatively cheaper for vectors, since the control cost is amortized
+over longer registers.
+
+
+> Would it be reasonable at least to say *bypass* (and freeze) the
+> instruction FIFO (drop down to a single-issue execution model temporarily)
+> for the purposes of executing the instructions in the interrupt (whilst
+> setting up the VM page), then re-continue the instruction with all
+> state intact?
+
+This approach has been done successfully, but it's desirable to be
+able to swap out the vector unit state to support context switches on
+exceptions that result in long-latency I/O.
+
+
+> Or would it be better to switch to an entirely separate secondary
+> hyperthread context?
+
+> Does anyone have any ideas or know if there is any academic literature
+> on solutions to this problem?
+
+The Vector VAX offered imprecise but restartable and swappable exceptions:
+http://mprc.pku.edu.cn/~liuxianhua/chn/corpus/Notes/articles/isca/1990/VAX%20vector%20architecture.pdf
+
+Sec. 4.6 of Krste's dissertation assesses some of
+the tradeoffs and references a bunch of related work:
+http://people.eecs.berkeley.edu/~krste/thesis.pdf
+
+
+----
+
+Started reading section 4.6 of Krste's thesis, noted the "IEE85 F.P
+exceptions" and thought, "hmmm that could go into a CSR, must re-read
+the section on FP state CSRs in RVV 0.4-Draft again" then i suddenly
+thought, "ah ha!  what if the memory exceptions were, instead of having
+an immediate exception thrown, were simply stored in a type of predication
+bit-field with a flag "error this element failed"?
+
+Then, *after* the vector load (or store, or even operation) was
+performed, you could *then* raise an exception, at which point it
+would be possible (yes in software... I know....) to go "hmmm, these
+indexed operations didn't work, let's get them into memory by triggering
+page-loads", then *re-run the entire instruction* but this time with a
+"memory-predication CSR" that stops the already-performed operations
+(whether they be loads, stores or an arithmetic / FP operation) from
+being carried out a second time.
+
+This theoretically could end up being done multiple times in an SMP
+environment, and also for LD.X there would be the remote outside annoying
+possibility that the indexed memory address could end up being modified.
+
+The advantage would be that the order of execution need not be
+sequential, which potentially could have some big advantages.
+Am still thinking through the implications as any dependent operations
+(particularly ones already decoded and moved into the execution FIFO)
+would still be there (and stalled).  hmmm.
+
+----
+
+    > > # assume internal parallelism of 8 and MAXVECTORLEN of 8
+    > > VSETL r0, 8
+    > > FADD x1, x2, x3
+    >
+    > > x3[0]: ok
+    > > x3[1]: exception
+    > > x3[2]: ok
+    > > ...
+    > > ...
+    > > x3[7]: ok
+    >
+    > > what happens to result elements 2-7?  those may be *big* results
+    > > (RV128)
+    > > or in the RVV-Extended may be arbitrary bit-widths far greater.
+    >
+    >  (you replied:)
+    >
+    > Thrown away.
+
+discussion then led to the question of OoO architectures
+
+> The costs of the imprecise-exception model are greater than the benefit.
+> Software doesn't want to cope with it.  It's hard to debug.  You can't
+> migrate state between different microarchitectures--unless you force all
+> implementations to support the same imprecise-exception model, which would
+> greatly limit implementation flexibility.  (Less important, but still
+> relevant, is that the imprecise model increases the size of the context
+> structure, as the microarchitectural guts have to be spilled to memory.)
+
+## Zero/Non-zero Predication
+
+>> >  it just occurred to me that there's another reason why the data
+>> > should be left instead of zeroed.  if the standard register file is
+>> > used, such that vectorised operations are translated to mean "please
+>> > insert multiple register-contiguous operations into the instruction
+>> > FIFO" and predication is used to *skip* some of those, then if the
+>> > next "vector" operation uses the (standard) registers that were masked
+>> > *out* of the previous operation it may proceed without blocking.
+>> >
+>> >  if however zeroing is made mandatory then that optimisation becomes
+>> > flat-out impossible to deploy.
+>> >
+>> >  whilst i haven't fully thought through the full implications, i
+>> > suspect RVV might also be able to benefit by being able to fit more
+>> > overlapping operations into the available SRAM by doing something
+>> > similar.
+>
+>
+> Luke, this is called density time masking. It doesn’t apply to only your
+> model with the “standard register file” is used. it applies to any
+> architecture that attempts to speed up by skipping computation and writeback
+> of masked elements.
+>
+> That said, the writing of zeros need not be explicit. It is possible to add
+> a “zero bit” per element that, when set, forces a zero to be read from the
+> vector (although the underlying storage may have old data). In this case,
+> there may be a way to implement DTM as well.
  
-> However, there are also several features that go beyond simply attaching VL
-> to a scalar operation and are crucial to being able to vectorize a lot of
-> code.  To name a few:
-> - Conditional execution (i.e., predicated operations)
-> - Inter-lane data movement (e.g. SLIDE, SELECT)
-> - Reductions (e.g., VADD with a scalar destination)
-
- Ok so the Conditional and also the Reductions is one of the reasons
- why as part of SimpleV / variable-SIMD / parallelism (gah gotta think
- of a decent name) i proposed that it be implemented as "if you say r0
- is to be a vector / SIMD that means operations actually take place on
- r0,r1,r2... r(N-1)".
-
- Consequently any parallel operation could be paused (or... more
- specifically: vectors disabled by resetting it back to a default /
- scalar / vector-length=1) yet the results would actually be in the
- *main register file* (integer or float) and so anything that wasn't
- possible to easily do in "simple" parallel terms could be done *out*
- of parallel "mode" instead.
-
- I do appreciate that the above does imply that there is a limit to the
- length that SimpleV (whatever) can be parallelised, namely that you
- run out of registers!  my thought there was, "leave space for the main
- V-Ext proposal to extend it to the length that V currently supports".
- Honestly i had not thought through precisely how that would work.
-
- Inter-lane (SELECT) i saw 17.19 in V2.3-Draft p117, I liked that,
- it reminds me of the discussion with Clifford on bit-manipulation
- (gather-scatter except not Bit Gather Scatter, *data* gather scatter): if
- applied "globally and outside of V and P" SLIDE and SELECT might become
- an extremely powerful way to do fast memory copy and reordering [2[.
-
- However I haven't quite got my head round how that would work: i am
- used to the concept of register "tags" (the modern term is "masks")
- and i *think* if "masks" were applied to a Simple-V-enhanced LOAD /
- STORE you would get the exact same thing as SELECT.
-
- SLIDE you could do simply by setting say r0 vector-length to say 16
- (meaning that if referred to in any operation it would be an implicit
- parallel operation on *all* registers r0 through r15), and temporarily
- set say.... r7 vector-length to say... 5.  Do a LOAD on r7 and it would
- implicitly mean "load from memory into r7 through r11".  Then you go
- back and do an operation on r0 and ta-daa, you're actually doing an
- operation on a SLID {SLIDED?) vector.
-
- The advantage of Simple-V (whatever) over V would be that you could
- actually do *operations* in the middle of vectors (not just SLIDEs)
- simply by (as above) setting r0 vector-length to 16 and r7 vector-length
- to 5.  There would be nothing preventing you from doing an ADD on r0
- (which meant do an ADD on r0 through r15) followed *immediately in the
- next instruction with no setup cost* a MUL on r7 (which actually meant
- "do a parallel MUL on r7 through r11").
-
- btw it's worth mentioning that you'd get scalar-vector and vector-scalar
- implicitly by having one of the source register be vector-length 1
- (the default) and one being N > 1.  but without having special opcodes
- to do it.  i *believe* (or more like "logically infer or deduce" as
- i haven't got access to the spec) that that would result in a further
- opcode reduction when comparing [draft] V-Ext to [proposed] Simple-V.
-
- Also, Reduction *might* be possible by specifying that the destination be
- a scalar (vector-length=1) whilst the source be a vector.  However... it
- would be an awful lot of work to go through *every single instruction*
- in *every* Extension, working out which ones could be parallelised (ADD,
- MUL, XOR) and those that definitely could not (DIV, SUB).  Is that worth
- the effort?  maybe.  Would it result in huge complexity? probably.
- Could an implementor just go "I ain't doing *that* as parallel!
- let's make it virtual-parallelism (sequential reduction) instead"?
- absolutely.  So, now that I think it through, Simple-V (whatever)
- covers Reduction as well.  huh, that's a surprise.
-
-
-> - Vector-length speculation (making it possible to vectorize some loops with
-> unknown trip count) - I don't think this part of the proposal is written
-> down yet.
-
- Now that _is_ an interesting concept.  A little scary, i imagine, with
- the possibility of putting a processor into a hard infinite execution
- loop... :)
-
-
-> Also, note the vector ISA consumes relatively little opcode space (all the
-> arithmetic fits in 7/8ths of a major opcode).  This is mainly because data
-> type and size is a function of runtime configuration, rather than of opcode.
-
- yes.  i love that aspect of V, i am a huge fan of polymorphism [1]
- which is why i am keen to advocate that the same runtime principle be
- extended to the rest of the RISC-V ISA [3]
-
- Yikes that's a lot.  I'm going to need to pull this into the wiki to
- make sure it's not lost.
-
-[1] inherent data type conversion: 25 years ago i designed a hypothetical
-hyper-hyper-hyper-escape-code-sequencing ISA based around 2-bit
-(escape-extended) opcodes and 2-bit (escape-extended) operands that
-only required a fixed 8-bit instruction length.  that relied heavily
-on polymorphism and runtime size configurations as well.  At the time
-I thought it would have meant one HELL of a lot of CSRs... but then I
-met RISC-V and was cured instantly of that delusion^Wmisapprehension :)
-
-[2] Interestingly if you then also add in the other aspect of Simple-V
-(the data-size, which is effectively functionally orthogonal / identical
-to "Packed" of Packed-SIMD), masked and packed *and* vectored LOAD / STORE
-operations become byte / half-word / word augmenters of B-Ext's proposed
-"BGS" i.e. where B-Ext's BGS dealt with bits, masked-packed-vectored
-LOAD / STORE would deal with 8 / 16 / 32 bits at a time.  Where it
-would get really REALLY interesting would be masked-packed-vectored
-B-Ext BGS instructions.  I can't even get my head fully round that,
-which is a good sign that the combination would be *really* powerful :)
-
-[3] ok sadly maybe not the polymorphism, it's too complicated and I
-think would be much too hard for implementors to easily "slide in" to an
-existing non-Simple-V implementation.  i say that despite really *really*
-wanting IEEE 704 FP Half-precision to end up somewhere in RISC-V in some
-fashion, for optimising 3D Graphics.  *sigh*.
-
-## TODO: instructions (based on Hwacha) V-Ext duplication analysis
-
-This is partly speculative due to lack of access to an up-to-date
-V-Ext Spec (V2.3-draft RVV 0.4-Draft at the time of writing).  However
-basin an analysis instead on Hwacha, a cursory examination shows over
-an **85%** duplication of V-Ext operand-related instructions when
-compared to Simple-V on a standard RG64G base.   Even Vector Fetch
-is analogous to "zero-overhead loop".
-
-Exceptions are:
-
-* Vector Indexed Memory Instructions (non-contiguous)
-* Vector Atomic Memory Instructions.
-* Some of the Vector Arithmetic ops: MADD, MSUB,
-  VSRL, VSRA, VEIDX, VFIRST, VSGNJN, VFSGNJX and potentially more.
-* Consensual Jump
-
-Table of RV32V Instructions
-
-| RV32V      |     |
-| -----      | --- |
-| VADD       |     |
-| VSUB       |     |
-| VSL        |     |
-| VSR        |     |
-| VAND       |     |
-| VOR        |     |
-| VXOR       |     |
-| VSEQ       |     |
-| VSNE       |     |
-| VSLT       |     |
-| VSGE       |     |
-| VCLIP      |     |
-| VCVT       |     |
-| VMPOP      |     |
-| VMFIRST    |     |
-| VEXTRACT   |     |
-| VINSERT    |     |
-| VMERGE     |     |
-| VSELECT    |     |
-| VSLIDE     |     |
-| VDIV       |     |
-| VREM       |     |
-| VMUL       |     |
-| VMULH      |     |
-| VMIN       |     |
-| VMAX       |     |
-| VSGNJ      |     |
-| VSGNJN     |     |
-| VSGNJX     |     |
-| VSQRT      |     |
-| VCLASS     |     |
-| VPOPC      |     |
-| VADDI      |     |
-| VSLI       |     |
-| VSRI       |     |
-| VANDI      |     |
-| VORI       |     |
-| VXORI      |     |
-| VCLIPI     |     |
-| VMADD      |     |
-| VMSUB      |     |
-| VNMADD     |     |
-| VNMSUB     |     |
-| VLD        |     |
-| VLDS       |     |
-| VLDX       |     |
-| VST        |     |
-| VSTS       |     |
-| VSTX       |     |
-| VAMOSWAP   |     |
-| VAMOADD    |     |
-| VAMOAND    |     |
-| VAMOOR     |     |
-| VAMOXOR    |     |
-| VAMOMIN    |     |
-| VAMOMAX    |     |
-
-## TODO: sort
-
-> I suspect that the "hardware loop" in question is actually a zero-overhead
-> loop unit that diverts execution from address X to address Y if a certain
-> condition is met.
-
- not quite.  The zero-overhead loop unit interestingly would be at
-an [independent] level above vector-length.  The distinctions are
-as follows:
-
-* Vector-length issues *virtual* instructions where the register
-  operands are *specifically* altered (to cover a range of registers),
-  whereas zero-overhead loops *specifically* do *NOT* alter the operands
-  in *ANY* way.
-
-* Vector-length-driven "virtual" instructions are driven by *one*
- and *only* one instruction (whether it be a LOAD, STORE, or pure
- one/two/three-operand opcode) whereas zero-overhead loop units
- specifically apply to *multiple* instructions.
-
-Where vector-length-driven "virtual" instructions might get conceptually
-blurred with zero-overhead loops is LOAD / STORE.  In the case of LOAD /
-STORE, to actually be useful, vector-length-driven LOAD / STORE should
-increment the LOAD / STORE memory address to correspondingly match the
-increment in the register bank.  example:
-
-* set vector-length for r0 to 4
-* issue RV32 LOAD from addr 0x1230 to r0
-
-translates effectively to:
-
-* RV32 LOAD from addr 0x1230 to r0
-* ...
-* ...
-* RV32 LOAD from addr 0x123B to r3
-
-# P-Ext ISA
-
-## 16-bit Arithmetic
-
-| Mnemonic           | 16-bit Instruction        | Simple-V Equivalent |
-| ------------------ | ------------------------- | ------------------- |
-| ADD16 rt, ra, rb   | add                       | RV ADD (bitwidth=16) |
-| RADD16 rt, ra, rb  | Signed Halving add        | |
-| URADD16 rt, ra, rb | Unsigned Halving add      | |
-| KADD16 rt, ra, rb  | Signed Saturating add     | |
-| UKADD16 rt, ra, rb | Unsigned Saturating add   | |
-| SUB16 rt, ra, rb   | sub                       | RV SUB (bitwidth=16) |
-| RSUB16 rt, ra, rb  | Signed Halving sub        | |
-| URSUB16 rt, ra, rb | Unsigned Halving sub                | |
-| KSUB16 rt, ra, rb  | Signed Saturating sub               | |
-| UKSUB16 rt, ra, rb | Unsigned Saturating sub             | |
-| CRAS16 rt, ra, rb  | Cross Add & Sub                     | |
-| RCRAS16 rt, ra, rb | Signed Halving Cross Add & Sub      | |
-| URCRAS16 rt, ra, rb| Unsigned Halving Cross Add & Sub    | |
-| KCRAS16 rt, ra, rb | Signed Saturating Cross Add & Sub   | |
-| UKCRAS16 rt, ra, rb| Unsigned Saturating Cross Add & Sub | |
-| CRSA16 rt, ra, rb  | Cross Sub & Add                     | |
-| RCRSA16 rt, ra, rb | Signed Halving Cross Sub & Add      | |
-| URCRSA16 rt, ra, rb| Unsigned Halving Cross Sub & Add    | |
-| KCRSA16 rt, ra, rb | Signed Saturating Cross Sub & Add   | |
-| UKCRSA16 rt, ra, rb| Unsigned Saturating Cross Sub & Add | |
-
-## 8-bit Arithmetic
-
-| Mnemonic           | 16-bit Instruction        | Simple-V Equivalent |
-| ------------------ | ------------------------- | ------------------- |
-| ADD8 rt, ra, rb    | add                       | RV ADD (bitwidth=8)|
-| RADD8 rt, ra, rb   | Signed Halving add        | |
-| URADD8 rt, ra, rb  | Unsigned Halving add      | |
-| KADD8 rt, ra, rb   | Signed Saturating add     | |
-| UKADD8 rt, ra, rb  | Unsigned Saturating add   | |
-| SUB8 rt, ra, rb    | sub                       | RV SUB (bitwidth=8)|
-| RSUB8 rt, ra, rb   | Signed Halving sub        | |
-| URSUB8 rt, ra, rb  | Unsigned Halving sub      | |
  
-# Exceptions
+## Implementation detail for scalar-only op detection <a name="scalar_detection"></a>
  
-> What does an ADD of two different-sized vectors do in simple-V?
+Note 1: this idea is a pipeline-bypass concept, which may *or may not* be
+worthwhile.
  
-* if the two source operands are not the same, throw an exception.
-* if the destination operand is also a vector, and the source is longer
-  than the destination, throw an exception.
+Note 2: this is just one possible implementation.  Another implementation
+may choose to treat *all* operations as vectorised (including treating
+scalars as vectors of length 1), choosing to add an extra pipeline stage
+dedicated to *all* instructions.
  
-# Impementing V on top of Simple-V
+This section *specifically* covers the implementor's freedom to choose
+that they wish to minimise disruption to an existing design by detecting
+"scalar-only operations", bypassing the vectorisation phase (which may
+or may not require an additional pipeline stage)
  
-* Number of Offset CSRs extends from 2
-* Extra register file: vector-file
-* Setup of Vector length and bitwidth CSRs now can specify vector-file
-  as well as integer or float file.
-* TODO
+[[scalardetect.png]]
  
-# Implementing P (renamed to DSP) on top of Simple-V
+>> For scalar ops an implementation may choose to compare 2-3 bits through an
+>> AND gate: are src & dest scalar? Yep, ok send straight to ALU  (or instr
+>> FIFO).
  
-* Implementors indicate chosen bitwidth support in Vector-bitwidth CSR
-  (caveat: anything not specified drops through to software-emulation / traps)
-* TODO
+> Those bits cannot be known until after the registers are decoded from the
+> instruction and a lookup in the "vector length table" has completed.
+> Considering that one of the reasons RISC-V keeps registers in invariant
+> positions across all instructions is to simplify register decoding, I expect
+> that inserting an SRAM read would lengthen the critical path in most
+> implementations.
+
+reply:
+
+> briefly: the trick i mentioned about ANDing bits together to check if
+> an op was fully-scalar or not was to be read out of a single 32-bit
+> 3R1W SRAM (64-bit if FPU exists).  the 32/64-bit SRAM contains 1 bit per
+> register indicating "is register vectorised yes no".  3R because you need
+> to check src1, src2 and dest simultaneously.  the entries are *generated*
+> from the CSRs and are an optimisation that on slower embedded systems
+> would likely not be needed.
+
+>  is there anything unreasonable that anyone can foresee about that?
+> what are the down-sides?
+
+## C.MV predicated src, predicated dest
+
+> Can this be usefully defined in such a way that it is
+> equivalent to vector gather-scatter on each source, followed by a
+> non-predicated vector-compare, followed by vector gather-scatter on the
+> result?
+
+## element width conversion: restrict or remove?
+
+summary: don't restrict / remove.  it's fine.
+
+> > it has virtually no cost/overhead as long as you specify
+> > that inputs can only upconvert, and operations are always done at the
+> > largest size, and downconversion only happens at the output.
+>
+> okaaay.  so that's a really good piece of implementation advice.
+> algorithms do require data size conversion, so at some point you need to
+> introduce the feature of upconverting and downconverting.
+>
+> > for int and uint, this is dead simple and fits well within the RVV pipeline
+> > without any critical path, pipeline depth, or area implications.
+
+<https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/g3feFnAoKIM>
+
+## Under review / discussion: remove CSR vector length, use VSETVL <a name="vsetvl"></a>
+
+**DECISION: 11jun2018 - CSR vector length removed, VSETVL determines
+length on all regs**.  This section kept for historical reasons.
+
+So the issue is as follows:
+
+* CSRs are used to set the "span" of a vector (how many of the standard
+  register file to contiguously use)
+* VSETVL in RVV works as follows: it sets the vector length (copy of which
+  is placed in a dest register), and if the "required" length is longer
+  than the *available* length, the dest reg is set to the MIN of those
+  two.
+* **HOWEVER**... in SV, *EVERY* vector register has its own separate
+  length and thus there is no way (at the time that VSETVL is called) to
+  know what to set the vector length *to*.
+* At first glance it seems that it would be perfectly fine to just limit
+  the vector operation to the length specified in the destination
+  register's CSR, at the time that each instruction is issued...
+  except that that cannot possibly be guaranteed to match
+  with the value *already loaded into the target register from VSETVL*.
+
+Therefore a different approach is needed.
+
+Possible options include:
+
+* Removing the CSR "Vector Length" and always using the value from
+  VSETVL.  "VSETVL destreg, counterreg, #lenimmed" will set VL *and*
+  destreg equal to MIN(counterreg, lenimmed), with register-based
+  variant "VSETVL destreg, counterreg, lenreg" doing the same.
+* Keeping the CSR "Vector Length" and having the lenreg version have
+  a "twist": "if lengreg is vectorised, read the length from the CSR"
+* Other (TBD)
+
+The first option (of the ones brainstormed so far) is a lot simpler.
+It does however mean that the length set in VSETVL will apply across-the-board
+to all src1, src2 and dest vectorised registers until it is otherwise changed
+(by another VSETVL call).  This is probably desirable behaviour.
+
+## Implementation Paradigms <a name="implementation_paradigms"></a>
+
+TODO: assess various implementation paradigms.  These are listed roughly
+in order of simplicity (minimum compliance, for ultra-light-weight
+embedded systems or to reduce design complexity and the burden of
+design implementation and compliance, in non-critical areas), right the
+way to high-performance systems.
+
+* Full (or partial) software-emulated (via traps): full support for CSRs
+  required, however when a register is used that is detected (in hardware)
+  to be vectorised, an exception is thrown.
+* Single-issue In-order, reduced pipeline depth (traditional SIMD / DSP)
+* In-order 5+ stage pipelines with instruction FIFOs and mild register-renaming
+* Out-of-order with instruction FIFOs and aggressive register-renaming
+* VLIW
+
+Also to be taken into consideration:
+
+* "Virtual" vectorisation: single-issue loop, no internal ALU parallelism
+* Comphrensive vectorisation: FIFOs and internal parallelism
+* Hybrid Parallelism
+
+### Full or partial software-emulation
+
+The absolute, absolute minimal implementation is to provide the full
+set of CSRs and detection logic for when any of the source or destination
+registers are vectorised.  On detection, a trap is thrown, whether it's
+a branch, LOAD, STORE, or an arithmetic operation.
+
+Implementors are entirely free to choose whether to allow absolutely every
+single operation to be software-emulated, or whether to provide some emulation
+and some hardware support.  In particular, for an RV32E implementation
+where fast context-switching is a requirement (see "Context Switch Example"),
+it makes no sense to allow Vectorised-LOAD/STORE to be implemented as an
+exception, as every context-switch will result in double-traps.
+
+# TODO Research
+
+> For great floating point DSPs check TI’s C3x, C4X, and C6xx DSPs
+
+Idea: basic simple butterfly swap on a few element indices, primarily targetted
+at SIMD / DSP.  High-byte low-byte swapping, high-word low-word swapping,
+perhaps allow reindexing of permutations up to 4 elements?  8?  Reason:
+such operations are less costly than a full indexed-shuffle, which requires
+a separate instruction cycle.
+
+Predication "all zeros" needs to be "leave alone".  Detection of
+ADD r1, rs1, rs0 cases result in nop on predication index 0, whereas
+ADD r0, rs1, rs2 is actually a desirable copy from r2 into r0.
+Destruction of destination indices requires a copy of the entire vector
+in advance to avoid.
+
+TBD: floating-point compare and other exception handling
  
  # References
  
@@ -730,3 +2321,28 @@ translates effectively to:
  * Hwacha <https://www2.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-262.html>
  * Hwacha <https://www2.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-263.html>
  * Vector Workshop <http://riscv.org/wp-content/uploads/2015/06/riscv-vector-workshop-june2015.pdf>
+* Predication <https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/XoP4BfYSLXA>
+* Branch Divergence <https://jbush001.github.io/2014/12/07/branch-divergence-in-parallel-kernels.html>
+* Life of Triangles (3D) <https://jbush001.github.io/2016/02/27/life-of-triangle.html>
+* Videocore-IV <https://github.com/hermanhermitage/videocoreiv/wiki/VideoCore-IV-3d-Graphics-Pipeline>
+* Discussion proposing CSRs that change ISA definition
+  <https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/InzQ1wr_3Ak>
+* Zero-overhead loops <https://pdfs.semanticscholar.org/dbaa/66985cc730d4b44d79f519e96ec9c43ab5b7.pdf>
+* Multi-ported VLIW Register File Implementation <https://ce-publications.et.tudelft.nl/publications/1517_multiple_contexts_in_a_multiported_vliw_register_file_impl.pdf>
+* Fast context save/restore proposal <https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/57F823FA.6030701%40gmail.com>
+* Register File Bank Cacheing <https://www.princeton.edu/~rblee/ELE572Papers/MultiBankRegFile_ISCA2000.pdf>
+* Expired Patent on Vector Virtual Memory solutions
+  <https://patentimages.storage.googleapis.com/fc/f6/e2/2cbee92fcd8743/US5895501.pdf>
+* Discussion on RVV "re-entrant" capabilities allowing operations to be
+  restarted if an exception occurs (VM page-table miss)
+  <https://groups.google.com/a/groups.riscv.org/d/msg/isa-dev/IuNFitTw9fM/CCKBUlzsAAAJ>
+* Dot Product Vector <https://people.eecs.berkeley.edu/~biancolin/papers/arith17.pdf>
+* RVV slides 2017 <https://content.riscv.org/wp-content/uploads/2017/12/Wed-1330-RISCVRogerEspasaVEXT-v4.pdf>
+* Wavefront skipping using BRAMS <http://www.ece.ubc.ca/~lemieux/publications/severance-fpga2015.pdf>
+* Streaming Pipelines <http://www.ece.ubc.ca/~lemieux/publications/severance-fpga2014.pdf>
+* Barcelona SIMD Presentation <https://content.riscv.org/wp-content/uploads/2018/05/09.05.2018-9.15-9.30am-RISCV201805-Andes-proposed-P-extension.pdf>
+* <http://www.ece.ubc.ca/~lemieux/publications/severance-fpga2015.pdf>
+* Full Description (last page) of RVV instructions
+  <https://inst.eecs.berkeley.edu/~cs152/sp18/handouts/lab4-1.0.pdf>
+* PULP Low-energy Cluster Vector Processor
+  <http://iis-projects.ee.ethz.ch/index.php/Low-Energy_Cluster-Coupled_Vector_Coprocessor_for_Special-Purpose_PULP_Acceleration>