(no commit message)

[libreriscv.git] / simple_v_extension / specification.mdwn
diff --git a/simple_v_extension/specification.mdwn b/simple_v_extension/specification.mdwn

index 22f58baadc058e3317458baed91322dfe2a98633..2b1160f8c58e6b3042a7b505a916321f26e6320f 100644 (file)
--- a/simple_v_extension/specification.mdwn
+++ b/simple_v_extension/specification.mdwn
@@ -1,16 +1,18 @@
+
  # Simple-V (Parallelism Extension Proposal) Specification
  
  * Copyright (C) 2017, 2018, 2019 Luke Kenneth Casson Leighton
-* Status: DRAFTv0.6
-* Last edited: 21 jun 2019
+* Status: DRAFTv0.6.1
+* Last edited: 10 sep 2019
  * Ancillary resource: [[opcodes]]
  * Ancillary resource: [[sv_prefix_proposal]]
  * Ancillary resource: [[abridged_spec]]
  * Ancillary resource: [[vblock_format]]
  * Ancillary resource: [[appendix]]
  
-With thanks to:
+Authors/Contributors:
  
+* Luke Kenneth Casson Leighton
  * Allen Baum
  * Bruce Hoult
  * comp.arch
@@ -66,7 +68,7 @@ To emphasise that clearly: Simple-V (SV) is *not*:
  * A SIMT system
  * A Vectorisation Microarchitecture
  * A microarchitecture of any specific kind
-* A mandary parallel processor microarchitecture of any kind
+* A mandatory parallel processor microarchitecture of any kind
  * A supercomputer extension
  
  SV does **not** tell implementors how or even if they should implement
@@ -110,9 +112,10 @@ on hidden context that augments *scalar* RISCV instructions.
  There are five additional CSRs, available in any privilege level:
  
  * MVL (the Maximum Vector Length)
-* VL (which has different characteristics from standard CSRs)
+* VL (sets which scalar register is to be the Vector Length)
  * SUBVL (effectively a kind of SIMD)
  * STATE (containing copies of MVL, VL and SUBVL as well as context information)
+* SVPSTATE (state information for SVPrefix)
  * PCVBLK (the current operation being executed within a VBLOCK Group)
  
  For User Mode there are the following CSRs:
@@ -121,24 +124,31 @@ For User Mode there are the following CSRs:
    to the start of the current VBLOCK Group, set on a trap).
  * ueSTATE (useful for saving and restoring during context switch,
    and for providing fast transitions)
+* ueSVPSTATE when SVPrefix is implemented
+ Note: ueSVPSTATE is mirrored in the top 32 bits of ueSTATE.
  
-There are also two additional CSRs for Supervisor-Mode:
+There are also three additional CSRs for Supervisor-Mode:
  
  * sePCVBLK
-* seSTATE
+* seSTATE (which contains seSVPSTATE)
+* seSVPSTATE
  
  And likewise for M-Mode:
  
  * mePCVBLK
-* meSTATE
+* meSTATE (which contains meSVPSTATE)
+* meSVPSTATE
  
  The u/m/s CSRs are treated and handled exactly like their (x)epc
-equivalents. On entry to or exit from a privilege level, the contents of its (x)eSTATE are swapped with STATE.
+equivalents. On entry to or exit from a privilege level, the contents
+of its (x)eSTATE are swapped with STATE.
  
  Thus for example, a User Mode trap will end up swapping STATE and ueSTATE
  (on both entry and exit), allowing User Mode traps to have their own
  Vectorisation Context set up, separated from and unaffected by normal
-user applications.  If an M Mode trap occurs in the middle of the U Mode trap, STATE is swapped with meSTATE, and restored on exit: the U Mode trap continues unaware that the M Mode trap even occurred.
+user applications.  If an M Mode trap occurs in the middle of the U Mode
+trap, STATE is swapped with meSTATE, and restored on exit: the U Mode
+trap continues unaware that the M Mode trap even occurred.
  
  Likewise, Supervisor Mode may perform context-switches, safe in the
  knowledge that its Vectorisation State is unaffected by User Mode.
@@ -154,15 +164,13 @@ same pattern for other CSRs that have M-Mode and S-Mode "mirrors":
  * In U-Mode, accessing and changing of the S-Mode and U-Mode CSRs
    is prohibited.
  
-An interesting side effect of SV STATE being
-separate and distinct in S Mode
-is that
-Vectorised saving of an entire register file to the stack is a single
-instruction (through accidental provision of LOAD-MULTI semantics).  If the
-SVPrefix P64-LD-type format is used, LOAD-MULTI may even be done with a
-single standalone 64 bit opcode (P64 may set up SUBVL, VL and MVL from an
-immediate field, to cover the full regfile). It can even be predicated, which opens up some very
-interesting possibilities.
+An interesting side effect of SV STATE being separate and distinct in S
+Mode is that Vectorised saving of an entire register file to the stack
+is a single instruction (through accidental provision of LOAD-MULTI
+semantics).  If the SVPrefix P64-LD-type format is used, LOAD-MULTI may
+even be done with a single standalone 64 bit opcode (P64 may set up SVPSTATE.SUBVL,
+SVPSTATE.VL and SVPSTATE.MVL from an immediate field, to cover the full regfile). It can
+even be predicated, which opens up some very interesting possibilities.
  
  (x)EPCVBLK CSRs must be treated exactly like their corresponding (x)epc
  equivalents. See VBLOCK section for details.
@@ -187,65 +195,11 @@ section, where there are subtle differences between CSRRW and CSRRWI.
  
  ## Vector Length (VL) <a name="vl" />
  
-VSETVL is slightly different from RVV.  Similar to RVV, VL is set to be within
-the range 1 <= VL <= MVL (where MVL in turn is limited to 1 <= MVL <= XLEN)
-
-    VL = rd = MIN(vlen, MVL)
-
-where 1 <= MVL <= XLEN
-
-However just like MVL it is important to note that the range for VL has
-subtle design implications, covered in the "CSR pseudocode" section
-
-The fixed (specific) setting of VL allows vector LOAD/STORE to be used
-to switch the entire bank of registers using a single instruction (see
-Appendix, "Context Switch Example").  The reason for limiting VL to XLEN
-is down to the fact that predication bits fit into a single register of
-length XLEN bits.
-
-The second and most important change is that, within the limits set by
-MVL, the value passed in **must** be set in VL (and in the
-destination register).
-
-This has implication for the microarchitecture, as VL is required to be
-set (limits from MVL notwithstanding) to the actual value
-requested.  RVV has the option to set VL to an arbitrary value that suits
-the conditions and the micro-architecture: SV does *not* permit this.
-
-The reason is so that if SV is to be used for a context-switch or as a
-substitute for LOAD/STORE-Multiple, the operation can be done with only
-2-3 instructions (setup of the CSRs, VSETVL x0, x0, #{regfilelen-1},
-single LD/ST operation).  If VL does *not* get set to the register file
-length when VSETVL is called, then a software-loop would be needed.
-To avoid this need, VL *must* be set to exactly what is requested
-(limits notwithstanding).
-
-Therefore, in turn, unlike RVV, implementors *must* provide
-pseudo-parallelism (using sequential loops in hardware) if actual
-hardware-parallelism in the ALUs is not deployed.  A hybrid is also
-permitted (as used in Broadcom's VideoCore-IV) however this must be
-*entirely* transparent to the ISA.
-
-The third change is that VSETVL is implemented as a CSR, where the
-behaviour of CSRRW (and CSRRWI) must be changed to specifically store
-the *new* value in the destination register, **not** the old value.
-Where context-load/save is to be implemented in the usual fashion
-by using a single CSRRW instruction to obtain the old value, the
-*secondary* CSR must be used (STATE).  This CSR by contrast behaves
-exactly as standard CSRs, and contains more than just VL.
-
-One interesting side-effect of using CSRRWI to set VL is that this
-may be done with a single instruction, useful particularly for a
-context-load/save.  There are however limitations: CSRWI's immediate
-is limited to 0-31 (representing VL=1-32).
-
-Note that when VL is set to 1, vector operations cease (but not subvector
-operations: that requires setting SUBVL=1) the hardware loop is reduced
-to a single element: scalar operations.  This is in effect the default,
-normal operating mode. However it is important to appreciate that this
-does **not** result in the Register table or SUBVL being disabled. Only
-when the Register table is empty (P48/64 prefix fields notwithstanding)
-would SV have no effect.
+VL is very different from RVV's VL.  It contains the scalar register *number* that is to be treated as the Vector Length. It is a sub-field of STATE. When set to zero (x0) VL (vectorisation) is disabled.
+
+Implementations realistically should keep a cached copy of the register pointed to by VL in the instruction issue and decode phases. Out of Order Engines must then, if it is not x0, add this register to Vectorised instruction Dependency Checking as an additional read/write hazard as appropriate.
+
+Setting VL via this CSR is very unusual. It should not normally be needed except when [[specification/sv.setvl]] is not implemented.  Note that unlike in sv.setvl, setting VL does not change the contents of the scalar register that it points to, although if the scalar register's contents are not within the range of MVL at the time that VL is set, an illegal instruction exception must be raised.
  
  ## SUBVL - Sub Vector Length
  
@@ -256,7 +210,7 @@ operation issued, SUBVL operations are issued.
  
  Another way to view SUBVL is that each element in the VL length vector is
  now SUBVL times elwidth bits in length and now comprises SUBVL discrete
-sub operations.  An inner SUBVL for-loop within a VL for-loop in effect,
+sub operations.  This can be viewed as an inner SUBVL hardware for-loop within a VL hardware for-loop in effect,
  with the sub-element increased every time in the innermost loop. This
  is best illustrated in the (simplified) pseudocode example, in the
  [[appendix]].
@@ -279,6 +233,8 @@ See SUBVL Pseudocode illustration in the [[appendix]], for details.
  
  ## STATE
  
+out of date, see <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-June/001896.html>
+
  This is a standard CSR that contains sufficient information for a
  full context save/restore.  It contains (and permits setting of):
  
@@ -290,8 +246,6 @@ full context save/restore.  It contains (and permits setting of):
  * SUBVL
  * svdestoffs - the subvector destination element offset of the current
    parallel instruction being executed
-* svsrcoffs - for twin-predication, the subvector source element offset
-  as well.
  
  Interestingly STATE may hypothetically also be modified to make the
  immediately-following instruction to skip a certain number of elements,
@@ -308,9 +262,11 @@ and seSTATE).
  
  The format of the STATE CSR is as follows:
  
-| (29..28 | (27..26) | (25..24) | (23..18) | (17..12) | (11..6) | (5...0) |
-| ------- | -------- | -------- | -------- | -------- | ------- | ------- |
-| dsvoffs | ssvoffs  | subvl    | destoffs | srcoffs  | vl      | maxvl   |
+| (31..28) | (27..26) | (25..24) | (23..18) | (17..12) | (11..6) | (5...0) |
+| -------- | -------- | -------- | -------- | -------- | ------- | ------- |
+| rsvd     | dsvoffs  | subvl    | destoffs | srcoffs  | vl      | maxvl   |
+
+Legal values of vl are between 0 and 31.
  
  The relationship between SUBVL and the subvl field is:
  
@@ -324,18 +280,28 @@ The relationship between SUBVL and the subvl field is:
  When setting this CSR, the following characteristics will be enforced:
  
  * **MAXVL** will be truncated (after offset) to be within the range 1 to XLEN
-* **VL** will be truncated (after offset) to be within the range 1 to MAXVL
+* **VL** must be set to a scalar register between 0 and 31.
  * **SUBVL** which sets a SIMD-like quantity, has only 4 values so there
    are no changes needed
  * **srcoffs** will be truncated to be within the range 0 to VL-1
  * **destoffs** will be truncated to be within the range 0 to VL-1
-* **ssvoffs** will be truncated to be within the range 0 to SUBVL-1
  * **dsvoffs** will be truncated to be within the range 0 to SUBVL-1
  
  NOTE: if the following instruction is not a twin predicated instruction,
  and destoffs or dsvoffs has been set to non-zero, subsequent execution
  behaviour is undefined. **USE WITH CARE**.
  
+NOTE: sub-vector looping does not require a twin-predicate corresponding
+index, because sub-vectors use the *main* (VL) loop predicate bit.
+
+When SVPrefix is implemented, it can have its own VL, MVL and SUBVL, as well as element offsets. SVSTATE.VL acts slightly differently in that it is no longer a pointer to a scalar register but is an actual value just like RVV's VL.
+
+The format of SVSTATE, which fits into *both* the top bits of STATE and also into a separate CSR, is as follows:
+
+| (31..28) | (27..26) | (25..24) | (23..18) | (17..12) | (11..6) | (5...0) |
+| -------- | -------- | -------- | -------- | -------- | ------- | ------- |
+| rsvd     | dsvoffs  | subvl    | destoffs | srcoffs  | vl      | maxvl   |
+
  ### Hardware rules for when to increment STATE offsets
  
  The offsets inside STATE are like the indices in a loop, except
@@ -369,19 +335,16 @@ The pseudo-code for get and set of VL and MVL use the following internal
  functions as follows:
  
      set_mvl_csr(value, rd):
-        regs[rd] = STATE.MVL
          STATE.MVL = MIN(value, STATE.MVL)
  
      get_mvl_csr(rd):
          regs[rd] = STATE.VL
  
      set_vl_csr(value, rd):
-        STATE.VL = MIN(value, STATE.MVL)
-        regs[rd] = STATE.VL # yes returning the new value NOT the old CSR
+        STATE.VL = rd
          return STATE.VL
  
      get_vl_csr(rd):
-        regs[rd] = STATE.VL
          return STATE.VL
  
  Note that where setting MVL behaves as a normal CSR (returns the old
@@ -505,36 +468,14 @@ anywhere to the *full* 128 register range. Thus, RVC becomes far more
  powerful and has many more opportunities to reduce code size that in
  Standard RV32/RV64 executables.
  
-16 bit format:
-
-| RegCAM | | 15       | (14..8)  | 7   | (6..5) | (4..0)  |
-| ------ | | -        | -        | -   | ------ | ------- |
-| 0      | | isvec0   | regidx0  | i/f | vew0   | regkey  |
-| 1      | | isvec1   | regidx1  | i/f | vew1   | regkey  |
-| ..     | | isvec..  | regidx.. | i/f | vew..  | regkey  |
-| 15     | | isvec15  | regidx15 | i/f | vew15  | regkey  |
-
-8 bit format:
-
-| RegCAM | | 7   | (6..5) | (4..0)  |
-| ------ | | -   | ------ | ------- |
-| 0      | | i/f | vew0   | regnum  |
-
-Showing the mapping (relationship) between 8-bit and 16-bit format:
-
-| RegCAM | 15      | (14..8)    | 7   | (6..5) | (4..0)  |
-| ------ | -       | -          | -   | ------ | ------- |
-| 0      | isvec=1 | regnum0<<2 | i/f | vew0   | regnum0 |
-| 1      | isvec=1 | regnum1<<2 | i/f | vew1   | regnum1 |
-| 2      | isvec=1 | regnum2<<2 | i/f | vew2   | regnum2 |
-| 3      | isvec=1 | regnum2<<2 | i/f | vew3   | regnum3 |
+[[!inline raw="yes" pages="simple_v_extension/reg_table_format" ]]
  
  i/f is set to "1" to indicate that the redirection/tag entry is to
  be applied to integer registers; 0 indicates that it is relevant to
  floating-point registers.
  
  The 8 bit format is used for a much more compact expression. "isvec"
-is implicit and, similar to [[sv-prefix-proposal]], the target vector
+is implicit and, similar to [[sv_prefix_proposal]], the target vector
  is "regnum<<2", implicitly. Contrast this with the 16-bit format where
  the target vector is *explicitly* named in bits 8 to 14, and bit 15 may
  optionally set "scalar" mode.
@@ -557,14 +498,7 @@ operand size is "over-ridden" in a polymorphic fashion:
  As the above table is a CAM (key-value store) it may be appropriate
  (faster, implementation-wise) to expand it as follows:
  
-    struct vectorised fp_vec[32], int_vec[32];
-
-    for (i = 0; i < len; i++) // from VBLOCK Format
-       tb = int_vec if CSRvec[i].type == 0 else fp_vec
-       idx = CSRvec[i].regkey // INT/FP src/dst reg in opcode
-       tb[idx].elwidth  = CSRvec[i].elwidth
-       tb[idx].regidx   = CSRvec[i].regidx  // indirection
-       tb[idx].isvector = CSRvec[i].isvector // 0=scalar
+[[!inline raw="yes" pages="simple_v_extension/reg_table" ]]
  
  ## Predication Table <a name="predication_csr_table"></a>
  
@@ -604,44 +538,22 @@ in the instruction, due to the redirection through the lookup table.
    The handling of each (trap or conditional test) is slightly different:
    see Instruction sections for further details
  
-16 bit format:
-
-| PrCSR | (15..11) | 10     | 9     | 8   | (7..1)  | 0       |
-| ----- | -        | -      | -     | -   | ------- | ------- |
-| 0     | predidx  | zero0  | inv0  | i/f | regidx  | ffirst0 |
-| 1     | predidx  | zero1  | inv1  | i/f | regidx  | ffirst1 |
-| 2     | predidx  | zero2  | inv2  | i/f | regidx  | ffirst2 |
-| 3     | predidx  | zero3  | inv3  | i/f | regidx  | ffirst3 |
-
-Note: predidx=x0, zero=1, inv=1 is a RESERVED encoding.  Its use must
-generate an illegal instruction trap.
-
-8 bit format:
-
-| PrCSR | 7     | 6     | 5   | (4..0)  |
-| ----- | -     | -     | -   | ------- |
-| 0     | zero0 | inv0  | i/f | regnum  |
+[[!inline raw="yes" pages="simple_v_extension/pred_table_format" ]]
  
  The 8 bit format is a compact and less expressive variant of the full
-16 bit format.  Using the 8 bit formatis very different: the predicate
+16 bit format.  Using the 8 bit format is very different: the predicate
  register to use is implicit, and numbering begins inplicitly from x9. The
  regnum is still used to "activate" predication, in the same fashion as
  described above.
  
-Thus if we map from 8 to 16 bit format, the table becomes:
-
-| PrCSR | (15..11) | 10     | 9     | 8   | (7..1)  | 0       |
-| ----- | -        | -      | -     | -   | ------- | ------- |
-| 0     | x9       | zero0  | inv0  | i/f | regnum  | ff=0    |
-| 1     | x10      | zero1  | inv1  | i/f | regnum  | ff=0    |
-| 2     | x11      | zero2  | inv2  | i/f | regnum  | ff=0    |
-| 3     | x12      | zero3  | inv3  | i/f | regnum  | ff=0    |
-
  The 16 bit Predication CSR Table is a key-value store, so
  implementation-wise it will be faster to turn the table around (maintain
-topologically equivalent state):
+topologically equivalent state).  Opportunities then exist to access
+registers in unary form instead of binary, saving gates and power by
+only activating "redirection" with a single AND gate, instead of
+multiple multi-bit XORs (a CAM):
  
-[[!inline raw="yes" pages="pred_table"]]
+[[!inline raw="yes" pages="simple_v_extension/pred_table" ]]
  
  So when an operation is to be predicated, it is the internal state that
  is used.  In Section 6.4.2 of Hwacha's Manual (EECS-2015-262) the following
@@ -690,18 +602,7 @@ Note:
  If written as a function, obtaining the predication mask (and whether
  zeroing takes place) may be done as follows:
  
-    def get_pred_val(bool is_fp_op, int reg):
-       tb = int_reg if is_fp_op else fp_reg
-       if (!tb[reg].enabled):
-          return ~0x0, False       // all enabled; no zeroing
-       tb = int_pred if is_fp_op else fp_pred
-       if (!tb[reg].enabled):
-          return ~0x0, False       // all enabled; no zeroing
-       predidx = tb[reg].predidx   // redirection occurs HERE
-       predicate = intreg[predidx] // actual predicate HERE
-       if (tb[reg].inv):
-          predicate = ~predicate   // invert ALL bits
-       return predicate, tb[reg].zero
+[[!inline raw="yes" pages="simple_v_extension/get_pred_value" ]]
  
  Note here, critically, that **only** if the register is marked
  in its **register** table entry as being "active" does the testing
@@ -722,11 +623,17 @@ The other variant is comparisons such as FEQ (or the augmented behaviour
  of Branch), and any operation that returns a result of zero (whether
  integer or floating-point).  In the FP case, this includes negative-zero.
  
-Note that the execution order must "appear" to be sequential for ffirst
-mode to work correctly.  An in-order architecture must execute the element
+ffirst interacts with zero- and non-zero predication.  In non-zeroing
+mode, masked-out operations are simply excluded from testing (can never
+fail).  However for fail-comparisons (not faults) in zeroing mode, the
+result will be zero: this *always* "fails", thus on the very first
+masked-out element ffirst will always terminate.
+
+Note that ffirst mode works because the execution order must "appear" to be
+(in "program order").  An in-order architecture must execute the element
  operations in sequence, whilst an out-of-order architecture must *commit*
-the element operations in sequence (giving the appearance of in-order
-execution).
+the element operations in sequence and cancel speculatively-executed
+ones (giving the appearance of in-order execution).
  
  Note also, that if ffirst mode is needed without predication, a special
  "always-on" Predicate Table Entry may be constructed by setting
@@ -737,162 +644,9 @@ to be set.
  See [[appendix]] for more details on fail-on-first modes, as well as
  pseudo-code, below.
  
-## REMAP CSR <a name="remap" />
-
-(Note: both the REMAP and SHAPE sections are best read after the
- rest of the document has been read)
-
-There is one 32-bit CSR which may be used to indicate which registers,
-if used in any operation, must be "reshaped" (re-mapped) from a linear
-form to a 2D or 3D transposed form, or "offset" to permit arbitrary
-access to elements within a register.
-
-The 32-bit REMAP CSR may reshape up to 3 registers:
-
-| 29..28 | 27..26 | 25..24 | 23 | 22..16  | 15 | 14..8   | 7  | 6..0    |
-| ------ | ------ | ------ | -- | ------- | -- | ------- | -- | ------- |
-| shape2 | shape1 | shape0 | 0  | regidx2 | 0  | regidx1 | 0  | regidx0 |
-
-regidx0-2 refer not to the Register CSR CAM entry but to the underlying
-*real* register (see regidx, the value) and consequently is 7-bits wide.
-When set to zero (referring to x0), clearly reshaping x0 is pointless,
-so is used to indicate "disabled".
-shape0-2 refers to one of three SHAPE CSRs.  A value of 0x3 is reserved.
-Bits 7, 15, 23, 30 and 31 are also reserved, and must be set to zero.
-
-It is anticipated that these specialist CSRs not be very often used.
-Unlike the CSR Register and Predication tables, the REMAP CSRs use
-the full 7-bit regidx so that they can be set once and left alone,
-whilst the CSR Register entries pointing to them are disabled, instead.
-
-## SHAPE 1D/2D/3D vector-matrix remapping CSRs
-
-(Note: both the REMAP and SHAPE sections are best read after the
- rest of the document has been read)
-
-There are three "shape" CSRs, SHAPE0, SHAPE1, SHAPE2, 32-bits in each,
-which have the same format.  When each SHAPE CSR is set entirely to zeros,
-remapping is disabled: the register's elements are a linear (1D) vector.
-
-| 26..24  | 23      | 22..16  | 15      | 14..8   | 7       | 6..0    |
-| ------- | --      | ------- | --      | ------- | --      | ------- |
-| permute | offs[2] | zdimsz  | offs[1] | ydimsz  | offs[0] | xdimsz  |
-
-offs is a 3-bit field, spread out across bits 7, 15 and 23, which
-is added to the element index during the loop calculation.
-
-xdimsz, ydimsz and zdimsz are offset by 1, such that a value of 0 indicates
-that the array dimensionality for that dimension is 1.  A value of xdimsz=2
-would indicate that in the first dimension there are 3 elements in the
-array.  The format of the array is therefore as follows:
-
-    array[xdim+1][ydim+1][zdim+1]
-
-However whilst illustrative of the dimensionality, that does not take the
-"permute" setting into account.  "permute" may be any one of six values
-(0-5, with values of 6 and 7 being reserved, and not legal).  The table
-below shows how the permutation dimensionality order works:
-
-| permute | order | array format             |
-| ------- | ----- | ------------------------ |
-| 000     | 0,1,2 | (xdim+1)(ydim+1)(zdim+1) |
-| 001     | 0,2,1 | (xdim+1)(zdim+1)(ydim+1) |
-| 010     | 1,0,2 | (ydim+1)(xdim+1)(zdim+1) |
-| 011     | 1,2,0 | (ydim+1)(zdim+1)(xdim+1) |
-| 100     | 2,0,1 | (zdim+1)(xdim+1)(ydim+1) |
-| 101     | 2,1,0 | (zdim+1)(ydim+1)(xdim+1) |
-
-In other words, the "permute" option changes the order in which
-nested for-loops over the array would be done.  The algorithm below
-shows this more clearly, and may be executed as a python program:
-
-    # mapidx = REMAP.shape2
-    xdim = 3 # SHAPE[mapidx].xdim_sz+1
-    ydim = 4 # SHAPE[mapidx].ydim_sz+1
-    zdim = 5 # SHAPE[mapidx].zdim_sz+1
-
-    lims = [xdim, ydim, zdim]
-    idxs = [0,0,0] # starting indices
-    order = [1,0,2] # experiment with different permutations, here
-    offs = 0        # experiment with different offsets, here
-
-    for idx in range(xdim * ydim * zdim):
-        new_idx = offs + idxs[0] + idxs[1] * xdim + idxs[2] * xdim * ydim
-        print new_idx,
-        for i in range(3):
-            idxs[order[i]] = idxs[order[i]] + 1
-            if (idxs[order[i]] != lims[order[i]]):
-                break
-            print
-            idxs[order[i]] = 0
-
-Here, it is assumed that this algorithm be run within all pseudo-code
-throughout this document where a (parallelism) for-loop would normally
-run from 0 to VL-1 to refer to contiguous register
-elements; instead, where REMAP indicates to do so, the element index
-is run through the above algorithm to work out the **actual** element
-index, instead.  Given that there are three possible SHAPE entries, up to
-three separate registers in any given operation may be simultaneously
-remapped:
-
-    function op_add(rd, rs1, rs2) # add not VADD!
-      ...
-      ...
-      for (i = 0; i < VL; i++)
-        xSTATE.srcoffs = i # save context
-        if (predval & 1<<i) # predication uses intregs
-           ireg[rd+remap(id)] <= ireg[rs1+remap(irs1)] +
-                                 ireg[rs2+remap(irs2)];
-           if (!int_vec[rd ].isvector) break;
-        if (int_vec[rd ].isvector)  { id += 1; }
-        if (int_vec[rs1].isvector)  { irs1 += 1; }
-        if (int_vec[rs2].isvector)  { irs2 += 1; }
-
-By changing remappings, 2D matrices may be transposed "in-place" for one
-operation, followed by setting a different permutation order without
-having to move the values in the registers to or from memory.  Also,
-the reason for having REMAP separate from the three SHAPE CSRs is so
-that in a chain of matrix multiplications and additions, for example,
-the SHAPE CSRs need only be set up once; only the REMAP CSR need be
-changed to target different registers.
-
-Note that:
-
-* Over-running the register file clearly has to be detected and
-  an illegal instruction exception thrown
-* When non-default elwidths are set, the exact same algorithm still
-  applies (i.e. it offsets elements *within* registers rather than
-  entire registers).
-* If permute option 000 is utilised, the actual order of the
-  reindexing does not change!
-* If two or more dimensions are set to zero, the actual order does not change!
-* The above algorithm is pseudo-code **only**.  Actual implementations
-  will need to take into account the fact that the element for-looping
-  must be **re-entrant**, due to the possibility of exceptions occurring.
-  See MSTATE CSR, which records the current element index.
-* Twin-predicated operations require **two** separate and distinct
-  element offsets.  The above pseudo-code algorithm will be applied
-  separately and independently to each, should each of the two
-  operands be remapped.  *This even includes C.LDSP* and other operations
-  in that category, where in that case it will be the **offset** that is
-  remapped (see Compressed Stack LOAD/STORE section).
-* Offset is especially useful, on its own, for accessing elements
-  within the middle of a register.  Without offsets, it is necessary
-  to either use a predicated MV, skipping the first elements, or
-  performing a LOAD/STORE cycle to memory.
-  With offsets, the data does not have to be moved.
-* Setting the total elements (xdim+1) times (ydim+1) times (zdim+1) to
-  less than MVL is **perfectly legal**, albeit very obscure.  It permits
-  entries to be regularly presented to operands **more than once**, thus
-  allowing the same underlying registers to act as an accumulator of
-  multiple vector or matrix operations, for example.
-
-Clearly here some considerable care needs to be taken as the remapping
-could hypothetically create arithmetic operations that target the
-exact same underlying registers, resulting in data corruption due to
-pipeline overlaps.  Out-of-order / Superscalar micro-architectures with
-register-renaming will have an easier time dealing with this than
-DSP-style SIMD micro-architectures.
+## REMAP and SHAPE CSRs <a name="remap" />
+
+See optional [[remap]] section.
  
  # Instruction Execution Order
  
@@ -921,26 +675,41 @@ to the **one** instruction.
  
  # Instructions <a name="instructions" />
  
-See [[appendix]]
+See [[appendix]] for specific cases where instruction behaviour is
+augmented.  A greatly simplified example is below.  Note that this
+is the ADD implementation, not a separate VADD instruction:
+
+[[!inline raw="yes" pages="simple_v_extension/simple_add_example" ]]
+
+Note that several things have been left out of this example.
+See [[appendix]] for additional examples that show how to add
+support for additional features (twin predication, elwidth,
+zeroing, SUBVL etc.)
+
+Branches in particular have been transparently augmented to include
+"collation" of comparison results into a tagged register.
  
  # Exceptions
  
-TODO: expand.  Exceptions may occur at any time, in any given underlying
-scalar operation.  This implies that context-switching (traps) may
-occur, and operation must be returned to where it left off.  That in
-turn implies that the full state - including the current parallel
-element being processed - has to be saved and restored.  This is
-what the **STATE** CSR is for.
+Exceptions may occur at any time, in any given underlying scalar
+operation.  This implies that context-switching (traps) may occur, and
+operation must be returned to where it left off.  That in turn implies
+that the full state - including the current parallel element being
+processed - has to be saved and restored.  This is what the **STATE**
+and **PCVBLK** CSRs are for.
  
  The implications are that all underlying individual scalar operations
  "issued" by the parallelisation have to appear to be executed sequentially.
  The further implications are that if two or more individual element
  operations are underway, and one with an earlier index causes an exception,
-it may be necessary for the microarchitecture to **discard** or terminate
-operations with higher indices.
+it will be necessary for the microarchitecture to **discard** or terminate
+operations with higher indices.  Optimisated microarchitectures could
+hypothetically store (cache) results, for subsequent replay if appropriate.
  
-This being somewhat dissatisfactory, an "opaque predication" variant
-of the STATE CSR is being considered.
+In short: exception handling **MUST** be precise, in-order, and exactly
+like Standard RISC-V as far as the instruction execution order is
+concerned, regardless of whether it is PC, PCVBLK, VL or SUBVL that
+is currently being incremented.
  
  # Hints
  
@@ -959,64 +728,13 @@ No specific hints are yet defined in Simple-V
  
  # Vector Block Format <a name="vliw-format"></a>
  
-See ancillary resource: [[vblock_format]]
-
-# Under consideration <a name="issues"></a>
-
-for element-grouping, if there is unused space within a register
-(3 16-bit elements in a 64-bit register for example), recommend:
-
-* For the unused elements in an integer register, the used element
-  closest to the MSB is sign-extended on write and the unused elements
-  are ignored on read.
-* The unused elements in a floating-point register are treated as-if
-  they are set to all ones on write and are ignored on read, matching the
-  existing standard for storing smaller FP values in larger registers.
-
----
-
-info register,
-
-> One solution is to just not support LR/SC wider than a fixed
-> implementation-dependent size, which must be at least 
->1 XLEN word, which can be read from a read-only CSR
-> that can also be used for info like the kind and width of 
-> hw parallelism supported (128-bit SIMD, minimal virtual 
-> parallelism, etc.) and other things (like maybe the number 
-> of registers supported). 
+The VBLOCK Format allows Register, Predication and Vector Length to be contextually associated with a group of RISC-V scalar opcodes.  The format is as follows:
  
-> That CSR would have to have a flag to make a read trap so
-> a hypervisor can simulate different values.
+[[!inline raw="yes" pages="simple_v_extension/vblock_format_table" ]]
  
-----
+For more details, including the CSRs, see ancillary resource: [[vblock_format]]
  
-> And what about instructions like JALR? 
-
-answer: they're not vectorised, so not a problem
-
-----
-
-* if opcode is in the RV32 group, rd, rs1 and rs2 bitwidth are
-  XLEN if elwidth==default
-* if opcode is in the RV32I group, rd, rs1 and rs2 bitwidth are
-  *32* if elwidth == default
-
----
-
-TODO: document different lengths for INT / FP regfiles, and provide
-as part of info register. 00=32, 01=64, 10=128, 11=reserved.
-
----
-
-TODO, update to remove RegCam and PredCam CSRs, just use SVprefix and
-VBLOCK format
-
----
-
-Could the 8 bit Register VBLOCK format use regnum<<1 instead, only accessing regs 0 to 64?
-
---
+# Under consideration <a name="issues"></a>
  
-Expand the range of SUBVL and its associated svsrcoffs and svdestoffs by
-adding a 2nd STATE CSR (or extending STATE to 64 bits).  Future version?
+See [[discussion]]