add abridged spec, split out vblock format to own file

author Luke Kenneth Casson Leighton <lkcl@lkcl.net>

Tue, 25 Jun 2019 10:16:36 +0000 (11:16 +0100)

committer Luke Kenneth Casson Leighton <lkcl@lkcl.net>

Tue, 25 Jun 2019 10:16:36 +0000 (11:16 +0100)
author Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Tue, 25 Jun 2019 10:16:36 +0000 (11:16 +0100)
committer Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Tue, 25 Jun 2019 10:16:36 +0000 (11:16 +0100)
diff --git a/simple_v_extension/abridged_spec.mdwn b/simple_v_extension/abridged_spec.mdwn

new file mode 100644 (file)

index 0000000..d4c008f
--- /dev/null
+++ b/simple_v_extension/abridged_spec.mdwn
@@ -0,0 +1,1104 @@
+# Simple-V (Parallelism Extension Proposal) Specification (Abridged)
+
+* Copyright (C) 2017, 2018, 2019 Luke Kenneth Casson Leighton
+* Status: DRAFTv0.6
+* Last edited: 25 jun 2019
+
+[[!toc ]]
+
+# Introduction
+
+Simple-V is a uniform parallelism API for RISC-V hardware that allows
+the Program Counter to enter "sub-contexts" in which, ultimately, standard
+RISC-V scalar opcodes are executed.
+
+The sub-context execution is "nested" in "re-entrant" form, in the
+following order:
+
+* Main standard RISC-V Program Counter (PC)
+* VBLOCK sub-execution context (PCVBLK increments whilst PC is paused)
+* VL element loops (STATE srcoffs and destoffs increment, PC and PCVBLK pause)
+* SUBVL element loops (STATE svsrcoffs/svdestoffs increment, VL pauses)
+
+Note: **there are *no* new opcodes**. The scheme works *entirely*
+on hidden context that augments *scalar* RISCV instructions.
+
+# CSRs <a name="csrs"></a>
+
+There are five CSRs, available in any privilege level:
+
+* MVL (the Maximum Vector Length)
+* VL (which has different characteristics from standard CSRs)
+* SUBVL (effectively a kind of SIMD)
+* STATE (containing copies of MVL, VL and SUBVL as well as context information)
+* PCVBLK (the current operation being executed within a VBLOCK Group)
+
+For Privilege Levels (trap handling) there are the following CSRs,
+where x may be u, m, s or h for User, Machine, Supervisor or Hypervisor
+Modes respectively:
+
+* (x)ePCVBLK (a copy of the sub-execution Program Counter, that is relative
+  to the start of the current VBLOCK Group, set on a trap).
+* (x) eSTATE (useful for saving and restoring during context switch,
+  and for providing fast transitions)
+
+The u/m/s CSRs are treated and handled exactly like their (x)epc
+equivalents.  On entry to or exit from a privilege level, the contents
+of its (x)eSTATE are swapped with STATE.
+
+(x)EPCVBLK CSRs must be treated exactly like their corresponding (x)epc
+equivalents. See VBLOCK section for details.
+
+## MAXVECTORLENGTH (MVL) <a name="mvl" />
+
+MAXVECTORLENGTH is the same concept as MVL in RVV, except that it
+is variable length and may be dynamically set.  MVL is
+however limited to the regfile bitwidth XLEN (1-32 for RV32,
+1-64 for RV64 and so on).
+
+## Vector Length (VL) <a name="vl" />
+
+VSETVL is slightly different from RVV.  Similar to RVV, VL is set to be within
+the range 1 <= VL <= MVL (where MVL in turn is limited to 1 <= MVL <= XLEN)
+
+    VL = rd = MIN(vlen, MVL)
+
+where 1 <= MVL <= XLEN
+
+## SUBVL - Sub Vector Length
+
+This is a "group by quantity" that effectivrly asks each iteration
+of the hardware loop to load SUBVL elements of width elwidth at a
+time. Effectively, SUBVL is like a SIMD multiplier: instead of just 1
+operation issued, SUBVL operations are issued.
+
+The main effect of SUBVL is that predication bits are applied per
+**group**, rather than by individual element.
+
+## STATE
+
+This is a standard CSR that contains sufficient information for a
+full context save/restore.  It contains (and permits setting of):
+
+* MVL
+* VL
+* destoffs - the destination element offset of the current parallel
+  instruction being executed
+* srcoffs - for twin-predication, the source element offset as well.
+* SUBVL
+* svdestoffs - the subvector destination element offset of the current
+  parallel instruction being executed
+* svsrcoffs - for twin-predication, the subvector source element offset
+  as well.
+
+The format of the STATE CSR is as follows:
+
+| (29..28 | (27..26) | (25..24) | (23..18) | (17..12) | (11..6) | (5...0) |
+| ------- | -------- | -------- | -------- | -------- | ------- | ------- |
+| dsvoffs | ssvoffs  | subvl    | destoffs | srcoffs  | vl      | maxvl   |
+
+Notes:
+
+* The entries are truncated to be within range.  Attempts to set VL to
+  greater than MAXVL will truncate VL.
+* Both VL and MAXVL are stored offset by one.  0b000000 represents VL=1,
+  0b000001 represents VL=2.  This allows the full range 1 to XLEN instead
+  of 0 to only 63.
+
+## VL, MVL and SUBVL instruction aliases
+
+This table contains pseudo-assembly instruction aliases. Note the
+subtraction of 1 from the CSRRWI pseudo variants, to compensate for the
+reduced range of the 5 bit immediate.
+
+| alias           | CSR                  |
+| -               | -                    |
+| SETVL rd, rs    | CSRRW  VL, rd, rs    |
+| SETVLi rd, #n   | CSRRWI VL, rd, #n-1  |
+| GETVL rd        | CSRRW  VL, rd, x0    |
+| SETMVL rd, rs   | CSRRW  MVL, rd, rs   |
+| SETMVLi rd, #n  | CSRRWI MVL,rd, #n-1  |
+| GETMVL rd       | CSRRW  MVL, rd, x0   |
+
+Note: CSRRC and other bitsetting may still be used, they are however not particularly useful (very obscure).
+
+## Register key-value (CAM) table <a name="regcsrtable" />
+
+The purpose of the Register table is to mark which registers change behaviour
+if used in a "Standard" (normally scalar) opcode.
+
+16 bit format:
+
+| RegCAM | 15       | (14..8)  | 7   | (6..5) | (4..0)  |
+| ------ | -        | -        | -   | ------ | ------- |
+| 0      | isvec0   | regidx0  | i/f | vew0   | regkey  |
+| 1      | isvec1   | regidx1  | i/f | vew1   | regkey  |
+| 2      | isvec2   | regidx2  | i/f | vew2   | regkey  |
+| 3      | isvec3   | regidx3  | i/f | vew3   | regkey  |
+
+8 bit format:
+
+| RegCAM | | 7   | (6..5) | (4..0)  |
+| ------ | | -   | ------ | ------- |
+| 0      | | i/f | vew0   | regnum  |
+
+Mapping the 8-bit to 16-bit format:
+
+| RegCAM | 15      | (14..8)    | 7   | (6..5) | (4..0)  |
+| ------ | -       | -          | -   | ------ | ------- |
+| 0      | isvec=1 | regnum0<<2 | i/f | vew0   | regnum0 |
+| 1      | isvec=1 | regnum1<<2 | i/f | vew1   | regnum1 |
+| 2      | isvec=1 | regnum2<<2 | i/f | vew2   | regnum2 |
+| 3      | isvec=1 | regnum2<<2 | i/f | vew3   | regnum3 |
+
+Fields:
+
+* i/f is set to "1" to indicate that the redirection/tag entry is to
+  be applied to integer registers; 0 indicates that it is relevant to
+  floating-point registers.
+* isvec indicates that the register (whether a src or dest) is to progress
+  incrementally forward on each loop iteration.  this gives the "effect"
+  of vectorisation.  isvec is zero indicates "do not progress", giving
+  the "effect" of that register being scalar.
+* vew overrides the operation's default width.  See table below
+* regkey is the register which, if encountered in an op (as src or dest)
+  is to be "redirected"
+* in the 16-bit format, regidx is the *actual* register to be used
+  for the operation (note that it is 7 bits wide)
+
+| vew | bitwidth            |
+| --- | ------------------- |
+| 00  | default (XLEN/FLEN) |
+| 01  | 8 bit               |
+| 10  | 16 bit              |
+| 11  | 32 bit              |
+
+A useful way to view the above table (and not have it as a CAM):
+
+As the above table is a CAM (key-value store) it may be appropriate
+(faster, less gates, implementation-wise) to expand it as follows:
+
+    struct vectorised {
+        bool isvector:1; 
+        int  vew:2; 
+        bool enabled:1;
+        int  predidx:7;  
+    }
+
+    struct vectorised fp_vec[32], int_vec[32];
+
+    for (i = 0; i < len; i++) // from VBLOCK Format
+       tb = int_vec if CSRvec[i].type == 0 else fp_vec
+       idx = CSRvec[i].regkey // INT/FP src/dst reg in opcode
+       tb[idx].elwidth  = CSRvec[i].elwidth
+       tb[idx].regidx   = CSRvec[i].regidx  // indirection
+       tb[idx].isvector = CSRvec[i].isvector // 0=scalar
+       tb[idx].enabled  = true;
+
+## Predication Table <a name="predication_csr_table"></a>
+
+The Predication Table is a key-value store indicating whether, if a
+given destination register (integer or floating-point) is referred to
+in an instruction, it is to be predicated. Like the Register table, it
+is an indirect lookup that allows the RV opcodes to not need modification.
+
+* regidx is the register that in combination with the
+  i/f flag, if that integer or floating-point register is referred to in a
+  (standard RV) instruction results in the lookup table being referenced
+  to find the predication mask to use for this operation.
+* predidx is the *actual* (full, 7 bit) register to be used for the
+  predication mask.
+* inv indicates that the predication mask bits are to be inverted
+  prior to use *without* actually modifying the contents of the
+  registerfrom which those bits originated.
+* zeroing is either 1 or 0, and if set to 1, the operation must
+  place zeros in any element position where the predication mask is
+  set to zero.  If zeroing is set to 0, unpredicated elements *must*
+  be left alone.  Some microarchitectures may choose to interpret
+  this as skipping the operation entirely.  Others which wish to
+  stick more closely to a SIMD architecture may choose instead to
+  interpret unpredicated elements as an internal "copy element"
+  operation (which would be necessary in SIMD microarchitectures
+  that perform register-renaming)
+* ffirst is a special mode that stops sequential element processing when
+  a data-dependent condition occurs, whether a trap or a conditional test.
+  The handling of each (trap or conditional test) is slightly different:
+  see Instruction sections for further details
+
+16 bit format:
+
+| PrCSR | (15..11) | 10     | 9     | 8   | (7..1)  | 0       |
+| ----- | -        | -      | -     | -   | ------- | ------- |
+| 0     | predidx  | zero0  | inv0  | i/f | regidx  | ffirst0 |
+| 1     | predidx  | zero1  | inv1  | i/f | regidx  | ffirst1 |
+| 2     | predidx  | zero2  | inv2  | i/f | regidx  | ffirst2 |
+| 3     | predidx  | zero3  | inv3  | i/f | regidx  | ffirst3 |
+
+Note: predidx=x0, zero=1, inv=1 is a RESERVED encoding.  Its use must
+generate an illegal instruction trap.
+
+8 bit format:
+
+| PrCSR | 7     | 6     | 5   | (4..0)  |
+| ----- | -     | -     | -   | ------- |
+| 0     | zero0 | inv0  | i/f | regnum  |
+
+Mapping from 8 to 16 bit format, the table becomes:
+
+| PrCSR | (15..11) | 10     | 9     | 8   | (7..1)  | 0       |
+| ----- | -        | -      | -     | -   | ------- | ------- |
+| 0     | x9       | zero0  | inv0  | i/f | regnum  | ff=0    |
+| 1     | x10      | zero1  | inv1  | i/f | regnum  | ff=0    |
+| 2     | x11      | zero2  | inv2  | i/f | regnum  | ff=0    |
+| 3     | x12      | zero3  | inv3  | i/f | regnum  | ff=0    |
+
+Pseudocode for predication:
+
+    struct pred {
+        bool zero;    // zeroing
+        bool inv;     // register at predidx is inverted
+        bool ffirst;  // fail-on-first
+        bool enabled; // use this to tell if the table-entry is active
+        int predidx;  // redirection: actual int register to use
+    }
+
+    struct pred fp_pred_reg[32];
+    struct pred int_pred_reg[32];
+
+    for (i = 0; i < len; i++) // number of Predication entries in VBLOCK
+      tb = int_pred_reg if PredicateTable[i].type == 0 else fp_pred_reg;
+      idx = VBLOCKPredicateTable[i].regidx
+      tb[idx].zero     = CSRpred[i].zero
+      tb[idx].inv      = CSRpred[i].inv
+      tb[idx].ffirst   = CSRpred[i].ffirst
+      tb[idx].predidx  = CSRpred[i].predidx
+      tb[idx].enabled  = true
+
+    def get_pred_val(bool is_fp_op, int reg):
+       tb = int_reg if is_fp_op else fp_reg
+       if (!tb[reg].enabled):
+          return ~0x0, False       // all enabled; no zeroing
+       tb = int_pred if is_fp_op else fp_pred
+       if (!tb[reg].enabled):
+          return ~0x0, False       // all enabled; no zeroing
+       predidx = tb[reg].predidx   // redirection occurs HERE
+       predicate = intreg[predidx] // actual predicate HERE
+       if (tb[reg].inv):
+          predicate = ~predicate   // invert ALL bits
+       return predicate, tb[reg].zero
+
+## Fail-on-First Mode <a name="ffirst-mode"></a>
+
+ffirst is a special data-dependent predicate mode.  There are two
+variants: one is for faults: typically for LOAD/STORE operations,
+which may encounter end of page faults during a series of operations.
+The other variant is comparisons such as FEQ (or the augmented behaviour
+of Branch), and any operation that returns a result of zero (whether
+integer or floating-point).  In the FP case, this includes negative-zero.
+
+Note that the execution order must "appear" to be sequential for ffirst
+mode to work correctly.  An in-order architecture must execute the element
+operations in sequence, whilst an out-of-order architecture must *commit*
+the element operations in sequence (giving the appearance of in-order
+execution).
+
+Note also, that if ffirst mode is needed without predication, a special
+"always-on" Predicate Table Entry may be constructed by setting
+inverse-on and using x0 as the predicate register.  This
+will have the effect of creating a mask of all ones, allowing ffirst
+to be set.
+
+### Fail-on-first traps
+
+Except for the first element, ffault stops sequential element processing
+when a trap occurs.  The first element is treated normally (as if ffirst
+is clear).  Should any subsequent element instruction require a trap,
+instead it and subsequent indexed elements are ignored (or cancelled in
+out-of-order designs), and VL is set to the *last* instruction that did
+not take the trap.
+
+Note that predicated-out elements (where the predicate mask bit is zero)
+are clearly excluded (i.e. the trap will not occur).  However, note that
+the loop still had to test the predicate bit: thus on return,
+VL is set to include elements that did not take the trap *and* includes
+the elements that were predicated (masked) out (not tested up to the
+point where the trap occurred).
+
+If SUBVL is being used (SUBVL!=1), the first *sub-group* of elements
+will cause a trap as normal (as if ffirst is not set); subsequently,
+the trap must not occur in the *sub-group* of elements.  SUBVL will **NOT**
+be modified.
+
+Given that predication bits apply to SUBVL groups, the same rules apply
+to predicated-out (masked-out) sub-groups in calculating the value that VL
+is set to.
+
+### Fail-on-first conditional tests
+
+ffault stops sequential element conditional testing on the first element result
+being zero.  VL is set to the number of elements that were processed before
+the fail-condition was encountered.
+
+Note that just as with traps, if SUBVL!=1, the first of any of the *sub-group*
+will cause the processing to end, and, even if there were elements within
+the *sub-group* that passed the test, that sub-group is still (entirely)
+excluded from the count (from setting VL).  i.e. VL is set to the total
+number of *sub-groups* that had no fail-condition up until execution was
+stopped.
+
+Note again that, just as with traps, predicated-out (masked-out) elements
+are included in the count leading up to the fail-condition, even though they
+were not tested.
+
+The pseudo-code for Predication makes this clearer and simpler than it is
+in words (the loop ends, VL is set to the current element index, "i").
+
+# Instructions <a name="instructions" />
+
+To illustrate how Scalar operations are turned "vector" and "predicated",
+simplified example pseudo-code for an integer ADD operation is shown below.
+Floating-point would use the FP Register Table.
+
+    function op_add(rd, rs1, rs2) # add not VADD!
+      int i, id=0, irs1=0, irs2=0;
+      predval = get_pred_val(FALSE, rd);
+      rd  = int_vec[rd ].isvector ? int_vec[rd ].regidx : rd;
+      rs1 = int_vec[rs1].isvector ? int_vec[rs1].regidx : rs1;
+      rs2 = int_vec[rs2].isvector ? int_vec[rs2].regidx : rs2;
+      for (i = 0; i < VL; i++)
+        xSTATE.srcoffs = i # save context
+        if (predval & 1<<i) # predication uses intregs
+           ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
+           if (!int_vec[rd ].isvector) break;
+        if (int_vec[rd ].isvector)  { id += 1; }
+        if (int_vec[rs1].isvector)  { irs1 += 1; }
+        if (int_vec[rs2].isvector)  { irs2 += 1; }
+
+Note that for simplicity there is quite a lot missing from the above
+pseudo-code.
+
+## SUBVL Pseudocode <a name="subvl-pseudocode"></a>
+
+Adding in support for SUBVL is a matter of adding in an extra inner
+for-loop, where register src and dest are still incremented inside the
+inner part. Not that the predication is still taken from the VL index.
+
+So whilst elements are indexed by "(i * SUBVL + s)", predicate bits are
+indexed by "(i)"
+
+    function op_add(rd, rs1, rs2) # add not VADD!
+      int i, id=0, irs1=0, irs2=0;
+      predval = get_pred_val(FALSE, rd);
+      rd  = int_vec[rd ].isvector ? int_vec[rd ].regidx : rd;
+      rs1 = int_vec[rs1].isvector ? int_vec[rs1].regidx : rs1;
+      rs2 = int_vec[rs2].isvector ? int_vec[rs2].regidx : rs2;
+      for (i = 0; i < VL; i++)
+       xSTATE.srcoffs = i # save context
+       for (s = 0; s < SUBVL; s++)
+        xSTATE.ssvoffs = s # save context
+        if (predval & 1<<i) # predication uses intregs
+           # actual add is here (at last)
+           ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
+           if (!int_vec[rd ].isvector) break;
+        if (int_vec[rd ].isvector)  { id += 1; }
+        if (int_vec[rs1].isvector)  { irs1 += 1; }
+        if (int_vec[rs2].isvector)  { irs2 += 1; }
+        if (id == VL or irs1 == VL or irs2 == VL) {
+          # end VL hardware loop
+          xSTATE.srcoffs = 0; # reset
+          xSTATE.ssvoffs = 0; # reset
+          return;
+        }
+
+NOTE: pseudocode simplified greatly: zeroing, proper predicate handling,
+elwidth handling etc. all left out.
+
+## Instruction Format
+
+It is critical to appreciate that there are
+**no operations added to SV, at all**.
+
+Examples are given below where "standard" RV scalar behaviour is augmented.
+
+## Branch Instructions
+
+Branch operations are augmented slightly to be a little more like FP
+Compares (FEQ, FNE etc.), by permitting the cumulation (and storage)
+of multiple comparisons into a register (taken indirectly from the predicate
+table).  As such, "ffirst" - fail-on-first - condition mode can be enabled.
+See ffirst mode in the Predication Table section.
+
+### Standard Branch <a name="standard_branch"></a>
+
+Branch operations use standard RV opcodes that are reinterpreted to
+be "predicate variants" in the instance where either of the two src
+registers are marked as vectors (active=1, vector=1).
+
+Note that the predication register to use (if one is enabled) is taken from
+the *first* src register, and that this is used, just as with predicated
+arithmetic operations, to mask whether the comparison operations take
+place or not.  If the second register is also marked as predicated,
+that (scalar) predicate register is used as a **destination** to store
+the results of all the comparisons.
+
+In instances where no vectorisation is detected on either src registers
+the operation is treated as an absolutely standard scalar branch operation.
+Where vectorisation is present on either or both src registers, the
+branch may stil go ahead if any only if *all* tests succeed (i.e. excluding
+those tests that are predicated out).
+
+Pseudo-code for branch:
+
+    s1 = reg_is_vectorised(src1);
+    s2 = reg_is_vectorised(src2);
+
+    if not s1 && not s2
+        if cmp(rs1, rs2) # scalar compare
+            goto branch
+        return
+
+    preg = int_pred_reg[rd]
+    reg = int_regfile
+
+    ps = get_pred_val(I/F==INT, rs1);
+    rd = get_pred_val(I/F==INT, rs2); # this may not exist
+
+    if not exists(rd) or zeroing:
+        result = 0
+    else
+        result = preg[rd]
+
+    for (int i = 0; i < VL; ++i)
+      if (zeroing)
+        if not (ps & (1<<i))
+           result &= ~(1<<i);
+      else if (ps & (1<<i))
+          if (cmp(s1 ? reg[src1+i]:reg[src1],
+                               s2 ? reg[src2+i]:reg[src2])
+              result |= 1<<i;
+          else
+              result &= ~(1<<i);
+
+     if not exists(rd)
+        if result == ps
+            goto branch
+     else
+        preg[rd] = result # store in destination
+        if preg[rd] == ps
+            goto branch
+
+Notes:
+
+* Predicated SIMD comparisons would break src1 and src2 further down
+  into bitwidth-sized chunks (see Appendix "Bitwidth Virtual Register
+  Reordering") setting Vector-Length times (number of SIMD elements) bits
+  in Predicate Register rd, as opposed to just Vector-Length bits.
+* The execution of "parallelised" instructions **must** be implemented
+  as "re-entrant" (to use a term from software).  If an exception (trap)
+  occurs during the middle of a vectorised
+  Branch (now a SV predicated compare) operation, the partial results
+  of any comparisons must be written out to the destination
+  register before the trap is permitted to begin.  If however there
+  is no predicate, the **entire** set of comparisons must be **restarted**,
+  with the offset loop indices set back to zero.  This is because
+  there is no place to store the temporary result during the handling
+  of traps.
+
+Note also that where normally, predication requires that there must
+also be a CSR register entry for the register being used in order
+for the **predication** CSR register entry to also be active,
+for branches this is **not** the case.  src2 does **not** have
+to have its CSR register entry marked as active in order for
+predication on src2 to be active.
+
+### Floating-point Comparisons
+
+There does not exist floating-point branch operations, only compare.
+Interestingly no change is needed to the instruction format because
+FP Compare already stores a 1 or a zero in its "rd" integer register
+target, i.e. it's not actually a Branch at all: it's a compare.
+
+As RV Scalar does not have "FNE", predication inversion must be used.
+Also: note that FP Compare may be predicated, using the destination
+integer register (rd) to determine the predicate.  FP Compare is **not**
+a twin-predication operation, as, again, just as with SV Branches,
+there are three registers involved: FP src1, FP src2 and INT rd.
+
+Also: note that ffirst (fail first mode) applies directly to this operation.
+
+### Compressed Branch Instruction
+
+Compressed Branch instructions are, just like standard Branch instructions,
+reinterpreted to be vectorised and predicated based on the source register
+(rs1s) CSR entries.  As however there is only the one source register,
+given that c.beqz a10 is equivalent to beqz a10,x0, the optional target
+to store the results of the comparisions is taken from CSR predication
+table entries for **x0**.
+
+The specific required use of x0 is, with a little thought, quite obvious,
+but is counterintuitive.  Clearly it is **not** recommended to redirect
+x0 with a CSR register entry, however as a means to opaquely obtain
+a predication target it is the only sensible option that does not involve
+additional special CSRs (or, worse, additional special opcodes).
+
+Note also that, just as with standard branches, the 2nd source
+(in this case x0 rather than src2) does **not** have to have its CSR
+register table marked as "active" in order for predication to work.
+
+## Vectorised Dual-operand instructions
+
+There is a series of 2-operand instructions involving copying (and
+sometimes alteration):
+
+* C.MV
+* FMV, FNEG, FABS, FCVT, FSGNJ, FSGNJN and FSGNJX
+* C.LWSP, C.SWSP, C.LDSP, C.FLWSP etc.
+* LOAD(-FP) and STORE(-FP)
+
+All of these operations follow the same two-operand pattern, so it is
+*both* the source *and* destination predication masks that are taken into
+account.  This is different from
+the three-operand arithmetic instructions, where the predication mask
+is taken from the *destination* register, and applied uniformly to the
+elements of the source register(s), element-for-element.
+
+The pseudo-code pattern for twin-predicated operations is as
+follows:
+
+    function op(rd, rs):
+      rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
+      rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
+      ps = get_pred_val(FALSE, rs); # predication on src
+      pd = get_pred_val(FALSE, rd); # ... AND on dest
+      for (int i = 0, int j = 0; i < VL && j < VL;):
+        if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
+        if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
+        xSTATE.srcoffs = i # save context
+        xSTATE.destoffs = j # save context
+        reg[rd+j] = SCALAR_OPERATION_ON(reg[rs+i])
+        if (int_csr[rs].isvec) i++;
+        if (int_csr[rd].isvec) j++; else break
+
+This pattern covers scalar-scalar, scalar-vector, vector-scalar
+and vector-vector, and predicated variants of all of those.
+Zeroing is not presently included (TODO).  As such, when compared
+to RVV, the twin-predicated variants of C.MV and FMV cover
+**all** standard vector operations: VINSERT, VSPLAT, VREDUCE,
+VEXTRACT, VSCATTER, VGATHER, VCOPY, and more.
+
+### C.MV Instruction <a name="c_mv"></a>
+
+There is no MV instruction in RV however there is a C.MV instruction.
+It is used for copying integer-to-integer registers (vectorised FMV
+is used for copying floating-point).
+
+If either the source or the destination register are marked as vectors
+C.MV is reinterpreted to be a vectorised (multi-register) predicated
+move operation.  The actual instruction's format does not change.
+
+There are several different instructions from RVV that are covered by
+this one opcode:
+
+[[!table  data="""
+src    | dest    | predication   | op             |
+scalar | vector  | none          | VSPLAT         |
+scalar | vector  | destination   | sparse VSPLAT  |
+scalar | vector  | 1-bit dest    | VINSERT        |
+vector | scalar  | 1-bit? src    | VEXTRACT       |
+vector | vector  | none          | VCOPY          |
+vector | vector  | src           | Vector Gather  |
+vector | vector  | dest          | Vector Scatter |
+vector | vector  | src & dest    | Gather/Scatter |
+vector | vector  | src == dest   | sparse VCOPY   |
+"""]]
+
+Also, VMERGE may be implemented as back-to-back (macro-op fused) C.MV
+operations with inversion on the src and dest predication for one of the
+two C.MV operations.
+
+### FMV, FNEG and FABS Instructions
+
+These are identical in form to C.MV, except covering floating-point
+register copying.  The same double-predication rules also apply.
+However when elwidth is not set to default the instruction is implicitly
+and automatic converted to a (vectorised) floating-point type conversion
+operation of the appropriate size covering the source and destination
+register bitwidths.
+
+(Note that FMV, FNEG and FABS are all actually pseudo-instructions)
+
+### FVCT Instructions
+
+These are again identical in form to C.MV, except that they cover
+floating-point to integer and integer to floating-point.  When element
+width in each vector is set to default, the instructions behave exactly
+as they are defined for standard RV (scalar) operations, except vectorised
+in exactly the same fashion as outlined in C.MV.
+
+However when the source or destination element width is not set to default,
+the opcode's explicit element widths are *over-ridden* to new definitions,
+and the opcode's element width is taken as indicative of the SIMD width
+(if applicable i.e. if packed SIMD is requested) instead.
+
+## LOAD / STORE Instructions and LOAD-FP/STORE-FP <a name="load_store"></a>
+
+In vectorised architectures there are usually at least two different modes
+for LOAD/STORE:
+
+* Read (or write for STORE) from sequential locations, where one
+  register specifies the address, and the one address is incremented
+  by a fixed amount.  This is usually known as "Unit Stride" mode.
+* Read (or write) from multiple indirected addresses, where the
+  vector elements each specify separate and distinct addresses.
+
+To support these different addressing modes, the CSR Register "isvector"
+bit is used.  So, for a LOAD, when the src register is set to
+scalar, the LOADs are sequentially incremented by the src register
+element width, and when the src register is set to "vector", the
+elements are treated as indirection addresses.  Simplified
+pseudo-code would look like this:
+
+    function op_ld(rd, rs) # LD not VLD!
+      rdv = int_csr[rd].active ? int_csr[rd].regidx : rd;
+      rsv = int_csr[rs].active ? int_csr[rs].regidx : rs;
+      ps = get_pred_val(FALSE, rs); # predication on src
+      pd = get_pred_val(FALSE, rd); # ... AND on dest
+      for (int i = 0, int j = 0; i < VL && j < VL;):
+        if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
+        if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
+        if (int_csr[rd].isvec)
+          # indirect mode (multi mode)
+          srcbase = ireg[rsv+i];
+        else
+          # unit stride mode
+          srcbase = ireg[rsv] + i * XLEN/8; # offset in bytes
+        ireg[rdv+j] <= mem[srcbase + imm_offs];
+        if (!int_csr[rs].isvec &&
+            !int_csr[rd].isvec) break # scalar-scalar LD
+        if (int_csr[rs].isvec) i++;
+        if (int_csr[rd].isvec) j++;
+
+## Compressed Stack LOAD / STORE Instructions <a name="c_ld_st"></a>
+
+C.LWSP / C.SWSP and floating-point etc. are also source-dest twin-predicated,
+where it is implicit in C.LWSP/FLWSP etc. that x2 is the source register.
+It is therefore possible to use predicated C.LWSP to efficiently
+pop registers off the stack (by predicating x2 as the source), cherry-picking
+which registers to store to (by predicating the destination).  Likewise
+for C.SWSP.  In this way, LOAD/STORE-Multiple is efficiently achieved.
+
+**Note**: it is still possible to redirect x2 to an alternative target
+register.  With care, this allows C.LWSP / C.SWSP (and C.FLWSP) to be used as
+general-purpose LOAD/STORE operations.
+
+## Compressed LOAD / STORE Instructions
+
+Compressed LOAD and STORE are again exactly the same as scalar LOAD/STORE,
+where the same rules apply and the same pseudo-code apply as for
+non-compressed LOAD/STORE.  Again: setting scalar or vector mode
+on the src for LOAD and dest for STORE switches mode from "Unit Stride"
+to "Multi-indirection", respectively.
+
+# Element bitwidth polymorphism <a name="elwidth"></a>
+
+Element bitwidth is best covered as its own special section, as it
+is quite involved and applies uniformly across-the-board.  SV restricts
+bitwidth polymorphism to default, 8-bit, 16-bit and 32-bit.
+
+The effect of setting an element bitwidth is to re-cast each entry
+in the register table, and for all memory operations involving
+load/stores of certain specific sizes, to a completely different width.
+Thus In c-style terms, on an RV64 architecture, effectively each register
+now looks like this:
+
+    typedef union {
+        uint8_t  actual_bytes[8]; // 8 for RV64, 4 for RV32, 16 for RV128
+        uint8_t  b[0]; // array of type uint8_t
+        uint16_t s[0];
+        uint32_t i[0];
+        uint64_t l[0];
+        uint128_t d[0];
+    } reg_t;
+
+    reg_t int_regfile[128];
+
+where when accessing any individual regfile[n].b entry it is permitted
+(in c) to arbitrarily over-run the *declared* length of the array (zero),
+and thus "overspill" to consecutive register file entries in a fashion
+that is completely transparent to a greatly-simplified software / pseudo-code
+representation.
+It is however critical to note that it is clearly the responsibility of
+the implementor to ensure that, towards the end of the register file,
+an exception is thrown if attempts to access beyond the "real" register
+bytes is ever attempted.
+
+The pseudo-code is as follows, to demonstrate how the sign-extending
+and width-extending works:
+
+    typedef union {
+        uint8_t  b;
+        uint16_t s;
+        uint32_t i;
+        uint64_t l;
+    } el_reg_t;
+
+    bw(elwidth):
+        if elwidth == 0:
+            return xlen
+        if elwidth == 1:
+            return xlen / 2
+        if elwidth == 2:
+            return xlen * 2
+        // elwidth == 3:
+        return 8
+
+    get_max_elwidth(rs1, rs2):
+        return max(bw(int_csr[rs1].elwidth), # default (XLEN) if not set
+                   bw(int_csr[rs2].elwidth)) # again XLEN if no entry
+
+    get_polymorphed_reg(reg, bitwidth, offset):
+        el_reg_t res;
+        res.l = 0; // TODO: going to need sign-extending / zero-extending
+        if bitwidth == 8:
+            reg.b = int_regfile[reg].b[offset]
+        elif bitwidth == 16:
+            reg.s = int_regfile[reg].s[offset]
+        elif bitwidth == 32:
+            reg.i = int_regfile[reg].i[offset]
+        elif bitwidth == 64:
+            reg.l = int_regfile[reg].l[offset]
+        return res
+
+    set_polymorphed_reg(reg, bitwidth, offset, val):
+        if (!int_csr[reg].isvec):
+            # sign/zero-extend depending on opcode requirements, from
+            # the reg's bitwidth out to the full bitwidth of the regfile
+            val = sign_or_zero_extend(val, bitwidth, xlen)
+            int_regfile[reg].l[0] = val
+        elif bitwidth == 8:
+            int_regfile[reg].b[offset] = val
+        elif bitwidth == 16:
+            int_regfile[reg].s[offset] = val
+        elif bitwidth == 32:
+            int_regfile[reg].i[offset] = val
+        elif bitwidth == 64:
+            int_regfile[reg].l[offset] = val
+
+      maxsrcwid =  get_max_elwidth(rs1, rs2) # source element width(s)
+      destwid = int_csr[rs1].elwidth         # destination element width
+      for (i = 0; i < VL; i++)
+        if (predval & 1<<i) # predication uses intregs
+           // TODO, calculate if over-run occurs, for each elwidth
+           src1 = get_polymorphed_reg(rs1, maxsrcwid, irs1)
+           // TODO, sign/zero-extend src1 and src2 as operation requires
+           if (op_requires_sign_extend_src1)
+              src1 = sign_extend(src1, maxsrcwid)
+           src2 = get_polymorphed_reg(rs2, maxsrcwid, irs2)
+           result = src1 + src2 # actual add here
+           // TODO, sign/zero-extend result, as operation requires
+           if (op_requires_sign_extend_dest)
+              result = sign_extend(result, maxsrcwid)
+           set_polymorphed_reg(rd, destwid, ird, result)
+           if (!int_vec[rd].isvector) break
+        if (int_vec[rd ].isvector)  { id += 1; }
+        if (int_vec[rs1].isvector)  { irs1 += 1; }
+        if (int_vec[rs2].isvector)  { irs2 += 1; }
+
+## Polymorphic floating-point operation exceptions and error-handling
+
+For floating-point operations, conversion takes place without
+raising any kind of exception.  Exactly as specified in the standard
+RV specification, NAN (or appropriate) is stored if the result
+is beyond the range of the destination, and, again, exactly as
+with the standard RV specification just as with scalar
+operations, the floating-point flag is raised (FCSR).  And, again, just as
+with scalar operations, it is software's responsibility to check this flag.
+Given that the FCSR flags are "accrued", the fact that multiple element
+operations could have occurred is not a problem.
+
+Note that it is perfectly legitimate for floating-point bitwidths of
+only 8 to be specified.  However whilst it is possible to apply IEEE 754
+principles, no actual standard yet exists.  Implementors wishing to
+provide hardware-level 8-bit support rather than throw a trap to emulate
+in software should contact the author of this specification before
+proceeding.
+
+## Polymorphic shift operators
+
+A special note is needed for changing the element width of left and right
+shift operators, particularly right-shift.  
+
+For SV, where each operand's element bitwidth may be over-ridden, the
+rule about determining the operation's bitwidth *still applies*, being
+defined as the maximum bitwidth of RS1 and RS2.  *However*, this rule
+**also applies to the truncation of RS2**.  In other words, *after*
+determining the maximum bitwidth, RS2's range must **also be truncated**
+to ensure a correct answer.  Example:
+
+* RS1 is over-ridden to a 16-bit width
+* RS2 is over-ridden to an 8-bit width
+* RD is over-ridden to a 64-bit width
+* the maximum bitwidth is thus determined to be 16-bit - max(8,16)
+* RS2 is **truncated to a range of values from 0 to 15**: RS2 & (16-1)
+
+Pseudocode (in spike) for this example would therefore be:
+
+    WRITE_RD(sext_xlen(zext_16bit(RS1) << (RS2 & (16-1))));
+
+## Polymorphic MULH/MULHU/MULHSU
+
+MULH is designed to take the top half MSBs of a multiply that
+does not fit within the range of the source operands, such that
+smaller width operations may produce a full double-width multiply
+in two cycles.  The issue is: SV allows the source operands to
+have variable bitwidth.
+
+Here again special attention has to be paid to the rules regarding
+bitwidth, which, again, are that the operation is performed at
+the maximum bitwidth of the **source** registers.  Therefore:
+
+* An 8-bit x 8-bit multiply will create a 16-bit result that must
+  be shifted down by 8 bits
+* A 16-bit x 8-bit multiply will create a 24-bit result that must
+  be shifted down by 16 bits (top 8 bits being zero)
+* A 16-bit x 16-bit multiply will create a 32-bit result that must
+  be shifted down by 16 bits
+* A 32-bit x 16-bit multiply will create a 48-bit result that must
+  be shifted down by 32 bits
+* A 32-bit x 8-bit multiply will create a 40-bit result that must
+  be shifted down by 32 bits
+
+So again, just as with shift-left and shift-right, the result
+is shifted down by the maximum of the two source register bitwidths.
+And, exactly again, truncation or sign-extension is performed on the
+result.  If sign-extension is to be carried out, it is performed
+from the same maximum of the two source register bitwidths out
+to the result element's bitwidth.
+
+If truncation occurs, i.e. the top MSBs of the result are lost,
+this is "Officially Not Our Problem", i.e. it is assumed that the
+programmer actually desires the result to be truncated.  i.e. if the
+programmer wanted all of the bits, they would have set the destination
+elwidth to accommodate them.
+
+## Polymorphic elwidth on LOAD/STORE <a name="elwidth_loadstore"></a>
+
+Polymorphic element widths in vectorised form means that the data
+being loaded (or stored) across multiple registers needs to be treated
+(reinterpreted) as a contiguous stream of elwidth-wide items, where
+the source register's element width is **independent** from the destination's.
+
+This makes for a slightly more complex algorithm when using indirection
+on the "addressed" register (source for LOAD and destination for STORE),
+particularly given that the LOAD/STORE instruction provides important
+information about the width of the data to be reinterpreted.
+
+As LOAD/STORE may be twin-predicated, it is important to note that
+the rules on twin predication still apply.  Where in previous
+pseudo-code (elwidth=default for both source and target) it was
+the *registers* that the predication was applied to, it is now the
+**elements** that the predication is applied to.
+
+The full pseudocode for all LD operations may be written out
+as follows:
+
+    function LBU(rd, rs):
+        load_elwidthed(rd, rs, 8, true)
+    function LB(rd, rs):
+        load_elwidthed(rd, rs, 8, false)
+    function LH(rd, rs):
+        load_elwidthed(rd, rs, 16, false)
+    ...
+    ...
+    function LQ(rd, rs):
+        load_elwidthed(rd, rs, 128, false)
+
+    # returns 1 byte of data when opwidth=8, 2 bytes when opwidth=16..
+    function load_memory(rs, imm, i, opwidth):
+        elwidth = int_csr[rs].elwidth
+        bitwidth = bw(elwidth);
+        elsperblock = min(1, opwidth / bitwidth)
+        srcbase = ireg[rs+i/(elsperblock)];
+        offs = i % elsperblock;
+        return mem[srcbase + imm + offs]; # 1/2/4/8/16 bytes
+
+    function load_elwidthed(rd, rs, opwidth, unsigned):
+      destwid = int_csr[rd].elwidth # destination element width
+      rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
+      rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
+      ps = get_pred_val(FALSE, rs); # predication on src
+      pd = get_pred_val(FALSE, rd); # ... AND on dest
+      for (int i = 0, int j = 0; i < VL && j < VL;):
+        if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
+        if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
+        val = load_memory(rs, imm, i, opwidth)
+        if unsigned:
+            val = zero_extend(val, min(opwidth, bitwidth))
+        else:
+            val = sign_extend(val, min(opwidth, bitwidth))
+        set_polymorphed_reg(rd, bitwidth, j, val)
+        if (int_csr[rs].isvec) i++;
+        if (int_csr[rd].isvec) j++; else break;
+
+# Predication Element Zeroing
+
+The introduction of zeroing on traditional vector predication is usually
+intended as an optimisation for lane-based microarchitectures with register
+renaming to be able to save power by avoiding a register read on elements
+that are passed through en-masse through the ALU.  Simpler microarchitectures
+do not have this issue: they simply do not pass the element through to
+the ALU at all, and therefore do not store it back in the destination.
+More complex non-lane-based micro-architectures can, when zeroing is
+not set, use the predication bits to simply avoid sending element-based
+operations to the ALUs, entirely: thus, over the long term, potentially
+keeping all ALUs 100% occupied even when elements are predicated out.
+
+SimpleV's design principle is not based on or influenced by
+microarchitectural design factors: it is a hardware-level API.
+Therefore, looking purely at whether zeroing is *useful* or not,
+(whether less instructions are needed for certain scenarios),
+given that a case can be made for zeroing *and* non-zeroing, the
+decision was taken to add support for both.
+
+## Single-predication (based on destination register)
+
+Zeroing on predication for arithmetic operations is taken from
+the destination register's predicate.  i.e. the predication *and*
+zeroing settings to be applied to the whole operation come from the
+CSR Predication table entry for the destination register.
+Thus when zeroing is set on predication of a destination element,
+if the predication bit is clear, then the destination element is *set*
+to zero (twin-predication is slightly different, and will be covered
+next).
+
+Thus the pseudo-code loop for a predicated arithmetic operation
+is modified to as follows:
+
+      for (i = 0; i < VL; i++)
+        if not zeroing: # an optimisation
+           while (!(predval & 1<<i) && i < VL)
+             if (int_vec[rd ].isvector)  { id += 1; }
+             if (int_vec[rs1].isvector)  { irs1 += 1; }
+             if (int_vec[rs2].isvector)  { irs2 += 1; }
+           if i == VL:
+             return
+        if (predval & 1<<i)
+           src1 = ....
+           src2 = ...
+           else:
+               result = src1 + src2 # actual add (or other op) here
+           set_polymorphed_reg(rd, destwid, ird, result)
+           if int_vec[rd].ffirst and result == 0:
+              VL = i # result was zero, end loop early, return VL
+              return
+           if (!int_vec[rd].isvector) return
+        else if zeroing:
+           result = 0
+           set_polymorphed_reg(rd, destwid, ird, result)
+        if (int_vec[rd ].isvector)  { id += 1; }
+        else if (predval & 1<<i) return
+        if (int_vec[rs1].isvector)  { irs1 += 1; }
+        if (int_vec[rs2].isvector)  { irs2 += 1; }
+        if (rd == VL or rs1 == VL or rs2 == VL): return
+
+The optimisation to skip elements entirely is only possible for certain
+micro-architectures when zeroing is not set.  However for lane-based
+micro-architectures this optimisation may not be practical, as it
+implies that elements end up in different "lanes".  Under these
+circumstances it is perfectly fine to simply have the lanes
+"inactive" for predicated elements, even though it results in
+less than 100% ALU utilisation.
+
+## Twin-predication (based on source and destination register)
+
+Twin-predication is not that much different, except that that
+the source is independently zero-predicated from the destination.
+This means that the source may be zero-predicated *or* the
+destination zero-predicated *or both*, or neither.
+
+When with twin-predication, zeroing is set on the source and not
+the destination, if a predicate bit is set it indicates that a zero
+data element is passed through the operation (the exception being:
+if the source data element is to be treated as an address - a LOAD -
+then the data returned *from* the LOAD is zero, rather than looking up an
+*address* of zero.
+
+When zeroing is set on the destination and not the source, then just
+as with single-predicated operations, a zero is stored into the destination
+element (or target memory address for a STORE).
+
+Zeroing on both source and destination effectively result in a bitwise
+NOR operation of the source and destination predicate: the result is that
+where either source predicate OR destination predicate is set to 0,
+a zero element will ultimately end up in the destination register.
+
+However: this may not necessarily be the case for all operations;
+implementors, particularly of custom instructions, clearly need to
+think through the implications in each and every case.
+
+Here is pseudo-code for a twin zero-predicated operation:
+
+    function op_mv(rd, rs) # MV not VMV!
+      rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
+      rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
+      ps, zerosrc = get_pred_val(FALSE, rs); # predication on src
+      pd, zerodst = get_pred_val(FALSE, rd); # ... AND on dest
+      for (int i = 0, int j = 0; i < VL && j < VL):
+        if (int_csr[rs].isvec && !zerosrc) while (!(ps & 1<<i)) i++;
+        if (int_csr[rd].isvec && !zerodst) while (!(pd & 1<<j)) j++;
+        if ((pd & 1<<j))
+            if ((pd & 1<<j))
+                sourcedata = ireg[rs+i];
+            else
+                sourcedata = 0
+            ireg[rd+j] <= sourcedata
+        else if (zerodst)
+            ireg[rd+j] <= 0
+        if (int_csr[rs].isvec)
+            i++;
+        if (int_csr[rd].isvec)
+            j++;
+        else
+            if ((pd & 1<<j))
+                break;
+
+Note that in the instance where the destination is a scalar, the hardware
+loop is ended the moment a value *or a zero* is placed into the destination
+register/element.  Also note that, for clarity, variable element widths
+have been left out of the above.
+
+# Exceptions
+
+TODO: expand.  
+
+# Hints
+
+With Simple-V being capable of issuing *parallel* instructions where
+rd=x0, the space for possible HINTs is expanded considerably.  VL
+could be used to indicate different hints.  In addition, if predication
+is set, the predication register itself could hypothetically be passed
+in as a *parameter* to the HINT operation.
+
+No specific hints are yet defined in Simple-V
+
+# Vector Block Format <a name="vliw-format"></a>
+
+See ancillary resource: [[vblock_format]]
+
+# Subsets of RV functionality
+
+It is permitted to only implement SVprefix and not the VBLOCK instruction
+format option, and vice-versa.  UNIX Platforms **MUST** raise illegal
+instruction on seeing an unsupported VBLOCK or SVprefix opcode, so that
+traps may emulate the format.
+
+It is permitted in SVprefix to either not implement VL or not implement
+SUBVL (see [[sv_prefix_proposal]] for full details. Again, UNIX Platforms
+*MUST* raise illegal instruction on implementations that do not support
+VL or SUBVL.
+
+It is permitted to limit the size of either (or both) the register files
+down to the original size of the standard RV architecture.  However, below
+the mandatory limits set in the RV standard will result in non-compliance
+with the SV Specification.
+
diff --git a/simple_v_extension/specification.mdwn b/simple_v_extension/specification.mdwn

index 6b57f8c5a8064b7dc6a431ab2217ae8fcbd33d9e..3c7caf3374dc3195d4b6f1cc7d16b1213c39705a 100644 (file)
--- a/simple_v_extension/specification.mdwn
+++ b/simple_v_extension/specification.mdwn
@@ -3,7 +3,10 @@
  * Copyright (C) 2017, 2018, 2019 Luke Kenneth Casson Leighton
  * Status: DRAFTv0.6
  * Last edited: 21 jun 2019
-* Ancillary resource: [[opcodes]] [[sv_prefix_proposal]]
+* Ancillary resource: [[opcodes]]
+* Ancillary resource: [[sv_prefix_proposal]]
+* Ancillary resource: [[abridged_spec]]
+* Ancillary resource: [[vblock_format]]
  
  With thanks to:
  
@@ -2380,186 +2383,7 @@ No specific hints are yet defined in Simple-V
  
  # Vector Block Format <a name="vliw-format"></a>
  
-One issue with a former revision of SV was the setup and teardown
-time of the CSRs.  The cost of the use of a full CSRRW (requiring LI)
-to set up registers and predicates was quite high.  A VLIW-like format
-therefore makes sense (named VBLOCK), and is conceptually reminiscent of
-the ARM Thumb2 "IT" instruction.
-
-The format is:
-
-* the standard RISC-V 80 to 192 bit encoding sequence, with bits
-  defining the options to follow within the block
-* An optional VL Block (16-bit)
-* Optional predicate entries (8/16-bit blocks: see Predicate Table, above)
-* Optional register entries (8/16-bit blocks: see Register Table, above)
-* finally some 16/32/48 bit standard RV or SVPrefix opcodes follow.
-
-Thus, the variable-length format from Section 1.5 of the RISC-V ISA is used
-as follows:
-
-| base+4 ... base+2          | base             | number of bits             |
-| ------ -----------------   | ---------------- | -------------------------- |
-| ..xxxx  xxxxxxxxxxxxxxxx   | xnnnxxxxx1111111 | (80+16\*nnn)-bit, nnn!=111 |
-| {ops}{Pred}{Reg}{VL Block} | SV Prefix        |                            |
-
-A suitable prefix, which fits the Expanded Instruction-Length encoding
-for "(80 + 16 times instruction-length)", as defined in Section 1.5
-of the RISC-V ISA, is as follows:
-
-| 15    | 14:12 | 11:10 | 9:8   | 7    | 6:0     |
-| -     | ----- | ----- | ----- | ---  | ------- |
-| vlset | 16xil | pplen | rplen | mode | 1111111 |
-
-The VL/MAXVL/SubVL Block format:
-
-| 31-30 | 29:28 | 27:22  | 21:17  - 16  |
-| -     | ----- | ------ | ------ - -   |
-| 0     | SubVL | VLdest | VLEN     vlt |
-| 1     | SubVL | VLdest | VLEN         |
-
-Note: this format is very similar to that used in [[sv_prefix_proposal]]
-
-If vlt is 0, VLEN is a 5 bit immediate value, offset by one (i.e
-a bit sequence of 0b00000 represents VL=1 and so on). If vlt is 1,
-it specifies the scalar register from which VL is set by this VBLOCK
-instruction group. VL, whether set from the register or the immediate,
-is then modified (truncated) to be MIN(VL, MAXVL), and the result stored
-in the scalar register specified in VLdest. If VLdest is zero, no store
-in the regfile occurs (however VL is still set).
-
-This option will typically be used to start vectorised loops, where
-the VBLOCK instruction effectively embeds an optional "SETSUBVL, SETVL"
-sequence (in compact form).
-
-When bit 15 is set to 1, MAXVL and VL are both set to the immediate,
-VLEN (again, offset by one), which is 6 bits in length, and the same
-value stored in scalar register VLdest (if that register is nonzero).
-A value of 0b000000 will set MAXVL=VL=1, a value of 0b000001 will
-set MAXVL=VL= 2 and so on.
-
-This option will typically not be used so much for loops as it will be
-for one-off instructions such as saving the entire register file to the
-stack with a single one-off Vectorised and predicated LD/ST, or as a way
-to save or restore registers in a function call with a single instruction.
-
-CSRs needed:
-
-* mepcvliw
-* sepcvliw
-* uepcvliw
-* hepcvliw
-
-Notes:
-
-* Bit 7 specifies if the prefix block format is the full 16 bit format
-  (1) or the compact less expressive format (0). In the 8 bit format,
-  pplen is multiplied by 2.
-* 8 bit format predicate numbering is implicit and begins from x9. Thus
-  it is critical to put blocks in the correct order as required.
-* Bit 7 also specifies if the register block format is 16 bit (1) or 8 bit
-  (0). In the 8 bit format, rplen is multiplied by 2. If only an odd number
-  of entries are needed the last may be set to 0x00, indicating "unused".
-* Bit 15 specifies if the VL Block is present. If set to 1, the VL Block
-  immediately follows the VBLOCK instruction Prefix
-* Bits 8 and 9 define how many RegCam entries (0 to 3 if bit 15 is 1,
-  otherwise 0 to 6) follow the (optional) VL Block.
-* Bits 10 and 11 define how many PredCam entries (0 to 3 if bit 7 is 1,
-  otherwise 0 to 6) follow the (optional) RegCam entries
-* Bits 14 to 12 (IL) define the actual length of the instruction: total
-  number of bits is 80 + 16 times IL.  Standard RV32, RVC and also
-  SVPrefix (P48/64-\*-Type) instructions fit into this space, after the
-  (optional) VL / RegCam / PredCam entries
-* In any RVC or 32 Bit opcode, any registers within the VBLOCK-prefixed
-  format *MUST* have the RegCam and PredCam entries applied to the
-  operation (and the Vectorisation loop activated)
-* P48 and P64 opcodes do **not** take their Register or predication
-  context from the VBLOCK tables: they do however have VL or SUBVL
-  applied (unless VLtyp or svlen are set).
-* At the end of the VBLOCK Group, the RegCam and PredCam entries
-  *no longer apply*.  VL, MAXVL and SUBVL on the other hand remain at
-  the values set by the last instruction (whether a CSRRW or the VL
-  Block header).
-* Although an inefficient use of resources, it is fine to set the MAXVL,
-  VL and SUBVL CSRs with standard CSRRW instructions, within a VBLOCK.
-
-All this would greatly reduce the amount of space utilised by Vectorised
-instructions, given that 64-bit CSRRW requires 3, even 4 32-bit opcodes:
-the CSR itself, a LI, and the setting up of the value into the RS
-register of the CSR, which, again, requires a LI / LUI to get the 32
-bit data into the CSR.  To get 64-bit data into the register in order
-to put it into the CSR(s), LOAD operations from memory are needed!
-
-Given that each 64-bit CSR can hold only 4x PredCAM entries (or 4 RegCAM
-entries), that's potentially 6 to eight 32-bit instructions, just to
-establish the Vector State!
-
-Not only that: even CSRRW on VL and MAXVL requires 64-bits (even more
-bits if VL needs to be set to greater than 32).  Bear in mind that in SV,
-both MAXVL and VL need to be set.
-
-By contrast, the VBLOCK prefix is only 16 bits, the VL/MAX/SubVL block is
-only 16 bits, and as long as not too many predicates and register vector
-qualifiers are specified, several 32-bit and 16-bit opcodes can fit into
-the format. If the full flexibility of the 16 bit block formats are not
-needed, more space is saved by using the 8 bit formats.
-
-In this light, embedding the VL/MAXVL, PredCam and RegCam CSR entries
-into a VBLOCK format makes a lot of sense.
-
-Bear in mind the warning in an earlier section that use of VLtyp or svlen
-in a P48 or P64 opcode within a VBLOCK Group will result in corruption
-(use) of the STATE CSR, as the STATE CSR is shared with SVPrefix. To
-avoid this situation, the STATE CSR may be copied into a temp register
-and restored afterwards.
-
-Open Questions:
-
-* Is it necessary to stick to the RISC-V 1.5 format?  Why not go with
-  using the 15th bit to allow 80 + 16\*0bnnnn bits?  Perhaps to be sane,
-  limit to 256 bits (16 times 0-11).
-* Could a "hint" be used to set which operations are parallel and which
-  are sequential?
-* Could a new sub-instruction opcode format be used, one that does not
-  conform precisely to RISC-V rules, but *unpacks* to RISC-V opcodes?
-  no need for byte or bit-alignment
-* Could a hardware compression algorithm be deployed?  Quite likely,
-  because of the sub-execution context (sub-VBLOCK PC)
-
-## Limitations on instructions.
-
-To greatly simplify implementations, it is required to treat the VBLOCK
-group as a separate sub-program with its own separate PC. The sub-pc
-advances separately whilst the main PC remains pointing at the beginning
-of the VBLOCK instruction (not to be confused with how VL works, which
-is exactly the same principle, except it is VStart in the STATE CSR
-that increments).
-
-This has implications, namely that a new set of CSRs identical to xepc
-(mepc, srpc, hepc and uepc) must be created and managed and respected
-as being a sub extension of the xepc set of CSRs.  Thus, xepcvliw CSRs
-must be context switched and saved / restored in traps.
-
-The srcoffs and destoffs indices in the STATE CSR may be similarly
-regarded as another sub-execution context, giving in effect two sets of
-nested sub-levels of the RISCV Program Counter (actually, three including
-SUBVL and ssvoffs).
-
-In addition, as xepcvliw CSRs are relative to the beginning of the VBLOCK,
-branches MUST be restricted to within (relative to) the block,
-i.e. addressing is now restricted to the start (and very short) length
-of the block.
-
-Also: calling subroutines is inadviseable, unless they can be entirely
-accomplished within a block.
-
-A normal jump, normal branch and a normal function call may only be taken
-by letting the VBLOCK group end, returning to "normal" standard RV mode,
-and then using standard RVC, 32 bit or P48/64-\*-type opcodes.
-
-## Links
-
-* <https://groups.google.com/d/msg/comp.arch/yIFmee-Cx-c/jRcf0evSAAAJ>
+See ancillary resource: [[vblock_format]]
  
  # Subsets of RV functionality
  
diff --git a/simple_v_extension/vblock_format.mdwn b/simple_v_extension/vblock_format.mdwn

new file mode 100644 (file)

index 0000000..5eda930
--- /dev/null
+++ b/simple_v_extension/vblock_format.mdwn
@@ -0,0 +1,188 @@
+# Simple-V (Parallelism Extension Proposal) Vector Block Format
+
+* Copyright (C) 2017, 2018, 2019 Luke Kenneth Casson Leighton
+* Status: DRAFTv0.6
+* Last edited: 21 jun 2019
+
+[[!toc ]]
+
+# Vector Block Format <a name="vliw-format"></a>
+
+This is a way to give Vector and Predication Context to a group of
+standard scalar RISC-V instructions, in a highly compact form.
+
+The format is:
+
+* the standard RISC-V 80 to 192 bit encoding sequence, with bits
+  defining the options to follow within the block
+* An optional VL Block (16-bit)
+* Optional predicate entries (8/16-bit blocks: see Predicate Table, above)
+* Optional register entries (8/16-bit blocks: see Register Table, above)
+* finally some 16/32/48 bit standard RV or SVPrefix opcodes follow.
+
+Thus, the variable-length format from Section 1.5 of the RISC-V ISA is used
+as follows:
+
+| base+4 ... base+2          | base             | number of bits             |
+| ------ -----------------   | ---------------- | -------------------------- |
+| ..xxxx  xxxxxxxxxxxxxxxx   | xnnnxxxxx1111111 | (80+16\*nnn)-bit, nnn!=111 |
+| {ops}{Pred}{Reg}{VL Block} | SV Prefix        |                            |
+
+A suitable prefix, which fits the Expanded Instruction-Length encoding
+for "(80 + 16 times instruction-length)", as defined in Section 1.5
+of the RISC-V ISA, is as follows:
+
+| 15    | 14:12 | 11:10 | 9:8   | 7    | 6:0     |
+| -     | ----- | ----- | ----- | ---  | ------- |
+| vlset | 16xil | pplen | rplen | mode | 1111111 |
+
+The VL/MAXVL/SubVL Block format:
+
+| 31-30 | 29:28 | 27:22  | 21:17  - 16  |
+| -     | ----- | ------ | ------ - -   |
+| 0     | SubVL | VLdest | VLEN     vlt |
+| 1     | SubVL | VLdest | VLEN         |
+
+Note: this format is very similar to that used in [[sv_prefix_proposal]]
+
+If vlt is 0, VLEN is a 5 bit immediate value, offset by one (i.e
+a bit sequence of 0b00000 represents VL=1 and so on). If vlt is 1,
+it specifies the scalar register from which VL is set by this VBLOCK
+instruction group. VL, whether set from the register or the immediate,
+is then modified (truncated) to be MIN(VL, MAXVL), and the result stored
+in the scalar register specified in VLdest. If VLdest is zero, no store
+in the regfile occurs (however VL is still set).
+
+This option will typically be used to start vectorised loops, where
+the VBLOCK instruction effectively embeds an optional "SETSUBVL, SETVL"
+sequence (in compact form).
+
+When bit 15 is set to 1, MAXVL and VL are both set to the immediate,
+VLEN (again, offset by one), which is 6 bits in length, and the same
+value stored in scalar register VLdest (if that register is nonzero).
+A value of 0b000000 will set MAXVL=VL=1, a value of 0b000001 will
+set MAXVL=VL= 2 and so on.
+
+This option will typically not be used so much for loops as it will be
+for one-off instructions such as saving the entire register file to the
+stack with a single one-off Vectorised and predicated LD/ST, or as a way
+to save or restore registers in a function call with a single instruction.
+
+CSRs needed:
+
+* mepcvliw
+* sepcvliw
+* uepcvliw
+* hepcvliw
+
+Notes:
+
+* Bit 7 specifies if the prefix block format is the full 16 bit format
+  (1) or the compact less expressive format (0). In the 8 bit format,
+  pplen is multiplied by 2.
+* 8 bit format predicate numbering is implicit and begins from x9. Thus
+  it is critical to put blocks in the correct order as required.
+* Bit 7 also specifies if the register block format is 16 bit (1) or 8 bit
+  (0). In the 8 bit format, rplen is multiplied by 2. If only an odd number
+  of entries are needed the last may be set to 0x00, indicating "unused".
+* Bit 15 specifies if the VL Block is present. If set to 1, the VL Block
+  immediately follows the VBLOCK instruction Prefix
+* Bits 8 and 9 define how many RegCam entries (0 to 3 if bit 15 is 1,
+  otherwise 0 to 6) follow the (optional) VL Block.
+* Bits 10 and 11 define how many PredCam entries (0 to 3 if bit 7 is 1,
+  otherwise 0 to 6) follow the (optional) RegCam entries
+* Bits 14 to 12 (IL) define the actual length of the instruction: total
+  number of bits is 80 + 16 times IL.  Standard RV32, RVC and also
+  SVPrefix (P48/64-\*-Type) instructions fit into this space, after the
+  (optional) VL / RegCam / PredCam entries
+* In any RVC or 32 Bit opcode, any registers within the VBLOCK-prefixed
+  format *MUST* have the RegCam and PredCam entries applied to the
+  operation (and the Vectorisation loop activated)
+* P48 and P64 opcodes do **not** take their Register or predication
+  context from the VBLOCK tables: they do however have VL or SUBVL
+  applied (unless VLtyp or svlen are set).
+* At the end of the VBLOCK Group, the RegCam and PredCam entries
+  *no longer apply*.  VL, MAXVL and SUBVL on the other hand remain at
+  the values set by the last instruction (whether a CSRRW or the VL
+  Block header).
+* Although an inefficient use of resources, it is fine to set the MAXVL,
+  VL and SUBVL CSRs with standard CSRRW instructions, within a VBLOCK.
+
+All this would greatly reduce the amount of space utilised by Vectorised
+instructions, given that 64-bit CSRRW requires 3, even 4 32-bit opcodes:
+the CSR itself, a LI, and the setting up of the value into the RS
+register of the CSR, which, again, requires a LI / LUI to get the 32
+bit data into the CSR.  To get 64-bit data into the register in order
+to put it into the CSR(s), LOAD operations from memory are needed!
+
+Given that each 64-bit CSR can hold only 4x PredCAM entries (or 4 RegCAM
+entries), that's potentially 6 to eight 32-bit instructions, just to
+establish the Vector State!
+
+Not only that: even CSRRW on VL and MAXVL requires 64-bits (even more
+bits if VL needs to be set to greater than 32).  Bear in mind that in SV,
+both MAXVL and VL need to be set.
+
+By contrast, the VBLOCK prefix is only 16 bits, the VL/MAX/SubVL block is
+only 16 bits, and as long as not too many predicates and register vector
+qualifiers are specified, several 32-bit and 16-bit opcodes can fit into
+the format. If the full flexibility of the 16 bit block formats are not
+needed, more space is saved by using the 8 bit formats.
+
+In this light, embedding the VL/MAXVL, PredCam and RegCam CSR entries
+into a VBLOCK format makes a lot of sense.
+
+Bear in mind the warning in an earlier section that use of VLtyp or svlen
+in a P48 or P64 opcode within a VBLOCK Group will result in corruption
+(use) of the STATE CSR, as the STATE CSR is shared with SVPrefix. To
+avoid this situation, the STATE CSR may be copied into a temp register
+and restored afterwards.
+
+Open Questions:
+
+* Is it necessary to stick to the RISC-V 1.5 format?  Why not go with
+  using the 15th bit to allow 80 + 16\*0bnnnn bits?  Perhaps to be sane,
+  limit to 256 bits (16 times 0-11).
+* Could a "hint" be used to set which operations are parallel and which
+  are sequential?
+* Could a new sub-instruction opcode format be used, one that does not
+  conform precisely to RISC-V rules, but *unpacks* to RISC-V opcodes?
+  no need for byte or bit-alignment
+* Could a hardware compression algorithm be deployed?  Quite likely,
+  because of the sub-execution context (sub-VBLOCK PC)
+
+## Limitations on instructions.
+
+To greatly simplify implementations, it is required to treat the VBLOCK
+group as a separate sub-program with its own separate PC. The sub-pc
+advances separately whilst the main PC remains pointing at the beginning
+of the VBLOCK instruction (not to be confused with how VL works, which
+is exactly the same principle, except it is VStart in the STATE CSR
+that increments).
+
+This has implications, namely that a new set of CSRs identical to xepc
+(mepc, srpc, hepc and uepc) must be created and managed and respected
+as being a sub extension of the xepc set of CSRs.  Thus, xepcvliw CSRs
+must be context switched and saved / restored in traps.
+
+The srcoffs and destoffs indices in the STATE CSR may be similarly
+regarded as another sub-execution context, giving in effect two sets of
+nested sub-levels of the RISCV Program Counter (actually, three including
+SUBVL and ssvoffs).
+
+In addition, as xepcvliw CSRs are relative to the beginning of the VBLOCK,
+branches MUST be restricted to within (relative to) the block,
+i.e. addressing is now restricted to the start (and very short) length
+of the block.
+
+Also: calling subroutines is inadviseable, unless they can be entirely
+accomplished within a block.
+
+A normal jump, normal branch and a normal function call may only be taken
+by letting the VBLOCK group end, returning to "normal" standard RV mode,
+and then using standard RVC, 32 bit or P48/64-\*-type opcodes.
+
+## Links
+
+* <https://groups.google.com/d/msg/comp.arch/yIFmee-Cx-c/jRcf0evSAAAJ>
+
author	Luke Kenneth Casson Leighton <lkcl@lkcl.net>
	Tue, 25 Jun 2019 10:16:36 +0000 (11:16 +0100)
committer	Luke Kenneth Casson Leighton <lkcl@lkcl.net>
	Tue, 25 Jun 2019 10:16:36 +0000 (11:16 +0100)
simple_v_extension/abridged_spec.mdwn	[new file with mode: 0644]	patch \| blob
simple_v_extension/specification.mdwn		patch \| blob \| history
simple_v_extension/vblock_format.mdwn	[new file with mode: 0644]	patch \| blob