+# Instructions
+
+Despite being a 98% complete and accurate topological remap of RVV
+concepts and functionality, the only instructions needed are VSETVL
+and VGETVL. *All* RVV instructions can be re-mapped, however xBitManip
+becomes a critical dependency for efficient manipulation of predication
+masks (as a bit-field). Despite the removal of all but VSETVL and VGETVL,
+*all instructions from RVV are topologically re-mapped and retain their
+complete functionality, intact*.
+
+Three instructions, VSELECT, VCLIP and VCLIPI, do not have RV Standard
+equivalents, so are left out of Simple-V. VSELECT could be included if
+there existed a MV.X instruction in RV (MV.X is a hypothetical
+non-immediate variant of MV that would allow another register to
+specify which register was to be copied). Note that if any of these three
+instructions are added to any given RV extension, their functionality
+will be inherently parallelised.
+
+## Instruction Format
+
+The instruction format for Simple-V does not actually have *any* explicit
+compare operations, *any* arithmetic, floating point or *any*
+memory instructions.
+Instead it *overloads* pre-existing branch operations into predicated
+variants, and implicitly overloads arithmetic operations, MV,
+FCVT, and LOAD/STORE
+depending on CSR configurations for bitwidth and
+predication. **Everything** becomes parallelised. *This includes
+Compressed instructions* as well as any
+future instructions and Custom Extensions.
+
+* For analysis of RVV see [[v_comparative_analysis]] which begins to
+ outline topologically-equivalent mappings of instructions
+* Also see Appendix "Retro-fitting Predication into branch-explicit ISA"
+ for format of Branch opcodes.
+
+**TODO**: *analyse and decide whether the implicit nature of predication
+as proposed is or is not a lot of hassle, and if explicit prefixes are
+a better idea instead. Parallelism therefore effectively may end up
+as always being 64-bit opcodes (32 for the prefix, 32 for the instruction)
+with some opportunities for to use Compressed bringing it down to 48.
+Also to consider is whether one or both of the last two remaining Compressed
+instruction codes in Quadrant 1 could be used as a parallelism prefix,
+bringing parallelised opcodes down to 32-bit (when combined with C)
+and having the benefit of being explicit.*
+
+## VSETVL
+
+NOTE TODO: 28may2018: VSETVL may need to be *really* different from RVV,
+with the instruction format remaining the same.
+
+VSETVL is slightly different from RVV in that the minimum vector length
+is required to be at least the number of registers in the register file,
+and no more than XLEN. This allows vector LOAD/STORE to be used to switch
+the entire bank of registers using a single instruction (see Appendix,
+"Context Switch Example"). The reason for limiting VSETVL to XLEN is
+down to the fact that predication bits fit into a single register of length
+XLEN bits.
+
+The second change is that when VSETVL is requested to be stored
+into x0, it is *ignored* silently (VSETVL x0, x5, #4)
+
+The third change is that there is an additional immediate added to VSETVL,
+to which VL is set after first going through MIN-filtering.
+So When using the "vsetl rs1, rs2, #vlen" instruction, it becomes:
+
+ VL = MIN(MIN(vlen, MAXVECTORDEPTH), rs2)
+
+where RegfileLen <= MAXVECTORDEPTH < XLEN
+
+This has implication for the microarchitecture, as VL is required to be
+set (limits from MAXVECTORDEPTH notwithstanding) to the actual value
+requested in the #immediate parameter. RVV has the option to set VL
+to an arbitrary value that suits the conditions and the micro-architecture:
+SV does *not* permit that.
+
+The reason is so that if SV is to be used for a context-switch or as a
+substitute for LOAD/STORE-Multiple, the operation can be done with only
+2-3 instructions (setup of the CSRs, VSETVL x0, x0, #{regfilelen-1},
+single LD/ST operation). If VL does *not* get set to the register file
+length when VSETVL is called, then a software-loop would be needed.
+To avoid this need, VL *must* be set to exactly what is requested
+(limits notwithstanding).
+
+Therefore, in turn, unlike RVV, implementors *must* provide
+pseudo-parallelism (using sequential loops in hardware) if actual
+hardware-parallelism in the ALUs is not deployed. A hybrid is also
+permitted (as used in Broadcom's VideoCore-IV) however this must be
+*entirely* transparent to the ISA.
+
+## Branch Instruction:
+
+Branch operations use standard RV opcodes that are reinterpreted to
+be "predicate variants" in the instance where either of the two src
+registers are marked as vectors (isvector=1). When this reinterpretation
+is enabled the "immediate" field of the branch operation is taken to be a
+predication target register, rs3. The predicate target register rs3 is
+to be treated as a bitfield (up to a maximum of XLEN bits corresponding
+to a maximum of XLEN elements).
+
+If either of src1 or src2 are scalars (CSRvectorlen[src] == 0) the comparison
+goes ahead as vector-scalar or scalar-vector. Implementors should note that
+this could require considerable multi-porting of the register file in order
+to parallelise properly, so may have to involve the use of register cacheing
+and transparent copying (see Multiple-Banked Register File Architectures
+paper).
+
+In instances where no vectorisation is detected on either src registers
+the operation is treated as an absolutely standard scalar branch operation.
+
+This is the overloaded table for Integer-base Branch operations. Opcode
+(bits 6..0) is set in all cases to 1100011.
+
+[[!table data="""
+31 .. 25 |24 ... 20 | 19 15 | 14 12 | 11 .. 8 | 7 | 6 ... 0 |
+imm[12,10:5]| rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode |
+7 | 5 | 5 | 3 | 4 | 1 | 7 |
+reserved | src2 | src1 | BPR | predicate rs3 || BRANCH |
+reserved | src2 | src1 | 000 | predicate rs3 || BEQ |
+reserved | src2 | src1 | 001 | predicate rs3 || BNE |
+reserved | src2 | src1 | 010 | predicate rs3 || rsvd |
+reserved | src2 | src1 | 011 | predicate rs3 || rsvd |
+reserved | src2 | src1 | 100 | predicate rs3 || BLE |
+reserved | src2 | src1 | 101 | predicate rs3 || BGE |
+reserved | src2 | src1 | 110 | predicate rs3 || BLTU |
+reserved | src2 | src1 | 111 | predicate rs3 || BGEU |
+"""]]
+
+Note that just as with the standard (scalar, non-predicated) branch
+operations, BLT, BGT, BLEU and BTGU may be synthesised by inverting
+src1 and src2.
+
+Below is the overloaded table for Floating-point Predication operations.
+Interestingly no change is needed to the instruction format because
+FP Compare already stores a 1 or a zero in its "rd" integer register
+target, i.e. it's not actually a Branch at all: it's a compare.
+The target needs to simply change to be a predication bitfield (done
+implicitly).
+
+As with
+Standard RVF/D/Q, Opcode (bits 6..0) is set in all cases to 1010011.
+Likewise Single-precision, fmt bits 26..25) is still set to 00.
+Double-precision is still set to 01, whilst Quad-precision
+appears not to have a definition in V2.3-Draft (but should be unaffected).
+
+It is however noted that an entry "FNE" (the opposite of FEQ) is missing,
+and whilst in ordinary branch code this is fine because the standard
+RVF compare can always be followed up with an integer BEQ or a BNE (or
+a compressed comparison to zero or non-zero), in predication terms that
+becomes more of an impact. To deal with this, SV's predication has
+had "invert" added to it.
+
+[[!table data="""
+31 .. 27| 26 .. 25 |24 ... 20 | 19 15 | 14 12 | 11 .. 7 | 6 ... 0 |
+funct5 | fmt | rs2 | rs1 | funct3 | rd | opcode |
+5 | 2 | 5 | 5 | 3 | 4 | 7 |
+10100 | 00/01/11 | src2 | src1 | 010 | pred rs3 | FEQ |
+10100 | 00/01/11 | src2 | src1 | **011**| pred rs3 | rsvd |
+10100 | 00/01/11 | src2 | src1 | 001 | pred rs3 | FLT |
+10100 | 00/01/11 | src2 | src1 | 000 | pred rs3 | FLE |
+"""]]
+
+Note (**TBD**): floating-point exceptions will need to be extended
+to cater for multiple exceptions (and statuses of the same). The
+usual approach is to have an array of status codes and bit-fields,
+and one exception, rather than throw separate exceptions for each
+Vector element.
+
+In Hwacha EECS-2015-262 Section 6.7.2 the following pseudocode is given
+for predicated compare operations of function "cmp":
+
+ for (int i=0; i<vl; ++i)
+ if ([!]preg[p][i])
+ preg[pd][i] = cmp(s1 ? vreg[rs1][i] : sreg[rs1],
+ s2 ? vreg[rs2][i] : sreg[rs2]);
+
+With associated predication, vector-length adjustments and so on,
+and temporarily ignoring bitwidth (which makes the comparisons more
+complex), this becomes:
+
+ if I/F == INT: # integer type cmp
+ preg = int_pred_reg[rd]
+ reg = int_regfile
+ else:
+ preg = fp_pred_reg[rd]
+ reg = fp_regfile
+
+ s1 = reg_is_vectorised(src1);
+ s2 = reg_is_vectorised(src2);
+ if (!s2 && !s1) goto branch;
+ for (int i = 0; i < VL; ++i)
+ if (cmp(s1 ? reg[src1+i]:reg[src1],
+ s2 ? reg[src2+i]:reg[src2])
+ preg[rs3] |= 1<<i; # bitfield not vector
+
+Notes:
+
+* Predicated SIMD comparisons would break src1 and src2 further down
+ into bitwidth-sized chunks (see Appendix "Bitwidth Virtual Register
+ Reordering") setting Vector-Length times (number of SIMD elements) bits
+ in Predicate Register rs3 as opposed to just Vector-Length bits.
+* Predicated Branches do not actually have an adjustment to the Program
+ Counter, so all of bits 25 through 30 in every case are not needed.
+* There are plenty of reserved opcodes for which bits 25 through 30 could
+ be put to good use if there is a suitable use-case.
+ FLT and FLE may be inverted to FGT and FGE if needed by swapping
+ src1 and src2 (likewise the integer counterparts).
+
+## Compressed Branch Instruction:
+
+Compressed Branch instructions are likewise re-interpreted as predicated
+2-register operations, with the result going into rs3. All the bits of
+the immediate are re-interpreted for different purposes, to extend the
+number of comparator operations to beyond the original specification,
+but also to cater for floating-point comparisons as well as integer ones.
+
+[[!table data="""
+15..13 | 12...10 | 9..7 | 6..5 | 4..2 | 1..0 | name |
+funct3 | imm | rs10 | imm | | op | |
+3 | 3 | 3 | 2 | 3 | 2 | |
+C.BPR | pred rs3 | src1 | I/F B | src2 | C1 | |
+110 | pred rs3 | src1 | I/F 0 | src2 | C1 | P.EQ |
+111 | pred rs3 | src1 | I/F 0 | src2 | C1 | P.NE |
+110 | pred rs3 | src1 | I/F 1 | src2 | C1 | P.LT |
+111 | pred rs3 | src1 | I/F 1 | src2 | C1 | P.LE |
+"""]]
+
+Notes:
+
+* Bits 5 13 14 and 15 make up the comparator type
+* Bit 6 indicates whether to use integer or floating-point comparisons
+* In both floating-point and integer cases there are four predication
+ comparators: EQ/NEQ/LT/LE (with GT and GE being synthesised by inverting
+ src1 and src2).
+
+## LOAD / STORE Instructions <a name="load_store"></a>
+
+For full analysis of topological adaptation of RVV LOAD/STORE
+see [[v_comparative_analysis]]. All three types (LD, LD.S and LD.X)
+may be implicitly overloaded into the one base RV LOAD instruction,
+and likewise for STORE.
+
+Revised LOAD:
+
+[[!table data="""
+31 | 30 | 29 25 | 24 20 | 19 15 | 14 12 | 11 7 | 6 0 |
+imm[11:0] |||| rs1 | funct3 | rd | opcode |
+1 | 1 | 5 | 5 | 5 | 3 | 5 | 7 |
+? | s | rs2 | imm[4:0] | base | width | dest | LOAD |
+"""]]
+
+The exact same corresponding adaptation is also carried out on the single,
+double and quad precision floating-point LOAD-FP and STORE-FP operations,
+which fit the exact same instruction format. Thus all three types
+(unit, stride and indexed) may be fitted into FLW, FLD and FLQ,
+as well as FSW, FSD and FSQ.
+
+Notes:
+
+* LOAD remains functionally (topologically) identical to RVV LOAD
+ (for both integer and floating-point variants).
+* Predication CSR-marking register is not explicitly shown in instruction, it's
+ implicit based on the CSR predicate state for the rd (destination) register
+* rs2, the source, may *also be marked as a vector*, which implicitly
+ is taken to indicate "Indexed Load" (LD.X)
+* Bit 30 indicates "element stride" or "constant-stride" (LD or LD.S)
+* Bit 31 is reserved (ideas under consideration: auto-increment)
+* **TODO**: include CSR SIMD bitwidth in the pseudo-code below.
+* **TODO**: clarify where width maps to elsize
+
+Pseudo-code (excludes CSR SIMD bitwidth for simplicity):
+
+ if (unit-strided) stride = elsize;
+ else stride = areg[as2]; // constant-strided
+
+ preg = int_pred_reg[rd]
+
+ for (int i=0; i<vl; ++i)
+ if ([!]preg[rd] & 1<<i)
+ for (int j=0; j<seglen+1; j++)
+ {
+ if CSRvectorised[rs2])
+ offs = vreg[rs2+i]
+ else
+ offs = i*(seglen+1)*stride;
+ vreg[rd+j][i] = mem[sreg[base] + offs + j*stride];
+ }
+
+Taking CSR (SIMD) bitwidth into account involves using the vector
+length and register encoding according to the "Bitwidth Virtual Register
+Reordering" scheme shown in the Appendix (see function "regoffs").
+
+A similar instruction exists for STORE, with identical topological
+translation of all features. **TODO**
+
+## Compressed LOAD / STORE Instructions
+
+Compressed LOAD and STORE are of the same format, where bits 2-4 are
+a src register instead of dest:
+
+[[!table data="""
+15 13 | 12 10 | 9 7 | 6 5 | 4 2 | 1 0 |
+funct3 | imm | rs10 | imm | rd0 | op |
+3 | 3 | 3 | 2 | 3 | 2 |
+C.LW | offset[5:3] | base | offset[2|6] | dest | C0 |
+"""]]
+
+Unfortunately it is not possible to fit the full functionality
+of vectorised LOAD / STORE into C.LD / C.ST: the "X" variants (Indexed)
+require another operand (rs2) in addition to the operand width
+(which is also missing), offset, base, and src/dest.
+
+However a close approximation may be achieved by taking the top bit
+of the offset in each of the five types of LD (and ST), reducing the
+offset to 4 bits and utilising the 5th bit to indicate whether "stride"
+is to be enabled. In this way it is at least possible to introduce
+that functionality.
+
+(**TODO**: *assess whether the loss of one bit from offset is worth having
+"stride" capability.*)
+
+We also assume (including for the "stride" variant) that the "width"
+parameter, which is missing, is derived and implicit, just as it is
+with the standard Compressed LOAD/STORE instructions. For C.LW, C.LD
+and C.LQ, the width is implicitly 4, 8 and 16 respectively, whilst for
+C.FLW and C.FLD the width is implicitly 4 and 8 respectively.
+
+Interestingly we note that the Vectorised Simple-V variant of
+LOAD/STORE (Compressed and otherwise), due to it effectively using the
+standard register file(s), is the direct functional equivalent of
+standard load-multiple and store-multiple instructions found in other
+processors.
+
+In Section 12.3 riscv-isa manual V2.3-draft it is noted the comments on
+page 76, "For virtual memory systems some data accesses could be resident
+in physical memory and some not". The interesting question then arises:
+how does RVV deal with the exact same scenario?
+Expired U.S. Patent 5895501 (Filing Date Sep 3 1996) describes a method
+of detecting early page / segmentation faults and adjusting the TLB
+in advance, accordingly: other strategies are explored in the Appendix
+Section "Virtual Memory Page Faults".
+
+## Vectorised Copy/Move (and conversion) instructions
+
+There is a series of 2-operand instructions involving copying (and
+alteration): C.MV, FMV, FNEG, FABS, FCVT, FSGNJ. These operations all
+follow the same pattern, as it is *both* the source *and* destination
+predication masks that are taken into account. This is different from
+the three-operand arithmetic instructions, where the predication mask
+is taken from the *destination* register, and applied uniformly to the
+elements of the source register(s), element-for-element.
+
+### C.MV Instruction <a name="c_mv"></a>
+
+There is no MV instruction in RV however there is a C.MV instruction.
+It is used for copying integer-to-integer registers (vectorised FMV
+is used for copying floating-point).
+
+If either the source or the destination register are marked as vectors
+C.MV is reinterpreted to be a vectorised (multi-register) predicated
+move operation. The actual instruction's format does not change:
+
+[[!table data="""
+15 12 | 11 7 | 6 2 | 1 0 |
+funct4 | rd | rs | op |
+4 | 5 | 5 | 2 |
+C.MV | dest | src | C0 |
+"""]]
+
+A simplified version of the pseudocode for this operation is as follows:
+
+ function op_mv(rd, rs) # MV not VMV!
+ rd = int_vec[rd].isvector ? int_vec[rd].regidx : rd;
+ rs = int_vec[rs].isvector ? int_vec[rs].regidx : rs;
+ ps = get_pred_val(FALSE, rs); # predication on src
+ pd = get_pred_val(FALSE, rd); # ... AND on dest
+ for (int i = 0, int j = 0; i < VL && j < VL;):
+ if (int_vec[rs].isvec) while (!(ps & 1<<i)) i++;
+ if (int_vec[rd].isvec) while (!(pd & 1<<j)) j++;
+ ireg[rd+j] <= ireg[rs+i];
+ if (int_vec[rs].isvec) i++;
+ if (int_vec[rd].isvec) j++;
+
+Note that:
+
+* elwidth (SIMD) is not covered above
+* ending the loop early in scalar cases (VINSERT, VEXTRACT) is also
+ not covered
+
+There are several different instructions from RVV that are covered by
+this one opcode:
+
+[[!table data="""
+src | dest | predication | op |
+scalar | vector | none | VSPLAT |
+scalar | vector | destination | sparse VSPLAT |
+scalar | vector | 1-bit dest | VINSERT |
+vector | scalar | 1-bit? src | VEXTRACT |
+vector | vector | none | VCOPY |
+vector | vector | src | Vector Gather |
+vector | vector | dest | Vector Scatter |
+vector | vector | src & dest | Gather/Scatter |
+vector | vector | src == dest | sparse VCOPY |
+"""]]
+
+Also, VMERGE may be implemented as back-to-back (macro-op fused) C.MV
+operations with inversion on the src and dest predication for one of the
+two C.MV operations.
+
+Note that in the instance where the Compressed Extension is not implemented,
+MV may be used, but that is a pseudo-operation mapping to addi rd, x0, rs.
+Note that the behaviour is **different** from C.MV because with addi the
+predication mask to use is taken **only** from rd and is applied against
+all elements: rs[i] = rd[i].
+
+### FMV, FNEG and FABS Instructions
+
+These are identical in form to C.MV, except covering floating-point
+register copying. The same double-predication rules also apply.
+However when elwidth is not set to default the instruction is implicitly
+and automatic converted to a (vectorised) floating-point type conversion
+operation of the appropriate size covering the source and destination
+register bitwidths.
+
+(Note that FMV, FNEG and FABS are all actually pseudo-instructions)
+
+### FVCT Instructions
+
+These are again identical in form to C.MV, except that they cover
+floating-point to integer and integer to floating-point. When element
+width in each vector is set to default, the instructions behave exactly
+as they are defined for standard RV (scalar) operations, except vectorised
+in exactly the same fashion as outlined in C.MV.
+
+However when the source or destination element width is not set to default,
+the opcode's explicit element widths are *over-ridden* to new definitions,
+and the opcode's element width is taken as indicative of the SIMD width
+(if applicable i.e. if packed SIMD is requested) instead.
+
+For example FCVT.S.L would normally be used to convert a 64-bit
+integer in register rs1 to a 64-bit floating-point number in rd.
+If however the source rs1 is set to be a vector, where elwidth is set to
+default/2 and "packed SIMD" is enabled, then the first 32 bits of
+rs1 are converted to a floating-point number to be stored in rd's
+first element and the higher 32-bits *also* converted to floating-point
+and stored in the second. The 32 bit size comes from the fact that
+FCVT.S.L's integer width is 64 bit, and with elwidth on rs1 set to
+divide that by two it means that rs1 element width is to be taken as 32.
+
+Similar rules apply to the destination register.
+