simple_v_extension/appendix.mdwn

   1 [[!oldstandards]]
   2
   3 # Simple-V (Parallelism Extension Proposal) Appendix (OBSOLETE)
   4
   5 **OBSOLETE**
   6
   7 * Copyright (C) 2017, 2018, 2019 Luke Kenneth Casson Leighton
   8 * Status: DRAFTv0.6
   9 * Last edited: 30 jun 2019
  10 * main spec [[specification]]
  11
  12 [[!toc ]]
  13
  14 # Fail-on-first modes <a name="ffirst"></a>
  15
  16 Fail-on-first data dependency has different behaviour for traps than
  17 for conditional testing.  "Conditional" is taken to mean "anything
  18 that is zero", however with traps, the first element has to
  19 be given the opportunity to throw the exact same trap that would
  20 be thrown if this were a scalar operation (when VL=1).
  21
  22 Note that implementors are required to mutually exclusively choose one
  23 or the other modes: an instruction is **not** permitted to fail on a
  24 trap *and* fail a conditional test at the same time.  This advice to
  25 custom opcode writers as well as future extension writers.
  26
  27 ## Fail-on-first traps
  28
  29 Except for the first element, ffirst stops sequential element processing
  30 when a trap occurs.  The first element is treated normally (as if ffirst
  31 is clear).  Should any subsequent element instruction require a trap,
  32 instead it and subsequent indexed elements are ignored (or cancelled in
  33 out-of-order designs), and VL is set to the *last* in-sequence instruction
  34 that did not take the trap.
  35
  36 Note that predicated-out elements (where the predicate mask bit is
  37 zero) are clearly excluded (i.e. the trap will not occur).  However,
  38 note that the loop still had to test the predicate bit: thus on return,
  39 VL is set to include elements that did not take the trap *and* includes
  40 the elements that were predicated (masked) out (not tested up to the
  41 point where the trap occurred).
  42
  43 Unlike conditional tests, "fail-on-first trap" instruction behaviour is
  44 unaltered by setting zero or non-zero predication mode.
  45
  46 If SUBVL is being used (SUBVL!=1), the first *sub-group* of elements
  47 will cause a trap as normal (as if ffirst is not set); subsequently, the
  48 trap must not occur in the *sub-group* of elements.  SUBVL will **NOT**
  49 be modified.  Traps must analyse (x)eSTATE (subvl offset indices) to
  50 determine the element that caused the trap.
  51
  52 Given that predication bits apply to SUBVL groups, the same rules apply
  53 to predicated-out (masked-out) sub-groups in calculating the value that
  54 VL is set to.
  55
  56 ## Fail-on-first conditional tests
  57
  58 ffirst stops sequential (or sequentially-appearing in the case of
  59 out-of-order designs) element conditional testing on the first element
  60 result being zero (or other "fail" condition).  VL is set to the number
  61 of elements that were (sequentially) processed before the fail-condition
  62 was encountered.
  63
  64 Unlike trap fail-on-first, fail-on-first conditional testing behaviour
  65 responds to changes in the zero or non-zero predication mode.  Whilst
  66 in non-zeroing mode, masked-out elements are simply not tested (and
  67 thus considered "never to fail"), in zeroing mode, masked-out elements
  68 may be viewed as *always* (unconditionally) failing.  This effectively
  69 turns VL into something akin to a software-controlled loop.
  70
  71 Note that just as with traps, if SUBVL!=1, the first trap in the
  72 *sub-group* will cause the processing to end, and, even if there were
  73 elements within the *sub-group* that passed the test, that sub-group is
  74 still (entirely) excluded from the count (from setting VL).  i.e. VL is
  75 set to the total number of *sub-groups* that had no fail-condition up
  76 until execution was stopped.  However, again: SUBVL must not be modified:
  77 traps must analyse (x)eSTATE (subvl offset indices) to determine the
  78 element that caused the trap.
  79
  80 Note again that, just as with traps, predicated-out (masked-out) elements
  81 are included in the (sequential) count leading up to the fail-condition,
  82 even though they were not tested.
  83
  84 # Instructions <a name="instructions" />
  85
  86 Despite being a 98% complete and accurate topological remap of RVV
  87 concepts and functionality, no new instructions are needed.
  88 Compared to RVV: *All* RVV instructions can be re-mapped, however xBitManip
  89 becomes a critical dependency for efficient manipulation of predication
  90 masks (as a bit-field).  Despite the removal of all operations,
  91 with the exception of CLIP and VSELECT.X
  92 *all instructions from RVV Base are topologically re-mapped and retain their
  93 complete functionality, intact*.  Note that if RV64G ever had
  94 a MV.X added as well as FCLIP, the full functionality of RVV-Base would
  95 be obtained in SV.
  96
  97 Three instructions, VSELECT, VCLIP and VCLIPI, do not have RV Standard
  98 equivalents, so are left out of Simple-V.  VSELECT could be included if
  99 there existed a MV.X instruction in RV (MV.X is a hypothetical
 100 non-immediate variant of MV that would allow another register to
 101 specify which register was to be copied).  Note that if any of these three
 102 instructions are added to any given RV extension, their functionality
 103 will be inherently parallelised.
 104
 105 With some exceptions, where it does not make sense or is simply too
 106 challenging, all RV-Base instructions are parallelised:
 107
 108 * CSR instructions, whilst a case could be made for fast-polling of
 109   a CSR into multiple registers, or for being able to copy multiple
 110   contiguously addressed CSRs into contiguous registers, and so on,
 111   are the fundamental core basis of SV.  If parallelised, extreme
 112   care would need to be taken.  Additionally, CSR reads are done
 113   using x0, and it is *really* inadviseable to tag x0.
 114 * LUI, C.J, C.JR, WFI, AUIPC are not suitable for parallelising so are
 115   left as scalar.
 116 * LR/SC could hypothetically be parallelised however their purpose is
 117   single (complex) atomic memory operations where the LR must be followed
 118   up by a matching SC.  A sequence of parallel LR instructions followed
 119   by a sequence of parallel SC instructions therefore is guaranteed to
 120   not be useful. Not least: the guarantees of a Multi-LR/SC
 121   would be impossible to provide if emulated in a trap.
 122 * EBREAK, NOP, FENCE and others do not use registers so are not inherently
 123   paralleliseable anyway.
 124
 125 All other operations using registers are automatically parallelised.
 126 This includes AMOMAX, AMOSWAP and so on, where particular care and
 127 attention must be paid.
 128
 129 Example pseudo-code for an integer ADD operation (including scalar
 130 operations).  Floating-point uses the FP Register Table.
 131
 132 [[!inline raw="yes" pages="simple_v_extension/simple_add_example" ]]
 133
 134 Note that for simplicity there is quite a lot missing from the above
 135 pseudo-code: PCVBLK, element widths, zeroing on predication, dimensional
 136 reshaping and offsets and so on.  However it demonstrates the basic
 137 principle.  Augmentations that produce the full pseudo-code are covered in
 138 other sections.
 139
 140 ## SUBVL Pseudocode <a name="subvl-pseudocode"></a>
 141
 142 Adding in support for SUBVL is a matter of adding in an extra inner
 143 for-loop, where register src and dest are still incremented inside the
 144 inner part. Note that the predication is still taken from the VL index.
 145
 146 So whilst elements are indexed by "(i * SUBVL + s)", predicate bits are
 147 indexed by "(i)"
 148
 149     function op_add(rd, rs1, rs2) # add not VADD!
 150       int i, id=0, irs1=0, irs2=0;
 151       predval = get_pred_val(FALSE, rd);
 152       rd  = int_vec[rd ].isvector ? int_vec[rd ].regidx : rd;
 153       rs1 = int_vec[rs1].isvector ? int_vec[rs1].regidx : rs1;
 154       rs2 = int_vec[rs2].isvector ? int_vec[rs2].regidx : rs2;
 155       for (i = 0; i < VL; i++)
 156        xSTATE.srcoffs = i # save context
 157        for (s = 0; s < SUBVL; s++)
 158         xSTATE.ssvoffs = s # save context
 159         if (predval & 1<<i) # predication uses intregs
 160            # actual add is here (at last)
 161            ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
 162            if (!int_vec[rd ].isvector) break;
 163         if (int_vec[rd ].isvector)  { id += 1; }
 164         if (int_vec[rs1].isvector)  { irs1 += 1; }
 165         if (int_vec[rs2].isvector)  { irs2 += 1; }
 166         if (id == VL or irs1 == VL or irs2 == VL) {
 167           # end VL hardware loop
 168           xSTATE.srcoffs = 0; # reset
 169           xSTATE.ssvoffs = 0; # reset
 170           return;
 171         }
 172
 173
 174 NOTE: pseudocode simplified greatly: zeroing, proper predicate handling,
 175 elwidth handling etc. all left out.
 176
 177 ## Instruction Format
 178
 179 It is critical to appreciate that there are
 180 **no operations added to SV, at all**.
 181
 182 Instead, by using CSRs to tag registers as an indication of "changed
 183 behaviour", SV *overloads* pre-existing branch operations into predicated
 184 variants, and implicitly overloads arithmetic operations, MV, FCVT, and
 185 LOAD/STORE depending on CSR configurations for bitwidth and predication.
 186 **Everything** becomes parallelised.  *This includes Compressed
 187 instructions* as well as any future instructions and Custom Extensions.
 188
 189 Note: CSR tags to change behaviour of instructions is nothing new, including
 190 in RISC-V.  UXL, SXL and MXL change the behaviour so that XLEN=32/64/128.
 191 FRM changes the behaviour of the floating-point unit, to alter the rounding
 192 mode.  Other architectures change the LOAD/STORE byte-order from big-endian
 193 to little-endian on a per-instruction basis.  SV is just a little more...
 194 comprehensive in its effect on instructions.
 195
 196 ## Branch Instructions
 197
 198 Branch operations are augmented slightly to be a little more like FP
 199 Compares (FEQ, FNE etc.), by permitting the cumulation (and storage)
 200 of multiple comparisons into a register (taken indirectly from the predicate
 201 table) and enhancing them to branch "consensually" depending on *multiple*
 202 tests.  "ffirst" - fail-on-first - condition mode can also be enabled,
 203 to terminate the comparisons early.
 204 See ffirst mode in the Predication Table section.
 205
 206 There are two registers for the comparison operation, therefore there
 207 is the opportunity to associate two predicate registers (note: not in
 208 the same way as twin-predication).  The first is a "normal" predicate
 209 register, which acts just as it does on any other single-predicated
 210 operation: masks out elements where a bit is zero, applies an inversion
 211 to the predicate mask, and enables zeroing / non-zeroing mode.
 212
 213 The second (not to be confused with a twin-predication 2nd register)
 214 is utilised to indicate where the results of each comparison are to
 215 be stored, as a bitmask.  Additionally, the behaviour of the branch -
 216 when it occurs - may also be modified depending on whether the 2nd predicate's
 217 "invert" and "zeroing" bits are set.  These four combinations result
 218 in "consensual branches", cbranch.ifnone (NOR), cbranch.ifany (OR),
 219 cbranch.ifall (AND), cbranch.ifnotall (NAND).
 220
 221 | invert | zeroing | description                 | operation | cbranch |
 222 | ------ | ------- | --------------------------- | --------- | ------- |
 223 | 0      | 0       | branch if all pass          | AND       | ifall   |
 224 | 1      | 0       | branch if one fails         | NAND      | ifnall  |
 225 | 0      | 1       | branch if one passes        | OR        | ifany   |
 226 | 1      | 1       | branch if all fail          | NOR       | ifnone  |
 227
 228 This inversion capability covers AND, OR, NAND and NOR branching
 229 based on multiple element comparisons. Without the full set of four,
 230 it is necessary to have two-sequence branch operations: one conditional, one
 231 unconditional.
 232
 233 Note that unlike normal computer programming, early-termination of chains
 234 of AND or OR conditional tests, the chain does *not* terminate early
 235 except if fail-on-first is set, and even then ffirst ends on the first
 236 data-dependent zero.  When ffirst mode is not set, *all* conditional
 237 element tests must be performed (and the result optionally stored in
 238 the result mask), with a "post-analysis" phase carried out which checks
 239 whether to branch.
 240
 241 Note also that whilst it may seem excessive to have all four (because
 242 conditional comparisons may be inverted by swapping src1 and src2),
 243 data-dependent fail-on-first is *not* invertible and *only* terminates
 244 on first zero-condition encountered.  Additionally it may be inconvenient
 245 to have to swap the predicate registers associated with src1 and src2,
 246 because this involves a new VBLOCK Context.
 247
 248 ### Standard Branch <a name="standard_branch"></a>
 249
 250 Branch operations use standard RV opcodes that are reinterpreted to
 251 be "predicate variants" in the instance where either of the two src
 252 registers are marked as vectors (active=1, vector=1).
 253
 254 Note that the predication register to use (if one is enabled) is taken from
 255 the *first* src register, and that this is used, just as with predicated
 256 arithmetic operations, to mask whether the comparison operations take
 257 place or not.  The target (destination) predication register
 258 to use (if one is enabled) is taken from the *second* src register.
 259
 260 If either of src1 or src2 are scalars (whether by there being no
 261 CSR register entry or whether by the CSR entry specifically marking
 262 the register as "scalar") the comparison goes ahead as vector-scalar
 263 or scalar-vector.
 264
 265 In instances where no vectorisation is detected on either src registers
 266 the operation is treated as an absolutely standard scalar branch operation.
 267 Where vectorisation is present on either or both src registers, the
 268 branch may stil go ahead if any only if *all* tests succeed (i.e. excluding
 269 those tests that are predicated out).
 270
 271 Note that when zero-predication is enabled (from source rs1),
 272 a cleared bit in the predicate indicates that the result
 273 of the compare is set to "false", i.e. that the corresponding
 274 destination bit (or result)) be set to zero.  Contrast this with
 275 when zeroing is not set: bits in the destination predicate are
 276 only *set*; they are **not** cleared.  This is important to appreciate,
 277 as there may be an expectation that, going into the hardware-loop,
 278 the destination predicate is always expected to be set to zero:
 279 this is **not** the case.  The destination predicate is only set
 280 to zero if **zeroing** is enabled.
 281
 282 Note that just as with the standard (scalar, non-predicated) branch
 283 operations, BLE, BGT, BLEU and BTGU may be synthesised by inverting
 284 src1 and src2, however note that in doing so, the predicate table
 285 setup must also be correspondingly adjusted.
 286
 287 In Hwacha EECS-2015-262 Section 6.7.2 the following pseudocode is given
 288 for predicated compare operations of function "cmp":
 289
 290     for (int i=0; i<vl; ++i)
 291       if ([!]preg[p][i])
 292          preg[pd][i] = cmp(s1 ? vreg[rs1][i] : sreg[rs1],
 293                            s2 ? vreg[rs2][i] : sreg[rs2]);
 294
 295 With associated predication, vector-length adjustments and so on,
 296 and temporarily ignoring bitwidth (which makes the comparisons more
 297 complex), this becomes:
 298
 299     s1 = reg_is_vectorised(src1);
 300     s2 = reg_is_vectorised(src2);
 301
 302     if not s1 && not s2
 303         if cmp(rs1, rs2) # scalar compare
 304             goto branch
 305         return
 306
 307     preg = int_pred_reg[rd]
 308     reg = int_regfile
 309
 310     ps = get_pred_val(I/F==INT, rs1);
 311     rd = get_pred_val(I/F==INT, rs2); # this may not exist
 312
 313     ffirst_mode, zeroing = get_pred_flags(rs1)
 314     if exists(rd):
 315         pred_inversion, pred_zeroing = get_pred_flags(rs2)
 316     else
 317         pred_inversion, pred_zeroing = False, False
 318
 319     if not exists(rd) or zeroing:
 320         result = (1<<VL)-1 # all 1s
 321     else
 322         result = preg[rd]
 323
 324     for (int i = 0; i < VL; ++i)
 325       if (zeroing)
 326         if not (ps & (1<<i))
 327            result &= ~(1<<i);
 328       else if (ps & (1<<i))
 329           if (cmp(s1 ? reg[src1+i]:reg[src1],
 330                                s2 ? reg[src2+i]:reg[src2])
 331               result |= 1<<i;
 332           else
 333               result &= ~(1<<i);
 334               if ffirst_mode:
 335                 break
 336
 337     if exists(rd):
 338         preg[rd] = result # store in destination
 339
 340     if pred_inversion:
 341         if pred_zeroing:
 342             # NOR
 343             if result == 0:
 344                 goto branch
 345         else:
 346             # NAND
 347             if (result & ps) != result:
 348                 goto branch
 349     else:
 350         if pred_zeroing:
 351             # OR
 352             if result != 0:
 353                 goto branch
 354         else:
 355             # AND
 356             if (result & ps) == result:
 357                 goto branch
 358
 359 Notes:
 360
 361 * Predicated SIMD comparisons would break src1 and src2 further down
 362   into bitwidth-sized chunks (see Appendix "Bitwidth Virtual Register
 363   Reordering") setting Vector-Length times (number of SIMD elements) bits
 364   in Predicate Register rd, as opposed to just Vector-Length bits.
 365 * The execution of "parallelised" instructions **must** be implemented
 366   as "re-entrant" (to use a term from software).  If an exception (trap)
 367   occurs during the middle of a vectorised
 368   Branch (now a SV predicated compare) operation, the partial results
 369   of any comparisons must be written out to the destination
 370   register before the trap is permitted to begin.  If however there
 371   is no predicate, the **entire** set of comparisons must be **restarted**,
 372   with the offset loop indices set back to zero.  This is because
 373   there is no place to store the temporary result during the handling
 374   of traps.
 375
 376 TODO: predication now taken from src2.  also branch goes ahead
 377 if all compares are successful.
 378
 379 Note also that where normally, predication requires that there must
 380 also be a CSR register entry for the register being used in order
 381 for the **predication** CSR register entry to also be active,
 382 for branches this is **not** the case.  src2 does **not** have
 383 to have its CSR register entry marked as active in order for
 384 predication on src2 to be active.
 385
 386 Also note: SV Branch operations are **not** twin-predicated
 387 (see Twin Predication section).  This would require three
 388 element offsets: one to track src1, one to track src2 and a third
 389 to track where to store the accumulation of the results.  Given
 390 that the element offsets need to be exposed via CSRs so that
 391 the parallel hardware looping may be made re-entrant on traps
 392 and exceptions, the decision was made not to make SV Branches
 393 twin-predicated.
 394
 395 ### Floating-point Comparisons
 396
 397 There does not exist floating-point branch operations, only compare.
 398 Interestingly no change is needed to the instruction format because
 399 FP Compare already stores a 1 or a zero in its "rd" integer register
 400 target, i.e. it's not actually a Branch at all: it's a compare.
 401
 402 In RV (scalar) Base, a branch on a floating-point compare is
 403 done via the sequence "FEQ x1, f0, f5; BEQ x1, x0, #jumploc".
 404 This does extend to SV, as long as x1 (in the example sequence given)
 405 is vectorised.  When that is the case, x1..x(1+VL-1) will also be
 406 set to 0 or 1 depending on whether f0==f5, f1==f6, f2==f7 and so on.
 407 The BEQ that follows will *also* compare x1==x0, x2==x0, x3==x0 and
 408 so on.  Consequently, unlike integer-branch, FP Compare needs no
 409 modification in its behaviour.
 410
 411 In addition, it is noted that an entry "FNE" (the opposite of FEQ) is
 412 missing, and whilst in ordinary branch code this is fine because the
 413 standard RVF compare can always be followed up with an integer BEQ or
 414 a BNE (or a compressed comparison to zero or non-zero), in predication
 415 terms that becomes more of an impact.  To deal with this, SV's predication
 416 has had "invert" added to it.
 417
 418 Also: note that FP Compare may be predicated, using the destination
 419 integer register (rd) to determine the predicate.  FP Compare is **not**
 420 a twin-predication operation, as, again, just as with SV Branches,
 421 there are three registers involved: FP src1, FP src2 and INT rd.
 422
 423 Also: note that ffirst (fail first mode) applies directly to this operation.
 424
 425 ### Compressed Branch Instruction
 426
 427 Compressed Branch instructions are, just like standard Branch instructions,
 428 reinterpreted to be vectorised and predicated based on the source register
 429 (rs1s) CSR entries.  As however there is only the one source register,
 430 given that c.beqz a10 is equivalent to beqz a10,x0, the optional target
 431 to store the results of the comparisions is taken from CSR predication
 432 table entries for **x0**.
 433
 434 The specific required use of x0 is, with a little thought, quite obvious,
 435 but is counterintuitive.  Clearly it is **not** recommended to redirect
 436 x0 with a CSR register entry, however as a means to opaquely obtain
 437 a predication target it is the only sensible option that does not involve
 438 additional special CSRs (or, worse, additional special opcodes).
 439
 440 Note also that, just as with standard branches, the 2nd source
 441 (in this case x0 rather than src2) does **not** have to have its CSR
 442 register table marked as "active" in order for predication to work.
 443
 444 ## Vectorised Dual-operand instructions
 445
 446 There is a series of 2-operand instructions involving copying (and
 447 sometimes alteration):
 448
 449 * C.MV
 450 * FMV, FNEG, FABS, FCVT, FSGNJ, FSGNJN and FSGNJX
 451 * C.LWSP, C.SWSP, C.LDSP, C.FLWSP etc.
 452 * LOAD(-FP) and STORE(-FP)
 453
 454 All of these operations follow the same two-operand pattern, so it is
 455 *both* the source *and* destination predication masks that are taken into
 456 account.  This is different from
 457 the three-operand arithmetic instructions, where the predication mask
 458 is taken from the *destination* register, and applied uniformly to the
 459 elements of the source register(s), element-for-element.
 460
 461 The pseudo-code pattern for twin-predicated operations is as
 462 follows:
 463
 464     function op(rd, rs):
 465       rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
 466       rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
 467       ps = get_pred_val(FALSE, rs); # predication on src
 468       pd = get_pred_val(FALSE, rd); # ... AND on dest
 469       for (int i = 0, int j = 0; i < VL && j < VL;):
 470         if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
 471         if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
 472         xSTATE.srcoffs = i # save context
 473         xSTATE.destoffs = j # save context
 474         reg[rd+j] = SCALAR_OPERATION_ON(reg[rs+i])
 475         if (int_csr[rs].isvec) i++;
 476         if (int_csr[rd].isvec) j++; else break
 477
 478 This pattern covers scalar-scalar, scalar-vector, vector-scalar
 479 and vector-vector, and predicated variants of all of those.
 480 Zeroing is not presently included (TODO).  As such, when compared
 481 to RVV, the twin-predicated variants of C.MV and FMV cover
 482 **all** standard vector operations: VINSERT, VSPLAT, VREDUCE,
 483 VEXTRACT, VSCATTER, VGATHER, VCOPY, and more.
 484
 485 Note that:
 486
 487 * elwidth (SIMD) is not covered in the pseudo-code above
 488 * ending the loop early in scalar cases (VINSERT, VEXTRACT) is also
 489   not covered
 490 * zero predication is also not shown (TODO).
 491
 492 ### C.MV Instruction <a name="c_mv"></a>
 493
 494 There is no MV instruction in RV however there is a C.MV instruction.
 495 It is used for copying integer-to-integer registers (vectorised FMV
 496 is used for copying floating-point).
 497
 498 If either the source or the destination register are marked as vectors
 499 C.MV is reinterpreted to be a vectorised (multi-register) predicated
 500 move operation.  The actual instruction's format does not change:
 501
 502 [[!table  data="""
 503 15  12 | 11   7 | 6  2 | 1  0 |
 504 funct4 | rd     | rs   | op   |
 505 4      | 5      | 5    | 2    |
 506 C.MV   | dest   | src  | C0   |
 507 """]]
 508
 509 A simplified version of the pseudocode for this operation is as follows:
 510
 511     function op_mv(rd, rs) # MV not VMV!
 512       rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
 513       rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
 514       ps = get_pred_val(FALSE, rs); # predication on src
 515       pd = get_pred_val(FALSE, rd); # ... AND on dest
 516       for (int i = 0, int j = 0; i < VL && j < VL;):
 517         if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
 518         if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
 519         xSTATE.srcoffs = i # save context
 520         xSTATE.destoffs = j # save context
 521         ireg[rd+j] <= ireg[rs+i];
 522         if (int_csr[rs].isvec) i++;
 523         if (int_csr[rd].isvec) j++; else break
 524
 525 There are several different instructions from RVV that are covered by
 526 this one opcode:
 527
 528 [[!table  data="""
 529 src    | dest    | predication   | op             |
 530 scalar | vector  | none          | VSPLAT         |
 531 scalar | vector  | destination   | sparse VSPLAT  |
 532 scalar | vector  | 1-bit dest    | VINSERT        |
 533 vector | scalar  | 1-bit? src    | VEXTRACT       |
 534 vector | vector  | none          | VCOPY          |
 535 vector | vector  | src           | Vector Gather  |
 536 vector | vector  | dest          | Vector Scatter |
 537 vector | vector  | src & dest    | Gather/Scatter |
 538 vector | vector  | src == dest   | sparse VCOPY   |
 539 """]]
 540
 541 Also, VMERGE may be implemented as back-to-back (macro-op fused) C.MV
 542 operations with zeroing off, and inversion on the src and dest predication
 543 for one of the two C.MV operations.  The non-inverted C.MV will place
 544 one set of registers into the destination, and the inverted one the other
 545 set.  With predicate-inversion, copying and inversion of the predicate mask
 546 need not be done as a separate (scalar) instruction.
 547
 548 Note that in the instance where the Compressed Extension is not implemented,
 549 MV may be used, but that is a pseudo-operation mapping to addi rd, x0, rs.
 550 Note that the behaviour is **different** from C.MV because with addi the
 551 predication mask to use is taken **only** from rd and is applied against
 552 all elements: rs[i] = rd[i].
 553
 554 ### FMV, FNEG and FABS Instructions
 555
 556 These are identical in form to C.MV, except covering floating-point
 557 register copying.  The same double-predication rules also apply.
 558 However when elwidth is not set to default the instruction is implicitly
 559 and automatic converted to a (vectorised) floating-point type conversion
 560 operation of the appropriate size covering the source and destination
 561 register bitwidths.
 562
 563 (Note that FMV, FNEG and FABS are all actually pseudo-instructions)
 564
 565 ### FVCT Instructions
 566
 567 These are again identical in form to C.MV, except that they cover
 568 floating-point to integer and integer to floating-point.  When element
 569 width in each vector is set to default, the instructions behave exactly
 570 as they are defined for standard RV (scalar) operations, except vectorised
 571 in exactly the same fashion as outlined in C.MV.
 572
 573 However when the source or destination element width is not set to default,
 574 the opcode's explicit element widths are *over-ridden* to new definitions,
 575 and the opcode's element width is taken as indicative of the SIMD width
 576 (if applicable i.e. if packed SIMD is requested) instead.
 577
 578 For example FCVT.S.L would normally be used to convert a 64-bit
 579 integer in register rs1 to a 64-bit floating-point number in rd.
 580 If however the source rs1 is set to be a vector, where elwidth is set to
 581 default/2 and "packed SIMD" is enabled, then the first 32 bits of
 582 rs1 are converted to a floating-point number to be stored in rd's
 583 first element and the higher 32-bits *also* converted to floating-point
 584 and stored in the second.  The 32 bit size comes from the fact that
 585 FCVT.S.L's integer width is 64 bit, and with elwidth on rs1 set to
 586 divide that by two it means that rs1 element width is to be taken as 32.
 587
 588 Similar rules apply to the destination register.
 589
 590 ## LOAD / STORE Instructions and LOAD-FP/STORE-FP <a name="load_store"></a>
 591
 592 An earlier draft of SV modified the behaviour of LOAD/STORE (modified
 593 the interpretation of the instruction fields).  This
 594 actually undermined the fundamental principle of SV, namely that there
 595 be no modifications to the scalar behaviour (except where absolutely
 596 necessary), in order to simplify an implementor's task if considering
 597 converting a pre-existing scalar design to support parallelism.
 598
 599 So the original RISC-V scalar LOAD/STORE and LOAD-FP/STORE-FP functionality
 600 do not change in SV, however just as with C.MV it is important to note
 601 that dual-predication is possible.
 602
 603 In vectorised architectures there are usually at least two different modes
 604 for LOAD/STORE:
 605
 606 * Read (or write for STORE) from sequential locations, where one
 607   register specifies the address, and the one address is incremented
 608   by a fixed amount.  This is usually known as "Unit Stride" mode.
 609 * Read (or write) from multiple indirected addresses, where the
 610   vector elements each specify separate and distinct addresses.
 611
 612 To support these different addressing modes, the CSR Register "isvector"
 613 bit is used.  So, for a LOAD, when the src register is set to
 614 scalar, the LOADs are sequentially incremented by the src register
 615 element width, and when the src register is set to "vector", the
 616 elements are treated as indirection addresses.  Simplified
 617 pseudo-code would look like this:
 618
 619     function op_ld(rd, rs) # LD not VLD!
 620       rdv = int_csr[rd].active ? int_csr[rd].regidx : rd;
 621       rsv = int_csr[rs].active ? int_csr[rs].regidx : rs;
 622       ps = get_pred_val(FALSE, rs); # predication on src
 623       pd = get_pred_val(FALSE, rd); # ... AND on dest
 624       for (int i = 0, int j = 0; i < VL && j < VL;):
 625         if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
 626         if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
 627         if (int_csr[rd].isvec)
 628           # indirect mode (multi mode)
 629           srcbase = ireg[rsv+i];
 630         else
 631           # unit stride mode
 632           srcbase = ireg[rsv] + i * XLEN/8; # offset in bytes
 633         ireg[rdv+j] <= mem[srcbase + imm_offs];
 634         if (!int_csr[rs].isvec &&
 635             !int_csr[rd].isvec) break # scalar-scalar LD
 636         if (int_csr[rs].isvec) i++;
 637         if (int_csr[rd].isvec) j++;
 638
 639 Notes:
 640
 641 * For simplicity, zeroing and elwidth is not included in the above:
 642   the key focus here is the decision-making for srcbase; vectorised
 643   rs means use sequentially-numbered registers as the indirection
 644   address, and scalar rs is "offset" mode.
 645 * The test towards the end for whether both source and destination are
 646   scalar is what makes the above pseudo-code provide the "standard" RV
 647   Base behaviour for LD operations.
 648 * The offset in bytes (XLEN/8) changes depending on whether the
 649   operation is a LB (1 byte), LH (2 byes), LW (4 bytes) or LD
 650   (8 bytes), and also whether the element width is over-ridden
 651   (see special element width section).
 652
 653 ## Compressed Stack LOAD / STORE Instructions <a name="c_ld_st"></a>
 654
 655 C.LWSP / C.SWSP and floating-point etc. are also source-dest twin-predicated,
 656 where it is implicit in C.LWSP/FLWSP etc. that x2 is the source register.
 657 It is therefore possible to use predicated C.LWSP to efficiently
 658 pop registers off the stack (by predicating x2 as the source), cherry-picking
 659 which registers to store to (by predicating the destination).  Likewise
 660 for C.SWSP.  In this way, LOAD/STORE-Multiple is efficiently achieved.
 661
 662 The two modes ("unit stride" and multi-indirection) are still supported,
 663 as with standard LD/ST.  Essentially, the only difference is that the
 664 use of x2 is hard-coded into the instruction.
 665
 666 **Note**: it is still possible to redirect x2 to an alternative target
 667 register.  With care, this allows C.LWSP / C.SWSP (and C.FLWSP) to be used as
 668 general-purpose LOAD/STORE operations.
 669
 670 ## Compressed LOAD / STORE Instructions
 671
 672 Compressed LOAD and STORE are again exactly the same as scalar LOAD/STORE,
 673 where the same rules apply and the same pseudo-code apply as for
 674 non-compressed LOAD/STORE.  Again: setting scalar or vector mode
 675 on the src for LOAD and dest for STORE switches mode from "Unit Stride"
 676 to "Multi-indirection", respectively.
 677
 678 # Element bitwidth polymorphism <a name="elwidth"></a>
 679
 680 Element bitwidth is best covered as its own special section, as it
 681 is quite involved and applies uniformly across-the-board.  SV restricts
 682 bitwidth polymorphism to default, 8-bit, 16-bit and 32-bit.
 683
 684 The effect of setting an element bitwidth is to re-cast each entry
 685 in the register table, and for all memory operations involving
 686 load/stores of certain specific sizes, to a completely different width.
 687 Thus In c-style terms, on an RV64 architecture, effectively each register
 688 now looks like this:
 689
 690     typedef union {
 691         uint8_t  b[8];
 692         uint16_t s[4];
 693         uint32_t i[2];
 694         uint64_t l[1];
 695     } reg_t;
 696
 697     // integer table: assume maximum SV 7-bit regfile size
 698     reg_t int_regfile[128];
 699
 700 where the CSR Register table entry (not the instruction alone) determines
 701 which of those union entries is to be used on each operation, and the
 702 VL element offset in the hardware-loop specifies the index into each array.
 703
 704 However a naive interpretation of the data structure above masks the
 705 fact that setting VL greater than 8, for example, when the bitwidth is 8,
 706 accessing one specific register "spills over" to the following parts of
 707 the register file in a sequential fashion.  So a much more accurate way
 708 to reflect this would be:
 709
 710     typedef union {
 711         uint8_t  actual_bytes[8]; // 8 for RV64, 4 for RV32, 16 for RV128
 712         uint8_t  b[0]; // array of type uint8_t
 713         uint16_t s[0];
 714         uint32_t i[0];
 715         uint64_t l[0];
 716         uint128_t d[0];
 717     } reg_t;
 718
 719     reg_t int_regfile[128];
 720
 721 where when accessing any individual regfile[n].b entry it is permitted
 722 (in c) to arbitrarily over-run the *declared* length of the array (zero),
 723 and thus "overspill" to consecutive register file entries in a fashion
 724 that is completely transparent to a greatly-simplified software / pseudo-code
 725 representation.
 726 It is however critical to note that it is clearly the responsibility of
 727 the implementor to ensure that, towards the end of the register file,
 728 an exception is thrown if attempts to access beyond the "real" register
 729 bytes is ever attempted.
 730
 731 Now we may modify pseudo-code an operation where all element bitwidths have
 732 been set to the same size, where this pseudo-code is otherwise identical
 733 to its "non" polymorphic versions (above):
 734
 735     function op_add(rd, rs1, rs2) # add not VADD!
 736       ...
 737       ...
 738       for (i = 0; i < VL; i++)
 739            ...
 740            ...
 741            // TODO, calculate if over-run occurs, for each elwidth
 742            if (elwidth == 8) {
 743                int_regfile[rd].b[id] <= int_regfile[rs1].i[irs1] +
 744                                         int_regfile[rs2].i[irs2];
 745             } else if elwidth == 16 {
 746                int_regfile[rd].s[id] <= int_regfile[rs1].s[irs1] +
 747                                         int_regfile[rs2].s[irs2];
 748             } else if elwidth == 32 {
 749                int_regfile[rd].i[id] <= int_regfile[rs1].i[irs1] +
 750                                         int_regfile[rs2].i[irs2];
 751             } else { // elwidth == 64
 752                int_regfile[rd].l[id] <= int_regfile[rs1].l[irs1] +
 753                                         int_regfile[rs2].l[irs2];
 754             }
 755            ...
 756            ...
 757
 758 So here we can see clearly: for 8-bit entries rd, rs1 and rs2 (and registers
 759 following sequentially on respectively from the same) are "type-cast"
 760 to 8-bit; for 16-bit entries likewise and so on.
 761
 762 However that only covers the case where the element widths are the same.
 763 Where the element widths are different, the following algorithm applies:
 764
 765 * Analyse the bitwidth of all source operands and work out the
 766   maximum.  Record this as "maxsrcbitwidth"
 767 * If any given source operand requires sign-extension or zero-extension
 768   (ldb, div, rem, mul, sll, srl, sra etc.), instead of mandatory 32-bit
 769   sign-extension / zero-extension or whatever is specified in the standard
 770   RV specification, **change** that to sign-extending from the respective
 771   individual source operand's bitwidth from the CSR table out to
 772   "maxsrcbitwidth" (previously calculated), instead.
 773 * Following separate and distinct (optional) sign/zero-extension of all
 774   source operands as specifically required for that operation, carry out the
 775   operation at "maxsrcbitwidth".  (Note that in the case of LOAD/STORE or MV
 776   this may be a "null" (copy) operation, and that with FCVT, the changes
 777   to the source and destination bitwidths may also turn FVCT effectively
 778   into a copy).
 779 * If the destination operand requires sign-extension or zero-extension,
 780   instead of a mandatory fixed size (typically 32-bit for arithmetic,
 781   for subw for example, and otherwise various: 8-bit for sb, 16-bit for sw
 782   etc.), overload the RV specification with the bitwidth from the
 783   destination register's elwidth entry.
 784 * Finally, store the (optionally) sign/zero-extended value into its
 785   destination: memory for sb/sw etc., or an offset section of the register
 786   file for an arithmetic operation.
 787
 788 In this way, polymorphic bitwidths are achieved without requiring a
 789 massive 64-way permutation of calculations **per opcode**, for example
 790 (4 possible rs1 bitwidths times 4 possible rs2 bitwidths times 4 possible
 791 rd bitwidths).  The pseudo-code is therefore as follows:
 792
 793     typedef union {
 794         uint8_t  b;
 795         uint16_t s;
 796         uint32_t i;
 797         uint64_t l;
 798     } el_reg_t;
 799
 800     bw(elwidth):
 801         if elwidth == 0: return xlen
 802         if elwidth == 1: return 8
 803         if elwidth == 2: return 16
 804         // elwidth == 3:
 805         return 32
 806
 807     get_max_elwidth(rs1, rs2):
 808         return max(bw(int_csr[rs1].elwidth), # default (XLEN) if not set
 809                    bw(int_csr[rs2].elwidth)) # again XLEN if no entry
 810
 811     get_polymorphed_reg(reg, bitwidth, offset):
 812         el_reg_t res;
 813         res.l = 0; // TODO: going to need sign-extending / zero-extending
 814         if bitwidth == 8:
 815             reg.b = int_regfile[reg].b[offset]
 816         elif bitwidth == 16:
 817             reg.s = int_regfile[reg].s[offset]
 818         elif bitwidth == 32:
 819             reg.i = int_regfile[reg].i[offset]
 820         elif bitwidth == 64:
 821             reg.l = int_regfile[reg].l[offset]
 822         return res
 823
 824     set_polymorphed_reg(reg, bitwidth, offset, val):
 825         if (!int_csr[reg].isvec):
 826             # sign/zero-extend depending on opcode requirements, from
 827             # the reg's bitwidth out to the full bitwidth of the regfile
 828             val = sign_or_zero_extend(val, bitwidth, xlen)
 829             int_regfile[reg].l[0] = val
 830         elif bitwidth == 8:
 831             int_regfile[reg].b[offset] = val
 832         elif bitwidth == 16:
 833             int_regfile[reg].s[offset] = val
 834         elif bitwidth == 32:
 835             int_regfile[reg].i[offset] = val
 836         elif bitwidth == 64:
 837             int_regfile[reg].l[offset] = val
 838
 839       maxsrcwid =  get_max_elwidth(rs1, rs2) # source element width(s)
 840       destwid = int_csr[rs1].elwidth         # destination element width
 841       for (i = 0; i < VL; i++)
 842         if (predval & 1<<i) # predication uses intregs
 843            // TODO, calculate if over-run occurs, for each elwidth
 844            src1 = get_polymorphed_reg(rs1, maxsrcwid, irs1)
 845            // TODO, sign/zero-extend src1 and src2 as operation requires
 846            if (op_requires_sign_extend_src1)
 847               src1 = sign_extend(src1, maxsrcwid)
 848            src2 = get_polymorphed_reg(rs2, maxsrcwid, irs2)
 849            result = src1 + src2 # actual add here
 850            // TODO, sign/zero-extend result, as operation requires
 851            if (op_requires_sign_extend_dest)
 852               result = sign_extend(result, maxsrcwid)
 853            set_polymorphed_reg(rd, destwid, ird, result)
 854            if (!int_vec[rd].isvector) break
 855         if (int_vec[rd ].isvector)  { id += 1; }
 856         if (int_vec[rs1].isvector)  { irs1 += 1; }
 857         if (int_vec[rs2].isvector)  { irs2 += 1; }
 858
 859 Whilst specific sign-extension and zero-extension pseudocode call
 860 details are left out, due to each operation being different, the above
 861 should be clear that;
 862
 863 * the source operands are extended out to the maximum bitwidth of all
 864   source operands
 865 * the operation takes place at that maximum source bitwidth (the
 866   destination bitwidth is not involved at this point, at all)
 867 * the result is extended (or potentially even, truncated) before being
 868   stored in the destination.  i.e. truncation (if required) to the
 869   destination width occurs **after** the operation **not** before.
 870 * when the destination is not marked as "vectorised", the **full**
 871   (standard, scalar) register file entry is taken up, i.e. the
 872   element is either sign-extended or zero-extended to cover the
 873   full register bitwidth (XLEN) if it is not already XLEN bits long.
 874
 875 Implementors are entirely free to optimise the above, particularly
 876 if it is specifically known that any given operation will complete
 877 accurately in less bits, as long as the results produced are
 878 directly equivalent and equal, for all inputs and all outputs,
 879 to those produced by the above algorithm.
 880
 881 ## Polymorphic floating-point operation exceptions and error-handling
 882
 883 For floating-point operations, conversion takes place without raising any
 884 kind of exception.  Exactly as specified in the standard RV specification,
 885 NAN (or appropriate) is stored if the result is beyond the range of the
 886 destination, and, again, exactly as with the standard RV specification
 887 just as with scalar operations, the floating-point flag is raised
 888 (FCSR).  And, again, just as with scalar operations, it is software's
 889 responsibility to check this flag.  Given that the FCSR flags are
 890 "accrued", the fact that multiple element operations could have occurred
 891 is not a problem.
 892
 893 Note that it is perfectly legitimate for floating-point bitwidths of
 894 only 8 to be specified.  However whilst it is possible to apply IEEE 754
 895 principles, no actual standard yet exists.  Implementors wishing to
 896 provide hardware-level 8-bit support rather than throw a trap to emulate
 897 in software should contact the author of this specification before
 898 proceeding.
 899
 900 ## Polymorphic shift operators
 901
 902 A special note is needed for changing the element width of left and
 903 right shift operators, particularly right-shift.  Even for standard RV
 904 base, in order for correct results to be returned, the second operand
 905 RS2 must be truncated to be within the range of RS1's bitwidth.
 906 spike's implementation of sll for example is as follows:
 907
 908     WRITE_RD(sext_xlen(zext_xlen(RS1) << (RS2 & (xlen-1))));
 909
 910 which means: where XLEN is 32 (for RV32), restrict RS2 to cover the
 911 range 0..31 so that RS1 will only be left-shifted by the amount that
 912 is possible to fit into a 32-bit register.  Whilst this appears not
 913 to matter for hardware, it matters greatly in software implementations,
 914 and it also matters where an RV64 system is set to "RV32" mode, such
 915 that the underlying registers RS1 and RS2 comprise 64 hardware bits
 916 each.
 917
 918 For SV, where each operand's element bitwidth may be over-ridden, the
 919 rule about determining the operation's bitwidth *still applies*, being
 920 defined as the maximum bitwidth of RS1 and RS2.  *However*, this rule
 921 **also applies to the truncation of RS2**.  In other words, *after*
 922 determining the maximum bitwidth, RS2's range must **also be truncated**
 923 to ensure a correct answer.  Example:
 924
 925 * RS1 is over-ridden to a 16-bit width
 926 * RS2 is over-ridden to an 8-bit width
 927 * RD is over-ridden to a 64-bit width
 928 * the maximum bitwidth is thus determined to be 16-bit - max(8,16)
 929 * RS2 is **truncated to a range of values from 0 to 15**: RS2 & (16-1)
 930
 931 Pseudocode (in spike) for this example would therefore be:
 932
 933     WRITE_RD(sext_xlen(zext_16bit(RS1) << (RS2 & (16-1))));
 934
 935 This example illustrates that considerable care therefore needs to be
 936 taken to ensure that left and right shift operations are implemented
 937 correctly.  The key is that
 938
 939 * The operation bitwidth is determined by the maximum bitwidth
 940   of the *source registers*, **not** the destination register bitwidth
 941 * The result is then sign-extend (or truncated) as appropriate.
 942
 943 ## Polymorphic MULH/MULHU/MULHSU
 944
 945 MULH is designed to take the top half MSBs of a multiply that
 946 does not fit within the range of the source operands, such that
 947 smaller width operations may produce a full double-width multiply
 948 in two cycles.  The issue is: SV allows the source operands to
 949 have variable bitwidth.
 950
 951 Here again special attention has to be paid to the rules regarding
 952 bitwidth, which, again, are that the operation is performed at
 953 the maximum bitwidth of the **source** registers.  Therefore:
 954
 955 * An 8-bit x 8-bit multiply will create a 16-bit result that must
 956   be shifted down by 8 bits
 957 * A 16-bit x 8-bit multiply will create a 24-bit result that must
 958   be shifted down by 16 bits (top 8 bits being zero)
 959 * A 16-bit x 16-bit multiply will create a 32-bit result that must
 960   be shifted down by 16 bits
 961 * A 32-bit x 16-bit multiply will create a 48-bit result that must
 962   be shifted down by 32 bits
 963 * A 32-bit x 8-bit multiply will create a 40-bit result that must
 964   be shifted down by 32 bits
 965
 966 So again, just as with shift-left and shift-right, the result
 967 is shifted down by the maximum of the two source register bitwidths.
 968 And, exactly again, truncation or sign-extension is performed on the
 969 result.  If sign-extension is to be carried out, it is performed
 970 from the same maximum of the two source register bitwidths out
 971 to the result element's bitwidth.
 972
 973 If truncation occurs, i.e. the top MSBs of the result are lost,
 974 this is "Officially Not Our Problem", i.e. it is assumed that the
 975 programmer actually desires the result to be truncated.  i.e. if the
 976 programmer wanted all of the bits, they would have set the destination
 977 elwidth to accommodate them.
 978
 979 ## Polymorphic elwidth on LOAD/STORE <a name="elwidth_loadstore"></a>
 980
 981 Polymorphic element widths in vectorised form means that the data
 982 being loaded (or stored) across multiple registers needs to be treated
 983 (reinterpreted) as a contiguous stream of elwidth-wide items, where
 984 the source register's element width is **independent** from the destination's.
 985
 986 This makes for a slightly more complex algorithm when using indirection
 987 on the "addressed" register (source for LOAD and destination for STORE),
 988 particularly given that the LOAD/STORE instruction provides important
 989 information about the width of the data to be reinterpreted.
 990
 991 Let's illustrate the "load" part, where the pseudo-code for elwidth=default
 992 was as follows, and i is the loop from 0 to VL-1:
 993
 994     srcbase = ireg[rs+i];
 995     return mem[srcbase + imm]; // returns XLEN bits
 996
 997 Instead, when elwidth != default, for a LW (32-bit LOAD), elwidth-wide
 998 chunks are taken from the source memory location addressed by the current
 999 indexed source address register, and only when a full 32-bits-worth
1000 are taken will the index be moved on to the next contiguous source
1001 address register:
1002
1003     bitwidth = bw(elwidth);             // source elwidth from CSR reg entry
1004     elsperblock = 32 / bitwidth         // 1 if bw=32, 2 if bw=16, 4 if bw=8
1005     srcbase = ireg[rs+i/(elsperblock)]; // integer divide
1006     offs = i % elsperblock;             // modulo
1007     return &mem[srcbase + imm + offs];  // re-cast to uint8_t*, uint16_t* etc.
1008
1009 Note that the constant "32" above is replaced by 8 for LB, 16 for LH, 64 for LD
1010 and 128 for LQ.
1011
1012 The principle is basically exactly the same as if the srcbase were pointing
1013 at the memory of the *register* file: memory is re-interpreted as containing
1014 groups of elwidth-wide discrete elements.
1015
1016 When storing the result from a load, it's important to respect the fact
1017 that the destination register has its *own separate element width*.  Thus,
1018 when each element is loaded (at the source element width), any sign-extension
1019 or zero-extension (or truncation) needs to be done to the *destination*
1020 bitwidth.  Also, the storing has the exact same analogous algorithm as
1021 above, where in fact it is just the set\_polymorphed\_reg pseudocode
1022 (completely unchanged) used above.
1023
1024 One issue remains: when the source element width is **greater** than
1025 the width of the operation, it is obvious that a single LB for example
1026 cannot possibly obtain 16-bit-wide data.  This condition may be detected
1027 where, when using integer divide, elsperblock (the width of the LOAD
1028 divided by the bitwidth of the element) is zero.
1029
1030 The issue is "fixed" by ensuring that elsperblock is a minimum of 1:
1031
1032     elsperblock = min(1, LD_OP_BITWIDTH / element_bitwidth)
1033
1034 The elements, if the element bitwidth is larger than the LD operation's
1035 size, will then be sign/zero-extended to the full LD operation size, as
1036 specified by the LOAD (LDU instead of LD, LBU instead of LB), before
1037 being passed on to the second phase.
1038
1039 As LOAD/STORE may be twin-predicated, it is important to note that
1040 the rules on twin predication still apply, except where in previous
1041 pseudo-code (elwidth=default for both source and target) it was
1042 the *registers* that the predication was applied to, it is now the
1043 **elements** that the predication is applied to.
1044
1045 Thus the full pseudocode for all LD operations may be written out
1046 as follows:
1047
1048     function LBU(rd, rs):
1049         load_elwidthed(rd, rs, 8, true)
1050     function LB(rd, rs):
1051         load_elwidthed(rd, rs, 8, false)
1052     function LH(rd, rs):
1053         load_elwidthed(rd, rs, 16, false)
1054     ...
1055     ...
1056     function LQ(rd, rs):
1057         load_elwidthed(rd, rs, 128, false)
1058
1059     # returns 1 byte of data when opwidth=8, 2 bytes when opwidth=16..
1060     function load_memory(rs, imm, i, opwidth):
1061         elwidth = int_csr[rs].elwidth
1062         bitwidth = bw(elwidth);
1063         elsperblock = min(1, opwidth / bitwidth)
1064         srcbase = ireg[rs+i/(elsperblock)];
1065         offs = i % elsperblock;
1066         return mem[srcbase + imm + offs]; # 1/2/4/8/16 bytes
1067
1068     function load_elwidthed(rd, rs, opwidth, unsigned):
1069       destwid = int_csr[rd].elwidth # destination element width
1070       rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
1071       rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
1072       ps = get_pred_val(FALSE, rs); # predication on src
1073       pd = get_pred_val(FALSE, rd); # ... AND on dest
1074       for (int i = 0, int j = 0; i < VL && j < VL;):
1075         if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
1076         if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
1077         val = load_memory(rs, imm, i, opwidth)
1078         if unsigned:
1079             val = zero_extend(val, min(opwidth, bitwidth))
1080         else:
1081             val = sign_extend(val, min(opwidth, bitwidth))
1082         set_polymorphed_reg(rd, bitwidth, j, val)
1083         if (int_csr[rs].isvec) i++;
1084         if (int_csr[rd].isvec) j++; else break;
1085
1086 Note:
1087
1088 * when comparing against for example the twin-predicated c.mv
1089   pseudo-code, the pattern of independent incrementing of rd and rs
1090   is preserved unchanged.
1091 * just as with the c.mv pseudocode, zeroing is not included and must be
1092   taken into account (TODO).
1093 * that due to the use of a twin-predication algorithm, LOAD/STORE also
1094   take on the same VSPLAT, VINSERT, VREDUCE, VEXTRACT, VGATHER and
1095   VSCATTER characteristics.
1096 * that due to the use of the same set\_polymorphed\_reg pseudocode,
1097   a destination that is not vectorised (marked as scalar) will
1098   result in the element being fully sign-extended or zero-extended
1099   out to the full register file bitwidth (XLEN).  When the source
1100   is also marked as scalar, this is how the compatibility with
1101   standard RV LOAD/STORE is preserved by this algorithm.
1102
1103 ### Example Tables showing LOAD elements <a name="load_example"></a>
1104
1105 This section contains examples of vectorised LOAD operations, showing
1106 how the two stage process works (three if zero/sign-extension is included).
1107
1108
1109 #### Example: LD x8, x5(0), x8 CSR-elwidth=32, x5 CSR-elwidth=16, VL=7
1110
1111 This is:
1112
1113 * a 64-bit load, with an offset of zero
1114 * with a source-address elwidth of 16-bit
1115 * into a destination-register with an elwidth of 32-bit
1116 * where VL=7
1117 * from register x5 (actually x5-x6) to x8 (actually x8 to half of x11)
1118 * RV64, where XLEN=64 is assumed.
1119
1120 First, the memory table, which, due to the element width being 16 and the
1121 operation being LD (64), the 64-bits loaded from memory are subdivided
1122 into groups of **four** elements.  And, with VL being 7 (deliberately
1123 to illustrate that this is reasonable and possible), the first four are
1124 sourced from the offset addresses pointed to by x5, and the next three
1125 from the ofset addresses pointed to by the next contiguous register, x6:
1126
1127 [[!table  data="""
1128 addr | byte 0 | byte 1 | byte 2 | byte 3 | byte 4 | byte 5 | byte 6 | byte 7 |
1129 @x5  | elem 0         || elem 1         || elem 2         || elem 3         ||
1130 @x6  | elem 4         || elem 5         || elem 6         || not loaded     ||
1131 """]]
1132
1133 Next, the elements are zero-extended from 16-bit to 32-bit, as whilst
1134 the elwidth CSR entry for x5 is 16-bit, the destination elwidth on x8 is 32.
1135
1136 [[!table  data="""
1137 byte 3 | byte 2 | byte 1 | byte 0 |
1138 0x0    | 0x0    | elem0          ||
1139 0x0    | 0x0    | elem1          ||
1140 0x0    | 0x0    | elem2          ||
1141 0x0    | 0x0    | elem3          ||
1142 0x0    | 0x0    | elem4          ||
1143 0x0    | 0x0    | elem5          ||
1144 0x0    | 0x0    | elem6          ||
1145 0x0    | 0x0    | elem7          ||
1146 """]]
1147
1148 Lastly, the elements are stored in contiguous blocks, as if x8 was also
1149 byte-addressable "memory".  That "memory" happens to cover registers
1150 x8, x9, x10 and x11, with the last 32 "bits" of x11 being **UNMODIFIED**:
1151
1152 [[!table  data="""
1153 reg# | byte 7 | byte 6 | byte 5 | byte 4 | byte 3 | byte 2 | byte 1 | byte 0 |
1154 x8   | 0x0    | 0x0    | elem 1         || 0x0    | 0x0    | elem 0         ||
1155 x9   | 0x0    | 0x0    | elem 3         || 0x0    | 0x0    | elem 2         ||
1156 x10  | 0x0    | 0x0    | elem 5         || 0x0    | 0x0    | elem 4         ||
1157 x11  | **UNMODIFIED**                 |||| 0x0    | 0x0    | elem 6         ||
1158 """]]
1159
1160 Thus we have data that is loaded from the **addresses** pointed to by
1161 x5 and x6, zero-extended from 16-bit to 32-bit, stored in the **registers**
1162 x8 through to half of x11.
1163 The end result is that elements 0 and 1 end up in x8, with element 8 being
1164 shifted up 32 bits, and so on, until finally element 6 is in the
1165 LSBs of x11.
1166
1167 Note that whilst the memory addressing table is shown left-to-right byte order,
1168 the registers are shown in right-to-left (MSB) order.  This does **not**
1169 imply that bit or byte-reversal is carried out: it's just easier to visualise
1170 memory as being contiguous bytes, and emphasises that registers are not
1171 really actually "memory" as such.
1172
1173 ## Why SV bitwidth specification is restricted to 4 entries
1174
1175 The four entries for SV element bitwidths only allows three over-rides:
1176
1177 * 8 bit
1178 * 16 hit
1179 * 32 bit
1180
1181 This would seem inadequate, surely it would be better to have 3 bits or
1182 more and allow 64, 128 and some other options besides.  The answer here
1183 is, it gets too complex, no RV128 implementation yet exists, and so RV64's
1184 default is 64 bit, so the 4 major element widths are covered anyway.
1185
1186 There is an absolutely crucial aspect oF SV here that explicitly
1187 needs spelling out, and it's whether the "vectorised" bit is set in
1188 the Register's CSR entry.
1189
1190 If "vectorised" is clear (not set), this indicates that the operation
1191 is "scalar".  Under these circumstances, when set on a destination (RD),
1192 then sign-extension and zero-extension, whilst changed to match the
1193 override bitwidth (if set), will erase the **full** register entry
1194 (64-bit if RV64).
1195
1196 When vectorised is *set*, this indicates that the operation now treats
1197 **elements** as if they were independent registers, so regardless of
1198 the length, any parts of a given actual register that are not involved
1199 in the operation are **NOT** modified, but are **PRESERVED**.
1200
1201 For example:
1202
1203 * when the vector bit is clear and elwidth set to 16 on the destination
1204   register, operations are truncated to 16 bit and then sign or zero
1205   extended to the *FULL* XLEN register width.
1206 * when the vector bit is set, elwidth is 16 and VL=1 (or other value where
1207   groups of elwidth sized elements do not fill an entire XLEN register),
1208   the "top" bits of the destination register do *NOT* get modified, zero'd
1209   or otherwise overwritten.
1210
1211 SIMD micro-architectures may implement this by using predication on
1212 any elements in a given actual register that are beyond the end of
1213 multi-element operation.
1214
1215 Other microarchitectures may choose to provide byte-level write-enable
1216 lines on the register file, such that each 64 bit register in an RV64
1217 system requires 8 WE lines.  Scalar RV64 operations would require
1218 activation of all 8 lines, where SV elwidth based operations would
1219 activate the required subset of those byte-level write lines.
1220
1221 Example:
1222
1223 * rs1, rs2 and rd are all set to 8-bit
1224 * VL is set to 3
1225 * RV64 architecture is set (UXL=64)
1226 * add operation is carried out
1227 * bits 0-23 of RD are modified to be rs1[23..16] + rs2[23..16]
1228   concatenated with similar add operations on bits 15..8 and 7..0
1229 * bits 24 through 63 **remain as they originally were**.
1230
1231 Example SIMD micro-architectural implementation:
1232
1233 * SIMD architecture works out the nearest round number of elements
1234   that would fit into a full RV64 register (in this case: 8)
1235 * SIMD architecture creates a hidden predicate, binary 0b00000111
1236   i.e. the bottom 3 bits set (VL=3) and the top 5 bits clear
1237 * SIMD architecture goes ahead with the add operation as if it
1238   was a full 8-wide batch of 8 adds
1239 * SIMD architecture passes top 5 elements through the adders
1240   (which are "disabled" due to zero-bit predication)
1241 * SIMD architecture gets the 5 unmodified top 8-bits back unmodified
1242   and stores them in rd.
1243
1244 This requires a read on rd, however this is required anyway in order
1245 to support non-zeroing mode.
1246
1247 ## Polymorphic floating-point
1248
1249 Standard scalar RV integer operations base the register width on XLEN,
1250 which may be changed (UXL in USTATUS, and the corresponding MXL and
1251 SXL in MSTATUS and SSTATUS respectively).  Integer LOAD, STORE and
1252 arithmetic operations are therefore restricted to an active XLEN bits,
1253 with sign or zero extension to pad out the upper bits when XLEN has
1254 been dynamically set to less than the actual register size.
1255
1256 For scalar floating-point, the active (used / changed) bits are
1257 specified exclusively by the operation: ADD.S specifies an active
1258 32-bits, with the upper bits of the source registers needing to
1259 be all 1s ("NaN-boxed"), and the destination upper bits being
1260 *set* to all 1s (including on LOAD/STOREs).
1261
1262 Where elwidth is set to default (on any source or the destination)
1263 it is obvious that this NaN-boxing behaviour can and should be
1264 preserved.  When elwidth is non-default things are less obvious,
1265 so need to be thought through.  Here is a normal (scalar) sequence,
1266 assuming an RV64 which supports Quad (128-bit) FLEN:
1267
1268 * FLD loads 64-bit wide from memory.  Top 64 MSBs are set to all 1s
1269 * ADD.D performs a 64-bit-wide add.  Top 64 MSBs of destination set to 1s.
1270 * FSD stores lowest 64-bits from the 128-bit-wide register to memory:
1271   top 64 MSBs ignored.
1272
1273 Therefore it makes sense to mirror this behaviour when, for example,
1274 elwidth is set to 32.  Assume elwidth set to 32 on all source and
1275 destination registers:
1276
1277 * FLD loads 64-bit wide from memory as **two** 32-bit single-precision
1278   floating-point numbers.
1279 * ADD.D performs **two** 32-bit-wide adds, storing one of the adds
1280   in bits 0-31 and the second in bits 32-63.
1281 * FSD stores lowest 64-bits from the 128-bit-wide register to memory
1282
1283 Here's the thing: it does not make sense to overwrite the top 64 MSBs
1284 of the registers either during the FLD **or** the ADD.D.  The reason
1285 is that, effectively, the top 64 MSBs actually represent a completely
1286 independent 64-bit register, so overwriting it is not only gratuitous
1287 but may actually be harmful for a future extension to SV which may
1288 have a way to directly access those top 64 bits.
1289
1290 The decision is therefore **not** to touch the upper parts of floating-point
1291 registers whereever elwidth is set to non-default values, including
1292 when "isvec" is false in a given register's CSR entry.  Only when the
1293 elwidth is set to default **and** isvec is false will the standard
1294 RV behaviour be followed, namely that the upper bits be modified.
1295
1296 Ultimately if elwidth is default and isvec false on *all* source
1297 and destination registers, a SimpleV instruction defaults completely
1298 to standard RV scalar behaviour (this holds true for **all** operations,
1299 right across the board).
1300
1301 The nice thing here is that ADD.S, ADD.D and ADD.Q when elwidth are
1302 non-default values are effectively all the same: they all still perform
1303 multiple ADD operations, just at different widths.  A future extension
1304 to SimpleV may actually allow ADD.S to access the upper bits of the
1305 register, effectively breaking down a 128-bit register into a bank
1306 of 4 independently-accesible 32-bit registers.
1307
1308 In the meantime, although when e.g. setting VL to 8 it would technically
1309 make no difference to the ALU whether ADD.S, ADD.D or ADD.Q is used,
1310 using ADD.Q may be an easy way to signal to the microarchitecture that
1311 it is to receive a higher VL value.  On a superscalar OoO architecture
1312 there may be absolutely no difference, however on simpler SIMD-style
1313 microarchitectures they may not necessarily have the infrastructure in
1314 place to know the difference, such that when VL=8 and an ADD.D instruction
1315 is issued, it completes in 2 cycles (or more) rather than one, where
1316 if an ADD.Q had been issued instead on such simpler microarchitectures
1317 it would complete in one.
1318
1319 ## Specific instruction walk-throughs
1320
1321 This section covers walk-throughs of the above-outlined procedure
1322 for converting standard RISC-V scalar arithmetic operations to
1323 polymorphic widths, to ensure that it is correct.
1324
1325 ### add
1326
1327 Standard Scalar RV32/RV64 (xlen):
1328
1329 * RS1 @ xlen bits
1330 * RS2 @ xlen bits
1331 * add @ xlen bits
1332 * RD @ xlen bits
1333
1334 Polymorphic variant:
1335
1336 * RS1 @ rs1 bits, zero-extended to max(rs1, rs2) bits
1337 * RS2 @ rs2 bits, zero-extended to max(rs1, rs2) bits
1338 * add @ max(rs1, rs2) bits
1339 * RD @ rd bits.  zero-extend to rd if rd > max(rs1, rs2) otherwise truncate
1340
1341 Note here that polymorphic add zero-extends its source operands,
1342 where addw sign-extends.
1343
1344 ### addw
1345
1346 The RV Specification specifically states that "W" variants of arithmetic
1347 operations always produce 32-bit signed values.  In a polymorphic
1348 environment it is reasonable to assume that the signed aspect is
1349 preserved, where it is the length of the operands and the result
1350 that may be changed.
1351
1352 Standard Scalar RV64 (xlen):
1353
1354 * RS1 @ xlen bits
1355 * RS2 @ xlen bits
1356 * add @ xlen bits
1357 * RD @ xlen bits, truncate add to 32-bit and sign-extend to xlen.
1358
1359 Polymorphic variant:
1360
1361 * RS1 @ rs1 bits, sign-extended to max(rs1, rs2) bits
1362 * RS2 @ rs2 bits, sign-extended to max(rs1, rs2) bits
1363 * add @ max(rs1, rs2) bits
1364 * RD @ rd bits. sign-extend to rd if rd > max(rs1, rs2) otherwise truncate
1365
1366 Note here that polymorphic addw sign-extends its source operands,
1367 where add zero-extends.
1368
1369 This requires a little more in-depth analysis.  Where the bitwidth of
1370 rs1 equals the bitwidth of rs2, no sign-extending will occur.  It is
1371 only where the bitwidth of either rs1 or rs2 are different, will the
1372 lesser-width operand be sign-extended.
1373
1374 Effectively however, both rs1 and rs2 are being sign-extended (or
1375 truncated), where for add they are both zero-extended.  This holds true
1376 for all arithmetic operations ending with "W".
1377
1378 ### addiw
1379
1380 Standard Scalar RV64I:
1381
1382 * RS1 @ xlen bits, truncated to 32-bit
1383 * immed @ 12 bits, sign-extended to 32-bit
1384 * add @ 32 bits
1385 * RD @ rd bits.  sign-extend to rd if rd > 32, otherwise truncate.
1386
1387 Polymorphic variant:
1388
1389 * RS1 @ rs1 bits
1390 * immed @ 12 bits, sign-extend to max(rs1, 12) bits
1391 * add @ max(rs1, 12) bits
1392 * RD @ rd bits. sign-extend to rd if rd > max(rs1, 12) otherwise truncate
1393
1394 # Predication Element Zeroing
1395
1396 The introduction of zeroing on traditional vector predication is usually
1397 intended as an optimisation for lane-based microarchitectures with register
1398 renaming to be able to save power by avoiding a register read on elements
1399 that are passed through en-masse through the ALU.  Simpler microarchitectures
1400 do not have this issue: they simply do not pass the element through to
1401 the ALU at all, and therefore do not store it back in the destination.
1402 More complex non-lane-based micro-architectures can, when zeroing is
1403 not set, use the predication bits to simply avoid sending element-based
1404 operations to the ALUs, entirely: thus, over the long term, potentially
1405 keeping all ALUs 100% occupied even when elements are predicated out.
1406
1407 SimpleV's design principle is not based on or influenced by
1408 microarchitectural design factors: it is a hardware-level API.
1409 Therefore, looking purely at whether zeroing is *useful* or not,
1410 (whether less instructions are needed for certain scenarios),
1411 given that a case can be made for zeroing *and* non-zeroing, the
1412 decision was taken to add support for both.
1413
1414 ## Single-predication (based on destination register)
1415
1416 Zeroing on predication for arithmetic operations is taken from
1417 the destination register's predicate.  i.e. the predication *and*
1418 zeroing settings to be applied to the whole operation come from the
1419 CSR Predication table entry for the destination register.
1420 Thus when zeroing is set on predication of a destination element,
1421 if the predication bit is clear, then the destination element is *set*
1422 to zero (twin-predication is slightly different, and will be covered
1423 next).
1424
1425 Thus the pseudo-code loop for a predicated arithmetic operation
1426 is modified to as follows:
1427
1428       for (i = 0; i < VL; i++)
1429         if not zeroing: # an optimisation
1430            while (!(predval & 1<<i) && i < VL)
1431              if (int_vec[rd ].isvector)  { id += 1; }
1432              if (int_vec[rs1].isvector)  { irs1 += 1; }
1433              if (int_vec[rs2].isvector)  { irs2 += 1; }
1434            if i == VL:
1435              return
1436         if (predval & 1<<i)
1437            src1 = ....
1438            src2 = ...
1439            else:
1440                result = src1 + src2 # actual add (or other op) here
1441            set_polymorphed_reg(rd, destwid, ird, result)
1442            if int_vec[rd].ffirst and result == 0:
1443               VL = i # result was zero, end loop early, return VL
1444               return
1445            if (!int_vec[rd].isvector) return
1446         else if zeroing:
1447            result = 0
1448            set_polymorphed_reg(rd, destwid, ird, result)
1449         if (int_vec[rd ].isvector)  { id += 1; }
1450         else if (predval & 1<<i) return
1451         if (int_vec[rs1].isvector)  { irs1 += 1; }
1452         if (int_vec[rs2].isvector)  { irs2 += 1; }
1453         if (rd == VL or rs1 == VL or rs2 == VL): return
1454
1455 The optimisation to skip elements entirely is only possible for certain
1456 micro-architectures when zeroing is not set.  However for lane-based
1457 micro-architectures this optimisation may not be practical, as it
1458 implies that elements end up in different "lanes".  Under these
1459 circumstances it is perfectly fine to simply have the lanes
1460 "inactive" for predicated elements, even though it results in
1461 less than 100% ALU utilisation.
1462
1463 ## Twin-predication (based on source and destination register) <a name="tpred"></a>
1464
1465 Twin-predication is not that much different, except that that
1466 the source is independently zero-predicated from the destination.
1467 This means that the source may be zero-predicated *or* the
1468 destination zero-predicated *or both*, or neither.
1469
1470 When with twin-predication, zeroing is set on the source and not
1471 the destination, if a predicate bit is set it indicates that a zero
1472 data element is passed through the operation (the exception being:
1473 if the source data element is to be treated as an address - a LOAD -
1474 then the data returned *from* the LOAD is zero, rather than looking up an
1475 *address* of zero.
1476
1477 When zeroing is set on the destination and not the source, then just
1478 as with single-predicated operations, a zero is stored into the destination
1479 element (or target memory address for a STORE).
1480
1481 Zeroing on both source and destination effectively result in a bitwise
1482 NOR operation of the source and destination predicate: the result is that
1483 where either source predicate OR destination predicate is set to 0,
1484 a zero element will ultimately end up in the destination register.
1485
1486 However: this may not necessarily be the case for all operations;
1487 implementors, particularly of custom instructions, clearly need to
1488 think through the implications in each and every case.
1489
1490 Here is pseudo-code for a twin zero-predicated operation:
1491
1492     function op_mv(rd, rs) # MV not VMV!
1493       rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
1494       rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
1495       ps, zerosrc = get_pred_val(FALSE, rs); # predication on src
1496       pd, zerodst = get_pred_val(FALSE, rd); # ... AND on dest
1497       for (int i = 0, int j = 0; i < VL && j < VL):
1498         if (int_csr[rs].isvec && !zerosrc) while (!(ps & 1<<i)) i++;
1499         if (int_csr[rd].isvec && !zerodst) while (!(pd & 1<<j)) j++;
1500         if ((pd & 1<<j))
1501             if ((pd & 1<<j))
1502                 sourcedata = ireg[rs+i];
1503             else
1504                 sourcedata = 0
1505             ireg[rd+j] <= sourcedata
1506         else if (zerodst)
1507             ireg[rd+j] <= 0
1508         if (int_csr[rs].isvec)
1509             i++;
1510         if (int_csr[rd].isvec)
1511             j++;
1512         else
1513             if ((pd & 1<<j))
1514                 break;
1515
1516 Note that in the instance where the destination is a scalar, the hardware
1517 loop is ended the moment a value *or a zero* is placed into the destination
1518 register/element.  Also note that, for clarity, variable element widths
1519 have been left out of the above.
1520
1521 # Subsets of RV functionality
1522
1523 This section describes the differences when SV is implemented on top of
1524 different subsets of RV.
1525
1526 ## Common options
1527
1528 It is permitted to only implement SVprefix and not the VBLOCK instruction
1529 format option, and vice-versa.  UNIX Platforms **MUST** raise illegal
1530 instruction on seeing an unsupported VBLOCK or SVprefix opcode, so that
1531 traps may emulate the format.
1532
1533 It is permitted in SVprefix to either not implement VL or not implement
1534 SUBVL (see [[sv_prefix_proposal]] for full details. Again, UNIX Platforms
1535 *MUST* raise illegal instruction on implementations that do not support
1536 VL or SUBVL.
1537
1538 It is permitted to limit the size of either (or both) the register files
1539 down to the original size of the standard RV architecture.  However, below
1540 the mandatory limits set in the RV standard will result in non-compliance
1541 with the SV Specification.
1542
1543 ## RV32 / RV32F
1544
1545 When RV32 or RV32F is implemented, XLEN is set to 32, and thus the
1546 maximum limit for predication is also restricted to 32 bits.  Whilst not
1547 actually specifically an "option" it is worth noting.
1548
1549 ## RV32G
1550
1551 Normally in standard RV32 it does not make much sense to have
1552 RV32G, The critical instructions that are missing in standard RV32
1553 are those for moving data to and from the double-width floating-point
1554 registers into the integer ones, as well as the FCVT routines.
1555
1556 In an earlier draft of SV, it was possible to specify an elwidth
1557 of double the standard register size: this had to be dropped,
1558 and may be reintroduced in future revisions.
1559
1560 ## RV32 (not RV32F / RV32G) and RV64 (not RV64F / RV64G)
1561
1562 When floating-point is not implemented, the size of the User Register and
1563 Predication CSR tables may be halved, to only 4 2x16-bit CSRs (8 entries
1564 per table).
1565
1566 ## RV32E
1567
1568 In embedded scenarios the User Register and Predication CSRs may be
1569 dropped entirely, or optionally limited to 1 CSR, such that the combined
1570 number of entries from the M-Mode CSR Register table plus U-Mode
1571 CSR Register table is either 4 16-bit entries or (if the U-Mode is
1572 zero) only 2 16-bit entries (M-Mode CSR table only).  Likewise for
1573 the Predication CSR tables.
1574
1575 RV32E is the most likely candidate for simply detecting that registers
1576 are marked as "vectorised", and generating an appropriate exception
1577 for the VL loop to be implemented in software.
1578
1579 ## RV128
1580
1581 RV128 has not been especially considered, here, however it has some
1582 extremely large possibilities: double the element width implies
1583 256-bit operands, spanning 2 128-bit registers each, and predication
1584 of total length 128 bit given that XLEN is now 128.
1585
1586 # Example usage
1587
1588 TODO evaluate strncpy and strlen
1589 <https://groups.google.com/forum/m/#!msg/comp.arch/bGBeaNjAKvc/_vbqyxTUAQAJ>
1590
1591 ## strncpy <a name="strncpy"></>
1592
1593 RVV version:
1594
1595     strncpy:
1596         c.mv a3, a0               # Copy dst
1597     loop:
1598         setvli x0, a2, vint8    # Vectors of bytes.
1599         vlbff.v v1, (a1)        # Get src bytes
1600         vseq.vi v0, v1, 0       # Flag zero bytes
1601         vmfirst a4, v0          # Zero found?
1602         vmsif.v v0, v0          # Set mask up to and including zero byte.
1603         vsb.v v1, (a3), v0.t    # Write out bytes
1604         c.bgez a4, exit           # Done
1605         csrr t1, vl             # Get number of bytes fetched
1606         c.add a1, a1, t1          # Bump src pointer
1607         c.sub a2, a2, t1          # Decrement count.
1608         c.add a3, a3, t1          # Bump dst pointer
1609         c.bnez a2, loop           # Anymore?
1610
1611     exit:
1612         c.ret
1613
1614 SV version (WIP):
1615
1616     strncpy:
1617         c.mv a3, a0
1618         VBLK.RegCSR[t0] = 8bit, t0, vector
1619         VBLK.PredTb[t0] = ffirst, x0, inv
1620     loop:
1621         VBLK.SETVLI a2, t4, 8 # t4 and VL now 1..8 (MVL=8)
1622         c.ldb t0, (a1) # t0 fail first mode
1623         c.bne t0, x0, allnonzero # still ff
1624         # VL (t4) points to last nonzero
1625         c.addi t4, t4, 1 # include zero
1626         c.stb t0, (a3)   # store incl zero
1627         c.ret            # end subroutine
1628     allnonzero:
1629         c.stb t0, (a3)    # VL legal range
1630         c.add a1, a1, t4  # Bump src pointer
1631         c.sub a2, a2, t4  # Decrement count.
1632         c.add a3, a3, t4  # Bump dst pointer
1633         c.bnez a2, loop   # Anymore?
1634     exit:
1635         c.ret
1636
1637 Notes:
1638
1639 * Setting MVL to 8 is just an example. If enough registers are spare it
1640   may be set to XLEN which will require a bank of 8 scalar registers for
1641   a1, a3 and t0.
1642 * obviously if that is done, t0 is not separated by 8 full registers, and
1643   would overwrite t1 thru t7. x80 would work well, as an example, instead.
1644 * with the exception of the GETVL (a pseudo code alias for csrr), every
1645   single instruction above may use RVC.
1646 * RVC C.BNEZ can be used because rs1' may be extended to the full 128
1647   registers through redirection
1648 * RVC C.LW and C.SW may be used because the W format may be overridden by
1649   the 8 bit format. All of t0, a3 and a1 are overridden to make that work.
1650 * with the exception of the GETVL, all Vector Context may be done in
1651   VBLOCK form.
1652 * setting predication to x0 (zero) and invert on t0 is a trick to enable
1653   just ffirst on t0
1654 * ldb and bne are both using t0, both in ffirst mode
1655 * t0 vectorised, a1 scalar, both elwidth 8 bit: ldb enters "unit stride,
1656   vectorised, no (un)sign-extension or truncation" mode.
1657 * ldb will end on illegal mem, reduce VL, but copied all sorts of stuff
1658   into t0 (could contain zeros).
1659 * bne t0 x0 tests up to the NEW VL for nonzero, vector t0 against
1660   scalar x0
1661 * however as t0 is in ffirst mode, the first fail will ALSO stop the
1662   compares, and reduce VL as well
1663 * the branch only goes to allnonzero if all tests succeed
1664 * if it did not, we can safely increment VL by 1 (using a4) to include
1665   the zero.
1666 * SETVL sets *exactly* the requested amount into VL.
1667 * the SETVL just after allnonzero label is needed in case the ldb ffirst
1668   activates but the bne allzeros does not.
1669 * this would cause the stb to copy up to the end of the legal memory
1670 * of course, on the next loop the ldb would throw a trap, as a1 now
1671   points to the first illegal mem location.
1672
1673 ## strcpy
1674
1675 RVV version:
1676
1677         mv a3, a0             # Save start
1678     loop:
1679         setvli a1, x0, vint8  # byte vec, x0 (Zero reg) => use max hardware len
1680         vldbff.v v1, (a3)     # Get bytes
1681         csrr a1, vl           # Get bytes actually read e.g. if fault
1682         vseq.vi v0, v1, 0     # Set v0[i] where v1[i] = 0
1683         add a3, a3, a1        # Bump pointer
1684         vmfirst a2, v0        # Find first set bit in mask, returns -1 if none
1685         bltz a2, loop         # Not found?
1686         add a0, a0, a1        # Sum start + bump
1687         add a3, a3, a2        # Add index of zero byte
1688         sub a0, a3, a0        # Subtract start address+bump
1689         ret
1690
1691 ## DAXPY <a name="daxpy"></a>
1692
1693 [[!inline raw="yes" pages="simple_v_extension/daxpy_example" ]]
1694
1695 Notes:
1696
1697 * Setting MVL to 4 is just an example.  With enough space between the
1698   FP regs, MVL may be set to larger values
1699 * VBLOCK header takes 16 bits, 8-bit mode may be used on the registers,
1700   taking only another 16 bits, VBLOCK.SETVL requires 16 bits.  Total
1701   overhead for use of VBLOCK: 48 bits (3 16-bit words).
1702 * All instructions except fmadd may use Compressed variants.  Total
1703   number of 16-bit instruction words: 11.
1704 * Total: 14 16-bit words.  By contrast, RVV requires around 18 16-bit words.
1705
1706 ## BigInt add <a name="bigadd"></a>
1707
1708 [[!inline raw="yes" pages="simple_v_extension/bigadd_example" ]]