simple_v_extension/appendix.mdwn

   1 # Simple-V (Parallelism Extension Proposal) Appendix
   2
   3 * Copyright (C) 2017, 2018, 2019 Luke Kenneth Casson Leighton
   4 * Status: DRAFTv0.6
   5 * Last edited: 25 jun 2019
   6 * main spec [[specification]]
   7
   8 [[!toc ]]
   9
  10 # Fail-on-first modes
  11
  12 Fail-on-first data dependency has different behaviour for traps than
  13 for conditional testing.  "Conditional" is taken to mean "anything
  14 that is zero", however with traps, the first element has to
  15 be given the opportunity to throw the exact same trap that would
  16 be thrown if this were a scalar operation (when VL=1).
  17
  18 ## Fail-on-first traps
  19
  20 Except for the first element, ffirst stops sequential element processing
  21 when a trap occurs.  The first element is treated normally (as if ffirst
  22 is clear).  Should any subsequent element instruction require a trap,
  23 instead it and subsequent indexed elements are ignored (or cancelled in
  24 out-of-order designs), and VL is set to the *last* instruction that did
  25 not take the trap.
  26
  27 Note that predicated-out elements (where the predicate mask bit is zero)
  28 are clearly excluded (i.e. the trap will not occur).  However, note that
  29 the loop still had to test the predicate bit: thus on return,
  30 VL is set to include elements that did not take the trap *and* includes
  31 the elements that were predicated (masked) out (not tested up to the
  32 point where the trap occurred).
  33
  34 If SUBVL is being used (SUBVL!=1), the first *sub-group* of elements
  35 will cause a trap as normal (as if ffirst is not set); subsequently,
  36 the trap must not occur in the *sub-group* of elements.  SUBVL will **NOT**
  37 be modified.
  38
  39 Given that predication bits apply to SUBVL groups, the same rules apply
  40 to predicated-out (masked-out) sub-groups in calculating the value that VL
  41 is set to.
  42
  43 ## Fail-on-first conditional tests
  44
  45 ffirst stops sequential element conditional testing on the first element result
  46 being zero.  VL is set to the number of elements that were processed before
  47 the fail-condition was encountered.
  48
  49 Note that just as with traps, if SUBVL!=1, the first of any of the *sub-group*
  50 will cause the processing to end, and, even if there were elements within
  51 the *sub-group* that passed the test, that sub-group is still (entirely)
  52 excluded from the count (from setting VL).  i.e. VL is set to the total
  53 number of *sub-groups* that had no fail-condition up until execution was
  54 stopped.
  55
  56 Note again that, just as with traps, predicated-out (masked-out) elements
  57 are included in the count leading up to the fail-condition, even though they
  58 were not tested.
  59
  60 # Instructions <a name="instructions" />
  61
  62 Despite being a 98% complete and accurate topological remap of RVV
  63 concepts and functionality, no new instructions are needed.
  64 Compared to RVV: *All* RVV instructions can be re-mapped, however xBitManip
  65 becomes a critical dependency for efficient manipulation of predication
  66 masks (as a bit-field).  Despite the removal of all operations,
  67 with the exception of CLIP and VSELECT.X
  68 *all instructions from RVV Base are topologically re-mapped and retain their
  69 complete functionality, intact*.  Note that if RV64G ever had
  70 a MV.X added as well as FCLIP, the full functionality of RVV-Base would
  71 be obtained in SV.
  72
  73 Three instructions, VSELECT, VCLIP and VCLIPI, do not have RV Standard
  74 equivalents, so are left out of Simple-V.  VSELECT could be included if
  75 there existed a MV.X instruction in RV (MV.X is a hypothetical
  76 non-immediate variant of MV that would allow another register to
  77 specify which register was to be copied).  Note that if any of these three
  78 instructions are added to any given RV extension, their functionality
  79 will be inherently parallelised.
  80
  81 With some exceptions, where it does not make sense or is simply too
  82 challenging, all RV-Base instructions are parallelised:
  83
  84 * CSR instructions, whilst a case could be made for fast-polling of
  85   a CSR into multiple registers, or for being able to copy multiple
  86   contiguously addressed CSRs into contiguous registers, and so on,
  87   are the fundamental core basis of SV.  If parallelised, extreme
  88   care would need to be taken.  Additionally, CSR reads are done
  89   using x0, and it is *really* inadviseable to tag x0.
  90 * LUI, C.J, C.JR, WFI, AUIPC are not suitable for parallelising so are
  91   left as scalar.
  92 * LR/SC could hypothetically be parallelised however their purpose is
  93   single (complex) atomic memory operations where the LR must be followed
  94   up by a matching SC.  A sequence of parallel LR instructions followed
  95   by a sequence of parallel SC instructions therefore is guaranteed to
  96   not be useful. Not least: the guarantees of a Multi-LR/SC
  97   would be impossible to provide if emulated in a trap.
  98 * EBREAK, NOP, FENCE and others do not use registers so are not inherently
  99   paralleliseable anyway.
 100
 101 All other operations using registers are automatically parallelised.
 102 This includes AMOMAX, AMOSWAP and so on, where particular care and
 103 attention must be paid.
 104
 105 Example pseudo-code for an integer ADD operation (including scalar
 106 operations).  Floating-point uses the FP Register Table.
 107
 108 [[!inline raw="yes" pages="simple_v_extension/simple_add_example" ]]
 109
 110 Note that for simplicity there is quite a lot missing from the above
 111 pseudo-code: PCVBLK, element widths, zeroing on predication, dimensional
 112 reshaping and offsets and so on.  However it demonstrates the basic
 113 principle.  Augmentations that produce the full pseudo-code are covered in
 114 other sections.
 115
 116 ## SUBVL Pseudocode <a name="subvl-pseudocode"></a>
 117
 118 Adding in support for SUBVL is a matter of adding in an extra inner
 119 for-loop, where register src and dest are still incremented inside the
 120 inner part. Note that the predication is still taken from the VL index.
 121
 122 So whilst elements are indexed by "(i * SUBVL + s)", predicate bits are
 123 indexed by "(i)"
 124
 125     function op_add(rd, rs1, rs2) # add not VADD!
 126       int i, id=0, irs1=0, irs2=0;
 127       predval = get_pred_val(FALSE, rd);
 128       rd  = int_vec[rd ].isvector ? int_vec[rd ].regidx : rd;
 129       rs1 = int_vec[rs1].isvector ? int_vec[rs1].regidx : rs1;
 130       rs2 = int_vec[rs2].isvector ? int_vec[rs2].regidx : rs2;
 131       for (i = 0; i < VL; i++)
 132        xSTATE.srcoffs = i # save context
 133        for (s = 0; s < SUBVL; s++)
 134         xSTATE.ssvoffs = s # save context
 135         if (predval & 1<<i) # predication uses intregs
 136            # actual add is here (at last)
 137            ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
 138            if (!int_vec[rd ].isvector) break;
 139         if (int_vec[rd ].isvector)  { id += 1; }
 140         if (int_vec[rs1].isvector)  { irs1 += 1; }
 141         if (int_vec[rs2].isvector)  { irs2 += 1; }
 142         if (id == VL or irs1 == VL or irs2 == VL) {
 143           # end VL hardware loop
 144           xSTATE.srcoffs = 0; # reset
 145           xSTATE.ssvoffs = 0; # reset
 146           return;
 147         }
 148
 149
 150 NOTE: pseudocode simplified greatly: zeroing, proper predicate handling,
 151 elwidth handling etc. all left out.
 152
 153 ## Instruction Format
 154
 155 It is critical to appreciate that there are
 156 **no operations added to SV, at all**.
 157
 158 Instead, by using CSRs to tag registers as an indication of "changed
 159 behaviour", SV *overloads* pre-existing branch operations into predicated
 160 variants, and implicitly overloads arithmetic operations, MV, FCVT, and
 161 LOAD/STORE depending on CSR configurations for bitwidth and predication.
 162 **Everything** becomes parallelised.  *This includes Compressed
 163 instructions* as well as any future instructions and Custom Extensions.
 164
 165 Note: CSR tags to change behaviour of instructions is nothing new, including
 166 in RISC-V.  UXL, SXL and MXL change the behaviour so that XLEN=32/64/128.
 167 FRM changes the behaviour of the floating-point unit, to alter the rounding
 168 mode.  Other architectures change the LOAD/STORE byte-order from big-endian
 169 to little-endian on a per-instruction basis.  SV is just a little more...
 170 comprehensive in its effect on instructions.
 171
 172 ## Branch Instructions
 173
 174 Branch operations are augmented slightly to be a little more like FP
 175 Compares (FEQ, FNE etc.), by permitting the cumulation (and storage)
 176 of multiple comparisons into a register (taken indirectly from the predicate
 177 table).  As such, "ffirst" - fail-on-first - condition mode can be enabled.
 178 See ffirst mode in the Predication Table section.
 179
 180 ### Standard Branch <a name="standard_branch"></a>
 181
 182 Branch operations use standard RV opcodes that are reinterpreted to
 183 be "predicate variants" in the instance where either of the two src
 184 registers are marked as vectors (active=1, vector=1).
 185
 186 Note that the predication register to use (if one is enabled) is taken from
 187 the *first* src register, and that this is used, just as with predicated
 188 arithmetic operations, to mask whether the comparison operations take
 189 place or not.  The target (destination) predication register
 190 to use (if one is enabled) is taken from the *second* src register.
 191
 192 If either of src1 or src2 are scalars (whether by there being no
 193 CSR register entry or whether by the CSR entry specifically marking
 194 the register as "scalar") the comparison goes ahead as vector-scalar
 195 or scalar-vector.
 196
 197 In instances where no vectorisation is detected on either src registers
 198 the operation is treated as an absolutely standard scalar branch operation.
 199 Where vectorisation is present on either or both src registers, the
 200 branch may stil go ahead if any only if *all* tests succeed (i.e. excluding
 201 those tests that are predicated out).
 202
 203 Note that when zero-predication is enabled (from source rs1),
 204 a cleared bit in the predicate indicates that the result
 205 of the compare is set to "false", i.e. that the corresponding
 206 destination bit (or result)) be set to zero.  Contrast this with
 207 when zeroing is not set: bits in the destination predicate are
 208 only *set*; they are **not** cleared.  This is important to appreciate,
 209 as there may be an expectation that, going into the hardware-loop,
 210 the destination predicate is always expected to be set to zero:
 211 this is **not** the case.  The destination predicate is only set
 212 to zero if **zeroing** is enabled.
 213
 214 Note that just as with the standard (scalar, non-predicated) branch
 215 operations, BLE, BGT, BLEU and BTGU may be synthesised by inverting
 216 src1 and src2.
 217
 218 In Hwacha EECS-2015-262 Section 6.7.2 the following pseudocode is given
 219 for predicated compare operations of function "cmp":
 220
 221     for (int i=0; i<vl; ++i)
 222       if ([!]preg[p][i])
 223          preg[pd][i] = cmp(s1 ? vreg[rs1][i] : sreg[rs1],
 224                            s2 ? vreg[rs2][i] : sreg[rs2]);
 225
 226 With associated predication, vector-length adjustments and so on,
 227 and temporarily ignoring bitwidth (which makes the comparisons more
 228 complex), this becomes:
 229
 230     s1 = reg_is_vectorised(src1);
 231     s2 = reg_is_vectorised(src2);
 232
 233     if not s1 && not s2
 234         if cmp(rs1, rs2) # scalar compare
 235             goto branch
 236         return
 237
 238     preg = int_pred_reg[rd]
 239     reg = int_regfile
 240
 241     ps = get_pred_val(I/F==INT, rs1);
 242     rd = get_pred_val(I/F==INT, rs2); # this may not exist
 243
 244     if not exists(rd) or zeroing:
 245         result = 0
 246     else
 247         result = preg[rd]
 248
 249     for (int i = 0; i < VL; ++i)
 250       if (zeroing)
 251         if not (ps & (1<<i))
 252            result &= ~(1<<i);
 253       else if (ps & (1<<i))
 254           if (cmp(s1 ? reg[src1+i]:reg[src1],
 255                                s2 ? reg[src2+i]:reg[src2])
 256               result |= 1<<i;
 257           else
 258               result &= ~(1<<i);
 259
 260      if not exists(rd)
 261         if result == ps
 262             goto branch
 263      else
 264         preg[rd] = result # store in destination
 265         if preg[rd] == ps
 266             goto branch
 267
 268 Notes:
 269
 270 * Predicated SIMD comparisons would break src1 and src2 further down
 271   into bitwidth-sized chunks (see Appendix "Bitwidth Virtual Register
 272   Reordering") setting Vector-Length times (number of SIMD elements) bits
 273   in Predicate Register rd, as opposed to just Vector-Length bits.
 274 * The execution of "parallelised" instructions **must** be implemented
 275   as "re-entrant" (to use a term from software).  If an exception (trap)
 276   occurs during the middle of a vectorised
 277   Branch (now a SV predicated compare) operation, the partial results
 278   of any comparisons must be written out to the destination
 279   register before the trap is permitted to begin.  If however there
 280   is no predicate, the **entire** set of comparisons must be **restarted**,
 281   with the offset loop indices set back to zero.  This is because
 282   there is no place to store the temporary result during the handling
 283   of traps.
 284
 285 TODO: predication now taken from src2.  also branch goes ahead
 286 if all compares are successful.
 287
 288 Note also that where normally, predication requires that there must
 289 also be a CSR register entry for the register being used in order
 290 for the **predication** CSR register entry to also be active,
 291 for branches this is **not** the case.  src2 does **not** have
 292 to have its CSR register entry marked as active in order for
 293 predication on src2 to be active.
 294
 295 Also note: SV Branch operations are **not** twin-predicated
 296 (see Twin Predication section).  This would require three
 297 element offsets: one to track src1, one to track src2 and a third
 298 to track where to store the accumulation of the results.  Given
 299 that the element offsets need to be exposed via CSRs so that
 300 the parallel hardware looping may be made re-entrant on traps
 301 and exceptions, the decision was made not to make SV Branches
 302 twin-predicated.
 303
 304 ### Floating-point Comparisons
 305
 306 There does not exist floating-point branch operations, only compare.
 307 Interestingly no change is needed to the instruction format because
 308 FP Compare already stores a 1 or a zero in its "rd" integer register
 309 target, i.e. it's not actually a Branch at all: it's a compare.
 310
 311 In RV (scalar) Base, a branch on a floating-point compare is
 312 done via the sequence "FEQ x1, f0, f5; BEQ x1, x0, #jumploc".
 313 This does extend to SV, as long as x1 (in the example sequence given)
 314 is vectorised.  When that is the case, x1..x(1+VL-1) will also be
 315 set to 0 or 1 depending on whether f0==f5, f1==f6, f2==f7 and so on.
 316 The BEQ that follows will *also* compare x1==x0, x2==x0, x3==x0 and
 317 so on.  Consequently, unlike integer-branch, FP Compare needs no
 318 modification in its behaviour.
 319
 320 In addition, it is noted that an entry "FNE" (the opposite of FEQ) is missing,
 321 and whilst in ordinary branch code this is fine because the standard
 322 RVF compare can always be followed up with an integer BEQ or a BNE (or
 323 a compressed comparison to zero or non-zero), in predication terms that
 324 becomes more of an impact.  To deal with this, SV's predication has
 325 had "invert" added to it.
 326
 327 Also: note that FP Compare may be predicated, using the destination
 328 integer register (rd) to determine the predicate.  FP Compare is **not**
 329 a twin-predication operation, as, again, just as with SV Branches,
 330 there are three registers involved: FP src1, FP src2 and INT rd.
 331
 332 Also: note that ffirst (fail first mode) applies directly to this operation.
 333
 334 ### Compressed Branch Instruction
 335
 336 Compressed Branch instructions are, just like standard Branch instructions,
 337 reinterpreted to be vectorised and predicated based on the source register
 338 (rs1s) CSR entries.  As however there is only the one source register,
 339 given that c.beqz a10 is equivalent to beqz a10,x0, the optional target
 340 to store the results of the comparisions is taken from CSR predication
 341 table entries for **x0**.
 342
 343 The specific required use of x0 is, with a little thought, quite obvious,
 344 but is counterintuitive.  Clearly it is **not** recommended to redirect
 345 x0 with a CSR register entry, however as a means to opaquely obtain
 346 a predication target it is the only sensible option that does not involve
 347 additional special CSRs (or, worse, additional special opcodes).
 348
 349 Note also that, just as with standard branches, the 2nd source
 350 (in this case x0 rather than src2) does **not** have to have its CSR
 351 register table marked as "active" in order for predication to work.
 352
 353 ## Vectorised Dual-operand instructions
 354
 355 There is a series of 2-operand instructions involving copying (and
 356 sometimes alteration):
 357
 358 * C.MV
 359 * FMV, FNEG, FABS, FCVT, FSGNJ, FSGNJN and FSGNJX
 360 * C.LWSP, C.SWSP, C.LDSP, C.FLWSP etc.
 361 * LOAD(-FP) and STORE(-FP)
 362
 363 All of these operations follow the same two-operand pattern, so it is
 364 *both* the source *and* destination predication masks that are taken into
 365 account.  This is different from
 366 the three-operand arithmetic instructions, where the predication mask
 367 is taken from the *destination* register, and applied uniformly to the
 368 elements of the source register(s), element-for-element.
 369
 370 The pseudo-code pattern for twin-predicated operations is as
 371 follows:
 372
 373     function op(rd, rs):
 374       rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
 375       rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
 376       ps = get_pred_val(FALSE, rs); # predication on src
 377       pd = get_pred_val(FALSE, rd); # ... AND on dest
 378       for (int i = 0, int j = 0; i < VL && j < VL;):
 379         if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
 380         if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
 381         xSTATE.srcoffs = i # save context
 382         xSTATE.destoffs = j # save context
 383         reg[rd+j] = SCALAR_OPERATION_ON(reg[rs+i])
 384         if (int_csr[rs].isvec) i++;
 385         if (int_csr[rd].isvec) j++; else break
 386
 387 This pattern covers scalar-scalar, scalar-vector, vector-scalar
 388 and vector-vector, and predicated variants of all of those.
 389 Zeroing is not presently included (TODO).  As such, when compared
 390 to RVV, the twin-predicated variants of C.MV and FMV cover
 391 **all** standard vector operations: VINSERT, VSPLAT, VREDUCE,
 392 VEXTRACT, VSCATTER, VGATHER, VCOPY, and more.
 393
 394 Note that:
 395
 396 * elwidth (SIMD) is not covered in the pseudo-code above
 397 * ending the loop early in scalar cases (VINSERT, VEXTRACT) is also
 398   not covered
 399 * zero predication is also not shown (TODO).
 400
 401 ### C.MV Instruction <a name="c_mv"></a>
 402
 403 There is no MV instruction in RV however there is a C.MV instruction.
 404 It is used for copying integer-to-integer registers (vectorised FMV
 405 is used for copying floating-point).
 406
 407 If either the source or the destination register are marked as vectors
 408 C.MV is reinterpreted to be a vectorised (multi-register) predicated
 409 move operation.  The actual instruction's format does not change:
 410
 411 [[!table  data="""
 412 15  12 | 11   7 | 6  2 | 1  0 |
 413 funct4 | rd     | rs   | op   |
 414 4      | 5      | 5    | 2    |
 415 C.MV   | dest   | src  | C0   |
 416 """]]
 417
 418 A simplified version of the pseudocode for this operation is as follows:
 419
 420     function op_mv(rd, rs) # MV not VMV!
 421       rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
 422       rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
 423       ps = get_pred_val(FALSE, rs); # predication on src
 424       pd = get_pred_val(FALSE, rd); # ... AND on dest
 425       for (int i = 0, int j = 0; i < VL && j < VL;):
 426         if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
 427         if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
 428         xSTATE.srcoffs = i # save context
 429         xSTATE.destoffs = j # save context
 430         ireg[rd+j] <= ireg[rs+i];
 431         if (int_csr[rs].isvec) i++;
 432         if (int_csr[rd].isvec) j++; else break
 433
 434 There are several different instructions from RVV that are covered by
 435 this one opcode:
 436
 437 [[!table  data="""
 438 src    | dest    | predication   | op             |
 439 scalar | vector  | none          | VSPLAT         |
 440 scalar | vector  | destination   | sparse VSPLAT  |
 441 scalar | vector  | 1-bit dest    | VINSERT        |
 442 vector | scalar  | 1-bit? src    | VEXTRACT       |
 443 vector | vector  | none          | VCOPY          |
 444 vector | vector  | src           | Vector Gather  |
 445 vector | vector  | dest          | Vector Scatter |
 446 vector | vector  | src & dest    | Gather/Scatter |
 447 vector | vector  | src == dest   | sparse VCOPY   |
 448 """]]
 449
 450 Also, VMERGE may be implemented as back-to-back (macro-op fused) C.MV
 451 operations with zeroing off, and inversion on the src and dest predication
 452 for one of the two C.MV operations.  The non-inverted C.MV will place
 453 one set of registers into the destination, and the inverted one the other
 454 set.  With predicate-inversion, copying and inversion of the predicate mask
 455 need not be done as a separate (scalar) instruction.
 456
 457 Note that in the instance where the Compressed Extension is not implemented,
 458 MV may be used, but that is a pseudo-operation mapping to addi rd, x0, rs.
 459 Note that the behaviour is **different** from C.MV because with addi the
 460 predication mask to use is taken **only** from rd and is applied against
 461 all elements: rs[i] = rd[i].
 462
 463 ### FMV, FNEG and FABS Instructions
 464
 465 These are identical in form to C.MV, except covering floating-point
 466 register copying.  The same double-predication rules also apply.
 467 However when elwidth is not set to default the instruction is implicitly
 468 and automatic converted to a (vectorised) floating-point type conversion
 469 operation of the appropriate size covering the source and destination
 470 register bitwidths.
 471
 472 (Note that FMV, FNEG and FABS are all actually pseudo-instructions)
 473
 474 ### FVCT Instructions
 475
 476 These are again identical in form to C.MV, except that they cover
 477 floating-point to integer and integer to floating-point.  When element
 478 width in each vector is set to default, the instructions behave exactly
 479 as they are defined for standard RV (scalar) operations, except vectorised
 480 in exactly the same fashion as outlined in C.MV.
 481
 482 However when the source or destination element width is not set to default,
 483 the opcode's explicit element widths are *over-ridden* to new definitions,
 484 and the opcode's element width is taken as indicative of the SIMD width
 485 (if applicable i.e. if packed SIMD is requested) instead.
 486
 487 For example FCVT.S.L would normally be used to convert a 64-bit
 488 integer in register rs1 to a 64-bit floating-point number in rd.
 489 If however the source rs1 is set to be a vector, where elwidth is set to
 490 default/2 and "packed SIMD" is enabled, then the first 32 bits of
 491 rs1 are converted to a floating-point number to be stored in rd's
 492 first element and the higher 32-bits *also* converted to floating-point
 493 and stored in the second.  The 32 bit size comes from the fact that
 494 FCVT.S.L's integer width is 64 bit, and with elwidth on rs1 set to
 495 divide that by two it means that rs1 element width is to be taken as 32.
 496
 497 Similar rules apply to the destination register.
 498
 499 ## LOAD / STORE Instructions and LOAD-FP/STORE-FP <a name="load_store"></a>
 500
 501 An earlier draft of SV modified the behaviour of LOAD/STORE (modified
 502 the interpretation of the instruction fields).  This
 503 actually undermined the fundamental principle of SV, namely that there
 504 be no modifications to the scalar behaviour (except where absolutely
 505 necessary), in order to simplify an implementor's task if considering
 506 converting a pre-existing scalar design to support parallelism.
 507
 508 So the original RISC-V scalar LOAD/STORE and LOAD-FP/STORE-FP functionality
 509 do not change in SV, however just as with C.MV it is important to note
 510 that dual-predication is possible.
 511
 512 In vectorised architectures there are usually at least two different modes
 513 for LOAD/STORE:
 514
 515 * Read (or write for STORE) from sequential locations, where one
 516   register specifies the address, and the one address is incremented
 517   by a fixed amount.  This is usually known as "Unit Stride" mode.
 518 * Read (or write) from multiple indirected addresses, where the
 519   vector elements each specify separate and distinct addresses.
 520
 521 To support these different addressing modes, the CSR Register "isvector"
 522 bit is used.  So, for a LOAD, when the src register is set to
 523 scalar, the LOADs are sequentially incremented by the src register
 524 element width, and when the src register is set to "vector", the
 525 elements are treated as indirection addresses.  Simplified
 526 pseudo-code would look like this:
 527
 528     function op_ld(rd, rs) # LD not VLD!
 529       rdv = int_csr[rd].active ? int_csr[rd].regidx : rd;
 530       rsv = int_csr[rs].active ? int_csr[rs].regidx : rs;
 531       ps = get_pred_val(FALSE, rs); # predication on src
 532       pd = get_pred_val(FALSE, rd); # ... AND on dest
 533       for (int i = 0, int j = 0; i < VL && j < VL;):
 534         if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
 535         if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
 536         if (int_csr[rd].isvec)
 537           # indirect mode (multi mode)
 538           srcbase = ireg[rsv+i];
 539         else
 540           # unit stride mode
 541           srcbase = ireg[rsv] + i * XLEN/8; # offset in bytes
 542         ireg[rdv+j] <= mem[srcbase + imm_offs];
 543         if (!int_csr[rs].isvec &&
 544             !int_csr[rd].isvec) break # scalar-scalar LD
 545         if (int_csr[rs].isvec) i++;
 546         if (int_csr[rd].isvec) j++;
 547
 548 Notes:
 549
 550 * For simplicity, zeroing and elwidth is not included in the above:
 551   the key focus here is the decision-making for srcbase; vectorised
 552   rs means use sequentially-numbered registers as the indirection
 553   address, and scalar rs is "offset" mode.
 554 * The test towards the end for whether both source and destination are
 555   scalar is what makes the above pseudo-code provide the "standard" RV
 556   Base behaviour for LD operations.
 557 * The offset in bytes (XLEN/8) changes depending on whether the
 558   operation is a LB (1 byte), LH (2 byes), LW (4 bytes) or LD
 559   (8 bytes), and also whether the element width is over-ridden
 560   (see special element width section).
 561
 562 ## Compressed Stack LOAD / STORE Instructions <a name="c_ld_st"></a>
 563
 564 C.LWSP / C.SWSP and floating-point etc. are also source-dest twin-predicated,
 565 where it is implicit in C.LWSP/FLWSP etc. that x2 is the source register.
 566 It is therefore possible to use predicated C.LWSP to efficiently
 567 pop registers off the stack (by predicating x2 as the source), cherry-picking
 568 which registers to store to (by predicating the destination).  Likewise
 569 for C.SWSP.  In this way, LOAD/STORE-Multiple is efficiently achieved.
 570
 571 The two modes ("unit stride" and multi-indirection) are still supported,
 572 as with standard LD/ST.  Essentially, the only difference is that the
 573 use of x2 is hard-coded into the instruction.
 574
 575 **Note**: it is still possible to redirect x2 to an alternative target
 576 register.  With care, this allows C.LWSP / C.SWSP (and C.FLWSP) to be used as
 577 general-purpose LOAD/STORE operations.
 578
 579 ## Compressed LOAD / STORE Instructions
 580
 581 Compressed LOAD and STORE are again exactly the same as scalar LOAD/STORE,
 582 where the same rules apply and the same pseudo-code apply as for
 583 non-compressed LOAD/STORE.  Again: setting scalar or vector mode
 584 on the src for LOAD and dest for STORE switches mode from "Unit Stride"
 585 to "Multi-indirection", respectively.
 586
 587 # Element bitwidth polymorphism <a name="elwidth"></a>
 588
 589 Element bitwidth is best covered as its own special section, as it
 590 is quite involved and applies uniformly across-the-board.  SV restricts
 591 bitwidth polymorphism to default, 8-bit, 16-bit and 32-bit.
 592
 593 The effect of setting an element bitwidth is to re-cast each entry
 594 in the register table, and for all memory operations involving
 595 load/stores of certain specific sizes, to a completely different width.
 596 Thus In c-style terms, on an RV64 architecture, effectively each register
 597 now looks like this:
 598
 599     typedef union {
 600         uint8_t  b[8];
 601         uint16_t s[4];
 602         uint32_t i[2];
 603         uint64_t l[1];
 604     } reg_t;
 605
 606     // integer table: assume maximum SV 7-bit regfile size
 607     reg_t int_regfile[128];
 608
 609 where the CSR Register table entry (not the instruction alone) determines
 610 which of those union entries is to be used on each operation, and the
 611 VL element offset in the hardware-loop specifies the index into each array.
 612
 613 However a naive interpretation of the data structure above masks the
 614 fact that setting VL greater than 8, for example, when the bitwidth is 8,
 615 accessing one specific register "spills over" to the following parts of
 616 the register file in a sequential fashion.  So a much more accurate way
 617 to reflect this would be:
 618
 619     typedef union {
 620         uint8_t  actual_bytes[8]; // 8 for RV64, 4 for RV32, 16 for RV128
 621         uint8_t  b[0]; // array of type uint8_t
 622         uint16_t s[0];
 623         uint32_t i[0];
 624         uint64_t l[0];
 625         uint128_t d[0];
 626     } reg_t;
 627
 628     reg_t int_regfile[128];
 629
 630 where when accessing any individual regfile[n].b entry it is permitted
 631 (in c) to arbitrarily over-run the *declared* length of the array (zero),
 632 and thus "overspill" to consecutive register file entries in a fashion
 633 that is completely transparent to a greatly-simplified software / pseudo-code
 634 representation.
 635 It is however critical to note that it is clearly the responsibility of
 636 the implementor to ensure that, towards the end of the register file,
 637 an exception is thrown if attempts to access beyond the "real" register
 638 bytes is ever attempted.
 639
 640 Now we may modify pseudo-code an operation where all element bitwidths have
 641 been set to the same size, where this pseudo-code is otherwise identical
 642 to its "non" polymorphic versions (above):
 643
 644     function op_add(rd, rs1, rs2) # add not VADD!
 645       ...
 646       ...
 647       for (i = 0; i < VL; i++)
 648            ...
 649            ...
 650            // TODO, calculate if over-run occurs, for each elwidth
 651            if (elwidth == 8) {
 652                int_regfile[rd].b[id] <= int_regfile[rs1].i[irs1] +
 653                                         int_regfile[rs2].i[irs2];
 654             } else if elwidth == 16 {
 655                int_regfile[rd].s[id] <= int_regfile[rs1].s[irs1] +
 656                                         int_regfile[rs2].s[irs2];
 657             } else if elwidth == 32 {
 658                int_regfile[rd].i[id] <= int_regfile[rs1].i[irs1] +
 659                                         int_regfile[rs2].i[irs2];
 660             } else { // elwidth == 64
 661                int_regfile[rd].l[id] <= int_regfile[rs1].l[irs1] +
 662                                         int_regfile[rs2].l[irs2];
 663             }
 664            ...
 665            ...
 666
 667 So here we can see clearly: for 8-bit entries rd, rs1 and rs2 (and registers
 668 following sequentially on respectively from the same) are "type-cast"
 669 to 8-bit; for 16-bit entries likewise and so on.
 670
 671 However that only covers the case where the element widths are the same.
 672 Where the element widths are different, the following algorithm applies:
 673
 674 * Analyse the bitwidth of all source operands and work out the
 675   maximum.  Record this as "maxsrcbitwidth"
 676 * If any given source operand requires sign-extension or zero-extension
 677   (ldb, div, rem, mul, sll, srl, sra etc.), instead of mandatory 32-bit
 678   sign-extension / zero-extension or whatever is specified in the standard
 679   RV specification, **change** that to sign-extending from the respective
 680   individual source operand's bitwidth from the CSR table out to
 681   "maxsrcbitwidth" (previously calculated), instead.
 682 * Following separate and distinct (optional) sign/zero-extension of all
 683   source operands as specifically required for that operation, carry out the
 684   operation at "maxsrcbitwidth".  (Note that in the case of LOAD/STORE or MV
 685   this may be a "null" (copy) operation, and that with FCVT, the changes
 686   to the source and destination bitwidths may also turn FVCT effectively
 687   into a copy).
 688 * If the destination operand requires sign-extension or zero-extension,
 689   instead of a mandatory fixed size (typically 32-bit for arithmetic,
 690   for subw for example, and otherwise various: 8-bit for sb, 16-bit for sw
 691   etc.), overload the RV specification with the bitwidth from the
 692   destination register's elwidth entry.
 693 * Finally, store the (optionally) sign/zero-extended value into its
 694   destination: memory for sb/sw etc., or an offset section of the register
 695   file for an arithmetic operation.
 696
 697 In this way, polymorphic bitwidths are achieved without requiring a
 698 massive 64-way permutation of calculations **per opcode**, for example
 699 (4 possible rs1 bitwidths times 4 possible rs2 bitwidths times 4 possible
 700 rd bitwidths).  The pseudo-code is therefore as follows:
 701
 702     typedef union {
 703         uint8_t  b;
 704         uint16_t s;
 705         uint32_t i;
 706         uint64_t l;
 707     } el_reg_t;
 708
 709     bw(elwidth):
 710         if elwidth == 0: return xlen
 711         if elwidth == 1: return 8
 712         if elwidth == 2: return 16
 713         // elwidth == 3:
 714         return 32
 715
 716     get_max_elwidth(rs1, rs2):
 717         return max(bw(int_csr[rs1].elwidth), # default (XLEN) if not set
 718                    bw(int_csr[rs2].elwidth)) # again XLEN if no entry
 719
 720     get_polymorphed_reg(reg, bitwidth, offset):
 721         el_reg_t res;
 722         res.l = 0; // TODO: going to need sign-extending / zero-extending
 723         if bitwidth == 8:
 724             reg.b = int_regfile[reg].b[offset]
 725         elif bitwidth == 16:
 726             reg.s = int_regfile[reg].s[offset]
 727         elif bitwidth == 32:
 728             reg.i = int_regfile[reg].i[offset]
 729         elif bitwidth == 64:
 730             reg.l = int_regfile[reg].l[offset]
 731         return res
 732
 733     set_polymorphed_reg(reg, bitwidth, offset, val):
 734         if (!int_csr[reg].isvec):
 735             # sign/zero-extend depending on opcode requirements, from
 736             # the reg's bitwidth out to the full bitwidth of the regfile
 737             val = sign_or_zero_extend(val, bitwidth, xlen)
 738             int_regfile[reg].l[0] = val
 739         elif bitwidth == 8:
 740             int_regfile[reg].b[offset] = val
 741         elif bitwidth == 16:
 742             int_regfile[reg].s[offset] = val
 743         elif bitwidth == 32:
 744             int_regfile[reg].i[offset] = val
 745         elif bitwidth == 64:
 746             int_regfile[reg].l[offset] = val
 747
 748       maxsrcwid =  get_max_elwidth(rs1, rs2) # source element width(s)
 749       destwid = int_csr[rs1].elwidth         # destination element width
 750       for (i = 0; i < VL; i++)
 751         if (predval & 1<<i) # predication uses intregs
 752            // TODO, calculate if over-run occurs, for each elwidth
 753            src1 = get_polymorphed_reg(rs1, maxsrcwid, irs1)
 754            // TODO, sign/zero-extend src1 and src2 as operation requires
 755            if (op_requires_sign_extend_src1)
 756               src1 = sign_extend(src1, maxsrcwid)
 757            src2 = get_polymorphed_reg(rs2, maxsrcwid, irs2)
 758            result = src1 + src2 # actual add here
 759            // TODO, sign/zero-extend result, as operation requires
 760            if (op_requires_sign_extend_dest)
 761               result = sign_extend(result, maxsrcwid)
 762            set_polymorphed_reg(rd, destwid, ird, result)
 763            if (!int_vec[rd].isvector) break
 764         if (int_vec[rd ].isvector)  { id += 1; }
 765         if (int_vec[rs1].isvector)  { irs1 += 1; }
 766         if (int_vec[rs2].isvector)  { irs2 += 1; }
 767
 768 Whilst specific sign-extension and zero-extension pseudocode call
 769 details are left out, due to each operation being different, the above
 770 should be clear that;
 771
 772 * the source operands are extended out to the maximum bitwidth of all
 773   source operands
 774 * the operation takes place at that maximum source bitwidth (the
 775   destination bitwidth is not involved at this point, at all)
 776 * the result is extended (or potentially even, truncated) before being
 777   stored in the destination.  i.e. truncation (if required) to the
 778   destination width occurs **after** the operation **not** before.
 779 * when the destination is not marked as "vectorised", the **full**
 780   (standard, scalar) register file entry is taken up, i.e. the
 781   element is either sign-extended or zero-extended to cover the
 782   full register bitwidth (XLEN) if it is not already XLEN bits long.
 783
 784 Implementors are entirely free to optimise the above, particularly
 785 if it is specifically known that any given operation will complete
 786 accurately in less bits, as long as the results produced are
 787 directly equivalent and equal, for all inputs and all outputs,
 788 to those produced by the above algorithm.
 789
 790 ## Polymorphic floating-point operation exceptions and error-handling
 791
 792 For floating-point operations, conversion takes place without
 793 raising any kind of exception.  Exactly as specified in the standard
 794 RV specification, NAN (or appropriate) is stored if the result
 795 is beyond the range of the destination, and, again, exactly as
 796 with the standard RV specification just as with scalar
 797 operations, the floating-point flag is raised (FCSR).  And, again, just as
 798 with scalar operations, it is software's responsibility to check this flag.
 799 Given that the FCSR flags are "accrued", the fact that multiple element
 800 operations could have occurred is not a problem.
 801
 802 Note that it is perfectly legitimate for floating-point bitwidths of
 803 only 8 to be specified.  However whilst it is possible to apply IEEE 754
 804 principles, no actual standard yet exists.  Implementors wishing to
 805 provide hardware-level 8-bit support rather than throw a trap to emulate
 806 in software should contact the author of this specification before
 807 proceeding.
 808
 809 ## Polymorphic shift operators
 810
 811 A special note is needed for changing the element width of left and right
 812 shift operators, particularly right-shift.  Even for standard RV base,
 813 in order for correct results to be returned, the second operand RS2 must
 814 be truncated to be within the range of RS1's bitwidth.  spike's implementation
 815 of sll for example is as follows:
 816
 817     WRITE_RD(sext_xlen(zext_xlen(RS1) << (RS2 & (xlen-1))));
 818
 819 which means: where XLEN is 32 (for RV32), restrict RS2 to cover the
 820 range 0..31 so that RS1 will only be left-shifted by the amount that
 821 is possible to fit into a 32-bit register.  Whilst this appears not
 822 to matter for hardware, it matters greatly in software implementations,
 823 and it also matters where an RV64 system is set to "RV32" mode, such
 824 that the underlying registers RS1 and RS2 comprise 64 hardware bits
 825 each.
 826
 827 For SV, where each operand's element bitwidth may be over-ridden, the
 828 rule about determining the operation's bitwidth *still applies*, being
 829 defined as the maximum bitwidth of RS1 and RS2.  *However*, this rule
 830 **also applies to the truncation of RS2**.  In other words, *after*
 831 determining the maximum bitwidth, RS2's range must **also be truncated**
 832 to ensure a correct answer.  Example:
 833
 834 * RS1 is over-ridden to a 16-bit width
 835 * RS2 is over-ridden to an 8-bit width
 836 * RD is over-ridden to a 64-bit width
 837 * the maximum bitwidth is thus determined to be 16-bit - max(8,16)
 838 * RS2 is **truncated to a range of values from 0 to 15**: RS2 & (16-1)
 839
 840 Pseudocode (in spike) for this example would therefore be:
 841
 842     WRITE_RD(sext_xlen(zext_16bit(RS1) << (RS2 & (16-1))));
 843
 844 This example illustrates that considerable care therefore needs to be
 845 taken to ensure that left and right shift operations are implemented
 846 correctly.  The key is that
 847
 848 * The operation bitwidth is determined by the maximum bitwidth
 849   of the *source registers*, **not** the destination register bitwidth
 850 * The result is then sign-extend (or truncated) as appropriate.
 851
 852 ## Polymorphic MULH/MULHU/MULHSU
 853
 854 MULH is designed to take the top half MSBs of a multiply that
 855 does not fit within the range of the source operands, such that
 856 smaller width operations may produce a full double-width multiply
 857 in two cycles.  The issue is: SV allows the source operands to
 858 have variable bitwidth.
 859
 860 Here again special attention has to be paid to the rules regarding
 861 bitwidth, which, again, are that the operation is performed at
 862 the maximum bitwidth of the **source** registers.  Therefore:
 863
 864 * An 8-bit x 8-bit multiply will create a 16-bit result that must
 865   be shifted down by 8 bits
 866 * A 16-bit x 8-bit multiply will create a 24-bit result that must
 867   be shifted down by 16 bits (top 8 bits being zero)
 868 * A 16-bit x 16-bit multiply will create a 32-bit result that must
 869   be shifted down by 16 bits
 870 * A 32-bit x 16-bit multiply will create a 48-bit result that must
 871   be shifted down by 32 bits
 872 * A 32-bit x 8-bit multiply will create a 40-bit result that must
 873   be shifted down by 32 bits
 874
 875 So again, just as with shift-left and shift-right, the result
 876 is shifted down by the maximum of the two source register bitwidths.
 877 And, exactly again, truncation or sign-extension is performed on the
 878 result.  If sign-extension is to be carried out, it is performed
 879 from the same maximum of the two source register bitwidths out
 880 to the result element's bitwidth.
 881
 882 If truncation occurs, i.e. the top MSBs of the result are lost,
 883 this is "Officially Not Our Problem", i.e. it is assumed that the
 884 programmer actually desires the result to be truncated.  i.e. if the
 885 programmer wanted all of the bits, they would have set the destination
 886 elwidth to accommodate them.
 887
 888 ## Polymorphic elwidth on LOAD/STORE <a name="elwidth_loadstore"></a>
 889
 890 Polymorphic element widths in vectorised form means that the data
 891 being loaded (or stored) across multiple registers needs to be treated
 892 (reinterpreted) as a contiguous stream of elwidth-wide items, where
 893 the source register's element width is **independent** from the destination's.
 894
 895 This makes for a slightly more complex algorithm when using indirection
 896 on the "addressed" register (source for LOAD and destination for STORE),
 897 particularly given that the LOAD/STORE instruction provides important
 898 information about the width of the data to be reinterpreted.
 899
 900 Let's illustrate the "load" part, where the pseudo-code for elwidth=default
 901 was as follows, and i is the loop from 0 to VL-1:
 902
 903     srcbase = ireg[rs+i];
 904     return mem[srcbase + imm]; // returns XLEN bits
 905
 906 Instead, when elwidth != default, for a LW (32-bit LOAD), elwidth-wide
 907 chunks are taken from the source memory location addressed by the current
 908 indexed source address register, and only when a full 32-bits-worth
 909 are taken will the index be moved on to the next contiguous source
 910 address register:
 911
 912     bitwidth = bw(elwidth);             // source elwidth from CSR reg entry
 913     elsperblock = 32 / bitwidth         // 1 if bw=32, 2 if bw=16, 4 if bw=8
 914     srcbase = ireg[rs+i/(elsperblock)]; // integer divide
 915     offs = i % elsperblock;             // modulo
 916     return &mem[srcbase + imm + offs];  // re-cast to uint8_t*, uint16_t* etc.
 917
 918 Note that the constant "32" above is replaced by 8 for LB, 16 for LH, 64 for LD
 919 and 128 for LQ.
 920
 921 The principle is basically exactly the same as if the srcbase were pointing
 922 at the memory of the *register* file: memory is re-interpreted as containing
 923 groups of elwidth-wide discrete elements.
 924
 925 When storing the result from a load, it's important to respect the fact
 926 that the destination register has its *own separate element width*.  Thus,
 927 when each element is loaded (at the source element width), any sign-extension
 928 or zero-extension (or truncation) needs to be done to the *destination*
 929 bitwidth.  Also, the storing has the exact same analogous algorithm as
 930 above, where in fact it is just the set\_polymorphed\_reg pseudocode
 931 (completely unchanged) used above.
 932
 933 One issue remains: when the source element width is **greater** than
 934 the width of the operation, it is obvious that a single LB for example
 935 cannot possibly obtain 16-bit-wide data.  This condition may be detected
 936 where, when using integer divide, elsperblock (the width of the LOAD
 937 divided by the bitwidth of the element) is zero.
 938
 939 The issue is "fixed" by ensuring that elsperblock is a minimum of 1:
 940
 941     elsperblock = min(1, LD_OP_BITWIDTH / element_bitwidth)
 942
 943 The elements, if the element bitwidth is larger than the LD operation's
 944 size, will then be sign/zero-extended to the full LD operation size, as
 945 specified by the LOAD (LDU instead of LD, LBU instead of LB), before
 946 being passed on to the second phase.
 947
 948 As LOAD/STORE may be twin-predicated, it is important to note that
 949 the rules on twin predication still apply, except where in previous
 950 pseudo-code (elwidth=default for both source and target) it was
 951 the *registers* that the predication was applied to, it is now the
 952 **elements** that the predication is applied to.
 953
 954 Thus the full pseudocode for all LD operations may be written out
 955 as follows:
 956
 957     function LBU(rd, rs):
 958         load_elwidthed(rd, rs, 8, true)
 959     function LB(rd, rs):
 960         load_elwidthed(rd, rs, 8, false)
 961     function LH(rd, rs):
 962         load_elwidthed(rd, rs, 16, false)
 963     ...
 964     ...
 965     function LQ(rd, rs):
 966         load_elwidthed(rd, rs, 128, false)
 967
 968     # returns 1 byte of data when opwidth=8, 2 bytes when opwidth=16..
 969     function load_memory(rs, imm, i, opwidth):
 970         elwidth = int_csr[rs].elwidth
 971         bitwidth = bw(elwidth);
 972         elsperblock = min(1, opwidth / bitwidth)
 973         srcbase = ireg[rs+i/(elsperblock)];
 974         offs = i % elsperblock;
 975         return mem[srcbase + imm + offs]; # 1/2/4/8/16 bytes
 976
 977     function load_elwidthed(rd, rs, opwidth, unsigned):
 978       destwid = int_csr[rd].elwidth # destination element width
 979       rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
 980       rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
 981       ps = get_pred_val(FALSE, rs); # predication on src
 982       pd = get_pred_val(FALSE, rd); # ... AND on dest
 983       for (int i = 0, int j = 0; i < VL && j < VL;):
 984         if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
 985         if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
 986         val = load_memory(rs, imm, i, opwidth)
 987         if unsigned:
 988             val = zero_extend(val, min(opwidth, bitwidth))
 989         else:
 990             val = sign_extend(val, min(opwidth, bitwidth))
 991         set_polymorphed_reg(rd, bitwidth, j, val)
 992         if (int_csr[rs].isvec) i++;
 993         if (int_csr[rd].isvec) j++; else break;
 994
 995 Note:
 996
 997 * when comparing against for example the twin-predicated c.mv
 998   pseudo-code, the pattern of independent incrementing of rd and rs
 999   is preserved unchanged.
1000 * just as with the c.mv pseudocode, zeroing is not included and must be
1001   taken into account (TODO).
1002 * that due to the use of a twin-predication algorithm, LOAD/STORE also
1003   take on the same VSPLAT, VINSERT, VREDUCE, VEXTRACT, VGATHER and
1004   VSCATTER characteristics.
1005 * that due to the use of the same set\_polymorphed\_reg pseudocode,
1006   a destination that is not vectorised (marked as scalar) will
1007   result in the element being fully sign-extended or zero-extended
1008   out to the full register file bitwidth (XLEN).  When the source
1009   is also marked as scalar, this is how the compatibility with
1010   standard RV LOAD/STORE is preserved by this algorithm.
1011
1012 ### Example Tables showing LOAD elements
1013
1014 This section contains examples of vectorised LOAD operations, showing
1015 how the two stage process works (three if zero/sign-extension is included).
1016
1017
1018 #### Example: LD x8, x5(0), x8 CSR-elwidth=32, x5 CSR-elwidth=16, VL=7
1019
1020 This is:
1021
1022 * a 64-bit load, with an offset of zero
1023 * with a source-address elwidth of 16-bit
1024 * into a destination-register with an elwidth of 32-bit
1025 * where VL=7
1026 * from register x5 (actually x5-x6) to x8 (actually x8 to half of x11)
1027 * RV64, where XLEN=64 is assumed.
1028
1029 First, the memory table, which, due to the
1030 element width being 16 and the operation being LD (64), the 64-bits
1031 loaded from memory are subdivided into groups of **four** elements.
1032 And, with VL being 7 (deliberately to illustrate that this is reasonable
1033 and possible), the first four are sourced from the offset addresses pointed
1034 to by x5, and the next three from the ofset addresses pointed to by
1035 the next contiguous register, x6:
1036
1037 [[!table  data="""
1038 addr | byte 0 | byte 1 | byte 2 | byte 3 | byte 4 | byte 5 | byte 6 | byte 7 |
1039 @x5  | elem 0         || elem 1         || elem 2         || elem 3         ||
1040 @x6  | elem 4         || elem 5         || elem 6         || not loaded     ||
1041 """]]
1042
1043 Next, the elements are zero-extended from 16-bit to 32-bit, as whilst
1044 the elwidth CSR entry for x5 is 16-bit, the destination elwidth on x8 is 32.
1045
1046 [[!table  data="""
1047 byte 3 | byte 2 | byte 1 | byte 0 |
1048 0x0    | 0x0    | elem0          ||
1049 0x0    | 0x0    | elem1          ||
1050 0x0    | 0x0    | elem2          ||
1051 0x0    | 0x0    | elem3          ||
1052 0x0    | 0x0    | elem4          ||
1053 0x0    | 0x0    | elem5          ||
1054 0x0    | 0x0    | elem6          ||
1055 0x0    | 0x0    | elem7          ||
1056 """]]
1057
1058 Lastly, the elements are stored in contiguous blocks, as if x8 was also
1059 byte-addressable "memory".  That "memory" happens to cover registers
1060 x8, x9, x10 and x11, with the last 32 "bits" of x11 being **UNMODIFIED**:
1061
1062 [[!table  data="""
1063 reg# | byte 7 | byte 6 | byte 5 | byte 4 | byte 3 | byte 2 | byte 1 | byte 0 |
1064 x8   | 0x0    | 0x0    | elem 1         || 0x0    | 0x0    | elem 0         ||
1065 x9   | 0x0    | 0x0    | elem 3         || 0x0    | 0x0    | elem 2         ||
1066 x10  | 0x0    | 0x0    | elem 5         || 0x0    | 0x0    | elem 4         ||
1067 x11  | **UNMODIFIED**                 |||| 0x0    | 0x0    | elem 6         ||
1068 """]]
1069
1070 Thus we have data that is loaded from the **addresses** pointed to by
1071 x5 and x6, zero-extended from 16-bit to 32-bit, stored in the **registers**
1072 x8 through to half of x11.
1073 The end result is that elements 0 and 1 end up in x8, with element 8 being
1074 shifted up 32 bits, and so on, until finally element 6 is in the
1075 LSBs of x11.
1076
1077 Note that whilst the memory addressing table is shown left-to-right byte order,
1078 the registers are shown in right-to-left (MSB) order.  This does **not**
1079 imply that bit or byte-reversal is carried out: it's just easier to visualise
1080 memory as being contiguous bytes, and emphasises that registers are not
1081 really actually "memory" as such.
1082
1083 ## Why SV bitwidth specification is restricted to 4 entries
1084
1085 The four entries for SV element bitwidths only allows three over-rides:
1086
1087 * 8 bit
1088 * 16 hit
1089 * 32 bit
1090
1091 This would seem inadequate, surely it would be better to have 3 bits or
1092 more and allow 64, 128 and some other options besides.  The answer here
1093 is, it gets too complex, no RV128 implementation yet exists, and so RV64's
1094 default is 64 bit, so the 4 major element widths are covered anyway.
1095
1096 There is an absolutely crucial aspect oF SV here that explicitly
1097 needs spelling out, and it's whether the "vectorised" bit is set in
1098 the Register's CSR entry.
1099
1100 If "vectorised" is clear (not set), this indicates that the operation
1101 is "scalar".  Under these circumstances, when set on a destination (RD),
1102 then sign-extension and zero-extension, whilst changed to match the
1103 override bitwidth (if set), will erase the **full** register entry
1104 (64-bit if RV64).
1105
1106 When vectorised is *set*, this indicates that the operation now treats
1107 **elements** as if they were independent registers, so regardless of
1108 the length, any parts of a given actual register that are not involved
1109 in the operation are **NOT** modified, but are **PRESERVED**.
1110
1111 For example:
1112
1113 * when the vector bit is clear and elwidth set to 16 on the destination
1114   register, operations are truncated to 16 bit and then sign or zero
1115   extended to the *FULL* XLEN register width.
1116 * when the vector bit is set, elwidth is 16 and VL=1 (or other value where
1117   groups of elwidth sized elements do not fill an entire XLEN register),
1118   the "top" bits of the destination register do *NOT* get modified, zero'd
1119   or otherwise overwritten.
1120
1121 SIMD micro-architectures may implement this by using predication on
1122 any elements in a given actual register that are beyond the end of
1123 multi-element operation.
1124
1125 Other microarchitectures may choose to provide byte-level write-enable
1126 lines on the register file, such that each 64 bit register in an RV64
1127 system requires 8 WE lines.  Scalar RV64 operations would require
1128 activation of all 8 lines, where SV elwidth based operations would
1129 activate the required subset of those byte-level write lines.
1130
1131 Example:
1132
1133 * rs1, rs2 and rd are all set to 8-bit
1134 * VL is set to 3
1135 * RV64 architecture is set (UXL=64)
1136 * add operation is carried out
1137 * bits 0-23 of RD are modified to be rs1[23..16] + rs2[23..16]
1138   concatenated with similar add operations on bits 15..8 and 7..0
1139 * bits 24 through 63 **remain as they originally were**.
1140
1141 Example SIMD micro-architectural implementation:
1142
1143 * SIMD architecture works out the nearest round number of elements
1144   that would fit into a full RV64 register (in this case: 8)
1145 * SIMD architecture creates a hidden predicate, binary 0b00000111
1146   i.e. the bottom 3 bits set (VL=3) and the top 5 bits clear
1147 * SIMD architecture goes ahead with the add operation as if it
1148   was a full 8-wide batch of 8 adds
1149 * SIMD architecture passes top 5 elements through the adders
1150   (which are "disabled" due to zero-bit predication)
1151 * SIMD architecture gets the 5 unmodified top 8-bits back unmodified
1152   and stores them in rd.
1153
1154 This requires a read on rd, however this is required anyway in order
1155 to support non-zeroing mode.
1156
1157 ## Polymorphic floating-point
1158
1159 Standard scalar RV integer operations base the register width on XLEN,
1160 which may be changed (UXL in USTATUS, and the corresponding MXL and
1161 SXL in MSTATUS and SSTATUS respectively).  Integer LOAD, STORE and
1162 arithmetic operations are therefore restricted to an active XLEN bits,
1163 with sign or zero extension to pad out the upper bits when XLEN has
1164 been dynamically set to less than the actual register size.
1165
1166 For scalar floating-point, the active (used / changed) bits are
1167 specified exclusively by the operation: ADD.S specifies an active
1168 32-bits, with the upper bits of the source registers needing to
1169 be all 1s ("NaN-boxed"), and the destination upper bits being
1170 *set* to all 1s (including on LOAD/STOREs).
1171
1172 Where elwidth is set to default (on any source or the destination)
1173 it is obvious that this NaN-boxing behaviour can and should be
1174 preserved.  When elwidth is non-default things are less obvious,
1175 so need to be thought through.  Here is a normal (scalar) sequence,
1176 assuming an RV64 which supports Quad (128-bit) FLEN:
1177
1178 * FLD loads 64-bit wide from memory.  Top 64 MSBs are set to all 1s
1179 * ADD.D performs a 64-bit-wide add.  Top 64 MSBs of destination set to 1s.
1180 * FSD stores lowest 64-bits from the 128-bit-wide register to memory:
1181   top 64 MSBs ignored.
1182
1183 Therefore it makes sense to mirror this behaviour when, for example,
1184 elwidth is set to 32.  Assume elwidth set to 32 on all source and
1185 destination registers:
1186
1187 * FLD loads 64-bit wide from memory as **two** 32-bit single-precision
1188   floating-point numbers.
1189 * ADD.D performs **two** 32-bit-wide adds, storing one of the adds
1190   in bits 0-31 and the second in bits 32-63.
1191 * FSD stores lowest 64-bits from the 128-bit-wide register to memory
1192
1193 Here's the thing: it does not make sense to overwrite the top 64 MSBs
1194 of the registers either during the FLD **or** the ADD.D.  The reason
1195 is that, effectively, the top 64 MSBs actually represent a completely
1196 independent 64-bit register, so overwriting it is not only gratuitous
1197 but may actually be harmful for a future extension to SV which may
1198 have a way to directly access those top 64 bits.
1199
1200 The decision is therefore **not** to touch the upper parts of floating-point
1201 registers whereever elwidth is set to non-default values, including
1202 when "isvec" is false in a given register's CSR entry.  Only when the
1203 elwidth is set to default **and** isvec is false will the standard
1204 RV behaviour be followed, namely that the upper bits be modified.
1205
1206 Ultimately if elwidth is default and isvec false on *all* source
1207 and destination registers, a SimpleV instruction defaults completely
1208 to standard RV scalar behaviour (this holds true for **all** operations,
1209 right across the board).
1210
1211 The nice thing here is that ADD.S, ADD.D and ADD.Q when elwidth are
1212 non-default values are effectively all the same: they all still perform
1213 multiple ADD operations, just at different widths.  A future extension
1214 to SimpleV may actually allow ADD.S to access the upper bits of the
1215 register, effectively breaking down a 128-bit register into a bank
1216 of 4 independently-accesible 32-bit registers.
1217
1218 In the meantime, although when e.g. setting VL to 8 it would technically
1219 make no difference to the ALU whether ADD.S, ADD.D or ADD.Q is used,
1220 using ADD.Q may be an easy way to signal to the microarchitecture that
1221 it is to receive a higher VL value.  On a superscalar OoO architecture
1222 there may be absolutely no difference, however on simpler SIMD-style
1223 microarchitectures they may not necessarily have the infrastructure in
1224 place to know the difference, such that when VL=8 and an ADD.D instruction
1225 is issued, it completes in 2 cycles (or more) rather than one, where
1226 if an ADD.Q had been issued instead on such simpler microarchitectures
1227 it would complete in one.
1228
1229 ## Specific instruction walk-throughs
1230
1231 This section covers walk-throughs of the above-outlined procedure
1232 for converting standard RISC-V scalar arithmetic operations to
1233 polymorphic widths, to ensure that it is correct.
1234
1235 ### add
1236
1237 Standard Scalar RV32/RV64 (xlen):
1238
1239 * RS1 @ xlen bits
1240 * RS2 @ xlen bits
1241 * add @ xlen bits
1242 * RD @ xlen bits
1243
1244 Polymorphic variant:
1245
1246 * RS1 @ rs1 bits, zero-extended to max(rs1, rs2) bits
1247 * RS2 @ rs2 bits, zero-extended to max(rs1, rs2) bits
1248 * add @ max(rs1, rs2) bits
1249 * RD @ rd bits.  zero-extend to rd if rd > max(rs1, rs2) otherwise truncate
1250
1251 Note here that polymorphic add zero-extends its source operands,
1252 where addw sign-extends.
1253
1254 ### addw
1255
1256 The RV Specification specifically states that "W" variants of arithmetic
1257 operations always produce 32-bit signed values.  In a polymorphic
1258 environment it is reasonable to assume that the signed aspect is
1259 preserved, where it is the length of the operands and the result
1260 that may be changed.
1261
1262 Standard Scalar RV64 (xlen):
1263
1264 * RS1 @ xlen bits
1265 * RS2 @ xlen bits
1266 * add @ xlen bits
1267 * RD @ xlen bits, truncate add to 32-bit and sign-extend to xlen.
1268
1269 Polymorphic variant:
1270
1271 * RS1 @ rs1 bits, sign-extended to max(rs1, rs2) bits
1272 * RS2 @ rs2 bits, sign-extended to max(rs1, rs2) bits
1273 * add @ max(rs1, rs2) bits
1274 * RD @ rd bits. sign-extend to rd if rd > max(rs1, rs2) otherwise truncate
1275
1276 Note here that polymorphic addw sign-extends its source operands,
1277 where add zero-extends.
1278
1279 This requires a little more in-depth analysis.  Where the bitwidth of
1280 rs1 equals the bitwidth of rs2, no sign-extending will occur.  It is
1281 only where the bitwidth of either rs1 or rs2 are different, will the
1282 lesser-width operand be sign-extended.
1283
1284 Effectively however, both rs1 and rs2 are being sign-extended (or truncated),
1285 where for add they are both zero-extended.  This holds true for all arithmetic
1286 operations ending with "W".
1287
1288 ### addiw
1289
1290 Standard Scalar RV64I:
1291
1292 * RS1 @ xlen bits, truncated to 32-bit
1293 * immed @ 12 bits, sign-extended to 32-bit
1294 * add @ 32 bits
1295 * RD @ rd bits.  sign-extend to rd if rd > 32, otherwise truncate.
1296
1297 Polymorphic variant:
1298
1299 * RS1 @ rs1 bits
1300 * immed @ 12 bits, sign-extend to max(rs1, 12) bits
1301 * add @ max(rs1, 12) bits
1302 * RD @ rd bits. sign-extend to rd if rd > max(rs1, 12) otherwise truncate
1303
1304 # Predication Element Zeroing
1305
1306 The introduction of zeroing on traditional vector predication is usually
1307 intended as an optimisation for lane-based microarchitectures with register
1308 renaming to be able to save power by avoiding a register read on elements
1309 that are passed through en-masse through the ALU.  Simpler microarchitectures
1310 do not have this issue: they simply do not pass the element through to
1311 the ALU at all, and therefore do not store it back in the destination.
1312 More complex non-lane-based micro-architectures can, when zeroing is
1313 not set, use the predication bits to simply avoid sending element-based
1314 operations to the ALUs, entirely: thus, over the long term, potentially
1315 keeping all ALUs 100% occupied even when elements are predicated out.
1316
1317 SimpleV's design principle is not based on or influenced by
1318 microarchitectural design factors: it is a hardware-level API.
1319 Therefore, looking purely at whether zeroing is *useful* or not,
1320 (whether less instructions are needed for certain scenarios),
1321 given that a case can be made for zeroing *and* non-zeroing, the
1322 decision was taken to add support for both.
1323
1324 ## Single-predication (based on destination register)
1325
1326 Zeroing on predication for arithmetic operations is taken from
1327 the destination register's predicate.  i.e. the predication *and*
1328 zeroing settings to be applied to the whole operation come from the
1329 CSR Predication table entry for the destination register.
1330 Thus when zeroing is set on predication of a destination element,
1331 if the predication bit is clear, then the destination element is *set*
1332 to zero (twin-predication is slightly different, and will be covered
1333 next).
1334
1335 Thus the pseudo-code loop for a predicated arithmetic operation
1336 is modified to as follows:
1337
1338       for (i = 0; i < VL; i++)
1339         if not zeroing: # an optimisation
1340            while (!(predval & 1<<i) && i < VL)
1341              if (int_vec[rd ].isvector)  { id += 1; }
1342              if (int_vec[rs1].isvector)  { irs1 += 1; }
1343              if (int_vec[rs2].isvector)  { irs2 += 1; }
1344            if i == VL:
1345              return
1346         if (predval & 1<<i)
1347            src1 = ....
1348            src2 = ...
1349            else:
1350                result = src1 + src2 # actual add (or other op) here
1351            set_polymorphed_reg(rd, destwid, ird, result)
1352            if int_vec[rd].ffirst and result == 0:
1353               VL = i # result was zero, end loop early, return VL
1354               return
1355            if (!int_vec[rd].isvector) return
1356         else if zeroing:
1357            result = 0
1358            set_polymorphed_reg(rd, destwid, ird, result)
1359         if (int_vec[rd ].isvector)  { id += 1; }
1360         else if (predval & 1<<i) return
1361         if (int_vec[rs1].isvector)  { irs1 += 1; }
1362         if (int_vec[rs2].isvector)  { irs2 += 1; }
1363         if (rd == VL or rs1 == VL or rs2 == VL): return
1364
1365 The optimisation to skip elements entirely is only possible for certain
1366 micro-architectures when zeroing is not set.  However for lane-based
1367 micro-architectures this optimisation may not be practical, as it
1368 implies that elements end up in different "lanes".  Under these
1369 circumstances it is perfectly fine to simply have the lanes
1370 "inactive" for predicated elements, even though it results in
1371 less than 100% ALU utilisation.
1372
1373 ## Twin-predication (based on source and destination register)
1374
1375 Twin-predication is not that much different, except that that
1376 the source is independently zero-predicated from the destination.
1377 This means that the source may be zero-predicated *or* the
1378 destination zero-predicated *or both*, or neither.
1379
1380 When with twin-predication, zeroing is set on the source and not
1381 the destination, if a predicate bit is set it indicates that a zero
1382 data element is passed through the operation (the exception being:
1383 if the source data element is to be treated as an address - a LOAD -
1384 then the data returned *from* the LOAD is zero, rather than looking up an
1385 *address* of zero.
1386
1387 When zeroing is set on the destination and not the source, then just
1388 as with single-predicated operations, a zero is stored into the destination
1389 element (or target memory address for a STORE).
1390
1391 Zeroing on both source and destination effectively result in a bitwise
1392 NOR operation of the source and destination predicate: the result is that
1393 where either source predicate OR destination predicate is set to 0,
1394 a zero element will ultimately end up in the destination register.
1395
1396 However: this may not necessarily be the case for all operations;
1397 implementors, particularly of custom instructions, clearly need to
1398 think through the implications in each and every case.
1399
1400 Here is pseudo-code for a twin zero-predicated operation:
1401
1402     function op_mv(rd, rs) # MV not VMV!
1403       rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
1404       rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
1405       ps, zerosrc = get_pred_val(FALSE, rs); # predication on src
1406       pd, zerodst = get_pred_val(FALSE, rd); # ... AND on dest
1407       for (int i = 0, int j = 0; i < VL && j < VL):
1408         if (int_csr[rs].isvec && !zerosrc) while (!(ps & 1<<i)) i++;
1409         if (int_csr[rd].isvec && !zerodst) while (!(pd & 1<<j)) j++;
1410         if ((pd & 1<<j))
1411             if ((pd & 1<<j))
1412                 sourcedata = ireg[rs+i];
1413             else
1414                 sourcedata = 0
1415             ireg[rd+j] <= sourcedata
1416         else if (zerodst)
1417             ireg[rd+j] <= 0
1418         if (int_csr[rs].isvec)
1419             i++;
1420         if (int_csr[rd].isvec)
1421             j++;
1422         else
1423             if ((pd & 1<<j))
1424                 break;
1425
1426 Note that in the instance where the destination is a scalar, the hardware
1427 loop is ended the moment a value *or a zero* is placed into the destination
1428 register/element.  Also note that, for clarity, variable element widths
1429 have been left out of the above.
1430
1431 # Subsets of RV functionality
1432
1433 This section describes the differences when SV is implemented on top of
1434 different subsets of RV.
1435
1436 ## Common options
1437
1438 It is permitted to only implement SVprefix and not the VBLOCK instruction
1439 format option, and vice-versa.  UNIX Platforms **MUST** raise illegal
1440 instruction on seeing an unsupported VBLOCK or SVprefix opcode, so that
1441 traps may emulate the format.
1442
1443 It is permitted in SVprefix to either not implement VL or not implement
1444 SUBVL (see [[sv_prefix_proposal]] for full details. Again, UNIX Platforms
1445 *MUST* raise illegal instruction on implementations that do not support
1446 VL or SUBVL.
1447
1448 It is permitted to limit the size of either (or both) the register files
1449 down to the original size of the standard RV architecture.  However, below
1450 the mandatory limits set in the RV standard will result in non-compliance
1451 with the SV Specification.
1452
1453 ## RV32 / RV32F
1454
1455 When RV32 or RV32F is implemented, XLEN is set to 32, and thus the
1456 maximum limit for predication is also restricted to 32 bits.  Whilst not
1457 actually specifically an "option" it is worth noting.
1458
1459 ## RV32G
1460
1461 Normally in standard RV32 it does not make much sense to have
1462 RV32G, The critical instructions that are missing in standard RV32
1463 are those for moving data to and from the double-width floating-point
1464 registers into the integer ones, as well as the FCVT routines.
1465
1466 In an earlier draft of SV, it was possible to specify an elwidth
1467 of double the standard register size: this had to be dropped,
1468 and may be reintroduced in future revisions.
1469
1470 ## RV32 (not RV32F / RV32G) and RV64 (not RV64F / RV64G)
1471
1472 When floating-point is not implemented, the size of the User Register and
1473 Predication CSR tables may be halved, to only 4 2x16-bit CSRs (8 entries
1474 per table).
1475
1476 ## RV32E
1477
1478 In embedded scenarios the User Register and Predication CSRs may be
1479 dropped entirely, or optionally limited to 1 CSR, such that the combined
1480 number of entries from the M-Mode CSR Register table plus U-Mode
1481 CSR Register table is either 4 16-bit entries or (if the U-Mode is
1482 zero) only 2 16-bit entries (M-Mode CSR table only).  Likewise for
1483 the Predication CSR tables.
1484
1485 RV32E is the most likely candidate for simply detecting that registers
1486 are marked as "vectorised", and generating an appropriate exception
1487 for the VL loop to be implemented in software.
1488
1489 ## RV128
1490
1491 RV128 has not been especially considered, here, however it has some
1492 extremely large possibilities: double the element width implies
1493 256-bit operands, spanning 2 128-bit registers each, and predication
1494 of total length 128 bit given that XLEN is now 128.
1495
1496 # Example usage
1497
1498 TODO evaluate strncpy and strlen
1499 <https://groups.google.com/forum/m/#!msg/comp.arch/bGBeaNjAKvc/_vbqyxTUAQAJ>
1500
1501 ## strncpy
1502
1503 RVV version: <a name="strncpy"></>
1504
1505     strncpy:
1506         mv a3, a0               # Copy dst
1507     loop:
1508         setvli x0, a2, vint8    # Vectors of bytes.
1509         vlbff.v v1, (a1)        # Get src bytes
1510         vseq.vi v0, v1, 0       # Flag zero bytes
1511         vmfirst a4, v0          # Zero found?
1512         vmsif.v v0, v0          # Set mask up to and including zero byte. Ppplio
1513         vsb.v v1, (a3), v0.t    # Write out bytes
1514         bgez a4, exit           # Done
1515         csrr t1, vl             # Get number of bytes fetched
1516         add a1, a1, t1          # Bump src pointer
1517         sub a2, a2, t1          # Decrement count.
1518         add a3, a3, t1          # Bump dst pointer
1519         bnez a2, loop           # Anymore?
1520
1521     exit:
1522         ret
1523
1524 SV version (WIP):
1525
1526     strncpy:
1527         mv a3, a0
1528         SETMVLI 8 # set max vector to 8
1529         RegCSR[a3] = 8bit, a3, scalar
1530         RegCSR[a1] = 8bit, a1, scalar
1531         RegCSR[t0] = 8bit, t0, vector
1532         PredTb[t0] = ffirst, x0, inv
1533     loop:
1534         SETVLI a2, t4 # t4 and VL now 1..8
1535         ldb t0, (a1) # t0 fail first mode
1536         bne t0, x0, allnonzero # still ff
1537         # VL points to last nonzero
1538         GETVL t4       # from bne tests
1539         addi t4, t4, 1 # include zero
1540         SETVL t4       # set exactly to t4
1541         stb t0, (a3)   # store incl zero
1542         ret            # end subroutine
1543     allnonzero:
1544         stb t0, (a3)    # VL legal range
1545         GETVL t4        # from bne tests
1546         add a1, a1, t4  # Bump src pointer
1547         sub a2, a2, t4  # Decrement count.
1548         add a3, a3, t4  # Bump dst pointer
1549         bnez a2, loop   # Anymore?
1550     exit:
1551         ret
1552
1553 Notes:
1554
1555 * Setting MVL to 8 is just an example. If enough registers are spare it may be set to XLEN which will require a bank of 8 scalar registers for a1, a3 and t0.
1556 * obviously if that is done, t0 is not separated by 8 full registers, and would overwrite t1 thru t7. x80 would work well, as an example, instead.
1557 * with the exception of the GETVL (a pseudo code alias for csrr), every single instruction above may use RVC.
1558 * RVC C.BNEZ can be used because rs1' may be extended to the full 128 registers through redirection
1559 * RVC C.LW and C.SW may be used because the W format may be overridden by the 8 bit format. All of t0, a3 and a1 are overridden to make that work.
1560 * with the exception of the GETVL, all Vector Context may be done in VBLOCK form.
1561 * setting predication to x0 (zero) and invert on t0 is a trick to enable just ffirst on t0
1562 * ldb and bne are both using t0, both in ffirst mode
1563 * ldb will end on illegal mem, reduce VL, but copied all sorts of stuff into t0
1564 * bne t0 x0 tests up to the NEW VL for nonzero, vector t0 against scalar x0
1565 * however as t0 is in ffirst mode, the first fail wil ALSO stop the compares, and reduce VL as well
1566 * the branch only goes to allnonzero if all tests succeed
1567 * if it did not, we can safely increment VL by 1 (using a4) to include the zero.
1568 * SETVL sets *exactly* the requested amount into VL.
1569 * the SETVL just after allnonzero label is needed in case the ldb ffirst activates but the bne allzeros does not.
1570 * this would cause the stb to copy up to the end of the legal memory
1571 * of course, on the next loop the ldb would throw a trap, as a1 now points to the first illegal mem location.
1572
1573 ## strcpy
1574
1575 RVV version:
1576
1577         mv a3, a0             # Save start
1578     loop:
1579         setvli a1, x0, vint8  # byte vec, x0 (Zero reg) => use max hardware len
1580         vldbff.v v1, (a3)     # Get bytes
1581         csrr a1, vl           # Get bytes actually read e.g. if fault
1582         vseq.vi v0, v1, 0     # Set v0[i] where v1[i] = 0
1583         add a3, a3, a1        # Bump pointer
1584         vmfirst a2, v0        # Find first set bit in mask, returns -1 if none
1585         bltz a2, loop         # Not found?
1586         add a0, a0, a1        # Sum start + bump
1587         add a3, a3, a2        # Add index of zero byte
1588         sub a0, a3, a0        # Subtract start address+bump
1589         ret