simple_v_extension/appendix.mdwn

   1 # Simple-V (Parallelism Extension Proposal) Appendix
   2
   3 * Copyright (C) 2017, 2018, 2019 Luke Kenneth Casson Leighton
   4 * Status: DRAFTv0.6
   5 * Last edited: 25 jun 2019
   6 * main spec [[specification]]
   7
   8 [[!toc ]]
   9
  10 # Fail-on-first modes
  11
  12 Fail-on-first data dependency has different behaviour for traps than
  13 for conditional testing.  "Conditional" is taken to mean "anything
  14 that is zero", however with traps, the first element has to
  15 be given the opportunity to throw the exact same trap that would
  16 be thrown if this were a scalar operation (when VL=1).
  17
  18 Note that implementors are required to mutually exclusively choose one or the other modes: an instruction is **not** permitted to fail on a trap *and* fail a conditional test.  This advice to custom opcode writers as well as future extension writers.
  19
  20 ## Fail-on-first traps
  21
  22 Except for the first element, ffirst stops sequential element processing
  23 when a trap occurs.  The first element is treated normally (as if ffirst
  24 is clear).  Should any subsequent element instruction require a trap,
  25 instead it and subsequent indexed elements are ignored (or cancelled in
  26 out-of-order designs), and VL is set to the *last* in-sequence instruction that did
  27 not take the trap.
  28
  29 Note that predicated-out elements (where the predicate mask bit is zero)
  30 are clearly excluded (i.e. the trap will not occur).  However, note that
  31 the loop still had to test the predicate bit: thus on return,
  32 VL is set to include elements that did not take the trap *and* includes
  33 the elements that were predicated (masked) out (not tested up to the
  34 point where the trap occurred).
  35
  36 If SUBVL is being used (SUBVL!=1), the first *sub-group* of elements
  37 will cause a trap as normal (as if ffirst is not set); subsequently,
  38 the trap must not occur in the *sub-group* of elements.  SUBVL will **NOT**
  39 be modified.
  40
  41 Given that predication bits apply to SUBVL groups, the same rules apply
  42 to predicated-out (masked-out) sub-groups in calculating the value that VL
  43 is set to.
  44
  45 ## Fail-on-first conditional tests
  46
  47 ffirst stops sequential (or sequentially-appearing in the case of out-of-order designs)
  48 element conditional testing on the first element result
  49 being zero (or other "fail" condition).
  50 VL is set to the number of elements that were (sequentially) processed before
  51 the fail-condition was encountered.
  52
  53 Note that just as with traps, if SUBVL!=1, the first of any of the *sub-group*
  54 will cause the processing to end, and, even if there were elements within
  55 the *sub-group* that passed the test, that sub-group is still (entirely)
  56 excluded from the count (from setting VL).  i.e. VL is set to the total
  57 number of *sub-groups* that had no fail-condition up until execution was
  58 stopped.
  59
  60 Note again that, just as with traps, predicated-out (masked-out) elements
  61 are included in the (sequential)
  62 count leading up to the fail-condition, even though they
  63 were not tested.
  64
  65 # Instructions <a name="instructions" />
  66
  67 Despite being a 98% complete and accurate topological remap of RVV
  68 concepts and functionality, no new instructions are needed.
  69 Compared to RVV: *All* RVV instructions can be re-mapped, however xBitManip
  70 becomes a critical dependency for efficient manipulation of predication
  71 masks (as a bit-field).  Despite the removal of all operations,
  72 with the exception of CLIP and VSELECT.X
  73 *all instructions from RVV Base are topologically re-mapped and retain their
  74 complete functionality, intact*.  Note that if RV64G ever had
  75 a MV.X added as well as FCLIP, the full functionality of RVV-Base would
  76 be obtained in SV.
  77
  78 Three instructions, VSELECT, VCLIP and VCLIPI, do not have RV Standard
  79 equivalents, so are left out of Simple-V.  VSELECT could be included if
  80 there existed a MV.X instruction in RV (MV.X is a hypothetical
  81 non-immediate variant of MV that would allow another register to
  82 specify which register was to be copied).  Note that if any of these three
  83 instructions are added to any given RV extension, their functionality
  84 will be inherently parallelised.
  85
  86 With some exceptions, where it does not make sense or is simply too
  87 challenging, all RV-Base instructions are parallelised:
  88
  89 * CSR instructions, whilst a case could be made for fast-polling of
  90   a CSR into multiple registers, or for being able to copy multiple
  91   contiguously addressed CSRs into contiguous registers, and so on,
  92   are the fundamental core basis of SV.  If parallelised, extreme
  93   care would need to be taken.  Additionally, CSR reads are done
  94   using x0, and it is *really* inadviseable to tag x0.
  95 * LUI, C.J, C.JR, WFI, AUIPC are not suitable for parallelising so are
  96   left as scalar.
  97 * LR/SC could hypothetically be parallelised however their purpose is
  98   single (complex) atomic memory operations where the LR must be followed
  99   up by a matching SC.  A sequence of parallel LR instructions followed
 100   by a sequence of parallel SC instructions therefore is guaranteed to
 101   not be useful. Not least: the guarantees of a Multi-LR/SC
 102   would be impossible to provide if emulated in a trap.
 103 * EBREAK, NOP, FENCE and others do not use registers so are not inherently
 104   paralleliseable anyway.
 105
 106 All other operations using registers are automatically parallelised.
 107 This includes AMOMAX, AMOSWAP and so on, where particular care and
 108 attention must be paid.
 109
 110 Example pseudo-code for an integer ADD operation (including scalar
 111 operations).  Floating-point uses the FP Register Table.
 112
 113 [[!inline raw="yes" pages="simple_v_extension/simple_add_example" ]]
 114
 115 Note that for simplicity there is quite a lot missing from the above
 116 pseudo-code: PCVBLK, element widths, zeroing on predication, dimensional
 117 reshaping and offsets and so on.  However it demonstrates the basic
 118 principle.  Augmentations that produce the full pseudo-code are covered in
 119 other sections.
 120
 121 ## SUBVL Pseudocode <a name="subvl-pseudocode"></a>
 122
 123 Adding in support for SUBVL is a matter of adding in an extra inner
 124 for-loop, where register src and dest are still incremented inside the
 125 inner part. Note that the predication is still taken from the VL index.
 126
 127 So whilst elements are indexed by "(i * SUBVL + s)", predicate bits are
 128 indexed by "(i)"
 129
 130     function op_add(rd, rs1, rs2) # add not VADD!
 131       int i, id=0, irs1=0, irs2=0;
 132       predval = get_pred_val(FALSE, rd);
 133       rd  = int_vec[rd ].isvector ? int_vec[rd ].regidx : rd;
 134       rs1 = int_vec[rs1].isvector ? int_vec[rs1].regidx : rs1;
 135       rs2 = int_vec[rs2].isvector ? int_vec[rs2].regidx : rs2;
 136       for (i = 0; i < VL; i++)
 137        xSTATE.srcoffs = i # save context
 138        for (s = 0; s < SUBVL; s++)
 139         xSTATE.ssvoffs = s # save context
 140         if (predval & 1<<i) # predication uses intregs
 141            # actual add is here (at last)
 142            ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
 143            if (!int_vec[rd ].isvector) break;
 144         if (int_vec[rd ].isvector)  { id += 1; }
 145         if (int_vec[rs1].isvector)  { irs1 += 1; }
 146         if (int_vec[rs2].isvector)  { irs2 += 1; }
 147         if (id == VL or irs1 == VL or irs2 == VL) {
 148           # end VL hardware loop
 149           xSTATE.srcoffs = 0; # reset
 150           xSTATE.ssvoffs = 0; # reset
 151           return;
 152         }
 153
 154
 155 NOTE: pseudocode simplified greatly: zeroing, proper predicate handling,
 156 elwidth handling etc. all left out.
 157
 158 ## Instruction Format
 159
 160 It is critical to appreciate that there are
 161 **no operations added to SV, at all**.
 162
 163 Instead, by using CSRs to tag registers as an indication of "changed
 164 behaviour", SV *overloads* pre-existing branch operations into predicated
 165 variants, and implicitly overloads arithmetic operations, MV, FCVT, and
 166 LOAD/STORE depending on CSR configurations for bitwidth and predication.
 167 **Everything** becomes parallelised.  *This includes Compressed
 168 instructions* as well as any future instructions and Custom Extensions.
 169
 170 Note: CSR tags to change behaviour of instructions is nothing new, including
 171 in RISC-V.  UXL, SXL and MXL change the behaviour so that XLEN=32/64/128.
 172 FRM changes the behaviour of the floating-point unit, to alter the rounding
 173 mode.  Other architectures change the LOAD/STORE byte-order from big-endian
 174 to little-endian on a per-instruction basis.  SV is just a little more...
 175 comprehensive in its effect on instructions.
 176
 177 ## Branch Instructions
 178
 179 Branch operations are augmented slightly to be a little more like FP
 180 Compares (FEQ, FNE etc.), by permitting the cumulation (and storage)
 181 of multiple comparisons into a register (taken indirectly from the predicate
 182 table).  As such, "ffirst" - fail-on-first - condition mode can be enabled.
 183 See ffirst mode in the Predication Table section.
 184
 185 ### Standard Branch <a name="standard_branch"></a>
 186
 187 Branch operations use standard RV opcodes that are reinterpreted to
 188 be "predicate variants" in the instance where either of the two src
 189 registers are marked as vectors (active=1, vector=1).
 190
 191 Note that the predication register to use (if one is enabled) is taken from
 192 the *first* src register, and that this is used, just as with predicated
 193 arithmetic operations, to mask whether the comparison operations take
 194 place or not.  The target (destination) predication register
 195 to use (if one is enabled) is taken from the *second* src register.
 196
 197 If either of src1 or src2 are scalars (whether by there being no
 198 CSR register entry or whether by the CSR entry specifically marking
 199 the register as "scalar") the comparison goes ahead as vector-scalar
 200 or scalar-vector.
 201
 202 In instances where no vectorisation is detected on either src registers
 203 the operation is treated as an absolutely standard scalar branch operation.
 204 Where vectorisation is present on either or both src registers, the
 205 branch may stil go ahead if any only if *all* tests succeed (i.e. excluding
 206 those tests that are predicated out).
 207
 208 Note that when zero-predication is enabled (from source rs1),
 209 a cleared bit in the predicate indicates that the result
 210 of the compare is set to "false", i.e. that the corresponding
 211 destination bit (or result)) be set to zero.  Contrast this with
 212 when zeroing is not set: bits in the destination predicate are
 213 only *set*; they are **not** cleared.  This is important to appreciate,
 214 as there may be an expectation that, going into the hardware-loop,
 215 the destination predicate is always expected to be set to zero:
 216 this is **not** the case.  The destination predicate is only set
 217 to zero if **zeroing** is enabled.
 218
 219 Note that just as with the standard (scalar, non-predicated) branch
 220 operations, BLE, BGT, BLEU and BTGU may be synthesised by inverting
 221 src1 and src2.
 222
 223 In Hwacha EECS-2015-262 Section 6.7.2 the following pseudocode is given
 224 for predicated compare operations of function "cmp":
 225
 226     for (int i=0; i<vl; ++i)
 227       if ([!]preg[p][i])
 228          preg[pd][i] = cmp(s1 ? vreg[rs1][i] : sreg[rs1],
 229                            s2 ? vreg[rs2][i] : sreg[rs2]);
 230
 231 With associated predication, vector-length adjustments and so on,
 232 and temporarily ignoring bitwidth (which makes the comparisons more
 233 complex), this becomes:
 234
 235     s1 = reg_is_vectorised(src1);
 236     s2 = reg_is_vectorised(src2);
 237
 238     if not s1 && not s2
 239         if cmp(rs1, rs2) # scalar compare
 240             goto branch
 241         return
 242
 243     preg = int_pred_reg[rd]
 244     reg = int_regfile
 245
 246     ps = get_pred_val(I/F==INT, rs1);
 247     rd = get_pred_val(I/F==INT, rs2); # this may not exist
 248
 249     if not exists(rd) or zeroing:
 250         result = 0
 251     else
 252         result = preg[rd]
 253
 254     for (int i = 0; i < VL; ++i)
 255       if (zeroing)
 256         if not (ps & (1<<i))
 257            result &= ~(1<<i);
 258       else if (ps & (1<<i))
 259           if (cmp(s1 ? reg[src1+i]:reg[src1],
 260                                s2 ? reg[src2+i]:reg[src2])
 261               result |= 1<<i;
 262           else
 263               result &= ~(1<<i);
 264
 265      if not exists(rd)
 266         if result == ps
 267             goto branch
 268      else
 269         preg[rd] = result # store in destination
 270         if preg[rd] == ps
 271             goto branch
 272
 273 Notes:
 274
 275 * Predicated SIMD comparisons would break src1 and src2 further down
 276   into bitwidth-sized chunks (see Appendix "Bitwidth Virtual Register
 277   Reordering") setting Vector-Length times (number of SIMD elements) bits
 278   in Predicate Register rd, as opposed to just Vector-Length bits.
 279 * The execution of "parallelised" instructions **must** be implemented
 280   as "re-entrant" (to use a term from software).  If an exception (trap)
 281   occurs during the middle of a vectorised
 282   Branch (now a SV predicated compare) operation, the partial results
 283   of any comparisons must be written out to the destination
 284   register before the trap is permitted to begin.  If however there
 285   is no predicate, the **entire** set of comparisons must be **restarted**,
 286   with the offset loop indices set back to zero.  This is because
 287   there is no place to store the temporary result during the handling
 288   of traps.
 289
 290 TODO: predication now taken from src2.  also branch goes ahead
 291 if all compares are successful.
 292
 293 Note also that where normally, predication requires that there must
 294 also be a CSR register entry for the register being used in order
 295 for the **predication** CSR register entry to also be active,
 296 for branches this is **not** the case.  src2 does **not** have
 297 to have its CSR register entry marked as active in order for
 298 predication on src2 to be active.
 299
 300 Also note: SV Branch operations are **not** twin-predicated
 301 (see Twin Predication section).  This would require three
 302 element offsets: one to track src1, one to track src2 and a third
 303 to track where to store the accumulation of the results.  Given
 304 that the element offsets need to be exposed via CSRs so that
 305 the parallel hardware looping may be made re-entrant on traps
 306 and exceptions, the decision was made not to make SV Branches
 307 twin-predicated.
 308
 309 ### Floating-point Comparisons
 310
 311 There does not exist floating-point branch operations, only compare.
 312 Interestingly no change is needed to the instruction format because
 313 FP Compare already stores a 1 or a zero in its "rd" integer register
 314 target, i.e. it's not actually a Branch at all: it's a compare.
 315
 316 In RV (scalar) Base, a branch on a floating-point compare is
 317 done via the sequence "FEQ x1, f0, f5; BEQ x1, x0, #jumploc".
 318 This does extend to SV, as long as x1 (in the example sequence given)
 319 is vectorised.  When that is the case, x1..x(1+VL-1) will also be
 320 set to 0 or 1 depending on whether f0==f5, f1==f6, f2==f7 and so on.
 321 The BEQ that follows will *also* compare x1==x0, x2==x0, x3==x0 and
 322 so on.  Consequently, unlike integer-branch, FP Compare needs no
 323 modification in its behaviour.
 324
 325 In addition, it is noted that an entry "FNE" (the opposite of FEQ) is missing,
 326 and whilst in ordinary branch code this is fine because the standard
 327 RVF compare can always be followed up with an integer BEQ or a BNE (or
 328 a compressed comparison to zero or non-zero), in predication terms that
 329 becomes more of an impact.  To deal with this, SV's predication has
 330 had "invert" added to it.
 331
 332 Also: note that FP Compare may be predicated, using the destination
 333 integer register (rd) to determine the predicate.  FP Compare is **not**
 334 a twin-predication operation, as, again, just as with SV Branches,
 335 there are three registers involved: FP src1, FP src2 and INT rd.
 336
 337 Also: note that ffirst (fail first mode) applies directly to this operation.
 338
 339 ### Compressed Branch Instruction
 340
 341 Compressed Branch instructions are, just like standard Branch instructions,
 342 reinterpreted to be vectorised and predicated based on the source register
 343 (rs1s) CSR entries.  As however there is only the one source register,
 344 given that c.beqz a10 is equivalent to beqz a10,x0, the optional target
 345 to store the results of the comparisions is taken from CSR predication
 346 table entries for **x0**.
 347
 348 The specific required use of x0 is, with a little thought, quite obvious,
 349 but is counterintuitive.  Clearly it is **not** recommended to redirect
 350 x0 with a CSR register entry, however as a means to opaquely obtain
 351 a predication target it is the only sensible option that does not involve
 352 additional special CSRs (or, worse, additional special opcodes).
 353
 354 Note also that, just as with standard branches, the 2nd source
 355 (in this case x0 rather than src2) does **not** have to have its CSR
 356 register table marked as "active" in order for predication to work.
 357
 358 ## Vectorised Dual-operand instructions
 359
 360 There is a series of 2-operand instructions involving copying (and
 361 sometimes alteration):
 362
 363 * C.MV
 364 * FMV, FNEG, FABS, FCVT, FSGNJ, FSGNJN and FSGNJX
 365 * C.LWSP, C.SWSP, C.LDSP, C.FLWSP etc.
 366 * LOAD(-FP) and STORE(-FP)
 367
 368 All of these operations follow the same two-operand pattern, so it is
 369 *both* the source *and* destination predication masks that are taken into
 370 account.  This is different from
 371 the three-operand arithmetic instructions, where the predication mask
 372 is taken from the *destination* register, and applied uniformly to the
 373 elements of the source register(s), element-for-element.
 374
 375 The pseudo-code pattern for twin-predicated operations is as
 376 follows:
 377
 378     function op(rd, rs):
 379       rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
 380       rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
 381       ps = get_pred_val(FALSE, rs); # predication on src
 382       pd = get_pred_val(FALSE, rd); # ... AND on dest
 383       for (int i = 0, int j = 0; i < VL && j < VL;):
 384         if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
 385         if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
 386         xSTATE.srcoffs = i # save context
 387         xSTATE.destoffs = j # save context
 388         reg[rd+j] = SCALAR_OPERATION_ON(reg[rs+i])
 389         if (int_csr[rs].isvec) i++;
 390         if (int_csr[rd].isvec) j++; else break
 391
 392 This pattern covers scalar-scalar, scalar-vector, vector-scalar
 393 and vector-vector, and predicated variants of all of those.
 394 Zeroing is not presently included (TODO).  As such, when compared
 395 to RVV, the twin-predicated variants of C.MV and FMV cover
 396 **all** standard vector operations: VINSERT, VSPLAT, VREDUCE,
 397 VEXTRACT, VSCATTER, VGATHER, VCOPY, and more.
 398
 399 Note that:
 400
 401 * elwidth (SIMD) is not covered in the pseudo-code above
 402 * ending the loop early in scalar cases (VINSERT, VEXTRACT) is also
 403   not covered
 404 * zero predication is also not shown (TODO).
 405
 406 ### C.MV Instruction <a name="c_mv"></a>
 407
 408 There is no MV instruction in RV however there is a C.MV instruction.
 409 It is used for copying integer-to-integer registers (vectorised FMV
 410 is used for copying floating-point).
 411
 412 If either the source or the destination register are marked as vectors
 413 C.MV is reinterpreted to be a vectorised (multi-register) predicated
 414 move operation.  The actual instruction's format does not change:
 415
 416 [[!table  data="""
 417 15  12 | 11   7 | 6  2 | 1  0 |
 418 funct4 | rd     | rs   | op   |
 419 4      | 5      | 5    | 2    |
 420 C.MV   | dest   | src  | C0   |
 421 """]]
 422
 423 A simplified version of the pseudocode for this operation is as follows:
 424
 425     function op_mv(rd, rs) # MV not VMV!
 426       rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
 427       rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
 428       ps = get_pred_val(FALSE, rs); # predication on src
 429       pd = get_pred_val(FALSE, rd); # ... AND on dest
 430       for (int i = 0, int j = 0; i < VL && j < VL;):
 431         if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
 432         if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
 433         xSTATE.srcoffs = i # save context
 434         xSTATE.destoffs = j # save context
 435         ireg[rd+j] <= ireg[rs+i];
 436         if (int_csr[rs].isvec) i++;
 437         if (int_csr[rd].isvec) j++; else break
 438
 439 There are several different instructions from RVV that are covered by
 440 this one opcode:
 441
 442 [[!table  data="""
 443 src    | dest    | predication   | op             |
 444 scalar | vector  | none          | VSPLAT         |
 445 scalar | vector  | destination   | sparse VSPLAT  |
 446 scalar | vector  | 1-bit dest    | VINSERT        |
 447 vector | scalar  | 1-bit? src    | VEXTRACT       |
 448 vector | vector  | none          | VCOPY          |
 449 vector | vector  | src           | Vector Gather  |
 450 vector | vector  | dest          | Vector Scatter |
 451 vector | vector  | src & dest    | Gather/Scatter |
 452 vector | vector  | src == dest   | sparse VCOPY   |
 453 """]]
 454
 455 Also, VMERGE may be implemented as back-to-back (macro-op fused) C.MV
 456 operations with zeroing off, and inversion on the src and dest predication
 457 for one of the two C.MV operations.  The non-inverted C.MV will place
 458 one set of registers into the destination, and the inverted one the other
 459 set.  With predicate-inversion, copying and inversion of the predicate mask
 460 need not be done as a separate (scalar) instruction.
 461
 462 Note that in the instance where the Compressed Extension is not implemented,
 463 MV may be used, but that is a pseudo-operation mapping to addi rd, x0, rs.
 464 Note that the behaviour is **different** from C.MV because with addi the
 465 predication mask to use is taken **only** from rd and is applied against
 466 all elements: rs[i] = rd[i].
 467
 468 ### FMV, FNEG and FABS Instructions
 469
 470 These are identical in form to C.MV, except covering floating-point
 471 register copying.  The same double-predication rules also apply.
 472 However when elwidth is not set to default the instruction is implicitly
 473 and automatic converted to a (vectorised) floating-point type conversion
 474 operation of the appropriate size covering the source and destination
 475 register bitwidths.
 476
 477 (Note that FMV, FNEG and FABS are all actually pseudo-instructions)
 478
 479 ### FVCT Instructions
 480
 481 These are again identical in form to C.MV, except that they cover
 482 floating-point to integer and integer to floating-point.  When element
 483 width in each vector is set to default, the instructions behave exactly
 484 as they are defined for standard RV (scalar) operations, except vectorised
 485 in exactly the same fashion as outlined in C.MV.
 486
 487 However when the source or destination element width is not set to default,
 488 the opcode's explicit element widths are *over-ridden* to new definitions,
 489 and the opcode's element width is taken as indicative of the SIMD width
 490 (if applicable i.e. if packed SIMD is requested) instead.
 491
 492 For example FCVT.S.L would normally be used to convert a 64-bit
 493 integer in register rs1 to a 64-bit floating-point number in rd.
 494 If however the source rs1 is set to be a vector, where elwidth is set to
 495 default/2 and "packed SIMD" is enabled, then the first 32 bits of
 496 rs1 are converted to a floating-point number to be stored in rd's
 497 first element and the higher 32-bits *also* converted to floating-point
 498 and stored in the second.  The 32 bit size comes from the fact that
 499 FCVT.S.L's integer width is 64 bit, and with elwidth on rs1 set to
 500 divide that by two it means that rs1 element width is to be taken as 32.
 501
 502 Similar rules apply to the destination register.
 503
 504 ## LOAD / STORE Instructions and LOAD-FP/STORE-FP <a name="load_store"></a>
 505
 506 An earlier draft of SV modified the behaviour of LOAD/STORE (modified
 507 the interpretation of the instruction fields).  This
 508 actually undermined the fundamental principle of SV, namely that there
 509 be no modifications to the scalar behaviour (except where absolutely
 510 necessary), in order to simplify an implementor's task if considering
 511 converting a pre-existing scalar design to support parallelism.
 512
 513 So the original RISC-V scalar LOAD/STORE and LOAD-FP/STORE-FP functionality
 514 do not change in SV, however just as with C.MV it is important to note
 515 that dual-predication is possible.
 516
 517 In vectorised architectures there are usually at least two different modes
 518 for LOAD/STORE:
 519
 520 * Read (or write for STORE) from sequential locations, where one
 521   register specifies the address, and the one address is incremented
 522   by a fixed amount.  This is usually known as "Unit Stride" mode.
 523 * Read (or write) from multiple indirected addresses, where the
 524   vector elements each specify separate and distinct addresses.
 525
 526 To support these different addressing modes, the CSR Register "isvector"
 527 bit is used.  So, for a LOAD, when the src register is set to
 528 scalar, the LOADs are sequentially incremented by the src register
 529 element width, and when the src register is set to "vector", the
 530 elements are treated as indirection addresses.  Simplified
 531 pseudo-code would look like this:
 532
 533     function op_ld(rd, rs) # LD not VLD!
 534       rdv = int_csr[rd].active ? int_csr[rd].regidx : rd;
 535       rsv = int_csr[rs].active ? int_csr[rs].regidx : rs;
 536       ps = get_pred_val(FALSE, rs); # predication on src
 537       pd = get_pred_val(FALSE, rd); # ... AND on dest
 538       for (int i = 0, int j = 0; i < VL && j < VL;):
 539         if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
 540         if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
 541         if (int_csr[rd].isvec)
 542           # indirect mode (multi mode)
 543           srcbase = ireg[rsv+i];
 544         else
 545           # unit stride mode
 546           srcbase = ireg[rsv] + i * XLEN/8; # offset in bytes
 547         ireg[rdv+j] <= mem[srcbase + imm_offs];
 548         if (!int_csr[rs].isvec &&
 549             !int_csr[rd].isvec) break # scalar-scalar LD
 550         if (int_csr[rs].isvec) i++;
 551         if (int_csr[rd].isvec) j++;
 552
 553 Notes:
 554
 555 * For simplicity, zeroing and elwidth is not included in the above:
 556   the key focus here is the decision-making for srcbase; vectorised
 557   rs means use sequentially-numbered registers as the indirection
 558   address, and scalar rs is "offset" mode.
 559 * The test towards the end for whether both source and destination are
 560   scalar is what makes the above pseudo-code provide the "standard" RV
 561   Base behaviour for LD operations.
 562 * The offset in bytes (XLEN/8) changes depending on whether the
 563   operation is a LB (1 byte), LH (2 byes), LW (4 bytes) or LD
 564   (8 bytes), and also whether the element width is over-ridden
 565   (see special element width section).
 566
 567 ## Compressed Stack LOAD / STORE Instructions <a name="c_ld_st"></a>
 568
 569 C.LWSP / C.SWSP and floating-point etc. are also source-dest twin-predicated,
 570 where it is implicit in C.LWSP/FLWSP etc. that x2 is the source register.
 571 It is therefore possible to use predicated C.LWSP to efficiently
 572 pop registers off the stack (by predicating x2 as the source), cherry-picking
 573 which registers to store to (by predicating the destination).  Likewise
 574 for C.SWSP.  In this way, LOAD/STORE-Multiple is efficiently achieved.
 575
 576 The two modes ("unit stride" and multi-indirection) are still supported,
 577 as with standard LD/ST.  Essentially, the only difference is that the
 578 use of x2 is hard-coded into the instruction.
 579
 580 **Note**: it is still possible to redirect x2 to an alternative target
 581 register.  With care, this allows C.LWSP / C.SWSP (and C.FLWSP) to be used as
 582 general-purpose LOAD/STORE operations.
 583
 584 ## Compressed LOAD / STORE Instructions
 585
 586 Compressed LOAD and STORE are again exactly the same as scalar LOAD/STORE,
 587 where the same rules apply and the same pseudo-code apply as for
 588 non-compressed LOAD/STORE.  Again: setting scalar or vector mode
 589 on the src for LOAD and dest for STORE switches mode from "Unit Stride"
 590 to "Multi-indirection", respectively.
 591
 592 # Element bitwidth polymorphism <a name="elwidth"></a>
 593
 594 Element bitwidth is best covered as its own special section, as it
 595 is quite involved and applies uniformly across-the-board.  SV restricts
 596 bitwidth polymorphism to default, 8-bit, 16-bit and 32-bit.
 597
 598 The effect of setting an element bitwidth is to re-cast each entry
 599 in the register table, and for all memory operations involving
 600 load/stores of certain specific sizes, to a completely different width.
 601 Thus In c-style terms, on an RV64 architecture, effectively each register
 602 now looks like this:
 603
 604     typedef union {
 605         uint8_t  b[8];
 606         uint16_t s[4];
 607         uint32_t i[2];
 608         uint64_t l[1];
 609     } reg_t;
 610
 611     // integer table: assume maximum SV 7-bit regfile size
 612     reg_t int_regfile[128];
 613
 614 where the CSR Register table entry (not the instruction alone) determines
 615 which of those union entries is to be used on each operation, and the
 616 VL element offset in the hardware-loop specifies the index into each array.
 617
 618 However a naive interpretation of the data structure above masks the
 619 fact that setting VL greater than 8, for example, when the bitwidth is 8,
 620 accessing one specific register "spills over" to the following parts of
 621 the register file in a sequential fashion.  So a much more accurate way
 622 to reflect this would be:
 623
 624     typedef union {
 625         uint8_t  actual_bytes[8]; // 8 for RV64, 4 for RV32, 16 for RV128
 626         uint8_t  b[0]; // array of type uint8_t
 627         uint16_t s[0];
 628         uint32_t i[0];
 629         uint64_t l[0];
 630         uint128_t d[0];
 631     } reg_t;
 632
 633     reg_t int_regfile[128];
 634
 635 where when accessing any individual regfile[n].b entry it is permitted
 636 (in c) to arbitrarily over-run the *declared* length of the array (zero),
 637 and thus "overspill" to consecutive register file entries in a fashion
 638 that is completely transparent to a greatly-simplified software / pseudo-code
 639 representation.
 640 It is however critical to note that it is clearly the responsibility of
 641 the implementor to ensure that, towards the end of the register file,
 642 an exception is thrown if attempts to access beyond the "real" register
 643 bytes is ever attempted.
 644
 645 Now we may modify pseudo-code an operation where all element bitwidths have
 646 been set to the same size, where this pseudo-code is otherwise identical
 647 to its "non" polymorphic versions (above):
 648
 649     function op_add(rd, rs1, rs2) # add not VADD!
 650       ...
 651       ...
 652       for (i = 0; i < VL; i++)
 653            ...
 654            ...
 655            // TODO, calculate if over-run occurs, for each elwidth
 656            if (elwidth == 8) {
 657                int_regfile[rd].b[id] <= int_regfile[rs1].i[irs1] +
 658                                         int_regfile[rs2].i[irs2];
 659             } else if elwidth == 16 {
 660                int_regfile[rd].s[id] <= int_regfile[rs1].s[irs1] +
 661                                         int_regfile[rs2].s[irs2];
 662             } else if elwidth == 32 {
 663                int_regfile[rd].i[id] <= int_regfile[rs1].i[irs1] +
 664                                         int_regfile[rs2].i[irs2];
 665             } else { // elwidth == 64
 666                int_regfile[rd].l[id] <= int_regfile[rs1].l[irs1] +
 667                                         int_regfile[rs2].l[irs2];
 668             }
 669            ...
 670            ...
 671
 672 So here we can see clearly: for 8-bit entries rd, rs1 and rs2 (and registers
 673 following sequentially on respectively from the same) are "type-cast"
 674 to 8-bit; for 16-bit entries likewise and so on.
 675
 676 However that only covers the case where the element widths are the same.
 677 Where the element widths are different, the following algorithm applies:
 678
 679 * Analyse the bitwidth of all source operands and work out the
 680   maximum.  Record this as "maxsrcbitwidth"
 681 * If any given source operand requires sign-extension or zero-extension
 682   (ldb, div, rem, mul, sll, srl, sra etc.), instead of mandatory 32-bit
 683   sign-extension / zero-extension or whatever is specified in the standard
 684   RV specification, **change** that to sign-extending from the respective
 685   individual source operand's bitwidth from the CSR table out to
 686   "maxsrcbitwidth" (previously calculated), instead.
 687 * Following separate and distinct (optional) sign/zero-extension of all
 688   source operands as specifically required for that operation, carry out the
 689   operation at "maxsrcbitwidth".  (Note that in the case of LOAD/STORE or MV
 690   this may be a "null" (copy) operation, and that with FCVT, the changes
 691   to the source and destination bitwidths may also turn FVCT effectively
 692   into a copy).
 693 * If the destination operand requires sign-extension or zero-extension,
 694   instead of a mandatory fixed size (typically 32-bit for arithmetic,
 695   for subw for example, and otherwise various: 8-bit for sb, 16-bit for sw
 696   etc.), overload the RV specification with the bitwidth from the
 697   destination register's elwidth entry.
 698 * Finally, store the (optionally) sign/zero-extended value into its
 699   destination: memory for sb/sw etc., or an offset section of the register
 700   file for an arithmetic operation.
 701
 702 In this way, polymorphic bitwidths are achieved without requiring a
 703 massive 64-way permutation of calculations **per opcode**, for example
 704 (4 possible rs1 bitwidths times 4 possible rs2 bitwidths times 4 possible
 705 rd bitwidths).  The pseudo-code is therefore as follows:
 706
 707     typedef union {
 708         uint8_t  b;
 709         uint16_t s;
 710         uint32_t i;
 711         uint64_t l;
 712     } el_reg_t;
 713
 714     bw(elwidth):
 715         if elwidth == 0: return xlen
 716         if elwidth == 1: return 8
 717         if elwidth == 2: return 16
 718         // elwidth == 3:
 719         return 32
 720
 721     get_max_elwidth(rs1, rs2):
 722         return max(bw(int_csr[rs1].elwidth), # default (XLEN) if not set
 723                    bw(int_csr[rs2].elwidth)) # again XLEN if no entry
 724
 725     get_polymorphed_reg(reg, bitwidth, offset):
 726         el_reg_t res;
 727         res.l = 0; // TODO: going to need sign-extending / zero-extending
 728         if bitwidth == 8:
 729             reg.b = int_regfile[reg].b[offset]
 730         elif bitwidth == 16:
 731             reg.s = int_regfile[reg].s[offset]
 732         elif bitwidth == 32:
 733             reg.i = int_regfile[reg].i[offset]
 734         elif bitwidth == 64:
 735             reg.l = int_regfile[reg].l[offset]
 736         return res
 737
 738     set_polymorphed_reg(reg, bitwidth, offset, val):
 739         if (!int_csr[reg].isvec):
 740             # sign/zero-extend depending on opcode requirements, from
 741             # the reg's bitwidth out to the full bitwidth of the regfile
 742             val = sign_or_zero_extend(val, bitwidth, xlen)
 743             int_regfile[reg].l[0] = val
 744         elif bitwidth == 8:
 745             int_regfile[reg].b[offset] = val
 746         elif bitwidth == 16:
 747             int_regfile[reg].s[offset] = val
 748         elif bitwidth == 32:
 749             int_regfile[reg].i[offset] = val
 750         elif bitwidth == 64:
 751             int_regfile[reg].l[offset] = val
 752
 753       maxsrcwid =  get_max_elwidth(rs1, rs2) # source element width(s)
 754       destwid = int_csr[rs1].elwidth         # destination element width
 755       for (i = 0; i < VL; i++)
 756         if (predval & 1<<i) # predication uses intregs
 757            // TODO, calculate if over-run occurs, for each elwidth
 758            src1 = get_polymorphed_reg(rs1, maxsrcwid, irs1)
 759            // TODO, sign/zero-extend src1 and src2 as operation requires
 760            if (op_requires_sign_extend_src1)
 761               src1 = sign_extend(src1, maxsrcwid)
 762            src2 = get_polymorphed_reg(rs2, maxsrcwid, irs2)
 763            result = src1 + src2 # actual add here
 764            // TODO, sign/zero-extend result, as operation requires
 765            if (op_requires_sign_extend_dest)
 766               result = sign_extend(result, maxsrcwid)
 767            set_polymorphed_reg(rd, destwid, ird, result)
 768            if (!int_vec[rd].isvector) break
 769         if (int_vec[rd ].isvector)  { id += 1; }
 770         if (int_vec[rs1].isvector)  { irs1 += 1; }
 771         if (int_vec[rs2].isvector)  { irs2 += 1; }
 772
 773 Whilst specific sign-extension and zero-extension pseudocode call
 774 details are left out, due to each operation being different, the above
 775 should be clear that;
 776
 777 * the source operands are extended out to the maximum bitwidth of all
 778   source operands
 779 * the operation takes place at that maximum source bitwidth (the
 780   destination bitwidth is not involved at this point, at all)
 781 * the result is extended (or potentially even, truncated) before being
 782   stored in the destination.  i.e. truncation (if required) to the
 783   destination width occurs **after** the operation **not** before.
 784 * when the destination is not marked as "vectorised", the **full**
 785   (standard, scalar) register file entry is taken up, i.e. the
 786   element is either sign-extended or zero-extended to cover the
 787   full register bitwidth (XLEN) if it is not already XLEN bits long.
 788
 789 Implementors are entirely free to optimise the above, particularly
 790 if it is specifically known that any given operation will complete
 791 accurately in less bits, as long as the results produced are
 792 directly equivalent and equal, for all inputs and all outputs,
 793 to those produced by the above algorithm.
 794
 795 ## Polymorphic floating-point operation exceptions and error-handling
 796
 797 For floating-point operations, conversion takes place without
 798 raising any kind of exception.  Exactly as specified in the standard
 799 RV specification, NAN (or appropriate) is stored if the result
 800 is beyond the range of the destination, and, again, exactly as
 801 with the standard RV specification just as with scalar
 802 operations, the floating-point flag is raised (FCSR).  And, again, just as
 803 with scalar operations, it is software's responsibility to check this flag.
 804 Given that the FCSR flags are "accrued", the fact that multiple element
 805 operations could have occurred is not a problem.
 806
 807 Note that it is perfectly legitimate for floating-point bitwidths of
 808 only 8 to be specified.  However whilst it is possible to apply IEEE 754
 809 principles, no actual standard yet exists.  Implementors wishing to
 810 provide hardware-level 8-bit support rather than throw a trap to emulate
 811 in software should contact the author of this specification before
 812 proceeding.
 813
 814 ## Polymorphic shift operators
 815
 816 A special note is needed for changing the element width of left and right
 817 shift operators, particularly right-shift.  Even for standard RV base,
 818 in order for correct results to be returned, the second operand RS2 must
 819 be truncated to be within the range of RS1's bitwidth.  spike's implementation
 820 of sll for example is as follows:
 821
 822     WRITE_RD(sext_xlen(zext_xlen(RS1) << (RS2 & (xlen-1))));
 823
 824 which means: where XLEN is 32 (for RV32), restrict RS2 to cover the
 825 range 0..31 so that RS1 will only be left-shifted by the amount that
 826 is possible to fit into a 32-bit register.  Whilst this appears not
 827 to matter for hardware, it matters greatly in software implementations,
 828 and it also matters where an RV64 system is set to "RV32" mode, such
 829 that the underlying registers RS1 and RS2 comprise 64 hardware bits
 830 each.
 831
 832 For SV, where each operand's element bitwidth may be over-ridden, the
 833 rule about determining the operation's bitwidth *still applies*, being
 834 defined as the maximum bitwidth of RS1 and RS2.  *However*, this rule
 835 **also applies to the truncation of RS2**.  In other words, *after*
 836 determining the maximum bitwidth, RS2's range must **also be truncated**
 837 to ensure a correct answer.  Example:
 838
 839 * RS1 is over-ridden to a 16-bit width
 840 * RS2 is over-ridden to an 8-bit width
 841 * RD is over-ridden to a 64-bit width
 842 * the maximum bitwidth is thus determined to be 16-bit - max(8,16)
 843 * RS2 is **truncated to a range of values from 0 to 15**: RS2 & (16-1)
 844
 845 Pseudocode (in spike) for this example would therefore be:
 846
 847     WRITE_RD(sext_xlen(zext_16bit(RS1) << (RS2 & (16-1))));
 848
 849 This example illustrates that considerable care therefore needs to be
 850 taken to ensure that left and right shift operations are implemented
 851 correctly.  The key is that
 852
 853 * The operation bitwidth is determined by the maximum bitwidth
 854   of the *source registers*, **not** the destination register bitwidth
 855 * The result is then sign-extend (or truncated) as appropriate.
 856
 857 ## Polymorphic MULH/MULHU/MULHSU
 858
 859 MULH is designed to take the top half MSBs of a multiply that
 860 does not fit within the range of the source operands, such that
 861 smaller width operations may produce a full double-width multiply
 862 in two cycles.  The issue is: SV allows the source operands to
 863 have variable bitwidth.
 864
 865 Here again special attention has to be paid to the rules regarding
 866 bitwidth, which, again, are that the operation is performed at
 867 the maximum bitwidth of the **source** registers.  Therefore:
 868
 869 * An 8-bit x 8-bit multiply will create a 16-bit result that must
 870   be shifted down by 8 bits
 871 * A 16-bit x 8-bit multiply will create a 24-bit result that must
 872   be shifted down by 16 bits (top 8 bits being zero)
 873 * A 16-bit x 16-bit multiply will create a 32-bit result that must
 874   be shifted down by 16 bits
 875 * A 32-bit x 16-bit multiply will create a 48-bit result that must
 876   be shifted down by 32 bits
 877 * A 32-bit x 8-bit multiply will create a 40-bit result that must
 878   be shifted down by 32 bits
 879
 880 So again, just as with shift-left and shift-right, the result
 881 is shifted down by the maximum of the two source register bitwidths.
 882 And, exactly again, truncation or sign-extension is performed on the
 883 result.  If sign-extension is to be carried out, it is performed
 884 from the same maximum of the two source register bitwidths out
 885 to the result element's bitwidth.
 886
 887 If truncation occurs, i.e. the top MSBs of the result are lost,
 888 this is "Officially Not Our Problem", i.e. it is assumed that the
 889 programmer actually desires the result to be truncated.  i.e. if the
 890 programmer wanted all of the bits, they would have set the destination
 891 elwidth to accommodate them.
 892
 893 ## Polymorphic elwidth on LOAD/STORE <a name="elwidth_loadstore"></a>
 894
 895 Polymorphic element widths in vectorised form means that the data
 896 being loaded (or stored) across multiple registers needs to be treated
 897 (reinterpreted) as a contiguous stream of elwidth-wide items, where
 898 the source register's element width is **independent** from the destination's.
 899
 900 This makes for a slightly more complex algorithm when using indirection
 901 on the "addressed" register (source for LOAD and destination for STORE),
 902 particularly given that the LOAD/STORE instruction provides important
 903 information about the width of the data to be reinterpreted.
 904
 905 Let's illustrate the "load" part, where the pseudo-code for elwidth=default
 906 was as follows, and i is the loop from 0 to VL-1:
 907
 908     srcbase = ireg[rs+i];
 909     return mem[srcbase + imm]; // returns XLEN bits
 910
 911 Instead, when elwidth != default, for a LW (32-bit LOAD), elwidth-wide
 912 chunks are taken from the source memory location addressed by the current
 913 indexed source address register, and only when a full 32-bits-worth
 914 are taken will the index be moved on to the next contiguous source
 915 address register:
 916
 917     bitwidth = bw(elwidth);             // source elwidth from CSR reg entry
 918     elsperblock = 32 / bitwidth         // 1 if bw=32, 2 if bw=16, 4 if bw=8
 919     srcbase = ireg[rs+i/(elsperblock)]; // integer divide
 920     offs = i % elsperblock;             // modulo
 921     return &mem[srcbase + imm + offs];  // re-cast to uint8_t*, uint16_t* etc.
 922
 923 Note that the constant "32" above is replaced by 8 for LB, 16 for LH, 64 for LD
 924 and 128 for LQ.
 925
 926 The principle is basically exactly the same as if the srcbase were pointing
 927 at the memory of the *register* file: memory is re-interpreted as containing
 928 groups of elwidth-wide discrete elements.
 929
 930 When storing the result from a load, it's important to respect the fact
 931 that the destination register has its *own separate element width*.  Thus,
 932 when each element is loaded (at the source element width), any sign-extension
 933 or zero-extension (or truncation) needs to be done to the *destination*
 934 bitwidth.  Also, the storing has the exact same analogous algorithm as
 935 above, where in fact it is just the set\_polymorphed\_reg pseudocode
 936 (completely unchanged) used above.
 937
 938 One issue remains: when the source element width is **greater** than
 939 the width of the operation, it is obvious that a single LB for example
 940 cannot possibly obtain 16-bit-wide data.  This condition may be detected
 941 where, when using integer divide, elsperblock (the width of the LOAD
 942 divided by the bitwidth of the element) is zero.
 943
 944 The issue is "fixed" by ensuring that elsperblock is a minimum of 1:
 945
 946     elsperblock = min(1, LD_OP_BITWIDTH / element_bitwidth)
 947
 948 The elements, if the element bitwidth is larger than the LD operation's
 949 size, will then be sign/zero-extended to the full LD operation size, as
 950 specified by the LOAD (LDU instead of LD, LBU instead of LB), before
 951 being passed on to the second phase.
 952
 953 As LOAD/STORE may be twin-predicated, it is important to note that
 954 the rules on twin predication still apply, except where in previous
 955 pseudo-code (elwidth=default for both source and target) it was
 956 the *registers* that the predication was applied to, it is now the
 957 **elements** that the predication is applied to.
 958
 959 Thus the full pseudocode for all LD operations may be written out
 960 as follows:
 961
 962     function LBU(rd, rs):
 963         load_elwidthed(rd, rs, 8, true)
 964     function LB(rd, rs):
 965         load_elwidthed(rd, rs, 8, false)
 966     function LH(rd, rs):
 967         load_elwidthed(rd, rs, 16, false)
 968     ...
 969     ...
 970     function LQ(rd, rs):
 971         load_elwidthed(rd, rs, 128, false)
 972
 973     # returns 1 byte of data when opwidth=8, 2 bytes when opwidth=16..
 974     function load_memory(rs, imm, i, opwidth):
 975         elwidth = int_csr[rs].elwidth
 976         bitwidth = bw(elwidth);
 977         elsperblock = min(1, opwidth / bitwidth)
 978         srcbase = ireg[rs+i/(elsperblock)];
 979         offs = i % elsperblock;
 980         return mem[srcbase + imm + offs]; # 1/2/4/8/16 bytes
 981
 982     function load_elwidthed(rd, rs, opwidth, unsigned):
 983       destwid = int_csr[rd].elwidth # destination element width
 984       rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
 985       rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
 986       ps = get_pred_val(FALSE, rs); # predication on src
 987       pd = get_pred_val(FALSE, rd); # ... AND on dest
 988       for (int i = 0, int j = 0; i < VL && j < VL;):
 989         if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
 990         if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
 991         val = load_memory(rs, imm, i, opwidth)
 992         if unsigned:
 993             val = zero_extend(val, min(opwidth, bitwidth))
 994         else:
 995             val = sign_extend(val, min(opwidth, bitwidth))
 996         set_polymorphed_reg(rd, bitwidth, j, val)
 997         if (int_csr[rs].isvec) i++;
 998         if (int_csr[rd].isvec) j++; else break;
 999
1000 Note:
1001
1002 * when comparing against for example the twin-predicated c.mv
1003   pseudo-code, the pattern of independent incrementing of rd and rs
1004   is preserved unchanged.
1005 * just as with the c.mv pseudocode, zeroing is not included and must be
1006   taken into account (TODO).
1007 * that due to the use of a twin-predication algorithm, LOAD/STORE also
1008   take on the same VSPLAT, VINSERT, VREDUCE, VEXTRACT, VGATHER and
1009   VSCATTER characteristics.
1010 * that due to the use of the same set\_polymorphed\_reg pseudocode,
1011   a destination that is not vectorised (marked as scalar) will
1012   result in the element being fully sign-extended or zero-extended
1013   out to the full register file bitwidth (XLEN).  When the source
1014   is also marked as scalar, this is how the compatibility with
1015   standard RV LOAD/STORE is preserved by this algorithm.
1016
1017 ### Example Tables showing LOAD elements
1018
1019 This section contains examples of vectorised LOAD operations, showing
1020 how the two stage process works (three if zero/sign-extension is included).
1021
1022
1023 #### Example: LD x8, x5(0), x8 CSR-elwidth=32, x5 CSR-elwidth=16, VL=7
1024
1025 This is:
1026
1027 * a 64-bit load, with an offset of zero
1028 * with a source-address elwidth of 16-bit
1029 * into a destination-register with an elwidth of 32-bit
1030 * where VL=7
1031 * from register x5 (actually x5-x6) to x8 (actually x8 to half of x11)
1032 * RV64, where XLEN=64 is assumed.
1033
1034 First, the memory table, which, due to the
1035 element width being 16 and the operation being LD (64), the 64-bits
1036 loaded from memory are subdivided into groups of **four** elements.
1037 And, with VL being 7 (deliberately to illustrate that this is reasonable
1038 and possible), the first four are sourced from the offset addresses pointed
1039 to by x5, and the next three from the ofset addresses pointed to by
1040 the next contiguous register, x6:
1041
1042 [[!table  data="""
1043 addr | byte 0 | byte 1 | byte 2 | byte 3 | byte 4 | byte 5 | byte 6 | byte 7 |
1044 @x5  | elem 0         || elem 1         || elem 2         || elem 3         ||
1045 @x6  | elem 4         || elem 5         || elem 6         || not loaded     ||
1046 """]]
1047
1048 Next, the elements are zero-extended from 16-bit to 32-bit, as whilst
1049 the elwidth CSR entry for x5 is 16-bit, the destination elwidth on x8 is 32.
1050
1051 [[!table  data="""
1052 byte 3 | byte 2 | byte 1 | byte 0 |
1053 0x0    | 0x0    | elem0          ||
1054 0x0    | 0x0    | elem1          ||
1055 0x0    | 0x0    | elem2          ||
1056 0x0    | 0x0    | elem3          ||
1057 0x0    | 0x0    | elem4          ||
1058 0x0    | 0x0    | elem5          ||
1059 0x0    | 0x0    | elem6          ||
1060 0x0    | 0x0    | elem7          ||
1061 """]]
1062
1063 Lastly, the elements are stored in contiguous blocks, as if x8 was also
1064 byte-addressable "memory".  That "memory" happens to cover registers
1065 x8, x9, x10 and x11, with the last 32 "bits" of x11 being **UNMODIFIED**:
1066
1067 [[!table  data="""
1068 reg# | byte 7 | byte 6 | byte 5 | byte 4 | byte 3 | byte 2 | byte 1 | byte 0 |
1069 x8   | 0x0    | 0x0    | elem 1         || 0x0    | 0x0    | elem 0         ||
1070 x9   | 0x0    | 0x0    | elem 3         || 0x0    | 0x0    | elem 2         ||
1071 x10  | 0x0    | 0x0    | elem 5         || 0x0    | 0x0    | elem 4         ||
1072 x11  | **UNMODIFIED**                 |||| 0x0    | 0x0    | elem 6         ||
1073 """]]
1074
1075 Thus we have data that is loaded from the **addresses** pointed to by
1076 x5 and x6, zero-extended from 16-bit to 32-bit, stored in the **registers**
1077 x8 through to half of x11.
1078 The end result is that elements 0 and 1 end up in x8, with element 8 being
1079 shifted up 32 bits, and so on, until finally element 6 is in the
1080 LSBs of x11.
1081
1082 Note that whilst the memory addressing table is shown left-to-right byte order,
1083 the registers are shown in right-to-left (MSB) order.  This does **not**
1084 imply that bit or byte-reversal is carried out: it's just easier to visualise
1085 memory as being contiguous bytes, and emphasises that registers are not
1086 really actually "memory" as such.
1087
1088 ## Why SV bitwidth specification is restricted to 4 entries
1089
1090 The four entries for SV element bitwidths only allows three over-rides:
1091
1092 * 8 bit
1093 * 16 hit
1094 * 32 bit
1095
1096 This would seem inadequate, surely it would be better to have 3 bits or
1097 more and allow 64, 128 and some other options besides.  The answer here
1098 is, it gets too complex, no RV128 implementation yet exists, and so RV64's
1099 default is 64 bit, so the 4 major element widths are covered anyway.
1100
1101 There is an absolutely crucial aspect oF SV here that explicitly
1102 needs spelling out, and it's whether the "vectorised" bit is set in
1103 the Register's CSR entry.
1104
1105 If "vectorised" is clear (not set), this indicates that the operation
1106 is "scalar".  Under these circumstances, when set on a destination (RD),
1107 then sign-extension and zero-extension, whilst changed to match the
1108 override bitwidth (if set), will erase the **full** register entry
1109 (64-bit if RV64).
1110
1111 When vectorised is *set*, this indicates that the operation now treats
1112 **elements** as if they were independent registers, so regardless of
1113 the length, any parts of a given actual register that are not involved
1114 in the operation are **NOT** modified, but are **PRESERVED**.
1115
1116 For example:
1117
1118 * when the vector bit is clear and elwidth set to 16 on the destination
1119   register, operations are truncated to 16 bit and then sign or zero
1120   extended to the *FULL* XLEN register width.
1121 * when the vector bit is set, elwidth is 16 and VL=1 (or other value where
1122   groups of elwidth sized elements do not fill an entire XLEN register),
1123   the "top" bits of the destination register do *NOT* get modified, zero'd
1124   or otherwise overwritten.
1125
1126 SIMD micro-architectures may implement this by using predication on
1127 any elements in a given actual register that are beyond the end of
1128 multi-element operation.
1129
1130 Other microarchitectures may choose to provide byte-level write-enable
1131 lines on the register file, such that each 64 bit register in an RV64
1132 system requires 8 WE lines.  Scalar RV64 operations would require
1133 activation of all 8 lines, where SV elwidth based operations would
1134 activate the required subset of those byte-level write lines.
1135
1136 Example:
1137
1138 * rs1, rs2 and rd are all set to 8-bit
1139 * VL is set to 3
1140 * RV64 architecture is set (UXL=64)
1141 * add operation is carried out
1142 * bits 0-23 of RD are modified to be rs1[23..16] + rs2[23..16]
1143   concatenated with similar add operations on bits 15..8 and 7..0
1144 * bits 24 through 63 **remain as they originally were**.
1145
1146 Example SIMD micro-architectural implementation:
1147
1148 * SIMD architecture works out the nearest round number of elements
1149   that would fit into a full RV64 register (in this case: 8)
1150 * SIMD architecture creates a hidden predicate, binary 0b00000111
1151   i.e. the bottom 3 bits set (VL=3) and the top 5 bits clear
1152 * SIMD architecture goes ahead with the add operation as if it
1153   was a full 8-wide batch of 8 adds
1154 * SIMD architecture passes top 5 elements through the adders
1155   (which are "disabled" due to zero-bit predication)
1156 * SIMD architecture gets the 5 unmodified top 8-bits back unmodified
1157   and stores them in rd.
1158
1159 This requires a read on rd, however this is required anyway in order
1160 to support non-zeroing mode.
1161
1162 ## Polymorphic floating-point
1163
1164 Standard scalar RV integer operations base the register width on XLEN,
1165 which may be changed (UXL in USTATUS, and the corresponding MXL and
1166 SXL in MSTATUS and SSTATUS respectively).  Integer LOAD, STORE and
1167 arithmetic operations are therefore restricted to an active XLEN bits,
1168 with sign or zero extension to pad out the upper bits when XLEN has
1169 been dynamically set to less than the actual register size.
1170
1171 For scalar floating-point, the active (used / changed) bits are
1172 specified exclusively by the operation: ADD.S specifies an active
1173 32-bits, with the upper bits of the source registers needing to
1174 be all 1s ("NaN-boxed"), and the destination upper bits being
1175 *set* to all 1s (including on LOAD/STOREs).
1176
1177 Where elwidth is set to default (on any source or the destination)
1178 it is obvious that this NaN-boxing behaviour can and should be
1179 preserved.  When elwidth is non-default things are less obvious,
1180 so need to be thought through.  Here is a normal (scalar) sequence,
1181 assuming an RV64 which supports Quad (128-bit) FLEN:
1182
1183 * FLD loads 64-bit wide from memory.  Top 64 MSBs are set to all 1s
1184 * ADD.D performs a 64-bit-wide add.  Top 64 MSBs of destination set to 1s.
1185 * FSD stores lowest 64-bits from the 128-bit-wide register to memory:
1186   top 64 MSBs ignored.
1187
1188 Therefore it makes sense to mirror this behaviour when, for example,
1189 elwidth is set to 32.  Assume elwidth set to 32 on all source and
1190 destination registers:
1191
1192 * FLD loads 64-bit wide from memory as **two** 32-bit single-precision
1193   floating-point numbers.
1194 * ADD.D performs **two** 32-bit-wide adds, storing one of the adds
1195   in bits 0-31 and the second in bits 32-63.
1196 * FSD stores lowest 64-bits from the 128-bit-wide register to memory
1197
1198 Here's the thing: it does not make sense to overwrite the top 64 MSBs
1199 of the registers either during the FLD **or** the ADD.D.  The reason
1200 is that, effectively, the top 64 MSBs actually represent a completely
1201 independent 64-bit register, so overwriting it is not only gratuitous
1202 but may actually be harmful for a future extension to SV which may
1203 have a way to directly access those top 64 bits.
1204
1205 The decision is therefore **not** to touch the upper parts of floating-point
1206 registers whereever elwidth is set to non-default values, including
1207 when "isvec" is false in a given register's CSR entry.  Only when the
1208 elwidth is set to default **and** isvec is false will the standard
1209 RV behaviour be followed, namely that the upper bits be modified.
1210
1211 Ultimately if elwidth is default and isvec false on *all* source
1212 and destination registers, a SimpleV instruction defaults completely
1213 to standard RV scalar behaviour (this holds true for **all** operations,
1214 right across the board).
1215
1216 The nice thing here is that ADD.S, ADD.D and ADD.Q when elwidth are
1217 non-default values are effectively all the same: they all still perform
1218 multiple ADD operations, just at different widths.  A future extension
1219 to SimpleV may actually allow ADD.S to access the upper bits of the
1220 register, effectively breaking down a 128-bit register into a bank
1221 of 4 independently-accesible 32-bit registers.
1222
1223 In the meantime, although when e.g. setting VL to 8 it would technically
1224 make no difference to the ALU whether ADD.S, ADD.D or ADD.Q is used,
1225 using ADD.Q may be an easy way to signal to the microarchitecture that
1226 it is to receive a higher VL value.  On a superscalar OoO architecture
1227 there may be absolutely no difference, however on simpler SIMD-style
1228 microarchitectures they may not necessarily have the infrastructure in
1229 place to know the difference, such that when VL=8 and an ADD.D instruction
1230 is issued, it completes in 2 cycles (or more) rather than one, where
1231 if an ADD.Q had been issued instead on such simpler microarchitectures
1232 it would complete in one.
1233
1234 ## Specific instruction walk-throughs
1235
1236 This section covers walk-throughs of the above-outlined procedure
1237 for converting standard RISC-V scalar arithmetic operations to
1238 polymorphic widths, to ensure that it is correct.
1239
1240 ### add
1241
1242 Standard Scalar RV32/RV64 (xlen):
1243
1244 * RS1 @ xlen bits
1245 * RS2 @ xlen bits
1246 * add @ xlen bits
1247 * RD @ xlen bits
1248
1249 Polymorphic variant:
1250
1251 * RS1 @ rs1 bits, zero-extended to max(rs1, rs2) bits
1252 * RS2 @ rs2 bits, zero-extended to max(rs1, rs2) bits
1253 * add @ max(rs1, rs2) bits
1254 * RD @ rd bits.  zero-extend to rd if rd > max(rs1, rs2) otherwise truncate
1255
1256 Note here that polymorphic add zero-extends its source operands,
1257 where addw sign-extends.
1258
1259 ### addw
1260
1261 The RV Specification specifically states that "W" variants of arithmetic
1262 operations always produce 32-bit signed values.  In a polymorphic
1263 environment it is reasonable to assume that the signed aspect is
1264 preserved, where it is the length of the operands and the result
1265 that may be changed.
1266
1267 Standard Scalar RV64 (xlen):
1268
1269 * RS1 @ xlen bits
1270 * RS2 @ xlen bits
1271 * add @ xlen bits
1272 * RD @ xlen bits, truncate add to 32-bit and sign-extend to xlen.
1273
1274 Polymorphic variant:
1275
1276 * RS1 @ rs1 bits, sign-extended to max(rs1, rs2) bits
1277 * RS2 @ rs2 bits, sign-extended to max(rs1, rs2) bits
1278 * add @ max(rs1, rs2) bits
1279 * RD @ rd bits. sign-extend to rd if rd > max(rs1, rs2) otherwise truncate
1280
1281 Note here that polymorphic addw sign-extends its source operands,
1282 where add zero-extends.
1283
1284 This requires a little more in-depth analysis.  Where the bitwidth of
1285 rs1 equals the bitwidth of rs2, no sign-extending will occur.  It is
1286 only where the bitwidth of either rs1 or rs2 are different, will the
1287 lesser-width operand be sign-extended.
1288
1289 Effectively however, both rs1 and rs2 are being sign-extended (or truncated),
1290 where for add they are both zero-extended.  This holds true for all arithmetic
1291 operations ending with "W".
1292
1293 ### addiw
1294
1295 Standard Scalar RV64I:
1296
1297 * RS1 @ xlen bits, truncated to 32-bit
1298 * immed @ 12 bits, sign-extended to 32-bit
1299 * add @ 32 bits
1300 * RD @ rd bits.  sign-extend to rd if rd > 32, otherwise truncate.
1301
1302 Polymorphic variant:
1303
1304 * RS1 @ rs1 bits
1305 * immed @ 12 bits, sign-extend to max(rs1, 12) bits
1306 * add @ max(rs1, 12) bits
1307 * RD @ rd bits. sign-extend to rd if rd > max(rs1, 12) otherwise truncate
1308
1309 # Predication Element Zeroing
1310
1311 The introduction of zeroing on traditional vector predication is usually
1312 intended as an optimisation for lane-based microarchitectures with register
1313 renaming to be able to save power by avoiding a register read on elements
1314 that are passed through en-masse through the ALU.  Simpler microarchitectures
1315 do not have this issue: they simply do not pass the element through to
1316 the ALU at all, and therefore do not store it back in the destination.
1317 More complex non-lane-based micro-architectures can, when zeroing is
1318 not set, use the predication bits to simply avoid sending element-based
1319 operations to the ALUs, entirely: thus, over the long term, potentially
1320 keeping all ALUs 100% occupied even when elements are predicated out.
1321
1322 SimpleV's design principle is not based on or influenced by
1323 microarchitectural design factors: it is a hardware-level API.
1324 Therefore, looking purely at whether zeroing is *useful* or not,
1325 (whether less instructions are needed for certain scenarios),
1326 given that a case can be made for zeroing *and* non-zeroing, the
1327 decision was taken to add support for both.
1328
1329 ## Single-predication (based on destination register)
1330
1331 Zeroing on predication for arithmetic operations is taken from
1332 the destination register's predicate.  i.e. the predication *and*
1333 zeroing settings to be applied to the whole operation come from the
1334 CSR Predication table entry for the destination register.
1335 Thus when zeroing is set on predication of a destination element,
1336 if the predication bit is clear, then the destination element is *set*
1337 to zero (twin-predication is slightly different, and will be covered
1338 next).
1339
1340 Thus the pseudo-code loop for a predicated arithmetic operation
1341 is modified to as follows:
1342
1343       for (i = 0; i < VL; i++)
1344         if not zeroing: # an optimisation
1345            while (!(predval & 1<<i) && i < VL)
1346              if (int_vec[rd ].isvector)  { id += 1; }
1347              if (int_vec[rs1].isvector)  { irs1 += 1; }
1348              if (int_vec[rs2].isvector)  { irs2 += 1; }
1349            if i == VL:
1350              return
1351         if (predval & 1<<i)
1352            src1 = ....
1353            src2 = ...
1354            else:
1355                result = src1 + src2 # actual add (or other op) here
1356            set_polymorphed_reg(rd, destwid, ird, result)
1357            if int_vec[rd].ffirst and result == 0:
1358               VL = i # result was zero, end loop early, return VL
1359               return
1360            if (!int_vec[rd].isvector) return
1361         else if zeroing:
1362            result = 0
1363            set_polymorphed_reg(rd, destwid, ird, result)
1364         if (int_vec[rd ].isvector)  { id += 1; }
1365         else if (predval & 1<<i) return
1366         if (int_vec[rs1].isvector)  { irs1 += 1; }
1367         if (int_vec[rs2].isvector)  { irs2 += 1; }
1368         if (rd == VL or rs1 == VL or rs2 == VL): return
1369
1370 The optimisation to skip elements entirely is only possible for certain
1371 micro-architectures when zeroing is not set.  However for lane-based
1372 micro-architectures this optimisation may not be practical, as it
1373 implies that elements end up in different "lanes".  Under these
1374 circumstances it is perfectly fine to simply have the lanes
1375 "inactive" for predicated elements, even though it results in
1376 less than 100% ALU utilisation.
1377
1378 ## Twin-predication (based on source and destination register)
1379
1380 Twin-predication is not that much different, except that that
1381 the source is independently zero-predicated from the destination.
1382 This means that the source may be zero-predicated *or* the
1383 destination zero-predicated *or both*, or neither.
1384
1385 When with twin-predication, zeroing is set on the source and not
1386 the destination, if a predicate bit is set it indicates that a zero
1387 data element is passed through the operation (the exception being:
1388 if the source data element is to be treated as an address - a LOAD -
1389 then the data returned *from* the LOAD is zero, rather than looking up an
1390 *address* of zero.
1391
1392 When zeroing is set on the destination and not the source, then just
1393 as with single-predicated operations, a zero is stored into the destination
1394 element (or target memory address for a STORE).
1395
1396 Zeroing on both source and destination effectively result in a bitwise
1397 NOR operation of the source and destination predicate: the result is that
1398 where either source predicate OR destination predicate is set to 0,
1399 a zero element will ultimately end up in the destination register.
1400
1401 However: this may not necessarily be the case for all operations;
1402 implementors, particularly of custom instructions, clearly need to
1403 think through the implications in each and every case.
1404
1405 Here is pseudo-code for a twin zero-predicated operation:
1406
1407     function op_mv(rd, rs) # MV not VMV!
1408       rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
1409       rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
1410       ps, zerosrc = get_pred_val(FALSE, rs); # predication on src
1411       pd, zerodst = get_pred_val(FALSE, rd); # ... AND on dest
1412       for (int i = 0, int j = 0; i < VL && j < VL):
1413         if (int_csr[rs].isvec && !zerosrc) while (!(ps & 1<<i)) i++;
1414         if (int_csr[rd].isvec && !zerodst) while (!(pd & 1<<j)) j++;
1415         if ((pd & 1<<j))
1416             if ((pd & 1<<j))
1417                 sourcedata = ireg[rs+i];
1418             else
1419                 sourcedata = 0
1420             ireg[rd+j] <= sourcedata
1421         else if (zerodst)
1422             ireg[rd+j] <= 0
1423         if (int_csr[rs].isvec)
1424             i++;
1425         if (int_csr[rd].isvec)
1426             j++;
1427         else
1428             if ((pd & 1<<j))
1429                 break;
1430
1431 Note that in the instance where the destination is a scalar, the hardware
1432 loop is ended the moment a value *or a zero* is placed into the destination
1433 register/element.  Also note that, for clarity, variable element widths
1434 have been left out of the above.
1435
1436 # Subsets of RV functionality
1437
1438 This section describes the differences when SV is implemented on top of
1439 different subsets of RV.
1440
1441 ## Common options
1442
1443 It is permitted to only implement SVprefix and not the VBLOCK instruction
1444 format option, and vice-versa.  UNIX Platforms **MUST** raise illegal
1445 instruction on seeing an unsupported VBLOCK or SVprefix opcode, so that
1446 traps may emulate the format.
1447
1448 It is permitted in SVprefix to either not implement VL or not implement
1449 SUBVL (see [[sv_prefix_proposal]] for full details. Again, UNIX Platforms
1450 *MUST* raise illegal instruction on implementations that do not support
1451 VL or SUBVL.
1452
1453 It is permitted to limit the size of either (or both) the register files
1454 down to the original size of the standard RV architecture.  However, below
1455 the mandatory limits set in the RV standard will result in non-compliance
1456 with the SV Specification.
1457
1458 ## RV32 / RV32F
1459
1460 When RV32 or RV32F is implemented, XLEN is set to 32, and thus the
1461 maximum limit for predication is also restricted to 32 bits.  Whilst not
1462 actually specifically an "option" it is worth noting.
1463
1464 ## RV32G
1465
1466 Normally in standard RV32 it does not make much sense to have
1467 RV32G, The critical instructions that are missing in standard RV32
1468 are those for moving data to and from the double-width floating-point
1469 registers into the integer ones, as well as the FCVT routines.
1470
1471 In an earlier draft of SV, it was possible to specify an elwidth
1472 of double the standard register size: this had to be dropped,
1473 and may be reintroduced in future revisions.
1474
1475 ## RV32 (not RV32F / RV32G) and RV64 (not RV64F / RV64G)
1476
1477 When floating-point is not implemented, the size of the User Register and
1478 Predication CSR tables may be halved, to only 4 2x16-bit CSRs (8 entries
1479 per table).
1480
1481 ## RV32E
1482
1483 In embedded scenarios the User Register and Predication CSRs may be
1484 dropped entirely, or optionally limited to 1 CSR, such that the combined
1485 number of entries from the M-Mode CSR Register table plus U-Mode
1486 CSR Register table is either 4 16-bit entries or (if the U-Mode is
1487 zero) only 2 16-bit entries (M-Mode CSR table only).  Likewise for
1488 the Predication CSR tables.
1489
1490 RV32E is the most likely candidate for simply detecting that registers
1491 are marked as "vectorised", and generating an appropriate exception
1492 for the VL loop to be implemented in software.
1493
1494 ## RV128
1495
1496 RV128 has not been especially considered, here, however it has some
1497 extremely large possibilities: double the element width implies
1498 256-bit operands, spanning 2 128-bit registers each, and predication
1499 of total length 128 bit given that XLEN is now 128.
1500
1501 # Example usage
1502
1503 TODO evaluate strncpy and strlen
1504 <https://groups.google.com/forum/m/#!msg/comp.arch/bGBeaNjAKvc/_vbqyxTUAQAJ>
1505
1506 ## strncpy
1507
1508 RVV version: <a name="strncpy"></>
1509
1510     strncpy:
1511         mv a3, a0               # Copy dst
1512     loop:
1513         setvli x0, a2, vint8    # Vectors of bytes.
1514         vlbff.v v1, (a1)        # Get src bytes
1515         vseq.vi v0, v1, 0       # Flag zero bytes
1516         vmfirst a4, v0          # Zero found?
1517         vmsif.v v0, v0          # Set mask up to and including zero byte. Ppplio
1518         vsb.v v1, (a3), v0.t    # Write out bytes
1519         bgez a4, exit           # Done
1520         csrr t1, vl             # Get number of bytes fetched
1521         add a1, a1, t1          # Bump src pointer
1522         sub a2, a2, t1          # Decrement count.
1523         add a3, a3, t1          # Bump dst pointer
1524         bnez a2, loop           # Anymore?
1525
1526     exit:
1527         ret
1528
1529 SV version (WIP):
1530
1531     strncpy:
1532         mv a3, a0
1533         SETMVLI 8 # set max vector to 8
1534         RegCSR[a3] = 8bit, a3, scalar
1535         RegCSR[a1] = 8bit, a1, scalar
1536         RegCSR[t0] = 8bit, t0, vector
1537         PredTb[t0] = ffirst, x0, inv
1538     loop:
1539         SETVLI a2, t4 # t4 and VL now 1..8
1540         ldb t0, (a1) # t0 fail first mode
1541         bne t0, x0, allnonzero # still ff
1542         # VL points to last nonzero
1543         GETVL t4       # from bne tests
1544         addi t4, t4, 1 # include zero
1545         SETVL t4       # set exactly to t4
1546         stb t0, (a3)   # store incl zero
1547         ret            # end subroutine
1548     allnonzero:
1549         stb t0, (a3)    # VL legal range
1550         GETVL t4        # from bne tests
1551         add a1, a1, t4  # Bump src pointer
1552         sub a2, a2, t4  # Decrement count.
1553         add a3, a3, t4  # Bump dst pointer
1554         bnez a2, loop   # Anymore?
1555     exit:
1556         ret
1557
1558 Notes:
1559
1560 * Setting MVL to 8 is just an example. If enough registers are spare it may be set to XLEN which will require a bank of 8 scalar registers for a1, a3 and t0.
1561 * obviously if that is done, t0 is not separated by 8 full registers, and would overwrite t1 thru t7. x80 would work well, as an example, instead.
1562 * with the exception of the GETVL (a pseudo code alias for csrr), every single instruction above may use RVC.
1563 * RVC C.BNEZ can be used because rs1' may be extended to the full 128 registers through redirection
1564 * RVC C.LW and C.SW may be used because the W format may be overridden by the 8 bit format. All of t0, a3 and a1 are overridden to make that work.
1565 * with the exception of the GETVL, all Vector Context may be done in VBLOCK form.
1566 * setting predication to x0 (zero) and invert on t0 is a trick to enable just ffirst on t0
1567 * ldb and bne are both using t0, both in ffirst mode
1568 * t0 vectorised, a1 scalar, both elwidth 8 bit: ldb enters "unit stride, vectorised, no (un)sign-extension or truncation" mode.
1569 * ldb will end on illegal mem, reduce VL, but copied all sorts of stuff into t0 (could contain zeros).
1570 * bne t0 x0 tests up to the NEW VL for nonzero, vector t0 against scalar x0
1571 * however as t0 is in ffirst mode, the first fail wil ALSO stop the compares, and reduce VL as well
1572 * the branch only goes to allnonzero if all tests succeed
1573 * if it did not, we can safely increment VL by 1 (using a4) to include the zero.
1574 * SETVL sets *exactly* the requested amount into VL.
1575 * the SETVL just after allnonzero label is needed in case the ldb ffirst activates but the bne allzeros does not.
1576 * this would cause the stb to copy up to the end of the legal memory
1577 * of course, on the next loop the ldb would throw a trap, as a1 now points to the first illegal mem location.
1578
1579 ## strcpy
1580
1581 RVV version:
1582
1583         mv a3, a0             # Save start
1584     loop:
1585         setvli a1, x0, vint8  # byte vec, x0 (Zero reg) => use max hardware len
1586         vldbff.v v1, (a3)     # Get bytes
1587         csrr a1, vl           # Get bytes actually read e.g. if fault
1588         vseq.vi v0, v1, 0     # Set v0[i] where v1[i] = 0
1589         add a3, a3, a1        # Bump pointer
1590         vmfirst a2, v0        # Find first set bit in mask, returns -1 if none
1591         bltz a2, loop         # Not found?
1592         add a0, a0, a1        # Sum start + bump
1593         add a3, a3, a2        # Add index of zero byte
1594         sub a0, a3, a0        # Subtract start address+bump
1595         ret
1596
1597 ## DAXPY
1598
1599 [[!inline raw="yes" pages="simple_v_extension/daxpy_example" ]]