simple_v_extension/appendix.mdwn

   1 # Simple-V (Parallelism Extension Proposal) Appendix
   2
   3 * Copyright (C) 2017, 2018, 2019 Luke Kenneth Casson Leighton
   4 * Status: DRAFTv0.6
   5 * Last edited: 25 jun 2019
   6 * main spec [[specification]]
   7
   8 [[!toc ]]
   9
  10 # Fail-on-first modes
  11
  12 Fail-on-first data dependency has different behaviour for traps than
  13 for conditional testing.  "Conditional" is taken to mean "anything
  14 that is zero", however with traps, the first element has to
  15 be given the opportunity to throw the exact same trap that would
  16 be thrown if this were a scalar operation (when VL=1).
  17
  18 ## Fail-on-first traps
  19
  20 Except for the first element, ffirst stops sequential element processing
  21 when a trap occurs.  The first element is treated normally (as if ffirst
  22 is clear).  Should any subsequent element instruction require a trap,
  23 instead it and subsequent indexed elements are ignored (or cancelled in
  24 out-of-order designs), and VL is set to the *last* instruction that did
  25 not take the trap.
  26
  27 Note that predicated-out elements (where the predicate mask bit is zero)
  28 are clearly excluded (i.e. the trap will not occur).  However, note that
  29 the loop still had to test the predicate bit: thus on return,
  30 VL is set to include elements that did not take the trap *and* includes
  31 the elements that were predicated (masked) out (not tested up to the
  32 point where the trap occurred).
  33
  34 If SUBVL is being used (SUBVL!=1), the first *sub-group* of elements
  35 will cause a trap as normal (as if ffirst is not set); subsequently,
  36 the trap must not occur in the *sub-group* of elements.  SUBVL will **NOT**
  37 be modified.
  38
  39 Given that predication bits apply to SUBVL groups, the same rules apply
  40 to predicated-out (masked-out) sub-groups in calculating the value that VL
  41 is set to.
  42
  43 ## Fail-on-first conditional tests
  44
  45 ffirst stops sequential element conditional testing on the first element result
  46 being zero.  VL is set to the number of elements that were processed before
  47 the fail-condition was encountered.
  48
  49 Note that just as with traps, if SUBVL!=1, the first of any of the *sub-group*
  50 will cause the processing to end, and, even if there were elements within
  51 the *sub-group* that passed the test, that sub-group is still (entirely)
  52 excluded from the count (from setting VL).  i.e. VL is set to the total
  53 number of *sub-groups* that had no fail-condition up until execution was
  54 stopped.
  55
  56 Note again that, just as with traps, predicated-out (masked-out) elements
  57 are included in the count leading up to the fail-condition, even though they
  58 were not tested.
  59
  60 # Instructions <a name="instructions" />
  61
  62 Despite being a 98% complete and accurate topological remap of RVV
  63 concepts and functionality, no new instructions are needed.
  64 Compared to RVV: *All* RVV instructions can be re-mapped, however xBitManip
  65 becomes a critical dependency for efficient manipulation of predication
  66 masks (as a bit-field).  Despite the removal of all operations,
  67 with the exception of CLIP and VSELECT.X
  68 *all instructions from RVV Base are topologically re-mapped and retain their
  69 complete functionality, intact*.  Note that if RV64G ever had
  70 a MV.X added as well as FCLIP, the full functionality of RVV-Base would
  71 be obtained in SV.
  72
  73 Three instructions, VSELECT, VCLIP and VCLIPI, do not have RV Standard
  74 equivalents, so are left out of Simple-V.  VSELECT could be included if
  75 there existed a MV.X instruction in RV (MV.X is a hypothetical
  76 non-immediate variant of MV that would allow another register to
  77 specify which register was to be copied).  Note that if any of these three
  78 instructions are added to any given RV extension, their functionality
  79 will be inherently parallelised.
  80
  81 With some exceptions, where it does not make sense or is simply too
  82 challenging, all RV-Base instructions are parallelised:
  83
  84 * CSR instructions, whilst a case could be made for fast-polling of
  85   a CSR into multiple registers, or for being able to copy multiple
  86   contiguously addressed CSRs into contiguous registers, and so on,
  87   are the fundamental core basis of SV.  If parallelised, extreme
  88   care would need to be taken.  Additionally, CSR reads are done
  89   using x0, and it is *really* inadviseable to tag x0.
  90 * LUI, C.J, C.JR, WFI, AUIPC are not suitable for parallelising so are
  91   left as scalar.
  92 * LR/SC could hypothetically be parallelised however their purpose is
  93   single (complex) atomic memory operations where the LR must be followed
  94   up by a matching SC.  A sequence of parallel LR instructions followed
  95   by a sequence of parallel SC instructions therefore is guaranteed to
  96   not be useful. Not least: the guarantees of a Multi-LR/SC
  97   would be impossible to provide if emulated in a trap.
  98 * EBREAK, NOP, FENCE and others do not use registers so are not inherently
  99   paralleliseable anyway.
 100
 101 All other operations using registers are automatically parallelised.
 102 This includes AMOMAX, AMOSWAP and so on, where particular care and
 103 attention must be paid.
 104
 105 Example pseudo-code for an integer ADD operation (including scalar
 106 operations).  Floating-point uses the FP Register Table.
 107
 108     function op_add(rd, rs1, rs2) # add not VADD!
 109       int i, id=0, irs1=0, irs2=0;
 110       predval = get_pred_val(FALSE, rd);
 111       rd  = int_vec[rd ].isvector ? int_vec[rd ].regidx : rd;
 112       rs1 = int_vec[rs1].isvector ? int_vec[rs1].regidx : rs1;
 113       rs2 = int_vec[rs2].isvector ? int_vec[rs2].regidx : rs2;
 114       for (i = 0; i < VL; i++)
 115         xSTATE.srcoffs = i # save context
 116         if (predval & 1<<i) # predication uses intregs
 117            ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
 118            if (!int_vec[rd ].isvector) break;
 119         if (int_vec[rd ].isvector)  { id += 1; }
 120         if (int_vec[rs1].isvector)  { irs1 += 1; }
 121         if (int_vec[rs2].isvector)  { irs2 += 1; }
 122
 123 Note that for simplicity there is quite a lot missing from the above
 124 pseudo-code: element widths, zeroing on predication, dimensional
 125 reshaping and offsets and so on.  However it demonstrates the basic
 126 principle.  Augmentations that produce the full pseudo-code are covered in
 127 other sections.
 128
 129 ## SUBVL Pseudocode <a name="subvl-pseudocode"></a>
 130
 131 Adding in support for SUBVL is a matter of adding in an extra inner
 132 for-loop, where register src and dest are still incremented inside the
 133 inner part. Note that the predication is still taken from the VL index.
 134
 135 So whilst elements are indexed by "(i * SUBVL + s)", predicate bits are
 136 indexed by "(i)"
 137
 138     function op_add(rd, rs1, rs2) # add not VADD!
 139       int i, id=0, irs1=0, irs2=0;
 140       predval = get_pred_val(FALSE, rd);
 141       rd  = int_vec[rd ].isvector ? int_vec[rd ].regidx : rd;
 142       rs1 = int_vec[rs1].isvector ? int_vec[rs1].regidx : rs1;
 143       rs2 = int_vec[rs2].isvector ? int_vec[rs2].regidx : rs2;
 144       for (i = 0; i < VL; i++)
 145        xSTATE.srcoffs = i # save context
 146        for (s = 0; s < SUBVL; s++)
 147         xSTATE.ssvoffs = s # save context
 148         if (predval & 1<<i) # predication uses intregs
 149            # actual add is here (at last)
 150            ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
 151            if (!int_vec[rd ].isvector) break;
 152         if (int_vec[rd ].isvector)  { id += 1; }
 153         if (int_vec[rs1].isvector)  { irs1 += 1; }
 154         if (int_vec[rs2].isvector)  { irs2 += 1; }
 155         if (id == VL or irs1 == VL or irs2 == VL) {
 156           # end VL hardware loop
 157           xSTATE.srcoffs = 0; # reset
 158           xSTATE.ssvoffs = 0; # reset
 159           return;
 160         }
 161
 162
 163 NOTE: pseudocode simplified greatly: zeroing, proper predicate handling,
 164 elwidth handling etc. all left out.
 165
 166 ## Instruction Format
 167
 168 It is critical to appreciate that there are
 169 **no operations added to SV, at all**.
 170
 171 Instead, by using CSRs to tag registers as an indication of "changed
 172 behaviour", SV *overloads* pre-existing branch operations into predicated
 173 variants, and implicitly overloads arithmetic operations, MV, FCVT, and
 174 LOAD/STORE depending on CSR configurations for bitwidth and predication.
 175 **Everything** becomes parallelised.  *This includes Compressed
 176 instructions* as well as any future instructions and Custom Extensions.
 177
 178 Note: CSR tags to change behaviour of instructions is nothing new, including
 179 in RISC-V.  UXL, SXL and MXL change the behaviour so that XLEN=32/64/128.
 180 FRM changes the behaviour of the floating-point unit, to alter the rounding
 181 mode.  Other architectures change the LOAD/STORE byte-order from big-endian
 182 to little-endian on a per-instruction basis.  SV is just a little more...
 183 comprehensive in its effect on instructions.
 184
 185 ## Branch Instructions
 186
 187 Branch operations are augmented slightly to be a little more like FP
 188 Compares (FEQ, FNE etc.), by permitting the cumulation (and storage)
 189 of multiple comparisons into a register (taken indirectly from the predicate
 190 table).  As such, "ffirst" - fail-on-first - condition mode can be enabled.
 191 See ffirst mode in the Predication Table section.
 192
 193 ### Standard Branch <a name="standard_branch"></a>
 194
 195 Branch operations use standard RV opcodes that are reinterpreted to
 196 be "predicate variants" in the instance where either of the two src
 197 registers are marked as vectors (active=1, vector=1).
 198
 199 Note that the predication register to use (if one is enabled) is taken from
 200 the *first* src register, and that this is used, just as with predicated
 201 arithmetic operations, to mask whether the comparison operations take
 202 place or not.  The target (destination) predication register
 203 to use (if one is enabled) is taken from the *second* src register.
 204
 205 If either of src1 or src2 are scalars (whether by there being no
 206 CSR register entry or whether by the CSR entry specifically marking
 207 the register as "scalar") the comparison goes ahead as vector-scalar
 208 or scalar-vector.
 209
 210 In instances where no vectorisation is detected on either src registers
 211 the operation is treated as an absolutely standard scalar branch operation.
 212 Where vectorisation is present on either or both src registers, the
 213 branch may stil go ahead if any only if *all* tests succeed (i.e. excluding
 214 those tests that are predicated out).
 215
 216 Note that when zero-predication is enabled (from source rs1),
 217 a cleared bit in the predicate indicates that the result
 218 of the compare is set to "false", i.e. that the corresponding
 219 destination bit (or result)) be set to zero.  Contrast this with
 220 when zeroing is not set: bits in the destination predicate are
 221 only *set*; they are **not** cleared.  This is important to appreciate,
 222 as there may be an expectation that, going into the hardware-loop,
 223 the destination predicate is always expected to be set to zero:
 224 this is **not** the case.  The destination predicate is only set
 225 to zero if **zeroing** is enabled.
 226
 227 Note that just as with the standard (scalar, non-predicated) branch
 228 operations, BLE, BGT, BLEU and BTGU may be synthesised by inverting
 229 src1 and src2.
 230
 231 In Hwacha EECS-2015-262 Section 6.7.2 the following pseudocode is given
 232 for predicated compare operations of function "cmp":
 233
 234     for (int i=0; i<vl; ++i)
 235       if ([!]preg[p][i])
 236          preg[pd][i] = cmp(s1 ? vreg[rs1][i] : sreg[rs1],
 237                            s2 ? vreg[rs2][i] : sreg[rs2]);
 238
 239 With associated predication, vector-length adjustments and so on,
 240 and temporarily ignoring bitwidth (which makes the comparisons more
 241 complex), this becomes:
 242
 243     s1 = reg_is_vectorised(src1);
 244     s2 = reg_is_vectorised(src2);
 245
 246     if not s1 && not s2
 247         if cmp(rs1, rs2) # scalar compare
 248             goto branch
 249         return
 250
 251     preg = int_pred_reg[rd]
 252     reg = int_regfile
 253
 254     ps = get_pred_val(I/F==INT, rs1);
 255     rd = get_pred_val(I/F==INT, rs2); # this may not exist
 256
 257     if not exists(rd) or zeroing:
 258         result = 0
 259     else
 260         result = preg[rd]
 261
 262     for (int i = 0; i < VL; ++i)
 263       if (zeroing)
 264         if not (ps & (1<<i))
 265            result &= ~(1<<i);
 266       else if (ps & (1<<i))
 267           if (cmp(s1 ? reg[src1+i]:reg[src1],
 268                                s2 ? reg[src2+i]:reg[src2])
 269               result |= 1<<i;
 270           else
 271               result &= ~(1<<i);
 272
 273      if not exists(rd)
 274         if result == ps
 275             goto branch
 276      else
 277         preg[rd] = result # store in destination
 278         if preg[rd] == ps
 279             goto branch
 280
 281 Notes:
 282
 283 * Predicated SIMD comparisons would break src1 and src2 further down
 284   into bitwidth-sized chunks (see Appendix "Bitwidth Virtual Register
 285   Reordering") setting Vector-Length times (number of SIMD elements) bits
 286   in Predicate Register rd, as opposed to just Vector-Length bits.
 287 * The execution of "parallelised" instructions **must** be implemented
 288   as "re-entrant" (to use a term from software).  If an exception (trap)
 289   occurs during the middle of a vectorised
 290   Branch (now a SV predicated compare) operation, the partial results
 291   of any comparisons must be written out to the destination
 292   register before the trap is permitted to begin.  If however there
 293   is no predicate, the **entire** set of comparisons must be **restarted**,
 294   with the offset loop indices set back to zero.  This is because
 295   there is no place to store the temporary result during the handling
 296   of traps.
 297
 298 TODO: predication now taken from src2.  also branch goes ahead
 299 if all compares are successful.
 300
 301 Note also that where normally, predication requires that there must
 302 also be a CSR register entry for the register being used in order
 303 for the **predication** CSR register entry to also be active,
 304 for branches this is **not** the case.  src2 does **not** have
 305 to have its CSR register entry marked as active in order for
 306 predication on src2 to be active.
 307
 308 Also note: SV Branch operations are **not** twin-predicated
 309 (see Twin Predication section).  This would require three
 310 element offsets: one to track src1, one to track src2 and a third
 311 to track where to store the accumulation of the results.  Given
 312 that the element offsets need to be exposed via CSRs so that
 313 the parallel hardware looping may be made re-entrant on traps
 314 and exceptions, the decision was made not to make SV Branches
 315 twin-predicated.
 316
 317 ### Floating-point Comparisons
 318
 319 There does not exist floating-point branch operations, only compare.
 320 Interestingly no change is needed to the instruction format because
 321 FP Compare already stores a 1 or a zero in its "rd" integer register
 322 target, i.e. it's not actually a Branch at all: it's a compare.
 323
 324 In RV (scalar) Base, a branch on a floating-point compare is
 325 done via the sequence "FEQ x1, f0, f5; BEQ x1, x0, #jumploc".
 326 This does extend to SV, as long as x1 (in the example sequence given)
 327 is vectorised.  When that is the case, x1..x(1+VL-1) will also be
 328 set to 0 or 1 depending on whether f0==f5, f1==f6, f2==f7 and so on.
 329 The BEQ that follows will *also* compare x1==x0, x2==x0, x3==x0 and
 330 so on.  Consequently, unlike integer-branch, FP Compare needs no
 331 modification in its behaviour.
 332
 333 In addition, it is noted that an entry "FNE" (the opposite of FEQ) is missing,
 334 and whilst in ordinary branch code this is fine because the standard
 335 RVF compare can always be followed up with an integer BEQ or a BNE (or
 336 a compressed comparison to zero or non-zero), in predication terms that
 337 becomes more of an impact.  To deal with this, SV's predication has
 338 had "invert" added to it.
 339
 340 Also: note that FP Compare may be predicated, using the destination
 341 integer register (rd) to determine the predicate.  FP Compare is **not**
 342 a twin-predication operation, as, again, just as with SV Branches,
 343 there are three registers involved: FP src1, FP src2 and INT rd.
 344
 345 Also: note that ffirst (fail first mode) applies directly to this operation.
 346
 347 ### Compressed Branch Instruction
 348
 349 Compressed Branch instructions are, just like standard Branch instructions,
 350 reinterpreted to be vectorised and predicated based on the source register
 351 (rs1s) CSR entries.  As however there is only the one source register,
 352 given that c.beqz a10 is equivalent to beqz a10,x0, the optional target
 353 to store the results of the comparisions is taken from CSR predication
 354 table entries for **x0**.
 355
 356 The specific required use of x0 is, with a little thought, quite obvious,
 357 but is counterintuitive.  Clearly it is **not** recommended to redirect
 358 x0 with a CSR register entry, however as a means to opaquely obtain
 359 a predication target it is the only sensible option that does not involve
 360 additional special CSRs (or, worse, additional special opcodes).
 361
 362 Note also that, just as with standard branches, the 2nd source
 363 (in this case x0 rather than src2) does **not** have to have its CSR
 364 register table marked as "active" in order for predication to work.
 365
 366 ## Vectorised Dual-operand instructions
 367
 368 There is a series of 2-operand instructions involving copying (and
 369 sometimes alteration):
 370
 371 * C.MV
 372 * FMV, FNEG, FABS, FCVT, FSGNJ, FSGNJN and FSGNJX
 373 * C.LWSP, C.SWSP, C.LDSP, C.FLWSP etc.
 374 * LOAD(-FP) and STORE(-FP)
 375
 376 All of these operations follow the same two-operand pattern, so it is
 377 *both* the source *and* destination predication masks that are taken into
 378 account.  This is different from
 379 the three-operand arithmetic instructions, where the predication mask
 380 is taken from the *destination* register, and applied uniformly to the
 381 elements of the source register(s), element-for-element.
 382
 383 The pseudo-code pattern for twin-predicated operations is as
 384 follows:
 385
 386     function op(rd, rs):
 387       rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
 388       rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
 389       ps = get_pred_val(FALSE, rs); # predication on src
 390       pd = get_pred_val(FALSE, rd); # ... AND on dest
 391       for (int i = 0, int j = 0; i < VL && j < VL;):
 392         if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
 393         if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
 394         xSTATE.srcoffs = i # save context
 395         xSTATE.destoffs = j # save context
 396         reg[rd+j] = SCALAR_OPERATION_ON(reg[rs+i])
 397         if (int_csr[rs].isvec) i++;
 398         if (int_csr[rd].isvec) j++; else break
 399
 400 This pattern covers scalar-scalar, scalar-vector, vector-scalar
 401 and vector-vector, and predicated variants of all of those.
 402 Zeroing is not presently included (TODO).  As such, when compared
 403 to RVV, the twin-predicated variants of C.MV and FMV cover
 404 **all** standard vector operations: VINSERT, VSPLAT, VREDUCE,
 405 VEXTRACT, VSCATTER, VGATHER, VCOPY, and more.
 406
 407 Note that:
 408
 409 * elwidth (SIMD) is not covered in the pseudo-code above
 410 * ending the loop early in scalar cases (VINSERT, VEXTRACT) is also
 411   not covered
 412 * zero predication is also not shown (TODO).
 413
 414 ### C.MV Instruction <a name="c_mv"></a>
 415
 416 There is no MV instruction in RV however there is a C.MV instruction.
 417 It is used for copying integer-to-integer registers (vectorised FMV
 418 is used for copying floating-point).
 419
 420 If either the source or the destination register are marked as vectors
 421 C.MV is reinterpreted to be a vectorised (multi-register) predicated
 422 move operation.  The actual instruction's format does not change:
 423
 424 [[!table  data="""
 425 15  12 | 11   7 | 6  2 | 1  0 |
 426 funct4 | rd     | rs   | op   |
 427 4      | 5      | 5    | 2    |
 428 C.MV   | dest   | src  | C0   |
 429 """]]
 430
 431 A simplified version of the pseudocode for this operation is as follows:
 432
 433     function op_mv(rd, rs) # MV not VMV!
 434       rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
 435       rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
 436       ps = get_pred_val(FALSE, rs); # predication on src
 437       pd = get_pred_val(FALSE, rd); # ... AND on dest
 438       for (int i = 0, int j = 0; i < VL && j < VL;):
 439         if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
 440         if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
 441         xSTATE.srcoffs = i # save context
 442         xSTATE.destoffs = j # save context
 443         ireg[rd+j] <= ireg[rs+i];
 444         if (int_csr[rs].isvec) i++;
 445         if (int_csr[rd].isvec) j++; else break
 446
 447 There are several different instructions from RVV that are covered by
 448 this one opcode:
 449
 450 [[!table  data="""
 451 src    | dest    | predication   | op             |
 452 scalar | vector  | none          | VSPLAT         |
 453 scalar | vector  | destination   | sparse VSPLAT  |
 454 scalar | vector  | 1-bit dest    | VINSERT        |
 455 vector | scalar  | 1-bit? src    | VEXTRACT       |
 456 vector | vector  | none          | VCOPY          |
 457 vector | vector  | src           | Vector Gather  |
 458 vector | vector  | dest          | Vector Scatter |
 459 vector | vector  | src & dest    | Gather/Scatter |
 460 vector | vector  | src == dest   | sparse VCOPY   |
 461 """]]
 462
 463 Also, VMERGE may be implemented as back-to-back (macro-op fused) C.MV
 464 operations with zeroing off, and inversion on the src and dest predication
 465 for one of the two C.MV operations.  The non-inverted C.MV will place
 466 one set of registers into the destination, and the inverted one the other
 467 set.  With predicate-inversion, copying and inversion of the predicate mask
 468 need not be done as a separate (scalar) instruction.
 469
 470 Note that in the instance where the Compressed Extension is not implemented,
 471 MV may be used, but that is a pseudo-operation mapping to addi rd, x0, rs.
 472 Note that the behaviour is **different** from C.MV because with addi the
 473 predication mask to use is taken **only** from rd and is applied against
 474 all elements: rs[i] = rd[i].
 475
 476 ### FMV, FNEG and FABS Instructions
 477
 478 These are identical in form to C.MV, except covering floating-point
 479 register copying.  The same double-predication rules also apply.
 480 However when elwidth is not set to default the instruction is implicitly
 481 and automatic converted to a (vectorised) floating-point type conversion
 482 operation of the appropriate size covering the source and destination
 483 register bitwidths.
 484
 485 (Note that FMV, FNEG and FABS are all actually pseudo-instructions)
 486
 487 ### FVCT Instructions
 488
 489 These are again identical in form to C.MV, except that they cover
 490 floating-point to integer and integer to floating-point.  When element
 491 width in each vector is set to default, the instructions behave exactly
 492 as they are defined for standard RV (scalar) operations, except vectorised
 493 in exactly the same fashion as outlined in C.MV.
 494
 495 However when the source or destination element width is not set to default,
 496 the opcode's explicit element widths are *over-ridden* to new definitions,
 497 and the opcode's element width is taken as indicative of the SIMD width
 498 (if applicable i.e. if packed SIMD is requested) instead.
 499
 500 For example FCVT.S.L would normally be used to convert a 64-bit
 501 integer in register rs1 to a 64-bit floating-point number in rd.
 502 If however the source rs1 is set to be a vector, where elwidth is set to
 503 default/2 and "packed SIMD" is enabled, then the first 32 bits of
 504 rs1 are converted to a floating-point number to be stored in rd's
 505 first element and the higher 32-bits *also* converted to floating-point
 506 and stored in the second.  The 32 bit size comes from the fact that
 507 FCVT.S.L's integer width is 64 bit, and with elwidth on rs1 set to
 508 divide that by two it means that rs1 element width is to be taken as 32.
 509
 510 Similar rules apply to the destination register.
 511
 512 ## LOAD / STORE Instructions and LOAD-FP/STORE-FP <a name="load_store"></a>
 513
 514 An earlier draft of SV modified the behaviour of LOAD/STORE (modified
 515 the interpretation of the instruction fields).  This
 516 actually undermined the fundamental principle of SV, namely that there
 517 be no modifications to the scalar behaviour (except where absolutely
 518 necessary), in order to simplify an implementor's task if considering
 519 converting a pre-existing scalar design to support parallelism.
 520
 521 So the original RISC-V scalar LOAD/STORE and LOAD-FP/STORE-FP functionality
 522 do not change in SV, however just as with C.MV it is important to note
 523 that dual-predication is possible.
 524
 525 In vectorised architectures there are usually at least two different modes
 526 for LOAD/STORE:
 527
 528 * Read (or write for STORE) from sequential locations, where one
 529   register specifies the address, and the one address is incremented
 530   by a fixed amount.  This is usually known as "Unit Stride" mode.
 531 * Read (or write) from multiple indirected addresses, where the
 532   vector elements each specify separate and distinct addresses.
 533
 534 To support these different addressing modes, the CSR Register "isvector"
 535 bit is used.  So, for a LOAD, when the src register is set to
 536 scalar, the LOADs are sequentially incremented by the src register
 537 element width, and when the src register is set to "vector", the
 538 elements are treated as indirection addresses.  Simplified
 539 pseudo-code would look like this:
 540
 541     function op_ld(rd, rs) # LD not VLD!
 542       rdv = int_csr[rd].active ? int_csr[rd].regidx : rd;
 543       rsv = int_csr[rs].active ? int_csr[rs].regidx : rs;
 544       ps = get_pred_val(FALSE, rs); # predication on src
 545       pd = get_pred_val(FALSE, rd); # ... AND on dest
 546       for (int i = 0, int j = 0; i < VL && j < VL;):
 547         if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
 548         if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
 549         if (int_csr[rd].isvec)
 550           # indirect mode (multi mode)
 551           srcbase = ireg[rsv+i];
 552         else
 553           # unit stride mode
 554           srcbase = ireg[rsv] + i * XLEN/8; # offset in bytes
 555         ireg[rdv+j] <= mem[srcbase + imm_offs];
 556         if (!int_csr[rs].isvec &&
 557             !int_csr[rd].isvec) break # scalar-scalar LD
 558         if (int_csr[rs].isvec) i++;
 559         if (int_csr[rd].isvec) j++;
 560
 561 Notes:
 562
 563 * For simplicity, zeroing and elwidth is not included in the above:
 564   the key focus here is the decision-making for srcbase; vectorised
 565   rs means use sequentially-numbered registers as the indirection
 566   address, and scalar rs is "offset" mode.
 567 * The test towards the end for whether both source and destination are
 568   scalar is what makes the above pseudo-code provide the "standard" RV
 569   Base behaviour for LD operations.
 570 * The offset in bytes (XLEN/8) changes depending on whether the
 571   operation is a LB (1 byte), LH (2 byes), LW (4 bytes) or LD
 572   (8 bytes), and also whether the element width is over-ridden
 573   (see special element width section).
 574
 575 ## Compressed Stack LOAD / STORE Instructions <a name="c_ld_st"></a>
 576
 577 C.LWSP / C.SWSP and floating-point etc. are also source-dest twin-predicated,
 578 where it is implicit in C.LWSP/FLWSP etc. that x2 is the source register.
 579 It is therefore possible to use predicated C.LWSP to efficiently
 580 pop registers off the stack (by predicating x2 as the source), cherry-picking
 581 which registers to store to (by predicating the destination).  Likewise
 582 for C.SWSP.  In this way, LOAD/STORE-Multiple is efficiently achieved.
 583
 584 The two modes ("unit stride" and multi-indirection) are still supported,
 585 as with standard LD/ST.  Essentially, the only difference is that the
 586 use of x2 is hard-coded into the instruction.
 587
 588 **Note**: it is still possible to redirect x2 to an alternative target
 589 register.  With care, this allows C.LWSP / C.SWSP (and C.FLWSP) to be used as
 590 general-purpose LOAD/STORE operations.
 591
 592 ## Compressed LOAD / STORE Instructions
 593
 594 Compressed LOAD and STORE are again exactly the same as scalar LOAD/STORE,
 595 where the same rules apply and the same pseudo-code apply as for
 596 non-compressed LOAD/STORE.  Again: setting scalar or vector mode
 597 on the src for LOAD and dest for STORE switches mode from "Unit Stride"
 598 to "Multi-indirection", respectively.
 599
 600 # Element bitwidth polymorphism <a name="elwidth"></a>
 601
 602 Element bitwidth is best covered as its own special section, as it
 603 is quite involved and applies uniformly across-the-board.  SV restricts
 604 bitwidth polymorphism to default, 8-bit, 16-bit and 32-bit.
 605
 606 The effect of setting an element bitwidth is to re-cast each entry
 607 in the register table, and for all memory operations involving
 608 load/stores of certain specific sizes, to a completely different width.
 609 Thus In c-style terms, on an RV64 architecture, effectively each register
 610 now looks like this:
 611
 612     typedef union {
 613         uint8_t  b[8];
 614         uint16_t s[4];
 615         uint32_t i[2];
 616         uint64_t l[1];
 617     } reg_t;
 618
 619     // integer table: assume maximum SV 7-bit regfile size
 620     reg_t int_regfile[128];
 621
 622 where the CSR Register table entry (not the instruction alone) determines
 623 which of those union entries is to be used on each operation, and the
 624 VL element offset in the hardware-loop specifies the index into each array.
 625
 626 However a naive interpretation of the data structure above masks the
 627 fact that setting VL greater than 8, for example, when the bitwidth is 8,
 628 accessing one specific register "spills over" to the following parts of
 629 the register file in a sequential fashion.  So a much more accurate way
 630 to reflect this would be:
 631
 632     typedef union {
 633         uint8_t  actual_bytes[8]; // 8 for RV64, 4 for RV32, 16 for RV128
 634         uint8_t  b[0]; // array of type uint8_t
 635         uint16_t s[0];
 636         uint32_t i[0];
 637         uint64_t l[0];
 638         uint128_t d[0];
 639     } reg_t;
 640
 641     reg_t int_regfile[128];
 642
 643 where when accessing any individual regfile[n].b entry it is permitted
 644 (in c) to arbitrarily over-run the *declared* length of the array (zero),
 645 and thus "overspill" to consecutive register file entries in a fashion
 646 that is completely transparent to a greatly-simplified software / pseudo-code
 647 representation.
 648 It is however critical to note that it is clearly the responsibility of
 649 the implementor to ensure that, towards the end of the register file,
 650 an exception is thrown if attempts to access beyond the "real" register
 651 bytes is ever attempted.
 652
 653 Now we may modify pseudo-code an operation where all element bitwidths have
 654 been set to the same size, where this pseudo-code is otherwise identical
 655 to its "non" polymorphic versions (above):
 656
 657     function op_add(rd, rs1, rs2) # add not VADD!
 658       ...
 659       ...
 660       for (i = 0; i < VL; i++)
 661            ...
 662            ...
 663            // TODO, calculate if over-run occurs, for each elwidth
 664            if (elwidth == 8) {
 665                int_regfile[rd].b[id] <= int_regfile[rs1].i[irs1] +
 666                                         int_regfile[rs2].i[irs2];
 667             } else if elwidth == 16 {
 668                int_regfile[rd].s[id] <= int_regfile[rs1].s[irs1] +
 669                                         int_regfile[rs2].s[irs2];
 670             } else if elwidth == 32 {
 671                int_regfile[rd].i[id] <= int_regfile[rs1].i[irs1] +
 672                                         int_regfile[rs2].i[irs2];
 673             } else { // elwidth == 64
 674                int_regfile[rd].l[id] <= int_regfile[rs1].l[irs1] +
 675                                         int_regfile[rs2].l[irs2];
 676             }
 677            ...
 678            ...
 679
 680 So here we can see clearly: for 8-bit entries rd, rs1 and rs2 (and registers
 681 following sequentially on respectively from the same) are "type-cast"
 682 to 8-bit; for 16-bit entries likewise and so on.
 683
 684 However that only covers the case where the element widths are the same.
 685 Where the element widths are different, the following algorithm applies:
 686
 687 * Analyse the bitwidth of all source operands and work out the
 688   maximum.  Record this as "maxsrcbitwidth"
 689 * If any given source operand requires sign-extension or zero-extension
 690   (ldb, div, rem, mul, sll, srl, sra etc.), instead of mandatory 32-bit
 691   sign-extension / zero-extension or whatever is specified in the standard
 692   RV specification, **change** that to sign-extending from the respective
 693   individual source operand's bitwidth from the CSR table out to
 694   "maxsrcbitwidth" (previously calculated), instead.
 695 * Following separate and distinct (optional) sign/zero-extension of all
 696   source operands as specifically required for that operation, carry out the
 697   operation at "maxsrcbitwidth".  (Note that in the case of LOAD/STORE or MV
 698   this may be a "null" (copy) operation, and that with FCVT, the changes
 699   to the source and destination bitwidths may also turn FVCT effectively
 700   into a copy).
 701 * If the destination operand requires sign-extension or zero-extension,
 702   instead of a mandatory fixed size (typically 32-bit for arithmetic,
 703   for subw for example, and otherwise various: 8-bit for sb, 16-bit for sw
 704   etc.), overload the RV specification with the bitwidth from the
 705   destination register's elwidth entry.
 706 * Finally, store the (optionally) sign/zero-extended value into its
 707   destination: memory for sb/sw etc., or an offset section of the register
 708   file for an arithmetic operation.
 709
 710 In this way, polymorphic bitwidths are achieved without requiring a
 711 massive 64-way permutation of calculations **per opcode**, for example
 712 (4 possible rs1 bitwidths times 4 possible rs2 bitwidths times 4 possible
 713 rd bitwidths).  The pseudo-code is therefore as follows:
 714
 715     typedef union {
 716         uint8_t  b;
 717         uint16_t s;
 718         uint32_t i;
 719         uint64_t l;
 720     } el_reg_t;
 721
 722     bw(elwidth):
 723         if elwidth == 0: return xlen
 724         if elwidth == 1: return 8
 725         if elwidth == 2: return 16
 726         // elwidth == 3:
 727         return 32
 728
 729     get_max_elwidth(rs1, rs2):
 730         return max(bw(int_csr[rs1].elwidth), # default (XLEN) if not set
 731                    bw(int_csr[rs2].elwidth)) # again XLEN if no entry
 732
 733     get_polymorphed_reg(reg, bitwidth, offset):
 734         el_reg_t res;
 735         res.l = 0; // TODO: going to need sign-extending / zero-extending
 736         if bitwidth == 8:
 737             reg.b = int_regfile[reg].b[offset]
 738         elif bitwidth == 16:
 739             reg.s = int_regfile[reg].s[offset]
 740         elif bitwidth == 32:
 741             reg.i = int_regfile[reg].i[offset]
 742         elif bitwidth == 64:
 743             reg.l = int_regfile[reg].l[offset]
 744         return res
 745
 746     set_polymorphed_reg(reg, bitwidth, offset, val):
 747         if (!int_csr[reg].isvec):
 748             # sign/zero-extend depending on opcode requirements, from
 749             # the reg's bitwidth out to the full bitwidth of the regfile
 750             val = sign_or_zero_extend(val, bitwidth, xlen)
 751             int_regfile[reg].l[0] = val
 752         elif bitwidth == 8:
 753             int_regfile[reg].b[offset] = val
 754         elif bitwidth == 16:
 755             int_regfile[reg].s[offset] = val
 756         elif bitwidth == 32:
 757             int_regfile[reg].i[offset] = val
 758         elif bitwidth == 64:
 759             int_regfile[reg].l[offset] = val
 760
 761       maxsrcwid =  get_max_elwidth(rs1, rs2) # source element width(s)
 762       destwid = int_csr[rs1].elwidth         # destination element width
 763       for (i = 0; i < VL; i++)
 764         if (predval & 1<<i) # predication uses intregs
 765            // TODO, calculate if over-run occurs, for each elwidth
 766            src1 = get_polymorphed_reg(rs1, maxsrcwid, irs1)
 767            // TODO, sign/zero-extend src1 and src2 as operation requires
 768            if (op_requires_sign_extend_src1)
 769               src1 = sign_extend(src1, maxsrcwid)
 770            src2 = get_polymorphed_reg(rs2, maxsrcwid, irs2)
 771            result = src1 + src2 # actual add here
 772            // TODO, sign/zero-extend result, as operation requires
 773            if (op_requires_sign_extend_dest)
 774               result = sign_extend(result, maxsrcwid)
 775            set_polymorphed_reg(rd, destwid, ird, result)
 776            if (!int_vec[rd].isvector) break
 777         if (int_vec[rd ].isvector)  { id += 1; }
 778         if (int_vec[rs1].isvector)  { irs1 += 1; }
 779         if (int_vec[rs2].isvector)  { irs2 += 1; }
 780
 781 Whilst specific sign-extension and zero-extension pseudocode call
 782 details are left out, due to each operation being different, the above
 783 should be clear that;
 784
 785 * the source operands are extended out to the maximum bitwidth of all
 786   source operands
 787 * the operation takes place at that maximum source bitwidth (the
 788   destination bitwidth is not involved at this point, at all)
 789 * the result is extended (or potentially even, truncated) before being
 790   stored in the destination.  i.e. truncation (if required) to the
 791   destination width occurs **after** the operation **not** before.
 792 * when the destination is not marked as "vectorised", the **full**
 793   (standard, scalar) register file entry is taken up, i.e. the
 794   element is either sign-extended or zero-extended to cover the
 795   full register bitwidth (XLEN) if it is not already XLEN bits long.
 796
 797 Implementors are entirely free to optimise the above, particularly
 798 if it is specifically known that any given operation will complete
 799 accurately in less bits, as long as the results produced are
 800 directly equivalent and equal, for all inputs and all outputs,
 801 to those produced by the above algorithm.
 802
 803 ## Polymorphic floating-point operation exceptions and error-handling
 804
 805 For floating-point operations, conversion takes place without
 806 raising any kind of exception.  Exactly as specified in the standard
 807 RV specification, NAN (or appropriate) is stored if the result
 808 is beyond the range of the destination, and, again, exactly as
 809 with the standard RV specification just as with scalar
 810 operations, the floating-point flag is raised (FCSR).  And, again, just as
 811 with scalar operations, it is software's responsibility to check this flag.
 812 Given that the FCSR flags are "accrued", the fact that multiple element
 813 operations could have occurred is not a problem.
 814
 815 Note that it is perfectly legitimate for floating-point bitwidths of
 816 only 8 to be specified.  However whilst it is possible to apply IEEE 754
 817 principles, no actual standard yet exists.  Implementors wishing to
 818 provide hardware-level 8-bit support rather than throw a trap to emulate
 819 in software should contact the author of this specification before
 820 proceeding.
 821
 822 ## Polymorphic shift operators
 823
 824 A special note is needed for changing the element width of left and right
 825 shift operators, particularly right-shift.  Even for standard RV base,
 826 in order for correct results to be returned, the second operand RS2 must
 827 be truncated to be within the range of RS1's bitwidth.  spike's implementation
 828 of sll for example is as follows:
 829
 830     WRITE_RD(sext_xlen(zext_xlen(RS1) << (RS2 & (xlen-1))));
 831
 832 which means: where XLEN is 32 (for RV32), restrict RS2 to cover the
 833 range 0..31 so that RS1 will only be left-shifted by the amount that
 834 is possible to fit into a 32-bit register.  Whilst this appears not
 835 to matter for hardware, it matters greatly in software implementations,
 836 and it also matters where an RV64 system is set to "RV32" mode, such
 837 that the underlying registers RS1 and RS2 comprise 64 hardware bits
 838 each.
 839
 840 For SV, where each operand's element bitwidth may be over-ridden, the
 841 rule about determining the operation's bitwidth *still applies*, being
 842 defined as the maximum bitwidth of RS1 and RS2.  *However*, this rule
 843 **also applies to the truncation of RS2**.  In other words, *after*
 844 determining the maximum bitwidth, RS2's range must **also be truncated**
 845 to ensure a correct answer.  Example:
 846
 847 * RS1 is over-ridden to a 16-bit width
 848 * RS2 is over-ridden to an 8-bit width
 849 * RD is over-ridden to a 64-bit width
 850 * the maximum bitwidth is thus determined to be 16-bit - max(8,16)
 851 * RS2 is **truncated to a range of values from 0 to 15**: RS2 & (16-1)
 852
 853 Pseudocode (in spike) for this example would therefore be:
 854
 855     WRITE_RD(sext_xlen(zext_16bit(RS1) << (RS2 & (16-1))));
 856
 857 This example illustrates that considerable care therefore needs to be
 858 taken to ensure that left and right shift operations are implemented
 859 correctly.  The key is that
 860
 861 * The operation bitwidth is determined by the maximum bitwidth
 862   of the *source registers*, **not** the destination register bitwidth
 863 * The result is then sign-extend (or truncated) as appropriate.
 864
 865 ## Polymorphic MULH/MULHU/MULHSU
 866
 867 MULH is designed to take the top half MSBs of a multiply that
 868 does not fit within the range of the source operands, such that
 869 smaller width operations may produce a full double-width multiply
 870 in two cycles.  The issue is: SV allows the source operands to
 871 have variable bitwidth.
 872
 873 Here again special attention has to be paid to the rules regarding
 874 bitwidth, which, again, are that the operation is performed at
 875 the maximum bitwidth of the **source** registers.  Therefore:
 876
 877 * An 8-bit x 8-bit multiply will create a 16-bit result that must
 878   be shifted down by 8 bits
 879 * A 16-bit x 8-bit multiply will create a 24-bit result that must
 880   be shifted down by 16 bits (top 8 bits being zero)
 881 * A 16-bit x 16-bit multiply will create a 32-bit result that must
 882   be shifted down by 16 bits
 883 * A 32-bit x 16-bit multiply will create a 48-bit result that must
 884   be shifted down by 32 bits
 885 * A 32-bit x 8-bit multiply will create a 40-bit result that must
 886   be shifted down by 32 bits
 887
 888 So again, just as with shift-left and shift-right, the result
 889 is shifted down by the maximum of the two source register bitwidths.
 890 And, exactly again, truncation or sign-extension is performed on the
 891 result.  If sign-extension is to be carried out, it is performed
 892 from the same maximum of the two source register bitwidths out
 893 to the result element's bitwidth.
 894
 895 If truncation occurs, i.e. the top MSBs of the result are lost,
 896 this is "Officially Not Our Problem", i.e. it is assumed that the
 897 programmer actually desires the result to be truncated.  i.e. if the
 898 programmer wanted all of the bits, they would have set the destination
 899 elwidth to accommodate them.
 900
 901 ## Polymorphic elwidth on LOAD/STORE <a name="elwidth_loadstore"></a>
 902
 903 Polymorphic element widths in vectorised form means that the data
 904 being loaded (or stored) across multiple registers needs to be treated
 905 (reinterpreted) as a contiguous stream of elwidth-wide items, where
 906 the source register's element width is **independent** from the destination's.
 907
 908 This makes for a slightly more complex algorithm when using indirection
 909 on the "addressed" register (source for LOAD and destination for STORE),
 910 particularly given that the LOAD/STORE instruction provides important
 911 information about the width of the data to be reinterpreted.
 912
 913 Let's illustrate the "load" part, where the pseudo-code for elwidth=default
 914 was as follows, and i is the loop from 0 to VL-1:
 915
 916     srcbase = ireg[rs+i];
 917     return mem[srcbase + imm]; // returns XLEN bits
 918
 919 Instead, when elwidth != default, for a LW (32-bit LOAD), elwidth-wide
 920 chunks are taken from the source memory location addressed by the current
 921 indexed source address register, and only when a full 32-bits-worth
 922 are taken will the index be moved on to the next contiguous source
 923 address register:
 924
 925     bitwidth = bw(elwidth);             // source elwidth from CSR reg entry
 926     elsperblock = 32 / bitwidth         // 1 if bw=32, 2 if bw=16, 4 if bw=8
 927     srcbase = ireg[rs+i/(elsperblock)]; // integer divide
 928     offs = i % elsperblock;             // modulo
 929     return &mem[srcbase + imm + offs];  // re-cast to uint8_t*, uint16_t* etc.
 930
 931 Note that the constant "32" above is replaced by 8 for LB, 16 for LH, 64 for LD
 932 and 128 for LQ.
 933
 934 The principle is basically exactly the same as if the srcbase were pointing
 935 at the memory of the *register* file: memory is re-interpreted as containing
 936 groups of elwidth-wide discrete elements.
 937
 938 When storing the result from a load, it's important to respect the fact
 939 that the destination register has its *own separate element width*.  Thus,
 940 when each element is loaded (at the source element width), any sign-extension
 941 or zero-extension (or truncation) needs to be done to the *destination*
 942 bitwidth.  Also, the storing has the exact same analogous algorithm as
 943 above, where in fact it is just the set\_polymorphed\_reg pseudocode
 944 (completely unchanged) used above.
 945
 946 One issue remains: when the source element width is **greater** than
 947 the width of the operation, it is obvious that a single LB for example
 948 cannot possibly obtain 16-bit-wide data.  This condition may be detected
 949 where, when using integer divide, elsperblock (the width of the LOAD
 950 divided by the bitwidth of the element) is zero.
 951
 952 The issue is "fixed" by ensuring that elsperblock is a minimum of 1:
 953
 954     elsperblock = min(1, LD_OP_BITWIDTH / element_bitwidth)
 955
 956 The elements, if the element bitwidth is larger than the LD operation's
 957 size, will then be sign/zero-extended to the full LD operation size, as
 958 specified by the LOAD (LDU instead of LD, LBU instead of LB), before
 959 being passed on to the second phase.
 960
 961 As LOAD/STORE may be twin-predicated, it is important to note that
 962 the rules on twin predication still apply, except where in previous
 963 pseudo-code (elwidth=default for both source and target) it was
 964 the *registers* that the predication was applied to, it is now the
 965 **elements** that the predication is applied to.
 966
 967 Thus the full pseudocode for all LD operations may be written out
 968 as follows:
 969
 970     function LBU(rd, rs):
 971         load_elwidthed(rd, rs, 8, true)
 972     function LB(rd, rs):
 973         load_elwidthed(rd, rs, 8, false)
 974     function LH(rd, rs):
 975         load_elwidthed(rd, rs, 16, false)
 976     ...
 977     ...
 978     function LQ(rd, rs):
 979         load_elwidthed(rd, rs, 128, false)
 980
 981     # returns 1 byte of data when opwidth=8, 2 bytes when opwidth=16..
 982     function load_memory(rs, imm, i, opwidth):
 983         elwidth = int_csr[rs].elwidth
 984         bitwidth = bw(elwidth);
 985         elsperblock = min(1, opwidth / bitwidth)
 986         srcbase = ireg[rs+i/(elsperblock)];
 987         offs = i % elsperblock;
 988         return mem[srcbase + imm + offs]; # 1/2/4/8/16 bytes
 989
 990     function load_elwidthed(rd, rs, opwidth, unsigned):
 991       destwid = int_csr[rd].elwidth # destination element width
 992       rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
 993       rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
 994       ps = get_pred_val(FALSE, rs); # predication on src
 995       pd = get_pred_val(FALSE, rd); # ... AND on dest
 996       for (int i = 0, int j = 0; i < VL && j < VL;):
 997         if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
 998         if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
 999         val = load_memory(rs, imm, i, opwidth)
1000         if unsigned:
1001             val = zero_extend(val, min(opwidth, bitwidth))
1002         else:
1003             val = sign_extend(val, min(opwidth, bitwidth))
1004         set_polymorphed_reg(rd, bitwidth, j, val)
1005         if (int_csr[rs].isvec) i++;
1006         if (int_csr[rd].isvec) j++; else break;
1007
1008 Note:
1009
1010 * when comparing against for example the twin-predicated c.mv
1011   pseudo-code, the pattern of independent incrementing of rd and rs
1012   is preserved unchanged.
1013 * just as with the c.mv pseudocode, zeroing is not included and must be
1014   taken into account (TODO).
1015 * that due to the use of a twin-predication algorithm, LOAD/STORE also
1016   take on the same VSPLAT, VINSERT, VREDUCE, VEXTRACT, VGATHER and
1017   VSCATTER characteristics.
1018 * that due to the use of the same set\_polymorphed\_reg pseudocode,
1019   a destination that is not vectorised (marked as scalar) will
1020   result in the element being fully sign-extended or zero-extended
1021   out to the full register file bitwidth (XLEN).  When the source
1022   is also marked as scalar, this is how the compatibility with
1023   standard RV LOAD/STORE is preserved by this algorithm.
1024
1025 ### Example Tables showing LOAD elements
1026
1027 This section contains examples of vectorised LOAD operations, showing
1028 how the two stage process works (three if zero/sign-extension is included).
1029
1030
1031 #### Example: LD x8, x5(0), x8 CSR-elwidth=32, x5 CSR-elwidth=16, VL=7
1032
1033 This is:
1034
1035 * a 64-bit load, with an offset of zero
1036 * with a source-address elwidth of 16-bit
1037 * into a destination-register with an elwidth of 32-bit
1038 * where VL=7
1039 * from register x5 (actually x5-x6) to x8 (actually x8 to half of x11)
1040 * RV64, where XLEN=64 is assumed.
1041
1042 First, the memory table, which, due to the
1043 element width being 16 and the operation being LD (64), the 64-bits
1044 loaded from memory are subdivided into groups of **four** elements.
1045 And, with VL being 7 (deliberately to illustrate that this is reasonable
1046 and possible), the first four are sourced from the offset addresses pointed
1047 to by x5, and the next three from the ofset addresses pointed to by
1048 the next contiguous register, x6:
1049
1050 [[!table  data="""
1051 addr | byte 0 | byte 1 | byte 2 | byte 3 | byte 4 | byte 5 | byte 6 | byte 7 |
1052 @x5  | elem 0         || elem 1         || elem 2         || elem 3         ||
1053 @x6  | elem 4         || elem 5         || elem 6         || not loaded     ||
1054 """]]
1055
1056 Next, the elements are zero-extended from 16-bit to 32-bit, as whilst
1057 the elwidth CSR entry for x5 is 16-bit, the destination elwidth on x8 is 32.
1058
1059 [[!table  data="""
1060 byte 3 | byte 2 | byte 1 | byte 0 |
1061 0x0    | 0x0    | elem0          ||
1062 0x0    | 0x0    | elem1          ||
1063 0x0    | 0x0    | elem2          ||
1064 0x0    | 0x0    | elem3          ||
1065 0x0    | 0x0    | elem4          ||
1066 0x0    | 0x0    | elem5          ||
1067 0x0    | 0x0    | elem6          ||
1068 0x0    | 0x0    | elem7          ||
1069 """]]
1070
1071 Lastly, the elements are stored in contiguous blocks, as if x8 was also
1072 byte-addressable "memory".  That "memory" happens to cover registers
1073 x8, x9, x10 and x11, with the last 32 "bits" of x11 being **UNMODIFIED**:
1074
1075 [[!table  data="""
1076 reg# | byte 7 | byte 6 | byte 5 | byte 4 | byte 3 | byte 2 | byte 1 | byte 0 |
1077 x8   | 0x0    | 0x0    | elem 1         || 0x0    | 0x0    | elem 0         ||
1078 x9   | 0x0    | 0x0    | elem 3         || 0x0    | 0x0    | elem 2         ||
1079 x10  | 0x0    | 0x0    | elem 5         || 0x0    | 0x0    | elem 4         ||
1080 x11  | **UNMODIFIED**                 |||| 0x0    | 0x0    | elem 6         ||
1081 """]]
1082
1083 Thus we have data that is loaded from the **addresses** pointed to by
1084 x5 and x6, zero-extended from 16-bit to 32-bit, stored in the **registers**
1085 x8 through to half of x11.
1086 The end result is that elements 0 and 1 end up in x8, with element 8 being
1087 shifted up 32 bits, and so on, until finally element 6 is in the
1088 LSBs of x11.
1089
1090 Note that whilst the memory addressing table is shown left-to-right byte order,
1091 the registers are shown in right-to-left (MSB) order.  This does **not**
1092 imply that bit or byte-reversal is carried out: it's just easier to visualise
1093 memory as being contiguous bytes, and emphasises that registers are not
1094 really actually "memory" as such.
1095
1096 ## Why SV bitwidth specification is restricted to 4 entries
1097
1098 The four entries for SV element bitwidths only allows three over-rides:
1099
1100 * 8 bit
1101 * 16 hit
1102 * 32 bit
1103
1104 This would seem inadequate, surely it would be better to have 3 bits or
1105 more and allow 64, 128 and some other options besides.  The answer here
1106 is, it gets too complex, no RV128 implementation yet exists, and so RV64's
1107 default is 64 bit, so the 4 major element widths are covered anyway.
1108
1109 There is an absolutely crucial aspect oF SV here that explicitly
1110 needs spelling out, and it's whether the "vectorised" bit is set in
1111 the Register's CSR entry.
1112
1113 If "vectorised" is clear (not set), this indicates that the operation
1114 is "scalar".  Under these circumstances, when set on a destination (RD),
1115 then sign-extension and zero-extension, whilst changed to match the
1116 override bitwidth (if set), will erase the **full** register entry
1117 (64-bit if RV64).
1118
1119 When vectorised is *set*, this indicates that the operation now treats
1120 **elements** as if they were independent registers, so regardless of
1121 the length, any parts of a given actual register that are not involved
1122 in the operation are **NOT** modified, but are **PRESERVED**.
1123
1124 For example:
1125
1126 * when the vector bit is clear and elwidth set to 16 on the destination
1127   register, operations are truncated to 16 bit and then sign or zero
1128   extended to the *FULL* XLEN register width.
1129 * when the vector bit is set, elwidth is 16 and VL=1 (or other value where
1130   groups of elwidth sized elements do not fill an entire XLEN register),
1131   the "top" bits of the destination register do *NOT* get modified, zero'd
1132   or otherwise overwritten.
1133
1134 SIMD micro-architectures may implement this by using predication on
1135 any elements in a given actual register that are beyond the end of
1136 multi-element operation.
1137
1138 Other microarchitectures may choose to provide byte-level write-enable
1139 lines on the register file, such that each 64 bit register in an RV64
1140 system requires 8 WE lines.  Scalar RV64 operations would require
1141 activation of all 8 lines, where SV elwidth based operations would
1142 activate the required subset of those byte-level write lines.
1143
1144 Example:
1145
1146 * rs1, rs2 and rd are all set to 8-bit
1147 * VL is set to 3
1148 * RV64 architecture is set (UXL=64)
1149 * add operation is carried out
1150 * bits 0-23 of RD are modified to be rs1[23..16] + rs2[23..16]
1151   concatenated with similar add operations on bits 15..8 and 7..0
1152 * bits 24 through 63 **remain as they originally were**.
1153
1154 Example SIMD micro-architectural implementation:
1155
1156 * SIMD architecture works out the nearest round number of elements
1157   that would fit into a full RV64 register (in this case: 8)
1158 * SIMD architecture creates a hidden predicate, binary 0b00000111
1159   i.e. the bottom 3 bits set (VL=3) and the top 5 bits clear
1160 * SIMD architecture goes ahead with the add operation as if it
1161   was a full 8-wide batch of 8 adds
1162 * SIMD architecture passes top 5 elements through the adders
1163   (which are "disabled" due to zero-bit predication)
1164 * SIMD architecture gets the 5 unmodified top 8-bits back unmodified
1165   and stores them in rd.
1166
1167 This requires a read on rd, however this is required anyway in order
1168 to support non-zeroing mode.
1169
1170 ## Polymorphic floating-point
1171
1172 Standard scalar RV integer operations base the register width on XLEN,
1173 which may be changed (UXL in USTATUS, and the corresponding MXL and
1174 SXL in MSTATUS and SSTATUS respectively).  Integer LOAD, STORE and
1175 arithmetic operations are therefore restricted to an active XLEN bits,
1176 with sign or zero extension to pad out the upper bits when XLEN has
1177 been dynamically set to less than the actual register size.
1178
1179 For scalar floating-point, the active (used / changed) bits are
1180 specified exclusively by the operation: ADD.S specifies an active
1181 32-bits, with the upper bits of the source registers needing to
1182 be all 1s ("NaN-boxed"), and the destination upper bits being
1183 *set* to all 1s (including on LOAD/STOREs).
1184
1185 Where elwidth is set to default (on any source or the destination)
1186 it is obvious that this NaN-boxing behaviour can and should be
1187 preserved.  When elwidth is non-default things are less obvious,
1188 so need to be thought through.  Here is a normal (scalar) sequence,
1189 assuming an RV64 which supports Quad (128-bit) FLEN:
1190
1191 * FLD loads 64-bit wide from memory.  Top 64 MSBs are set to all 1s
1192 * ADD.D performs a 64-bit-wide add.  Top 64 MSBs of destination set to 1s.
1193 * FSD stores lowest 64-bits from the 128-bit-wide register to memory:
1194   top 64 MSBs ignored.
1195
1196 Therefore it makes sense to mirror this behaviour when, for example,
1197 elwidth is set to 32.  Assume elwidth set to 32 on all source and
1198 destination registers:
1199
1200 * FLD loads 64-bit wide from memory as **two** 32-bit single-precision
1201   floating-point numbers.
1202 * ADD.D performs **two** 32-bit-wide adds, storing one of the adds
1203   in bits 0-31 and the second in bits 32-63.
1204 * FSD stores lowest 64-bits from the 128-bit-wide register to memory
1205
1206 Here's the thing: it does not make sense to overwrite the top 64 MSBs
1207 of the registers either during the FLD **or** the ADD.D.  The reason
1208 is that, effectively, the top 64 MSBs actually represent a completely
1209 independent 64-bit register, so overwriting it is not only gratuitous
1210 but may actually be harmful for a future extension to SV which may
1211 have a way to directly access those top 64 bits.
1212
1213 The decision is therefore **not** to touch the upper parts of floating-point
1214 registers whereever elwidth is set to non-default values, including
1215 when "isvec" is false in a given register's CSR entry.  Only when the
1216 elwidth is set to default **and** isvec is false will the standard
1217 RV behaviour be followed, namely that the upper bits be modified.
1218
1219 Ultimately if elwidth is default and isvec false on *all* source
1220 and destination registers, a SimpleV instruction defaults completely
1221 to standard RV scalar behaviour (this holds true for **all** operations,
1222 right across the board).
1223
1224 The nice thing here is that ADD.S, ADD.D and ADD.Q when elwidth are
1225 non-default values are effectively all the same: they all still perform
1226 multiple ADD operations, just at different widths.  A future extension
1227 to SimpleV may actually allow ADD.S to access the upper bits of the
1228 register, effectively breaking down a 128-bit register into a bank
1229 of 4 independently-accesible 32-bit registers.
1230
1231 In the meantime, although when e.g. setting VL to 8 it would technically
1232 make no difference to the ALU whether ADD.S, ADD.D or ADD.Q is used,
1233 using ADD.Q may be an easy way to signal to the microarchitecture that
1234 it is to receive a higher VL value.  On a superscalar OoO architecture
1235 there may be absolutely no difference, however on simpler SIMD-style
1236 microarchitectures they may not necessarily have the infrastructure in
1237 place to know the difference, such that when VL=8 and an ADD.D instruction
1238 is issued, it completes in 2 cycles (or more) rather than one, where
1239 if an ADD.Q had been issued instead on such simpler microarchitectures
1240 it would complete in one.
1241
1242 ## Specific instruction walk-throughs
1243
1244 This section covers walk-throughs of the above-outlined procedure
1245 for converting standard RISC-V scalar arithmetic operations to
1246 polymorphic widths, to ensure that it is correct.
1247
1248 ### add
1249
1250 Standard Scalar RV32/RV64 (xlen):
1251
1252 * RS1 @ xlen bits
1253 * RS2 @ xlen bits
1254 * add @ xlen bits
1255 * RD @ xlen bits
1256
1257 Polymorphic variant:
1258
1259 * RS1 @ rs1 bits, zero-extended to max(rs1, rs2) bits
1260 * RS2 @ rs2 bits, zero-extended to max(rs1, rs2) bits
1261 * add @ max(rs1, rs2) bits
1262 * RD @ rd bits.  zero-extend to rd if rd > max(rs1, rs2) otherwise truncate
1263
1264 Note here that polymorphic add zero-extends its source operands,
1265 where addw sign-extends.
1266
1267 ### addw
1268
1269 The RV Specification specifically states that "W" variants of arithmetic
1270 operations always produce 32-bit signed values.  In a polymorphic
1271 environment it is reasonable to assume that the signed aspect is
1272 preserved, where it is the length of the operands and the result
1273 that may be changed.
1274
1275 Standard Scalar RV64 (xlen):
1276
1277 * RS1 @ xlen bits
1278 * RS2 @ xlen bits
1279 * add @ xlen bits
1280 * RD @ xlen bits, truncate add to 32-bit and sign-extend to xlen.
1281
1282 Polymorphic variant:
1283
1284 * RS1 @ rs1 bits, sign-extended to max(rs1, rs2) bits
1285 * RS2 @ rs2 bits, sign-extended to max(rs1, rs2) bits
1286 * add @ max(rs1, rs2) bits
1287 * RD @ rd bits. sign-extend to rd if rd > max(rs1, rs2) otherwise truncate
1288
1289 Note here that polymorphic addw sign-extends its source operands,
1290 where add zero-extends.
1291
1292 This requires a little more in-depth analysis.  Where the bitwidth of
1293 rs1 equals the bitwidth of rs2, no sign-extending will occur.  It is
1294 only where the bitwidth of either rs1 or rs2 are different, will the
1295 lesser-width operand be sign-extended.
1296
1297 Effectively however, both rs1 and rs2 are being sign-extended (or truncated),
1298 where for add they are both zero-extended.  This holds true for all arithmetic
1299 operations ending with "W".
1300
1301 ### addiw
1302
1303 Standard Scalar RV64I:
1304
1305 * RS1 @ xlen bits, truncated to 32-bit
1306 * immed @ 12 bits, sign-extended to 32-bit
1307 * add @ 32 bits
1308 * RD @ rd bits.  sign-extend to rd if rd > 32, otherwise truncate.
1309
1310 Polymorphic variant:
1311
1312 * RS1 @ rs1 bits
1313 * immed @ 12 bits, sign-extend to max(rs1, 12) bits
1314 * add @ max(rs1, 12) bits
1315 * RD @ rd bits. sign-extend to rd if rd > max(rs1, 12) otherwise truncate
1316
1317 # Predication Element Zeroing
1318
1319 The introduction of zeroing on traditional vector predication is usually
1320 intended as an optimisation for lane-based microarchitectures with register
1321 renaming to be able to save power by avoiding a register read on elements
1322 that are passed through en-masse through the ALU.  Simpler microarchitectures
1323 do not have this issue: they simply do not pass the element through to
1324 the ALU at all, and therefore do not store it back in the destination.
1325 More complex non-lane-based micro-architectures can, when zeroing is
1326 not set, use the predication bits to simply avoid sending element-based
1327 operations to the ALUs, entirely: thus, over the long term, potentially
1328 keeping all ALUs 100% occupied even when elements are predicated out.
1329
1330 SimpleV's design principle is not based on or influenced by
1331 microarchitectural design factors: it is a hardware-level API.
1332 Therefore, looking purely at whether zeroing is *useful* or not,
1333 (whether less instructions are needed for certain scenarios),
1334 given that a case can be made for zeroing *and* non-zeroing, the
1335 decision was taken to add support for both.
1336
1337 ## Single-predication (based on destination register)
1338
1339 Zeroing on predication for arithmetic operations is taken from
1340 the destination register's predicate.  i.e. the predication *and*
1341 zeroing settings to be applied to the whole operation come from the
1342 CSR Predication table entry for the destination register.
1343 Thus when zeroing is set on predication of a destination element,
1344 if the predication bit is clear, then the destination element is *set*
1345 to zero (twin-predication is slightly different, and will be covered
1346 next).
1347
1348 Thus the pseudo-code loop for a predicated arithmetic operation
1349 is modified to as follows:
1350
1351       for (i = 0; i < VL; i++)
1352         if not zeroing: # an optimisation
1353            while (!(predval & 1<<i) && i < VL)
1354              if (int_vec[rd ].isvector)  { id += 1; }
1355              if (int_vec[rs1].isvector)  { irs1 += 1; }
1356              if (int_vec[rs2].isvector)  { irs2 += 1; }
1357            if i == VL:
1358              return
1359         if (predval & 1<<i)
1360            src1 = ....
1361            src2 = ...
1362            else:
1363                result = src1 + src2 # actual add (or other op) here
1364            set_polymorphed_reg(rd, destwid, ird, result)
1365            if int_vec[rd].ffirst and result == 0:
1366               VL = i # result was zero, end loop early, return VL
1367               return
1368            if (!int_vec[rd].isvector) return
1369         else if zeroing:
1370            result = 0
1371            set_polymorphed_reg(rd, destwid, ird, result)
1372         if (int_vec[rd ].isvector)  { id += 1; }
1373         else if (predval & 1<<i) return
1374         if (int_vec[rs1].isvector)  { irs1 += 1; }
1375         if (int_vec[rs2].isvector)  { irs2 += 1; }
1376         if (rd == VL or rs1 == VL or rs2 == VL): return
1377
1378 The optimisation to skip elements entirely is only possible for certain
1379 micro-architectures when zeroing is not set.  However for lane-based
1380 micro-architectures this optimisation may not be practical, as it
1381 implies that elements end up in different "lanes".  Under these
1382 circumstances it is perfectly fine to simply have the lanes
1383 "inactive" for predicated elements, even though it results in
1384 less than 100% ALU utilisation.
1385
1386 ## Twin-predication (based on source and destination register)
1387
1388 Twin-predication is not that much different, except that that
1389 the source is independently zero-predicated from the destination.
1390 This means that the source may be zero-predicated *or* the
1391 destination zero-predicated *or both*, or neither.
1392
1393 When with twin-predication, zeroing is set on the source and not
1394 the destination, if a predicate bit is set it indicates that a zero
1395 data element is passed through the operation (the exception being:
1396 if the source data element is to be treated as an address - a LOAD -
1397 then the data returned *from* the LOAD is zero, rather than looking up an
1398 *address* of zero.
1399
1400 When zeroing is set on the destination and not the source, then just
1401 as with single-predicated operations, a zero is stored into the destination
1402 element (or target memory address for a STORE).
1403
1404 Zeroing on both source and destination effectively result in a bitwise
1405 NOR operation of the source and destination predicate: the result is that
1406 where either source predicate OR destination predicate is set to 0,
1407 a zero element will ultimately end up in the destination register.
1408
1409 However: this may not necessarily be the case for all operations;
1410 implementors, particularly of custom instructions, clearly need to
1411 think through the implications in each and every case.
1412
1413 Here is pseudo-code for a twin zero-predicated operation:
1414
1415     function op_mv(rd, rs) # MV not VMV!
1416       rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
1417       rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
1418       ps, zerosrc = get_pred_val(FALSE, rs); # predication on src
1419       pd, zerodst = get_pred_val(FALSE, rd); # ... AND on dest
1420       for (int i = 0, int j = 0; i < VL && j < VL):
1421         if (int_csr[rs].isvec && !zerosrc) while (!(ps & 1<<i)) i++;
1422         if (int_csr[rd].isvec && !zerodst) while (!(pd & 1<<j)) j++;
1423         if ((pd & 1<<j))
1424             if ((pd & 1<<j))
1425                 sourcedata = ireg[rs+i];
1426             else
1427                 sourcedata = 0
1428             ireg[rd+j] <= sourcedata
1429         else if (zerodst)
1430             ireg[rd+j] <= 0
1431         if (int_csr[rs].isvec)
1432             i++;
1433         if (int_csr[rd].isvec)
1434             j++;
1435         else
1436             if ((pd & 1<<j))
1437                 break;
1438
1439 Note that in the instance where the destination is a scalar, the hardware
1440 loop is ended the moment a value *or a zero* is placed into the destination
1441 register/element.  Also note that, for clarity, variable element widths
1442 have been left out of the above.
1443
1444 # Subsets of RV functionality
1445
1446 This section describes the differences when SV is implemented on top of
1447 different subsets of RV.
1448
1449 ## Common options
1450
1451 It is permitted to only implement SVprefix and not the VBLOCK instruction
1452 format option, and vice-versa.  UNIX Platforms **MUST** raise illegal
1453 instruction on seeing an unsupported VBLOCK or SVprefix opcode, so that
1454 traps may emulate the format.
1455
1456 It is permitted in SVprefix to either not implement VL or not implement
1457 SUBVL (see [[sv_prefix_proposal]] for full details. Again, UNIX Platforms
1458 *MUST* raise illegal instruction on implementations that do not support
1459 VL or SUBVL.
1460
1461 It is permitted to limit the size of either (or both) the register files
1462 down to the original size of the standard RV architecture.  However, below
1463 the mandatory limits set in the RV standard will result in non-compliance
1464 with the SV Specification.
1465
1466 ## RV32 / RV32F
1467
1468 When RV32 or RV32F is implemented, XLEN is set to 32, and thus the
1469 maximum limit for predication is also restricted to 32 bits.  Whilst not
1470 actually specifically an "option" it is worth noting.
1471
1472 ## RV32G
1473
1474 Normally in standard RV32 it does not make much sense to have
1475 RV32G, The critical instructions that are missing in standard RV32
1476 are those for moving data to and from the double-width floating-point
1477 registers into the integer ones, as well as the FCVT routines.
1478
1479 In an earlier draft of SV, it was possible to specify an elwidth
1480 of double the standard register size: this had to be dropped,
1481 and may be reintroduced in future revisions.
1482
1483 ## RV32 (not RV32F / RV32G) and RV64 (not RV64F / RV64G)
1484
1485 When floating-point is not implemented, the size of the User Register and
1486 Predication CSR tables may be halved, to only 4 2x16-bit CSRs (8 entries
1487 per table).
1488
1489 ## RV32E
1490
1491 In embedded scenarios the User Register and Predication CSRs may be
1492 dropped entirely, or optionally limited to 1 CSR, such that the combined
1493 number of entries from the M-Mode CSR Register table plus U-Mode
1494 CSR Register table is either 4 16-bit entries or (if the U-Mode is
1495 zero) only 2 16-bit entries (M-Mode CSR table only).  Likewise for
1496 the Predication CSR tables.
1497
1498 RV32E is the most likely candidate for simply detecting that registers
1499 are marked as "vectorised", and generating an appropriate exception
1500 for the VL loop to be implemented in software.
1501
1502 ## RV128
1503
1504 RV128 has not been especially considered, here, however it has some
1505 extremely large possibilities: double the element width implies
1506 256-bit operands, spanning 2 128-bit registers each, and predication
1507 of total length 128 bit given that XLEN is now 128.
1508
1509 # Example usage
1510
1511 TODO evaluate strncpy and strlen
1512 <https://groups.google.com/forum/m/#!msg/comp.arch/bGBeaNjAKvc/_vbqyxTUAQAJ>
1513
1514 ## strncpy
1515
1516 RVV version: <a name="strncpy"></>
1517
1518     strncpy:
1519         mv a3, a0               # Copy dst
1520     loop:
1521         setvli x0, a2, vint8    # Vectors of bytes.
1522         vlbff.v v1, (a1)        # Get src bytes
1523         vseq.vi v0, v1, 0       # Flag zero bytes
1524         vmfirst a4, v0          # Zero found?
1525         vmsif.v v0, v0          # Set mask up to and including zero byte. Ppplio
1526         vsb.v v1, (a3), v0.t    # Write out bytes
1527         bgez a4, exit           # Done
1528         csrr t1, vl             # Get number of bytes fetched
1529         add a1, a1, t1          # Bump src pointer
1530         sub a2, a2, t1          # Decrement count.
1531         add a3, a3, t1          # Bump dst pointer
1532         bnez a2, loop           # Anymore?
1533
1534     exit:
1535         ret
1536
1537 SV version (WIP):
1538
1539     strncpy:
1540         mv a3, a0
1541         SETMVLI 8 # set max vector to 8
1542         RegCSR[a3] = 8bit, a3, scalar
1543         RegCSR[a1] = 8bit, a1, scalar
1544         RegCSR[t0] = 8bit, t0, vector
1545         PredTb[t0] = ffirst, x0, inv
1546     loop:
1547         SETVLI a2, t4 # t4 and VL now 1..8
1548         ldb t0, (a1) # t0 fail first mode
1549         bne t0, x0, allnonzero # still ff
1550         # VL points to last nonzero
1551         GETVL t4       # from bne tests
1552         addi t4, t4, 1 # include zero
1553         SETVL t4       # set exactly to t4
1554         stb t0, (a3)   # store incl zero
1555         ret            # end subroutine
1556     allnonzero:
1557         stb t0, (a3)    # VL legal range
1558         GETVL t4        # from bne tests
1559         add a1, a1, t4  # Bump src pointer
1560         sub a2, a2, t4  # Decrement count.
1561         add a3, a3, t4  # Bump dst pointer
1562         bnez a2, loop   # Anymore?
1563     exit:
1564         ret
1565
1566 Notes:
1567
1568 * Setting MVL to 8 is just an example. If enough registers are spare it may be set to XLEN which will require a bank of 8 scalar registers for a1, a3 and t0.
1569 * obviously if that is done, t0 is not separated by 8 full registers, and would overwrite t1 thru t7. x80 would work well, as an example, instead.
1570 * with the exception of the GETVL (a pseudo code alias for csrr), every single instruction above may use RVC.
1571 * RVC C.BNEZ can be used because rs1' may be extended to the full 128 registers through redirection
1572 * RVC C.LW and C.SW may be used because the W format may be overridden by the 8 bit format. All of t0, a3 and a1 are overridden to make that work.
1573 * with the exception of the GETVL, all Vector Context may be done in VBLOCK form.
1574 * setting predication to x0 (zero) and invert on t0 is a trick to enable just ffirst on t0
1575 * ldb and bne are both using t0, both in ffirst mode
1576 * ldb will end on illegal mem, reduce VL, but copied all sorts of stuff into t0
1577 * bne t0 x0 tests up to the NEW VL for nonzero, vector t0 against scalar x0
1578 * however as t0 is in ffirst mode, the first fail wil ALSO stop the compares, and reduce VL as well
1579 * the branch only goes to allnonzero if all tests succeed
1580 * if it did not, we can safely increment VL by 1 (using a4) to include the zero.
1581 * SETVL sets *exactly* the requested amount into VL.
1582 * the SETVL just after allnonzero label is needed in case the ldb ffirst activates but the bne allzeros does not.
1583 * this would cause the stb to copy up to the end of the legal memory
1584 * of course, on the next loop the ldb would throw a trap, as a1 now points to the first illegal mem location.
1585
1586 ## strcpy
1587
1588 RVV version:
1589
1590         mv a3, a0             # Save start
1591     loop:
1592         setvli a1, x0, vint8  # byte vec, x0 (Zero reg) => use max hardware len
1593         vldbff.v v1, (a3)     # Get bytes
1594         csrr a1, vl           # Get bytes actually read e.g. if fault
1595         vseq.vi v0, v1, 0     # Set v0[i] where v1[i] = 0
1596         add a3, a3, a1        # Bump pointer
1597         vmfirst a2, v0        # Find first set bit in mask, returns -1 if none
1598         bltz a2, loop         # Not found?
1599         add a0, a0, a1        # Sum start + bump
1600         add a3, a3, a2        # Add index of zero byte
1601         sub a0, a3, a0        # Subtract start address+bump
1602         ret