simple_v_extension/appendix.mdwn

   1 # Simple-V (Parallelism Extension Proposal) Appendix
   2
   3 * Copyright (C) 2017, 2018, 2019 Luke Kenneth Casson Leighton
   4 * Status: DRAFTv0.6
   5 * Last edited: 30 jun 2019
   6 * main spec [[specification]]
   7
   8 [[!toc ]]
   9
  10 # Fail-on-first modes <a name="ffirst"></a>
  11
  12 Fail-on-first data dependency has different behaviour for traps than
  13 for conditional testing.  "Conditional" is taken to mean "anything
  14 that is zero", however with traps, the first element has to
  15 be given the opportunity to throw the exact same trap that would
  16 be thrown if this were a scalar operation (when VL=1).
  17
  18 Note that implementors are required to mutually exclusively choose one
  19 or the other modes: an instruction is **not** permitted to fail on a
  20 trap *and* fail a conditional test at the same time.  This advice to
  21 custom opcode writers as well as future extension writers.
  22
  23 ## Fail-on-first traps
  24
  25 Except for the first element, ffirst stops sequential element processing
  26 when a trap occurs.  The first element is treated normally (as if ffirst
  27 is clear).  Should any subsequent element instruction require a trap,
  28 instead it and subsequent indexed elements are ignored (or cancelled in
  29 out-of-order designs), and VL is set to the *last* in-sequence instruction
  30 that did not take the trap.
  31
  32 Note that predicated-out elements (where the predicate mask bit is
  33 zero) are clearly excluded (i.e. the trap will not occur).  However,
  34 note that the loop still had to test the predicate bit: thus on return,
  35 VL is set to include elements that did not take the trap *and* includes
  36 the elements that were predicated (masked) out (not tested up to the
  37 point where the trap occurred).
  38
  39 Unlike conditional tests, "fail-on-first trap" instruction behaviour is
  40 unaltered by setting zero or non-zero predication mode.
  41
  42 If SUBVL is being used (SUBVL!=1), the first *sub-group* of elements
  43 will cause a trap as normal (as if ffirst is not set); subsequently, the
  44 trap must not occur in the *sub-group* of elements.  SUBVL will **NOT**
  45 be modified.  Traps must analyse (x)eSTATE (subvl offset indices) to
  46 determine the element that caused the trap.
  47
  48 Given that predication bits apply to SUBVL groups, the same rules apply
  49 to predicated-out (masked-out) sub-groups in calculating the value that
  50 VL is set to.
  51
  52 ## Fail-on-first conditional tests
  53
  54 ffirst stops sequential (or sequentially-appearing in the case of
  55 out-of-order designs) element conditional testing on the first element
  56 result being zero (or other "fail" condition).  VL is set to the number
  57 of elements that were (sequentially) processed before the fail-condition
  58 was encountered.
  59
  60 Unlike trap fail-on-first, fail-on-first conditional testing behaviour
  61 responds to changes in the zero or non-zero predication mode.  Whilst
  62 in non-zeroing mode, masked-out elements are simply not tested (and
  63 thus considered "never to fail"), in zeroing mode, masked-out elements
  64 may be viewed as *always* (unconditionally) failing.  This effectively
  65 turns VL into something akin to a software-controlled loop.
  66
  67 Note that just as with traps, if SUBVL!=1, the first trap in the
  68 *sub-group* will cause the processing to end, and, even if there were
  69 elements within the *sub-group* that passed the test, that sub-group is
  70 still (entirely) excluded from the count (from setting VL).  i.e. VL is
  71 set to the total number of *sub-groups* that had no fail-condition up
  72 until execution was stopped.  However, again: SUBVL must not be modified:
  73 traps must analyse (x)eSTATE (subvl offset indices) to determine the
  74 element that caused the trap.
  75
  76 Note again that, just as with traps, predicated-out (masked-out) elements
  77 are included in the (sequential) count leading up to the fail-condition,
  78 even though they were not tested.
  79
  80 # Instructions <a name="instructions" />
  81
  82 Despite being a 98% complete and accurate topological remap of RVV
  83 concepts and functionality, no new instructions are needed.
  84 Compared to RVV: *All* RVV instructions can be re-mapped, however xBitManip
  85 becomes a critical dependency for efficient manipulation of predication
  86 masks (as a bit-field).  Despite the removal of all operations,
  87 with the exception of CLIP and VSELECT.X
  88 *all instructions from RVV Base are topologically re-mapped and retain their
  89 complete functionality, intact*.  Note that if RV64G ever had
  90 a MV.X added as well as FCLIP, the full functionality of RVV-Base would
  91 be obtained in SV.
  92
  93 Three instructions, VSELECT, VCLIP and VCLIPI, do not have RV Standard
  94 equivalents, so are left out of Simple-V.  VSELECT could be included if
  95 there existed a MV.X instruction in RV (MV.X is a hypothetical
  96 non-immediate variant of MV that would allow another register to
  97 specify which register was to be copied).  Note that if any of these three
  98 instructions are added to any given RV extension, their functionality
  99 will be inherently parallelised.
 100
 101 With some exceptions, where it does not make sense or is simply too
 102 challenging, all RV-Base instructions are parallelised:
 103
 104 * CSR instructions, whilst a case could be made for fast-polling of
 105   a CSR into multiple registers, or for being able to copy multiple
 106   contiguously addressed CSRs into contiguous registers, and so on,
 107   are the fundamental core basis of SV.  If parallelised, extreme
 108   care would need to be taken.  Additionally, CSR reads are done
 109   using x0, and it is *really* inadviseable to tag x0.
 110 * LUI, C.J, C.JR, WFI, AUIPC are not suitable for parallelising so are
 111   left as scalar.
 112 * LR/SC could hypothetically be parallelised however their purpose is
 113   single (complex) atomic memory operations where the LR must be followed
 114   up by a matching SC.  A sequence of parallel LR instructions followed
 115   by a sequence of parallel SC instructions therefore is guaranteed to
 116   not be useful. Not least: the guarantees of a Multi-LR/SC
 117   would be impossible to provide if emulated in a trap.
 118 * EBREAK, NOP, FENCE and others do not use registers so are not inherently
 119   paralleliseable anyway.
 120
 121 All other operations using registers are automatically parallelised.
 122 This includes AMOMAX, AMOSWAP and so on, where particular care and
 123 attention must be paid.
 124
 125 Example pseudo-code for an integer ADD operation (including scalar
 126 operations).  Floating-point uses the FP Register Table.
 127
 128 [[!inline raw="yes" pages="simple_v_extension/simple_add_example" ]]
 129
 130 Note that for simplicity there is quite a lot missing from the above
 131 pseudo-code: PCVBLK, element widths, zeroing on predication, dimensional
 132 reshaping and offsets and so on.  However it demonstrates the basic
 133 principle.  Augmentations that produce the full pseudo-code are covered in
 134 other sections.
 135
 136 ## SUBVL Pseudocode <a name="subvl-pseudocode"></a>
 137
 138 Adding in support for SUBVL is a matter of adding in an extra inner
 139 for-loop, where register src and dest are still incremented inside the
 140 inner part. Note that the predication is still taken from the VL index.
 141
 142 So whilst elements are indexed by "(i * SUBVL + s)", predicate bits are
 143 indexed by "(i)"
 144
 145     function op_add(rd, rs1, rs2) # add not VADD!
 146       int i, id=0, irs1=0, irs2=0;
 147       predval = get_pred_val(FALSE, rd);
 148       rd  = int_vec[rd ].isvector ? int_vec[rd ].regidx : rd;
 149       rs1 = int_vec[rs1].isvector ? int_vec[rs1].regidx : rs1;
 150       rs2 = int_vec[rs2].isvector ? int_vec[rs2].regidx : rs2;
 151       for (i = 0; i < VL; i++)
 152        xSTATE.srcoffs = i # save context
 153        for (s = 0; s < SUBVL; s++)
 154         xSTATE.ssvoffs = s # save context
 155         if (predval & 1<<i) # predication uses intregs
 156            # actual add is here (at last)
 157            ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
 158            if (!int_vec[rd ].isvector) break;
 159         if (int_vec[rd ].isvector)  { id += 1; }
 160         if (int_vec[rs1].isvector)  { irs1 += 1; }
 161         if (int_vec[rs2].isvector)  { irs2 += 1; }
 162         if (id == VL or irs1 == VL or irs2 == VL) {
 163           # end VL hardware loop
 164           xSTATE.srcoffs = 0; # reset
 165           xSTATE.ssvoffs = 0; # reset
 166           return;
 167         }
 168
 169
 170 NOTE: pseudocode simplified greatly: zeroing, proper predicate handling,
 171 elwidth handling etc. all left out.
 172
 173 ## Instruction Format
 174
 175 It is critical to appreciate that there are
 176 **no operations added to SV, at all**.
 177
 178 Instead, by using CSRs to tag registers as an indication of "changed
 179 behaviour", SV *overloads* pre-existing branch operations into predicated
 180 variants, and implicitly overloads arithmetic operations, MV, FCVT, and
 181 LOAD/STORE depending on CSR configurations for bitwidth and predication.
 182 **Everything** becomes parallelised.  *This includes Compressed
 183 instructions* as well as any future instructions and Custom Extensions.
 184
 185 Note: CSR tags to change behaviour of instructions is nothing new, including
 186 in RISC-V.  UXL, SXL and MXL change the behaviour so that XLEN=32/64/128.
 187 FRM changes the behaviour of the floating-point unit, to alter the rounding
 188 mode.  Other architectures change the LOAD/STORE byte-order from big-endian
 189 to little-endian on a per-instruction basis.  SV is just a little more...
 190 comprehensive in its effect on instructions.
 191
 192 ## Branch Instructions
 193
 194 Branch operations are augmented slightly to be a little more like FP
 195 Compares (FEQ, FNE etc.), by permitting the cumulation (and storage)
 196 of multiple comparisons into a register (taken indirectly from the predicate
 197 table).  As such, "ffirst" - fail-on-first - condition mode can be enabled.
 198 See ffirst mode in the Predication Table section.
 199
 200 ### Standard Branch <a name="standard_branch"></a>
 201
 202 Branch operations use standard RV opcodes that are reinterpreted to
 203 be "predicate variants" in the instance where either of the two src
 204 registers are marked as vectors (active=1, vector=1).
 205
 206 Note that the predication register to use (if one is enabled) is taken from
 207 the *first* src register, and that this is used, just as with predicated
 208 arithmetic operations, to mask whether the comparison operations take
 209 place or not.  The target (destination) predication register
 210 to use (if one is enabled) is taken from the *second* src register.
 211
 212 If either of src1 or src2 are scalars (whether by there being no
 213 CSR register entry or whether by the CSR entry specifically marking
 214 the register as "scalar") the comparison goes ahead as vector-scalar
 215 or scalar-vector.
 216
 217 In instances where no vectorisation is detected on either src registers
 218 the operation is treated as an absolutely standard scalar branch operation.
 219 Where vectorisation is present on either or both src registers, the
 220 branch may stil go ahead if any only if *all* tests succeed (i.e. excluding
 221 those tests that are predicated out).
 222
 223 Note that when zero-predication is enabled (from source rs1),
 224 a cleared bit in the predicate indicates that the result
 225 of the compare is set to "false", i.e. that the corresponding
 226 destination bit (or result)) be set to zero.  Contrast this with
 227 when zeroing is not set: bits in the destination predicate are
 228 only *set*; they are **not** cleared.  This is important to appreciate,
 229 as there may be an expectation that, going into the hardware-loop,
 230 the destination predicate is always expected to be set to zero:
 231 this is **not** the case.  The destination predicate is only set
 232 to zero if **zeroing** is enabled.
 233
 234 Note that just as with the standard (scalar, non-predicated) branch
 235 operations, BLE, BGT, BLEU and BTGU may be synthesised by inverting
 236 src1 and src2.
 237
 238 In Hwacha EECS-2015-262 Section 6.7.2 the following pseudocode is given
 239 for predicated compare operations of function "cmp":
 240
 241     for (int i=0; i<vl; ++i)
 242       if ([!]preg[p][i])
 243          preg[pd][i] = cmp(s1 ? vreg[rs1][i] : sreg[rs1],
 244                            s2 ? vreg[rs2][i] : sreg[rs2]);
 245
 246 With associated predication, vector-length adjustments and so on,
 247 and temporarily ignoring bitwidth (which makes the comparisons more
 248 complex), this becomes:
 249
 250     s1 = reg_is_vectorised(src1);
 251     s2 = reg_is_vectorised(src2);
 252
 253     if not s1 && not s2
 254         if cmp(rs1, rs2) # scalar compare
 255             goto branch
 256         return
 257
 258     preg = int_pred_reg[rd]
 259     reg = int_regfile
 260
 261     ps = get_pred_val(I/F==INT, rs1);
 262     rd = get_pred_val(I/F==INT, rs2); # this may not exist
 263
 264     if not exists(rd) or zeroing:
 265         result = (1<<VL)-1 # all 1s
 266     else
 267         result = preg[rd]
 268
 269     for (int i = 0; i < VL; ++i)
 270       if (zeroing)
 271         if not (ps & (1<<i))
 272            result &= ~(1<<i);
 273       else if (ps & (1<<i))
 274           if (cmp(s1 ? reg[src1+i]:reg[src1],
 275                                s2 ? reg[src2+i]:reg[src2])
 276               result |= 1<<i;
 277           else
 278               result &= ~(1<<i);
 279
 280      if not exists(rd)
 281         if result == ps
 282             goto branch
 283      else
 284         preg[rd] = result # store in destination
 285         if preg[rd] == ps
 286             goto branch
 287
 288 Notes:
 289
 290 * Predicated SIMD comparisons would break src1 and src2 further down
 291   into bitwidth-sized chunks (see Appendix "Bitwidth Virtual Register
 292   Reordering") setting Vector-Length times (number of SIMD elements) bits
 293   in Predicate Register rd, as opposed to just Vector-Length bits.
 294 * The execution of "parallelised" instructions **must** be implemented
 295   as "re-entrant" (to use a term from software).  If an exception (trap)
 296   occurs during the middle of a vectorised
 297   Branch (now a SV predicated compare) operation, the partial results
 298   of any comparisons must be written out to the destination
 299   register before the trap is permitted to begin.  If however there
 300   is no predicate, the **entire** set of comparisons must be **restarted**,
 301   with the offset loop indices set back to zero.  This is because
 302   there is no place to store the temporary result during the handling
 303   of traps.
 304
 305 TODO: predication now taken from src2.  also branch goes ahead
 306 if all compares are successful.
 307
 308 Note also that where normally, predication requires that there must
 309 also be a CSR register entry for the register being used in order
 310 for the **predication** CSR register entry to also be active,
 311 for branches this is **not** the case.  src2 does **not** have
 312 to have its CSR register entry marked as active in order for
 313 predication on src2 to be active.
 314
 315 Also note: SV Branch operations are **not** twin-predicated
 316 (see Twin Predication section).  This would require three
 317 element offsets: one to track src1, one to track src2 and a third
 318 to track where to store the accumulation of the results.  Given
 319 that the element offsets need to be exposed via CSRs so that
 320 the parallel hardware looping may be made re-entrant on traps
 321 and exceptions, the decision was made not to make SV Branches
 322 twin-predicated.
 323
 324 ### Floating-point Comparisons
 325
 326 There does not exist floating-point branch operations, only compare.
 327 Interestingly no change is needed to the instruction format because
 328 FP Compare already stores a 1 or a zero in its "rd" integer register
 329 target, i.e. it's not actually a Branch at all: it's a compare.
 330
 331 In RV (scalar) Base, a branch on a floating-point compare is
 332 done via the sequence "FEQ x1, f0, f5; BEQ x1, x0, #jumploc".
 333 This does extend to SV, as long as x1 (in the example sequence given)
 334 is vectorised.  When that is the case, x1..x(1+VL-1) will also be
 335 set to 0 or 1 depending on whether f0==f5, f1==f6, f2==f7 and so on.
 336 The BEQ that follows will *also* compare x1==x0, x2==x0, x3==x0 and
 337 so on.  Consequently, unlike integer-branch, FP Compare needs no
 338 modification in its behaviour.
 339
 340 In addition, it is noted that an entry "FNE" (the opposite of FEQ) is
 341 missing, and whilst in ordinary branch code this is fine because the
 342 standard RVF compare can always be followed up with an integer BEQ or
 343 a BNE (or a compressed comparison to zero or non-zero), in predication
 344 terms that becomes more of an impact.  To deal with this, SV's predication
 345 has had "invert" added to it.
 346
 347 Also: note that FP Compare may be predicated, using the destination
 348 integer register (rd) to determine the predicate.  FP Compare is **not**
 349 a twin-predication operation, as, again, just as with SV Branches,
 350 there are three registers involved: FP src1, FP src2 and INT rd.
 351
 352 Also: note that ffirst (fail first mode) applies directly to this operation.
 353
 354 ### Compressed Branch Instruction
 355
 356 Compressed Branch instructions are, just like standard Branch instructions,
 357 reinterpreted to be vectorised and predicated based on the source register
 358 (rs1s) CSR entries.  As however there is only the one source register,
 359 given that c.beqz a10 is equivalent to beqz a10,x0, the optional target
 360 to store the results of the comparisions is taken from CSR predication
 361 table entries for **x0**.
 362
 363 The specific required use of x0 is, with a little thought, quite obvious,
 364 but is counterintuitive.  Clearly it is **not** recommended to redirect
 365 x0 with a CSR register entry, however as a means to opaquely obtain
 366 a predication target it is the only sensible option that does not involve
 367 additional special CSRs (or, worse, additional special opcodes).
 368
 369 Note also that, just as with standard branches, the 2nd source
 370 (in this case x0 rather than src2) does **not** have to have its CSR
 371 register table marked as "active" in order for predication to work.
 372
 373 ## Vectorised Dual-operand instructions
 374
 375 There is a series of 2-operand instructions involving copying (and
 376 sometimes alteration):
 377
 378 * C.MV
 379 * FMV, FNEG, FABS, FCVT, FSGNJ, FSGNJN and FSGNJX
 380 * C.LWSP, C.SWSP, C.LDSP, C.FLWSP etc.
 381 * LOAD(-FP) and STORE(-FP)
 382
 383 All of these operations follow the same two-operand pattern, so it is
 384 *both* the source *and* destination predication masks that are taken into
 385 account.  This is different from
 386 the three-operand arithmetic instructions, where the predication mask
 387 is taken from the *destination* register, and applied uniformly to the
 388 elements of the source register(s), element-for-element.
 389
 390 The pseudo-code pattern for twin-predicated operations is as
 391 follows:
 392
 393     function op(rd, rs):
 394       rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
 395       rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
 396       ps = get_pred_val(FALSE, rs); # predication on src
 397       pd = get_pred_val(FALSE, rd); # ... AND on dest
 398       for (int i = 0, int j = 0; i < VL && j < VL;):
 399         if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
 400         if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
 401         xSTATE.srcoffs = i # save context
 402         xSTATE.destoffs = j # save context
 403         reg[rd+j] = SCALAR_OPERATION_ON(reg[rs+i])
 404         if (int_csr[rs].isvec) i++;
 405         if (int_csr[rd].isvec) j++; else break
 406
 407 This pattern covers scalar-scalar, scalar-vector, vector-scalar
 408 and vector-vector, and predicated variants of all of those.
 409 Zeroing is not presently included (TODO).  As such, when compared
 410 to RVV, the twin-predicated variants of C.MV and FMV cover
 411 **all** standard vector operations: VINSERT, VSPLAT, VREDUCE,
 412 VEXTRACT, VSCATTER, VGATHER, VCOPY, and more.
 413
 414 Note that:
 415
 416 * elwidth (SIMD) is not covered in the pseudo-code above
 417 * ending the loop early in scalar cases (VINSERT, VEXTRACT) is also
 418   not covered
 419 * zero predication is also not shown (TODO).
 420
 421 ### C.MV Instruction <a name="c_mv"></a>
 422
 423 There is no MV instruction in RV however there is a C.MV instruction.
 424 It is used for copying integer-to-integer registers (vectorised FMV
 425 is used for copying floating-point).
 426
 427 If either the source or the destination register are marked as vectors
 428 C.MV is reinterpreted to be a vectorised (multi-register) predicated
 429 move operation.  The actual instruction's format does not change:
 430
 431 [[!table  data="""
 432 15  12 | 11   7 | 6  2 | 1  0 |
 433 funct4 | rd     | rs   | op   |
 434 4      | 5      | 5    | 2    |
 435 C.MV   | dest   | src  | C0   |
 436 """]]
 437
 438 A simplified version of the pseudocode for this operation is as follows:
 439
 440     function op_mv(rd, rs) # MV not VMV!
 441       rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
 442       rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
 443       ps = get_pred_val(FALSE, rs); # predication on src
 444       pd = get_pred_val(FALSE, rd); # ... AND on dest
 445       for (int i = 0, int j = 0; i < VL && j < VL;):
 446         if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
 447         if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
 448         xSTATE.srcoffs = i # save context
 449         xSTATE.destoffs = j # save context
 450         ireg[rd+j] <= ireg[rs+i];
 451         if (int_csr[rs].isvec) i++;
 452         if (int_csr[rd].isvec) j++; else break
 453
 454 There are several different instructions from RVV that are covered by
 455 this one opcode:
 456
 457 [[!table  data="""
 458 src    | dest    | predication   | op             |
 459 scalar | vector  | none          | VSPLAT         |
 460 scalar | vector  | destination   | sparse VSPLAT  |
 461 scalar | vector  | 1-bit dest    | VINSERT        |
 462 vector | scalar  | 1-bit? src    | VEXTRACT       |
 463 vector | vector  | none          | VCOPY          |
 464 vector | vector  | src           | Vector Gather  |
 465 vector | vector  | dest          | Vector Scatter |
 466 vector | vector  | src & dest    | Gather/Scatter |
 467 vector | vector  | src == dest   | sparse VCOPY   |
 468 """]]
 469
 470 Also, VMERGE may be implemented as back-to-back (macro-op fused) C.MV
 471 operations with zeroing off, and inversion on the src and dest predication
 472 for one of the two C.MV operations.  The non-inverted C.MV will place
 473 one set of registers into the destination, and the inverted one the other
 474 set.  With predicate-inversion, copying and inversion of the predicate mask
 475 need not be done as a separate (scalar) instruction.
 476
 477 Note that in the instance where the Compressed Extension is not implemented,
 478 MV may be used, but that is a pseudo-operation mapping to addi rd, x0, rs.
 479 Note that the behaviour is **different** from C.MV because with addi the
 480 predication mask to use is taken **only** from rd and is applied against
 481 all elements: rs[i] = rd[i].
 482
 483 ### FMV, FNEG and FABS Instructions
 484
 485 These are identical in form to C.MV, except covering floating-point
 486 register copying.  The same double-predication rules also apply.
 487 However when elwidth is not set to default the instruction is implicitly
 488 and automatic converted to a (vectorised) floating-point type conversion
 489 operation of the appropriate size covering the source and destination
 490 register bitwidths.
 491
 492 (Note that FMV, FNEG and FABS are all actually pseudo-instructions)
 493
 494 ### FVCT Instructions
 495
 496 These are again identical in form to C.MV, except that they cover
 497 floating-point to integer and integer to floating-point.  When element
 498 width in each vector is set to default, the instructions behave exactly
 499 as they are defined for standard RV (scalar) operations, except vectorised
 500 in exactly the same fashion as outlined in C.MV.
 501
 502 However when the source or destination element width is not set to default,
 503 the opcode's explicit element widths are *over-ridden* to new definitions,
 504 and the opcode's element width is taken as indicative of the SIMD width
 505 (if applicable i.e. if packed SIMD is requested) instead.
 506
 507 For example FCVT.S.L would normally be used to convert a 64-bit
 508 integer in register rs1 to a 64-bit floating-point number in rd.
 509 If however the source rs1 is set to be a vector, where elwidth is set to
 510 default/2 and "packed SIMD" is enabled, then the first 32 bits of
 511 rs1 are converted to a floating-point number to be stored in rd's
 512 first element and the higher 32-bits *also* converted to floating-point
 513 and stored in the second.  The 32 bit size comes from the fact that
 514 FCVT.S.L's integer width is 64 bit, and with elwidth on rs1 set to
 515 divide that by two it means that rs1 element width is to be taken as 32.
 516
 517 Similar rules apply to the destination register.
 518
 519 ## LOAD / STORE Instructions and LOAD-FP/STORE-FP <a name="load_store"></a>
 520
 521 An earlier draft of SV modified the behaviour of LOAD/STORE (modified
 522 the interpretation of the instruction fields).  This
 523 actually undermined the fundamental principle of SV, namely that there
 524 be no modifications to the scalar behaviour (except where absolutely
 525 necessary), in order to simplify an implementor's task if considering
 526 converting a pre-existing scalar design to support parallelism.
 527
 528 So the original RISC-V scalar LOAD/STORE and LOAD-FP/STORE-FP functionality
 529 do not change in SV, however just as with C.MV it is important to note
 530 that dual-predication is possible.
 531
 532 In vectorised architectures there are usually at least two different modes
 533 for LOAD/STORE:
 534
 535 * Read (or write for STORE) from sequential locations, where one
 536   register specifies the address, and the one address is incremented
 537   by a fixed amount.  This is usually known as "Unit Stride" mode.
 538 * Read (or write) from multiple indirected addresses, where the
 539   vector elements each specify separate and distinct addresses.
 540
 541 To support these different addressing modes, the CSR Register "isvector"
 542 bit is used.  So, for a LOAD, when the src register is set to
 543 scalar, the LOADs are sequentially incremented by the src register
 544 element width, and when the src register is set to "vector", the
 545 elements are treated as indirection addresses.  Simplified
 546 pseudo-code would look like this:
 547
 548     function op_ld(rd, rs) # LD not VLD!
 549       rdv = int_csr[rd].active ? int_csr[rd].regidx : rd;
 550       rsv = int_csr[rs].active ? int_csr[rs].regidx : rs;
 551       ps = get_pred_val(FALSE, rs); # predication on src
 552       pd = get_pred_val(FALSE, rd); # ... AND on dest
 553       for (int i = 0, int j = 0; i < VL && j < VL;):
 554         if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
 555         if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
 556         if (int_csr[rd].isvec)
 557           # indirect mode (multi mode)
 558           srcbase = ireg[rsv+i];
 559         else
 560           # unit stride mode
 561           srcbase = ireg[rsv] + i * XLEN/8; # offset in bytes
 562         ireg[rdv+j] <= mem[srcbase + imm_offs];
 563         if (!int_csr[rs].isvec &&
 564             !int_csr[rd].isvec) break # scalar-scalar LD
 565         if (int_csr[rs].isvec) i++;
 566         if (int_csr[rd].isvec) j++;
 567
 568 Notes:
 569
 570 * For simplicity, zeroing and elwidth is not included in the above:
 571   the key focus here is the decision-making for srcbase; vectorised
 572   rs means use sequentially-numbered registers as the indirection
 573   address, and scalar rs is "offset" mode.
 574 * The test towards the end for whether both source and destination are
 575   scalar is what makes the above pseudo-code provide the "standard" RV
 576   Base behaviour for LD operations.
 577 * The offset in bytes (XLEN/8) changes depending on whether the
 578   operation is a LB (1 byte), LH (2 byes), LW (4 bytes) or LD
 579   (8 bytes), and also whether the element width is over-ridden
 580   (see special element width section).
 581
 582 ## Compressed Stack LOAD / STORE Instructions <a name="c_ld_st"></a>
 583
 584 C.LWSP / C.SWSP and floating-point etc. are also source-dest twin-predicated,
 585 where it is implicit in C.LWSP/FLWSP etc. that x2 is the source register.
 586 It is therefore possible to use predicated C.LWSP to efficiently
 587 pop registers off the stack (by predicating x2 as the source), cherry-picking
 588 which registers to store to (by predicating the destination).  Likewise
 589 for C.SWSP.  In this way, LOAD/STORE-Multiple is efficiently achieved.
 590
 591 The two modes ("unit stride" and multi-indirection) are still supported,
 592 as with standard LD/ST.  Essentially, the only difference is that the
 593 use of x2 is hard-coded into the instruction.
 594
 595 **Note**: it is still possible to redirect x2 to an alternative target
 596 register.  With care, this allows C.LWSP / C.SWSP (and C.FLWSP) to be used as
 597 general-purpose LOAD/STORE operations.
 598
 599 ## Compressed LOAD / STORE Instructions
 600
 601 Compressed LOAD and STORE are again exactly the same as scalar LOAD/STORE,
 602 where the same rules apply and the same pseudo-code apply as for
 603 non-compressed LOAD/STORE.  Again: setting scalar or vector mode
 604 on the src for LOAD and dest for STORE switches mode from "Unit Stride"
 605 to "Multi-indirection", respectively.
 606
 607 # Element bitwidth polymorphism <a name="elwidth"></a>
 608
 609 Element bitwidth is best covered as its own special section, as it
 610 is quite involved and applies uniformly across-the-board.  SV restricts
 611 bitwidth polymorphism to default, 8-bit, 16-bit and 32-bit.
 612
 613 The effect of setting an element bitwidth is to re-cast each entry
 614 in the register table, and for all memory operations involving
 615 load/stores of certain specific sizes, to a completely different width.
 616 Thus In c-style terms, on an RV64 architecture, effectively each register
 617 now looks like this:
 618
 619     typedef union {
 620         uint8_t  b[8];
 621         uint16_t s[4];
 622         uint32_t i[2];
 623         uint64_t l[1];
 624     } reg_t;
 625
 626     // integer table: assume maximum SV 7-bit regfile size
 627     reg_t int_regfile[128];
 628
 629 where the CSR Register table entry (not the instruction alone) determines
 630 which of those union entries is to be used on each operation, and the
 631 VL element offset in the hardware-loop specifies the index into each array.
 632
 633 However a naive interpretation of the data structure above masks the
 634 fact that setting VL greater than 8, for example, when the bitwidth is 8,
 635 accessing one specific register "spills over" to the following parts of
 636 the register file in a sequential fashion.  So a much more accurate way
 637 to reflect this would be:
 638
 639     typedef union {
 640         uint8_t  actual_bytes[8]; // 8 for RV64, 4 for RV32, 16 for RV128
 641         uint8_t  b[0]; // array of type uint8_t
 642         uint16_t s[0];
 643         uint32_t i[0];
 644         uint64_t l[0];
 645         uint128_t d[0];
 646     } reg_t;
 647
 648     reg_t int_regfile[128];
 649
 650 where when accessing any individual regfile[n].b entry it is permitted
 651 (in c) to arbitrarily over-run the *declared* length of the array (zero),
 652 and thus "overspill" to consecutive register file entries in a fashion
 653 that is completely transparent to a greatly-simplified software / pseudo-code
 654 representation.
 655 It is however critical to note that it is clearly the responsibility of
 656 the implementor to ensure that, towards the end of the register file,
 657 an exception is thrown if attempts to access beyond the "real" register
 658 bytes is ever attempted.
 659
 660 Now we may modify pseudo-code an operation where all element bitwidths have
 661 been set to the same size, where this pseudo-code is otherwise identical
 662 to its "non" polymorphic versions (above):
 663
 664     function op_add(rd, rs1, rs2) # add not VADD!
 665       ...
 666       ...
 667       for (i = 0; i < VL; i++)
 668            ...
 669            ...
 670            // TODO, calculate if over-run occurs, for each elwidth
 671            if (elwidth == 8) {
 672                int_regfile[rd].b[id] <= int_regfile[rs1].i[irs1] +
 673                                         int_regfile[rs2].i[irs2];
 674             } else if elwidth == 16 {
 675                int_regfile[rd].s[id] <= int_regfile[rs1].s[irs1] +
 676                                         int_regfile[rs2].s[irs2];
 677             } else if elwidth == 32 {
 678                int_regfile[rd].i[id] <= int_regfile[rs1].i[irs1] +
 679                                         int_regfile[rs2].i[irs2];
 680             } else { // elwidth == 64
 681                int_regfile[rd].l[id] <= int_regfile[rs1].l[irs1] +
 682                                         int_regfile[rs2].l[irs2];
 683             }
 684            ...
 685            ...
 686
 687 So here we can see clearly: for 8-bit entries rd, rs1 and rs2 (and registers
 688 following sequentially on respectively from the same) are "type-cast"
 689 to 8-bit; for 16-bit entries likewise and so on.
 690
 691 However that only covers the case where the element widths are the same.
 692 Where the element widths are different, the following algorithm applies:
 693
 694 * Analyse the bitwidth of all source operands and work out the
 695   maximum.  Record this as "maxsrcbitwidth"
 696 * If any given source operand requires sign-extension or zero-extension
 697   (ldb, div, rem, mul, sll, srl, sra etc.), instead of mandatory 32-bit
 698   sign-extension / zero-extension or whatever is specified in the standard
 699   RV specification, **change** that to sign-extending from the respective
 700   individual source operand's bitwidth from the CSR table out to
 701   "maxsrcbitwidth" (previously calculated), instead.
 702 * Following separate and distinct (optional) sign/zero-extension of all
 703   source operands as specifically required for that operation, carry out the
 704   operation at "maxsrcbitwidth".  (Note that in the case of LOAD/STORE or MV
 705   this may be a "null" (copy) operation, and that with FCVT, the changes
 706   to the source and destination bitwidths may also turn FVCT effectively
 707   into a copy).
 708 * If the destination operand requires sign-extension or zero-extension,
 709   instead of a mandatory fixed size (typically 32-bit for arithmetic,
 710   for subw for example, and otherwise various: 8-bit for sb, 16-bit for sw
 711   etc.), overload the RV specification with the bitwidth from the
 712   destination register's elwidth entry.
 713 * Finally, store the (optionally) sign/zero-extended value into its
 714   destination: memory for sb/sw etc., or an offset section of the register
 715   file for an arithmetic operation.
 716
 717 In this way, polymorphic bitwidths are achieved without requiring a
 718 massive 64-way permutation of calculations **per opcode**, for example
 719 (4 possible rs1 bitwidths times 4 possible rs2 bitwidths times 4 possible
 720 rd bitwidths).  The pseudo-code is therefore as follows:
 721
 722     typedef union {
 723         uint8_t  b;
 724         uint16_t s;
 725         uint32_t i;
 726         uint64_t l;
 727     } el_reg_t;
 728
 729     bw(elwidth):
 730         if elwidth == 0: return xlen
 731         if elwidth == 1: return 8
 732         if elwidth == 2: return 16
 733         // elwidth == 3:
 734         return 32
 735
 736     get_max_elwidth(rs1, rs2):
 737         return max(bw(int_csr[rs1].elwidth), # default (XLEN) if not set
 738                    bw(int_csr[rs2].elwidth)) # again XLEN if no entry
 739
 740     get_polymorphed_reg(reg, bitwidth, offset):
 741         el_reg_t res;
 742         res.l = 0; // TODO: going to need sign-extending / zero-extending
 743         if bitwidth == 8:
 744             reg.b = int_regfile[reg].b[offset]
 745         elif bitwidth == 16:
 746             reg.s = int_regfile[reg].s[offset]
 747         elif bitwidth == 32:
 748             reg.i = int_regfile[reg].i[offset]
 749         elif bitwidth == 64:
 750             reg.l = int_regfile[reg].l[offset]
 751         return res
 752
 753     set_polymorphed_reg(reg, bitwidth, offset, val):
 754         if (!int_csr[reg].isvec):
 755             # sign/zero-extend depending on opcode requirements, from
 756             # the reg's bitwidth out to the full bitwidth of the regfile
 757             val = sign_or_zero_extend(val, bitwidth, xlen)
 758             int_regfile[reg].l[0] = val
 759         elif bitwidth == 8:
 760             int_regfile[reg].b[offset] = val
 761         elif bitwidth == 16:
 762             int_regfile[reg].s[offset] = val
 763         elif bitwidth == 32:
 764             int_regfile[reg].i[offset] = val
 765         elif bitwidth == 64:
 766             int_regfile[reg].l[offset] = val
 767
 768       maxsrcwid =  get_max_elwidth(rs1, rs2) # source element width(s)
 769       destwid = int_csr[rs1].elwidth         # destination element width
 770       for (i = 0; i < VL; i++)
 771         if (predval & 1<<i) # predication uses intregs
 772            // TODO, calculate if over-run occurs, for each elwidth
 773            src1 = get_polymorphed_reg(rs1, maxsrcwid, irs1)
 774            // TODO, sign/zero-extend src1 and src2 as operation requires
 775            if (op_requires_sign_extend_src1)
 776               src1 = sign_extend(src1, maxsrcwid)
 777            src2 = get_polymorphed_reg(rs2, maxsrcwid, irs2)
 778            result = src1 + src2 # actual add here
 779            // TODO, sign/zero-extend result, as operation requires
 780            if (op_requires_sign_extend_dest)
 781               result = sign_extend(result, maxsrcwid)
 782            set_polymorphed_reg(rd, destwid, ird, result)
 783            if (!int_vec[rd].isvector) break
 784         if (int_vec[rd ].isvector)  { id += 1; }
 785         if (int_vec[rs1].isvector)  { irs1 += 1; }
 786         if (int_vec[rs2].isvector)  { irs2 += 1; }
 787
 788 Whilst specific sign-extension and zero-extension pseudocode call
 789 details are left out, due to each operation being different, the above
 790 should be clear that;
 791
 792 * the source operands are extended out to the maximum bitwidth of all
 793   source operands
 794 * the operation takes place at that maximum source bitwidth (the
 795   destination bitwidth is not involved at this point, at all)
 796 * the result is extended (or potentially even, truncated) before being
 797   stored in the destination.  i.e. truncation (if required) to the
 798   destination width occurs **after** the operation **not** before.
 799 * when the destination is not marked as "vectorised", the **full**
 800   (standard, scalar) register file entry is taken up, i.e. the
 801   element is either sign-extended or zero-extended to cover the
 802   full register bitwidth (XLEN) if it is not already XLEN bits long.
 803
 804 Implementors are entirely free to optimise the above, particularly
 805 if it is specifically known that any given operation will complete
 806 accurately in less bits, as long as the results produced are
 807 directly equivalent and equal, for all inputs and all outputs,
 808 to those produced by the above algorithm.
 809
 810 ## Polymorphic floating-point operation exceptions and error-handling
 811
 812 For floating-point operations, conversion takes place without raising any
 813 kind of exception.  Exactly as specified in the standard RV specification,
 814 NAN (or appropriate) is stored if the result is beyond the range of the
 815 destination, and, again, exactly as with the standard RV specification
 816 just as with scalar operations, the floating-point flag is raised
 817 (FCSR).  And, again, just as with scalar operations, it is software's
 818 responsibility to check this flag.  Given that the FCSR flags are
 819 "accrued", the fact that multiple element operations could have occurred
 820 is not a problem.
 821
 822 Note that it is perfectly legitimate for floating-point bitwidths of
 823 only 8 to be specified.  However whilst it is possible to apply IEEE 754
 824 principles, no actual standard yet exists.  Implementors wishing to
 825 provide hardware-level 8-bit support rather than throw a trap to emulate
 826 in software should contact the author of this specification before
 827 proceeding.
 828
 829 ## Polymorphic shift operators
 830
 831 A special note is needed for changing the element width of left and
 832 right shift operators, particularly right-shift.  Even for standard RV
 833 base, in order for correct results to be returned, the second operand
 834 RS2 must be truncated to be within the range of RS1's bitwidth.
 835 spike's implementation of sll for example is as follows:
 836
 837     WRITE_RD(sext_xlen(zext_xlen(RS1) << (RS2 & (xlen-1))));
 838
 839 which means: where XLEN is 32 (for RV32), restrict RS2 to cover the
 840 range 0..31 so that RS1 will only be left-shifted by the amount that
 841 is possible to fit into a 32-bit register.  Whilst this appears not
 842 to matter for hardware, it matters greatly in software implementations,
 843 and it also matters where an RV64 system is set to "RV32" mode, such
 844 that the underlying registers RS1 and RS2 comprise 64 hardware bits
 845 each.
 846
 847 For SV, where each operand's element bitwidth may be over-ridden, the
 848 rule about determining the operation's bitwidth *still applies*, being
 849 defined as the maximum bitwidth of RS1 and RS2.  *However*, this rule
 850 **also applies to the truncation of RS2**.  In other words, *after*
 851 determining the maximum bitwidth, RS2's range must **also be truncated**
 852 to ensure a correct answer.  Example:
 853
 854 * RS1 is over-ridden to a 16-bit width
 855 * RS2 is over-ridden to an 8-bit width
 856 * RD is over-ridden to a 64-bit width
 857 * the maximum bitwidth is thus determined to be 16-bit - max(8,16)
 858 * RS2 is **truncated to a range of values from 0 to 15**: RS2 & (16-1)
 859
 860 Pseudocode (in spike) for this example would therefore be:
 861
 862     WRITE_RD(sext_xlen(zext_16bit(RS1) << (RS2 & (16-1))));
 863
 864 This example illustrates that considerable care therefore needs to be
 865 taken to ensure that left and right shift operations are implemented
 866 correctly.  The key is that
 867
 868 * The operation bitwidth is determined by the maximum bitwidth
 869   of the *source registers*, **not** the destination register bitwidth
 870 * The result is then sign-extend (or truncated) as appropriate.
 871
 872 ## Polymorphic MULH/MULHU/MULHSU
 873
 874 MULH is designed to take the top half MSBs of a multiply that
 875 does not fit within the range of the source operands, such that
 876 smaller width operations may produce a full double-width multiply
 877 in two cycles.  The issue is: SV allows the source operands to
 878 have variable bitwidth.
 879
 880 Here again special attention has to be paid to the rules regarding
 881 bitwidth, which, again, are that the operation is performed at
 882 the maximum bitwidth of the **source** registers.  Therefore:
 883
 884 * An 8-bit x 8-bit multiply will create a 16-bit result that must
 885   be shifted down by 8 bits
 886 * A 16-bit x 8-bit multiply will create a 24-bit result that must
 887   be shifted down by 16 bits (top 8 bits being zero)
 888 * A 16-bit x 16-bit multiply will create a 32-bit result that must
 889   be shifted down by 16 bits
 890 * A 32-bit x 16-bit multiply will create a 48-bit result that must
 891   be shifted down by 32 bits
 892 * A 32-bit x 8-bit multiply will create a 40-bit result that must
 893   be shifted down by 32 bits
 894
 895 So again, just as with shift-left and shift-right, the result
 896 is shifted down by the maximum of the two source register bitwidths.
 897 And, exactly again, truncation or sign-extension is performed on the
 898 result.  If sign-extension is to be carried out, it is performed
 899 from the same maximum of the two source register bitwidths out
 900 to the result element's bitwidth.
 901
 902 If truncation occurs, i.e. the top MSBs of the result are lost,
 903 this is "Officially Not Our Problem", i.e. it is assumed that the
 904 programmer actually desires the result to be truncated.  i.e. if the
 905 programmer wanted all of the bits, they would have set the destination
 906 elwidth to accommodate them.
 907
 908 ## Polymorphic elwidth on LOAD/STORE <a name="elwidth_loadstore"></a>
 909
 910 Polymorphic element widths in vectorised form means that the data
 911 being loaded (or stored) across multiple registers needs to be treated
 912 (reinterpreted) as a contiguous stream of elwidth-wide items, where
 913 the source register's element width is **independent** from the destination's.
 914
 915 This makes for a slightly more complex algorithm when using indirection
 916 on the "addressed" register (source for LOAD and destination for STORE),
 917 particularly given that the LOAD/STORE instruction provides important
 918 information about the width of the data to be reinterpreted.
 919
 920 Let's illustrate the "load" part, where the pseudo-code for elwidth=default
 921 was as follows, and i is the loop from 0 to VL-1:
 922
 923     srcbase = ireg[rs+i];
 924     return mem[srcbase + imm]; // returns XLEN bits
 925
 926 Instead, when elwidth != default, for a LW (32-bit LOAD), elwidth-wide
 927 chunks are taken from the source memory location addressed by the current
 928 indexed source address register, and only when a full 32-bits-worth
 929 are taken will the index be moved on to the next contiguous source
 930 address register:
 931
 932     bitwidth = bw(elwidth);             // source elwidth from CSR reg entry
 933     elsperblock = 32 / bitwidth         // 1 if bw=32, 2 if bw=16, 4 if bw=8
 934     srcbase = ireg[rs+i/(elsperblock)]; // integer divide
 935     offs = i % elsperblock;             // modulo
 936     return &mem[srcbase + imm + offs];  // re-cast to uint8_t*, uint16_t* etc.
 937
 938 Note that the constant "32" above is replaced by 8 for LB, 16 for LH, 64 for LD
 939 and 128 for LQ.
 940
 941 The principle is basically exactly the same as if the srcbase were pointing
 942 at the memory of the *register* file: memory is re-interpreted as containing
 943 groups of elwidth-wide discrete elements.
 944
 945 When storing the result from a load, it's important to respect the fact
 946 that the destination register has its *own separate element width*.  Thus,
 947 when each element is loaded (at the source element width), any sign-extension
 948 or zero-extension (or truncation) needs to be done to the *destination*
 949 bitwidth.  Also, the storing has the exact same analogous algorithm as
 950 above, where in fact it is just the set\_polymorphed\_reg pseudocode
 951 (completely unchanged) used above.
 952
 953 One issue remains: when the source element width is **greater** than
 954 the width of the operation, it is obvious that a single LB for example
 955 cannot possibly obtain 16-bit-wide data.  This condition may be detected
 956 where, when using integer divide, elsperblock (the width of the LOAD
 957 divided by the bitwidth of the element) is zero.
 958
 959 The issue is "fixed" by ensuring that elsperblock is a minimum of 1:
 960
 961     elsperblock = min(1, LD_OP_BITWIDTH / element_bitwidth)
 962
 963 The elements, if the element bitwidth is larger than the LD operation's
 964 size, will then be sign/zero-extended to the full LD operation size, as
 965 specified by the LOAD (LDU instead of LD, LBU instead of LB), before
 966 being passed on to the second phase.
 967
 968 As LOAD/STORE may be twin-predicated, it is important to note that
 969 the rules on twin predication still apply, except where in previous
 970 pseudo-code (elwidth=default for both source and target) it was
 971 the *registers* that the predication was applied to, it is now the
 972 **elements** that the predication is applied to.
 973
 974 Thus the full pseudocode for all LD operations may be written out
 975 as follows:
 976
 977     function LBU(rd, rs):
 978         load_elwidthed(rd, rs, 8, true)
 979     function LB(rd, rs):
 980         load_elwidthed(rd, rs, 8, false)
 981     function LH(rd, rs):
 982         load_elwidthed(rd, rs, 16, false)
 983     ...
 984     ...
 985     function LQ(rd, rs):
 986         load_elwidthed(rd, rs, 128, false)
 987
 988     # returns 1 byte of data when opwidth=8, 2 bytes when opwidth=16..
 989     function load_memory(rs, imm, i, opwidth):
 990         elwidth = int_csr[rs].elwidth
 991         bitwidth = bw(elwidth);
 992         elsperblock = min(1, opwidth / bitwidth)
 993         srcbase = ireg[rs+i/(elsperblock)];
 994         offs = i % elsperblock;
 995         return mem[srcbase + imm + offs]; # 1/2/4/8/16 bytes
 996
 997     function load_elwidthed(rd, rs, opwidth, unsigned):
 998       destwid = int_csr[rd].elwidth # destination element width
 999       rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
1000       rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
1001       ps = get_pred_val(FALSE, rs); # predication on src
1002       pd = get_pred_val(FALSE, rd); # ... AND on dest
1003       for (int i = 0, int j = 0; i < VL && j < VL;):
1004         if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
1005         if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
1006         val = load_memory(rs, imm, i, opwidth)
1007         if unsigned:
1008             val = zero_extend(val, min(opwidth, bitwidth))
1009         else:
1010             val = sign_extend(val, min(opwidth, bitwidth))
1011         set_polymorphed_reg(rd, bitwidth, j, val)
1012         if (int_csr[rs].isvec) i++;
1013         if (int_csr[rd].isvec) j++; else break;
1014
1015 Note:
1016
1017 * when comparing against for example the twin-predicated c.mv
1018   pseudo-code, the pattern of independent incrementing of rd and rs
1019   is preserved unchanged.
1020 * just as with the c.mv pseudocode, zeroing is not included and must be
1021   taken into account (TODO).
1022 * that due to the use of a twin-predication algorithm, LOAD/STORE also
1023   take on the same VSPLAT, VINSERT, VREDUCE, VEXTRACT, VGATHER and
1024   VSCATTER characteristics.
1025 * that due to the use of the same set\_polymorphed\_reg pseudocode,
1026   a destination that is not vectorised (marked as scalar) will
1027   result in the element being fully sign-extended or zero-extended
1028   out to the full register file bitwidth (XLEN).  When the source
1029   is also marked as scalar, this is how the compatibility with
1030   standard RV LOAD/STORE is preserved by this algorithm.
1031
1032 ### Example Tables showing LOAD elements
1033
1034 This section contains examples of vectorised LOAD operations, showing
1035 how the two stage process works (three if zero/sign-extension is included).
1036
1037
1038 #### Example: LD x8, x5(0), x8 CSR-elwidth=32, x5 CSR-elwidth=16, VL=7
1039
1040 This is:
1041
1042 * a 64-bit load, with an offset of zero
1043 * with a source-address elwidth of 16-bit
1044 * into a destination-register with an elwidth of 32-bit
1045 * where VL=7
1046 * from register x5 (actually x5-x6) to x8 (actually x8 to half of x11)
1047 * RV64, where XLEN=64 is assumed.
1048
1049 First, the memory table, which, due to the element width being 16 and the
1050 operation being LD (64), the 64-bits loaded from memory are subdivided
1051 into groups of **four** elements.  And, with VL being 7 (deliberately
1052 to illustrate that this is reasonable and possible), the first four are
1053 sourced from the offset addresses pointed to by x5, and the next three
1054 from the ofset addresses pointed to by the next contiguous register, x6:
1055
1056 [[!table  data="""
1057 addr | byte 0 | byte 1 | byte 2 | byte 3 | byte 4 | byte 5 | byte 6 | byte 7 |
1058 @x5  | elem 0         || elem 1         || elem 2         || elem 3         ||
1059 @x6  | elem 4         || elem 5         || elem 6         || not loaded     ||
1060 """]]
1061
1062 Next, the elements are zero-extended from 16-bit to 32-bit, as whilst
1063 the elwidth CSR entry for x5 is 16-bit, the destination elwidth on x8 is 32.
1064
1065 [[!table  data="""
1066 byte 3 | byte 2 | byte 1 | byte 0 |
1067 0x0    | 0x0    | elem0          ||
1068 0x0    | 0x0    | elem1          ||
1069 0x0    | 0x0    | elem2          ||
1070 0x0    | 0x0    | elem3          ||
1071 0x0    | 0x0    | elem4          ||
1072 0x0    | 0x0    | elem5          ||
1073 0x0    | 0x0    | elem6          ||
1074 0x0    | 0x0    | elem7          ||
1075 """]]
1076
1077 Lastly, the elements are stored in contiguous blocks, as if x8 was also
1078 byte-addressable "memory".  That "memory" happens to cover registers
1079 x8, x9, x10 and x11, with the last 32 "bits" of x11 being **UNMODIFIED**:
1080
1081 [[!table  data="""
1082 reg# | byte 7 | byte 6 | byte 5 | byte 4 | byte 3 | byte 2 | byte 1 | byte 0 |
1083 x8   | 0x0    | 0x0    | elem 1         || 0x0    | 0x0    | elem 0         ||
1084 x9   | 0x0    | 0x0    | elem 3         || 0x0    | 0x0    | elem 2         ||
1085 x10  | 0x0    | 0x0    | elem 5         || 0x0    | 0x0    | elem 4         ||
1086 x11  | **UNMODIFIED**                 |||| 0x0    | 0x0    | elem 6         ||
1087 """]]
1088
1089 Thus we have data that is loaded from the **addresses** pointed to by
1090 x5 and x6, zero-extended from 16-bit to 32-bit, stored in the **registers**
1091 x8 through to half of x11.
1092 The end result is that elements 0 and 1 end up in x8, with element 8 being
1093 shifted up 32 bits, and so on, until finally element 6 is in the
1094 LSBs of x11.
1095
1096 Note that whilst the memory addressing table is shown left-to-right byte order,
1097 the registers are shown in right-to-left (MSB) order.  This does **not**
1098 imply that bit or byte-reversal is carried out: it's just easier to visualise
1099 memory as being contiguous bytes, and emphasises that registers are not
1100 really actually "memory" as such.
1101
1102 ## Why SV bitwidth specification is restricted to 4 entries
1103
1104 The four entries for SV element bitwidths only allows three over-rides:
1105
1106 * 8 bit
1107 * 16 hit
1108 * 32 bit
1109
1110 This would seem inadequate, surely it would be better to have 3 bits or
1111 more and allow 64, 128 and some other options besides.  The answer here
1112 is, it gets too complex, no RV128 implementation yet exists, and so RV64's
1113 default is 64 bit, so the 4 major element widths are covered anyway.
1114
1115 There is an absolutely crucial aspect oF SV here that explicitly
1116 needs spelling out, and it's whether the "vectorised" bit is set in
1117 the Register's CSR entry.
1118
1119 If "vectorised" is clear (not set), this indicates that the operation
1120 is "scalar".  Under these circumstances, when set on a destination (RD),
1121 then sign-extension and zero-extension, whilst changed to match the
1122 override bitwidth (if set), will erase the **full** register entry
1123 (64-bit if RV64).
1124
1125 When vectorised is *set*, this indicates that the operation now treats
1126 **elements** as if they were independent registers, so regardless of
1127 the length, any parts of a given actual register that are not involved
1128 in the operation are **NOT** modified, but are **PRESERVED**.
1129
1130 For example:
1131
1132 * when the vector bit is clear and elwidth set to 16 on the destination
1133   register, operations are truncated to 16 bit and then sign or zero
1134   extended to the *FULL* XLEN register width.
1135 * when the vector bit is set, elwidth is 16 and VL=1 (or other value where
1136   groups of elwidth sized elements do not fill an entire XLEN register),
1137   the "top" bits of the destination register do *NOT* get modified, zero'd
1138   or otherwise overwritten.
1139
1140 SIMD micro-architectures may implement this by using predication on
1141 any elements in a given actual register that are beyond the end of
1142 multi-element operation.
1143
1144 Other microarchitectures may choose to provide byte-level write-enable
1145 lines on the register file, such that each 64 bit register in an RV64
1146 system requires 8 WE lines.  Scalar RV64 operations would require
1147 activation of all 8 lines, where SV elwidth based operations would
1148 activate the required subset of those byte-level write lines.
1149
1150 Example:
1151
1152 * rs1, rs2 and rd are all set to 8-bit
1153 * VL is set to 3
1154 * RV64 architecture is set (UXL=64)
1155 * add operation is carried out
1156 * bits 0-23 of RD are modified to be rs1[23..16] + rs2[23..16]
1157   concatenated with similar add operations on bits 15..8 and 7..0
1158 * bits 24 through 63 **remain as they originally were**.
1159
1160 Example SIMD micro-architectural implementation:
1161
1162 * SIMD architecture works out the nearest round number of elements
1163   that would fit into a full RV64 register (in this case: 8)
1164 * SIMD architecture creates a hidden predicate, binary 0b00000111
1165   i.e. the bottom 3 bits set (VL=3) and the top 5 bits clear
1166 * SIMD architecture goes ahead with the add operation as if it
1167   was a full 8-wide batch of 8 adds
1168 * SIMD architecture passes top 5 elements through the adders
1169   (which are "disabled" due to zero-bit predication)
1170 * SIMD architecture gets the 5 unmodified top 8-bits back unmodified
1171   and stores them in rd.
1172
1173 This requires a read on rd, however this is required anyway in order
1174 to support non-zeroing mode.
1175
1176 ## Polymorphic floating-point
1177
1178 Standard scalar RV integer operations base the register width on XLEN,
1179 which may be changed (UXL in USTATUS, and the corresponding MXL and
1180 SXL in MSTATUS and SSTATUS respectively).  Integer LOAD, STORE and
1181 arithmetic operations are therefore restricted to an active XLEN bits,
1182 with sign or zero extension to pad out the upper bits when XLEN has
1183 been dynamically set to less than the actual register size.
1184
1185 For scalar floating-point, the active (used / changed) bits are
1186 specified exclusively by the operation: ADD.S specifies an active
1187 32-bits, with the upper bits of the source registers needing to
1188 be all 1s ("NaN-boxed"), and the destination upper bits being
1189 *set* to all 1s (including on LOAD/STOREs).
1190
1191 Where elwidth is set to default (on any source or the destination)
1192 it is obvious that this NaN-boxing behaviour can and should be
1193 preserved.  When elwidth is non-default things are less obvious,
1194 so need to be thought through.  Here is a normal (scalar) sequence,
1195 assuming an RV64 which supports Quad (128-bit) FLEN:
1196
1197 * FLD loads 64-bit wide from memory.  Top 64 MSBs are set to all 1s
1198 * ADD.D performs a 64-bit-wide add.  Top 64 MSBs of destination set to 1s.
1199 * FSD stores lowest 64-bits from the 128-bit-wide register to memory:
1200   top 64 MSBs ignored.
1201
1202 Therefore it makes sense to mirror this behaviour when, for example,
1203 elwidth is set to 32.  Assume elwidth set to 32 on all source and
1204 destination registers:
1205
1206 * FLD loads 64-bit wide from memory as **two** 32-bit single-precision
1207   floating-point numbers.
1208 * ADD.D performs **two** 32-bit-wide adds, storing one of the adds
1209   in bits 0-31 and the second in bits 32-63.
1210 * FSD stores lowest 64-bits from the 128-bit-wide register to memory
1211
1212 Here's the thing: it does not make sense to overwrite the top 64 MSBs
1213 of the registers either during the FLD **or** the ADD.D.  The reason
1214 is that, effectively, the top 64 MSBs actually represent a completely
1215 independent 64-bit register, so overwriting it is not only gratuitous
1216 but may actually be harmful for a future extension to SV which may
1217 have a way to directly access those top 64 bits.
1218
1219 The decision is therefore **not** to touch the upper parts of floating-point
1220 registers whereever elwidth is set to non-default values, including
1221 when "isvec" is false in a given register's CSR entry.  Only when the
1222 elwidth is set to default **and** isvec is false will the standard
1223 RV behaviour be followed, namely that the upper bits be modified.
1224
1225 Ultimately if elwidth is default and isvec false on *all* source
1226 and destination registers, a SimpleV instruction defaults completely
1227 to standard RV scalar behaviour (this holds true for **all** operations,
1228 right across the board).
1229
1230 The nice thing here is that ADD.S, ADD.D and ADD.Q when elwidth are
1231 non-default values are effectively all the same: they all still perform
1232 multiple ADD operations, just at different widths.  A future extension
1233 to SimpleV may actually allow ADD.S to access the upper bits of the
1234 register, effectively breaking down a 128-bit register into a bank
1235 of 4 independently-accesible 32-bit registers.
1236
1237 In the meantime, although when e.g. setting VL to 8 it would technically
1238 make no difference to the ALU whether ADD.S, ADD.D or ADD.Q is used,
1239 using ADD.Q may be an easy way to signal to the microarchitecture that
1240 it is to receive a higher VL value.  On a superscalar OoO architecture
1241 there may be absolutely no difference, however on simpler SIMD-style
1242 microarchitectures they may not necessarily have the infrastructure in
1243 place to know the difference, such that when VL=8 and an ADD.D instruction
1244 is issued, it completes in 2 cycles (or more) rather than one, where
1245 if an ADD.Q had been issued instead on such simpler microarchitectures
1246 it would complete in one.
1247
1248 ## Specific instruction walk-throughs
1249
1250 This section covers walk-throughs of the above-outlined procedure
1251 for converting standard RISC-V scalar arithmetic operations to
1252 polymorphic widths, to ensure that it is correct.
1253
1254 ### add
1255
1256 Standard Scalar RV32/RV64 (xlen):
1257
1258 * RS1 @ xlen bits
1259 * RS2 @ xlen bits
1260 * add @ xlen bits
1261 * RD @ xlen bits
1262
1263 Polymorphic variant:
1264
1265 * RS1 @ rs1 bits, zero-extended to max(rs1, rs2) bits
1266 * RS2 @ rs2 bits, zero-extended to max(rs1, rs2) bits
1267 * add @ max(rs1, rs2) bits
1268 * RD @ rd bits.  zero-extend to rd if rd > max(rs1, rs2) otherwise truncate
1269
1270 Note here that polymorphic add zero-extends its source operands,
1271 where addw sign-extends.
1272
1273 ### addw
1274
1275 The RV Specification specifically states that "W" variants of arithmetic
1276 operations always produce 32-bit signed values.  In a polymorphic
1277 environment it is reasonable to assume that the signed aspect is
1278 preserved, where it is the length of the operands and the result
1279 that may be changed.
1280
1281 Standard Scalar RV64 (xlen):
1282
1283 * RS1 @ xlen bits
1284 * RS2 @ xlen bits
1285 * add @ xlen bits
1286 * RD @ xlen bits, truncate add to 32-bit and sign-extend to xlen.
1287
1288 Polymorphic variant:
1289
1290 * RS1 @ rs1 bits, sign-extended to max(rs1, rs2) bits
1291 * RS2 @ rs2 bits, sign-extended to max(rs1, rs2) bits
1292 * add @ max(rs1, rs2) bits
1293 * RD @ rd bits. sign-extend to rd if rd > max(rs1, rs2) otherwise truncate
1294
1295 Note here that polymorphic addw sign-extends its source operands,
1296 where add zero-extends.
1297
1298 This requires a little more in-depth analysis.  Where the bitwidth of
1299 rs1 equals the bitwidth of rs2, no sign-extending will occur.  It is
1300 only where the bitwidth of either rs1 or rs2 are different, will the
1301 lesser-width operand be sign-extended.
1302
1303 Effectively however, both rs1 and rs2 are being sign-extended (or
1304 truncated), where for add they are both zero-extended.  This holds true
1305 for all arithmetic operations ending with "W".
1306
1307 ### addiw
1308
1309 Standard Scalar RV64I:
1310
1311 * RS1 @ xlen bits, truncated to 32-bit
1312 * immed @ 12 bits, sign-extended to 32-bit
1313 * add @ 32 bits
1314 * RD @ rd bits.  sign-extend to rd if rd > 32, otherwise truncate.
1315
1316 Polymorphic variant:
1317
1318 * RS1 @ rs1 bits
1319 * immed @ 12 bits, sign-extend to max(rs1, 12) bits
1320 * add @ max(rs1, 12) bits
1321 * RD @ rd bits. sign-extend to rd if rd > max(rs1, 12) otherwise truncate
1322
1323 # Predication Element Zeroing
1324
1325 The introduction of zeroing on traditional vector predication is usually
1326 intended as an optimisation for lane-based microarchitectures with register
1327 renaming to be able to save power by avoiding a register read on elements
1328 that are passed through en-masse through the ALU.  Simpler microarchitectures
1329 do not have this issue: they simply do not pass the element through to
1330 the ALU at all, and therefore do not store it back in the destination.
1331 More complex non-lane-based micro-architectures can, when zeroing is
1332 not set, use the predication bits to simply avoid sending element-based
1333 operations to the ALUs, entirely: thus, over the long term, potentially
1334 keeping all ALUs 100% occupied even when elements are predicated out.
1335
1336 SimpleV's design principle is not based on or influenced by
1337 microarchitectural design factors: it is a hardware-level API.
1338 Therefore, looking purely at whether zeroing is *useful* or not,
1339 (whether less instructions are needed for certain scenarios),
1340 given that a case can be made for zeroing *and* non-zeroing, the
1341 decision was taken to add support for both.
1342
1343 ## Single-predication (based on destination register)
1344
1345 Zeroing on predication for arithmetic operations is taken from
1346 the destination register's predicate.  i.e. the predication *and*
1347 zeroing settings to be applied to the whole operation come from the
1348 CSR Predication table entry for the destination register.
1349 Thus when zeroing is set on predication of a destination element,
1350 if the predication bit is clear, then the destination element is *set*
1351 to zero (twin-predication is slightly different, and will be covered
1352 next).
1353
1354 Thus the pseudo-code loop for a predicated arithmetic operation
1355 is modified to as follows:
1356
1357       for (i = 0; i < VL; i++)
1358         if not zeroing: # an optimisation
1359            while (!(predval & 1<<i) && i < VL)
1360              if (int_vec[rd ].isvector)  { id += 1; }
1361              if (int_vec[rs1].isvector)  { irs1 += 1; }
1362              if (int_vec[rs2].isvector)  { irs2 += 1; }
1363            if i == VL:
1364              return
1365         if (predval & 1<<i)
1366            src1 = ....
1367            src2 = ...
1368            else:
1369                result = src1 + src2 # actual add (or other op) here
1370            set_polymorphed_reg(rd, destwid, ird, result)
1371            if int_vec[rd].ffirst and result == 0:
1372               VL = i # result was zero, end loop early, return VL
1373               return
1374            if (!int_vec[rd].isvector) return
1375         else if zeroing:
1376            result = 0
1377            set_polymorphed_reg(rd, destwid, ird, result)
1378         if (int_vec[rd ].isvector)  { id += 1; }
1379         else if (predval & 1<<i) return
1380         if (int_vec[rs1].isvector)  { irs1 += 1; }
1381         if (int_vec[rs2].isvector)  { irs2 += 1; }
1382         if (rd == VL or rs1 == VL or rs2 == VL): return
1383
1384 The optimisation to skip elements entirely is only possible for certain
1385 micro-architectures when zeroing is not set.  However for lane-based
1386 micro-architectures this optimisation may not be practical, as it
1387 implies that elements end up in different "lanes".  Under these
1388 circumstances it is perfectly fine to simply have the lanes
1389 "inactive" for predicated elements, even though it results in
1390 less than 100% ALU utilisation.
1391
1392 ## Twin-predication (based on source and destination register)
1393
1394 Twin-predication is not that much different, except that that
1395 the source is independently zero-predicated from the destination.
1396 This means that the source may be zero-predicated *or* the
1397 destination zero-predicated *or both*, or neither.
1398
1399 When with twin-predication, zeroing is set on the source and not
1400 the destination, if a predicate bit is set it indicates that a zero
1401 data element is passed through the operation (the exception being:
1402 if the source data element is to be treated as an address - a LOAD -
1403 then the data returned *from* the LOAD is zero, rather than looking up an
1404 *address* of zero.
1405
1406 When zeroing is set on the destination and not the source, then just
1407 as with single-predicated operations, a zero is stored into the destination
1408 element (or target memory address for a STORE).
1409
1410 Zeroing on both source and destination effectively result in a bitwise
1411 NOR operation of the source and destination predicate: the result is that
1412 where either source predicate OR destination predicate is set to 0,
1413 a zero element will ultimately end up in the destination register.
1414
1415 However: this may not necessarily be the case for all operations;
1416 implementors, particularly of custom instructions, clearly need to
1417 think through the implications in each and every case.
1418
1419 Here is pseudo-code for a twin zero-predicated operation:
1420
1421     function op_mv(rd, rs) # MV not VMV!
1422       rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
1423       rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
1424       ps, zerosrc = get_pred_val(FALSE, rs); # predication on src
1425       pd, zerodst = get_pred_val(FALSE, rd); # ... AND on dest
1426       for (int i = 0, int j = 0; i < VL && j < VL):
1427         if (int_csr[rs].isvec && !zerosrc) while (!(ps & 1<<i)) i++;
1428         if (int_csr[rd].isvec && !zerodst) while (!(pd & 1<<j)) j++;
1429         if ((pd & 1<<j))
1430             if ((pd & 1<<j))
1431                 sourcedata = ireg[rs+i];
1432             else
1433                 sourcedata = 0
1434             ireg[rd+j] <= sourcedata
1435         else if (zerodst)
1436             ireg[rd+j] <= 0
1437         if (int_csr[rs].isvec)
1438             i++;
1439         if (int_csr[rd].isvec)
1440             j++;
1441         else
1442             if ((pd & 1<<j))
1443                 break;
1444
1445 Note that in the instance where the destination is a scalar, the hardware
1446 loop is ended the moment a value *or a zero* is placed into the destination
1447 register/element.  Also note that, for clarity, variable element widths
1448 have been left out of the above.
1449
1450 # Subsets of RV functionality
1451
1452 This section describes the differences when SV is implemented on top of
1453 different subsets of RV.
1454
1455 ## Common options
1456
1457 It is permitted to only implement SVprefix and not the VBLOCK instruction
1458 format option, and vice-versa.  UNIX Platforms **MUST** raise illegal
1459 instruction on seeing an unsupported VBLOCK or SVprefix opcode, so that
1460 traps may emulate the format.
1461
1462 It is permitted in SVprefix to either not implement VL or not implement
1463 SUBVL (see [[sv_prefix_proposal]] for full details. Again, UNIX Platforms
1464 *MUST* raise illegal instruction on implementations that do not support
1465 VL or SUBVL.
1466
1467 It is permitted to limit the size of either (or both) the register files
1468 down to the original size of the standard RV architecture.  However, below
1469 the mandatory limits set in the RV standard will result in non-compliance
1470 with the SV Specification.
1471
1472 ## RV32 / RV32F
1473
1474 When RV32 or RV32F is implemented, XLEN is set to 32, and thus the
1475 maximum limit for predication is also restricted to 32 bits.  Whilst not
1476 actually specifically an "option" it is worth noting.
1477
1478 ## RV32G
1479
1480 Normally in standard RV32 it does not make much sense to have
1481 RV32G, The critical instructions that are missing in standard RV32
1482 are those for moving data to and from the double-width floating-point
1483 registers into the integer ones, as well as the FCVT routines.
1484
1485 In an earlier draft of SV, it was possible to specify an elwidth
1486 of double the standard register size: this had to be dropped,
1487 and may be reintroduced in future revisions.
1488
1489 ## RV32 (not RV32F / RV32G) and RV64 (not RV64F / RV64G)
1490
1491 When floating-point is not implemented, the size of the User Register and
1492 Predication CSR tables may be halved, to only 4 2x16-bit CSRs (8 entries
1493 per table).
1494
1495 ## RV32E
1496
1497 In embedded scenarios the User Register and Predication CSRs may be
1498 dropped entirely, or optionally limited to 1 CSR, such that the combined
1499 number of entries from the M-Mode CSR Register table plus U-Mode
1500 CSR Register table is either 4 16-bit entries or (if the U-Mode is
1501 zero) only 2 16-bit entries (M-Mode CSR table only).  Likewise for
1502 the Predication CSR tables.
1503
1504 RV32E is the most likely candidate for simply detecting that registers
1505 are marked as "vectorised", and generating an appropriate exception
1506 for the VL loop to be implemented in software.
1507
1508 ## RV128
1509
1510 RV128 has not been especially considered, here, however it has some
1511 extremely large possibilities: double the element width implies
1512 256-bit operands, spanning 2 128-bit registers each, and predication
1513 of total length 128 bit given that XLEN is now 128.
1514
1515 # Example usage
1516
1517 TODO evaluate strncpy and strlen
1518 <https://groups.google.com/forum/m/#!msg/comp.arch/bGBeaNjAKvc/_vbqyxTUAQAJ>
1519
1520 ## strncpy <a name="strncpy"></>
1521
1522 RVV version:
1523
1524     strncpy:
1525         c.mv a3, a0               # Copy dst
1526     loop:
1527         setvli x0, a2, vint8    # Vectors of bytes.
1528         vlbff.v v1, (a1)        # Get src bytes
1529         vseq.vi v0, v1, 0       # Flag zero bytes
1530         vmfirst a4, v0          # Zero found?
1531         vmsif.v v0, v0          # Set mask up to and including zero byte.
1532         vsb.v v1, (a3), v0.t    # Write out bytes
1533         c.bgez a4, exit           # Done
1534         csrr t1, vl             # Get number of bytes fetched
1535         c.add a1, a1, t1          # Bump src pointer
1536         c.sub a2, a2, t1          # Decrement count.
1537         c.add a3, a3, t1          # Bump dst pointer
1538         c.bnez a2, loop           # Anymore?
1539
1540     exit:
1541         c.ret
1542
1543 SV version (WIP):
1544
1545     strncpy:
1546         c.mv a3, a0
1547         VBLK.RegCSR[t0] = 8bit, t0, vector
1548         VBLK.PredTb[t0] = ffirst, x0, inv
1549     loop:
1550         VBLK.SETVLI a2, t4, 8 # t4 and VL now 1..8 (MVL=8)
1551         c.ldb t0, (a1) # t0 fail first mode
1552         c.bne t0, x0, allnonzero # still ff
1553         # VL (t4) points to last nonzero
1554         c.addi t4, t4, 1 # include zero
1555         c.stb t0, (a3)   # store incl zero
1556         c.ret            # end subroutine
1557     allnonzero:
1558         c.stb t0, (a3)    # VL legal range
1559         c.add a1, a1, t4  # Bump src pointer
1560         c.sub a2, a2, t4  # Decrement count.
1561         c.add a3, a3, t4  # Bump dst pointer
1562         c.bnez a2, loop   # Anymore?
1563     exit:
1564         c.ret
1565
1566 Notes:
1567
1568 * Setting MVL to 8 is just an example. If enough registers are spare it
1569   may be set to XLEN which will require a bank of 8 scalar registers for
1570   a1, a3 and t0.
1571 * obviously if that is done, t0 is not separated by 8 full registers, and
1572   would overwrite t1 thru t7. x80 would work well, as an example, instead.
1573 * with the exception of the GETVL (a pseudo code alias for csrr), every
1574   single instruction above may use RVC.
1575 * RVC C.BNEZ can be used because rs1' may be extended to the full 128
1576   registers through redirection
1577 * RVC C.LW and C.SW may be used because the W format may be overridden by
1578   the 8 bit format. All of t0, a3 and a1 are overridden to make that work.
1579 * with the exception of the GETVL, all Vector Context may be done in
1580   VBLOCK form.
1581 * setting predication to x0 (zero) and invert on t0 is a trick to enable
1582   just ffirst on t0
1583 * ldb and bne are both using t0, both in ffirst mode
1584 * t0 vectorised, a1 scalar, both elwidth 8 bit: ldb enters "unit stride,
1585   vectorised, no (un)sign-extension or truncation" mode.
1586 * ldb will end on illegal mem, reduce VL, but copied all sorts of stuff
1587   into t0 (could contain zeros).
1588 * bne t0 x0 tests up to the NEW VL for nonzero, vector t0 against
1589   scalar x0
1590 * however as t0 is in ffirst mode, the first fail will ALSO stop the
1591   compares, and reduce VL as well
1592 * the branch only goes to allnonzero if all tests succeed
1593 * if it did not, we can safely increment VL by 1 (using a4) to include
1594   the zero.
1595 * SETVL sets *exactly* the requested amount into VL.
1596 * the SETVL just after allnonzero label is needed in case the ldb ffirst
1597   activates but the bne allzeros does not.
1598 * this would cause the stb to copy up to the end of the legal memory
1599 * of course, on the next loop the ldb would throw a trap, as a1 now
1600   points to the first illegal mem location.
1601
1602 ## strcpy
1603
1604 RVV version:
1605
1606         mv a3, a0             # Save start
1607     loop:
1608         setvli a1, x0, vint8  # byte vec, x0 (Zero reg) => use max hardware len
1609         vldbff.v v1, (a3)     # Get bytes
1610         csrr a1, vl           # Get bytes actually read e.g. if fault
1611         vseq.vi v0, v1, 0     # Set v0[i] where v1[i] = 0
1612         add a3, a3, a1        # Bump pointer
1613         vmfirst a2, v0        # Find first set bit in mask, returns -1 if none
1614         bltz a2, loop         # Not found?
1615         add a0, a0, a1        # Sum start + bump
1616         add a3, a3, a2        # Add index of zero byte
1617         sub a0, a3, a0        # Subtract start address+bump
1618         ret
1619
1620 ## DAXPY <a name="daxpy"></a>
1621
1622 [[!inline raw="yes" pages="simple_v_extension/daxpy_example" ]]
1623
1624 Notes:
1625
1626 * Setting MVL to 4 is just an example.  With enough space between the
1627   FP regs, MVL may be set to larger values
1628 * VBLOCK header takes 16 bits, 8-bit mode may be used on the registers,
1629   taking only another 16 bits, VBLOCK.SETVL requires 16 bits.  Total
1630   overhead for use of VBLOCK: 48 bits (3 16-bit words).
1631 * All instructions except fmadd may use Compressed variants.  Total
1632   number of 16-bit instruction words: 11.
1633 * Total: 14 16-bit words.  By contrast, RVV requires around 18 16-bit words.
1634