simple_v_extension/appendix.mdwn

   1
   2 # Simple-V (Parallelism Extension Proposal) Appendix (OBSOLETE)
   3
   4 **OBSOLETE**
   5
   6 * Copyright (C) 2017, 2018, 2019 Luke Kenneth Casson Leighton
   7 * Status: DRAFTv0.6
   8 * Last edited: 30 jun 2019
   9 * main spec [[specification]]
  10
  11 [[!toc ]]
  12
  13 # Fail-on-first modes <a name="ffirst"></a>
  14
  15 Fail-on-first data dependency has different behaviour for traps than
  16 for conditional testing.  "Conditional" is taken to mean "anything
  17 that is zero", however with traps, the first element has to
  18 be given the opportunity to throw the exact same trap that would
  19 be thrown if this were a scalar operation (when VL=1).
  20
  21 Note that implementors are required to mutually exclusively choose one
  22 or the other modes: an instruction is **not** permitted to fail on a
  23 trap *and* fail a conditional test at the same time.  This advice to
  24 custom opcode writers as well as future extension writers.
  25
  26 ## Fail-on-first traps
  27
  28 Except for the first element, ffirst stops sequential element processing
  29 when a trap occurs.  The first element is treated normally (as if ffirst
  30 is clear).  Should any subsequent element instruction require a trap,
  31 instead it and subsequent indexed elements are ignored (or cancelled in
  32 out-of-order designs), and VL is set to the *last* in-sequence instruction
  33 that did not take the trap.
  34
  35 Note that predicated-out elements (where the predicate mask bit is
  36 zero) are clearly excluded (i.e. the trap will not occur).  However,
  37 note that the loop still had to test the predicate bit: thus on return,
  38 VL is set to include elements that did not take the trap *and* includes
  39 the elements that were predicated (masked) out (not tested up to the
  40 point where the trap occurred).
  41
  42 Unlike conditional tests, "fail-on-first trap" instruction behaviour is
  43 unaltered by setting zero or non-zero predication mode.
  44
  45 If SUBVL is being used (SUBVL!=1), the first *sub-group* of elements
  46 will cause a trap as normal (as if ffirst is not set); subsequently, the
  47 trap must not occur in the *sub-group* of elements.  SUBVL will **NOT**
  48 be modified.  Traps must analyse (x)eSTATE (subvl offset indices) to
  49 determine the element that caused the trap.
  50
  51 Given that predication bits apply to SUBVL groups, the same rules apply
  52 to predicated-out (masked-out) sub-groups in calculating the value that
  53 VL is set to.
  54
  55 ## Fail-on-first conditional tests
  56
  57 ffirst stops sequential (or sequentially-appearing in the case of
  58 out-of-order designs) element conditional testing on the first element
  59 result being zero (or other "fail" condition).  VL is set to the number
  60 of elements that were (sequentially) processed before the fail-condition
  61 was encountered.
  62
  63 Unlike trap fail-on-first, fail-on-first conditional testing behaviour
  64 responds to changes in the zero or non-zero predication mode.  Whilst
  65 in non-zeroing mode, masked-out elements are simply not tested (and
  66 thus considered "never to fail"), in zeroing mode, masked-out elements
  67 may be viewed as *always* (unconditionally) failing.  This effectively
  68 turns VL into something akin to a software-controlled loop.
  69
  70 Note that just as with traps, if SUBVL!=1, the first trap in the
  71 *sub-group* will cause the processing to end, and, even if there were
  72 elements within the *sub-group* that passed the test, that sub-group is
  73 still (entirely) excluded from the count (from setting VL).  i.e. VL is
  74 set to the total number of *sub-groups* that had no fail-condition up
  75 until execution was stopped.  However, again: SUBVL must not be modified:
  76 traps must analyse (x)eSTATE (subvl offset indices) to determine the
  77 element that caused the trap.
  78
  79 Note again that, just as with traps, predicated-out (masked-out) elements
  80 are included in the (sequential) count leading up to the fail-condition,
  81 even though they were not tested.
  82
  83 # Instructions <a name="instructions" />
  84
  85 Despite being a 98% complete and accurate topological remap of RVV
  86 concepts and functionality, no new instructions are needed.
  87 Compared to RVV: *All* RVV instructions can be re-mapped, however xBitManip
  88 becomes a critical dependency for efficient manipulation of predication
  89 masks (as a bit-field).  Despite the removal of all operations,
  90 with the exception of CLIP and VSELECT.X
  91 *all instructions from RVV Base are topologically re-mapped and retain their
  92 complete functionality, intact*.  Note that if RV64G ever had
  93 a MV.X added as well as FCLIP, the full functionality of RVV-Base would
  94 be obtained in SV.
  95
  96 Three instructions, VSELECT, VCLIP and VCLIPI, do not have RV Standard
  97 equivalents, so are left out of Simple-V.  VSELECT could be included if
  98 there existed a MV.X instruction in RV (MV.X is a hypothetical
  99 non-immediate variant of MV that would allow another register to
 100 specify which register was to be copied).  Note that if any of these three
 101 instructions are added to any given RV extension, their functionality
 102 will be inherently parallelised.
 103
 104 With some exceptions, where it does not make sense or is simply too
 105 challenging, all RV-Base instructions are parallelised:
 106
 107 * CSR instructions, whilst a case could be made for fast-polling of
 108   a CSR into multiple registers, or for being able to copy multiple
 109   contiguously addressed CSRs into contiguous registers, and so on,
 110   are the fundamental core basis of SV.  If parallelised, extreme
 111   care would need to be taken.  Additionally, CSR reads are done
 112   using x0, and it is *really* inadviseable to tag x0.
 113 * LUI, C.J, C.JR, WFI, AUIPC are not suitable for parallelising so are
 114   left as scalar.
 115 * LR/SC could hypothetically be parallelised however their purpose is
 116   single (complex) atomic memory operations where the LR must be followed
 117   up by a matching SC.  A sequence of parallel LR instructions followed
 118   by a sequence of parallel SC instructions therefore is guaranteed to
 119   not be useful. Not least: the guarantees of a Multi-LR/SC
 120   would be impossible to provide if emulated in a trap.
 121 * EBREAK, NOP, FENCE and others do not use registers so are not inherently
 122   paralleliseable anyway.
 123
 124 All other operations using registers are automatically parallelised.
 125 This includes AMOMAX, AMOSWAP and so on, where particular care and
 126 attention must be paid.
 127
 128 Example pseudo-code for an integer ADD operation (including scalar
 129 operations).  Floating-point uses the FP Register Table.
 130
 131 [[!inline raw="yes" pages="simple_v_extension/simple_add_example" ]]
 132
 133 Note that for simplicity there is quite a lot missing from the above
 134 pseudo-code: PCVBLK, element widths, zeroing on predication, dimensional
 135 reshaping and offsets and so on.  However it demonstrates the basic
 136 principle.  Augmentations that produce the full pseudo-code are covered in
 137 other sections.
 138
 139 ## SUBVL Pseudocode <a name="subvl-pseudocode"></a>
 140
 141 Adding in support for SUBVL is a matter of adding in an extra inner
 142 for-loop, where register src and dest are still incremented inside the
 143 inner part. Note that the predication is still taken from the VL index.
 144
 145 So whilst elements are indexed by "(i * SUBVL + s)", predicate bits are
 146 indexed by "(i)"
 147
 148     function op_add(rd, rs1, rs2) # add not VADD!
 149       int i, id=0, irs1=0, irs2=0;
 150       predval = get_pred_val(FALSE, rd);
 151       rd  = int_vec[rd ].isvector ? int_vec[rd ].regidx : rd;
 152       rs1 = int_vec[rs1].isvector ? int_vec[rs1].regidx : rs1;
 153       rs2 = int_vec[rs2].isvector ? int_vec[rs2].regidx : rs2;
 154       for (i = 0; i < VL; i++)
 155        xSTATE.srcoffs = i # save context
 156        for (s = 0; s < SUBVL; s++)
 157         xSTATE.ssvoffs = s # save context
 158         if (predval & 1<<i) # predication uses intregs
 159            # actual add is here (at last)
 160            ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
 161            if (!int_vec[rd ].isvector) break;
 162         if (int_vec[rd ].isvector)  { id += 1; }
 163         if (int_vec[rs1].isvector)  { irs1 += 1; }
 164         if (int_vec[rs2].isvector)  { irs2 += 1; }
 165         if (id == VL or irs1 == VL or irs2 == VL) {
 166           # end VL hardware loop
 167           xSTATE.srcoffs = 0; # reset
 168           xSTATE.ssvoffs = 0; # reset
 169           return;
 170         }
 171
 172
 173 NOTE: pseudocode simplified greatly: zeroing, proper predicate handling,
 174 elwidth handling etc. all left out.
 175
 176 ## Instruction Format
 177
 178 It is critical to appreciate that there are
 179 **no operations added to SV, at all**.
 180
 181 Instead, by using CSRs to tag registers as an indication of "changed
 182 behaviour", SV *overloads* pre-existing branch operations into predicated
 183 variants, and implicitly overloads arithmetic operations, MV, FCVT, and
 184 LOAD/STORE depending on CSR configurations for bitwidth and predication.
 185 **Everything** becomes parallelised.  *This includes Compressed
 186 instructions* as well as any future instructions and Custom Extensions.
 187
 188 Note: CSR tags to change behaviour of instructions is nothing new, including
 189 in RISC-V.  UXL, SXL and MXL change the behaviour so that XLEN=32/64/128.
 190 FRM changes the behaviour of the floating-point unit, to alter the rounding
 191 mode.  Other architectures change the LOAD/STORE byte-order from big-endian
 192 to little-endian on a per-instruction basis.  SV is just a little more...
 193 comprehensive in its effect on instructions.
 194
 195 ## Branch Instructions
 196
 197 Branch operations are augmented slightly to be a little more like FP
 198 Compares (FEQ, FNE etc.), by permitting the cumulation (and storage)
 199 of multiple comparisons into a register (taken indirectly from the predicate
 200 table) and enhancing them to branch "consensually" depending on *multiple*
 201 tests.  "ffirst" - fail-on-first - condition mode can also be enabled,
 202 to terminate the comparisons early.
 203 See ffirst mode in the Predication Table section.
 204
 205 There are two registers for the comparison operation, therefore there
 206 is the opportunity to associate two predicate registers (note: not in
 207 the same way as twin-predication).  The first is a "normal" predicate
 208 register, which acts just as it does on any other single-predicated
 209 operation: masks out elements where a bit is zero, applies an inversion
 210 to the predicate mask, and enables zeroing / non-zeroing mode.
 211
 212 The second (not to be confused with a twin-predication 2nd register)
 213 is utilised to indicate where the results of each comparison are to
 214 be stored, as a bitmask.  Additionally, the behaviour of the branch -
 215 when it occurs - may also be modified depending on whether the 2nd predicate's
 216 "invert" and "zeroing" bits are set.  These four combinations result
 217 in "consensual branches", cbranch.ifnone (NOR), cbranch.ifany (OR),
 218 cbranch.ifall (AND), cbranch.ifnotall (NAND).
 219
 220 | invert | zeroing | description                 | operation | cbranch |
 221 | ------ | ------- | --------------------------- | --------- | ------- |
 222 | 0      | 0       | branch if all pass          | AND       | ifall   |
 223 | 1      | 0       | branch if one fails         | NAND      | ifnall  |
 224 | 0      | 1       | branch if one passes        | OR        | ifany   |
 225 | 1      | 1       | branch if all fail          | NOR       | ifnone  |
 226
 227 This inversion capability covers AND, OR, NAND and NOR branching
 228 based on multiple element comparisons. Without the full set of four,
 229 it is necessary to have two-sequence branch operations: one conditional, one
 230 unconditional.
 231
 232 Note that unlike normal computer programming, early-termination of chains
 233 of AND or OR conditional tests, the chain does *not* terminate early
 234 except if fail-on-first is set, and even then ffirst ends on the first
 235 data-dependent zero.  When ffirst mode is not set, *all* conditional
 236 element tests must be performed (and the result optionally stored in
 237 the result mask), with a "post-analysis" phase carried out which checks
 238 whether to branch.
 239
 240 Note also that whilst it may seem excessive to have all four (because
 241 conditional comparisons may be inverted by swapping src1 and src2),
 242 data-dependent fail-on-first is *not* invertible and *only* terminates
 243 on first zero-condition encountered.  Additionally it may be inconvenient
 244 to have to swap the predicate registers associated with src1 and src2,
 245 because this involves a new VBLOCK Context.
 246
 247 ### Standard Branch <a name="standard_branch"></a>
 248
 249 Branch operations use standard RV opcodes that are reinterpreted to
 250 be "predicate variants" in the instance where either of the two src
 251 registers are marked as vectors (active=1, vector=1).
 252
 253 Note that the predication register to use (if one is enabled) is taken from
 254 the *first* src register, and that this is used, just as with predicated
 255 arithmetic operations, to mask whether the comparison operations take
 256 place or not.  The target (destination) predication register
 257 to use (if one is enabled) is taken from the *second* src register.
 258
 259 If either of src1 or src2 are scalars (whether by there being no
 260 CSR register entry or whether by the CSR entry specifically marking
 261 the register as "scalar") the comparison goes ahead as vector-scalar
 262 or scalar-vector.
 263
 264 In instances where no vectorisation is detected on either src registers
 265 the operation is treated as an absolutely standard scalar branch operation.
 266 Where vectorisation is present on either or both src registers, the
 267 branch may stil go ahead if any only if *all* tests succeed (i.e. excluding
 268 those tests that are predicated out).
 269
 270 Note that when zero-predication is enabled (from source rs1),
 271 a cleared bit in the predicate indicates that the result
 272 of the compare is set to "false", i.e. that the corresponding
 273 destination bit (or result)) be set to zero.  Contrast this with
 274 when zeroing is not set: bits in the destination predicate are
 275 only *set*; they are **not** cleared.  This is important to appreciate,
 276 as there may be an expectation that, going into the hardware-loop,
 277 the destination predicate is always expected to be set to zero:
 278 this is **not** the case.  The destination predicate is only set
 279 to zero if **zeroing** is enabled.
 280
 281 Note that just as with the standard (scalar, non-predicated) branch
 282 operations, BLE, BGT, BLEU and BTGU may be synthesised by inverting
 283 src1 and src2, however note that in doing so, the predicate table
 284 setup must also be correspondingly adjusted.
 285
 286 In Hwacha EECS-2015-262 Section 6.7.2 the following pseudocode is given
 287 for predicated compare operations of function "cmp":
 288
 289     for (int i=0; i<vl; ++i)
 290       if ([!]preg[p][i])
 291          preg[pd][i] = cmp(s1 ? vreg[rs1][i] : sreg[rs1],
 292                            s2 ? vreg[rs2][i] : sreg[rs2]);
 293
 294 With associated predication, vector-length adjustments and so on,
 295 and temporarily ignoring bitwidth (which makes the comparisons more
 296 complex), this becomes:
 297
 298     s1 = reg_is_vectorised(src1);
 299     s2 = reg_is_vectorised(src2);
 300
 301     if not s1 && not s2
 302         if cmp(rs1, rs2) # scalar compare
 303             goto branch
 304         return
 305
 306     preg = int_pred_reg[rd]
 307     reg = int_regfile
 308
 309     ps = get_pred_val(I/F==INT, rs1);
 310     rd = get_pred_val(I/F==INT, rs2); # this may not exist
 311
 312     ffirst_mode, zeroing = get_pred_flags(rs1)
 313     if exists(rd):
 314         pred_inversion, pred_zeroing = get_pred_flags(rs2)
 315     else
 316         pred_inversion, pred_zeroing = False, False
 317
 318     if not exists(rd) or zeroing:
 319         result = (1<<VL)-1 # all 1s
 320     else
 321         result = preg[rd]
 322
 323     for (int i = 0; i < VL; ++i)
 324       if (zeroing)
 325         if not (ps & (1<<i))
 326            result &= ~(1<<i);
 327       else if (ps & (1<<i))
 328           if (cmp(s1 ? reg[src1+i]:reg[src1],
 329                                s2 ? reg[src2+i]:reg[src2])
 330               result |= 1<<i;
 331           else
 332               result &= ~(1<<i);
 333               if ffirst_mode:
 334                 break
 335
 336     if exists(rd):
 337         preg[rd] = result # store in destination
 338
 339     if pred_inversion:
 340         if pred_zeroing:
 341             # NOR
 342             if result == 0:
 343                 goto branch
 344         else:
 345             # NAND
 346             if (result & ps) != result:
 347                 goto branch
 348     else:
 349         if pred_zeroing:
 350             # OR
 351             if result != 0:
 352                 goto branch
 353         else:
 354             # AND
 355             if (result & ps) == result:
 356                 goto branch
 357
 358 Notes:
 359
 360 * Predicated SIMD comparisons would break src1 and src2 further down
 361   into bitwidth-sized chunks (see Appendix "Bitwidth Virtual Register
 362   Reordering") setting Vector-Length times (number of SIMD elements) bits
 363   in Predicate Register rd, as opposed to just Vector-Length bits.
 364 * The execution of "parallelised" instructions **must** be implemented
 365   as "re-entrant" (to use a term from software).  If an exception (trap)
 366   occurs during the middle of a vectorised
 367   Branch (now a SV predicated compare) operation, the partial results
 368   of any comparisons must be written out to the destination
 369   register before the trap is permitted to begin.  If however there
 370   is no predicate, the **entire** set of comparisons must be **restarted**,
 371   with the offset loop indices set back to zero.  This is because
 372   there is no place to store the temporary result during the handling
 373   of traps.
 374
 375 TODO: predication now taken from src2.  also branch goes ahead
 376 if all compares are successful.
 377
 378 Note also that where normally, predication requires that there must
 379 also be a CSR register entry for the register being used in order
 380 for the **predication** CSR register entry to also be active,
 381 for branches this is **not** the case.  src2 does **not** have
 382 to have its CSR register entry marked as active in order for
 383 predication on src2 to be active.
 384
 385 Also note: SV Branch operations are **not** twin-predicated
 386 (see Twin Predication section).  This would require three
 387 element offsets: one to track src1, one to track src2 and a third
 388 to track where to store the accumulation of the results.  Given
 389 that the element offsets need to be exposed via CSRs so that
 390 the parallel hardware looping may be made re-entrant on traps
 391 and exceptions, the decision was made not to make SV Branches
 392 twin-predicated.
 393
 394 ### Floating-point Comparisons
 395
 396 There does not exist floating-point branch operations, only compare.
 397 Interestingly no change is needed to the instruction format because
 398 FP Compare already stores a 1 or a zero in its "rd" integer register
 399 target, i.e. it's not actually a Branch at all: it's a compare.
 400
 401 In RV (scalar) Base, a branch on a floating-point compare is
 402 done via the sequence "FEQ x1, f0, f5; BEQ x1, x0, #jumploc".
 403 This does extend to SV, as long as x1 (in the example sequence given)
 404 is vectorised.  When that is the case, x1..x(1+VL-1) will also be
 405 set to 0 or 1 depending on whether f0==f5, f1==f6, f2==f7 and so on.
 406 The BEQ that follows will *also* compare x1==x0, x2==x0, x3==x0 and
 407 so on.  Consequently, unlike integer-branch, FP Compare needs no
 408 modification in its behaviour.
 409
 410 In addition, it is noted that an entry "FNE" (the opposite of FEQ) is
 411 missing, and whilst in ordinary branch code this is fine because the
 412 standard RVF compare can always be followed up with an integer BEQ or
 413 a BNE (or a compressed comparison to zero or non-zero), in predication
 414 terms that becomes more of an impact.  To deal with this, SV's predication
 415 has had "invert" added to it.
 416
 417 Also: note that FP Compare may be predicated, using the destination
 418 integer register (rd) to determine the predicate.  FP Compare is **not**
 419 a twin-predication operation, as, again, just as with SV Branches,
 420 there are three registers involved: FP src1, FP src2 and INT rd.
 421
 422 Also: note that ffirst (fail first mode) applies directly to this operation.
 423
 424 ### Compressed Branch Instruction
 425
 426 Compressed Branch instructions are, just like standard Branch instructions,
 427 reinterpreted to be vectorised and predicated based on the source register
 428 (rs1s) CSR entries.  As however there is only the one source register,
 429 given that c.beqz a10 is equivalent to beqz a10,x0, the optional target
 430 to store the results of the comparisions is taken from CSR predication
 431 table entries for **x0**.
 432
 433 The specific required use of x0 is, with a little thought, quite obvious,
 434 but is counterintuitive.  Clearly it is **not** recommended to redirect
 435 x0 with a CSR register entry, however as a means to opaquely obtain
 436 a predication target it is the only sensible option that does not involve
 437 additional special CSRs (or, worse, additional special opcodes).
 438
 439 Note also that, just as with standard branches, the 2nd source
 440 (in this case x0 rather than src2) does **not** have to have its CSR
 441 register table marked as "active" in order for predication to work.
 442
 443 ## Vectorised Dual-operand instructions
 444
 445 There is a series of 2-operand instructions involving copying (and
 446 sometimes alteration):
 447
 448 * C.MV
 449 * FMV, FNEG, FABS, FCVT, FSGNJ, FSGNJN and FSGNJX
 450 * C.LWSP, C.SWSP, C.LDSP, C.FLWSP etc.
 451 * LOAD(-FP) and STORE(-FP)
 452
 453 All of these operations follow the same two-operand pattern, so it is
 454 *both* the source *and* destination predication masks that are taken into
 455 account.  This is different from
 456 the three-operand arithmetic instructions, where the predication mask
 457 is taken from the *destination* register, and applied uniformly to the
 458 elements of the source register(s), element-for-element.
 459
 460 The pseudo-code pattern for twin-predicated operations is as
 461 follows:
 462
 463     function op(rd, rs):
 464       rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
 465       rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
 466       ps = get_pred_val(FALSE, rs); # predication on src
 467       pd = get_pred_val(FALSE, rd); # ... AND on dest
 468       for (int i = 0, int j = 0; i < VL && j < VL;):
 469         if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
 470         if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
 471         xSTATE.srcoffs = i # save context
 472         xSTATE.destoffs = j # save context
 473         reg[rd+j] = SCALAR_OPERATION_ON(reg[rs+i])
 474         if (int_csr[rs].isvec) i++;
 475         if (int_csr[rd].isvec) j++; else break
 476
 477 This pattern covers scalar-scalar, scalar-vector, vector-scalar
 478 and vector-vector, and predicated variants of all of those.
 479 Zeroing is not presently included (TODO).  As such, when compared
 480 to RVV, the twin-predicated variants of C.MV and FMV cover
 481 **all** standard vector operations: VINSERT, VSPLAT, VREDUCE,
 482 VEXTRACT, VSCATTER, VGATHER, VCOPY, and more.
 483
 484 Note that:
 485
 486 * elwidth (SIMD) is not covered in the pseudo-code above
 487 * ending the loop early in scalar cases (VINSERT, VEXTRACT) is also
 488   not covered
 489 * zero predication is also not shown (TODO).
 490
 491 ### C.MV Instruction <a name="c_mv"></a>
 492
 493 There is no MV instruction in RV however there is a C.MV instruction.
 494 It is used for copying integer-to-integer registers (vectorised FMV
 495 is used for copying floating-point).
 496
 497 If either the source or the destination register are marked as vectors
 498 C.MV is reinterpreted to be a vectorised (multi-register) predicated
 499 move operation.  The actual instruction's format does not change:
 500
 501 [[!table  data="""
 502 15  12 | 11   7 | 6  2 | 1  0 |
 503 funct4 | rd     | rs   | op   |
 504 4      | 5      | 5    | 2    |
 505 C.MV   | dest   | src  | C0   |
 506 """]]
 507
 508 A simplified version of the pseudocode for this operation is as follows:
 509
 510     function op_mv(rd, rs) # MV not VMV!
 511       rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
 512       rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
 513       ps = get_pred_val(FALSE, rs); # predication on src
 514       pd = get_pred_val(FALSE, rd); # ... AND on dest
 515       for (int i = 0, int j = 0; i < VL && j < VL;):
 516         if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
 517         if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
 518         xSTATE.srcoffs = i # save context
 519         xSTATE.destoffs = j # save context
 520         ireg[rd+j] <= ireg[rs+i];
 521         if (int_csr[rs].isvec) i++;
 522         if (int_csr[rd].isvec) j++; else break
 523
 524 There are several different instructions from RVV that are covered by
 525 this one opcode:
 526
 527 [[!table  data="""
 528 src    | dest    | predication   | op             |
 529 scalar | vector  | none          | VSPLAT         |
 530 scalar | vector  | destination   | sparse VSPLAT  |
 531 scalar | vector  | 1-bit dest    | VINSERT        |
 532 vector | scalar  | 1-bit? src    | VEXTRACT       |
 533 vector | vector  | none          | VCOPY          |
 534 vector | vector  | src           | Vector Gather  |
 535 vector | vector  | dest          | Vector Scatter |
 536 vector | vector  | src & dest    | Gather/Scatter |
 537 vector | vector  | src == dest   | sparse VCOPY   |
 538 """]]
 539
 540 Also, VMERGE may be implemented as back-to-back (macro-op fused) C.MV
 541 operations with zeroing off, and inversion on the src and dest predication
 542 for one of the two C.MV operations.  The non-inverted C.MV will place
 543 one set of registers into the destination, and the inverted one the other
 544 set.  With predicate-inversion, copying and inversion of the predicate mask
 545 need not be done as a separate (scalar) instruction.
 546
 547 Note that in the instance where the Compressed Extension is not implemented,
 548 MV may be used, but that is a pseudo-operation mapping to addi rd, x0, rs.
 549 Note that the behaviour is **different** from C.MV because with addi the
 550 predication mask to use is taken **only** from rd and is applied against
 551 all elements: rs[i] = rd[i].
 552
 553 ### FMV, FNEG and FABS Instructions
 554
 555 These are identical in form to C.MV, except covering floating-point
 556 register copying.  The same double-predication rules also apply.
 557 However when elwidth is not set to default the instruction is implicitly
 558 and automatic converted to a (vectorised) floating-point type conversion
 559 operation of the appropriate size covering the source and destination
 560 register bitwidths.
 561
 562 (Note that FMV, FNEG and FABS are all actually pseudo-instructions)
 563
 564 ### FVCT Instructions
 565
 566 These are again identical in form to C.MV, except that they cover
 567 floating-point to integer and integer to floating-point.  When element
 568 width in each vector is set to default, the instructions behave exactly
 569 as they are defined for standard RV (scalar) operations, except vectorised
 570 in exactly the same fashion as outlined in C.MV.
 571
 572 However when the source or destination element width is not set to default,
 573 the opcode's explicit element widths are *over-ridden* to new definitions,
 574 and the opcode's element width is taken as indicative of the SIMD width
 575 (if applicable i.e. if packed SIMD is requested) instead.
 576
 577 For example FCVT.S.L would normally be used to convert a 64-bit
 578 integer in register rs1 to a 64-bit floating-point number in rd.
 579 If however the source rs1 is set to be a vector, where elwidth is set to
 580 default/2 and "packed SIMD" is enabled, then the first 32 bits of
 581 rs1 are converted to a floating-point number to be stored in rd's
 582 first element and the higher 32-bits *also* converted to floating-point
 583 and stored in the second.  The 32 bit size comes from the fact that
 584 FCVT.S.L's integer width is 64 bit, and with elwidth on rs1 set to
 585 divide that by two it means that rs1 element width is to be taken as 32.
 586
 587 Similar rules apply to the destination register.
 588
 589 ## LOAD / STORE Instructions and LOAD-FP/STORE-FP <a name="load_store"></a>
 590
 591 An earlier draft of SV modified the behaviour of LOAD/STORE (modified
 592 the interpretation of the instruction fields).  This
 593 actually undermined the fundamental principle of SV, namely that there
 594 be no modifications to the scalar behaviour (except where absolutely
 595 necessary), in order to simplify an implementor's task if considering
 596 converting a pre-existing scalar design to support parallelism.
 597
 598 So the original RISC-V scalar LOAD/STORE and LOAD-FP/STORE-FP functionality
 599 do not change in SV, however just as with C.MV it is important to note
 600 that dual-predication is possible.
 601
 602 In vectorised architectures there are usually at least two different modes
 603 for LOAD/STORE:
 604
 605 * Read (or write for STORE) from sequential locations, where one
 606   register specifies the address, and the one address is incremented
 607   by a fixed amount.  This is usually known as "Unit Stride" mode.
 608 * Read (or write) from multiple indirected addresses, where the
 609   vector elements each specify separate and distinct addresses.
 610
 611 To support these different addressing modes, the CSR Register "isvector"
 612 bit is used.  So, for a LOAD, when the src register is set to
 613 scalar, the LOADs are sequentially incremented by the src register
 614 element width, and when the src register is set to "vector", the
 615 elements are treated as indirection addresses.  Simplified
 616 pseudo-code would look like this:
 617
 618     function op_ld(rd, rs) # LD not VLD!
 619       rdv = int_csr[rd].active ? int_csr[rd].regidx : rd;
 620       rsv = int_csr[rs].active ? int_csr[rs].regidx : rs;
 621       ps = get_pred_val(FALSE, rs); # predication on src
 622       pd = get_pred_val(FALSE, rd); # ... AND on dest
 623       for (int i = 0, int j = 0; i < VL && j < VL;):
 624         if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
 625         if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
 626         if (int_csr[rd].isvec)
 627           # indirect mode (multi mode)
 628           srcbase = ireg[rsv+i];
 629         else
 630           # unit stride mode
 631           srcbase = ireg[rsv] + i * XLEN/8; # offset in bytes
 632         ireg[rdv+j] <= mem[srcbase + imm_offs];
 633         if (!int_csr[rs].isvec &&
 634             !int_csr[rd].isvec) break # scalar-scalar LD
 635         if (int_csr[rs].isvec) i++;
 636         if (int_csr[rd].isvec) j++;
 637
 638 Notes:
 639
 640 * For simplicity, zeroing and elwidth is not included in the above:
 641   the key focus here is the decision-making for srcbase; vectorised
 642   rs means use sequentially-numbered registers as the indirection
 643   address, and scalar rs is "offset" mode.
 644 * The test towards the end for whether both source and destination are
 645   scalar is what makes the above pseudo-code provide the "standard" RV
 646   Base behaviour for LD operations.
 647 * The offset in bytes (XLEN/8) changes depending on whether the
 648   operation is a LB (1 byte), LH (2 byes), LW (4 bytes) or LD
 649   (8 bytes), and also whether the element width is over-ridden
 650   (see special element width section).
 651
 652 ## Compressed Stack LOAD / STORE Instructions <a name="c_ld_st"></a>
 653
 654 C.LWSP / C.SWSP and floating-point etc. are also source-dest twin-predicated,
 655 where it is implicit in C.LWSP/FLWSP etc. that x2 is the source register.
 656 It is therefore possible to use predicated C.LWSP to efficiently
 657 pop registers off the stack (by predicating x2 as the source), cherry-picking
 658 which registers to store to (by predicating the destination).  Likewise
 659 for C.SWSP.  In this way, LOAD/STORE-Multiple is efficiently achieved.
 660
 661 The two modes ("unit stride" and multi-indirection) are still supported,
 662 as with standard LD/ST.  Essentially, the only difference is that the
 663 use of x2 is hard-coded into the instruction.
 664
 665 **Note**: it is still possible to redirect x2 to an alternative target
 666 register.  With care, this allows C.LWSP / C.SWSP (and C.FLWSP) to be used as
 667 general-purpose LOAD/STORE operations.
 668
 669 ## Compressed LOAD / STORE Instructions
 670
 671 Compressed LOAD and STORE are again exactly the same as scalar LOAD/STORE,
 672 where the same rules apply and the same pseudo-code apply as for
 673 non-compressed LOAD/STORE.  Again: setting scalar or vector mode
 674 on the src for LOAD and dest for STORE switches mode from "Unit Stride"
 675 to "Multi-indirection", respectively.
 676
 677 # Element bitwidth polymorphism <a name="elwidth"></a>
 678
 679 Element bitwidth is best covered as its own special section, as it
 680 is quite involved and applies uniformly across-the-board.  SV restricts
 681 bitwidth polymorphism to default, 8-bit, 16-bit and 32-bit.
 682
 683 The effect of setting an element bitwidth is to re-cast each entry
 684 in the register table, and for all memory operations involving
 685 load/stores of certain specific sizes, to a completely different width.
 686 Thus In c-style terms, on an RV64 architecture, effectively each register
 687 now looks like this:
 688
 689     typedef union {
 690         uint8_t  b[8];
 691         uint16_t s[4];
 692         uint32_t i[2];
 693         uint64_t l[1];
 694     } reg_t;
 695
 696     // integer table: assume maximum SV 7-bit regfile size
 697     reg_t int_regfile[128];
 698
 699 where the CSR Register table entry (not the instruction alone) determines
 700 which of those union entries is to be used on each operation, and the
 701 VL element offset in the hardware-loop specifies the index into each array.
 702
 703 However a naive interpretation of the data structure above masks the
 704 fact that setting VL greater than 8, for example, when the bitwidth is 8,
 705 accessing one specific register "spills over" to the following parts of
 706 the register file in a sequential fashion.  So a much more accurate way
 707 to reflect this would be:
 708
 709     typedef union {
 710         uint8_t  actual_bytes[8]; // 8 for RV64, 4 for RV32, 16 for RV128
 711         uint8_t  b[0]; // array of type uint8_t
 712         uint16_t s[0];
 713         uint32_t i[0];
 714         uint64_t l[0];
 715         uint128_t d[0];
 716     } reg_t;
 717
 718     reg_t int_regfile[128];
 719
 720 where when accessing any individual regfile[n].b entry it is permitted
 721 (in c) to arbitrarily over-run the *declared* length of the array (zero),
 722 and thus "overspill" to consecutive register file entries in a fashion
 723 that is completely transparent to a greatly-simplified software / pseudo-code
 724 representation.
 725 It is however critical to note that it is clearly the responsibility of
 726 the implementor to ensure that, towards the end of the register file,
 727 an exception is thrown if attempts to access beyond the "real" register
 728 bytes is ever attempted.
 729
 730 Now we may modify pseudo-code an operation where all element bitwidths have
 731 been set to the same size, where this pseudo-code is otherwise identical
 732 to its "non" polymorphic versions (above):
 733
 734     function op_add(rd, rs1, rs2) # add not VADD!
 735       ...
 736       ...
 737       for (i = 0; i < VL; i++)
 738            ...
 739            ...
 740            // TODO, calculate if over-run occurs, for each elwidth
 741            if (elwidth == 8) {
 742                int_regfile[rd].b[id] <= int_regfile[rs1].i[irs1] +
 743                                         int_regfile[rs2].i[irs2];
 744             } else if elwidth == 16 {
 745                int_regfile[rd].s[id] <= int_regfile[rs1].s[irs1] +
 746                                         int_regfile[rs2].s[irs2];
 747             } else if elwidth == 32 {
 748                int_regfile[rd].i[id] <= int_regfile[rs1].i[irs1] +
 749                                         int_regfile[rs2].i[irs2];
 750             } else { // elwidth == 64
 751                int_regfile[rd].l[id] <= int_regfile[rs1].l[irs1] +
 752                                         int_regfile[rs2].l[irs2];
 753             }
 754            ...
 755            ...
 756
 757 So here we can see clearly: for 8-bit entries rd, rs1 and rs2 (and registers
 758 following sequentially on respectively from the same) are "type-cast"
 759 to 8-bit; for 16-bit entries likewise and so on.
 760
 761 However that only covers the case where the element widths are the same.
 762 Where the element widths are different, the following algorithm applies:
 763
 764 * Analyse the bitwidth of all source operands and work out the
 765   maximum.  Record this as "maxsrcbitwidth"
 766 * If any given source operand requires sign-extension or zero-extension
 767   (ldb, div, rem, mul, sll, srl, sra etc.), instead of mandatory 32-bit
 768   sign-extension / zero-extension or whatever is specified in the standard
 769   RV specification, **change** that to sign-extending from the respective
 770   individual source operand's bitwidth from the CSR table out to
 771   "maxsrcbitwidth" (previously calculated), instead.
 772 * Following separate and distinct (optional) sign/zero-extension of all
 773   source operands as specifically required for that operation, carry out the
 774   operation at "maxsrcbitwidth".  (Note that in the case of LOAD/STORE or MV
 775   this may be a "null" (copy) operation, and that with FCVT, the changes
 776   to the source and destination bitwidths may also turn FVCT effectively
 777   into a copy).
 778 * If the destination operand requires sign-extension or zero-extension,
 779   instead of a mandatory fixed size (typically 32-bit for arithmetic,
 780   for subw for example, and otherwise various: 8-bit for sb, 16-bit for sw
 781   etc.), overload the RV specification with the bitwidth from the
 782   destination register's elwidth entry.
 783 * Finally, store the (optionally) sign/zero-extended value into its
 784   destination: memory for sb/sw etc., or an offset section of the register
 785   file for an arithmetic operation.
 786
 787 In this way, polymorphic bitwidths are achieved without requiring a
 788 massive 64-way permutation of calculations **per opcode**, for example
 789 (4 possible rs1 bitwidths times 4 possible rs2 bitwidths times 4 possible
 790 rd bitwidths).  The pseudo-code is therefore as follows:
 791
 792     typedef union {
 793         uint8_t  b;
 794         uint16_t s;
 795         uint32_t i;
 796         uint64_t l;
 797     } el_reg_t;
 798
 799     bw(elwidth):
 800         if elwidth == 0: return xlen
 801         if elwidth == 1: return 8
 802         if elwidth == 2: return 16
 803         // elwidth == 3:
 804         return 32
 805
 806     get_max_elwidth(rs1, rs2):
 807         return max(bw(int_csr[rs1].elwidth), # default (XLEN) if not set
 808                    bw(int_csr[rs2].elwidth)) # again XLEN if no entry
 809
 810     get_polymorphed_reg(reg, bitwidth, offset):
 811         el_reg_t res;
 812         res.l = 0; // TODO: going to need sign-extending / zero-extending
 813         if bitwidth == 8:
 814             reg.b = int_regfile[reg].b[offset]
 815         elif bitwidth == 16:
 816             reg.s = int_regfile[reg].s[offset]
 817         elif bitwidth == 32:
 818             reg.i = int_regfile[reg].i[offset]
 819         elif bitwidth == 64:
 820             reg.l = int_regfile[reg].l[offset]
 821         return res
 822
 823     set_polymorphed_reg(reg, bitwidth, offset, val):
 824         if (!int_csr[reg].isvec):
 825             # sign/zero-extend depending on opcode requirements, from
 826             # the reg's bitwidth out to the full bitwidth of the regfile
 827             val = sign_or_zero_extend(val, bitwidth, xlen)
 828             int_regfile[reg].l[0] = val
 829         elif bitwidth == 8:
 830             int_regfile[reg].b[offset] = val
 831         elif bitwidth == 16:
 832             int_regfile[reg].s[offset] = val
 833         elif bitwidth == 32:
 834             int_regfile[reg].i[offset] = val
 835         elif bitwidth == 64:
 836             int_regfile[reg].l[offset] = val
 837
 838       maxsrcwid =  get_max_elwidth(rs1, rs2) # source element width(s)
 839       destwid = int_csr[rs1].elwidth         # destination element width
 840       for (i = 0; i < VL; i++)
 841         if (predval & 1<<i) # predication uses intregs
 842            // TODO, calculate if over-run occurs, for each elwidth
 843            src1 = get_polymorphed_reg(rs1, maxsrcwid, irs1)
 844            // TODO, sign/zero-extend src1 and src2 as operation requires
 845            if (op_requires_sign_extend_src1)
 846               src1 = sign_extend(src1, maxsrcwid)
 847            src2 = get_polymorphed_reg(rs2, maxsrcwid, irs2)
 848            result = src1 + src2 # actual add here
 849            // TODO, sign/zero-extend result, as operation requires
 850            if (op_requires_sign_extend_dest)
 851               result = sign_extend(result, maxsrcwid)
 852            set_polymorphed_reg(rd, destwid, ird, result)
 853            if (!int_vec[rd].isvector) break
 854         if (int_vec[rd ].isvector)  { id += 1; }
 855         if (int_vec[rs1].isvector)  { irs1 += 1; }
 856         if (int_vec[rs2].isvector)  { irs2 += 1; }
 857
 858 Whilst specific sign-extension and zero-extension pseudocode call
 859 details are left out, due to each operation being different, the above
 860 should be clear that;
 861
 862 * the source operands are extended out to the maximum bitwidth of all
 863   source operands
 864 * the operation takes place at that maximum source bitwidth (the
 865   destination bitwidth is not involved at this point, at all)
 866 * the result is extended (or potentially even, truncated) before being
 867   stored in the destination.  i.e. truncation (if required) to the
 868   destination width occurs **after** the operation **not** before.
 869 * when the destination is not marked as "vectorised", the **full**
 870   (standard, scalar) register file entry is taken up, i.e. the
 871   element is either sign-extended or zero-extended to cover the
 872   full register bitwidth (XLEN) if it is not already XLEN bits long.
 873
 874 Implementors are entirely free to optimise the above, particularly
 875 if it is specifically known that any given operation will complete
 876 accurately in less bits, as long as the results produced are
 877 directly equivalent and equal, for all inputs and all outputs,
 878 to those produced by the above algorithm.
 879
 880 ## Polymorphic floating-point operation exceptions and error-handling
 881
 882 For floating-point operations, conversion takes place without raising any
 883 kind of exception.  Exactly as specified in the standard RV specification,
 884 NAN (or appropriate) is stored if the result is beyond the range of the
 885 destination, and, again, exactly as with the standard RV specification
 886 just as with scalar operations, the floating-point flag is raised
 887 (FCSR).  And, again, just as with scalar operations, it is software's
 888 responsibility to check this flag.  Given that the FCSR flags are
 889 "accrued", the fact that multiple element operations could have occurred
 890 is not a problem.
 891
 892 Note that it is perfectly legitimate for floating-point bitwidths of
 893 only 8 to be specified.  However whilst it is possible to apply IEEE 754
 894 principles, no actual standard yet exists.  Implementors wishing to
 895 provide hardware-level 8-bit support rather than throw a trap to emulate
 896 in software should contact the author of this specification before
 897 proceeding.
 898
 899 ## Polymorphic shift operators
 900
 901 A special note is needed for changing the element width of left and
 902 right shift operators, particularly right-shift.  Even for standard RV
 903 base, in order for correct results to be returned, the second operand
 904 RS2 must be truncated to be within the range of RS1's bitwidth.
 905 spike's implementation of sll for example is as follows:
 906
 907     WRITE_RD(sext_xlen(zext_xlen(RS1) << (RS2 & (xlen-1))));
 908
 909 which means: where XLEN is 32 (for RV32), restrict RS2 to cover the
 910 range 0..31 so that RS1 will only be left-shifted by the amount that
 911 is possible to fit into a 32-bit register.  Whilst this appears not
 912 to matter for hardware, it matters greatly in software implementations,
 913 and it also matters where an RV64 system is set to "RV32" mode, such
 914 that the underlying registers RS1 and RS2 comprise 64 hardware bits
 915 each.
 916
 917 For SV, where each operand's element bitwidth may be over-ridden, the
 918 rule about determining the operation's bitwidth *still applies*, being
 919 defined as the maximum bitwidth of RS1 and RS2.  *However*, this rule
 920 **also applies to the truncation of RS2**.  In other words, *after*
 921 determining the maximum bitwidth, RS2's range must **also be truncated**
 922 to ensure a correct answer.  Example:
 923
 924 * RS1 is over-ridden to a 16-bit width
 925 * RS2 is over-ridden to an 8-bit width
 926 * RD is over-ridden to a 64-bit width
 927 * the maximum bitwidth is thus determined to be 16-bit - max(8,16)
 928 * RS2 is **truncated to a range of values from 0 to 15**: RS2 & (16-1)
 929
 930 Pseudocode (in spike) for this example would therefore be:
 931
 932     WRITE_RD(sext_xlen(zext_16bit(RS1) << (RS2 & (16-1))));
 933
 934 This example illustrates that considerable care therefore needs to be
 935 taken to ensure that left and right shift operations are implemented
 936 correctly.  The key is that
 937
 938 * The operation bitwidth is determined by the maximum bitwidth
 939   of the *source registers*, **not** the destination register bitwidth
 940 * The result is then sign-extend (or truncated) as appropriate.
 941
 942 ## Polymorphic MULH/MULHU/MULHSU
 943
 944 MULH is designed to take the top half MSBs of a multiply that
 945 does not fit within the range of the source operands, such that
 946 smaller width operations may produce a full double-width multiply
 947 in two cycles.  The issue is: SV allows the source operands to
 948 have variable bitwidth.
 949
 950 Here again special attention has to be paid to the rules regarding
 951 bitwidth, which, again, are that the operation is performed at
 952 the maximum bitwidth of the **source** registers.  Therefore:
 953
 954 * An 8-bit x 8-bit multiply will create a 16-bit result that must
 955   be shifted down by 8 bits
 956 * A 16-bit x 8-bit multiply will create a 24-bit result that must
 957   be shifted down by 16 bits (top 8 bits being zero)
 958 * A 16-bit x 16-bit multiply will create a 32-bit result that must
 959   be shifted down by 16 bits
 960 * A 32-bit x 16-bit multiply will create a 48-bit result that must
 961   be shifted down by 32 bits
 962 * A 32-bit x 8-bit multiply will create a 40-bit result that must
 963   be shifted down by 32 bits
 964
 965 So again, just as with shift-left and shift-right, the result
 966 is shifted down by the maximum of the two source register bitwidths.
 967 And, exactly again, truncation or sign-extension is performed on the
 968 result.  If sign-extension is to be carried out, it is performed
 969 from the same maximum of the two source register bitwidths out
 970 to the result element's bitwidth.
 971
 972 If truncation occurs, i.e. the top MSBs of the result are lost,
 973 this is "Officially Not Our Problem", i.e. it is assumed that the
 974 programmer actually desires the result to be truncated.  i.e. if the
 975 programmer wanted all of the bits, they would have set the destination
 976 elwidth to accommodate them.
 977
 978 ## Polymorphic elwidth on LOAD/STORE <a name="elwidth_loadstore"></a>
 979
 980 Polymorphic element widths in vectorised form means that the data
 981 being loaded (or stored) across multiple registers needs to be treated
 982 (reinterpreted) as a contiguous stream of elwidth-wide items, where
 983 the source register's element width is **independent** from the destination's.
 984
 985 This makes for a slightly more complex algorithm when using indirection
 986 on the "addressed" register (source for LOAD and destination for STORE),
 987 particularly given that the LOAD/STORE instruction provides important
 988 information about the width of the data to be reinterpreted.
 989
 990 Let's illustrate the "load" part, where the pseudo-code for elwidth=default
 991 was as follows, and i is the loop from 0 to VL-1:
 992
 993     srcbase = ireg[rs+i];
 994     return mem[srcbase + imm]; // returns XLEN bits
 995
 996 Instead, when elwidth != default, for a LW (32-bit LOAD), elwidth-wide
 997 chunks are taken from the source memory location addressed by the current
 998 indexed source address register, and only when a full 32-bits-worth
 999 are taken will the index be moved on to the next contiguous source
1000 address register:
1001
1002     bitwidth = bw(elwidth);             // source elwidth from CSR reg entry
1003     elsperblock = 32 / bitwidth         // 1 if bw=32, 2 if bw=16, 4 if bw=8
1004     srcbase = ireg[rs+i/(elsperblock)]; // integer divide
1005     offs = i % elsperblock;             // modulo
1006     return &mem[srcbase + imm + offs];  // re-cast to uint8_t*, uint16_t* etc.
1007
1008 Note that the constant "32" above is replaced by 8 for LB, 16 for LH, 64 for LD
1009 and 128 for LQ.
1010
1011 The principle is basically exactly the same as if the srcbase were pointing
1012 at the memory of the *register* file: memory is re-interpreted as containing
1013 groups of elwidth-wide discrete elements.
1014
1015 When storing the result from a load, it's important to respect the fact
1016 that the destination register has its *own separate element width*.  Thus,
1017 when each element is loaded (at the source element width), any sign-extension
1018 or zero-extension (or truncation) needs to be done to the *destination*
1019 bitwidth.  Also, the storing has the exact same analogous algorithm as
1020 above, where in fact it is just the set\_polymorphed\_reg pseudocode
1021 (completely unchanged) used above.
1022
1023 One issue remains: when the source element width is **greater** than
1024 the width of the operation, it is obvious that a single LB for example
1025 cannot possibly obtain 16-bit-wide data.  This condition may be detected
1026 where, when using integer divide, elsperblock (the width of the LOAD
1027 divided by the bitwidth of the element) is zero.
1028
1029 The issue is "fixed" by ensuring that elsperblock is a minimum of 1:
1030
1031     elsperblock = min(1, LD_OP_BITWIDTH / element_bitwidth)
1032
1033 The elements, if the element bitwidth is larger than the LD operation's
1034 size, will then be sign/zero-extended to the full LD operation size, as
1035 specified by the LOAD (LDU instead of LD, LBU instead of LB), before
1036 being passed on to the second phase.
1037
1038 As LOAD/STORE may be twin-predicated, it is important to note that
1039 the rules on twin predication still apply, except where in previous
1040 pseudo-code (elwidth=default for both source and target) it was
1041 the *registers* that the predication was applied to, it is now the
1042 **elements** that the predication is applied to.
1043
1044 Thus the full pseudocode for all LD operations may be written out
1045 as follows:
1046
1047     function LBU(rd, rs):
1048         load_elwidthed(rd, rs, 8, true)
1049     function LB(rd, rs):
1050         load_elwidthed(rd, rs, 8, false)
1051     function LH(rd, rs):
1052         load_elwidthed(rd, rs, 16, false)
1053     ...
1054     ...
1055     function LQ(rd, rs):
1056         load_elwidthed(rd, rs, 128, false)
1057
1058     # returns 1 byte of data when opwidth=8, 2 bytes when opwidth=16..
1059     function load_memory(rs, imm, i, opwidth):
1060         elwidth = int_csr[rs].elwidth
1061         bitwidth = bw(elwidth);
1062         elsperblock = min(1, opwidth / bitwidth)
1063         srcbase = ireg[rs+i/(elsperblock)];
1064         offs = i % elsperblock;
1065         return mem[srcbase + imm + offs]; # 1/2/4/8/16 bytes
1066
1067     function load_elwidthed(rd, rs, opwidth, unsigned):
1068       destwid = int_csr[rd].elwidth # destination element width
1069       rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
1070       rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
1071       ps = get_pred_val(FALSE, rs); # predication on src
1072       pd = get_pred_val(FALSE, rd); # ... AND on dest
1073       for (int i = 0, int j = 0; i < VL && j < VL;):
1074         if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
1075         if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
1076         val = load_memory(rs, imm, i, opwidth)
1077         if unsigned:
1078             val = zero_extend(val, min(opwidth, bitwidth))
1079         else:
1080             val = sign_extend(val, min(opwidth, bitwidth))
1081         set_polymorphed_reg(rd, bitwidth, j, val)
1082         if (int_csr[rs].isvec) i++;
1083         if (int_csr[rd].isvec) j++; else break;
1084
1085 Note:
1086
1087 * when comparing against for example the twin-predicated c.mv
1088   pseudo-code, the pattern of independent incrementing of rd and rs
1089   is preserved unchanged.
1090 * just as with the c.mv pseudocode, zeroing is not included and must be
1091   taken into account (TODO).
1092 * that due to the use of a twin-predication algorithm, LOAD/STORE also
1093   take on the same VSPLAT, VINSERT, VREDUCE, VEXTRACT, VGATHER and
1094   VSCATTER characteristics.
1095 * that due to the use of the same set\_polymorphed\_reg pseudocode,
1096   a destination that is not vectorised (marked as scalar) will
1097   result in the element being fully sign-extended or zero-extended
1098   out to the full register file bitwidth (XLEN).  When the source
1099   is also marked as scalar, this is how the compatibility with
1100   standard RV LOAD/STORE is preserved by this algorithm.
1101
1102 ### Example Tables showing LOAD elements <a name="load_example"></a>
1103
1104 This section contains examples of vectorised LOAD operations, showing
1105 how the two stage process works (three if zero/sign-extension is included).
1106
1107
1108 #### Example: LD x8, x5(0), x8 CSR-elwidth=32, x5 CSR-elwidth=16, VL=7
1109
1110 This is:
1111
1112 * a 64-bit load, with an offset of zero
1113 * with a source-address elwidth of 16-bit
1114 * into a destination-register with an elwidth of 32-bit
1115 * where VL=7
1116 * from register x5 (actually x5-x6) to x8 (actually x8 to half of x11)
1117 * RV64, where XLEN=64 is assumed.
1118
1119 First, the memory table, which, due to the element width being 16 and the
1120 operation being LD (64), the 64-bits loaded from memory are subdivided
1121 into groups of **four** elements.  And, with VL being 7 (deliberately
1122 to illustrate that this is reasonable and possible), the first four are
1123 sourced from the offset addresses pointed to by x5, and the next three
1124 from the ofset addresses pointed to by the next contiguous register, x6:
1125
1126 [[!table  data="""
1127 addr | byte 0 | byte 1 | byte 2 | byte 3 | byte 4 | byte 5 | byte 6 | byte 7 |
1128 @x5  | elem 0         || elem 1         || elem 2         || elem 3         ||
1129 @x6  | elem 4         || elem 5         || elem 6         || not loaded     ||
1130 """]]
1131
1132 Next, the elements are zero-extended from 16-bit to 32-bit, as whilst
1133 the elwidth CSR entry for x5 is 16-bit, the destination elwidth on x8 is 32.
1134
1135 [[!table  data="""
1136 byte 3 | byte 2 | byte 1 | byte 0 |
1137 0x0    | 0x0    | elem0          ||
1138 0x0    | 0x0    | elem1          ||
1139 0x0    | 0x0    | elem2          ||
1140 0x0    | 0x0    | elem3          ||
1141 0x0    | 0x0    | elem4          ||
1142 0x0    | 0x0    | elem5          ||
1143 0x0    | 0x0    | elem6          ||
1144 0x0    | 0x0    | elem7          ||
1145 """]]
1146
1147 Lastly, the elements are stored in contiguous blocks, as if x8 was also
1148 byte-addressable "memory".  That "memory" happens to cover registers
1149 x8, x9, x10 and x11, with the last 32 "bits" of x11 being **UNMODIFIED**:
1150
1151 [[!table  data="""
1152 reg# | byte 7 | byte 6 | byte 5 | byte 4 | byte 3 | byte 2 | byte 1 | byte 0 |
1153 x8   | 0x0    | 0x0    | elem 1         || 0x0    | 0x0    | elem 0         ||
1154 x9   | 0x0    | 0x0    | elem 3         || 0x0    | 0x0    | elem 2         ||
1155 x10  | 0x0    | 0x0    | elem 5         || 0x0    | 0x0    | elem 4         ||
1156 x11  | **UNMODIFIED**                 |||| 0x0    | 0x0    | elem 6         ||
1157 """]]
1158
1159 Thus we have data that is loaded from the **addresses** pointed to by
1160 x5 and x6, zero-extended from 16-bit to 32-bit, stored in the **registers**
1161 x8 through to half of x11.
1162 The end result is that elements 0 and 1 end up in x8, with element 8 being
1163 shifted up 32 bits, and so on, until finally element 6 is in the
1164 LSBs of x11.
1165
1166 Note that whilst the memory addressing table is shown left-to-right byte order,
1167 the registers are shown in right-to-left (MSB) order.  This does **not**
1168 imply that bit or byte-reversal is carried out: it's just easier to visualise
1169 memory as being contiguous bytes, and emphasises that registers are not
1170 really actually "memory" as such.
1171
1172 ## Why SV bitwidth specification is restricted to 4 entries
1173
1174 The four entries for SV element bitwidths only allows three over-rides:
1175
1176 * 8 bit
1177 * 16 hit
1178 * 32 bit
1179
1180 This would seem inadequate, surely it would be better to have 3 bits or
1181 more and allow 64, 128 and some other options besides.  The answer here
1182 is, it gets too complex, no RV128 implementation yet exists, and so RV64's
1183 default is 64 bit, so the 4 major element widths are covered anyway.
1184
1185 There is an absolutely crucial aspect oF SV here that explicitly
1186 needs spelling out, and it's whether the "vectorised" bit is set in
1187 the Register's CSR entry.
1188
1189 If "vectorised" is clear (not set), this indicates that the operation
1190 is "scalar".  Under these circumstances, when set on a destination (RD),
1191 then sign-extension and zero-extension, whilst changed to match the
1192 override bitwidth (if set), will erase the **full** register entry
1193 (64-bit if RV64).
1194
1195 When vectorised is *set*, this indicates that the operation now treats
1196 **elements** as if they were independent registers, so regardless of
1197 the length, any parts of a given actual register that are not involved
1198 in the operation are **NOT** modified, but are **PRESERVED**.
1199
1200 For example:
1201
1202 * when the vector bit is clear and elwidth set to 16 on the destination
1203   register, operations are truncated to 16 bit and then sign or zero
1204   extended to the *FULL* XLEN register width.
1205 * when the vector bit is set, elwidth is 16 and VL=1 (or other value where
1206   groups of elwidth sized elements do not fill an entire XLEN register),
1207   the "top" bits of the destination register do *NOT* get modified, zero'd
1208   or otherwise overwritten.
1209
1210 SIMD micro-architectures may implement this by using predication on
1211 any elements in a given actual register that are beyond the end of
1212 multi-element operation.
1213
1214 Other microarchitectures may choose to provide byte-level write-enable
1215 lines on the register file, such that each 64 bit register in an RV64
1216 system requires 8 WE lines.  Scalar RV64 operations would require
1217 activation of all 8 lines, where SV elwidth based operations would
1218 activate the required subset of those byte-level write lines.
1219
1220 Example:
1221
1222 * rs1, rs2 and rd are all set to 8-bit
1223 * VL is set to 3
1224 * RV64 architecture is set (UXL=64)
1225 * add operation is carried out
1226 * bits 0-23 of RD are modified to be rs1[23..16] + rs2[23..16]
1227   concatenated with similar add operations on bits 15..8 and 7..0
1228 * bits 24 through 63 **remain as they originally were**.
1229
1230 Example SIMD micro-architectural implementation:
1231
1232 * SIMD architecture works out the nearest round number of elements
1233   that would fit into a full RV64 register (in this case: 8)
1234 * SIMD architecture creates a hidden predicate, binary 0b00000111
1235   i.e. the bottom 3 bits set (VL=3) and the top 5 bits clear
1236 * SIMD architecture goes ahead with the add operation as if it
1237   was a full 8-wide batch of 8 adds
1238 * SIMD architecture passes top 5 elements through the adders
1239   (which are "disabled" due to zero-bit predication)
1240 * SIMD architecture gets the 5 unmodified top 8-bits back unmodified
1241   and stores them in rd.
1242
1243 This requires a read on rd, however this is required anyway in order
1244 to support non-zeroing mode.
1245
1246 ## Polymorphic floating-point
1247
1248 Standard scalar RV integer operations base the register width on XLEN,
1249 which may be changed (UXL in USTATUS, and the corresponding MXL and
1250 SXL in MSTATUS and SSTATUS respectively).  Integer LOAD, STORE and
1251 arithmetic operations are therefore restricted to an active XLEN bits,
1252 with sign or zero extension to pad out the upper bits when XLEN has
1253 been dynamically set to less than the actual register size.
1254
1255 For scalar floating-point, the active (used / changed) bits are
1256 specified exclusively by the operation: ADD.S specifies an active
1257 32-bits, with the upper bits of the source registers needing to
1258 be all 1s ("NaN-boxed"), and the destination upper bits being
1259 *set* to all 1s (including on LOAD/STOREs).
1260
1261 Where elwidth is set to default (on any source or the destination)
1262 it is obvious that this NaN-boxing behaviour can and should be
1263 preserved.  When elwidth is non-default things are less obvious,
1264 so need to be thought through.  Here is a normal (scalar) sequence,
1265 assuming an RV64 which supports Quad (128-bit) FLEN:
1266
1267 * FLD loads 64-bit wide from memory.  Top 64 MSBs are set to all 1s
1268 * ADD.D performs a 64-bit-wide add.  Top 64 MSBs of destination set to 1s.
1269 * FSD stores lowest 64-bits from the 128-bit-wide register to memory:
1270   top 64 MSBs ignored.
1271
1272 Therefore it makes sense to mirror this behaviour when, for example,
1273 elwidth is set to 32.  Assume elwidth set to 32 on all source and
1274 destination registers:
1275
1276 * FLD loads 64-bit wide from memory as **two** 32-bit single-precision
1277   floating-point numbers.
1278 * ADD.D performs **two** 32-bit-wide adds, storing one of the adds
1279   in bits 0-31 and the second in bits 32-63.
1280 * FSD stores lowest 64-bits from the 128-bit-wide register to memory
1281
1282 Here's the thing: it does not make sense to overwrite the top 64 MSBs
1283 of the registers either during the FLD **or** the ADD.D.  The reason
1284 is that, effectively, the top 64 MSBs actually represent a completely
1285 independent 64-bit register, so overwriting it is not only gratuitous
1286 but may actually be harmful for a future extension to SV which may
1287 have a way to directly access those top 64 bits.
1288
1289 The decision is therefore **not** to touch the upper parts of floating-point
1290 registers whereever elwidth is set to non-default values, including
1291 when "isvec" is false in a given register's CSR entry.  Only when the
1292 elwidth is set to default **and** isvec is false will the standard
1293 RV behaviour be followed, namely that the upper bits be modified.
1294
1295 Ultimately if elwidth is default and isvec false on *all* source
1296 and destination registers, a SimpleV instruction defaults completely
1297 to standard RV scalar behaviour (this holds true for **all** operations,
1298 right across the board).
1299
1300 The nice thing here is that ADD.S, ADD.D and ADD.Q when elwidth are
1301 non-default values are effectively all the same: they all still perform
1302 multiple ADD operations, just at different widths.  A future extension
1303 to SimpleV may actually allow ADD.S to access the upper bits of the
1304 register, effectively breaking down a 128-bit register into a bank
1305 of 4 independently-accesible 32-bit registers.
1306
1307 In the meantime, although when e.g. setting VL to 8 it would technically
1308 make no difference to the ALU whether ADD.S, ADD.D or ADD.Q is used,
1309 using ADD.Q may be an easy way to signal to the microarchitecture that
1310 it is to receive a higher VL value.  On a superscalar OoO architecture
1311 there may be absolutely no difference, however on simpler SIMD-style
1312 microarchitectures they may not necessarily have the infrastructure in
1313 place to know the difference, such that when VL=8 and an ADD.D instruction
1314 is issued, it completes in 2 cycles (or more) rather than one, where
1315 if an ADD.Q had been issued instead on such simpler microarchitectures
1316 it would complete in one.
1317
1318 ## Specific instruction walk-throughs
1319
1320 This section covers walk-throughs of the above-outlined procedure
1321 for converting standard RISC-V scalar arithmetic operations to
1322 polymorphic widths, to ensure that it is correct.
1323
1324 ### add
1325
1326 Standard Scalar RV32/RV64 (xlen):
1327
1328 * RS1 @ xlen bits
1329 * RS2 @ xlen bits
1330 * add @ xlen bits
1331 * RD @ xlen bits
1332
1333 Polymorphic variant:
1334
1335 * RS1 @ rs1 bits, zero-extended to max(rs1, rs2) bits
1336 * RS2 @ rs2 bits, zero-extended to max(rs1, rs2) bits
1337 * add @ max(rs1, rs2) bits
1338 * RD @ rd bits.  zero-extend to rd if rd > max(rs1, rs2) otherwise truncate
1339
1340 Note here that polymorphic add zero-extends its source operands,
1341 where addw sign-extends.
1342
1343 ### addw
1344
1345 The RV Specification specifically states that "W" variants of arithmetic
1346 operations always produce 32-bit signed values.  In a polymorphic
1347 environment it is reasonable to assume that the signed aspect is
1348 preserved, where it is the length of the operands and the result
1349 that may be changed.
1350
1351 Standard Scalar RV64 (xlen):
1352
1353 * RS1 @ xlen bits
1354 * RS2 @ xlen bits
1355 * add @ xlen bits
1356 * RD @ xlen bits, truncate add to 32-bit and sign-extend to xlen.
1357
1358 Polymorphic variant:
1359
1360 * RS1 @ rs1 bits, sign-extended to max(rs1, rs2) bits
1361 * RS2 @ rs2 bits, sign-extended to max(rs1, rs2) bits
1362 * add @ max(rs1, rs2) bits
1363 * RD @ rd bits. sign-extend to rd if rd > max(rs1, rs2) otherwise truncate
1364
1365 Note here that polymorphic addw sign-extends its source operands,
1366 where add zero-extends.
1367
1368 This requires a little more in-depth analysis.  Where the bitwidth of
1369 rs1 equals the bitwidth of rs2, no sign-extending will occur.  It is
1370 only where the bitwidth of either rs1 or rs2 are different, will the
1371 lesser-width operand be sign-extended.
1372
1373 Effectively however, both rs1 and rs2 are being sign-extended (or
1374 truncated), where for add they are both zero-extended.  This holds true
1375 for all arithmetic operations ending with "W".
1376
1377 ### addiw
1378
1379 Standard Scalar RV64I:
1380
1381 * RS1 @ xlen bits, truncated to 32-bit
1382 * immed @ 12 bits, sign-extended to 32-bit
1383 * add @ 32 bits
1384 * RD @ rd bits.  sign-extend to rd if rd > 32, otherwise truncate.
1385
1386 Polymorphic variant:
1387
1388 * RS1 @ rs1 bits
1389 * immed @ 12 bits, sign-extend to max(rs1, 12) bits
1390 * add @ max(rs1, 12) bits
1391 * RD @ rd bits. sign-extend to rd if rd > max(rs1, 12) otherwise truncate
1392
1393 # Predication Element Zeroing
1394
1395 The introduction of zeroing on traditional vector predication is usually
1396 intended as an optimisation for lane-based microarchitectures with register
1397 renaming to be able to save power by avoiding a register read on elements
1398 that are passed through en-masse through the ALU.  Simpler microarchitectures
1399 do not have this issue: they simply do not pass the element through to
1400 the ALU at all, and therefore do not store it back in the destination.
1401 More complex non-lane-based micro-architectures can, when zeroing is
1402 not set, use the predication bits to simply avoid sending element-based
1403 operations to the ALUs, entirely: thus, over the long term, potentially
1404 keeping all ALUs 100% occupied even when elements are predicated out.
1405
1406 SimpleV's design principle is not based on or influenced by
1407 microarchitectural design factors: it is a hardware-level API.
1408 Therefore, looking purely at whether zeroing is *useful* or not,
1409 (whether less instructions are needed for certain scenarios),
1410 given that a case can be made for zeroing *and* non-zeroing, the
1411 decision was taken to add support for both.
1412
1413 ## Single-predication (based on destination register)
1414
1415 Zeroing on predication for arithmetic operations is taken from
1416 the destination register's predicate.  i.e. the predication *and*
1417 zeroing settings to be applied to the whole operation come from the
1418 CSR Predication table entry for the destination register.
1419 Thus when zeroing is set on predication of a destination element,
1420 if the predication bit is clear, then the destination element is *set*
1421 to zero (twin-predication is slightly different, and will be covered
1422 next).
1423
1424 Thus the pseudo-code loop for a predicated arithmetic operation
1425 is modified to as follows:
1426
1427       for (i = 0; i < VL; i++)
1428         if not zeroing: # an optimisation
1429            while (!(predval & 1<<i) && i < VL)
1430              if (int_vec[rd ].isvector)  { id += 1; }
1431              if (int_vec[rs1].isvector)  { irs1 += 1; }
1432              if (int_vec[rs2].isvector)  { irs2 += 1; }
1433            if i == VL:
1434              return
1435         if (predval & 1<<i)
1436            src1 = ....
1437            src2 = ...
1438            else:
1439                result = src1 + src2 # actual add (or other op) here
1440            set_polymorphed_reg(rd, destwid, ird, result)
1441            if int_vec[rd].ffirst and result == 0:
1442               VL = i # result was zero, end loop early, return VL
1443               return
1444            if (!int_vec[rd].isvector) return
1445         else if zeroing:
1446            result = 0
1447            set_polymorphed_reg(rd, destwid, ird, result)
1448         if (int_vec[rd ].isvector)  { id += 1; }
1449         else if (predval & 1<<i) return
1450         if (int_vec[rs1].isvector)  { irs1 += 1; }
1451         if (int_vec[rs2].isvector)  { irs2 += 1; }
1452         if (rd == VL or rs1 == VL or rs2 == VL): return
1453
1454 The optimisation to skip elements entirely is only possible for certain
1455 micro-architectures when zeroing is not set.  However for lane-based
1456 micro-architectures this optimisation may not be practical, as it
1457 implies that elements end up in different "lanes".  Under these
1458 circumstances it is perfectly fine to simply have the lanes
1459 "inactive" for predicated elements, even though it results in
1460 less than 100% ALU utilisation.
1461
1462 ## Twin-predication (based on source and destination register) <a name="tpred"></a>
1463
1464 Twin-predication is not that much different, except that that
1465 the source is independently zero-predicated from the destination.
1466 This means that the source may be zero-predicated *or* the
1467 destination zero-predicated *or both*, or neither.
1468
1469 When with twin-predication, zeroing is set on the source and not
1470 the destination, if a predicate bit is set it indicates that a zero
1471 data element is passed through the operation (the exception being:
1472 if the source data element is to be treated as an address - a LOAD -
1473 then the data returned *from* the LOAD is zero, rather than looking up an
1474 *address* of zero.
1475
1476 When zeroing is set on the destination and not the source, then just
1477 as with single-predicated operations, a zero is stored into the destination
1478 element (or target memory address for a STORE).
1479
1480 Zeroing on both source and destination effectively result in a bitwise
1481 NOR operation of the source and destination predicate: the result is that
1482 where either source predicate OR destination predicate is set to 0,
1483 a zero element will ultimately end up in the destination register.
1484
1485 However: this may not necessarily be the case for all operations;
1486 implementors, particularly of custom instructions, clearly need to
1487 think through the implications in each and every case.
1488
1489 Here is pseudo-code for a twin zero-predicated operation:
1490
1491     function op_mv(rd, rs) # MV not VMV!
1492       rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
1493       rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
1494       ps, zerosrc = get_pred_val(FALSE, rs); # predication on src
1495       pd, zerodst = get_pred_val(FALSE, rd); # ... AND on dest
1496       for (int i = 0, int j = 0; i < VL && j < VL):
1497         if (int_csr[rs].isvec && !zerosrc) while (!(ps & 1<<i)) i++;
1498         if (int_csr[rd].isvec && !zerodst) while (!(pd & 1<<j)) j++;
1499         if ((pd & 1<<j))
1500             if ((pd & 1<<j))
1501                 sourcedata = ireg[rs+i];
1502             else
1503                 sourcedata = 0
1504             ireg[rd+j] <= sourcedata
1505         else if (zerodst)
1506             ireg[rd+j] <= 0
1507         if (int_csr[rs].isvec)
1508             i++;
1509         if (int_csr[rd].isvec)
1510             j++;
1511         else
1512             if ((pd & 1<<j))
1513                 break;
1514
1515 Note that in the instance where the destination is a scalar, the hardware
1516 loop is ended the moment a value *or a zero* is placed into the destination
1517 register/element.  Also note that, for clarity, variable element widths
1518 have been left out of the above.
1519
1520 # Subsets of RV functionality
1521
1522 This section describes the differences when SV is implemented on top of
1523 different subsets of RV.
1524
1525 ## Common options
1526
1527 It is permitted to only implement SVprefix and not the VBLOCK instruction
1528 format option, and vice-versa.  UNIX Platforms **MUST** raise illegal
1529 instruction on seeing an unsupported VBLOCK or SVprefix opcode, so that
1530 traps may emulate the format.
1531
1532 It is permitted in SVprefix to either not implement VL or not implement
1533 SUBVL (see [[sv_prefix_proposal]] for full details. Again, UNIX Platforms
1534 *MUST* raise illegal instruction on implementations that do not support
1535 VL or SUBVL.
1536
1537 It is permitted to limit the size of either (or both) the register files
1538 down to the original size of the standard RV architecture.  However, below
1539 the mandatory limits set in the RV standard will result in non-compliance
1540 with the SV Specification.
1541
1542 ## RV32 / RV32F
1543
1544 When RV32 or RV32F is implemented, XLEN is set to 32, and thus the
1545 maximum limit for predication is also restricted to 32 bits.  Whilst not
1546 actually specifically an "option" it is worth noting.
1547
1548 ## RV32G
1549
1550 Normally in standard RV32 it does not make much sense to have
1551 RV32G, The critical instructions that are missing in standard RV32
1552 are those for moving data to and from the double-width floating-point
1553 registers into the integer ones, as well as the FCVT routines.
1554
1555 In an earlier draft of SV, it was possible to specify an elwidth
1556 of double the standard register size: this had to be dropped,
1557 and may be reintroduced in future revisions.
1558
1559 ## RV32 (not RV32F / RV32G) and RV64 (not RV64F / RV64G)
1560
1561 When floating-point is not implemented, the size of the User Register and
1562 Predication CSR tables may be halved, to only 4 2x16-bit CSRs (8 entries
1563 per table).
1564
1565 ## RV32E
1566
1567 In embedded scenarios the User Register and Predication CSRs may be
1568 dropped entirely, or optionally limited to 1 CSR, such that the combined
1569 number of entries from the M-Mode CSR Register table plus U-Mode
1570 CSR Register table is either 4 16-bit entries or (if the U-Mode is
1571 zero) only 2 16-bit entries (M-Mode CSR table only).  Likewise for
1572 the Predication CSR tables.
1573
1574 RV32E is the most likely candidate for simply detecting that registers
1575 are marked as "vectorised", and generating an appropriate exception
1576 for the VL loop to be implemented in software.
1577
1578 ## RV128
1579
1580 RV128 has not been especially considered, here, however it has some
1581 extremely large possibilities: double the element width implies
1582 256-bit operands, spanning 2 128-bit registers each, and predication
1583 of total length 128 bit given that XLEN is now 128.
1584
1585 # Example usage
1586
1587 TODO evaluate strncpy and strlen
1588 <https://groups.google.com/forum/m/#!msg/comp.arch/bGBeaNjAKvc/_vbqyxTUAQAJ>
1589
1590 ## strncpy <a name="strncpy"></>
1591
1592 RVV version:
1593
1594     strncpy:
1595         c.mv a3, a0               # Copy dst
1596     loop:
1597         setvli x0, a2, vint8    # Vectors of bytes.
1598         vlbff.v v1, (a1)        # Get src bytes
1599         vseq.vi v0, v1, 0       # Flag zero bytes
1600         vmfirst a4, v0          # Zero found?
1601         vmsif.v v0, v0          # Set mask up to and including zero byte.
1602         vsb.v v1, (a3), v0.t    # Write out bytes
1603         c.bgez a4, exit           # Done
1604         csrr t1, vl             # Get number of bytes fetched
1605         c.add a1, a1, t1          # Bump src pointer
1606         c.sub a2, a2, t1          # Decrement count.
1607         c.add a3, a3, t1          # Bump dst pointer
1608         c.bnez a2, loop           # Anymore?
1609
1610     exit:
1611         c.ret
1612
1613 SV version (WIP):
1614
1615     strncpy:
1616         c.mv a3, a0
1617         VBLK.RegCSR[t0] = 8bit, t0, vector
1618         VBLK.PredTb[t0] = ffirst, x0, inv
1619     loop:
1620         VBLK.SETVLI a2, t4, 8 # t4 and VL now 1..8 (MVL=8)
1621         c.ldb t0, (a1) # t0 fail first mode
1622         c.bne t0, x0, allnonzero # still ff
1623         # VL (t4) points to last nonzero
1624         c.addi t4, t4, 1 # include zero
1625         c.stb t0, (a3)   # store incl zero
1626         c.ret            # end subroutine
1627     allnonzero:
1628         c.stb t0, (a3)    # VL legal range
1629         c.add a1, a1, t4  # Bump src pointer
1630         c.sub a2, a2, t4  # Decrement count.
1631         c.add a3, a3, t4  # Bump dst pointer
1632         c.bnez a2, loop   # Anymore?
1633     exit:
1634         c.ret
1635
1636 Notes:
1637
1638 * Setting MVL to 8 is just an example. If enough registers are spare it
1639   may be set to XLEN which will require a bank of 8 scalar registers for
1640   a1, a3 and t0.
1641 * obviously if that is done, t0 is not separated by 8 full registers, and
1642   would overwrite t1 thru t7. x80 would work well, as an example, instead.
1643 * with the exception of the GETVL (a pseudo code alias for csrr), every
1644   single instruction above may use RVC.
1645 * RVC C.BNEZ can be used because rs1' may be extended to the full 128
1646   registers through redirection
1647 * RVC C.LW and C.SW may be used because the W format may be overridden by
1648   the 8 bit format. All of t0, a3 and a1 are overridden to make that work.
1649 * with the exception of the GETVL, all Vector Context may be done in
1650   VBLOCK form.
1651 * setting predication to x0 (zero) and invert on t0 is a trick to enable
1652   just ffirst on t0
1653 * ldb and bne are both using t0, both in ffirst mode
1654 * t0 vectorised, a1 scalar, both elwidth 8 bit: ldb enters "unit stride,
1655   vectorised, no (un)sign-extension or truncation" mode.
1656 * ldb will end on illegal mem, reduce VL, but copied all sorts of stuff
1657   into t0 (could contain zeros).
1658 * bne t0 x0 tests up to the NEW VL for nonzero, vector t0 against
1659   scalar x0
1660 * however as t0 is in ffirst mode, the first fail will ALSO stop the
1661   compares, and reduce VL as well
1662 * the branch only goes to allnonzero if all tests succeed
1663 * if it did not, we can safely increment VL by 1 (using a4) to include
1664   the zero.
1665 * SETVL sets *exactly* the requested amount into VL.
1666 * the SETVL just after allnonzero label is needed in case the ldb ffirst
1667   activates but the bne allzeros does not.
1668 * this would cause the stb to copy up to the end of the legal memory
1669 * of course, on the next loop the ldb would throw a trap, as a1 now
1670   points to the first illegal mem location.
1671
1672 ## strcpy
1673
1674 RVV version:
1675
1676         mv a3, a0             # Save start
1677     loop:
1678         setvli a1, x0, vint8  # byte vec, x0 (Zero reg) => use max hardware len
1679         vldbff.v v1, (a3)     # Get bytes
1680         csrr a1, vl           # Get bytes actually read e.g. if fault
1681         vseq.vi v0, v1, 0     # Set v0[i] where v1[i] = 0
1682         add a3, a3, a1        # Bump pointer
1683         vmfirst a2, v0        # Find first set bit in mask, returns -1 if none
1684         bltz a2, loop         # Not found?
1685         add a0, a0, a1        # Sum start + bump
1686         add a3, a3, a2        # Add index of zero byte
1687         sub a0, a3, a0        # Subtract start address+bump
1688         ret
1689
1690 ## DAXPY <a name="daxpy"></a>
1691
1692 [[!inline raw="yes" pages="simple_v_extension/daxpy_example" ]]
1693
1694 Notes:
1695
1696 * Setting MVL to 4 is just an example.  With enough space between the
1697   FP regs, MVL may be set to larger values
1698 * VBLOCK header takes 16 bits, 8-bit mode may be used on the registers,
1699   taking only another 16 bits, VBLOCK.SETVL requires 16 bits.  Total
1700   overhead for use of VBLOCK: 48 bits (3 16-bit words).
1701 * All instructions except fmadd may use Compressed variants.  Total
1702   number of 16-bit instruction words: 11.
1703 * Total: 14 16-bit words.  By contrast, RVV requires around 18 16-bit words.
1704
1705 ## BigInt add <a name="bigadd"></a>
1706
1707 [[!inline raw="yes" pages="simple_v_extension/bigadd_example" ]]