simple_v_extension/appendix.mdwn

   1 [[!tag standards]]
   2
   3 # Simple-V (Parallelism Extension Proposal) Appendix
   4
   5 * Copyright (C) 2017, 2018, 2019 Luke Kenneth Casson Leighton
   6 * Status: DRAFTv0.6
   7 * Last edited: 30 jun 2019
   8 * main spec [[specification]]
   9
  10 [[!toc ]]
  11
  12 # Fail-on-first modes <a name="ffirst"></a>
  13
  14 Fail-on-first data dependency has different behaviour for traps than
  15 for conditional testing.  "Conditional" is taken to mean "anything
  16 that is zero", however with traps, the first element has to
  17 be given the opportunity to throw the exact same trap that would
  18 be thrown if this were a scalar operation (when VL=1).
  19
  20 Note that implementors are required to mutually exclusively choose one
  21 or the other modes: an instruction is **not** permitted to fail on a
  22 trap *and* fail a conditional test at the same time.  This advice to
  23 custom opcode writers as well as future extension writers.
  24
  25 ## Fail-on-first traps
  26
  27 Except for the first element, ffirst stops sequential element processing
  28 when a trap occurs.  The first element is treated normally (as if ffirst
  29 is clear).  Should any subsequent element instruction require a trap,
  30 instead it and subsequent indexed elements are ignored (or cancelled in
  31 out-of-order designs), and VL is set to the *last* in-sequence instruction
  32 that did not take the trap.
  33
  34 Note that predicated-out elements (where the predicate mask bit is
  35 zero) are clearly excluded (i.e. the trap will not occur).  However,
  36 note that the loop still had to test the predicate bit: thus on return,
  37 VL is set to include elements that did not take the trap *and* includes
  38 the elements that were predicated (masked) out (not tested up to the
  39 point where the trap occurred).
  40
  41 Unlike conditional tests, "fail-on-first trap" instruction behaviour is
  42 unaltered by setting zero or non-zero predication mode.
  43
  44 If SUBVL is being used (SUBVL!=1), the first *sub-group* of elements
  45 will cause a trap as normal (as if ffirst is not set); subsequently, the
  46 trap must not occur in the *sub-group* of elements.  SUBVL will **NOT**
  47 be modified.  Traps must analyse (x)eSTATE (subvl offset indices) to
  48 determine the element that caused the trap.
  49
  50 Given that predication bits apply to SUBVL groups, the same rules apply
  51 to predicated-out (masked-out) sub-groups in calculating the value that
  52 VL is set to.
  53
  54 ## Fail-on-first conditional tests
  55
  56 ffirst stops sequential (or sequentially-appearing in the case of
  57 out-of-order designs) element conditional testing on the first element
  58 result being zero (or other "fail" condition).  VL is set to the number
  59 of elements that were (sequentially) processed before the fail-condition
  60 was encountered.
  61
  62 Unlike trap fail-on-first, fail-on-first conditional testing behaviour
  63 responds to changes in the zero or non-zero predication mode.  Whilst
  64 in non-zeroing mode, masked-out elements are simply not tested (and
  65 thus considered "never to fail"), in zeroing mode, masked-out elements
  66 may be viewed as *always* (unconditionally) failing.  This effectively
  67 turns VL into something akin to a software-controlled loop.
  68
  69 Note that just as with traps, if SUBVL!=1, the first trap in the
  70 *sub-group* will cause the processing to end, and, even if there were
  71 elements within the *sub-group* that passed the test, that sub-group is
  72 still (entirely) excluded from the count (from setting VL).  i.e. VL is
  73 set to the total number of *sub-groups* that had no fail-condition up
  74 until execution was stopped.  However, again: SUBVL must not be modified:
  75 traps must analyse (x)eSTATE (subvl offset indices) to determine the
  76 element that caused the trap.
  77
  78 Note again that, just as with traps, predicated-out (masked-out) elements
  79 are included in the (sequential) count leading up to the fail-condition,
  80 even though they were not tested.
  81
  82 # Instructions <a name="instructions" />
  83
  84 Despite being a 98% complete and accurate topological remap of RVV
  85 concepts and functionality, no new instructions are needed.
  86 Compared to RVV: *All* RVV instructions can be re-mapped, however xBitManip
  87 becomes a critical dependency for efficient manipulation of predication
  88 masks (as a bit-field).  Despite the removal of all operations,
  89 with the exception of CLIP and VSELECT.X
  90 *all instructions from RVV Base are topologically re-mapped and retain their
  91 complete functionality, intact*.  Note that if RV64G ever had
  92 a MV.X added as well as FCLIP, the full functionality of RVV-Base would
  93 be obtained in SV.
  94
  95 Three instructions, VSELECT, VCLIP and VCLIPI, do not have RV Standard
  96 equivalents, so are left out of Simple-V.  VSELECT could be included if
  97 there existed a MV.X instruction in RV (MV.X is a hypothetical
  98 non-immediate variant of MV that would allow another register to
  99 specify which register was to be copied).  Note that if any of these three
 100 instructions are added to any given RV extension, their functionality
 101 will be inherently parallelised.
 102
 103 With some exceptions, where it does not make sense or is simply too
 104 challenging, all RV-Base instructions are parallelised:
 105
 106 * CSR instructions, whilst a case could be made for fast-polling of
 107   a CSR into multiple registers, or for being able to copy multiple
 108   contiguously addressed CSRs into contiguous registers, and so on,
 109   are the fundamental core basis of SV.  If parallelised, extreme
 110   care would need to be taken.  Additionally, CSR reads are done
 111   using x0, and it is *really* inadviseable to tag x0.
 112 * LUI, C.J, C.JR, WFI, AUIPC are not suitable for parallelising so are
 113   left as scalar.
 114 * LR/SC could hypothetically be parallelised however their purpose is
 115   single (complex) atomic memory operations where the LR must be followed
 116   up by a matching SC.  A sequence of parallel LR instructions followed
 117   by a sequence of parallel SC instructions therefore is guaranteed to
 118   not be useful. Not least: the guarantees of a Multi-LR/SC
 119   would be impossible to provide if emulated in a trap.
 120 * EBREAK, NOP, FENCE and others do not use registers so are not inherently
 121   paralleliseable anyway.
 122
 123 All other operations using registers are automatically parallelised.
 124 This includes AMOMAX, AMOSWAP and so on, where particular care and
 125 attention must be paid.
 126
 127 Example pseudo-code for an integer ADD operation (including scalar
 128 operations).  Floating-point uses the FP Register Table.
 129
 130 [[!inline raw="yes" pages="simple_v_extension/simple_add_example" ]]
 131
 132 Note that for simplicity there is quite a lot missing from the above
 133 pseudo-code: PCVBLK, element widths, zeroing on predication, dimensional
 134 reshaping and offsets and so on.  However it demonstrates the basic
 135 principle.  Augmentations that produce the full pseudo-code are covered in
 136 other sections.
 137
 138 ## SUBVL Pseudocode <a name="subvl-pseudocode"></a>
 139
 140 Adding in support for SUBVL is a matter of adding in an extra inner
 141 for-loop, where register src and dest are still incremented inside the
 142 inner part. Note that the predication is still taken from the VL index.
 143
 144 So whilst elements are indexed by "(i * SUBVL + s)", predicate bits are
 145 indexed by "(i)"
 146
 147     function op_add(rd, rs1, rs2) # add not VADD!
 148       int i, id=0, irs1=0, irs2=0;
 149       predval = get_pred_val(FALSE, rd);
 150       rd  = int_vec[rd ].isvector ? int_vec[rd ].regidx : rd;
 151       rs1 = int_vec[rs1].isvector ? int_vec[rs1].regidx : rs1;
 152       rs2 = int_vec[rs2].isvector ? int_vec[rs2].regidx : rs2;
 153       for (i = 0; i < VL; i++)
 154        xSTATE.srcoffs = i # save context
 155        for (s = 0; s < SUBVL; s++)
 156         xSTATE.ssvoffs = s # save context
 157         if (predval & 1<<i) # predication uses intregs
 158            # actual add is here (at last)
 159            ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
 160            if (!int_vec[rd ].isvector) break;
 161         if (int_vec[rd ].isvector)  { id += 1; }
 162         if (int_vec[rs1].isvector)  { irs1 += 1; }
 163         if (int_vec[rs2].isvector)  { irs2 += 1; }
 164         if (id == VL or irs1 == VL or irs2 == VL) {
 165           # end VL hardware loop
 166           xSTATE.srcoffs = 0; # reset
 167           xSTATE.ssvoffs = 0; # reset
 168           return;
 169         }
 170
 171
 172 NOTE: pseudocode simplified greatly: zeroing, proper predicate handling,
 173 elwidth handling etc. all left out.
 174
 175 ## Instruction Format
 176
 177 It is critical to appreciate that there are
 178 **no operations added to SV, at all**.
 179
 180 Instead, by using CSRs to tag registers as an indication of "changed
 181 behaviour", SV *overloads* pre-existing branch operations into predicated
 182 variants, and implicitly overloads arithmetic operations, MV, FCVT, and
 183 LOAD/STORE depending on CSR configurations for bitwidth and predication.
 184 **Everything** becomes parallelised.  *This includes Compressed
 185 instructions* as well as any future instructions and Custom Extensions.
 186
 187 Note: CSR tags to change behaviour of instructions is nothing new, including
 188 in RISC-V.  UXL, SXL and MXL change the behaviour so that XLEN=32/64/128.
 189 FRM changes the behaviour of the floating-point unit, to alter the rounding
 190 mode.  Other architectures change the LOAD/STORE byte-order from big-endian
 191 to little-endian on a per-instruction basis.  SV is just a little more...
 192 comprehensive in its effect on instructions.
 193
 194 ## Branch Instructions
 195
 196 Branch operations are augmented slightly to be a little more like FP
 197 Compares (FEQ, FNE etc.), by permitting the cumulation (and storage)
 198 of multiple comparisons into a register (taken indirectly from the predicate
 199 table) and enhancing them to branch "consensually" depending on *multiple*
 200 tests.  "ffirst" - fail-on-first - condition mode can also be enabled,
 201 to terminate the comparisons early.
 202 See ffirst mode in the Predication Table section.
 203
 204 There are two registers for the comparison operation, therefore there
 205 is the opportunity to associate two predicate registers (note: not in
 206 the same way as twin-predication).  The first is a "normal" predicate
 207 register, which acts just as it does on any other single-predicated
 208 operation: masks out elements where a bit is zero, applies an inversion
 209 to the predicate mask, and enables zeroing / non-zeroing mode.
 210
 211 The second (not to be confused with a twin-predication 2nd register)
 212 is utilised to indicate where the results of each comparison are to
 213 be stored, as a bitmask.  Additionally, the behaviour of the branch -
 214 when it occurs - may also be modified depending on whether the 2nd predicate's
 215 "invert" and "zeroing" bits are set.  These four combinations result
 216 in "consensual branches", cbranch.ifnone (NOR), cbranch.ifany (OR),
 217 cbranch.ifall (AND), cbranch.ifnotall (NAND).
 218
 219 | invert | zeroing | description                 | operation | cbranch |
 220 | ------ | ------- | --------------------------- | --------- | ------- |
 221 | 0      | 0       | branch if all pass          | AND       | ifall   |
 222 | 1      | 0       | branch if one fails         | NAND      | ifnall  |
 223 | 0      | 1       | branch if one passes        | OR        | ifany   |
 224 | 1      | 1       | branch if all fail          | NOR       | ifnone  |
 225
 226 This inversion capability covers AND, OR, NAND and NOR branching
 227 based on multiple element comparisons. Without the full set of four,
 228 it is necessary to have two-sequence branch operations: one conditional, one
 229 unconditional.
 230
 231 Note that unlike normal computer programming, early-termination of chains
 232 of AND or OR conditional tests, the chain does *not* terminate early
 233 except if fail-on-first is set, and even then ffirst ends on the first
 234 data-dependent zero.  When ffirst mode is not set, *all* conditional
 235 element tests must be performed (and the result optionally stored in
 236 the result mask), with a "post-analysis" phase carried out which checks
 237 whether to branch.
 238
 239 Note also that whilst it may seem excessive to have all four (because
 240 conditional comparisons may be inverted by swapping src1 and src2),
 241 data-dependent fail-on-first is *not* invertible and *only* terminates
 242 on first zero-condition encountered.  Additionally it may be inconvenient
 243 to have to swap the predicate registers associated with src1 and src2,
 244 because this involves a new VBLOCK Context.
 245
 246 ### Standard Branch <a name="standard_branch"></a>
 247
 248 Branch operations use standard RV opcodes that are reinterpreted to
 249 be "predicate variants" in the instance where either of the two src
 250 registers are marked as vectors (active=1, vector=1).
 251
 252 Note that the predication register to use (if one is enabled) is taken from
 253 the *first* src register, and that this is used, just as with predicated
 254 arithmetic operations, to mask whether the comparison operations take
 255 place or not.  The target (destination) predication register
 256 to use (if one is enabled) is taken from the *second* src register.
 257
 258 If either of src1 or src2 are scalars (whether by there being no
 259 CSR register entry or whether by the CSR entry specifically marking
 260 the register as "scalar") the comparison goes ahead as vector-scalar
 261 or scalar-vector.
 262
 263 In instances where no vectorisation is detected on either src registers
 264 the operation is treated as an absolutely standard scalar branch operation.
 265 Where vectorisation is present on either or both src registers, the
 266 branch may stil go ahead if any only if *all* tests succeed (i.e. excluding
 267 those tests that are predicated out).
 268
 269 Note that when zero-predication is enabled (from source rs1),
 270 a cleared bit in the predicate indicates that the result
 271 of the compare is set to "false", i.e. that the corresponding
 272 destination bit (or result)) be set to zero.  Contrast this with
 273 when zeroing is not set: bits in the destination predicate are
 274 only *set*; they are **not** cleared.  This is important to appreciate,
 275 as there may be an expectation that, going into the hardware-loop,
 276 the destination predicate is always expected to be set to zero:
 277 this is **not** the case.  The destination predicate is only set
 278 to zero if **zeroing** is enabled.
 279
 280 Note that just as with the standard (scalar, non-predicated) branch
 281 operations, BLE, BGT, BLEU and BTGU may be synthesised by inverting
 282 src1 and src2, however note that in doing so, the predicate table
 283 setup must also be correspondingly adjusted.
 284
 285 In Hwacha EECS-2015-262 Section 6.7.2 the following pseudocode is given
 286 for predicated compare operations of function "cmp":
 287
 288     for (int i=0; i<vl; ++i)
 289       if ([!]preg[p][i])
 290          preg[pd][i] = cmp(s1 ? vreg[rs1][i] : sreg[rs1],
 291                            s2 ? vreg[rs2][i] : sreg[rs2]);
 292
 293 With associated predication, vector-length adjustments and so on,
 294 and temporarily ignoring bitwidth (which makes the comparisons more
 295 complex), this becomes:
 296
 297     s1 = reg_is_vectorised(src1);
 298     s2 = reg_is_vectorised(src2);
 299
 300     if not s1 && not s2
 301         if cmp(rs1, rs2) # scalar compare
 302             goto branch
 303         return
 304
 305     preg = int_pred_reg[rd]
 306     reg = int_regfile
 307
 308     ps = get_pred_val(I/F==INT, rs1);
 309     rd = get_pred_val(I/F==INT, rs2); # this may not exist
 310
 311     ffirst_mode, zeroing = get_pred_flags(rs1)
 312     if exists(rd):
 313         pred_inversion, pred_zeroing = get_pred_flags(rs2)
 314     else
 315         pred_inversion, pred_zeroing = False, False
 316
 317     if not exists(rd) or zeroing:
 318         result = (1<<VL)-1 # all 1s
 319     else
 320         result = preg[rd]
 321
 322     for (int i = 0; i < VL; ++i)
 323       if (zeroing)
 324         if not (ps & (1<<i))
 325            result &= ~(1<<i);
 326       else if (ps & (1<<i))
 327           if (cmp(s1 ? reg[src1+i]:reg[src1],
 328                                s2 ? reg[src2+i]:reg[src2])
 329               result |= 1<<i;
 330           else
 331               result &= ~(1<<i);
 332               if ffirst_mode:
 333                 break
 334
 335     if exists(rd):
 336         preg[rd] = result # store in destination
 337
 338     if pred_inversion:
 339         if pred_zeroing:
 340             # NOR
 341             if result == 0:
 342                 goto branch
 343         else:
 344             # NAND
 345             if (result & ps) != result:
 346                 goto branch
 347     else:
 348         if pred_zeroing:
 349             # OR
 350             if result != 0:
 351                 goto branch
 352         else:
 353             # AND
 354             if (result & ps) == result:
 355                 goto branch
 356
 357 Notes:
 358
 359 * Predicated SIMD comparisons would break src1 and src2 further down
 360   into bitwidth-sized chunks (see Appendix "Bitwidth Virtual Register
 361   Reordering") setting Vector-Length times (number of SIMD elements) bits
 362   in Predicate Register rd, as opposed to just Vector-Length bits.
 363 * The execution of "parallelised" instructions **must** be implemented
 364   as "re-entrant" (to use a term from software).  If an exception (trap)
 365   occurs during the middle of a vectorised
 366   Branch (now a SV predicated compare) operation, the partial results
 367   of any comparisons must be written out to the destination
 368   register before the trap is permitted to begin.  If however there
 369   is no predicate, the **entire** set of comparisons must be **restarted**,
 370   with the offset loop indices set back to zero.  This is because
 371   there is no place to store the temporary result during the handling
 372   of traps.
 373
 374 TODO: predication now taken from src2.  also branch goes ahead
 375 if all compares are successful.
 376
 377 Note also that where normally, predication requires that there must
 378 also be a CSR register entry for the register being used in order
 379 for the **predication** CSR register entry to also be active,
 380 for branches this is **not** the case.  src2 does **not** have
 381 to have its CSR register entry marked as active in order for
 382 predication on src2 to be active.
 383
 384 Also note: SV Branch operations are **not** twin-predicated
 385 (see Twin Predication section).  This would require three
 386 element offsets: one to track src1, one to track src2 and a third
 387 to track where to store the accumulation of the results.  Given
 388 that the element offsets need to be exposed via CSRs so that
 389 the parallel hardware looping may be made re-entrant on traps
 390 and exceptions, the decision was made not to make SV Branches
 391 twin-predicated.
 392
 393 ### Floating-point Comparisons
 394
 395 There does not exist floating-point branch operations, only compare.
 396 Interestingly no change is needed to the instruction format because
 397 FP Compare already stores a 1 or a zero in its "rd" integer register
 398 target, i.e. it's not actually a Branch at all: it's a compare.
 399
 400 In RV (scalar) Base, a branch on a floating-point compare is
 401 done via the sequence "FEQ x1, f0, f5; BEQ x1, x0, #jumploc".
 402 This does extend to SV, as long as x1 (in the example sequence given)
 403 is vectorised.  When that is the case, x1..x(1+VL-1) will also be
 404 set to 0 or 1 depending on whether f0==f5, f1==f6, f2==f7 and so on.
 405 The BEQ that follows will *also* compare x1==x0, x2==x0, x3==x0 and
 406 so on.  Consequently, unlike integer-branch, FP Compare needs no
 407 modification in its behaviour.
 408
 409 In addition, it is noted that an entry "FNE" (the opposite of FEQ) is
 410 missing, and whilst in ordinary branch code this is fine because the
 411 standard RVF compare can always be followed up with an integer BEQ or
 412 a BNE (or a compressed comparison to zero or non-zero), in predication
 413 terms that becomes more of an impact.  To deal with this, SV's predication
 414 has had "invert" added to it.
 415
 416 Also: note that FP Compare may be predicated, using the destination
 417 integer register (rd) to determine the predicate.  FP Compare is **not**
 418 a twin-predication operation, as, again, just as with SV Branches,
 419 there are three registers involved: FP src1, FP src2 and INT rd.
 420
 421 Also: note that ffirst (fail first mode) applies directly to this operation.
 422
 423 ### Compressed Branch Instruction
 424
 425 Compressed Branch instructions are, just like standard Branch instructions,
 426 reinterpreted to be vectorised and predicated based on the source register
 427 (rs1s) CSR entries.  As however there is only the one source register,
 428 given that c.beqz a10 is equivalent to beqz a10,x0, the optional target
 429 to store the results of the comparisions is taken from CSR predication
 430 table entries for **x0**.
 431
 432 The specific required use of x0 is, with a little thought, quite obvious,
 433 but is counterintuitive.  Clearly it is **not** recommended to redirect
 434 x0 with a CSR register entry, however as a means to opaquely obtain
 435 a predication target it is the only sensible option that does not involve
 436 additional special CSRs (or, worse, additional special opcodes).
 437
 438 Note also that, just as with standard branches, the 2nd source
 439 (in this case x0 rather than src2) does **not** have to have its CSR
 440 register table marked as "active" in order for predication to work.
 441
 442 ## Vectorised Dual-operand instructions
 443
 444 There is a series of 2-operand instructions involving copying (and
 445 sometimes alteration):
 446
 447 * C.MV
 448 * FMV, FNEG, FABS, FCVT, FSGNJ, FSGNJN and FSGNJX
 449 * C.LWSP, C.SWSP, C.LDSP, C.FLWSP etc.
 450 * LOAD(-FP) and STORE(-FP)
 451
 452 All of these operations follow the same two-operand pattern, so it is
 453 *both* the source *and* destination predication masks that are taken into
 454 account.  This is different from
 455 the three-operand arithmetic instructions, where the predication mask
 456 is taken from the *destination* register, and applied uniformly to the
 457 elements of the source register(s), element-for-element.
 458
 459 The pseudo-code pattern for twin-predicated operations is as
 460 follows:
 461
 462     function op(rd, rs):
 463       rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
 464       rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
 465       ps = get_pred_val(FALSE, rs); # predication on src
 466       pd = get_pred_val(FALSE, rd); # ... AND on dest
 467       for (int i = 0, int j = 0; i < VL && j < VL;):
 468         if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
 469         if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
 470         xSTATE.srcoffs = i # save context
 471         xSTATE.destoffs = j # save context
 472         reg[rd+j] = SCALAR_OPERATION_ON(reg[rs+i])
 473         if (int_csr[rs].isvec) i++;
 474         if (int_csr[rd].isvec) j++; else break
 475
 476 This pattern covers scalar-scalar, scalar-vector, vector-scalar
 477 and vector-vector, and predicated variants of all of those.
 478 Zeroing is not presently included (TODO).  As such, when compared
 479 to RVV, the twin-predicated variants of C.MV and FMV cover
 480 **all** standard vector operations: VINSERT, VSPLAT, VREDUCE,
 481 VEXTRACT, VSCATTER, VGATHER, VCOPY, and more.
 482
 483 Note that:
 484
 485 * elwidth (SIMD) is not covered in the pseudo-code above
 486 * ending the loop early in scalar cases (VINSERT, VEXTRACT) is also
 487   not covered
 488 * zero predication is also not shown (TODO).
 489
 490 ### C.MV Instruction <a name="c_mv"></a>
 491
 492 There is no MV instruction in RV however there is a C.MV instruction.
 493 It is used for copying integer-to-integer registers (vectorised FMV
 494 is used for copying floating-point).
 495
 496 If either the source or the destination register are marked as vectors
 497 C.MV is reinterpreted to be a vectorised (multi-register) predicated
 498 move operation.  The actual instruction's format does not change:
 499
 500 [[!table  data="""
 501 15  12 | 11   7 | 6  2 | 1  0 |
 502 funct4 | rd     | rs   | op   |
 503 4      | 5      | 5    | 2    |
 504 C.MV   | dest   | src  | C0   |
 505 """]]
 506
 507 A simplified version of the pseudocode for this operation is as follows:
 508
 509     function op_mv(rd, rs) # MV not VMV!
 510       rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
 511       rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
 512       ps = get_pred_val(FALSE, rs); # predication on src
 513       pd = get_pred_val(FALSE, rd); # ... AND on dest
 514       for (int i = 0, int j = 0; i < VL && j < VL;):
 515         if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
 516         if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
 517         xSTATE.srcoffs = i # save context
 518         xSTATE.destoffs = j # save context
 519         ireg[rd+j] <= ireg[rs+i];
 520         if (int_csr[rs].isvec) i++;
 521         if (int_csr[rd].isvec) j++; else break
 522
 523 There are several different instructions from RVV that are covered by
 524 this one opcode:
 525
 526 [[!table  data="""
 527 src    | dest    | predication   | op             |
 528 scalar | vector  | none          | VSPLAT         |
 529 scalar | vector  | destination   | sparse VSPLAT  |
 530 scalar | vector  | 1-bit dest    | VINSERT        |
 531 vector | scalar  | 1-bit? src    | VEXTRACT       |
 532 vector | vector  | none          | VCOPY          |
 533 vector | vector  | src           | Vector Gather  |
 534 vector | vector  | dest          | Vector Scatter |
 535 vector | vector  | src & dest    | Gather/Scatter |
 536 vector | vector  | src == dest   | sparse VCOPY   |
 537 """]]
 538
 539 Also, VMERGE may be implemented as back-to-back (macro-op fused) C.MV
 540 operations with zeroing off, and inversion on the src and dest predication
 541 for one of the two C.MV operations.  The non-inverted C.MV will place
 542 one set of registers into the destination, and the inverted one the other
 543 set.  With predicate-inversion, copying and inversion of the predicate mask
 544 need not be done as a separate (scalar) instruction.
 545
 546 Note that in the instance where the Compressed Extension is not implemented,
 547 MV may be used, but that is a pseudo-operation mapping to addi rd, x0, rs.
 548 Note that the behaviour is **different** from C.MV because with addi the
 549 predication mask to use is taken **only** from rd and is applied against
 550 all elements: rs[i] = rd[i].
 551
 552 ### FMV, FNEG and FABS Instructions
 553
 554 These are identical in form to C.MV, except covering floating-point
 555 register copying.  The same double-predication rules also apply.
 556 However when elwidth is not set to default the instruction is implicitly
 557 and automatic converted to a (vectorised) floating-point type conversion
 558 operation of the appropriate size covering the source and destination
 559 register bitwidths.
 560
 561 (Note that FMV, FNEG and FABS are all actually pseudo-instructions)
 562
 563 ### FVCT Instructions
 564
 565 These are again identical in form to C.MV, except that they cover
 566 floating-point to integer and integer to floating-point.  When element
 567 width in each vector is set to default, the instructions behave exactly
 568 as they are defined for standard RV (scalar) operations, except vectorised
 569 in exactly the same fashion as outlined in C.MV.
 570
 571 However when the source or destination element width is not set to default,
 572 the opcode's explicit element widths are *over-ridden* to new definitions,
 573 and the opcode's element width is taken as indicative of the SIMD width
 574 (if applicable i.e. if packed SIMD is requested) instead.
 575
 576 For example FCVT.S.L would normally be used to convert a 64-bit
 577 integer in register rs1 to a 64-bit floating-point number in rd.
 578 If however the source rs1 is set to be a vector, where elwidth is set to
 579 default/2 and "packed SIMD" is enabled, then the first 32 bits of
 580 rs1 are converted to a floating-point number to be stored in rd's
 581 first element and the higher 32-bits *also* converted to floating-point
 582 and stored in the second.  The 32 bit size comes from the fact that
 583 FCVT.S.L's integer width is 64 bit, and with elwidth on rs1 set to
 584 divide that by two it means that rs1 element width is to be taken as 32.
 585
 586 Similar rules apply to the destination register.
 587
 588 ## LOAD / STORE Instructions and LOAD-FP/STORE-FP <a name="load_store"></a>
 589
 590 An earlier draft of SV modified the behaviour of LOAD/STORE (modified
 591 the interpretation of the instruction fields).  This
 592 actually undermined the fundamental principle of SV, namely that there
 593 be no modifications to the scalar behaviour (except where absolutely
 594 necessary), in order to simplify an implementor's task if considering
 595 converting a pre-existing scalar design to support parallelism.
 596
 597 So the original RISC-V scalar LOAD/STORE and LOAD-FP/STORE-FP functionality
 598 do not change in SV, however just as with C.MV it is important to note
 599 that dual-predication is possible.
 600
 601 In vectorised architectures there are usually at least two different modes
 602 for LOAD/STORE:
 603
 604 * Read (or write for STORE) from sequential locations, where one
 605   register specifies the address, and the one address is incremented
 606   by a fixed amount.  This is usually known as "Unit Stride" mode.
 607 * Read (or write) from multiple indirected addresses, where the
 608   vector elements each specify separate and distinct addresses.
 609
 610 To support these different addressing modes, the CSR Register "isvector"
 611 bit is used.  So, for a LOAD, when the src register is set to
 612 scalar, the LOADs are sequentially incremented by the src register
 613 element width, and when the src register is set to "vector", the
 614 elements are treated as indirection addresses.  Simplified
 615 pseudo-code would look like this:
 616
 617     function op_ld(rd, rs) # LD not VLD!
 618       rdv = int_csr[rd].active ? int_csr[rd].regidx : rd;
 619       rsv = int_csr[rs].active ? int_csr[rs].regidx : rs;
 620       ps = get_pred_val(FALSE, rs); # predication on src
 621       pd = get_pred_val(FALSE, rd); # ... AND on dest
 622       for (int i = 0, int j = 0; i < VL && j < VL;):
 623         if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
 624         if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
 625         if (int_csr[rd].isvec)
 626           # indirect mode (multi mode)
 627           srcbase = ireg[rsv+i];
 628         else
 629           # unit stride mode
 630           srcbase = ireg[rsv] + i * XLEN/8; # offset in bytes
 631         ireg[rdv+j] <= mem[srcbase + imm_offs];
 632         if (!int_csr[rs].isvec &&
 633             !int_csr[rd].isvec) break # scalar-scalar LD
 634         if (int_csr[rs].isvec) i++;
 635         if (int_csr[rd].isvec) j++;
 636
 637 Notes:
 638
 639 * For simplicity, zeroing and elwidth is not included in the above:
 640   the key focus here is the decision-making for srcbase; vectorised
 641   rs means use sequentially-numbered registers as the indirection
 642   address, and scalar rs is "offset" mode.
 643 * The test towards the end for whether both source and destination are
 644   scalar is what makes the above pseudo-code provide the "standard" RV
 645   Base behaviour for LD operations.
 646 * The offset in bytes (XLEN/8) changes depending on whether the
 647   operation is a LB (1 byte), LH (2 byes), LW (4 bytes) or LD
 648   (8 bytes), and also whether the element width is over-ridden
 649   (see special element width section).
 650
 651 ## Compressed Stack LOAD / STORE Instructions <a name="c_ld_st"></a>
 652
 653 C.LWSP / C.SWSP and floating-point etc. are also source-dest twin-predicated,
 654 where it is implicit in C.LWSP/FLWSP etc. that x2 is the source register.
 655 It is therefore possible to use predicated C.LWSP to efficiently
 656 pop registers off the stack (by predicating x2 as the source), cherry-picking
 657 which registers to store to (by predicating the destination).  Likewise
 658 for C.SWSP.  In this way, LOAD/STORE-Multiple is efficiently achieved.
 659
 660 The two modes ("unit stride" and multi-indirection) are still supported,
 661 as with standard LD/ST.  Essentially, the only difference is that the
 662 use of x2 is hard-coded into the instruction.
 663
 664 **Note**: it is still possible to redirect x2 to an alternative target
 665 register.  With care, this allows C.LWSP / C.SWSP (and C.FLWSP) to be used as
 666 general-purpose LOAD/STORE operations.
 667
 668 ## Compressed LOAD / STORE Instructions
 669
 670 Compressed LOAD and STORE are again exactly the same as scalar LOAD/STORE,
 671 where the same rules apply and the same pseudo-code apply as for
 672 non-compressed LOAD/STORE.  Again: setting scalar or vector mode
 673 on the src for LOAD and dest for STORE switches mode from "Unit Stride"
 674 to "Multi-indirection", respectively.
 675
 676 # Element bitwidth polymorphism <a name="elwidth"></a>
 677
 678 Element bitwidth is best covered as its own special section, as it
 679 is quite involved and applies uniformly across-the-board.  SV restricts
 680 bitwidth polymorphism to default, 8-bit, 16-bit and 32-bit.
 681
 682 The effect of setting an element bitwidth is to re-cast each entry
 683 in the register table, and for all memory operations involving
 684 load/stores of certain specific sizes, to a completely different width.
 685 Thus In c-style terms, on an RV64 architecture, effectively each register
 686 now looks like this:
 687
 688     typedef union {
 689         uint8_t  b[8];
 690         uint16_t s[4];
 691         uint32_t i[2];
 692         uint64_t l[1];
 693     } reg_t;
 694
 695     // integer table: assume maximum SV 7-bit regfile size
 696     reg_t int_regfile[128];
 697
 698 where the CSR Register table entry (not the instruction alone) determines
 699 which of those union entries is to be used on each operation, and the
 700 VL element offset in the hardware-loop specifies the index into each array.
 701
 702 However a naive interpretation of the data structure above masks the
 703 fact that setting VL greater than 8, for example, when the bitwidth is 8,
 704 accessing one specific register "spills over" to the following parts of
 705 the register file in a sequential fashion.  So a much more accurate way
 706 to reflect this would be:
 707
 708     typedef union {
 709         uint8_t  actual_bytes[8]; // 8 for RV64, 4 for RV32, 16 for RV128
 710         uint8_t  b[0]; // array of type uint8_t
 711         uint16_t s[0];
 712         uint32_t i[0];
 713         uint64_t l[0];
 714         uint128_t d[0];
 715     } reg_t;
 716
 717     reg_t int_regfile[128];
 718
 719 where when accessing any individual regfile[n].b entry it is permitted
 720 (in c) to arbitrarily over-run the *declared* length of the array (zero),
 721 and thus "overspill" to consecutive register file entries in a fashion
 722 that is completely transparent to a greatly-simplified software / pseudo-code
 723 representation.
 724 It is however critical to note that it is clearly the responsibility of
 725 the implementor to ensure that, towards the end of the register file,
 726 an exception is thrown if attempts to access beyond the "real" register
 727 bytes is ever attempted.
 728
 729 Now we may modify pseudo-code an operation where all element bitwidths have
 730 been set to the same size, where this pseudo-code is otherwise identical
 731 to its "non" polymorphic versions (above):
 732
 733     function op_add(rd, rs1, rs2) # add not VADD!
 734       ...
 735       ...
 736       for (i = 0; i < VL; i++)
 737            ...
 738            ...
 739            // TODO, calculate if over-run occurs, for each elwidth
 740            if (elwidth == 8) {
 741                int_regfile[rd].b[id] <= int_regfile[rs1].i[irs1] +
 742                                         int_regfile[rs2].i[irs2];
 743             } else if elwidth == 16 {
 744                int_regfile[rd].s[id] <= int_regfile[rs1].s[irs1] +
 745                                         int_regfile[rs2].s[irs2];
 746             } else if elwidth == 32 {
 747                int_regfile[rd].i[id] <= int_regfile[rs1].i[irs1] +
 748                                         int_regfile[rs2].i[irs2];
 749             } else { // elwidth == 64
 750                int_regfile[rd].l[id] <= int_regfile[rs1].l[irs1] +
 751                                         int_regfile[rs2].l[irs2];
 752             }
 753            ...
 754            ...
 755
 756 So here we can see clearly: for 8-bit entries rd, rs1 and rs2 (and registers
 757 following sequentially on respectively from the same) are "type-cast"
 758 to 8-bit; for 16-bit entries likewise and so on.
 759
 760 However that only covers the case where the element widths are the same.
 761 Where the element widths are different, the following algorithm applies:
 762
 763 * Analyse the bitwidth of all source operands and work out the
 764   maximum.  Record this as "maxsrcbitwidth"
 765 * If any given source operand requires sign-extension or zero-extension
 766   (ldb, div, rem, mul, sll, srl, sra etc.), instead of mandatory 32-bit
 767   sign-extension / zero-extension or whatever is specified in the standard
 768   RV specification, **change** that to sign-extending from the respective
 769   individual source operand's bitwidth from the CSR table out to
 770   "maxsrcbitwidth" (previously calculated), instead.
 771 * Following separate and distinct (optional) sign/zero-extension of all
 772   source operands as specifically required for that operation, carry out the
 773   operation at "maxsrcbitwidth".  (Note that in the case of LOAD/STORE or MV
 774   this may be a "null" (copy) operation, and that with FCVT, the changes
 775   to the source and destination bitwidths may also turn FVCT effectively
 776   into a copy).
 777 * If the destination operand requires sign-extension or zero-extension,
 778   instead of a mandatory fixed size (typically 32-bit for arithmetic,
 779   for subw for example, and otherwise various: 8-bit for sb, 16-bit for sw
 780   etc.), overload the RV specification with the bitwidth from the
 781   destination register's elwidth entry.
 782 * Finally, store the (optionally) sign/zero-extended value into its
 783   destination: memory for sb/sw etc., or an offset section of the register
 784   file for an arithmetic operation.
 785
 786 In this way, polymorphic bitwidths are achieved without requiring a
 787 massive 64-way permutation of calculations **per opcode**, for example
 788 (4 possible rs1 bitwidths times 4 possible rs2 bitwidths times 4 possible
 789 rd bitwidths).  The pseudo-code is therefore as follows:
 790
 791     typedef union {
 792         uint8_t  b;
 793         uint16_t s;
 794         uint32_t i;
 795         uint64_t l;
 796     } el_reg_t;
 797
 798     bw(elwidth):
 799         if elwidth == 0: return xlen
 800         if elwidth == 1: return 8
 801         if elwidth == 2: return 16
 802         // elwidth == 3:
 803         return 32
 804
 805     get_max_elwidth(rs1, rs2):
 806         return max(bw(int_csr[rs1].elwidth), # default (XLEN) if not set
 807                    bw(int_csr[rs2].elwidth)) # again XLEN if no entry
 808
 809     get_polymorphed_reg(reg, bitwidth, offset):
 810         el_reg_t res;
 811         res.l = 0; // TODO: going to need sign-extending / zero-extending
 812         if bitwidth == 8:
 813             reg.b = int_regfile[reg].b[offset]
 814         elif bitwidth == 16:
 815             reg.s = int_regfile[reg].s[offset]
 816         elif bitwidth == 32:
 817             reg.i = int_regfile[reg].i[offset]
 818         elif bitwidth == 64:
 819             reg.l = int_regfile[reg].l[offset]
 820         return res
 821
 822     set_polymorphed_reg(reg, bitwidth, offset, val):
 823         if (!int_csr[reg].isvec):
 824             # sign/zero-extend depending on opcode requirements, from
 825             # the reg's bitwidth out to the full bitwidth of the regfile
 826             val = sign_or_zero_extend(val, bitwidth, xlen)
 827             int_regfile[reg].l[0] = val
 828         elif bitwidth == 8:
 829             int_regfile[reg].b[offset] = val
 830         elif bitwidth == 16:
 831             int_regfile[reg].s[offset] = val
 832         elif bitwidth == 32:
 833             int_regfile[reg].i[offset] = val
 834         elif bitwidth == 64:
 835             int_regfile[reg].l[offset] = val
 836
 837       maxsrcwid =  get_max_elwidth(rs1, rs2) # source element width(s)
 838       destwid = int_csr[rs1].elwidth         # destination element width
 839       for (i = 0; i < VL; i++)
 840         if (predval & 1<<i) # predication uses intregs
 841            // TODO, calculate if over-run occurs, for each elwidth
 842            src1 = get_polymorphed_reg(rs1, maxsrcwid, irs1)
 843            // TODO, sign/zero-extend src1 and src2 as operation requires
 844            if (op_requires_sign_extend_src1)
 845               src1 = sign_extend(src1, maxsrcwid)
 846            src2 = get_polymorphed_reg(rs2, maxsrcwid, irs2)
 847            result = src1 + src2 # actual add here
 848            // TODO, sign/zero-extend result, as operation requires
 849            if (op_requires_sign_extend_dest)
 850               result = sign_extend(result, maxsrcwid)
 851            set_polymorphed_reg(rd, destwid, ird, result)
 852            if (!int_vec[rd].isvector) break
 853         if (int_vec[rd ].isvector)  { id += 1; }
 854         if (int_vec[rs1].isvector)  { irs1 += 1; }
 855         if (int_vec[rs2].isvector)  { irs2 += 1; }
 856
 857 Whilst specific sign-extension and zero-extension pseudocode call
 858 details are left out, due to each operation being different, the above
 859 should be clear that;
 860
 861 * the source operands are extended out to the maximum bitwidth of all
 862   source operands
 863 * the operation takes place at that maximum source bitwidth (the
 864   destination bitwidth is not involved at this point, at all)
 865 * the result is extended (or potentially even, truncated) before being
 866   stored in the destination.  i.e. truncation (if required) to the
 867   destination width occurs **after** the operation **not** before.
 868 * when the destination is not marked as "vectorised", the **full**
 869   (standard, scalar) register file entry is taken up, i.e. the
 870   element is either sign-extended or zero-extended to cover the
 871   full register bitwidth (XLEN) if it is not already XLEN bits long.
 872
 873 Implementors are entirely free to optimise the above, particularly
 874 if it is specifically known that any given operation will complete
 875 accurately in less bits, as long as the results produced are
 876 directly equivalent and equal, for all inputs and all outputs,
 877 to those produced by the above algorithm.
 878
 879 ## Polymorphic floating-point operation exceptions and error-handling
 880
 881 For floating-point operations, conversion takes place without raising any
 882 kind of exception.  Exactly as specified in the standard RV specification,
 883 NAN (or appropriate) is stored if the result is beyond the range of the
 884 destination, and, again, exactly as with the standard RV specification
 885 just as with scalar operations, the floating-point flag is raised
 886 (FCSR).  And, again, just as with scalar operations, it is software's
 887 responsibility to check this flag.  Given that the FCSR flags are
 888 "accrued", the fact that multiple element operations could have occurred
 889 is not a problem.
 890
 891 Note that it is perfectly legitimate for floating-point bitwidths of
 892 only 8 to be specified.  However whilst it is possible to apply IEEE 754
 893 principles, no actual standard yet exists.  Implementors wishing to
 894 provide hardware-level 8-bit support rather than throw a trap to emulate
 895 in software should contact the author of this specification before
 896 proceeding.
 897
 898 ## Polymorphic shift operators
 899
 900 A special note is needed for changing the element width of left and
 901 right shift operators, particularly right-shift.  Even for standard RV
 902 base, in order for correct results to be returned, the second operand
 903 RS2 must be truncated to be within the range of RS1's bitwidth.
 904 spike's implementation of sll for example is as follows:
 905
 906     WRITE_RD(sext_xlen(zext_xlen(RS1) << (RS2 & (xlen-1))));
 907
 908 which means: where XLEN is 32 (for RV32), restrict RS2 to cover the
 909 range 0..31 so that RS1 will only be left-shifted by the amount that
 910 is possible to fit into a 32-bit register.  Whilst this appears not
 911 to matter for hardware, it matters greatly in software implementations,
 912 and it also matters where an RV64 system is set to "RV32" mode, such
 913 that the underlying registers RS1 and RS2 comprise 64 hardware bits
 914 each.
 915
 916 For SV, where each operand's element bitwidth may be over-ridden, the
 917 rule about determining the operation's bitwidth *still applies*, being
 918 defined as the maximum bitwidth of RS1 and RS2.  *However*, this rule
 919 **also applies to the truncation of RS2**.  In other words, *after*
 920 determining the maximum bitwidth, RS2's range must **also be truncated**
 921 to ensure a correct answer.  Example:
 922
 923 * RS1 is over-ridden to a 16-bit width
 924 * RS2 is over-ridden to an 8-bit width
 925 * RD is over-ridden to a 64-bit width
 926 * the maximum bitwidth is thus determined to be 16-bit - max(8,16)
 927 * RS2 is **truncated to a range of values from 0 to 15**: RS2 & (16-1)
 928
 929 Pseudocode (in spike) for this example would therefore be:
 930
 931     WRITE_RD(sext_xlen(zext_16bit(RS1) << (RS2 & (16-1))));
 932
 933 This example illustrates that considerable care therefore needs to be
 934 taken to ensure that left and right shift operations are implemented
 935 correctly.  The key is that
 936
 937 * The operation bitwidth is determined by the maximum bitwidth
 938   of the *source registers*, **not** the destination register bitwidth
 939 * The result is then sign-extend (or truncated) as appropriate.
 940
 941 ## Polymorphic MULH/MULHU/MULHSU
 942
 943 MULH is designed to take the top half MSBs of a multiply that
 944 does not fit within the range of the source operands, such that
 945 smaller width operations may produce a full double-width multiply
 946 in two cycles.  The issue is: SV allows the source operands to
 947 have variable bitwidth.
 948
 949 Here again special attention has to be paid to the rules regarding
 950 bitwidth, which, again, are that the operation is performed at
 951 the maximum bitwidth of the **source** registers.  Therefore:
 952
 953 * An 8-bit x 8-bit multiply will create a 16-bit result that must
 954   be shifted down by 8 bits
 955 * A 16-bit x 8-bit multiply will create a 24-bit result that must
 956   be shifted down by 16 bits (top 8 bits being zero)
 957 * A 16-bit x 16-bit multiply will create a 32-bit result that must
 958   be shifted down by 16 bits
 959 * A 32-bit x 16-bit multiply will create a 48-bit result that must
 960   be shifted down by 32 bits
 961 * A 32-bit x 8-bit multiply will create a 40-bit result that must
 962   be shifted down by 32 bits
 963
 964 So again, just as with shift-left and shift-right, the result
 965 is shifted down by the maximum of the two source register bitwidths.
 966 And, exactly again, truncation or sign-extension is performed on the
 967 result.  If sign-extension is to be carried out, it is performed
 968 from the same maximum of the two source register bitwidths out
 969 to the result element's bitwidth.
 970
 971 If truncation occurs, i.e. the top MSBs of the result are lost,
 972 this is "Officially Not Our Problem", i.e. it is assumed that the
 973 programmer actually desires the result to be truncated.  i.e. if the
 974 programmer wanted all of the bits, they would have set the destination
 975 elwidth to accommodate them.
 976
 977 ## Polymorphic elwidth on LOAD/STORE <a name="elwidth_loadstore"></a>
 978
 979 Polymorphic element widths in vectorised form means that the data
 980 being loaded (or stored) across multiple registers needs to be treated
 981 (reinterpreted) as a contiguous stream of elwidth-wide items, where
 982 the source register's element width is **independent** from the destination's.
 983
 984 This makes for a slightly more complex algorithm when using indirection
 985 on the "addressed" register (source for LOAD and destination for STORE),
 986 particularly given that the LOAD/STORE instruction provides important
 987 information about the width of the data to be reinterpreted.
 988
 989 Let's illustrate the "load" part, where the pseudo-code for elwidth=default
 990 was as follows, and i is the loop from 0 to VL-1:
 991
 992     srcbase = ireg[rs+i];
 993     return mem[srcbase + imm]; // returns XLEN bits
 994
 995 Instead, when elwidth != default, for a LW (32-bit LOAD), elwidth-wide
 996 chunks are taken from the source memory location addressed by the current
 997 indexed source address register, and only when a full 32-bits-worth
 998 are taken will the index be moved on to the next contiguous source
 999 address register:
1000
1001     bitwidth = bw(elwidth);             // source elwidth from CSR reg entry
1002     elsperblock = 32 / bitwidth         // 1 if bw=32, 2 if bw=16, 4 if bw=8
1003     srcbase = ireg[rs+i/(elsperblock)]; // integer divide
1004     offs = i % elsperblock;             // modulo
1005     return &mem[srcbase + imm + offs];  // re-cast to uint8_t*, uint16_t* etc.
1006
1007 Note that the constant "32" above is replaced by 8 for LB, 16 for LH, 64 for LD
1008 and 128 for LQ.
1009
1010 The principle is basically exactly the same as if the srcbase were pointing
1011 at the memory of the *register* file: memory is re-interpreted as containing
1012 groups of elwidth-wide discrete elements.
1013
1014 When storing the result from a load, it's important to respect the fact
1015 that the destination register has its *own separate element width*.  Thus,
1016 when each element is loaded (at the source element width), any sign-extension
1017 or zero-extension (or truncation) needs to be done to the *destination*
1018 bitwidth.  Also, the storing has the exact same analogous algorithm as
1019 above, where in fact it is just the set\_polymorphed\_reg pseudocode
1020 (completely unchanged) used above.
1021
1022 One issue remains: when the source element width is **greater** than
1023 the width of the operation, it is obvious that a single LB for example
1024 cannot possibly obtain 16-bit-wide data.  This condition may be detected
1025 where, when using integer divide, elsperblock (the width of the LOAD
1026 divided by the bitwidth of the element) is zero.
1027
1028 The issue is "fixed" by ensuring that elsperblock is a minimum of 1:
1029
1030     elsperblock = min(1, LD_OP_BITWIDTH / element_bitwidth)
1031
1032 The elements, if the element bitwidth is larger than the LD operation's
1033 size, will then be sign/zero-extended to the full LD operation size, as
1034 specified by the LOAD (LDU instead of LD, LBU instead of LB), before
1035 being passed on to the second phase.
1036
1037 As LOAD/STORE may be twin-predicated, it is important to note that
1038 the rules on twin predication still apply, except where in previous
1039 pseudo-code (elwidth=default for both source and target) it was
1040 the *registers* that the predication was applied to, it is now the
1041 **elements** that the predication is applied to.
1042
1043 Thus the full pseudocode for all LD operations may be written out
1044 as follows:
1045
1046     function LBU(rd, rs):
1047         load_elwidthed(rd, rs, 8, true)
1048     function LB(rd, rs):
1049         load_elwidthed(rd, rs, 8, false)
1050     function LH(rd, rs):
1051         load_elwidthed(rd, rs, 16, false)
1052     ...
1053     ...
1054     function LQ(rd, rs):
1055         load_elwidthed(rd, rs, 128, false)
1056
1057     # returns 1 byte of data when opwidth=8, 2 bytes when opwidth=16..
1058     function load_memory(rs, imm, i, opwidth):
1059         elwidth = int_csr[rs].elwidth
1060         bitwidth = bw(elwidth);
1061         elsperblock = min(1, opwidth / bitwidth)
1062         srcbase = ireg[rs+i/(elsperblock)];
1063         offs = i % elsperblock;
1064         return mem[srcbase + imm + offs]; # 1/2/4/8/16 bytes
1065
1066     function load_elwidthed(rd, rs, opwidth, unsigned):
1067       destwid = int_csr[rd].elwidth # destination element width
1068       rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
1069       rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
1070       ps = get_pred_val(FALSE, rs); # predication on src
1071       pd = get_pred_val(FALSE, rd); # ... AND on dest
1072       for (int i = 0, int j = 0; i < VL && j < VL;):
1073         if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
1074         if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
1075         val = load_memory(rs, imm, i, opwidth)
1076         if unsigned:
1077             val = zero_extend(val, min(opwidth, bitwidth))
1078         else:
1079             val = sign_extend(val, min(opwidth, bitwidth))
1080         set_polymorphed_reg(rd, bitwidth, j, val)
1081         if (int_csr[rs].isvec) i++;
1082         if (int_csr[rd].isvec) j++; else break;
1083
1084 Note:
1085
1086 * when comparing against for example the twin-predicated c.mv
1087   pseudo-code, the pattern of independent incrementing of rd and rs
1088   is preserved unchanged.
1089 * just as with the c.mv pseudocode, zeroing is not included and must be
1090   taken into account (TODO).
1091 * that due to the use of a twin-predication algorithm, LOAD/STORE also
1092   take on the same VSPLAT, VINSERT, VREDUCE, VEXTRACT, VGATHER and
1093   VSCATTER characteristics.
1094 * that due to the use of the same set\_polymorphed\_reg pseudocode,
1095   a destination that is not vectorised (marked as scalar) will
1096   result in the element being fully sign-extended or zero-extended
1097   out to the full register file bitwidth (XLEN).  When the source
1098   is also marked as scalar, this is how the compatibility with
1099   standard RV LOAD/STORE is preserved by this algorithm.
1100
1101 ### Example Tables showing LOAD elements
1102
1103 This section contains examples of vectorised LOAD operations, showing
1104 how the two stage process works (three if zero/sign-extension is included).
1105
1106
1107 #### Example: LD x8, x5(0), x8 CSR-elwidth=32, x5 CSR-elwidth=16, VL=7
1108
1109 This is:
1110
1111 * a 64-bit load, with an offset of zero
1112 * with a source-address elwidth of 16-bit
1113 * into a destination-register with an elwidth of 32-bit
1114 * where VL=7
1115 * from register x5 (actually x5-x6) to x8 (actually x8 to half of x11)
1116 * RV64, where XLEN=64 is assumed.
1117
1118 First, the memory table, which, due to the element width being 16 and the
1119 operation being LD (64), the 64-bits loaded from memory are subdivided
1120 into groups of **four** elements.  And, with VL being 7 (deliberately
1121 to illustrate that this is reasonable and possible), the first four are
1122 sourced from the offset addresses pointed to by x5, and the next three
1123 from the ofset addresses pointed to by the next contiguous register, x6:
1124
1125 [[!table  data="""
1126 addr | byte 0 | byte 1 | byte 2 | byte 3 | byte 4 | byte 5 | byte 6 | byte 7 |
1127 @x5  | elem 0         || elem 1         || elem 2         || elem 3         ||
1128 @x6  | elem 4         || elem 5         || elem 6         || not loaded     ||
1129 """]]
1130
1131 Next, the elements are zero-extended from 16-bit to 32-bit, as whilst
1132 the elwidth CSR entry for x5 is 16-bit, the destination elwidth on x8 is 32.
1133
1134 [[!table  data="""
1135 byte 3 | byte 2 | byte 1 | byte 0 |
1136 0x0    | 0x0    | elem0          ||
1137 0x0    | 0x0    | elem1          ||
1138 0x0    | 0x0    | elem2          ||
1139 0x0    | 0x0    | elem3          ||
1140 0x0    | 0x0    | elem4          ||
1141 0x0    | 0x0    | elem5          ||
1142 0x0    | 0x0    | elem6          ||
1143 0x0    | 0x0    | elem7          ||
1144 """]]
1145
1146 Lastly, the elements are stored in contiguous blocks, as if x8 was also
1147 byte-addressable "memory".  That "memory" happens to cover registers
1148 x8, x9, x10 and x11, with the last 32 "bits" of x11 being **UNMODIFIED**:
1149
1150 [[!table  data="""
1151 reg# | byte 7 | byte 6 | byte 5 | byte 4 | byte 3 | byte 2 | byte 1 | byte 0 |
1152 x8   | 0x0    | 0x0    | elem 1         || 0x0    | 0x0    | elem 0         ||
1153 x9   | 0x0    | 0x0    | elem 3         || 0x0    | 0x0    | elem 2         ||
1154 x10  | 0x0    | 0x0    | elem 5         || 0x0    | 0x0    | elem 4         ||
1155 x11  | **UNMODIFIED**                 |||| 0x0    | 0x0    | elem 6         ||
1156 """]]
1157
1158 Thus we have data that is loaded from the **addresses** pointed to by
1159 x5 and x6, zero-extended from 16-bit to 32-bit, stored in the **registers**
1160 x8 through to half of x11.
1161 The end result is that elements 0 and 1 end up in x8, with element 8 being
1162 shifted up 32 bits, and so on, until finally element 6 is in the
1163 LSBs of x11.
1164
1165 Note that whilst the memory addressing table is shown left-to-right byte order,
1166 the registers are shown in right-to-left (MSB) order.  This does **not**
1167 imply that bit or byte-reversal is carried out: it's just easier to visualise
1168 memory as being contiguous bytes, and emphasises that registers are not
1169 really actually "memory" as such.
1170
1171 ## Why SV bitwidth specification is restricted to 4 entries
1172
1173 The four entries for SV element bitwidths only allows three over-rides:
1174
1175 * 8 bit
1176 * 16 hit
1177 * 32 bit
1178
1179 This would seem inadequate, surely it would be better to have 3 bits or
1180 more and allow 64, 128 and some other options besides.  The answer here
1181 is, it gets too complex, no RV128 implementation yet exists, and so RV64's
1182 default is 64 bit, so the 4 major element widths are covered anyway.
1183
1184 There is an absolutely crucial aspect oF SV here that explicitly
1185 needs spelling out, and it's whether the "vectorised" bit is set in
1186 the Register's CSR entry.
1187
1188 If "vectorised" is clear (not set), this indicates that the operation
1189 is "scalar".  Under these circumstances, when set on a destination (RD),
1190 then sign-extension and zero-extension, whilst changed to match the
1191 override bitwidth (if set), will erase the **full** register entry
1192 (64-bit if RV64).
1193
1194 When vectorised is *set*, this indicates that the operation now treats
1195 **elements** as if they were independent registers, so regardless of
1196 the length, any parts of a given actual register that are not involved
1197 in the operation are **NOT** modified, but are **PRESERVED**.
1198
1199 For example:
1200
1201 * when the vector bit is clear and elwidth set to 16 on the destination
1202   register, operations are truncated to 16 bit and then sign or zero
1203   extended to the *FULL* XLEN register width.
1204 * when the vector bit is set, elwidth is 16 and VL=1 (or other value where
1205   groups of elwidth sized elements do not fill an entire XLEN register),
1206   the "top" bits of the destination register do *NOT* get modified, zero'd
1207   or otherwise overwritten.
1208
1209 SIMD micro-architectures may implement this by using predication on
1210 any elements in a given actual register that are beyond the end of
1211 multi-element operation.
1212
1213 Other microarchitectures may choose to provide byte-level write-enable
1214 lines on the register file, such that each 64 bit register in an RV64
1215 system requires 8 WE lines.  Scalar RV64 operations would require
1216 activation of all 8 lines, where SV elwidth based operations would
1217 activate the required subset of those byte-level write lines.
1218
1219 Example:
1220
1221 * rs1, rs2 and rd are all set to 8-bit
1222 * VL is set to 3
1223 * RV64 architecture is set (UXL=64)
1224 * add operation is carried out
1225 * bits 0-23 of RD are modified to be rs1[23..16] + rs2[23..16]
1226   concatenated with similar add operations on bits 15..8 and 7..0
1227 * bits 24 through 63 **remain as they originally were**.
1228
1229 Example SIMD micro-architectural implementation:
1230
1231 * SIMD architecture works out the nearest round number of elements
1232   that would fit into a full RV64 register (in this case: 8)
1233 * SIMD architecture creates a hidden predicate, binary 0b00000111
1234   i.e. the bottom 3 bits set (VL=3) and the top 5 bits clear
1235 * SIMD architecture goes ahead with the add operation as if it
1236   was a full 8-wide batch of 8 adds
1237 * SIMD architecture passes top 5 elements through the adders
1238   (which are "disabled" due to zero-bit predication)
1239 * SIMD architecture gets the 5 unmodified top 8-bits back unmodified
1240   and stores them in rd.
1241
1242 This requires a read on rd, however this is required anyway in order
1243 to support non-zeroing mode.
1244
1245 ## Polymorphic floating-point
1246
1247 Standard scalar RV integer operations base the register width on XLEN,
1248 which may be changed (UXL in USTATUS, and the corresponding MXL and
1249 SXL in MSTATUS and SSTATUS respectively).  Integer LOAD, STORE and
1250 arithmetic operations are therefore restricted to an active XLEN bits,
1251 with sign or zero extension to pad out the upper bits when XLEN has
1252 been dynamically set to less than the actual register size.
1253
1254 For scalar floating-point, the active (used / changed) bits are
1255 specified exclusively by the operation: ADD.S specifies an active
1256 32-bits, with the upper bits of the source registers needing to
1257 be all 1s ("NaN-boxed"), and the destination upper bits being
1258 *set* to all 1s (including on LOAD/STOREs).
1259
1260 Where elwidth is set to default (on any source or the destination)
1261 it is obvious that this NaN-boxing behaviour can and should be
1262 preserved.  When elwidth is non-default things are less obvious,
1263 so need to be thought through.  Here is a normal (scalar) sequence,
1264 assuming an RV64 which supports Quad (128-bit) FLEN:
1265
1266 * FLD loads 64-bit wide from memory.  Top 64 MSBs are set to all 1s
1267 * ADD.D performs a 64-bit-wide add.  Top 64 MSBs of destination set to 1s.
1268 * FSD stores lowest 64-bits from the 128-bit-wide register to memory:
1269   top 64 MSBs ignored.
1270
1271 Therefore it makes sense to mirror this behaviour when, for example,
1272 elwidth is set to 32.  Assume elwidth set to 32 on all source and
1273 destination registers:
1274
1275 * FLD loads 64-bit wide from memory as **two** 32-bit single-precision
1276   floating-point numbers.
1277 * ADD.D performs **two** 32-bit-wide adds, storing one of the adds
1278   in bits 0-31 and the second in bits 32-63.
1279 * FSD stores lowest 64-bits from the 128-bit-wide register to memory
1280
1281 Here's the thing: it does not make sense to overwrite the top 64 MSBs
1282 of the registers either during the FLD **or** the ADD.D.  The reason
1283 is that, effectively, the top 64 MSBs actually represent a completely
1284 independent 64-bit register, so overwriting it is not only gratuitous
1285 but may actually be harmful for a future extension to SV which may
1286 have a way to directly access those top 64 bits.
1287
1288 The decision is therefore **not** to touch the upper parts of floating-point
1289 registers whereever elwidth is set to non-default values, including
1290 when "isvec" is false in a given register's CSR entry.  Only when the
1291 elwidth is set to default **and** isvec is false will the standard
1292 RV behaviour be followed, namely that the upper bits be modified.
1293
1294 Ultimately if elwidth is default and isvec false on *all* source
1295 and destination registers, a SimpleV instruction defaults completely
1296 to standard RV scalar behaviour (this holds true for **all** operations,
1297 right across the board).
1298
1299 The nice thing here is that ADD.S, ADD.D and ADD.Q when elwidth are
1300 non-default values are effectively all the same: they all still perform
1301 multiple ADD operations, just at different widths.  A future extension
1302 to SimpleV may actually allow ADD.S to access the upper bits of the
1303 register, effectively breaking down a 128-bit register into a bank
1304 of 4 independently-accesible 32-bit registers.
1305
1306 In the meantime, although when e.g. setting VL to 8 it would technically
1307 make no difference to the ALU whether ADD.S, ADD.D or ADD.Q is used,
1308 using ADD.Q may be an easy way to signal to the microarchitecture that
1309 it is to receive a higher VL value.  On a superscalar OoO architecture
1310 there may be absolutely no difference, however on simpler SIMD-style
1311 microarchitectures they may not necessarily have the infrastructure in
1312 place to know the difference, such that when VL=8 and an ADD.D instruction
1313 is issued, it completes in 2 cycles (or more) rather than one, where
1314 if an ADD.Q had been issued instead on such simpler microarchitectures
1315 it would complete in one.
1316
1317 ## Specific instruction walk-throughs
1318
1319 This section covers walk-throughs of the above-outlined procedure
1320 for converting standard RISC-V scalar arithmetic operations to
1321 polymorphic widths, to ensure that it is correct.
1322
1323 ### add
1324
1325 Standard Scalar RV32/RV64 (xlen):
1326
1327 * RS1 @ xlen bits
1328 * RS2 @ xlen bits
1329 * add @ xlen bits
1330 * RD @ xlen bits
1331
1332 Polymorphic variant:
1333
1334 * RS1 @ rs1 bits, zero-extended to max(rs1, rs2) bits
1335 * RS2 @ rs2 bits, zero-extended to max(rs1, rs2) bits
1336 * add @ max(rs1, rs2) bits
1337 * RD @ rd bits.  zero-extend to rd if rd > max(rs1, rs2) otherwise truncate
1338
1339 Note here that polymorphic add zero-extends its source operands,
1340 where addw sign-extends.
1341
1342 ### addw
1343
1344 The RV Specification specifically states that "W" variants of arithmetic
1345 operations always produce 32-bit signed values.  In a polymorphic
1346 environment it is reasonable to assume that the signed aspect is
1347 preserved, where it is the length of the operands and the result
1348 that may be changed.
1349
1350 Standard Scalar RV64 (xlen):
1351
1352 * RS1 @ xlen bits
1353 * RS2 @ xlen bits
1354 * add @ xlen bits
1355 * RD @ xlen bits, truncate add to 32-bit and sign-extend to xlen.
1356
1357 Polymorphic variant:
1358
1359 * RS1 @ rs1 bits, sign-extended to max(rs1, rs2) bits
1360 * RS2 @ rs2 bits, sign-extended to max(rs1, rs2) bits
1361 * add @ max(rs1, rs2) bits
1362 * RD @ rd bits. sign-extend to rd if rd > max(rs1, rs2) otherwise truncate
1363
1364 Note here that polymorphic addw sign-extends its source operands,
1365 where add zero-extends.
1366
1367 This requires a little more in-depth analysis.  Where the bitwidth of
1368 rs1 equals the bitwidth of rs2, no sign-extending will occur.  It is
1369 only where the bitwidth of either rs1 or rs2 are different, will the
1370 lesser-width operand be sign-extended.
1371
1372 Effectively however, both rs1 and rs2 are being sign-extended (or
1373 truncated), where for add they are both zero-extended.  This holds true
1374 for all arithmetic operations ending with "W".
1375
1376 ### addiw
1377
1378 Standard Scalar RV64I:
1379
1380 * RS1 @ xlen bits, truncated to 32-bit
1381 * immed @ 12 bits, sign-extended to 32-bit
1382 * add @ 32 bits
1383 * RD @ rd bits.  sign-extend to rd if rd > 32, otherwise truncate.
1384
1385 Polymorphic variant:
1386
1387 * RS1 @ rs1 bits
1388 * immed @ 12 bits, sign-extend to max(rs1, 12) bits
1389 * add @ max(rs1, 12) bits
1390 * RD @ rd bits. sign-extend to rd if rd > max(rs1, 12) otherwise truncate
1391
1392 # Predication Element Zeroing
1393
1394 The introduction of zeroing on traditional vector predication is usually
1395 intended as an optimisation for lane-based microarchitectures with register
1396 renaming to be able to save power by avoiding a register read on elements
1397 that are passed through en-masse through the ALU.  Simpler microarchitectures
1398 do not have this issue: they simply do not pass the element through to
1399 the ALU at all, and therefore do not store it back in the destination.
1400 More complex non-lane-based micro-architectures can, when zeroing is
1401 not set, use the predication bits to simply avoid sending element-based
1402 operations to the ALUs, entirely: thus, over the long term, potentially
1403 keeping all ALUs 100% occupied even when elements are predicated out.
1404
1405 SimpleV's design principle is not based on or influenced by
1406 microarchitectural design factors: it is a hardware-level API.
1407 Therefore, looking purely at whether zeroing is *useful* or not,
1408 (whether less instructions are needed for certain scenarios),
1409 given that a case can be made for zeroing *and* non-zeroing, the
1410 decision was taken to add support for both.
1411
1412 ## Single-predication (based on destination register)
1413
1414 Zeroing on predication for arithmetic operations is taken from
1415 the destination register's predicate.  i.e. the predication *and*
1416 zeroing settings to be applied to the whole operation come from the
1417 CSR Predication table entry for the destination register.
1418 Thus when zeroing is set on predication of a destination element,
1419 if the predication bit is clear, then the destination element is *set*
1420 to zero (twin-predication is slightly different, and will be covered
1421 next).
1422
1423 Thus the pseudo-code loop for a predicated arithmetic operation
1424 is modified to as follows:
1425
1426       for (i = 0; i < VL; i++)
1427         if not zeroing: # an optimisation
1428            while (!(predval & 1<<i) && i < VL)
1429              if (int_vec[rd ].isvector)  { id += 1; }
1430              if (int_vec[rs1].isvector)  { irs1 += 1; }
1431              if (int_vec[rs2].isvector)  { irs2 += 1; }
1432            if i == VL:
1433              return
1434         if (predval & 1<<i)
1435            src1 = ....
1436            src2 = ...
1437            else:
1438                result = src1 + src2 # actual add (or other op) here
1439            set_polymorphed_reg(rd, destwid, ird, result)
1440            if int_vec[rd].ffirst and result == 0:
1441               VL = i # result was zero, end loop early, return VL
1442               return
1443            if (!int_vec[rd].isvector) return
1444         else if zeroing:
1445            result = 0
1446            set_polymorphed_reg(rd, destwid, ird, result)
1447         if (int_vec[rd ].isvector)  { id += 1; }
1448         else if (predval & 1<<i) return
1449         if (int_vec[rs1].isvector)  { irs1 += 1; }
1450         if (int_vec[rs2].isvector)  { irs2 += 1; }
1451         if (rd == VL or rs1 == VL or rs2 == VL): return
1452
1453 The optimisation to skip elements entirely is only possible for certain
1454 micro-architectures when zeroing is not set.  However for lane-based
1455 micro-architectures this optimisation may not be practical, as it
1456 implies that elements end up in different "lanes".  Under these
1457 circumstances it is perfectly fine to simply have the lanes
1458 "inactive" for predicated elements, even though it results in
1459 less than 100% ALU utilisation.
1460
1461 ## Twin-predication (based on source and destination register)
1462
1463 Twin-predication is not that much different, except that that
1464 the source is independently zero-predicated from the destination.
1465 This means that the source may be zero-predicated *or* the
1466 destination zero-predicated *or both*, or neither.
1467
1468 When with twin-predication, zeroing is set on the source and not
1469 the destination, if a predicate bit is set it indicates that a zero
1470 data element is passed through the operation (the exception being:
1471 if the source data element is to be treated as an address - a LOAD -
1472 then the data returned *from* the LOAD is zero, rather than looking up an
1473 *address* of zero.
1474
1475 When zeroing is set on the destination and not the source, then just
1476 as with single-predicated operations, a zero is stored into the destination
1477 element (or target memory address for a STORE).
1478
1479 Zeroing on both source and destination effectively result in a bitwise
1480 NOR operation of the source and destination predicate: the result is that
1481 where either source predicate OR destination predicate is set to 0,
1482 a zero element will ultimately end up in the destination register.
1483
1484 However: this may not necessarily be the case for all operations;
1485 implementors, particularly of custom instructions, clearly need to
1486 think through the implications in each and every case.
1487
1488 Here is pseudo-code for a twin zero-predicated operation:
1489
1490     function op_mv(rd, rs) # MV not VMV!
1491       rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
1492       rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
1493       ps, zerosrc = get_pred_val(FALSE, rs); # predication on src
1494       pd, zerodst = get_pred_val(FALSE, rd); # ... AND on dest
1495       for (int i = 0, int j = 0; i < VL && j < VL):
1496         if (int_csr[rs].isvec && !zerosrc) while (!(ps & 1<<i)) i++;
1497         if (int_csr[rd].isvec && !zerodst) while (!(pd & 1<<j)) j++;
1498         if ((pd & 1<<j))
1499             if ((pd & 1<<j))
1500                 sourcedata = ireg[rs+i];
1501             else
1502                 sourcedata = 0
1503             ireg[rd+j] <= sourcedata
1504         else if (zerodst)
1505             ireg[rd+j] <= 0
1506         if (int_csr[rs].isvec)
1507             i++;
1508         if (int_csr[rd].isvec)
1509             j++;
1510         else
1511             if ((pd & 1<<j))
1512                 break;
1513
1514 Note that in the instance where the destination is a scalar, the hardware
1515 loop is ended the moment a value *or a zero* is placed into the destination
1516 register/element.  Also note that, for clarity, variable element widths
1517 have been left out of the above.
1518
1519 # Subsets of RV functionality
1520
1521 This section describes the differences when SV is implemented on top of
1522 different subsets of RV.
1523
1524 ## Common options
1525
1526 It is permitted to only implement SVprefix and not the VBLOCK instruction
1527 format option, and vice-versa.  UNIX Platforms **MUST** raise illegal
1528 instruction on seeing an unsupported VBLOCK or SVprefix opcode, so that
1529 traps may emulate the format.
1530
1531 It is permitted in SVprefix to either not implement VL or not implement
1532 SUBVL (see [[sv_prefix_proposal]] for full details. Again, UNIX Platforms
1533 *MUST* raise illegal instruction on implementations that do not support
1534 VL or SUBVL.
1535
1536 It is permitted to limit the size of either (or both) the register files
1537 down to the original size of the standard RV architecture.  However, below
1538 the mandatory limits set in the RV standard will result in non-compliance
1539 with the SV Specification.
1540
1541 ## RV32 / RV32F
1542
1543 When RV32 or RV32F is implemented, XLEN is set to 32, and thus the
1544 maximum limit for predication is also restricted to 32 bits.  Whilst not
1545 actually specifically an "option" it is worth noting.
1546
1547 ## RV32G
1548
1549 Normally in standard RV32 it does not make much sense to have
1550 RV32G, The critical instructions that are missing in standard RV32
1551 are those for moving data to and from the double-width floating-point
1552 registers into the integer ones, as well as the FCVT routines.
1553
1554 In an earlier draft of SV, it was possible to specify an elwidth
1555 of double the standard register size: this had to be dropped,
1556 and may be reintroduced in future revisions.
1557
1558 ## RV32 (not RV32F / RV32G) and RV64 (not RV64F / RV64G)
1559
1560 When floating-point is not implemented, the size of the User Register and
1561 Predication CSR tables may be halved, to only 4 2x16-bit CSRs (8 entries
1562 per table).
1563
1564 ## RV32E
1565
1566 In embedded scenarios the User Register and Predication CSRs may be
1567 dropped entirely, or optionally limited to 1 CSR, such that the combined
1568 number of entries from the M-Mode CSR Register table plus U-Mode
1569 CSR Register table is either 4 16-bit entries or (if the U-Mode is
1570 zero) only 2 16-bit entries (M-Mode CSR table only).  Likewise for
1571 the Predication CSR tables.
1572
1573 RV32E is the most likely candidate for simply detecting that registers
1574 are marked as "vectorised", and generating an appropriate exception
1575 for the VL loop to be implemented in software.
1576
1577 ## RV128
1578
1579 RV128 has not been especially considered, here, however it has some
1580 extremely large possibilities: double the element width implies
1581 256-bit operands, spanning 2 128-bit registers each, and predication
1582 of total length 128 bit given that XLEN is now 128.
1583
1584 # Example usage
1585
1586 TODO evaluate strncpy and strlen
1587 <https://groups.google.com/forum/m/#!msg/comp.arch/bGBeaNjAKvc/_vbqyxTUAQAJ>
1588
1589 ## strncpy <a name="strncpy"></>
1590
1591 RVV version:
1592
1593     strncpy:
1594         c.mv a3, a0               # Copy dst
1595     loop:
1596         setvli x0, a2, vint8    # Vectors of bytes.
1597         vlbff.v v1, (a1)        # Get src bytes
1598         vseq.vi v0, v1, 0       # Flag zero bytes
1599         vmfirst a4, v0          # Zero found?
1600         vmsif.v v0, v0          # Set mask up to and including zero byte.
1601         vsb.v v1, (a3), v0.t    # Write out bytes
1602         c.bgez a4, exit           # Done
1603         csrr t1, vl             # Get number of bytes fetched
1604         c.add a1, a1, t1          # Bump src pointer
1605         c.sub a2, a2, t1          # Decrement count.
1606         c.add a3, a3, t1          # Bump dst pointer
1607         c.bnez a2, loop           # Anymore?
1608
1609     exit:
1610         c.ret
1611
1612 SV version (WIP):
1613
1614     strncpy:
1615         c.mv a3, a0
1616         VBLK.RegCSR[t0] = 8bit, t0, vector
1617         VBLK.PredTb[t0] = ffirst, x0, inv
1618     loop:
1619         VBLK.SETVLI a2, t4, 8 # t4 and VL now 1..8 (MVL=8)
1620         c.ldb t0, (a1) # t0 fail first mode
1621         c.bne t0, x0, allnonzero # still ff
1622         # VL (t4) points to last nonzero
1623         c.addi t4, t4, 1 # include zero
1624         c.stb t0, (a3)   # store incl zero
1625         c.ret            # end subroutine
1626     allnonzero:
1627         c.stb t0, (a3)    # VL legal range
1628         c.add a1, a1, t4  # Bump src pointer
1629         c.sub a2, a2, t4  # Decrement count.
1630         c.add a3, a3, t4  # Bump dst pointer
1631         c.bnez a2, loop   # Anymore?
1632     exit:
1633         c.ret
1634
1635 Notes:
1636
1637 * Setting MVL to 8 is just an example. If enough registers are spare it
1638   may be set to XLEN which will require a bank of 8 scalar registers for
1639   a1, a3 and t0.
1640 * obviously if that is done, t0 is not separated by 8 full registers, and
1641   would overwrite t1 thru t7. x80 would work well, as an example, instead.
1642 * with the exception of the GETVL (a pseudo code alias for csrr), every
1643   single instruction above may use RVC.
1644 * RVC C.BNEZ can be used because rs1' may be extended to the full 128
1645   registers through redirection
1646 * RVC C.LW and C.SW may be used because the W format may be overridden by
1647   the 8 bit format. All of t0, a3 and a1 are overridden to make that work.
1648 * with the exception of the GETVL, all Vector Context may be done in
1649   VBLOCK form.
1650 * setting predication to x0 (zero) and invert on t0 is a trick to enable
1651   just ffirst on t0
1652 * ldb and bne are both using t0, both in ffirst mode
1653 * t0 vectorised, a1 scalar, both elwidth 8 bit: ldb enters "unit stride,
1654   vectorised, no (un)sign-extension or truncation" mode.
1655 * ldb will end on illegal mem, reduce VL, but copied all sorts of stuff
1656   into t0 (could contain zeros).
1657 * bne t0 x0 tests up to the NEW VL for nonzero, vector t0 against
1658   scalar x0
1659 * however as t0 is in ffirst mode, the first fail will ALSO stop the
1660   compares, and reduce VL as well
1661 * the branch only goes to allnonzero if all tests succeed
1662 * if it did not, we can safely increment VL by 1 (using a4) to include
1663   the zero.
1664 * SETVL sets *exactly* the requested amount into VL.
1665 * the SETVL just after allnonzero label is needed in case the ldb ffirst
1666   activates but the bne allzeros does not.
1667 * this would cause the stb to copy up to the end of the legal memory
1668 * of course, on the next loop the ldb would throw a trap, as a1 now
1669   points to the first illegal mem location.
1670
1671 ## strcpy
1672
1673 RVV version:
1674
1675         mv a3, a0             # Save start
1676     loop:
1677         setvli a1, x0, vint8  # byte vec, x0 (Zero reg) => use max hardware len
1678         vldbff.v v1, (a3)     # Get bytes
1679         csrr a1, vl           # Get bytes actually read e.g. if fault
1680         vseq.vi v0, v1, 0     # Set v0[i] where v1[i] = 0
1681         add a3, a3, a1        # Bump pointer
1682         vmfirst a2, v0        # Find first set bit in mask, returns -1 if none
1683         bltz a2, loop         # Not found?
1684         add a0, a0, a1        # Sum start + bump
1685         add a3, a3, a2        # Add index of zero byte
1686         sub a0, a3, a0        # Subtract start address+bump
1687         ret
1688
1689 ## DAXPY <a name="daxpy"></a>
1690
1691 [[!inline raw="yes" pages="simple_v_extension/daxpy_example" ]]
1692
1693 Notes:
1694
1695 * Setting MVL to 4 is just an example.  With enough space between the
1696   FP regs, MVL may be set to larger values
1697 * VBLOCK header takes 16 bits, 8-bit mode may be used on the registers,
1698   taking only another 16 bits, VBLOCK.SETVL requires 16 bits.  Total
1699   overhead for use of VBLOCK: 48 bits (3 16-bit words).
1700 * All instructions except fmadd may use Compressed variants.  Total
1701   number of 16-bit instruction words: 11.
1702 * Total: 14 16-bit words.  By contrast, RVV requires around 18 16-bit words.
1703
1704 ## BigInt add <a name="bigadd"></a>
1705
1706 [[!inline raw="yes" pages="simple_v_extension/bigadd_example" ]]