simple_v_extension/appendix.mdwn

   1 # Simple-V (Parallelism Extension Proposal) Appendix
   2
   3 * Copyright (C) 2017, 2018, 2019 Luke Kenneth Casson Leighton
   4 * Status: DRAFTv0.6
   5 * Last edited: 30 jun 2019
   6 * main spec [[specification]]
   7
   8 [[!toc ]]
   9
  10 # Fail-on-first modes <a name="ffirst"></a>
  11
  12 Fail-on-first data dependency has different behaviour for traps than
  13 for conditional testing.  "Conditional" is taken to mean "anything
  14 that is zero", however with traps, the first element has to
  15 be given the opportunity to throw the exact same trap that would
  16 be thrown if this were a scalar operation (when VL=1).
  17
  18 Note that implementors are required to mutually exclusively choose one
  19 or the other modes: an instruction is **not** permitted to fail on a
  20 trap *and* fail a conditional test at the same time.  This advice to
  21 custom opcode writers as well as future extension writers.
  22
  23 ## Fail-on-first traps
  24
  25 Except for the first element, ffirst stops sequential element processing
  26 when a trap occurs.  The first element is treated normally (as if ffirst
  27 is clear).  Should any subsequent element instruction require a trap,
  28 instead it and subsequent indexed elements are ignored (or cancelled in
  29 out-of-order designs), and VL is set to the *last* in-sequence instruction
  30 that did not take the trap.
  31
  32 Note that predicated-out elements (where the predicate mask bit is
  33 zero) are clearly excluded (i.e. the trap will not occur).  However,
  34 note that the loop still had to test the predicate bit: thus on return,
  35 VL is set to include elements that did not take the trap *and* includes
  36 the elements that were predicated (masked) out (not tested up to the
  37 point where the trap occurred).
  38
  39 Unlike conditional tests, "fail-on-first trap" instruction behaviour is
  40 unaltered by setting zero or non-zero predication mode.
  41
  42 If SUBVL is being used (SUBVL!=1), the first *sub-group* of elements
  43 will cause a trap as normal (as if ffirst is not set); subsequently, the
  44 trap must not occur in the *sub-group* of elements.  SUBVL will **NOT**
  45 be modified.  Traps must analyse (x)eSTATE (subvl offset indices) to
  46 determine the element that caused the trap.
  47
  48 Given that predication bits apply to SUBVL groups, the same rules apply
  49 to predicated-out (masked-out) sub-groups in calculating the value that
  50 VL is set to.
  51
  52 ## Fail-on-first conditional tests
  53
  54 ffirst stops sequential (or sequentially-appearing in the case of
  55 out-of-order designs) element conditional testing on the first element
  56 result being zero (or other "fail" condition).  VL is set to the number
  57 of elements that were (sequentially) processed before the fail-condition
  58 was encountered.
  59
  60 Unlike trap fail-on-first, fail-on-first conditional testing behaviour
  61 responds to changes in the zero or non-zero predication mode.  Whilst
  62 in non-zeroing mode, masked-out elements are simply not tested (and
  63 thus considered "never to fail"), in zeroing mode, masked-out elements
  64 may be viewed as *always* (unconditionally) failing.  This effectively
  65 turns VL into something akin to a software-controlled loop.
  66
  67 Note that just as with traps, if SUBVL!=1, the first trap in the
  68 *sub-group* will cause the processing to end, and, even if there were
  69 elements within the *sub-group* that passed the test, that sub-group is
  70 still (entirely) excluded from the count (from setting VL).  i.e. VL is
  71 set to the total number of *sub-groups* that had no fail-condition up
  72 until execution was stopped.  However, again: SUBVL must not be modified:
  73 traps must analyse (x)eSTATE (subvl offset indices) to determine the
  74 element that caused the trap.
  75
  76 Note again that, just as with traps, predicated-out (masked-out) elements
  77 are included in the (sequential) count leading up to the fail-condition,
  78 even though they were not tested.
  79
  80 # Instructions <a name="instructions" />
  81
  82 Despite being a 98% complete and accurate topological remap of RVV
  83 concepts and functionality, no new instructions are needed.
  84 Compared to RVV: *All* RVV instructions can be re-mapped, however xBitManip
  85 becomes a critical dependency for efficient manipulation of predication
  86 masks (as a bit-field).  Despite the removal of all operations,
  87 with the exception of CLIP and VSELECT.X
  88 *all instructions from RVV Base are topologically re-mapped and retain their
  89 complete functionality, intact*.  Note that if RV64G ever had
  90 a MV.X added as well as FCLIP, the full functionality of RVV-Base would
  91 be obtained in SV.
  92
  93 Three instructions, VSELECT, VCLIP and VCLIPI, do not have RV Standard
  94 equivalents, so are left out of Simple-V.  VSELECT could be included if
  95 there existed a MV.X instruction in RV (MV.X is a hypothetical
  96 non-immediate variant of MV that would allow another register to
  97 specify which register was to be copied).  Note that if any of these three
  98 instructions are added to any given RV extension, their functionality
  99 will be inherently parallelised.
 100
 101 With some exceptions, where it does not make sense or is simply too
 102 challenging, all RV-Base instructions are parallelised:
 103
 104 * CSR instructions, whilst a case could be made for fast-polling of
 105   a CSR into multiple registers, or for being able to copy multiple
 106   contiguously addressed CSRs into contiguous registers, and so on,
 107   are the fundamental core basis of SV.  If parallelised, extreme
 108   care would need to be taken.  Additionally, CSR reads are done
 109   using x0, and it is *really* inadviseable to tag x0.
 110 * LUI, C.J, C.JR, WFI, AUIPC are not suitable for parallelising so are
 111   left as scalar.
 112 * LR/SC could hypothetically be parallelised however their purpose is
 113   single (complex) atomic memory operations where the LR must be followed
 114   up by a matching SC.  A sequence of parallel LR instructions followed
 115   by a sequence of parallel SC instructions therefore is guaranteed to
 116   not be useful. Not least: the guarantees of a Multi-LR/SC
 117   would be impossible to provide if emulated in a trap.
 118 * EBREAK, NOP, FENCE and others do not use registers so are not inherently
 119   paralleliseable anyway.
 120
 121 All other operations using registers are automatically parallelised.
 122 This includes AMOMAX, AMOSWAP and so on, where particular care and
 123 attention must be paid.
 124
 125 Example pseudo-code for an integer ADD operation (including scalar
 126 operations).  Floating-point uses the FP Register Table.
 127
 128 [[!inline raw="yes" pages="simple_v_extension/simple_add_example" ]]
 129
 130 Note that for simplicity there is quite a lot missing from the above
 131 pseudo-code: PCVBLK, element widths, zeroing on predication, dimensional
 132 reshaping and offsets and so on.  However it demonstrates the basic
 133 principle.  Augmentations that produce the full pseudo-code are covered in
 134 other sections.
 135
 136 ## SUBVL Pseudocode <a name="subvl-pseudocode"></a>
 137
 138 Adding in support for SUBVL is a matter of adding in an extra inner
 139 for-loop, where register src and dest are still incremented inside the
 140 inner part. Note that the predication is still taken from the VL index.
 141
 142 So whilst elements are indexed by "(i * SUBVL + s)", predicate bits are
 143 indexed by "(i)"
 144
 145     function op_add(rd, rs1, rs2) # add not VADD!
 146       int i, id=0, irs1=0, irs2=0;
 147       predval = get_pred_val(FALSE, rd);
 148       rd  = int_vec[rd ].isvector ? int_vec[rd ].regidx : rd;
 149       rs1 = int_vec[rs1].isvector ? int_vec[rs1].regidx : rs1;
 150       rs2 = int_vec[rs2].isvector ? int_vec[rs2].regidx : rs2;
 151       for (i = 0; i < VL; i++)
 152        xSTATE.srcoffs = i # save context
 153        for (s = 0; s < SUBVL; s++)
 154         xSTATE.ssvoffs = s # save context
 155         if (predval & 1<<i) # predication uses intregs
 156            # actual add is here (at last)
 157            ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
 158            if (!int_vec[rd ].isvector) break;
 159         if (int_vec[rd ].isvector)  { id += 1; }
 160         if (int_vec[rs1].isvector)  { irs1 += 1; }
 161         if (int_vec[rs2].isvector)  { irs2 += 1; }
 162         if (id == VL or irs1 == VL or irs2 == VL) {
 163           # end VL hardware loop
 164           xSTATE.srcoffs = 0; # reset
 165           xSTATE.ssvoffs = 0; # reset
 166           return;
 167         }
 168
 169
 170 NOTE: pseudocode simplified greatly: zeroing, proper predicate handling,
 171 elwidth handling etc. all left out.
 172
 173 ## Instruction Format
 174
 175 It is critical to appreciate that there are
 176 **no operations added to SV, at all**.
 177
 178 Instead, by using CSRs to tag registers as an indication of "changed
 179 behaviour", SV *overloads* pre-existing branch operations into predicated
 180 variants, and implicitly overloads arithmetic operations, MV, FCVT, and
 181 LOAD/STORE depending on CSR configurations for bitwidth and predication.
 182 **Everything** becomes parallelised.  *This includes Compressed
 183 instructions* as well as any future instructions and Custom Extensions.
 184
 185 Note: CSR tags to change behaviour of instructions is nothing new, including
 186 in RISC-V.  UXL, SXL and MXL change the behaviour so that XLEN=32/64/128.
 187 FRM changes the behaviour of the floating-point unit, to alter the rounding
 188 mode.  Other architectures change the LOAD/STORE byte-order from big-endian
 189 to little-endian on a per-instruction basis.  SV is just a little more...
 190 comprehensive in its effect on instructions.
 191
 192 ## Branch Instructions
 193
 194 Branch operations are augmented slightly to be a little more like FP
 195 Compares (FEQ, FNE etc.), by permitting the cumulation (and storage)
 196 of multiple comparisons into a register (taken indirectly from the predicate
 197 table).  As such, "ffirst" - fail-on-first - condition mode can be enabled.
 198 See ffirst mode in the Predication Table section.
 199
 200 There are two registers for the comparison operation, therefore there is
 201 the opportunity to associate two predicate registers.  The first is a
 202 "normal" predicate register, which acts just as it does on any other
 203 single-predicated operation: masks out elements where a bit is zero,
 204 applies an inversion to the predicate mask, and enables zeroing / non-zeroing
 205 mode.
 206
 207 The second is utilised to indicate where the results of each comparison
 208 are to be stored, as a bitmask.  Additionally, the behaviour of the branch
 209 - when it occurs - may also be modified depending on whether the predicate
 210 "invert" and "zeroing" bits are set.
 211
 212 * If "invert" is zero, and "zeroing" is zero, the branch will occur if and only
 213   all tests pass
 214 * If "invert" is set and "zeroing" is zero, the branch will occur if all
 215   tests *fail* (opposite of inv=0,zero=0)
 216 * If "invert" is zero, and "zeroing" is set, the branch will occur if
 217   even *one* test passes
 218 * If "invert" is set and "zeroing" is set, the branch will occur if
 219   even *one* test fails.
 220
 221 This inversion capability covers AND, OR, NAND and NOR branching based
 222 on multiple element comparisons.  Note that unlike normal computer
 223 programming early-termination of chains of AND or OR conditional tests,
 224 the chain does *not* terminate early except if fail-on-first is set,
 225 and even then ffirst ends on the first data-dependent zero.  When ffirst
 226 mode is not set, *all* conditional element tests must be performed (and
 227 the result optionally stored in the result mask), with a "post-analysis"
 228 phase carried out which checks whether to branch.
 229
 230 ### Standard Branch <a name="standard_branch"></a>
 231
 232 Branch operations use standard RV opcodes that are reinterpreted to
 233 be "predicate variants" in the instance where either of the two src
 234 registers are marked as vectors (active=1, vector=1).
 235
 236 Note that the predication register to use (if one is enabled) is taken from
 237 the *first* src register, and that this is used, just as with predicated
 238 arithmetic operations, to mask whether the comparison operations take
 239 place or not.  The target (destination) predication register
 240 to use (if one is enabled) is taken from the *second* src register.
 241
 242 If either of src1 or src2 are scalars (whether by there being no
 243 CSR register entry or whether by the CSR entry specifically marking
 244 the register as "scalar") the comparison goes ahead as vector-scalar
 245 or scalar-vector.
 246
 247 In instances where no vectorisation is detected on either src registers
 248 the operation is treated as an absolutely standard scalar branch operation.
 249 Where vectorisation is present on either or both src registers, the
 250 branch may stil go ahead if any only if *all* tests succeed (i.e. excluding
 251 those tests that are predicated out).
 252
 253 Note that when zero-predication is enabled (from source rs1),
 254 a cleared bit in the predicate indicates that the result
 255 of the compare is set to "false", i.e. that the corresponding
 256 destination bit (or result)) be set to zero.  Contrast this with
 257 when zeroing is not set: bits in the destination predicate are
 258 only *set*; they are **not** cleared.  This is important to appreciate,
 259 as there may be an expectation that, going into the hardware-loop,
 260 the destination predicate is always expected to be set to zero:
 261 this is **not** the case.  The destination predicate is only set
 262 to zero if **zeroing** is enabled.
 263
 264 Note that just as with the standard (scalar, non-predicated) branch
 265 operations, BLE, BGT, BLEU and BTGU may be synthesised by inverting
 266 src1 and src2, however note that in doing so, the predicate table
 267 setup must also be correspondingly adjusted.
 268
 269 In Hwacha EECS-2015-262 Section 6.7.2 the following pseudocode is given
 270 for predicated compare operations of function "cmp":
 271
 272     for (int i=0; i<vl; ++i)
 273       if ([!]preg[p][i])
 274          preg[pd][i] = cmp(s1 ? vreg[rs1][i] : sreg[rs1],
 275                            s2 ? vreg[rs2][i] : sreg[rs2]);
 276
 277 With associated predication, vector-length adjustments and so on,
 278 and temporarily ignoring bitwidth (which makes the comparisons more
 279 complex), this becomes:
 280
 281     s1 = reg_is_vectorised(src1);
 282     s2 = reg_is_vectorised(src2);
 283
 284     if not s1 && not s2
 285         if cmp(rs1, rs2) # scalar compare
 286             goto branch
 287         return
 288
 289     preg = int_pred_reg[rd]
 290     reg = int_regfile
 291
 292     ps = get_pred_val(I/F==INT, rs1);
 293     rd = get_pred_val(I/F==INT, rs2); # this may not exist
 294
 295     ffirst_mode, zeroing = get_pred_flags(rs1)
 296     if exists(rd):
 297         pred_inversion, pred_zeroing = get_pred_flags(rs2)
 298     else
 299         pred_inversion, pred_zeroing = False, False
 300
 301     if not exists(rd) or zeroing:
 302         result = (1<<VL)-1 # all 1s
 303     else
 304         result = preg[rd]
 305
 306     for (int i = 0; i < VL; ++i)
 307       if (zeroing)
 308         if not (ps & (1<<i))
 309            result &= ~(1<<i);
 310       else if (ps & (1<<i))
 311           if (cmp(s1 ? reg[src1+i]:reg[src1],
 312                                s2 ? reg[src2+i]:reg[src2])
 313               result |= 1<<i;
 314           else
 315               result &= ~(1<<i);
 316               if ffirst_mode:
 317                 break
 318
 319     if exists(rd):
 320         preg[rd] = result # store in destination
 321
 322     if pred_inversion:
 323         if pred_zeroing:
 324             if result != 0:
 325                 goto branch
 326         else:
 327             if result == 0:
 328                 goto branch
 329     else:
 330         if pred_zeroing:
 331             if (result & ps) != result:
 332                 goto branch
 333         else:
 334             if (result & ps) == result:
 335                 goto branch
 336
 337 Notes:
 338
 339 * Predicated SIMD comparisons would break src1 and src2 further down
 340   into bitwidth-sized chunks (see Appendix "Bitwidth Virtual Register
 341   Reordering") setting Vector-Length times (number of SIMD elements) bits
 342   in Predicate Register rd, as opposed to just Vector-Length bits.
 343 * The execution of "parallelised" instructions **must** be implemented
 344   as "re-entrant" (to use a term from software).  If an exception (trap)
 345   occurs during the middle of a vectorised
 346   Branch (now a SV predicated compare) operation, the partial results
 347   of any comparisons must be written out to the destination
 348   register before the trap is permitted to begin.  If however there
 349   is no predicate, the **entire** set of comparisons must be **restarted**,
 350   with the offset loop indices set back to zero.  This is because
 351   there is no place to store the temporary result during the handling
 352   of traps.
 353
 354 TODO: predication now taken from src2.  also branch goes ahead
 355 if all compares are successful.
 356
 357 Note also that where normally, predication requires that there must
 358 also be a CSR register entry for the register being used in order
 359 for the **predication** CSR register entry to also be active,
 360 for branches this is **not** the case.  src2 does **not** have
 361 to have its CSR register entry marked as active in order for
 362 predication on src2 to be active.
 363
 364 Also note: SV Branch operations are **not** twin-predicated
 365 (see Twin Predication section).  This would require three
 366 element offsets: one to track src1, one to track src2 and a third
 367 to track where to store the accumulation of the results.  Given
 368 that the element offsets need to be exposed via CSRs so that
 369 the parallel hardware looping may be made re-entrant on traps
 370 and exceptions, the decision was made not to make SV Branches
 371 twin-predicated.
 372
 373 ### Floating-point Comparisons
 374
 375 There does not exist floating-point branch operations, only compare.
 376 Interestingly no change is needed to the instruction format because
 377 FP Compare already stores a 1 or a zero in its "rd" integer register
 378 target, i.e. it's not actually a Branch at all: it's a compare.
 379
 380 In RV (scalar) Base, a branch on a floating-point compare is
 381 done via the sequence "FEQ x1, f0, f5; BEQ x1, x0, #jumploc".
 382 This does extend to SV, as long as x1 (in the example sequence given)
 383 is vectorised.  When that is the case, x1..x(1+VL-1) will also be
 384 set to 0 or 1 depending on whether f0==f5, f1==f6, f2==f7 and so on.
 385 The BEQ that follows will *also* compare x1==x0, x2==x0, x3==x0 and
 386 so on.  Consequently, unlike integer-branch, FP Compare needs no
 387 modification in its behaviour.
 388
 389 In addition, it is noted that an entry "FNE" (the opposite of FEQ) is
 390 missing, and whilst in ordinary branch code this is fine because the
 391 standard RVF compare can always be followed up with an integer BEQ or
 392 a BNE (or a compressed comparison to zero or non-zero), in predication
 393 terms that becomes more of an impact.  To deal with this, SV's predication
 394 has had "invert" added to it.
 395
 396 Also: note that FP Compare may be predicated, using the destination
 397 integer register (rd) to determine the predicate.  FP Compare is **not**
 398 a twin-predication operation, as, again, just as with SV Branches,
 399 there are three registers involved: FP src1, FP src2 and INT rd.
 400
 401 Also: note that ffirst (fail first mode) applies directly to this operation.
 402
 403 ### Compressed Branch Instruction
 404
 405 Compressed Branch instructions are, just like standard Branch instructions,
 406 reinterpreted to be vectorised and predicated based on the source register
 407 (rs1s) CSR entries.  As however there is only the one source register,
 408 given that c.beqz a10 is equivalent to beqz a10,x0, the optional target
 409 to store the results of the comparisions is taken from CSR predication
 410 table entries for **x0**.
 411
 412 The specific required use of x0 is, with a little thought, quite obvious,
 413 but is counterintuitive.  Clearly it is **not** recommended to redirect
 414 x0 with a CSR register entry, however as a means to opaquely obtain
 415 a predication target it is the only sensible option that does not involve
 416 additional special CSRs (or, worse, additional special opcodes).
 417
 418 Note also that, just as with standard branches, the 2nd source
 419 (in this case x0 rather than src2) does **not** have to have its CSR
 420 register table marked as "active" in order for predication to work.
 421
 422 ## Vectorised Dual-operand instructions
 423
 424 There is a series of 2-operand instructions involving copying (and
 425 sometimes alteration):
 426
 427 * C.MV
 428 * FMV, FNEG, FABS, FCVT, FSGNJ, FSGNJN and FSGNJX
 429 * C.LWSP, C.SWSP, C.LDSP, C.FLWSP etc.
 430 * LOAD(-FP) and STORE(-FP)
 431
 432 All of these operations follow the same two-operand pattern, so it is
 433 *both* the source *and* destination predication masks that are taken into
 434 account.  This is different from
 435 the three-operand arithmetic instructions, where the predication mask
 436 is taken from the *destination* register, and applied uniformly to the
 437 elements of the source register(s), element-for-element.
 438
 439 The pseudo-code pattern for twin-predicated operations is as
 440 follows:
 441
 442     function op(rd, rs):
 443       rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
 444       rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
 445       ps = get_pred_val(FALSE, rs); # predication on src
 446       pd = get_pred_val(FALSE, rd); # ... AND on dest
 447       for (int i = 0, int j = 0; i < VL && j < VL;):
 448         if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
 449         if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
 450         xSTATE.srcoffs = i # save context
 451         xSTATE.destoffs = j # save context
 452         reg[rd+j] = SCALAR_OPERATION_ON(reg[rs+i])
 453         if (int_csr[rs].isvec) i++;
 454         if (int_csr[rd].isvec) j++; else break
 455
 456 This pattern covers scalar-scalar, scalar-vector, vector-scalar
 457 and vector-vector, and predicated variants of all of those.
 458 Zeroing is not presently included (TODO).  As such, when compared
 459 to RVV, the twin-predicated variants of C.MV and FMV cover
 460 **all** standard vector operations: VINSERT, VSPLAT, VREDUCE,
 461 VEXTRACT, VSCATTER, VGATHER, VCOPY, and more.
 462
 463 Note that:
 464
 465 * elwidth (SIMD) is not covered in the pseudo-code above
 466 * ending the loop early in scalar cases (VINSERT, VEXTRACT) is also
 467   not covered
 468 * zero predication is also not shown (TODO).
 469
 470 ### C.MV Instruction <a name="c_mv"></a>
 471
 472 There is no MV instruction in RV however there is a C.MV instruction.
 473 It is used for copying integer-to-integer registers (vectorised FMV
 474 is used for copying floating-point).
 475
 476 If either the source or the destination register are marked as vectors
 477 C.MV is reinterpreted to be a vectorised (multi-register) predicated
 478 move operation.  The actual instruction's format does not change:
 479
 480 [[!table  data="""
 481 15  12 | 11   7 | 6  2 | 1  0 |
 482 funct4 | rd     | rs   | op   |
 483 4      | 5      | 5    | 2    |
 484 C.MV   | dest   | src  | C0   |
 485 """]]
 486
 487 A simplified version of the pseudocode for this operation is as follows:
 488
 489     function op_mv(rd, rs) # MV not VMV!
 490       rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
 491       rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
 492       ps = get_pred_val(FALSE, rs); # predication on src
 493       pd = get_pred_val(FALSE, rd); # ... AND on dest
 494       for (int i = 0, int j = 0; i < VL && j < VL;):
 495         if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
 496         if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
 497         xSTATE.srcoffs = i # save context
 498         xSTATE.destoffs = j # save context
 499         ireg[rd+j] <= ireg[rs+i];
 500         if (int_csr[rs].isvec) i++;
 501         if (int_csr[rd].isvec) j++; else break
 502
 503 There are several different instructions from RVV that are covered by
 504 this one opcode:
 505
 506 [[!table  data="""
 507 src    | dest    | predication   | op             |
 508 scalar | vector  | none          | VSPLAT         |
 509 scalar | vector  | destination   | sparse VSPLAT  |
 510 scalar | vector  | 1-bit dest    | VINSERT        |
 511 vector | scalar  | 1-bit? src    | VEXTRACT       |
 512 vector | vector  | none          | VCOPY          |
 513 vector | vector  | src           | Vector Gather  |
 514 vector | vector  | dest          | Vector Scatter |
 515 vector | vector  | src & dest    | Gather/Scatter |
 516 vector | vector  | src == dest   | sparse VCOPY   |
 517 """]]
 518
 519 Also, VMERGE may be implemented as back-to-back (macro-op fused) C.MV
 520 operations with zeroing off, and inversion on the src and dest predication
 521 for one of the two C.MV operations.  The non-inverted C.MV will place
 522 one set of registers into the destination, and the inverted one the other
 523 set.  With predicate-inversion, copying and inversion of the predicate mask
 524 need not be done as a separate (scalar) instruction.
 525
 526 Note that in the instance where the Compressed Extension is not implemented,
 527 MV may be used, but that is a pseudo-operation mapping to addi rd, x0, rs.
 528 Note that the behaviour is **different** from C.MV because with addi the
 529 predication mask to use is taken **only** from rd and is applied against
 530 all elements: rs[i] = rd[i].
 531
 532 ### FMV, FNEG and FABS Instructions
 533
 534 These are identical in form to C.MV, except covering floating-point
 535 register copying.  The same double-predication rules also apply.
 536 However when elwidth is not set to default the instruction is implicitly
 537 and automatic converted to a (vectorised) floating-point type conversion
 538 operation of the appropriate size covering the source and destination
 539 register bitwidths.
 540
 541 (Note that FMV, FNEG and FABS are all actually pseudo-instructions)
 542
 543 ### FVCT Instructions
 544
 545 These are again identical in form to C.MV, except that they cover
 546 floating-point to integer and integer to floating-point.  When element
 547 width in each vector is set to default, the instructions behave exactly
 548 as they are defined for standard RV (scalar) operations, except vectorised
 549 in exactly the same fashion as outlined in C.MV.
 550
 551 However when the source or destination element width is not set to default,
 552 the opcode's explicit element widths are *over-ridden* to new definitions,
 553 and the opcode's element width is taken as indicative of the SIMD width
 554 (if applicable i.e. if packed SIMD is requested) instead.
 555
 556 For example FCVT.S.L would normally be used to convert a 64-bit
 557 integer in register rs1 to a 64-bit floating-point number in rd.
 558 If however the source rs1 is set to be a vector, where elwidth is set to
 559 default/2 and "packed SIMD" is enabled, then the first 32 bits of
 560 rs1 are converted to a floating-point number to be stored in rd's
 561 first element and the higher 32-bits *also* converted to floating-point
 562 and stored in the second.  The 32 bit size comes from the fact that
 563 FCVT.S.L's integer width is 64 bit, and with elwidth on rs1 set to
 564 divide that by two it means that rs1 element width is to be taken as 32.
 565
 566 Similar rules apply to the destination register.
 567
 568 ## LOAD / STORE Instructions and LOAD-FP/STORE-FP <a name="load_store"></a>
 569
 570 An earlier draft of SV modified the behaviour of LOAD/STORE (modified
 571 the interpretation of the instruction fields).  This
 572 actually undermined the fundamental principle of SV, namely that there
 573 be no modifications to the scalar behaviour (except where absolutely
 574 necessary), in order to simplify an implementor's task if considering
 575 converting a pre-existing scalar design to support parallelism.
 576
 577 So the original RISC-V scalar LOAD/STORE and LOAD-FP/STORE-FP functionality
 578 do not change in SV, however just as with C.MV it is important to note
 579 that dual-predication is possible.
 580
 581 In vectorised architectures there are usually at least two different modes
 582 for LOAD/STORE:
 583
 584 * Read (or write for STORE) from sequential locations, where one
 585   register specifies the address, and the one address is incremented
 586   by a fixed amount.  This is usually known as "Unit Stride" mode.
 587 * Read (or write) from multiple indirected addresses, where the
 588   vector elements each specify separate and distinct addresses.
 589
 590 To support these different addressing modes, the CSR Register "isvector"
 591 bit is used.  So, for a LOAD, when the src register is set to
 592 scalar, the LOADs are sequentially incremented by the src register
 593 element width, and when the src register is set to "vector", the
 594 elements are treated as indirection addresses.  Simplified
 595 pseudo-code would look like this:
 596
 597     function op_ld(rd, rs) # LD not VLD!
 598       rdv = int_csr[rd].active ? int_csr[rd].regidx : rd;
 599       rsv = int_csr[rs].active ? int_csr[rs].regidx : rs;
 600       ps = get_pred_val(FALSE, rs); # predication on src
 601       pd = get_pred_val(FALSE, rd); # ... AND on dest
 602       for (int i = 0, int j = 0; i < VL && j < VL;):
 603         if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
 604         if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
 605         if (int_csr[rd].isvec)
 606           # indirect mode (multi mode)
 607           srcbase = ireg[rsv+i];
 608         else
 609           # unit stride mode
 610           srcbase = ireg[rsv] + i * XLEN/8; # offset in bytes
 611         ireg[rdv+j] <= mem[srcbase + imm_offs];
 612         if (!int_csr[rs].isvec &&
 613             !int_csr[rd].isvec) break # scalar-scalar LD
 614         if (int_csr[rs].isvec) i++;
 615         if (int_csr[rd].isvec) j++;
 616
 617 Notes:
 618
 619 * For simplicity, zeroing and elwidth is not included in the above:
 620   the key focus here is the decision-making for srcbase; vectorised
 621   rs means use sequentially-numbered registers as the indirection
 622   address, and scalar rs is "offset" mode.
 623 * The test towards the end for whether both source and destination are
 624   scalar is what makes the above pseudo-code provide the "standard" RV
 625   Base behaviour for LD operations.
 626 * The offset in bytes (XLEN/8) changes depending on whether the
 627   operation is a LB (1 byte), LH (2 byes), LW (4 bytes) or LD
 628   (8 bytes), and also whether the element width is over-ridden
 629   (see special element width section).
 630
 631 ## Compressed Stack LOAD / STORE Instructions <a name="c_ld_st"></a>
 632
 633 C.LWSP / C.SWSP and floating-point etc. are also source-dest twin-predicated,
 634 where it is implicit in C.LWSP/FLWSP etc. that x2 is the source register.
 635 It is therefore possible to use predicated C.LWSP to efficiently
 636 pop registers off the stack (by predicating x2 as the source), cherry-picking
 637 which registers to store to (by predicating the destination).  Likewise
 638 for C.SWSP.  In this way, LOAD/STORE-Multiple is efficiently achieved.
 639
 640 The two modes ("unit stride" and multi-indirection) are still supported,
 641 as with standard LD/ST.  Essentially, the only difference is that the
 642 use of x2 is hard-coded into the instruction.
 643
 644 **Note**: it is still possible to redirect x2 to an alternative target
 645 register.  With care, this allows C.LWSP / C.SWSP (and C.FLWSP) to be used as
 646 general-purpose LOAD/STORE operations.
 647
 648 ## Compressed LOAD / STORE Instructions
 649
 650 Compressed LOAD and STORE are again exactly the same as scalar LOAD/STORE,
 651 where the same rules apply and the same pseudo-code apply as for
 652 non-compressed LOAD/STORE.  Again: setting scalar or vector mode
 653 on the src for LOAD and dest for STORE switches mode from "Unit Stride"
 654 to "Multi-indirection", respectively.
 655
 656 # Element bitwidth polymorphism <a name="elwidth"></a>
 657
 658 Element bitwidth is best covered as its own special section, as it
 659 is quite involved and applies uniformly across-the-board.  SV restricts
 660 bitwidth polymorphism to default, 8-bit, 16-bit and 32-bit.
 661
 662 The effect of setting an element bitwidth is to re-cast each entry
 663 in the register table, and for all memory operations involving
 664 load/stores of certain specific sizes, to a completely different width.
 665 Thus In c-style terms, on an RV64 architecture, effectively each register
 666 now looks like this:
 667
 668     typedef union {
 669         uint8_t  b[8];
 670         uint16_t s[4];
 671         uint32_t i[2];
 672         uint64_t l[1];
 673     } reg_t;
 674
 675     // integer table: assume maximum SV 7-bit regfile size
 676     reg_t int_regfile[128];
 677
 678 where the CSR Register table entry (not the instruction alone) determines
 679 which of those union entries is to be used on each operation, and the
 680 VL element offset in the hardware-loop specifies the index into each array.
 681
 682 However a naive interpretation of the data structure above masks the
 683 fact that setting VL greater than 8, for example, when the bitwidth is 8,
 684 accessing one specific register "spills over" to the following parts of
 685 the register file in a sequential fashion.  So a much more accurate way
 686 to reflect this would be:
 687
 688     typedef union {
 689         uint8_t  actual_bytes[8]; // 8 for RV64, 4 for RV32, 16 for RV128
 690         uint8_t  b[0]; // array of type uint8_t
 691         uint16_t s[0];
 692         uint32_t i[0];
 693         uint64_t l[0];
 694         uint128_t d[0];
 695     } reg_t;
 696
 697     reg_t int_regfile[128];
 698
 699 where when accessing any individual regfile[n].b entry it is permitted
 700 (in c) to arbitrarily over-run the *declared* length of the array (zero),
 701 and thus "overspill" to consecutive register file entries in a fashion
 702 that is completely transparent to a greatly-simplified software / pseudo-code
 703 representation.
 704 It is however critical to note that it is clearly the responsibility of
 705 the implementor to ensure that, towards the end of the register file,
 706 an exception is thrown if attempts to access beyond the "real" register
 707 bytes is ever attempted.
 708
 709 Now we may modify pseudo-code an operation where all element bitwidths have
 710 been set to the same size, where this pseudo-code is otherwise identical
 711 to its "non" polymorphic versions (above):
 712
 713     function op_add(rd, rs1, rs2) # add not VADD!
 714       ...
 715       ...
 716       for (i = 0; i < VL; i++)
 717            ...
 718            ...
 719            // TODO, calculate if over-run occurs, for each elwidth
 720            if (elwidth == 8) {
 721                int_regfile[rd].b[id] <= int_regfile[rs1].i[irs1] +
 722                                         int_regfile[rs2].i[irs2];
 723             } else if elwidth == 16 {
 724                int_regfile[rd].s[id] <= int_regfile[rs1].s[irs1] +
 725                                         int_regfile[rs2].s[irs2];
 726             } else if elwidth == 32 {
 727                int_regfile[rd].i[id] <= int_regfile[rs1].i[irs1] +
 728                                         int_regfile[rs2].i[irs2];
 729             } else { // elwidth == 64
 730                int_regfile[rd].l[id] <= int_regfile[rs1].l[irs1] +
 731                                         int_regfile[rs2].l[irs2];
 732             }
 733            ...
 734            ...
 735
 736 So here we can see clearly: for 8-bit entries rd, rs1 and rs2 (and registers
 737 following sequentially on respectively from the same) are "type-cast"
 738 to 8-bit; for 16-bit entries likewise and so on.
 739
 740 However that only covers the case where the element widths are the same.
 741 Where the element widths are different, the following algorithm applies:
 742
 743 * Analyse the bitwidth of all source operands and work out the
 744   maximum.  Record this as "maxsrcbitwidth"
 745 * If any given source operand requires sign-extension or zero-extension
 746   (ldb, div, rem, mul, sll, srl, sra etc.), instead of mandatory 32-bit
 747   sign-extension / zero-extension or whatever is specified in the standard
 748   RV specification, **change** that to sign-extending from the respective
 749   individual source operand's bitwidth from the CSR table out to
 750   "maxsrcbitwidth" (previously calculated), instead.
 751 * Following separate and distinct (optional) sign/zero-extension of all
 752   source operands as specifically required for that operation, carry out the
 753   operation at "maxsrcbitwidth".  (Note that in the case of LOAD/STORE or MV
 754   this may be a "null" (copy) operation, and that with FCVT, the changes
 755   to the source and destination bitwidths may also turn FVCT effectively
 756   into a copy).
 757 * If the destination operand requires sign-extension or zero-extension,
 758   instead of a mandatory fixed size (typically 32-bit for arithmetic,
 759   for subw for example, and otherwise various: 8-bit for sb, 16-bit for sw
 760   etc.), overload the RV specification with the bitwidth from the
 761   destination register's elwidth entry.
 762 * Finally, store the (optionally) sign/zero-extended value into its
 763   destination: memory for sb/sw etc., or an offset section of the register
 764   file for an arithmetic operation.
 765
 766 In this way, polymorphic bitwidths are achieved without requiring a
 767 massive 64-way permutation of calculations **per opcode**, for example
 768 (4 possible rs1 bitwidths times 4 possible rs2 bitwidths times 4 possible
 769 rd bitwidths).  The pseudo-code is therefore as follows:
 770
 771     typedef union {
 772         uint8_t  b;
 773         uint16_t s;
 774         uint32_t i;
 775         uint64_t l;
 776     } el_reg_t;
 777
 778     bw(elwidth):
 779         if elwidth == 0: return xlen
 780         if elwidth == 1: return 8
 781         if elwidth == 2: return 16
 782         // elwidth == 3:
 783         return 32
 784
 785     get_max_elwidth(rs1, rs2):
 786         return max(bw(int_csr[rs1].elwidth), # default (XLEN) if not set
 787                    bw(int_csr[rs2].elwidth)) # again XLEN if no entry
 788
 789     get_polymorphed_reg(reg, bitwidth, offset):
 790         el_reg_t res;
 791         res.l = 0; // TODO: going to need sign-extending / zero-extending
 792         if bitwidth == 8:
 793             reg.b = int_regfile[reg].b[offset]
 794         elif bitwidth == 16:
 795             reg.s = int_regfile[reg].s[offset]
 796         elif bitwidth == 32:
 797             reg.i = int_regfile[reg].i[offset]
 798         elif bitwidth == 64:
 799             reg.l = int_regfile[reg].l[offset]
 800         return res
 801
 802     set_polymorphed_reg(reg, bitwidth, offset, val):
 803         if (!int_csr[reg].isvec):
 804             # sign/zero-extend depending on opcode requirements, from
 805             # the reg's bitwidth out to the full bitwidth of the regfile
 806             val = sign_or_zero_extend(val, bitwidth, xlen)
 807             int_regfile[reg].l[0] = val
 808         elif bitwidth == 8:
 809             int_regfile[reg].b[offset] = val
 810         elif bitwidth == 16:
 811             int_regfile[reg].s[offset] = val
 812         elif bitwidth == 32:
 813             int_regfile[reg].i[offset] = val
 814         elif bitwidth == 64:
 815             int_regfile[reg].l[offset] = val
 816
 817       maxsrcwid =  get_max_elwidth(rs1, rs2) # source element width(s)
 818       destwid = int_csr[rs1].elwidth         # destination element width
 819       for (i = 0; i < VL; i++)
 820         if (predval & 1<<i) # predication uses intregs
 821            // TODO, calculate if over-run occurs, for each elwidth
 822            src1 = get_polymorphed_reg(rs1, maxsrcwid, irs1)
 823            // TODO, sign/zero-extend src1 and src2 as operation requires
 824            if (op_requires_sign_extend_src1)
 825               src1 = sign_extend(src1, maxsrcwid)
 826            src2 = get_polymorphed_reg(rs2, maxsrcwid, irs2)
 827            result = src1 + src2 # actual add here
 828            // TODO, sign/zero-extend result, as operation requires
 829            if (op_requires_sign_extend_dest)
 830               result = sign_extend(result, maxsrcwid)
 831            set_polymorphed_reg(rd, destwid, ird, result)
 832            if (!int_vec[rd].isvector) break
 833         if (int_vec[rd ].isvector)  { id += 1; }
 834         if (int_vec[rs1].isvector)  { irs1 += 1; }
 835         if (int_vec[rs2].isvector)  { irs2 += 1; }
 836
 837 Whilst specific sign-extension and zero-extension pseudocode call
 838 details are left out, due to each operation being different, the above
 839 should be clear that;
 840
 841 * the source operands are extended out to the maximum bitwidth of all
 842   source operands
 843 * the operation takes place at that maximum source bitwidth (the
 844   destination bitwidth is not involved at this point, at all)
 845 * the result is extended (or potentially even, truncated) before being
 846   stored in the destination.  i.e. truncation (if required) to the
 847   destination width occurs **after** the operation **not** before.
 848 * when the destination is not marked as "vectorised", the **full**
 849   (standard, scalar) register file entry is taken up, i.e. the
 850   element is either sign-extended or zero-extended to cover the
 851   full register bitwidth (XLEN) if it is not already XLEN bits long.
 852
 853 Implementors are entirely free to optimise the above, particularly
 854 if it is specifically known that any given operation will complete
 855 accurately in less bits, as long as the results produced are
 856 directly equivalent and equal, for all inputs and all outputs,
 857 to those produced by the above algorithm.
 858
 859 ## Polymorphic floating-point operation exceptions and error-handling
 860
 861 For floating-point operations, conversion takes place without raising any
 862 kind of exception.  Exactly as specified in the standard RV specification,
 863 NAN (or appropriate) is stored if the result is beyond the range of the
 864 destination, and, again, exactly as with the standard RV specification
 865 just as with scalar operations, the floating-point flag is raised
 866 (FCSR).  And, again, just as with scalar operations, it is software's
 867 responsibility to check this flag.  Given that the FCSR flags are
 868 "accrued", the fact that multiple element operations could have occurred
 869 is not a problem.
 870
 871 Note that it is perfectly legitimate for floating-point bitwidths of
 872 only 8 to be specified.  However whilst it is possible to apply IEEE 754
 873 principles, no actual standard yet exists.  Implementors wishing to
 874 provide hardware-level 8-bit support rather than throw a trap to emulate
 875 in software should contact the author of this specification before
 876 proceeding.
 877
 878 ## Polymorphic shift operators
 879
 880 A special note is needed for changing the element width of left and
 881 right shift operators, particularly right-shift.  Even for standard RV
 882 base, in order for correct results to be returned, the second operand
 883 RS2 must be truncated to be within the range of RS1's bitwidth.
 884 spike's implementation of sll for example is as follows:
 885
 886     WRITE_RD(sext_xlen(zext_xlen(RS1) << (RS2 & (xlen-1))));
 887
 888 which means: where XLEN is 32 (for RV32), restrict RS2 to cover the
 889 range 0..31 so that RS1 will only be left-shifted by the amount that
 890 is possible to fit into a 32-bit register.  Whilst this appears not
 891 to matter for hardware, it matters greatly in software implementations,
 892 and it also matters where an RV64 system is set to "RV32" mode, such
 893 that the underlying registers RS1 and RS2 comprise 64 hardware bits
 894 each.
 895
 896 For SV, where each operand's element bitwidth may be over-ridden, the
 897 rule about determining the operation's bitwidth *still applies*, being
 898 defined as the maximum bitwidth of RS1 and RS2.  *However*, this rule
 899 **also applies to the truncation of RS2**.  In other words, *after*
 900 determining the maximum bitwidth, RS2's range must **also be truncated**
 901 to ensure a correct answer.  Example:
 902
 903 * RS1 is over-ridden to a 16-bit width
 904 * RS2 is over-ridden to an 8-bit width
 905 * RD is over-ridden to a 64-bit width
 906 * the maximum bitwidth is thus determined to be 16-bit - max(8,16)
 907 * RS2 is **truncated to a range of values from 0 to 15**: RS2 & (16-1)
 908
 909 Pseudocode (in spike) for this example would therefore be:
 910
 911     WRITE_RD(sext_xlen(zext_16bit(RS1) << (RS2 & (16-1))));
 912
 913 This example illustrates that considerable care therefore needs to be
 914 taken to ensure that left and right shift operations are implemented
 915 correctly.  The key is that
 916
 917 * The operation bitwidth is determined by the maximum bitwidth
 918   of the *source registers*, **not** the destination register bitwidth
 919 * The result is then sign-extend (or truncated) as appropriate.
 920
 921 ## Polymorphic MULH/MULHU/MULHSU
 922
 923 MULH is designed to take the top half MSBs of a multiply that
 924 does not fit within the range of the source operands, such that
 925 smaller width operations may produce a full double-width multiply
 926 in two cycles.  The issue is: SV allows the source operands to
 927 have variable bitwidth.
 928
 929 Here again special attention has to be paid to the rules regarding
 930 bitwidth, which, again, are that the operation is performed at
 931 the maximum bitwidth of the **source** registers.  Therefore:
 932
 933 * An 8-bit x 8-bit multiply will create a 16-bit result that must
 934   be shifted down by 8 bits
 935 * A 16-bit x 8-bit multiply will create a 24-bit result that must
 936   be shifted down by 16 bits (top 8 bits being zero)
 937 * A 16-bit x 16-bit multiply will create a 32-bit result that must
 938   be shifted down by 16 bits
 939 * A 32-bit x 16-bit multiply will create a 48-bit result that must
 940   be shifted down by 32 bits
 941 * A 32-bit x 8-bit multiply will create a 40-bit result that must
 942   be shifted down by 32 bits
 943
 944 So again, just as with shift-left and shift-right, the result
 945 is shifted down by the maximum of the two source register bitwidths.
 946 And, exactly again, truncation or sign-extension is performed on the
 947 result.  If sign-extension is to be carried out, it is performed
 948 from the same maximum of the two source register bitwidths out
 949 to the result element's bitwidth.
 950
 951 If truncation occurs, i.e. the top MSBs of the result are lost,
 952 this is "Officially Not Our Problem", i.e. it is assumed that the
 953 programmer actually desires the result to be truncated.  i.e. if the
 954 programmer wanted all of the bits, they would have set the destination
 955 elwidth to accommodate them.
 956
 957 ## Polymorphic elwidth on LOAD/STORE <a name="elwidth_loadstore"></a>
 958
 959 Polymorphic element widths in vectorised form means that the data
 960 being loaded (or stored) across multiple registers needs to be treated
 961 (reinterpreted) as a contiguous stream of elwidth-wide items, where
 962 the source register's element width is **independent** from the destination's.
 963
 964 This makes for a slightly more complex algorithm when using indirection
 965 on the "addressed" register (source for LOAD and destination for STORE),
 966 particularly given that the LOAD/STORE instruction provides important
 967 information about the width of the data to be reinterpreted.
 968
 969 Let's illustrate the "load" part, where the pseudo-code for elwidth=default
 970 was as follows, and i is the loop from 0 to VL-1:
 971
 972     srcbase = ireg[rs+i];
 973     return mem[srcbase + imm]; // returns XLEN bits
 974
 975 Instead, when elwidth != default, for a LW (32-bit LOAD), elwidth-wide
 976 chunks are taken from the source memory location addressed by the current
 977 indexed source address register, and only when a full 32-bits-worth
 978 are taken will the index be moved on to the next contiguous source
 979 address register:
 980
 981     bitwidth = bw(elwidth);             // source elwidth from CSR reg entry
 982     elsperblock = 32 / bitwidth         // 1 if bw=32, 2 if bw=16, 4 if bw=8
 983     srcbase = ireg[rs+i/(elsperblock)]; // integer divide
 984     offs = i % elsperblock;             // modulo
 985     return &mem[srcbase + imm + offs];  // re-cast to uint8_t*, uint16_t* etc.
 986
 987 Note that the constant "32" above is replaced by 8 for LB, 16 for LH, 64 for LD
 988 and 128 for LQ.
 989
 990 The principle is basically exactly the same as if the srcbase were pointing
 991 at the memory of the *register* file: memory is re-interpreted as containing
 992 groups of elwidth-wide discrete elements.
 993
 994 When storing the result from a load, it's important to respect the fact
 995 that the destination register has its *own separate element width*.  Thus,
 996 when each element is loaded (at the source element width), any sign-extension
 997 or zero-extension (or truncation) needs to be done to the *destination*
 998 bitwidth.  Also, the storing has the exact same analogous algorithm as
 999 above, where in fact it is just the set\_polymorphed\_reg pseudocode
1000 (completely unchanged) used above.
1001
1002 One issue remains: when the source element width is **greater** than
1003 the width of the operation, it is obvious that a single LB for example
1004 cannot possibly obtain 16-bit-wide data.  This condition may be detected
1005 where, when using integer divide, elsperblock (the width of the LOAD
1006 divided by the bitwidth of the element) is zero.
1007
1008 The issue is "fixed" by ensuring that elsperblock is a minimum of 1:
1009
1010     elsperblock = min(1, LD_OP_BITWIDTH / element_bitwidth)
1011
1012 The elements, if the element bitwidth is larger than the LD operation's
1013 size, will then be sign/zero-extended to the full LD operation size, as
1014 specified by the LOAD (LDU instead of LD, LBU instead of LB), before
1015 being passed on to the second phase.
1016
1017 As LOAD/STORE may be twin-predicated, it is important to note that
1018 the rules on twin predication still apply, except where in previous
1019 pseudo-code (elwidth=default for both source and target) it was
1020 the *registers* that the predication was applied to, it is now the
1021 **elements** that the predication is applied to.
1022
1023 Thus the full pseudocode for all LD operations may be written out
1024 as follows:
1025
1026     function LBU(rd, rs):
1027         load_elwidthed(rd, rs, 8, true)
1028     function LB(rd, rs):
1029         load_elwidthed(rd, rs, 8, false)
1030     function LH(rd, rs):
1031         load_elwidthed(rd, rs, 16, false)
1032     ...
1033     ...
1034     function LQ(rd, rs):
1035         load_elwidthed(rd, rs, 128, false)
1036
1037     # returns 1 byte of data when opwidth=8, 2 bytes when opwidth=16..
1038     function load_memory(rs, imm, i, opwidth):
1039         elwidth = int_csr[rs].elwidth
1040         bitwidth = bw(elwidth);
1041         elsperblock = min(1, opwidth / bitwidth)
1042         srcbase = ireg[rs+i/(elsperblock)];
1043         offs = i % elsperblock;
1044         return mem[srcbase + imm + offs]; # 1/2/4/8/16 bytes
1045
1046     function load_elwidthed(rd, rs, opwidth, unsigned):
1047       destwid = int_csr[rd].elwidth # destination element width
1048       rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
1049       rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
1050       ps = get_pred_val(FALSE, rs); # predication on src
1051       pd = get_pred_val(FALSE, rd); # ... AND on dest
1052       for (int i = 0, int j = 0; i < VL && j < VL;):
1053         if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
1054         if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
1055         val = load_memory(rs, imm, i, opwidth)
1056         if unsigned:
1057             val = zero_extend(val, min(opwidth, bitwidth))
1058         else:
1059             val = sign_extend(val, min(opwidth, bitwidth))
1060         set_polymorphed_reg(rd, bitwidth, j, val)
1061         if (int_csr[rs].isvec) i++;
1062         if (int_csr[rd].isvec) j++; else break;
1063
1064 Note:
1065
1066 * when comparing against for example the twin-predicated c.mv
1067   pseudo-code, the pattern of independent incrementing of rd and rs
1068   is preserved unchanged.
1069 * just as with the c.mv pseudocode, zeroing is not included and must be
1070   taken into account (TODO).
1071 * that due to the use of a twin-predication algorithm, LOAD/STORE also
1072   take on the same VSPLAT, VINSERT, VREDUCE, VEXTRACT, VGATHER and
1073   VSCATTER characteristics.
1074 * that due to the use of the same set\_polymorphed\_reg pseudocode,
1075   a destination that is not vectorised (marked as scalar) will
1076   result in the element being fully sign-extended or zero-extended
1077   out to the full register file bitwidth (XLEN).  When the source
1078   is also marked as scalar, this is how the compatibility with
1079   standard RV LOAD/STORE is preserved by this algorithm.
1080
1081 ### Example Tables showing LOAD elements
1082
1083 This section contains examples of vectorised LOAD operations, showing
1084 how the two stage process works (three if zero/sign-extension is included).
1085
1086
1087 #### Example: LD x8, x5(0), x8 CSR-elwidth=32, x5 CSR-elwidth=16, VL=7
1088
1089 This is:
1090
1091 * a 64-bit load, with an offset of zero
1092 * with a source-address elwidth of 16-bit
1093 * into a destination-register with an elwidth of 32-bit
1094 * where VL=7
1095 * from register x5 (actually x5-x6) to x8 (actually x8 to half of x11)
1096 * RV64, where XLEN=64 is assumed.
1097
1098 First, the memory table, which, due to the element width being 16 and the
1099 operation being LD (64), the 64-bits loaded from memory are subdivided
1100 into groups of **four** elements.  And, with VL being 7 (deliberately
1101 to illustrate that this is reasonable and possible), the first four are
1102 sourced from the offset addresses pointed to by x5, and the next three
1103 from the ofset addresses pointed to by the next contiguous register, x6:
1104
1105 [[!table  data="""
1106 addr | byte 0 | byte 1 | byte 2 | byte 3 | byte 4 | byte 5 | byte 6 | byte 7 |
1107 @x5  | elem 0         || elem 1         || elem 2         || elem 3         ||
1108 @x6  | elem 4         || elem 5         || elem 6         || not loaded     ||
1109 """]]
1110
1111 Next, the elements are zero-extended from 16-bit to 32-bit, as whilst
1112 the elwidth CSR entry for x5 is 16-bit, the destination elwidth on x8 is 32.
1113
1114 [[!table  data="""
1115 byte 3 | byte 2 | byte 1 | byte 0 |
1116 0x0    | 0x0    | elem0          ||
1117 0x0    | 0x0    | elem1          ||
1118 0x0    | 0x0    | elem2          ||
1119 0x0    | 0x0    | elem3          ||
1120 0x0    | 0x0    | elem4          ||
1121 0x0    | 0x0    | elem5          ||
1122 0x0    | 0x0    | elem6          ||
1123 0x0    | 0x0    | elem7          ||
1124 """]]
1125
1126 Lastly, the elements are stored in contiguous blocks, as if x8 was also
1127 byte-addressable "memory".  That "memory" happens to cover registers
1128 x8, x9, x10 and x11, with the last 32 "bits" of x11 being **UNMODIFIED**:
1129
1130 [[!table  data="""
1131 reg# | byte 7 | byte 6 | byte 5 | byte 4 | byte 3 | byte 2 | byte 1 | byte 0 |
1132 x8   | 0x0    | 0x0    | elem 1         || 0x0    | 0x0    | elem 0         ||
1133 x9   | 0x0    | 0x0    | elem 3         || 0x0    | 0x0    | elem 2         ||
1134 x10  | 0x0    | 0x0    | elem 5         || 0x0    | 0x0    | elem 4         ||
1135 x11  | **UNMODIFIED**                 |||| 0x0    | 0x0    | elem 6         ||
1136 """]]
1137
1138 Thus we have data that is loaded from the **addresses** pointed to by
1139 x5 and x6, zero-extended from 16-bit to 32-bit, stored in the **registers**
1140 x8 through to half of x11.
1141 The end result is that elements 0 and 1 end up in x8, with element 8 being
1142 shifted up 32 bits, and so on, until finally element 6 is in the
1143 LSBs of x11.
1144
1145 Note that whilst the memory addressing table is shown left-to-right byte order,
1146 the registers are shown in right-to-left (MSB) order.  This does **not**
1147 imply that bit or byte-reversal is carried out: it's just easier to visualise
1148 memory as being contiguous bytes, and emphasises that registers are not
1149 really actually "memory" as such.
1150
1151 ## Why SV bitwidth specification is restricted to 4 entries
1152
1153 The four entries for SV element bitwidths only allows three over-rides:
1154
1155 * 8 bit
1156 * 16 hit
1157 * 32 bit
1158
1159 This would seem inadequate, surely it would be better to have 3 bits or
1160 more and allow 64, 128 and some other options besides.  The answer here
1161 is, it gets too complex, no RV128 implementation yet exists, and so RV64's
1162 default is 64 bit, so the 4 major element widths are covered anyway.
1163
1164 There is an absolutely crucial aspect oF SV here that explicitly
1165 needs spelling out, and it's whether the "vectorised" bit is set in
1166 the Register's CSR entry.
1167
1168 If "vectorised" is clear (not set), this indicates that the operation
1169 is "scalar".  Under these circumstances, when set on a destination (RD),
1170 then sign-extension and zero-extension, whilst changed to match the
1171 override bitwidth (if set), will erase the **full** register entry
1172 (64-bit if RV64).
1173
1174 When vectorised is *set*, this indicates that the operation now treats
1175 **elements** as if they were independent registers, so regardless of
1176 the length, any parts of a given actual register that are not involved
1177 in the operation are **NOT** modified, but are **PRESERVED**.
1178
1179 For example:
1180
1181 * when the vector bit is clear and elwidth set to 16 on the destination
1182   register, operations are truncated to 16 bit and then sign or zero
1183   extended to the *FULL* XLEN register width.
1184 * when the vector bit is set, elwidth is 16 and VL=1 (or other value where
1185   groups of elwidth sized elements do not fill an entire XLEN register),
1186   the "top" bits of the destination register do *NOT* get modified, zero'd
1187   or otherwise overwritten.
1188
1189 SIMD micro-architectures may implement this by using predication on
1190 any elements in a given actual register that are beyond the end of
1191 multi-element operation.
1192
1193 Other microarchitectures may choose to provide byte-level write-enable
1194 lines on the register file, such that each 64 bit register in an RV64
1195 system requires 8 WE lines.  Scalar RV64 operations would require
1196 activation of all 8 lines, where SV elwidth based operations would
1197 activate the required subset of those byte-level write lines.
1198
1199 Example:
1200
1201 * rs1, rs2 and rd are all set to 8-bit
1202 * VL is set to 3
1203 * RV64 architecture is set (UXL=64)
1204 * add operation is carried out
1205 * bits 0-23 of RD are modified to be rs1[23..16] + rs2[23..16]
1206   concatenated with similar add operations on bits 15..8 and 7..0
1207 * bits 24 through 63 **remain as they originally were**.
1208
1209 Example SIMD micro-architectural implementation:
1210
1211 * SIMD architecture works out the nearest round number of elements
1212   that would fit into a full RV64 register (in this case: 8)
1213 * SIMD architecture creates a hidden predicate, binary 0b00000111
1214   i.e. the bottom 3 bits set (VL=3) and the top 5 bits clear
1215 * SIMD architecture goes ahead with the add operation as if it
1216   was a full 8-wide batch of 8 adds
1217 * SIMD architecture passes top 5 elements through the adders
1218   (which are "disabled" due to zero-bit predication)
1219 * SIMD architecture gets the 5 unmodified top 8-bits back unmodified
1220   and stores them in rd.
1221
1222 This requires a read on rd, however this is required anyway in order
1223 to support non-zeroing mode.
1224
1225 ## Polymorphic floating-point
1226
1227 Standard scalar RV integer operations base the register width on XLEN,
1228 which may be changed (UXL in USTATUS, and the corresponding MXL and
1229 SXL in MSTATUS and SSTATUS respectively).  Integer LOAD, STORE and
1230 arithmetic operations are therefore restricted to an active XLEN bits,
1231 with sign or zero extension to pad out the upper bits when XLEN has
1232 been dynamically set to less than the actual register size.
1233
1234 For scalar floating-point, the active (used / changed) bits are
1235 specified exclusively by the operation: ADD.S specifies an active
1236 32-bits, with the upper bits of the source registers needing to
1237 be all 1s ("NaN-boxed"), and the destination upper bits being
1238 *set* to all 1s (including on LOAD/STOREs).
1239
1240 Where elwidth is set to default (on any source or the destination)
1241 it is obvious that this NaN-boxing behaviour can and should be
1242 preserved.  When elwidth is non-default things are less obvious,
1243 so need to be thought through.  Here is a normal (scalar) sequence,
1244 assuming an RV64 which supports Quad (128-bit) FLEN:
1245
1246 * FLD loads 64-bit wide from memory.  Top 64 MSBs are set to all 1s
1247 * ADD.D performs a 64-bit-wide add.  Top 64 MSBs of destination set to 1s.
1248 * FSD stores lowest 64-bits from the 128-bit-wide register to memory:
1249   top 64 MSBs ignored.
1250
1251 Therefore it makes sense to mirror this behaviour when, for example,
1252 elwidth is set to 32.  Assume elwidth set to 32 on all source and
1253 destination registers:
1254
1255 * FLD loads 64-bit wide from memory as **two** 32-bit single-precision
1256   floating-point numbers.
1257 * ADD.D performs **two** 32-bit-wide adds, storing one of the adds
1258   in bits 0-31 and the second in bits 32-63.
1259 * FSD stores lowest 64-bits from the 128-bit-wide register to memory
1260
1261 Here's the thing: it does not make sense to overwrite the top 64 MSBs
1262 of the registers either during the FLD **or** the ADD.D.  The reason
1263 is that, effectively, the top 64 MSBs actually represent a completely
1264 independent 64-bit register, so overwriting it is not only gratuitous
1265 but may actually be harmful for a future extension to SV which may
1266 have a way to directly access those top 64 bits.
1267
1268 The decision is therefore **not** to touch the upper parts of floating-point
1269 registers whereever elwidth is set to non-default values, including
1270 when "isvec" is false in a given register's CSR entry.  Only when the
1271 elwidth is set to default **and** isvec is false will the standard
1272 RV behaviour be followed, namely that the upper bits be modified.
1273
1274 Ultimately if elwidth is default and isvec false on *all* source
1275 and destination registers, a SimpleV instruction defaults completely
1276 to standard RV scalar behaviour (this holds true for **all** operations,
1277 right across the board).
1278
1279 The nice thing here is that ADD.S, ADD.D and ADD.Q when elwidth are
1280 non-default values are effectively all the same: they all still perform
1281 multiple ADD operations, just at different widths.  A future extension
1282 to SimpleV may actually allow ADD.S to access the upper bits of the
1283 register, effectively breaking down a 128-bit register into a bank
1284 of 4 independently-accesible 32-bit registers.
1285
1286 In the meantime, although when e.g. setting VL to 8 it would technically
1287 make no difference to the ALU whether ADD.S, ADD.D or ADD.Q is used,
1288 using ADD.Q may be an easy way to signal to the microarchitecture that
1289 it is to receive a higher VL value.  On a superscalar OoO architecture
1290 there may be absolutely no difference, however on simpler SIMD-style
1291 microarchitectures they may not necessarily have the infrastructure in
1292 place to know the difference, such that when VL=8 and an ADD.D instruction
1293 is issued, it completes in 2 cycles (or more) rather than one, where
1294 if an ADD.Q had been issued instead on such simpler microarchitectures
1295 it would complete in one.
1296
1297 ## Specific instruction walk-throughs
1298
1299 This section covers walk-throughs of the above-outlined procedure
1300 for converting standard RISC-V scalar arithmetic operations to
1301 polymorphic widths, to ensure that it is correct.
1302
1303 ### add
1304
1305 Standard Scalar RV32/RV64 (xlen):
1306
1307 * RS1 @ xlen bits
1308 * RS2 @ xlen bits
1309 * add @ xlen bits
1310 * RD @ xlen bits
1311
1312 Polymorphic variant:
1313
1314 * RS1 @ rs1 bits, zero-extended to max(rs1, rs2) bits
1315 * RS2 @ rs2 bits, zero-extended to max(rs1, rs2) bits
1316 * add @ max(rs1, rs2) bits
1317 * RD @ rd bits.  zero-extend to rd if rd > max(rs1, rs2) otherwise truncate
1318
1319 Note here that polymorphic add zero-extends its source operands,
1320 where addw sign-extends.
1321
1322 ### addw
1323
1324 The RV Specification specifically states that "W" variants of arithmetic
1325 operations always produce 32-bit signed values.  In a polymorphic
1326 environment it is reasonable to assume that the signed aspect is
1327 preserved, where it is the length of the operands and the result
1328 that may be changed.
1329
1330 Standard Scalar RV64 (xlen):
1331
1332 * RS1 @ xlen bits
1333 * RS2 @ xlen bits
1334 * add @ xlen bits
1335 * RD @ xlen bits, truncate add to 32-bit and sign-extend to xlen.
1336
1337 Polymorphic variant:
1338
1339 * RS1 @ rs1 bits, sign-extended to max(rs1, rs2) bits
1340 * RS2 @ rs2 bits, sign-extended to max(rs1, rs2) bits
1341 * add @ max(rs1, rs2) bits
1342 * RD @ rd bits. sign-extend to rd if rd > max(rs1, rs2) otherwise truncate
1343
1344 Note here that polymorphic addw sign-extends its source operands,
1345 where add zero-extends.
1346
1347 This requires a little more in-depth analysis.  Where the bitwidth of
1348 rs1 equals the bitwidth of rs2, no sign-extending will occur.  It is
1349 only where the bitwidth of either rs1 or rs2 are different, will the
1350 lesser-width operand be sign-extended.
1351
1352 Effectively however, both rs1 and rs2 are being sign-extended (or
1353 truncated), where for add they are both zero-extended.  This holds true
1354 for all arithmetic operations ending with "W".
1355
1356 ### addiw
1357
1358 Standard Scalar RV64I:
1359
1360 * RS1 @ xlen bits, truncated to 32-bit
1361 * immed @ 12 bits, sign-extended to 32-bit
1362 * add @ 32 bits
1363 * RD @ rd bits.  sign-extend to rd if rd > 32, otherwise truncate.
1364
1365 Polymorphic variant:
1366
1367 * RS1 @ rs1 bits
1368 * immed @ 12 bits, sign-extend to max(rs1, 12) bits
1369 * add @ max(rs1, 12) bits
1370 * RD @ rd bits. sign-extend to rd if rd > max(rs1, 12) otherwise truncate
1371
1372 # Predication Element Zeroing
1373
1374 The introduction of zeroing on traditional vector predication is usually
1375 intended as an optimisation for lane-based microarchitectures with register
1376 renaming to be able to save power by avoiding a register read on elements
1377 that are passed through en-masse through the ALU.  Simpler microarchitectures
1378 do not have this issue: they simply do not pass the element through to
1379 the ALU at all, and therefore do not store it back in the destination.
1380 More complex non-lane-based micro-architectures can, when zeroing is
1381 not set, use the predication bits to simply avoid sending element-based
1382 operations to the ALUs, entirely: thus, over the long term, potentially
1383 keeping all ALUs 100% occupied even when elements are predicated out.
1384
1385 SimpleV's design principle is not based on or influenced by
1386 microarchitectural design factors: it is a hardware-level API.
1387 Therefore, looking purely at whether zeroing is *useful* or not,
1388 (whether less instructions are needed for certain scenarios),
1389 given that a case can be made for zeroing *and* non-zeroing, the
1390 decision was taken to add support for both.
1391
1392 ## Single-predication (based on destination register)
1393
1394 Zeroing on predication for arithmetic operations is taken from
1395 the destination register's predicate.  i.e. the predication *and*
1396 zeroing settings to be applied to the whole operation come from the
1397 CSR Predication table entry for the destination register.
1398 Thus when zeroing is set on predication of a destination element,
1399 if the predication bit is clear, then the destination element is *set*
1400 to zero (twin-predication is slightly different, and will be covered
1401 next).
1402
1403 Thus the pseudo-code loop for a predicated arithmetic operation
1404 is modified to as follows:
1405
1406       for (i = 0; i < VL; i++)
1407         if not zeroing: # an optimisation
1408            while (!(predval & 1<<i) && i < VL)
1409              if (int_vec[rd ].isvector)  { id += 1; }
1410              if (int_vec[rs1].isvector)  { irs1 += 1; }
1411              if (int_vec[rs2].isvector)  { irs2 += 1; }
1412            if i == VL:
1413              return
1414         if (predval & 1<<i)
1415            src1 = ....
1416            src2 = ...
1417            else:
1418                result = src1 + src2 # actual add (or other op) here
1419            set_polymorphed_reg(rd, destwid, ird, result)
1420            if int_vec[rd].ffirst and result == 0:
1421               VL = i # result was zero, end loop early, return VL
1422               return
1423            if (!int_vec[rd].isvector) return
1424         else if zeroing:
1425            result = 0
1426            set_polymorphed_reg(rd, destwid, ird, result)
1427         if (int_vec[rd ].isvector)  { id += 1; }
1428         else if (predval & 1<<i) return
1429         if (int_vec[rs1].isvector)  { irs1 += 1; }
1430         if (int_vec[rs2].isvector)  { irs2 += 1; }
1431         if (rd == VL or rs1 == VL or rs2 == VL): return
1432
1433 The optimisation to skip elements entirely is only possible for certain
1434 micro-architectures when zeroing is not set.  However for lane-based
1435 micro-architectures this optimisation may not be practical, as it
1436 implies that elements end up in different "lanes".  Under these
1437 circumstances it is perfectly fine to simply have the lanes
1438 "inactive" for predicated elements, even though it results in
1439 less than 100% ALU utilisation.
1440
1441 ## Twin-predication (based on source and destination register)
1442
1443 Twin-predication is not that much different, except that that
1444 the source is independently zero-predicated from the destination.
1445 This means that the source may be zero-predicated *or* the
1446 destination zero-predicated *or both*, or neither.
1447
1448 When with twin-predication, zeroing is set on the source and not
1449 the destination, if a predicate bit is set it indicates that a zero
1450 data element is passed through the operation (the exception being:
1451 if the source data element is to be treated as an address - a LOAD -
1452 then the data returned *from* the LOAD is zero, rather than looking up an
1453 *address* of zero.
1454
1455 When zeroing is set on the destination and not the source, then just
1456 as with single-predicated operations, a zero is stored into the destination
1457 element (or target memory address for a STORE).
1458
1459 Zeroing on both source and destination effectively result in a bitwise
1460 NOR operation of the source and destination predicate: the result is that
1461 where either source predicate OR destination predicate is set to 0,
1462 a zero element will ultimately end up in the destination register.
1463
1464 However: this may not necessarily be the case for all operations;
1465 implementors, particularly of custom instructions, clearly need to
1466 think through the implications in each and every case.
1467
1468 Here is pseudo-code for a twin zero-predicated operation:
1469
1470     function op_mv(rd, rs) # MV not VMV!
1471       rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
1472       rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
1473       ps, zerosrc = get_pred_val(FALSE, rs); # predication on src
1474       pd, zerodst = get_pred_val(FALSE, rd); # ... AND on dest
1475       for (int i = 0, int j = 0; i < VL && j < VL):
1476         if (int_csr[rs].isvec && !zerosrc) while (!(ps & 1<<i)) i++;
1477         if (int_csr[rd].isvec && !zerodst) while (!(pd & 1<<j)) j++;
1478         if ((pd & 1<<j))
1479             if ((pd & 1<<j))
1480                 sourcedata = ireg[rs+i];
1481             else
1482                 sourcedata = 0
1483             ireg[rd+j] <= sourcedata
1484         else if (zerodst)
1485             ireg[rd+j] <= 0
1486         if (int_csr[rs].isvec)
1487             i++;
1488         if (int_csr[rd].isvec)
1489             j++;
1490         else
1491             if ((pd & 1<<j))
1492                 break;
1493
1494 Note that in the instance where the destination is a scalar, the hardware
1495 loop is ended the moment a value *or a zero* is placed into the destination
1496 register/element.  Also note that, for clarity, variable element widths
1497 have been left out of the above.
1498
1499 # Subsets of RV functionality
1500
1501 This section describes the differences when SV is implemented on top of
1502 different subsets of RV.
1503
1504 ## Common options
1505
1506 It is permitted to only implement SVprefix and not the VBLOCK instruction
1507 format option, and vice-versa.  UNIX Platforms **MUST** raise illegal
1508 instruction on seeing an unsupported VBLOCK or SVprefix opcode, so that
1509 traps may emulate the format.
1510
1511 It is permitted in SVprefix to either not implement VL or not implement
1512 SUBVL (see [[sv_prefix_proposal]] for full details. Again, UNIX Platforms
1513 *MUST* raise illegal instruction on implementations that do not support
1514 VL or SUBVL.
1515
1516 It is permitted to limit the size of either (or both) the register files
1517 down to the original size of the standard RV architecture.  However, below
1518 the mandatory limits set in the RV standard will result in non-compliance
1519 with the SV Specification.
1520
1521 ## RV32 / RV32F
1522
1523 When RV32 or RV32F is implemented, XLEN is set to 32, and thus the
1524 maximum limit for predication is also restricted to 32 bits.  Whilst not
1525 actually specifically an "option" it is worth noting.
1526
1527 ## RV32G
1528
1529 Normally in standard RV32 it does not make much sense to have
1530 RV32G, The critical instructions that are missing in standard RV32
1531 are those for moving data to and from the double-width floating-point
1532 registers into the integer ones, as well as the FCVT routines.
1533
1534 In an earlier draft of SV, it was possible to specify an elwidth
1535 of double the standard register size: this had to be dropped,
1536 and may be reintroduced in future revisions.
1537
1538 ## RV32 (not RV32F / RV32G) and RV64 (not RV64F / RV64G)
1539
1540 When floating-point is not implemented, the size of the User Register and
1541 Predication CSR tables may be halved, to only 4 2x16-bit CSRs (8 entries
1542 per table).
1543
1544 ## RV32E
1545
1546 In embedded scenarios the User Register and Predication CSRs may be
1547 dropped entirely, or optionally limited to 1 CSR, such that the combined
1548 number of entries from the M-Mode CSR Register table plus U-Mode
1549 CSR Register table is either 4 16-bit entries or (if the U-Mode is
1550 zero) only 2 16-bit entries (M-Mode CSR table only).  Likewise for
1551 the Predication CSR tables.
1552
1553 RV32E is the most likely candidate for simply detecting that registers
1554 are marked as "vectorised", and generating an appropriate exception
1555 for the VL loop to be implemented in software.
1556
1557 ## RV128
1558
1559 RV128 has not been especially considered, here, however it has some
1560 extremely large possibilities: double the element width implies
1561 256-bit operands, spanning 2 128-bit registers each, and predication
1562 of total length 128 bit given that XLEN is now 128.
1563
1564 # Example usage
1565
1566 TODO evaluate strncpy and strlen
1567 <https://groups.google.com/forum/m/#!msg/comp.arch/bGBeaNjAKvc/_vbqyxTUAQAJ>
1568
1569 ## strncpy <a name="strncpy"></>
1570
1571 RVV version:
1572
1573     strncpy:
1574         c.mv a3, a0               # Copy dst
1575     loop:
1576         setvli x0, a2, vint8    # Vectors of bytes.
1577         vlbff.v v1, (a1)        # Get src bytes
1578         vseq.vi v0, v1, 0       # Flag zero bytes
1579         vmfirst a4, v0          # Zero found?
1580         vmsif.v v0, v0          # Set mask up to and including zero byte.
1581         vsb.v v1, (a3), v0.t    # Write out bytes
1582         c.bgez a4, exit           # Done
1583         csrr t1, vl             # Get number of bytes fetched
1584         c.add a1, a1, t1          # Bump src pointer
1585         c.sub a2, a2, t1          # Decrement count.
1586         c.add a3, a3, t1          # Bump dst pointer
1587         c.bnez a2, loop           # Anymore?
1588
1589     exit:
1590         c.ret
1591
1592 SV version (WIP):
1593
1594     strncpy:
1595         c.mv a3, a0
1596         VBLK.RegCSR[t0] = 8bit, t0, vector
1597         VBLK.PredTb[t0] = ffirst, x0, inv
1598     loop:
1599         VBLK.SETVLI a2, t4, 8 # t4 and VL now 1..8 (MVL=8)
1600         c.ldb t0, (a1) # t0 fail first mode
1601         c.bne t0, x0, allnonzero # still ff
1602         # VL (t4) points to last nonzero
1603         c.addi t4, t4, 1 # include zero
1604         c.stb t0, (a3)   # store incl zero
1605         c.ret            # end subroutine
1606     allnonzero:
1607         c.stb t0, (a3)    # VL legal range
1608         c.add a1, a1, t4  # Bump src pointer
1609         c.sub a2, a2, t4  # Decrement count.
1610         c.add a3, a3, t4  # Bump dst pointer
1611         c.bnez a2, loop   # Anymore?
1612     exit:
1613         c.ret
1614
1615 Notes:
1616
1617 * Setting MVL to 8 is just an example. If enough registers are spare it
1618   may be set to XLEN which will require a bank of 8 scalar registers for
1619   a1, a3 and t0.
1620 * obviously if that is done, t0 is not separated by 8 full registers, and
1621   would overwrite t1 thru t7. x80 would work well, as an example, instead.
1622 * with the exception of the GETVL (a pseudo code alias for csrr), every
1623   single instruction above may use RVC.
1624 * RVC C.BNEZ can be used because rs1' may be extended to the full 128
1625   registers through redirection
1626 * RVC C.LW and C.SW may be used because the W format may be overridden by
1627   the 8 bit format. All of t0, a3 and a1 are overridden to make that work.
1628 * with the exception of the GETVL, all Vector Context may be done in
1629   VBLOCK form.
1630 * setting predication to x0 (zero) and invert on t0 is a trick to enable
1631   just ffirst on t0
1632 * ldb and bne are both using t0, both in ffirst mode
1633 * t0 vectorised, a1 scalar, both elwidth 8 bit: ldb enters "unit stride,
1634   vectorised, no (un)sign-extension or truncation" mode.
1635 * ldb will end on illegal mem, reduce VL, but copied all sorts of stuff
1636   into t0 (could contain zeros).
1637 * bne t0 x0 tests up to the NEW VL for nonzero, vector t0 against
1638   scalar x0
1639 * however as t0 is in ffirst mode, the first fail will ALSO stop the
1640   compares, and reduce VL as well
1641 * the branch only goes to allnonzero if all tests succeed
1642 * if it did not, we can safely increment VL by 1 (using a4) to include
1643   the zero.
1644 * SETVL sets *exactly* the requested amount into VL.
1645 * the SETVL just after allnonzero label is needed in case the ldb ffirst
1646   activates but the bne allzeros does not.
1647 * this would cause the stb to copy up to the end of the legal memory
1648 * of course, on the next loop the ldb would throw a trap, as a1 now
1649   points to the first illegal mem location.
1650
1651 ## strcpy
1652
1653 RVV version:
1654
1655         mv a3, a0             # Save start
1656     loop:
1657         setvli a1, x0, vint8  # byte vec, x0 (Zero reg) => use max hardware len
1658         vldbff.v v1, (a3)     # Get bytes
1659         csrr a1, vl           # Get bytes actually read e.g. if fault
1660         vseq.vi v0, v1, 0     # Set v0[i] where v1[i] = 0
1661         add a3, a3, a1        # Bump pointer
1662         vmfirst a2, v0        # Find first set bit in mask, returns -1 if none
1663         bltz a2, loop         # Not found?
1664         add a0, a0, a1        # Sum start + bump
1665         add a3, a3, a2        # Add index of zero byte
1666         sub a0, a3, a0        # Subtract start address+bump
1667         ret
1668
1669 ## DAXPY <a name="daxpy"></a>
1670
1671 [[!inline raw="yes" pages="simple_v_extension/daxpy_example" ]]
1672
1673 Notes:
1674
1675 * Setting MVL to 4 is just an example.  With enough space between the
1676   FP regs, MVL may be set to larger values
1677 * VBLOCK header takes 16 bits, 8-bit mode may be used on the registers,
1678   taking only another 16 bits, VBLOCK.SETVL requires 16 bits.  Total
1679   overhead for use of VBLOCK: 48 bits (3 16-bit words).
1680 * All instructions except fmadd may use Compressed variants.  Total
1681   number of 16-bit instruction words: 11.
1682 * Total: 14 16-bit words.  By contrast, RVV requires around 18 16-bit words.
1683
1684 ## BigInt add <a name="bigadd"></a>
1685
1686 [[!inline raw="yes" pages="simple_v_extension/bigadd_example" ]]