simple_v_extension/appendix.mdwn

   1 # Simple-V (Parallelism Extension Proposal) Appendix
   2
   3 * Copyright (C) 2017, 2018, 2019 Luke Kenneth Casson Leighton
   4 * Status: DRAFTv0.6
   5 * Last edited: 25 jun 2019
   6 * main spec [[specification]]
   7
   8 [[!toc ]]
   9
  10 # Instructions <a name="instructions" />
  11
  12 Despite being a 98% complete and accurate topological remap of RVV
  13 concepts and functionality, no new instructions are needed.
  14 Compared to RVV: *All* RVV instructions can be re-mapped, however xBitManip
  15 becomes a critical dependency for efficient manipulation of predication
  16 masks (as a bit-field).  Despite the removal of all operations,
  17 with the exception of CLIP and VSELECT.X
  18 *all instructions from RVV Base are topologically re-mapped and retain their
  19 complete functionality, intact*.  Note that if RV64G ever had
  20 a MV.X added as well as FCLIP, the full functionality of RVV-Base would
  21 be obtained in SV.
  22
  23 Three instructions, VSELECT, VCLIP and VCLIPI, do not have RV Standard
  24 equivalents, so are left out of Simple-V.  VSELECT could be included if
  25 there existed a MV.X instruction in RV (MV.X is a hypothetical
  26 non-immediate variant of MV that would allow another register to
  27 specify which register was to be copied).  Note that if any of these three
  28 instructions are added to any given RV extension, their functionality
  29 will be inherently parallelised.
  30
  31 With some exceptions, where it does not make sense or is simply too
  32 challenging, all RV-Base instructions are parallelised:
  33
  34 * CSR instructions, whilst a case could be made for fast-polling of
  35   a CSR into multiple registers, or for being able to copy multiple
  36   contiguously addressed CSRs into contiguous registers, and so on,
  37   are the fundamental core basis of SV.  If parallelised, extreme
  38   care would need to be taken.  Additionally, CSR reads are done
  39   using x0, and it is *really* inadviseable to tag x0.
  40 * LUI, C.J, C.JR, WFI, AUIPC are not suitable for parallelising so are
  41   left as scalar.
  42 * LR/SC could hypothetically be parallelised however their purpose is
  43   single (complex) atomic memory operations where the LR must be followed
  44   up by a matching SC.  A sequence of parallel LR instructions followed
  45   by a sequence of parallel SC instructions therefore is guaranteed to
  46   not be useful. Not least: the guarantees of a Multi-LR/SC
  47   would be impossible to provide if emulated in a trap.
  48 * EBREAK, NOP, FENCE and others do not use registers so are not inherently
  49   paralleliseable anyway.
  50
  51 All other operations using registers are automatically parallelised.
  52 This includes AMOMAX, AMOSWAP and so on, where particular care and
  53 attention must be paid.
  54
  55 Example pseudo-code for an integer ADD operation (including scalar
  56 operations).  Floating-point uses the FP Register Table.
  57
  58     function op_add(rd, rs1, rs2) # add not VADD!
  59       int i, id=0, irs1=0, irs2=0;
  60       predval = get_pred_val(FALSE, rd);
  61       rd  = int_vec[rd ].isvector ? int_vec[rd ].regidx : rd;
  62       rs1 = int_vec[rs1].isvector ? int_vec[rs1].regidx : rs1;
  63       rs2 = int_vec[rs2].isvector ? int_vec[rs2].regidx : rs2;
  64       for (i = 0; i < VL; i++)
  65         xSTATE.srcoffs = i # save context
  66         if (predval & 1<<i) # predication uses intregs
  67            ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
  68            if (!int_vec[rd ].isvector) break;
  69         if (int_vec[rd ].isvector)  { id += 1; }
  70         if (int_vec[rs1].isvector)  { irs1 += 1; }
  71         if (int_vec[rs2].isvector)  { irs2 += 1; }
  72
  73 Note that for simplicity there is quite a lot missing from the above
  74 pseudo-code: element widths, zeroing on predication, dimensional
  75 reshaping and offsets and so on.  However it demonstrates the basic
  76 principle.  Augmentations that produce the full pseudo-code are covered in
  77 other sections.
  78
  79 ## SUBVL Pseudocode <a name="subvl-pseudocode"></a>
  80
  81 Adding in support for SUBVL is a matter of adding in an extra inner
  82 for-loop, where register src and dest are still incremented inside the
  83 inner part. Not that the predication is still taken from the VL index.
  84
  85 So whilst elements are indexed by "(i * SUBVL + s)", predicate bits are
  86 indexed by "(i)"
  87
  88     function op_add(rd, rs1, rs2) # add not VADD!
  89       int i, id=0, irs1=0, irs2=0;
  90       predval = get_pred_val(FALSE, rd);
  91       rd  = int_vec[rd ].isvector ? int_vec[rd ].regidx : rd;
  92       rs1 = int_vec[rs1].isvector ? int_vec[rs1].regidx : rs1;
  93       rs2 = int_vec[rs2].isvector ? int_vec[rs2].regidx : rs2;
  94       for (i = 0; i < VL; i++)
  95        xSTATE.srcoffs = i # save context
  96        for (s = 0; s < SUBVL; s++)
  97         xSTATE.ssvoffs = s # save context
  98         if (predval & 1<<i) # predication uses intregs
  99            # actual add is here (at last)
 100            ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
 101            if (!int_vec[rd ].isvector) break;
 102         if (int_vec[rd ].isvector)  { id += 1; }
 103         if (int_vec[rs1].isvector)  { irs1 += 1; }
 104         if (int_vec[rs2].isvector)  { irs2 += 1; }
 105         if (id == VL or irs1 == VL or irs2 == VL) {
 106           # end VL hardware loop
 107           xSTATE.srcoffs = 0; # reset
 108           xSTATE.ssvoffs = 0; # reset
 109           return;
 110         }
 111
 112
 113 NOTE: pseudocode simplified greatly: zeroing, proper predicate handling,
 114 elwidth handling etc. all left out.
 115
 116 ## Instruction Format
 117
 118 It is critical to appreciate that there are
 119 **no operations added to SV, at all**.
 120
 121 Instead, by using CSRs to tag registers as an indication of "changed
 122 behaviour", SV *overloads* pre-existing branch operations into predicated
 123 variants, and implicitly overloads arithmetic operations, MV, FCVT, and
 124 LOAD/STORE depending on CSR configurations for bitwidth and predication.
 125 **Everything** becomes parallelised.  *This includes Compressed
 126 instructions* as well as any future instructions and Custom Extensions.
 127
 128 Note: CSR tags to change behaviour of instructions is nothing new, including
 129 in RISC-V.  UXL, SXL and MXL change the behaviour so that XLEN=32/64/128.
 130 FRM changes the behaviour of the floating-point unit, to alter the rounding
 131 mode.  Other architectures change the LOAD/STORE byte-order from big-endian
 132 to little-endian on a per-instruction basis.  SV is just a little more...
 133 comprehensive in its effect on instructions.
 134
 135 ## Branch Instructions
 136
 137 Branch operations are augmented slightly to be a little more like FP
 138 Compares (FEQ, FNE etc.), by permitting the cumulation (and storage)
 139 of multiple comparisons into a register (taken indirectly from the predicate
 140 table).  As such, "ffirst" - fail-on-first - condition mode can be enabled.
 141 See ffirst mode in the Predication Table section.
 142
 143 ### Standard Branch <a name="standard_branch"></a>
 144
 145 Branch operations use standard RV opcodes that are reinterpreted to
 146 be "predicate variants" in the instance where either of the two src
 147 registers are marked as vectors (active=1, vector=1).
 148
 149 Note that the predication register to use (if one is enabled) is taken from
 150 the *first* src register, and that this is used, just as with predicated
 151 arithmetic operations, to mask whether the comparison operations take
 152 place or not.  The target (destination) predication register
 153 to use (if one is enabled) is taken from the *second* src register.
 154
 155 If either of src1 or src2 are scalars (whether by there being no
 156 CSR register entry or whether by the CSR entry specifically marking
 157 the register as "scalar") the comparison goes ahead as vector-scalar
 158 or scalar-vector.
 159
 160 In instances where no vectorisation is detected on either src registers
 161 the operation is treated as an absolutely standard scalar branch operation.
 162 Where vectorisation is present on either or both src registers, the
 163 branch may stil go ahead if any only if *all* tests succeed (i.e. excluding
 164 those tests that are predicated out).
 165
 166 Note that when zero-predication is enabled (from source rs1),
 167 a cleared bit in the predicate indicates that the result
 168 of the compare is set to "false", i.e. that the corresponding
 169 destination bit (or result)) be set to zero.  Contrast this with
 170 when zeroing is not set: bits in the destination predicate are
 171 only *set*; they are **not** cleared.  This is important to appreciate,
 172 as there may be an expectation that, going into the hardware-loop,
 173 the destination predicate is always expected to be set to zero:
 174 this is **not** the case.  The destination predicate is only set
 175 to zero if **zeroing** is enabled.
 176
 177 Note that just as with the standard (scalar, non-predicated) branch
 178 operations, BLE, BGT, BLEU and BTGU may be synthesised by inverting
 179 src1 and src2.
 180
 181 In Hwacha EECS-2015-262 Section 6.7.2 the following pseudocode is given
 182 for predicated compare operations of function "cmp":
 183
 184     for (int i=0; i<vl; ++i)
 185       if ([!]preg[p][i])
 186          preg[pd][i] = cmp(s1 ? vreg[rs1][i] : sreg[rs1],
 187                            s2 ? vreg[rs2][i] : sreg[rs2]);
 188
 189 With associated predication, vector-length adjustments and so on,
 190 and temporarily ignoring bitwidth (which makes the comparisons more
 191 complex), this becomes:
 192
 193     s1 = reg_is_vectorised(src1);
 194     s2 = reg_is_vectorised(src2);
 195
 196     if not s1 && not s2
 197         if cmp(rs1, rs2) # scalar compare
 198             goto branch
 199         return
 200
 201     preg = int_pred_reg[rd]
 202     reg = int_regfile
 203
 204     ps = get_pred_val(I/F==INT, rs1);
 205     rd = get_pred_val(I/F==INT, rs2); # this may not exist
 206
 207     if not exists(rd) or zeroing:
 208         result = 0
 209     else
 210         result = preg[rd]
 211
 212     for (int i = 0; i < VL; ++i)
 213       if (zeroing)
 214         if not (ps & (1<<i))
 215            result &= ~(1<<i);
 216       else if (ps & (1<<i))
 217           if (cmp(s1 ? reg[src1+i]:reg[src1],
 218                                s2 ? reg[src2+i]:reg[src2])
 219               result |= 1<<i;
 220           else
 221               result &= ~(1<<i);
 222
 223      if not exists(rd)
 224         if result == ps
 225             goto branch
 226      else
 227         preg[rd] = result # store in destination
 228         if preg[rd] == ps
 229             goto branch
 230
 231 Notes:
 232
 233 * Predicated SIMD comparisons would break src1 and src2 further down
 234   into bitwidth-sized chunks (see Appendix "Bitwidth Virtual Register
 235   Reordering") setting Vector-Length times (number of SIMD elements) bits
 236   in Predicate Register rd, as opposed to just Vector-Length bits.
 237 * The execution of "parallelised" instructions **must** be implemented
 238   as "re-entrant" (to use a term from software).  If an exception (trap)
 239   occurs during the middle of a vectorised
 240   Branch (now a SV predicated compare) operation, the partial results
 241   of any comparisons must be written out to the destination
 242   register before the trap is permitted to begin.  If however there
 243   is no predicate, the **entire** set of comparisons must be **restarted**,
 244   with the offset loop indices set back to zero.  This is because
 245   there is no place to store the temporary result during the handling
 246   of traps.
 247
 248 TODO: predication now taken from src2.  also branch goes ahead
 249 if all compares are successful.
 250
 251 Note also that where normally, predication requires that there must
 252 also be a CSR register entry for the register being used in order
 253 for the **predication** CSR register entry to also be active,
 254 for branches this is **not** the case.  src2 does **not** have
 255 to have its CSR register entry marked as active in order for
 256 predication on src2 to be active.
 257
 258 Also note: SV Branch operations are **not** twin-predicated
 259 (see Twin Predication section).  This would require three
 260 element offsets: one to track src1, one to track src2 and a third
 261 to track where to store the accumulation of the results.  Given
 262 that the element offsets need to be exposed via CSRs so that
 263 the parallel hardware looping may be made re-entrant on traps
 264 and exceptions, the decision was made not to make SV Branches
 265 twin-predicated.
 266
 267 ### Floating-point Comparisons
 268
 269 There does not exist floating-point branch operations, only compare.
 270 Interestingly no change is needed to the instruction format because
 271 FP Compare already stores a 1 or a zero in its "rd" integer register
 272 target, i.e. it's not actually a Branch at all: it's a compare.
 273
 274 In RV (scalar) Base, a branch on a floating-point compare is
 275 done via the sequence "FEQ x1, f0, f5; BEQ x1, x0, #jumploc".
 276 This does extend to SV, as long as x1 (in the example sequence given)
 277 is vectorised.  When that is the case, x1..x(1+VL-1) will also be
 278 set to 0 or 1 depending on whether f0==f5, f1==f6, f2==f7 and so on.
 279 The BEQ that follows will *also* compare x1==x0, x2==x0, x3==x0 and
 280 so on.  Consequently, unlike integer-branch, FP Compare needs no
 281 modification in its behaviour.
 282
 283 In addition, it is noted that an entry "FNE" (the opposite of FEQ) is missing,
 284 and whilst in ordinary branch code this is fine because the standard
 285 RVF compare can always be followed up with an integer BEQ or a BNE (or
 286 a compressed comparison to zero or non-zero), in predication terms that
 287 becomes more of an impact.  To deal with this, SV's predication has
 288 had "invert" added to it.
 289
 290 Also: note that FP Compare may be predicated, using the destination
 291 integer register (rd) to determine the predicate.  FP Compare is **not**
 292 a twin-predication operation, as, again, just as with SV Branches,
 293 there are three registers involved: FP src1, FP src2 and INT rd.
 294
 295 Also: note that ffirst (fail first mode) applies directly to this operation.
 296
 297 ### Compressed Branch Instruction
 298
 299 Compressed Branch instructions are, just like standard Branch instructions,
 300 reinterpreted to be vectorised and predicated based on the source register
 301 (rs1s) CSR entries.  As however there is only the one source register,
 302 given that c.beqz a10 is equivalent to beqz a10,x0, the optional target
 303 to store the results of the comparisions is taken from CSR predication
 304 table entries for **x0**.
 305
 306 The specific required use of x0 is, with a little thought, quite obvious,
 307 but is counterintuitive.  Clearly it is **not** recommended to redirect
 308 x0 with a CSR register entry, however as a means to opaquely obtain
 309 a predication target it is the only sensible option that does not involve
 310 additional special CSRs (or, worse, additional special opcodes).
 311
 312 Note also that, just as with standard branches, the 2nd source
 313 (in this case x0 rather than src2) does **not** have to have its CSR
 314 register table marked as "active" in order for predication to work.
 315
 316 ## Vectorised Dual-operand instructions
 317
 318 There is a series of 2-operand instructions involving copying (and
 319 sometimes alteration):
 320
 321 * C.MV
 322 * FMV, FNEG, FABS, FCVT, FSGNJ, FSGNJN and FSGNJX
 323 * C.LWSP, C.SWSP, C.LDSP, C.FLWSP etc.
 324 * LOAD(-FP) and STORE(-FP)
 325
 326 All of these operations follow the same two-operand pattern, so it is
 327 *both* the source *and* destination predication masks that are taken into
 328 account.  This is different from
 329 the three-operand arithmetic instructions, where the predication mask
 330 is taken from the *destination* register, and applied uniformly to the
 331 elements of the source register(s), element-for-element.
 332
 333 The pseudo-code pattern for twin-predicated operations is as
 334 follows:
 335
 336     function op(rd, rs):
 337       rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
 338       rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
 339       ps = get_pred_val(FALSE, rs); # predication on src
 340       pd = get_pred_val(FALSE, rd); # ... AND on dest
 341       for (int i = 0, int j = 0; i < VL && j < VL;):
 342         if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
 343         if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
 344         xSTATE.srcoffs = i # save context
 345         xSTATE.destoffs = j # save context
 346         reg[rd+j] = SCALAR_OPERATION_ON(reg[rs+i])
 347         if (int_csr[rs].isvec) i++;
 348         if (int_csr[rd].isvec) j++; else break
 349
 350 This pattern covers scalar-scalar, scalar-vector, vector-scalar
 351 and vector-vector, and predicated variants of all of those.
 352 Zeroing is not presently included (TODO).  As such, when compared
 353 to RVV, the twin-predicated variants of C.MV and FMV cover
 354 **all** standard vector operations: VINSERT, VSPLAT, VREDUCE,
 355 VEXTRACT, VSCATTER, VGATHER, VCOPY, and more.
 356
 357 Note that:
 358
 359 * elwidth (SIMD) is not covered in the pseudo-code above
 360 * ending the loop early in scalar cases (VINSERT, VEXTRACT) is also
 361   not covered
 362 * zero predication is also not shown (TODO).
 363
 364 ### C.MV Instruction <a name="c_mv"></a>
 365
 366 There is no MV instruction in RV however there is a C.MV instruction.
 367 It is used for copying integer-to-integer registers (vectorised FMV
 368 is used for copying floating-point).
 369
 370 If either the source or the destination register are marked as vectors
 371 C.MV is reinterpreted to be a vectorised (multi-register) predicated
 372 move operation.  The actual instruction's format does not change:
 373
 374 [[!table  data="""
 375 15  12 | 11   7 | 6  2 | 1  0 |
 376 funct4 | rd     | rs   | op   |
 377 4      | 5      | 5    | 2    |
 378 C.MV   | dest   | src  | C0   |
 379 """]]
 380
 381 A simplified version of the pseudocode for this operation is as follows:
 382
 383     function op_mv(rd, rs) # MV not VMV!
 384       rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
 385       rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
 386       ps = get_pred_val(FALSE, rs); # predication on src
 387       pd = get_pred_val(FALSE, rd); # ... AND on dest
 388       for (int i = 0, int j = 0; i < VL && j < VL;):
 389         if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
 390         if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
 391         xSTATE.srcoffs = i # save context
 392         xSTATE.destoffs = j # save context
 393         ireg[rd+j] <= ireg[rs+i];
 394         if (int_csr[rs].isvec) i++;
 395         if (int_csr[rd].isvec) j++; else break
 396
 397 There are several different instructions from RVV that are covered by
 398 this one opcode:
 399
 400 [[!table  data="""
 401 src    | dest    | predication   | op             |
 402 scalar | vector  | none          | VSPLAT         |
 403 scalar | vector  | destination   | sparse VSPLAT  |
 404 scalar | vector  | 1-bit dest    | VINSERT        |
 405 vector | scalar  | 1-bit? src    | VEXTRACT       |
 406 vector | vector  | none          | VCOPY          |
 407 vector | vector  | src           | Vector Gather  |
 408 vector | vector  | dest          | Vector Scatter |
 409 vector | vector  | src & dest    | Gather/Scatter |
 410 vector | vector  | src == dest   | sparse VCOPY   |
 411 """]]
 412
 413 Also, VMERGE may be implemented as back-to-back (macro-op fused) C.MV
 414 operations with zeroing off, and inversion on the src and dest predication
 415 for one of the two C.MV operations.  The non-inverted C.MV will place
 416 one set of registers into the destination, and the inverted one the other
 417 set.  With predicate-inversion, copying and inversion of the predicate mask
 418 need not be done as a separate (scalar) instruction.
 419
 420 Note that in the instance where the Compressed Extension is not implemented,
 421 MV may be used, but that is a pseudo-operation mapping to addi rd, x0, rs.
 422 Note that the behaviour is **different** from C.MV because with addi the
 423 predication mask to use is taken **only** from rd and is applied against
 424 all elements: rs[i] = rd[i].
 425
 426 ### FMV, FNEG and FABS Instructions
 427
 428 These are identical in form to C.MV, except covering floating-point
 429 register copying.  The same double-predication rules also apply.
 430 However when elwidth is not set to default the instruction is implicitly
 431 and automatic converted to a (vectorised) floating-point type conversion
 432 operation of the appropriate size covering the source and destination
 433 register bitwidths.
 434
 435 (Note that FMV, FNEG and FABS are all actually pseudo-instructions)
 436
 437 ### FVCT Instructions
 438
 439 These are again identical in form to C.MV, except that they cover
 440 floating-point to integer and integer to floating-point.  When element
 441 width in each vector is set to default, the instructions behave exactly
 442 as they are defined for standard RV (scalar) operations, except vectorised
 443 in exactly the same fashion as outlined in C.MV.
 444
 445 However when the source or destination element width is not set to default,
 446 the opcode's explicit element widths are *over-ridden* to new definitions,
 447 and the opcode's element width is taken as indicative of the SIMD width
 448 (if applicable i.e. if packed SIMD is requested) instead.
 449
 450 For example FCVT.S.L would normally be used to convert a 64-bit
 451 integer in register rs1 to a 64-bit floating-point number in rd.
 452 If however the source rs1 is set to be a vector, where elwidth is set to
 453 default/2 and "packed SIMD" is enabled, then the first 32 bits of
 454 rs1 are converted to a floating-point number to be stored in rd's
 455 first element and the higher 32-bits *also* converted to floating-point
 456 and stored in the second.  The 32 bit size comes from the fact that
 457 FCVT.S.L's integer width is 64 bit, and with elwidth on rs1 set to
 458 divide that by two it means that rs1 element width is to be taken as 32.
 459
 460 Similar rules apply to the destination register.
 461
 462 ## LOAD / STORE Instructions and LOAD-FP/STORE-FP <a name="load_store"></a>
 463
 464 An earlier draft of SV modified the behaviour of LOAD/STORE (modified
 465 the interpretation of the instruction fields).  This
 466 actually undermined the fundamental principle of SV, namely that there
 467 be no modifications to the scalar behaviour (except where absolutely
 468 necessary), in order to simplify an implementor's task if considering
 469 converting a pre-existing scalar design to support parallelism.
 470
 471 So the original RISC-V scalar LOAD/STORE and LOAD-FP/STORE-FP functionality
 472 do not change in SV, however just as with C.MV it is important to note
 473 that dual-predication is possible.
 474
 475 In vectorised architectures there are usually at least two different modes
 476 for LOAD/STORE:
 477
 478 * Read (or write for STORE) from sequential locations, where one
 479   register specifies the address, and the one address is incremented
 480   by a fixed amount.  This is usually known as "Unit Stride" mode.
 481 * Read (or write) from multiple indirected addresses, where the
 482   vector elements each specify separate and distinct addresses.
 483
 484 To support these different addressing modes, the CSR Register "isvector"
 485 bit is used.  So, for a LOAD, when the src register is set to
 486 scalar, the LOADs are sequentially incremented by the src register
 487 element width, and when the src register is set to "vector", the
 488 elements are treated as indirection addresses.  Simplified
 489 pseudo-code would look like this:
 490
 491     function op_ld(rd, rs) # LD not VLD!
 492       rdv = int_csr[rd].active ? int_csr[rd].regidx : rd;
 493       rsv = int_csr[rs].active ? int_csr[rs].regidx : rs;
 494       ps = get_pred_val(FALSE, rs); # predication on src
 495       pd = get_pred_val(FALSE, rd); # ... AND on dest
 496       for (int i = 0, int j = 0; i < VL && j < VL;):
 497         if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
 498         if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
 499         if (int_csr[rd].isvec)
 500           # indirect mode (multi mode)
 501           srcbase = ireg[rsv+i];
 502         else
 503           # unit stride mode
 504           srcbase = ireg[rsv] + i * XLEN/8; # offset in bytes
 505         ireg[rdv+j] <= mem[srcbase + imm_offs];
 506         if (!int_csr[rs].isvec &&
 507             !int_csr[rd].isvec) break # scalar-scalar LD
 508         if (int_csr[rs].isvec) i++;
 509         if (int_csr[rd].isvec) j++;
 510
 511 Notes:
 512
 513 * For simplicity, zeroing and elwidth is not included in the above:
 514   the key focus here is the decision-making for srcbase; vectorised
 515   rs means use sequentially-numbered registers as the indirection
 516   address, and scalar rs is "offset" mode.
 517 * The test towards the end for whether both source and destination are
 518   scalar is what makes the above pseudo-code provide the "standard" RV
 519   Base behaviour for LD operations.
 520 * The offset in bytes (XLEN/8) changes depending on whether the
 521   operation is a LB (1 byte), LH (2 byes), LW (4 bytes) or LD
 522   (8 bytes), and also whether the element width is over-ridden
 523   (see special element width section).
 524
 525 ## Compressed Stack LOAD / STORE Instructions <a name="c_ld_st"></a>
 526
 527 C.LWSP / C.SWSP and floating-point etc. are also source-dest twin-predicated,
 528 where it is implicit in C.LWSP/FLWSP etc. that x2 is the source register.
 529 It is therefore possible to use predicated C.LWSP to efficiently
 530 pop registers off the stack (by predicating x2 as the source), cherry-picking
 531 which registers to store to (by predicating the destination).  Likewise
 532 for C.SWSP.  In this way, LOAD/STORE-Multiple is efficiently achieved.
 533
 534 The two modes ("unit stride" and multi-indirection) are still supported,
 535 as with standard LD/ST.  Essentially, the only difference is that the
 536 use of x2 is hard-coded into the instruction.
 537
 538 **Note**: it is still possible to redirect x2 to an alternative target
 539 register.  With care, this allows C.LWSP / C.SWSP (and C.FLWSP) to be used as
 540 general-purpose LOAD/STORE operations.
 541
 542 ## Compressed LOAD / STORE Instructions
 543
 544 Compressed LOAD and STORE are again exactly the same as scalar LOAD/STORE,
 545 where the same rules apply and the same pseudo-code apply as for
 546 non-compressed LOAD/STORE.  Again: setting scalar or vector mode
 547 on the src for LOAD and dest for STORE switches mode from "Unit Stride"
 548 to "Multi-indirection", respectively.
 549
 550 # Element bitwidth polymorphism <a name="elwidth"></a>
 551
 552 Element bitwidth is best covered as its own special section, as it
 553 is quite involved and applies uniformly across-the-board.  SV restricts
 554 bitwidth polymorphism to default, 8-bit, 16-bit and 32-bit.
 555
 556 The effect of setting an element bitwidth is to re-cast each entry
 557 in the register table, and for all memory operations involving
 558 load/stores of certain specific sizes, to a completely different width.
 559 Thus In c-style terms, on an RV64 architecture, effectively each register
 560 now looks like this:
 561
 562     typedef union {
 563         uint8_t  b[8];
 564         uint16_t s[4];
 565         uint32_t i[2];
 566         uint64_t l[1];
 567     } reg_t;
 568
 569     // integer table: assume maximum SV 7-bit regfile size
 570     reg_t int_regfile[128];
 571
 572 where the CSR Register table entry (not the instruction alone) determines
 573 which of those union entries is to be used on each operation, and the
 574 VL element offset in the hardware-loop specifies the index into each array.
 575
 576 However a naive interpretation of the data structure above masks the
 577 fact that setting VL greater than 8, for example, when the bitwidth is 8,
 578 accessing one specific register "spills over" to the following parts of
 579 the register file in a sequential fashion.  So a much more accurate way
 580 to reflect this would be:
 581
 582     typedef union {
 583         uint8_t  actual_bytes[8]; // 8 for RV64, 4 for RV32, 16 for RV128
 584         uint8_t  b[0]; // array of type uint8_t
 585         uint16_t s[0];
 586         uint32_t i[0];
 587         uint64_t l[0];
 588         uint128_t d[0];
 589     } reg_t;
 590
 591     reg_t int_regfile[128];
 592
 593 where when accessing any individual regfile[n].b entry it is permitted
 594 (in c) to arbitrarily over-run the *declared* length of the array (zero),
 595 and thus "overspill" to consecutive register file entries in a fashion
 596 that is completely transparent to a greatly-simplified software / pseudo-code
 597 representation.
 598 It is however critical to note that it is clearly the responsibility of
 599 the implementor to ensure that, towards the end of the register file,
 600 an exception is thrown if attempts to access beyond the "real" register
 601 bytes is ever attempted.
 602
 603 Now we may modify pseudo-code an operation where all element bitwidths have
 604 been set to the same size, where this pseudo-code is otherwise identical
 605 to its "non" polymorphic versions (above):
 606
 607     function op_add(rd, rs1, rs2) # add not VADD!
 608       ...
 609       ...
 610       for (i = 0; i < VL; i++)
 611            ...
 612            ...
 613            // TODO, calculate if over-run occurs, for each elwidth
 614            if (elwidth == 8) {
 615                int_regfile[rd].b[id] <= int_regfile[rs1].i[irs1] +
 616                                         int_regfile[rs2].i[irs2];
 617             } else if elwidth == 16 {
 618                int_regfile[rd].s[id] <= int_regfile[rs1].s[irs1] +
 619                                         int_regfile[rs2].s[irs2];
 620             } else if elwidth == 32 {
 621                int_regfile[rd].i[id] <= int_regfile[rs1].i[irs1] +
 622                                         int_regfile[rs2].i[irs2];
 623             } else { // elwidth == 64
 624                int_regfile[rd].l[id] <= int_regfile[rs1].l[irs1] +
 625                                         int_regfile[rs2].l[irs2];
 626             }
 627            ...
 628            ...
 629
 630 So here we can see clearly: for 8-bit entries rd, rs1 and rs2 (and registers
 631 following sequentially on respectively from the same) are "type-cast"
 632 to 8-bit; for 16-bit entries likewise and so on.
 633
 634 However that only covers the case where the element widths are the same.
 635 Where the element widths are different, the following algorithm applies:
 636
 637 * Analyse the bitwidth of all source operands and work out the
 638   maximum.  Record this as "maxsrcbitwidth"
 639 * If any given source operand requires sign-extension or zero-extension
 640   (ldb, div, rem, mul, sll, srl, sra etc.), instead of mandatory 32-bit
 641   sign-extension / zero-extension or whatever is specified in the standard
 642   RV specification, **change** that to sign-extending from the respective
 643   individual source operand's bitwidth from the CSR table out to
 644   "maxsrcbitwidth" (previously calculated), instead.
 645 * Following separate and distinct (optional) sign/zero-extension of all
 646   source operands as specifically required for that operation, carry out the
 647   operation at "maxsrcbitwidth".  (Note that in the case of LOAD/STORE or MV
 648   this may be a "null" (copy) operation, and that with FCVT, the changes
 649   to the source and destination bitwidths may also turn FVCT effectively
 650   into a copy).
 651 * If the destination operand requires sign-extension or zero-extension,
 652   instead of a mandatory fixed size (typically 32-bit for arithmetic,
 653   for subw for example, and otherwise various: 8-bit for sb, 16-bit for sw
 654   etc.), overload the RV specification with the bitwidth from the
 655   destination register's elwidth entry.
 656 * Finally, store the (optionally) sign/zero-extended value into its
 657   destination: memory for sb/sw etc., or an offset section of the register
 658   file for an arithmetic operation.
 659
 660 In this way, polymorphic bitwidths are achieved without requiring a
 661 massive 64-way permutation of calculations **per opcode**, for example
 662 (4 possible rs1 bitwidths times 4 possible rs2 bitwidths times 4 possible
 663 rd bitwidths).  The pseudo-code is therefore as follows:
 664
 665     typedef union {
 666         uint8_t  b;
 667         uint16_t s;
 668         uint32_t i;
 669         uint64_t l;
 670     } el_reg_t;
 671
 672     bw(elwidth):
 673         if elwidth == 0: return xlen
 674         if elwidth == 1: return 8
 675         if elwidth == 2: return 16
 676         // elwidth == 3:
 677         return 32
 678
 679     get_max_elwidth(rs1, rs2):
 680         return max(bw(int_csr[rs1].elwidth), # default (XLEN) if not set
 681                    bw(int_csr[rs2].elwidth)) # again XLEN if no entry
 682
 683     get_polymorphed_reg(reg, bitwidth, offset):
 684         el_reg_t res;
 685         res.l = 0; // TODO: going to need sign-extending / zero-extending
 686         if bitwidth == 8:
 687             reg.b = int_regfile[reg].b[offset]
 688         elif bitwidth == 16:
 689             reg.s = int_regfile[reg].s[offset]
 690         elif bitwidth == 32:
 691             reg.i = int_regfile[reg].i[offset]
 692         elif bitwidth == 64:
 693             reg.l = int_regfile[reg].l[offset]
 694         return res
 695
 696     set_polymorphed_reg(reg, bitwidth, offset, val):
 697         if (!int_csr[reg].isvec):
 698             # sign/zero-extend depending on opcode requirements, from
 699             # the reg's bitwidth out to the full bitwidth of the regfile
 700             val = sign_or_zero_extend(val, bitwidth, xlen)
 701             int_regfile[reg].l[0] = val
 702         elif bitwidth == 8:
 703             int_regfile[reg].b[offset] = val
 704         elif bitwidth == 16:
 705             int_regfile[reg].s[offset] = val
 706         elif bitwidth == 32:
 707             int_regfile[reg].i[offset] = val
 708         elif bitwidth == 64:
 709             int_regfile[reg].l[offset] = val
 710
 711       maxsrcwid =  get_max_elwidth(rs1, rs2) # source element width(s)
 712       destwid = int_csr[rs1].elwidth         # destination element width
 713       for (i = 0; i < VL; i++)
 714         if (predval & 1<<i) # predication uses intregs
 715            // TODO, calculate if over-run occurs, for each elwidth
 716            src1 = get_polymorphed_reg(rs1, maxsrcwid, irs1)
 717            // TODO, sign/zero-extend src1 and src2 as operation requires
 718            if (op_requires_sign_extend_src1)
 719               src1 = sign_extend(src1, maxsrcwid)
 720            src2 = get_polymorphed_reg(rs2, maxsrcwid, irs2)
 721            result = src1 + src2 # actual add here
 722            // TODO, sign/zero-extend result, as operation requires
 723            if (op_requires_sign_extend_dest)
 724               result = sign_extend(result, maxsrcwid)
 725            set_polymorphed_reg(rd, destwid, ird, result)
 726            if (!int_vec[rd].isvector) break
 727         if (int_vec[rd ].isvector)  { id += 1; }
 728         if (int_vec[rs1].isvector)  { irs1 += 1; }
 729         if (int_vec[rs2].isvector)  { irs2 += 1; }
 730
 731 Whilst specific sign-extension and zero-extension pseudocode call
 732 details are left out, due to each operation being different, the above
 733 should be clear that;
 734
 735 * the source operands are extended out to the maximum bitwidth of all
 736   source operands
 737 * the operation takes place at that maximum source bitwidth (the
 738   destination bitwidth is not involved at this point, at all)
 739 * the result is extended (or potentially even, truncated) before being
 740   stored in the destination.  i.e. truncation (if required) to the
 741   destination width occurs **after** the operation **not** before.
 742 * when the destination is not marked as "vectorised", the **full**
 743   (standard, scalar) register file entry is taken up, i.e. the
 744   element is either sign-extended or zero-extended to cover the
 745   full register bitwidth (XLEN) if it is not already XLEN bits long.
 746
 747 Implementors are entirely free to optimise the above, particularly
 748 if it is specifically known that any given operation will complete
 749 accurately in less bits, as long as the results produced are
 750 directly equivalent and equal, for all inputs and all outputs,
 751 to those produced by the above algorithm.
 752
 753 ## Polymorphic floating-point operation exceptions and error-handling
 754
 755 For floating-point operations, conversion takes place without
 756 raising any kind of exception.  Exactly as specified in the standard
 757 RV specification, NAN (or appropriate) is stored if the result
 758 is beyond the range of the destination, and, again, exactly as
 759 with the standard RV specification just as with scalar
 760 operations, the floating-point flag is raised (FCSR).  And, again, just as
 761 with scalar operations, it is software's responsibility to check this flag.
 762 Given that the FCSR flags are "accrued", the fact that multiple element
 763 operations could have occurred is not a problem.
 764
 765 Note that it is perfectly legitimate for floating-point bitwidths of
 766 only 8 to be specified.  However whilst it is possible to apply IEEE 754
 767 principles, no actual standard yet exists.  Implementors wishing to
 768 provide hardware-level 8-bit support rather than throw a trap to emulate
 769 in software should contact the author of this specification before
 770 proceeding.
 771
 772 ## Polymorphic shift operators
 773
 774 A special note is needed for changing the element width of left and right
 775 shift operators, particularly right-shift.  Even for standard RV base,
 776 in order for correct results to be returned, the second operand RS2 must
 777 be truncated to be within the range of RS1's bitwidth.  spike's implementation
 778 of sll for example is as follows:
 779
 780     WRITE_RD(sext_xlen(zext_xlen(RS1) << (RS2 & (xlen-1))));
 781
 782 which means: where XLEN is 32 (for RV32), restrict RS2 to cover the
 783 range 0..31 so that RS1 will only be left-shifted by the amount that
 784 is possible to fit into a 32-bit register.  Whilst this appears not
 785 to matter for hardware, it matters greatly in software implementations,
 786 and it also matters where an RV64 system is set to "RV32" mode, such
 787 that the underlying registers RS1 and RS2 comprise 64 hardware bits
 788 each.
 789
 790 For SV, where each operand's element bitwidth may be over-ridden, the
 791 rule about determining the operation's bitwidth *still applies*, being
 792 defined as the maximum bitwidth of RS1 and RS2.  *However*, this rule
 793 **also applies to the truncation of RS2**.  In other words, *after*
 794 determining the maximum bitwidth, RS2's range must **also be truncated**
 795 to ensure a correct answer.  Example:
 796
 797 * RS1 is over-ridden to a 16-bit width
 798 * RS2 is over-ridden to an 8-bit width
 799 * RD is over-ridden to a 64-bit width
 800 * the maximum bitwidth is thus determined to be 16-bit - max(8,16)
 801 * RS2 is **truncated to a range of values from 0 to 15**: RS2 & (16-1)
 802
 803 Pseudocode (in spike) for this example would therefore be:
 804
 805     WRITE_RD(sext_xlen(zext_16bit(RS1) << (RS2 & (16-1))));
 806
 807 This example illustrates that considerable care therefore needs to be
 808 taken to ensure that left and right shift operations are implemented
 809 correctly.  The key is that
 810
 811 * The operation bitwidth is determined by the maximum bitwidth
 812   of the *source registers*, **not** the destination register bitwidth
 813 * The result is then sign-extend (or truncated) as appropriate.
 814
 815 ## Polymorphic MULH/MULHU/MULHSU
 816
 817 MULH is designed to take the top half MSBs of a multiply that
 818 does not fit within the range of the source operands, such that
 819 smaller width operations may produce a full double-width multiply
 820 in two cycles.  The issue is: SV allows the source operands to
 821 have variable bitwidth.
 822
 823 Here again special attention has to be paid to the rules regarding
 824 bitwidth, which, again, are that the operation is performed at
 825 the maximum bitwidth of the **source** registers.  Therefore:
 826
 827 * An 8-bit x 8-bit multiply will create a 16-bit result that must
 828   be shifted down by 8 bits
 829 * A 16-bit x 8-bit multiply will create a 24-bit result that must
 830   be shifted down by 16 bits (top 8 bits being zero)
 831 * A 16-bit x 16-bit multiply will create a 32-bit result that must
 832   be shifted down by 16 bits
 833 * A 32-bit x 16-bit multiply will create a 48-bit result that must
 834   be shifted down by 32 bits
 835 * A 32-bit x 8-bit multiply will create a 40-bit result that must
 836   be shifted down by 32 bits
 837
 838 So again, just as with shift-left and shift-right, the result
 839 is shifted down by the maximum of the two source register bitwidths.
 840 And, exactly again, truncation or sign-extension is performed on the
 841 result.  If sign-extension is to be carried out, it is performed
 842 from the same maximum of the two source register bitwidths out
 843 to the result element's bitwidth.
 844
 845 If truncation occurs, i.e. the top MSBs of the result are lost,
 846 this is "Officially Not Our Problem", i.e. it is assumed that the
 847 programmer actually desires the result to be truncated.  i.e. if the
 848 programmer wanted all of the bits, they would have set the destination
 849 elwidth to accommodate them.
 850
 851 ## Polymorphic elwidth on LOAD/STORE <a name="elwidth_loadstore"></a>
 852
 853 Polymorphic element widths in vectorised form means that the data
 854 being loaded (or stored) across multiple registers needs to be treated
 855 (reinterpreted) as a contiguous stream of elwidth-wide items, where
 856 the source register's element width is **independent** from the destination's.
 857
 858 This makes for a slightly more complex algorithm when using indirection
 859 on the "addressed" register (source for LOAD and destination for STORE),
 860 particularly given that the LOAD/STORE instruction provides important
 861 information about the width of the data to be reinterpreted.
 862
 863 Let's illustrate the "load" part, where the pseudo-code for elwidth=default
 864 was as follows, and i is the loop from 0 to VL-1:
 865
 866     srcbase = ireg[rs+i];
 867     return mem[srcbase + imm]; // returns XLEN bits
 868
 869 Instead, when elwidth != default, for a LW (32-bit LOAD), elwidth-wide
 870 chunks are taken from the source memory location addressed by the current
 871 indexed source address register, and only when a full 32-bits-worth
 872 are taken will the index be moved on to the next contiguous source
 873 address register:
 874
 875     bitwidth = bw(elwidth);             // source elwidth from CSR reg entry
 876     elsperblock = 32 / bitwidth         // 1 if bw=32, 2 if bw=16, 4 if bw=8
 877     srcbase = ireg[rs+i/(elsperblock)]; // integer divide
 878     offs = i % elsperblock;             // modulo
 879     return &mem[srcbase + imm + offs];  // re-cast to uint8_t*, uint16_t* etc.
 880
 881 Note that the constant "32" above is replaced by 8 for LB, 16 for LH, 64 for LD
 882 and 128 for LQ.
 883
 884 The principle is basically exactly the same as if the srcbase were pointing
 885 at the memory of the *register* file: memory is re-interpreted as containing
 886 groups of elwidth-wide discrete elements.
 887
 888 When storing the result from a load, it's important to respect the fact
 889 that the destination register has its *own separate element width*.  Thus,
 890 when each element is loaded (at the source element width), any sign-extension
 891 or zero-extension (or truncation) needs to be done to the *destination*
 892 bitwidth.  Also, the storing has the exact same analogous algorithm as
 893 above, where in fact it is just the set\_polymorphed\_reg pseudocode
 894 (completely unchanged) used above.
 895
 896 One issue remains: when the source element width is **greater** than
 897 the width of the operation, it is obvious that a single LB for example
 898 cannot possibly obtain 16-bit-wide data.  This condition may be detected
 899 where, when using integer divide, elsperblock (the width of the LOAD
 900 divided by the bitwidth of the element) is zero.
 901
 902 The issue is "fixed" by ensuring that elsperblock is a minimum of 1:
 903
 904     elsperblock = min(1, LD_OP_BITWIDTH / element_bitwidth)
 905
 906 The elements, if the element bitwidth is larger than the LD operation's
 907 size, will then be sign/zero-extended to the full LD operation size, as
 908 specified by the LOAD (LDU instead of LD, LBU instead of LB), before
 909 being passed on to the second phase.
 910
 911 As LOAD/STORE may be twin-predicated, it is important to note that
 912 the rules on twin predication still apply, except where in previous
 913 pseudo-code (elwidth=default for both source and target) it was
 914 the *registers* that the predication was applied to, it is now the
 915 **elements** that the predication is applied to.
 916
 917 Thus the full pseudocode for all LD operations may be written out
 918 as follows:
 919
 920     function LBU(rd, rs):
 921         load_elwidthed(rd, rs, 8, true)
 922     function LB(rd, rs):
 923         load_elwidthed(rd, rs, 8, false)
 924     function LH(rd, rs):
 925         load_elwidthed(rd, rs, 16, false)
 926     ...
 927     ...
 928     function LQ(rd, rs):
 929         load_elwidthed(rd, rs, 128, false)
 930
 931     # returns 1 byte of data when opwidth=8, 2 bytes when opwidth=16..
 932     function load_memory(rs, imm, i, opwidth):
 933         elwidth = int_csr[rs].elwidth
 934         bitwidth = bw(elwidth);
 935         elsperblock = min(1, opwidth / bitwidth)
 936         srcbase = ireg[rs+i/(elsperblock)];
 937         offs = i % elsperblock;
 938         return mem[srcbase + imm + offs]; # 1/2/4/8/16 bytes
 939
 940     function load_elwidthed(rd, rs, opwidth, unsigned):
 941       destwid = int_csr[rd].elwidth # destination element width
 942       rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
 943       rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
 944       ps = get_pred_val(FALSE, rs); # predication on src
 945       pd = get_pred_val(FALSE, rd); # ... AND on dest
 946       for (int i = 0, int j = 0; i < VL && j < VL;):
 947         if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
 948         if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
 949         val = load_memory(rs, imm, i, opwidth)
 950         if unsigned:
 951             val = zero_extend(val, min(opwidth, bitwidth))
 952         else:
 953             val = sign_extend(val, min(opwidth, bitwidth))
 954         set_polymorphed_reg(rd, bitwidth, j, val)
 955         if (int_csr[rs].isvec) i++;
 956         if (int_csr[rd].isvec) j++; else break;
 957
 958 Note:
 959
 960 * when comparing against for example the twin-predicated c.mv
 961   pseudo-code, the pattern of independent incrementing of rd and rs
 962   is preserved unchanged.
 963 * just as with the c.mv pseudocode, zeroing is not included and must be
 964   taken into account (TODO).
 965 * that due to the use of a twin-predication algorithm, LOAD/STORE also
 966   take on the same VSPLAT, VINSERT, VREDUCE, VEXTRACT, VGATHER and
 967   VSCATTER characteristics.
 968 * that due to the use of the same set\_polymorphed\_reg pseudocode,
 969   a destination that is not vectorised (marked as scalar) will
 970   result in the element being fully sign-extended or zero-extended
 971   out to the full register file bitwidth (XLEN).  When the source
 972   is also marked as scalar, this is how the compatibility with
 973   standard RV LOAD/STORE is preserved by this algorithm.
 974
 975 ### Example Tables showing LOAD elements
 976
 977 This section contains examples of vectorised LOAD operations, showing
 978 how the two stage process works (three if zero/sign-extension is included).
 979
 980
 981 #### Example: LD x8, x5(0), x8 CSR-elwidth=32, x5 CSR-elwidth=16, VL=7
 982
 983 This is:
 984
 985 * a 64-bit load, with an offset of zero
 986 * with a source-address elwidth of 16-bit
 987 * into a destination-register with an elwidth of 32-bit
 988 * where VL=7
 989 * from register x5 (actually x5-x6) to x8 (actually x8 to half of x11)
 990 * RV64, where XLEN=64 is assumed.
 991
 992 First, the memory table, which, due to the
 993 element width being 16 and the operation being LD (64), the 64-bits
 994 loaded from memory are subdivided into groups of **four** elements.
 995 And, with VL being 7 (deliberately to illustrate that this is reasonable
 996 and possible), the first four are sourced from the offset addresses pointed
 997 to by x5, and the next three from the ofset addresses pointed to by
 998 the next contiguous register, x6:
 999
1000 [[!table  data="""
1001 addr | byte 0 | byte 1 | byte 2 | byte 3 | byte 4 | byte 5 | byte 6 | byte 7 |
1002 @x5  | elem 0         || elem 1         || elem 2         || elem 3         ||
1003 @x6  | elem 4         || elem 5         || elem 6         || not loaded     ||
1004 """]]
1005
1006 Next, the elements are zero-extended from 16-bit to 32-bit, as whilst
1007 the elwidth CSR entry for x5 is 16-bit, the destination elwidth on x8 is 32.
1008
1009 [[!table  data="""
1010 byte 3 | byte 2 | byte 1 | byte 0 |
1011 0x0    | 0x0    | elem0          ||
1012 0x0    | 0x0    | elem1          ||
1013 0x0    | 0x0    | elem2          ||
1014 0x0    | 0x0    | elem3          ||
1015 0x0    | 0x0    | elem4          ||
1016 0x0    | 0x0    | elem5          ||
1017 0x0    | 0x0    | elem6          ||
1018 0x0    | 0x0    | elem7          ||
1019 """]]
1020
1021 Lastly, the elements are stored in contiguous blocks, as if x8 was also
1022 byte-addressable "memory".  That "memory" happens to cover registers
1023 x8, x9, x10 and x11, with the last 32 "bits" of x11 being **UNMODIFIED**:
1024
1025 [[!table  data="""
1026 reg# | byte 7 | byte 6 | byte 5 | byte 4 | byte 3 | byte 2 | byte 1 | byte 0 |
1027 x8   | 0x0    | 0x0    | elem 1         || 0x0    | 0x0    | elem 0         ||
1028 x9   | 0x0    | 0x0    | elem 3         || 0x0    | 0x0    | elem 2         ||
1029 x10  | 0x0    | 0x0    | elem 5         || 0x0    | 0x0    | elem 4         ||
1030 x11  | **UNMODIFIED**                 |||| 0x0    | 0x0    | elem 6         ||
1031 """]]
1032
1033 Thus we have data that is loaded from the **addresses** pointed to by
1034 x5 and x6, zero-extended from 16-bit to 32-bit, stored in the **registers**
1035 x8 through to half of x11.
1036 The end result is that elements 0 and 1 end up in x8, with element 8 being
1037 shifted up 32 bits, and so on, until finally element 6 is in the
1038 LSBs of x11.
1039
1040 Note that whilst the memory addressing table is shown left-to-right byte order,
1041 the registers are shown in right-to-left (MSB) order.  This does **not**
1042 imply that bit or byte-reversal is carried out: it's just easier to visualise
1043 memory as being contiguous bytes, and emphasises that registers are not
1044 really actually "memory" as such.
1045
1046 ## Why SV bitwidth specification is restricted to 4 entries
1047
1048 The four entries for SV element bitwidths only allows three over-rides:
1049
1050 * 8 bit
1051 * 16 hit
1052 * 32 bit
1053
1054 This would seem inadequate, surely it would be better to have 3 bits or
1055 more and allow 64, 128 and some other options besides.  The answer here
1056 is, it gets too complex, no RV128 implementation yet exists, and so RV64's
1057 default is 64 bit, so the 4 major element widths are covered anyway.
1058
1059 There is an absolutely crucial aspect oF SV here that explicitly
1060 needs spelling out, and it's whether the "vectorised" bit is set in
1061 the Register's CSR entry.
1062
1063 If "vectorised" is clear (not set), this indicates that the operation
1064 is "scalar".  Under these circumstances, when set on a destination (RD),
1065 then sign-extension and zero-extension, whilst changed to match the
1066 override bitwidth (if set), will erase the **full** register entry
1067 (64-bit if RV64).
1068
1069 When vectorised is *set*, this indicates that the operation now treats
1070 **elements** as if they were independent registers, so regardless of
1071 the length, any parts of a given actual register that are not involved
1072 in the operation are **NOT** modified, but are **PRESERVED**.
1073
1074 For example:
1075
1076 * when the vector bit is clear and elwidth set to 16 on the destination
1077   register, operations are truncated to 16 bit and then sign or zero
1078   extended to the *FULL* XLEN register width.
1079 * when the vector bit is set, elwidth is 16 and VL=1 (or other value where
1080   groups of elwidth sized elements do not fill an entire XLEN register),
1081   the "top" bits of the destination register do *NOT* get modified, zero'd
1082   or otherwise overwritten.
1083
1084 SIMD micro-architectures may implement this by using predication on
1085 any elements in a given actual register that are beyond the end of
1086 multi-element operation.
1087
1088 Other microarchitectures may choose to provide byte-level write-enable
1089 lines on the register file, such that each 64 bit register in an RV64
1090 system requires 8 WE lines.  Scalar RV64 operations would require
1091 activation of all 8 lines, where SV elwidth based operations would
1092 activate the required subset of those byte-level write lines.
1093
1094 Example:
1095
1096 * rs1, rs2 and rd are all set to 8-bit
1097 * VL is set to 3
1098 * RV64 architecture is set (UXL=64)
1099 * add operation is carried out
1100 * bits 0-23 of RD are modified to be rs1[23..16] + rs2[23..16]
1101   concatenated with similar add operations on bits 15..8 and 7..0
1102 * bits 24 through 63 **remain as they originally were**.
1103
1104 Example SIMD micro-architectural implementation:
1105
1106 * SIMD architecture works out the nearest round number of elements
1107   that would fit into a full RV64 register (in this case: 8)
1108 * SIMD architecture creates a hidden predicate, binary 0b00000111
1109   i.e. the bottom 3 bits set (VL=3) and the top 5 bits clear
1110 * SIMD architecture goes ahead with the add operation as if it
1111   was a full 8-wide batch of 8 adds
1112 * SIMD architecture passes top 5 elements through the adders
1113   (which are "disabled" due to zero-bit predication)
1114 * SIMD architecture gets the 5 unmodified top 8-bits back unmodified
1115   and stores them in rd.
1116
1117 This requires a read on rd, however this is required anyway in order
1118 to support non-zeroing mode.
1119
1120 ## Polymorphic floating-point
1121
1122 Standard scalar RV integer operations base the register width on XLEN,
1123 which may be changed (UXL in USTATUS, and the corresponding MXL and
1124 SXL in MSTATUS and SSTATUS respectively).  Integer LOAD, STORE and
1125 arithmetic operations are therefore restricted to an active XLEN bits,
1126 with sign or zero extension to pad out the upper bits when XLEN has
1127 been dynamically set to less than the actual register size.
1128
1129 For scalar floating-point, the active (used / changed) bits are
1130 specified exclusively by the operation: ADD.S specifies an active
1131 32-bits, with the upper bits of the source registers needing to
1132 be all 1s ("NaN-boxed"), and the destination upper bits being
1133 *set* to all 1s (including on LOAD/STOREs).
1134
1135 Where elwidth is set to default (on any source or the destination)
1136 it is obvious that this NaN-boxing behaviour can and should be
1137 preserved.  When elwidth is non-default things are less obvious,
1138 so need to be thought through.  Here is a normal (scalar) sequence,
1139 assuming an RV64 which supports Quad (128-bit) FLEN:
1140
1141 * FLD loads 64-bit wide from memory.  Top 64 MSBs are set to all 1s
1142 * ADD.D performs a 64-bit-wide add.  Top 64 MSBs of destination set to 1s.
1143 * FSD stores lowest 64-bits from the 128-bit-wide register to memory:
1144   top 64 MSBs ignored.
1145
1146 Therefore it makes sense to mirror this behaviour when, for example,
1147 elwidth is set to 32.  Assume elwidth set to 32 on all source and
1148 destination registers:
1149
1150 * FLD loads 64-bit wide from memory as **two** 32-bit single-precision
1151   floating-point numbers.
1152 * ADD.D performs **two** 32-bit-wide adds, storing one of the adds
1153   in bits 0-31 and the second in bits 32-63.
1154 * FSD stores lowest 64-bits from the 128-bit-wide register to memory
1155
1156 Here's the thing: it does not make sense to overwrite the top 64 MSBs
1157 of the registers either during the FLD **or** the ADD.D.  The reason
1158 is that, effectively, the top 64 MSBs actually represent a completely
1159 independent 64-bit register, so overwriting it is not only gratuitous
1160 but may actually be harmful for a future extension to SV which may
1161 have a way to directly access those top 64 bits.
1162
1163 The decision is therefore **not** to touch the upper parts of floating-point
1164 registers whereever elwidth is set to non-default values, including
1165 when "isvec" is false in a given register's CSR entry.  Only when the
1166 elwidth is set to default **and** isvec is false will the standard
1167 RV behaviour be followed, namely that the upper bits be modified.
1168
1169 Ultimately if elwidth is default and isvec false on *all* source
1170 and destination registers, a SimpleV instruction defaults completely
1171 to standard RV scalar behaviour (this holds true for **all** operations,
1172 right across the board).
1173
1174 The nice thing here is that ADD.S, ADD.D and ADD.Q when elwidth are
1175 non-default values are effectively all the same: they all still perform
1176 multiple ADD operations, just at different widths.  A future extension
1177 to SimpleV may actually allow ADD.S to access the upper bits of the
1178 register, effectively breaking down a 128-bit register into a bank
1179 of 4 independently-accesible 32-bit registers.
1180
1181 In the meantime, although when e.g. setting VL to 8 it would technically
1182 make no difference to the ALU whether ADD.S, ADD.D or ADD.Q is used,
1183 using ADD.Q may be an easy way to signal to the microarchitecture that
1184 it is to receive a higher VL value.  On a superscalar OoO architecture
1185 there may be absolutely no difference, however on simpler SIMD-style
1186 microarchitectures they may not necessarily have the infrastructure in
1187 place to know the difference, such that when VL=8 and an ADD.D instruction
1188 is issued, it completes in 2 cycles (or more) rather than one, where
1189 if an ADD.Q had been issued instead on such simpler microarchitectures
1190 it would complete in one.
1191
1192 ## Specific instruction walk-throughs
1193
1194 This section covers walk-throughs of the above-outlined procedure
1195 for converting standard RISC-V scalar arithmetic operations to
1196 polymorphic widths, to ensure that it is correct.
1197
1198 ### add
1199
1200 Standard Scalar RV32/RV64 (xlen):
1201
1202 * RS1 @ xlen bits
1203 * RS2 @ xlen bits
1204 * add @ xlen bits
1205 * RD @ xlen bits
1206
1207 Polymorphic variant:
1208
1209 * RS1 @ rs1 bits, zero-extended to max(rs1, rs2) bits
1210 * RS2 @ rs2 bits, zero-extended to max(rs1, rs2) bits
1211 * add @ max(rs1, rs2) bits
1212 * RD @ rd bits.  zero-extend to rd if rd > max(rs1, rs2) otherwise truncate
1213
1214 Note here that polymorphic add zero-extends its source operands,
1215 where addw sign-extends.
1216
1217 ### addw
1218
1219 The RV Specification specifically states that "W" variants of arithmetic
1220 operations always produce 32-bit signed values.  In a polymorphic
1221 environment it is reasonable to assume that the signed aspect is
1222 preserved, where it is the length of the operands and the result
1223 that may be changed.
1224
1225 Standard Scalar RV64 (xlen):
1226
1227 * RS1 @ xlen bits
1228 * RS2 @ xlen bits
1229 * add @ xlen bits
1230 * RD @ xlen bits, truncate add to 32-bit and sign-extend to xlen.
1231
1232 Polymorphic variant:
1233
1234 * RS1 @ rs1 bits, sign-extended to max(rs1, rs2) bits
1235 * RS2 @ rs2 bits, sign-extended to max(rs1, rs2) bits
1236 * add @ max(rs1, rs2) bits
1237 * RD @ rd bits. sign-extend to rd if rd > max(rs1, rs2) otherwise truncate
1238
1239 Note here that polymorphic addw sign-extends its source operands,
1240 where add zero-extends.
1241
1242 This requires a little more in-depth analysis.  Where the bitwidth of
1243 rs1 equals the bitwidth of rs2, no sign-extending will occur.  It is
1244 only where the bitwidth of either rs1 or rs2 are different, will the
1245 lesser-width operand be sign-extended.
1246
1247 Effectively however, both rs1 and rs2 are being sign-extended (or truncated),
1248 where for add they are both zero-extended.  This holds true for all arithmetic
1249 operations ending with "W".
1250
1251 ### addiw
1252
1253 Standard Scalar RV64I:
1254
1255 * RS1 @ xlen bits, truncated to 32-bit
1256 * immed @ 12 bits, sign-extended to 32-bit
1257 * add @ 32 bits
1258 * RD @ rd bits.  sign-extend to rd if rd > 32, otherwise truncate.
1259
1260 Polymorphic variant:
1261
1262 * RS1 @ rs1 bits
1263 * immed @ 12 bits, sign-extend to max(rs1, 12) bits
1264 * add @ max(rs1, 12) bits
1265 * RD @ rd bits. sign-extend to rd if rd > max(rs1, 12) otherwise truncate
1266
1267 # Predication Element Zeroing
1268
1269 The introduction of zeroing on traditional vector predication is usually
1270 intended as an optimisation for lane-based microarchitectures with register
1271 renaming to be able to save power by avoiding a register read on elements
1272 that are passed through en-masse through the ALU.  Simpler microarchitectures
1273 do not have this issue: they simply do not pass the element through to
1274 the ALU at all, and therefore do not store it back in the destination.
1275 More complex non-lane-based micro-architectures can, when zeroing is
1276 not set, use the predication bits to simply avoid sending element-based
1277 operations to the ALUs, entirely: thus, over the long term, potentially
1278 keeping all ALUs 100% occupied even when elements are predicated out.
1279
1280 SimpleV's design principle is not based on or influenced by
1281 microarchitectural design factors: it is a hardware-level API.
1282 Therefore, looking purely at whether zeroing is *useful* or not,
1283 (whether less instructions are needed for certain scenarios),
1284 given that a case can be made for zeroing *and* non-zeroing, the
1285 decision was taken to add support for both.
1286
1287 ## Single-predication (based on destination register)
1288
1289 Zeroing on predication for arithmetic operations is taken from
1290 the destination register's predicate.  i.e. the predication *and*
1291 zeroing settings to be applied to the whole operation come from the
1292 CSR Predication table entry for the destination register.
1293 Thus when zeroing is set on predication of a destination element,
1294 if the predication bit is clear, then the destination element is *set*
1295 to zero (twin-predication is slightly different, and will be covered
1296 next).
1297
1298 Thus the pseudo-code loop for a predicated arithmetic operation
1299 is modified to as follows:
1300
1301       for (i = 0; i < VL; i++)
1302         if not zeroing: # an optimisation
1303            while (!(predval & 1<<i) && i < VL)
1304              if (int_vec[rd ].isvector)  { id += 1; }
1305              if (int_vec[rs1].isvector)  { irs1 += 1; }
1306              if (int_vec[rs2].isvector)  { irs2 += 1; }
1307            if i == VL:
1308              return
1309         if (predval & 1<<i)
1310            src1 = ....
1311            src2 = ...
1312            else:
1313                result = src1 + src2 # actual add (or other op) here
1314            set_polymorphed_reg(rd, destwid, ird, result)
1315            if int_vec[rd].ffirst and result == 0:
1316               VL = i # result was zero, end loop early, return VL
1317               return
1318            if (!int_vec[rd].isvector) return
1319         else if zeroing:
1320            result = 0
1321            set_polymorphed_reg(rd, destwid, ird, result)
1322         if (int_vec[rd ].isvector)  { id += 1; }
1323         else if (predval & 1<<i) return
1324         if (int_vec[rs1].isvector)  { irs1 += 1; }
1325         if (int_vec[rs2].isvector)  { irs2 += 1; }
1326         if (rd == VL or rs1 == VL or rs2 == VL): return
1327
1328 The optimisation to skip elements entirely is only possible for certain
1329 micro-architectures when zeroing is not set.  However for lane-based
1330 micro-architectures this optimisation may not be practical, as it
1331 implies that elements end up in different "lanes".  Under these
1332 circumstances it is perfectly fine to simply have the lanes
1333 "inactive" for predicated elements, even though it results in
1334 less than 100% ALU utilisation.
1335
1336 ## Twin-predication (based on source and destination register)
1337
1338 Twin-predication is not that much different, except that that
1339 the source is independently zero-predicated from the destination.
1340 This means that the source may be zero-predicated *or* the
1341 destination zero-predicated *or both*, or neither.
1342
1343 When with twin-predication, zeroing is set on the source and not
1344 the destination, if a predicate bit is set it indicates that a zero
1345 data element is passed through the operation (the exception being:
1346 if the source data element is to be treated as an address - a LOAD -
1347 then the data returned *from* the LOAD is zero, rather than looking up an
1348 *address* of zero.
1349
1350 When zeroing is set on the destination and not the source, then just
1351 as with single-predicated operations, a zero is stored into the destination
1352 element (or target memory address for a STORE).
1353
1354 Zeroing on both source and destination effectively result in a bitwise
1355 NOR operation of the source and destination predicate: the result is that
1356 where either source predicate OR destination predicate is set to 0,
1357 a zero element will ultimately end up in the destination register.
1358
1359 However: this may not necessarily be the case for all operations;
1360 implementors, particularly of custom instructions, clearly need to
1361 think through the implications in each and every case.
1362
1363 Here is pseudo-code for a twin zero-predicated operation:
1364
1365     function op_mv(rd, rs) # MV not VMV!
1366       rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
1367       rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
1368       ps, zerosrc = get_pred_val(FALSE, rs); # predication on src
1369       pd, zerodst = get_pred_val(FALSE, rd); # ... AND on dest
1370       for (int i = 0, int j = 0; i < VL && j < VL):
1371         if (int_csr[rs].isvec && !zerosrc) while (!(ps & 1<<i)) i++;
1372         if (int_csr[rd].isvec && !zerodst) while (!(pd & 1<<j)) j++;
1373         if ((pd & 1<<j))
1374             if ((pd & 1<<j))
1375                 sourcedata = ireg[rs+i];
1376             else
1377                 sourcedata = 0
1378             ireg[rd+j] <= sourcedata
1379         else if (zerodst)
1380             ireg[rd+j] <= 0
1381         if (int_csr[rs].isvec)
1382             i++;
1383         if (int_csr[rd].isvec)
1384             j++;
1385         else
1386             if ((pd & 1<<j))
1387                 break;
1388
1389 Note that in the instance where the destination is a scalar, the hardware
1390 loop is ended the moment a value *or a zero* is placed into the destination
1391 register/element.  Also note that, for clarity, variable element widths
1392 have been left out of the above.
1393
1394 # Subsets of RV functionality
1395
1396 This section describes the differences when SV is implemented on top of
1397 different subsets of RV.
1398
1399 ## Common options
1400
1401 It is permitted to only implement SVprefix and not the VBLOCK instruction
1402 format option, and vice-versa.  UNIX Platforms **MUST** raise illegal
1403 instruction on seeing an unsupported VBLOCK or SVprefix opcode, so that
1404 traps may emulate the format.
1405
1406 It is permitted in SVprefix to either not implement VL or not implement
1407 SUBVL (see [[sv_prefix_proposal]] for full details. Again, UNIX Platforms
1408 *MUST* raise illegal instruction on implementations that do not support
1409 VL or SUBVL.
1410
1411 It is permitted to limit the size of either (or both) the register files
1412 down to the original size of the standard RV architecture.  However, below
1413 the mandatory limits set in the RV standard will result in non-compliance
1414 with the SV Specification.
1415
1416 ## RV32 / RV32F
1417
1418 When RV32 or RV32F is implemented, XLEN is set to 32, and thus the
1419 maximum limit for predication is also restricted to 32 bits.  Whilst not
1420 actually specifically an "option" it is worth noting.
1421
1422 ## RV32G
1423
1424 Normally in standard RV32 it does not make much sense to have
1425 RV32G, The critical instructions that are missing in standard RV32
1426 are those for moving data to and from the double-width floating-point
1427 registers into the integer ones, as well as the FCVT routines.
1428
1429 In an earlier draft of SV, it was possible to specify an elwidth
1430 of double the standard register size: this had to be dropped,
1431 and may be reintroduced in future revisions.
1432
1433 ## RV32 (not RV32F / RV32G) and RV64 (not RV64F / RV64G)
1434
1435 When floating-point is not implemented, the size of the User Register and
1436 Predication CSR tables may be halved, to only 4 2x16-bit CSRs (8 entries
1437 per table).
1438
1439 ## RV32E
1440
1441 In embedded scenarios the User Register and Predication CSRs may be
1442 dropped entirely, or optionally limited to 1 CSR, such that the combined
1443 number of entries from the M-Mode CSR Register table plus U-Mode
1444 CSR Register table is either 4 16-bit entries or (if the U-Mode is
1445 zero) only 2 16-bit entries (M-Mode CSR table only).  Likewise for
1446 the Predication CSR tables.
1447
1448 RV32E is the most likely candidate for simply detecting that registers
1449 are marked as "vectorised", and generating an appropriate exception
1450 for the VL loop to be implemented in software.
1451
1452 ## RV128
1453
1454 RV128 has not been especially considered, here, however it has some
1455 extremely large possibilities: double the element width implies
1456 256-bit operands, spanning 2 128-bit registers each, and predication
1457 of total length 128 bit given that XLEN is now 128.
1458
1459 # Example usage
1460
1461 TODO evaluate strncpy and strlen
1462 <https://groups.google.com/forum/m/#!msg/comp.arch/bGBeaNjAKvc/_vbqyxTUAQAJ>
1463
1464 ## strncpy
1465
1466 RVV version: <a name="strncpy"></>
1467
1468     strncpy:
1469         mv a3, a0               # Copy dst
1470     loop:
1471         setvli x0, a2, vint8    # Vectors of bytes.
1472         vlbff.v v1, (a1)        # Get src bytes
1473         vseq.vi v0, v1, 0       # Flag zero bytes
1474         vmfirst a4, v0          # Zero found?
1475         vmsif.v v0, v0          # Set mask up to and including zero byte. Ppplio
1476         vsb.v v1, (a3), v0.t    # Write out bytes
1477         bgez a4, exit           # Done
1478         csrr t1, vl             # Get number of bytes fetched
1479         add a1, a1, t1          # Bump src pointer
1480         sub a2, a2, t1          # Decrement count.
1481         add a3, a3, t1          # Bump dst pointer
1482         bnez a2, loop           # Anymore?
1483
1484     exit:
1485         ret
1486
1487 SV version (WIP):
1488
1489     strncpy:
1490         mv a3, a0
1491         SETMVLI 8 # set max vector to 8
1492         RegCSR[a3] = 8bit, a3, scalar
1493         RegCSR[a1] = 8bit, a1, scalar
1494         RegCSR[t0] = 8bit, t0, vector
1495         PredTb[t0] = ffirst, x0, inv
1496     loop:
1497         SETVLI a2, t4 # t4 and VL now 1..8
1498         ldb t0, (a1) # t0 fail first mode
1499         bne t0, x0, allnonzero # still ff
1500         # VL points to last nonzero
1501         GETVL t4       # from bne tests
1502         addi t4, t4, 1 # include zero
1503         SETVL t4       # set exactly to t4
1504         stb t0, (a3)   # store incl zero
1505         ret            # end subroutine
1506     allnonzero:
1507         stb t0, (a3)    # VL legal range
1508         GETVL t4        # from bne tests
1509         add a1, a1, t4  # Bump src pointer
1510         sub a2, a2, t4  # Decrement count.
1511         add a3, a3, t4  # Bump dst pointer
1512         bnez a2, loop   # Anymore?
1513     exit:
1514         ret
1515
1516 Notes:
1517
1518 * Setting MVL to 8 is just an example. If enough registers are spare it may be set to XLEN which will require a bank of 8 scalar registers for a1, a3 and t0.
1519 * obviously if that is done, t0 is not separated by 8 full registers, and would overwrite t1 thru t7. x80 would work well, as an example, instead.
1520 * with the exception of the GETVL (a pseudo code alias for csrr), every single instruction above may use RVC.
1521 * RVC C.BNEZ can be used because rs1' may be extended to the full 128 registers through redirection
1522 * RVC C.LW and C.SW may be used because the W format may be overridden by the 8 bit format. All of t0, a3 and a1 are overridden to make that work.
1523 * with the exception of the GETVL, all Vector Context may be done in VBLOCK form.
1524 * setting predication to x0 (zero) and invert on t0 is a trick to enable just ffirst on t0
1525 * ldb and bne are both using t0, both in ffirst mode
1526 * ldb will end on illegal mem, reduce VL, but copied all sorts of stuff into t0
1527 * bne t0 x0 tests up to the NEW VL for nonzero, vector t0 against scalar x0
1528 * however as t0 is in ffirst mode, the first fail wil ALSO stop the compares, and reduce VL as well
1529 * the branch only goes to allnonzero if all tests succeed
1530 * if it did not, we can safely increment VL by 1 (using a4) to include the zero.
1531 * SETVL sets *exactly* the requested amount into VL.
1532 * the SETVL just after allnonzero label is needed in case the ldb ffirst activates but the bne allzeros does not.
1533 * this would cause the stb to copy up to the end of the legal memory
1534 * of course, on the next loop the ldb would throw a trap, as a1 now points to the first illegal mem location.
1535
1536 ## strcpy
1537
1538 RVV version:
1539
1540         mv a3, a0             # Save start
1541     loop:
1542         setvli a1, x0, vint8  # byte vec, x0 (Zero reg) => use max hardware len
1543         vldbff.v v1, (a3)     # Get bytes
1544         csrr a1, vl           # Get bytes actually read e.g. if fault
1545         vseq.vi v0, v1, 0     # Set v0[i] where v1[i] = 0
1546         add a3, a3, a1        # Bump pointer
1547         vmfirst a2, v0        # Find first set bit in mask, returns -1 if none
1548         bltz a2, loop         # Not found?
1549         add a0, a0, a1        # Sum start + bump
1550         add a3, a3, a2        # Add index of zero byte
1551         sub a0, a3, a0        # Subtract start address+bump
1552         ret