simple_v_extension/v_comparative_analysis.mdwn

   1 # V-Extension to Simple-V Comparative Analysis
   2
   3 [[!toc ]]
   4
   5 This section covers the ways in which Simple-V is comparable
   6 to, or more flexible than, V-Extension (V2.3-draft).  Also covered is
   7 one major weak-point (register files are fixed size, where V is
   8 arbitrary length), and how best to deal with that, should V be adapted
   9 to be on top of Simple-V.
  10
  11 The first stages of this section go over each of the sections of V2.3-draft V
  12 where appropriate
  13
  14 # 17.3 Shape Encoding
  15
  16 Simple-V's proposed means of expressing whether a register (from the
  17 standard integer or the standard floating-point file) is a scalar or
  18 a vector is to simply set the vector length to 1.  The instruction
  19 would however have to specify which register file (integer or FP) that
  20 the vector-length was to be applied to.
  21
  22 Extended shapes (2-D etc) would not be part of Simple-V at all.
  23
  24 # 17.4 Representation Encoding
  25
  26 Simple-V would not have representation-encoding.  This is part of
  27 polymorphism, which is considered too complex to implement (TODO: confirm?)
  28
  29 # 17.5 Element Bitwidth
  30
  31 This is directly equivalent to Simple-V's "Packed", and implies that
  32 integer (or floating-point) are divided down into vector-indexable
  33 chunks of size Bitwidth.
  34
  35 In this way it becomes possible to have ADD effectively and implicitly
  36 turn into ADDb (8-bit add), ADDw (16-bit add) and so on, and where
  37 vector-length has been set to greater than 1, it becomes a "Packed"
  38 (SIMD) instruction.
  39
  40 It remains to be decided what should be done when RV32 / RV64 ADD (sized)
  41 opcodes are used.  One useful idea would be, on an RV64 system where
  42 a 32-bit-sized ADD was performed, to simply use the least significant
  43 32-bits of the register (exactly as is currently done) but at the same
  44 time to *respect the packed bitwidth as well*.
  45
  46 The extended encoding (Table 17.6) would not be part of Simple-V.
  47
  48 # 17.6 Base Vector Extension Supported Types
  49
  50 TODO: analyse.  probably exactly the same.
  51
  52 # 17.7 Maximum Vector Element Width
  53
  54 No equivalent in Simple-V
  55
  56 # 17.8 Vector Configuration Registers
  57
  58 TODO: analyse.
  59
  60 # 17.9 Legal Vector Unit Configurations
  61
  62 TODO: analyse
  63
  64 # 17.10 Vector Unit CSRs
  65
  66 TODO: analyse
  67
  68 > Ok so this is an aspect of Simple-V that I hadn't thought through,
  69 > yet (proposal / idea only a few days old!).  in V2.3-Draft ISA Section
  70 > 17.10 the CSRs are listed.  I note that there's some general-purpose
  71 > CSRs (including a global/active vector-length) and 16 vcfgN CSRs.  i
  72 > don't precisely know what those are for.
  73
  74 >  In the Simple-V proposal, *every* register in both the integer
  75 > register-file *and* the floating-point register-file would have at
  76 > least a 2-bit "data-width" CSR and probably something like an 8-bit
  77 > "vector-length" CSR (less in RV32E, by exactly one bit).
  78
  79 >  What I *don't* know is whether that would be considered perfectly
  80 > reasonable or completely insane.  If it turns out that the proposed
  81 > Simple-V CSRs can indeed be stored in SRAM then I would imagine that
  82 > adding somewhere in the region of 10 bits per register would be... okay?
  83 > I really don't honestly know.
  84
  85 >  Would these proposed 10-or-so-bit per-register Simple-V CSRs need to
  86 > be multi-ported? No I don't believe they would.
  87
  88 # 17.11 Maximum Vector Length (MVL)
  89
  90 Basically implicitly this is set to the maximum size of the register
  91 file multiplied by the number of 8-bit packed ints that can fit into
  92 a register (4 for RV32, 8 for RV64 and 16 for RV128).
  93
  94 # !7.12 Vector Instruction Formats
  95
  96 No equivalent in Simple-V because *all* instructions of *all* Extensions
  97 are implicitly parallelised (and packed).
  98
  99 # 17.13 Polymorphic Vector Instructions
 100
 101 Polymorphism (implicit type-casting) is deliberately not supported
 102 in Simple-V.
 103
 104 # 17.14 Rapid Configuration Instructions
 105
 106 TODO: analyse if this is useful to have an equivalent in Simple-V
 107
 108 # 17.15 Vector-Type-Change Instructions
 109
 110 TODO: analyse if this is useful to have an equivalent in Simple-V
 111
 112 # 17.16 Vector Length
 113
 114 Has a direct corresponding equivalent.
 115
 116 # 17.17 Predicated Execution
 117
 118 Predicated Execution is another name for "masking" or "tagging".  Masked
 119 (or tagged) implies that there is a bit field which is indexed, and each
 120 bit associated with the corresponding indexed offset register within
 121 the "Vector".  If the tag / mask bit is 1, when a parallel operation is
 122 issued, the indexed element of the vector has the operation carried out.
 123 However if the tag / mask bit is *zero*, that particular indexed element
 124 of the vector does *not* have the requested operation carried out.
 125
 126 In V2.3-draft V, there is a significant (not recommended) difference:
 127 the zero-tagged elements are *set to zero*.  This loses a *significant*
 128 advantage of mask / tagging, particularly if the entire mask register
 129 is itself a general-purpose register, as that general-purpose register
 130 can be inverted, shifted, and'ed, or'ed and so on.  In other words
 131 it becomes possible, especially if Carry/Overflow from each vector
 132 operation is also accessible, to do conditional (step-by-step) vector
 133 operations including things like turn vectors into 1024-bit or greater
 134 operands with very few instructions, by treating the "carry" from
 135 one instruction as a way to do "Conditional add of 1 to the register
 136 next door".  If V2.3-draft V sets zero-tagged elements to zero, such
 137 extremely powerful techniques are simply not possible.
 138
 139 It is noted that there is no mention of an equivalent to BEXT (element
 140 skipping) which would be particularly fascinating and powerful to have.
 141 In this mode, the "mask" would skip elements where its mask bit was zero
 142 in either the source or the destination operand.
 143
 144 Lots to be discussed.
 145
 146 # 17.18 Vector Load/Store Instructions
 147
 148 The Vector Load/Store instructions as proposed in V are extremely powerful
 149 and can be used for reordering and regular restructuring.
 150
 151 Vector Load:
 152
 153     if (unit-strided) stride = elsize;
 154     else stride = areg[as2]; // constant-strided
 155     for (int i=0; i<vl; ++i)
 156       if ([!]preg[p][i])
 157         for (int j=0; j<seglen+1; j++)
 158           vreg[vd+j][i] = mem[areg[as1] + (i*(seglen+1)+j)*stride];
 159
 160 Store:
 161
 162     if (unit-strided) stride = elsize;
 163     else stride = areg[as2]; // constant-strided
 164     for (int i=0; i<vl; ++i)
 165       if ([!]preg[p][i])
 166         for (int j=0; j<seglen+1; j++)
 167           mem[areg[base] + (i*(seglen+1)+j)*stride] = vreg[vd+j][i];
 168
 169 Indexed Load:
 170
 171     for (int i=0; i<vl; ++i)
 172       if ([!]preg[p][i])
 173         for (int j=0; j<seglen+1; j++)
 174           vreg[vd+j][i] = mem[sreg[base] + vreg[vs2][i] + j*elsize];
 175
 176 Indexed Store:
 177
 178     for (int i=0; i<vl; ++i)
 179     if ([!]preg[p][i])
 180       for (int j=0; j<seglen+1; j++)
 181         mem[sreg[base] + vreg[vs2][i] + j*elsize] = vreg[vd+j][i];
 182
 183 Keeping these instructions as-is for Simple-V is highly recommended.
 184 However: one of the goals of this Extension is to retro-fit (re-use)
 185 existing RV Load/Store:
 186
 187 [[!table  data="""
 188 31                  20 | 19      15 | 14    12 | 11           7 | 6         0 |
 189        imm[11:0]       |     rs1    |  funct3  |       rd       |    opcode |
 190             12         |      5     |    3     |        5       |      7 |
 191        offset[11:0]    |    base    |  width   |      dest      |    LOAD |
 192 """]]
 193
 194 [[!table  data="""
 195 31          25 | 24    20 | 19     15 | 14    12 | 11          7 | 6         0 |
 196  imm[11:5]     |   rs2    |    rs1    |  funct3  |   imm[4:0]    |    opcode |
 197       7        |    5     |     5     |    3     |       5       |      7 |
 198  offset[11:5]  |   src    |   base    |  width   |  offset[4:0]  |   STORE |
 199 """]]
 200
 201 The RV32 instruction opcodes as follows:
 202
 203 [[!table  data="""
 204 31 28  27 | 26 25 | 24  20 |19  15  |14| 13 12 | 11   7 | 6     0 | op  |
 205 imm[4:0]  | 00    | 00000  |    rs1 | 1| m     | vd     | 0000111 | VLD |
 206 imm[4:0]  | 01    |   rs2  |    rs1 | 1| m     | vd     | 0000111 | VLDS|
 207 imm[4:0]  | 11    |   vs2  |    rs1 | 1| m     | vd     | 0000111 | VLDX|
 208 vs3       | 00    | 00000  |    rs1 |1 | m     |imm[4:0]| 0100111 |VST  |
 209 vs3       | 01    | rs2    |    rs1 |1 | m     |imm[4:0]| 0100111 |VSTS |
 210 vs3       | 11    | vs2    |    rs1 |1 | m     |imm[4:0]| 0100111 |VSTX |
 211 """]]
 212
 213 Conversion on LOAD as follows:
 214
 215 * rd or rs1 are CSR-vectorised indicating "Vector Mode"
 216 * rd equivalent to vd
 217 * rs1 equivalent to rs1
 218 * imm[4:0] from RV format (11..7]) is same
 219 * imm[9:5] from RV format (29..25] is rs2 (rs2=00000 for VLD)
 220 * imm[11:10] from RV format (31..30] is opcode (VLD, VLDS, VLDX)
 221 * width from RV format (14..12) is same (width and zero/sign extend)
 222
 223 [[!table  data="""
 224 31 30 | 29 25 | 24    20 | 19 15 | 14  12 | 11      7 | 6    0 |
 225 imm[11:0]              ||| rs1   | funct3 | rd        | opcode |
 226 2     | 5     | 5        | 5     | 3      | 5         | 7      |
 227 00    | 00000 | imm[4:0] | base  | width  | dest      | LOAD   |
 228 01    | rs2   | imm[4:0] | base  | width  | dest      | LOAD.S |
 229 11    | rs2   | imm[4:0] | base  | width  | dest      | LOAD.X |
 230 """]]
 231
 232 Similar conversion on STORE as follows:
 233
 234 [[!table  data="""
 235 31 30 | 29  25 | 24   20 | 19 15 | 14  12 | 11      7 | 6    0 |
 236 imm[11:0]              ||| rs1   | funct3 | rd        | opcode |
 237 2     | 5      | 5       | 5     | 3      | 5         | 7      |
 238 00    | 00000  | src     | base  | width  | offs[4:0] | LOAD   |
 239 01    | rs3    | src     | base  | width  | offs[4:0] | LOAD.S |
 240 11    | rs3    | src     | base  | width  | offs[4:0] | LOAD.X |
 241 """]]
 242
 243 Notes:
 244
 245 * Predication CSR-marking register is not explicitly shown in instruction
 246 * In both LOAD and STORE, it is possible now to rs2 (or rs3) as a vector.
 247 * That in turn means that Indexed Load need not have an explicit opcode
 248 * That in turn means that bit 30 may indicate "stride" and bit 31 is free
 249
 250 Revised LOAD:
 251
 252 [[!table  data="""
 253 31 | 30 | 29 25 | 24    20 | 19 15 | 14   12 | 11 7 | 6    0 |
 254 imm[11:0]               |||| rs1   | funct3  | rd   | opcode |
 255 1  | 1  |  5    | 5        | 5     | 3       | 5    | 7      |
 256 ?  | s  |  rs2  | imm[4:0] | base  | width   | dest | LOAD   |
 257 """]]
 258
 259 Where in turn the pseudo-code may now combine the two:
 260
 261     if (unit-strided) stride = elsize;
 262     else stride = areg[as2]; // constant-strided
 263     for (int i=0; i<vl; ++i)
 264       if ([!]preg[p][i])
 265         for (int j=0; j<seglen+1; j++)
 266         {
 267           if CSRvectorised[rs2])
 268              offs = vreg[rs2][i]
 269           else
 270              offs = i*(seglen+1)*stride;
 271           vreg[vd+j][i] = mem[sreg[base] + offs + j*stride];
 272         }
 273
 274 Notes:
 275
 276 * j is multiplied by stride, not elsize, including in the rs2 vectorised case.
 277 * There may be more sophisticated variants involving the 31st bit, however
 278   it would be nice to reserve that bit for post-increment of address registers
 279 *
 280
 281 # 17.19 Vector Register Gather
 282
 283 TODO
 284
 285 # TODO, sort
 286
 287 > However, there are also several features that go beyond simply attaching VL
 288 > to a scalar operation and are crucial to being able to vectorize a lot of
 289 > code.  To name a few:
 290 > - Conditional execution (i.e., predicated operations)
 291 > - Inter-lane data movement (e.g. SLIDE, SELECT)
 292 > - Reductions (e.g., VADD with a scalar destination)
 293
 294  Ok so the Conditional and also the Reductions is one of the reasons
 295  why as part of SimpleV / variable-SIMD / parallelism (gah gotta think
 296  of a decent name) i proposed that it be implemented as "if you say r0
 297  is to be a vector / SIMD that means operations actually take place on
 298  r0,r1,r2... r(N-1)".
 299
 300  Consequently any parallel operation could be paused (or... more
 301  specifically: vectors disabled by resetting it back to a default /
 302  scalar / vector-length=1) yet the results would actually be in the
 303  *main register file* (integer or float) and so anything that wasn't
 304  possible to easily do in "simple" parallel terms could be done *out*
 305  of parallel "mode" instead.
 306
 307  I do appreciate that the above does imply that there is a limit to the
 308  length that SimpleV (whatever) can be parallelised, namely that you
 309  run out of registers!  my thought there was, "leave space for the main
 310  V-Ext proposal to extend it to the length that V currently supports".
 311  Honestly i had not thought through precisely how that would work.
 312
 313  Inter-lane (SELECT) i saw 17.19 in V2.3-Draft p117, I liked that,
 314  it reminds me of the discussion with Clifford on bit-manipulation
 315  (gather-scatter except not Bit Gather Scatter, *data* gather scatter): if
 316  applied "globally and outside of V and P" SLIDE and SELECT might become
 317  an extremely powerful way to do fast memory copy and reordering [2[.
 318
 319  However I haven't quite got my head round how that would work: i am
 320  used to the concept of register "tags" (the modern term is "masks")
 321  and i *think* if "masks" were applied to a Simple-V-enhanced LOAD /
 322  STORE you would get the exact same thing as SELECT.
 323
 324  SLIDE you could do simply by setting say r0 vector-length to say 16
 325  (meaning that if referred to in any operation it would be an implicit
 326  parallel operation on *all* registers r0 through r15), and temporarily
 327  set say.... r7 vector-length to say... 5.  Do a LOAD on r7 and it would
 328  implicitly mean "load from memory into r7 through r11".  Then you go
 329  back and do an operation on r0 and ta-daa, you're actually doing an
 330  operation on a SLID {SLIDED?) vector.
 331
 332  The advantage of Simple-V (whatever) over V would be that you could
 333  actually do *operations* in the middle of vectors (not just SLIDEs)
 334  simply by (as above) setting r0 vector-length to 16 and r7 vector-length
 335  to 5.  There would be nothing preventing you from doing an ADD on r0
 336  (which meant do an ADD on r0 through r15) followed *immediately in the
 337  next instruction with no setup cost* a MUL on r7 (which actually meant
 338  "do a parallel MUL on r7 through r11").
 339
 340  btw it's worth mentioning that you'd get scalar-vector and vector-scalar
 341  implicitly by having one of the source register be vector-length 1
 342  (the default) and one being N > 1.  but without having special opcodes
 343  to do it.  i *believe* (or more like "logically infer or deduce" as
 344  i haven't got access to the spec) that that would result in a further
 345  opcode reduction when comparing [draft] V-Ext to [proposed] Simple-V.
 346
 347  Also, Reduction *might* be possible by specifying that the destination be
 348  a scalar (vector-length=1) whilst the source be a vector.  However... it
 349  would be an awful lot of work to go through *every single instruction*
 350  in *every* Extension, working out which ones could be parallelised (ADD,
 351  MUL, XOR) and those that definitely could not (DIV, SUB).  Is that worth
 352  the effort?  maybe.  Would it result in huge complexity? probably.
 353  Could an implementor just go "I ain't doing *that* as parallel!
 354  let's make it virtual-parallelism (sequential reduction) instead"?
 355  absolutely.  So, now that I think it through, Simple-V (whatever)
 356  covers Reduction as well.  huh, that's a surprise.
 357
 358
 359 > - Vector-length speculation (making it possible to vectorize some loops with
 360 > unknown trip count) - I don't think this part of the proposal is written
 361 > down yet.
 362
 363  Now that _is_ an interesting concept.  A little scary, i imagine, with
 364  the possibility of putting a processor into a hard infinite execution
 365  loop... :)
 366
 367
 368 > Also, note the vector ISA consumes relatively little opcode space (all the
 369 > arithmetic fits in 7/8ths of a major opcode).  This is mainly because data
 370 > type and size is a function of runtime configuration, rather than of opcode.
 371
 372  yes.  i love that aspect of V, i am a huge fan of polymorphism [1]
 373  which is why i am keen to advocate that the same runtime principle be
 374  extended to the rest of the RISC-V ISA [3]
 375
 376  Yikes that's a lot.  I'm going to need to pull this into the wiki to
 377  make sure it's not lost.
 378
 379 [1] inherent data type conversion: 25 years ago i designed a hypothetical
 380 hyper-hyper-hyper-escape-code-sequencing ISA based around 2-bit
 381 (escape-extended) opcodes and 2-bit (escape-extended) operands that
 382 only required a fixed 8-bit instruction length.  that relied heavily
 383 on polymorphism and runtime size configurations as well.  At the time
 384 I thought it would have meant one HELL of a lot of CSRs... but then I
 385 met RISC-V and was cured instantly of that delusion^Wmisapprehension :)
 386
 387 [2] Interestingly if you then also add in the other aspect of Simple-V
 388 (the data-size, which is effectively functionally orthogonal / identical
 389 to "Packed" of Packed-SIMD), masked and packed *and* vectored LOAD / STORE
 390 operations become byte / half-word / word augmenters of B-Ext's proposed
 391 "BGS" i.e. where B-Ext's BGS dealt with bits, masked-packed-vectored
 392 LOAD / STORE would deal with 8 / 16 / 32 bits at a time.  Where it
 393 would get really REALLY interesting would be masked-packed-vectored
 394 B-Ext BGS instructions.  I can't even get my head fully round that,
 395 which is a good sign that the combination would be *really* powerful :)
 396
 397 [3] ok sadly maybe not the polymorphism, it's too complicated and I
 398 think would be much too hard for implementors to easily "slide in" to an
 399 existing non-Simple-V implementation.  i say that despite really *really*
 400 wanting IEEE 704 FP Half-precision to end up somewhere in RISC-V in some
 401 fashion, for optimising 3D Graphics.  *sigh*.
 402
 403 # TODO: analyse, auto-increment on unit-stride and constant-stride
 404
 405 so i thought about that for a day or so, and wondered if it would be
 406 possible to propose a variant of zero-overhead loop that included
 407 auto-incrementing the two address registers a2 and a3, as well as
 408 providing a means to interact between the zero-overhead loop and the
 409 vsetvl instruction.  a sort-of pseudo-assembly of that would look like:
 410
 411     # a2 to be auto-incremented by t0 times 4
 412     zero-overhead-set-auto-increment a2, t0, 4
 413     # a2 to be auto-incremented by t0 times 4
 414     zero-overhead-set-auto-increment a3, t0, 4
 415     zero-overhead-set-loop-terminator-condition a0 zero
 416     zero-overhead-set-start-end stripmine, stripmine+endoffset
 417     stripmine:
 418     vsetvl t0,a0
 419     vlw v0, a2
 420     vlw v1, a3
 421     vfma v1, a1, v0, v1
 422     vsw v1, a3
 423     sub a0, a0, t0
 424     stripmine+endoffset:
 425
 426 the question is: would something like this even be desirable?  it's a
 427 variant of auto-increment [1].  last time i saw any hint of auto-increment
 428 register opcodes was in the 1980s... 68000 if i recall correctly... yep
 429 see [1]
 430
 431 [1] http://fourier.eng.hmc.edu/e85_old/lectures/instruction/node6.html
 432
 433 Reply:
 434
 435 Another option for auto-increment is for vector-memory-access instructions
 436 to support post-increment addressing for unit-stride and constant-stride
 437 modes.  This can be implemented by the scalar unit passing the operation
 438 to the vector unit while itself executing an appropriate multiply-and-add
 439 to produce the incremented address.  This does *not* require additional
 440 ports on the scalar register file, unlike scalar post-increment addressing
 441 modes.
 442
 443 # TODO: instructions V-Ext duplication analysis <a name="duplication_analysis">
 444
 445 This is partly speculative due to lack of access to an up-to-date
 446 V-Ext Spec (V2.3-draft RVV 0.4-Draft at the time of writing).
 447 A cursory examination shows an **85%** duplication of V-Ext
 448 operand-related instructions when compared to a standard RG64G base,
 449 and a **95%** duplication of arithmetic and floating-point operations.
 450
 451 Exceptions are:
 452
 453 * The Vector Misc ops: VEIDX, VFIRST, VPOPC
 454   and potentially more (9 control-related instructions)
 455 * VCLIP and VCLIPI (the only 2 opcodes not duplicated out of 47
 456   total arithmetic / floating-point operations)
 457
 458 Table of RV32V Instructions
 459
 460 | RV32V      | RV Std (FP) | RV Std (Int) | Notes |
 461 | -----      | ---      |         |   |
 462 | VADD       | FADD     | ADD     |   |
 463 | VSUB       | FSUB     | SUB     |   |
 464 | VSL        |          | SLL     |   |
 465 | VSR        |          | SRL     |   |
 466 | VAND       |          | AND     |   |
 467 | VOR        |          | OR      |   |
 468 | VXOR       |          | XOR     |   |
 469 | VSEQ       | FEQ      | BEQ     | (1) |
 470 | VSNE       | !FEQ     | BNE     | (1) |
 471 | VSLT       | FLT      | BLT     | (1) |
 472 | VSGE       | !FLE     | BGE     | (1) |
 473 | VCLIP      |          |         |   |
 474 | VCVT       | FCVT     |         |   |
 475 | VMPOP      |          |         |   |
 476 | VMFIRST    |          |         |   |
 477 | VEXTRACT   |          |         |   |
 478 | VINSERT    |          |         |   |
 479 | VMERGE     |          |         |   |
 480 | VSELECT    |          |         |   |
 481 | VSLIDE     |          |         |   |
 482 | VDIV       | FDIV     | DIV     |   |
 483 | VREM       |          | REM     |   |
 484 | VMUL       | FMUL     | MUL     |   |
 485 | VMULH      |          | MULH    |   |
 486 | VMIN       | FMIN     |         |   |
 487 | VMAX       | FMUX     |         |   |
 488 | VSGNJ      | FSGNJ    |         |   |
 489 | VSGNJN     | FSGNJN   |         |   |
 490 | VSGNJX     | FSNGJX   |         |   |
 491 | VSQRT      | FSQRT    |         |   |
 492 | VCLASS     | FCLASS   |         |   |
 493 | VPOPC      |          |         |   |
 494 | VADDI      |          | ADDI    |   |
 495 | VSLI       |          | SLI     |   |
 496 | VSRI       |          | SRI     |   |
 497 | VANDI      |          | ANDI    |   |
 498 | VORI       |          | ORI     |   |
 499 | VXORI      |          | XORI    |   |
 500 | VCLIPI     |          |         |   |
 501 | VMADD      | FMADD    |         |   |
 502 | VMSUB      | FMSUB    |         |   |
 503 | VNMADD     | FNMSUB   |         |   |
 504 | VNMSUB     | FNMADD   |         |   |
 505 | VLD        | FLD      | LD      |   |
 506 | VLDS       | FLD      | LD      | (2)  |
 507 | VLDX       | FLD      | LD      | (3)  |
 508 | VST        | FST      | ST      |   |
 509 | VSTS       | FST      | ST      | (2)  |
 510 | VSTX       | FST      | ST      | (3)  |
 511 | VAMOSWAP   |          | AMOSWAP |   |
 512 | VAMOADD    |          | AMOADD  |   |
 513 | VAMOAND    |          | AMOAND  |   |
 514 | VAMOOR     |          | AMOOR   |   |
 515 | VAMOXOR    |          | AMOXOR  |   |
 516 | VAMOMIN    |          | AMOMIN  |   |
 517 | VAMOMAX    |          | AMOMAX  |   |
 518
 519 Notes:
 520
 521 * (1) retro-fit predication variants into branch instructions (base and C),
 522   decoding triggered by CSR bit marking register as "Vector type".
 523 * (2) retro-fit LOAD/STORE constant-stride by reinterpreting one bit of
 524   immediate-offset when register arguments are detected as being vectorised
 525 * (3) retro-fit LOAD/STORE indexed-stride through detection of address
 526   register argument being vectorised
 527
 528 # TODO: sort
 529
 530 > I suspect that the "hardware loop" in question is actually a zero-overhead
 531 > loop unit that diverts execution from address X to address Y if a certain
 532 > condition is met.
 533
 534  not quite.  The zero-overhead loop unit interestingly would be at
 535 an [independent] level above vector-length.  The distinctions are
 536 as follows:
 537
 538 * Vector-length issues *virtual* instructions where the register
 539   operands are *specifically* altered (to cover a range of registers),
 540   whereas zero-overhead loops *specifically* do *NOT* alter the operands
 541   in *ANY* way.
 542
 543 * Vector-length-driven "virtual" instructions are driven by *one*
 544  and *only* one instruction (whether it be a LOAD, STORE, or pure
 545  one/two/three-operand opcode) whereas zero-overhead loop units
 546  specifically apply to *multiple* instructions.
 547
 548 Where vector-length-driven "virtual" instructions might get conceptually
 549 blurred with zero-overhead loops is LOAD / STORE.  In the case of LOAD /
 550 STORE, to actually be useful, vector-length-driven LOAD / STORE should
 551 increment the LOAD / STORE memory address to correspondingly match the
 552 increment in the register bank.  example:
 553
 554 * set vector-length for r0 to 4
 555 * issue RV32 LOAD from addr 0x1230 to r0
 556
 557 translates effectively to:
 558
 559 * RV32 LOAD from addr 0x1230 to r0
 560 * ...
 561 * ...
 562 * RV32 LOAD from addr 0x123B to r3
 563