simple_v_extension/specification.mdwn

   1 # Simple-V (Parallelism Extension Proposal) Specification
   2
   3 * Copyright (C) 2017, 2018, 2019 Luke Kenneth Casson Leighton
   4 * Status: DRAFTv0.6.1
   5 * Last edited: 10 sep 2019
   6 * Ancillary resource: [[opcodes]]
   7 * Ancillary resource: [[sv_prefix_proposal]]
   8 * Ancillary resource: [[abridged_spec]]
   9 * Ancillary resource: [[vblock_format]]
  10 * Ancillary resource: [[appendix]]
  11
  12 With thanks to:
  13
  14 * Allen Baum
  15 * Bruce Hoult
  16 * comp.arch
  17 * Jacob Bachmeyer
  18 * Guy Lemurieux
  19 * Jacob Lifshay
  20 * Terje Mathisen
  21 * The RISC-V Founders, without whom this all would not be possible.
  22
  23 [[!toc ]]
  24
  25 # Summary and Background: Rationale
  26
  27 Simple-V is a uniform parallelism API for RISC-V hardware that has several
  28 unplanned side-effects including code-size reduction, expansion of
  29 HINT space and more.  The reason for
  30 creating it is to provide a manageable way to turn a pre-existing design
  31 into a parallel one, in a step-by-step incremental fashion, without adding any new opcodes, thus allowing
  32 the implementor to focus on adding hardware where it is needed and necessary.
  33 The primary target is for mobile-class 3D GPUs and VPUs, with secondary
  34 goals being to reduce executable size (by extending the effectiveness of RV opcodes, RVC in particular) and reduce context-switch latency.
  35
  36 Critically: **No new instructions are added**.  The parallelism (if any
  37 is implemented) is implicitly added by tagging *standard* scalar registers
  38 for redirection.  When such a tagged register is used in any instruction,
  39 it indicates that the PC shall **not** be incremented; instead a loop
  40 is activated where *multiple* instructions are issued to the pipeline
  41 (as determined by a length CSR), with contiguously incrementing register
  42 numbers starting from the tagged register.  When the last "element"
  43 has been reached, only then is the PC permitted to move on.  Thus
  44 Simple-V effectively sits (slots) *in between* the instruction decode phase
  45 and the ALU(s).
  46
  47 The barrier to entry with SV is therefore very low.  The minimum
  48 compliant implementation is software-emulation (traps), requiring
  49 only the CSRs and CSR tables, and that an exception be thrown if an
  50 instruction's registers are detected to have been tagged.  The looping
  51 that would otherwise be done in hardware is thus carried out in software,
  52 instead.  Whilst much slower, it is "compliant" with the SV specification,
  53 and may be suited for implementation in RV32E and also in situations
  54 where the implementor wishes to focus on certain aspects of SV, without
  55 unnecessary time and resources into the silicon, whilst also conforming
  56 strictly with the API.  A good area to punt to software would be the
  57 polymorphic element width capability for example.
  58
  59 Hardware Parallelism, if any, is therefore added at the implementor's
  60 discretion to turn what would otherwise be a sequential loop into a
  61 parallel one.
  62
  63 To emphasise that clearly: Simple-V (SV) is *not*:
  64
  65 * A SIMD system
  66 * A SIMT system
  67 * A Vectorisation Microarchitecture
  68 * A microarchitecture of any specific kind
  69 * A mandary parallel processor microarchitecture of any kind
  70 * A supercomputer extension
  71
  72 SV does **not** tell implementors how or even if they should implement
  73 parallelism: it is a hardware "API" (Application Programming Interface)
  74 that, if implemented, presents a uniform and consistent way to *express*
  75 parallelism, at the same time leaving the choice of if, how, how much,
  76 when and whether to parallelise operations **entirely to the implementor**.
  77
  78 # Basic Operation
  79
  80 The principle of SV is as follows:
  81
  82 * Standard RV instructions are "prefixed" (extended) through a 48/64
  83   bit format (single instruction option) or a variable
  84  length VLIW-like prefix (multi or "grouped" option).
  85 * The prefix(es) indicate which registers are "tagged" as
  86   "vectorised". Predicates can also be added, and element widths
  87   overridden on any src or dest register.
  88 * A "Vector Length" CSR is set, indicating the span of any future
  89   "parallel" operations.
  90 * If any operation (a **scalar** standard RV opcode) uses a register
  91   that has been so "marked" ("tagged"), a hardware "macro-unrolling loop"
  92   is activated, of length VL, that effectively issues **multiple**
  93   identical instructions using contiguous sequentially-incrementing
  94   register numbers, based on the "tags".
  95 * **Whether they be executed sequentially or in parallel or a
  96   mixture of both or punted to software-emulation in a trap handler
  97   is entirely up to the implementor**.
  98
  99 In this way an entire scalar algorithm may be vectorised with
 100 the minimum of modification to the hardware and to compiler toolchains.
 101
 102 To reiterate: **There are *no* new opcodes**. The scheme works *entirely*
 103 on hidden context that augments *scalar* RISCV instructions.
 104
 105 # CSRs <a name="csrs"></a>
 106
 107 * An optional "reshaping" CSR key-value table which remaps from a 1D
 108   linear shape to 2D or 3D, including full transposition.
 109
 110 There are five additional CSRs, available in any privilege level:
 111
 112 * MVL (the Maximum Vector Length)
 113 * VL (sets which scalar register is to be the Vector Length)
 114 * SUBVL (effectively a kind of SIMD)
 115 * STATE (containing copies of MVL, VL and SUBVL as well as context information)
 116 * SVPSTATE (state information for SVPrefix)
 117 * PCVBLK (the current operation being executed within a VBLOCK Group)
 118
 119 For User Mode there are the following CSRs:
 120
 121 * uePCVBLK (a copy of the sub-execution Program Counter, that is relative
 122   to the start of the current VBLOCK Group, set on a trap).
 123 * ueSTATE (useful for saving and restoring during context switch,
 124   and for providing fast transitions)
 125 * ueSVPSTATE when SVPrefix is implemented
 126  Note: ueSVPSTATE is mirrored in the top 32 bits of ueSTATE.
 127
 128 There are also three additional CSRs for Supervisor-Mode:
 129
 130 * sePCVBLK
 131 * seSTATE (which contains seSVPSTATE)
 132 * seSVPSTATE
 133
 134 And likewise for M-Mode:
 135
 136 * mePCVBLK
 137 * meSTATE (which contains meSVPSTATE)
 138 * meSVPSTATE
 139
 140 The u/m/s CSRs are treated and handled exactly like their (x)epc
 141 equivalents. On entry to or exit from a privilege level, the contents
 142 of its (x)eSTATE are swapped with STATE.
 143
 144 Thus for example, a User Mode trap will end up swapping STATE and ueSTATE
 145 (on both entry and exit), allowing User Mode traps to have their own
 146 Vectorisation Context set up, separated from and unaffected by normal
 147 user applications.  If an M Mode trap occurs in the middle of the U Mode
 148 trap, STATE is swapped with meSTATE, and restored on exit: the U Mode
 149 trap continues unaware that the M Mode trap even occurred.
 150
 151 Likewise, Supervisor Mode may perform context-switches, safe in the
 152 knowledge that its Vectorisation State is unaffected by User Mode.
 153
 154 The access pattern for these groups of CSRs in each mode follows the
 155 same pattern for other CSRs that have M-Mode and S-Mode "mirrors":
 156
 157 * In M-Mode, the S-Mode and U-Mode CSRs are separate and distinct.
 158 * In S-Mode, accessing and changing of the M-Mode CSRs is transparently
 159   identical
 160   to changing the S-Mode CSRs.  Accessing and changing the U-Mode
 161   CSRs is permitted.
 162 * In U-Mode, accessing and changing of the S-Mode and U-Mode CSRs
 163   is prohibited.
 164
 165 An interesting side effect of SV STATE being separate and distinct in S
 166 Mode is that Vectorised saving of an entire register file to the stack
 167 is a single instruction (through accidental provision of LOAD-MULTI
 168 semantics).  If the SVPrefix P64-LD-type format is used, LOAD-MULTI may
 169 even be done with a single standalone 64 bit opcode (P64 may set up SVPSTATE.SUBVL,
 170 SVPSTATE.VL and SVPSTATE.MVL from an immediate field, to cover the full regfile). It can
 171 even be predicated, which opens up some very interesting possibilities.
 172
 173 (x)EPCVBLK CSRs must be treated exactly like their corresponding (x)epc
 174 equivalents. See VBLOCK section for details.
 175
 176 ## MAXVECTORLENGTH (MVL) <a name="mvl" />
 177
 178 MAXVECTORLENGTH is the same concept as MVL in RVV, except that it
 179 is variable length and may be dynamically set.  MVL is
 180 however limited to the regfile bitwidth XLEN (1-32 for RV32,
 181 1-64 for RV64 and so on).
 182
 183 The reason for setting this limit is so that predication registers, when
 184 marked as such, may fit into a single register as opposed to fanning
 185 out over several registers.  This keeps the hardware implementation a
 186 little simpler.
 187
 188 The other important factor to note is that the actual MVL is internally
 189 stored **offset by one**, so that it can fit into only 6 bits (for RV64)
 190 and still cover a range up to XLEN bits.  Attempts to set MVL to zero will
 191 return an exception.  This is expressed more clearly in the "pseudocode"
 192 section, where there are subtle differences between CSRRW and CSRRWI.
 193
 194 ## Vector Length (VL) <a name="vl" />
 195
 196 VL is very different from RVV's VL.  It contains the scalar register *number* that is to be treated as the Vector Length. It is a sub-field of STATE. When set to zero (x0) VL (vectorisation) is disabled.
 197
 198 Implementations realistically should keep a cached copy of the register pointed to by VL in the instruction issue and decode phases. Out of Order Engines must then, if it is not x0, add this register to Vectorised instruction Dependency Checking as an additional read/write hazard as appropriate.
 199
 200 Setting VL via this CSR is very unusual. It should not normally be needed except when [[specification/sv.setvl]] is not implemented.  Note that unlike in sv.setvl, setting VL does not change the contents of the scalar register that it points to, although if the scalar register's contents are not within the range of MVL at the time that VL is set, an illegal instruction exception must be raised.
 201
 202 ## SUBVL - Sub Vector Length
 203
 204 This is a "group by quantity" that effectively asks each iteration
 205 of the hardware loop to load SUBVL elements of width elwidth at a
 206 time. Effectively, SUBVL is like a SIMD multiplier: instead of just 1
 207 operation issued, SUBVL operations are issued.
 208
 209 Another way to view SUBVL is that each element in the VL length vector is
 210 now SUBVL times elwidth bits in length and now comprises SUBVL discrete
 211 sub operations.  This can be viewed as an inner SUBVL hardware for-loop within a VL hardware for-loop in effect,
 212 with the sub-element increased every time in the innermost loop. This
 213 is best illustrated in the (simplified) pseudocode example, in the
 214 [[appendix]].
 215
 216 The primary use case for SUBVL is for 3D FP Vectors. A Vector of 3D
 217 coordinates X,Y,Z for example may be loaded and multiplied then stored, per
 218 VL element iteration, rather than having to set VL to three times larger.
 219
 220 Setting this CSR to 0 must raise an exception.  Setting it to a value
 221 greater than 4 likewise.  To see the relationship with STATE, see below.
 222
 223 The main effect of SUBVL is that predication bits are applied per
 224 **group**, rather than by individual element.
 225
 226 This saves a not insignificant number of instructions when handling 3D
 227 vectors, as otherwise a much longer predicate mask would have to be set
 228 up with regularly-repeated bit patterns.
 229
 230 See SUBVL Pseudocode illustration in the [[appendix]], for details.
 231
 232 ## STATE
 233
 234 out of date, see <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-June/001896.html>
 235
 236 This is a standard CSR that contains sufficient information for a
 237 full context save/restore.  It contains (and permits setting of):
 238
 239 * MVL
 240 * VL
 241 * destoffs - the destination element offset of the current parallel
 242   instruction being executed
 243 * srcoffs - for twin-predication, the source element offset as well.
 244 * SUBVL
 245 * svdestoffs - the subvector destination element offset of the current
 246   parallel instruction being executed
 247
 248 Interestingly STATE may hypothetically also be modified to make the
 249 immediately-following instruction to skip a certain number of elements,
 250 by playing with destoffs and srcoffs (and the subvector offsets as well)
 251
 252 Setting destoffs and srcoffs is realistically intended for saving state
 253 so that exceptions (page faults in particular) may be serviced and the
 254 hardware-loop that was being executed at the time of the trap, from
 255 user-mode (or Supervisor-mode), may be returned to and continued from
 256 exactly where it left off.  The reason why this works is because setting
 257 User-Mode STATE will not change (not be used) in M-Mode or S-Mode (and
 258 is entirely why M-Mode and S-Mode have their own STATE CSRs, meSTATE
 259 and seSTATE).
 260
 261 The format of the STATE CSR is as follows:
 262
 263 | (31..28) | (27..26) | (25..24) | (23..18) | (17..12) | (11..6) | (5...0) |
 264 | -------- | -------- | -------- | -------- | -------- | ------- | ------- |
 265 | rsvd     | dsvoffs  | subvl    | destoffs | srcoffs  | vl      | maxvl   |
 266
 267 Legal values of vl are between 0 and 31.
 268
 269 The relationship between SUBVL and the subvl field is:
 270
 271 | SUBVL | (25..24) |
 272 | ----- | -------- |
 273 | 1     | 0b00     |
 274 | 2     | 0b01     |
 275 | 3     | 0b10     |
 276 | 4     | 0b11     |
 277
 278 When setting this CSR, the following characteristics will be enforced:
 279
 280 * **MAXVL** will be truncated (after offset) to be within the range 1 to XLEN
 281 * **VL** must be set to a scalar register between 0 and 31.
 282 * **SUBVL** which sets a SIMD-like quantity, has only 4 values so there
 283   are no changes needed
 284 * **srcoffs** will be truncated to be within the range 0 to VL-1
 285 * **destoffs** will be truncated to be within the range 0 to VL-1
 286 * **dsvoffs** will be truncated to be within the range 0 to SUBVL-1
 287
 288 NOTE: if the following instruction is not a twin predicated instruction,
 289 and destoffs or dsvoffs has been set to non-zero, subsequent execution
 290 behaviour is undefined. **USE WITH CARE**.
 291
 292 NOTE: sub-vector looping does not require a twin-predicate corresponding
 293 index, because sub-vectors use the *main* (VL) loop predicate bit.
 294
 295 When SVPrefix is implemented, it can have its own VL, MVL and SUBVL, as well as element offsets. SVSTATE.VL acts slightly differently in that it is no longer a pointer to a scalar register but is an actual value just like RVV's VL.
 296
 297 The format of SVSTATE, which fits into *both* the top bits of STATE and also into a separate CSR, is as follows:
 298
 299 | (31..28) | (27..26) | (25..24) | (23..18) | (17..12) | (11..6) | (5...0) |
 300 | -------- | -------- | -------- | -------- | -------- | ------- | ------- |
 301 | rsvd     | dsvoffs  | subvl    | destoffs | srcoffs  | vl      | maxvl   |
 302
 303 ### Hardware rules for when to increment STATE offsets
 304
 305 The offsets inside STATE are like the indices in a loop, except
 306 in hardware. They are also partially (conceptually) similar to a
 307 "sub-execution Program Counter". As such, and to allow proper context
 308 switching and to define correct exception behaviour, the following rules
 309 must be observed:
 310
 311 * When the VL CSR is set, srcoffs and destoffs are reset to zero.
 312 * Each instruction that contains a "tagged" register shall start
 313   execution at the *current* value of srcoffs (and destoffs in the case
 314   of twin predication)
 315 * Unpredicated bits (in nonzeroing mode) shall cause the element operation
 316   to skip, incrementing the srcoffs (or destoffs)
 317 * On execution of an element operation, Exceptions shall **NOT** cause
 318   srcoffs or destoffs to increment.
 319 * On completion of the full Vector Loop (srcoffs = VL-1 or destoffs =
 320   VL-1 after the last element is executed), both srcoffs and destoffs
 321   shall be reset to zero.
 322
 323 This latter is why srcoffs and destoffs may be stored as values from
 324 0 to XLEN-1 in the STATE CSR, because as loop indices they refer to
 325 elements. srcoffs and destoffs never need to be set to VL: their maximum
 326 operating values are limited to 0 to VL-1.
 327
 328 The same corresponding rules apply to SUBVL, svsrcoffs and svdestoffs.
 329
 330 ## MVL and VL Pseudocode
 331
 332 The pseudo-code for get and set of VL and MVL use the following internal
 333 functions as follows:
 334
 335     set_mvl_csr(value, rd):
 336         STATE.MVL = MIN(value, STATE.MVL)
 337
 338     get_mvl_csr(rd):
 339         regs[rd] = STATE.VL
 340
 341     set_vl_csr(value, rd):
 342         STATE.VL = rd
 343         return STATE.VL
 344
 345     get_vl_csr(rd):
 346         return STATE.VL
 347
 348 Note that where setting MVL behaves as a normal CSR (returns the old
 349 value), unlike standard CSR behaviour, setting VL will return the **new**
 350 value of VL **not** the old one.
 351
 352 For CSRRWI, the range of the immediate is restricted to 5 bits.  In order to
 353 maximise the effectiveness, an immediate of 0 is used to set VL=1,
 354 an immediate of 1 is used to set VL=2 and so on:
 355
 356     CSRRWI_Set_MVL(value):
 357         set_mvl_csr(value+1, x0)
 358
 359     CSRRWI_Set_VL(value):
 360         set_vl_csr(value+1, x0)
 361
 362 However for CSRRW the following pseudocode is used for MVL and VL,
 363 where setting the value to zero will cause an exception to be raised.
 364 The reason is that if VL or MVL are set to zero, the STATE CSR is
 365 not capable of storing that value.
 366
 367     CSRRW_Set_MVL(rs1, rd):
 368         value = regs[rs1]
 369         if value == 0 or value > XLEN:
 370             raise Exception
 371         set_mvl_csr(value, rd)
 372
 373     CSRRW_Set_VL(rs1, rd):
 374         value = regs[rs1]
 375         if value == 0 or value > XLEN:
 376             raise Exception
 377         set_vl_csr(value, rd)
 378
 379 In this way, when CSRRW is utilised with a loop variable, the value
 380 that goes into VL (and into the destination register) may be used
 381 in an instruction-minimal fashion:
 382
 383      CSRvect1 = {type: F, key: a3, val: a3, elwidth: dflt}
 384      CSRvect2 = {type: F, key: a7, val: a7, elwidth: dflt}
 385      CSRRWI MVL, 3          # sets MVL == **4** (not 3)
 386      j zerotest             # in case loop counter a0 already 0
 387     loop:
 388      CSRRW VL, t0, a0       # vl = t0 = min(mvl, a0)
 389      ld     a3, a1          # load 4 registers a3-6 from x
 390      slli   t1, t0, 3       # t1 = vl * 8 (in bytes)
 391      ld     a7, a2          # load 4 registers a7-10 from y
 392      add    a1, a1, t1      # increment pointer to x by vl*8
 393      fmadd a7, a3, fa0, a7 # v1 += v0 * fa0 (y = a * x + y)
 394      sub    a0, a0, t0      # n -= vl (t0)
 395      st     a7, a2          # store 4 registers a7-10 to y
 396      add    a2, a2, t1      # increment pointer to y by vl*8
 397     zerotest:
 398      bnez   a0, loop        # repeat if n != 0
 399
 400 With the STATE CSR, just like with CSRRWI, in order to maximise the
 401 utilisation of the limited bitspace, "000000" in binary represents
 402 VL==1, "00001" represents VL==2 and so on (likewise for MVL):
 403
 404     CSRRW_Set_SV_STATE(rs1, rd):
 405         value = regs[rs1]
 406         get_state_csr(rd)
 407         STATE.MVL = set_mvl_csr(value[11:6]+1)
 408         STATE.VL = set_vl_csr(value[5:0]+1)
 409         STATE.destoffs = value[23:18]>>18
 410         STATE.srcoffs = value[23:18]>>12
 411
 412     get_state_csr(rd):
 413         regs[rd] = (STATE.MVL-1) | (STATE.VL-1)<<6 | (STATE.srcoffs)<<12 |
 414                    (STATE.destoffs)<<18
 415         return regs[rd]
 416
 417 In both cases, whilst CSR read of VL and MVL return the exact values
 418 of VL and MVL respectively, reading and writing the STATE CSR returns
 419 those values **minus one**.  This is absolutely critical to implement
 420 if the STATE CSR is to be used for fast context-switching.
 421
 422 ## VL, MVL and SUBVL instruction aliases
 423
 424 This table contains pseudo-assembly instruction aliases. Note the
 425 subtraction of 1 from the CSRRWI pseudo variants, to compensate for the
 426 reduced range of the 5 bit immediate.
 427
 428 | alias           | CSR                  |
 429 | -               | -                    |
 430 | SETVL rd, rs    | CSRRW  VL, rd, rs    |
 431 | SETVLi rd, #n   | CSRRWI VL, rd, #n-1  |
 432 | GETVL rd        | CSRRW  VL, rd, x0    |
 433 | SETMVL rd, rs   | CSRRW  MVL, rd, rs   |
 434 | SETMVLi rd, #n  | CSRRWI MVL,rd, #n-1  |
 435 | GETMVL rd       | CSRRW  MVL, rd, x0   |
 436
 437 Note: CSRRC and other bitsetting may still be used, they are however not particularly useful (very obscure).
 438
 439 ## Register key-value (CAM) table <a name="regcsrtable" />
 440
 441 *NOTE: in prior versions of SV, this table used to be writable and
 442 accessible via CSRs. It is now stored in the VBLOCK instruction format. Note
 443 that this table does *not* get applied to the SVPrefix P48/64 format,
 444 only to scalar opcodes*
 445
 446 The purpose of the Register table is three-fold:
 447
 448 * To mark integer and floating-point registers as requiring "redirection"
 449   if it is ever used as a source or destination in any given operation.
 450   This involves a level of indirection through a 5-to-7-bit lookup table,
 451   such that **unmodified** operands with 5 bits (3 for some RVC ops) may
 452   access up to **128** registers.
 453 * To indicate whether, after redirection through the lookup table, the
 454   register is a vector (or remains a scalar).
 455 * To over-ride the implicit or explicit bitwidth that the operation would
 456   normally give the register.
 457
 458 Note: clearly, if an RVC operation uses a 3 bit spec'd register (x8-x15)
 459 and the Register table contains entried that only refer to registerd
 460 x1-x14 or x16-x31, such operations will *never* activate the VL hardware
 461 loop!
 462
 463 If however the (16 bit) Register table does contain such an entry (x8-x15
 464 or x2 in the case of LWSP), that src or dest reg may be redirected
 465 anywhere to the *full* 128 register range. Thus, RVC becomes far more
 466 powerful and has many more opportunities to reduce code size that in
 467 Standard RV32/RV64 executables.
 468
 469 [[!inline raw="yes" pages="simple_v_extension/reg_table_format" ]]
 470
 471 i/f is set to "1" to indicate that the redirection/tag entry is to
 472 be applied to integer registers; 0 indicates that it is relevant to
 473 floating-point registers.
 474
 475 The 8 bit format is used for a much more compact expression. "isvec"
 476 is implicit and, similar to [[sv_prefix_proposal]], the target vector
 477 is "regnum<<2", implicitly. Contrast this with the 16-bit format where
 478 the target vector is *explicitly* named in bits 8 to 14, and bit 15 may
 479 optionally set "scalar" mode.
 480
 481 Note that whilst SVPrefix adds one extra bit to each of rd, rs1 etc.,
 482 and thus the "vector" mode need only shift the (6 bit) regnum by 1 to
 483 get the actual (7 bit) register number to use, there is not enough space
 484 in the 8 bit format (only 5 bits for regnum) so "regnum<<2" is required.
 485
 486 vew has the following meanings, indicating that the instruction's
 487 operand size is "over-ridden" in a polymorphic fashion:
 488
 489 | vew | bitwidth            |
 490 | --- | ------------------- |
 491 | 00  | default (XLEN/FLEN) |
 492 | 01  | 8 bit               |
 493 | 10  | 16 bit              |
 494 | 11  | 32 bit              |
 495
 496 As the above table is a CAM (key-value store) it may be appropriate
 497 (faster, implementation-wise) to expand it as follows:
 498
 499 [[!inline raw="yes" pages="simple_v_extension/reg_table" ]]
 500
 501 ## Predication Table <a name="predication_csr_table"></a>
 502
 503 *NOTE: in prior versions of SV, this table used to be writable and
 504 accessible via CSRs. It is now stored in the VBLOCK instruction format.
 505 The table does **not** apply to SVPrefix opcodes*
 506
 507 The Predication Table is a key-value store indicating whether, if a
 508 given destination register (integer or floating-point) is referred to
 509 in an instruction, it is to be predicated. Like the Register table, it
 510 is an indirect lookup that allows the RV opcodes to not need modification.
 511
 512 It is particularly important to note
 513 that the *actual* register used can be *different* from the one that is
 514 in the instruction, due to the redirection through the lookup table.
 515
 516 * regidx is the register that in combination with the
 517   i/f flag, if that integer or floating-point register is referred to in a
 518   (standard RV) instruction results in the lookup table being referenced
 519   to find the predication mask to use for this operation.
 520 * predidx is the *actual* (full, 7 bit) register to be used for the
 521   predication mask.
 522 * inv indicates that the predication mask bits are to be inverted
 523   prior to use *without* actually modifying the contents of the
 524   register from which those bits originated.
 525 * zeroing is either 1 or 0, and if set to 1, the operation must
 526   place zeros in any element position where the predication mask is
 527   set to zero.  If zeroing is set to 0, unpredicated elements *must*
 528   be left alone.  Some microarchitectures may choose to interpret
 529   this as skipping the operation entirely.  Others which wish to
 530   stick more closely to a SIMD architecture may choose instead to
 531   interpret unpredicated elements as an internal "copy element"
 532   operation (which would be necessary in SIMD microarchitectures
 533   that perform register-renaming)
 534 * ffirst is a special mode that stops sequential element processing when
 535   a data-dependent condition occurs, whether a trap or a conditional test.
 536   The handling of each (trap or conditional test) is slightly different:
 537   see Instruction sections for further details
 538
 539 [[!inline raw="yes" pages="simple_v_extension/pred_table_format" ]]
 540
 541 The 8 bit format is a compact and less expressive variant of the full
 542 16 bit format.  Using the 8 bit format is very different: the predicate
 543 register to use is implicit, and numbering begins inplicitly from x9. The
 544 regnum is still used to "activate" predication, in the same fashion as
 545 described above.
 546
 547 The 16 bit Predication CSR Table is a key-value store, so
 548 implementation-wise it will be faster to turn the table around (maintain
 549 topologically equivalent state).  Opportunities then exist to access
 550 registers in unary form instead of binary, saving gates and power by
 551 only activating "redirection" with a single AND gate, instead of
 552 multiple multi-bit XORs (a CAM):
 553
 554 [[!inline raw="yes" pages="simple_v_extension/pred_table" ]]
 555
 556 So when an operation is to be predicated, it is the internal state that
 557 is used.  In Section 6.4.2 of Hwacha's Manual (EECS-2015-262) the following
 558 pseudo-code for operations is given, where p is the explicit (direct)
 559 reference to the predication register to be used:
 560
 561     for (int i=0; i<vl; ++i)
 562         if ([!]preg[p][i])
 563            (d ? vreg[rd][i] : sreg[rd]) =
 564             iop(s1 ? vreg[rs1][i] : sreg[rs1],
 565                 s2 ? vreg[rs2][i] : sreg[rs2]); // for insts with 2 inputs
 566
 567 This instead becomes an *indirect* reference using the *internal* state
 568 table generated from the Predication CSR key-value store, which is used
 569 as follows.
 570
 571     if type(iop) == INT:
 572         preg = int_pred_reg[rd]
 573     else:
 574         preg = fp_pred_reg[rd]
 575
 576     for (int i=0; i<vl; ++i)
 577         predicate, zeroing = get_pred_val(type(iop) == INT, rd):
 578         if (predicate && (1<<i))
 579            result = iop(s1 ? regfile[rs1+i] : regfile[rs1],
 580                         s2 ? regfile[rs2+i] : regfile[rs2]);
 581            (d ? regfile[rd+i] : regfile[rd]) = result
 582            if preg.ffirst and result == 0:
 583               VL = i # result was zero, end loop early, return VL
 584               return
 585         else if (zeroing)
 586            (d ? regfile[rd+i] : regfile[rd]) = 0
 587
 588 Note:
 589
 590 * d, s1 and s2 are booleans indicating whether destination,
 591   source1 and source2 are vector or scalar
 592 * key-value CSR-redirection of rd, rs1 and rs2 have NOT been included
 593   above, for clarity.  rd, rs1 and rs2 all also must ALSO go through
 594   register-level redirection (from the Register table) if they are
 595   vectors.
 596 * fail-on-first mode stops execution early whenever an operation
 597   returns a zero value.  floating-point results count both
 598   positive-zero as well as negative-zero as "fail".
 599
 600 If written as a function, obtaining the predication mask (and whether
 601 zeroing takes place) may be done as follows:
 602
 603 [[!inline raw="yes" pages="simple_v_extension/get_pred_value" ]]
 604
 605 Note here, critically, that **only** if the register is marked
 606 in its **register** table entry as being "active" does the testing
 607 proceed further to check if the **predicate** table entry is
 608 also active.
 609
 610 Note also that this is in direct contrast to branch operations
 611 for the storage of comparisions: in these specific circumstances
 612 the requirement for there to be an active *register* entry
 613 is removed.
 614
 615 ## Fail-on-First Mode <a name="ffirst-mode"></a>
 616
 617 ffirst is a special data-dependent predicate mode.  There are two
 618 variants: one is for faults: typically for LOAD/STORE operations,
 619 which may encounter end of page faults during a series of operations.
 620 The other variant is comparisons such as FEQ (or the augmented behaviour
 621 of Branch), and any operation that returns a result of zero (whether
 622 integer or floating-point).  In the FP case, this includes negative-zero.
 623
 624 ffirst interacts with zero- and non-zero predication.  In non-zeroing
 625 mode, masked-out operations are simply excluded from testing (can never
 626 fail).  However for fail-comparisons (not faults) in zeroing mode, the
 627 result will be zero: this *always* "fails", thus on the very first
 628 masked-out element ffirst will always terminate.
 629
 630 Note that ffirst mode works because the execution order must "appear" to be
 631 (in "program order").  An in-order architecture must execute the element
 632 operations in sequence, whilst an out-of-order architecture must *commit*
 633 the element operations in sequence and cancel speculatively-executed
 634 ones (giving the appearance of in-order execution).
 635
 636 Note also, that if ffirst mode is needed without predication, a special
 637 "always-on" Predicate Table Entry may be constructed by setting
 638 inverse-on and using x0 as the predicate register.  This
 639 will have the effect of creating a mask of all ones, allowing ffirst
 640 to be set.
 641
 642 See [[appendix]] for more details on fail-on-first modes, as well as
 643 pseudo-code, below.
 644
 645 ## REMAP and SHAPE CSRs <a name="remap" />
 646
 647 See optional [[remap]] section.
 648
 649 # Instruction Execution Order
 650
 651 Simple-V behaves as if it is a hardware-level "macro expansion system",
 652 substituting and expanding a single instruction into multiple sequential
 653 instructions with contiguous and sequentially-incrementing registers.
 654 As such, it does **not** modify - or specify - the behaviour and semantics of
 655 the execution order: that may be deduced from the **existing** RV
 656 specification in each and every case.
 657
 658 So for example if a particular micro-architecture permits out-of-order
 659 execution, and it is augmented with Simple-V, then wherever instructions
 660 may be out-of-order then so may the "post-expansion" SV ones.
 661
 662 If on the other hand there are memory guarantees which specifically
 663 prevent and prohibit certain instructions from being re-ordered
 664 (such as the Atomicity Axiom, or FENCE constraints), then clearly
 665 those constraints **MUST** also be obeyed "post-expansion".
 666
 667 It should be absolutely clear that SV is **not** about providing new
 668 functionality or changing the existing behaviour of a micro-architetural
 669 design, or about changing the RISC-V Specification.
 670 It is **purely** about compacting what would otherwise be contiguous
 671 instructions that use sequentially-increasing register numbers down
 672 to the **one** instruction.
 673
 674 # Instructions <a name="instructions" />
 675
 676 See [[appendix]] for specific cases where instruction behaviour is
 677 augmented.  A greatly simplified example is below.  Note that this
 678 is the ADD implementation, not a separate VADD instruction:
 679
 680 [[!inline raw="yes" pages="simple_v_extension/simple_add_example" ]]
 681
 682 Note that several things have been left out of this example.
 683 See [[appendix]] for additional examples that show how to add
 684 support for additional features (twin predication, elwidth,
 685 zeroing, SUBVL etc.)
 686
 687 Branches in particular have been transparently augmented to include
 688 "collation" of comparison results into a tagged register.
 689
 690 # Exceptions
 691
 692 Exceptions may occur at any time, in any given underlying scalar
 693 operation.  This implies that context-switching (traps) may occur, and
 694 operation must be returned to where it left off.  That in turn implies
 695 that the full state - including the current parallel element being
 696 processed - has to be saved and restored.  This is what the **STATE**
 697 and **PCVBLK** CSRs are for.
 698
 699 The implications are that all underlying individual scalar operations
 700 "issued" by the parallelisation have to appear to be executed sequentially.
 701 The further implications are that if two or more individual element
 702 operations are underway, and one with an earlier index causes an exception,
 703 it will be necessary for the microarchitecture to **discard** or terminate
 704 operations with higher indices.  Optimisated microarchitectures could
 705 hypothetically store (cache) results, for subsequent replay if appropriate.
 706
 707 In short: exception handling **MUST** be precise, in-order, and exactly
 708 like Standard RISC-V as far as the instruction execution order is
 709 concerned, regardless of whether it is PC, PCVBLK, VL or SUBVL that
 710 is currently being incremented.
 711
 712 # Hints
 713
 714 A "HINT" is an operation that has no effect on architectural state,
 715 where its use may, by agreed convention, give advance notification
 716 to the microarchitecture: branch prediction notification would be
 717 a good example.  Usually HINTs are where rd=x0.
 718
 719 With Simple-V being capable of issuing *parallel* instructions where
 720 rd=x0, the space for possible HINTs is expanded considerably.  VL
 721 could be used to indicate different hints.  In addition, if predication
 722 is set, the predication register itself could hypothetically be passed
 723 in as a *parameter* to the HINT operation.
 724
 725 No specific hints are yet defined in Simple-V
 726
 727 # Vector Block Format <a name="vliw-format"></a>
 728
 729 The VBLOCK Format allows Register, Predication and Vector Length to be contextually associated with a group of RISC-V scalar opcodes.  The format is as follows:
 730
 731 [[!inline raw="yes" pages="simple_v_extension/vblock_format_table" ]]
 732
 733 For more details, including the CSRs, see ancillary resource: [[vblock_format]]
 734
 735 # Under consideration <a name="issues"></a>
 736
 737 See [[discussion]]
 738