openpower/sv/rfc/ls010.mdwn

   1 # RFC ls009 SVP64 Zero-Overhead Loop Prefix Subsystem
   2
   3 Credits and acknowledgements:
   4
   5 * Luke Leighton
   6 * Jacob Lifshay
   7 * Hendrik Boom
   8 * Richard Wilbur
   9 * Alexandre Oliva
  10 * Cesar Strauss
  11 * NLnet Foundation, for funding
  12 * OpenPOWER Foundation
  13 * Paul Mackerras
  14 * Toshaan Bharvani
  15 * IBM for the Power ISA itself
  16
  17 Links:
  18
  19 * <https://bugs.libre-soc.org/show_bug.cgi?id=1045>
  20
  21 # Introduction
  22
  23 Simple-V is a type of Vectorisation best described as a "Prefix Loop Subsystem"
  24 similar to the Z80 `LDIR` instruction and to the x86 `REP` Prefix instruction.
  25 More advanced features are similar to the Z80 `CPIR` instruction. If viewed
  26 as an actual Vector ISA it introduces over 1.5 million 64-bit Vector instructions.
  27 SVP64, the instruction format, is therefore best viewed as an orthogonal
  28 RISC-style "Prefixing" subsystem instead.
  29
  30 Except where explicitly stated all bit numbers remain as in the rest of the Power ISA:
  31 in MSB0 form (the bits are numbered from 0 at the MSB on the left
  32 and counting up as you move rightwards to the LSB end). All bit ranges are inclusive
  33 (so `4:6` means bits 4, 5, and 6, in MSB0 order).  **All register numbering and
  34 element numbering however is LSB0 ordering** which is a different convention from that used
  35 elsewhere in the Power ISA.
  36
  37 The SVP64 prefix always comes before the suffix in PC order and must be considered
  38 an independent "Defined word" that augments the behaviour of the following instruction,
  39 but does **not** change the actual Decoding of that following instruction.
  40 **All prefixed instructions retain their non-prefixed encoding and definition**.
  41
  42 *Architectural Resource Allocation note: it is prohibited to accept RFCs which
  43 fundamentally violate this hard requirement.  Under no circumstances must the
  44 Suffix space have an alternate instruction encoding allocated within SVP64 that is
  45 entirely different from the non-prefixed Defined Word. Hardware Implementors
  46 critically rely on this inviolate guarantee to implement High-Performance Multi-Issue
  47 micro-architectures that can sustain 100% throughput*
  48
  49 | 0:5    | 6:31         | 32:63        |
  50 |--------|--------------|--------------|
  51 | EXT09  | v3.1  Prefix | v3.0/1  Suffix |
  52
  53 Subset implementations in hardware are permitted, as long as certain
  54 rules are followed, allowing for full soft-emulation including future
  55 revisions.  Compliancy Subsets exist to ensure minimum levels of binary
  56 interoperability expectations within certain environments.
  57
  58 ## Register files, elements, and Element-width Overrides
  59
  60 In the Upper Compliancy Levels the size of the GPR and FPR Register files are expanded
  61 from 32 to 128 entries, and the number of CR Fields expanded from CR0-CR7 to CR0-CR127.
  62
  63 Memory access remains exactly the same: the effects of `MSR.LE` remain exactly the same,
  64 affecting as they already do and remain **only** on the Load and Store memory-register
  65 operation byte-order, and having nothing to do with the
  66 ordering of the contents of register files or register-register operations.
  67
  68 Whilst the bits within the GPRs and FPRs are expected to be MSB0-ordered and for
  69 numbering to be sequentially incremental the element offset numbering is naturally
  70 **LSB0-sequentially-incrementing from zero not MSB0-incrementing.**  Expressed exclusively in
  71 MSB0-numbering, SVP64 is unnecessarily complex to understand: the required
  72 subtractions from 63, 31, 15 and 7 unfortunately become a hostile minefield.
  73 Therefore for the purposes of this section the more natural
  74 **LSB0 numbering is assumed** and it is up to the reader to translate to MSB0 numbering.
  75
  76 The Canonical specification for how element-sequential numbering and element-width
  77 overrides is defined is expressed in the following c structure, assuming a Little-Endian
  78 system, and naturally using LSB0 numbering everywhere because the ANSI c specification
  79 is inherently LSB0:
  80
  81 ```
  82     #pragma pack
  83     typedef union {
  84         uint8_t  b[]; // elwidth 8
  85         uint16_t s[]; // elwidth 16
  86         uint32_t i[]; // elwidth 32
  87         uint64_t l[]; // elwidth 64
  88         uint8_t actual_bytes[8];
  89     } el_reg_t;
  90
  91     elreg_t int_regfile[128];
  92
  93     void get_register_element(el_reg_t* el, int gpr, int element, int width) {
  94         switch (width) {
  95             case 64: el->l = int_regfile[gpr].l[element];
  96             case 32: el->i = int_regfile[gpr].i[element];
  97             case 16: el->s = int_regfile[gpr].s[element];
  98             case 8 : el->b = int_regfile[gpr].b[element];
  99         }
 100     }
 101     void set_register_element(el_reg_t* el, int gpr, int element, int width) {
 102         switch (width) {
 103             case 64: int_regfile[gpr].l[element] = el->l;
 104             case 32: int_regfile[gpr].i[element] = el->i;
 105             case 16: int_regfile[gpr].s[element] = el->s;
 106             case 8 : int_regfile[gpr].b[element] = el->b;
 107         }
 108     }
 109 ```
 110
 111 Example Vector-looped add operation implementation when elwidths are 64-bit:
 112
 113 ```
 114  # add RT, RA,RB using the "uint64_t" union member, "l"
 115  for i in range(VL):
 116       int_regfile[RT].l[i] = int_regfile[RA].l[i] + int_regfile[RB].l[i]
 117 ```
 118
 119 However if elwidth overrides are set to 16 for both source and destination:
 120
 121 ```
 122  # add RT, RA, RB using the "uint64_t" union member "s"
 123  for i in range(VL):
 124       int_regfile[RT].s[i] = int_regfile[RA].s[i] + int_regfile[RB].s[i]
 125 ```
 126
 127 Hardware Architectural note: to avoid a Read-Modify-Write at the register file it is
 128 strongly recommended to implement byte-level write-enable lines exactly as has been
 129 implemented in DRAM ICs for many decades. Additionally the predicate mask bit is advised
 130 to be associated with the element operation and alongside the result ultimately
 131 passed to the register file.
 132 When element-width is set to 64-bit the relevant predicate mask bit may be repeated
 133 eight times and pull all eight write-port byte-level lines HIGH. Clearly when element-width
 134 is set to 8-bit the relevant predicate mask bit corresponds directly with one single
 135 byte-level write-enable line.  It is up to the Hardware Architect to then amortise (merge)
 136 elements together into both PredicatedSIMD Pipelines as well as simultaneous non-overlapping
 137 Register File writes, to achieve High Performance designs.
 138
 139 ## SVP64 encoding features
 140
 141 A number of features need to be compacted into a very small space of only 24 bits:
 142
 143 * Independent per-register Scalar/Vector tagging and range extension on every register
 144 * Element width overrides on both source and destination
 145 * Predication on both source and destination
 146 * Two different sources of predication: INT and CR Fields
 147 * SV Modes including saturation (for Audio, Video and DSP), mapreduce, fail-first and
 148   predicate-result mode.
 149
 150 Different classes of operations require different formats. The earlier sections cover
 151 the c9mmon formats and the four separate modes follow: CR operations (crops),
 152 Arithmetic/Logical (termed "normal"), Load/Store and Branch-Conditional.
 153
 154 ## Definition of Reserved in this spec.
 155
 156 For the new fields added in SVP64, instructions that have any of their
 157 fields set to a reserved value must cause an illegal instruction trap,
 158 to allow emulation of future instruction sets, or for subsets of SVP64
 159 to be implemented in hardware and the rest emulated.
 160 This includes SVP64 SPRs: reading or writing values which are not
 161 supported in hardware must also raise illegal instruction traps
 162 in order to allow emulation.
 163 Unless otherwise stated, reserved values are always all zeros.
 164
 165 This is unlike OpenPower ISA v3.1, which in many instances does not require a trap if reserved fields are nonzero.  Where the standard Power ISA definition
 166 is intended the red keyword `RESERVED` is used.
 167
 168 ##  Definition of "UnVectoriseable"
 169
 170 Any operation that inherently makes no sense if repeated is termed "UnVectoriseable"
 171 or "UnVectorised".  Examples include `sc` or `sync` which have no registers. `mtmsr` is
 172 also classed as UnVectoriseable because there is only one `MSR`.
 173
 174 ## Scalar Identity Behaviour
 175
 176 SVP64 is designed so that when the prefix is all zeros, and
 177  VL=1, no effect or
 178 influence occurs (no augmentation) such that all standard Power ISA
 179 v3.0/v3 1 instructions covered by the prefix are "unaltered". This is termed `scalar identity behaviour` (based on the mathematical definition for "identity", as in, "identity matrix" or better "identity transformation").
 180
 181 Note that this is completely different from when VL=0.  VL=0 turns all operations under its influence into `nops` (regardless of the prefix)
 182  whereas when VL=1 and the SV prefix is all zeros, the operation simply acts as if SV had not been applied at all to the instruction  (an "identity transformation").
 183
 184 ## Register Naming and size
 185
 186 As previously mentioned SV Registers are simply the INT, FP and CR register files extended
 187 linearly to larger sizes; SV Vectorisation iterates sequentially through these registers
 188 (LSB0 sequential ordering from 0 to VL-1).
 189
 190 Where the integer regfile in standard scalar
 191 Power ISA v3.0B/v3.1B is r0 to r31, SV extends this as r0 to r127.
 192 Likewise FP registers are extended to 128 (fp0 to fp127), and CR Fields
 193 are
 194 extended to 128 entries, CR0 thru CR127.
 195
 196 The names of the registers therefore reflects a simple linear extension
 197 of the Power ISA v3.0B / v3.1B register naming, and in hardware this
 198 would be reflected by a linear increase in the size of the underlying
 199 SRAM used for the regfiles.
 200
 201 Note: when an EXTRA field (defined below) is zero, SV is deliberately designed
 202 so that the register fields are identical to as if SV was not in effect
 203 i.e. under these circumstances (EXTRA=0) the register field names RA,
 204 RB etc. are interpreted and treated as v3.0B / v3.1B scalar registers.  This is part of
 205 `scalar identity behaviour` described above.
 206
 207 ## Future expansion.
 208
 209 With the way that EXTRA fields are defined and applied to register fields,
 210 future versions of SV may involve 256 or greater registers. Backwards binary compatibility may be achieved with a PCR bit (Program Compatibility Register).  Further discussion is out of scope for this version of SVP64.
 211
 212 --------
 213
 214 \newpage{}
 215
 216 # Remapped Encoding (`RM[0:23]`)
 217
 218 To allow relatively easy remapping of which portions of the Prefix Opcode
 219 Map are used for SVP64 without needing to rewrite a large portion of the
 220 SVP64 spec, a mapping is defined from the OpenPower v3.1 prefix bits to
 221 a new 24-bit Remapped Encoding denoted `RM[0]` at the MSB to `RM[23]`
 222 at the LSB.
 223
 224 The mapping from the OpenPower v3.1 prefix bits to the Remapped Encoding
 225 is defined in the Prefix Fields section.
 226
 227 ## Prefix Fields
 228
 229 TODO incorporate EXT09
 230
 231 To "activate" svp64 (in a way that does not conflict with v3.1B 64 bit Prefix mode), fields within the v3.1B Prefix Opcode Map are set
 232 (see Prefix Opcode Map, above), leaving 24 bits "free" for use by SV.
 233 This is achieved by setting bits 7 and 9 to 1:
 234
 235 | Name       | Bits    | Value | Description                    |
 236 |------------|---------|-------|--------------------------------|
 237 | EXT01      | `0:5`   | `1`   | Indicates Prefixed 64-bit      |
 238 | `RM[0]`    | `6`     |       | Bit 0 of Remapped Encoding     |
 239 | SVP64_7    | `7`     | `1`   | Indicates this is SVP64        |
 240 | `RM[1]`    | `8`     |       | Bit 1 of Remapped Encoding     |
 241 | SVP64_9    | `9`     | `1`   | Indicates this is SVP64        |
 242 | `RM[2:23]` | `10:31` |       | Bits 2-23 of Remapped Encoding |
 243
 244 Laid out bitwise, this is as follows, showing how the 32-bits of the prefix
 245 are constructed:
 246
 247 | 0:5    | 6     | 7 | 8     | 9 | 10:31    |
 248 |--------|-------|---|-------|---|----------|
 249 | EXT01  | RM    | 1 | RM    | 1 | RM       |
 250 | 000001 | RM[0] | 1 | RM[1] | 1 | RM[2:23] |
 251
 252 Following the prefix will be the suffix: this is simply a 32-bit v3.0B / v3.1
 253 instruction.  That instruction becomes "prefixed" with the SVP context: the
 254 Remapped Encoding field (RM).
 255
 256 It is important to note that unlike v3.1 64-bit prefixed instructions
 257 there is insufficient space in `RM` to provide identification of
 258 any SVP64 Fields without first partially decoding the
 259 32-bit suffix.  Similar to the "Forms" (X-Form, D-Form) the
 260 `RM` format is individually associated with every instruction.
 261
 262 Extreme caution and care must therefore be taken
 263 when extending SVP64 in future, to not create unnecessary relationships
 264 between prefix and suffix that could complicate decoding, adding latency.
 265
 266 # Common RM fields
 267
 268 The following fields are common to all Remapped Encodings:
 269
 270 | Field Name | Field bits | Description                            |
 271 |------------|------------|----------------------------------------|
 272 | MASKMODE   | `0`        | Execution (predication) Mask Kind      |
 273 | MASK       | `1:3`      | Execution Mask                      |
 274 | SUBVL      | `8:9`      | Sub-vector length                   |
 275
 276 The following fields are optional or encoded differently depending
 277 on context after decoding of the Scalar suffix:
 278
 279 | Field Name | Field bits | Description                            |
 280 |------------|------------|----------------------------------------|
 281 | ELWIDTH       | `4:5`      | Element Width                       |
 282 | ELWIDTH_SRC   | `6:7`      | Element Width for Source      |
 283 | EXTRA         | `10:18`    | Register Extra encoding                |
 284 | MODE          | `19:23`    | changes Vector behaviour               |
 285
 286 * MODE changes the behaviour of the SV operation (result saturation, mapreduce)
 287 * SUBVL groups elements together into vec2, vec3, vec4 for use in 3D and Audio/Video DSP work
 288 * ELWIDTH and ELWIDTH_SRC overrides the instruction's destination and source operand width
 289 * MASK (and MASK_SRC) and MASKMODE provide predication (two types of sources: scalar INT and Vector CR).
 290 * Bits 10 to 18 (EXTRA) are further decoded depending on the RM category for the instruction, which is determined only by decoding the Scalar 32 bit suffix.
 291
 292 Similar to Power ISA `X-Form` etc. EXTRA bits are given designations, such as `RM-1P-3S1D` which indicates for this example that the operation is to be single-predicated and that there are 3 source operand EXTRA tags and one destination operand tag.
 293
 294 Note that if ELWIDTH != ELWIDTH_SRC this may result in reduced performance or increased latency in some implementations due to lane-crossing.
 295
 296 # Mode
 297
 298 Mode is an augmentation of SV behaviour.  Different types of
 299 instructions have different needs, similar to Power ISA
 300 v3.1 64 bit prefix 8LS and MTRR formats apply to different
 301 instruction types.  Modes include Reduction, Iteration, arithmetic
 302 saturation, and Fail-First.  More specific details in each
 303 section and in the [[svp64/appendix]]
 304
 305 * For condition register operations see [[sv/cr_ops]]
 306 * For LD/ST Modes, see [[sv/ldst]].
 307 * For Branch modes, see [[sv/branches]]
 308 * For arithmetic and logical, see [[sv/normal]]
 309
 310 # ELWIDTH Encoding
 311
 312 Default behaviour is set to 0b00 so that zeros follow the convention of
 313 `scalar identity behaviour`.  In this case it means that elwidth overrides
 314 are not applicable.  Thus if a 32 bit instruction operates on 32 bit,
 315 `elwidth=0b00` specifies that this behaviour is unmodified.  Likewise
 316 when a processor is switched from 64 bit to 32 bit mode, `elwidth=0b00`
 317 states that, again, the behaviour is not to be modified.
 318
 319 Only when elwidth is nonzero is the element width overridden to the
 320 explicitly required value.
 321
 322 ## Elwidth for Integers:
 323
 324 | Value | Mnemonic       | Description                        |
 325 |-------|----------------|------------------------------------|
 326 | 00    | DEFAULT        | default behaviour for operation    |
 327 | 01    | `ELWIDTH=w`    | Word: 32-bit integer                 |
 328 | 10    | `ELWIDTH=h`    | Halfword: 16-bit integer             |
 329 | 11    | `ELWIDTH=b`    | Byte: 8-bit integer                  |
 330
 331 This encoding is chosen such that the byte width may be computed as
 332 `8<<(3-ew)`
 333
 334 ## Elwidth for FP Registers:
 335
 336 | Value | Mnemonic       | Description                        |
 337 |-------|----------------|------------------------------------|
 338 | 00    | DEFAULT        | default behaviour for FP operation     |
 339 | 01    | `ELWIDTH=f32`  | 32-bit IEEE 754 Single floating-point  |
 340 | 10    | `ELWIDTH=f16`  | 16-bit IEEE 754 Half floating-point   |
 341 | 11    | `ELWIDTH=bf16` | Reserved for `bf16` |
 342
 343 Note:
 344 [`bf16`](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format)
 345 is reserved for a future implementation of SV
 346
 347 Note that any IEEE754 FP operation in Power ISA ending in "s" (`fadds`) shall
 348 perform its operation at **half** the ELWIDTH then padded back out
 349 to ELWIDTH.  `sv.fadds/ew=f32` shall perform an IEEE754 FP16 operation that is then "padded" to fill out to an IEEE754 FP32. When ELWIDTH=DEFAULT
 350 clearly the behaviour of `sv.fadds` is performed at 32-bit accuracy
 351 then padded back out to fit in IEEE754 FP64, exactly as for Scalar
 352 v3.0B "single" FP.  Any FP operation ending in "s" where ELWIDTH=f16
 353 or ELWIDTH=bf16 is reserved and must raise an illegal instruction
 354 (IEEE754 FP8 or BF8 are not defined).
 355
 356 ## Elwidth for CRs:
 357
 358 Element-width overrides for CR Fields has no meaning. The bits
 359 are therefore used for other purposes, or when Rc=1, the Elwidth
 360 applies to the result being tested (a GPR or FPR), but not to the
 361 Vector of CR Fields.
 362
 363 # SUBVL Encoding
 364
 365 the default for SUBVL is 1 and its encoding is 0b00 to indicate that
 366 SUBVL is effectively disabled (a SUBVL for-loop of only one element). this
 367 lines up in combination with all other "default is all zeros" behaviour.
 368
 369 | Value | Mnemonic  | Subvec  | Description            |
 370 |-------|-----------|---------|------------------------|
 371 | 00    | `SUBVL=1` | single  | Sub-vector length of 1 |
 372 | 01    | `SUBVL=2` | vec2    | Sub-vector length of 2 |
 373 | 10    | `SUBVL=3` | vec3    | Sub-vector length of 3 |
 374 | 11    | `SUBVL=4` | vec4    | Sub-vector length of 4 |
 375
 376 The SUBVL encoding value may be thought of as an inclusive range of a
 377 sub-vector.  SUBVL=2 represents a vec2, its encoding is 0b01, therefore
 378 this may be considered to be elements 0b00 to 0b01 inclusive.
 379
 380 # MASK/MASK_SRC & MASKMODE Encoding
 381
 382 TODO: rename MASK_KIND to MASKMODE
 383
 384 One bit (`MASKMODE`) indicates the mode: CR or Int predication.   The two
 385 types may not be mixed.
 386
 387 Special note: to disable predication this field must
 388 be set to zero in combination with Integer Predication also being set
 389 to 0b000. this has the effect of enabling "all 1s" in the predicate
 390 mask, which is equivalent to "not having any predication at all"
 391 and consequently, in combination with all other default zeros, fully
 392 disables SV (`scalar identity behaviour`).
 393
 394 `MASKMODE` may be set to one of 2 values:
 395
 396 | Value | Description                                          |
 397 |-----------|------------------------------------------------------|
 398 | 0         | MASK/MASK_SRC are encoded using Integer Predication  |
 399 | 1         | MASK/MASK_SRC are encoded using CR-based Predication |
 400
 401 Integer Twin predication has a second set of 3 bits that uses the same
 402 encoding thus allowing either the same register (r3, r10 or r31) to be used
 403 for both src and dest, or different regs (one for src, one for dest).
 404
 405 Likewise CR based twin predication has a second set of 3 bits, allowing
 406 a different test to be applied.
 407
 408 Note that it is assumed that Predicate Masks (whether INT or CR)
 409 are read *before* the operations proceed.  In practice (for CR Fields)
 410 this creates an unnecessary block on parallelism.  Therefore,
 411 it is up to the programmer to ensure that the CR fields used as
 412 Predicate Masks are not being written to by any parallel Vector Loop.
 413 Doing so results in **UNDEFINED** behaviour, according to the definition
 414 outlined in the Power ISA v3.0B Specification.
 415
 416 Hardware Implementations are therefore free and clear to delay reading
 417 of individual CR fields until the actual predicated element operation
 418 needs to take place, safe in the knowledge that no programmer will
 419 have issued a Vector Instruction where previous elements could have
 420 overwritten (destroyed) not-yet-executed CR-Predicated element operations.
 421
 422 ## Integer Predication (MASKMODE=0)
 423
 424 When the predicate mode bit is zero the 3 bits are interpreted as below.
 425 Twin predication has an identical 3 bit field similarly encoded.
 426
 427 `MASK` and `MASK_SRC` may be set to one of 8 values, to provide the following meaning:
 428
 429 | Value | Mnemonic | Element `i` enabled if:      |
 430 |-------|----------|------------------------------|
 431 | 000   | ALWAYS   | predicate effectively all 1s |
 432 | 001   | 1 << R3  | `i == R3`                    |
 433 | 010   | R3       | `R3 & (1 << i)` is non-zero  |
 434 | 011   | ~R3      | `R3 & (1 << i)` is zero      |
 435 | 100   | R10      | `R10 & (1 << i)` is non-zero |
 436 | 101   | ~R10     | `R10 & (1 << i)` is zero     |
 437 | 110   | R30      | `R30 & (1 << i)` is non-zero |
 438 | 111   | ~R30     | `R30 & (1 << i)` is zero     |
 439
 440 r10 and r30 are at the high end of temporary and unused registers, so as not to interfere with register allocation from ABIs.
 441
 442 ## CR-based Predication (MASKMODE=1)
 443
 444 When the predicate mode bit is one the 3 bits are interpreted as below.
 445 Twin predication has an identical 3 bit field similarly encoded.
 446
 447 `MASK` and `MASK_SRC` may be set to one of 8 values, to provide the following meaning:
 448
 449 | Value | Mnemonic | Element `i` is enabled if     |
 450 |-------|----------|--------------------------|
 451 | 000   | lt       | `CR[offs+i].LT` is set   |
 452 | 001   | nl/ge    | `CR[offs+i].LT` is clear |
 453 | 010   | gt       | `CR[offs+i].GT` is set   |
 454 | 011   | ng/le    | `CR[offs+i].GT` is clear |
 455 | 100   | eq       | `CR[offs+i].EQ` is set   |
 456 | 101   | ne       | `CR[offs+i].EQ` is clear |
 457 | 110   | so/un    | `CR[offs+i].FU` is set   |
 458 | 111   | ns/nu    | `CR[offs+i].FU` is clear |
 459
 460 CR based predication.  TODO: select alternate CR for twin predication? see
 461 [[discussion]]  Overlap of the two CR based predicates must be taken
 462 into account, so the starting point for one of them must be suitably
 463 high, or accept that for twin predication VL must not exceed the range
 464 where overlap will occur, *or* that they use the same starting point
 465 but select different *bits* of the same CRs
 466
 467 `offs` is defined as CR32 (4x8) so as to mesh cleanly with Vectorised Rc=1 operations (see below).  Rc=1 operations start from CR8 (TBD).
 468
 469 The CR Predicates chosen must start on a boundary that Vectorised
 470 CR operations can access cleanly, in full.
 471 With EXTRA2 restricting starting points
 472 to multiples of 8 (CR0, CR8, CR16...) both Vectorised Rc=1 and CR Predicate
 473 Masks have to be adapted to fit on these boundaries as well.
 474
 475 # Extra Remapped Encoding <a name="extra_remap"> </a>
 476
 477 Shows all instruction-specific fields in the Remapped Encoding `RM[10:18]` for all instruction variants.  Note that due to the very tight space, the encoding mode is *not* included in the prefix itself.  The mode is "applied", similar to Power ISA "Forms" (X-Form, D-Form) on a per-instruction basis, and, like "Forms" are given a designation (below) of the form `RM-nP-nSnD`. The full list of which instructions use which remaps is here [[opcode_regs_deduped]]. (*Machine-readable CSV files have been provided which will make the task of creating SV-aware ISA decoders easier*).
 478
 479 These mappings are part of the SVP64 Specification in exactly the same
 480 way as X-Form, D-Form. New Scalar instructions added to the Power ISA
 481 will need a corresponding SVP64 Mapping, which can be derived by-rote
 482 from examining the Register "Profile" of the instruction.
 483
 484 There are two categories:  Single and Twin Predication.
 485 Due to space considerations further subdivision of Single Predication
 486 is based on whether the number of src operands is 2 or 3.  With only
 487 9 bits available some compromises have to be made.
 488
 489 * `RM-1P-3S1D` Single Predication dest/src1/2/3, applies to 4-operand instructions (fmadd, isel, madd).
 490 * `RM-1P-2S1D` Single Predication dest/src1/2 applies to 3-operand instructions (src1 src2 dest)
 491 * `RM-2P-1S1D` Twin Predication (src=1, dest=1)
 492 * `RM-2P-2S1D` Twin Predication (src=2, dest=1) primarily for LDST (Indexed)
 493 * `RM-2P-1S2D` Twin Predication (src=1, dest=2) primarily for LDST Update
 494
 495 ## RM-1P-3S1D
 496
 497 | Field Name | Field bits | Description                            |
 498 |------------|------------|----------------------------------------|
 499 | Rdest\_EXTRA2 | `10:11` | extends Rdest (R\*\_EXTRA2 Encoding)   |
 500 | Rsrc1\_EXTRA2 | `12:13` | extends Rsrc1 (R\*\_EXTRA2 Encoding)   |
 501 | Rsrc2\_EXTRA2 | `14:15` | extends Rsrc2 (R\*\_EXTRA2 Encoding)   |
 502 | Rsrc3\_EXTRA2 | `16:17` | extends Rsrc3 (R\*\_EXTRA2 Encoding)   |
 503 | EXTRA2_MODE   | `18`    | used by `divmod2du` and `maddedu` for RS   |
 504
 505 These are for 3 operand in and either 1 or 2 out instructions.
 506 3-in 1-out includes `madd RT,RA,RB,RC`. (DRAFT) instructions
 507 such as `maddedu` have an implicit second destination, RS, the
 508 selection of which is determined by bit 18.
 509
 510 ## RM-1P-2S1D
 511
 512 | Field Name | Field bits | Description                               |
 513 |------------|------------|-------------------------------------------|
 514 | Rdest\_EXTRA3 | `10:12` | extends Rdest  |
 515 | Rsrc1\_EXTRA3 | `13:15` | extends Rsrc1  |
 516 | Rsrc2\_EXTRA3 | `16:18` | extends Rsrc3  |
 517
 518 These are for 2 operand 1 dest instructions, such as `add RT, RA,
 519 RB`. However also included are unusual instructions with an implicit dest
 520 that is identical to its src reg, such as `rlwinmi`.
 521
 522 Normally, with instructions such as `rlwinmi`, the scalar v3.0B ISA would not have sufficient bit fields to allow
 523 an alternative destination.  With SV however this becomes possible.
 524 Therefore, the fact that the dest is implicitly also a src should not
 525 mislead: due to the *prefix* they are different SV regs.
 526
 527 * `rlwimi RA, RS, ...`
 528 * Rsrc1_EXTRA3 applies to RS as the first src
 529 * Rsrc2_EXTRA3 applies to RA as the secomd src
 530 * Rdest_EXTRA3 applies to RA to create an **independent** dest.
 531
 532 With the addition of the EXTRA bits, the three registers
 533 each may be *independently* made vector or scalar, and be independently
 534 augmented to 7 bits in length.
 535
 536 ## RM-2P-1S1D/2S
 537
 538 | Field Name | Field bits | Description                 |
 539 |------------|------------|----------------------------|
 540 | Rdest_EXTRA3 | `10:12`    | extends Rdest             |
 541 | Rsrc1_EXTRA3 | `13:15`    | extends Rsrc1             |
 542 | MASK_SRC     | `16:18`    | Execution Mask for Source |
 543
 544 `RM-2P-2S` is for `stw` etc. and is Rsrc1 Rsrc2.
 545
 546 ## RM-1P-2S1D
 547
 548 single-predicate, three registers (2 read, 1 write)
 549
 550 | Field Name | Field bits | Description                 |
 551 |------------|------------|----------------------------|
 552 | Rdest_EXTRA3 | `10:12`    | extends Rdest             |
 553 | Rsrc1_EXTRA3 | `13:15`    | extends Rsrc1             |
 554 | Rsrc2_EXTRA3 | `16:18`    | extends Rsrc2             |
 555
 556 ## RM-2P-2S1D/1S2D/3S
 557
 558 The primary purpose for this encoding is for Twin Predication on LOAD
 559 and STORE operations.  see [[sv/ldst]] for detailed anslysis.
 560
 561 RM-2P-2S1D:
 562
 563 | Field Name | Field bits | Description                     |
 564 |------------|------------|----------------------------|
 565 | Rdest_EXTRA2 | `10:11`  | extends Rdest (R\*\_EXTRA2 Encoding)   |
 566 | Rsrc1_EXTRA2 | `12:13`  | extends Rsrc1 (R\*\_EXTRA2 Encoding)   |
 567 | Rsrc2_EXTRA2 | `14:15`  | extends Rsrc2 (R\*\_EXTRA2 Encoding)   |
 568 | MASK_SRC     | `16:18`  | Execution Mask for Source     |
 569
 570 Note that for 1S2P the EXTRA2 dest and src names are switched (Rsrc_EXTRA2
 571 is in bits 10:11, Rdest1_EXTRA2 in 12:13)
 572
 573 Also that for 3S (to cover `stdx` etc.) the names are switched to 3 src: Rsrc1_EXTRA2, Rsrc2_EXTRA2, Rsrc3_EXTRA2.
 574
 575 Note also that LD with update indexed, which takes 2 src and 2 dest
 576 (e.g. `lhaux RT,RA,RB`), does not have room for 4 registers and also
 577 Twin Predication.  therefore these are treated as RM-2P-2S1D and the
 578 src spec for RA is also used for the same RA as a dest.
 579
 580 Note that if ELWIDTH != ELWIDTH_SRC this may result in reduced performance or increased latency in some implementations due to lane-crossing.
 581
 582 # R\*\_EXTRA2/3
 583
 584 EXTRA is the means by which two things are achieved:
 585
 586 1. Registers are marked as either Vector *or Scalar*
 587 2. Register field numbers (limited typically to 5 bit)
 588    are extended in range, both for Scalar and Vector.
 589
 590 The register files are therefore extended:
 591
 592 * INT is extended from r0-31 to r0-127
 593 * FP is extended from fp0-32 to fp0-fp127
 594 * CR Fields are extended from CR0-7 to CR0-127
 595
 596 However due to pressure in `RM.EXTRA` not all these registers
 597 are accessible by all instructions, particularly those with
 598 a large number of operands (`madd`, `isel`).
 599
 600 In the following tables register numbers are constructed from the
 601 standard v3.0B / v3.1B 32 bit register field (RA, FRA) and the EXTRA2
 602 or EXTRA3 field from the SV Prefix, determined by the specific
 603 RM-xx-yyyy designation for a given instruction.
 604 The prefixing is arranged so that
 605 interoperability between prefixing and nonprefixing of scalar registers
 606 is direct and convenient (when the EXTRA field is all zeros).
 607
 608 A pseudocode algorithm explains the relationship, for INT/FP (see [[svp64/appendix]] for CRs)
 609
 610 ```
 611     if extra3_mode:
 612         spec = EXTRA3
 613     else:
 614         spec = EXTRA2 << 1 # same as EXTRA3, shifted
 615     if spec[0]: # vector
 616          return (RA << 2) | spec[1:2]
 617     else:         # scalar
 618          return (spec[1:2] << 5) | RA
 619 ```
 620
 621 Future versions may extend to 256 by shifting Vector numbering up.
 622 Scalar will not be altered.
 623
 624 Note that in some cases the range of starting points for Vectors
 625 is limited.
 626
 627 ## INT/FP EXTRA3
 628
 629 If EXTRA3 is zero, maps to
 630 "scalar identity" (scalar Power ISA field naming).
 631
 632 Fields are as follows:
 633
 634 * Value: R_EXTRA3
 635 * Mode: register is tagged as scalar or vector
 636 * Range/Inc: the range of registers accessible from this EXTRA
 637   encoding, and the "increment" (accessibility). "/4" means
 638   that this EXTRA encoding may only give access (starting point)
 639   every 4th register.
 640 * MSB..LSB: the bit field showing how the register opcode field
 641   combines with EXTRA to give (extend) the register number (GPR)
 642
 643 | Value | Mode | Range/Inc | 6..0 |
 644 |-----------|-------|---------------|---------------------|
 645 | 000       | Scalar | `r0-r31`/1 | `0b00 RA`      |
 646 | 001       | Scalar | `r32-r63`/1 | `0b01 RA`      |
 647 | 010       | Scalar | `r64-r95`/1 | `0b10 RA`      |
 648 | 011       | Scalar | `r96-r127`/1 | `0b11 RA`      |
 649 | 100       | Vector | `r0-r124`/4 | `RA 0b00`      |
 650 | 101       | Vector | `r1-r125`/4 | `RA 0b01`      |
 651 | 110       | Vector | `r2-r126`/4 | `RA 0b10`      |
 652 | 111       | Vector | `r3-r127`/4 | `RA 0b11`      |
 653
 654 ## INT/FP EXTRA2
 655
 656 If EXTRA2 is zero will map to
 657 "scalar identity behaviour" i.e Scalar Power ISA register naming:
 658
 659 | Value | Mode | Range/inc | 6..0 |
 660 |-----------|-------|---------------|-----------|
 661 | 00       | Scalar | `r0-r31`/1 | `0b00 RA`     |
 662 | 01       | Scalar | `r32-r63`/1 | `0b01 RA`      |
 663 | 10       | Vector | `r0-r124`/4 | `RA 0b00`      |
 664 | 11       | Vector | `r2-r126`/4 | `RA 0b10`   |
 665
 666 **Note that unlike in EXTRA3, in EXTRA2**:
 667
 668 * the GPR Vectors may only start from
 669   `r0, r2, r4, r6, r8` and likewise FPR Vectors.
 670 * the GPR Scalars may only go from `r0, r1, r2.. r63` and likewise FPR Scalars.
 671
 672 as there is insufficient bits to cover the full range.
 673
 674 ## CR Field EXTRA3
 675
 676 CR Field encoding is essentially the same but made more complex due to CRs being bit-based.  See [[svp64/appendix]] for explanation and pseudocode.
 677 Note that Vectors may only start from `CR0, CR4, CR8, CR12, CR16, CR20`...
 678 and Scalars may only go from `CR0, CR1, ... CR31`
 679
 680 Encoding shown MSB down to LSB
 681
 682 For a 5-bit operand (BA, BB, BT):
 683
 684 | Value | Mode | Range/Inc     | 8..5      | 4..2    | 1..0    |
 685 |-------|------|---------------|-----------| --------|---------|
 686 | 000   | Scalar | `CR0-CR7`/1   | 0b0000    | BA[4:2] | BA[1:0] |
 687 | 001   | Scalar | `CR8-CR15`/1  | 0b0001    | BA[4:2] | BA[1:0] |
 688 | 010   | Scalar | `CR16-CR23`/1 | 0b0010    | BA[4:2] | BA[1:0] |
 689 | 011   | Scalar | `CR24-CR31`/1 | 0b0011    | BA[4:2] | BA[1:0] |
 690 | 100   | Vector | `CR0-CR112`/16 | BA[4:2] 0 | 0b000   | BA[1:0] |
 691 | 101   | Vector | `CR4-CR116`/16 | BA[4:2] 0 | 0b100   | BA[1:0] |
 692 | 110   | Vector | `CR8-CR120`/16 | BA[4:2] 1 | 0b000   | BA[1:0] |
 693 | 111   | Vector | `CR12-CR124`/16 | BA[4:2] 1 | 0b100   | BA[1:0] |
 694
 695 For a 3-bit operand (e.g. BFA):
 696
 697 | Value | Mode | Range/Inc     | 6..3      | 2..0    |
 698 |-------|------|---------------|-----------| --------|
 699 | 000   | Scalar | `CR0-CR7`/1   | 0b0000    | BFA   |
 700 | 001   | Scalar | `CR8-CR15`/1  | 0b0001    | BFA      |
 701 | 010   | Scalar | `CR16-CR23`/1 | 0b0010    | BFA      |
 702 | 011   | Scalar | `CR24-CR31`/1 | 0b0011    | BFA      |
 703 | 100   | Vector | `CR0-CR112`/16 | BFA 0 | 0b000   |
 704 | 101   | Vector | `CR4-CR116`/16 | BFA 0 | 0b100   |
 705 | 110   | Vector | `CR8-CR120`/16 | BFA 1 | 0b000   |
 706 | 111   | Vector | `CR12-CR124`/16 | BFA 1 | 0b100   |
 707
 708 ## CR EXTRA2
 709
 710 CR encoding is essentially the same but made more complex due to CRs being bit-based.  See separate section for explanation and pseudocode.
 711 Note that Vectors may only start from CR0, CR8, CR16, CR24, CR32...
 712
 713
 714 Encoding shown MSB down to LSB
 715
 716 For a 5-bit operand (BA, BB, BC):
 717
 718 | Value | Mode   | Range/Inc      | 8..5    | 4..2    | 1..0    |
 719 |-------|--------|----------------|---------|---------|---------|
 720 | 00    | Scalar | `CR0-CR7`/1    | 0b0000  | BA[4:2] | BA[1:0] |
 721 | 01    | Scalar | `CR8-CR15`/1   | 0b0001  | BA[4:2] | BA[1:0] |
 722 | 10    | Vector | `CR0-CR112`/16 | BA[4:2] 0 | 0b000   | BA[1:0] |
 723 | 11    | Vector | `CR8-CR120`/16 | BA[4:2] 1 | 0b000   | BA[1:0] |
 724
 725 For a 3-bit operand (e.g. BFA):
 726
 727 | Value | Mode | Range/Inc     | 6..3      | 2..0    |
 728 |-------|------|---------------|-----------| --------|
 729 | 00    | Scalar | `CR0-CR7`/1   | 0b0000  | BFA   |
 730 | 01    | Scalar | `CR8-CR15`/1  | 0b0001  | BFA     |
 731 | 10    | Vector | `CR0-CR112`/16 | BFA 0 | 0b000   |
 732 | 11    | Vector | `CR8-CR120`/16 | BFA 1 | 0b000   |
 733
 734 --------
 735
 736 \newpage{}
 737
 738
 739 # Normal SVP64 Modes, for Arithmetic and Logical Operations
 740
 741 Normal SVP64 Mode covers Arithmetic and Logical operations
 742 to provide suitable additional behaviour.  The Mode
 743 field is bits 19-23 of the [[svp64]] RM Field.
 744
 745 ## Mode
 746
 747 Mode is an augmentation of SV behaviour, providing additional
 748 functionality.  Some of these alterations are element-based (saturation), others involve post-analysis (predicate result) and others are Vector-based (mapreduce, fail-on-first).
 749
 750 [[sv/ldst]],
 751 [[sv/cr_ops]] and [[sv/branches]] are covered separately: the following
 752 Modes apply to Arithmetic and Logical SVP64 operations:
 753
 754 * **simple** mode is straight vectorisation.  no augmentations: the vector comprises an array of independently created results.
 755 * **ffirst** or data-dependent fail-on-first: see separate section.  the vector may be truncated depending on certain criteria.
 756   *VL is altered as a result*.
 757 * **sat mode** or saturation: clamps each element result to a min/max rather than overflows / wraps.  allows signed and unsigned clamping for both INT
 758 and FP.
 759 * **reduce mode**. if used correctly, a mapreduce (or a prefix sum)
 760   is performed.    see [[svp64/appendix]].
 761   note that there are comprehensive caveats when using this mode.
 762 * **pred-result** will test the result (CR testing selects a bit of CR and inverts it, just like branch conditional testing) and if the test fails it
 763 is as if the
 764 *destination* predicate bit was zero even before starting the operation.
 765 When Rc=1 the CR element however is still stored in the CR regfile, even if the test failed.  See appendix for details.
 766
 767 Note that ffirst and reduce modes are not anticipated to be high-performance in some implementations.  ffirst due to interactions with VL, and reduce due to it requiring additional operations to produce a result.  simple, saturate and pred-result are however inter-element independent and may easily be parallelised to give high performance, regardless of the value of VL.
 768
 769 The Mode table for Arithmetic and Logical operations
 770  is laid out as follows:
 771
 772 | 0-1 |  2  |  3   4  |  description              |
 773 | --- | --- |---------|-------------------------- |
 774 | 00  |   0 |  dz  sz | simple mode                      |
 775 | 00  |   1 | 0  RG   | scalar reduce mode (mapreduce) |
 776 | 00  |   1 | 1  /    | reserved     |
 777 | 01  | inv | CR-bit  | Rc=1: ffirst CR sel              |
 778 | 01  | inv | VLi RC1 |  Rc=0: ffirst z/nonz |
 779 | 10  |   N | dz   sz |  sat mode: N=0/1 u/s |
 780 | 11  | inv | CR-bit  |  Rc=1: pred-result CR sel |
 781 | 11  | inv | zz  RC1 |  Rc=0: pred-result z/nonz |
 782
 783 Fields:
 784
 785 * **sz / dz**  if predication is enabled will put zeros into the dest (or as src in the case of twin pred) when the predicate bit is zero.  otherwise the element is ignored or skipped, depending on context.
 786 * **zz**: both sz and dz are set equal to this flag
 787 * **inv CR bit** just as in branches (BO) these bits allow testing of a CR bit and whether it is set (inv=0) or unset (inv=1)
 788 * **RG** inverts the Vector Loop order (VL-1 downto 0) rather
 789 than the normal 0..VL-1
 790 * **N** sets signed/unsigned saturation.
 791 * **RC1** as if Rc=1, enables access to `VLi`.
 792 * **VLi** VL inclusive: in fail-first mode, the truncation of
 793   VL *includes* the current element at the failure point rather
 794   than excludes it from the count.
 795
 796 For LD/ST Modes, see [[sv/ldst]].  For Condition Registers
 797 see [[sv/cr_ops]].
 798 For Branch modes, see [[sv/branches]].
 799
 800 ## Rounding, clamp and saturate
 801
 802 To help ensure for example that audio quality is not compromised by overflow,
 803 "saturation" is provided, as well as a way to detect when saturation
 804 occurred if desired (Rc=1). When Rc=1 there will be a *vector* of CRs,
 805 one CR per element in the result (Note: this is different from VSX which
 806 has a single CR per block).
 807
 808 When N=0 the result is saturated to within the maximum range of an
 809 unsigned value.  For integer ops this will be 0 to 2^elwidth-1. Similar
 810 logic applies to FP operations, with the result being saturated to
 811 maximum rather than returning INF, and the minimum to +0.0
 812
 813 When N=1 the same occurs except that the result is saturated to the min
 814 or max of a signed result, and for FP to the min and max value rather
 815 than returning +/- INF.
 816
 817 When Rc=1, the CR "overflow" bit is set on the CR associated with the
 818 element, to indicate whether saturation occurred.  Note that due to
 819 the hugely detrimental effect it has on parallel processing, XER.SO is
 820 **ignored** completely and is **not** brought into play here.  The CR
 821 overflow bit is therefore simply set to zero if saturation did not occur,
 822 and to one if it did.
 823
 824 Note also that saturate on operations that set OE=1 must raise an
 825 Illegal Instruction due to the conflicting use of the CR.so bit for
 826 storing if
 827 saturation occurred. Integer Operations that produce a Carry-Out (CA, CA32):
 828 these two bits will be `UNDEFINED` if saturation is also requested.
 829
 830 Note that the operation takes place at the maximum bitwidth (max of
 831 src and dest elwidth) and that truncation occurs to the range of the
 832 dest elwidth.
 833
 834 *Programmer's Note: Post-analysis of the Vector of CRs to find out if any given element hit
 835 saturation may be done using a mapreduced CR op (cror), or by using the
 836 new crrweird instruction with Rc=1, which will transfer the required
 837 CR bits to a scalar integer and update CR0, which will allow testing
 838 the scalar integer for nonzero.  see [[sv/cr_int_predication]]*
 839
 840 ## Reduce mode
 841
 842 Reduction in SVP64 is similar in essence to other Vector Processing
 843 ISAs, but leverages the underlying scalar Base v3.0B operations.
 844 Thus it is more a convention that the programmer may utilise to give
 845 the appearance and effect of a Horizontal Vector Reduction. Due
 846 to the unusual decoupling it is also possible to perform
 847 prefix-sum (Fibonacci Series) in certain circumstances. Details are in the [[svp64/appendix]]
 848
 849 Reduce Mode should not be confused with Parallel Reduction [[sv/remap]].
 850 As explained in the [[sv/appendix]] Reduce Mode switches off the check
 851 which would normally stop looping if the result register is scalar.
 852 Thus, the result scalar register, if also used as a source scalar,
 853 may be used to perform sequential accumulation.  This *deliberately*
 854 sets up a chain
 855 of Register Hazard Dependencies, whereas Parallel Reduce [[sv/remap]]
 856 deliberately issues a Tree-Schedule of operations that may be parallelised.
 857
 858 ## Fail-on-first
 859
 860 Data-dependent fail-on-first has two distinct variants: one for LD/ST,
 861 the other for arithmetic operations (actually, CR-driven).  Note in each
 862 case the assumption is that vector elements are required to appear to be
 863 executed in sequential Program Order. When REMAP is not active,
 864 element 0 would be the first.
 865
 866 Data-driven (CR-driven) fail-on-first activates when Rc=1 or other
 867 CR-creating operation produces a result (including cmp).  Similar to
 868 branch, an analysis of the CR is performed and if the test fails, the
 869 vector operation terminates and discards all element operations **at and
 870 above the current one**, and VL is truncated to either
 871 the *previous* element or the current one, depending on whether
 872 VLi (VL "inclusive") is clear or set, respectively.
 873
 874 Thus the new VL comprises a contiguous vector of results,
 875 all of which pass the testing criteria (equal to zero, less than zero etc
 876 as defined by the CR-bit test).
 877
 878 *Note: when VLi is clear, the behaviour at first seems counter-intuitive.
 879 A result is calculated but if the test fails it is prohibited from being
 880 actually written.  This becomes intuitive again when it is remembered
 881 that the length that VL is set to is the number of *written* elements,
 882 and only when VLI is set will the current element be included in that
 883 count.*
 884
 885 The CR-based data-driven fail-on-first is "new" and not found in ARM
 886 SVE or RVV. At the same time it is "old" because it is almost
 887 identical to a generalised form of Z80's `CPIR` instruction.
 888 It is extremely useful for reducing instruction count,
 889 however requires speculative execution involving modifications of VL
 890 to get high performance implementations.  An additional mode (RC1=1)
 891 effectively turns what would otherwise be an arithmetic operation
 892 into a type of `cmp`.  The CR is stored (and the CR.eq bit tested
 893 against the `inv` field).
 894 If the CR.eq bit is equal to `inv` then the Vector is truncated and
 895 the loop ends.
 896
 897 VLi is only available as an option when `Rc=0` (or for instructions
 898 which do not have Rc). When set, the current element is always
 899 also included in the count (the new length that VL will be set to).
 900 This may be useful in combination with "inv" to truncate the Vector
 901 to *exclude* elements that fail a test, or, in the case of implementations
 902 of strncpy, to include the terminating zero.
 903
 904 In CR-based data-driven fail-on-first there is only the option to select
 905 and test one bit of each CR (just as with branch BO).  For more complex
 906 tests this may be insufficient.  If that is the case, a vectorised crop
 907 such as crand, cror or [[sv/cr_int_predication]] crweirder may be used,
 908 and ffirst applied to the crop instead of to
 909 the arithmetic vector. Note that crops are covered by
 910 the [[sv/cr_ops]] Mode format.
 911
 912 *Programmer's note: `VLi` is only accessible in normal operations
 913 which in turn limits the CR field bit-testing to only `EQ/NE`.
 914 [[sv/cr_ops]] are not so limited.  Thus it is possible to use for
 915 example `sv.cror/ff=gt/vli *0,*0,*0`, which is not a `nop` because
 916 it allows Fail-First Mode to perform a test and truncate VL.*
 917
 918 Two extremely important aspects of ffirst are:
 919
 920 * LDST ffirst may never set VL equal to zero.  This because on the first
 921   element an exception must be raised "as normal".
 922 * CR-based data-dependent ffirst on the other hand **can** set VL equal
 923   to zero. This is the only means in the entirety of SV that VL may be set
 924   to zero (with the exception of via the SV.STATE SPR).  When VL is set
 925   zero due to the first element failing the CR bit-test, all subsequent
 926   vectorised operations are effectively `nops` which is
 927   *precisely the desired and intended behaviour*.
 928
 929 The second crucial aspect, compared to LDST Ffirst:
 930
 931 * LD/ST Failfirst may (beyond the initial first element
 932   conditions) truncate VL for any architecturally
 933   suitable reason. Beyond the first element LD/ST Failfirst is
 934   arbitrarily speculative and 100% non-deterministic.
 935 * CR-based data-dependent first on the other hand MUST NOT truncate VL
 936   arbitrarily to a length decided by the hardware: VL MUST only be
 937   truncated based explicitly on whether a test fails.
 938   This because it is a precise Deterministic test on which algorithms
 939   can and will will rely.
 940
 941 **Floating-point Exceptions**
 942
 943 When Floating-point exceptions are enabled VL must be truncated at
 944 the point where the Exception appears not to have occurred. If `VLi`
 945 is set then VL must include the faulting element, and thus the
 946 faulting element will always raise its exception.  If however `VLi`
 947 is clear then VL **excludes** the faulting element and thus the
 948 exception will **never** be raised.
 949
 950 Although very strongly
 951 discouraged the Exception Mode that permits Floating Point Exception
 952 notification to arrive too late to unwind is permitted
 953 (under protest, due it violating
 954 the otherwise 100% Deterministic nature of Data-dependent Fail-first).
 955
 956 **Use of lax FP Exception Notification Mode could result in parallel
 957 computations proceeding with invalid results that have to be explicitly
 958 detected, whereas with the strict FP Execption Mode enabled, FFirst
 959 truncates VL, allows subsequent parallel computation to avoid
 960 the exceptions entirely**
 961
 962 ## Data-dependent fail-first on CR operations (crand etc)
 963
 964 Operations that actually produce or alter CR Field as a result
 965 have their own SVP64 Mode, described
 966 in [[sv/cr_ops]].
 967
 968 ## pred-result mode
 969
 970 This mode merges common CR testing with predication, saving on instruction
 971 count. Below is the pseudocode excluding predicate zeroing and elwidth
 972 overrides. Note that the pseudocode for SVP64 CR-ops is slightly different.
 973
 974 ```
 975     for i in range(VL):
 976         # predication test, skip all masked out elements.
 977         if predicate_masked_out(i):
 978              continue
 979         result = op(iregs[RA+i], iregs[RB+i])
 980         CRnew = analyse(result) # calculates eq/lt/gt
 981         # Rc=1 always stores the CR field
 982         if Rc=1 or RC1:
 983             CR.field[offs+i] = CRnew
 984         # now test CR, similar to branch
 985         if RC1 or CR.field[BO[0:1]] != BO[2]:
 986             continue # test failed: cancel store
 987         # result optionally stored but CR always is
 988         iregs[RT+i] = result
 989 ```
 990
 991 The reason for allowing the CR element to be stored is so that
 992 post-analysis of the CR Vector may be carried out.  For example:
 993 Saturation may have occurred (and been prevented from updating, by the
 994 test) but it is desirable to know *which* elements fail saturation.
 995
 996 Note that RC1 Mode basically turns all operations into `cmp`.  The
 997 calculation is performed but it is only the CR that is written. The
 998 element result is *always* discarded, never written (just like `cmp`).
 999
1000 Note that predication is still respected: predicate zeroing is slightly
1001 different: elements that fail the CR test *or* are masked out are zero'd.
1002
1003 --------
1004
1005 \newpage{}
1006
1007 # SV Load and Store
1008
1009 **Rationale**
1010
1011 All Vector ISAs dating back fifty years have extensive and comprehensive
1012 Load and Store operations that go far beyond the capabilities of Scalar
1013 RISC and most CISC processors, yet at their heart on an individual element
1014 basis may be found to be no different from RISC Scalar equivalents.
1015
1016 The resource savings from Vector LD/ST are significant and stem from
1017 the fact that one single instruction can trigger a dozen (or in some
1018 microarchitectures such as Cray or NEC SX Aurora) hundreds of element-level Memory accesses.
1019
1020 Additionally, and simply: if the Arithmetic side of an ISA supports
1021 Vector Operations, then in order to keep the ALUs 100% occupied the
1022 Memory infrastructure (and the ISA itself) correspondingly needs Vector
1023 Memory Operations as well.
1024
1025 Vectorised Load and Store also presents an extra dimension (literally)
1026 which creates scenarios unique to Vector applications, that a Scalar
1027 (and even a SIMD) ISA simply never encounters.  SVP64 endeavours to
1028 add the modes typically found in *all* Scalable Vector ISAs,
1029 without changing the behaviour of the underlying Base
1030 (Scalar) v3.0B operations in any way.
1031
1032 ## Modes overview
1033
1034 Vectorisation of Load and Store requires creation, from scalar operations,
1035 a number of different modes:
1036
1037 * **fixed aka "unit" stride** - contiguous sequence with no gaps
1038 * **element strided** - sequential but regularly offset, with gaps
1039 * **vector indexed** - vector of base addresses and vector of offsets
1040 * **Speculative fail-first** - where it makes sense to do so
1041 * **Structure Packing** - covered in SV by [[sv/remap]] and Pack/Unpack Mode.
1042
1043 *Despite being constructed from Scalar LD/ST none of these Modes
1044 exist or make sense in any Scalar ISA. They **only** exist in Vector ISAs*
1045
1046 Also included in SVP64 LD/ST is both signed and unsigned Saturation,
1047 as well as Element-width overrides and Twin-Predication.
1048
1049 Note also that Indexed [[sv/remap]] mode may be applied to both
1050 v3.0 LD/ST Immediate instructions *and* v3.0 LD/ST Indexed instructions.
1051 LD/ST-Indexed should not be conflated with Indexed REMAP mode: clarification
1052 is provided below.
1053
1054 **Determining the LD/ST Modes**
1055
1056 A minor complication (caused by the retro-fitting of modern Vector
1057 features to a Scalar ISA) is that certain features do not exactly make
1058 sense or are considered a security risk.  Fail-first on Vector Indexed
1059 would allow attackers to probe large numbers of pages from userspace, where
1060 strided fail-first (by creating contiguous sequential LDs) does not.
1061
1062 In addition, reduce mode makes no sense.
1063 Realistically we need
1064 an alternative table definition for [[sv/svp64]] `RM.MODE`.
1065 The following modes make sense:
1066
1067 * saturation
1068 * predicate-result (mostly for cache-inhibited LD/ST)
1069 * simple (no augmentation)
1070 * fail-first (where Vector Indexed is banned)
1071 * Signed Effective Address computation (Vector Indexed only)
1072 * Pack/Unpack (on LD/ST immediate operations only)
1073
1074 More than that however it is necessary to fit the usual Vector ISA
1075 capabilities onto both Power ISA LD/ST with immediate and to
1076 LD/ST Indexed. They present subtly different Mode tables, which, due
1077 to lack of space, have the following quirks:
1078
1079 * LD/ST Immediate has no individual control over src/dest zeroing,
1080   whereas LD/ST Indexed does.
1081 * LD/ST Immediate has no Saturated Pack/Unpack (Arithmetic Mode does)
1082 * LD/ST Indexed has no Pack/Unpack (REMAP may be used instead)
1083
1084 ## Format and fields
1085
1086 Fields used in tables below:
1087
1088 * **sz / dz**  if predication is enabled will put zeros into the dest (or as src in the case of twin pred) when the predicate bit is zero.  otherwise the element is ignored or skipped, depending on context.
1089 * **zz**: both sz and dz are set equal to this flag.
1090 * **inv CR bit** just as in branches (BO) these bits allow testing of a CR bit and whether it is set (inv=0) or unset (inv=1)
1091 * **N** sets signed/unsigned saturation.
1092 * **RC1** as if Rc=1, stores CRs *but not the result*
1093 * **SEA** - Signed Effective Address, if enabled performs sign-extension on
1094   registers that have been reduced due to elwidth overrides
1095
1096 **LD/ST immediate**
1097
1098 The table for [[sv/svp64]] for `immed(RA)` which is `RM.MODE`
1099 (bits 19:23 of `RM`) is:
1100
1101 | 0-1 |  2  |  3   4  |  description               |
1102 | --- | --- |---------|--------------------------- |
1103 | 00  | 0   |  zz els | simple mode                |
1104 | 00  | 1   | PI  LF  | post-increment and Fault-First  |
1105 | 01  | inv | CR-bit  | Rc=1: ffirst CR sel        |
1106 | 01  | inv | els RC1 |  Rc=0: ffirst z/nonz       |
1107 | 10  |   N | zz  els |  sat mode: N=0/1 u/s       |
1108 | 11  | inv | CR-bit  |  Rc=1: pred-result CR sel  |
1109 | 11  | inv | els RC1 |  Rc=0: pred-result z/nonz  |
1110
1111 The `els` bit is only relevant when `RA.isvec` is clear: this indicates
1112 whether stride is unit or element:
1113
1114 ```
1115     if RA.isvec:
1116         svctx.ldstmode = indexed
1117     elif els == 0:
1118         svctx.ldstmode = unitstride
1119     elif immediate != 0:
1120         svctx.ldstmode = elementstride
1121 ```
1122
1123 An immediate of zero is a safety-valve to allow `LD-VSPLAT`:
1124 in effect the multiplication of the immediate-offset by zero results
1125 in reading from the exact same memory location, *even with a Vector
1126 register*. (Normally this type of behaviour is reserved for the
1127 mapreduce modes)
1128
1129 For `LD-VSPLAT`, on non-cache-inhibited Loads, the read can occur
1130 just the once and be copied, rather than hitting the Data Cache
1131 multiple times with the same memory read at the same location.
1132 The benefit of Cache-inhibited LD-splats is that it allows
1133 for memory-mapped peripherals to have multiple
1134 data values read in quick succession and stored in sequentially
1135 numbered registers (but, see Note below).
1136
1137 For non-cache-inhibited ST from a vector source onto a scalar
1138 destination: with the Vector
1139 loop effectively creating multiple memory writes to the same location,
1140 we can deduce that the last of these will be the "successful" one. Thus,
1141 implementations are free and clear to optimise out the overwriting STs,
1142 leaving just the last one as the "winner".  Bear in mind that predicate
1143 masks will skip some elements (in source non-zeroing mode).
1144 Cache-inhibited ST operations on the other hand **MUST** write out
1145 a Vector source multiple successive times to the exact same Scalar
1146 destination. Just like Cache-inhibited LDs, multiple values may be
1147 written out in quick succession to a memory-mapped peripheral from
1148 sequentially-numbered registers.
1149
1150 Note that any memory location may be Cache-inhibited
1151 (Power ISA v3.1, Book III, 1.6.1, p1033)
1152
1153 *Programmer's Note: an immediate also with a Scalar source as
1154 a "VSPLAT" mode is simply not possible: there are not enough
1155 Mode bits. One single Scalar Load operation may be used instead, followed
1156 by any arithmetic operation (including a simple mv) in "Splat"
1157 mode.*
1158
1159 **LD/ST Indexed**
1160
1161 The modes for `RA+RB` indexed version are slightly different
1162 but are the same `RM.MODE` bits (19:23 of `RM`):
1163
1164 | 0-1 |  2  |  3   4  |  description              |
1165 | --- | --- |---------|-------------------------- |
1166 | 00  | SEA |  dz  sz | simple mode        |
1167 | 01  | SEA | dz sz   | Strided (scalar only source)   |
1168 | 10  |   N | dz   sz |  sat mode: N=0/1 u/s |
1169 | 11  | inv | CR-bit  |  Rc=1: pred-result CR sel |
1170 | 11  | inv | zz  RC1 |  Rc=0: pred-result z/nonz |
1171
1172 Vector Indexed Strided Mode is qualified as follows:
1173
1174     if mode = 0b01 and !RA.isvec and !RB.isvec:
1175         svctx.ldstmode = elementstride
1176
1177 A summary of the effect of Vectorisation of src or dest:
1178
1179      imm(RA)  RT.v   RA.v   no stride allowed
1180      imm(RA)  RT.s   RA.v   no stride allowed
1181      imm(RA)  RT.v   RA.s   stride-select allowed
1182      imm(RA)  RT.s   RA.s   not vectorised
1183      RA,RB    RT.v  {RA|RB}.v Standard Indexed
1184      RA,RB    RT.s  {RA|RB}.v Indexed but single LD (no VSPLAT)
1185      RA,RB    RT.v  {RA&RB}.s VSPLAT possible. stride selectable
1186      RA,RB    RT.s  {RA&RB}.s not vectorised (scalar identity)
1187
1188 Signed Effective Address computation is only relevant for
1189 Vector Indexed Mode, when elwidth overrides are applied.
1190 The source override applies to RB, and before adding to
1191 RA in order to calculate the Effective Address, if SEA is
1192 set RB is sign-extended from elwidth bits to the full 64
1193 bits.  For other Modes (ffirst, saturate),
1194 all EA computation with elwidth overrides is unsigned.
1195
1196 Note that cache-inhibited LD/ST  when VSPLAT is activated will perform **multiple** LD/ST operations, sequentially.  Even with scalar src a
1197 Cache-inhibited LD will read the same memory location *multiple times*, storing the result in successive Vector destination registers.  This because the cache-inhibit instructions are typically used to read and write memory-mapped peripherals.
1198 If a genuine cache-inhibited LD-VSPLAT is required then a single *scalar*
1199 cache-inhibited LD should be performed, followed by a VSPLAT-augmented mv,
1200 copying the one *scalar* value into multiple register destinations.
1201
1202 Note also that cache-inhibited VSPLAT with Predicate-result is possible.
1203 This allows for example to issue a massive batch of memory-mapped
1204 peripheral reads, stopping at the first NULL-terminated character and
1205 truncating VL to that point. No branch is needed to issue that large burst
1206 of LDs, which may be valuable in Embedded scenarios.
1207
1208 ## Vectorisation of Scalar Power ISA v3.0B
1209
1210 Scalar Power ISA Load/Store operations may be seen from their
1211 pseudocode to be of the form:
1212
1213     lbux RT, RA, RB
1214     EA <- (RA) + (RB)
1215     RT <- MEM(EA)
1216
1217 and for immediate variants:
1218
1219     lb RT,D(RA)
1220     EA <- RA + EXTS(D)
1221     RT <- MEM(EA)
1222
1223 Thus in the first example, the source registers may each be independently
1224 marked as scalar or vector, and likewise the destination; in the second
1225 example only the one source and one dest may be marked as scalar or
1226 vector.
1227
1228 Thus we can see that Vector Indexed may be covered, and, as demonstrated
1229 with the pseudocode below, the immediate can be used to give unit
1230 stride or element stride.  With there being no way to tell which from
1231 the Power v3.0B Scalar opcode alone, the choice is provided instead by
1232 the SV Context.
1233
1234 ```
1235     # LD not VLD!  format - ldop RT, immed(RA)
1236     # op_width: lb=1, lh=2, lw=4, ld=8
1237     op_load(RT, RA, op_width, immed, svctx, RAupdate):
1238       ps = get_pred_val(FALSE, RA); # predication on src
1239       pd = get_pred_val(FALSE, RT); # ... AND on dest
1240       for (i=0, j=0, u=0; i < VL && j < VL;):
1241         # skip nonpredicates elements
1242         if (RA.isvec) while (!(ps & 1<<i)) i++;
1243         if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
1244         if (RT.isvec) while (!(pd & 1<<j)) j++;
1245         if postinc:
1246             offs = 0; # added afterwards
1247             if RA.isvec: srcbase = ireg[RA+i]
1248             else         srcbase = ireg[RA]
1249         elif svctx.ldstmode == elementstride:
1250           # element stride mode
1251           srcbase = ireg[RA]
1252           offs = i * immed              # j*immed for a ST
1253         elif svctx.ldstmode == unitstride:
1254           # unit stride mode
1255           srcbase = ireg[RA]
1256           offs = immed + (i * op_width) # j*op_width for ST
1257         elif RA.isvec:
1258           # quirky Vector indexed mode but with an immediate
1259           srcbase = ireg[RA+i]
1260           offs = immed;
1261         else
1262           # standard scalar mode (but predicated)
1263           # no stride multiplier means VSPLAT mode
1264           srcbase = ireg[RA]
1265           offs = immed
1266
1267         # compute EA
1268         EA = srcbase + offs
1269         # load from memory
1270         ireg[RT+j] <= MEM[EA];
1271         # check post-increment of EA
1272         if postinc: EA = srcbase + immed;
1273         # update RA?
1274         if RAupdate: ireg[RAupdate+u] = EA;
1275         if (!RT.isvec)
1276             break # destination scalar, end now
1277         if (RA.isvec) i++;
1278         if (RAupdate.isvec) u++;
1279         if (RT.isvec) j++;
1280 ```
1281
1282 Indexed LD is:
1283
1284 ```
1285     # format: ldop RT, RA, RB
1286     function op_ldx(RT, RA, RB, RAupdate=False) # LD not VLD!
1287       ps = get_pred_val(FALSE, RA); # predication on src
1288       pd = get_pred_val(FALSE, RT); # ... AND on dest
1289       for (i=0, j=0, k=0, u=0; i < VL && j < VL && k < VL):
1290         # skip nonpredicated RA, RB and RT
1291         if (RA.isvec) while (!(ps & 1<<i)) i++;
1292         if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
1293         if (RB.isvec) while (!(ps & 1<<k)) k++;
1294         if (RT.isvec) while (!(pd & 1<<j)) j++;
1295         if svctx.ldstmode == elementstride:
1296             EA = ireg[RA] + ireg[RB]*j   # register-strided
1297         else
1298             EA = ireg[RA+i] + ireg[RB+k] # indexed address
1299         if RAupdate: ireg[RAupdate+u] = EA
1300         ireg[RT+j] <= MEM[EA];
1301         if (!RT.isvec)
1302             break # destination scalar, end immediately
1303         if (RA.isvec) i++;
1304         if (RAupdate.isvec) u++;
1305         if (RB.isvec) k++;
1306         if (RT.isvec) j++;
1307 ```
1308
1309 Note that Element-Strided uses the Destination Step because with both
1310 sources being Scalar as a prerequisite condition of activation of
1311 Element-Stride Mode, the source step (being Scalar) would never advance.
1312
1313 Note in both cases that [[sv/svp64]] allows RA-as-a-dest in "update" mode (`ldux`) to be effectively a *completely different* register from RA-as-a-source.  This because there is room in svp64 to extend RA-as-src as well as RA-as-dest, both independently as scalar or vector *and* independently extending their range.
1314
1315 *Programmer's note: being able to set RA-as-a-source
1316  as separate from RA-as-a-destination as Scalar is **extremely valuable**
1317  once it is remembered that Simple-V element operations must
1318  be in Program Order, especially in loops, for saving on
1319  multiple address computations. Care does have
1320  to be taken however that RA-as-src is not overwritten by
1321  RA-as-dest unless intentionally desired, especially in element-strided Mode.*
1322
1323 ## LD/ST Indexed vs Indexed REMAP
1324
1325 Unfortunately the word "Indexed" is used twice in completely different
1326 contexts, potentially causing confusion.
1327
1328 * There has existed instructions in the Power ISA `ld RT,RA,RB` since
1329   its creation: these are called "LD/ST Indexed" instructions and their
1330   name and meaning is well-established.
1331 * There now exists, in Simple-V, a REMAP mode called "Indexed"
1332   Mode that can be applied to *any* instruction **including those
1333   named LD/ST Indexed**.
1334
1335 Whilst it may be costly in terms of register reads to allow REMAP
1336 Indexed Mode to be applied to any Vectorised LD/ST Indexed operation such as
1337 `sv.ld *RT,RA,*RB`, or even misleadingly
1338 labelled  as redundant, firstly the strict
1339 application of the RISC Paradigm that Simple-V follows makes it awkward
1340 to consider *preventing* the application of Indexed REMAP to such
1341 operations, and secondly they are not actually the same at all.
1342
1343 Indexed REMAP, as applied to RB in the instruction `sv.ld *RT,RA,*RB`
1344 effectively performs an *in-place* re-ordering of the offsets, RB.
1345 To achieve the same effect without Indexed REMAP would require taking
1346 a *copy* of the Vector of offsets starting at RB, manually explicitly
1347 reordering them, and finally using the copy of re-ordered offsets in
1348 a non-REMAP'ed `sv.ld`.  Using non-strided LD as an example,
1349 pseudocode showing what actually occurs,
1350 where the pseudocode for `indexed_remap` may be found in [[sv/remap]]:
1351
1352 ```
1353     # sv.ld *RT,RA,*RB with Index REMAP applied to RB
1354     for i in 0..VL-1:
1355         if remap.indexed:
1356             rb_idx = indexed_remap(i) # remap
1357         else:
1358             rb_idx = i # use the index as-is
1359         EA = GPR(RA) + GPR(RB+rb_idx)
1360         GPR(RT+i) = MEM(EA, 8)
1361 ```
1362
1363 Thus it can be seen that the use of Indexed REMAP saves copying
1364 and manual reordering of the Vector of RB offsets.
1365
1366 ## LD/ST ffirst
1367
1368 LD/ST ffirst treats the first LD/ST in a vector (element 0 if REMAP
1369 is not active) as an ordinary one, with all behaviour with respect to
1370 Interrupts Exceptions Page Faults Memory Management being identical
1371 in every regard to Scalar v3.0 Power ISA LD/ST. However for elements
1372 1 and above, if an exception would occur, then VL is **truncated**
1373 to the previous element: the exception is **not** then raised because
1374 the LD/ST that would otherwise have caused an exception is *required*
1375 to be cancelled. Additionally an implementor may choose to truncate VL
1376 for any arbitrary reason *except for the very first*.
1377
1378 ffirst LD/ST to multiple pages via a Vectorised Index base is
1379 considered a security risk due to the abuse of probing multiple
1380 pages in rapid succession and getting speculative feedback on which
1381 pages would fail.  Therefore Vector Indexed LD/ST is prohibited
1382 entirely, and the Mode bit instead used for element-strided LD/ST.
1383 See <https://bugs.libre-soc.org/show_bug.cgi?id=561>
1384
1385 ```
1386     for(i = 0; i < VL; i++)
1387         reg[rt + i] = mem[reg[ra] + i * reg[rb]];
1388 ```
1389
1390 High security implementations where any kind of speculative probing
1391 of memory pages is considered a risk should take advantage of the fact that
1392 implementations may truncate VL at any point, without requiring software
1393 to be rewritten and made non-portable. Such implementations may choose
1394 to *always* set VL=1 which will have the effect of terminating any
1395 speculative probing (and also adversely affect performance), but will
1396 at least not require applications to be rewritten.
1397
1398 Low-performance simpler hardware implementations may also
1399 choose (always) to also set VL=1 as the bare minimum compliant implementation of
1400 LD/ST Fail-First. It is however critically important to remember that
1401 the first element LD/ST **MUST** be treated as an ordinary LD/ST, i.e.
1402 **MUST** raise exceptions exactly like an ordinary LD/ST.
1403
1404 For ffirst LD/STs, VL may be truncated arbitrarily to a nonzero value for any implementation-specific reason. For example: it is perfectly reasonable for implementations to alter VL when ffirst LD or ST operations are initiated on a nonaligned boundary, such that within a loop the subsequent iteration of that loop begins the following ffirst LD/ST operations on an aligned boundary
1405 such as the beginning of a cache line, or beginning of a Virtual Memory
1406 page. Likewise, to reduce workloads or balance resources.
1407
1408 Vertical-First Mode is slightly strange in that only one element
1409 at a time is ever executed anyway.  Given that programmers may
1410 legitimately choose to alter srcstep and dststep in non-sequential
1411 order as part of explicit loops, it is neither possible nor
1412 safe to make speculative assumptions about future LD/STs.
1413 Therefore, Fail-First LD/ST in Vertical-First is `UNDEFINED`.
1414 This is very different from Arithmetic (Data-dependent) FFirst
1415 where Vertical-First Mode is fully deterministic, not speculative.
1416
1417 ## LOAD/STORE Elwidths <a name="elwidth"></a>
1418
1419 Loads and Stores are almost unique in that the Power Scalar ISA
1420 provides a width for the operation (lb, lh, lw, ld).  Only `extsb` and
1421 others like it provide an explicit operation width.  There are therefore
1422 *three* widths involved:
1423
1424 * operation width (lb=8, lh=16, lw=32, ld=64)
1425 * src element width override (8/16/32/default)
1426 * destination element width override (8/16/32/default)
1427
1428 Some care is therefore needed to express and make clear the transformations,
1429 which are expressly in this order:
1430
1431 * Calculate the Effective Address from RA at full width
1432   but (on Indexed Load) allow srcwidth overrides on RB
1433 * Load at the operation width (lb/lh/lw/ld) as usual
1434 * byte-reversal as usual
1435 * Non-saturated mode:
1436    - zero-extension or truncation from operation width to dest elwidth
1437    - place result in destination at dest elwidth
1438 * Saturated mode:
1439    - Sign-extension or truncation from operation width to dest width
1440    - signed/unsigned saturation down to dest elwidth
1441
1442 In order to respect Power v3.0B Scalar behaviour the memory side
1443 is treated effectively as completely separate and distinct from SV
1444 augmentation.  This is primarily down to quirks surrounding LE/BE and
1445 byte-reversal.
1446
1447 It is rather unfortunately possible to request an elwidth override
1448 on the memory side which
1449 does not mesh with the overridden operation width: these result in
1450 `UNDEFINED`
1451 behaviour.  The reason is that the effect of attempting a 64-bit `sv.ld`
1452 operation with a source elwidth override of 8/16/32 would result in
1453 overlapping memory requests, particularly on unit and element strided
1454 operations.  Thus it is `UNDEFINED` when the elwidth is smaller than
1455 the memory operation width. Examples include `sv.lw/sw=16/els` which
1456 requests (overlapping) 4-byte memory reads offset from
1457 each other at 2-byte intervals.  Store likewise is also `UNDEFINED`
1458 where the dest elwidth override is less than the operation width.
1459
1460 Note the following regarding the pseudocode to follow:
1461
1462 * `scalar identity behaviour` SV Context parameter conditions turn this
1463   into a straight absolute fully-compliant Scalar v3.0B LD operation
1464 * `brev` selects whether the operation is the byte-reversed variant (`ldbrx`
1465   rather than `ld`)
1466 * `op_width` specifies the operation width (`lb`, `lh`, `lw`, `ld`) as
1467   a "normal" part of Scalar v3.0B LD
1468 * `imm_offs` specifies the immediate offset `ld r3, imm_offs(r5)`, again
1469   as a "normal" part of Scalar v3.0B LD
1470 * `svctx` specifies the SV Context and includes VL as well as
1471   source and destination elwidth overrides.
1472
1473 Below is the pseudocode for Unit-Strided LD (which includes Vector capability). Observe in particular that RA, as the base address in
1474 both Immediate and Indexed LD/ST,
1475 does not have element-width overriding applied to it.
1476
1477 Note that predication, predication-zeroing,
1478 and other modes except saturation have all been removed,
1479 for clarity and simplicity:
1480
1481 ```
1482     # LD not VLD!
1483     # this covers unit stride mode and a type of vector offset
1484     function op_ld(RT, RA, op_width, imm_offs, svctx)
1485       for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL):
1486         if not svctx.unit/el-strided:
1487             # strange vector mode, compute 64 bit address which is
1488             # not polymorphic! elwidth hardcoded to 64 here
1489             srcbase = get_polymorphed_reg(RA, 64, i)
1490         else:
1491             # unit / element stride mode, compute 64 bit address
1492             srcbase = get_polymorphed_reg(RA, 64, 0)
1493             # adjust for unit/el-stride
1494             srcbase += ....
1495
1496         # read the underlying memory
1497         memread <= MEM(srcbase + imm_offs, op_width)
1498
1499         # check saturation.
1500         if svpctx.saturation_mode:
1501             # ... saturation adjustment...
1502             memread = clamp(memread, op_width, svctx.dest_elwidth)
1503         else:
1504             # truncate/extend to over-ridden dest width.
1505             memread = adjust_wid(memread, op_width, svctx.dest_elwidth)
1506
1507         # takes care of inserting memory-read (now correctly byteswapped)
1508         # into regfile underlying LE-defined order, into the right place
1509         # within the NEON-like register, respecting destination element
1510         # bitwidth, and the element index (j)
1511         set_polymorphed_reg(RT, svctx.dest_elwidth, j, memread)
1512
1513         # increments both src and dest element indices (no predication here)
1514         i++;
1515         j++;
1516 ```
1517
1518 Note above that the source elwidth is *not used at all* in LD-immediate.
1519
1520 For LD/Indexed, the key is that in the calculation of the Effective Address,
1521 RA has no elwidth override but RB does.  Pseudocode below is simplified
1522 for clarity: predication and all modes except saturation are removed:
1523
1524 ```
1525     # LD not VLD! ld*rx if brev else ld*
1526     function op_ld(RT, RA, RB, op_width, svctx, brev)
1527       for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL):
1528         if not svctx.el-strided:
1529             # RA not polymorphic! elwidth hardcoded to 64 here
1530             srcbase = get_polymorphed_reg(RA, 64, i)
1531         else:
1532             # element stride mode, again RA not polymorphic
1533             srcbase = get_polymorphed_reg(RA, 64, 0)
1534         # RB *is* polymorphic
1535         offs = get_polymorphed_reg(RB, svctx.src_elwidth, i)
1536         # sign-extend
1537         if svctx.SEA: offs = sext(offs, svctx.src_elwidth, 64)
1538
1539         # takes care of (merges) processor LE/BE and ld/ldbrx
1540         bytereverse = brev XNOR MSR.LE
1541
1542         # read the underlying memory
1543         memread <= MEM(srcbase + offs, op_width)
1544
1545         # optionally performs byteswap at op width
1546         if (bytereverse):
1547             memread = byteswap(memread, op_width)
1548
1549         if svpctx.saturation_mode:
1550             # ... saturation adjustment...
1551             memread = clamp(memread, op_width, svctx.dest_elwidth)
1552         else:
1553             # truncate/extend to over-ridden dest width.
1554             memread = adjust_wid(memread, op_width, svctx.dest_elwidth)
1555
1556         # takes care of inserting memory-read (now correctly byteswapped)
1557         # into regfile underlying LE-defined order, into the right place
1558         # within the NEON-like register, respecting destination element
1559         # bitwidth, and the element index (j)
1560         set_polymorphed_reg(RT, svctx.dest_elwidth, j, memread)
1561
1562         # increments both src and dest element indices (no predication here)
1563         i++;
1564         j++;
1565 ```
1566
1567 ## Remapped LD/ST
1568
1569 In the [[sv/remap]] page the concept of "Remapping" is described.
1570 Whilst it is expensive to set up (2 64-bit opcodes minimum) it provides
1571 a way to arbitrarily perform 1D, 2D and 3D "remapping" of up to 64
1572 elements worth of LDs or STs.  The usual interest in such re-mapping
1573 is for example in separating out 24-bit RGB channel data into separate
1574 contiguous registers.
1575
1576 REMAP easily covers this capability, and with dest
1577 elwidth overrides and saturation may do so with built-in conversion that
1578 would normally require additional width-extension, sign-extension and
1579 min/max Vectorised instructions as post-processing stages.
1580
1581 Thus we do not need to provide specialist LD/ST "Structure Packed" opcodes
1582 because the generic abstracted concept of "Remapping", when applied to
1583 LD/ST, will give that same capability, with far more flexibility.
1584
1585 It is worth noting that Pack/Unpack Modes of SVSTATE, which may be
1586 established through `svstep`, are also an easy way to perform regular
1587 Structure Packing, at the vec2/vec3/vec4 granularity level.  Beyond
1588 that, REMAP will need to be used.
1589
1590 --------
1591
1592 \newpage{}
1593
1594 # Condition Register SVP64 Operations
1595
1596 Condition Register Fields are only 4 bits wide: this presents some
1597 interesting conceptual challenges for SVP64, which was designed
1598 primarily for vectors of arithmetic and logical operations. However
1599 if predicates may be bits of CR Fields it makes sense to extend
1600 Simple-V to cover CR Operations, especially given that Vectorised Rc=1
1601 may be processed by Vectorised CR Operations tbat usefully in turn
1602 may become Predicate Masks to yet more Vector operations, like so:
1603
1604 ```
1605     sv.cmpi/ew=8 *B,*ra,0    # compare bytes against zero
1606     sv.cmpi/ew=8 *B2,*ra,13. # and against newline
1607     sv.cror PM.EQ,B.EQ,B2.EQ # OR compares to create mask
1608     sv.stb/sm=EQ    ...      # store only nonzero/newline
1609 ```
1610
1611 Element width however is clearly meaningless for a 4-bit collation of
1612 Conditions, EQ LT GE SO. Likewise, arithmetic saturation (an important
1613 part of Arithmetic SVP64) has no meaning. An alternative Mode Format is
1614 required, and given that elwidths are meaningless for CR Fields the bits
1615 in SVP64 `RM` may be used for other purposes.
1616
1617 This alternative mapping **only** applies to instructions that **only**
1618 reference a CR Field or CR bit as the sole exclusive result. This section
1619 **does not** apply to instructions which primarily produce arithmetic
1620 results that also, as an aside, produce a corresponding
1621 CR Field (such as when Rc=1).
1622 Instructions that involve Rc=1 are definitively arithmetic in nature,
1623 where the corresponding Condition Register Field can be considered to
1624 be a "co-result". Such CR Field "co-result" arithmeric operations
1625 are firmly out of scope for
1626 this section, being covered fully by [[sv/normal]].
1627
1628 * Examples of v3.0B instructions to which this section does
1629   apply is
1630   - `mfcr` and `cmpi` (3 bit operands) and
1631   - `crnor` and `crand` (5 bit operands).
1632 * Examples to which this section does **not** apply include
1633   `fadds.` and `subf.` which both produce arithmetic results
1634   (and a CR Field co-result).
1635
1636 The CR Mode Format still applies to `sv.cmpi` because despite
1637 taking a GPR as input, the output from the Base Scalar v3.0B `cmpi`
1638 instruction is purely to a Condition Register Field.
1639
1640 Other modes are still applicable and include:
1641
1642 * **Data-dependent fail-first**.
1643   useful to truncate VL based on
1644   analysis of a Condition Register result bit.
1645 * **Reduction**.
1646   Reduction is useful
1647 for analysing a Vector of Condition Register Fields
1648 and reducing it to one
1649 single Condition Register Field.
1650
1651 Predicate-result does not make any sense because
1652 when Rc=1 a co-result is created (a CR Field). Testing the co-result
1653 allows the decision to be made to store or not store the main
1654 result, and for CR Ops the CR Field result *is*
1655 the main result.
1656
1657 ## Format
1658
1659 SVP64 RM `MODE` (includes `ELWIDTH_SRC` bits) for CR-based operations:
1660
1661 |6 | 7 |19-20|  21 | 22   23 |  description     |
1662 |--|---|-----| --- |---------|----------------- |
1663 |/ | / |0 RG |   0 | dz  sz  | simple mode                      |
1664 |/ | / |0 RG |   1 | dz  sz  | scalar reduce mode (mapreduce) |
1665 |zz|SNZ|1 VLI| inv |  CR-bit | Ffirst 3-bit mode      |
1666 |/ |SNZ|1 VLI| inv |  dz sz  | Ffirst 5-bit mode (implies CR-bit from result) |
1667
1668 Fields:
1669
1670 * **sz / dz**  if predication is enabled will put zeros into the dest (or as src in the case of twin pred) when the predicate bit is zero.  otherwise the element is ignored or skipped, depending on context.
1671 * **zz** set both sz and dz equal to this flag
1672 * **SNZ** In fail-first mode, on the bit being tested, when sz=1 and SNZ=1 a value "1" is put in place of "0".
1673 * **inv CR-bit** just as in branches (BO) these bits allow testing of a CR bit and whether it is set (inv=0) or unset (inv=1)
1674 * **RG** inverts the Vector Loop order (VL-1 downto 0) rather
1675 than the normal 0..VL-1
1676 * **SVM** sets "subvector" reduce mode
1677 * **VLi** VL inclusive: in fail-first mode, the truncation of
1678   VL *includes* the current element at the failure point rather
1679   than excludes it from the count.
1680
1681 ## Data-dependent fail-first on CR operations
1682
1683 The principle of data-dependent fail-first is that if, during
1684 the course of sequentially evaluating an element's Condition Test,
1685 one such test is encountered which fails,
1686 then VL (Vector Length) is truncated (set) at that point. In the case
1687 of Arithmetic SVP64 Operations the Condition Register Field generated from
1688 Rc=1 is used as the basis for the truncation decision.
1689 However with CR-based operations that CR Field result to be
1690 tested is provided
1691 *by the operation itself*.
1692
1693 Data-dependent SVP64 Vectorised Operations involving the creation or
1694 modification of a CR can require an extra two bits, which are not available
1695 in the compact space of the SVP64 RM `MODE` Field. With the concept of element
1696 width overrides being meaningless for CR Fields it is possible to use the
1697 `ELWIDTH` field for alternative purposes.
1698
1699 Condition Register based operations such as `sv.mfcr` and `sv.crand` can thus
1700 be made more flexible.  However the rules that apply in this section
1701 also apply to future CR-based instructions.
1702
1703 There are two primary different types of CR operations:
1704
1705 * Those which have a 3-bit operand field (referring to a CR Field)
1706 * Those which have a 5-bit operand (referring to a bit within the
1707    whole 32-bit CR)
1708
1709 Examining these two types it is observed that the
1710 difference may be considered to be that the 5-bit variant
1711 *already* provides the
1712 prerequisite information about which CR Field bit (EQ, GE, LT, SO) is to
1713 be operated on by the instruction.
1714 Thus, logically, we may set the following rule:
1715
1716 * When a 5-bit CR Result field is used in an instruction, the
1717   5-bit variant of Data-Dependent Fail-First
1718   must be used. i.e. the bit of the CR field to be tested is
1719   the one that has just been modified (created) by the operation.
1720 * When a 3-bit CR Result field is used the 3-bit variant
1721   must be used, providing as it does the missing `CRbit` field
1722   in order to select which CR Field bit of the result shall
1723   be tested (EQ, LE, GE, SO)
1724
1725 The reason why the 3-bit CR variant needs the additional CR-bit
1726 field should be obvious from the fact that the 3-bit CR Field
1727 from the base Power ISA v3.0B operation clearly does not contain
1728 and is missing the two CR Field Selector bits. Thus, these two
1729 bits (to select EQ, LE, GE or SO) must be provided in another
1730 way.
1731
1732 Examples of the former type:
1733
1734 * crand, cror, crnor. These all are 5-bit (BA, BB, BT). The bit
1735   to be tested against `inv` is the one selected by `BT`
1736 * mcrf. This has only 3-bit (BF, BFA). In order to select the
1737   bit to be tested, the alternative encoding must be used.
1738   With `CRbit` coming from the SVP64 RM bits 22-23 the bit
1739   of BF to be tested is identified.
1740
1741 Just as with SVP64 [[sv/branches]] there is the option to truncate
1742 VL to include the element being tested (`VLi=1`) and to exclude it
1743 (`VLi=0`).
1744
1745 Also exactly as with [[sv/normal]] fail-first, VL cannot, unlike
1746 [[sv/ldst]], be set to an arbitrary value.  Deterministic behaviour
1747 is *required*.
1748
1749 ## Reduction and Iteration
1750
1751 Bearing in mind as described in the svp64 Appendix, SVP64 Horizontal
1752 Reduction is a deterministic schedule on top of base Scalar v3.0 operations,
1753 the same rules apply to CR Operations, i.e. that programmers must
1754 follow certain conventions in order for an *end result* of a
1755 reduction to be achieved.  Unlike
1756 other Vector ISAs *there are no explicit reduction opcodes*
1757 in SVP64: Schedules however achieve the same effect.
1758
1759 Due to these conventions only reduction on operations such as `crand`
1760 and `cror` are meaningful because these have Condition Register Fields
1761 as both input and output.
1762 Meaningless operations are not prohibited because the cost in hardware
1763 of doing so is prohibitive, but neither are they `UNDEFINED`. Implementations
1764 are still required to execute them but are at liberty to optimise out
1765 any operations that would ultimately be overwritten, as long as Strict
1766 Program Order is still obvservable by the programmer.
1767
1768 Also bear in mind that 'Reverse Gear' may be enabled, which can be
1769 used in combination with overlapping CR operations to iteratively accumulate
1770 results.  Issuing a `sv.crand` operation for example with `BA`
1771 differing from `BB` by one Condition Register Field would
1772 result in a cascade effect, where the first-encountered CR Field
1773 would set the result to zero, and also all subsequent CR Field
1774 elements thereafter:
1775
1776 ```
1777     # sv.crand/mr/rg CR4.ge.v, CR5.ge.v, CR4.ge.v
1778     for i in VL-1 downto 0 # reverse gear
1779          CR.field[4+i].ge &= CR.field[5+i].ge
1780 ```
1781
1782 `sv.crxor` with reduction would be particularly useful for parity calculation
1783 for example, although there are many ways in which the same calculation
1784 could be carried out after transferring a vector of CR Fields to a GPR
1785 using crweird operations.
1786
1787 Implementations are free and clear to optimise these reductions in any
1788 way they see fit, as long as the end-result is compatible with Strict Program
1789 Order being observed, and Interrupt latency is not adversely impacted.
1790
1791 ## Unusual and quirky CR operations
1792
1793 **cmp and other compare ops**
1794
1795 `cmp` and `cmpi` etc take GPRs as sources and create a CR Field as a result.
1796
1797     cmpli BF,L,RA,UI
1798     cmpeqb BF,RA,RB
1799
1800 With `ELWIDTH` applying to the source GPR operands this is perfectly fine.
1801
1802 **crweird operations**
1803
1804 There are 4 weird CR-GPR operations and one reasonable one in
1805 the [[cr_int_predication]] set:
1806
1807 * crrweird
1808 * mtcrweird
1809 * crweirder
1810 * crweird
1811 * mcrfm - reasonably normal and referring to CR Fields for src and dest.
1812
1813 The "weird" operations have a non-standard behaviour, being able to
1814 treat *individual bits* of a GPR effectively as elements.  They are
1815 expected to be Micro-coded by most Hardware implementations.
1816
1817
1818 --------
1819
1820 \newpage{}
1821
1822 # SVP64 Branch Conditional behaviour
1823
1824 Please note: although similar, SVP64 Branch instructions should be
1825 considered completely separate and distinct from
1826 standard scalar OpenPOWER-approved v3.0B branches.
1827 **v3.0B branches are in no way impacted, altered,
1828 changed or modified in any way, shape or form by
1829 the SVP64 Vectorised Variants**.
1830
1831 It is also
1832 extremely important to note that Branches are the
1833 sole pseudo-exception in SVP64 to `Scalar Identity Behaviour`.
1834 SVP64 Branches contain additional modes that are useful
1835 for scalar operations (i.e. even when VL=1 or when
1836 using single-bit predication).
1837
1838 **Rationale**
1839
1840 Scalar 3.0B Branch Conditional operations, `bc`, `bctar` etc. test a
1841 Condition Register.  However for parallel processing it is simply impossible
1842 to perform multiple independent branches: the Program Counter simply
1843 cannot branch to multiple destinations based on multiple conditions.
1844 The best that can be done is
1845 to test multiple Conditions and make a decision of a *single* branch,
1846 based on analysis of a *Vector* of CR Fields
1847 which have just been calculated from a *Vector* of results.
1848
1849 In 3D Shader
1850 binaries, which are inherently parallelised and predicated, testing all or
1851 some results and branching based on multiple tests is extremely common,
1852 and a fundamental part of Shader Compilers.  Example:
1853 without such multi-condition
1854 test-and-branch, if a predicate mask is all zeros a large batch of
1855 instructions may be masked out to `nop`, and it would waste
1856 CPU cycles to run them.  3D GPU ISAs can test for this scenario
1857 and, with the appropriate predicate-analysis instruction,
1858 jump over fully-masked-out operations, by spotting that
1859 *all* Conditions are false.
1860
1861 Unless Branches are aware and capable of such analysis, additional
1862 instructions would be required which perform Horizontal Cumulative
1863 analysis of Vectorised Condition Register Fields, in order to
1864 reduce the Vector of CR Fields down to one single yes or no
1865 decision that a Scalar-only v3.0B Branch-Conditional could cope with.
1866 Such instructions would be unavoidable, required, and costly
1867 by comparison to a single Vector-aware Branch.
1868 Therefore, in order to be commercially competitive, `sv.bc` and
1869 other Vector-aware Branch Conditional instructions are a high priority
1870 for 3D GPU (and OpenCL-style) workloads.
1871
1872 Given that Power ISA v3.0B is already quite powerful, particularly
1873 the Condition Registers and their interaction with Branches, there
1874 are opportunities to create extremely flexible and compact
1875 Vectorised Branch behaviour.  In addition, the side-effects (updating
1876 of CTR, truncation of VL, described below) make it a useful instruction
1877 even if the branch points to the next instruction (no actual branch).
1878
1879 ## Overview
1880
1881 When considering an "array" of branch-tests, there are four
1882 primarily-useful modes:
1883 AND, OR, NAND and NOR of all Conditions.
1884 NAND and NOR may be synthesised from AND and OR by
1885 inverting `BO[1]` which just leaves two modes:
1886
1887 * Branch takes place on the **first** CR Field test to succeed
1888   (a Great Big OR of all condition tests). Exit occurs
1889   on the first **successful** test.
1890 * Branch takes place only if **all** CR field tests succeed:
1891   a Great Big AND of all condition tests.  Exit occurs
1892   on the first **failed** test.
1893
1894 Early-exit is enacted such that the Vectorised Branch does not
1895 perform needless extra tests, which will help reduce reads on
1896 the Condition Register file.
1897
1898 *Note: Early-exit is **MANDATORY** (required) behaviour.
1899 Branches **MUST** exit at the first sequentially-encountered
1900 failure point, for
1901 exactly the same reasons for which it is mandatory in
1902 programming languages doing early-exit: to avoid
1903 damaging side-effects and to provide deterministic
1904 behaviour. Speculative testing of Condition
1905 Register Fields is permitted, as is speculative calculation
1906 of CTR, as long as, as usual in any Out-of-Order microarchitecture,
1907 that speculative testing is cancelled should an early-exit occur.
1908 i.e. the speculation must be "precise": Program Order must be preserved*
1909
1910 Also note that when early-exit occurs in Horizontal-first Mode,
1911 srcstep, dststep etc. are all reset, ready to begin looping from the
1912 beginning for the next instruction. However for Vertical-first
1913 Mode srcstep etc. are incremented "as usual" i.e. an early-exit
1914 has no special impact, regardless of whether the branch
1915 occurred or not. This can leave srcstep etc. in what may be
1916 considered an unusual
1917 state on exit from a loop and it is up to the programmer to
1918 reset srcstep, dststep etc. to known-good values
1919 *(easily achieved with `setvl`)*.
1920
1921 Additional useful behaviour involves two primary Modes (both of
1922 which may be enabled and combined):
1923
1924 * **VLSET Mode**: identical to Data-Dependent Fail-First Mode
1925   for Arithmetic SVP64 operations, with more
1926   flexibility and a close interaction and integration into the
1927   underlying base Scalar v3.0B Branch instruction.
1928   Truncation of VL takes place around the early-exit point.
1929 * **CTR-test Mode**: gives much more flexibility over when and why
1930   CTR is decremented, including options to decrement if a Condition
1931   test succeeds *or if it fails*.
1932
1933 With these side-effects, basic Boolean Logic Analysis advises that
1934 it is important to provide a means
1935 to enact them each based on whether testing succeeds *or fails*. This
1936 results in a not-insignificant number of additional Mode Augmentation bits,
1937 accompanying VLSET and CTR-test Modes respectively.
1938
1939 Predicate skipping or zeroing may, as usual with SVP64, be controlled
1940 by `sz`.
1941 Where the predicate is masked out and
1942 zeroing is enabled, then in such circumstances
1943 the same Boolean Logic Analysis dictates that
1944 rather than testing only against zero, the option to test
1945 against one is also prudent. This introduces a new
1946 immediate field, `SNZ`, which works in conjunction with
1947 `sz`.
1948
1949
1950 Vectorised Branches can be used
1951 in either SVP64 Horizontal-First or Vertical-First Mode. Essentially,
1952 at an element level, the behaviour is identical in both Modes,
1953 although the `ALL` bit is meaningless in Vertical-First Mode.
1954
1955 It is also important
1956 to bear in mind that, fundamentally, Vectorised Branch-Conditional
1957 is still extremely close to the Scalar v3.0B Branch-Conditional
1958 instructions, and that the same v3.0B Scalar Branch-Conditional
1959 instructions are still
1960 *completely separate and independent*, being unaltered and
1961 unaffected by their SVP64 variants in every conceivable way.
1962
1963 *Programming note: One important point is that SVP64 instructions are 64 bit.
1964 (8 bytes not 4). This needs to be taken into consideration when computing
1965 branch offsets: the offset is relative to the start of the instruction,
1966 which **includes** the SVP64 Prefix*
1967
1968 ## Format and fields
1969
1970 With element-width overrides being meaningless for Condition
1971 Register Fields, bits 4 thru 7 of SVP64 RM may be used for additional
1972 Mode bits.
1973
1974 SVP64 RM `MODE` (includes repurposing `ELWIDTH` bits 4:5,
1975 and `ELWIDTH_SRC` bits 6-7 for *alternate* uses) for Branch
1976 Conditional:
1977
1978 | 4 | 5 | 6 | 7 | 17 | 18 | 19 | 20 |  21 | 22  23 |  description     |
1979 | - | - | - | - | -- | -- | -- | -- | --- |--------|----------------- |
1980 |ALL|SNZ| / | / | SL |SLu | 0  | 0  | /   | LRu sz | simple mode      |
1981 |ALL|SNZ| / |VSb| SL |SLu | 0  | 1  | VLI | LRu sz | VLSET mode       |
1982 |ALL|SNZ|CTi| / | SL |SLu | 1  | 0  | /   | LRu sz | CTR-test mode         |
1983 |ALL|SNZ|CTi|VSb| SL |SLu | 1  | 1  | VLI | LRu sz | CTR-test+VLSET mode   |
1984
1985 Brief description of fields:
1986
1987 * **sz=1** if predication is enabled and `sz=1` and a predicate
1988   element bit is zero, `SNZ` will
1989   be substituted in place of the CR bit selected by `BI`,
1990   as the Condition tested.
1991   Contrast this with
1992   normal SVP64 `sz=1` behaviour, where *only* a zero is put in
1993   place of masked-out predicate bits.
1994 * **sz=0** When `sz=0` skipping occurs as usual on
1995   masked-out elements, but unlike all
1996   other SVP64 behaviour which entirely skips an element with
1997   no related side-effects at all, there are certain
1998   special circumstances where CTR
1999   may be decremented.  See CTR-test Mode, below.
2000 * **ALL** when set, all branch conditional tests must pass in order for
2001   the branch to succeed. When clear, it is the first sequentially
2002   encountered successful test that causes the branch to succeed.
2003   This is identical behaviour to how programming languages perform
2004   early-exit on Boolean Logic chains.
2005 * **VLI** VLSET is identical to Data-dependent Fail-First mode.
2006   In VLSET mode, VL *may* (depending on `VSb`) be truncated.
2007   If VLI (Vector Length Inclusive) is clear,
2008   VL is truncated to *exclude* the current element, otherwise it is
2009   included. SVSTATE.MVL is not altered: only VL.
2010 * **SL** identical to `LR` except applicable to SVSTATE. If `SL`
2011   is set, SVSTATE is transferred to SVLR (conditionally on
2012   whether `SLu` is set).
2013 * **SLu**: SVSTATE Link Update, like `LRu` except applies to SVSTATE.
2014 * **LRu**: Link Register Update, used in conjunction with LK=1
2015   to make LR update conditional
2016 * **VSb** In VLSET Mode, after testing,
2017   if VSb is set, VL is truncated if the test succeeds.  If VSb is clear,
2018   VL is truncated if a test *fails*. Masked-out (skipped)
2019   bits are not considered
2020   part of testing when `sz=0`
2021 * **CTi** CTR inversion. CTR-test Mode normally decrements per element
2022   tested. CTR inversion decrements if a test *fails*. Only relevant
2023   in CTR-test Mode.
2024
2025 LRu and CTR-test modes are where SVP64 Branches subtly differ from
2026 Scalar v3.0B Branches. `sv.bcl` for example will always update LR, whereas
2027 `sv.bcl/lru` will only update LR if the branch succeeds.
2028
2029 Of special interest is that when using ALL Mode (Great Big AND
2030 of all Condition Tests), if `VL=0`,
2031 which is rare but can occur in Data-Dependent Modes, the Branch
2032 will always take place because there will be no failing Condition
2033 Tests to prevent it. Likewise when not using ALL Mode (Great Big OR
2034 of all Condition Tests) and `VL=0` the Branch is guaranteed not
2035 to occur because there will be no *successful* Condition Tests
2036 to make it happen.
2037
2038 ## Vectorised CR Field numbering, and Scalar behaviour
2039
2040 It is important to keep in mind that just like all SVP64 instructions,
2041 the `BI` field of the base v3.0B Branch Conditional instruction
2042 may be extended by SVP64 EXTRA augmentation, as well as be marked
2043 as either Scalar or Vector. It is also crucially important to keep in mind
2044 that for CRs, SVP64 sequentially increments the CR *Field* numbers.
2045 CR *Fields* are treated as elements, not bit-numbers of the CR *register*.
2046
2047 The `BI` operand of Branch Conditional operations is five bits, in scalar
2048 v3.0B this would select one bit of the 32 bit CR,
2049 comprising eight CR Fields of 4 bits each.  In SVP64 there are
2050 16 32 bit CRs, containing 128 4-bit CR Fields.  Therefore, the 2 LSBs of
2051 `BI` select the bit from the CR Field (EQ LT GT SO), and the top 3 bits
2052 are extended to either scalar or vector and to select CR Fields 0..127
2053 as specified in SVP64 [[sv/svp64/appendix]].
2054
2055 When the CR Fields selected by SVP64-Augmented `BI` is marked as scalar,
2056 then as the usual SVP64 rules apply:
2057 the Vector loop ends at the first element tested
2058 (the first CR *Field*), after taking
2059 predication into consideration. Thus, also as usual, when a predicate mask is
2060 given, and `BI` marked as scalar, and `sz` is zero, srcstep
2061 skips forward to the first non-zero predicated element, and only that
2062 one element is tested.
2063
2064 In other words, the fact that this is a Branch
2065 Operation (instead of an arithmetic one) does not result, ultimately,
2066 in significant changes as to
2067 how SVP64 is fundamentally applied, except with respect to:
2068
2069 * the unique properties associated with conditionally
2070  changing the Program
2071 Counter (aka "a Branch"), resulting in early-out
2072 opportunities
2073 * CTR-testing
2074
2075 Both are outlined below, in later sections.
2076
2077 ## Horizontal-First and Vertical-First Modes
2078
2079 In SVP64 Horizontal-First Mode, the first failure in ALL mode (Great Big
2080 AND) results in early exit: no more updates to CTR occur (if requested);
2081 no branch occurs, and LR is not updated (if requested). Likewise for
2082 non-ALL mode (Great Big Or) on first success early exit also occurs,
2083 however this time with the Branch proceeding.  In both cases the testing
2084 of the Vector of CRs should be done in linear sequential order (or in
2085 REMAP re-sequenced order): such that tests that are sequentially beyond
2086 the exit point are *not* carried out. (*Note: it is standard practice in
2087 Programming languages to exit early from conditional tests, however
2088 a little unusual to consider in an ISA that is designed for Parallel
2089 Vector Processing. The reason is to have strictly-defined guaranteed
2090 behaviour*)
2091
2092 In Vertical-First Mode, setting the `ALL` bit results in `UNDEFINED`
2093 behaviour. Given that only one element is being tested at a time
2094 in Vertical-First Mode, a test designed to be done on multiple
2095 bits is meaningless.
2096
2097 ## Description and Modes
2098
2099 Predication in both INT and CR modes may be applied to `sv.bc` and other
2100 SVP64 Branch Conditional operations, exactly as they may be applied to
2101 other SVP64 operations.  When `sz` is zero, any masked-out Branch-element
2102 operations are not included in condition testing, exactly like all other
2103 SVP64 operations, *including* side-effects such as potentially updating
2104 LR or CTR, which will also be skipped. There is *one* exception here,
2105 which is when
2106 `BO[2]=0, sz=0, CTR-test=0, CTi=1` and the relevant element
2107 predicate mask bit is also zero:
2108 under these special circumstances CTR will also decrement.
2109
2110 When `sz` is non-zero, this normally requests insertion of a zero
2111 in place of the input data, when the relevant predicate mask bit is zero.
2112 This would mean that a zero is inserted in place of `CR[BI+32]` for
2113 testing against `BO`, which may not be desirable in all circumstances.
2114 Therefore, an extra field is provided `SNZ`, which, if set, will insert
2115 a **one** in place of a masked-out element, instead of a zero.
2116
2117 (*Note: Both options are provided because it is useful to deliberately
2118 cause the Branch-Conditional Vector testing to fail at a specific point,
2119 controlled by the Predicate mask. This is particularly useful in `VLSET`
2120 mode, which will truncate SVSTATE.VL at the point of the first failed
2121 test.*)
2122
2123 Normally, CTR mode will decrement once per Condition Test, resulting
2124 under normal circumstances that CTR reduces by up to VL in Horizontal-First
2125 Mode. Just as when v3.0B Branch-Conditional saves at
2126 least one instruction on tight inner loops through auto-decrementation
2127 of CTR, likewise it is also possible to save instruction count for
2128 SVP64 loops in both Vertical-First and Horizontal-First Mode, particularly
2129 in circumstances where there is conditional interaction between the
2130 element computation and testing, and the continuation (or otherwise)
2131 of a given loop. The potential combinations of interactions is why CTR
2132 testing options have been added.
2133
2134 Also, the unconditional bit `BO[0]` is still relevant when Predication
2135 is applied to the Branch because in `ALL` mode all nonmasked bits have
2136 to be tested, and when `sz=0` skipping occurs.
2137 Even when VLSET mode is not used, CTR
2138 may still be decremented by the total number of nonmasked elements,
2139 acting in effect as either a popcount or cntlz depending on which
2140 mode bits are set.
2141 In short, Vectorised Branch becomes an extremely powerful tool.
2142
2143 **Micro-Architectural Implementation Note**: *when implemented on
2144 top of a Multi-Issue Out-of-Order Engine it is possible to pass
2145 a copy of the predicate and the prerequisite CR Fields to all
2146 Branch Units, as well as the current value of CTR at the time of
2147 multi-issue, and for each Branch Unit to compute how many times
2148 CTR would be subtracted, in a fully-deterministic and parallel
2149 fashion. A SIMD-based Branch Unit, receiving and processing
2150 multiple CR Fields covered by multiple predicate bits, would
2151 do the exact same thing. Obviously, however, if CTR is modified
2152 within any given loop (mtctr) the behaviour of CTR is no longer
2153 deterministic.*
2154
2155 ### Link Register Update
2156
2157 For a Scalar Branch, unconditional updating of the Link Register
2158 LR is useful and practical. However, if a loop of CR Fields is
2159 tested, unconditional updating of LR becomes problematic.
2160
2161 For example when using `bclr` with `LRu=1,LK=0` in Horizontal-First Mode,
2162 LR's value will be unconditionally overwritten after the first element,
2163 such that for execution (testing) of the second element, LR
2164 has the value `CIA+8`. This is covered in the `bclrl` example, in
2165 a later section.
2166
2167 The addition of a LRu bit modifies behaviour in conjunction
2168 with LK, as follows:
2169
2170 * `sv.bc` When LRu=0,LK=0, Link Register is not updated
2171 * `sv.bcl` When LRu=0,LK=1, Link Register is updated unconditionally
2172 * `sv.bcl/lru` When LRu=1,LK=1, Link Register will
2173   only be updated if the Branch Condition fails.
2174 * `sv.bc/lru` When LRu=1,LK=0, Link Register will only be updated if
2175   the Branch Condition succeeds.
2176
2177 This avoids
2178 destruction of LR during loops (particularly Vertical-First
2179 ones).
2180
2181 **SVLR and SVSTATE**
2182
2183 For precisely the reasons why `LK=1` was added originally to the Power
2184 ISA, with SVSTATE being a peer of the Program Counter it becomes
2185 necessary to also add an SVLR (SVSTATE Link Register)
2186 and corresponding control bits `SL` and `SLu`.
2187
2188 ### CTR-test
2189
2190 Where a standard Scalar v3.0B branch unconditionally decrements
2191 CTR when `BO[2]` is clear, CTR-test Mode introduces more flexibility
2192 which allows CTR to be used for many more types of Vector loops
2193 constructs.
2194
2195 CTR-test mode and CTi interaction is as follows: note that
2196 `BO[2]` is still required to be clear for CTR decrements to be
2197 considered, exactly as is the case in Scalar Power ISA v3.0B
2198
2199 * **CTR-test=0, CTi=0**: CTR decrements on a per-element basis
2200   if `BO[2]` is zero. Masked-out elements when `sz=0` are
2201   skipped (i.e. CTR is *not* decremented when the predicate
2202   bit is zero and `sz=0`).
2203 * **CTR-test=0, CTi=1**: CTR decrements on a per-element basis
2204   if `BO[2]` is zero and a masked-out element is skipped
2205   (`sz=0` and predicate bit is zero). This one special case is the
2206   **opposite** of other combinations, as well as being
2207   completely different from normal SVP64 `sz=0` behaviour)
2208 * **CTR-test=1, CTi=0**: CTR decrements on a per-element basis
2209   if `BO[2]` is zero and the Condition Test succeeds.
2210   Masked-out elements when `sz=0` are skipped (including
2211   not decrementing CTR)
2212 * **CTR-test=1, CTi=1**: CTR decrements on a per-element basis
2213   if `BO[2]` is zero and the Condition Test *fails*.
2214   Masked-out elements when `sz=0` are skipped (including
2215   not decrementing CTR)
2216
2217 `CTR-test=0, CTi=1, sz=0` requires special emphasis because it is the
2218 only time in the entirety of SVP64 that has side-effects when
2219 a predicate mask bit is clear.  **All** other SVP64 operations
2220 entirely skip an element when sz=0 and a predicate mask bit is zero.
2221 It is also critical to emphasise that in this unusual mode,
2222 no other side-effects occur: **only** CTR is decremented, i.e. the
2223 rest of the Branch operation is skipped.
2224
2225 ### VLSET Mode
2226
2227 VLSET Mode truncates the Vector Length so that subsequent instructions
2228 operate on a reduced Vector Length. This is similar to
2229 Data-dependent Fail-First and LD/ST Fail-First, where for VLSET the
2230 truncation occurs at the Branch decision-point.
2231
2232 Interestingly, due to the side-effects of `VLSET` mode
2233 it is actually useful to use Branch Conditional even
2234 to perform no actual branch operation, i.e to point to the instruction
2235 after the branch. Truncation of VL would thus conditionally occur yet control
2236 flow alteration would not.
2237
2238 `VLSET` mode with Vertical-First is particularly unusual. Vertical-First
2239 is designed to be used for explicit looping, where an explicit call to
2240 `svstep` is required to move both srcstep and dststep on to
2241 the next element, until VL (or other condition) is reached.
2242 Vertical-First Looping is expected (required) to terminate if the end
2243 of the Vector, VL, is reached. If however that loop is terminated early
2244 because VL is truncated, VLSET with Vertical-First becomes meaningless.
2245 Resolving this would require two branches: one Conditional, the other
2246 branching unconditionally to create the loop, where the Conditional
2247 one jumps over it.
2248
2249 Therefore, with `VSb`, the option to decide whether truncation should occur if the
2250 branch succeeds *or* if the branch condition fails allows for the flexibility
2251 required.  This allows a Vertical-First Branch to *either* be used as
2252 a branch-back (loop) *or* as part of a conditional exit or function
2253 call from *inside* a loop, and for VLSET to be integrated into both
2254 types of decision-making.
2255
2256 In the case of a Vertical-First branch-back (loop), with `VSb=0` the branch takes
2257 place if success conditions are met, but on exit from that loop
2258 (branch condition fails), VL will be truncated. This is extremely
2259 useful.
2260
2261 `VLSET` mode with Horizontal-First when `VSb=0` is still
2262 useful, because it can be used to truncate VL to the first predicated
2263 (non-masked-out) element.
2264
2265 The truncation point for VL, when VLi is clear, must not include skipped
2266 elements that preceded the current element being tested.
2267 Example: `sz=0, VLi=0, predicate mask = 0b110010` and the Condition
2268 Register failure point is at CR Field element 4.
2269
2270 * Testing at element 0 is skipped because its predicate bit is zero
2271 * Testing at element 1 passed
2272 * Testing elements 2 and 3 are skipped because their
2273   respective predicate mask bits are zero
2274 * Testing element 4 fails therefore VL is truncated to **2**
2275   not 4 due to elements 2 and 3 being skipped.
2276
2277 If `sz=1` in the above example *then* VL would have been set to 4 because
2278 in non-zeroing mode the zero'd elements are still effectively part of the
2279 Vector (with their respective elements set to `SNZ`)
2280
2281 If `VLI=1` then VL would be set to 5 regardless of sz, due to being inclusive
2282 of the element actually being tested.
2283
2284 ### VLSET and CTR-test combined
2285
2286 If both CTR-test and VLSET Modes are requested, it's important to
2287 observe the correct order. What occurs depends on whether VLi
2288 is enabled, because VLi affects the length, VL.
2289
2290 If VLi (VL truncate inclusive) is set:
2291
2292 1. compute the test including whether CTR triggers
2293 2. (optionally) decrement CTR
2294 3. (optionally) truncate VL (VSb inverts the decision)
2295 4. decide (based on step 1) whether to terminate looping
2296    (including not executing step 5)
2297 5. decide whether to branch.
2298
2299 If VLi is clear, then when a test fails that element
2300 and any following it
2301 should **not** be considered part of the Vector. Consequently:
2302
2303 1. compute the branch test including whether CTR triggers
2304 2. if the test fails against VSb, truncate VL to the *previous*
2305    element, and terminate looping. No further steps executed.
2306 3. (optionally) decrement CTR
2307 4. decide whether to branch.
2308
2309 ## Boolean Logic combinations
2310
2311 In a Scalar ISA, Branch-Conditional testing even of vector
2312 results may be performed through inversion of tests. NOR of
2313 all tests may be performed by inversion of the scalar condition
2314 and branching *out* from the scalar loop around elements,
2315 using scalar operations.
2316
2317 In a parallel (Vector) ISA it is the ISA itself which must perform
2318 the prerequisite logic manipulation.
2319 Thus for SVP64 there are an extraordinary number of nesessary combinations
2320 which provide completely different and useful behaviour.
2321 Available options to combine:
2322
2323 * `BO[0]` to make an unconditional branch would seem irrelevant if
2324   it were not for predication and for side-effects (CTR Mode
2325   for example)
2326 * Enabling CTR-test Mode and setting `BO[2]` can still result in the
2327   Branch
2328   taking place, not because the Condition Test itself failed, but
2329   because CTR reached zero **because**, as required by CTR-test mode,
2330   CTR was decremented as a  **result** of Condition Tests failing.
2331 * `BO[1]` to select whether the CR bit being tested is zero or nonzero
2332 * `R30` and `~R30` and other predicate mask options including CR and
2333   inverted CR bit testing
2334 * `sz` and `SNZ` to insert either zeros or ones in place of masked-out
2335   predicate bits
2336 * `ALL` or `ANY` behaviour corresponding to `AND` of all tests and
2337   `OR` of all tests, respectively.
2338 * Predicate Mask bits, which combine in effect with the CR being
2339   tested.
2340 * Inversion of Predicate Masks (`~r3` instead of `r3`, or using
2341   `NE` rather than `EQ`) which results in an additional
2342   level of possible ANDing, ORing etc. that would otherwise
2343   need explicit instructions.
2344
2345 The most obviously useful combinations here are to set `BO[1]` to zero
2346 in order to turn `ALL` into Great-Big-NAND and `ANY` into
2347 Great-Big-NOR.  Other Mode bits which perform behavioural inversion then
2348 have to work round the fact that the Condition Testing is NOR or NAND.
2349 The alternative to not having additional behavioural inversion
2350 (`SNZ`, `VSb`, `CTi`) would be to have a second (unconditional)
2351 branch directly after the first, which the first branch jumps over.
2352 This contrivance is avoided by the behavioural inversion bits.
2353
2354 ## Pseudocode and examples
2355
2356 Please see the SVP64 appendix regarding CR bit ordering and for
2357 the definition of `CR{n}`
2358
2359 For comparative purposes this is a copy of the v3.0B `bc` pseudocode
2360
2361 ```
2362     if (mode_is_64bit) then M <- 0
2363     else M <- 32
2364     if ¬BO[2] then CTR <- CTR - 1
2365     ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
2366     cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1])
2367     if ctr_ok & cond_ok then
2368       if AA then NIA <-iea EXTS(BD || 0b00)
2369       else       NIA <-iea CIA + EXTS(BD || 0b00)
2370     if LK then LR  <-iea  CIA + 4
2371 ```
2372
2373 Simplified pseudocode including LRu and CTR skipping, which illustrates
2374 clearly that SVP64 Scalar Branches (VL=1) are **not** identical to
2375 v3.0B Scalar Branches.  The key areas where differences occur are
2376 the inclusion of predication (which can still be used when VL=1), in
2377 when and why CTR is decremented (CTRtest Mode) and whether LR is
2378 updated (which is unconditional in v3.0B when LK=1, and conditional
2379 in SVP64 when LRu=1).
2380
2381 Inline comments highlight the fact that the Scalar Branch behaviour
2382 and pseudocode is still clearly visible and embedded within the
2383 Vectorised variant:
2384
2385 ```
2386     if (mode_is_64bit) then M <- 0
2387     else M <- 32
2388     # the bit of CR to test, if the predicate bit is zero,
2389     # is overridden
2390     testbit = CR[BI+32]
2391     if ¬predicate_bit then testbit = SVRMmode.SNZ
2392     # otherwise apart from the override ctr_ok and cond_ok
2393     # are exactly the same
2394     ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
2395     cond_ok <- BO[0] | ¬(testbit ^ BO[1])
2396     if ¬predicate_bit & ¬SVRMmode.sz then
2397       # this is entirely new: CTR-test mode still decrements CTR
2398       # even when predicate-bits are zero
2399       if ¬BO[2] & CTRtest & ¬CTi then
2400         CTR = CTR - 1
2401       # instruction finishes here
2402     else
2403       # usual BO[2] CTR-mode now under CTR-test mode as well
2404       if ¬BO[2] & ¬(CTRtest & (cond_ok ^ CTi)) then CTR <- CTR - 1
2405       # new VLset mode, conditional test truncates VL
2406       if VLSET and VSb = (cond_ok & ctr_ok) then
2407         if SVRMmode.VLI then SVSTATE.VL = srcstep+1
2408         else                 SVSTATE.VL = srcstep
2409       # usual LR is now conditional, but also joined by SVLR
2410       lr_ok <- LK
2411       svlr_ok <- SVRMmode.SL
2412       if ctr_ok & cond_ok then
2413         if AA then NIA <-iea EXTS(BD || 0b00)
2414         else       NIA <-iea CIA + EXTS(BD || 0b00)
2415         if SVRMmode.LRu then lr_ok <- ¬lr_ok
2416         if SVRMmode.SLu then svlr_ok <- ¬svlr_ok
2417       if lr_ok   then LR   <-iea CIA + 4
2418       if svlr_ok then SVLR <- SVSTATE
2419 ```
2420
2421 Below is the pseudocode for SVP64 Branches, which is a little less
2422 obvious but identical to the above. The lack of obviousness is down
2423 to the early-exit opportunities.
2424
2425 Effective pseudocode for Horizontal-First Mode:
2426
2427 ```
2428     if (mode_is_64bit) then M <- 0
2429     else M <- 32
2430     cond_ok = not SVRMmode.ALL
2431     for srcstep in range(VL):
2432         # select predicate bit or zero/one
2433         if predicate[srcstep]:
2434             # get SVP64 extended CR field 0..127
2435             SVCRf = SVP64EXTRA(BI>>2)
2436             CRbits = CR{SVCRf}
2437             testbit = CRbits[BI & 0b11]
2438             # testbit = CR[BI+32+srcstep*4]
2439         else if not SVRMmode.sz:
2440             # inverted CTR test skip mode
2441             if ¬BO[2] & CTRtest & ¬CTI then
2442               CTR = CTR - 1
2443             continue # skip to next element
2444         else
2445             testbit = SVRMmode.SNZ
2446         # actual element test here
2447         ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
2448         el_cond_ok <- BO[0] | ¬(testbit ^ BO[1])
2449         # check if CTR dec should occur
2450         ctrdec = ¬BO[2]
2451         if CTRtest & (el_cond_ok ^ CTi) then
2452            ctrdec = 0b0
2453         if ctrdec then CTR <- CTR - 1
2454         # merge in the test
2455         if SVRMmode.ALL:
2456             cond_ok &= (el_cond_ok & ctr_ok)
2457         else
2458             cond_ok |= (el_cond_ok & ctr_ok)
2459         # test for VL to be set (and exit)
2460         if VLSET and VSb = (el_cond_ok & ctr_ok) then
2461             if SVRMmode.VLI then SVSTATE.VL = srcstep+1
2462             else                 SVSTATE.VL = srcstep
2463             break
2464         # early exit?
2465         if SVRMmode.ALL != (el_cond_ok & ctr_ok):
2466              break
2467         # SVP64 rules about Scalar registers still apply!
2468         if SVCRf.scalar:
2469            break
2470     # loop finally done, now test if branch (and update LR)
2471     lr_ok <- LK
2472     svlr_ok <- SVRMmode.SL
2473     if cond_ok then
2474         if AA then NIA <-iea EXTS(BD || 0b00)
2475         else       NIA <-iea CIA + EXTS(BD || 0b00)
2476         if SVRMmode.LRu then lr_ok <- ¬lr_ok
2477         if SVRMmode.SLu then svlr_ok <- ¬svlr_ok
2478     if lr_ok then LR <-iea CIA + 4
2479     if svlr_ok then SVLR <- SVSTATE
2480 ```
2481
2482 Pseudocode for Vertical-First Mode:
2483
2484 ```
2485     # get SVP64 extended CR field 0..127
2486     SVCRf = SVP64EXTRA(BI>>2)
2487     CRbits = CR{SVCRf}
2488     # select predicate bit or zero/one
2489     if predicate[srcstep]:
2490         if BRc = 1 then # CR0 vectorised
2491             CR{SVCRf+srcstep} = CRbits
2492         testbit = CRbits[BI & 0b11]
2493     else if not SVRMmode.sz:
2494         # inverted CTR test skip mode
2495         if ¬BO[2] & CTRtest & ¬CTI then
2496            CTR = CTR - 1
2497         SVSTATE.srcstep = new_srcstep
2498         exit # no branch testing
2499     else
2500         testbit = SVRMmode.SNZ
2501     # actual element test here
2502     cond_ok <- BO[0] | ¬(testbit ^ BO[1])
2503     # test for VL to be set (and exit)
2504     if VLSET and cond_ok = VSb then
2505         if SVRMmode.VLI
2506             SVSTATE.VL = new_srcstep+1
2507         else
2508             SVSTATE.VL = new_srcstep
2509 ```
2510
2511 ### Example Shader code
2512
2513 ```
2514     // assume f() g() or h() modify a and/or b
2515     while(a > 2) {
2516         if(b < 5)
2517             f();
2518         else
2519             g();
2520         h();
2521     }
2522 ```
2523
2524 which compiles to something like:
2525
2526 ```
2527     vec<i32> a, b;
2528     // ...
2529     pred loop_pred = a > 2;
2530     // loop continues while any of a elements greater than 2
2531     while(loop_pred.any()) {
2532         // vector of predicate bits
2533         pred if_pred = loop_pred & (b < 5);
2534         // only call f() if at least 1 bit set
2535         if(if_pred.any()) {
2536             f(if_pred);
2537         }
2538     label1:
2539         // loop mask ANDs with inverted if-test
2540         pred else_pred = loop_pred & ~if_pred;
2541         // only call g() if at least 1 bit set
2542         if(else_pred.any()) {
2543             g(else_pred);
2544         }
2545         h(loop_pred);
2546     }
2547 ```
2548
2549 which will end up as:
2550
2551 ```
2552        # start from while loop test point
2553        b looptest
2554     while_loop:
2555        sv.cmpi CR80.v, b.v, 5     # vector compare b into CR64 Vector
2556        sv.bc/m=r30/~ALL/sz CR80.v.LT skip_f # skip when none
2557        # only calculate loop_pred & pred_b because needed in f()
2558        sv.crand CR80.v.SO, CR60.v.GT, CR80.V.LT # if = loop & pred_b
2559        f(CR80.v.SO)
2560     skip_f:
2561        # illustrate inversion of pred_b. invert r30, test ALL
2562        # rather than SOME, but masked-out zero test would FAIL,
2563        # therefore masked-out instead is tested against 1 not 0
2564        sv.bc/m=~r30/ALL/SNZ CR80.v.LT skip_g
2565        # else = loop & ~pred_b, need this because used in g()
2566        sv.crternari(A&~B) CR80.v.SO, CR60.v.GT, CR80.V.LT
2567        g(CR80.v.SO)
2568     skip_g:
2569        # conditionally call h(r30) if any loop pred set
2570        sv.bclr/m=r30/~ALL/sz BO[1]=1 h()
2571     looptest:
2572        sv.cmpi CR60.v a.v, 2      # vector compare a into CR60 vector
2573        sv.crweird r30, CR60.GT # transfer GT vector to r30
2574        sv.bc/m=r30/~ALL/sz BO[1]=1 while_loop
2575     end:
2576 ```
2577
2578 ### LRu example
2579
2580 show why LRu would be useful in a loop.  Imagine the following
2581 c code:
2582
2583 ```
2584     for (int i = 0; i < 8; i++) {
2585         if (x < y) break;
2586     }
2587 ```
2588
2589 Under these circumstances exiting from the loop is not only
2590 based on CTR it has become conditional on a CR result.
2591 Thus it is desirable that NIA *and* LR only be modified
2592 if the conditions are met
2593
2594
2595 v3.0 pseudocode for `bclrl`:
2596
2597 ```
2598     if (mode_is_64bit) then M <- 0
2599     else M <- 32
2600     if ¬BO[2]  then CTR <- CTR - 1
2601     ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
2602     cond_ok <- BO[0] | ¬(CR[BI+32] ^  BO[1])
2603     if ctr_ok & cond_ok then NIA <-iea LR[0:61] || 0b00
2604     if LK then LR <-iea CIA + 4
2605 ```
2606
2607 the latter part for SVP64 `bclrl` becomes:
2608
2609 ```
2610     for i in 0 to VL-1:
2611         ...
2612         ...
2613         cond_ok <- BO[0] | ¬(CR[BI+32] ^  BO[1])
2614         lr_ok <- LK
2615         if ctr_ok & cond_ok then
2616            NIA <-iea LR[0:61] || 0b00
2617            if SVRMmode.LRu then lr_ok <- ¬lr_ok
2618         if lr_ok then LR <-iea CIA + 4
2619         # if NIA modified exit loop
2620 ```
2621
2622 The reason why should be clear from this being a Vector loop:
2623 unconditional destruction of LR when LK=1 makes `sv.bclrl`
2624 ineffective, because the intention going into the loop is
2625 that the branch should be to the copy of LR set at the *start*
2626 of the loop, not half way through it.
2627 However if the change to LR only occurs if
2628 the branch is taken then it becomes a useful instruction.
2629
2630 The following pseudocode should **not** be implemented because
2631 it violates the fundamental principle of SVP64 which is that
2632 SVP64 looping is a thin wrapper around Scalar Instructions.
2633 The pseducode below is more an actual Vector ISA Branch and
2634 as such is not at all appropriate:
2635
2636 ```
2637     for i in 0 to VL-1:
2638         ...
2639         ...
2640         cond_ok <- BO[0] | ¬(CR[BI+32] ^  BO[1])
2641         if ctr_ok & cond_ok then NIA <-iea LR[0:61] || 0b00
2642     # only at the end of looping is LK checked.
2643     # this completely violates the design principle of SVP64
2644     # and would actually need to be a separate (scalar)
2645     # instruction "set LR to CIA+4 but retrospectively"
2646     # which is clearly impossible
2647     if LK then LR <-iea CIA + 4
2648 ```
2649
2650 [[!tag standards]]