openpower/sv/rfc/ls010.mdwn

   1 # RFC ls009 SVP64 Zero-Overhead Loop Prefix Subsystem
   2
   3 Credits and acknowledgements:
   4
   5 * Luke Leighton
   6 * Jacob Lifshay
   7 * Hendrik Boom
   8 * Richard Wilbur
   9 * Alexandre Oliva
  10 * Cesar Strauss
  11 * NLnet Foundation, for funding
  12 * OpenPOWER Foundation
  13 * Paul Mackerras
  14 * Toshaan Bharvani
  15 * IBM for the Power ISA itself
  16
  17 Links:
  18
  19 * <https://bugs.libre-soc.org/show_bug.cgi?id=1045>
  20
  21 # Introduction
  22
  23 Simple-V is a type of Vectorisation best described as a "Prefix Loop Subsystem"
  24 similar to the Z80 `LDIR` instruction and to the x86 `REP` Prefix instruction.
  25 More advanced features are similar to the Z80 `CPIR` instruction. If viewed
  26 as an actual Vector ISA it introduces over 1.5 million 64-bit Vector instructions.
  27 SVP64, the instruction format, is therefore best viewed as an orthogonal
  28 RISC-style "Prefixing" subsystem instead.
  29
  30 Except where explicitly stated all bit numbers remain as in the rest of the Power ISA:
  31 in MSB0 form (the bits are numbered from 0 at the MSB on the left
  32 and counting up as you move rightwards to the LSB end). All bit ranges are inclusive
  33 (so `4:6` means bits 4, 5, and 6, in MSB0 order).  **All register numbering and
  34 element numbering however is LSB0 ordering** which is a different convention from that used
  35 elsewhere in the Power ISA.
  36
  37 The SVP64 prefix always comes before the suffix in PC order and must be considered
  38 an independent "Defined word" that augments the behaviour of the following instruction,
  39 but does **not** change the actual Decoding of that following instruction.
  40 **All prefixed instructions retain their non-prefixed encoding and definition**.
  41
  42 *Architectural Resource Allocation note: it is prohibited to accept RFCs which
  43 fundamentally violate this hard requirement.  Under no circumstances must the
  44 Suffix space have an alternate instruction encoding allocated within SVP64 that is
  45 entirely different from the non-prefixed Defined Word. Hardware Implementors
  46 critically rely on this inviolate guarantee to implement High-Performance Multi-Issue
  47 micro-architectures that can sustain 100% throughput*
  48
  49 | 0:5    | 6:31         | 32:63        |
  50 |--------|--------------|--------------|
  51 | EXT09  | v3.1  Prefix | v3.0/1  Suffix |
  52
  53 Subset implementations in hardware are permitted, as long as certain
  54 rules are followed, allowing for full soft-emulation including future
  55 revisions.  Compliancy Subsets exist to ensure minimum levels of binary
  56 interoperability expectations within certain environments.
  57
  58 ## Register files, elements, and Element-width Overrides
  59
  60 In the Upper Compliancy Levels the size of the GPR and FPR Register files are expanded
  61 from 32 to 128 entries, and the number of CR Fields expanded from CR0-CR7 to CR0-CR127.
  62
  63 Memory access remains exactly the same: the effects of `MSR.LE` remain exactly the same,
  64 affecting as they already do and remain **only** on the Load and Store memory-register
  65 operation byte-order, and having nothing to do with the
  66 ordering of the contents of register files or register-register operations.
  67
  68 Whilst the bits within the GPRs and FPRs are expected to be MSB0-ordered and for
  69 numbering to be sequentially incremental the element offset numbering is naturally
  70 **LSB0-sequentially-incrementing from zero not MSB0-incrementing.**  Expressed exclusively in
  71 MSB0-numbering, SVP64 is unnecessarily complex to understand: the required
  72 subtractions from 63, 31, 15 and 7 unfortunately become a hostile minefield.
  73 Therefore for the purposes of this section the more natural
  74 **LSB0 numbering is assumed** and it is up to the reader to translate to MSB0 numbering.
  75
  76 The Canonical specification for how element-sequential numbering and element-width
  77 overrides is defined is expressed in the following c structure, assuming a Little-Endian
  78 system, and naturally using LSB0 numbering everywhere because the ANSI c specification
  79 is inherently LSB0:
  80
  81 ```
  82     #pragma pack
  83     typedef union {
  84         uint8_t  b[]; // elwidth 8
  85         uint16_t s[]; // elwidth 16
  86         uint32_t i[]; // elwidth 32
  87         uint64_t l[]; // elwidth 64
  88         uint8_t actual_bytes[8];
  89     } el_reg_t;
  90
  91     elreg_t int_regfile[128];
  92
  93     void get_register_element(el_reg_t* el, int gpr, int element, int width) {
  94         switch (width) {
  95             case 64: el->l = int_regfile[gpr].l[element];
  96             case 32: el->i = int_regfile[gpr].i[element];
  97             case 16: el->s = int_regfile[gpr].s[element];
  98             case 8 : el->b = int_regfile[gpr].b[element];
  99         }
 100     }
 101     void set_register_element(el_reg_t* el, int gpr, int element, int width) {
 102         switch (width) {
 103             case 64: int_regfile[gpr].l[element] = el->l;
 104             case 32: int_regfile[gpr].i[element] = el->i;
 105             case 16: int_regfile[gpr].s[element] = el->s;
 106             case 8 : int_regfile[gpr].b[element] = el->b;
 107         }
 108     }
 109 ```
 110
 111 Example Vector-looped add operation implementation when elwidths are 64-bit:
 112
 113 ```
 114  # add RT, RA,RB using the "uint64_t" union member, "l"
 115  for i in range(VL):
 116       int_regfile[RT].l[i] = int_regfile[RA].l[i] + int_regfile[RB].l[i]
 117 ```
 118
 119 However if elwidth overrides are set to 16 for both source and destination:
 120
 121 ```
 122  # add RT, RA, RB using the "uint64_t" union member "s"
 123  for i in range(VL):
 124       int_regfile[RT].s[i] = int_regfile[RA].s[i] + int_regfile[RB].s[i]
 125 ```
 126
 127 Hardware Architectural note: to avoid a Read-Modify-Write at the register file it is
 128 strongly recommended to implement byte-level write-enable lines exactly as has been
 129 implemented in DRAM ICs for many decades. Additionally the predicate mask bit is advised
 130 to be associated with the element operation and alongside the result ultimately
 131 passed to the register file.
 132 When element-width is set to 64-bit the relevant predicate mask bit may be repeated
 133 eight times and pull all eight write-port byte-level lines HIGH. Clearly when element-width
 134 is set to 8-bit the relevant predicate mask bit corresponds directly with one single
 135 byte-level write-enable line.  It is up to the Hardware Architect to then amortise (merge)
 136 elements together into both PredicatedSIMD Pipelines as well as simultaneous non-overlapping
 137 Register File writes, to achieve High Performance designs.
 138
 139 ## SVP64 encoding features
 140
 141 A number of features need to be compacted into a very small space of only 24 bits:
 142
 143 * Independent per-register Scalar/Vector tagging and range extension on every register
 144 * Element width overrides on both source and destination
 145 * Predication on both source and destination
 146 * Two different sources of predication: INT and CR Fields
 147 * SV Modes including saturation (for Audio, Video and DSP), mapreduce, fail-first and
 148   predicate-result mode.
 149
 150 Different classes of operations require different formats. The earlier sections cover
 151 the c9mmon formats and the four separate modes follow: CR operations (crops),
 152 Arithmetic/Logical (termed "normal"), Load/Store and Branch-Conditional.
 153
 154 ## Definition of Reserved in this spec.
 155
 156 For the new fields added in SVP64, instructions that have any of their
 157 fields set to a reserved value must cause an illegal instruction trap,
 158 to allow emulation of future instruction sets, or for subsets of SVP64
 159 to be implemented in hardware and the rest emulated.
 160 This includes SVP64 SPRs: reading or writing values which are not
 161 supported in hardware must also raise illegal instruction traps
 162 in order to allow emulation.
 163 Unless otherwise stated, reserved values are always all zeros.
 164
 165 This is unlike OpenPower ISA v3.1, which in many instances does not require a trap if reserved fields are nonzero.  Where the standard Power ISA definition
 166 is intended the red keyword `RESERVED` is used.
 167
 168 ##  Definition of "UnVectoriseable"
 169
 170 Any operation that inherently makes no sense if repeated is termed "UnVectoriseable"
 171 or "UnVectorised".  Examples include `sc` or `sync` which have no registers. `mtmsr` is
 172 also classed as UnVectoriseable because there is only one `MSR`.
 173
 174 ## Scalar Identity Behaviour
 175
 176 SVP64 is designed so that when the prefix is all zeros, and
 177  VL=1, no effect or
 178 influence occurs (no augmentation) such that all standard Power ISA
 179 v3.0/v3 1 instructions covered by the prefix are "unaltered". This is termed `scalar identity behaviour` (based on the mathematical definition for "identity", as in, "identity matrix" or better "identity transformation").
 180
 181 Note that this is completely different from when VL=0.  VL=0 turns all operations under its influence into `nops` (regardless of the prefix)
 182  whereas when VL=1 and the SV prefix is all zeros, the operation simply acts as if SV had not been applied at all to the instruction  (an "identity transformation").
 183
 184 ## Register Naming and size
 185
 186 As previously mentioned SV Registers are simply the INT, FP and CR register files extended
 187 linearly to larger sizes; SV Vectorisation iterates sequentially through these registers
 188 (LSB0 sequential ordering from 0 to VL-1).
 189
 190 Where the integer regfile in standard scalar
 191 Power ISA v3.0B/v3.1B is r0 to r31, SV extends this as r0 to r127.
 192 Likewise FP registers are extended to 128 (fp0 to fp127), and CR Fields
 193 are
 194 extended to 128 entries, CR0 thru CR127.
 195
 196 The names of the registers therefore reflects a simple linear extension
 197 of the Power ISA v3.0B / v3.1B register naming, and in hardware this
 198 would be reflected by a linear increase in the size of the underlying
 199 SRAM used for the regfiles.
 200
 201 Note: when an EXTRA field (defined below) is zero, SV is deliberately designed
 202 so that the register fields are identical to as if SV was not in effect
 203 i.e. under these circumstances (EXTRA=0) the register field names RA,
 204 RB etc. are interpreted and treated as v3.0B / v3.1B scalar registers.  This is part of
 205 `scalar identity behaviour` described above.
 206
 207 ## Future expansion.
 208
 209 With the way that EXTRA fields are defined and applied to register fields,
 210 future versions of SV may involve 256 or greater registers. Backwards binary compatibility may be achieved with a PCR bit (Program Compatibility Register).  Further discussion is out of scope for this version of SVP64.
 211
 212 # Remapped Encoding (`RM[0:23]`)
 213
 214 To allow relatively easy remapping of which portions of the Prefix Opcode
 215 Map are used for SVP64 without needing to rewrite a large portion of the
 216 SVP64 spec, a mapping is defined from the OpenPower v3.1 prefix bits to
 217 a new 24-bit Remapped Encoding denoted `RM[0]` at the MSB to `RM[23]`
 218 at the LSB.
 219
 220 The mapping from the OpenPower v3.1 prefix bits to the Remapped Encoding
 221 is defined in the Prefix Fields section.
 222
 223 ## Prefix Fields
 224
 225 TODO incorporate EXT09
 226
 227 To "activate" svp64 (in a way that does not conflict with v3.1B 64 bit Prefix mode), fields within the v3.1B Prefix Opcode Map are set
 228 (see Prefix Opcode Map, above), leaving 24 bits "free" for use by SV.
 229 This is achieved by setting bits 7 and 9 to 1:
 230
 231 | Name       | Bits    | Value | Description                    |
 232 |------------|---------|-------|--------------------------------|
 233 | EXT01      | `0:5`   | `1`   | Indicates Prefixed 64-bit      |
 234 | `RM[0]`    | `6`     |       | Bit 0 of Remapped Encoding     |
 235 | SVP64_7    | `7`     | `1`   | Indicates this is SVP64        |
 236 | `RM[1]`    | `8`     |       | Bit 1 of Remapped Encoding     |
 237 | SVP64_9    | `9`     | `1`   | Indicates this is SVP64        |
 238 | `RM[2:23]` | `10:31` |       | Bits 2-23 of Remapped Encoding |
 239
 240 Laid out bitwise, this is as follows, showing how the 32-bits of the prefix
 241 are constructed:
 242
 243 | 0:5    | 6     | 7 | 8     | 9 | 10:31    |
 244 |--------|-------|---|-------|---|----------|
 245 | EXT01  | RM    | 1 | RM    | 1 | RM       |
 246 | 000001 | RM[0] | 1 | RM[1] | 1 | RM[2:23] |
 247
 248 Following the prefix will be the suffix: this is simply a 32-bit v3.0B / v3.1
 249 instruction.  That instruction becomes "prefixed" with the SVP context: the
 250 Remapped Encoding field (RM).
 251
 252 It is important to note that unlike v3.1 64-bit prefixed instructions
 253 there is insufficient space in `RM` to provide identification of
 254 any SVP64 Fields without first partially decoding the
 255 32-bit suffix.  Similar to the "Forms" (X-Form, D-Form) the
 256 `RM` format is individually associated with every instruction.
 257
 258 Extreme caution and care must therefore be taken
 259 when extending SVP64 in future, to not create unnecessary relationships
 260 between prefix and suffix that could complicate decoding, adding latency.
 261
 262 # Common RM fields
 263
 264 The following fields are common to all Remapped Encodings:
 265
 266 | Field Name | Field bits | Description                            |
 267 |------------|------------|----------------------------------------|
 268 | MASKMODE   | `0`        | Execution (predication) Mask Kind      |
 269 | MASK       | `1:3`      | Execution Mask                      |
 270 | SUBVL      | `8:9`      | Sub-vector length                   |
 271
 272 The following fields are optional or encoded differently depending
 273 on context after decoding of the Scalar suffix:
 274
 275 | Field Name | Field bits | Description                            |
 276 |------------|------------|----------------------------------------|
 277 | ELWIDTH       | `4:5`      | Element Width                       |
 278 | ELWIDTH_SRC   | `6:7`      | Element Width for Source      |
 279 | EXTRA         | `10:18`    | Register Extra encoding                |
 280 | MODE          | `19:23`    | changes Vector behaviour               |
 281
 282 * MODE changes the behaviour of the SV operation (result saturation, mapreduce)
 283 * SUBVL groups elements together into vec2, vec3, vec4 for use in 3D and Audio/Video DSP work
 284 * ELWIDTH and ELWIDTH_SRC overrides the instruction's destination and source operand width
 285 * MASK (and MASK_SRC) and MASKMODE provide predication (two types of sources: scalar INT and Vector CR).
 286 * Bits 10 to 18 (EXTRA) are further decoded depending on the RM category for the instruction, which is determined only by decoding the Scalar 32 bit suffix.
 287
 288 Similar to Power ISA `X-Form` etc. EXTRA bits are given designations, such as `RM-1P-3S1D` which indicates for this example that the operation is to be single-predicated and that there are 3 source operand EXTRA tags and one destination operand tag.
 289
 290 Note that if ELWIDTH != ELWIDTH_SRC this may result in reduced performance or increased latency in some implementations due to lane-crossing.
 291
 292 # Mode
 293
 294 Mode is an augmentation of SV behaviour.  Different types of
 295 instructions have different needs, similar to Power ISA
 296 v3.1 64 bit prefix 8LS and MTRR formats apply to different
 297 instruction types.  Modes include Reduction, Iteration, arithmetic
 298 saturation, and Fail-First.  More specific details in each
 299 section and in the [[svp64/appendix]]
 300
 301 * For condition register operations see [[sv/cr_ops]]
 302 * For LD/ST Modes, see [[sv/ldst]].
 303 * For Branch modes, see [[sv/branches]]
 304 * For arithmetic and logical, see [[sv/normal]]
 305
 306 # ELWIDTH Encoding
 307
 308 Default behaviour is set to 0b00 so that zeros follow the convention of
 309 `scalar identity behaviour`.  In this case it means that elwidth overrides
 310 are not applicable.  Thus if a 32 bit instruction operates on 32 bit,
 311 `elwidth=0b00` specifies that this behaviour is unmodified.  Likewise
 312 when a processor is switched from 64 bit to 32 bit mode, `elwidth=0b00`
 313 states that, again, the behaviour is not to be modified.
 314
 315 Only when elwidth is nonzero is the element width overridden to the
 316 explicitly required value.
 317
 318 ## Elwidth for Integers:
 319
 320 | Value | Mnemonic       | Description                        |
 321 |-------|----------------|------------------------------------|
 322 | 00    | DEFAULT        | default behaviour for operation    |
 323 | 01    | `ELWIDTH=w`    | Word: 32-bit integer                 |
 324 | 10    | `ELWIDTH=h`    | Halfword: 16-bit integer             |
 325 | 11    | `ELWIDTH=b`    | Byte: 8-bit integer                  |
 326
 327 This encoding is chosen such that the byte width may be computed as
 328 `8<<(3-ew)`
 329
 330 ## Elwidth for FP Registers:
 331
 332 | Value | Mnemonic       | Description                        |
 333 |-------|----------------|------------------------------------|
 334 | 00    | DEFAULT        | default behaviour for FP operation     |
 335 | 01    | `ELWIDTH=f32`  | 32-bit IEEE 754 Single floating-point  |
 336 | 10    | `ELWIDTH=f16`  | 16-bit IEEE 754 Half floating-point   |
 337 | 11    | `ELWIDTH=bf16` | Reserved for `bf16` |
 338
 339 Note:
 340 [`bf16`](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format)
 341 is reserved for a future implementation of SV
 342
 343 Note that any IEEE754 FP operation in Power ISA ending in "s" (`fadds`) shall
 344 perform its operation at **half** the ELWIDTH then padded back out
 345 to ELWIDTH.  `sv.fadds/ew=f32` shall perform an IEEE754 FP16 operation that is then "padded" to fill out to an IEEE754 FP32. When ELWIDTH=DEFAULT
 346 clearly the behaviour of `sv.fadds` is performed at 32-bit accuracy
 347 then padded back out to fit in IEEE754 FP64, exactly as for Scalar
 348 v3.0B "single" FP.  Any FP operation ending in "s" where ELWIDTH=f16
 349 or ELWIDTH=bf16 is reserved and must raise an illegal instruction
 350 (IEEE754 FP8 or BF8 are not defined).
 351
 352 ## Elwidth for CRs:
 353
 354 Element-width overrides for CR Fields has no meaning. The bits
 355 are therefore used for other purposes, or when Rc=1, the Elwidth
 356 applies to the result being tested (a GPR or FPR), but not to the
 357 Vector of CR Fields.
 358
 359 # SUBVL Encoding
 360
 361 the default for SUBVL is 1 and its encoding is 0b00 to indicate that
 362 SUBVL is effectively disabled (a SUBVL for-loop of only one element). this
 363 lines up in combination with all other "default is all zeros" behaviour.
 364
 365 | Value | Mnemonic  | Subvec  | Description            |
 366 |-------|-----------|---------|------------------------|
 367 | 00    | `SUBVL=1` | single  | Sub-vector length of 1 |
 368 | 01    | `SUBVL=2` | vec2    | Sub-vector length of 2 |
 369 | 10    | `SUBVL=3` | vec3    | Sub-vector length of 3 |
 370 | 11    | `SUBVL=4` | vec4    | Sub-vector length of 4 |
 371
 372 The SUBVL encoding value may be thought of as an inclusive range of a
 373 sub-vector.  SUBVL=2 represents a vec2, its encoding is 0b01, therefore
 374 this may be considered to be elements 0b00 to 0b01 inclusive.
 375
 376 # MASK/MASK_SRC & MASKMODE Encoding
 377
 378 TODO: rename MASK_KIND to MASKMODE
 379
 380 One bit (`MASKMODE`) indicates the mode: CR or Int predication.   The two
 381 types may not be mixed.
 382
 383 Special note: to disable predication this field must
 384 be set to zero in combination with Integer Predication also being set
 385 to 0b000. this has the effect of enabling "all 1s" in the predicate
 386 mask, which is equivalent to "not having any predication at all"
 387 and consequently, in combination with all other default zeros, fully
 388 disables SV (`scalar identity behaviour`).
 389
 390 `MASKMODE` may be set to one of 2 values:
 391
 392 | Value | Description                                          |
 393 |-----------|------------------------------------------------------|
 394 | 0         | MASK/MASK_SRC are encoded using Integer Predication  |
 395 | 1         | MASK/MASK_SRC are encoded using CR-based Predication |
 396
 397 Integer Twin predication has a second set of 3 bits that uses the same
 398 encoding thus allowing either the same register (r3, r10 or r31) to be used
 399 for both src and dest, or different regs (one for src, one for dest).
 400
 401 Likewise CR based twin predication has a second set of 3 bits, allowing
 402 a different test to be applied.
 403
 404 Note that it is assumed that Predicate Masks (whether INT or CR)
 405 are read *before* the operations proceed.  In practice (for CR Fields)
 406 this creates an unnecessary block on parallelism.  Therefore,
 407 it is up to the programmer to ensure that the CR fields used as
 408 Predicate Masks are not being written to by any parallel Vector Loop.
 409 Doing so results in **UNDEFINED** behaviour, according to the definition
 410 outlined in the Power ISA v3.0B Specification.
 411
 412 Hardware Implementations are therefore free and clear to delay reading
 413 of individual CR fields until the actual predicated element operation
 414 needs to take place, safe in the knowledge that no programmer will
 415 have issued a Vector Instruction where previous elements could have
 416 overwritten (destroyed) not-yet-executed CR-Predicated element operations.
 417
 418 ## Integer Predication (MASKMODE=0)
 419
 420 When the predicate mode bit is zero the 3 bits are interpreted as below.
 421 Twin predication has an identical 3 bit field similarly encoded.
 422
 423 `MASK` and `MASK_SRC` may be set to one of 8 values, to provide the following meaning:
 424
 425 | Value | Mnemonic | Element `i` enabled if:      |
 426 |-------|----------|------------------------------|
 427 | 000   | ALWAYS   | predicate effectively all 1s |
 428 | 001   | 1 << R3  | `i == R3`                    |
 429 | 010   | R3       | `R3 & (1 << i)` is non-zero  |
 430 | 011   | ~R3      | `R3 & (1 << i)` is zero      |
 431 | 100   | R10      | `R10 & (1 << i)` is non-zero |
 432 | 101   | ~R10     | `R10 & (1 << i)` is zero     |
 433 | 110   | R30      | `R30 & (1 << i)` is non-zero |
 434 | 111   | ~R30     | `R30 & (1 << i)` is zero     |
 435
 436 r10 and r30 are at the high end of temporary and unused registers, so as not to interfere with register allocation from ABIs.
 437
 438 ## CR-based Predication (MASKMODE=1)
 439
 440 When the predicate mode bit is one the 3 bits are interpreted as below.
 441 Twin predication has an identical 3 bit field similarly encoded.
 442
 443 `MASK` and `MASK_SRC` may be set to one of 8 values, to provide the following meaning:
 444
 445 | Value | Mnemonic | Element `i` is enabled if     |
 446 |-------|----------|--------------------------|
 447 | 000   | lt       | `CR[offs+i].LT` is set   |
 448 | 001   | nl/ge    | `CR[offs+i].LT` is clear |
 449 | 010   | gt       | `CR[offs+i].GT` is set   |
 450 | 011   | ng/le    | `CR[offs+i].GT` is clear |
 451 | 100   | eq       | `CR[offs+i].EQ` is set   |
 452 | 101   | ne       | `CR[offs+i].EQ` is clear |
 453 | 110   | so/un    | `CR[offs+i].FU` is set   |
 454 | 111   | ns/nu    | `CR[offs+i].FU` is clear |
 455
 456 CR based predication.  TODO: select alternate CR for twin predication? see
 457 [[discussion]]  Overlap of the two CR based predicates must be taken
 458 into account, so the starting point for one of them must be suitably
 459 high, or accept that for twin predication VL must not exceed the range
 460 where overlap will occur, *or* that they use the same starting point
 461 but select different *bits* of the same CRs
 462
 463 `offs` is defined as CR32 (4x8) so as to mesh cleanly with Vectorised Rc=1 operations (see below).  Rc=1 operations start from CR8 (TBD).
 464
 465 The CR Predicates chosen must start on a boundary that Vectorised
 466 CR operations can access cleanly, in full.
 467 With EXTRA2 restricting starting points
 468 to multiples of 8 (CR0, CR8, CR16...) both Vectorised Rc=1 and CR Predicate
 469 Masks have to be adapted to fit on these boundaries as well.
 470
 471 # Extra Remapped Encoding <a name="extra_remap"> </a>
 472
 473 Shows all instruction-specific fields in the Remapped Encoding `RM[10:18]` for all instruction variants.  Note that due to the very tight space, the encoding mode is *not* included in the prefix itself.  The mode is "applied", similar to Power ISA "Forms" (X-Form, D-Form) on a per-instruction basis, and, like "Forms" are given a designation (below) of the form `RM-nP-nSnD`. The full list of which instructions use which remaps is here [[opcode_regs_deduped]]. (*Machine-readable CSV files have been provided which will make the task of creating SV-aware ISA decoders easier*).
 474
 475 These mappings are part of the SVP64 Specification in exactly the same
 476 way as X-Form, D-Form. New Scalar instructions added to the Power ISA
 477 will need a corresponding SVP64 Mapping, which can be derived by-rote
 478 from examining the Register "Profile" of the instruction.
 479
 480 There are two categories:  Single and Twin Predication.
 481 Due to space considerations further subdivision of Single Predication
 482 is based on whether the number of src operands is 2 or 3.  With only
 483 9 bits available some compromises have to be made.
 484
 485 * `RM-1P-3S1D` Single Predication dest/src1/2/3, applies to 4-operand instructions (fmadd, isel, madd).
 486 * `RM-1P-2S1D` Single Predication dest/src1/2 applies to 3-operand instructions (src1 src2 dest)
 487 * `RM-2P-1S1D` Twin Predication (src=1, dest=1)
 488 * `RM-2P-2S1D` Twin Predication (src=2, dest=1) primarily for LDST (Indexed)
 489 * `RM-2P-1S2D` Twin Predication (src=1, dest=2) primarily for LDST Update
 490
 491 ## RM-1P-3S1D
 492
 493 | Field Name | Field bits | Description                            |
 494 |------------|------------|----------------------------------------|
 495 | Rdest\_EXTRA2 | `10:11` | extends Rdest (R\*\_EXTRA2 Encoding)   |
 496 | Rsrc1\_EXTRA2 | `12:13` | extends Rsrc1 (R\*\_EXTRA2 Encoding)   |
 497 | Rsrc2\_EXTRA2 | `14:15` | extends Rsrc2 (R\*\_EXTRA2 Encoding)   |
 498 | Rsrc3\_EXTRA2 | `16:17` | extends Rsrc3 (R\*\_EXTRA2 Encoding)   |
 499 | EXTRA2_MODE   | `18`    | used by `divmod2du` and `maddedu` for RS   |
 500
 501 These are for 3 operand in and either 1 or 2 out instructions.
 502 3-in 1-out includes `madd RT,RA,RB,RC`. (DRAFT) instructions
 503 such as `maddedu` have an implicit second destination, RS, the
 504 selection of which is determined by bit 18.
 505
 506 ## RM-1P-2S1D
 507
 508 | Field Name | Field bits | Description                               |
 509 |------------|------------|-------------------------------------------|
 510 | Rdest\_EXTRA3 | `10:12` | extends Rdest  |
 511 | Rsrc1\_EXTRA3 | `13:15` | extends Rsrc1  |
 512 | Rsrc2\_EXTRA3 | `16:18` | extends Rsrc3  |
 513
 514 These are for 2 operand 1 dest instructions, such as `add RT, RA,
 515 RB`. However also included are unusual instructions with an implicit dest
 516 that is identical to its src reg, such as `rlwinmi`.
 517
 518 Normally, with instructions such as `rlwinmi`, the scalar v3.0B ISA would not have sufficient bit fields to allow
 519 an alternative destination.  With SV however this becomes possible.
 520 Therefore, the fact that the dest is implicitly also a src should not
 521 mislead: due to the *prefix* they are different SV regs.
 522
 523 * `rlwimi RA, RS, ...`
 524 * Rsrc1_EXTRA3 applies to RS as the first src
 525 * Rsrc2_EXTRA3 applies to RA as the secomd src
 526 * Rdest_EXTRA3 applies to RA to create an **independent** dest.
 527
 528 With the addition of the EXTRA bits, the three registers
 529 each may be *independently* made vector or scalar, and be independently
 530 augmented to 7 bits in length.
 531
 532 ## RM-2P-1S1D/2S
 533
 534 | Field Name | Field bits | Description                 |
 535 |------------|------------|----------------------------|
 536 | Rdest_EXTRA3 | `10:12`    | extends Rdest             |
 537 | Rsrc1_EXTRA3 | `13:15`    | extends Rsrc1             |
 538 | MASK_SRC     | `16:18`    | Execution Mask for Source |
 539
 540 `RM-2P-2S` is for `stw` etc. and is Rsrc1 Rsrc2.
 541
 542 ## RM-1P-2S1D
 543
 544 single-predicate, three registers (2 read, 1 write)
 545
 546 | Field Name | Field bits | Description                 |
 547 |------------|------------|----------------------------|
 548 | Rdest_EXTRA3 | `10:12`    | extends Rdest             |
 549 | Rsrc1_EXTRA3 | `13:15`    | extends Rsrc1             |
 550 | Rsrc2_EXTRA3 | `16:18`    | extends Rsrc2             |
 551
 552 ## RM-2P-2S1D/1S2D/3S
 553
 554 The primary purpose for this encoding is for Twin Predication on LOAD
 555 and STORE operations.  see [[sv/ldst]] for detailed anslysis.
 556
 557 RM-2P-2S1D:
 558
 559 | Field Name | Field bits | Description                     |
 560 |------------|------------|----------------------------|
 561 | Rdest_EXTRA2 | `10:11`  | extends Rdest (R\*\_EXTRA2 Encoding)   |
 562 | Rsrc1_EXTRA2 | `12:13`  | extends Rsrc1 (R\*\_EXTRA2 Encoding)   |
 563 | Rsrc2_EXTRA2 | `14:15`  | extends Rsrc2 (R\*\_EXTRA2 Encoding)   |
 564 | MASK_SRC     | `16:18`  | Execution Mask for Source     |
 565
 566 Note that for 1S2P the EXTRA2 dest and src names are switched (Rsrc_EXTRA2
 567 is in bits 10:11, Rdest1_EXTRA2 in 12:13)
 568
 569 Also that for 3S (to cover `stdx` etc.) the names are switched to 3 src: Rsrc1_EXTRA2, Rsrc2_EXTRA2, Rsrc3_EXTRA2.
 570
 571 Note also that LD with update indexed, which takes 2 src and 2 dest
 572 (e.g. `lhaux RT,RA,RB`), does not have room for 4 registers and also
 573 Twin Predication.  therefore these are treated as RM-2P-2S1D and the
 574 src spec for RA is also used for the same RA as a dest.
 575
 576 Note that if ELWIDTH != ELWIDTH_SRC this may result in reduced performance or increased latency in some implementations due to lane-crossing.
 577
 578 # R\*\_EXTRA2/3
 579
 580 EXTRA is the means by which two things are achieved:
 581
 582 1. Registers are marked as either Vector *or Scalar*
 583 2. Register field numbers (limited typically to 5 bit)
 584    are extended in range, both for Scalar and Vector.
 585
 586 The register files are therefore extended:
 587
 588 * INT is extended from r0-31 to r0-127
 589 * FP is extended from fp0-32 to fp0-fp127
 590 * CR Fields are extended from CR0-7 to CR0-127
 591
 592 However due to pressure in `RM.EXTRA` not all these registers
 593 are accessible by all instructions, particularly those with
 594 a large number of operands (`madd`, `isel`).
 595
 596 In the following tables register numbers are constructed from the
 597 standard v3.0B / v3.1B 32 bit register field (RA, FRA) and the EXTRA2
 598 or EXTRA3 field from the SV Prefix, determined by the specific
 599 RM-xx-yyyy designation for a given instruction.
 600 The prefixing is arranged so that
 601 interoperability between prefixing and nonprefixing of scalar registers
 602 is direct and convenient (when the EXTRA field is all zeros).
 603
 604 A pseudocode algorithm explains the relationship, for INT/FP (see [[svp64/appendix]] for CRs)
 605
 606 ```
 607     if extra3_mode:
 608         spec = EXTRA3
 609     else:
 610         spec = EXTRA2 << 1 # same as EXTRA3, shifted
 611     if spec[0]: # vector
 612          return (RA << 2) | spec[1:2]
 613     else:         # scalar
 614          return (spec[1:2] << 5) | RA
 615 ```
 616
 617 Future versions may extend to 256 by shifting Vector numbering up.
 618 Scalar will not be altered.
 619
 620 Note that in some cases the range of starting points for Vectors
 621 is limited.
 622
 623 ## INT/FP EXTRA3
 624
 625 If EXTRA3 is zero, maps to
 626 "scalar identity" (scalar Power ISA field naming).
 627
 628 Fields are as follows:
 629
 630 * Value: R_EXTRA3
 631 * Mode: register is tagged as scalar or vector
 632 * Range/Inc: the range of registers accessible from this EXTRA
 633   encoding, and the "increment" (accessibility). "/4" means
 634   that this EXTRA encoding may only give access (starting point)
 635   every 4th register.
 636 * MSB..LSB: the bit field showing how the register opcode field
 637   combines with EXTRA to give (extend) the register number (GPR)
 638
 639 | Value | Mode | Range/Inc | 6..0 |
 640 |-----------|-------|---------------|---------------------|
 641 | 000       | Scalar | `r0-r31`/1 | `0b00 RA`      |
 642 | 001       | Scalar | `r32-r63`/1 | `0b01 RA`      |
 643 | 010       | Scalar | `r64-r95`/1 | `0b10 RA`      |
 644 | 011       | Scalar | `r96-r127`/1 | `0b11 RA`      |
 645 | 100       | Vector | `r0-r124`/4 | `RA 0b00`      |
 646 | 101       | Vector | `r1-r125`/4 | `RA 0b01`      |
 647 | 110       | Vector | `r2-r126`/4 | `RA 0b10`      |
 648 | 111       | Vector | `r3-r127`/4 | `RA 0b11`      |
 649
 650 ## INT/FP EXTRA2
 651
 652 If EXTRA2 is zero will map to
 653 "scalar identity behaviour" i.e Scalar Power ISA register naming:
 654
 655 | Value | Mode | Range/inc | 6..0 |
 656 |-----------|-------|---------------|-----------|
 657 | 00       | Scalar | `r0-r31`/1 | `0b00 RA`     |
 658 | 01       | Scalar | `r32-r63`/1 | `0b01 RA`      |
 659 | 10       | Vector | `r0-r124`/4 | `RA 0b00`      |
 660 | 11       | Vector | `r2-r126`/4 | `RA 0b10`   |
 661
 662 **Note that unlike in EXTRA3, in EXTRA2**:
 663
 664 * the GPR Vectors may only start from
 665   `r0, r2, r4, r6, r8` and likewise FPR Vectors.
 666 * the GPR Scalars may only go from `r0, r1, r2.. r63` and likewise FPR Scalars.
 667
 668 as there is insufficient bits to cover the full range.
 669
 670 ## CR Field EXTRA3
 671
 672 CR Field encoding is essentially the same but made more complex due to CRs being bit-based.  See [[svp64/appendix]] for explanation and pseudocode.
 673 Note that Vectors may only start from `CR0, CR4, CR8, CR12, CR16, CR20`...
 674 and Scalars may only go from `CR0, CR1, ... CR31`
 675
 676 Encoding shown MSB down to LSB
 677
 678 For a 5-bit operand (BA, BB, BT):
 679
 680 | Value | Mode | Range/Inc     | 8..5      | 4..2    | 1..0    |
 681 |-------|------|---------------|-----------| --------|---------|
 682 | 000   | Scalar | `CR0-CR7`/1   | 0b0000    | BA[4:2] | BA[1:0] |
 683 | 001   | Scalar | `CR8-CR15`/1  | 0b0001    | BA[4:2] | BA[1:0] |
 684 | 010   | Scalar | `CR16-CR23`/1 | 0b0010    | BA[4:2] | BA[1:0] |
 685 | 011   | Scalar | `CR24-CR31`/1 | 0b0011    | BA[4:2] | BA[1:0] |
 686 | 100   | Vector | `CR0-CR112`/16 | BA[4:2] 0 | 0b000   | BA[1:0] |
 687 | 101   | Vector | `CR4-CR116`/16 | BA[4:2] 0 | 0b100   | BA[1:0] |
 688 | 110   | Vector | `CR8-CR120`/16 | BA[4:2] 1 | 0b000   | BA[1:0] |
 689 | 111   | Vector | `CR12-CR124`/16 | BA[4:2] 1 | 0b100   | BA[1:0] |
 690
 691 For a 3-bit operand (e.g. BFA):
 692
 693 | Value | Mode | Range/Inc     | 6..3      | 2..0    |
 694 |-------|------|---------------|-----------| --------|
 695 | 000   | Scalar | `CR0-CR7`/1   | 0b0000    | BFA   |
 696 | 001   | Scalar | `CR8-CR15`/1  | 0b0001    | BFA      |
 697 | 010   | Scalar | `CR16-CR23`/1 | 0b0010    | BFA      |
 698 | 011   | Scalar | `CR24-CR31`/1 | 0b0011    | BFA      |
 699 | 100   | Vector | `CR0-CR112`/16 | BFA 0 | 0b000   |
 700 | 101   | Vector | `CR4-CR116`/16 | BFA 0 | 0b100   |
 701 | 110   | Vector | `CR8-CR120`/16 | BFA 1 | 0b000   |
 702 | 111   | Vector | `CR12-CR124`/16 | BFA 1 | 0b100   |
 703
 704 ## CR EXTRA2
 705
 706 CR encoding is essentially the same but made more complex due to CRs being bit-based.  See separate section for explanation and pseudocode.
 707 Note that Vectors may only start from CR0, CR8, CR16, CR24, CR32...
 708
 709
 710 Encoding shown MSB down to LSB
 711
 712 For a 5-bit operand (BA, BB, BC):
 713
 714 | Value | Mode   | Range/Inc      | 8..5    | 4..2    | 1..0    |
 715 |-------|--------|----------------|---------|---------|---------|
 716 | 00    | Scalar | `CR0-CR7`/1    | 0b0000  | BA[4:2] | BA[1:0] |
 717 | 01    | Scalar | `CR8-CR15`/1   | 0b0001  | BA[4:2] | BA[1:0] |
 718 | 10    | Vector | `CR0-CR112`/16 | BA[4:2] 0 | 0b000   | BA[1:0] |
 719 | 11    | Vector | `CR8-CR120`/16 | BA[4:2] 1 | 0b000   | BA[1:0] |
 720
 721 For a 3-bit operand (e.g. BFA):
 722
 723 | Value | Mode | Range/Inc     | 6..3      | 2..0    |
 724 |-------|------|---------------|-----------| --------|
 725 | 00    | Scalar | `CR0-CR7`/1   | 0b0000  | BFA   |
 726 | 01    | Scalar | `CR8-CR15`/1  | 0b0001  | BFA     |
 727 | 10    | Vector | `CR0-CR112`/16 | BFA 0 | 0b000   |
 728 | 11    | Vector | `CR8-CR120`/16 | BFA 1 | 0b000   |
 729
 730
 731 # Normal SVP64 Modes, for Arithmetic and Logical Operations
 732
 733 Normal SVP64 Mode covers Arithmetic and Logical operations
 734 to provide suitable additional behaviour.  The Mode
 735 field is bits 19-23 of the [[svp64]] RM Field.
 736
 737 ## Mode
 738
 739 Mode is an augmentation of SV behaviour, providing additional
 740 functionality.  Some of these alterations are element-based (saturation), others involve post-analysis (predicate result) and others are Vector-based (mapreduce, fail-on-first).
 741
 742 [[sv/ldst]],
 743 [[sv/cr_ops]] and [[sv/branches]] are covered separately: the following
 744 Modes apply to Arithmetic and Logical SVP64 operations:
 745
 746 * **simple** mode is straight vectorisation.  no augmentations: the vector comprises an array of independently created results.
 747 * **ffirst** or data-dependent fail-on-first: see separate section.  the vector may be truncated depending on certain criteria.
 748   *VL is altered as a result*.
 749 * **sat mode** or saturation: clamps each element result to a min/max rather than overflows / wraps.  allows signed and unsigned clamping for both INT
 750 and FP.
 751 * **reduce mode**. if used correctly, a mapreduce (or a prefix sum)
 752   is performed.    see [[svp64/appendix]].
 753   note that there are comprehensive caveats when using this mode.
 754 * **pred-result** will test the result (CR testing selects a bit of CR and inverts it, just like branch conditional testing) and if the test fails it
 755 is as if the
 756 *destination* predicate bit was zero even before starting the operation.
 757 When Rc=1 the CR element however is still stored in the CR regfile, even if the test failed.  See appendix for details.
 758
 759 Note that ffirst and reduce modes are not anticipated to be high-performance in some implementations.  ffirst due to interactions with VL, and reduce due to it requiring additional operations to produce a result.  simple, saturate and pred-result are however inter-element independent and may easily be parallelised to give high performance, regardless of the value of VL.
 760
 761 The Mode table for Arithmetic and Logical operations
 762  is laid out as follows:
 763
 764 | 0-1 |  2  |  3   4  |  description              |
 765 | --- | --- |---------|-------------------------- |
 766 | 00  |   0 |  dz  sz | simple mode                      |
 767 | 00  |   1 | 0  RG   | scalar reduce mode (mapreduce) |
 768 | 00  |   1 | 1  /    | reserved     |
 769 | 01  | inv | CR-bit  | Rc=1: ffirst CR sel              |
 770 | 01  | inv | VLi RC1 |  Rc=0: ffirst z/nonz |
 771 | 10  |   N | dz   sz |  sat mode: N=0/1 u/s |
 772 | 11  | inv | CR-bit  |  Rc=1: pred-result CR sel |
 773 | 11  | inv | zz  RC1 |  Rc=0: pred-result z/nonz |
 774
 775 Fields:
 776
 777 * **sz / dz**  if predication is enabled will put zeros into the dest (or as src in the case of twin pred) when the predicate bit is zero.  otherwise the element is ignored or skipped, depending on context.
 778 * **zz**: both sz and dz are set equal to this flag
 779 * **inv CR bit** just as in branches (BO) these bits allow testing of a CR bit and whether it is set (inv=0) or unset (inv=1)
 780 * **RG** inverts the Vector Loop order (VL-1 downto 0) rather
 781 than the normal 0..VL-1
 782 * **N** sets signed/unsigned saturation.
 783 * **RC1** as if Rc=1, enables access to `VLi`.
 784 * **VLi** VL inclusive: in fail-first mode, the truncation of
 785   VL *includes* the current element at the failure point rather
 786   than excludes it from the count.
 787
 788 For LD/ST Modes, see [[sv/ldst]].  For Condition Registers
 789 see [[sv/cr_ops]].
 790 For Branch modes, see [[sv/branches]].
 791
 792 # Rounding, clamp and saturate
 793
 794 See [[av_opcodes]] for relevant opcodes and use-cases.
 795
 796 To help ensure that audio quality is not compromised by overflow,
 797 "saturation" is provided, as well as a way to detect when saturation
 798 occurred if desired (Rc=1). When Rc=1 there will be a *vector* of CRs,
 799 one CR per element in the result (Note: this is different from VSX which
 800 has a single CR per block).
 801
 802 When N=0 the result is saturated to within the maximum range of an
 803 unsigned value.  For integer ops this will be 0 to 2^elwidth-1. Similar
 804 logic applies to FP operations, with the result being saturated to
 805 maximum rather than returning INF, and the minimum to +0.0
 806
 807 When N=1 the same occurs except that the result is saturated to the min
 808 or max of a signed result, and for FP to the min and max value rather
 809 than returning +/- INF.
 810
 811 When Rc=1, the CR "overflow" bit is set on the CR associated with the
 812 element, to indicate whether saturation occurred.  Note that due to
 813 the hugely detrimental effect it has on parallel processing, XER.SO is
 814 **ignored** completely and is **not** brought into play here.  The CR
 815 overflow bit is therefore simply set to zero if saturation did not occur,
 816 and to one if it did.
 817
 818 Note also that saturate on operations that set OE=1 must raise an
 819 Illegal Instruction due to the conflicting use of the CR.so bit for
 820 storing if
 821 saturation occurred. Integer Operations that produce a Carry-Out (CA, CA32):
 822 these two bits will be `UNDEFINED` if saturation is also requested.
 823
 824 Note that the operation takes place at the maximum bitwidth (max of
 825 src and dest elwidth) and that truncation occurs to the range of the
 826 dest elwidth.
 827
 828 *Programmer's Note: Post-analysis of the Vector of CRs to find out if any given element hit
 829 saturation may be done using a mapreduced CR op (cror), or by using the
 830 new crrweird instruction with Rc=1, which will transfer the required
 831 CR bits to a scalar integer and update CR0, which will allow testing
 832 the scalar integer for nonzero.  see [[sv/cr_int_predication]]*
 833
 834 ## Reduce mode
 835
 836 Reduction in SVP64 is similar in essence to other Vector Processing
 837 ISAs, but leverages the underlying scalar Base v3.0B operations.
 838 Thus it is more a convention that the programmer may utilise to give
 839 the appearance and effect of a Horizontal Vector Reduction. Due
 840 to the unusual decoupling it is also possible to perform
 841 prefix-sum (Fibonacci Series) in certain circumstances. Details are in the [[svp64/appendix]]
 842
 843 Reduce Mode should not be confused with Parallel Reduction [[sv/remap]].
 844 As explained in the [[sv/appendix]] Reduce Mode switches off the check
 845 which would normally stop looping if the result register is scalar.
 846 Thus, the result scalar register, if also used as a source scalar,
 847 may be used to perform sequential accumulation.  This *deliberately*
 848 sets up a chain
 849 of Register Hazard Dependencies, whereas Parallel Reduce [[sv/remap]]
 850 deliberately issues a Tree-Schedule of operations that may be parallelised.
 851
 852 ## Fail-on-first
 853
 854 Data-dependent fail-on-first has two distinct variants: one for LD/ST,
 855 the other for arithmetic operations (actually, CR-driven).  Note in each
 856 case the assumption is that vector elements are required to appear to be
 857 executed in sequential Program Order. When REMAP is not active,
 858 element 0 would be the first.
 859
 860 Data-driven (CR-driven) fail-on-first activates when Rc=1 or other
 861 CR-creating operation produces a result (including cmp).  Similar to
 862 branch, an analysis of the CR is performed and if the test fails, the
 863 vector operation terminates and discards all element operations **at and
 864 above the current one**, and VL is truncated to either
 865 the *previous* element or the current one, depending on whether
 866 VLi (VL "inclusive") is clear or set, respectively.
 867
 868 Thus the new VL comprises a contiguous vector of results,
 869 all of which pass the testing criteria (equal to zero, less than zero etc
 870 as defined by the CR-bit test).
 871
 872 *Note: when VLi is clear, the behaviour at first seems counter-intuitive.
 873 A result is calculated but if the test fails it is prohibited from being
 874 actually written.  This becomes intuitive again when it is remembered
 875 that the length that VL is set to is the number of *written* elements,
 876 and only when VLI is set will the current element be included in that
 877 count.*
 878
 879 The CR-based data-driven fail-on-first is "new" and not found in ARM
 880 SVE or RVV. At the same time it is "old" because it is almost
 881 identical to a generalised form of Z80's `CPIR` instruction.
 882 It is extremely useful for reducing instruction count,
 883 however requires speculative execution involving modifications of VL
 884 to get high performance implementations.  An additional mode (RC1=1)
 885 effectively turns what would otherwise be an arithmetic operation
 886 into a type of `cmp`.  The CR is stored (and the CR.eq bit tested
 887 against the `inv` field).
 888 If the CR.eq bit is equal to `inv` then the Vector is truncated and
 889 the loop ends.
 890
 891 VLi is only available as an option when `Rc=0` (or for instructions
 892 which do not have Rc). When set, the current element is always
 893 also included in the count (the new length that VL will be set to).
 894 This may be useful in combination with "inv" to truncate the Vector
 895 to *exclude* elements that fail a test, or, in the case of implementations
 896 of strncpy, to include the terminating zero.
 897
 898 In CR-based data-driven fail-on-first there is only the option to select
 899 and test one bit of each CR (just as with branch BO).  For more complex
 900 tests this may be insufficient.  If that is the case, a vectorised crop
 901 such as crand, cror or [[sv/cr_int_predication]] crweirder may be used,
 902 and ffirst applied to the crop instead of to
 903 the arithmetic vector. Note that crops are covered by
 904 the [[sv/cr_ops]] Mode format.
 905
 906 *Programmer's note: `VLi` is only accessible in normal operations
 907 which in turn limits the CR field bit-testing to only `EQ/NE`.
 908 [[sv/cr_ops]] are not so limited.  Thus it is possible to use for
 909 example `sv.cror/ff=gt/vli *0,*0,*0`, which is not a `nop` because
 910 it allows Fail-First Mode to perform a test and truncate VL.*
 911
 912 Two extremely important aspects of ffirst are:
 913
 914 * LDST ffirst may never set VL equal to zero.  This because on the first
 915   element an exception must be raised "as normal".
 916 * CR-based data-dependent ffirst on the other hand **can** set VL equal
 917   to zero. This is the only means in the entirety of SV that VL may be set
 918   to zero (with the exception of via the SV.STATE SPR).  When VL is set
 919   zero due to the first element failing the CR bit-test, all subsequent
 920   vectorised operations are effectively `nops` which is
 921   *precisely the desired and intended behaviour*.
 922
 923 The second crucial aspect, compared to LDST Ffirst:
 924
 925 * LD/ST Failfirst may (beyond the initial first element
 926   conditions) truncate VL for any architecturally
 927   suitable reason. Beyond the first element LD/ST Failfirst is
 928   arbitrarily speculative and 100% non-deterministic.
 929 * CR-based data-dependent first on the other hand MUST NOT truncate VL
 930   arbitrarily to a length decided by the hardware: VL MUST only be
 931   truncated based explicitly on whether a test fails.
 932   This because it is a precise Deterministic test on which algorithms
 933   can and will will rely.
 934
 935 **Floating-point Exceptions**
 936
 937 When Floating-point exceptions are enabled VL must be truncated at
 938 the point where the Exception appears not to have occurred. If `VLi`
 939 is set then VL must include the faulting element, and thus the
 940 faulting element will always raise its exception.  If however `VLi`
 941 is clear then VL **excludes** the faulting element and thus the
 942 exception will **never** be raised.
 943
 944 Although very strongly
 945 discouraged the Exception Mode that permits Floating Point Exception
 946 notification to arrive too late to unwind is permitted
 947 (under protest, due it violating
 948 the otherwise 100% Deterministic nature of Data-dependent Fail-first).
 949
 950 **Use of lax FP Exception Notification Mode could result in parallel
 951 computations proceeding with invalid results that have to be explicitly
 952 detected, whereas with the strict FP Execption Mode enabled, FFirst
 953 truncates VL, allows subsequent parallel computation to avoid
 954 the exceptions entirely**
 955
 956 ## Data-dependent fail-first on CR operations (crand etc)
 957
 958 Operations that actually produce or alter CR Field as a result
 959 have their own SVP64 Mode, described
 960 in [[sv/cr_ops]].
 961
 962 ## pred-result mode
 963
 964 This mode merges common CR testing with predication, saving on instruction
 965 count. Below is the pseudocode excluding predicate zeroing and elwidth
 966 overrides. Note that the pseudocode for SVP64 CR-ops is slightly different.
 967
 968 ```
 969     for i in range(VL):
 970         # predication test, skip all masked out elements.
 971         if predicate_masked_out(i):
 972              continue
 973         result = op(iregs[RA+i], iregs[RB+i])
 974         CRnew = analyse(result) # calculates eq/lt/gt
 975         # Rc=1 always stores the CR field
 976         if Rc=1 or RC1:
 977             CR.field[offs+i] = CRnew
 978         # now test CR, similar to branch
 979         if RC1 or CR.field[BO[0:1]] != BO[2]:
 980             continue # test failed: cancel store
 981         # result optionally stored but CR always is
 982         iregs[RT+i] = result
 983 ```
 984
 985 The reason for allowing the CR element to be stored is so that
 986 post-analysis of the CR Vector may be carried out.  For example:
 987 Saturation may have occurred (and been prevented from updating, by the
 988 test) but it is desirable to know *which* elements fail saturation.
 989
 990 Note that RC1 Mode basically turns all operations into `cmp`.  The
 991 calculation is performed but it is only the CR that is written. The
 992 element result is *always* discarded, never written (just like `cmp`).
 993
 994 Note that predication is still respected: predicate zeroing is slightly
 995 different: elements that fail the CR test *or* are masked out are zero'd.
 996
 997 # SV Load and Store
 998
 999 **Rationale**
1000
1001 All Vector ISAs dating back fifty years have extensive and comprehensive
1002 Load and Store operations that go far beyond the capabilities of Scalar
1003 RISC and most CISC processors, yet at their heart on an individual element
1004 basis may be found to be no different from RISC Scalar equivalents.
1005
1006 The resource savings from Vector LD/ST are significant and stem from
1007 the fact that one single instruction can trigger a dozen (or in some
1008 microarchitectures such as Cray or NEC SX Aurora) hundreds of element-level Memory accesses.
1009
1010 Additionally, and simply: if the Arithmetic side of an ISA supports
1011 Vector Operations, then in order to keep the ALUs 100% occupied the
1012 Memory infrastructure (and the ISA itself) correspondingly needs Vector
1013 Memory Operations as well.
1014
1015 Vectorised Load and Store also presents an extra dimension (literally)
1016 which creates scenarios unique to Vector applications, that a Scalar
1017 (and even a SIMD) ISA simply never encounters.  SVP64 endeavours to
1018 add the modes typically found in *all* Scalable Vector ISAs,
1019 without changing the behaviour of the underlying Base
1020 (Scalar) v3.0B operations in any way.
1021
1022 ## Modes overview
1023
1024 Vectorisation of Load and Store requires creation, from scalar operations,
1025 a number of different modes:
1026
1027 * **fixed aka "unit" stride** - contiguous sequence with no gaps
1028 * **element strided** - sequential but regularly offset, with gaps
1029 * **vector indexed** - vector of base addresses and vector of offsets
1030 * **Speculative fail-first** - where it makes sense to do so
1031 * **Structure Packing** - covered in SV by [[sv/remap]] and Pack/Unpack Mode.
1032
1033 *Despite being constructed from Scalar LD/ST none of these Modes
1034 exist or make sense in any Scalar ISA. They **only** exist in Vector ISAs*
1035
1036 Also included in SVP64 LD/ST is both signed and unsigned Saturation,
1037 as well as Element-width overrides and Twin-Predication.
1038
1039 Note also that Indexed [[sv/remap]] mode may be applied to both
1040 v3.0 LD/ST Immediate instructions *and* v3.0 LD/ST Indexed instructions.
1041 LD/ST-Indexed should not be conflated with Indexed REMAP mode: clarification
1042 is provided below.
1043
1044 **Determining the LD/ST Modes**
1045
1046 A minor complication (caused by the retro-fitting of modern Vector
1047 features to a Scalar ISA) is that certain features do not exactly make
1048 sense or are considered a security risk.  Fail-first on Vector Indexed
1049 would allow attackers to probe large numbers of pages from userspace, where
1050 strided fail-first (by creating contiguous sequential LDs) does not.
1051
1052 In addition, reduce mode makes no sense.
1053 Realistically we need
1054 an alternative table definition for [[sv/svp64]] `RM.MODE`.
1055 The following modes make sense:
1056
1057 * saturation
1058 * predicate-result (mostly for cache-inhibited LD/ST)
1059 * simple (no augmentation)
1060 * fail-first (where Vector Indexed is banned)
1061 * Signed Effective Address computation (Vector Indexed only)
1062 * Pack/Unpack (on LD/ST immediate operations only)
1063
1064 More than that however it is necessary to fit the usual Vector ISA
1065 capabilities onto both Power ISA LD/ST with immediate and to
1066 LD/ST Indexed. They present subtly different Mode tables, which, due
1067 to lack of space, have the following quirks:
1068
1069 * LD/ST Immediate has no individual control over src/dest zeroing,
1070   whereas LD/ST Indexed does.
1071 * LD/ST Immediate has no Saturated Pack/Unpack (Arithmetic Mode does)
1072 * LD/ST Indexed has no Pack/Unpack (REMAP may be used instead)
1073
1074 # Format and fields
1075
1076 Fields used in tables below:
1077
1078 * **sz / dz**  if predication is enabled will put zeros into the dest (or as src in the case of twin pred) when the predicate bit is zero.  otherwise the element is ignored or skipped, depending on context.
1079 * **zz**: both sz and dz are set equal to this flag.
1080 * **inv CR bit** just as in branches (BO) these bits allow testing of a CR bit and whether it is set (inv=0) or unset (inv=1)
1081 * **N** sets signed/unsigned saturation.
1082 * **RC1** as if Rc=1, stores CRs *but not the result*
1083 * **SEA** - Signed Effective Address, if enabled performs sign-extension on
1084   registers that have been reduced due to elwidth overrides
1085
1086 **LD/ST immediate**
1087
1088 The table for [[sv/svp64]] for `immed(RA)` which is `RM.MODE`
1089 (bits 19:23 of `RM`) is:
1090
1091 | 0-1 |  2  |  3   4  |  description               |
1092 | --- | --- |---------|--------------------------- |
1093 | 00  | 0   |  zz els | simple mode                |
1094 | 00  | 1   | PI  LF  | post-increment and Fault-First  |
1095 | 01  | inv | CR-bit  | Rc=1: ffirst CR sel        |
1096 | 01  | inv | els RC1 |  Rc=0: ffirst z/nonz       |
1097 | 10  |   N | zz  els |  sat mode: N=0/1 u/s       |
1098 | 11  | inv | CR-bit  |  Rc=1: pred-result CR sel  |
1099 | 11  | inv | els RC1 |  Rc=0: pred-result z/nonz  |
1100
1101 The `els` bit is only relevant when `RA.isvec` is clear: this indicates
1102 whether stride is unit or element:
1103
1104 ```
1105     if RA.isvec:
1106         svctx.ldstmode = indexed
1107     elif els == 0:
1108         svctx.ldstmode = unitstride
1109     elif immediate != 0:
1110         svctx.ldstmode = elementstride
1111 ```
1112
1113 An immediate of zero is a safety-valve to allow `LD-VSPLAT`:
1114 in effect the multiplication of the immediate-offset by zero results
1115 in reading from the exact same memory location, *even with a Vector
1116 register*. (Normally this type of behaviour is reserved for the
1117 mapreduce modes)
1118
1119 For `LD-VSPLAT`, on non-cache-inhibited Loads, the read can occur
1120 just the once and be copied, rather than hitting the Data Cache
1121 multiple times with the same memory read at the same location.
1122 The benefit of Cache-inhibited LD-splats is that it allows
1123 for memory-mapped peripherals to have multiple
1124 data values read in quick succession and stored in sequentially
1125 numbered registers (but, see Note below).
1126
1127 For non-cache-inhibited ST from a vector source onto a scalar
1128 destination: with the Vector
1129 loop effectively creating multiple memory writes to the same location,
1130 we can deduce that the last of these will be the "successful" one. Thus,
1131 implementations are free and clear to optimise out the overwriting STs,
1132 leaving just the last one as the "winner".  Bear in mind that predicate
1133 masks will skip some elements (in source non-zeroing mode).
1134 Cache-inhibited ST operations on the other hand **MUST** write out
1135 a Vector source multiple successive times to the exact same Scalar
1136 destination. Just like Cache-inhibited LDs, multiple values may be
1137 written out in quick succession to a memory-mapped peripheral from
1138 sequentially-numbered registers.
1139
1140 Note that any memory location may be Cache-inhibited
1141 (Power ISA v3.1, Book III, 1.6.1, p1033)
1142
1143 *Programmer's Note: an immediate also with a Scalar source as
1144 a "VSPLAT" mode is simply not possible: there are not enough
1145 Mode bits. One single Scalar Load operation may be used instead, followed
1146 by any arithmetic operation (including a simple mv) in "Splat"
1147 mode.*
1148
1149 **LD/ST Indexed**
1150
1151 The modes for `RA+RB` indexed version are slightly different
1152 but are the same `RM.MODE` bits (19:23 of `RM`):
1153
1154 | 0-1 |  2  |  3   4  |  description              |
1155 | --- | --- |---------|-------------------------- |
1156 | 00  | SEA |  dz  sz | simple mode        |
1157 | 01  | SEA | dz sz   | Strided (scalar only source)   |
1158 | 10  |   N | dz   sz |  sat mode: N=0/1 u/s |
1159 | 11  | inv | CR-bit  |  Rc=1: pred-result CR sel |
1160 | 11  | inv | zz  RC1 |  Rc=0: pred-result z/nonz |
1161
1162 Vector Indexed Strided Mode is qualified as follows:
1163
1164     if mode = 0b01 and !RA.isvec and !RB.isvec:
1165         svctx.ldstmode = elementstride
1166
1167 A summary of the effect of Vectorisation of src or dest:
1168
1169      imm(RA)  RT.v   RA.v   no stride allowed
1170      imm(RA)  RT.s   RA.v   no stride allowed
1171      imm(RA)  RT.v   RA.s   stride-select allowed
1172      imm(RA)  RT.s   RA.s   not vectorised
1173      RA,RB    RT.v  {RA|RB}.v Standard Indexed
1174      RA,RB    RT.s  {RA|RB}.v Indexed but single LD (no VSPLAT)
1175      RA,RB    RT.v  {RA&RB}.s VSPLAT possible. stride selectable
1176      RA,RB    RT.s  {RA&RB}.s not vectorised (scalar identity)
1177
1178 Signed Effective Address computation is only relevant for
1179 Vector Indexed Mode, when elwidth overrides are applied.
1180 The source override applies to RB, and before adding to
1181 RA in order to calculate the Effective Address, if SEA is
1182 set RB is sign-extended from elwidth bits to the full 64
1183 bits.  For other Modes (ffirst, saturate),
1184 all EA computation with elwidth overrides is unsigned.
1185
1186 Note that cache-inhibited LD/ST  when VSPLAT is activated will perform **multiple** LD/ST operations, sequentially.  Even with scalar src a
1187 Cache-inhibited LD will read the same memory location *multiple times*, storing the result in successive Vector destination registers.  This because the cache-inhibit instructions are typically used to read and write memory-mapped peripherals.
1188 If a genuine cache-inhibited LD-VSPLAT is required then a single *scalar*
1189 cache-inhibited LD should be performed, followed by a VSPLAT-augmented mv,
1190 copying the one *scalar* value into multiple register destinations.
1191
1192 Note also that cache-inhibited VSPLAT with Predicate-result is possible.
1193 This allows for example to issue a massive batch of memory-mapped
1194 peripheral reads, stopping at the first NULL-terminated character and
1195 truncating VL to that point. No branch is needed to issue that large burst
1196 of LDs, which may be valuable in Embedded scenarios.
1197
1198 ## Vectorisation of Scalar Power ISA v3.0B
1199
1200 Scalar Power ISA Load/Store operations may be seen from their
1201 pseudocode to be of the form:
1202
1203     lbux RT, RA, RB
1204     EA <- (RA) + (RB)
1205     RT <- MEM(EA)
1206
1207 and for immediate variants:
1208
1209     lb RT,D(RA)
1210     EA <- RA + EXTS(D)
1211     RT <- MEM(EA)
1212
1213 Thus in the first example, the source registers may each be independently
1214 marked as scalar or vector, and likewise the destination; in the second
1215 example only the one source and one dest may be marked as scalar or
1216 vector.
1217
1218 Thus we can see that Vector Indexed may be covered, and, as demonstrated
1219 with the pseudocode below, the immediate can be used to give unit
1220 stride or element stride.  With there being no way to tell which from
1221 the Power v3.0B Scalar opcode alone, the choice is provided instead by
1222 the SV Context.
1223
1224 ```
1225     # LD not VLD!  format - ldop RT, immed(RA)
1226     # op_width: lb=1, lh=2, lw=4, ld=8
1227     op_load(RT, RA, op_width, immed, svctx, RAupdate):
1228       ps = get_pred_val(FALSE, RA); # predication on src
1229       pd = get_pred_val(FALSE, RT); # ... AND on dest
1230       for (i=0, j=0, u=0; i < VL && j < VL;):
1231         # skip nonpredicates elements
1232         if (RA.isvec) while (!(ps & 1<<i)) i++;
1233         if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
1234         if (RT.isvec) while (!(pd & 1<<j)) j++;
1235         if postinc:
1236             offs = 0; # added afterwards
1237             if RA.isvec: srcbase = ireg[RA+i]
1238             else         srcbase = ireg[RA]
1239         elif svctx.ldstmode == elementstride:
1240           # element stride mode
1241           srcbase = ireg[RA]
1242           offs = i * immed              # j*immed for a ST
1243         elif svctx.ldstmode == unitstride:
1244           # unit stride mode
1245           srcbase = ireg[RA]
1246           offs = immed + (i * op_width) # j*op_width for ST
1247         elif RA.isvec:
1248           # quirky Vector indexed mode but with an immediate
1249           srcbase = ireg[RA+i]
1250           offs = immed;
1251         else
1252           # standard scalar mode (but predicated)
1253           # no stride multiplier means VSPLAT mode
1254           srcbase = ireg[RA]
1255           offs = immed
1256
1257         # compute EA
1258         EA = srcbase + offs
1259         # load from memory
1260         ireg[RT+j] <= MEM[EA];
1261         # check post-increment of EA
1262         if postinc: EA = srcbase + immed;
1263         # update RA?
1264         if RAupdate: ireg[RAupdate+u] = EA;
1265         if (!RT.isvec)
1266             break # destination scalar, end now
1267         if (RA.isvec) i++;
1268         if (RAupdate.isvec) u++;
1269         if (RT.isvec) j++;
1270 ```
1271
1272 Indexed LD is:
1273
1274 ```
1275     # format: ldop RT, RA, RB
1276     function op_ldx(RT, RA, RB, RAupdate=False) # LD not VLD!
1277       ps = get_pred_val(FALSE, RA); # predication on src
1278       pd = get_pred_val(FALSE, RT); # ... AND on dest
1279       for (i=0, j=0, k=0, u=0; i < VL && j < VL && k < VL):
1280         # skip nonpredicated RA, RB and RT
1281         if (RA.isvec) while (!(ps & 1<<i)) i++;
1282         if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
1283         if (RB.isvec) while (!(ps & 1<<k)) k++;
1284         if (RT.isvec) while (!(pd & 1<<j)) j++;
1285         if svctx.ldstmode == elementstride:
1286             EA = ireg[RA] + ireg[RB]*j   # register-strided
1287         else
1288             EA = ireg[RA+i] + ireg[RB+k] # indexed address
1289         if RAupdate: ireg[RAupdate+u] = EA
1290         ireg[RT+j] <= MEM[EA];
1291         if (!RT.isvec)
1292             break # destination scalar, end immediately
1293         if (RA.isvec) i++;
1294         if (RAupdate.isvec) u++;
1295         if (RB.isvec) k++;
1296         if (RT.isvec) j++;
1297 ```
1298
1299 Note that Element-Strided uses the Destination Step because with both
1300 sources being Scalar as a prerequisite condition of activation of
1301 Element-Stride Mode, the source step (being Scalar) would never advance.
1302
1303 Note in both cases that [[sv/svp64]] allows RA-as-a-dest in "update" mode (`ldux`) to be effectively a *completely different* register from RA-as-a-source.  This because there is room in svp64 to extend RA-as-src as well as RA-as-dest, both independently as scalar or vector *and* independently extending their range.
1304
1305 *Programmer's note: being able to set RA-as-a-source
1306  as separate from RA-as-a-destination as Scalar is **extremely valuable**
1307  once it is remembered that Simple-V element operations must
1308  be in Program Order, especially in loops, for saving on
1309  multiple address computations. Care does have
1310  to be taken however that RA-as-src is not overwritten by
1311  RA-as-dest unless intentionally desired, especially in element-strided Mode.*
1312
1313 ## LD/ST Indexed vs Indexed REMAP
1314
1315 Unfortunately the word "Indexed" is used twice in completely different
1316 contexts, potentially causing confusion.
1317
1318 * There has existed instructions in the Power ISA `ld RT,RA,RB` since
1319   its creation: these are called "LD/ST Indexed" instructions and their
1320   name and meaning is well-established.
1321 * There now exists, in Simple-V, a REMAP mode called "Indexed"
1322   Mode that can be applied to *any* instruction **including those
1323   named LD/ST Indexed**.
1324
1325 Whilst it may be costly in terms of register reads to allow REMAP
1326 Indexed Mode to be applied to any Vectorised LD/ST Indexed operation such as
1327 `sv.ld *RT,RA,*RB`, or even misleadingly
1328 labelled  as redundant, firstly the strict
1329 application of the RISC Paradigm that Simple-V follows makes it awkward
1330 to consider *preventing* the application of Indexed REMAP to such
1331 operations, and secondly they are not actually the same at all.
1332
1333 Indexed REMAP, as applied to RB in the instruction `sv.ld *RT,RA,*RB`
1334 effectively performs an *in-place* re-ordering of the offsets, RB.
1335 To achieve the same effect without Indexed REMAP would require taking
1336 a *copy* of the Vector of offsets starting at RB, manually explicitly
1337 reordering them, and finally using the copy of re-ordered offsets in
1338 a non-REMAP'ed `sv.ld`.  Using non-strided LD as an example,
1339 pseudocode showing what actually occurs,
1340 where the pseudocode for `indexed_remap` may be found in [[sv/remap]]:
1341
1342 ```
1343     # sv.ld *RT,RA,*RB with Index REMAP applied to RB
1344     for i in 0..VL-1:
1345         if remap.indexed:
1346             rb_idx = indexed_remap(i) # remap
1347         else:
1348             rb_idx = i # use the index as-is
1349         EA = GPR(RA) + GPR(RB+rb_idx)
1350         GPR(RT+i) = MEM(EA, 8)
1351 ```
1352
1353 Thus it can be seen that the use of Indexed REMAP saves copying
1354 and manual reordering of the Vector of RB offsets.
1355
1356 ## LD/ST ffirst
1357
1358 LD/ST ffirst treats the first LD/ST in a vector (element 0 if REMAP
1359 is not active) as an ordinary one, with all behaviour with respect to
1360 Interrupts Exceptions Page Faults Memory Management being identical
1361 in every regard to Scalar v3.0 Power ISA LD/ST. However for elements
1362 1 and above, if an exception would occur, then VL is **truncated**
1363 to the previous element: the exception is **not** then raised because
1364 the LD/ST that would otherwise have caused an exception is *required*
1365 to be cancelled. Additionally an implementor may choose to truncate VL
1366 for any arbitrary reason *except for the very first*.
1367
1368 ffirst LD/ST to multiple pages via a Vectorised Index base is
1369 considered a security risk due to the abuse of probing multiple
1370 pages in rapid succession and getting speculative feedback on which
1371 pages would fail.  Therefore Vector Indexed LD/ST is prohibited
1372 entirely, and the Mode bit instead used for element-strided LD/ST.
1373 See <https://bugs.libre-soc.org/show_bug.cgi?id=561>
1374
1375 ```
1376     for(i = 0; i < VL; i++)
1377         reg[rt + i] = mem[reg[ra] + i * reg[rb]];
1378 ```
1379
1380 High security implementations where any kind of speculative probing
1381 of memory pages is considered a risk should take advantage of the fact that
1382 implementations may truncate VL at any point, without requiring software
1383 to be rewritten and made non-portable. Such implementations may choose
1384 to *always* set VL=1 which will have the effect of terminating any
1385 speculative probing (and also adversely affect performance), but will
1386 at least not require applications to be rewritten.
1387
1388 Low-performance simpler hardware implementations may also
1389 choose (always) to also set VL=1 as the bare minimum compliant implementation of
1390 LD/ST Fail-First. It is however critically important to remember that
1391 the first element LD/ST **MUST** be treated as an ordinary LD/ST, i.e.
1392 **MUST** raise exceptions exactly like an ordinary LD/ST.
1393
1394 For ffirst LD/STs, VL may be truncated arbitrarily to a nonzero value for any implementation-specific reason. For example: it is perfectly reasonable for implementations to alter VL when ffirst LD or ST operations are initiated on a nonaligned boundary, such that within a loop the subsequent iteration of that loop begins the following ffirst LD/ST operations on an aligned boundary
1395 such as the beginning of a cache line, or beginning of a Virtual Memory
1396 page. Likewise, to reduce workloads or balance resources.
1397
1398 Vertical-First Mode is slightly strange in that only one element
1399 at a time is ever executed anyway.  Given that programmers may
1400 legitimately choose to alter srcstep and dststep in non-sequential
1401 order as part of explicit loops, it is neither possible nor
1402 safe to make speculative assumptions about future LD/STs.
1403 Therefore, Fail-First LD/ST in Vertical-First is `UNDEFINED`.
1404 This is very different from Arithmetic (Data-dependent) FFirst
1405 where Vertical-First Mode is fully deterministic, not speculative.
1406
1407 ## LOAD/STORE Elwidths <a name="elwidth"></a>
1408
1409 Loads and Stores are almost unique in that the Power Scalar ISA
1410 provides a width for the operation (lb, lh, lw, ld).  Only `extsb` and
1411 others like it provide an explicit operation width.  There are therefore
1412 *three* widths involved:
1413
1414 * operation width (lb=8, lh=16, lw=32, ld=64)
1415 * src element width override (8/16/32/default)
1416 * destination element width override (8/16/32/default)
1417
1418 Some care is therefore needed to express and make clear the transformations,
1419 which are expressly in this order:
1420
1421 * Calculate the Effective Address from RA at full width
1422   but (on Indexed Load) allow srcwidth overrides on RB
1423 * Load at the operation width (lb/lh/lw/ld) as usual
1424 * byte-reversal as usual
1425 * Non-saturated mode:
1426    - zero-extension or truncation from operation width to dest elwidth
1427    - place result in destination at dest elwidth
1428 * Saturated mode:
1429    - Sign-extension or truncation from operation width to dest width
1430    - signed/unsigned saturation down to dest elwidth
1431
1432 In order to respect Power v3.0B Scalar behaviour the memory side
1433 is treated effectively as completely separate and distinct from SV
1434 augmentation.  This is primarily down to quirks surrounding LE/BE and
1435 byte-reversal.
1436
1437 It is rather unfortunately possible to request an elwidth override
1438 on the memory side which
1439 does not mesh with the overridden operation width: these result in
1440 `UNDEFINED`
1441 behaviour.  The reason is that the effect of attempting a 64-bit `sv.ld`
1442 operation with a source elwidth override of 8/16/32 would result in
1443 overlapping memory requests, particularly on unit and element strided
1444 operations.  Thus it is `UNDEFINED` when the elwidth is smaller than
1445 the memory operation width. Examples include `sv.lw/sw=16/els` which
1446 requests (overlapping) 4-byte memory reads offset from
1447 each other at 2-byte intervals.  Store likewise is also `UNDEFINED`
1448 where the dest elwidth override is less than the operation width.
1449
1450 Note the following regarding the pseudocode to follow:
1451
1452 * `scalar identity behaviour` SV Context parameter conditions turn this
1453   into a straight absolute fully-compliant Scalar v3.0B LD operation
1454 * `brev` selects whether the operation is the byte-reversed variant (`ldbrx`
1455   rather than `ld`)
1456 * `op_width` specifies the operation width (`lb`, `lh`, `lw`, `ld`) as
1457   a "normal" part of Scalar v3.0B LD
1458 * `imm_offs` specifies the immediate offset `ld r3, imm_offs(r5)`, again
1459   as a "normal" part of Scalar v3.0B LD
1460 * `svctx` specifies the SV Context and includes VL as well as
1461   source and destination elwidth overrides.
1462
1463 Below is the pseudocode for Unit-Strided LD (which includes Vector capability). Observe in particular that RA, as the base address in
1464 both Immediate and Indexed LD/ST,
1465 does not have element-width overriding applied to it.
1466
1467 Note that predication, predication-zeroing,
1468 and other modes except saturation have all been removed,
1469 for clarity and simplicity:
1470
1471 ```
1472     # LD not VLD!
1473     # this covers unit stride mode and a type of vector offset
1474     function op_ld(RT, RA, op_width, imm_offs, svctx)
1475       for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL):
1476         if not svctx.unit/el-strided:
1477             # strange vector mode, compute 64 bit address which is
1478             # not polymorphic! elwidth hardcoded to 64 here
1479             srcbase = get_polymorphed_reg(RA, 64, i)
1480         else:
1481             # unit / element stride mode, compute 64 bit address
1482             srcbase = get_polymorphed_reg(RA, 64, 0)
1483             # adjust for unit/el-stride
1484             srcbase += ....
1485
1486         # read the underlying memory
1487         memread <= MEM(srcbase + imm_offs, op_width)
1488
1489         # check saturation.
1490         if svpctx.saturation_mode:
1491             # ... saturation adjustment...
1492             memread = clamp(memread, op_width, svctx.dest_elwidth)
1493         else:
1494             # truncate/extend to over-ridden dest width.
1495             memread = adjust_wid(memread, op_width, svctx.dest_elwidth)
1496
1497         # takes care of inserting memory-read (now correctly byteswapped)
1498         # into regfile underlying LE-defined order, into the right place
1499         # within the NEON-like register, respecting destination element
1500         # bitwidth, and the element index (j)
1501         set_polymorphed_reg(RT, svctx.dest_elwidth, j, memread)
1502
1503         # increments both src and dest element indices (no predication here)
1504         i++;
1505         j++;
1506 ```
1507
1508 Note above that the source elwidth is *not used at all* in LD-immediate.
1509
1510 For LD/Indexed, the key is that in the calculation of the Effective Address,
1511 RA has no elwidth override but RB does.  Pseudocode below is simplified
1512 for clarity: predication and all modes except saturation are removed:
1513
1514 ```
1515     # LD not VLD! ld*rx if brev else ld*
1516     function op_ld(RT, RA, RB, op_width, svctx, brev)
1517       for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL):
1518         if not svctx.el-strided:
1519             # RA not polymorphic! elwidth hardcoded to 64 here
1520             srcbase = get_polymorphed_reg(RA, 64, i)
1521         else:
1522             # element stride mode, again RA not polymorphic
1523             srcbase = get_polymorphed_reg(RA, 64, 0)
1524         # RB *is* polymorphic
1525         offs = get_polymorphed_reg(RB, svctx.src_elwidth, i)
1526         # sign-extend
1527         if svctx.SEA: offs = sext(offs, svctx.src_elwidth, 64)
1528
1529         # takes care of (merges) processor LE/BE and ld/ldbrx
1530         bytereverse = brev XNOR MSR.LE
1531
1532         # read the underlying memory
1533         memread <= MEM(srcbase + offs, op_width)
1534
1535         # optionally performs byteswap at op width
1536         if (bytereverse):
1537             memread = byteswap(memread, op_width)
1538
1539         if svpctx.saturation_mode:
1540             # ... saturation adjustment...
1541             memread = clamp(memread, op_width, svctx.dest_elwidth)
1542         else:
1543             # truncate/extend to over-ridden dest width.
1544             memread = adjust_wid(memread, op_width, svctx.dest_elwidth)
1545
1546         # takes care of inserting memory-read (now correctly byteswapped)
1547         # into regfile underlying LE-defined order, into the right place
1548         # within the NEON-like register, respecting destination element
1549         # bitwidth, and the element index (j)
1550         set_polymorphed_reg(RT, svctx.dest_elwidth, j, memread)
1551
1552         # increments both src and dest element indices (no predication here)
1553         i++;
1554         j++;
1555 ```
1556
1557 # Remapped LD/ST
1558
1559 In the [[sv/remap]] page the concept of "Remapping" is described.
1560 Whilst it is expensive to set up (2 64-bit opcodes minimum) it provides
1561 a way to arbitrarily perform 1D, 2D and 3D "remapping" of up to 64
1562 elements worth of LDs or STs.  The usual interest in such re-mapping
1563 is for example in separating out 24-bit RGB channel data into separate
1564 contiguous registers.
1565
1566 REMAP easily covers this capability, and with dest
1567 elwidth overrides and saturation may do so with built-in conversion that
1568 would normally require additional width-extension, sign-extension and
1569 min/max Vectorised instructions as post-processing stages.
1570
1571 Thus we do not need to provide specialist LD/ST "Structure Packed" opcodes
1572 because the generic abstracted concept of "Remapping", when applied to
1573 LD/ST, will give that same capability, with far more flexibility.
1574
1575 It is worth noting that Pack/Unpack Modes of SVSTATE, which may be
1576 established through `svstep`, are also an easy way to perform regular
1577 Structure Packing, at the vec2/vec3/vec4 granularity level.  Beyond
1578 that, REMAP will need to be used.
1579
1580 # Condition Register SVP64 Operations
1581
1582 Condition Register Fields are only 4 bits wide: this presents some
1583 interesting conceptual challenges for SVP64, which was designed
1584 primarily for vectors of arithmetic and logical operations. However
1585 if predicates may be bits of CR Fields it makes sense to extend
1586 Simple-V to cover CR Operations, especially given that Vectorised Rc=1
1587 may be processed by Vectorised CR Operations tbat usefully in turn
1588 may become Predicate Masks to yet more Vector operations, like so:
1589
1590 ```
1591     sv.cmpi/ew=8 *B,*ra,0    # compare bytes against zero
1592     sv.cmpi/ew=8 *B2,*ra,13. # and against newline
1593     sv.cror PM.EQ,B.EQ,B2.EQ # OR compares to create mask
1594     sv.stb/sm=EQ    ...      # store only nonzero/newline
1595 ```
1596
1597 Element width however is clearly meaningless for a 4-bit collation of
1598 Conditions, EQ LT GE SO. Likewise, arithmetic saturation (an important
1599 part of Arithmetic SVP64) has no meaning. An alternative Mode Format is
1600 required, and given that elwidths are meaningless for CR Fields the bits
1601 in SVP64 `RM` may be used for other purposes.
1602
1603 This alternative mapping **only** applies to instructions that **only**
1604 reference a CR Field or CR bit as the sole exclusive result. This section
1605 **does not** apply to instructions which primarily produce arithmetic
1606 results that also, as an aside, produce a corresponding
1607 CR Field (such as when Rc=1).
1608 Instructions that involve Rc=1 are definitively arithmetic in nature,
1609 where the corresponding Condition Register Field can be considered to
1610 be a "co-result". Such CR Field "co-result" arithmeric operations
1611 are firmly out of scope for
1612 this section, being covered fully by [[sv/normal]].
1613
1614 * Examples of v3.0B instructions to which this section does
1615   apply is
1616   - `mfcr` and `cmpi` (3 bit operands) and
1617   - `crnor` and `crand` (5 bit operands).
1618 * Examples to which this section does **not** apply include
1619   `fadds.` and `subf.` which both produce arithmetic results
1620   (and a CR Field co-result).
1621
1622 The CR Mode Format still applies to `sv.cmpi` because despite
1623 taking a GPR as input, the output from the Base Scalar v3.0B `cmpi`
1624 instruction is purely to a Condition Register Field.
1625
1626 Other modes are still applicable and include:
1627
1628 * **Data-dependent fail-first**.
1629   useful to truncate VL based on
1630   analysis of a Condition Register result bit.
1631 * **Reduction**.
1632   Reduction is useful
1633 for analysing a Vector of Condition Register Fields
1634 and reducing it to one
1635 single Condition Register Field.
1636
1637 Predicate-result does not make any sense because
1638 when Rc=1 a co-result is created (a CR Field). Testing the co-result
1639 allows the decision to be made to store or not store the main
1640 result, and for CR Ops the CR Field result *is*
1641 the main result.
1642
1643 ## Format
1644
1645 SVP64 RM `MODE` (includes `ELWIDTH_SRC` bits) for CR-based operations:
1646
1647 |6 | 7 |19-20|  21 | 22   23 |  description     |
1648 |--|---|-----| --- |---------|----------------- |
1649 |/ | / |0 RG |   0 | dz  sz  | simple mode                      |
1650 |/ | / |0 RG |   1 | dz  sz  | scalar reduce mode (mapreduce) |
1651 |zz|SNZ|1 VLI| inv |  CR-bit | Ffirst 3-bit mode      |
1652 |/ |SNZ|1 VLI| inv |  dz sz  | Ffirst 5-bit mode (implies CR-bit from result) |
1653
1654 Fields:
1655
1656 * **sz / dz**  if predication is enabled will put zeros into the dest (or as src in the case of twin pred) when the predicate bit is zero.  otherwise the element is ignored or skipped, depending on context.
1657 * **zz** set both sz and dz equal to this flag
1658 * **SNZ** In fail-first mode, on the bit being tested, when sz=1 and SNZ=1 a value "1" is put in place of "0".
1659 * **inv CR-bit** just as in branches (BO) these bits allow testing of a CR bit and whether it is set (inv=0) or unset (inv=1)
1660 * **RG** inverts the Vector Loop order (VL-1 downto 0) rather
1661 than the normal 0..VL-1
1662 * **SVM** sets "subvector" reduce mode
1663 * **VLi** VL inclusive: in fail-first mode, the truncation of
1664   VL *includes* the current element at the failure point rather
1665   than excludes it from the count.
1666
1667 ## Data-dependent fail-first on CR operations
1668
1669 The principle of data-dependent fail-first is that if, during
1670 the course of sequentially evaluating an element's Condition Test,
1671 one such test is encountered which fails,
1672 then VL (Vector Length) is truncated (set) at that point. In the case
1673 of Arithmetic SVP64 Operations the Condition Register Field generated from
1674 Rc=1 is used as the basis for the truncation decision.
1675 However with CR-based operations that CR Field result to be
1676 tested is provided
1677 *by the operation itself*.
1678
1679 Data-dependent SVP64 Vectorised Operations involving the creation or
1680 modification of a CR can require an extra two bits, which are not available
1681 in the compact space of the SVP64 RM `MODE` Field. With the concept of element
1682 width overrides being meaningless for CR Fields it is possible to use the
1683 `ELWIDTH` field for alternative purposes.
1684
1685 Condition Register based operations such as `sv.mfcr` and `sv.crand` can thus
1686 be made more flexible.  However the rules that apply in this section
1687 also apply to future CR-based instructions.
1688
1689 There are two primary different types of CR operations:
1690
1691 * Those which have a 3-bit operand field (referring to a CR Field)
1692 * Those which have a 5-bit operand (referring to a bit within the
1693    whole 32-bit CR)
1694
1695 Examining these two types it is observed that the
1696 difference may be considered to be that the 5-bit variant
1697 *already* provides the
1698 prerequisite information about which CR Field bit (EQ, GE, LT, SO) is to
1699 be operated on by the instruction.
1700 Thus, logically, we may set the following rule:
1701
1702 * When a 5-bit CR Result field is used in an instruction, the
1703   5-bit variant of Data-Dependent Fail-First
1704   must be used. i.e. the bit of the CR field to be tested is
1705   the one that has just been modified (created) by the operation.
1706 * When a 3-bit CR Result field is used the 3-bit variant
1707   must be used, providing as it does the missing `CRbit` field
1708   in order to select which CR Field bit of the result shall
1709   be tested (EQ, LE, GE, SO)
1710
1711 The reason why the 3-bit CR variant needs the additional CR-bit
1712 field should be obvious from the fact that the 3-bit CR Field
1713 from the base Power ISA v3.0B operation clearly does not contain
1714 and is missing the two CR Field Selector bits. Thus, these two
1715 bits (to select EQ, LE, GE or SO) must be provided in another
1716 way.
1717
1718 Examples of the former type:
1719
1720 * crand, cror, crnor. These all are 5-bit (BA, BB, BT). The bit
1721   to be tested against `inv` is the one selected by `BT`
1722 * mcrf. This has only 3-bit (BF, BFA). In order to select the
1723   bit to be tested, the alternative encoding must be used.
1724   With `CRbit` coming from the SVP64 RM bits 22-23 the bit
1725   of BF to be tested is identified.
1726
1727 Just as with SVP64 [[sv/branches]] there is the option to truncate
1728 VL to include the element being tested (`VLi=1`) and to exclude it
1729 (`VLi=0`).
1730
1731 Also exactly as with [[sv/normal]] fail-first, VL cannot, unlike
1732 [[sv/ldst]], be set to an arbitrary value.  Deterministic behaviour
1733 is *required*.
1734
1735 ## Reduction and Iteration
1736
1737 Bearing in mind as described in the svp64 Appendix, SVP64 Horizontal
1738 Reduction is a deterministic schedule on top of base Scalar v3.0 operations,
1739 the same rules apply to CR Operations, i.e. that programmers must
1740 follow certain conventions in order for an *end result* of a
1741 reduction to be achieved.  Unlike
1742 other Vector ISAs *there are no explicit reduction opcodes*
1743 in SVP64: Schedules however achieve the same effect.
1744
1745 Due to these conventions only reduction on operations such as `crand`
1746 and `cror` are meaningful because these have Condition Register Fields
1747 as both input and output.
1748 Meaningless operations are not prohibited because the cost in hardware
1749 of doing so is prohibitive, but neither are they `UNDEFINED`. Implementations
1750 are still required to execute them but are at liberty to optimise out
1751 any operations that would ultimately be overwritten, as long as Strict
1752 Program Order is still obvservable by the programmer.
1753
1754 Also bear in mind that 'Reverse Gear' may be enabled, which can be
1755 used in combination with overlapping CR operations to iteratively accumulate
1756 results.  Issuing a `sv.crand` operation for example with `BA`
1757 differing from `BB` by one Condition Register Field would
1758 result in a cascade effect, where the first-encountered CR Field
1759 would set the result to zero, and also all subsequent CR Field
1760 elements thereafter:
1761
1762 ```
1763     # sv.crand/mr/rg CR4.ge.v, CR5.ge.v, CR4.ge.v
1764     for i in VL-1 downto 0 # reverse gear
1765          CR.field[4+i].ge &= CR.field[5+i].ge
1766 ```
1767
1768 `sv.crxor` with reduction would be particularly useful for parity calculation
1769 for example, although there are many ways in which the same calculation
1770 could be carried out after transferring a vector of CR Fields to a GPR
1771 using crweird operations.
1772
1773 Implementations are free and clear to optimise these reductions in any
1774 way they see fit, as long as the end-result is compatible with Strict Program
1775 Order being observed, and Interrupt latency is not adversely impacted.
1776
1777 ## Unusual and quirky CR operations
1778
1779 **cmp and other compare ops**
1780
1781 `cmp` and `cmpi` etc take GPRs as sources and create a CR Field as a result.
1782
1783     cmpli BF,L,RA,UI
1784     cmpeqb BF,RA,RB
1785
1786 With `ELWIDTH` applying to the source GPR operands this is perfectly fine.
1787
1788 **crweird operations**
1789
1790 There are 4 weird CR-GPR operations and one reasonable one in
1791 the [[cr_int_predication]] set:
1792
1793 * crrweird
1794 * mtcrweird
1795 * crweirder
1796 * crweird
1797 * mcrfm - reasonably normal and referring to CR Fields for src and dest.
1798
1799 The "weird" operations have a non-standard behaviour, being able to
1800 treat *individual bits* of a GPR effectively as elements.  They are
1801 expected to be Micro-coded by most Hardware implementations.
1802
1803
1804 ## SVP64 Branch Conditional behaviour
1805
1806 Please note: although similar, SVP64 Branch instructions should be
1807 considered completely separate and distinct from
1808 standard scalar OpenPOWER-approved v3.0B branches.
1809 **v3.0B branches are in no way impacted, altered,
1810 changed or modified in any way, shape or form by
1811 the SVP64 Vectorised Variants**.
1812
1813 It is also
1814 extremely important to note that Branches are the
1815 sole pseudo-exception in SVP64 to `Scalar Identity Behaviour`.
1816 SVP64 Branches contain additional modes that are useful
1817 for scalar operations (i.e. even when VL=1 or when
1818 using single-bit predication).
1819
1820 **Rationale**
1821
1822 Scalar 3.0B Branch Conditional operations, `bc`, `bctar` etc. test a
1823 Condition Register.  However for parallel processing it is simply impossible
1824 to perform multiple independent branches: the Program Counter simply
1825 cannot branch to multiple destinations based on multiple conditions.
1826 The best that can be done is
1827 to test multiple Conditions and make a decision of a *single* branch,
1828 based on analysis of a *Vector* of CR Fields
1829 which have just been calculated from a *Vector* of results.
1830
1831 In 3D Shader
1832 binaries, which are inherently parallelised and predicated, testing all or
1833 some results and branching based on multiple tests is extremely common,
1834 and a fundamental part of Shader Compilers.  Example:
1835 without such multi-condition
1836 test-and-branch, if a predicate mask is all zeros a large batch of
1837 instructions may be masked out to `nop`, and it would waste
1838 CPU cycles to run them.  3D GPU ISAs can test for this scenario
1839 and, with the appropriate predicate-analysis instruction,
1840 jump over fully-masked-out operations, by spotting that
1841 *all* Conditions are false.
1842
1843 Unless Branches are aware and capable of such analysis, additional
1844 instructions would be required which perform Horizontal Cumulative
1845 analysis of Vectorised Condition Register Fields, in order to
1846 reduce the Vector of CR Fields down to one single yes or no
1847 decision that a Scalar-only v3.0B Branch-Conditional could cope with.
1848 Such instructions would be unavoidable, required, and costly
1849 by comparison to a single Vector-aware Branch.
1850 Therefore, in order to be commercially competitive, `sv.bc` and
1851 other Vector-aware Branch Conditional instructions are a high priority
1852 for 3D GPU (and OpenCL-style) workloads.
1853
1854 Given that Power ISA v3.0B is already quite powerful, particularly
1855 the Condition Registers and their interaction with Branches, there
1856 are opportunities to create extremely flexible and compact
1857 Vectorised Branch behaviour.  In addition, the side-effects (updating
1858 of CTR, truncation of VL, described below) make it a useful instruction
1859 even if the branch points to the next instruction (no actual branch).
1860
1861 ## Overview
1862
1863 When considering an "array" of branch-tests, there are four
1864 primarily-useful modes:
1865 AND, OR, NAND and NOR of all Conditions.
1866 NAND and NOR may be synthesised from AND and OR by
1867 inverting `BO[1]` which just leaves two modes:
1868
1869 * Branch takes place on the **first** CR Field test to succeed
1870   (a Great Big OR of all condition tests). Exit occurs
1871   on the first **successful** test.
1872 * Branch takes place only if **all** CR field tests succeed:
1873   a Great Big AND of all condition tests.  Exit occurs
1874   on the first **failed** test.
1875
1876 Early-exit is enacted such that the Vectorised Branch does not
1877 perform needless extra tests, which will help reduce reads on
1878 the Condition Register file.
1879
1880 *Note: Early-exit is **MANDATORY** (required) behaviour.
1881 Branches **MUST** exit at the first sequentially-encountered
1882 failure point, for
1883 exactly the same reasons for which it is mandatory in
1884 programming languages doing early-exit: to avoid
1885 damaging side-effects and to provide deterministic
1886 behaviour. Speculative testing of Condition
1887 Register Fields is permitted, as is speculative calculation
1888 of CTR, as long as, as usual in any Out-of-Order microarchitecture,
1889 that speculative testing is cancelled should an early-exit occur.
1890 i.e. the speculation must be "precise": Program Order must be preserved*
1891
1892 Also note that when early-exit occurs in Horizontal-first Mode,
1893 srcstep, dststep etc. are all reset, ready to begin looping from the
1894 beginning for the next instruction. However for Vertical-first
1895 Mode srcstep etc. are incremented "as usual" i.e. an early-exit
1896 has no special impact, regardless of whether the branch
1897 occurred or not. This can leave srcstep etc. in what may be
1898 considered an unusual
1899 state on exit from a loop and it is up to the programmer to
1900 reset srcstep, dststep etc. to known-good values
1901 *(easily achieved with `setvl`)*.
1902
1903 Additional useful behaviour involves two primary Modes (both of
1904 which may be enabled and combined):
1905
1906 * **VLSET Mode**: identical to Data-Dependent Fail-First Mode
1907   for Arithmetic SVP64 operations, with more
1908   flexibility and a close interaction and integration into the
1909   underlying base Scalar v3.0B Branch instruction.
1910   Truncation of VL takes place around the early-exit point.
1911 * **CTR-test Mode**: gives much more flexibility over when and why
1912   CTR is decremented, including options to decrement if a Condition
1913   test succeeds *or if it fails*.
1914
1915 With these side-effects, basic Boolean Logic Analysis advises that
1916 it is important to provide a means
1917 to enact them each based on whether testing succeeds *or fails*. This
1918 results in a not-insignificant number of additional Mode Augmentation bits,
1919 accompanying VLSET and CTR-test Modes respectively.
1920
1921 Predicate skipping or zeroing may, as usual with SVP64, be controlled
1922 by `sz`.
1923 Where the predicate is masked out and
1924 zeroing is enabled, then in such circumstances
1925 the same Boolean Logic Analysis dictates that
1926 rather than testing only against zero, the option to test
1927 against one is also prudent. This introduces a new
1928 immediate field, `SNZ`, which works in conjunction with
1929 `sz`.
1930
1931
1932 Vectorised Branches can be used
1933 in either SVP64 Horizontal-First or Vertical-First Mode. Essentially,
1934 at an element level, the behaviour is identical in both Modes,
1935 although the `ALL` bit is meaningless in Vertical-First Mode.
1936
1937 It is also important
1938 to bear in mind that, fundamentally, Vectorised Branch-Conditional
1939 is still extremely close to the Scalar v3.0B Branch-Conditional
1940 instructions, and that the same v3.0B Scalar Branch-Conditional
1941 instructions are still
1942 *completely separate and independent*, being unaltered and
1943 unaffected by their SVP64 variants in every conceivable way.
1944
1945 *Programming note: One important point is that SVP64 instructions are 64 bit.
1946 (8 bytes not 4). This needs to be taken into consideration when computing
1947 branch offsets: the offset is relative to the start of the instruction,
1948 which **includes** the SVP64 Prefix*
1949
1950 ## Format and fields
1951
1952 With element-width overrides being meaningless for Condition
1953 Register Fields, bits 4 thru 7 of SVP64 RM may be used for additional
1954 Mode bits.
1955
1956 SVP64 RM `MODE` (includes repurposing `ELWIDTH` bits 4:5,
1957 and `ELWIDTH_SRC` bits 6-7 for *alternate* uses) for Branch
1958 Conditional:
1959
1960 | 4 | 5 | 6 | 7 | 17 | 18 | 19 | 20 |  21 | 22  23 |  description     |
1961 | - | - | - | - | -- | -- | -- | -- | --- |--------|----------------- |
1962 |ALL|SNZ| / | / | SL |SLu | 0  | 0  | /   | LRu sz | simple mode      |
1963 |ALL|SNZ| / |VSb| SL |SLu | 0  | 1  | VLI | LRu sz | VLSET mode       |
1964 |ALL|SNZ|CTi| / | SL |SLu | 1  | 0  | /   | LRu sz | CTR-test mode         |
1965 |ALL|SNZ|CTi|VSb| SL |SLu | 1  | 1  | VLI | LRu sz | CTR-test+VLSET mode   |
1966
1967 Brief description of fields:
1968
1969 * **sz=1** if predication is enabled and `sz=1` and a predicate
1970   element bit is zero, `SNZ` will
1971   be substituted in place of the CR bit selected by `BI`,
1972   as the Condition tested.
1973   Contrast this with
1974   normal SVP64 `sz=1` behaviour, where *only* a zero is put in
1975   place of masked-out predicate bits.
1976 * **sz=0** When `sz=0` skipping occurs as usual on
1977   masked-out elements, but unlike all
1978   other SVP64 behaviour which entirely skips an element with
1979   no related side-effects at all, there are certain
1980   special circumstances where CTR
1981   may be decremented.  See CTR-test Mode, below.
1982 * **ALL** when set, all branch conditional tests must pass in order for
1983   the branch to succeed. When clear, it is the first sequentially
1984   encountered successful test that causes the branch to succeed.
1985   This is identical behaviour to how programming languages perform
1986   early-exit on Boolean Logic chains.
1987 * **VLI** VLSET is identical to Data-dependent Fail-First mode.
1988   In VLSET mode, VL *may* (depending on `VSb`) be truncated.
1989   If VLI (Vector Length Inclusive) is clear,
1990   VL is truncated to *exclude* the current element, otherwise it is
1991   included. SVSTATE.MVL is not altered: only VL.
1992 * **SL** identical to `LR` except applicable to SVSTATE. If `SL`
1993   is set, SVSTATE is transferred to SVLR (conditionally on
1994   whether `SLu` is set).
1995 * **SLu**: SVSTATE Link Update, like `LRu` except applies to SVSTATE.
1996 * **LRu**: Link Register Update, used in conjunction with LK=1
1997   to make LR update conditional
1998 * **VSb** In VLSET Mode, after testing,
1999   if VSb is set, VL is truncated if the test succeeds.  If VSb is clear,
2000   VL is truncated if a test *fails*. Masked-out (skipped)
2001   bits are not considered
2002   part of testing when `sz=0`
2003 * **CTi** CTR inversion. CTR-test Mode normally decrements per element
2004   tested. CTR inversion decrements if a test *fails*. Only relevant
2005   in CTR-test Mode.
2006
2007 LRu and CTR-test modes are where SVP64 Branches subtly differ from
2008 Scalar v3.0B Branches. `sv.bcl` for example will always update LR, whereas
2009 `sv.bcl/lru` will only update LR if the branch succeeds.
2010
2011 Of special interest is that when using ALL Mode (Great Big AND
2012 of all Condition Tests), if `VL=0`,
2013 which is rare but can occur in Data-Dependent Modes, the Branch
2014 will always take place because there will be no failing Condition
2015 Tests to prevent it. Likewise when not using ALL Mode (Great Big OR
2016 of all Condition Tests) and `VL=0` the Branch is guaranteed not
2017 to occur because there will be no *successful* Condition Tests
2018 to make it happen.
2019
2020 ## Vectorised CR Field numbering, and Scalar behaviour
2021
2022 It is important to keep in mind that just like all SVP64 instructions,
2023 the `BI` field of the base v3.0B Branch Conditional instruction
2024 may be extended by SVP64 EXTRA augmentation, as well as be marked
2025 as either Scalar or Vector. It is also crucially important to keep in mind
2026 that for CRs, SVP64 sequentially increments the CR *Field* numbers.
2027 CR *Fields* are treated as elements, not bit-numbers of the CR *register*.
2028
2029 The `BI` operand of Branch Conditional operations is five bits, in scalar
2030 v3.0B this would select one bit of the 32 bit CR,
2031 comprising eight CR Fields of 4 bits each.  In SVP64 there are
2032 16 32 bit CRs, containing 128 4-bit CR Fields.  Therefore, the 2 LSBs of
2033 `BI` select the bit from the CR Field (EQ LT GT SO), and the top 3 bits
2034 are extended to either scalar or vector and to select CR Fields 0..127
2035 as specified in SVP64 [[sv/svp64/appendix]].
2036
2037 When the CR Fields selected by SVP64-Augmented `BI` is marked as scalar,
2038 then as the usual SVP64 rules apply:
2039 the Vector loop ends at the first element tested
2040 (the first CR *Field*), after taking
2041 predication into consideration. Thus, also as usual, when a predicate mask is
2042 given, and `BI` marked as scalar, and `sz` is zero, srcstep
2043 skips forward to the first non-zero predicated element, and only that
2044 one element is tested.
2045
2046 In other words, the fact that this is a Branch
2047 Operation (instead of an arithmetic one) does not result, ultimately,
2048 in significant changes as to
2049 how SVP64 is fundamentally applied, except with respect to:
2050
2051 * the unique properties associated with conditionally
2052  changing the Program
2053 Counter (aka "a Branch"), resulting in early-out
2054 opportunities
2055 * CTR-testing
2056
2057 Both are outlined below, in later sections.
2058
2059 ## Horizontal-First and Vertical-First Modes
2060
2061 In SVP64 Horizontal-First Mode, the first failure in ALL mode (Great Big
2062 AND) results in early exit: no more updates to CTR occur (if requested);
2063 no branch occurs, and LR is not updated (if requested). Likewise for
2064 non-ALL mode (Great Big Or) on first success early exit also occurs,
2065 however this time with the Branch proceeding.  In both cases the testing
2066 of the Vector of CRs should be done in linear sequential order (or in
2067 REMAP re-sequenced order): such that tests that are sequentially beyond
2068 the exit point are *not* carried out. (*Note: it is standard practice in
2069 Programming languages to exit early from conditional tests, however
2070 a little unusual to consider in an ISA that is designed for Parallel
2071 Vector Processing. The reason is to have strictly-defined guaranteed
2072 behaviour*)
2073
2074 In Vertical-First Mode, setting the `ALL` bit results in `UNDEFINED`
2075 behaviour. Given that only one element is being tested at a time
2076 in Vertical-First Mode, a test designed to be done on multiple
2077 bits is meaningless.
2078
2079 ## Description and Modes
2080
2081 Predication in both INT and CR modes may be applied to `sv.bc` and other
2082 SVP64 Branch Conditional operations, exactly as they may be applied to
2083 other SVP64 operations.  When `sz` is zero, any masked-out Branch-element
2084 operations are not included in condition testing, exactly like all other
2085 SVP64 operations, *including* side-effects such as potentially updating
2086 LR or CTR, which will also be skipped. There is *one* exception here,
2087 which is when
2088 `BO[2]=0, sz=0, CTR-test=0, CTi=1` and the relevant element
2089 predicate mask bit is also zero:
2090 under these special circumstances CTR will also decrement.
2091
2092 When `sz` is non-zero, this normally requests insertion of a zero
2093 in place of the input data, when the relevant predicate mask bit is zero.
2094 This would mean that a zero is inserted in place of `CR[BI+32]` for
2095 testing against `BO`, which may not be desirable in all circumstances.
2096 Therefore, an extra field is provided `SNZ`, which, if set, will insert
2097 a **one** in place of a masked-out element, instead of a zero.
2098
2099 (*Note: Both options are provided because it is useful to deliberately
2100 cause the Branch-Conditional Vector testing to fail at a specific point,
2101 controlled by the Predicate mask. This is particularly useful in `VLSET`
2102 mode, which will truncate SVSTATE.VL at the point of the first failed
2103 test.*)
2104
2105 Normally, CTR mode will decrement once per Condition Test, resulting
2106 under normal circumstances that CTR reduces by up to VL in Horizontal-First
2107 Mode. Just as when v3.0B Branch-Conditional saves at
2108 least one instruction on tight inner loops through auto-decrementation
2109 of CTR, likewise it is also possible to save instruction count for
2110 SVP64 loops in both Vertical-First and Horizontal-First Mode, particularly
2111 in circumstances where there is conditional interaction between the
2112 element computation and testing, and the continuation (or otherwise)
2113 of a given loop. The potential combinations of interactions is why CTR
2114 testing options have been added.
2115
2116 Also, the unconditional bit `BO[0]` is still relevant when Predication
2117 is applied to the Branch because in `ALL` mode all nonmasked bits have
2118 to be tested, and when `sz=0` skipping occurs.
2119 Even when VLSET mode is not used, CTR
2120 may still be decremented by the total number of nonmasked elements,
2121 acting in effect as either a popcount or cntlz depending on which
2122 mode bits are set.
2123 In short, Vectorised Branch becomes an extremely powerful tool.
2124
2125 **Micro-Architectural Implementation Note**: *when implemented on
2126 top of a Multi-Issue Out-of-Order Engine it is possible to pass
2127 a copy of the predicate and the prerequisite CR Fields to all
2128 Branch Units, as well as the current value of CTR at the time of
2129 multi-issue, and for each Branch Unit to compute how many times
2130 CTR would be subtracted, in a fully-deterministic and parallel
2131 fashion. A SIMD-based Branch Unit, receiving and processing
2132 multiple CR Fields covered by multiple predicate bits, would
2133 do the exact same thing. Obviously, however, if CTR is modified
2134 within any given loop (mtctr) the behaviour of CTR is no longer
2135 deterministic.*
2136
2137 ### Link Register Update
2138
2139 For a Scalar Branch, unconditional updating of the Link Register
2140 LR is useful and practical. However, if a loop of CR Fields is
2141 tested, unconditional updating of LR becomes problematic.
2142
2143 For example when using `bclr` with `LRu=1,LK=0` in Horizontal-First Mode,
2144 LR's value will be unconditionally overwritten after the first element,
2145 such that for execution (testing) of the second element, LR
2146 has the value `CIA+8`. This is covered in the `bclrl` example, in
2147 a later section.
2148
2149 The addition of a LRu bit modifies behaviour in conjunction
2150 with LK, as follows:
2151
2152 * `sv.bc` When LRu=0,LK=0, Link Register is not updated
2153 * `sv.bcl` When LRu=0,LK=1, Link Register is updated unconditionally
2154 * `sv.bcl/lru` When LRu=1,LK=1, Link Register will
2155   only be updated if the Branch Condition fails.
2156 * `sv.bc/lru` When LRu=1,LK=0, Link Register will only be updated if
2157   the Branch Condition succeeds.
2158
2159 This avoids
2160 destruction of LR during loops (particularly Vertical-First
2161 ones).
2162
2163 **SVLR and SVSTATE**
2164
2165 For precisely the reasons why `LK=1` was added originally to the Power
2166 ISA, with SVSTATE being a peer of the Program Counter it becomes
2167 necessary to also add an SVLR (SVSTATE Link Register)
2168 and corresponding control bits `SL` and `SLu`.
2169
2170 ### CTR-test
2171
2172 Where a standard Scalar v3.0B branch unconditionally decrements
2173 CTR when `BO[2]` is clear, CTR-test Mode introduces more flexibility
2174 which allows CTR to be used for many more types of Vector loops
2175 constructs.
2176
2177 CTR-test mode and CTi interaction is as follows: note that
2178 `BO[2]` is still required to be clear for CTR decrements to be
2179 considered, exactly as is the case in Scalar Power ISA v3.0B
2180
2181 * **CTR-test=0, CTi=0**: CTR decrements on a per-element basis
2182   if `BO[2]` is zero. Masked-out elements when `sz=0` are
2183   skipped (i.e. CTR is *not* decremented when the predicate
2184   bit is zero and `sz=0`).
2185 * **CTR-test=0, CTi=1**: CTR decrements on a per-element basis
2186   if `BO[2]` is zero and a masked-out element is skipped
2187   (`sz=0` and predicate bit is zero). This one special case is the
2188   **opposite** of other combinations, as well as being
2189   completely different from normal SVP64 `sz=0` behaviour)
2190 * **CTR-test=1, CTi=0**: CTR decrements on a per-element basis
2191   if `BO[2]` is zero and the Condition Test succeeds.
2192   Masked-out elements when `sz=0` are skipped (including
2193   not decrementing CTR)
2194 * **CTR-test=1, CTi=1**: CTR decrements on a per-element basis
2195   if `BO[2]` is zero and the Condition Test *fails*.
2196   Masked-out elements when `sz=0` are skipped (including
2197   not decrementing CTR)
2198
2199 `CTR-test=0, CTi=1, sz=0` requires special emphasis because it is the
2200 only time in the entirety of SVP64 that has side-effects when
2201 a predicate mask bit is clear.  **All** other SVP64 operations
2202 entirely skip an element when sz=0 and a predicate mask bit is zero.
2203 It is also critical to emphasise that in this unusual mode,
2204 no other side-effects occur: **only** CTR is decremented, i.e. the
2205 rest of the Branch operation is skipped.
2206
2207 ### VLSET Mode
2208
2209 VLSET Mode truncates the Vector Length so that subsequent instructions
2210 operate on a reduced Vector Length. This is similar to
2211 Data-dependent Fail-First and LD/ST Fail-First, where for VLSET the
2212 truncation occurs at the Branch decision-point.
2213
2214 Interestingly, due to the side-effects of `VLSET` mode
2215 it is actually useful to use Branch Conditional even
2216 to perform no actual branch operation, i.e to point to the instruction
2217 after the branch. Truncation of VL would thus conditionally occur yet control
2218 flow alteration would not.
2219
2220 `VLSET` mode with Vertical-First is particularly unusual. Vertical-First
2221 is designed to be used for explicit looping, where an explicit call to
2222 `svstep` is required to move both srcstep and dststep on to
2223 the next element, until VL (or other condition) is reached.
2224 Vertical-First Looping is expected (required) to terminate if the end
2225 of the Vector, VL, is reached. If however that loop is terminated early
2226 because VL is truncated, VLSET with Vertical-First becomes meaningless.
2227 Resolving this would require two branches: one Conditional, the other
2228 branching unconditionally to create the loop, where the Conditional
2229 one jumps over it.
2230
2231 Therefore, with `VSb`, the option to decide whether truncation should occur if the
2232 branch succeeds *or* if the branch condition fails allows for the flexibility
2233 required.  This allows a Vertical-First Branch to *either* be used as
2234 a branch-back (loop) *or* as part of a conditional exit or function
2235 call from *inside* a loop, and for VLSET to be integrated into both
2236 types of decision-making.
2237
2238 In the case of a Vertical-First branch-back (loop), with `VSb=0` the branch takes
2239 place if success conditions are met, but on exit from that loop
2240 (branch condition fails), VL will be truncated. This is extremely
2241 useful.
2242
2243 `VLSET` mode with Horizontal-First when `VSb=0` is still
2244 useful, because it can be used to truncate VL to the first predicated
2245 (non-masked-out) element.
2246
2247 The truncation point for VL, when VLi is clear, must not include skipped
2248 elements that preceded the current element being tested.
2249 Example: `sz=0, VLi=0, predicate mask = 0b110010` and the Condition
2250 Register failure point is at CR Field element 4.
2251
2252 * Testing at element 0 is skipped because its predicate bit is zero
2253 * Testing at element 1 passed
2254 * Testing elements 2 and 3 are skipped because their
2255   respective predicate mask bits are zero
2256 * Testing element 4 fails therefore VL is truncated to **2**
2257   not 4 due to elements 2 and 3 being skipped.
2258
2259 If `sz=1` in the above example *then* VL would have been set to 4 because
2260 in non-zeroing mode the zero'd elements are still effectively part of the
2261 Vector (with their respective elements set to `SNZ`)
2262
2263 If `VLI=1` then VL would be set to 5 regardless of sz, due to being inclusive
2264 of the element actually being tested.
2265
2266 ### VLSET and CTR-test combined
2267
2268 If both CTR-test and VLSET Modes are requested, it's important to
2269 observe the correct order. What occurs depends on whether VLi
2270 is enabled, because VLi affects the length, VL.
2271
2272 If VLi (VL truncate inclusive) is set:
2273
2274 1. compute the test including whether CTR triggers
2275 2. (optionally) decrement CTR
2276 3. (optionally) truncate VL (VSb inverts the decision)
2277 4. decide (based on step 1) whether to terminate looping
2278    (including not executing step 5)
2279 5. decide whether to branch.
2280
2281 If VLi is clear, then when a test fails that element
2282 and any following it
2283 should **not** be considered part of the Vector. Consequently:
2284
2285 1. compute the branch test including whether CTR triggers
2286 2. if the test fails against VSb, truncate VL to the *previous*
2287    element, and terminate looping. No further steps executed.
2288 3. (optionally) decrement CTR
2289 4. decide whether to branch.
2290
2291 ## Boolean Logic combinations
2292
2293 In a Scalar ISA, Branch-Conditional testing even of vector
2294 results may be performed through inversion of tests. NOR of
2295 all tests may be performed by inversion of the scalar condition
2296 and branching *out* from the scalar loop around elements,
2297 using scalar operations.
2298
2299 In a parallel (Vector) ISA it is the ISA itself which must perform
2300 the prerequisite logic manipulation.
2301 Thus for SVP64 there are an extraordinary number of nesessary combinations
2302 which provide completely different and useful behaviour.
2303 Available options to combine:
2304
2305 * `BO[0]` to make an unconditional branch would seem irrelevant if
2306   it were not for predication and for side-effects (CTR Mode
2307   for example)
2308 * Enabling CTR-test Mode and setting `BO[2]` can still result in the
2309   Branch
2310   taking place, not because the Condition Test itself failed, but
2311   because CTR reached zero **because**, as required by CTR-test mode,
2312   CTR was decremented as a  **result** of Condition Tests failing.
2313 * `BO[1]` to select whether the CR bit being tested is zero or nonzero
2314 * `R30` and `~R30` and other predicate mask options including CR and
2315   inverted CR bit testing
2316 * `sz` and `SNZ` to insert either zeros or ones in place of masked-out
2317   predicate bits
2318 * `ALL` or `ANY` behaviour corresponding to `AND` of all tests and
2319   `OR` of all tests, respectively.
2320 * Predicate Mask bits, which combine in effect with the CR being
2321   tested.
2322 * Inversion of Predicate Masks (`~r3` instead of `r3`, or using
2323   `NE` rather than `EQ`) which results in an additional
2324   level of possible ANDing, ORing etc. that would otherwise
2325   need explicit instructions.
2326
2327 The most obviously useful combinations here are to set `BO[1]` to zero
2328 in order to turn `ALL` into Great-Big-NAND and `ANY` into
2329 Great-Big-NOR.  Other Mode bits which perform behavioural inversion then
2330 have to work round the fact that the Condition Testing is NOR or NAND.
2331 The alternative to not having additional behavioural inversion
2332 (`SNZ`, `VSb`, `CTi`) would be to have a second (unconditional)
2333 branch directly after the first, which the first branch jumps over.
2334 This contrivance is avoided by the behavioural inversion bits.
2335
2336 ## Pseudocode and examples
2337
2338 Please see the SVP64 appendix regarding CR bit ordering and for
2339 the definition of `CR{n}`
2340
2341 For comparative purposes this is a copy of the v3.0B `bc` pseudocode
2342
2343 ```
2344     if (mode_is_64bit) then M <- 0
2345     else M <- 32
2346     if ¬BO[2] then CTR <- CTR - 1
2347     ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
2348     cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1])
2349     if ctr_ok & cond_ok then
2350       if AA then NIA <-iea EXTS(BD || 0b00)
2351       else       NIA <-iea CIA + EXTS(BD || 0b00)
2352     if LK then LR  <-iea  CIA + 4
2353 ```
2354
2355 Simplified pseudocode including LRu and CTR skipping, which illustrates
2356 clearly that SVP64 Scalar Branches (VL=1) are **not** identical to
2357 v3.0B Scalar Branches.  The key areas where differences occur are
2358 the inclusion of predication (which can still be used when VL=1), in
2359 when and why CTR is decremented (CTRtest Mode) and whether LR is
2360 updated (which is unconditional in v3.0B when LK=1, and conditional
2361 in SVP64 when LRu=1).
2362
2363 Inline comments highlight the fact that the Scalar Branch behaviour
2364 and pseudocode is still clearly visible and embedded within the
2365 Vectorised variant:
2366
2367 ```
2368     if (mode_is_64bit) then M <- 0
2369     else M <- 32
2370     # the bit of CR to test, if the predicate bit is zero,
2371     # is overridden
2372     testbit = CR[BI+32]
2373     if ¬predicate_bit then testbit = SVRMmode.SNZ
2374     # otherwise apart from the override ctr_ok and cond_ok
2375     # are exactly the same
2376     ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
2377     cond_ok <- BO[0] | ¬(testbit ^ BO[1])
2378     if ¬predicate_bit & ¬SVRMmode.sz then
2379       # this is entirely new: CTR-test mode still decrements CTR
2380       # even when predicate-bits are zero
2381       if ¬BO[2] & CTRtest & ¬CTi then
2382         CTR = CTR - 1
2383       # instruction finishes here
2384     else
2385       # usual BO[2] CTR-mode now under CTR-test mode as well
2386       if ¬BO[2] & ¬(CTRtest & (cond_ok ^ CTi)) then CTR <- CTR - 1
2387       # new VLset mode, conditional test truncates VL
2388       if VLSET and VSb = (cond_ok & ctr_ok) then
2389         if SVRMmode.VLI then SVSTATE.VL = srcstep+1
2390         else                 SVSTATE.VL = srcstep
2391       # usual LR is now conditional, but also joined by SVLR
2392       lr_ok <- LK
2393       svlr_ok <- SVRMmode.SL
2394       if ctr_ok & cond_ok then
2395         if AA then NIA <-iea EXTS(BD || 0b00)
2396         else       NIA <-iea CIA + EXTS(BD || 0b00)
2397         if SVRMmode.LRu then lr_ok <- ¬lr_ok
2398         if SVRMmode.SLu then svlr_ok <- ¬svlr_ok
2399       if lr_ok   then LR   <-iea CIA + 4
2400       if svlr_ok then SVLR <- SVSTATE
2401 ```
2402
2403 Below is the pseudocode for SVP64 Branches, which is a little less
2404 obvious but identical to the above. The lack of obviousness is down
2405 to the early-exit opportunities.
2406
2407 Effective pseudocode for Horizontal-First Mode:
2408
2409 ```
2410     if (mode_is_64bit) then M <- 0
2411     else M <- 32
2412     cond_ok = not SVRMmode.ALL
2413     for srcstep in range(VL):
2414         # select predicate bit or zero/one
2415         if predicate[srcstep]:
2416             # get SVP64 extended CR field 0..127
2417             SVCRf = SVP64EXTRA(BI>>2)
2418             CRbits = CR{SVCRf}
2419             testbit = CRbits[BI & 0b11]
2420             # testbit = CR[BI+32+srcstep*4]
2421         else if not SVRMmode.sz:
2422             # inverted CTR test skip mode
2423             if ¬BO[2] & CTRtest & ¬CTI then
2424               CTR = CTR - 1
2425             continue # skip to next element
2426         else
2427             testbit = SVRMmode.SNZ
2428         # actual element test here
2429         ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
2430         el_cond_ok <- BO[0] | ¬(testbit ^ BO[1])
2431         # check if CTR dec should occur
2432         ctrdec = ¬BO[2]
2433         if CTRtest & (el_cond_ok ^ CTi) then
2434            ctrdec = 0b0
2435         if ctrdec then CTR <- CTR - 1
2436         # merge in the test
2437         if SVRMmode.ALL:
2438             cond_ok &= (el_cond_ok & ctr_ok)
2439         else
2440             cond_ok |= (el_cond_ok & ctr_ok)
2441         # test for VL to be set (and exit)
2442         if VLSET and VSb = (el_cond_ok & ctr_ok) then
2443             if SVRMmode.VLI then SVSTATE.VL = srcstep+1
2444             else                 SVSTATE.VL = srcstep
2445             break
2446         # early exit?
2447         if SVRMmode.ALL != (el_cond_ok & ctr_ok):
2448              break
2449         # SVP64 rules about Scalar registers still apply!
2450         if SVCRf.scalar:
2451            break
2452     # loop finally done, now test if branch (and update LR)
2453     lr_ok <- LK
2454     svlr_ok <- SVRMmode.SL
2455     if cond_ok then
2456         if AA then NIA <-iea EXTS(BD || 0b00)
2457         else       NIA <-iea CIA + EXTS(BD || 0b00)
2458         if SVRMmode.LRu then lr_ok <- ¬lr_ok
2459         if SVRMmode.SLu then svlr_ok <- ¬svlr_ok
2460     if lr_ok then LR <-iea CIA + 4
2461     if svlr_ok then SVLR <- SVSTATE
2462 ```
2463
2464 Pseudocode for Vertical-First Mode:
2465
2466 ```
2467     # get SVP64 extended CR field 0..127
2468     SVCRf = SVP64EXTRA(BI>>2)
2469     CRbits = CR{SVCRf}
2470     # select predicate bit or zero/one
2471     if predicate[srcstep]:
2472         if BRc = 1 then # CR0 vectorised
2473             CR{SVCRf+srcstep} = CRbits
2474         testbit = CRbits[BI & 0b11]
2475     else if not SVRMmode.sz:
2476         # inverted CTR test skip mode
2477         if ¬BO[2] & CTRtest & ¬CTI then
2478            CTR = CTR - 1
2479         SVSTATE.srcstep = new_srcstep
2480         exit # no branch testing
2481     else
2482         testbit = SVRMmode.SNZ
2483     # actual element test here
2484     cond_ok <- BO[0] | ¬(testbit ^ BO[1])
2485     # test for VL to be set (and exit)
2486     if VLSET and cond_ok = VSb then
2487         if SVRMmode.VLI
2488             SVSTATE.VL = new_srcstep+1
2489         else
2490             SVSTATE.VL = new_srcstep
2491 ```
2492
2493 ### Example Shader code
2494
2495 ```
2496     // assume f() g() or h() modify a and/or b
2497     while(a > 2) {
2498         if(b < 5)
2499             f();
2500         else
2501             g();
2502         h();
2503     }
2504 ```
2505
2506 which compiles to something like:
2507
2508 ```
2509     vec<i32> a, b;
2510     // ...
2511     pred loop_pred = a > 2;
2512     // loop continues while any of a elements greater than 2
2513     while(loop_pred.any()) {
2514         // vector of predicate bits
2515         pred if_pred = loop_pred & (b < 5);
2516         // only call f() if at least 1 bit set
2517         if(if_pred.any()) {
2518             f(if_pred);
2519         }
2520     label1:
2521         // loop mask ANDs with inverted if-test
2522         pred else_pred = loop_pred & ~if_pred;
2523         // only call g() if at least 1 bit set
2524         if(else_pred.any()) {
2525             g(else_pred);
2526         }
2527         h(loop_pred);
2528     }
2529 ```
2530
2531 which will end up as:
2532
2533 ```
2534        # start from while loop test point
2535        b looptest
2536     while_loop:
2537        sv.cmpi CR80.v, b.v, 5     # vector compare b into CR64 Vector
2538        sv.bc/m=r30/~ALL/sz CR80.v.LT skip_f # skip when none
2539        # only calculate loop_pred & pred_b because needed in f()
2540        sv.crand CR80.v.SO, CR60.v.GT, CR80.V.LT # if = loop & pred_b
2541        f(CR80.v.SO)
2542     skip_f:
2543        # illustrate inversion of pred_b. invert r30, test ALL
2544        # rather than SOME, but masked-out zero test would FAIL,
2545        # therefore masked-out instead is tested against 1 not 0
2546        sv.bc/m=~r30/ALL/SNZ CR80.v.LT skip_g
2547        # else = loop & ~pred_b, need this because used in g()
2548        sv.crternari(A&~B) CR80.v.SO, CR60.v.GT, CR80.V.LT
2549        g(CR80.v.SO)
2550     skip_g:
2551        # conditionally call h(r30) if any loop pred set
2552        sv.bclr/m=r30/~ALL/sz BO[1]=1 h()
2553     looptest:
2554        sv.cmpi CR60.v a.v, 2      # vector compare a into CR60 vector
2555        sv.crweird r30, CR60.GT # transfer GT vector to r30
2556        sv.bc/m=r30/~ALL/sz BO[1]=1 while_loop
2557     end:
2558 ```
2559
2560 ### LRu example
2561
2562 show why LRu would be useful in a loop.  Imagine the following
2563 c code:
2564
2565 ```
2566     for (int i = 0; i < 8; i++) {
2567         if (x < y) break;
2568     }
2569 ```
2570
2571 Under these circumstances exiting from the loop is not only
2572 based on CTR it has become conditional on a CR result.
2573 Thus it is desirable that NIA *and* LR only be modified
2574 if the conditions are met
2575
2576
2577 v3.0 pseudocode for `bclrl`:
2578
2579 ```
2580     if (mode_is_64bit) then M <- 0
2581     else M <- 32
2582     if ¬BO[2]  then CTR <- CTR - 1
2583     ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
2584     cond_ok <- BO[0] | ¬(CR[BI+32] ^  BO[1])
2585     if ctr_ok & cond_ok then NIA <-iea LR[0:61] || 0b00
2586     if LK then LR <-iea CIA + 4
2587 ```
2588
2589 the latter part for SVP64 `bclrl` becomes:
2590
2591 ```
2592     for i in 0 to VL-1:
2593         ...
2594         ...
2595         cond_ok <- BO[0] | ¬(CR[BI+32] ^  BO[1])
2596         lr_ok <- LK
2597         if ctr_ok & cond_ok then
2598            NIA <-iea LR[0:61] || 0b00
2599            if SVRMmode.LRu then lr_ok <- ¬lr_ok
2600         if lr_ok then LR <-iea CIA + 4
2601         # if NIA modified exit loop
2602 ```
2603
2604 The reason why should be clear from this being a Vector loop:
2605 unconditional destruction of LR when LK=1 makes `sv.bclrl`
2606 ineffective, because the intention going into the loop is
2607 that the branch should be to the copy of LR set at the *start*
2608 of the loop, not half way through it.
2609 However if the change to LR only occurs if
2610 the branch is taken then it becomes a useful instruction.
2611
2612 The following pseudocode should **not** be implemented because
2613 it violates the fundamental principle of SVP64 which is that
2614 SVP64 looping is a thin wrapper around Scalar Instructions.
2615 The pseducode below is more an actual Vector ISA Branch and
2616 as such is not at all appropriate:
2617
2618 ```
2619     for i in 0 to VL-1:
2620         ...
2621         ...
2622         cond_ok <- BO[0] | ¬(CR[BI+32] ^  BO[1])
2623         if ctr_ok & cond_ok then NIA <-iea LR[0:61] || 0b00
2624     # only at the end of looping is LK checked.
2625     # this completely violates the design principle of SVP64
2626     # and would actually need to be a separate (scalar)
2627     # instruction "set LR to CIA+4 but retrospectively"
2628     # which is clearly impossible
2629     if LK then LR <-iea CIA + 4
2630 ```