add notes and observations for ls010 SVP64 main book proposal

[libreriscv.git] / openpower / sv / rfc / ls010.mdwn
diff --git a/openpower/sv/rfc/ls010.mdwn b/openpower/sv/rfc/ls010.mdwn

index 180591001e8b74657e47321b3dd381923d6e1dfd..626d1d307ba6203f45c6830b0b48d4556dd27697 100644 (file)
--- a/openpower/sv/rfc/ls010.mdwn
+++ b/openpower/sv/rfc/ls010.mdwn
@@ -1,2705 +1,109 @@
  # RFC ls010 SVP64 Zero-Overhead Loop Prefix Subsystem
+**URLs**:
  
-Credits and acknowledgements:
-
-* Luke Leighton
-* Jacob Lifshay
-* Hendrik Boom
-* Richard Wilbur
-* Alexandre Oliva
-* Cesar Strauss
-* NLnet Foundation, for funding
-* OpenPOWER Foundation
-* Paul Mackerras
-* Toshaan Bharvani
-* IBM for the Power ISA itself
-
-Links:
-
+* <https://www.sigarch.org/simd-instructions-considered-harmful/>
+* <https://libre-soc.org/openpower/sv/>
+* <https://libre-soc.org/openpower/sv/rfc/ls010/>
  * <https://bugs.libre-soc.org/show_bug.cgi?id=1045>
+* <https://git.openpower.foundation/isa/PowerISA/issues/64>
+* <https://git.openpower.foundation/isa/PowerISA/issues/121>
  
-# Introduction
-
-Simple-V is a type of Vectorisation best described as a "Prefix Loop
-Subsystem" similar to the 5 decades-old Zilog Z80 `LDIR` instruction and to the 8086 `REP`
-Prefix instruction.  More advanced features are similar to the Z80
-`CPIR` instruction. If viewed one-dimensionally as an actual Vector ISA it introduces
-over 1.5 million 64-bit Vector instructions.  SVP64, the instruction
-format, is therefore best viewed as an orthogonal RISC-paradigm "Prefixing"
-subsystem instead.
-
-Except where explicitly stated all bit numbers remain as in the rest of
-the Power ISA: in MSB0 form (the bits are numbered from 0 at the MSB on
-the left and counting up as you move rightwards to the LSB end). All bit
-ranges are inclusive (so `4:6` means bits 4, 5, and 6, in MSB0 order).
-**All register numbering and element numbering however is LSB0 ordering**
-which is a different convention from that used elsewhere in the Power ISA.
-
-The SVP64 prefix always comes before the suffix in PC order and must be
-considered an independent "Defined word" that augments the behaviour of
-the following instruction, but does **not** change the actual Decoding
-of that following instruction.  **All prefixed instructions retain their
-non-prefixed encoding and definition**.
-
-*Architectural Resource Allocation note: it is prohibited to accept RFCs
-which fundamentally violate this hard requirement.  Under no circumstances
-must the Suffix space have an alternate instruction encoding allocated
-within SVP64 that is entirely different from the non-prefixed Defined
-Word. Hardware Implementors critically rely on this inviolate guarantee
-to implement High-Performance Multi-Issue micro-architectures that can
-sustain 100% throughput*
-
-| 0:5    | 6:31         | 32:63        |
-|--------|--------------|--------------|
-| EXT09  | v3.1  Prefix | v3.0/1  Suffix |
-
-Two apparent exceptions to the above hard rule exist: SV Branch-Conditional
-operations and LD/ST-update "Post-Increment" Mode.  Post-Increment
-was considered sufficiently high priority (significantly reducing hot-loop
-instruction count) that one bit in the Prefix is reserved for it.
-Vectorised Branch-Conditional operations "embed" the original Scalar
-Branch-Conditional behaviour into a much more advanced variant that
-is highly suited to High-Performance Computation (HPC), Supercomputing,
-and parallel GPU Workloads.
-
-Subset implementations in hardware are permitted, as long as certain
-rules are followed, allowing for full soft-emulation including future
-revisions.  Compliancy Subsets exist to ensure minimum levels of binary
-interoperability expectations within certain environments.
-
-## SVP64 encoding features
-
-A number of features need to be compacted into a very small space of
-only 24 bits:
-
-* Independent per-register Scalar/Vector tagging and range extension on
-  every register
-* Element width overrides on both source and destination
-* Predication on both source and destination
-* Two different sources of predication: INT and CR Fields
-* SV Modes including saturation (for Audio, Video and DSP), mapreduce,
-  fail-first and predicate-result mode.
-
-Different classes of operations require different formats. The earlier
-sections cover the c9mmon formats and the four separate modes follow:
-CR operations (crops), Arithmetic/Logical (termed "normal"), Load/Store
-and Branch-Conditional.
-
-## Definition of Reserved in this spec.
-
-For the new fields added in SVP64, instructions that have any of their
-fields set to a reserved value must cause an illegal instruction trap,
-to allow emulation of future instruction sets, or for subsets of SVP64 to
-be implemented in hardware and the rest emulated.  This includes SVP64
-SPRs: reading or writing values which are not supported in hardware
-must also raise illegal instruction traps in order to allow emulation.
-Unless otherwise stated, reserved values are always all zeros.
+**Severity**: Major
  
-This is unlike OpenPower ISA v3.1, which in many instances does not
-require a trap if reserved fields are nonzero.  Where the standard Power
-ISA definition is intended the red keyword `RESERVED` is used.
+**Status**: New
  
-##  Definition of "UnVectoriseable"
+**Date**: 04 Apr 2023 v1
  
-Any operation that inherently makes no sense if repeated is termed
-"UnVectoriseable" or "UnVectorised".  Examples include `sc` or `sync`
-which have no registers. `mtmsr` is also classed as UnVectoriseable
-because there is only one `MSR`.
+**Target**: v3.2B
  
-## Register files, elements, and Element-width Overrides
+**Source**: v3.0B
  
-In the Upper Compliancy Levels of SVP64 the size of the GPR and FPR Register
-files are expanded from 32 to 128 entries, and the number of CR Fields
-expanded from CR0-CR7 to CR0-CR127. (Note: A future version of SVP64 is anticipated
-to extend the VSR register file).
+**Books and Section affected**:
  
-Memory access remains exactly the same: the effects of `MSR.LE` remain
-exactly the same, affecting as they already do and remain **only**
-on the Load and Store memory-register operation byte-order, and having
-nothing to do with the ordering of the contents of register files or
-register-register operations.
-
-To be absolutely clear:
-
-**No conceptual arithmetic ordering or other changes over the Scalar Power ISA
-definitions to registers or register files
-or to arithmetic or Logical Operations beyond element-width subdivision and
-sequential element numbering are expressed or implied**
-
-Whilst the bits within the GPRs and FPRs are expected to be MSB0-ordered
-and for numbering to be sequentially incremental the element offset
-numbering is naturally **LSB0-sequentially-incrementing from zero not
-MSB0-incrementing.**  When exclusively using MSB0-numbering, SVP64
-becomes unnecessarily complex to both express and subsequently understand:
-the required subtractions from 63,
-31, 15 and 7 unfortunately become a hostile minefield, obscuring both
-intent and meaning. Therefore for the
-purposes of this section the more natural **LSB0 numbering is assumed**
-and it is left to the reader to translate to MSB0 numbering.
-
-The Canonical specification for how element-sequential numbering and
-element-width overrides is defined is expressed in the following c
-structure, assuming a Little-Endian system, and naturally using LSB0
-numbering everywhere because the ANSI c specification is inherently LSB0:
-
-```
-    #pragma pack
-    typedef union {
-        uint8_t  b[]; // elwidth 8
-        uint16_t s[]; // elwidth 16
-        uint32_t i[]; // elwidth 32
-        uint64_t l[]; // elwidth 64
-        uint8_t actual_bytes[8];
-    } el_reg_t;
-
-    elreg_t int_regfile[128];
-
-    void get_register_element(el_reg_t* el, int gpr, int element, int width) {
-        switch (width) {
-            case 64: el->l = int_regfile[gpr].l[element];
-            case 32: el->i = int_regfile[gpr].i[element];
-            case 16: el->s = int_regfile[gpr].s[element];
-            case 8 : el->b = int_regfile[gpr].b[element];
-        }
-    }
-    void set_register_element(el_reg_t* el, int gpr, int element, int width) {
-        switch (width) {
-            case 64: int_regfile[gpr].l[element] = el->l;
-            case 32: int_regfile[gpr].i[element] = el->i;
-            case 16: int_regfile[gpr].s[element] = el->s;
-            case 8 : int_regfile[gpr].b[element] = el->b;
-        }
-    }
  ```
-
-Example Vector-looped add operation implementation when elwidths are 64-bit:
-
-```
- # add RT, RA,RB using the "uint64_t" union member, "l"
- for i in range(VL):
-      int_regfile[RT].l[i] = int_regfile[RA].l[i] + int_regfile[RB].l[i]
+    New Book: new Zero-Overhead-Loop
+    New Appendix, Zero-Overhead-Loop
  ```
  
-However if elwidth overrides are set to 16 for both source and destination:
+**Summary**
  
  ```
- # add RT, RA, RB using the "uint64_t" union member "s"
- for i in range(VL):
-      int_regfile[RT].s[i] = int_regfile[RA].s[i] + int_regfile[RB].s[i]
+    Adds a Zero-Overhead-Loop Subsystem based on the Cray True-Scalable Vector concept
+    in a RISC-paradigm fashion.  Total instructions added is six, plus Prefix format.
  ```
  
-Hardware Architectural note: to avoid a Read-Modify-Write at the register
-file it is strongly recommended to implement byte-level write-enable lines
-exactly as has been implemented in DRAM ICs for many decades. Additionally
-the predicate mask bit is advised to be associated with the element
-operation and alongside the result ultimately passed to the register file.
-When element-width is set to 64-bit the relevant predicate mask bit
-may be repeated eight times and pull all eight write-port byte-level
-lines HIGH. Clearly when element-width is set to 8-bit the relevant
-predicate mask bit corresponds directly with one single byte-level
-write-enable line.  It is up to the Hardware Architect to then amortise
-(merge) elements together into both PredicatedSIMD Pipelines as well
-as simultaneous non-overlapping Register File writes, to achieve High
-Performance designs.
-
-## Scalar Identity Behaviour
-
-SVP64 is designed so that when the prefix is all zeros, and
- VL=1, no effect or
-influence occurs (no augmentation) such that all standard Power ISA
-v3.0/v3 1 instructions covered by the prefix are "unaltered". This
-is termed `scalar identity behaviour` (based on the mathematical
-definition for "identity", as in, "identity matrix" or better "identity
-transformation").
-
-Note that this is completely different from when VL=0.  VL=0 turns all
-operations under its influence into `nops` (regardless of the prefix)
- whereas when VL=1 and the SV prefix is all zeros, the operation simply
- acts as if SV had not been applied at all to the instruction  (an
- "identity transformation").
-
-## Register Naming and size
-
-As indicated above SV Registers are simply the INT, FP and CR
-register files extended linearly to larger sizes; SV Vectorisation
-iterates sequentially through these registers (LSB0 sequential ordering
-from 0 to VL-1).
-
-Where the integer regfile in standard scalar Power ISA v3.0B/v3.1B is
-r0 to r31, SV extends this as r0 to r127.  Likewise FP registers are
-extended to 128 (fp0 to fp127), and CR Fields are extended to 128 entries,
-CR0 thru CR127.
-
-The names of the registers therefore reflects a simple linear extension
-of the Power ISA v3.0B / v3.1B register naming, and in hardware this
-would be reflected by a linear increase in the size of the underlying
-SRAM used for the regfiles.
-
-Note: when an EXTRA field (defined below) is zero, SV is deliberately
-designed so that the register fields are identical to as if SV was not in
-effect i.e. under these circumstances (EXTRA=0) the register field names
-RA, RB etc. are interpreted and treated as v3.0B / v3.1B scalar registers.
-This is part of `scalar identity behaviour` described above.
-
-## Future expansion.
-
-With the way that EXTRA fields are defined and applied to register fields,
-future versions of SV may involve 256 or greater registers. Backwards
-binary compatibility may be achieved with a PCR bit (Program Compatibility
-Register).  Further discussion is out of scope for this version of SVP64.
-
---------
-
-\newpage{}
-
-# Remapped Encoding (`RM[0:23]`)
-
-To allow relatively easy remapping of which portions of the Prefix Opcode
-Map are used for SVP64 without needing to rewrite a large portion of the
-SVP64 spec, a mapping is defined from the OpenPower v3.1 prefix bits to
-a new 24-bit Remapped Encoding denoted `RM[0]` at the MSB to `RM[23]`
-at the LSB.
-
-The mapping from the OpenPower v3.1 prefix bits to the Remapped Encoding
-is defined in the Prefix Fields section.
-
-## Prefix Fields
-
-TODO incorporate EXT09
-
-To "activate" svp64 (in a way that does not conflict with v3.1B 64
-bit Prefix mode), fields within the v3.1B Prefix Opcode Map are set
-(see Prefix Opcode Map, above), leaving 24 bits "free" for use by SV.
-This is achieved by setting bits 7 and 9 to 1:
-
-| Name       | Bits    | Value | Description                    |
-|------------|---------|-------|--------------------------------|
-| EXT01      | `0:5`   | `1`   | Indicates Prefixed 64-bit      |
-| `RM[0]`    | `6`     |       | Bit 0 of Remapped Encoding     |
-| SVP64_7    | `7`     | `1`   | Indicates this is SVP64        |
-| `RM[1]`    | `8`     |       | Bit 1 of Remapped Encoding     |
-| SVP64_9    | `9`     | `1`   | Indicates this is SVP64        |
-| `RM[2:23]` | `10:31` |       | Bits 2-23 of Remapped Encoding |
-
-Laid out bitwise, this is as follows, showing how the 32-bits of the prefix
-are constructed:
-
-| 0:5    | 6     | 7 | 8     | 9 | 10:31    |
-|--------|-------|---|-------|---|----------|
-| EXT01  | RM    | 1 | RM    | 1 | RM       |
-| 000001 | RM[0] | 1 | RM[1] | 1 | RM[2:23] |
-
-Following the prefix will be the suffix: this is simply a 32-bit v3.0B
-/ v3.1 instruction.  That instruction becomes "prefixed" with the SVP
-context: the Remapped Encoding field (RM).
-
-It is important to note that unlike v3.1 64-bit prefixed instructions
-there is insufficient space in `RM` to provide identification of any SVP64
-Fields without first partially decoding the 32-bit suffix.  Similar to
-the "Forms" (X-Form, D-Form) the `RM` format is individually associated
-with every instruction.
-
-Extreme caution and care must therefore be taken when extending SVP64
-in future, to not create unnecessary relationships between prefix and
-suffix that could complicate decoding, adding latency.
-
-# Common RM fields
-
-The following fields are common to all Remapped Encodings:
-
-| Field Name | Field bits | Description                            |
-|------------|------------|----------------------------------------|
-| MASKMODE   | `0`        | Execution (predication) Mask Kind      |
-| MASK       | `1:3`      | Execution Mask                      |
-| SUBVL      | `8:9`      | Sub-vector length                   |
-
-The following fields are optional or encoded differently depending
-on context after decoding of the Scalar suffix:
-
-| Field Name | Field bits | Description                            |
-|------------|------------|----------------------------------------|
-| ELWIDTH       | `4:5`      | Element Width                       |
-| ELWIDTH_SRC   | `6:7`      | Element Width for Source      |
-| EXTRA         | `10:18`    | Register Extra encoding                |
-| MODE          | `19:23`    | changes Vector behaviour               |
-
-* MODE changes the behaviour of the SV operation (result saturation,
-  mapreduce)
-* SUBVL groups elements together into vec2, vec3, vec4 for use in 3D
-  and Audio/Video DSP work
-* ELWIDTH and ELWIDTH_SRC overrides the instruction's destination and
-  source operand width
-* MASK (and MASK_SRC) and MASKMODE provide predication (two types of
-  sources: scalar INT and Vector CR).
-* Bits 10 to 18 (EXTRA) are further decoded depending on the RM category
-  for the instruction, which is determined only by decoding the Scalar 32
-  bit suffix.
-
-Similar to Power ISA `X-Form` etc. EXTRA bits are given designations,
-such as `RM-1P-3S1D` which indicates for this example that the operation
-is to be single-predicated and that there are 3 source operand EXTRA
-tags and one destination operand tag.
-
-Note that if ELWIDTH != ELWIDTH_SRC this may result in reduced performance
-or increased latency in some implementations due to lane-crossing.
-
-## Mode
-
-Mode is an augmentation of SV behaviour.  Different types of instructions
-have different needs, similar to Power ISA v3.1 64 bit prefix 8LS and MTRR
-formats apply to different instruction types.  Modes include Reduction,
-Iteration, arithmetic saturation, and Fail-First.  More specific details
-in each section and in the SVP64 appendix
-
-* For condition register operations see [[sv/cr_ops]]
-* For LD/ST Modes, see [[sv/ldst]].
-* For Branch modes, see [[sv/branches]]
-* For arithmetic and logical, see [[sv/normal]]
-
-## ELWIDTH Encoding
-
-Default behaviour is set to 0b00 so that zeros follow the convention
-of `scalar identity behaviour`.  In this case it means that elwidth
-overrides are not applicable.  Thus if a 32 bit instruction operates
-on 32 bit, `elwidth=0b00` specifies that this behaviour is unmodified.
-Likewise when a processor is switched from 64 bit to 32 bit mode,
-`elwidth=0b00` states that, again, the behaviour is not to be modified.
-
-Only when elwidth is nonzero is the element width overridden to the
-explicitly required value.
-
-### Elwidth for Integers:
-
-| Value | Mnemonic       | Description                        |
-|-------|----------------|------------------------------------|
-| 00    | DEFAULT        | default behaviour for operation    |
-| 01    | `ELWIDTH=w`    | Word: 32-bit integer                 |
-| 10    | `ELWIDTH=h`    | Halfword: 16-bit integer             |
-| 11    | `ELWIDTH=b`    | Byte: 8-bit integer                  |
-
-This encoding is chosen such that the byte width may be computed as
-`8<<(3-ew)`
-
-### Elwidth for FP Registers:
-
-| Value | Mnemonic       | Description                        |
-|-------|----------------|------------------------------------|
-| 00    | DEFAULT        | default behaviour for FP operation     |
-| 01    | `ELWIDTH=f32`  | 32-bit IEEE 754 Single floating-point  |
-| 10    | `ELWIDTH=f16`  | 16-bit IEEE 754 Half floating-point   |
-| 11    | `ELWIDTH=bf16` | Reserved for `bf16` |
-
-Note:
-[`bf16`](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format)
-is reserved for a future implementation of SV
-
-Note that any IEEE754 FP operation in Power ISA ending in "s" (`fadds`) shall
-perform its operation at **half** the ELWIDTH then padded back out
-to ELWIDTH.  `sv.fadds/ew=f32` shall perform an IEEE754 FP16 operation that is then "padded" to fill out to an IEEE754 FP32. When ELWIDTH=DEFAULT
-clearly the behaviour of `sv.fadds` is performed at 32-bit accuracy
-then padded back out to fit in IEEE754 FP64, exactly as for Scalar
-v3.0B "single" FP.  Any FP operation ending in "s" where ELWIDTH=f16
-or ELWIDTH=bf16 is reserved and must raise an illegal instruction
-(IEEE754 FP8 or BF8 are not defined).
-
-### Elwidth for CRs (no meaning)
-
-Element-width overrides for CR Fields has no meaning. The bits
-are therefore used for other purposes, or when Rc=1, the Elwidth
-applies to the result being tested (a GPR or FPR), but not to the
-Vector of CR Fields.
-
-## SUBVL Encoding
-
-The default for SUBVL is 1 and its encoding is 0b00 to indicate that
-SUBVL is effectively disabled (a SUBVL for-loop of only one element). this
-lines up in combination with all other "default is all zeros" behaviour.
-
-| Value | Mnemonic  | Subvec  | Description            |
-|-------|-----------|---------|------------------------|
-| 00    | `SUBVL=1` | single  | Sub-vector length of 1 |
-| 01    | `SUBVL=2` | vec2    | Sub-vector length of 2 |
-| 10    | `SUBVL=3` | vec3    | Sub-vector length of 3 |
-| 11    | `SUBVL=4` | vec4    | Sub-vector length of 4 |
-
-The SUBVL encoding value may be thought of as an inclusive range of a
-sub-vector.  SUBVL=2 represents a vec2, its encoding is 0b01, therefore
-this may be considered to be elements 0b00 to 0b01 inclusive.
-
-## MASK/MASK_SRC & MASKMODE Encoding
-
-One bit (`MASKMODE`) indicates the mode: CR or Int predication.   The two
-types may not be mixed.
-
-Special note: to disable predication this field must be set to zero in
-combination with Integer Predication also being set to 0b000. this has the
-effect of enabling "all 1s" in the predicate mask, which is equivalent to
-"not having any predication at all" and consequently, in combination with
-all other default zeros, fully disables SV (`scalar identity behaviour`).
-
-`MASKMODE` may be set to one of 2 values:
-
-| Value | Description                                          |
-|-----------|------------------------------------------------------|
-| 0         | MASK/MASK_SRC are encoded using Integer Predication  |
-| 1         | MASK/MASK_SRC are encoded using CR-based Predication |
-
-Integer Twin predication has a second set of 3 bits that uses the same
-encoding thus allowing either the same register (r3, r10 or r31) to be
-used for both src and dest, or different regs (one for src, one for dest).
-
-Likewise CR based twin predication has a second set of 3 bits, allowing
-a different test to be applied.
-
-Note that it is assumed that Predicate Masks (whether INT or CR) are
-read *before* the operations proceed.  In practice (for CR Fields)
-this creates an unnecessary block on parallelism.  Therefore, it is up
-to the programmer to ensure that the CR fields used as Predicate Masks
-are not being written to by any parallel Vector Loop.  Doing so results
-in **UNDEFINED** behaviour, according to the definition outlined in the
-Power ISA v3.0B Specification.
-
-Hardware Implementations are therefore free and clear to delay reading
-of individual CR fields until the actual predicated element operation
-needs to take place, safe in the knowledge that no programmer will have
-issued a Vector Instruction where previous elements could have overwritten
-(destroyed) not-yet-executed CR-Predicated element operations.
-
-### Integer Predication (MASKMODE=0)
-
-When the predicate mode bit is zero the 3 bits are interpreted as below.
-Twin predication has an identical 3 bit field similarly encoded.
-
-`MASK` and `MASK_SRC` may be set to one of 8 values, to provide the
-following meaning:
-
-| Value | Mnemonic | Element `i` enabled if:      |
-|-------|----------|------------------------------|
-| 000   | ALWAYS   | predicate effectively all 1s |
-| 001   | 1 << R3  | `i == R3`                    |
-| 010   | R3       | `R3 & (1 << i)` is non-zero  |
-| 011   | ~R3      | `R3 & (1 << i)` is zero      |
-| 100   | R10      | `R10 & (1 << i)` is non-zero |
-| 101   | ~R10     | `R10 & (1 << i)` is zero     |
-| 110   | R30      | `R30 & (1 << i)` is non-zero |
-| 111   | ~R30     | `R30 & (1 << i)` is zero     |
-
-r10 and r30 are at the high end of temporary and unused registers,
-so as not to interfere with register allocation from ABIs.
-
-### CR-based Predication (MASKMODE=1)
-
-When the predicate mode bit is one the 3 bits are interpreted as below.
-Twin predication has an identical 3 bit field similarly encoded.
-
-`MASK` and `MASK_SRC` may be set to one of 8 values, to provide the
-following meaning:
-
-| Value | Mnemonic | Element `i` is enabled if     |
-|-------|----------|--------------------------|
-| 000   | lt       | `CR[offs+i].LT` is set   |
-| 001   | nl/ge    | `CR[offs+i].LT` is clear |
-| 010   | gt       | `CR[offs+i].GT` is set   |
-| 011   | ng/le    | `CR[offs+i].GT` is clear |
-| 100   | eq       | `CR[offs+i].EQ` is set   |
-| 101   | ne       | `CR[offs+i].EQ` is clear |
-| 110   | so/un    | `CR[offs+i].FU` is set   |
-| 111   | ns/nu    | `CR[offs+i].FU` is clear |
-
-CR based predication.  TODO: select alternate CR for twin predication? see
-[[discussion]]  Overlap of the two CR based predicates must be taken
-into account, so the starting point for one of them must be suitably
-high, or accept that for twin predication VL must not exceed the range
-where overlap will occur, *or* that they use the same starting point
-but select different *bits* of the same CRs
-
-`offs` is defined as CR32 (4x8) so as to mesh cleanly with Vectorised
-Rc=1 operations (see below).  Rc=1 operations start from CR8 (TBD).
-
-The CR Predicates chosen must start on a boundary that Vectorised CR
-operations can access cleanly, in full.  With EXTRA2 restricting starting
-points to multiples of 8 (CR0, CR8, CR16...) both Vectorised Rc=1 and
-CR Predicate Masks have to be adapted to fit on these boundaries as well.
-
-## Extra Remapped Encoding <a name="extra_remap"> </a>
-
-Shows all instruction-specific fields in the Remapped Encoding
-`RM[10:18]` for all instruction variants.  Note that due to the very
-tight space, the encoding mode is *not* included in the prefix itself.
-The mode is "applied", similar to Power ISA "Forms" (X-Form, D-Form)
-on a per-instruction basis, and, like "Forms" are given a designation
-(below) of the form `RM-nP-nSnD`. The full list of which instructions
-use which remaps is here [[opcode_regs_deduped]]. (*Machine-readable CSV
-files have been provided which will make the task of creating SV-aware
-ISA decoders easier*).
-
-These mappings are part of the SVP64 Specification in exactly the same
-way as X-Form, D-Form. New Scalar instructions added to the Power ISA
-will need a corresponding SVP64 Mapping, which can be derived by-rote
-from examining the Register "Profile" of the instruction.
-
-There are two categories:  Single and Twin Predication.  Due to space
-considerations further subdivision of Single Predication is based on
-whether the number of src operands is 2 or 3.  With only 9 bits available
-some compromises have to be made.
-
-* `RM-1P-3S1D` Single Predication dest/src1/2/3, applies to 4-operand
-   instructions (fmadd, isel, madd).
-* `RM-1P-2S1D` Single Predication dest/src1/2 applies to 3-operand 
-   instructions (src1 src2 dest)
-* `RM-2P-1S1D` Twin Predication (src=1, dest=1)
-* `RM-2P-2S1D` Twin Predication (src=2, dest=1) primarily for LDST (Indexed)
-* `RM-2P-1S2D` Twin Predication (src=1, dest=2) primarily for LDST Update
-
-### RM-1P-3S1D
-
-| Field Name | Field bits | Description                            |
-|------------|------------|----------------------------------------|
-| Rdest\_EXTRA2 | `10:11` | extends Rdest (R\*\_EXTRA2 Encoding)   |
-| Rsrc1\_EXTRA2 | `12:13` | extends Rsrc1 (R\*\_EXTRA2 Encoding)   |
-| Rsrc2\_EXTRA2 | `14:15` | extends Rsrc2 (R\*\_EXTRA2 Encoding)   |
-| Rsrc3\_EXTRA2 | `16:17` | extends Rsrc3 (R\*\_EXTRA2 Encoding)   |
-| EXTRA2_MODE   | `18`    | used by `divmod2du` and `maddedu` for RS   |
-
-These are for 3 operand in and either 1 or 2 out instructions.
-3-in 1-out includes `madd RT,RA,RB,RC`. (DRAFT) instructions
-such as `maddedu` have an implicit second destination, RS, the
-selection of which is determined by bit 18.
-
-### RM-1P-2S1D
-
-| Field Name | Field bits | Description                               |
-|------------|------------|-------------------------------------------|
-| Rdest\_EXTRA3 | `10:12` | extends Rdest  |
-| Rsrc1\_EXTRA3 | `13:15` | extends Rsrc1  |
-| Rsrc2\_EXTRA3 | `16:18` | extends Rsrc3  |
-
-These are for 2 operand 1 dest instructions, such as `add RT, RA,
-RB`. However also included are unusual instructions with an implicit
-dest that is identical to its src reg, such as `rlwinmi`.
-
-Normally, with instructions such as `rlwinmi`, the scalar v3.0B ISA would
-not have sufficient bit fields to allow an alternative destination.
-With SV however this becomes possible.  Therefore, the fact that the
-dest is implicitly also a src should not mislead: due to the *prefix*
-they are different SV regs.
-
-* `rlwimi RA, RS, ...`
-* Rsrc1_EXTRA3 applies to RS as the first src
-* Rsrc2_EXTRA3 applies to RA as the secomd src
-* Rdest_EXTRA3 applies to RA to create an **independent** dest.
-
-With the addition of the EXTRA bits, the three registers
-each may be *independently* made vector or scalar, and be independently
-augmented to 7 bits in length.
-
-### RM-2P-1S1D/2S
-
-| Field Name | Field bits | Description                 |
-|------------|------------|----------------------------|
-| Rdest_EXTRA3 | `10:12`    | extends Rdest             |
-| Rsrc1_EXTRA3 | `13:15`    | extends Rsrc1             |
-| MASK_SRC     | `16:18`    | Execution Mask for Source |
-
-`RM-2P-2S` is for `stw` etc. and is Rsrc1 Rsrc2.
-
-### RM-1P-2S1D
-
-single-predicate, three registers (2 read, 1 write)
-
-| Field Name | Field bits | Description                 |
-|------------|------------|----------------------------|
-| Rdest_EXTRA3 | `10:12`    | extends Rdest             |
-| Rsrc1_EXTRA3 | `13:15`    | extends Rsrc1             |
-| Rsrc2_EXTRA3 | `16:18`    | extends Rsrc2             |
-
-### RM-2P-2S1D/1S2D/3S
-
-The primary purpose for this encoding is for Twin Predication on LOAD
-and STORE operations.  see [[sv/ldst]] for detailed anslysis.
-
-RM-2P-2S1D:
-
-| Field Name | Field bits | Description                     |
-|------------|------------|----------------------------|
-| Rdest_EXTRA2 | `10:11`  | extends Rdest (R\*\_EXTRA2 Encoding)   |
-| Rsrc1_EXTRA2 | `12:13`  | extends Rsrc1 (R\*\_EXTRA2 Encoding)   |
-| Rsrc2_EXTRA2 | `14:15`  | extends Rsrc2 (R\*\_EXTRA2 Encoding)   |
-| MASK_SRC     | `16:18`  | Execution Mask for Source     |
-
-Note that for 1S2P the EXTRA2 dest and src names are switched (Rsrc_EXTRA2
-is in bits 10:11, Rdest1_EXTRA2 in 12:13)
-
-Also that for 3S (to cover `stdx` etc.) the names are switched to 3 src:
-Rsrc1_EXTRA2, Rsrc2_EXTRA2, Rsrc3_EXTRA2.
-
-Note also that LD with update indexed, which takes 2 src and 2 dest
-(e.g. `lhaux RT,RA,RB`), does not have room for 4 registers and also
-Twin Predication.  therefore these are treated as RM-2P-2S1D and the
-src spec for RA is also used for the same RA as a dest.
+**Submitter**: Luke Leighton (Libre-SOC)
  
-Note that if ELWIDTH != ELWIDTH_SRC this may result in reduced performance
-or increased latency in some implementations due to lane-crossing.
+**Requester**: Libre-SOC
  
-## R\*\_EXTRA2/3
-
-EXTRA is the means by which two things are achieved:
-
-1. Registers are marked as either Vector *or Scalar*
-2. Register field numbers (limited typically to 5 bit)
-   are extended in range, both for Scalar and Vector.
-
-The register files are therefore extended:
-
-* INT is extended from r0-31 to r0-127
-* FP is extended from fp0-32 to fp0-fp127
-* CR Fields are extended from CR0-7 to CR0-127
-
-However due to pressure in `RM.EXTRA` not all these registers
-are accessible by all instructions, particularly those with
-a large number of operands (`madd`, `isel`).
-
-In the following tables register numbers are constructed from the
-standard v3.0B / v3.1B 32 bit register field (RA, FRA) and the EXTRA2 or
-EXTRA3 field from the SV Prefix, determined by the specific RM-xx-yyyy
-designation for a given instruction.  The prefixing is arranged so that
-interoperability between prefixing and nonprefixing of scalar registers
-is direct and convenient (when the EXTRA field is all zeros).
-
-A pseudocode algorithm explains the relationship, for INT/FP (see
-SVP64 appendix for CRs)
+**Impact on processor**:
  
  ```
-    if extra3_mode:
-        spec = EXTRA3
-    else:
-        spec = EXTRA2 << 1 # same as EXTRA3, shifted
-    if spec[0]: # vector
-         return (RA << 2) | spec[1:2]
-    else:         # scalar
-         return (spec[1:2] << 5) | RA
+    Addition of new "Zero-Overhead-Loop-Control" DSP-style Vector-style
+    subsystem that in simple low-end (Embedded) systems may be minimalistically
+    and easily be implemented by inserting a new fully-independent Pipeline Stage
+    in between Decode and Issue, with very little disruption, and in higher
+    performance pre-existing Multi-Issue Out-of-Order systems seamlessly fits likewise
+    to significantly boost performance.
  ```
  
-Future versions may extend to 256 by shifting Vector numbering up.
-Scalar will not be altered.
-
-Note that in some cases the range of starting points for Vectors
-is limited.
-
-### INT/FP EXTRA3
-
-If EXTRA3 is zero, maps to "scalar identity" (scalar Power ISA field
-naming).
-
-Fields are as follows:
-
-* Value: R_EXTRA3
-* Mode: register is tagged as scalar or vector
-* Range/Inc: the range of registers accessible from this EXTRA
-  encoding, and the "increment" (accessibility). "/4" means
-  that this EXTRA encoding may only give access (starting point)
-  every 4th register.
-* MSB..LSB: the bit field showing how the register opcode field
-  combines with EXTRA to give (extend) the register number (GPR)
-
-| Value     | Mode  | Range/Inc     | 6..0 |
-|-----------|-------|---------------|---------------------|
-| 000       | Scalar | `r0-r31`/1 | `0b00 RA`      |
-| 001       | Scalar | `r32-r63`/1 | `0b01 RA`      |
-| 010       | Scalar | `r64-r95`/1 | `0b10 RA`      |
-| 011       | Scalar | `r96-r127`/1 | `0b11 RA`      |
-| 100       | Vector | `r0-r124`/4 | `RA 0b00`      |
-| 101       | Vector | `r1-r125`/4 | `RA 0b01`      |
-| 110       | Vector | `r2-r126`/4 | `RA 0b10`      |
-| 111       | Vector | `r3-r127`/4 | `RA 0b11`      |
-
-### INT/FP EXTRA2
-
-If EXTRA2 is zero will map to
-"scalar identity behaviour" i.e Scalar Power ISA register naming:
-
-| Value    | Mode  | Range/inc     | 6..0 |
-|----------|-------|---------------|-----------|
-| 00       | Scalar | `r0-r31`/1 | `0b00 RA`     |
-| 01       | Scalar | `r32-r63`/1 | `0b01 RA`      |
-| 10       | Vector | `r0-r124`/4 | `RA 0b00`      |
-| 11       | Vector | `r2-r126`/4 | `RA 0b10`   |
-
-**Note that unlike in EXTRA3, in EXTRA2**:
-
-* the GPR Vectors may only start from
-  `r0, r2, r4, r6, r8` and likewise FPR Vectors.
-* the GPR Scalars may only go from `r0, r1, r2.. r63` and likewise FPR Scalars.
-
-as there is insufficient bits to cover the full range.
-
-### CR Field EXTRA3
-
-CR Field encoding is essentially the same but made more complex due to CRs
-being bit-based, because the application of SVP64 element-numbering applies
-to the CR *Field* numbering not the CR register *bit* numbering.
-Note that Vectors may only start from `CR0, CR4, CR8, CR12, CR16, CR20`...
-and Scalars may only go from `CR0, CR1, ... CR31`
-
-Encoding shown MSB down to LSB
-
-For a 5-bit operand (BA, BB, BT):
-
-| Value | Mode | Range/Inc     | 8..5      | 4..2    | 1..0    |
-|-------|------|---------------|-----------| --------|---------|
-| 000   | Scalar | `CR0-CR7`/1   | 0b0000    | BA[0:2] | BA[3:4] |
-| 001   | Scalar | `CR8-CR15`/1  | 0b0001    | BA[0:2] | BA[3:4] |
-| 010   | Scalar | `CR16-CR23`/1 | 0b0010    | BA[0:2] | BA[3:4] |
-| 011   | Scalar | `CR24-CR31`/1 | 0b0011    | BA[0:2] | BA[3:4] |
-| 100   | Vector | `CR0-CR112`/16 | BA[0:2] 0 | 0b000   | BA[3:4] |
-| 101   | Vector | `CR4-CR116`/16 | BA[0:2] 0 | 0b100   | BA[3:4] |
-| 110   | Vector | `CR8-CR120`/16 | BA[0:2] 1 | 0b000   | BA[3:4] |
-| 111   | Vector | `CR12-CR124`/16 | BA[0:2] 1 | 0b100   | BA[3:4] |
-
-For a 3-bit operand (e.g. BFA):
-
-| Value | Mode | Range/Inc     | 6..3      | 2..0    |
-|-------|------|---------------|-----------| --------|
-| 000   | Scalar | `CR0-CR7`/1   | 0b0000    | BFA   |
-| 001   | Scalar | `CR8-CR15`/1  | 0b0001    | BFA      |
-| 010   | Scalar | `CR16-CR23`/1 | 0b0010    | BFA      |
-| 011   | Scalar | `CR24-CR31`/1 | 0b0011    | BFA      |
-| 100   | Vector | `CR0-CR112`/16 | BFA 0 | 0b000   |
-| 101   | Vector | `CR4-CR116`/16 | BFA 0 | 0b100   |
-| 110   | Vector | `CR8-CR120`/16 | BFA 1 | 0b000   |
-| 111   | Vector | `CR12-CR124`/16 | BFA 1 | 0b100   |
-
-### CR EXTRA2
-
-CR encoding is essentially the same but made more complex due to CRs
-being bit-based, because the application of SVP64 element-numbering applies
-to the CR *Field* numbering not the CR register *bit* numbering.
-See separate section for explanation and pseudocode.
-Note that Vectors may only start from CR0, CR8, CR16, CR24, CR32...
-
-Encoding shown MSB down to LSB
-
-For a 5-bit operand (BA, BB, BC):
-
-| Value | Mode   | Range/Inc      | 8..5    | 4..2    | 1..0    |
-|-------|--------|----------------|---------|---------|---------|
-| 00    | Scalar | `CR0-CR7`/1    | 0b0000  | BA[0:2] | BA[3:4] |
-| 01    | Scalar | `CR8-CR15`/1   | 0b0001  | BA[0:2] | BA[3:4] |
-| 10    | Vector | `CR0-CR112`/16 | BA[0:2] 0 | 0b000   | BA[3:4] |
-| 11    | Vector | `CR8-CR120`/16 | BA[0:2] 1 | 0b000   | BA[3:4] |
-
-For a 3-bit operand (e.g. BFA):
-
-| Value | Mode | Range/Inc     | 6..3      | 2..0    |
-|-------|------|---------------|-----------| --------|
-| 00    | Scalar | `CR0-CR7`/1   | 0b0000  | BFA   |
-| 01    | Scalar | `CR8-CR15`/1  | 0b0001  | BFA     |
-| 10    | Vector | `CR0-CR112`/16 | BFA 0 | 0b000   |
-| 11    | Vector | `CR8-CR120`/16 | BFA 1 | 0b000   |
-
---------
-
-\newpage{}
-
-
-# Normal SVP64 Modes, for Arithmetic and Logical Operations
-
-Normal SVP64 Mode covers Arithmetic and Logical operations
-to provide suitable additional behaviour.  The Mode
-field is bits 19-23 of the [[svp64]] RM Field.
-
-## Mode
-
-Mode is an augmentation of SV behaviour, providing additional
-functionality.  Some of these alterations are element-based (saturation),
-others involve post-analysis (predicate result) and others are
-Vector-based (mapreduce, fail-on-first).
-
-[[sv/ldst]], [[sv/cr_ops]] and [[sv/branches]] are covered separately:
-the following Modes apply to Arithmetic and Logical SVP64 operations:
-
-* **simple** mode is straight vectorisation.  no augmentations: the
-  vector comprises an array of independently created results.
-* **ffirst** or data-dependent fail-on-first: see separate section.
-  the vector may be truncated depending on certain criteria.
-  *VL is altered as a result*.
-* **sat mode** or saturation: clamps each element result to a min/max
-  rather than overflows / wraps.  allows signed and unsigned clamping
-  for both INT and FP.
-* **reduce mode**. if used correctly, a mapreduce (or a prefix sum)
-  is performed.    see [[svp64/appendix]].
-  note that there are comprehensive caveats when using this mode.
-* **pred-result** will test the result (CR testing selects a bit of CR
-  and inverts it, just like branch conditional testing) and if the
-  test fails it is as if the *destination* predicate bit was zero even
-  before starting the operation.  When Rc=1 the CR element however is
-  still stored in the CR regfile, even if the test failed.  See appendix
-  for details.
-
-Note that ffirst and reduce modes are not anticipated to be
-high-performance in some implementations.  ffirst due to interactions
-with VL, and reduce due to it requiring additional operations to produce
-a result.  simple, saturate and pred-result are however inter-element
-independent and may easily be parallelised to give high performance,
-regardless of the value of VL.
-
-The Mode table for Arithmetic and Logical operations is laid out as
-follows:
-
-| 0-1 |  2  |  3   4  |  description              |
-| --- | --- |---------|-------------------------- |
-| 00  |   0 |  dz  sz | simple mode                      |
-| 00  |   1 | 0  RG   | scalar reduce mode (mapreduce) |
-| 00  |   1 | 1  /    | reserved     |
-| 01  | inv | CR-bit  | Rc=1: ffirst CR sel              |
-| 01  | inv | VLi RC1 |  Rc=0: ffirst z/nonz |
-| 10  |   N | dz   sz |  sat mode: N=0/1 u/s |
-| 11  | inv | CR-bit  |  Rc=1: pred-result CR sel |
-| 11  | inv | zz  RC1 |  Rc=0: pred-result z/nonz |
-
-Fields:
-
-* **sz / dz**  if predication is enabled will put zeros into the dest
-  (or as src in the case of twin pred) when the predicate bit is zero.
-  otherwise the element is ignored or skipped, depending on context.
-* **zz**: both sz and dz are set equal to this flag
-* **inv CR bit** just as in branches (BO) these bits allow testing of
-  a CR bit and whether it is set (inv=0) or unset (inv=1)
-* **RG** inverts the Vector Loop order (VL-1 downto 0) rather
-  than the normal 0..VL-1
-* **N** sets signed/unsigned saturation.
-* **RC1** as if Rc=1, enables access to `VLi`.
-* **VLi** VL inclusive: in fail-first mode, the truncation of
-  VL *includes* the current element at the failure point rather
-  than excludes it from the count.
-
-For LD/ST Modes, see [[sv/ldst]].  For Condition Registers see
-[[sv/cr_ops]].  For Branch modes, see [[sv/branches]].
-
-## Rounding, clamp and saturate
-
-To help ensure for example that audio quality is not compromised by
-overflow, "saturation" is provided, as well as a way to detect when
-saturation occurred if desired (Rc=1). When Rc=1 there will be a *vector*
-of CRs, one CR per element in the result (Note: this is different from
-VSX which has a single CR per block).
-
-When N=0 the result is saturated to within the maximum range of an
-unsigned value.  For integer ops this will be 0 to 2^elwidth-1. Similar
-logic applies to FP operations, with the result being saturated to
-maximum rather than returning INF, and the minimum to +0.0
-
-When N=1 the same occurs except that the result is saturated to the min
-or max of a signed result, and for FP to the min and max value rather
-than returning +/- INF.
-
-When Rc=1, the CR "overflow" bit is set on the CR associated with the
-element, to indicate whether saturation occurred.  Note that due to
-the hugely detrimental effect it has on parallel processing, XER.SO is
-**ignored** completely and is **not** brought into play here.  The CR
-overflow bit is therefore simply set to zero if saturation did not occur,
-and to one if it did.  This behaviour (ignoring XER.SO) is actually optional in
-the SFFS Compliancy Subset: for SVP64 it is made mandatory *but only on
-Vectorised instructions*.
-
-Note also that saturate on operations that set OE=1 must raise an Illegal
-Instruction due to the conflicting use of the CR.so bit for storing if
-saturation occurred. Vectorised Integer Operations that produce a Carry-Out (CA,
-CA32): these two bits will be `UNDEFINED` if saturation is also requested.
-
-Note that the operation takes place at the maximum bitwidth (max of
-src and dest elwidth) and that truncation occurs to the range of the
-dest elwidth.
-
-*Programmer's Note: Post-analysis of the Vector of CRs to find out if any
-given element hit saturation may be done using a mapreduced CR op (cror),
-or by using the new crrweird instruction with Rc=1, which will transfer
-the required CR bits to a scalar integer and update CR0, which will allow
-testing the scalar integer for nonzero.  see [[sv/cr_int_predication]].
-Alternatively, a Data-Dependent Fail-First may be used to truncate the
-Vector Length to non-saturated elements, greatly increasing the productivity
-of parallelised inner hot-loops.*
-
-## Reduce mode
-
-Reduction in SVP64 is similar in essence to other Vector Processing ISAs,
-but leverages the underlying scalar Base v3.0B operations.  Thus it is
-more a convention that the programmer may utilise to give the appearance
-and effect of a Horizontal Vector Reduction. Due to the unusual decoupling
-it is also possible to perform prefix-sum (Fibonacci Series) in certain
-circumstances. Details are in the SVP64 appendix
-
-Reduce Mode should not be confused with Parallel Reduction [[sv/remap]].
-As explained in the [[sv/appendix]] Reduce Mode switches off the check
-which would normally stop looping if the result register is scalar.
-Thus, the result scalar register, if also used as a source scalar,
-may be used to perform sequential accumulation.  This *deliberately*
-sets up a chain of Register Hazard Dependencies, whereas Parallel Reduce
-[[sv/remap]] deliberately issues a Tree-Schedule of operations that may
-be parallelised.
-
-## Data-dependent Fail-on-first
-
-Data-dependent fail-on-first is very different from LD/ST Fail-First
-(also known as Fault-First) and is actually CR-field-driven.
-Vector elements are required to appear
-to be executed in sequential Program Order. When REMAP is not active,
-element 0 would be the first.
-
-Data-driven (CR-driven) fail-on-first activates when Rc=1 or other
-CR-creating operation produces a result (including cmp).  Similar to
-branch, an analysis of the CR is performed and if the test fails, the
-vector operation terminates and discards all element operations **at and
-above the current one**, and VL is truncated to either the *previous*
-element or the current one, depending on whether VLi (VL "inclusive")
-is clear or set, respectively.
-
-Thus the new VL comprises a contiguous vector of results, all of which
-pass the testing criteria (equal to zero, less than zero etc as defined
-by the CR-bit test).
-
-*Note: when VLi is clear, the behaviour at first seems counter-intuitive.
-A result is calculated but if the test fails it is prohibited from being
-actually written.  This becomes intuitive again when it is remembered
-that the length that VL is set to is the number of *written* elements, and
-only when VLI is set will the current element be included in that count.*
-
-The CR-based data-driven fail-on-first is "new" and not found in ARM SVE
-or RVV. At the same time it is "old" because it is almost identical to
-a generalised form of Z80's `CPIR` instruction.  It is extremely useful
-for reducing instruction count, however requires speculative execution
-involving modifications of VL to get high performance implementations.
-An additional mode (RC1=1) effectively turns what would otherwise be an
-arithmetic operation into a type of `cmp`.  The CR is stored (and the
-CR.eq bit tested against the `inv` field).  If the CR.eq bit is equal to
-`inv` then the Vector is truncated and the loop ends.
-
-VLi is only available as an option when `Rc=0` (or for instructions
-which do not have Rc). When set, the current element is always also
-included in the count (the new length that VL will be set to).  This may
-be useful in combination with "inv" to truncate the Vector to *exclude*
-elements that fail a test, or, in the case of implementations of strncpy,
-to include the terminating zero.
-
-In CR-based data-driven fail-on-first there is only the option to select
-and test one bit of each CR (just as with branch BO).  For more complex
-tests this may be insufficient.  If that is the case, a vectorised crop
-such as crand, cror or [[sv/cr_int_predication]] crweirder may be used,
-and ffirst applied to the crop instead of to the arithmetic vector. Note
-that crops are covered by the [[sv/cr_ops]] Mode format.
-
-*Programmer's note: `VLi` is only accessible in normal operations which in
-turn limits the CR field bit-testing to only `EQ/NE`.  [[sv/cr_ops]] are
-not so limited.  Thus it is possible to use for example `sv.cror/ff=gt/vli
-*0,*0,*0`, which is not a `nop` because it allows Fail-First Mode to
-perform a test and truncate VL.*
-
-*Hardware implementor's note: effective Sequential Program Order must be preserved.
-Speculative Execution is perfectly permitted as long as the speculative elements
-are held back from writing to register files (kept in Resevation Stations),
-until such time as the relevant
-CR Field bit(s) has been analysed. All Speculative elements sequentially beyond the
-test-failure point **MUST** be cancelled. This is no different from standard
-Out-of-Order Execution and the modification effort to efficiently support 
-Data-Dependent Fail-First within a pre-existing Multi-Issue Out-of-Order Engine
-is anticipated to be minimal. In-Order systems on the other hand are expected,
-unavoidably, to be low-performance*.
-
-Two extremely important aspects of ffirst are:
-
-* LDST ffirst may never set VL equal to zero.  This because on the first
-  element an exception must be raised "as normal".
-* CR-based data-dependent ffirst on the other hand **can** set VL equal
-  to zero. This is the only means in the entirety of SV that VL may be set
-  to zero (with the exception of via the SV.STATE SPR).  When VL is set
-  zero due to the first element failing the CR bit-test, all subsequent
-  vectorised operations are effectively `nops` which is
-  *precisely the desired and intended behaviour*.
-
-The second crucial aspect, compared to LDST Ffirst:
-
-* LD/ST Failfirst may (beyond the initial first element
-  conditions) truncate VL for any architecturally suitable reason. Beyond
-  the first element LD/ST Failfirst is arbitrarily speculative and 100%
-  non-deterministic.
-* CR-based data-dependent first on the other hand MUST NOT truncate VL
-  arbitrarily to a length decided by the hardware: VL MUST only be
-  truncated based explicitly on whether a test fails.  This because it is
-  a precise Deterministic test on which algorithms can and will will rely.
-
-**Floating-point Exceptions**
-
-When Floating-point exceptions are enabled VL must be truncated at
-the point where the Exception appears not to have occurred. If `VLi`
-is set then VL must include the faulting element, and thus the faulting
-element will always raise its exception.  If however `VLi` is clear then
-VL **excludes** the faulting element and thus the exception will **never**
-be raised.
-
-Although very strongly discouraged the Exception Mode that permits
-Floating Point Exception notification to arrive too late to unwind
-is permitted (under protest, due it violating the otherwise 100%
-Deterministic nature of Data-dependent Fail-first) and is `UNDEFINED`
-behaviour.
-
-**Use of lax FP Exception Notification Mode could result in parallel
-computations proceeding with invalid results that have to be explicitly
-detected, whereas with the strict FP Execption Mode enabled, FFirst
-truncates VL, allows subsequent parallel computation to avoid the
-exceptions entirely**
-
-## Data-dependent fail-first on CR operations (crand etc)
-
-Operations that actually produce or alter CR Field as a result have
-their own SVP64 Mode, described in [[sv/cr_ops]].
-
-## pred-result mode
-
-This mode merges common CR testing with predication, saving on instruction
-count. Below is the pseudocode excluding predicate zeroing and elwidth
-overrides. Note that the pseudocode for SVP64 CR-ops is slightly different.
+**Impact on software**:
  
  ```
-    for i in range(VL):
-        # predication test, skip all masked out elements.
-        if predicate_masked_out(i):
-             continue
-        result = op(iregs[RA+i], iregs[RB+i])
-        CRnew = analyse(result) # calculates eq/lt/gt
-        # Rc=1 always stores the CR field
-        if Rc=1 or RC1:
-            CR.field[offs+i] = CRnew
-        # now test CR, similar to branch
-        if RC1 or CR.field[BO[0:1]] != BO[2]:
-            continue # test failed: cancel store
-        # result optionally stored but CR always is
-        iregs[RT+i] = result
+    Requires support for new instructions in assembler, debuggers, and related tools.
+    Dramatically reduces instructions. Requires introduction of term "High-Level Assembler"
  ```
  
-The reason for allowing the CR element to be stored is so that
-post-analysis of the CR Vector may be carried out.  For example:
-Saturation may have occurred (and been prevented from updating, by the
-test) but it is desirable to know *which* elements fail saturation.
-
-Note that RC1 Mode basically turns all operations into `cmp`.  The
-calculation is performed but it is only the CR that is written. The
-element result is *always* discarded, never written (just like `cmp`).
-
-Note that predication is still respected: predicate zeroing is slightly
-different: elements that fail the CR test *or* are masked out are zero'd.
-
---------
-
-\newpage{}
-
-# SV Load and Store
-
-**Rationale**
-
-All Vector ISAs dating back fifty years have extensive and comprehensive
-Load and Store operations that go far beyond the capabilities of Scalar
-RISC and most CISC processors, yet at their heart on an individual element
-basis may be found to be no different from RISC Scalar equivalents.
-
-The resource savings from Vector LD/ST are significant and stem
-from the fact that one single instruction can trigger a dozen (or in
-some microarchitectures such as Cray or NEC SX Aurora) hundreds of
-element-level Memory accesses.
-
-Additionally, and simply: if the Arithmetic side of an ISA supports
-Vector Operations, then in order to keep the ALUs 100% occupied the
-Memory infrastructure (and the ISA itself) correspondingly needs Vector
-Memory Operations as well.
-
-Vectorised Load and Store also presents an extra dimension (literally)
-which creates scenarios unique to Vector applications, that a Scalar
-(and even a SIMD) ISA simply never encounters.  SVP64 endeavours to add
-the modes typically found in *all* Scalable Vector ISAs, without changing
-the behaviour of the underlying Base (Scalar) v3.0B operations in any way.
-
-## Modes overview
-
-Vectorisation of Load and Store requires creation, from scalar operations,
-a number of different modes:
-
-* **fixed aka "unit" stride** - contiguous sequence with no gaps
-* **element strided** - sequential but regularly offset, with gaps
-* **vector indexed** - vector of base addresses and vector of offsets
-* **Speculative fail-first** - where it makes sense to do so
-* **Structure Packing** - covered in SV by [[sv/remap]] and Pack/Unpack Mode.
-
-*Despite being constructed from Scalar LD/ST none of these Modes exist
-or make sense in any Scalar ISA. They **only** exist in Vector ISAs*
-
-Also included in SVP64 LD/ST is both signed and unsigned Saturation,
-as well as Element-width overrides and Twin-Predication.
-
-Note also that Indexed [[sv/remap]] mode may be applied to both v3.0
-LD/ST Immediate instructions *and* v3.0 LD/ST Indexed instructions.
-LD/ST-Indexed should not be conflated with Indexed REMAP mode:
-clarification is provided below.
-
-**Determining the LD/ST Modes**
-
-A minor complication (caused by the retro-fitting of modern Vector
-features to a Scalar ISA) is that certain features do not exactly make
-sense or are considered a security risk.  Fail-first on Vector Indexed
-would allow attackers to probe large numbers of pages from userspace,
-where strided fail-first (by creating contiguous sequential LDs) does not.
-
-In addition, reduce mode makes no sense.  Realistically we need an
-alternative table definition for [[sv/svp64]] `RM.MODE`.  The following
-modes make sense:
-
-* saturation
-* predicate-result (mostly for cache-inhibited LD/ST)
-* simple (no augmentation)
-* fail-first (where Vector Indexed is banned)
-* Signed Effective Address computation (Vector Indexed only)
-
-More than that however it is necessary to fit the usual Vector ISA
-capabilities onto both Power ISA LD/ST with immediate and to LD/ST
-Indexed. They present subtly different Mode tables, which, due to lack
-of space, have the following quirks:
-
-* LD/ST Immediate has no individual control over src/dest zeroing,
-  whereas LD/ST Indexed does.
-* LD/ST Immediate has no Saturated Pack/Unpack (Arithmetic Mode does)
-* LD/ST Indexed has no Pack/Unpack (REMAP may be used instead)
-
-## Format and fields
-
-Fields used in tables below:
-
-* **sz / dz**  if predication is enabled will put zeros into the dest
-  (or as src in the case of twin pred) when the predicate bit is zero.
-  otherwise the element is ignored or skipped, depending on context.
-* **zz**: both sz and dz are set equal to this flag.
-* **inv CR bit** just as in branches (BO) these bits allow testing of
-  a CR bit and whether it is set (inv=0) or unset (inv=1)
-* **N** sets signed/unsigned saturation.
-* **RC1** as if Rc=1, stores CRs *but not the result*
-* **SEA** - Signed Effective Address, if enabled performs sign-extension on
-  registers that have been reduced due to elwidth overrides
-* **PI** - post-increment mode (applies to LD/ST with update only).
-  the Effective Address utilised is always just RA, i.e. the computation of
-  EA is stored in RA **after** it is actually used.
-* **LF** - Load/Store Fail or Fault First: for any reason Load or Store Vectors
-  may be truncated to (at least) one element, and VL altered to indicate such.
-
-**LD/ST immediate**
-
-The table for [[sv/svp64]] for `immed(RA)` which is `RM.MODE`
-(bits 19:23 of `RM`) is:
-
-| 0-1 |  2  |  3   4  |  description               |
-| --- | --- |---------|--------------------------- |
-| 00  | 0   |  zz els | simple mode                |
-| 00  | 1   | PI  LF  | post-increment and Fault-First  |
-| 01  | inv | CR-bit  | Rc=1: ffirst CR sel        |
-| 01  | inv | els RC1 |  Rc=0: ffirst z/nonz       |
-| 10  |   N | zz  els |  sat mode: N=0/1 u/s       |
-| 11  | inv | CR-bit  |  Rc=1: pred-result CR sel  |
-| 11  | inv | els RC1 |  Rc=0: pred-result z/nonz  |
-
-The `els` bit is only relevant when `RA.isvec` is clear: this indicates
-whether stride is unit or element:
+**Keywords**:
  
  ```
-    if RA.isvec:
-        svctx.ldstmode = indexed
-    elif els == 0:
-        svctx.ldstmode = unitstride
-    elif immediate != 0:
-        svctx.ldstmode = elementstride
+    Cray Supercomputing, Vectorisation, Zero-Overhead-Loop-Control (ZOLC),
+    True-Scalable Vectors, Multi-Issue Out-of-Order, Sequential Programming Model,
+    Digital Signal Processing (DSP), High-level Assembler
  ```
  
-An immediate of zero is a safety-valve to allow `LD-VSPLAT`: in effect
-the multiplication of the immediate-offset by zero results in reading from
-the exact same memory location, *even with a Vector register*. (Normally
-this type of behaviour is reserved for the mapreduce modes)
-
-For `LD-VSPLAT`, on non-cache-inhibited Loads, the read can occur just
-the once and be copied, rather than hitting the Data Cache multiple
-times with the same memory read at the same location.  The benefit of
-Cache-inhibited LD-splats is that it allows for memory-mapped peripherals
-to have multiple data values read in quick succession and stored in
-sequentially numbered registers (but, see Note below).
-
-For non-cache-inhibited ST from a vector source onto a scalar destination:
-with the Vector loop effectively creating multiple memory writes to
-the same location, we can deduce that the last of these will be the
-"successful" one. Thus, implementations are free and clear to optimise
-out the overwriting STs, leaving just the last one as the "winner".
-Bear in mind that predicate masks will skip some elements (in source
-non-zeroing mode).  Cache-inhibited ST operations on the other hand
-**MUST** write out a Vector source multiple successive times to the exact
-same Scalar destination. Just like Cache-inhibited LDs, multiple values
-may be written out in quick succession to a memory-mapped peripheral
-from sequentially-numbered registers.
-
-Note that any memory location may be Cache-inhibited
-(Power ISA v3.1, Book III, 1.6.1, p1033)
-
-*Programmer's Note: an immediate also with a Scalar source as a "VSPLAT"
-mode is simply not possible: there are not enough Mode bits. One single
-Scalar Load operation may be used instead, followed by any arithmetic
-operation (including a simple mv) in "Splat" mode.*
-
-**LD/ST Indexed**
-
-The modes for `RA+RB` indexed version are slightly different
-but are the same `RM.MODE` bits (19:23 of `RM`):
+**Motivation**
  
-| 0-1 |  2  |  3   4  |  description              |
-| --- | --- |---------|-------------------------- |
-| 00  | SEA |  dz  sz | simple mode        |
-| 01  | SEA | dz sz   | Strided (scalar only source)   |
-| 10  |   N | dz   sz |  sat mode: N=0/1 u/s |
-| 11  | inv | CR-bit  |  Rc=1: pred-result CR sel |
-| 11  | inv | zz  RC1 |  Rc=0: pred-result z/nonz |
+Just at the time when customers are asking for higher performance,
+the seductive lure of SIMD, as outlined in the sigarch "SIMD Considered
+Harmful" article is getting out of control and damaging the reputation
+of mainstream general-purpose ISAs that offer it.  A solution from
+50 years ago exists in the form of Cray-Style True-Scalable Vectors.
+However the usual way that True-Scalable Vector ISAs are done *also*
+adds more instructions and complexifies the ISA.  Simple-V takes a step
+back to a simpler era in computing from half a century ago: the Zilog
+Z80 CPIR and LDIR instructions, and the 8086 REP instruction, and brings
+them forward to Modern-day Computing.  The result is a huge reduction in
+programming complexity, and a strong base to project the Power ISA back
+to the world's most powerful Supercomputing ISA for at least the next two
+decades.
  
-Vector Indexed Strided Mode is qualified as follows:
+**Notes and Observations**:
  
-    if mode = 0b01 and !RA.isvec and !RB.isvec:
-        svctx.ldstmode = elementstride
+Related RFCs are [[ls008]] for the two Management instructions `setvl`
+and `svstep`, and [ls009]] for the REMAP Subsystem. Also [[ls001]] is
+a Dependency as it introduces Primary Opcode 9 64-bit encoding. An
+additional RFC [[ls005]] introduced XLEN on which SVP64 is also critically
+dependent, for Element-width Overrides.
  
-A summary of the effect of Vectorisation of src or dest:
+**Changes**
  
-```
-    imm(RA)  RT.v   RA.v   no stride allowed
-    imm(RA)  RT.s   RA.v   no stride allowed
-    imm(RA)  RT.v   RA.s   stride-select allowed
-    imm(RA)  RT.s   RA.s   not vectorised
-    RA,RB    RT.v  {RA|RB}.v Standard Indexed
-    RA,RB    RT.s  {RA|RB}.v Indexed but single LD (no VSPLAT)
-    RA,RB    RT.v  {RA&RB}.s VSPLAT possible. stride selectable
-    RA,RB    RT.s  {RA&RB}.s not vectorised (scalar identity)
-```
-
-Signed Effective Address computation is only relevant for Vector Indexed
-Mode, when elwidth overrides are applied.  The source override applies to
-RB, and before adding to RA in order to calculate the Effective Address,
-if SEA is set RB is sign-extended from elwidth bits to the full 64 bits.
-For other Modes (ffirst, saturate), all EA computation with elwidth
-overrides is unsigned.
-
-Note that cache-inhibited LD/ST  when VSPLAT is activated will perform
-**multiple** LD/ST operations, sequentially.  Even with scalar src
-a Cache-inhibited LD will read the same memory location *multiple
-times*, storing the result in successive Vector destination registers.
-This because the cache-inhibit instructions are typically used to read
-and write memory-mapped peripherals.  If a genuine cache-inhibited
-LD-VSPLAT is required then a single *scalar* cache-inhibited LD should
-be performed, followed by a VSPLAT-augmented mv, copying the one *scalar*
-value into multiple register destinations.
+Add the following entries to:
  
-Note also that cache-inhibited VSPLAT with Predicate-result is possible.
-This allows for example to issue a massive batch of memory-mapped
-peripheral reads, stopping at the first NULL-terminated character and
-truncating VL to that point. No branch is needed to issue that large
-burst of LDs, which may be valuable in Embedded scenarios.
-
-## Vectorisation of Scalar Power ISA v3.0B
-
-Scalar Power ISA Load/Store operations may be seen from their
-pseudocode to be of the form:
-
-```
-    lbux RT, RA, RB
-    EA <- (RA) + (RB)
-    RT <- MEM(EA)
-```
-
-and for immediate variants:
-
-```
-    lb RT,D(RA)
-    EA <- RA + EXTS(D)
-    RT <- MEM(EA)
-```
-
-Thus in the first example, the source registers may each be independently
-marked as scalar or vector, and likewise the destination; in the second
-example only the one source and one dest may be marked as scalar or
-vector.
-
-Thus we can see that Vector Indexed may be covered, and, as demonstrated
-with the pseudocode below, the immediate can be used to give unit
-stride or element stride.  With there being no way to tell which from
-the Power v3.0B Scalar opcode alone, the choice is provided instead by
-the SV Context.
-
-```
-    # LD not VLD!  format - ldop RT, immed(RA)
-    # op_width: lb=1, lh=2, lw=4, ld=8
-    op_load(RT, RA, op_width, immed, svctx, RAupdate):
-      ps = get_pred_val(FALSE, RA); # predication on src
-      pd = get_pred_val(FALSE, RT); # ... AND on dest
-      for (i=0, j=0, u=0; i < VL && j < VL;):
-        # skip nonpredicates elements
-        if (RA.isvec) while (!(ps & 1<<i)) i++;
-        if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
-        if (RT.isvec) while (!(pd & 1<<j)) j++;
-        if postinc:
-            offs = 0; # added afterwards
-            if RA.isvec: srcbase = ireg[RA+i]
-            else         srcbase = ireg[RA]
-        elif svctx.ldstmode == elementstride:
-          # element stride mode
-          srcbase = ireg[RA]
-          offs = i * immed              # j*immed for a ST
-        elif svctx.ldstmode == unitstride:
-          # unit stride mode
-          srcbase = ireg[RA]
-          offs = immed + (i * op_width) # j*op_width for ST
-        elif RA.isvec:
-          # quirky Vector indexed mode but with an immediate
-          srcbase = ireg[RA+i]
-          offs = immed;
-        else
-          # standard scalar mode (but predicated)
-          # no stride multiplier means VSPLAT mode
-          srcbase = ireg[RA]
-          offs = immed
-
-        # compute EA
-        EA = srcbase + offs
-        # load from memory
-        ireg[RT+j] <= MEM[EA];
-        # check post-increment of EA
-        if postinc: EA = srcbase + immed;
-        # update RA?
-        if RAupdate: ireg[RAupdate+u] = EA;
-        if (!RT.isvec)
-            break # destination scalar, end now
-        if (RA.isvec) i++;
-        if (RAupdate.isvec) u++;
-        if (RT.isvec) j++;
-```
-
-Indexed LD is:
-
-```
-    # format: ldop RT, RA, RB
-    function op_ldx(RT, RA, RB, RAupdate=False) # LD not VLD!
-      ps = get_pred_val(FALSE, RA); # predication on src
-      pd = get_pred_val(FALSE, RT); # ... AND on dest
-      for (i=0, j=0, k=0, u=0; i < VL && j < VL && k < VL):
-        # skip nonpredicated RA, RB and RT
-        if (RA.isvec) while (!(ps & 1<<i)) i++;
-        if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
-        if (RB.isvec) while (!(ps & 1<<k)) k++;
-        if (RT.isvec) while (!(pd & 1<<j)) j++;
-        if svctx.ldstmode == elementstride:
-            EA = ireg[RA] + ireg[RB]*j   # register-strided
-        else
-            EA = ireg[RA+i] + ireg[RB+k] # indexed address
-        if RAupdate: ireg[RAupdate+u] = EA
-        ireg[RT+j] <= MEM[EA];
-        if (!RT.isvec)
-            break # destination scalar, end immediately
-        if (RA.isvec) i++;
-        if (RAupdate.isvec) u++;
-        if (RB.isvec) k++;
-        if (RT.isvec) j++;
-```
-
-Note that Element-Strided uses the Destination Step because with both
-sources being Scalar as a prerequisite condition of activation of
-Element-Stride Mode, the source step (being Scalar) would never advance.
-
-Note in both cases that [[sv/svp64]] allows RA-as-a-dest in "update"
-mode (`ldux`) to be effectively a *completely different* register from
-RA-as-a-source.  This because there is room in svp64 to extend RA-as-src
-as well as RA-as-dest, both independently as scalar or vector *and*
-independently extending their range.
-
-*Programmer's note: being able to set RA-as-a-source as separate from
-RA-as-a-destination as Scalar is **extremely valuable** once it is
-remembered that Simple-V element operations must be in Program Order,
-especially in loops, for saving on multiple address computations. Care
-does have to be taken however that RA-as-src is not overwritten by
-RA-as-dest unless intentionally desired, especially in element-strided
-Mode.*
-
-## LD/ST Indexed vs Indexed REMAP
-
-Unfortunately the word "Indexed" is used twice in completely different
-contexts, potentially causing confusion.
-
-* There has existed instructions in the Power ISA `ld RT,RA,RB` since
-  its creation: these are called "LD/ST Indexed" instructions and their
-  name and meaning is well-established.
-* There now exists, in Simple-V, a REMAP mode called "Indexed"
-  Mode that can be applied to *any* instruction **including those
-  named LD/ST Indexed**.
-
-Whilst it may be costly in terms of register reads to allow REMAP Indexed
-Mode to be applied to any Vectorised LD/ST Indexed operation such as
-`sv.ld *RT,RA,*RB`, or even misleadingly labelled  as redundant, firstly
-the strict application of the RISC Paradigm that Simple-V follows makes
-it awkward to consider *preventing* the application of Indexed REMAP to
-such operations, and secondly they are not actually the same at all.
-
-Indexed REMAP, as applied to RB in the instruction `sv.ld *RT,RA,*RB`
-effectively performs an *in-place* re-ordering of the offsets, RB.
-To achieve the same effect without Indexed REMAP would require taking
-a *copy* of the Vector of offsets starting at RB, manually explicitly
-reordering them, and finally using the copy of re-ordered offsets in a
-non-REMAP'ed `sv.ld`.  Using non-strided LD as an example, pseudocode
-showing what actually occurs, where the pseudocode for `indexed_remap`
-may be found in [[sv/remap]]:
-
-```
-    # sv.ld *RT,RA,*RB with Index REMAP applied to RB
-    for i in 0..VL-1:
-        if remap.indexed:
-            rb_idx = indexed_remap(i) # remap
-        else:
-            rb_idx = i # use the index as-is
-        EA = GPR(RA) + GPR(RB+rb_idx)
-        GPR(RT+i) = MEM(EA, 8)
-```
-
-Thus it can be seen that the use of Indexed REMAP saves copying
-and manual reordering of the Vector of RB offsets.
-
-## LD/ST ffirst
-
-LD/ST ffirst treats the first LD/ST in a vector (element 0 if REMAP
-is not active) as an ordinary one, with all behaviour with respect to
-Interrupts Exceptions Page Faults Memory Management being identical
-in every regard to Scalar v3.0 Power ISA LD/ST. However for elements
-1 and above, if an exception would occur, then VL is **truncated**
-to the previous element: the exception is **not** then raised because
-the LD/ST that would otherwise have caused an exception is *required*
-to be cancelled. Additionally an implementor may choose to truncate VL
-for any arbitrary reason *except for the very first*.
-
-ffirst LD/ST to multiple pages via a Vectorised Index base is
-considered a security risk due to the abuse of probing multiple
-pages in rapid succession and getting speculative feedback on which
-pages would fail.  Therefore Vector Indexed LD/ST is prohibited
-entirely, and the Mode bit instead used for element-strided LD/ST.
-
-```
-    for(i = 0; i < VL; i++)
-        reg[rt + i] = mem[reg[ra] + i * reg[rb]];
-```
-
-High security implementations where any kind of speculative probing of
-memory pages is considered a risk should take advantage of the fact
-that implementations may truncate VL at any point, without requiring
-software to be rewritten and made non-portable. Such implementations may
-choose to *always* set VL=1 which will have the effect of terminating
-any speculative probing (and also adversely affect performance), but
-will at least not require applications to be rewritten.
-
-Low-performance simpler hardware implementations may also choose (always)
-to also set VL=1 as the bare minimum compliant implementation of LD/ST
-Fail-First. It is however critically important to remember that the first
-element LD/ST **MUST** be treated as an ordinary LD/ST, i.e.  **MUST**
-raise exceptions exactly like an ordinary LD/ST.
-
-For ffirst LD/STs, VL may be truncated arbitrarily to a nonzero value
-for any implementation-specific reason. For example: it is perfectly
-reasonable for implementations to alter VL when ffirst LD or ST operations
-are initiated on a nonaligned boundary, such that within a loop the
-subsequent iteration of that loop begins the following ffirst LD/ST
-operations on an aligned boundary such as the beginning of a cache line,
-or beginning of a Virtual Memory page. Likewise, to reduce workloads or
-balance resources.
-
-Vertical-First Mode is slightly strange in that only one element at a time
-is ever executed anyway.  Given that programmers may legitimately choose
-to alter srcstep and dststep in non-sequential order as part of explicit
-loops, it is neither possible nor safe to make speculative assumptions
-about future LD/STs.  Therefore, Fail-First LD/ST in Vertical-First is
-`UNDEFINED`.  This is very different from Arithmetic (Data-dependent)
-FFirst where Vertical-First Mode is fully deterministic, not speculative.
-
-## LOAD/STORE Elwidths <a name="elwidth"></a>
-
-Loads and Stores are almost unique in that the Power Scalar ISA
-provides a width for the operation (lb, lh, lw, ld).  Only `extsb` and
-others like it provide an explicit operation width.  There are therefore
-*three* widths involved:
-
-* operation width (lb=8, lh=16, lw=32, ld=64)
-* src element width override (8/16/32/default)
-* destination element width override (8/16/32/default)
-
-Some care is therefore needed to express and make clear the transformations,
-which are expressly in this order:
-
-* Calculate the Effective Address from RA at full width
-  but (on Indexed Load) allow srcwidth overrides on RB
-* Load at the operation width (lb/lh/lw/ld) as usual
-* byte-reversal as usual
-* Non-saturated mode:
-   - zero-extension or truncation from operation width to dest elwidth
-   - place result in destination at dest elwidth
-* Saturated mode:
-   - Sign-extension or truncation from operation width to dest width
-   - signed/unsigned saturation down to dest elwidth
-
-In order to respect Power v3.0B Scalar behaviour the memory side
-is treated effectively as completely separate and distinct from SV
-augmentation.  This is primarily down to quirks surrounding LE/BE and
-byte-reversal.
-
-It is rather unfortunately possible to request an elwidth override on
-the memory side which does not mesh with the overridden operation width:
-these result in `UNDEFINED` behaviour.  The reason is that the effect
-of attempting a 64-bit `sv.ld` operation with a source elwidth override
-of 8/16/32 would result in overlapping memory requests, particularly
-on unit and element strided operations.  Thus it is `UNDEFINED` when
-the elwidth is smaller than the memory operation width. Examples include
-`sv.lw/sw=16/els` which requests (overlapping) 4-byte memory reads offset
-from each other at 2-byte intervals.  Store likewise is also `UNDEFINED`
-where the dest elwidth override is less than the operation width.
-
-Note the following regarding the pseudocode to follow:
-
-* `scalar identity behaviour` SV Context parameter conditions turn this
-  into a straight absolute fully-compliant Scalar v3.0B LD operation
-* `brev` selects whether the operation is the byte-reversed variant (`ldbrx`
-  rather than `ld`)
-* `op_width` specifies the operation width (`lb`, `lh`, `lw`, `ld`) as
-  a "normal" part of Scalar v3.0B LD
-* `imm_offs` specifies the immediate offset `ld r3, imm_offs(r5)`, again
-  as a "normal" part of Scalar v3.0B LD
-* `svctx` specifies the SV Context and includes VL as well as
-  source and destination elwidth overrides.
-
-Below is the pseudocode for Unit-Strided LD (which includes Vector
-capability). Observe in particular that RA, as the base address in both
-Immediate and Indexed LD/ST, does not have element-width overriding
-applied to it.
-
-Note that predication, predication-zeroing, and other modes except
-saturation have all been removed, for clarity and simplicity:
-
-```
-    # LD not VLD!
-    # this covers unit stride mode and a type of vector offset
-    function op_ld(RT, RA, op_width, imm_offs, svctx)
-      for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL):
-        if not svctx.unit/el-strided:
-            # strange vector mode, compute 64 bit address which is
-            # not polymorphic! elwidth hardcoded to 64 here
-            srcbase = get_polymorphed_reg(RA, 64, i)
-        else:
-            # unit / element stride mode, compute 64 bit address
-            srcbase = get_polymorphed_reg(RA, 64, 0)
-            # adjust for unit/el-stride
-            srcbase += ....
-
-        # read the underlying memory
-        memread <= MEM(srcbase + imm_offs, op_width)
-
-        # check saturation.
-        if svpctx.saturation_mode:
-            # ... saturation adjustment...
-            memread = clamp(memread, op_width, svctx.dest_elwidth)
-        else:
-            # truncate/extend to over-ridden dest width.
-            memread = adjust_wid(memread, op_width, svctx.dest_elwidth)
-
-        # takes care of inserting memory-read (now correctly byteswapped)
-        # into regfile underlying LE-defined order, into the right place
-        # within the NEON-like register, respecting destination element
-        # bitwidth, and the element index (j)
-        set_polymorphed_reg(RT, svctx.dest_elwidth, j, memread)
-
-        # increments both src and dest element indices (no predication here)
-        i++;
-        j++;
-```
-
-Note above that the source elwidth is *not used at all* in LD-immediate.
-
-For LD/Indexed, the key is that in the calculation of the Effective Address,
-RA has no elwidth override but RB does.  Pseudocode below is simplified
-for clarity: predication and all modes except saturation are removed:
-
-```
-    # LD not VLD! ld*rx if brev else ld*
-    function op_ld(RT, RA, RB, op_width, svctx, brev)
-      for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL):
-        if not svctx.el-strided:
-            # RA not polymorphic! elwidth hardcoded to 64 here
-            srcbase = get_polymorphed_reg(RA, 64, i)
-        else:
-            # element stride mode, again RA not polymorphic
-            srcbase = get_polymorphed_reg(RA, 64, 0)
-        # RB *is* polymorphic
-        offs = get_polymorphed_reg(RB, svctx.src_elwidth, i)
-        # sign-extend
-        if svctx.SEA: offs = sext(offs, svctx.src_elwidth, 64)
-
-        # takes care of (merges) processor LE/BE and ld/ldbrx
-        bytereverse = brev XNOR MSR.LE
-
-        # read the underlying memory
-        memread <= MEM(srcbase + offs, op_width)
-
-        # optionally performs byteswap at op width
-        if (bytereverse):
-            memread = byteswap(memread, op_width)
-
-        if svpctx.saturation_mode:
-            # ... saturation adjustment...
-            memread = clamp(memread, op_width, svctx.dest_elwidth)
-        else:
-            # truncate/extend to over-ridden dest width.
-            memread = adjust_wid(memread, op_width, svctx.dest_elwidth)
-
-        # takes care of inserting memory-read (now correctly byteswapped)
-        # into regfile underlying LE-defined order, into the right place
-        # within the NEON-like register, respecting destination element
-        # bitwidth, and the element index (j)
-        set_polymorphed_reg(RT, svctx.dest_elwidth, j, memread)
-
-        # increments both src and dest element indices (no predication here)
-        i++;
-        j++;
-```
-
-## Remapped LD/ST
-
-In the [[sv/remap]] page the concept of "Remapping" is described.  Whilst
-it is expensive to set up (2 64-bit opcodes minimum) it provides a way to
-arbitrarily perform 1D, 2D and 3D "remapping" of up to 64 elements worth
-of LDs or STs.  The usual interest in such re-mapping is for example in
-separating out 24-bit RGB channel data into separate contiguous registers.
-
-REMAP easily covers this capability, and with dest elwidth overrides
-and saturation may do so with built-in conversion that would normally
-require additional width-extension, sign-extension and min/max Vectorised
-instructions as post-processing stages.
-
-Thus we do not need to provide specialist LD/ST "Structure Packed" opcodes
-because the generic abstracted concept of "Remapping", when applied to
-LD/ST, will give that same capability, with far more flexibility.
-
-It is worth noting that Pack/Unpack Modes of SVSTATE, which may be
-established through `svstep`, are also an easy way to perform regular
-Structure Packing, at the vec2/vec3/vec4 granularity level.  Beyond that,
-REMAP will need to be used.
-
---------
-
-\newpage{}
-
-# Condition Register SVP64 Operations
-
-Condition Register Fields are only 4 bits wide: this presents some
-interesting conceptual challenges for SVP64, which was designed
-primarily for vectors of arithmetic and logical operations. However
-if predicates may be bits of CR Fields it makes sense to extend
-Simple-V to cover CR Operations, especially given that Vectorised Rc=1
-may be processed by Vectorised CR Operations tbat usefully in turn
-may become Predicate Masks to yet more Vector operations, like so:
-
-```
-    sv.cmpi/ew=8 *B,*ra,0    # compare bytes against zero
-    sv.cmpi/ew=8 *B2,*ra,13. # and against newline
-    sv.cror PM.EQ,B.EQ,B2.EQ # OR compares to create mask
-    sv.stb/sm=EQ    ...      # store only nonzero/newline
-```
-
-Element width however is clearly meaningless for a 4-bit collation of
-Conditions, EQ LT GE SO. Likewise, arithmetic saturation (an important
-part of Arithmetic SVP64) has no meaning. An alternative Mode Format is
-required, and given that elwidths are meaningless for CR Fields the bits
-in SVP64 `RM` may be used for other purposes.
-
-This alternative mapping **only** applies to instructions that **only**
-reference a CR Field or CR bit as the sole exclusive result. This section
-**does not** apply to instructions which primarily produce arithmetic
-results that also, as an aside, produce a corresponding CR Field (such as
-when Rc=1).  Instructions that involve Rc=1 are definitively arithmetic
-in nature, where the corresponding Condition Register Field can be
-considered to be a "co-result". Such CR Field "co-result" arithmeric
-operations are firmly out of scope for this section, being covered fully
-by [[sv/normal]].
-
-* Examples of v3.0B instructions to which this section does
-  apply is
-  - `mfcr` and `cmpi` (3 bit operands) and
-  - `crnor` and `crand` (5 bit operands).
-* Examples to which this section does **not** apply include
-  `fadds.` and `subf.` which both produce arithmetic results
-  (and a CR Field co-result).
-
-The CR Mode Format still applies to `sv.cmpi` because despite
-taking a GPR as input, the output from the Base Scalar v3.0B `cmpi`
-instruction is purely to a Condition Register Field.
-
-Other modes are still applicable and include:
-
-* **Data-dependent fail-first**.
-  useful to truncate VL based on analysis of a Condition Register result bit.
-* **Reduction**.
-  Reduction is useful for analysing a Vector of Condition Register Fields
-  and reducing it to one single Condition Register Field.
-
-Predicate-result does not make any sense because when Rc=1 a co-result
-is created (a CR Field). Testing the co-result allows the decision to
-be made to store or not store the main result, and for CR Ops the CR
-Field result *is* the main result.
-
-## Format
-
-SVP64 RM `MODE` (includes `ELWIDTH_SRC` bits) for CR-based operations:
-
-|6 | 7 |19-20|  21 | 22   23 |  description     |
-|--|---|-----| --- |---------|----------------- |
-|/ | / |0 RG |   0 | dz  sz  | simple mode                      |
-|/ | / |0 RG |   1 | dz  sz  | scalar reduce mode (mapreduce) |
-|zz|SNZ|1 VLI| inv |  CR-bit | Ffirst 3-bit mode      |
-|/ |SNZ|1 VLI| inv |  dz sz  | Ffirst 5-bit mode (implies CR-bit from result) |
-
-Fields:
-
-* **sz / dz**  if predication is enabled will put zeros into the dest
- (or as src in the case of twin pred) when the predicate bit is zero.
-  otherwise the element is ignored or skipped, depending on context.
-* **zz** set both sz and dz equal to this flag
-* **SNZ** In fail-first mode, on the bit being tested, when sz=1 and
-  SNZ=1 a value "1" is put in place of "0".
-* **inv CR-bit** just as in branches (BO) these bits allow testing of
-  a CR bit and whether it is set (inv=0) or unset (inv=1)
-* **RG** inverts the Vector Loop order (VL-1 downto 0) rather
-  than the normal 0..VL-1
-* **SVM** sets "subvector" reduce mode
-* **VLi** VL inclusive: in fail-first mode, the truncation of
-  VL *includes* the current element at the failure point rather
-  than excludes it from the count.
-
-## Data-dependent fail-first on CR operations
-
-The principle of data-dependent fail-first is that if, during the course
-of sequentially evaluating an element's Condition Test, one such test
-is encountered which fails, then VL (Vector Length) is truncated (set)
-at that point. In the case of Arithmetic SVP64 Operations the Condition
-Register Field generated from Rc=1 is used as the basis for the truncation
-decision.  However with CR-based operations that CR Field result to be
-tested is provided *by the operation itself*.
-
-Data-dependent SVP64 Vectorised Operations involving the creation
-or modification of a CR can require an extra two bits, which are not
-available in the compact space of the SVP64 RM `MODE` Field. With the
-concept of element width overrides being meaningless for CR Fields it
-is possible to use the `ELWIDTH` field for alternative purposes.
-
-Condition Register based operations such as `sv.mfcr` and `sv.crand`
-can thus be made more flexible.  However the rules that apply in this
-section also apply to future CR-based instructions.
-
-There are two primary different types of CR operations:
-
-* Those which have a 3-bit operand field (referring to a CR Field)
-* Those which have a 5-bit operand (referring to a bit within the
-   whole 32-bit CR)
-
-Examining these two types it is observed that the difference may
-be considered to be that the 5-bit variant *already* provides the
-prerequisite information about which CR Field bit (EQ, GE, LT, SO) is
-to be operated on by the instruction.  Thus, logically, we may set the
-following rule:
-
-* When a 5-bit CR Result field is used in an instruction, the
-  5-bit variant of Data-Dependent Fail-First
-  must be used. i.e. the bit of the CR field to be tested is
-  the one that has just been modified (created) by the operation.
-* When a 3-bit CR Result field is used the 3-bit variant
-  must be used, providing as it does the missing `CRbit` field
-  in order to select which CR Field bit of the result shall
-  be tested (EQ, LE, GE, SO)
-
-The reason why the 3-bit CR variant needs the additional CR-bit field
-should be obvious from the fact that the 3-bit CR Field from the base
-Power ISA v3.0B operation clearly does not contain and is missing the
-two CR Field Selector bits. Thus, these two bits (to select EQ, LE,
-GE or SO) must be provided in another way.
-
-Examples of the former type:
-
-* crand, cror, crnor. These all are 5-bit (BA, BB, BT). The bit
-  to be tested against `inv` is the one selected by `BT`
-* mcrf. This has only 3-bit (BF, BFA). In order to select the
-  bit to be tested, the alternative encoding must be used.
-  With `CRbit` coming from the SVP64 RM bits 22-23 the bit
-  of BF to be tested is identified.
-
-Just as with SVP64 [[sv/branches]] there is the option to truncate
-VL to include the element being tested (`VLi=1`) and to exclude it
-(`VLi=0`).
-
-Also exactly as with [[sv/normal]] fail-first, VL cannot, unlike
-[[sv/ldst]], be set to an arbitrary value.  Deterministic behaviour
-is *required*.
-
-## Reduction and Iteration
-
-Bearing in mind as described in the svp64 Appendix, SVP64 Horizontal
-Reduction is a deterministic schedule on top of base Scalar v3.0
-operations, the same rules apply to CR Operations, i.e. that programmers
-must follow certain conventions in order for an *end result* of a
-reduction to be achieved.  Unlike other Vector ISAs *there are no explicit
-reduction opcodes* in SVP64: Schedules however achieve the same effect.
-
-Due to these conventions only reduction on operations such as `crand`
-and `cror` are meaningful because these have Condition Register Fields
-as both input and output.  Meaningless operations are not prohibited
-because the cost in hardware of doing so is prohibitive, but neither
-are they `UNDEFINED`. Implementations are still required to execute them
-but are at liberty to optimise out any operations that would ultimately
-be overwritten, as long as Strict Program Order is still obvservable by
-the programmer.
-
-Also bear in mind that 'Reverse Gear' may be enabled, which can be
-used in combination with overlapping CR operations to iteratively
-accumulate results.  Issuing a `sv.crand` operation for example with
-`BA` differing from `BB` by one Condition Register Field would result
-in a cascade effect, where the first-encountered CR Field would set the
-result to zero, and also all subsequent CR Field elements thereafter:
-
-```
-    # sv.crand/mr/rg CR4.ge.v, CR5.ge.v, CR4.ge.v
-    for i in VL-1 downto 0 # reverse gear
-         CR.field[4+i].ge &= CR.field[5+i].ge
-```
-
-`sv.crxor` with reduction would be particularly useful for parity
-calculation for example, although there are many ways in which the same
-calculation could be carried out after transferring a vector of CR Fields
-to a GPR using crweird operations.
-
-Implementations are free and clear to optimise these reductions in any way
-they see fit, as long as the end-result is compatible with Strict Program
-Order being observed, and Interrupt latency is not adversely impacted.
-
-## Unusual and quirky CR operations
-
-**cmp and other compare ops**
-
-`cmp` and `cmpi` etc take GPRs as sources and create a CR Field as a result.
-
-    cmpli BF,L,RA,UI
-    cmpeqb BF,RA,RB
-
-With `ELWIDTH` applying to the source GPR operands this is perfectly fine.
-
-**crweird operations**
-
-There are 4 weird CR-GPR operations and one reasonable one in
-the [[cr_int_predication]] set:
-
-* crrweird
-* mtcrweird
-* crweirder
-* crweird
-* mcrfm - reasonably normal and referring to CR Fields for src and dest.
-
-The "weird" operations have a non-standard behaviour, being able to
-treat *individual bits* of a GPR effectively as elements.  They are
-expected to be Micro-coded by most Hardware implementations.
+* A new "Vector Looping" Book
+* New Vector-Looping Chapters
+* New Vector-Looping Appendices
  
+[[!tag opf_rfc]]
  
  --------
  
  \newpage{}
  
-# SVP64 Branch Conditional behaviour
-
-Please note: although similar, SVP64 Branch instructions should be
-considered completely separate and distinct from standard scalar
-OpenPOWER-approved v3.0B branches.  **v3.0B branches are in no way
-impacted, altered, changed or modified in any way, shape or form by the
-SVP64 Vectorised Variants**.
-
-It is also extremely important to note that Branches are the sole
-pseudo-exception in SVP64 to `Scalar Identity Behaviour`.  SVP64 Branches
-contain additional modes that are useful for scalar operations (i.e. even
-when VL=1 or when using single-bit predication).
-
-**Rationale**
-
-Scalar 3.0B Branch Conditional operations, `bc`, `bctar` etc. test
-a Condition Register.  However for parallel processing it is simply
-impossible to perform multiple independent branches: the Program
-Counter simply cannot branch to multiple destinations based on multiple
-conditions.  The best that can be done is to test multiple Conditions
-and make a decision of a *single* branch, based on analysis of a *Vector*
-of CR Fields which have just been calculated from a *Vector* of results.
-
-In 3D Shader binaries, which are inherently parallelised and predicated,
-testing all or some results and branching based on multiple tests is
-extremely common, and a fundamental part of Shader Compilers.  Example:
-without such multi-condition test-and-branch, if a predicate mask is
-all zeros a large batch of instructions may be masked out to `nop`,
-and it would waste CPU cycles to run them.  3D GPU ISAs can test for
-this scenario and, with the appropriate predicate-analysis instruction,
-jump over fully-masked-out operations, by spotting that *all* Conditions
-are false.
-
-Unless Branches are aware and capable of such analysis, additional
-instructions would be required which perform Horizontal Cumulative
-analysis of Vectorised Condition Register Fields, in order to reduce
-the Vector of CR Fields down to one single yes or no decision that a
-Scalar-only v3.0B Branch-Conditional could cope with.  Such instructions
-would be unavoidable, required, and costly by comparison to a single
-Vector-aware Branch.  Therefore, in order to be commercially competitive,
-`sv.bc` and other Vector-aware Branch Conditional instructions are a
-high priority for 3D GPU (and OpenCL-style) workloads.
-
-Given that Power ISA v3.0B is already quite powerful, particularly
-the Condition Registers and their interaction with Branches, there are
-opportunities to create extremely flexible and compact Vectorised Branch
-behaviour.  In addition, the side-effects (updating of CTR, truncation
-of VL, described below) make it a useful instruction even if the branch
-points to the next instruction (no actual branch).
-
-## Overview
-
-When considering an "array" of branch-tests, there are four
-primarily-useful modes: AND, OR, NAND and NOR of all Conditions.
-NAND and NOR may be synthesised from AND and OR by inverting `BO[1]`
-which just leaves two modes:
-
-* Branch takes place on the **first** CR Field test to succeed
-  (a Great Big OR of all condition tests). Exit occurs
-  on the first **successful** test.
-* Branch takes place only if **all** CR field tests succeed:
-  a Great Big AND of all condition tests.  Exit occurs
-  on the first **failed** test.
-
-Early-exit is enacted such that the Vectorised Branch does not
-perform needless extra tests, which will help reduce reads on
-the Condition Register file.
-
-*Note: Early-exit is **MANDATORY** (required) behaviour.  Branches
-**MUST** exit at the first sequentially-encountered failure point,
-for exactly the same reasons for which it is mandatory in programming
-languages doing early-exit: to avoid damaging side-effects and to provide
-deterministic behaviour. Speculative testing of Condition Register
-Fields is permitted, as is speculative calculation of CTR, as long as,
-as usual in any Out-of-Order microarchitecture, that speculative testing
-is cancelled should an early-exit occur.  i.e. the speculation must be
-"precise": Program Order must be preserved*
-
-Also note that when early-exit occurs in Horizontal-first Mode, srcstep,
-dststep etc. are all reset, ready to begin looping from the beginning
-for the next instruction. However for Vertical-first Mode srcstep
-etc. are incremented "as usual" i.e. an early-exit has no special impact,
-regardless of whether the branch occurred or not. This can leave srcstep
-etc. in what may be considered an unusual state on exit from a loop and
-it is up to the programmer to reset srcstep, dststep etc. to known-good
-values *(easily achieved with `setvl`)*.
-
-Additional useful behaviour involves two primary Modes (both of which
-may be enabled and combined):
-
-* **VLSET Mode**: identical to Data-Dependent Fail-First Mode
-  for Arithmetic SVP64 operations, with more
-  flexibility and a close interaction and integration into the
-  underlying base Scalar v3.0B Branch instruction.
-  Truncation of VL takes place around the early-exit point.
-* **CTR-test Mode**: gives much more flexibility over when and why
-  CTR is decremented, including options to decrement if a Condition
-  test succeeds *or if it fails*.
-
-With these side-effects, basic Boolean Logic Analysis advises that it
-is important to provide a means to enact them each based on whether
-testing succeeds *or fails*. This results in a not-insignificant number
-of additional Mode Augmentation bits, accompanying VLSET and CTR-test
-Modes respectively.
-
-Predicate skipping or zeroing may, as usual with SVP64, be controlled by
-`sz`.  Where the predicate is masked out and zeroing is enabled, then in
-such circumstances the same Boolean Logic Analysis dictates that rather
-than testing only against zero, the option to test against one is also
-prudent. This introduces a new immediate field, `SNZ`, which works in
-conjunction with `sz`.
-
-Vectorised Branches can be used in either SVP64 Horizontal-First or
-Vertical-First Mode. Essentially, at an element level, the behaviour
-is identical in both Modes, although the `ALL` bit is meaningless in
-Vertical-First Mode.
-
-It is also important to bear in mind that, fundamentally, Vectorised
-Branch-Conditional is still extremely close to the Scalar v3.0B
-Branch-Conditional instructions, and that the same v3.0B Scalar
-Branch-Conditional instructions are still *completely separate and
-independent*, being unaltered and unaffected by their SVP64 variants in
-every conceivable way.
-
-*Programming note: One important point is that SVP64 instructions are
-64 bit.  (8 bytes not 4). This needs to be taken into consideration
-when computing branch offsets: the offset is relative to the start of
-the instruction, which **includes** the SVP64 Prefix*
-
-## Format and fields
-
-With element-width overrides being meaningless for Condition Register
-Fields, bits 4 thru 7 of SVP64 RM may be used for additional Mode bits.
-
-SVP64 RM `MODE` (includes repurposing `ELWIDTH` bits 4:5, and
-`ELWIDTH_SRC` bits 6-7 for *alternate* uses) for Branch Conditional:
-
-| 4 | 5 | 6 | 7 | 17 | 18 | 19 | 20 |  21 | 22  23 |  description     |
-| - | - | - | - | -- | -- | -- | -- | --- |--------|----------------- |
-|ALL|SNZ| / | / | SL |SLu | 0  | 0  | /   | LRu sz | simple mode      |
-|ALL|SNZ| / |VSb| SL |SLu | 0  | 1  | VLI | LRu sz | VLSET mode       |
-|ALL|SNZ|CTi| / | SL |SLu | 1  | 0  | /   | LRu sz | CTR-test mode         |
-|ALL|SNZ|CTi|VSb| SL |SLu | 1  | 1  | VLI | LRu sz | CTR-test+VLSET mode   |
-
-Brief description of fields:
-
-* **sz=1** if predication is enabled and `sz=1` and a predicate
-  element bit is zero, `SNZ` will
-  be substituted in place of the CR bit selected by `BI`,
-  as the Condition tested.
-  Contrast this with
-  normal SVP64 `sz=1` behaviour, where *only* a zero is put in
-  place of masked-out predicate bits.
-* **sz=0** When `sz=0` skipping occurs as usual on
-  masked-out elements, but unlike all
-  other SVP64 behaviour which entirely skips an element with
-  no related side-effects at all, there are certain
-  special circumstances where CTR
-  may be decremented.  See CTR-test Mode, below.
-* **ALL** when set, all branch conditional tests must pass in order for
-  the branch to succeed. When clear, it is the first sequentially
-  encountered successful test that causes the branch to succeed.
-  This is identical behaviour to how programming languages perform
-  early-exit on Boolean Logic chains.
-* **VLI** VLSET is identical to Data-dependent Fail-First mode.
-  In VLSET mode, VL *may* (depending on `VSb`) be truncated.
-  If VLI (Vector Length Inclusive) is clear,
-  VL is truncated to *exclude* the current element, otherwise it is
-  included. SVSTATE.MVL is not altered: only VL.
-* **SL** identical to `LR` except applicable to SVSTATE. If `SL`
-  is set, SVSTATE is transferred to SVLR (conditionally on
-  whether `SLu` is set).
-* **SLu**: SVSTATE Link Update, like `LRu` except applies to SVSTATE.
-* **LRu**: Link Register Update, used in conjunction with LK=1
-  to make LR update conditional
-* **VSb** In VLSET Mode, after testing,
-  if VSb is set, VL is truncated if the test succeeds.  If VSb is clear,
-  VL is truncated if a test *fails*. Masked-out (skipped)
-  bits are not considered
-  part of testing when `sz=0`
-* **CTi** CTR inversion. CTR-test Mode normally decrements per element
-  tested. CTR inversion decrements if a test *fails*. Only relevant
-  in CTR-test Mode.
-
-LRu and CTR-test modes are where SVP64 Branches subtly differ from
-Scalar v3.0B Branches. `sv.bcl` for example will always update LR,
-whereas `sv.bcl/lru` will only update LR if the branch succeeds.
-
-Of special interest is that when using ALL Mode (Great Big AND of all
-Condition Tests), if `VL=0`, which is rare but can occur in Data-Dependent
-Modes, the Branch will always take place because there will be no failing
-Condition Tests to prevent it. Likewise when not using ALL Mode (Great
-Big OR of all Condition Tests) and `VL=0` the Branch is guaranteed not
-to occur because there will be no *successful* Condition Tests to make
-it happen.
-
-## Vectorised CR Field numbering, and Scalar behaviour
-
-It is important to keep in mind that just like all SVP64 instructions,
-the `BI` field of the base v3.0B Branch Conditional instruction may be
-extended by SVP64 EXTRA augmentation, as well as be marked as either
-Scalar or Vector. It is also crucially important to keep in mind that for
-CRs, SVP64 sequentially increments the CR *Field* numbers.  CR *Fields*
-are treated as elements, not bit-numbers of the CR *register*.
-
-The `BI` operand of Branch Conditional operations is five bits, in scalar
-v3.0B this would select one bit of the 32 bit CR, comprising eight CR
-Fields of 4 bits each.  In SVP64 there are 16 32 bit CRs, containing
-128 4-bit CR Fields.  Therefore, the 2 LSBs of `BI` select the bit from
-the CR Field (EQ LT GT SO), and the top 3 bits are extended to either
-scalar or vector and to select CR Fields 0..127 as specified in SVP64
-[[sv/svp64/appendix]].
-
-When the CR Fields selected by SVP64-Augmented `BI` is marked as scalar,
-then as the usual SVP64 rules apply: the Vector loop ends at the first
-element tested (the first CR *Field*), after taking predication into
-consideration. Thus, also as usual, when a predicate mask is given, and
-`BI` marked as scalar, and `sz` is zero, srcstep skips forward to the
-first non-zero predicated element, and only that one element is tested.
-
-In other words, the fact that this is a Branch Operation (instead of an
-arithmetic one) does not result, ultimately, in significant changes as
-to how SVP64 is fundamentally applied, except with respect to:
-
-* the unique properties associated with conditionally
- changing the Program Counter (aka "a Branch"), resulting in early-out
- opportunities
-* CTR-testing
-
-Both are outlined below, in later sections.
-
-## Horizontal-First and Vertical-First Modes
-
-In SVP64 Horizontal-First Mode, the first failure in ALL mode (Great Big
-AND) results in early exit: no more updates to CTR occur (if requested);
-no branch occurs, and LR is not updated (if requested). Likewise for
-non-ALL mode (Great Big Or) on first success early exit also occurs,
-however this time with the Branch proceeding.  In both cases the testing
-of the Vector of CRs should be done in linear sequential order (or in
-REMAP re-sequenced order): such that tests that are sequentially beyond
-the exit point are *not* carried out. (*Note: it is standard practice
-in Programming languages to exit early from conditional tests, however a
-little unusual to consider in an ISA that is designed for Parallel Vector
-Processing. The reason is to have strictly-defined guaranteed behaviour*)
-
-In Vertical-First Mode, setting the `ALL` bit results in `UNDEFINED`
-behaviour. Given that only one element is being tested at a time in
-Vertical-First Mode, a test designed to be done on multiple bits is
-meaningless.
-
-## Description and Modes
-
-Predication in both INT and CR modes may be applied to `sv.bc` and other
-SVP64 Branch Conditional operations, exactly as they may be applied to
-other SVP64 operations.  When `sz` is zero, any masked-out Branch-element
-operations are not included in condition testing, exactly like all other
-SVP64 operations, *including* side-effects such as potentially updating
-LR or CTR, which will also be skipped. There is *one* exception here,
-which is when `BO[2]=0, sz=0, CTR-test=0, CTi=1` and the relevant element
-predicate mask bit is also zero: under these special circumstances CTR
-will also decrement.
-
-When `sz` is non-zero, this normally requests insertion of a zero in
-place of the input data, when the relevant predicate mask bit is zero.
-This would mean that a zero is inserted in place of `CR[BI+32]` for
-testing against `BO`, which may not be desirable in all circumstances.
-Therefore, an extra field is provided `SNZ`, which, if set, will insert
-a **one** in place of a masked-out element, instead of a zero.
-
-(*Note: Both options are provided because it is useful to deliberately
-cause the Branch-Conditional Vector testing to fail at a specific point,
-controlled by the Predicate mask. This is particularly useful in `VLSET`
-mode, which will truncate SVSTATE.VL at the point of the first failed
-test.*)
-
-Normally, CTR mode will decrement once per Condition Test, resulting under
-normal circumstances that CTR reduces by up to VL in Horizontal-First
-Mode. Just as when v3.0B Branch-Conditional saves at least one instruction
-on tight inner loops through auto-decrementation of CTR, likewise it
-is also possible to save instruction count for SVP64 loops in both
-Vertical-First and Horizontal-First Mode, particularly in circumstances
-where there is conditional interaction between the element computation
-and testing, and the continuation (or otherwise) of a given loop. The
-potential combinations of interactions is why CTR testing options have
-been added.
-
-Also, the unconditional bit `BO[0]` is still relevant when Predication
-is applied to the Branch because in `ALL` mode all nonmasked bits have
-to be tested, and when `sz=0` skipping occurs.  Even when VLSET mode is
-not used, CTR may still be decremented by the total number of nonmasked
-elements, acting in effect as either a popcount or cntlz depending
-on which mode bits are set.  In short, Vectorised Branch becomes an
-extremely powerful tool.
-
-**Micro-Architectural Implementation Note**: *when implemented on top
-of a Multi-Issue Out-of-Order Engine it is possible to pass a copy of
-the predicate and the prerequisite CR Fields to all Branch Units, as
-well as the current value of CTR at the time of multi-issue, and for
-each Branch Unit to compute how many times CTR would be subtracted,
-in a fully-deterministic and parallel fashion. A SIMD-based Branch
-Unit, receiving and processing multiple CR Fields covered by multiple
-predicate bits, would do the exact same thing. Obviously, however, if
-CTR is modified within any given loop (mtctr) the behaviour of CTR is
-no longer deterministic.*
-
-### Link Register Update
-
-For a Scalar Branch, unconditional updating of the Link Register LR
-is useful and practical. However, if a loop of CR Fields is tested,
-unconditional updating of LR becomes problematic.
-
-For example when using `bclr` with `LRu=1,LK=0` in Horizontal-First Mode,
-LR's value will be unconditionally overwritten after the first element,
-such that for execution (testing) of the second element, LR has the value
-`CIA+8`. This is covered in the `bclrl` example, in a later section.
-
-The addition of a LRu bit modifies behaviour in conjunction with LK,
-as follows:
-
-* `sv.bc` When LRu=0,LK=0, Link Register is not updated
-* `sv.bcl` When LRu=0,LK=1, Link Register is updated unconditionally
-* `sv.bcl/lru` When LRu=1,LK=1, Link Register will
-  only be updated if the Branch Condition fails.
-* `sv.bc/lru` When LRu=1,LK=0, Link Register will only be updated if
-  the Branch Condition succeeds.
-
-This avoids destruction of LR during loops (particularly Vertical-First
-ones).
-
-**SVLR and SVSTATE**
-
-For precisely the reasons why `LK=1` was added originally to the Power
-ISA, with SVSTATE being a peer of the Program Counter it becomes necessary
-to also add an SVLR (SVSTATE Link Register) and corresponding control bits
-`SL` and `SLu`.
-
-### CTR-test
-
-Where a standard Scalar v3.0B branch unconditionally decrements CTR when
-`BO[2]` is clear, CTR-test Mode introduces more flexibility which allows
-CTR to be used for many more types of Vector loops constructs.
-
-CTR-test mode and CTi interaction is as follows: note that `BO[2]`
-is still required to be clear for CTR decrements to be considered,
-exactly as is the case in Scalar Power ISA v3.0B
-
-* **CTR-test=0, CTi=0**: CTR decrements on a per-element basis
-  if `BO[2]` is zero. Masked-out elements when `sz=0` are
-  skipped (i.e. CTR is *not* decremented when the predicate
-  bit is zero and `sz=0`).
-* **CTR-test=0, CTi=1**: CTR decrements on a per-element basis
-  if `BO[2]` is zero and a masked-out element is skipped
-  (`sz=0` and predicate bit is zero). This one special case is the
-  **opposite** of other combinations, as well as being
-  completely different from normal SVP64 `sz=0` behaviour)
-* **CTR-test=1, CTi=0**: CTR decrements on a per-element basis
-  if `BO[2]` is zero and the Condition Test succeeds.
-  Masked-out elements when `sz=0` are skipped (including
-  not decrementing CTR)
-* **CTR-test=1, CTi=1**: CTR decrements on a per-element basis
-  if `BO[2]` is zero and the Condition Test *fails*.
-  Masked-out elements when `sz=0` are skipped (including
-  not decrementing CTR)
-
-`CTR-test=0, CTi=1, sz=0` requires special emphasis because it is the
-only time in the entirety of SVP64 that has side-effects when
-a predicate mask bit is clear.  **All** other SVP64 operations
-entirely skip an element when sz=0 and a predicate mask bit is zero.
-It is also critical to emphasise that in this unusual mode,
-no other side-effects occur: **only** CTR is decremented, i.e. the
-rest of the Branch operation is skipped.
-
-### VLSET Mode
-
-VLSET Mode truncates the Vector Length so that subsequent instructions
-operate on a reduced Vector Length. This is similar to Data-dependent
-Fail-First and LD/ST Fail-First, where for VLSET the truncation occurs
-at the Branch decision-point.
-
-Interestingly, due to the side-effects of `VLSET` mode it is actually
-useful to use Branch Conditional even to perform no actual branch
-operation, i.e to point to the instruction after the branch. Truncation of
-VL would thus conditionally occur yet control flow alteration would not.
-
-`VLSET` mode with Vertical-First is particularly unusual. Vertical-First
-is designed to be used for explicit looping, where an explicit call to
-`svstep` is required to move both srcstep and dststep on to the next
-element, until VL (or other condition) is reached.  Vertical-First Looping
-is expected (required) to terminate if the end of the Vector, VL, is
-reached. If however that loop is terminated early because VL is truncated,
-VLSET with Vertical-First becomes meaningless.  Resolving this would
-require two branches: one Conditional, the other branching unconditionally
-to create the loop, where the Conditional one jumps over it.
-
-Therefore, with `VSb`, the option to decide whether truncation should
-occur if the branch succeeds *or* if the branch condition fails allows
-for the flexibility required.  This allows a Vertical-First Branch to
-*either* be used as a branch-back (loop) *or* as part of a conditional
-exit or function call from *inside* a loop, and for VLSET to be integrated
-into both types of decision-making.
-
-In the case of a Vertical-First branch-back (loop), with `VSb=0` the
-branch takes place if success conditions are met, but on exit from that
-loop (branch condition fails), VL will be truncated. This is extremely
-useful.
-
-`VLSET` mode with Horizontal-First when `VSb=0` is still useful, because
-it can be used to truncate VL to the first predicated (non-masked-out)
-element.
-
-The truncation point for VL, when VLi is clear, must not include skipped
-elements that preceded the current element being tested.  Example:
-`sz=0, VLi=0, predicate mask = 0b110010` and the Condition Register
-failure point is at CR Field element 4.
-
-* Testing at element 0 is skipped because its predicate bit is zero
-* Testing at element 1 passed
-* Testing elements 2 and 3 are skipped because their
-  respective predicate mask bits are zero
-* Testing element 4 fails therefore VL is truncated to **2**
-  not 4 due to elements 2 and 3 being skipped.
-
-If `sz=1` in the above example *then* VL would have been set to 4 because
-in non-zeroing mode the zero'd elements are still effectively part of the
-Vector (with their respective elements set to `SNZ`)
-
-If `VLI=1` then VL would be set to 5 regardless of sz, due to being inclusive
-of the element actually being tested.
-
-### VLSET and CTR-test combined
-
-If both CTR-test and VLSET Modes are requested, it is important to
-observe the correct order. What occurs depends on whether VLi is enabled,
-because VLi affects the length, VL.
-
-If VLi (VL truncate inclusive) is set:
-
-1. compute the test including whether CTR triggers
-2. (optionally) decrement CTR
-3. (optionally) truncate VL (VSb inverts the decision)
-4. decide (based on step 1) whether to terminate looping
-   (including not executing step 5)
-5. decide whether to branch.
-
-If VLi is clear, then when a test fails that element
-and any following it
-should **not** be considered part of the Vector. Consequently:
-
-1. compute the branch test including whether CTR triggers
-2. if the test fails against VSb, truncate VL to the *previous*
-   element, and terminate looping. No further steps executed.
-3. (optionally) decrement CTR
-4. decide whether to branch.
-
-## Boolean Logic combinations
-
-In a Scalar ISA, Branch-Conditional testing even of vector results may be
-performed through inversion of tests. NOR of all tests may be performed
-by inversion of the scalar condition and branching *out* from the scalar
-loop around elements, using scalar operations.
-
-In a parallel (Vector) ISA it is the ISA itself which must perform
-the prerequisite logic manipulation.  Thus for SVP64 there are an
-extraordinary number of nesessary combinations which provide completely
-different and useful behaviour.  Available options to combine:
-
-* `BO[0]` to make an unconditional branch would seem irrelevant if
-  it were not for predication and for side-effects (CTR Mode
-  for example)
-* Enabling CTR-test Mode and setting `BO[2]` can still result in the
-  Branch
-  taking place, not because the Condition Test itself failed, but
-  because CTR reached zero **because**, as required by CTR-test mode,
-  CTR was decremented as a  **result** of Condition Tests failing.
-* `BO[1]` to select whether the CR bit being tested is zero or nonzero
-* `R30` and `~R30` and other predicate mask options including CR and
-  inverted CR bit testing
-* `sz` and `SNZ` to insert either zeros or ones in place of masked-out
-  predicate bits
-* `ALL` or `ANY` behaviour corresponding to `AND` of all tests and
-  `OR` of all tests, respectively.
-* Predicate Mask bits, which combine in effect with the CR being
-  tested.
-* Inversion of Predicate Masks (`~r3` instead of `r3`, or using
-  `NE` rather than `EQ`) which results in an additional
-  level of possible ANDing, ORing etc. that would otherwise
-  need explicit instructions.
-
-The most obviously useful combinations here are to set `BO[1]` to zero
-in order to turn `ALL` into Great-Big-NAND and `ANY` into Great-Big-NOR.
-Other Mode bits which perform behavioural inversion then have to work
-round the fact that the Condition Testing is NOR or NAND.  The alternative
-to not having additional behavioural inversion (`SNZ`, `VSb`, `CTi`)
-would be to have a second (unconditional) branch directly after the first,
-which the first branch jumps over.  This contrivance is avoided by the
-behavioural inversion bits.
-
-## Pseudocode and examples
-
-Please see the SVP64 appendix regarding CR bit ordering and for
-the definition of `CR{n}`
-
-For comparative purposes this is a copy of the v3.0B `bc` pseudocode
-
-```
-    if (mode_is_64bit) then M <- 0
-    else M <- 32
-    if ¬BO[2] then CTR <- CTR - 1
-    ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
-    cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1])
-    if ctr_ok & cond_ok then
-      if AA then NIA <-iea EXTS(BD || 0b00)
-      else       NIA <-iea CIA + EXTS(BD || 0b00)
-    if LK then LR  <-iea  CIA + 4
-```
-
-Simplified pseudocode including LRu and CTR skipping, which illustrates
-clearly that SVP64 Scalar Branches (VL=1) are **not** identical to
-v3.0B Scalar Branches.  The key areas where differences occur are the
-inclusion of predication (which can still be used when VL=1), in when and
-why CTR is decremented (CTRtest Mode) and whether LR is updated (which
-is unconditional in v3.0B when LK=1, and conditional in SVP64 when LRu=1).
-
-Inline comments highlight the fact that the Scalar Branch behaviour and
-pseudocode is still clearly visible and embedded within the Vectorised
-variant:
-
-```
-    if (mode_is_64bit) then M <- 0
-    else M <- 32
-    # the bit of CR to test, if the predicate bit is zero,
-    # is overridden
-    testbit = CR[BI+32]
-    if ¬predicate_bit then testbit = SVRMmode.SNZ
-    # otherwise apart from the override ctr_ok and cond_ok
-    # are exactly the same
-    ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
-    cond_ok <- BO[0] | ¬(testbit ^ BO[1])
-    if ¬predicate_bit & ¬SVRMmode.sz then
-      # this is entirely new: CTR-test mode still decrements CTR
-      # even when predicate-bits are zero
-      if ¬BO[2] & CTRtest & ¬CTi then
-        CTR = CTR - 1
-      # instruction finishes here
-    else
-      # usual BO[2] CTR-mode now under CTR-test mode as well
-      if ¬BO[2] & ¬(CTRtest & (cond_ok ^ CTi)) then CTR <- CTR - 1
-      # new VLset mode, conditional test truncates VL
-      if VLSET and VSb = (cond_ok & ctr_ok) then
-        if SVRMmode.VLI then SVSTATE.VL = srcstep+1
-        else                 SVSTATE.VL = srcstep
-      # usual LR is now conditional, but also joined by SVLR
-      lr_ok <- LK
-      svlr_ok <- SVRMmode.SL
-      if ctr_ok & cond_ok then
-        if AA then NIA <-iea EXTS(BD || 0b00)
-        else       NIA <-iea CIA + EXTS(BD || 0b00)
-        if SVRMmode.LRu then lr_ok <- ¬lr_ok
-        if SVRMmode.SLu then svlr_ok <- ¬svlr_ok
-      if lr_ok   then LR   <-iea CIA + 4
-      if svlr_ok then SVLR <- SVSTATE
-```
-
-Below is the pseudocode for SVP64 Branches, which is a little less
-obvious but identical to the above. The lack of obviousness is down to
-the early-exit opportunities.
-
-Effective pseudocode for Horizontal-First Mode:
-
-```
-    if (mode_is_64bit) then M <- 0
-    else M <- 32
-    cond_ok = not SVRMmode.ALL
-    for srcstep in range(VL):
-        # select predicate bit or zero/one
-        if predicate[srcstep]:
-            # get SVP64 extended CR field 0..127
-            SVCRf = SVP64EXTRA(BI>>2)
-            CRbits = CR{SVCRf}
-            testbit = CRbits[BI & 0b11]
-            # testbit = CR[BI+32+srcstep*4]
-        else if not SVRMmode.sz:
-            # inverted CTR test skip mode
-            if ¬BO[2] & CTRtest & ¬CTI then
-              CTR = CTR - 1
-            continue # skip to next element
-        else
-            testbit = SVRMmode.SNZ
-        # actual element test here
-        ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
-        el_cond_ok <- BO[0] | ¬(testbit ^ BO[1])
-        # check if CTR dec should occur
-        ctrdec = ¬BO[2]
-        if CTRtest & (el_cond_ok ^ CTi) then
-           ctrdec = 0b0
-        if ctrdec then CTR <- CTR - 1
-        # merge in the test
-        if SVRMmode.ALL:
-            cond_ok &= (el_cond_ok & ctr_ok)
-        else
-            cond_ok |= (el_cond_ok & ctr_ok)
-        # test for VL to be set (and exit)
-        if VLSET and VSb = (el_cond_ok & ctr_ok) then
-            if SVRMmode.VLI then SVSTATE.VL = srcstep+1
-            else                 SVSTATE.VL = srcstep
-            break
-        # early exit?
-        if SVRMmode.ALL != (el_cond_ok & ctr_ok):
-             break
-        # SVP64 rules about Scalar registers still apply!
-        if SVCRf.scalar:
-           break
-    # loop finally done, now test if branch (and update LR)
-    lr_ok <- LK
-    svlr_ok <- SVRMmode.SL
-    if cond_ok then
-        if AA then NIA <-iea EXTS(BD || 0b00)
-        else       NIA <-iea CIA + EXTS(BD || 0b00)
-        if SVRMmode.LRu then lr_ok <- ¬lr_ok
-        if SVRMmode.SLu then svlr_ok <- ¬svlr_ok
-    if lr_ok then LR <-iea CIA + 4
-    if svlr_ok then SVLR <- SVSTATE
-```
-
-Pseudocode for Vertical-First Mode:
-
-```
-    # get SVP64 extended CR field 0..127
-    SVCRf = SVP64EXTRA(BI>>2)
-    CRbits = CR{SVCRf}
-    # select predicate bit or zero/one
-    if predicate[srcstep]:
-        if BRc = 1 then # CR0 vectorised
-            CR{SVCRf+srcstep} = CRbits
-        testbit = CRbits[BI & 0b11]
-    else if not SVRMmode.sz:
-        # inverted CTR test skip mode
-        if ¬BO[2] & CTRtest & ¬CTI then
-           CTR = CTR - 1
-        SVSTATE.srcstep = new_srcstep
-        exit # no branch testing
-    else
-        testbit = SVRMmode.SNZ
-    # actual element test here
-    cond_ok <- BO[0] | ¬(testbit ^ BO[1])
-    # test for VL to be set (and exit)
-    if VLSET and cond_ok = VSb then
-        if SVRMmode.VLI
-            SVSTATE.VL = new_srcstep+1
-        else
-            SVSTATE.VL = new_srcstep
-```
-
-### Example Shader code
-
-```
-    // assume f() g() or h() modify a and/or b
-    while(a > 2) {
-        if(b < 5)
-            f();
-        else
-            g();
-        h();
-    }
-```
-
-which compiles to something like:
-
-```
-    vec<i32> a, b;
-    // ...
-    pred loop_pred = a > 2;
-    // loop continues while any of a elements greater than 2
-    while(loop_pred.any()) {
-        // vector of predicate bits
-        pred if_pred = loop_pred & (b < 5);
-        // only call f() if at least 1 bit set
-        if(if_pred.any()) {
-            f(if_pred);
-        }
-    label1:
-        // loop mask ANDs with inverted if-test
-        pred else_pred = loop_pred & ~if_pred;
-        // only call g() if at least 1 bit set
-        if(else_pred.any()) {
-            g(else_pred);
-        }
-        h(loop_pred);
-    }
-```
-
-which will end up as:
-
-```
-       # start from while loop test point
-       b looptest
-    while_loop:
-       sv.cmpi CR80.v, b.v, 5     # vector compare b into CR64 Vector
-       sv.bc/m=r30/~ALL/sz CR80.v.LT skip_f # skip when none
-       # only calculate loop_pred & pred_b because needed in f()
-       sv.crand CR80.v.SO, CR60.v.GT, CR80.V.LT # if = loop & pred_b
-       f(CR80.v.SO)
-    skip_f:
-       # illustrate inversion of pred_b. invert r30, test ALL
-       # rather than SOME, but masked-out zero test would FAIL,
-       # therefore masked-out instead is tested against 1 not 0
-       sv.bc/m=~r30/ALL/SNZ CR80.v.LT skip_g
-       # else = loop & ~pred_b, need this because used in g()
-       sv.crternari(A&~B) CR80.v.SO, CR60.v.GT, CR80.V.LT
-       g(CR80.v.SO)
-    skip_g:
-       # conditionally call h(r30) if any loop pred set
-       sv.bclr/m=r30/~ALL/sz BO[1]=1 h()
-    looptest:
-       sv.cmpi CR60.v a.v, 2      # vector compare a into CR60 vector
-       sv.crweird r30, CR60.GT # transfer GT vector to r30
-       sv.bc/m=r30/~ALL/sz BO[1]=1 while_loop
-    end:
-```
-
-### LRu example
-
-show why LRu would be useful in a loop.  Imagine the following
-c code:
-
-```
-    for (int i = 0; i < 8; i++) {
-        if (x < y) break;
-    }
-```
-
-Under these circumstances exiting from the loop is not only based on
-CTR it has become conditional on a CR result.  Thus it is desirable that
-NIA *and* LR only be modified if the conditions are met
-
-v3.0 pseudocode for `bclrl`:
-
-```
-    if (mode_is_64bit) then M <- 0
-    else M <- 32
-    if ¬BO[2]  then CTR <- CTR - 1
-    ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
-    cond_ok <- BO[0] | ¬(CR[BI+32] ^  BO[1])
-    if ctr_ok & cond_ok then NIA <-iea LR[0:61] || 0b00
-    if LK then LR <-iea CIA + 4
-```
-
-the latter part for SVP64 `bclrl` becomes:
-
-```
-    for i in 0 to VL-1:
-        ...
-        ...
-        cond_ok <- BO[0] | ¬(CR[BI+32] ^  BO[1])
-        lr_ok <- LK
-        if ctr_ok & cond_ok then
-           NIA <-iea LR[0:61] || 0b00
-           if SVRMmode.LRu then lr_ok <- ¬lr_ok
-        if lr_ok then LR <-iea CIA + 4
-        # if NIA modified exit loop
-```
-
-The reason why should be clear from this being a Vector loop:
-unconditional destruction of LR when LK=1 makes `sv.bclrl` ineffective,
-because the intention going into the loop is that the branch should be to
-the copy of LR set at the *start* of the loop, not half way through it.
-However if the change to LR only occurs if the branch is taken then it
-becomes a useful instruction.
-
-The following pseudocode should **not** be implemented because it
-violates the fundamental principle of SVP64 which is that SVP64 looping
-is a thin wrapper around Scalar Instructions.  The pseducode below is
-more an actual Vector ISA Branch and as such is not at all appropriate:
-
-```
-    for i in 0 to VL-1:
-        ...
-        ...
-        cond_ok <- BO[0] | ¬(CR[BI+32] ^  BO[1])
-        if ctr_ok & cond_ok then NIA <-iea LR[0:61] || 0b00
-    # only at the end of looping is LK checked.
-    # this completely violates the design principle of SVP64
-    # and would actually need to be a separate (scalar)
-    # instruction "set LR to CIA+4 but retrospectively"
-    # which is clearly impossible
-    if LK then LR <-iea CIA + 4
-```
-
-[[!tag standards]]
+[[!inline pages="openpower/sv/svp64" raw=yes ]]
+[[!inline pages="openpower/sv/normal" raw=yes ]]
+[[!inline pages="openpower/sv/ldst" raw=yes ]]
+[[!inline pages="openpower/sv/branches" raw=yes ]]
+[[!inline pages="openpower/sv/cr_ops" raw=yes ]]
+[[!inline pages="openpower/sv/svp64/appendix" raw=yes ]]
+[[!inline pages="openpower/sv/compliancy_levels" raw=yes ]]