add notes and observations for ls010 SVP64 main book proposal

[libreriscv.git] / openpower / sv / rfc / ls010.mdwn
diff --git a/openpower/sv/rfc/ls010.mdwn b/openpower/sv/rfc/ls010.mdwn

index d8d4886e613ee89045464828c3df296317849106..626d1d307ba6203f45c6830b0b48d4556dd27697 100644 (file)
--- a/openpower/sv/rfc/ls010.mdwn
+++ b/openpower/sv/rfc/ls010.mdwn
@@ -1,737 +1,109 @@
-# RFC ls009 SVP64 Zero-Overhead Loop Prefix Subsystem
-
-Credits and acknowledgements:
-
-* Luke Leighton
-* Jacob Lifshay
-* Hendrik Boom
-* Richard Wilbur
-* Alexandre Oliva
-* Cesar Strauss
-* NLnet Foundation, for funding
-* OpenPOWER Foundation
-* Paul Mackerras
-* Toshaan Bharvani
-* IBM for the Power ISA itself
-
-Links:
+# RFC ls010 SVP64 Zero-Overhead Loop Prefix Subsystem
+**URLs**:
  
+* <https://www.sigarch.org/simd-instructions-considered-harmful/>
+* <https://libre-soc.org/openpower/sv/>
+* <https://libre-soc.org/openpower/sv/rfc/ls010/>
  * <https://bugs.libre-soc.org/show_bug.cgi?id=1045>
+* <https://git.openpower.foundation/isa/PowerISA/issues/64>
+* <https://git.openpower.foundation/isa/PowerISA/issues/121>
  
-# Introduction
-
-This document focuses on the encoding of [[SV|sv]], and assumes familiarity with the same. It does not cover how SV works (merely the instruction encoding), and is therefore best read in conjunction with the [[sv/overview]], as well as the [[sv/svp64_quirks]] section.
-It is also crucial to note that whilst this format augments instruction
-behaviour it works in conjunction with SVSTATE and other [[sv/sprs]].
-
-Except where explicitly stated all bit numbers remain as in the Power ISA:
-in MSB0 form (the bits are numbered from 0 at the MSB on the left
-and counting up as you move rightwards to the LSB end). All bit ranges are inclusive
-(so `4:6` means bits 4, 5, and 6, in MSB0 order).  **All register numbering and
-element numbering however is LSB0 ordering** which is a different convention used
-elsewhere in the Power ISA.
+**Severity**: Major
  
-64-bit instructions are split into two 32-bit words, the prefix and the
-suffix. The prefix always comes before the suffix in PC order.
+**Status**: New
  
-| 0:5    | 6:31         | 32:63        |
-|--------|--------------|--------------|
-| EXT01  | v3.1  Prefix | v3.0/1  Suffix |
+**Date**: 04 Apr 2023 v1
  
-svp64 fits into the "reserved" portions of the v3.1 prefix, making it possible for svp64, v3.0B (or v3.1 including 64 bit prefixed) instructions  to co-exist in the same binary without conflict.
+**Target**: v3.2B
  
-Subset implementations in hardware are permitted, as long as certain
-rules are followed, allowing for full soft-emulation including future
-revisions.  Compliancy Subsets exist to ensure minimum levels of binary
-interoperability expectations within certain environments.
+**Source**: v3.0B
  
-## Register files, elements, and Element-width Overrides
+**Books and Section affected**:
  
-In the Upper Compliancy Levels the size of the GPR and FPR Register files are expanded
-from 32 to 128 entries, and the number of CR Fields expanded from CR0-CR7 to CR0-CR127.
-
-Memory access remains exactly the same: the effects of `MSR.LE` remain exactly the same,
-affecting as they already do and remain **only** on the Load and Store memory-register 
-operation byte-order, and having nothing to do with the
-ordering of the contents of register files or register-register operations.
-
-Whilst the bits within the GPRs and FPRs are expected to be MSB0-ordered and for
-numbering to be sequentially incremental the element offset numbering is naturally
-**LSB0-sequentially-incrementing from zero not MSB0-incrementing.**  Expressed exclusively in
-MSB0-numbering, SVP64 is unnecessarily complex to understand: the required
-subtractions from 63, 31, 15 and 7 unfortunately become a hostile minefield.
-Therefore for the purposes of this section the more natural
-**LSB0 numbering is assumed** and it is up to the reader to translate to MSB0 numbering.
+```
+    New Book: new Zero-Overhead-Loop
+    New Appendix, Zero-Overhead-Loop
+```
  
-The Canonical specification for how element-sequential numbering and element-width
-overrides is defined is expressed in the following c structure, assuming a Little-Endian
-system, and naturally using LSB0 numbering everywhere because the ANSI c specification
-is inherently LSB0:
+**Summary**
  
  ```
-    #pragma pack
-    typedef union {
-        uint8_t  b[]; // elwidth 8
-        uint16_t s[]; // elwidth 16
-        uint32_t i[]; // elwidth 32
-        uint64_t l[]; // elwidth 64
-        uint8_t actual_bytes[8];
-    } el_reg_t;
+    Adds a Zero-Overhead-Loop Subsystem based on the Cray True-Scalable Vector concept
+    in a RISC-paradigm fashion.  Total instructions added is six, plus Prefix format.
+```
  
-    elreg_t int_regfile[128];
+**Submitter**: Luke Leighton (Libre-SOC)
  
-    void get_register_element(el_reg_t* el, int gpr, int element, int width) {
-        switch (width) {
-            case 64: el->l = int_regfile[gpr].l[element];
-            case 32: el->i = int_regfile[gpr].i[element];
-            case 16: el->s = int_regfile[gpr].s[element];
-            case 8 : el->b = int_regfile[gpr].b[element];
-        }
-    }
-    void set_register_element(el_reg_t* el, int gpr, int element, int width) {
-        switch (width) {
-            case 64: int_regfile[gpr].l[element] = el->l;
-            case 32: int_regfile[gpr].i[element] = el->i;
-            case 16: int_regfile[gpr].s[element] = el->s;
-            case 8 : int_regfile[gpr].b[element] = el->b;
-        }
-    }
-```
+**Requester**: Libre-SOC
  
-Example add operation implementation when elwidths are 64-bit:
+**Impact on processor**:
  
  ```
- # add RT, RA,RB using the "uint64_t" union member, "l"
- for i in range(VL):
-      int_regfile[RT].l[i] = int_regfile[RA].l[i] + int_regfile[RB].l[i]
+    Addition of new "Zero-Overhead-Loop-Control" DSP-style Vector-style
+    subsystem that in simple low-end (Embedded) systems may be minimalistically
+    and easily be implemented by inserting a new fully-independent Pipeline Stage
+    in between Decode and Issue, with very little disruption, and in higher
+    performance pre-existing Multi-Issue Out-of-Order systems seamlessly fits likewise
+    to significantly boost performance.
  ```
  
-However if elwidth overrides are set to 16 for both source and destination:
+**Impact on software**:
  
  ```
- # add RT, RA, RB using the "uint64_t" union member "s"
- for i in range(VL):
-      int_regfile[RT].s[i] = int_regfile[RA].s[i] + int_regfile[RB].s[i]
+    Requires support for new instructions in assembler, debuggers, and related tools.
+    Dramatically reduces instructions. Requires introduction of term "High-Level Assembler"
  ```
  
-Hardware Architectural note: to avoid a Read-Modify-Write at the register file it is
-strongly recommended to implement byte-level write-enable lines exactly as has been
-implemented in DRAM ICs for many decades. Additionally the predicate mask bit is advised
-to be associated with the element operation and ultimately passed to the register file.
-When element-width is set to 64-bit the relevant predicate mask bit may be repeated
-eight times and pull all eight write-port byte-level lines HIGH. Clearly when element-width
-is set to 8-bit the relevant predicate mask bit corresponds directly with one single
-byte-level write-enable line.  It is up to the Hardware Architect to then amortise (merge)
-elements together into both PredicatedSIMD Pipelines as well as simultaneous non-overlapping
-Register File writes, to achieve High Performance designs.
-
-## SVP64 encoding features
-
-A number of features need to be compacted into a very small space of only 24 bits:
-
-* Independent per-register Scalar/Vector tagging and range extension on every register
-* Element width overrides on both source and destination
-* Predication on both source and destination
-* Two different sources of predication: INT and CR Fields
-* SV Modes including saturation (for Audio, Video and DSP), mapreduce, fail-first and
-  predicate-result mode.
-
-Different classes of operations require 
-
-# Definition of Reserved in this spec.
-
-For the new fields added in SVP64, instructions that have any of their
-fields set to a reserved value must cause an illegal instruction trap,
-to allow emulation of future instruction sets, or for subsets of SVP64
-to be implemented in hardware and the rest emulated.
-This includes SVP64 SPRs: reading or writing values which are not
-supported in hardware must also raise illegal instruction traps
-in order to allow emulation.
-Unless otherwise stated, reserved values are always all zeros.
-
-This is unlike OpenPower ISA v3.1, which in many instances does not require a trap if reserved fields are nonzero.  Where the standard Power ISA definition
-is intended the red keyword `RESERVED` is used.
-
-# Definition of "UnVectoriseable"
-
-Any operation that inherently makes no sense if repeated is termed "UnVectoriseable"
-or "UnVectorised".  Examples include `sc` or `sync` which have no registers. `mtmsr` is
-also classed as UnVectoriseable because there is only one `MSR`.
-
-# Scalar Identity Behaviour
-
-SVP64 is designed so that when the prefix is all zeros, and
- VL=1, no effect or
-influence occurs (no augmentation) such that all standard Power ISA
-v3.0/v3 1 instructions covered by the prefix are "unaltered". This is termed `scalar identity behaviour` (based on the mathematical definition for "identity", as in, "identity matrix" or better "identity transformation").
-
-Note that this is completely different from when VL=0.  VL=0 turns all operations under its influence into `nops` (regardless of the prefix)
- whereas when VL=1 and the SV prefix is all zeros, the operation simply acts as if SV had not been applied at all to the instruction  (an "identity transformation").
-
-# Register Naming and size
-
-SV Registers are simply the INT, FP and CR register files extended
-linearly to larger sizes; SV Vectorisation iterates sequentially through these registers.
-
-Where the integer regfile in standard scalar
-Power ISA v3.0B/v3.1B is r0 to r31, SV extends this as r0 to r127.
-Likewise FP registers are extended to 128 (fp0 to fp127), and CR Fields
-are
-extended to 128 entries, CR0 thru CR127.
-
-The names of the registers therefore reflects a simple linear extension
-of the Power ISA v3.0B / v3.1B register naming, and in hardware this
-would be reflected by a linear increase in the size of the underlying
-SRAM used for the regfiles.
-
-Note: when an EXTRA field (defined below) is zero, SV is deliberately designed
-so that the register fields are identical to as if SV was not in effect
-i.e. under these circumstances (EXTRA=0) the register field names RA,
-RB etc. are interpreted and treated as v3.0B / v3.1B scalar registers.  This is part of
-`scalar identity behaviour` described above.
-
-## Future expansion.
-
-With the way that EXTRA fields are defined and applied to register fields,
-future versions of SV may involve 256 or greater registers. Backwards binary compatibility may be achieved with a PCR bit (Program Compatibility Register).  Further discussion is out of scope for this version of SVP64.
-
-# Remapped Encoding (`RM[0:23]`)
-
-To allow relatively easy remapping of which portions of the Prefix Opcode
-Map are used for SVP64 without needing to rewrite a large portion of the
-SVP64 spec, a mapping is defined from the OpenPower v3.1 prefix bits to
-a new 24-bit Remapped Encoding denoted `RM[0]` at the MSB to `RM[23]`
-at the LSB.
-
-The mapping from the OpenPower v3.1 prefix bits to the Remapped Encoding
-is defined in the Prefix Fields section.
-
-## Prefix Opcode Map (64-bit instruction encoding)
-
-In the original table in the v3.1B Power ISA Spec on p1350, Table 12, prefix bits 6:11 are shown, with their allocations to different v3.1B pregix "modes".
-
-The table below hows both PowerISA v3.1 instructions as well as new SVP instructions fit;
-empty spaces are yet-to-be-allocated Illegal Instructions.  
-
-| 6:11 | ---000 | ---001 | ---010 | ---011 | ---100 | ---101 | ---110 | ---111 |
-|------|--------|--------|--------|--------|--------|--------|--------|--------|
-|000---| 8LS    | 8LS    | 8LS    | 8LS    | 8LS    | 8LS    | 8LS    | 8LS    |
-|001---|        |        |        |        |        |        |        |        |
-|010---| 8RR    |        |        |        | `SVP64`| `SVP64`| `SVP64`| `SVP64`|
-|011---|        |        |        |        | `SVP64`| `SVP64`| `SVP64`| `SVP64`|
-|100---| MLS    | MLS    | MLS    | MLS    | MLS    | MLS    | MLS    | MLS    |
-|101---|        |        |        |        |        |        |        |        |
-|110---| MRR    |        |        |        | `SVP64`| `SVP64`| `SVP64`| `SVP64`|
-|111---|        | MMIRR  |        |        | `SVP64`| `SVP64`| `SVP64`| `SVP64`|
-
-Note that by taking up a block of 16, where in every case bits 7 and 9 are set, this allows svp64 to utilise four bits of the v3.1B Prefix space and "allocate" them to svp64's Remapped Encoding field, instead.
-
-## Prefix Fields
-
-To "activate" svp64 (in a way that does not conflict with v3.1B 64 bit Prefix mode), fields within the v3.1B Prefix Opcode Map are set
-(see Prefix Opcode Map, above), leaving 24 bits "free" for use by SV.
-This is achieved by setting bits 7 and 9 to 1:  
-
-| Name       | Bits    | Value | Description                    |
-|------------|---------|-------|--------------------------------|
-| EXT01      | `0:5`   | `1`   | Indicates Prefixed 64-bit      |
-| `RM[0]`    | `6`     |       | Bit 0 of Remapped Encoding     |
-| SVP64_7    | `7`     | `1`   | Indicates this is SVP64        |
-| `RM[1]`    | `8`     |       | Bit 1 of Remapped Encoding     |
-| SVP64_9    | `9`     | `1`   | Indicates this is SVP64        |
-| `RM[2:23]` | `10:31` |       | Bits 2-23 of Remapped Encoding |
-
-Laid out bitwise, this is as follows, showing how the 32-bits of the prefix
-are constructed:
-
-| 0:5    | 6     | 7 | 8     | 9 | 10:31    |
-|--------|-------|---|-------|---|----------|
-| EXT01  | RM    | 1 | RM    | 1 | RM       |
-| 000001 | RM[0] | 1 | RM[1] | 1 | RM[2:23] |
-
-Following the prefix will be the suffix: this is simply a 32-bit v3.0B / v3.1
-instruction.  That instruction becomes "prefixed" with the SVP context: the
-Remapped Encoding field (RM).
-
-It is important to note that unlike v3.1 64-bit prefixed instructions
-there is insufficient space in `RM` to provide identification of
-any SVP64 Fields without first partially decoding the
-32-bit suffix.  Similar to the "Forms" (X-Form, D-Form) the
-`RM` format is individually associated with every instruction.
-
-Extreme caution and care must therefore be taken
-when extending SVP64 in future, to not create unnecessary relationships
-between prefix and suffix that could complicate decoding, adding latency.
-
-# Common RM fields
-
-The following fields are common to all Remapped Encodings:
-
-| Field Name | Field bits | Description                            |
-|------------|------------|----------------------------------------|
-| MASKMODE   | `0`        | Execution (predication) Mask Kind      |
-| MASK       | `1:3`      | Execution Mask                      |
-| SUBVL      | `8:9`      | Sub-vector length                   |                          
-
-The following fields are optional or encoded differently depending
-on context after decoding of the Scalar suffix:
-
-| Field Name | Field bits | Description                            |
-|------------|------------|----------------------------------------|
-| ELWIDTH       | `4:5`      | Element Width                       |
-| ELWIDTH_SRC   | `6:7`      | Element Width for Source      |
-| EXTRA         | `10:18`    | Register Extra encoding                |                          
-| MODE          | `19:23`    | changes Vector behaviour               |
-
-* MODE changes the behaviour of the SV operation (result saturation, mapreduce)
-* SUBVL groups elements together into vec2, vec3, vec4 for use in 3D and Audio/Video DSP work
-* ELWIDTH and ELWIDTH_SRC overrides the instruction's destination and source operand width
-* MASK (and MASK_SRC) and MASKMODE provide predication (two types of sources: scalar INT and Vector CR).
-* Bits 10 to 18 (EXTRA) are further decoded depending on the RM category for the instruction, which is determined only by decoding the Scalar 32 bit suffix.
-
-Similar to Power ISA `X-Form` etc. EXTRA bits are given designations, such as `RM-1P-3S1D` which indicates for this example that the operation is to be single-predicated and that there are 3 source operand EXTRA tags and one destination operand tag.
-
-Note that if ELWIDTH != ELWIDTH_SRC this may result in reduced performance or increased latency in some implementations due to lane-crossing. 
-
-# Mode
-
-Mode is an augmentation of SV behaviour.  Different types of
-instructions have different needs, similar to Power ISA 
-v3.1 64 bit prefix 8LS and MTRR formats apply to different
-instruction types.  Modes include Reduction, Iteration, arithmetic
-saturation, and Fail-First.  More specific details in each
-section and in the [[svp64/appendix]]
-
-* For condition register operations see [[sv/cr_ops]]
-* For LD/ST Modes, see [[sv/ldst]].
-* For Branch modes, see [[sv/branches]]
-* For arithmetic and logical, see [[sv/normal]]
-
-# ELWIDTH Encoding
-
-Default behaviour is set to 0b00 so that zeros follow the convention of
-`scalar identity behaviour`.  In this case it means that elwidth overrides
-are not applicable.  Thus if a 32 bit instruction operates on 32 bit,
-`elwidth=0b00` specifies that this behaviour is unmodified.  Likewise
-when a processor is switched from 64 bit to 32 bit mode, `elwidth=0b00`
-states that, again, the behaviour is not to be modified.
-
-Only when elwidth is nonzero is the element width overridden to the
-explicitly required value.
-
-## Elwidth for Integers:
-
-| Value | Mnemonic       | Description                        |
-|-------|----------------|------------------------------------|
-| 00    | DEFAULT        | default behaviour for operation    |
-| 01    | `ELWIDTH=w`    | Word: 32-bit integer                 |
-| 10    | `ELWIDTH=h`    | Halfword: 16-bit integer             |
-| 11    | `ELWIDTH=b`    | Byte: 8-bit integer                  |
-
-This encoding is chosen such that the byte width may be computed as
-`8<<(3-ew)`
-
-## Elwidth for FP Registers:
-
-| Value | Mnemonic       | Description                        |
-|-------|----------------|------------------------------------|
-| 00    | DEFAULT        | default behaviour for FP operation     |
-| 01    | `ELWIDTH=f32`  | 32-bit IEEE 754 Single floating-point  |
-| 10    | `ELWIDTH=f16`  | 16-bit IEEE 754 Half floating-point   |
-| 11    | `ELWIDTH=bf16` | Reserved for `bf16` |
-
-Note:
-[`bf16`](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format)
-is reserved for a future implementation of SV
-
-Note that any IEEE754 FP operation in Power ISA ending in "s" (`fadds`) shall
-perform its operation at **half** the ELWIDTH then padded back out
-to ELWIDTH.  `sv.fadds/ew=f32` shall perform an IEEE754 FP16 operation that is then "padded" to fill out to an IEEE754 FP32. When ELWIDTH=DEFAULT
-clearly the behaviour of `sv.fadds` is performed at 32-bit accuracy
-then padded back out to fit in IEEE754 FP64, exactly as for Scalar
-v3.0B "single" FP.  Any FP operation ending in "s" where ELWIDTH=f16
-or ELWIDTH=bf16 is reserved and must raise an illegal instruction
-(IEEE754 FP8 or BF8 are not defined). 
-
-## Elwidth for CRs:
-
-Element-width overrides for CR Fields has no meaning. The bits
-are therefore used for other purposes, or when Rc=1, the Elwidth
-applies to the result being tested (a GPR or FPR), but not to the
-Vector of CR Fields.
-
-# SUBVL Encoding
-
-the default for SUBVL is 1 and its encoding is 0b00 to indicate that
-SUBVL is effectively disabled (a SUBVL for-loop of only one element). this
-lines up in combination with all other "default is all zeros" behaviour.
-
-| Value | Mnemonic  | Subvec  | Description            |
-|-------|-----------|---------|------------------------|
-| 00    | `SUBVL=1` | single  | Sub-vector length of 1 |
-| 01    | `SUBVL=2` | vec2    | Sub-vector length of 2 |
-| 10    | `SUBVL=3` | vec3    | Sub-vector length of 3 |
-| 11    | `SUBVL=4` | vec4    | Sub-vector length of 4 |
-
-The SUBVL encoding value may be thought of as an inclusive range of a
-sub-vector.  SUBVL=2 represents a vec2, its encoding is 0b01, therefore
-this may be considered to be elements 0b00 to 0b01 inclusive.
-
-# MASK/MASK_SRC & MASKMODE Encoding
-
-TODO: rename MASK_KIND to MASKMODE
-
-One bit (`MASKMODE`) indicates the mode: CR or Int predication.   The two
-types may not be mixed.
-
-Special note: to disable predication this field must
-be set to zero in combination with Integer Predication also being set
-to 0b000. this has the effect of enabling "all 1s" in the predicate
-mask, which is equivalent to "not having any predication at all"
-and consequently, in combination with all other default zeros, fully
-disables SV (`scalar identity behaviour`).
-
-`MASKMODE` may be set to one of 2 values:
-
-| Value | Description                                          |
-|-----------|------------------------------------------------------|
-| 0         | MASK/MASK_SRC are encoded using Integer Predication  |
-| 1         | MASK/MASK_SRC are encoded using CR-based Predication |
-
-Integer Twin predication has a second set of 3 bits that uses the same
-encoding thus allowing either the same register (r3, r10 or r31) to be used
-for both src and dest, or different regs (one for src, one for dest).
-
-Likewise CR based twin predication has a second set of 3 bits, allowing
-a different test to be applied.
-
-Note that it is assumed that Predicate Masks (whether INT or CR)
-are read *before* the operations proceed.  In practice (for CR Fields)
-this creates an unnecessary block on parallelism.  Therefore,
-it is up to the programmer to ensure that the CR fields used as
-Predicate Masks are not being written to by any parallel Vector Loop.
-Doing so results in **UNDEFINED** behaviour, according to the definition
-outlined in the Power ISA v3.0B Specification.
-
-Hardware Implementations are therefore free and clear to delay reading
-of individual CR fields until the actual predicated element operation
-needs to take place, safe in the knowledge that no programmer will
-have issued a Vector Instruction where previous elements could have
-overwritten (destroyed) not-yet-executed CR-Predicated element operations.
-
-## Integer Predication (MASKMODE=0)
-
-When the predicate mode bit is zero the 3 bits are interpreted as below.
-Twin predication has an identical 3 bit field similarly encoded.
-
-`MASK` and `MASK_SRC` may be set to one of 8 values, to provide the following meaning:
-
-| Value | Mnemonic | Element `i` enabled if:      |
-|-------|----------|------------------------------|
-| 000   | ALWAYS   | predicate effectively all 1s |
-| 001   | 1 << R3  | `i == R3`                    |
-| 010   | R3       | `R3 & (1 << i)` is non-zero  |
-| 011   | ~R3      | `R3 & (1 << i)` is zero      |
-| 100   | R10      | `R10 & (1 << i)` is non-zero |
-| 101   | ~R10     | `R10 & (1 << i)` is zero     |
-| 110   | R30      | `R30 & (1 << i)` is non-zero |
-| 111   | ~R30     | `R30 & (1 << i)` is zero     |
-
-r10 and r30 are at the high end of temporary and unused registers, so as not to interfere with register allocation from ABIs.
-
-## CR-based Predication (MASKMODE=1)
-
-When the predicate mode bit is one the 3 bits are interpreted as below.
-Twin predication has an identical 3 bit field similarly encoded.
-
-`MASK` and `MASK_SRC` may be set to one of 8 values, to provide the following meaning:
-
-| Value | Mnemonic | Element `i` is enabled if     |
-|-------|----------|--------------------------|
-| 000   | lt       | `CR[offs+i].LT` is set   |
-| 001   | nl/ge    | `CR[offs+i].LT` is clear |
-| 010   | gt       | `CR[offs+i].GT` is set   |
-| 011   | ng/le    | `CR[offs+i].GT` is clear |
-| 100   | eq       | `CR[offs+i].EQ` is set   |
-| 101   | ne       | `CR[offs+i].EQ` is clear |
-| 110   | so/un    | `CR[offs+i].FU` is set   |
-| 111   | ns/nu    | `CR[offs+i].FU` is clear |
-
-CR based predication.  TODO: select alternate CR for twin predication? see
-[[discussion]]  Overlap of the two CR based predicates must be taken
-into account, so the starting point for one of them must be suitably
-high, or accept that for twin predication VL must not exceed the range
-where overlap will occur, *or* that they use the same starting point
-but select different *bits* of the same CRs
-
-`offs` is defined as CR32 (4x8) so as to mesh cleanly with Vectorised Rc=1 operations (see below).  Rc=1 operations start from CR8 (TBD).
-
-The CR Predicates chosen must start on a boundary that Vectorised
-CR operations can access cleanly, in full.
-With EXTRA2 restricting starting points
-to multiples of 8 (CR0, CR8, CR16...) both Vectorised Rc=1 and CR Predicate
-Masks have to be adapted to fit on these boundaries as well.
-
-# Extra Remapped Encoding <a name="extra_remap"> </a>
-
-Shows all instruction-specific fields in the Remapped Encoding `RM[10:18]` for all instruction variants.  Note that due to the very tight space, the encoding mode is *not* included in the prefix itself.  The mode is "applied", similar to Power ISA "Forms" (X-Form, D-Form) on a per-instruction basis, and, like "Forms" are given a designation (below) of the form `RM-nP-nSnD`. The full list of which instructions use which remaps is here [[opcode_regs_deduped]]. (*Machine-readable CSV files have been provided which will make the task of creating SV-aware ISA decoders easier*).
-
-These mappings are part of the SVP64 Specification in exactly the same
-way as X-Form, D-Form. New Scalar instructions added to the Power ISA
-will need a corresponding SVP64 Mapping, which can be derived by-rote
-from examining the Register "Profile" of the instruction.
-
-There are two categories:  Single and Twin Predication.
-Due to space considerations further subdivision of Single Predication
-is based on whether the number of src operands is 2 or 3.  With only
-9 bits available some compromises have to be made.
-
-* `RM-1P-3S1D` Single Predication dest/src1/2/3, applies to 4-operand instructions (fmadd, isel, madd).
-* `RM-1P-2S1D` Single Predication dest/src1/2 applies to 3-operand instructions (src1 src2 dest)
-* `RM-2P-1S1D` Twin Predication (src=1, dest=1)
-* `RM-2P-2S1D` Twin Predication (src=2, dest=1) primarily for LDST (Indexed)
-* `RM-2P-1S2D` Twin Predication (src=1, dest=2) primarily for LDST Update
-
-## RM-1P-3S1D
-
-| Field Name | Field bits | Description                            |
-|------------|------------|----------------------------------------|
-| Rdest\_EXTRA2 | `10:11` | extends Rdest (R\*\_EXTRA2 Encoding)   |
-| Rsrc1\_EXTRA2 | `12:13` | extends Rsrc1 (R\*\_EXTRA2 Encoding)   |
-| Rsrc2\_EXTRA2 | `14:15` | extends Rsrc2 (R\*\_EXTRA2 Encoding)   |
-| Rsrc3\_EXTRA2 | `16:17` | extends Rsrc3 (R\*\_EXTRA2 Encoding)   |
-| EXTRA2_MODE   | `18`    | used by `divmod2du` and `maddedu` for RS   |
-
-These are for 3 operand in and either 1 or 2 out instructions.
-3-in 1-out includes `madd RT,RA,RB,RC`. (DRAFT) instructions
-such as `maddedu` have an implicit second destination, RS, the
-selection of which is determined by bit 18.
-
-## RM-1P-2S1D
-
-| Field Name | Field bits | Description                               |
-|------------|------------|-------------------------------------------|
-| Rdest\_EXTRA3 | `10:12` | extends Rdest  |
-| Rsrc1\_EXTRA3 | `13:15` | extends Rsrc1  |
-| Rsrc2\_EXTRA3 | `16:18` | extends Rsrc3  |
-
-These are for 2 operand 1 dest instructions, such as `add RT, RA,
-RB`. However also included are unusual instructions with an implicit dest
-that is identical to its src reg, such as `rlwinmi`.
-
-Normally, with instructions such as `rlwinmi`, the scalar v3.0B ISA would not have sufficient bit fields to allow
-an alternative destination.  With SV however this becomes possible.
-Therefore, the fact that the dest is implicitly also a src should not
-mislead: due to the *prefix* they are different SV regs.
-
-* `rlwimi RA, RS, ...`
-* Rsrc1_EXTRA3 applies to RS as the first src
-* Rsrc2_EXTRA3 applies to RA as the secomd src
-* Rdest_EXTRA3 applies to RA to create an **independent** dest.
-
-With the addition of the EXTRA bits, the three registers
-each may be *independently* made vector or scalar, and be independently
-augmented to 7 bits in length.
-
-## RM-2P-1S1D/2S
-
-| Field Name | Field bits | Description                 |
-|------------|------------|----------------------------|
-| Rdest_EXTRA3 | `10:12`    | extends Rdest             |
-| Rsrc1_EXTRA3 | `13:15`    | extends Rsrc1             |
-| MASK_SRC     | `16:18`    | Execution Mask for Source |
-
-`RM-2P-2S` is for `stw` etc. and is Rsrc1 Rsrc2.
-
-## RM-1P-2S1D
-
-single-predicate, three registers (2 read, 1 write)
- 
-| Field Name | Field bits | Description                 |
-|------------|------------|----------------------------|
-| Rdest_EXTRA3 | `10:12`    | extends Rdest             |
-| Rsrc1_EXTRA3 | `13:15`    | extends Rsrc1             |
-| Rsrc2_EXTRA3 | `16:18`    | extends Rsrc2             |
-
-## RM-2P-2S1D/1S2D/3S
-
-The primary purpose for this encoding is for Twin Predication on LOAD
-and STORE operations.  see [[sv/ldst]] for detailed anslysis.
-
-RM-2P-2S1D:
-
-| Field Name | Field bits | Description                     |
-|------------|------------|----------------------------|
-| Rdest_EXTRA2 | `10:11`  | extends Rdest (R\*\_EXTRA2 Encoding)   |
-| Rsrc1_EXTRA2 | `12:13`  | extends Rsrc1 (R\*\_EXTRA2 Encoding)   |
-| Rsrc2_EXTRA2 | `14:15`  | extends Rsrc2 (R\*\_EXTRA2 Encoding)   |
-| MASK_SRC     | `16:18`  | Execution Mask for Source     |
-
-Note that for 1S2P the EXTRA2 dest and src names are switched (Rsrc_EXTRA2
-is in bits 10:11, Rdest1_EXTRA2 in 12:13)
-
-Also that for 3S (to cover `stdx` etc.) the names are switched to 3 src: Rsrc1_EXTRA2, Rsrc2_EXTRA2, Rsrc3_EXTRA2.
-
-Note also that LD with update indexed, which takes 2 src and 2 dest
-(e.g. `lhaux RT,RA,RB`), does not have room for 4 registers and also
-Twin Predication.  therefore these are treated as RM-2P-2S1D and the
-src spec for RA is also used for the same RA as a dest.
-
-Note that if ELWIDTH != ELWIDTH_SRC this may result in reduced performance or increased latency in some implementations due to lane-crossing. 
-
-# R\*\_EXTRA2/3
-
-EXTRA is the means by which two things are achieved:
-
-1. Registers are marked as either Vector *or Scalar*
-2. Register field numbers (limited typically to 5 bit)
-   are extended in range, both for Scalar and Vector.
-
-The register files are therefore extended:
-
-* INT is extended from r0-31 to r0-127
-* FP is extended from fp0-32 to fp0-fp127
-* CR Fields are extended from CR0-7 to CR0-127
-
-However due to pressure in `RM.EXTRA` not all these registers
-are accessible by all instructions, particularly those with
-a large number of operands (`madd`, `isel`).
-
-In the following tables register numbers are constructed from the
-standard v3.0B / v3.1B 32 bit register field (RA, FRA) and the EXTRA2
-or EXTRA3 field from the SV Prefix, determined by the specific
-RM-xx-yyyy designation for a given instruction.
-The prefixing is arranged so that
-interoperability between prefixing and nonprefixing of scalar registers
-is direct and convenient (when the EXTRA field is all zeros).
-
-A pseudocode algorithm explains the relationship, for INT/FP (see [[svp64/appendix]] for CRs)
+**Keywords**:
  
  ```
-    if extra3_mode:
-        spec = EXTRA3
-    else:
-        spec = EXTRA2 << 1 # same as EXTRA3, shifted
-    if spec[0]: # vector
-         return (RA << 2) | spec[1:2]
-    else:         # scalar
-         return (spec[1:2] << 5) | RA
+    Cray Supercomputing, Vectorisation, Zero-Overhead-Loop-Control (ZOLC),
+    True-Scalable Vectors, Multi-Issue Out-of-Order, Sequential Programming Model,
+    Digital Signal Processing (DSP), High-level Assembler
  ```
  
-Future versions may extend to 256 by shifting Vector numbering up.
-Scalar will not be altered.
-
-Note that in some cases the range of starting points for Vectors
-is limited. 
-
-## INT/FP EXTRA3
-
-If EXTRA3 is zero, maps to
-"scalar identity" (scalar Power ISA field naming).
-
-Fields are as follows:
-
-* Value: R_EXTRA3
-* Mode: register is tagged as scalar or vector
-* Range/Inc: the range of registers accessible from this EXTRA
-  encoding, and the "increment" (accessibility). "/4" means
-  that this EXTRA encoding may only give access (starting point)
-  every 4th register.
-* MSB..LSB: the bit field showing how the register opcode field
-  combines with EXTRA to give (extend) the register number (GPR)
-
-| Value | Mode | Range/Inc | 6..0 |
-|-----------|-------|---------------|---------------------|
-| 000       | Scalar | `r0-r31`/1 | `0b00 RA`      |
-| 001       | Scalar | `r32-r63`/1 | `0b01 RA`      |
-| 010       | Scalar | `r64-r95`/1 | `0b10 RA`      |
-| 011       | Scalar | `r96-r127`/1 | `0b11 RA`      |
-| 100       | Vector | `r0-r124`/4 | `RA 0b00`      |
-| 101       | Vector | `r1-r125`/4 | `RA 0b01`      |
-| 110       | Vector | `r2-r126`/4 | `RA 0b10`      |
-| 111       | Vector | `r3-r127`/4 | `RA 0b11`      |
-
-## INT/FP EXTRA2
-
-If EXTRA2 is zero will map to
-"scalar identity behaviour" i.e Scalar Power ISA register naming:
-
-| Value | Mode | Range/inc | 6..0 |
-|-----------|-------|---------------|-----------|
-| 00       | Scalar | `r0-r31`/1 | `0b00 RA`     |
-| 01       | Scalar | `r32-r63`/1 | `0b01 RA`      |
-| 10       | Vector | `r0-r124`/4 | `RA 0b00`      |
-| 11       | Vector | `r2-r126`/4 | `RA 0b10`   |
-
-**Note that unlike in EXTRA3, in EXTRA2**:
-
-* the GPR Vectors may only start from
-  `r0, r2, r4, r6, r8` and likewise FPR Vectors.
-* the GPR Scalars may only go from `r0, r1, r2.. r63` and likewise FPR Scalars.
-
-as there is insufficient bits to cover the full range.
-
-## CR Field EXTRA3
-
-CR Field encoding is essentially the same but made more complex due to CRs being bit-based.  See [[svp64/appendix]] for explanation and pseudocode.
-Note that Vectors may only start from `CR0, CR4, CR8, CR12, CR16, CR20`...
-and Scalars may only go from `CR0, CR1, ... CR31`
-
-Encoding shown MSB down to LSB
-
-For a 5-bit operand (BA, BB, BT):
-
-| Value | Mode | Range/Inc     | 8..5      | 4..2    | 1..0    |
-|-------|------|---------------|-----------| --------|---------|
-| 000   | Scalar | `CR0-CR7`/1   | 0b0000    | BA[4:2] | BA[1:0] |
-| 001   | Scalar | `CR8-CR15`/1  | 0b0001    | BA[4:2] | BA[1:0] |
-| 010   | Scalar | `CR16-CR23`/1 | 0b0010    | BA[4:2] | BA[1:0] |
-| 011   | Scalar | `CR24-CR31`/1 | 0b0011    | BA[4:2] | BA[1:0] |
-| 100   | Vector | `CR0-CR112`/16 | BA[4:2] 0 | 0b000   | BA[1:0] |
-| 101   | Vector | `CR4-CR116`/16 | BA[4:2] 0 | 0b100   | BA[1:0] |
-| 110   | Vector | `CR8-CR120`/16 | BA[4:2] 1 | 0b000   | BA[1:0] |
-| 111   | Vector | `CR12-CR124`/16 | BA[4:2] 1 | 0b100   | BA[1:0] |
-
-For a 3-bit operand (e.g. BFA):
-
-| Value | Mode | Range/Inc     | 6..3      | 2..0    |
-|-------|------|---------------|-----------| --------|
-| 000   | Scalar | `CR0-CR7`/1   | 0b0000    | BFA   |
-| 001   | Scalar | `CR8-CR15`/1  | 0b0001    | BFA      |
-| 010   | Scalar | `CR16-CR23`/1 | 0b0010    | BFA      |
-| 011   | Scalar | `CR24-CR31`/1 | 0b0011    | BFA      |
-| 100   | Vector | `CR0-CR112`/16 | BFA 0 | 0b000   |
-| 101   | Vector | `CR4-CR116`/16 | BFA 0 | 0b100   |
-| 110   | Vector | `CR8-CR120`/16 | BFA 1 | 0b000   |
-| 111   | Vector | `CR12-CR124`/16 | BFA 1 | 0b100   |
-
-## CR EXTRA2
+**Motivation**
  
-CR encoding is essentially the same but made more complex due to CRs being bit-based.  See separate section for explanation and pseudocode.
-Note that Vectors may only start from CR0, CR8, CR16, CR24, CR32...
+Just at the time when customers are asking for higher performance,
+the seductive lure of SIMD, as outlined in the sigarch "SIMD Considered
+Harmful" article is getting out of control and damaging the reputation
+of mainstream general-purpose ISAs that offer it.  A solution from
+50 years ago exists in the form of Cray-Style True-Scalable Vectors.
+However the usual way that True-Scalable Vector ISAs are done *also*
+adds more instructions and complexifies the ISA.  Simple-V takes a step
+back to a simpler era in computing from half a century ago: the Zilog
+Z80 CPIR and LDIR instructions, and the 8086 REP instruction, and brings
+them forward to Modern-day Computing.  The result is a huge reduction in
+programming complexity, and a strong base to project the Power ISA back
+to the world's most powerful Supercomputing ISA for at least the next two
+decades.
  
+**Notes and Observations**:
  
-Encoding shown MSB down to LSB
+Related RFCs are [[ls008]] for the two Management instructions `setvl`
+and `svstep`, and [ls009]] for the REMAP Subsystem. Also [[ls001]] is
+a Dependency as it introduces Primary Opcode 9 64-bit encoding. An
+additional RFC [[ls005]] introduced XLEN on which SVP64 is also critically
+dependent, for Element-width Overrides.
  
-For a 5-bit operand (BA, BB, BC):
+**Changes**
  
-| Value | Mode   | Range/Inc      | 8..5    | 4..2    | 1..0    |
-|-------|--------|----------------|---------|---------|---------|
-| 00    | Scalar | `CR0-CR7`/1    | 0b0000  | BA[4:2] | BA[1:0] |
-| 01    | Scalar | `CR8-CR15`/1   | 0b0001  | BA[4:2] | BA[1:0] |
-| 10    | Vector | `CR0-CR112`/16 | BA[4:2] 0 | 0b000   | BA[1:0] |
-| 11    | Vector | `CR8-CR120`/16 | BA[4:2] 1 | 0b000   | BA[1:0] |
+Add the following entries to:
  
-For a 3-bit operand (e.g. BFA):
+* A new "Vector Looping" Book
+* New Vector-Looping Chapters
+* New Vector-Looping Appendices
  
-| Value | Mode | Range/Inc     | 6..3      | 2..0    |
-|-------|------|---------------|-----------| --------|
-| 00    | Scalar | `CR0-CR7`/1   | 0b0000  | BFA   |
-| 01    | Scalar | `CR8-CR15`/1  | 0b0001  | BFA     |
-| 10    | Vector | `CR0-CR112`/16 | BFA 0 | 0b000   |
-| 11    | Vector | `CR8-CR120`/16 | BFA 1 | 0b000   |
+[[!tag opf_rfc]]
  
-# Appendix
+--------
  
-Now at its own page: [[svp64/appendix]]
+\newpage{}
  
+[[!inline pages="openpower/sv/svp64" raw=yes ]]
+[[!inline pages="openpower/sv/normal" raw=yes ]]
+[[!inline pages="openpower/sv/ldst" raw=yes ]]
+[[!inline pages="openpower/sv/branches" raw=yes ]]
+[[!inline pages="openpower/sv/cr_ops" raw=yes ]]
+[[!inline pages="openpower/sv/svp64/appendix" raw=yes ]]
+[[!inline pages="openpower/sv/compliancy_levels" raw=yes ]]