update date to 24 mar 2023 on ls001 v3

[libreriscv.git] / openpower / sv / rfc / ls001.mdwn
diff --git a/openpower/sv/rfc/ls001.mdwn b/openpower/sv/rfc/ls001.mdwn

index 67c87bd4a4b3ebb2d9fad5429382296007dbbce1..e7b5d12907e5a94a8046841907075b9bff628871 100644 (file)
--- a/openpower/sv/rfc/ls001.mdwn
+++ b/openpower/sv/rfc/ls001.mdwn
@@ -1,4 +1,4 @@
-# OPF ISA WG External RFC LS001 08Sep2022
+# OPF ISA WG External RFC LS001 v3 24mar2023
  
  * RFC Author: Luke Kenneth Casson Leighton.
  * RFC Contributors/Ideas: Brad Frey, Paul Mackerras, Konstantinos Magritis,
@@ -10,8 +10,8 @@
    [[ls001/discussion]]
  
  This proposal is to extend the Power ISA with an Abstract RISC-Paradigm
-Vectorisation Concept that may be orthogonally applied to **all and any** suitable
-Scalar instructions, present and future, in the Scalar Power ISA.
+Vectorisation Concept that may be orthogonally applied to **all and any**
+suitable Scalar instructions, present and future, in the Scalar Power ISA.
  The Vectorisation System is called
  ["Simple-V"](https://libre-soc.org/openpower/sv/)
  and the Prefix Format is called
@@ -20,7 +20,7 @@ and the Prefix Format is called
  does not add Vector opcodes or regfiles**.
  An ISA Concept similar to Simple-V was originally invented in 1994 by
  Peter Hsu (Architect of the MIPS R8000) but was dropped as MIPS did not
-have an Out-of-Order Microarchitecture.
+have an Out-of-Order Microarchitecture at the time.
  
  Simple-V is designed for Embedded Scenarios right the way through
  Audio/Visual DSPs to 3D GPUs and Supercomputing.  As it does **not**
@@ -32,17 +32,16 @@ The goal of RED Semiconductor Ltd, an OpenPOWER
  Stakeholder, is to bring to market mass-volume general-purpose compute
  processors that are competitive in the 3D GPU Audio Visual DSP EDGE IoT
  desktop chromebook netbook smartphone laptop markets, performance-leveraged
-by Simple-V.  Simple-V thus has to
-be accompanied by corresponding **Scalar** instructions that bring the
-**Scalar** Power ISA up-to-date.  These include IEEE754
+by Simple-V.  To achieve this goal both Simple-V and accompanying
+**Scalar** Power ISA instructions are needed.  These include IEEE754
  [Transcendentals](https://libre-soc.org/openpower/transcendentals/)
  [AV](https://libre-soc.org/openpower/sv/av_opcodes/)
  cryptographic
  [Biginteger](https://libre-soc.org/openpower/sv/biginteger/) and
  [bitmanipulation](https://libre-soc.org/openpower/sv/bitmanip)
-operations that ARM
-Intel AMD and many other ISAs have been adding over the past 12 years
-and Power ISA has not.  Three additional FP-related sets are needed
+operations present in ARM
+Intel AMD and many other ISAs.
+Three additional FP-related sets are needed
  (missing from SFFS) -
  [int_fp_mv](https://libre-soc.org/openpower/sv/int_fp_mv/)
  [fclass](https://libre-soc.org/openpower/sv/fclass/) and
@@ -51,20 +50,54 @@ and one set named
  [crweird](https://libre-soc.org/openpower/sv/cr_int_predication/)
  increase the capability of CR Fields.
  
-*Thus it becomes necesary to consider the Architectural Resource
+*Thus as the primary motivation is to create a **Hybrid 3D CPU-GPU-VPU ISA**
+it becomes necesary to consider the Architectural Resource
  Allocation of not just Simple-V but the 80-100 Scalar instructions all
  at the same time*.
  
  It is also critical to note that Simple-V **does not modify the Scalar
  Power ISA**, that **only** Scalar words may be
  Vectorised, and that Vectorised instructions are **not** permitted to be
-different from their Scalar words.
-The sole exception to that is Vectorised
+different from their Scalar words (`addi` must use the same Word encoding
+as `sv.addi`, and any new Prefixed instruction added **must** also
+be added as Scalar).
+The sole semi-exception is Vectorised
  Branch Conditional, in order to provide the usual Advanced Branching
  capability present in every Commercial 3D GPU ISA, but it
  is the *Vectorised* Branch-Conditional that is augmented, not Scalar
  Branch.
  
+# Basic principle
+
+The inspiration for Simple-V came from the fact that on examination of every
+Vector ISA pseudocode encountered the Vector operations were expressed
+as a for-loop on a Scalar element
+operation, and then both a Scalar **and** a Vector instruction was added.
+With
+[Zero-Overhead Looping](https://en.m.wikipedia.org/wiki/Zero-overhead_looping)
+*already* being common for over four
+decades it felt natural to separate the looping at both the ISA and
+the Hardware Level
+and thus provide only Scalar instructions (instantly halving the number
+of instructions), but rather than go the VLIW route (TI MSP Series)
+keep closely to existing Power ISA standard Scalar execution.
+
+Thus the basic principle of Simple-V is to provide a Precise-Interruptible
+Zero-Overhead Loop system[^zolc] with associated register "offsetting"
+which augments a Suffixed instruction as a "template",
+incrementing the register numbering progressively *and automatically*
+each time round the "loop".  Thus it may be considered to be a form
+of "Sub-Program-Counter" and at its simplest level can replace a large
+sequence of regularly-increasing loop-unrolled instructions with just two:
+one to set the Vector length and one saying where to
+start from in the regfile.
+
+On this sound and profoundly simple concept which leverages *Scalar*
+Micro-architectural capabilities much more comprehensive festures are
+easy to add, working up towards an ISA that easily matches the capability
+of powerful 3D GPU Vector Supercomputing ISAs, without ever adding even
+one single Vector opcode.
+
  # Extension Levels
  
  Simple-V has been subdivided into levels akin to the Power ISA Compliancy
@@ -82,26 +115,29 @@ to be reserved.
  
  Power ISA has a reputation as being long-term stable.
  **Simple-V guarantees binary interoperability** by defining fixed
-register file bitwidths and size for all instructions.
+register file bitwidths and size for a given set of instructions.
  The seduction of permitting different implementors to choose a register file
  bitwidth and size with the same instructions unfortunately has
  the catastrophic side-effect of introducing not only binary incompatibility
  but silent data corruption as well as no means to trap-and-emulate differing
  bitwidths.[^vsx256]
  
-Thus "Silicon-Partner" Scalability
-is prohibited in the Simple-V Scalable Vector ISA,
-This does
-mean that `RESERVED` space is crucial to have, in order
-to safely provide future expanded register file bitwidths and sizes[^msr]
-**at the discretion of and with the full authority of the OPF ISA WG**,
-not the implementor ("Silicon Partner").
+"Silicon-Partner" Scalability is identical to attempting to run 64-bit
+Power ISA binaries without setting - or having `MSR.SF` - on "Scaled"
+32-bit hardware: **the same opcodes** were shared between 32 and 64 bit.
+`RESERVED` space is thus crucial
+to have, in order to provide the **OPF ISA WG** - not implementors
+("Silicon Partners") - with the option to properly review and decide
+any (if any) future expanded register file bitwidths and sizes[^msr],
+**under explicitly-distinguishable encodings** so as to guarantee
+long-term stability and binary interoperability.
  
  # Hardware Implementations
  
  The fundamental principle of Simple-V is that it sits between Issue and
  Decode, pausing the Program-Counter to service a "Sub-PC"
-hardware for-loop.  This is very similar to "Zero-Overhead Loops"
+hardware for-loop.  This is very similar to
+[Zero-Overhead Loops](https://en.m.wikipedia.org/wiki/Zero-overhead_looping)
  in High-end DSPs (TI MSP Series).
  
  Considerable effort has been expended to ensure that Simple-V is
@@ -117,19 +153,22 @@ complexity to achieve high throughput, even on a single-issue in-order
  microarchitecture. As usually becomes quickly apparent with in-order, its
  limitations extend also to when Simple-V is deployed, which is why
  Multi-Issue Out-of-Order is the recommended (but not mandatory) Scalar
-Micro-architecture.
+Micro-architecture.  Byte-level write-enable regfiles (like SRAMs) are
+strongly recommended, to avoid a Read-Modify-Write cycle.
  
  The only major concern is in the upper SV Extension Levels: the Hazard
  Management for increased number of Scalar Registers to 128 (in current
  versions) but given that IBM POWER9/10 has VSX register numbering 64,
-and modern GPUs have 128, 256 amd even 512 registers this was deemed
+and modern GPUs have 128, 256 and even 512 registers this was deemed
  acceptable. Strategies do exist in hardware for Hazard Management of
  such large numbers of registers, even for Multi-Issue microarchitectures.
  
  # Simple-V Architectural Resources
  
  * No new Interrupt types are required.
-  (**No modifications to existing Power ISA opcodes are required either**).
+  No modifications to existing Power ISA opcodes are required.
+  No new Register Files are required (all because Simple-V is a category of
+  Zero-Overhead Looping on Scalar instructions)
  * GPR FPR and CR Field Register extend to 128.  A future
    version may extend to 256 or beyond[^extend] or also extend VSX[^futurevsx]
  * 24-bits are needed within the main SVP64 Prefix (equivalent to a 2-bit XO)
@@ -138,9 +177,7 @@ such large numbers of registers, even for Multi-Issue microarchitectures.
  * A third 24-bits (third 2-bit XO) is strongly recommended to be `RESERVED`
    such that future unforeseen capability is needed (although this may be
    alternatively achieved with a mandatory PCR or MSR bit)
-* To hold all Vector Context, five SPRs are needed for userspace.
-  If Supervisor and Hypervisor mode are to
-  also support Simple-V they will correspondingly need five SPRs each.
+* To hold all Vector Context, four SPRs are needed.
    (Some 32/32-to-64 aliases are advantageous but not critical).
  * Five 6-bit XO (A-Form) "Management" instructions are needed.  These are
    Scalar 32-bit instructions and *may* be 64-bit-extended in future
@@ -157,11 +194,11 @@ at least the next decade (including if added on VSX)
  **Simple-V SPRs**
  
  * **SVSTATE** - Vectorisation State sufficient for Precise-Interrupt
-  Context-switching and no adverse latency.
-* **SVSRR0** - identical in purpose to SRR0/1: storing SVSTATE on context-switch
-  along-side MSR and PC.
+  Context-switching and no adverse latency, it may be considered to
+  be a "Sub-PC" and as such absolutely must be treated with the same
+  respect and priority as MSR and PC.
  * **SVSHAPE0-3** - these are 32-bit and may be grouped in pairs, they REMAP
-  (shape) the Vectors
+  (shape) the Vectors[^svshape]
  * **SVLR** - again similar to LR for exactly the same purpose, SVSTATE
    is swapped with SVLR by SV-Branch-Conditional for exactly the same
    reason that NIA is swapped with LR
@@ -181,44 +218,56 @@ the same space):
    (fits within svshape's XO encoding)
  * **svindex** - convenience instruction for setting up "Indexed" REMAP.
  
-# SVP64 and SVP64-Single 24-bit Prefixes
+\newpage{}
+# SVP64 24-bit Prefixes
  
-The SVP64 24-bit Prefix provides several options,
-all fitting within the 24-bit space (and no other). REMAP is separately
-outlined below.
-The primary options all of which are aimed at reducing instruction
-count and reducing assembler complexity are:
+The SVP64 24-bit Prefix (RM) options aim to reduce instruction count
+and assembler complexity.
+These Modes do not interact with SVSTATE per se.  SVSTATE
+primarily controls the looping (quantity, order), RM
+influences the *elements* (the Suffix).  There is however
+some close interaction when it comes to predication.
+REMAP is outlined separately.
  
-* element-width overrides, which dynamically redefine each SFFS or SFS
+* **element-width overrides**, which dynamically redefine each SFFS or SFS
    Scalar prefixed instruction to be 8-bit, 16-bit, 32-bit or 64-bit
    operands **without requiring new 8/16/32 instructions.**[^pseudorewrite]
    This results in full BF16 and FP16 opcodes being added to the Power ISA
    **without adding BF16 or FP16 opcodes** including full conversion
    between all formats.
-* predication.  this is an absolutely essential feature for a 3D GPU VPU ISA.
+* **predication**.
+  this is an absolutely essential feature for a 3D GPU VPU ISA.
    CR Fields are available as Predicate Masks hence the reason for their
-  extension to 128.
-* Saturation. **all** LD/ST and Arithmetic and Logical operations may
-  be saturated (without adding explicit scalar saturated opcodes)
-* Reduction and Prefix-Sum (Fibonnacci Series) Modes
-* vec2/3/4  "Packing" and "Unpacking" (similar to VSX `vpack` and `vpkss`)
+  extension to 128. Twin-Predication is also provided: this may best
+  be envisaged as back-to-back VGATHER-VSCATTER but is not restricted
+  to LD/ST, its use saves on instruction count.  Enabling one or other
+  of the predicates provides all of the other types of operations
+  found in Vector ISAs (VEXTRACT, VINSERT etc) again with no need
+  to actually provide explicit such instructions.
+* **Saturation**. applies to **all** LD/ST and Arithmetic and Logical
+  operations (without adding explicit saturation ops)
+* **Reduction and Prefix-Sum** (Fibonnacci Series) Modes, including a
+  "Reverse Gear" (running loops backwards).
+* **vec2/3/4 "Packing" and "Unpacking"** (similar to VSX `vpack` and `vpkss`)
    accessible in a way that is easier than REMAP, added for the same reasons
    that drove `vpack` and `vpkss` etc. to be added: pixel, audio, and 3D
    data manipulation. With Pack/Unpack being part of SVSTATE it can be
    applied *in-place* saving register file space (no copy/mv needed).
-* Load/Store speculative "fault-first" behaviour, identical to ARM and RVV
-  Fault-first: provides auto-truncation of a speculative LD/ST helping
+* **Load/Store "fault-first"** speculative behaviour,
+  identical to SVE and RVV
+  Fault-first: provides auto-truncation of a speculative sequential parallel
+  LD/ST batch, helping
    solve the "SIMD Considered Harmful" stripmining problem from a Memory
    Access perspective.
-* Data-Dependent Fail-First: a 100% Deterministic extension of the LDST
-  ffirst concept: first `Rc=1 BO test` failure terminates looping and 
+* **Data-Dependent Fail-First**: a 100% Deterministic extension of the LDST
+  ffirst concept: first `Rc=1 BO test` failure terminates looping and
    truncates VL to that exact point. Useful for implementing algorithms
    such as `strcpy` in around 14 high-performance Vector instructions, the
    option exists to include or exclude the failing element.
-* Predicate-result: a strategic mode that effectively turns all and any
+* **Predicate-result**: a strategic mode that effectively turns all and any
    operations into a type of `cmp`. An `Rc=1 BO test` is performed and if
-  failing the result is **not** written to the regfile. The `Rc=1`
-  Vector of co-results **is** always written (subject to predication).
+  failing that element result is **not** written to the regfile. The `Rc=1`
+  Vector of co-results **is** always written (subject to usual predication).
    Termed "predicate-result" because the combination of producing then
    testing a result is as if the test was in a follow-up predicated
    copy/mv operation, it reduces regfile pressure and instruction count.
@@ -244,16 +293,16 @@ be suitably adapted to each category.
  It does have to be pointed out that there is huge pressure on the
  Mode bits.  There was therefore insufficient room, unlike the way that
  EXT001 was designed, to provide "identifying bits" *without first partially
-decoding the Suffix*.  This should in no way be conflated with or taken
-as an indicator that changing the meaning of the Suffix is performed
-or desirable.
+decoding the Suffix*.
  
  Some considerable care has been taken to ensure that Decoding may be
  performed in a strict forward-pipelined fashion that, aside from changes in
-SVSTATE and aside from the initial 32/64 length detection (also kept simple),
+SVSTATE (necessarily cached and propagated alongside MSR and PC)
+and aside from the initial 32/64 length detection (also kept simple),
  a Multi-Issue Engine would have no difficulty (performance maximisable).
-With the initial partial RM identification
-decode performed above the Vector operations may easily be passed downstream
+With the initial partial RM Mode type-identification
+decode performed above the Vector operations may then
+easily be passed downstream in a fully forward-progressive piplined fashion
  to independent parallel units for further analysis.
  
  **Vectorised Branch-Conditional**
@@ -276,10 +325,121 @@ Boolean Logic rules on sets (treating the Vector of CR Fields to be tested by
  `BO` as a set) dictate that the Branch should take place on either 'ALL'
  tests succeeding (or failing) or whether 'SOME' tests succeed (or fail).
  These options provide the ability to cover the majority of Parallel
-3D GPU Conditions, saving a not inconsiderable number of instructions
-especially given the close interaction with CTR in hot-loops. 
+3D GPU Conditions, saving up to **twelve** instructions
+especially given the close interaction with CTR in hot-loops.[^parity]
+
+[^parity]: adding a parity (XOR) option was too much. instead a parallel-reduction on `crxor` may be used in combination with a Scalar Branch.
+
+Also `SVLR` is introduced, which is a parallel twin of `LR`, and saving
+and restoring of LR and SVLR may be deferred until the final decision
+as to whether to branch.  In this way `sv.bclrl` does not corrupt `LR`.
+
+Vectorised Branch-Conditional due to its side-effects (e.g. reducing CTR
+or truncating VL) has practical uses even if the Branch is deliberately
+set to the next instruction (CIA+8). For example it may be used to reduce
+CTR by the number of bits set in a GPR, if that GPR is given as the predicate
+mask `sv.bc/pm=r3`.
+
+# LD/ST RM Modes
+
+Traditional Vector ISAs have vastly more (and more complex) addressing
+modes than Scalar ISAs: unit strided, element strided, Indexed, Structure
+Packing. All of these had to be jammed in on top of existing Scalar
+instructions **without modifying or adding new Scalar instructions**.
+A small conceptual "cheat" was therefore needed.  The Immediate (D)
+is in some Modes multiplied by the element index, which gives us
+element-strided.  For unit-strided the width of the operation (`ld`,
+8 byte) is multiplied by the element index and *substituted* for "D"
+when the immediate, D, is zero.  Modifications to support this "cheat"
+on top of pre-existing Scalar HDL (and Simulators) have both turned
+out to be minimal.[^mul] Also added was the option to perform signed
+or unsigned Effective Address calculation, which comes into play only
+on LD/ST Indexed, when elwidth overrides are used.  Another quirk:
+`RA` is never allowed to have its width altered: it remains 64-bit,
+as it is the Base Address.
+
+One confusing thing is the unfortunate naming of LD/ST Indexed and
+REMAP Indexed: some care is taken in the spec to discern the two.
+LD/ST Indexed is Scalar `EA=RA+RB` (where **either** RA or RB
+may be marked as Vectorised), where obviously the order in which
+that Vector of RA (or RB) is read in the usual linear sequential
+fashion. REMAP Indexed affects the
+**order** in which the Vector of RA (or RB) is accessed,
+according to a schedule determined by *another* vector of offsets
+in the register file.  Effectively this combines VSX `vperm`
+back-to-back with LD/ST operations *in the calculation of each
+Effective Address* in one instruction.
+
+For DCT and FFT, normally it is very expensive to perform the
+"bit-inversion" needed for address calculation and/or reordering
+of elements.  DCT in particular needs both bit-inversion *and
+Gray-Coding* offsets (a complexity that often "justifies" full
+assembler loop-unrolling).  DCT/FFT REMAP **automatically** performs
+the required offset adjustment to get data loaded and stored in
+the required order.  Matrix REMAP can likewise perform up to 3
+Dimensions of reordering (on both Immediate and Indexed), and
+when combined with vec2/3/4 the reordering can even go as far as
+four dimensions (four nested fixed size loops).
+
+Twin Predication is worth a special mention. Many Vector ISAs have
+special LD/ST `VCOMPRESS` and `VREDUCE` instructions, which sequentially
+skip elements based on predicate mask bits. They also add special
+`VINSERT` and `VEXTRACT` Register-based instructions to compensate
+for lack of single-element LD/ST (where in Simple-V you just use
+Scalar LD/ST). Also Broadcasting (`VSPLAT`) is either added to LDST
+or as Register-based.
+
+*All of the above modes are covered by Twin-Predication*
+
+In particular, a special predicate mode `1<<r3` uses the register `r3`
+*binary* value, converted to single-bit unary mask,
+effectively as a single (Scalar) Index *runtime*-dynamic offset into
+a Vector.[^r3] Combined with the
+(mis-named) "mapreduce" mode when used as a source predicate
+a `VSPLAT` (broadcast) is performed.  When used as a destination
+predicate `1<<r3`
+provides `VINSERT` behaviour.
+
+[^r3]: Effectively: `GPR(RA+r3)`
+
+Also worth an explicit mention is that Twin Predication when using
+different source from destination predicate masks effectively combines
+back-to-back `VCOMPRESS` and `VEXPAND` (in a single instruction), and,
+further, that the benefits of Twin Predication are not limited to LD/ST,
+they may be applied to Arithmetic, Logical and CR Field operations as well.
+
+Overall the LD/ST Modes available are astoundingly powerful, especially
+when combining arithmetic (lharx) with saturation, element-width overrides,
+Twin Predication,
+vec2/3/4 Structure Packing *and* REMAP, the combinations far exceed anything
+seen in any other Vector ISA in history, yet are really nothing more
+than concepts abstracted out in pure RISC form.[^ldstcisc]
+
+# CR Field RM Modes.
+
+CR Field operations (`crand` etc.) are somewhat underappreciated in the
+Power ISA. The CR Fields however are perfect for providing up to four
+separate Vectors of Predicate Masks: `EQ LT GT SO` and thus some special
+attention was given to first making transfer between GPR and CR Fields
+much more powerful with the
+[crweird](https://libre-soc.org/openpower/sv/cr_int_predication/)
+operations, and secondly by adding powerful binary and ternary CR Field
+operations into the bitmanip extension.[^crops]
+
+On these instructions RM Modes may still be applied (mapreduce and Data-Dependent Fail-first).  The usefulness of
+being able to auto-truncate subsequent Vector Processing at the point
+at which a CR Field test fails, based on any arbitary logical operation involving `three` CR Field Vectors (`crternlogi`) should be clear, as
+should the benefits of being able to do mapreduce and REMAP Parallel
+Reduction on `crternlogi`: dramatic reduction in instruction count
+for Branch-based control flow when faced with complex analysis of
+multiple Vectors, including XOR-reduction (parity).
+
+Overall the addition of the CR Operations and the CR RM Modes is about
+getting instruction count down and increasing the power and flexibility of CR Fields as pressed into service for the purpose of Predicate Masks.
  
-**SVP64Single**
+[^crops]: the alternative to powerful transfer instructions between GPR and CR Fields was to add the full duplicated suite of BMI and TBM operations present in GPR (popcnt, cntlz, set-before-first) as CR Field Operations. all of which was deemed inappropriate.
+
+# SVP64Single 24-bits
  
  The `SVP64-Single` 24-bit encoding focusses primarily on ensuring that
  all 128 Scalar registers are fully accessible, provides element-width
@@ -290,31 +450,90 @@ provided in the Scalar Power ISA without one single explicit FP16 or BF16
  32-bit opcode being added.  The downside: such Scalar operations are
  all 64-bit encodings.
  
+As SVP64Single is new and still under development, space for it may
+instead be `RESERVED`. It is however necessary in *some* form
+as there are limitations
+in SVP64 Register numbering, particularly for 4-operand instructions,
+that can only be easily overcome by SVP64Single.
+
  # Vertical-First Mode
  
  This is a Computer Science term that needed first to be invented.
  There exists only one other Vertical-First Vector ISA in the world:
-Mitch Alsup's VVM Extension for the 66000.
+Mitch Alsup's VVM Extension for the 66000, details of which may be
+obtained publicly on `comp.arch` or directly from Mitch Alsup under
+NDA. Several people have
+independently derived Vertical-First: it simply did not have a
+Computer Science term associated with it.
  
  If we envisage register and Memory layout to be Horizontal and
-instructions to be vertical, and to then have some form of Loop
-System it is easier to conceptualise VF vs HF Mode:
+instructions to be Vertical, and to then have some form of Loop
+System (wherther Zero-Overhead or just branch-conditional based)
+it is easier to then conceptualise VF vs HF Mode:
  
  * Vertical-First progresses through *instructions* first before
    moving on to the next *register* (or Memory-address in the case
    of Mitch Alsup's VVM).
  * Horizontal-First (also known as Cray-style Vectors) progresses
    through **registers** (or, register *elements* in traditional
-  Cray-Vector ISAs) in full before moving on to the next instruction.
-
-
+  Cray-Vector ISAs) in full before moving on to the next *instruction*.
+
+Mitch Alsup's VVM Extension is a form of hardware-level auto-vectorisation
+based around Zero-Overhead Loops. Using a Variable-Length Encoding all
+loop-invariant registers are "tagged" such that the Hazard Management
+Engine may perform optimally and do less work in automatically identifying
+parallelism opportunities.
+With it not being appropriate to use Variable-Length Encoding in the Power
+ISA a different much more explicit strategy was taken in Simple-V.
+
+The biggest advantage inherent in Vertical-First is that it is very easy
+to introduce into compilers, because all looping, as far as programs
+is concerned, remains expressed as *Scalar assembler*.[^autovec]
+Whilst Mitch Alsup's
+VVM biggest strength is its hardware-level auto-vectorisation
+but is limited in its ability to call
+functions, Simple-V's Vertical-First provides explicit control over the
+parallelism ("hphint")[^hphint] and also allows for full state to be stored/restored
+(SVLR combined with LR), permitting full function calls to be made
+from inside Vertical-First Loops, and potentially allows arbitrarily-depth
+nested VF Loops.
+
+Simple-V Vertical-First Looping requires an explicit instruction to
+move `SVSTATE` regfile offsets forward: `svstep`. An early version of
+Vectorised
+Branch-Conditional attempted to merge the functionality of `svstep`
+into `sv.bc`: it became CISC-like in its complexity and was quickly reverted.
  
-\newpage{}
  # Simple-V REMAP subsystem
  
  [REMAP](https://libre-soc.org/openpower/sv/remap)
  is extremely advanced but brings features already present in other
-DSPs and Supercomputing ISAs.
+DSPs and Supercomputing ISAs. The usual sequential progression
+through elements is pushed through a hardware-defined
+*fully Deterministic*
+"remapping".  Normally (without REMAP)
+algorithms are costly or
+convoluted to implement.  They are typically implemented
+as hard-coded fully loop-unrolled assembler which is often
+auto-generated by specialist tools, or written
+entirely by hand.
+All REMAP Schedules *including Indexed*
+are 100% Deterministic from their point of declaration,
+making it possible to forward-plan
+Issue, Memory access and Register Hazard Management
+in Multi-Issue Micro-architectures.
+
+If combined with Vertical-First then much more complex operations may exploit
+REMAP Schedules, such as Complex Number FFTs, by using Scalar intermediary
+temporary registers to compute results that have a Vector source
+or destination or both.
+Contrast this with a Standard Horizontal-First Vector ISA where the only
+way to perform Vectorised Complex Arithmetic would be to add Complex Vector
+Arithmetic operations, because due to the Horizontal (element-level)
+progression there is no way to utilise intermediary temporary (scalar)
+variables.[^complex]
+
+[^complex]: a case could be made for constructing Complex number arithmetic using multiple sequential Horizontal-First (Cray-style Vector) instructions. This may not be convenient in the least when REMAP is involved (such as Parallel Reduction of Complex Multiply).
  
  * **DCT/FFT** REMAP brings more capability than TI's MSP-Series DSPs and
    Qualcom Hexagon DSPs, and is not restricted to Integer or FP.
@@ -328,10 +547,25 @@ DSPs and Supercomputing ISAs.
    suited to Convolutions, Matrix Transpose and rotate, *all* of which is
    in-place.
  * **General-purpose Indexed** REMAP, this option is provided to implement
-  an equivalent of VSX `vperm`
+  an equivalent of VSX `vperm`, as a general-purpose catch-all means of
+  covering algorithms outside of the other REMAP Engines.
  * **Parallel Reduction** REMAP, performs an automatic map-reduce using
    *any suitable scalar operation*.
  
+All REMAP Schedules are Precise-Interruptible. No latency penalty is caused by
+the fact that the Schedule is Parallel-Reduction, for example.  The operations
+are Issued (Deterministically) as **Scalar** operations and thus any latency
+associated with **Scalar** operation Issue exactly as in a **Scalar**
+Micro-architecture will result.  Contrast this with a Standard Vector ISA
+where frequently there is either considerable interrupt latency due to
+requiring a Parallel Reduction to complete in full, or partial results
+to be discarded and re-started should a high-priority Interrupt occur
+in the middle.
+
+Note that predication is possible on REMAP but is hard to use effectively.
+It is often best to make copies of data (`VCOMPRESS`) then apply REMAP.
+
+\newpage{}
  # Scalar Operations
  
  The primary reason for mentioning the additional Scalar operations
@@ -409,9 +643,41 @@ For each of EXT059 and EXT063:
    [under evaluation](https://bugs.libre-soc.org/show_bug.cgi?id=923)
    as of 08Sep2022
  
+# Adding new opcodes.
+
+With Simple-V being a type of
+[Zero-Overhead Loop](https://en.m.wikipedia.org/wiki/Zero-overhead_looping)
+Engine on top of
+Scalar operations some clear guidelines are needed on how both
+existing "Defined Words" (Public v3.1 Section 1.6.3 term) and future
+Scalar operations are added within the 64-bit space.  Examples of
+legal and illegal allocations are given later.
+
+The primary point is that once an instruction is defined in Scalar
+32-bit form its corresponding space **must** be reserved in the
+SVP64 area with the exact same 32-bit form, even if that instruction
+is "Unvectoriseable" (`sc`, `sync`, `rfid` and `mtspr` for example).
+Instructions may **not** be added in the Vector space without also
+being added in the Scalar space, and vice-versa, *even if Unvectoriseable*.
+
+This is extremely important because the worst possible situation
+is if a conflicting Scalar instruction is added by another Stakeholder,
+which then turns out to be Vectoriseable: it would then have to be
+added to the Vector Space with a *completely different Defined Word*
+and things go rapidly downhill in the Decode Phase from there.
+Setting a simple inviolate rule helps avoid this scenario but does
+need to be borne in mind when discussing potential allocation
+schemes, as well as when new Vectoriseable Opcodes are proposed
+for addition by future RFCs: the opcodes **must** be uniformly
+added to Scalar **and** Vector spaces, or added in one and reserved
+in the other, or
+not added at all in either.[^whoops]
+
  \newpage{}
-# Potential Opcode allocation solution
+# Potential Opcode allocation solution (superseded)
  
+*Note this scheme is superseded below but kept for completeness as it
+defines terms and context*.
  There are unfortunately some inviolate requirements that directly place
  pressure on the EXT000-EXT063 (32-bit) opcode space to such a degree that
  it risks jeapordising the Power ISA. These requirements are:
@@ -430,9 +696,9 @@ is based loosely around Public v3.1 EXT001 Encoding.[^ext001]
  | 0-5 | 6 | 7 | 8-31  | Description               |
  |-----|---|---|-------|---------------------------|
  | PO  | 0 | 0 | 0000  | new-suffix `RESERVED1`               |
-| PO  | 0 | 0 | !zero | new-suffix, scalar (SVP64Single) |
+| PO  | 0 | 0 | !zero | new-suffix, scalar (SVP64Single), or `RESERVED3` |
  | PO  | 1 | 0 | 0000  | new scalar-only word, or `RESERVED2`               |
-| PO  | 1 | 0 | !zero | old-suffix, scalar (SVP64Single) |
+| PO  | 1 | 0 | !zero | old-suffix, scalar (SVP64Single), or `RESERVED4` |
  | PO  | 0 | 1 | nnnn  | new-suffix, vector (SVP64)       |
  | PO  | 1 | 1 | nnnn  | old-suffix, vector (SVP64)       |
  
@@ -455,6 +721,9 @@ is based loosely around Public v3.1 EXT001 Encoding.[^ext001]
    except that it is equivalent to hard-coded VL=1
    at all times. Predication is permitted, Element-width-overrides is
    permitted, Saturation is permitted.
+  If not allocated within the scope of this RFC
+  then these are requested to be `RESERVED` for a future Simple-V
+  proposal.
  * **SVP64** - a (well-defined, 2 years) DRAFT Proposal for a Vectorisation
    Augmentation of suffixes.
  
@@ -464,8 +733,8 @@ allocation to new POs, `RESERVED2` does not.[^only2]
  
  |          | Scalar (bit7=0,8-31=0000) | Scalar (bit7=0,8-31=!zero)| Vector (bit7=1)  |
  |----------|---------------------------|---------------------------|------------------|
-|new bit6=0| `RESERVED1`:{EXT200-263}  | SVP64-Single:{EXT200-263} | SVP64:{EXT200-263} |
-|old bit6=1| `RESERVED2`:{EXT300-363}  | SVP64-Single:{EXT000-063}   | SVP64:{EXT000-063}   |
+|new bit6=0| `RESERVED1`:{EXT200-263}  | `RESERVED3`:SVP64-Single:{EXT200-263} | SVP64:{EXT200-263} |
+|old bit6=1| `RESERVED2`:{EXT300-363}  | `RESERVED4`:SVP64-Single:{EXT000-063}   | SVP64:{EXT000-063}   |
  
  * **`RESERVED2`:{EXT300-363}** (not strictly necessary to be added) is not
    and **cannot** ever be Vectorised or Augmented by Simple-V or any future
@@ -473,17 +742,16 @@ allocation to new POs, `RESERVED2` does not.[^only2]
    it is a pure **Scalar-only** word-length PO Group. It may remain `RESERVED`.
  * **`RESERVED1`:{EXT200-263}** is also a new set of 64 word-length Major
    Opcodes.
-  These opcodes do not *need* to be Simple-V-Augmented
-  *but the option to do so exists* should an Implementor choose to do so.
-  This is unlike `EXT300-363` which may **never** be Simple-V-Augmented
+  These opcodes would be Simple-V-Augmentable
+  unlike `EXT300-363` which may **never** be Simple-V-Augmented
    under any circumstances.
-* **`SVP64-Single:{EXT200-263}`** - Major opcodes 200-263 with
+* **RESERVED3:`SVP64-Single:{EXT200-263}`** - Major opcodes 200-263 with
    Single-Augmentation, providing a one-bit predicate mask, element-width
    overrides on source and destination, and the option to extend the Scalar
    Register numbering (r0-32 extends to r0-127).  **Placing of alternative
    instruction encodings other than those exactly defined in EXT200-263
    is prohibited**.
-* **`SVP64-Single:{EXT000-063}`** - Major opcodes 000-063 with
+* **RESERVED4:`SVP64-Single:{EXT000-063}`** - Major opcodes 000-063 with
    Single-Augmentation, just like SVP64-Single on EXT200-263, these are
    in effect Single-Augmented-Prefixed variants of the v3.0 32-bit Power ISA.
    Alternative instruction encodings other than the exact same 32-bit word
@@ -504,7 +772,107 @@ The issues of allocation for bitmanip etc. from Libre-SOC is therefore
  overwhelmingly made moot. The only downside is that there is no
  `SVP64-Reserved` which will have to be achieved with SPRs (PCR or MSR).
  
+*Most importantly what this scheme does not do is provide large areas
+for other (non-Vectoriseable) RFCs.*
+
+# Potential Opcode allocation solution (2)
+
+One of the risks of the bit 6/7 scheme above is that there is no
+room to share PO9 (EXT009) with other potential uses.  A workaround for
+that is as follows:
+
+* EXT009, like EXT001 of Public v3.1, is **defined** as a 64-bit
+  encoding. This makes Multi-Issue Length-identification trivial.
+* bit 6 if 0b1 is 100% for Simple-V augmentation of (Public v3.1 1.6.3)
+  "Defined Words" (aka EXT000-063), with the exception of 0x26000000
+  as a Prefix, which is a new RESERVED encoding.
+* when bit 6 is 0b0 and bits 32-33 are 0b11 are **defined** as also
+  allocated to Simple-V
+* all other patterns are `RESERVED` for other non-Vectoriseable
+  purposes (just over 37.5%).
+
+| 0-5 | 6 | 7 | 8-31  | 32:33 |  Description               |
+|-----|---|---|-------|-------|----------------------------|
+| PO9?| 0 | 0 | !zero | 00-10 | RESERVED (other)           |
+| PO9?| 0 | 1 | xxxx  | 00-10 | RESERVED (other)          |
+| PO9?| x | 0 | 0000  | xx    | RESERVED (other)          |
+| PO9?| 0 | 0 | !zero | 11    | SVP64 (current and future) |
+| PO9?| 0 | 1 | xxxx  | 11    | SVP64 (current and future) |
+| PO9?| 1 | 0 | !zero | xx    | SVP64 (current and future) |
+| PO9?| 1 | 1 | xxxx  | xx    | SVP64 (current and future) |
+
+This ensures that any potential for future conflict over uses of the
+EXT009 space, jeapordising Simple-V in the process, are avoided,
+yet leaves huge areas (just over 37.5% of the 64-bit space) for other
+(non-Vectoriseable) uses.
+
+These areas thus need to be Allocated (SVP64 and Scalar EXT248-263):
+
+| 0-5 | 6 | 7 | 8-31  | 32-3 | Description               |
+|-----|---|---|-------|------|---------------------------|
+| PO  | 0 | 0 | !zero | 0b11 | SVP64Single:EXT248-263, or `RESERVED3` |
+| PO  | 0 | 0 | 0000  | 0b11 | Scalar EXT248-263               |
+| PO  | 0 | 1 | nnnn  | 0b11 | SVP64:EXT248-263     |
+| PO  | 1 | 0 | !zero | nn   | SVP64Single:EXT000-063 or `RESERVED4` |
+| PO  | 1 | 1 | nnnn  | nn   | SVP64:EXT000-063       |
+
+and reserved areas, QTY 1of 32-bit, and QTY 3of 55-bit, are:
+
+| 0-5 | 6 | 7 | 8-31  | 32-3 | Description               |
+|-----|---|---|-------|------|---------------------------|
+| PO9?| 1 | 0 | 0000  | xx   | `RESERVED1` or EXT300-363 (32-bit) |
+| PO9?| 0 | x | xxxx  | 0b00 | `RESERVED2` or EXT200-216 (55-bit) |
+| PO9?| 0 | x | xxxx  | 0b01 | `RESERVED2` or EXT216-231 (55-bit) |
+| PO9?| 0 | x | xxxx  | 0b10 | `RESERVED2` or EXT232-247 (55-bit) |
+
+* SVP64Single (`RESERVED3/4`) is *planned* for a future RFC
+  (but needs reserving as part of this RFC)
+* `RESERVED1/2` is available for new general-purpose
+  (non-Vectoriseable) 32-bit encodings (other RFCs)
+* EXT248-263 is for "new" instructions
+  which **must** be granted corresponding space
+  in SVP64.
+* Anything Vectorised-EXT000-063 is **automatically** being
+  requested as 100% Reserved for every single "Defined Word"
+  (Public v3.1 1.6.3 definition). Vectorised-EXT001 or EXT009
+  is defined as illegal.
+* Any **future** instruction
+  added to EXT000-063 likewise, must **automatically** be
+  assigned corresponding reservations in the SVP64:EXT000-063
+  and SVP64Single:EXT000-063 area, regardless of whether the
+  instruction is Vectoriseable or not.
+
+Bit-allocation Summary:
+
+* EXT3nn and other areas provide space for up to
+  QTY 4of non-Vectoriseable EXTn00-EXTn47 ranges.
+* QTY 3of 55-bit spaces also exist for future use (longer by 3 bits
+  than opcodes allocated in EXT001)
+* Simple-V EXT2nn is restricted to range EXT248-263
+* non-Simple-V (non-Vectoriseable) EXT2nn (if ever requested in any future RFC) is restricted to range EXT200-247
+* Simple-V EXT0nn takes up 50% of PO9 for this and future Simple-V RFCs
+
+**This however potentially puts SVP64 under pressure (in 5-10 years).**
+Ideas being discussed already include adding LD/ST-with-Shift and variant
+Shift-Immediate operations that require large quantity of Primary Opcodes.
+To ensure that there is room in future,
+it may be better to allocate 25% to `RESERVED`:
+
+| 0-5 | 6 | 7 | 8-31  | 32| Description                        |
+|-----|---|---|-------|---|------------------------------------|
+| PO9?| 1 | 0 | 0000  | x | EXT300-363 or `RESERVED1` (32-bit) |
+| PO9?| 0 | x | xxxx  | 0 | EXT200-232 or `RESERVED2` (56-bit) |
+| PO9?| 0 | x | xxxx  | 1 | EXT232-263 and SVP64(/V/S)         |
+
+The clear separation between Simple-V and non-Simple-V stops
+conflict in future RFCs, both of which get plenty of space.
+EXT000-063 pressure is reduced in both Vectoriseable and
+non-Vectoriseable, and the 100+ Vectoriseable Scalar operations
+identified by Libre-SOC may safely be proposed and each evaluated
+on their merits.
+
  \newpage{}
+
  **EXT000-EXT063**
  
  These are Scalar word-encodings. Often termed "v3.0 Scalar" in this document
@@ -512,86 +880,230 @@ Power ISA v3.1 Section 1.6.3 Book I calls it a "defined word".
  
  | 0-5    | 6-31   |
  |--------|--------|
-| PO     | EXT000-063 Scalar (v3.0 or v3.1) operation |
+| PO     | EXT000-063 "Defined word" |
  
-**RESERVED2 / EXT300-363** bit6=old bit7=scalar
+**SVP64Single:{EXT000-063}** bit6=old  bit7=scalar
  
-This is entirely at the discretion of the ISA WG. Libre-SOC is *not*
-proposing the addition of EXT300-363: it is merely a possibility for
-future.  The reason the space is not needed is because this is within
-the realm of Scalar-extended (SVP64Single), and with the 24-bit prefix
-area being all-zero (bits 8-31) this is defined as "having no augmentation"
-(in the Simple-V Specification it is termed `Scalar Identity Behaviour`).
-This in turn makes this prefix a *degenerate duplicate* so may be allocated
-for other purposes.
+This encoding, identical to SVP64Single:{EXT248-263},
+introduces SVP64Single Augmentation of Scalar "defined words".
+All meanings must be identical to EXT000-063, and is is likewise
+prohibited to add an instruction in this area without also adding
+the exact same (non-Augmented) instruction in EXT000-063 with the
+exact same Scalar word.
+Bits 32-37 0b00000 to 0b11111 represent EXT000-063 respectively.
+Augmenting EXT001 or EXT009 is prohibited.
  
  | 0-5    | 6 | 7 | 8-31  | 32-63   |
  |--------|---|---|-------|---------|
-| PO (9)?| 1 | 0 | 0000  | EXT300-363 or `RESERVED1` |
+| PO (9)?| 1 | 0 | !zero | SVP64Single:{EXT000-063} |
  
-**{EXT200-263}** bit6=new bit7=scalar
+**SVP64:{EXT000-063}** bit6=old bit7=vector
  
-This encoding represents the opportunity to introduce EXT200-263.
-It is a Scalar-word encoding, and does not require implementing
-SVP64 or SVP64-Single.
-PO2 is in the range 0b00000 to 0b11111 to represent EXT200-263 respectively.
+This encoding is identical to **SVP64:{EXT248-263}** except it
+is the Vectorisation of existing v3.0/3.1 Scalar-words, EXT000-063.
+All the same rules apply with the addition that
+Vectorisation of EXT001 or EXT009 is prohibited.
  
-| 0-5    | 6 | 7 | 8-31  | 32-37  | 38-63   |
-|--------|---|---|-------|--------|---------|
-| PO (9)?| 0 | 0 | 0000  | PO2    | {EXT200-263} |
+| 0-5    | 6 | 7 | 8-31  | 32-63   |
+|--------|---|---|-------|---------|
+| PO (9)?| 1 | 1 | nnnn  | SVP64:{EXT000-063} |
  
-**SVP64Single:{EXT200-263}** bit6=new bit7=scalar
+**{EXT232-263}** bit6=new bit7=scalar
  
-This encoding, which is effectively "implicit VL=1"
-and comprising (from bits 8-31)
-*at least some* form of Augmentation, it represents the opportunity
-to Augment EXT200-263 with the SVP64Single capabilities.
-Instructions may not be placed in this category without also being
-implemented as pure Scalar.
+This encoding represents the opportunity to introduce EXT248-263.
+It is a Scalar-word encoding, and does not require implementing
+SVP64 or SVP64-Single, but does require the Vector-space to be allocated.
+PO2 is in the range 0b100000 to 0b1111111 to represent EXT232-263 respectively.
  
-| 0-5    | 6 | 7 | 8-31  | 32-37  | 38-63   |
-|--------|---|---|-------|--------|---------|
-| PO (9)?| 0 | 0 | !zero | PO2    | SVP64Single:{EXT200-263} |
+| 0-5    | 6 | 7 | 8-31  | 32 | 33-37   | 38-63   |
+|--------|---|---|-------|----|---------|---------|
+| PO (9)?| 0 | 0 | 0000  | 1  |PO2[1:5] | {EXT232-263} |
  
-**SVP64Single:{EXT000-063}** bit6=old  bit7=scalar
+**SVP64Single:{EXT232-263}** bit6=new bit7=scalar
  
-This encoding, identical to SVP64Single:{EXT200-263},
-introduces SVP64Single Augmentation of v3.0 Scalar word instructions.
-All meanings must be identical to EXT000 to EXT063, and is is likewise
-prohibited to add an instruction in this area without also adding
-the exact same (non-Augmented) instruction in EXT000-063 with the
-exact same Scalar word.
-PO2 is in the range 0b00000 to 0b11111 to represent EXT000-063 respectively.
-Augmenting EXT001 is prohibited.
+This encoding, which is effectively "implicit VL=1"
+and comprising (from bits 8-31 being non-zero)
+*at least some* form of Augmentation, it represents the opportunity
+to Augment EXT232-263 with the SVP64Single capabilities.
+Must be allocated under Scalar *and* SVP64 simultaneously.
  
-| 0-5    | 6 | 7 | 8-31  | 32-37  | 38-63   |
-|--------|---|---|-------|--------|---------|
-| PO (9)?| 1 | 0 | !zero | PO2    | SVP64Single:{EXT000-063} |
+| 0-5    | 6 | 7 | 8-31  | 32 | 33-37   | 38-63   |
+|--------|---|---|-------|----|---------|---------|
+| PO (9)?| 0 | 0 | !zero | 1  |PO2[1:5] | SVP64Single:{EXT232-263} |
  
-**SVP64:{EXT200-263}** bit6=new bit7=vector
+**SVP64:{EXT248-263}** bit6=new bit7=vector
  
  This encoding, which permits VL to be dynamic (settable from GPR or CTR)
-is the Vectorisation of EXT200-263.
+is the Vectorisation of EXT248-263.
  Instructions may not be placed in this category without also being
  implemented as pure Scalar *and* SVP64Single. Unlike SVP64Single
  however, there is **no reserved encoding** (bits 8-24 zero).
  VL=1 may occur dynamically
  at runtime, even when bits 8-31 are zero.
  
-| 0-5    | 6 | 7 | 8-31  | 32-37  | 38-63   |
-|--------|---|---|-------|--------|---------|
-| PO (9)?| 0 | 1 | nnnn  | PO2    | SVP64:{EXT200-263} |
+| 0-5    | 6 | 7 | 8-31  | 32 | 33-37   | 38-63   |
+|--------|---|---|-------|----|---------|---------|
+| PO (9)?| 0 | 1 | nnnn  | 1  |PO2[1:5] | SVP64:{EXT232-263} |
  
-**SVP64:{EXT000-063}** bit6=old bit7=vector
+**RESERVED1 / EXT300-363** bit6=old bit7=scalar
  
-This encoding is identical to **SVP64:{EXT200-263}** except it
-is the Vectorisation of existing v3.0/3.1 Scalar-words, EXT000-063.
-All the same rules apply with the addition that
-Vectorisation of EXT001 is prohibited.
+This is at the discretion of the ISA WG. Libre-SOC is *not*
+proposing the addition of EXT300-363: it is merely a possibility
+
+| 0-5    | 6 | 7 | 8-31  | 32-63   |
+|--------|---|---|-------|---------|
+| PO (9)?| 1 | 0 | 0000  | EXT300-363 or `RESERVED1` |
+
+**RESERVED2 / EXT200-231** bit6=new bit32=1
  
-| 0-5    | 6 | 7 | 8-31  | 32-37  | 38-63   |
-|--------|---|---|-------|--------|---------|
-| PO (9)?| 1 | 1 | nnnn  | PO2    | SVP64:{EXT000-063} |
+This is at the discretion of the ISA WG. Libre-SOC is *not*
+proposing the addition of EXT200-231: it is merely a possibility
+
+| 0-5    | 6 | 7 | 8-31  | 32 | 33-37   | 38-63   |
+|--------|---|---|-------|----|---------|---------|
+| PO (9)?| 0 | x | nnnn  | 1  |PO2[1:5] | {EXT200-231} |
+
+\newpage{}
+# Example Legal Encodings and RESERVED spaces
+
+This section illustrates what is legal encoding, what is not, and
+why the 4 spaces should be `RESERVED` even if not allocated as part
+of this RFC.
+
+**legal, scalar and vector**
+
+| width | assembler | prefix?      | suffix    | description   |
+|-------|-----------|--------------|-----------|---------------|
+| 32bit | fishmv    | none         | 0x12345678| scalar EXT0nn |
+| 64bit | ss.fishmv | 0x26!zero    | 0x12345678| scalar SVP64Single:EXT0nn |
+| 64bit | sv.fishmv | 0x27nnnnnn   | 0x12345678| vector SVP64:EXT0nn |
+
+OR:
+
+| width | assembler | prefix?      | suffix    | description   |
+|-------|-----------|--------------|-----------|---------------|
+| 64bit | fishmv    | 0x24000000   | 0x12345678| scalar EXT2nn |
+| 64bit | ss.fishmv | 0x24!zero    | 0x12345678| scalar SVP64Single:EXT2nn |
+| 64bit | sv.fishmv | 0x25nnnnnn   | 0x12345678| vector SVP64:EXT2nn |
+
+Here the encodings are the same, 0x12345678 means the same thing in
+all cases. Anything other than this risks either damage (truncation
+of capabilities of Simple-V) or far greater complexity in the
+Decode Phase.
+
+This drives the compromise proposal (above) to reserve certain
+EXT2nn POs right
+across the board
+(in the Scalar Suffix side, irrespective of Prefix), some allocated
+to Simple-V, some not.
+
+**illegal due to missing**
+
+| width | assembler | prefix?      | suffix    | description   |
+|-------|-----------|--------------|-----------|---------------|
+| 32bit | fishmv    | none         | 0x12345678| scalar EXT0nn |
+| 64bit | ss.fishmv | 0x26!zero    | 0x12345678| scalar SVP64Single:EXT0nn |
+| 64bit | unallocated | 0x27nnnnnn   | 0x12345678| vector SVP64:EXT0nn |
+
+This is illegal because the instruction is possible to Vectorise,
+therefore it should be **defined** as Vectoriseable.
+
+**illegal due to unvectoriseable**
+
+| width | assembler | prefix?      | suffix    | description   |
+|-------|-----------|--------------|-----------|---------------|
+| 32bit | mtmsr     | none         | 0x12345678| scalar EXT0nn |
+| 64bit | ss.mtmsr  | 0x26!zero    | 0x12345678| scalar SVP64Single:EXT0nn |
+| 64bit | sv.mtmsr  | 0x27nnnnnn   | 0x12345678| vector SVP64:EXT0nn |
+
+This is illegal because the instruction `mtmsr` is not possible to Vectorise,
+at all.  This does **not** convey an opportunity to allocate the
+space to an alternative instruction.
+
+**illegal unvectoriseable in EXT2nn**
+
+| width | assembler | prefix?      | suffix    | description   |
+|-------|-----------|--------------|-----------|---------------|
+| 64bit | mtmsr2    | 0x24000000   | 0x12345678| scalar EXT2nn |
+| 64bit | ss.mtmsr2 | 0x24!zero    | 0x12345678| scalar SVP64Single:EXT2nn |
+| 64bit | sv.mtmsr2 | 0x25nnnnnn   | 0x12345678| vector SVP64:EXT2nn |
+
+For a given hypothetical `mtmsr2` which is inherently Unvectoriseable
+whilst it may be put into the scalar EXT2nn space it may **not** be
+allocated in the Vector space. As with Unvectoriseable EXT0nn opcodes
+this does not convey the right to use the 0x24/0x26 space for alternative
+opcodes.  This hypothetical Unvectoriseable operation would be better off
+being allocated as EXT001 Prefixed, EXT000-063, or hypothetically in
+EXT300-363.
+
+**ILLEGAL: dual allocation**
+
+| width | assembler | prefix?      | suffix    | description   |
+|-------|-----------|--------------|-----------|---------------|
+| 32bit | fredmv    | none         | 0x12345678| scalar EXT0nn |
+| 64bit | ss.fredmv | 0x26!zero    | 0x12345678| scalar SVP64Single:EXT0nn |
+| 64bit | sv.fishmv  | 0x27nnnnnn   | 0x12345678| vector SVP64:EXT0nn |
+
+the use of 0x12345678 for fredmv in scalar but fishmv in Vector is
+illegal.  the suffix in both 64-bit locations
+must be allocated to a Vectoriseable EXT000-063
+"Defined Word" (Public v3.1 Section 1.6.3 definition)
+or not at all.
+
+\newpage{}
+
+**illegal unallocated scalar EXT0nn or EXT2nn:**
+
+| width | assembler | prefix?      | suffix    | description   |
+|-------|-----------|--------------|-----------|---------------|
+| 32bit | unallocated | none         | 0x12345678| scalar EXT0nn |
+| 64bit | ss.fredmv | 0x26!zero    | 0x12345678| scalar SVP64Single:EXT0nn |
+| 64bit | sv.fishmv  | 0x27nnnnnn   | 0x12345678| vector SVP64:EXT0nn |
+
+and:
+
+| width | assembler | prefix?      | suffix    | description   |
+|-------|-----------|--------------|-----------|---------------|
+| 64bit | unallocated | 0x24000000 | 0x12345678| scalar EXT2nn |
+| 64bit | ss.fishmv | 0x24!zero    | 0x12345678| scalar SVP64Single:EXT2nn |
+| 64bit | sv.fishmv | 0x25nnnnnn   | 0x12345678| vector SVP64:EXT2nn |
+
+Both of these Simple-V operations are illegally-allocated. The fact that
+there does not exist a scalar "Defined Word" (even for EXT200-263) - the
+unallocated block - means that the instruction may **not** be allocated in
+the Simple-V space.
+
+**illegal attempt to put Scalar EXT004 into Vector EXT2nn**
+
+| width | assembler | prefix?      | suffix    | description   |
+|-------|-----------|--------------|-----------|---------------|
+| 32bit | unallocated | none         | 0x10345678| scalar EXT0nn |
+| 64bit | ss.fishmv | 0x24!zero    | 0x10345678| scalar SVP64Single:EXT2nn |
+| 64bit | sv.fishmv | 0x25nnnnnn   | 0x10345678| vector SVP64:EXT2nn |
+
+This is an illegal attempt to place an EXT004 "Defined Word"
+(Public v3.1 Section 1.6.3) into the EXT2nn Vector space.
+This is not just illegal it is not even possible to achieve.
+If attempted, by dropping EXT004 into bits 32-37, the top two
+MSBs are actually *zero*, and the Vector EXT2nn space is only
+legal for Primary Opcodes in the range 232-263, where the top
+two MSBs are 0b11.  Thus this faulty attempt actually falls
+unintentionally
+into `RESERVED` "Non-Vectoriseable" Encoding space.
+
+**illegal attempt to put Scalar EXT001 into Vector space**
+
+| width | assembler | prefix?      | suffix    | description   |
+|-------|-----------|--------------|-----------|---------------|
+| 64bit | EXT001    | 0x04nnnnnn   | any       | scalar EXT001 |
+| 96bit | sv.EXT001 | 0x24!zero    | EXT001    | scalar SVP64Single:EXT001 |
+| 96bit | sv.EXT001 | 0x25nnnnnn   | EXT001    | vector SVP64:EXT001 |
+
+This becomes in effect an effort to define 96-bit instructions,
+which are illegal due to cost at the Decode Phase (Variable-Length
+Encoding). Likewise attempting to embed EXT009 (chained) is also
+illegal. The implications are clear unfortunately that all 64-bit
+EXT001 Scalar instructions are Unvectoriseable.
  
  \newpage{}
  # Use cases
@@ -649,32 +1161,38 @@ prohibited either.
  
  <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_predication.py;hb=HEAD>
  
-## 3D GPU style "Branch Conditional"
+## Matrix Multiply
  
-(*Note: Specification is ready, Simulator still under development of
-full specification capabilities*)
-This example demonstrates a 2-long Vector Branch-Conditional only
-succeeding if *all* elements in the Vector are successful.  This
-avoids the need for additional instructions that would need to
-perform a Parallel Reduction of a Vector of Condition Register
-tests down to a single value, on which a Scalar Branch-Conditional
-could then be performed.  Full Rationale at
-<https://libre-soc.org/openpower/sv/branches/>
+Matrix Multiply of any size (non-power-2) up to a total of 127 operations
+is achievable with only three instructions.  Normally in any other SIMD
+ISA at least one source requires Transposition and often massive rolling
+repetition of data is required.  These 3 instructions may be used as the
+"inner triple-loop kernel" of the usual 6-loop Massive Matrix Multiply.
  
  ```
-  80   # test_sv_branch_cond_all
-  81       for i in [7, 8, 9]:
-  83               addi 1, 0, i+1        # set r1 to i
-  84               addi 2, 0, i          # set r2 to i
-  85               cmpi cr0, 1, 1, 8     # compare r1 with 10 and store to cr0
-  86               cmpi cr1, 1, 2, 8     # compare r2 with 10 and store to cr1
-  87               sv.bc/all 12, *1, 0xc # bgt 0xc - branch if BOTH
-  88                                     # r1 AND r2 greater 8 to the nop below
-  89               addi 3, 0, 0x1234,    # if tests fail this shouldn't execute
-  90               or 0, 0, 0            # branch target
+  28     # test_sv_remap1   5x4 by 4x3 matrix multiply
+  29                        svshape 5, 4, 3, 0, 0
+  30                        svremap 31, 1, 2, 3, 0, 0, 0
+  31                        sv.fmadds *0, *8, *16, *0
  ```
  
-<https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_bc.py;hb=HEAD>
+<https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_matrix.py;hb=HEAD>
+
+## Parallel Reduction
+
+Parallel (Horizontal) Reduction is often deeply problematic in SIMD and
+Vector ISAs.  Parallel Reduction is Fully Deterministic in Simple-V and
+thus may even usefully be deployed on non-associative and non-commutative
+operations.
+
+```
+  75     # test_sv_remap2
+  76                        svshape 7, 0, 0, 7, 0
+  77                        svremap 31, 1, 0, 0, 0, 0, 0 # different order
+  78                        sv.subf *0, *8, *16
+```
+
+<https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_parallel_reduce.py;hb=HEAD>
  
  \newpage{}
  ## DCT
@@ -704,44 +1222,78 @@ The cosine table may be computed (once) with 18 Vector instructions
  
  <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_dct.py;hb=HEAD>
  
-## Matrix Multiply
+## 3D GPU style "Branch Conditional"
  
-Matrix Multiply of any size (non-power-2) up to a total of 127 operations
-is achievable with only three instructions.  Normally in any other SIMD
-ISA at least one source requires Transposition and often massive rolling
-repetition of data is required.  These 3 instructions may be used as the
-"inner triple-loop kernel" of the usual 6-loop Massive Matrix Multiply.
+(*Note: Specification is ready, Simulator still under development of
+full specification capabilities*)
+This example demonstrates a 2-long Vector Branch-Conditional only
+succeeding if *all* elements in the Vector are successful.  This
+avoids the need for additional instructions that would need to
+perform a Parallel Reduction of a Vector of Condition Register
+tests down to a single value, on which a Scalar Branch-Conditional
+could then be performed.  Full Rationale at
+<https://libre-soc.org/openpower/sv/branches/>
  
  ```
-  28     # test_sv_remap1   5x4 by 4x3 matrix multiply
-  29                        svshape 5, 4, 3, 0, 0
-  30                        svremap 31, 1, 2, 3, 0, 0, 0
-  31                        sv.fmadds *0, *8, *16, *0
+  80   # test_sv_branch_cond_all
+  81       for i in [7, 8, 9]:
+  83               addi 1, 0, i+1        # set r1 to i
+  84               addi 2, 0, i          # set r2 to i
+  85               cmpi cr0, 1, 1, 8     # compare r1 with 8 and store to cr0
+  86               cmpi cr1, 1, 2, 8     # compare r2 with 8 and store to cr1
+  87               sv.bc/all 12, *1, 0xc # bgt 0xc - branch if BOTH
+  88                                     # r1 AND r2 greater 8 to the nop below
+  89               addi 3, 0, 0x1234,    # if tests fail this shouldn't execute
+  90               or 0, 0, 0            # branch target
  ```
  
-<https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_matrix.py;hb=HEAD>
+<https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_bc.py;hb=HEAD>
  
-## Parallel Reduction
+## Big-Integer Math
  
-Parallel (Horizontal) Reduction is often deeply problematic in SIMD and
-Vector ISAs.  Parallel Reduction is Fully Deterministic in Simple-V and
-thus may even usefully be deployed on non-associative and non-commutative
-operations.
+Remarkably, `sv.adde` is inherently a big-integer Vector Add, using `CA`
+chaining between **Scalar** operations.
+Using Vector LD/ST and recalling that the first and last `CA` may
+be chained in and out of an entire **Vector**, unlimited-length arithmetic is
+possible.
  
  ```
-  75     # test_sv_remap2
-  76                        svshape 7, 0, 0, 7, 0
-  77                        svremap 31, 1, 0, 0, 0, 0, 0 # different order
-  78                        sv.subf *0, *8, *16
-  79
-  80                 REMAP sv.subf RT,RA,RB - inverted application of RA/RB
-  81                                          left/right due to subf
+  26     # test_sv_bigint_add
+  32
+  33         r3/r2: 0x0000_0000_0000_0001 0xffff_ffff_ffff_ffff +
+  34         r5/r4: 0x8000_0000_0000_0000 0x0000_0000_0000_0001 =
+  35         r1/r0: 0x8000_0000_0000_0002 0x0000_0000_0000_0000
+  36
+  37                          sv.adde *0, *2, *4
  ```
  
-<https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_parallel_reduce.py;hb=HEAD>
+A 128/64-bit shift may be used as a Vector shift by a Scalar amount, by merging
+two 64-bit consecutive registers in succession.
+
+```
+  62     # test_sv_bigint_scalar_shiftright(self):
+  64
+  65     r3                    r2                    r1                       r4
+  66     0x0000_0000_0000_0002 0x8000_8000_8000_8001 0xffff_ffff_ffff_ffff >> 4
+  67     0x0000_0000_0000_0002 0x2800_0800_0800_0800 0x1fff_ffff_ffff_ffff
+  68
+  69                          sv.dsrd *0,*1,4,1
+```
+
+Additional 128/64 Mul and Div/Mod instructions may similarly be exploited
+to perform roll-over in arbitrary-length arithmetic: effectively they use
+one of the two 64-bit output registers as a form of "64-bit Carry In-Out".
+
+All of these big-integer instructions are Scalar instructions standing on
+their own merit and may be utilised even in a Scalar environment to improve
+performance.  When used with Simple-V they may also be used to improve
+performance and also greatly simplify unlimited-length biginteger algorithms.
+
+<https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_bigint.py;hb=HEAD>
  
  [[!tag opf_rfc]]
  
+[^zolc]: first introduced in DSPs, Zero-Overhead Loops are astoundingly effective in reducing total number of instructions executed or needed. [ZOLC](https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.301.4646&rep=rep1&type=pdf) reduces instructions by **25 to 80 percent**.
  [^msr]: an MSR bit or bits, conceptually equivalent to `MSR.SF` and added for the same reasons, would suffice perfectly.
  [^extend]: Prefix opcode space (or MSR bits) **must** be reserved in advance to do so, in order to avoid the catastrophic binary-incompatibility mistake made by RISC-V RVV and ARM SVE/2
  [^likeext001]: SVP64-Single is remarkably similar to the "bit 1" of EXT001 being set to indicate that the 64-bits is to be allocated in full to a new encoding, but in fact SVP64-single still embeds v3.0 Scalar operations.
@@ -749,4 +1301,10 @@ operations.
  [^only2]: reminder that this proposal only needs 75% of two POs for Scalar instructions. The rest of EXT200-263 is for general use.
  [^ext001]: Recall that EXT100 to EXT163 is for Public v3.1 64-bit-augmented Operations prefixed by EXT001, for which, from Section 1.6.3, bit 6 is set to 1.  This concept is where the above scheme originated. Section 1.6.3 uses the term "defined word" to refer to pre-existing EXT000-EXT063 32-bit instructions so prefixed to create the new numbering EXT100-EXT163, respectively
  [^futurevsx]: A future version or other Stakeholder *may* wish to drop Simple-V onto VSX: this would be a separate RFC
-[^vsx256]: imagine a hypothetical future VSX-256 using the exact same instructions as VSX. it would catstrophically damage existing IBM POWER8,9,10 hardware's reputation and that of Power ISA overall.
+[^vsx256]: imagine a hypothetical future VSX-256 using the exact same instructions as VSX. the binary incompatibility introducrd would catastrophically **and retroactively** damage existing IBM POWER8,9,10 hardware's reputation and that of Power ISA overall.
+[^autovec]: Compiler auto-vectorisation for best exploitation of SIMD and Vector ISAs on Scalar programming languages (c, c++) is an Indusstry-wide known-hard decades-long problem. Cross-reference the number of hand-optimised assembler algorithms.
+[^hphint]: intended for use when the compiler has determined the extent of Memory or register aliases in loops: `a[i] += a[i+4]` would necessitate a Vertical-First hphint of 4
+[^svshape]: although SVSHAPE0-3 should, realistically, be regarded as high a priority as SVSTATE, and given corresponding SVSRR and SVLR equivalents, it was felt that having to context-switch **five** SPRs on Interrupts and function calls was too much.
+[^whoops]: two efforts were made to mix non-uniform encodings into Simple-V space: one deliberate to see how it would go, and one accidental. They both went extremely badly, the deliberate one costing over two months to add then remove.
+[^mul]: Setting this "multiplier" to 1 clearly leaves pre-existing Scalar behaviour completely intact as a degenerate case.
+[^ldstcisc]: At least the CISC "auto-increment" modes are not present, from the CDC 6600 and Motorola 68000! although these would be fun to introduce they do unfortunately make for 3-in 3-out register profiles, all 64-bit, which explains why the 6600 and 68000 had separate special dedicated address regfiles.