whitespace

author Luke Kenneth Casson Leighton <lkcl@lkcl.net>

Sat, 8 Apr 2023 16:03:42 +0000 (17:03 +0100)

committer Luke Kenneth Casson Leighton <lkcl@lkcl.net>

Sat, 8 Apr 2023 16:03:42 +0000 (17:03 +0100)
author Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Sat, 8 Apr 2023 16:03:42 +0000 (17:03 +0100)
committer Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Sat, 8 Apr 2023 16:03:42 +0000 (17:03 +0100)
diff --git a/openpower/sv/rfc/ls012.mdwn b/openpower/sv/rfc/ls012.mdwn

index f725ddf2c62cc7a0449dc5604425aa34bbce6651..4eb5f18496b9b86736f3d337ed57f065bd96f8c3 100644 (file)
--- a/openpower/sv/rfc/ls012.mdwn
+++ b/openpower/sv/rfc/ls012.mdwn
@@ -8,54 +8,54 @@ The purpose of this RFC is:
  
  * to give a full list of the upcoming Scalar opcodes developed by Libre-SOC
    (respecting that *all* of them are Vectoriseble)
-* formally agree a priority order on an itertive basis with new versions of this RFC,
+* formally agree a priority order on an itertive basis with new versions
+  of this RFC,
  * which ones should be EXT022 Sandbox, which in EXT0xx, which in EXT2xx,
  * and for IBM to get a clear picture of the Opcode Allocation needs.
  
-As this is a Formal ISA RFC the evaluation
-shall ultimatly define (in advance of the actual submission of the instructions
-themselves) which instructions will be submitted over the next 8-18
-months.
-
-*It is expected that readers visit and interact with the Libre-SOC resources
-in order to do due-diligence on the prioritisation evaluation. Otherwise
-the ISA WG is overwhelmed by "drip-fed" RFCs that may turn out not
-to be useful, against a background of having no guiding overview
-or pre-filtering, and everybody's precious time is wasted.
-Also note that the Libre-SOC Team, being funded by NLnet
+As this is a Formal ISA RFC the evaluation shall ultimatly define
+(in advance of the actual submission of the instructions themselves)
+which instructions will be submitted over the next 8-18 months.
+
+*It is expected that readers visit and interact with the Libre-SOC
+resources in order to do due-diligence on the prioritisation
+evaluation. Otherwise the ISA WG is overwhelmed by "drip-fed" RFCs
+that may turn out not to be useful, against a background of having
+no guiding overview or pre-filtering, and everybody's precious time
+is wasted.  Also note that the Libre-SOC Team, being funded by NLnet
  under Privacy and Enhanced Trust Grants, are **prohibited** from signing
-Commercial-Confidentiality NDAs, as doing so is a direct conflict of interest
-with their funding body's Charitable Foundation Status and remit*.
-
-Worth bearing in mind during evaluation that every "Defined
-Word" may or may not be Vectoriseable, but that every "Defined Word"
-should have merits on its own, not just when Vectorised.  An example
-of a borderline Vectoriseable Defined Word is `mv.swizzle` which
-only really becomes high-priority for Audio/Video, Vector GPU and HPC Workloads,
-but has less merit as a Scalar-only operation.
-
-Power ISA Scalar (SFFS) has not been significantly advanced in 12 years:
-IBM's primary focus has understandably been on PackedSIMD VSX.
-Unfortunately, with VSX being 914 instructions and 128-bit it is far too much for any
-new team to consider (10 years development effort) and far outside of
-Embedded or Tablet/Desktop/Laptop power budgets. Thus bringing Power Scalar
-up-to-date to modern standards *and on its own merits* is a reasonable goal,
-and the advantages of the reduced focus is that SFFS remains RISC-paradigm,
-and  that lessons can be learned from other ISAs from the intervening years.
-Good examples here include `bmask`.
+Commercial-Confidentiality NDAs, as doing so is a direct conflict of
+interest with their funding body's Charitable Foundation Status and
+remit*.
+
+Worth bearing in mind during evaluation that every "Defined Word" may
+or may not be Vectoriseable, but that every "Defined Word" should have
+merits on its own, not just when Vectorised.  An example of a borderline
+Vectoriseable Defined Word is `mv.swizzle` which only really becomes
+high-priority for Audio/Video, Vector GPU and HPC Workloads, but has
+less merit as a Scalar-only operation.
+
+Power ISA Scalar (SFFS) has not been significantly advanced in 12
+years: IBM's primary focus has understandably been on PackedSIMD VSX.
+Unfortunately, with VSX being 914 instructions and 128-bit it is far too
+much for any new team to consider (10 years development effort) and far
+outside of Embedded or Tablet/Desktop/Laptop power budgets. Thus bringing
+Power Scalar up-to-date to modern standards *and on its own merits*
+is a reasonable goal, and the advantages of the reduced focus is that
+SFFS remains RISC-paradigm, and  that lessons can be learned from other
+ISAs from the intervening years.  Good examples here include `bmask`.
  
  SVP64 Prefixing - also known by the terms "Zero-Overhead-Loop-Prefixing"
  as well as "True-Scalable-Vector Prefixing" - also literally brings new
  dimensions to the Power ISA. Thus when adding new Scalar "Defined Words"
-it has to unavoidably and simultaneously be taken into consideration their value when
-Vector-Prefixed, *as well as* SVP64Single-Prefixed.
+it has to unavoidably and simultaneously be taken into consideration
+their value when Vector-Prefixed, *as well as* SVP64Single-Prefixed.
  
  **Target areas**
  
-Whilst entirely general-purpose there are some categories that
-these instructions are targetting: Bitmanipulation, Big-integer,
-cryptography, Audio/Visual, High-Performance Compute, GPU workloads
-and DSP.
+Whilst entirely general-purpose there are some categories that these
+instructions are targetting: Bitmanipulation, Big-integer, cryptography,
+Audio/Visual, High-Performance Compute, GPU workloads and DSP.
  
  **Instruction count guide and approximate priority order**
  
@@ -86,247 +86,265 @@ to this RFC.
  
  ## SVP64 Management instructions
  
-These without question have to go in EXT0xx.  Future extended variants, bringing
-even more powerful capabilities, can be followed up later with EXT1xx prefixed
-variants, which is not possible if placed in EXT2xx.
-*Only `svstep` is actually Vectoriseable*, all other Management instructions
-are UnVectoriseane.  PO1-Prefixed examples include adding psvshape in order to
-support both Inner and
-Outer Product Matrix Schedules, by providing the option to directly reverse the
-order of the triple loops.  Outer is used for standard Matrix Multiply, but Inner
-is required for Warshall Transitive Closure (on top of a cumulatively-applied
+These without question have to go in EXT0xx.  Future extended variants,
+bringing even more powerful capabilities, can be followed up later with
+EXT1xx prefixed variants, which is not possible if placed in EXT2xx.
+*Only `svstep` is actually Vectoriseable*, all other Management
+instructions are UnVectoriseane.  PO1-Prefixed examples include adding
+psvshape in order to support both Inner and Outer Product Matrix
+Schedules, by providing the option to directly reverse the order of the
+triple loops.  Outer is used for standard Matrix Multiply, but Inner is
+required for Warshall Transitive Closure (on top of a cumulatively-applied
  max instruction).
  
-The Management Instructions themselves are all Scalar Operations, so PO1-Prefixing
-is perfecly reasonable.  SVP64 Management instructions of which there are only
-6 are all 5 or 6 bit XO, meaning that the opcode space they take up in EXT0xx is
-not alarmingly high for their intrinsic strategic value.
+The Management Instructions themselves are all Scalar Operations, so
+PO1-Prefixing is perfecly reasonable.  SVP64 Management instructions of
+which there are only 6 are all 5 or 6 bit XO, meaning that the opcode
+space they take up in EXT0xx is not alarmingly high for their intrinsic
+strategic value.
  
  ## Transcendentals
  
-Found at [[openpower/transcendentals]] these subdivide into high priority for
-accelerating general-purpose and High-Performance Compute, specialist 3D GPU
-operations suited to 3D visualisation, and low-priority less common instructions
-where IEEE754 full bit-accuracy is paramount.  In 3D GPU scenarios for example
-even 12-bit accuracy can be overkill, but for HPC Scientific scenarios 12-bit
-would be disastrous.
-
-There are a **lot** of operations here, and they also bring Power ISA
-up-to-date to IEEE754-2019.  Fortunately the number of critical instructions
-is quite low, but the caveat is that if those operations are utilised to
-synthesise other IEEE754 operations (divide by `pi` for example) full bitlevel
-accuracy (a hard requirement for IEEE754) is lost.
-
-Also worth noting that the Khronos Group defines minimum acceptable bit-accuracy
-levels for 3D Graphics: these are **nowhere near* the full accuracy demanded
-by IEEE754, the reason for the Khronos definitions is a massive reduction often
-four-fold in power consumption and gate count when 3D Graphics simply has no need
-for full accuracy.
+Found at [[openpower/transcendentals]] these subdivide into high
+priority for accelerating general-purpose and High-Performance Compute,
+specialist 3D GPU operations suited to 3D visualisation, and low-priority
+less common instructions where IEEE754 full bit-accuracy is paramount.
+In 3D GPU scenarios for example even 12-bit accuracy can be overkill,
+but for HPC Scientific scenarios 12-bit would be disastrous.
+
+There are a **lot** of operations here, and they also bring Power
+ISA up-to-date to IEEE754-2019.  Fortunately the number of critical
+instructions is quite low, but the caveat is that if those operations
+are utilised to synthesise other IEEE754 operations (divide by `pi` for
+example) full bitlevel accuracy (a hard requirement for IEEE754) is lost.
+
+Also worth noting that the Khronos Group defines minimum acceptable
+bit-accuracy levels for 3D Graphics: these are **nowhere near** the full
+accuracy demanded by IEEE754, the reason for the Khronos definitions is
+a massive reduction often four-fold in power consumption and gate count
+when 3D Graphics simply has no need for full accuracy.
  
  *For 3D GPU markets this definitely needs addressing*
  
  ## Audio/Video
  
-Found at [[sv/av_opcodes]] these do not require Saturated variants because Saturation
-is added via [[sv/svp64]] (Vector Prefixing) and via [[sv/svp64_single]] Scalar
-Prefixing. This is important to note for Opcode Allocation because placing these
-operations in the UnVectoriseble areas would irrediemably damage their value.
-Unlike PackedSIMD ISAs the actual number of AV Opcodes is remarkably small once
-the usual cascading-option-multipliers (SIMD width, bitwidth, saturation, HI/LO)
-are abstracted out to RISC-paradigm Prefixing, leaving just absolute-diff-accumulate,
-min-max, average-add etc. as "basic primitives".
+Found at [[sv/av_opcodes]] these do not require Saturated variants
+because Saturation is added via [[sv/svp64]] (Vector Prefixing) and via
+[[sv/svp64_single]] Scalar Prefixing. This is important to note for
+Opcode Allocation because placing these operations in the UnVectoriseble
+areas would irrediemably damage their value.  Unlike PackedSIMD ISAs
+the actual number of AV Opcodes is remarkably small once the usual
+cascading-option-multipliers (SIMD width, bitwidth, saturation,
+HI/LO) are abstracted out to RISC-paradigm Prefixing, leaving just
+absolute-diff-accumulate, min-max, average-add etc. as "basic primitives".
  
  ## Twin-Butterfly FFT/DCT/DFT for DSP/HPC/AI/AV
  
-The number of uses in Computer Science for DCT, NTT, FFT and DFT, is astonishing.
-The wikipedia page lists over a hundred separate and distinct areas: Audio, Video,
-Radar, Baseband processing, AI, Solomon-Reed Error Correction, the list goes on and on.
-ARM has special dedicated Integer Twin-butterfly instructions. TI's MSP Series DSPs
-have had FFT Inner loop support for over 30 years. Qualcomm's Hexagon VLIW Baseband
+The number of uses in Computer Science for DCT, NTT, FFT and DFT,
+is astonishing.  The wikipedia page lists over a hundred separate and
+distinct areas: Audio, Video, Radar, Baseband processing, AI, Solomon-Reed
+Error Correction, the list goes on and on.  ARM has special dedicated
+Integer Twin-butterfly instructions. TI's MSP Series DSPs have had FFT
+Inner loop support for over 30 years. Qualcomm's Hexagon VLIW Baseband
  DSP can do full FFT triple loops in one VLIW group.
  
  It should be pretty clear this is high priority.
  
-With SVP64  [[sv/remap]] providing the Loop Schedules it falls to the Scalar side of
-the ISA to add the prerequisite "Twin Butterfly" operations, typically performing
-for example one multiply but in-place subtracting that product from one operand and
-adding it to the other.  The *in-place* aspect is strategically extremely important
-for significant reductions in Vectorised register usage, particularly for DCT.
+With SVP64  [[sv/remap]] providing the Loop Schedules it falls to
+the Scalar side of the ISA to add the prerequisite "Twin Butterfly"
+operations, typically performing for example one multiply but in-place
+subtracting that product from one operand and adding it to the other.
+The *in-place* aspect is strategically extremely important for significant
+reductions in Vectorised register usage, particularly for DCT.
  
  ## CR Weird group
  
-Outlined in [[sv/cr_int_predication]] these instructions massively save on CR-Field
-instruction count.  Multi-bit to single-bit and vice-versa normally requiring several
-CR-ops (crand, crxor) are done in one single instruction.  The reason for their
-addition is down to SVP64 overloading CR Fields as Vector Predicate Masks.
-Reducing instruction count in hot-loops is considered high priority.
+Outlined in [[sv/cr_int_predication]] these instructions massively save
+on CR-Field instruction count.  Multi-bit to single-bit and vice-versa
+normally requiring several CR-ops (crand, crxor) are done in one single
+instruction.  The reason for their addition is down to SVP64 overloading
+CR Fields as Vector Predicate Masks.  Reducing instruction count in
+hot-loops is considered high priority.
  
-An additional need is to do popcount on CR Field bit vectors but adding such instructions
-to the *Condition Register* side was deemed to be far too much. Therefore, priority
-was giiven instead to transferring several CR Field bits into GPRs, whereupon
-the full set of tandard Scalar GPR Logical Operations may be used. This strategy
-has the side-effect of keeping the CRweird group down to only five instructions.
+An additional need is to do popcount on CR Field bit vectors but adding
+such instructions to the *Condition Register* side was deemed to be far
+too much. Therefore, priority was giiven instead to transferring several
+CR Field bits into GPRs, whereupon the full set of tandard Scalar GPR
+Logical Operations may be used. This strategy has the side-effect of
+keeping the CRweird group down to only five instructions.
  
  ## Big-integer Math
  
-[[sv/biginteger]]  has always been a high priority area for commercial applications, privacy,
-Banking, as well as HPC Numerical Accuracy: libgmp as well as cryptographic uses
-in Asymmetric Ciphers. poly1305 and ec25519 are finding their way into everyday
-use via OpenSSL.
+[[sv/biginteger]]  has always been a high priority area for commercial
+applications, privacy, Banking, as well as HPC Numerical Accuracy:
+libgmp as well as cryptographic uses in Asymmetric Ciphers. poly1305
+and ec25519 are finding their way into everyday use via OpenSSL.
  
-A very early variant of the Power ISA had a 32-bit Carry-in Carry-out SPR. Its
-removal from subsequent revisions is regrettable.  An alternative concept is
-to add six explicit 3-in 2-out operations that, on close inspection, always
-turn out to be supersets of *existing Scalar operations* that discard upper
-or lower DWords, or parts thereof.
+A very early variant of the Power ISA had a 32-bit Carry-in Carry-out
+SPR. Its removal from subsequent revisions is regrettable.  An alternative
+concept is to add six explicit 3-in 2-out operations that, on close
+inspection, always turn out to be supersets of *existing Scalar
+operations* that discard upper or lower DWords, or parts thereof.
  
  *Thus it is critical to note that not one single one of these operations
  expands the bitwidth of any existing Scalar pipelines*.
  
-The `dsld` instruction for example merely places additional LSBs into the 64-bit
-shift (64-bit carry-in), and then places the (normally discarded) MSBs into the second
-output register (64-bit carry-out). It does **not** require a 128-bit shifter to
-replace the existing Scalar Power ISA 64-bit shifters.
+The `dsld` instruction for example merely places additional LSBs into the
+64-bit shift (64-bit carry-in), and then places the (normally discarded)
+MSBs into the second output register (64-bit carry-out). It does **not**
+require a 128-bit shifter to replace the existing Scalar Power ISA
+64-bit shifters.
  
-The reduction in instruction count these operations bring, in critical hotloops,
-is remarkably high, to the extent where a Scalar-to-Vector operation of
-*arbitrary length* becomes just the one Vector-Prefixed instruction.
+The reduction in instruction count these operations bring, in critical
+hotloops, is remarkably high, to the extent where a Scalar-to-Vector
+operation of *arbitrary length* becomes just the one Vector-Prefixed
+instruction.
  
-Whilst these are 5-6 bit XO their utility is considered high strategic value
-and as such are strongly advocated to be in EXT04. The alternative is to bring
-back a 64-bit Carry SPR but how it is retrospectively applicable to pre-existing Scalar
-Power ISA mutiply, divide, and shift operations at this late stage of maturity of
-the Power ISA is an entire area of research on its own deemed unlikely to be
-achievable.
+Whilst these are 5-6 bit XO their utility is considered high strategic
+value and as such are strongly advocated to be in EXT04. The alternative
+is to bring back a 64-bit Carry SPR but how it is retrospectively
+applicable to pre-existing Scalar Power ISA mutiply, divide, and shift
+operations at this late stage of maturity of the Power ISA is an entire
+area of research on its own deemed unlikely to be achievable.
  
  ## fclass and GPR-FPR moves
  
-[[sv/fclass]] - just one instruction.  With SFFS being locked down to exclude VSX,
-and there being no desire within the nascent OpenPOWER ecosystem outside of IBM to
-implement the VSX PackedSIMD paradigm, it becomes necessary to upgrade SFFS
-such that it is stand-alone capable. One omission based on the assumption
-that VSX would always be present is an equivalent to `xvtstdcsp`.
-
-Similar arguments apply to the GPR-INT move operations, proposed
-in [[ls006]], with the opportunity taken
-to add rounding modes present in other ISAs that Power ISA VSX PackedSIMD does not
-have. Javascript rounding, one of the worst offenders of Computer Science, requires
-a phenomental 35 instructions with *six branches* to emulate in Power ISA! For
-desktop as well as Server HTML/JS back-end execution of javascript this becomes an
-obvious priority, recognised already by ARM as just one example.
+[[sv/fclass]] - just one instruction.  With SFFS being locked down to
+exclude VSX, and there being no desire within the nascent OpenPOWER
+ecosystem outside of IBM to implement the VSX PackedSIMD paradigm, it
+becomes necessary to upgrade SFFS such that it is stand-alone capable. One
+omission based on the assumption that VSX would always be present is an
+equivalent to `xvtstdcsp`.
+
+Similar arguments apply to the GPR-INT move operations, proposed in
+[[ls006]], with the opportunity taken to add rounding modes present
+in other ISAs that Power ISA VSX PackedSIMD does not have. Javascript
+rounding, one of the worst offenders of Computer Science, requires a
+phenomental 35 instructions with *six branches* to emulate in Power
+ISA! For desktop as well as Server HTML/JS back-end execution of
+javascript this becomes an obvious priority, recognised already by ARM
+as just one example.
  
  ## Bitmanip LUT2/3
  
-These LUT2/3 operations are high cost high reward. Outlined in [[sv/bitmanip]],
-the simplest ones already exist in PackedSIMD VSX: `xxeval`.
-The same reasoning applies as to fclass: SFFS needs to be stand-alone on its
-own merits and not "punished" should an implementor choose not to implement
-any aspect of PackedSIMD VSX.
+These LUT2/3 operations are high cost high reward. Outlined in
+[[sv/bitmanip]], the simplest ones already exist in PackedSIMD VSX:
+`xxeval`.  The same reasoning applies as to fclass: SFFS needs to be
+stand-alone on its own merits and not "punished" should an implementor
+choose not to implement any aspect of PackedSIMD VSX.
  
-With Predication being such a high priority in GPUs and HPC, CR Field variants
-of Ternary and Binary LUT instructions were considered high priority, and again
-just like in the CRweird group the opportunity was taken to work on *all*
-bits of a CR Field rather than just one bit as is done with the existing CR operations
-crand, cror etc.
+With Predication being such a high priority in GPUs and HPC, CR Field
+variants of Ternary and Binary LUT instructions were considered high
+priority, and again just like in the CRweird group the opportunity was
+taken to work on *all* bits of a CR Field rather than just one bit as
+is done with the existing CR operations crand, cror etc.
  
-The other high strategic value instruction is `grevlut` (and  `grevluti` which can
-generate a remarkably large number of regular-patterned magic constants).
-The grevlut set require of the order of 20,000 gates but provide an astonishing
-plethora of innovative bit-permuting instructions never seen in any other ISA.
+The other high strategic value instruction is `grevlut` (and  `grevluti`
+which can generate a remarkably large number of regular-patterned magic
+constants).  The grevlut set require of the order of 20,000 gates but
+provide an astonishing plethora of innovative bit-permuting instructions
+never seen in any other ISA.
  
-The downside of all of these instructions is the extremely low XO bit requirements:
-2-3 bit XO due to the large immediates *and* the number of operands required.
-The LUT3 instructions are already compacted down to "Overwrite" variants.
-(By contrast the Float-Load-Immediate instructions are a much larger XO because
-despite having 16-bit immediate only one Register Operand is needed).
+The downside of all of these instructions is the extremely low XO bit
+requirements: 2-3 bit XO due to the large immediates *and* the number of
+operands required.  The LUT3 instructions are already compacted down to
+"Overwrite" variants.  (By contrast the Float-Load-Immediate instructions
+are a much larger XO because despite having 16-bit immediate only one
+Register Operand is needed).
  
-Realistically these high-value instructions should be proposed in EXT2xx where
-their XO cost does not overwhelm EXT0xx.
+Realistically these high-value instructions should be proposed in EXT2xx
+where their XO cost does not overwhelm EXT0xx.
  
  
  ## (f)mv.swizzle
  
-[[sv/mv.swizzle]] is dicey. It is a 2-in 2-out operation whose value as a Scalar
-instruction is limited *except* if combined with `cmpi` and SVP64Single
-Predication, whereupon the end result is the RISC-synthesis of Compare-and-Swap,
-in two instructions.
+[[sv/mv.swizzle]] is dicey. It is a 2-in 2-out operation whose value
+as a Scalar instruction is limited *except* if combined with `cmpi` and
+SVP64Single Predication, whereupon the end result is the RISC-synthesis
+of Compare-and-Swap, in two instructions.
  
-Where this instruction comes into its full value is when Vectorised.  3D GPU
-and HPC numerical workloads astonishingly contain between 10 to 15% swizzle
-operations: access to YYZ, XY, of an XYZW Quaternion, performing balancing
-of ARGB pixel data. The usage is so high that 3D GPU ISAs make Swizzle a first-class
-priority in their VLIW words. Even 64-bit Embedded GPU ISAs have a staggering
-24-bits dedicated to 2-operand Swizzle.
+Where this instruction comes into its full value is when Vectorised.
+3D GPU and HPC numerical workloads astonishingly contain between 10 to 15%
+swizzle operations: access to YYZ, XY, of an XYZW Quaternion, performing
+balancing of ARGB pixel data. The usage is so high that 3D GPU ISAs make
+Swizzle a first-class priority in their VLIW words. Even 64-bit Embedded
+GPU ISAs have a staggering 24-bits dedicated to 2-operand Swizzle.
  
-So as not to radicalise the Power ISA the Libre-SOC team decided to introduce
-mv Swizzle operations, which can always be Macro-op fused in exactly the same
-way that ARM SVE predicated-move extends 3-operand "overwrite" opcodes to full
-independent 3-in 1-out.
+So as not to radicalise the Power ISA the Libre-SOC team decided to
+introduce mv Swizzle operations, which can always be Macro-op fused
+in exactly the same way that ARM SVE predicated-move extends 3-operand
+"overwrite" opcodes to full independent 3-in 1-out.
  
  # BMI (bitmanipulation) group.
  
-Whilst the [[sv/vector_ops]] instructions are only two in number, in reality the
-`bmask` instruction has a Mode field allowing it to cover **24** instructions,
-more than have been added to any other CPUs by ARM, Intel or AMD.  Analyis of
-the BMI sets of these CPUs shows simple patterns that can greatly simplify both
-Decode and implementation. These are sufficiently commonly used, saving instruction
-count regularly, that they justify going into EXT0xx.
-
-The other instruction is `cprop` - Carry-Propagation - which takes the P and Q
-from carry-propagation algorithms and generates carry look-ahead. Greatly
-increases the efficiency of arbitrary-precision integer arithmetic by combining
-what would otherwise be half a dozen instructions into one. However it is
-still not a huge priority unlike `bmask` so is probably best placed in EXT2xx.
+Whilst the [[sv/vector_ops]] instructions are only two in number, in
+reality the `bmask` instruction has a Mode field allowing it to cover
+**24** instructions, more than have been added to any other CPUs by
+ARM, Intel or AMD.  Analyis of the BMI sets of these CPUs shows simple
+patterns that can greatly simplify both Decode and implementation. These
+are sufficiently commonly used, saving instruction count regularly,
+that they justify going into EXT0xx.
+
+The other instruction is `cprop` - Carry-Propagation - which takes
+the P and Q from carry-propagation algorithms and generates carry
+look-ahead. Greatly increases the efficiency of arbitrary-precision
+integer arithmetic by combining what would otherwise be half a dozen
+instructions into one. However it is still not a huge priority unlike
+`bmask` so is probably best placed in EXT2xx.
  
  ## Float-Load-Immediate
  
-Very easily justified.  As explained in [[ls002]] these
-always saves one LD L1/2/3 D-Cache memory-lookup operation, by virtue of the Immediate
-FP value being in the I-Cache side.  It is such a high priority that these instuctions
-are easily justifiable adding into EXT0xx, despite requiring a 16-bit immediate.
-By designing the second-half instruction as a Read-Modify-Write it saves on XO
-bitlength (only 5 bits), and can be macro-op fused with its first-half to store a
-full IEEE754 FP32 immediate into a register.
-
-There is little point in putting these instructions into EXT2xx. Their very benefit
-and inherent value *is* as 32-bit instructions, not 64-bit ones. Likewise there is
-less value in taking up EXT1xx Enoding space because EXT1xx only brings an additional
-16 bits (approx) to the table, and that is provided already by the second-half
-instuction.
+Very easily justified.  As explained in [[ls002]] these always saves one
+LD L1/2/3 D-Cache memory-lookup operation, by virtue of the Immediate
+FP value being in the I-Cache side.  It is such a high priority that
+these instuctions are easily justifiable adding into EXT0xx, despite
+requiring a 16-bit immediate.  By designing the second-half instruction
+as a Read-Modify-Write it saves on XO bitlength (only 5 bits), and can be
+macro-op fused with its first-half to store a full IEEE754 FP32 immediate
+into a register.
+
+There is little point in putting these instructions into EXT2xx. Their
+very benefit and inherent value *is* as 32-bit instructions, not 64-bit
+ones. Likewise there is less value in taking up EXT1xx Enoding space
+because EXT1xx only brings an additional 16 bits (approx) to the table,
+and that is provided already by the second-half instuction.
  
  Thus they qualify as both high priority and also EXT0xx candidates.
  
-## FPR/GPR LD/ST-PostIncrement-Update 
+## FPR/GPR LD/ST-PostIncrement-Update
  
-These instruction, outlined in [[ls011]], save hugely in hot-loops.  Early ISAs
-such as PDP-8, PDP-11, which inspired the iconic Motorola 68000, 88100, Mitch
-Alsup's MyISA 66000, and can even be traced back to the iconic ultra-RISC CDC 6600,
-all had both pre- and post- increment Addressing Modes.
+These instruction, outlined in [[ls011]], save hugely in hot-loops.
+Early ISAs such as PDP-8, PDP-11, which inspired the iconic Motorola
+68000, 88100, Mitch Alsup's MyISA 66000, and can even be traced back to
+the iconic ultra-RISC CDC 6600, all had both pre- and post- increment
+Addressing Modes.
  
-The reason is very simple: it is a direct recognition of the practice in c to
-frequently utilise both `*p++` and `*++p` which itself stems from common need in
-Computer Science algorithms.
+The reason is very simple: it is a direct recognition of the practice
+in c to frequently utilise both `*p++` and `*++p` which itself stems
+from common need in Computer Science algorithms.
  
-The problem for the Power ISA is - was - that the opcode space needed to support both
-was far too great, and the decision was made to go with pre-increment, on the basis
-that outside the loop a "pre-subtraction" may be performed.
+The problem for the Power ISA is - was - that the opcode space needed
+to support both was far too great, and the decision was made to go with
+pre-increment, on the basis that outside the loop a "pre-subtraction"
+may be performed.
  
-Whilst this is a "solution" it is less than ideal, and the opportunity exists now
-with the EXT2xx Primary Opcodes to correct this and bring Power ISA up a level.
+Whilst this is a "solution" it is less than ideal, and the opportunity
+exists now with the EXT2xx Primary Opcodes to correct this and bring
+Power ISA up a level.
  
  ## Shift-and-add
  
-Shift-and-Add are proposed in [[ls004]].  They mitigate the need to
-add LD-ST-Shift instructions which are a high-priority aspect of both
-x86 and ARM.  LD-ST-Shift is normally just the one instruction: Shift-and-add
+Shift-and-Add are proposed in [[ls004]].  They mitigate the need to add
+LD-ST-Shift instructions which are a high-priority aspect of both x86
+and ARM.  LD-ST-Shift is normally just the one instruction: Shift-and-add
  brings that down to two, where Power ISA presently requires three.
-Cryptography e.g. twofish also makes use of Integer double-and-add, so the value
-of these instructions is not limited to Effective Address computation.
-They will also have value in Audio DSP.
+Cryptography e.g. twofish also makes use of Integer double-and-add,
+so the value of these instructions is not limited to Effective Address
+computation.  They will also have value in Audio DSP.
  
-Being a 10-bit XO it would be somewhat punitive to place these in EXT2xx when their
-whole purpose and value is to reduce binary size in Address offset computation,
-thus they are best placed in EXT0xx.
+Being a 10-bit XO it would be somewhat punitive to place these in EXT2xx
+when their whole purpose and value is to reduce binary size in Address
+offset computation, thus they are best placed in EXT0xx.
author	Luke Kenneth Casson Leighton <lkcl@lkcl.net>
	Sat, 8 Apr 2023 16:03:42 +0000 (17:03 +0100)
committer	Luke Kenneth Casson Leighton <lkcl@lkcl.net>
	Sat, 8 Apr 2023 16:03:42 +0000 (17:03 +0100)