# External RFC ls012: Discuss priorities of Libre-SOC Scalar(Vector) ops

* <https://git.openpower.foundation/isa/PowerISA/issues/121>
* <https://bugs.libre-soc.org/show_bug.cgi?id=1051>
* <https://bugs.libre-soc.org/show_bug.cgi?id=1052>

The purpose of this RFC is to give a full list of the upcoming Scalar
opcodes developed by Libre-SOC, formally agree a priority order, which
ones should be EXT022 Sandbox, and for IBM to get a clear picture of
the Opcode Allocation needs.  As this is a Formal ISA RFC the evaluation
shall define (in advance of the actual submission of the instructions
themselves) which instructions should be submitted over the next 18
months.

*It is expected that readers visit and interact with the Libre-SOC resources
in order to do due-diligence on the prioritisation evaluation. Otherwise
the ISA WG is overwhelmed by piecemeal RFCs that may turn out not
to be useful, against a background of having no guiding overview
or pre-filtering*.

Worth bearing in mind during evaluation that every "Defined
Word" may or may not be Vectoriseable, but that every "Defined Word"
should have merits on its own, not just when Vectorised.  An example
of a borderline Vectoriseable Defined Word is `mv.swizzle` which
only really becomes high-priority for Audio/Video, Vector GPU and HPC Workloads,
but has less merit as a Scalar-only operation.

Power ISA Scalar (SFFS) has not been significantly advanced in 12 years:
IBM's primary focus has understandably been on PackedSIMD VSX.
Unfortunately, with VSX being 914 instructions and 128-bit it is far too much for any
new team to consider (10 years development effort) and far outside of
Embedded or Tablet/Desktop/Laptop power budgets. Thus bringing Power Scalar
up-to-date to modern standards is a reasonable goal, and the advantage is
that lessons can be learned from other ISAs.

SVP64 Prefixing - also known by the terms "Zero-Overhead-Loop-Prefixing"
as well as "True-Scalable-Vector Prefixing" - also literally brings new
dimensions to the Power ISA. Thus when adding new Scalar "Defined Words"
it has to unavoidably and simultaneously be taken into consideration their value when
Vector-Prefixed, *as well as* SVP64Single-Prefixed.

**Target areas**

Whilst entirely general-purpose there are some categories that
these instructions are targetting: Bitmanipulation, Big-integer,
cryptography, Audio/Visual, High-Performance Compute, GPU workloads
and DSP.

**Instruction count guide and approximate priority order**

* 6 - SVP64 Management [[ls008]] [[ls009]] [[ls010]]
* 5 - CR weirds [[sv/cr_int_predication]]
* 4 - INT<->FP mv [[ls006]]
* 19 - GPR LD/ST-PostIncrement-Update (saves hugely in hot-loops) [[ls011]]
* ~12 - FPR LD/ST-PostIncrement-Update (ditto) [[ls011]]
* 2 - Float-Load-Immediate (always saves one LD L1/2/3 D-Cache op) [[ls002]]
* 5 - Big-Integer Chained 3-in 2-out (64-bit Carry) [[sv/biginteger]]
* 6 - Bitmanip LUT2/3 operations. high cost high reward [[sv/bitmanip]]
* 1 - fclass (Scalar variant of xvtstdcsp) [[sv/fclass]]
* 5 - Audio-Video [[sv/av_opcodes]]
* 2 - Shift-and-Add (mitigates LD-ST-Shift; Cryptography e.g. twofish)
* 2 - BMI group [[sv/vector_ops]]
* 2 - GPU swizzle [[sv/mv.swizzle]]
* 9 - FP DCT/FFT Butterfly (2/3-in 2-out)
* ~9 Integer DCT/FFT Butterfly
* 18 - Trigonometric (1-arg) [[openpower/transcendentals]]
* 15 - Transcendentals (1-arg) [[openpower/transcendentals]]
* 25 - Transcendentals (2-arg) [[openpower/transcendentals]]

Summary tables are created below by different sort categories. Additional
columns as necessary can be requested to be added as part of update revisions
to this RFC.

# Target Area summaries

## Transcendentals

Found at [[openpower/transcendentals]] these subdivide into high priority for
accelerating general-purpose and High-Performance Compute, specialist 3D GPU
operations suited to 3D visualisation, and low-priority less common instructions
where IEEE754 full bit-accuracy is paramount.  In 3D GPU scenarios for example
even 12-bit accuracy can be overkill, but for HPC Scientific scenarios 12-bit
would be disastrous.

There are a **lot** of operations here, and they also bring Power ISA
up-to-date to IEEE754-2019.  Fortunately the number of critical instructions
is quite low, but the caveat is that if those operations are utilised to
synthesise other IEEE754 operations (divide by `pi` for example) full bitlevel
accuracy (a hard requirement for IEEE754) is lost.

Also worth noting that the Khronos Group defines minimum acceptable bit-accuracy
levels for 3D Graphics: these are **nowhere near* the full accuracy demanded
by IEEE754, the reason for the Khronos definitions is a massive reduction often
four-fold in power consumption and gate count when 3D Graphics simply has no need
for full accuracy.

*For 3D GPU markets this definitely needs addressing*

## Audio/Video

Found at [[sv/av_opcodes]] these do not require Saturated variants because Saturation
is added via [[sv/svp64]] (Vector Prefixing) and via [[sv/svp64_single]] Scalar
Prefixing. This is important to note for Opcode Allocation because placing these
operations in the UnVectoriseble areas would irrediemably damage their value.
Unlike PackedSIMD ISAs the actual number of AV Opcodes is remarkably small once
the usual cascading-option-multipliers (SIMD width, bitwidth, saturation, HI/LO)
are abstracted out to RISC-paradigm Prefixing, leaving just absolute-diff-accumulate,
min-max, average-add etc. as "basic primitives".

## Twin-Butterfly FFT/DCT/DFT for DSP/HPC/AI/AV

The number of uses in Computer Science for DCT, NTT, FFT and DFT, is astonishing.
The wikipedia page lists over a hundred separate and distinct areas: Audio, Video,
Radar, Baseband processing, AI, Solomon-Reed Error Correction, the list goes on and on.
ARM has special dedicated Integer Twin-butterfly instructions. TI's MSP Series DSPs
have had FFT Inner loop support for over 30 years. Qualcomm's Hexagon VLIW Baseband
DSP can do full FFT triple loops in one VLIW group.

It should be pretty clear this is high priority.

With SVP64  [[sv/remap]] providing the Loop Schedules it falls to the Scalar side of
the ISA to add the prerequisite "Twin Butterfly" operations, typically performing
for example one multiply but in-place subtracting that product from one operand and
adding it to the other.  The *in-place* aspect is strategically extremely important
for significant reductions in Vectorised register usage, particularly for DCT.

## CR Weird group

Outlined in [[sv/cr_int_predication]] these instructions massively save on CR-Field
instruction count.  Multi-bit to single-bit and vice-versa normally requiring several
CR-ops (crand, crxor) are done in one single instruction.  The reason for their
addition is down to SVP64 overloading CR Fields as Vector Predicate Masks.
Reducing instruction count in hot-loops is considered high priority.

An additional need is to do popcount on CR Field bit vectors but adding such instructions
to the *Condition Register* side was deemed to be far too much. Therefore, priority
was giiven instead to transferring several CR Field bits into GPRs, whereupon
the full set of tandard Scalar GPR Logical Operations may be used. This strategy
has the side-effect of keeping the CRweird group down to only five instructions.

# Big-integer Math

[[sv/biginteger]]  has always been a high priority area for commercial applications, privacy,
Banking, as well as HPC Numerical Accuracy: libgmp as well as cryptographic uses
in Asymmetric Ciphers. poly1305 and ec25519 are finding their way into everyday
use via OpenSSL.

A very early variant of the Power ISA had a 32-bit Carry-in Carry-out SPR. Its
removal from subsequent revisions is regrettable.  An alternative concept is
to add six explicit 3-in 2-out operations that, on close inspection, always
turn out to be supersets of *existing Scalar operations* that discard upper
or lower DWords, or parts thereof.

*Thus it is critical to note that not one single one of these operations
expands the bitwidth of any existing Scalar pipelines*.

The `dsld` instruction for example merely places additional LSBs into the 64-bit
shift (64-bit carry-in), and then places the (normally discarded) MSBs into the second
output register (64-bit carry-out). It does **not** require a 128-bit shifter to
replace the existing Scalar Power ISA 64-bit shifters.

The reduction in instruction count these operations bring, in critical hotloops,
is remarkably high, to the extent where a Scalar-to-Vector operation of
*arbitrary length* becomes just the one Vector-Prefixed instruction.

Whilst these are 5-6 bit XO their utility is considered high strategic value
and as such are strongly advocated to be in EXT04. The alternative is to bring
back a 64-bit Carry SPR but how it is retrospectively applicable to pre-existing Scalar
Power ISA mutiply, divide, and shift operations at this late stage of maturity of
the Power ISA is an entire area of research on its own deemed unlikely to be
achievable.

## fclass and GPR-FPR moves

[[sv/fclass]] - just one instruction.  With SFFS being locked down to exclude VSX,
and there being no desire within the nascent OpenPOWER ecosystem outside of IBM to
implement the VSX PackedSIMD paradigm, it becomes necessary to upgrade SFFS
such that it is stand-alone capable. One omission based on the assumption
that VSX would always be present is an equivalent to `xvtstdcsp`.

Similar arguments apply to the GPR-INT move operations, with the opportunity taken
to add rounding modes present in other ISAs that Power ISA VSX PackedSIMD does not
have. Javascript rounding, one of the worst offenders of Computer Science, requires
a phenomental 35 instructions with *six branches* to emulate in Power ISA! For
desktop as well as Server HTML/JS back-end execution of javascript this becomes an
obvious priority, recognised already by ARM as just one example.

## (f)mv.swizzle

[[sv/mv.swizzle]] is dicey. It is a 2-in 2-out operation whose value as a Scalar
instruction is limited *except* if combined with `cmpi` and SVP64Single
Predication, whereupon the end result is the RISC-synthesis of Compare-and-Swap,
in two instructions.

Where this instruction comes into its full value is when Vectorised.  3D GPU
and HPC numerical workloads astonishingly contain between 10 to 15% swizzle
operations: access to YYZ, XY, of an XYZW Quaternion, performing balancing
of ARGB pixel data. The usage is so high that 3D GPU ISAs make Swizzle a first-class
priority in their VLIW words. Even 64-bit Embedded GPU ISAs have a staggering
24-bits dedicated to 2-operand Swizzle.

So as not to radicalise the Power ISA the Libre-SOC team decided to introduce
mv Swizzle operations, which can always be Macro-op fused in exactly the same
way that ARM SVE predicated-move extends 3-operand "overwrite" opcodes to full
independent 3-in 1-out.

# BMI (bitmanipulation) group.

Whilst the [[sv/vector_ops]] instructions are only two in number, in reality the
`bmask` instruction has a Mode field allowing it to cover **24** instructions,
more than have been added to any other CPUs by ARM, Intel or AMD.  Analyis of
the BMI sets of these CPUs shows simple patterns that can greatly simplify both
Decode and implementation. These are sufficiently commonly used, saving instruction
count regularly, that they justify going into EXT0xx.

The other instruction is `cprop` - Carry-Propagation - which takes the P and Q
from carry-propagation algorithms and generates carry look-ahead. Greatly
increases the efficiency of arbitrary-precision integer arithmetic by combining
what would otherwise be half a dozen instructions into one. However it is
still not a huge priority unlike `bmask` so is probably best placed in EXT2xx.

* Float-Load-Immediate

Very easily justified.  As explained in [[ls002]] these
always saves one LD L1/2/3 D-Cache memory-lookup operation, by virtue of the Immediate
FP value being in the I-Cache side.  It is such a high priority that these instuctions
are easily justifiable adding into EXT0xx, despite requiring a 16-bit immediate.
By designing the second-half instruction as a Read-Modify-Write it saves on XO
bitlength (only 5 bits), and can be macro-op fused with its first-half to store a
full IEEE754 FP32 immediate into a register.

There is little point in putting these instructions into EXT2xx. Their very benefit
and inherent value *is* as 32-bit instructions, not 64-bit ones. Likewise there is
less value in taking up EXT1xx Enoding space because EXT1xx only brings an additional
16 bits (approx) to the table, and that is provided already by the second-half
instuction.

Thus they qualify as both high priority and also EXT0xx candidates.


[[!inline pages="openpower/sv/rfc/ls012/areas.mdwn" raw=yes ]]

[[!tag opf_rfc]]