openpower/sv/rfc/ls012.mdwn

   1 # External RFC ls012: Discuss priorities of Libre-SOC Scalar(Vector) ops
   2
   3 * <https://git.openpower.foundation/isa/PowerISA/issues/121>
   4 * <https://bugs.libre-soc.org/show_bug.cgi?id=1051>
   5 * <https://bugs.libre-soc.org/show_bug.cgi?id=1052>
   6
   7 The purpose of this RFC is to give a full list of the upcoming Scalar
   8 opcodes developed by Libre-SOC, formally agree a priority order, which
   9 ones should be EXT022 Sandbox, and for IBM to get a clear picture of
  10 the Opcode Allocation needs.  As this is a Formal ISA RFC the evaluation
  11 shall define (in advance of the actual submission of the instructions
  12 themselves) which instructions should be submitted over the next 18
  13 months.
  14
  15 *It is expected that readers visit and interact with the Libre-SOC resources
  16 in order to do due-diligence on the prioritisation evaluation. Otherwise
  17 the ISA WG is overwhelmed by piecemeal RFCs that may turn out not
  18 to be useful, against a background of having no guiding overview
  19 or pre-filtering*.
  20
  21 Worth bearing in mind during evaluation that every "Defined
  22 Word" may or may not be Vectoriseable, but that every "Defined Word"
  23 should have merits on its own, not just when Vectorised.  An example
  24 of a borderline Vectoriseable Defined Word is `mv.swizzle` which
  25 only really becomes high-priority for Audio/Video, Vector GPU and HPC Workloads,
  26 but has less merit as a Scalar-only operation.
  27
  28 Power ISA Scalar (SFFS) has not been significantly advanced in 12 years:
  29 IBM's primary focus has understandably been on PackedSIMD VSX.
  30 Unfortunately, with VSX being 914 instructions and 128-bit it is far too much for any
  31 new team to consider (10 years development effort) and far outside of
  32 Embedded or Tablet/Desktop/Laptop power budgets. Thus bringing Power Scalar
  33 up-to-date to modern standards is a reasonable goal, and the advantage is
  34 that lessons can be learned from other ISAs.
  35
  36 SVP64 Prefixing - also known by the terms "Zero-Overhead-Loop-Prefixing"
  37 as well as "True-Scalable-Vector Prefixing" - also literally brings new
  38 dimensions to the Power ISA. Thus when adding new Scalar "Defined Words"
  39 it has to unavoidably and simultaneously be taken into consideration their value when
  40 Vector-Prefixed, *as well as* SVP64Single-Prefixed.
  41
  42 **Target areas**
  43
  44 Whilst entirely general-purpose there are some categories that
  45 these instructions are targetting: Bitmanipulation, Big-integer,
  46 cryptography, Audio/Visual, High-Performance Compute, GPU workloads
  47 and DSP.
  48
  49 **Instruction count guide and approximate priority order**
  50
  51 * 6 - SVP64 Management [[ls008]] [[ls009]] [[ls010]]
  52 * 5 - CR weirds [[sv/cr_int_predication]]
  53 * 4 - INT<->FP mv [[ls006]]
  54 * 19 - GPR LD/ST-PostIncrement-Update (saves hugely in hot-loops) [[ls011]]
  55 * ~12 - FPR LD/ST-PostIncrement-Update (ditto) [[ls011]]
  56 * 2 - Float-Load-Immediate (always saves one LD L1/2/3 D-Cache op) [[ls002]]
  57 * 5 - Big-Integer Chained 3-in 2-out (64-bit Carry) [[sv/biginteger]]
  58 * 6 - Bitmanip LUT2/3 operations. high cost high reward [[sv/bitmanip]]
  59 * 1 - fclass (Scalar variant of xvtstdcsp) [[sv/fclass]]
  60 * 5 - Audio-Video [[sv/av_opcodes]]
  61 * 2 - Shift-and-Add (mitigates LD-ST-Shift; Cryptography e.g. twofish)
  62 * 2 - BMI group [[sv/vector_ops]]
  63 * 2 - GPU swizzle [[sv/mv.swizzle]]
  64 * 9 - FP DCT/FFT Butterfly (2/3-in 2-out)
  65 * ~9 Integer DCT/FFT Butterfly
  66 * 18 - Trigonometric (1-arg) [[openpower/transcendentals]]
  67 * 15 - Transcendentals (1-arg) [[openpower/transcendentals]]
  68 * 25 - Transcendentals (2-arg) [[openpower/transcendentals]]
  69
  70 Summary tables are created below by different sort categories. Additional
  71 columns as necessary can be requested to be added as part of update revisions
  72 to this RFC.
  73
  74 # Target Area summaries
  75
  76 ## Transcendentals
  77
  78 Found at [[openpower/transcendentals]] these subdivide into high priority for
  79 accelerating general-purpose and High-Performance Compute, specialist 3D GPU
  80 operations suited to 3D visualisation, and low-priority less common instructions
  81 where IEEE754 full bit-accuracy is paramount.  In 3D GPU scenarios for example
  82 even 12-bit accuracy can be overkill, but for HPC Scientific scenarios 12-bit
  83 would be disastrous.
  84
  85 There are a **lot** of operations here, and they also bring Power ISA
  86 up-to-date to IEEE754-2019.  Fortunately the number of critical instructions
  87 is quite low, but the caveat is that if those operations are utilised to
  88 synthesise other IEEE754 operations (divide by `pi` for example) full bitlevel
  89 accuracy (a hard requirement for IEEE754) is lost.
  90
  91 Also worth noting that the Khronos Group defines minimum acceptable bit-accuracy
  92 levels for 3D Graphics: these are **nowhere near* the full accuracy demanded
  93 by IEEE754, the reason for the Khronos definitions is a massive reduction often
  94 four-fold in power consumption and gate count when 3D Graphics simply has no need
  95 for full accuracy.
  96
  97 *For 3D GPU markets this definitely needs addressing*
  98
  99 ## Audio/Video
 100
 101 Found at [[sv/av_opcodes]] these do not require Saturated variants because Saturation
 102 is added via [[sv/svp64]] (Vector Prefixing) and via [[sv/svp64_single]] Scalar
 103 Prefixing. This is important to note for Opcode Allocation because placing these
 104 operations in the UnVectoriseble areas would irrediemably damage their value.
 105 Unlike PackedSIMD ISAs the actual number of AV Opcodes is remarkably small once
 106 the usual cascading-option-multipliers (SIMD width, bitwidth, saturation, HI/LO)
 107 are abstracted out to RISC-paradigm Prefixing, leaving just absolute-diff-accumulate,
 108 min-max, average-add etc. as "basic primitives".
 109
 110 ## Twin-Butterfly FFT/DCT/DFT for DSP/HPC/AI/AV
 111
 112 The number of uses in Computer Science for DCT, NTT, FFT and DFT, is astonishing.
 113 The wikipedia page lists over a hundred separate and distinct areas: Audio, Video,
 114 Radar, Baseband processing, AI, Solomon-Reed Error Correction, the list goes on and on.
 115 ARM has special dedicated Integer Twin-butterfly instructions. TI's MSP Series DSPs
 116 have had FFT Inner loop support for over 30 years. Qualcomm's Hexagon VLIW Baseband
 117 DSP can do full FFT triple loops in one VLIW group.
 118
 119 It should be pretty clear this is high priority.
 120
 121 With SVP64  [[sv/remap]] providing the Loop Schedules it falls to the Scalar side of
 122 the ISA to add the prerequisite "Twin Butterfly" operations, typically performing
 123 for example one multiply but in-place subtracting that product from one operand and
 124 adding it to the other.  The *in-place* aspect is strategically extremely important
 125 for significant reductions in Vectorised register usage, particularly for DCT.
 126
 127 ## CR Weird group
 128
 129 Outlined in [[sv/cr_int_predication]] these instructions massively save on CR-Field
 130 instruction count.  Multi-bit to single-bit and vice-versa normally requiring several
 131 CR-ops (crand, crxor) are done in one single instruction.  The reason for their
 132 addition is down to SVP64 overloading CR Fields as Vector Predicate Masks.
 133 Reducing instruction count in hot-loops is considered high priority.
 134
 135 An additional need is to do popcount on CR Field bit vectors but adding such instructions
 136 to the *Condition Register* side was deemed to be far too much. Therefore, priority
 137 was giiven instead to transferring several CR Field bits into GPRs, whereupon
 138 the full set of tandard Scalar GPR Logical Operations may be used. This strategy
 139 has the side-effect of keeping the CRweird group down to only five instructions.
 140
 141 # Big-integer Math
 142
 143 [[sv/biginteger]]  has always been a high priority area for commercial applications, privacy,
 144 Banking, as well as HPC Numerical Accuracy: libgmp as well as cryptographic uses
 145 in Asymmetric Ciphers. poly1305 and ec25519 are finding their way into everyday
 146 use via OpenSSL.
 147
 148 A very early variant of the Power ISA had a 32-bit Carry-in Carry-out SPR. Its
 149 removal from subsequent revisions is regrettable.  An alternative concept is
 150 to add six explicit 3-in 2-out operations that, on close inspection, always
 151 turn out to be supersets of *existing Scalar operations* that discard upper
 152 or lower DWords, or parts thereof.
 153
 154 *Thus it is critical to note that not one single one of these operations
 155 expands the bitwidth of any existing Scalar pipelines*.
 156
 157 The `dsld` instruction for example merely places additional LSBs into the 64-bit
 158 shift (64-bit carry-in), and then places the (normally discarded) MSBs into the second
 159 output register (64-bit carry-out). It does **not** require a 128-bit shifter to
 160 replace the existing Scalar Power ISA 64-bit shifters.
 161
 162 The reduction in instruction count these operations bring, in critical hotloops,
 163 is remarkably high, to the extent where a Scalar-to-Vector operation of
 164 *arbitrary length* becomes just the one Vector-Prefixed instruction.
 165
 166 Whilst these are 5-6 bit XO their utility is considered high strategic value
 167 and as such are strongly advocated to be in EXT04. The alternative is to bring
 168 back a 64-bit Carry SPR but how it is retrospectively applicable to pre-existing Scalar
 169 Power ISA mutiply, divide, and shift operations at this late stage of maturity of
 170 the Power ISA is an entire area of research on its own deemed unlikely to be
 171 achievable.
 172
 173 ## fclass
 174
 175 [[sv/fclass]] - just one instruction.  With SFFS being locked down to exclude VSX,
 176 and there being no desire within the nascent OpenPOWER ecosystem outside of IBM to
 177 implement the VSX PackedSIMD paradigm, it becomes necessary to upgrade SFFS
 178 such that it is stand-alone capable. One omission based on the assumption
 179 that VSX would always be present is an equivalent to `xvtstdcsp`.
 180
 181 ## (f)mv.swizzle
 182
 183 [[sv/mv.swizzle]] is dicey. It is a 2-in 2-out operation whose value as a Scalar
 184 instruction is limited *except* if combined with `cmpi` and SVP64Single
 185 Predication, whereupon the end result is the RISC-synthesis of Compare-and-Swap,
 186 in two instructions.
 187
 188 Where this instruction comes into its full value is when Vectorised.  3D GPU
 189 and HPC numerical workloads astonishingly contain between 10 to 15% swizzle
 190 operations: access to YYZ, XY, of an XYZW Quaternion, performing balancing
 191 of ARGB pixel data. The usage is so high that 3D GPU ISAs make Swizzle a first-class
 192 priority in their VLIW words. Even 64-bit Embedded GPU ISAs have a staggering
 193 24-bits dedicated to 2-operand Swizzle.
 194
 195 So as not to radicalise the Power ISA the Libre-SOC team decided to introduce
 196 mv Swizzle operations, which can always be Macro-op fused in exactly the same
 197 way that ARM SVE predicated-move extends 3-operand "overwrite" opcodes to full
 198 independent 3-in 1-out.
 199
 200
 201
 202
 203 [[!inline pages="openpower/sv/rfc/ls012/areas.mdwn" raw=yes ]]
 204
 205 [[!tag opf_rfc]]