openpower/sv/rfc/ls012.mdwn

   1 # External RFC ls012: Discuss priorities of Libre-SOC Scalar(Vector) ops
   2
   3 * <https://git.openpower.foundation/isa/PowerISA/issues/121>
   4 * <https://bugs.libre-soc.org/show_bug.cgi?id=1051>
   5 * <https://bugs.libre-soc.org/show_bug.cgi?id=1052>
   6
   7 The purpose of this RFC is to give a full list of the upcoming Scalar
   8 opcodes developed by Libre-SOC, formally agree a priority order on an itertive
   9 basis, which ones should be EXT022 Sandbox, which in EXT0xx, which in EXT2xx,
  10 and for IBM to get a clear picture of
  11 the Opcode Allocation needs.  As this is a Formal ISA RFC the evaluation
  12 shall ultimatly define (in advance of the actual submission of the instructions
  13 themselves) which instructions should be submitted over the next 18
  14 months.
  15
  16 *It is expected that readers visit and interact with the Libre-SOC resources
  17 in order to do due-diligence on the prioritisation evaluation. Otherwise
  18 the ISA WG is overwhelmed by piecemeal RFCs that may turn out not
  19 to be useful, against a background of having no guiding overview
  20 or pre-filtering*.
  21
  22 Worth bearing in mind during evaluation that every "Defined
  23 Word" may or may not be Vectoriseable, but that every "Defined Word"
  24 should have merits on its own, not just when Vectorised.  An example
  25 of a borderline Vectoriseable Defined Word is `mv.swizzle` which
  26 only really becomes high-priority for Audio/Video, Vector GPU and HPC Workloads,
  27 but has less merit as a Scalar-only operation.
  28
  29 Power ISA Scalar (SFFS) has not been significantly advanced in 12 years:
  30 IBM's primary focus has understandably been on PackedSIMD VSX.
  31 Unfortunately, with VSX being 914 instructions and 128-bit it is far too much for any
  32 new team to consider (10 years development effort) and far outside of
  33 Embedded or Tablet/Desktop/Laptop power budgets. Thus bringing Power Scalar
  34 up-to-date to modern standards is a reasonable goal, and the advantage is
  35 that lessons can be learned from other ISAs from the intervening years.
  36
  37 SVP64 Prefixing - also known by the terms "Zero-Overhead-Loop-Prefixing"
  38 as well as "True-Scalable-Vector Prefixing" - also literally brings new
  39 dimensions to the Power ISA. Thus when adding new Scalar "Defined Words"
  40 it has to unavoidably and simultaneously be taken into consideration their value when
  41 Vector-Prefixed, *as well as* SVP64Single-Prefixed.
  42
  43 **Target areas**
  44
  45 Whilst entirely general-purpose there are some categories that
  46 these instructions are targetting: Bitmanipulation, Big-integer,
  47 cryptography, Audio/Visual, High-Performance Compute, GPU workloads
  48 and DSP.
  49
  50 **Instruction count guide and approximate priority order**
  51
  52 * 6 - SVP64 Management [[ls008]] [[ls009]] [[ls010]]
  53 * 5 - CR weirds [[sv/cr_int_predication]]
  54 * 4 - INT<->FP mv [[ls006]]
  55 * 19 - GPR LD/ST-PostIncrement-Update (saves hugely in hot-loops) [[ls011]]
  56 * ~12 - FPR LD/ST-PostIncrement-Update (ditto) [[ls011]]
  57 * 2 - Float-Load-Immediate (always saves one LD L1/2/3 D-Cache op) [[ls002]]
  58 * 5 - Big-Integer Chained 3-in 2-out (64-bit Carry) [[sv/biginteger]]
  59 * 6 - Bitmanip LUT2/3 operations. high cost high reward [[sv/bitmanip]]
  60 * 1 - fclass (Scalar variant of xvtstdcsp) [[sv/fclass]]
  61 * 5 - Audio-Video [[sv/av_opcodes]]
  62 * 2 - Shift-and-Add (mitigates LD-ST-Shift; Cryptography e.g. twofish)
  63 * 2 - BMI group [[sv/vector_ops]]
  64 * 2 - GPU swizzle [[sv/mv.swizzle]]
  65 * 9 - FP DCT/FFT Butterfly (2/3-in 2-out)
  66 * ~9 Integer DCT/FFT Butterfly
  67 * 18 - Trigonometric (1-arg) [[openpower/transcendentals]]
  68 * 15 - Transcendentals (1-arg) [[openpower/transcendentals]]
  69 * 25 - Transcendentals (2-arg) [[openpower/transcendentals]]
  70
  71 Summary tables are created below by different sort categories. Additional
  72 columns as necessary can be requested to be added as part of update revisions
  73 to this RFC.
  74
  75 # Target Area summaries
  76
  77 ## Transcendentals
  78
  79 Found at [[openpower/transcendentals]] these subdivide into high priority for
  80 accelerating general-purpose and High-Performance Compute, specialist 3D GPU
  81 operations suited to 3D visualisation, and low-priority less common instructions
  82 where IEEE754 full bit-accuracy is paramount.  In 3D GPU scenarios for example
  83 even 12-bit accuracy can be overkill, but for HPC Scientific scenarios 12-bit
  84 would be disastrous.
  85
  86 There are a **lot** of operations here, and they also bring Power ISA
  87 up-to-date to IEEE754-2019.  Fortunately the number of critical instructions
  88 is quite low, but the caveat is that if those operations are utilised to
  89 synthesise other IEEE754 operations (divide by `pi` for example) full bitlevel
  90 accuracy (a hard requirement for IEEE754) is lost.
  91
  92 Also worth noting that the Khronos Group defines minimum acceptable bit-accuracy
  93 levels for 3D Graphics: these are **nowhere near* the full accuracy demanded
  94 by IEEE754, the reason for the Khronos definitions is a massive reduction often
  95 four-fold in power consumption and gate count when 3D Graphics simply has no need
  96 for full accuracy.
  97
  98 *For 3D GPU markets this definitely needs addressing*
  99
 100 ## Audio/Video
 101
 102 Found at [[sv/av_opcodes]] these do not require Saturated variants because Saturation
 103 is added via [[sv/svp64]] (Vector Prefixing) and via [[sv/svp64_single]] Scalar
 104 Prefixing. This is important to note for Opcode Allocation because placing these
 105 operations in the UnVectoriseble areas would irrediemably damage their value.
 106 Unlike PackedSIMD ISAs the actual number of AV Opcodes is remarkably small once
 107 the usual cascading-option-multipliers (SIMD width, bitwidth, saturation, HI/LO)
 108 are abstracted out to RISC-paradigm Prefixing, leaving just absolute-diff-accumulate,
 109 min-max, average-add etc. as "basic primitives".
 110
 111 ## Twin-Butterfly FFT/DCT/DFT for DSP/HPC/AI/AV
 112
 113 The number of uses in Computer Science for DCT, NTT, FFT and DFT, is astonishing.
 114 The wikipedia page lists over a hundred separate and distinct areas: Audio, Video,
 115 Radar, Baseband processing, AI, Solomon-Reed Error Correction, the list goes on and on.
 116 ARM has special dedicated Integer Twin-butterfly instructions. TI's MSP Series DSPs
 117 have had FFT Inner loop support for over 30 years. Qualcomm's Hexagon VLIW Baseband
 118 DSP can do full FFT triple loops in one VLIW group.
 119
 120 It should be pretty clear this is high priority.
 121
 122 With SVP64  [[sv/remap]] providing the Loop Schedules it falls to the Scalar side of
 123 the ISA to add the prerequisite "Twin Butterfly" operations, typically performing
 124 for example one multiply but in-place subtracting that product from one operand and
 125 adding it to the other.  The *in-place* aspect is strategically extremely important
 126 for significant reductions in Vectorised register usage, particularly for DCT.
 127
 128 ## CR Weird group
 129
 130 Outlined in [[sv/cr_int_predication]] these instructions massively save on CR-Field
 131 instruction count.  Multi-bit to single-bit and vice-versa normally requiring several
 132 CR-ops (crand, crxor) are done in one single instruction.  The reason for their
 133 addition is down to SVP64 overloading CR Fields as Vector Predicate Masks.
 134 Reducing instruction count in hot-loops is considered high priority.
 135
 136 An additional need is to do popcount on CR Field bit vectors but adding such instructions
 137 to the *Condition Register* side was deemed to be far too much. Therefore, priority
 138 was giiven instead to transferring several CR Field bits into GPRs, whereupon
 139 the full set of tandard Scalar GPR Logical Operations may be used. This strategy
 140 has the side-effect of keeping the CRweird group down to only five instructions.
 141
 142 # Big-integer Math
 143
 144 [[sv/biginteger]]  has always been a high priority area for commercial applications, privacy,
 145 Banking, as well as HPC Numerical Accuracy: libgmp as well as cryptographic uses
 146 in Asymmetric Ciphers. poly1305 and ec25519 are finding their way into everyday
 147 use via OpenSSL.
 148
 149 A very early variant of the Power ISA had a 32-bit Carry-in Carry-out SPR. Its
 150 removal from subsequent revisions is regrettable.  An alternative concept is
 151 to add six explicit 3-in 2-out operations that, on close inspection, always
 152 turn out to be supersets of *existing Scalar operations* that discard upper
 153 or lower DWords, or parts thereof.
 154
 155 *Thus it is critical to note that not one single one of these operations
 156 expands the bitwidth of any existing Scalar pipelines*.
 157
 158 The `dsld` instruction for example merely places additional LSBs into the 64-bit
 159 shift (64-bit carry-in), and then places the (normally discarded) MSBs into the second
 160 output register (64-bit carry-out). It does **not** require a 128-bit shifter to
 161 replace the existing Scalar Power ISA 64-bit shifters.
 162
 163 The reduction in instruction count these operations bring, in critical hotloops,
 164 is remarkably high, to the extent where a Scalar-to-Vector operation of
 165 *arbitrary length* becomes just the one Vector-Prefixed instruction.
 166
 167 Whilst these are 5-6 bit XO their utility is considered high strategic value
 168 and as such are strongly advocated to be in EXT04. The alternative is to bring
 169 back a 64-bit Carry SPR but how it is retrospectively applicable to pre-existing Scalar
 170 Power ISA mutiply, divide, and shift operations at this late stage of maturity of
 171 the Power ISA is an entire area of research on its own deemed unlikely to be
 172 achievable.
 173
 174 ## fclass and GPR-FPR moves
 175
 176 [[sv/fclass]] - just one instruction.  With SFFS being locked down to exclude VSX,
 177 and there being no desire within the nascent OpenPOWER ecosystem outside of IBM to
 178 implement the VSX PackedSIMD paradigm, it becomes necessary to upgrade SFFS
 179 such that it is stand-alone capable. One omission based on the assumption
 180 that VSX would always be present is an equivalent to `xvtstdcsp`.
 181
 182 Similar arguments apply to the GPR-INT move operations, with the opportunity taken
 183 to add rounding modes present in other ISAs that Power ISA VSX PackedSIMD does not
 184 have. Javascript rounding, one of the worst offenders of Computer Science, requires
 185 a phenomental 35 instructions with *six branches* to emulate in Power ISA! For
 186 desktop as well as Server HTML/JS back-end execution of javascript this becomes an
 187 obvious priority, recognised already by ARM as just one example.
 188
 189 ## (f)mv.swizzle
 190
 191 [[sv/mv.swizzle]] is dicey. It is a 2-in 2-out operation whose value as a Scalar
 192 instruction is limited *except* if combined with `cmpi` and SVP64Single
 193 Predication, whereupon the end result is the RISC-synthesis of Compare-and-Swap,
 194 in two instructions.
 195
 196 Where this instruction comes into its full value is when Vectorised.  3D GPU
 197 and HPC numerical workloads astonishingly contain between 10 to 15% swizzle
 198 operations: access to YYZ, XY, of an XYZW Quaternion, performing balancing
 199 of ARGB pixel data. The usage is so high that 3D GPU ISAs make Swizzle a first-class
 200 priority in their VLIW words. Even 64-bit Embedded GPU ISAs have a staggering
 201 24-bits dedicated to 2-operand Swizzle.
 202
 203 So as not to radicalise the Power ISA the Libre-SOC team decided to introduce
 204 mv Swizzle operations, which can always be Macro-op fused in exactly the same
 205 way that ARM SVE predicated-move extends 3-operand "overwrite" opcodes to full
 206 independent 3-in 1-out.
 207
 208 # BMI (bitmanipulation) group.
 209
 210 Whilst the [[sv/vector_ops]] instructions are only two in number, in reality the
 211 `bmask` instruction has a Mode field allowing it to cover **24** instructions,
 212 more than have been added to any other CPUs by ARM, Intel or AMD.  Analyis of
 213 the BMI sets of these CPUs shows simple patterns that can greatly simplify both
 214 Decode and implementation. These are sufficiently commonly used, saving instruction
 215 count regularly, that they justify going into EXT0xx.
 216
 217 The other instruction is `cprop` - Carry-Propagation - which takes the P and Q
 218 from carry-propagation algorithms and generates carry look-ahead. Greatly
 219 increases the efficiency of arbitrary-precision integer arithmetic by combining
 220 what would otherwise be half a dozen instructions into one. However it is
 221 still not a huge priority unlike `bmask` so is probably best placed in EXT2xx.
 222
 223 ## Float-Load-Immediate
 224
 225 Very easily justified.  As explained in [[ls002]] these
 226 always saves one LD L1/2/3 D-Cache memory-lookup operation, by virtue of the Immediate
 227 FP value being in the I-Cache side.  It is such a high priority that these instuctions
 228 are easily justifiable adding into EXT0xx, despite requiring a 16-bit immediate.
 229 By designing the second-half instruction as a Read-Modify-Write it saves on XO
 230 bitlength (only 5 bits), and can be macro-op fused with its first-half to store a
 231 full IEEE754 FP32 immediate into a register.
 232
 233 There is little point in putting these instructions into EXT2xx. Their very benefit
 234 and inherent value *is* as 32-bit instructions, not 64-bit ones. Likewise there is
 235 less value in taking up EXT1xx Enoding space because EXT1xx only brings an additional
 236 16 bits (approx) to the table, and that is provided already by the second-half
 237 instuction.
 238
 239 Thus they qualify as both high priority and also EXT0xx candidates.
 240
 241 #  FPR/GPR LD/ST-PostIncrement-Update
 242
 243 These instruction, outlined in [[ls011]], save hugely in hot-loops.  Early ISAs
 244 such as PDP-8, PDP-11, which inspired the iconic Motorola 68000, 88100, Mitch
 245 Alsup's MyISA 66000, and can even be traced back to the iconic ultra-RISC CDC 6600,
 246 all had both pre- and post- increment Addressing Modes.
 247
 248 The reason is very simple: it is a direct recognition of the practice in c to
 249 frequently utilise both `*p++` and `*++p` which itself stems from common need in
 250 Computer Science algorithms.
 251
 252 The problem for the Power ISA is - was - that the opcode space needed to support both
 253 was far too great, and the decision was made to go with pre-increment, on the basis
 254 that outside the loop a "pre-subtraction" may be performed.
 255
 256 Whilst this is a "solution" it is less than ideal, and the opportunity exists now
 257 with the EXT2xx Primary Opcodes to correct this and bring Power ISA up a level.
 258
 259
 260
 261 [[!inline pages="openpower/sv/rfc/ls012/areas.mdwn" raw=yes ]]
 262
 263 [[!tag opf_rfc]]