openpower/sv/rfc/ls012.mdwn

   1 # External RFC ls012: Discuss priorities of Libre-SOC Scalar(Vector) ops
   2
   3 * <https://git.openpower.foundation/isa/PowerISA/issues/121>
   4 * <https://bugs.libre-soc.org/show_bug.cgi?id=1051>
   5 * <https://bugs.libre-soc.org/show_bug.cgi?id=1052>
   6
   7 The purpose of this RFC is:
   8
   9 * to give a full list of the upcoming Scalar opcodes developed by Libre-SOC
  10   (respecting that *all* of them are Vectoriseble)
  11 * formally agree a priority order on an itertive basis with new versions of this RFC,
  12 * which ones should be EXT022 Sandbox, which in EXT0xx, which in EXT2xx,
  13 * and for IBM to get a clear picture of the Opcode Allocation needs.
  14
  15 As this is a Formal ISA RFC the evaluation
  16 shall ultimatly define (in advance of the actual submission of the instructions
  17 themselves) which instructions will be submitted over the next 8-18
  18 months.
  19
  20 *It is expected that readers visit and interact with the Libre-SOC resources
  21 in order to do due-diligence on the prioritisation evaluation. Otherwise
  22 the ISA WG is overwhelmed by "drip-fed" RFCs that may turn out not
  23 to be useful, against a background of having no guiding overview
  24 or pre-filtering, and everybody's precious time is wasted.
  25 Also note that the Libre-SOC Team, being funded by NLnet
  26 under Privacy and Enhanced Trust Grants, are **prohibited** from signing
  27 Commercial-Confidentiality NDAs, as doing so is a direct conflict of interest
  28 with their funding body's Charitable Foundation Status and remit*.
  29
  30 Worth bearing in mind during evaluation that every "Defined
  31 Word" may or may not be Vectoriseable, but that every "Defined Word"
  32 should have merits on its own, not just when Vectorised.  An example
  33 of a borderline Vectoriseable Defined Word is `mv.swizzle` which
  34 only really becomes high-priority for Audio/Video, Vector GPU and HPC Workloads,
  35 but has less merit as a Scalar-only operation.
  36
  37 Power ISA Scalar (SFFS) has not been significantly advanced in 12 years:
  38 IBM's primary focus has understandably been on PackedSIMD VSX.
  39 Unfortunately, with VSX being 914 instructions and 128-bit it is far too much for any
  40 new team to consider (10 years development effort) and far outside of
  41 Embedded or Tablet/Desktop/Laptop power budgets. Thus bringing Power Scalar
  42 up-to-date to modern standards *and on its own merits* is a reasonable goal,
  43 and the advantages of the reduced focus is that SFFS remains RISC-paradigm,
  44 and  that lessons can be learned from other ISAs from the intervening years.
  45 Good examples here include `bmask`.
  46
  47 SVP64 Prefixing - also known by the terms "Zero-Overhead-Loop-Prefixing"
  48 as well as "True-Scalable-Vector Prefixing" - also literally brings new
  49 dimensions to the Power ISA. Thus when adding new Scalar "Defined Words"
  50 it has to unavoidably and simultaneously be taken into consideration their value when
  51 Vector-Prefixed, *as well as* SVP64Single-Prefixed.
  52
  53 **Target areas**
  54
  55 Whilst entirely general-purpose there are some categories that
  56 these instructions are targetting: Bitmanipulation, Big-integer,
  57 cryptography, Audio/Visual, High-Performance Compute, GPU workloads
  58 and DSP.
  59
  60 **Instruction count guide and approximate priority order**
  61
  62 * 6 - SVP64 Management [[ls008]] [[ls009]] [[ls010]]
  63 * 5 - CR weirds [[sv/cr_int_predication]]
  64 * 4 - INT<->FP mv [[ls006]]
  65 * 19 - GPR LD/ST-PostIncrement-Update (saves hugely in hot-loops) [[ls011]]
  66 * ~12 - FPR LD/ST-PostIncrement-Update (ditto) [[ls011]]
  67 * 2 - Float-Load-Immediate (always saves one LD L1/2/3 D-Cache op) [[ls002]]
  68 * 5 - Big-Integer Chained 3-in 2-out (64-bit Carry) [[sv/biginteger]]
  69 * 6 - Bitmanip LUT2/3 operations. high cost high reward [[sv/bitmanip]]
  70 * 1 - fclass (Scalar variant of xvtstdcsp) [[sv/fclass]]
  71 * 5 - Audio-Video [[sv/av_opcodes]]
  72 * 2 - Shift-and-Add (mitigates LD-ST-Shift; Cryptography e.g. twofish) [[ls004]]
  73 * 2 - BMI group [[sv/vector_ops]]
  74 * 2 - GPU swizzle [[sv/mv.swizzle]]
  75 * 9 - FP DCT/FFT Butterfly (2/3-in 2-out)
  76 * ~9 Integer DCT/FFT Butterfly <https://bugs.libre-soc.org/show_bug.cgi?id=1028>
  77 * 18 - Trigonometric (1-arg) [[openpower/transcendentals]]
  78 * 15 - Transcendentals (1-arg) [[openpower/transcendentals]]
  79 * 25 - Transcendentals (2-arg) [[openpower/transcendentals]]
  80
  81 Summary tables are created below by different sort categories. Additional
  82 columns as necessary can be requested to be added as part of update revisions
  83 to this RFC.
  84
  85 # Target Area summaries
  86
  87 ## SVP64 Management instructions
  88
  89 These without question have to go in EXT0xx.  Future extended variants, bringing
  90 even more powerful capabilities, can be followed up later with EXT1xx prefixed
  91 variants.  Examples include adding psvshape in order to support both Inner and
  92 Outer Product Matrix Schedules, by providing the option to directly reverse the
  93 order of the triple loops.  Outer is used for standard Matrix Multiply, but Inner
  94 is required for Warshall Transitive Closure.
  95
  96 The Management Instructions themselves are all Scalar Operations, so PO1-Prefixing
  97 is perfecly reasonable.  SVP64 Management instructions of which there are only
  98 6 are all 5 or 6 bit XO, meaning that the opcode space they take up in EXT0xx is
  99 not alarmingly high for their intrinsic strategic value.
 100
 101 ## Transcendentals
 102
 103 Found at [[openpower/transcendentals]] these subdivide into high priority for
 104 accelerating general-purpose and High-Performance Compute, specialist 3D GPU
 105 operations suited to 3D visualisation, and low-priority less common instructions
 106 where IEEE754 full bit-accuracy is paramount.  In 3D GPU scenarios for example
 107 even 12-bit accuracy can be overkill, but for HPC Scientific scenarios 12-bit
 108 would be disastrous.
 109
 110 There are a **lot** of operations here, and they also bring Power ISA
 111 up-to-date to IEEE754-2019.  Fortunately the number of critical instructions
 112 is quite low, but the caveat is that if those operations are utilised to
 113 synthesise other IEEE754 operations (divide by `pi` for example) full bitlevel
 114 accuracy (a hard requirement for IEEE754) is lost.
 115
 116 Also worth noting that the Khronos Group defines minimum acceptable bit-accuracy
 117 levels for 3D Graphics: these are **nowhere near* the full accuracy demanded
 118 by IEEE754, the reason for the Khronos definitions is a massive reduction often
 119 four-fold in power consumption and gate count when 3D Graphics simply has no need
 120 for full accuracy.
 121
 122 *For 3D GPU markets this definitely needs addressing*
 123
 124 ## Audio/Video
 125
 126 Found at [[sv/av_opcodes]] these do not require Saturated variants because Saturation
 127 is added via [[sv/svp64]] (Vector Prefixing) and via [[sv/svp64_single]] Scalar
 128 Prefixing. This is important to note for Opcode Allocation because placing these
 129 operations in the UnVectoriseble areas would irrediemably damage their value.
 130 Unlike PackedSIMD ISAs the actual number of AV Opcodes is remarkably small once
 131 the usual cascading-option-multipliers (SIMD width, bitwidth, saturation, HI/LO)
 132 are abstracted out to RISC-paradigm Prefixing, leaving just absolute-diff-accumulate,
 133 min-max, average-add etc. as "basic primitives".
 134
 135 ## Twin-Butterfly FFT/DCT/DFT for DSP/HPC/AI/AV
 136
 137 The number of uses in Computer Science for DCT, NTT, FFT and DFT, is astonishing.
 138 The wikipedia page lists over a hundred separate and distinct areas: Audio, Video,
 139 Radar, Baseband processing, AI, Solomon-Reed Error Correction, the list goes on and on.
 140 ARM has special dedicated Integer Twin-butterfly instructions. TI's MSP Series DSPs
 141 have had FFT Inner loop support for over 30 years. Qualcomm's Hexagon VLIW Baseband
 142 DSP can do full FFT triple loops in one VLIW group.
 143
 144 It should be pretty clear this is high priority.
 145
 146 With SVP64  [[sv/remap]] providing the Loop Schedules it falls to the Scalar side of
 147 the ISA to add the prerequisite "Twin Butterfly" operations, typically performing
 148 for example one multiply but in-place subtracting that product from one operand and
 149 adding it to the other.  The *in-place* aspect is strategically extremely important
 150 for significant reductions in Vectorised register usage, particularly for DCT.
 151
 152 ## CR Weird group
 153
 154 Outlined in [[sv/cr_int_predication]] these instructions massively save on CR-Field
 155 instruction count.  Multi-bit to single-bit and vice-versa normally requiring several
 156 CR-ops (crand, crxor) are done in one single instruction.  The reason for their
 157 addition is down to SVP64 overloading CR Fields as Vector Predicate Masks.
 158 Reducing instruction count in hot-loops is considered high priority.
 159
 160 An additional need is to do popcount on CR Field bit vectors but adding such instructions
 161 to the *Condition Register* side was deemed to be far too much. Therefore, priority
 162 was giiven instead to transferring several CR Field bits into GPRs, whereupon
 163 the full set of tandard Scalar GPR Logical Operations may be used. This strategy
 164 has the side-effect of keeping the CRweird group down to only five instructions.
 165
 166 ## Big-integer Math
 167
 168 [[sv/biginteger]]  has always been a high priority area for commercial applications, privacy,
 169 Banking, as well as HPC Numerical Accuracy: libgmp as well as cryptographic uses
 170 in Asymmetric Ciphers. poly1305 and ec25519 are finding their way into everyday
 171 use via OpenSSL.
 172
 173 A very early variant of the Power ISA had a 32-bit Carry-in Carry-out SPR. Its
 174 removal from subsequent revisions is regrettable.  An alternative concept is
 175 to add six explicit 3-in 2-out operations that, on close inspection, always
 176 turn out to be supersets of *existing Scalar operations* that discard upper
 177 or lower DWords, or parts thereof.
 178
 179 *Thus it is critical to note that not one single one of these operations
 180 expands the bitwidth of any existing Scalar pipelines*.
 181
 182 The `dsld` instruction for example merely places additional LSBs into the 64-bit
 183 shift (64-bit carry-in), and then places the (normally discarded) MSBs into the second
 184 output register (64-bit carry-out). It does **not** require a 128-bit shifter to
 185 replace the existing Scalar Power ISA 64-bit shifters.
 186
 187 The reduction in instruction count these operations bring, in critical hotloops,
 188 is remarkably high, to the extent where a Scalar-to-Vector operation of
 189 *arbitrary length* becomes just the one Vector-Prefixed instruction.
 190
 191 Whilst these are 5-6 bit XO their utility is considered high strategic value
 192 and as such are strongly advocated to be in EXT04. The alternative is to bring
 193 back a 64-bit Carry SPR but how it is retrospectively applicable to pre-existing Scalar
 194 Power ISA mutiply, divide, and shift operations at this late stage of maturity of
 195 the Power ISA is an entire area of research on its own deemed unlikely to be
 196 achievable.
 197
 198 ## fclass and GPR-FPR moves
 199
 200 [[sv/fclass]] - just one instruction.  With SFFS being locked down to exclude VSX,
 201 and there being no desire within the nascent OpenPOWER ecosystem outside of IBM to
 202 implement the VSX PackedSIMD paradigm, it becomes necessary to upgrade SFFS
 203 such that it is stand-alone capable. One omission based on the assumption
 204 that VSX would always be present is an equivalent to `xvtstdcsp`.
 205
 206 Similar arguments apply to the GPR-INT move operations, proposed
 207 in [[ls006]], with the opportunity taken
 208 to add rounding modes present in other ISAs that Power ISA VSX PackedSIMD does not
 209 have. Javascript rounding, one of the worst offenders of Computer Science, requires
 210 a phenomental 35 instructions with *six branches* to emulate in Power ISA! For
 211 desktop as well as Server HTML/JS back-end execution of javascript this becomes an
 212 obvious priority, recognised already by ARM as just one example.
 213
 214 ## Bitmanip LUT2/3
 215
 216 These LUT2/3 operations are high cost high reward. Outlined in [[sv/bitmanip]],
 217 the simplest ones already exist in PackedSIMD VSX: `xxeval`.
 218 The same reasoning applies as to fclass: SFFS needs to be stand-alone on its
 219 own merits and not "punished" should an implementor choose not to implement
 220 any aspect of PackedSIMD VSX.
 221
 222 With Predication being such a high priority in GPUs and HPC, CR Field variants
 223 of Ternary and Binary LUT instructions were considered high priority, and again
 224 just like in the CRweird group the opportunity was taken to work on *all*
 225 bits of a CR Field rather than just one bit as is done with the existing CR operations
 226 crand, cror etc.
 227
 228 The other high strategic value instruction is `grevlut` (and  `grevluti` which can
 229 generate a remarkably large number of regular-patterned magic constants).
 230 The grevlut set require of the order of 20,000 gates but provide an astonishing
 231 plethora of innovative bit-permuting instructions never seen in any other ISA.
 232
 233 The downside of all of these instructions is the extremely low XO bit requirements:
 234 2-3 bit XO due to the large immediates *and* the number of operands required.
 235 The LUT3 instructions are already compacted down to "Overwrite" variants.
 236 (By contrast the Float-Load-Immediate instructions are a much larger XO because
 237 despite having 16-bit immediate only one Register Operand is needed).
 238
 239 Realistically these high-value instructions should be proposed in EXT2xx where
 240 their XO cost does not overwhelm EXT0xx.
 241
 242
 243 ## (f)mv.swizzle
 244
 245 [[sv/mv.swizzle]] is dicey. It is a 2-in 2-out operation whose value as a Scalar
 246 instruction is limited *except* if combined with `cmpi` and SVP64Single
 247 Predication, whereupon the end result is the RISC-synthesis of Compare-and-Swap,
 248 in two instructions.
 249
 250 Where this instruction comes into its full value is when Vectorised.  3D GPU
 251 and HPC numerical workloads astonishingly contain between 10 to 15% swizzle
 252 operations: access to YYZ, XY, of an XYZW Quaternion, performing balancing
 253 of ARGB pixel data. The usage is so high that 3D GPU ISAs make Swizzle a first-class
 254 priority in their VLIW words. Even 64-bit Embedded GPU ISAs have a staggering
 255 24-bits dedicated to 2-operand Swizzle.
 256
 257 So as not to radicalise the Power ISA the Libre-SOC team decided to introduce
 258 mv Swizzle operations, which can always be Macro-op fused in exactly the same
 259 way that ARM SVE predicated-move extends 3-operand "overwrite" opcodes to full
 260 independent 3-in 1-out.
 261
 262 # BMI (bitmanipulation) group.
 263
 264 Whilst the [[sv/vector_ops]] instructions are only two in number, in reality the
 265 `bmask` instruction has a Mode field allowing it to cover **24** instructions,
 266 more than have been added to any other CPUs by ARM, Intel or AMD.  Analyis of
 267 the BMI sets of these CPUs shows simple patterns that can greatly simplify both
 268 Decode and implementation. These are sufficiently commonly used, saving instruction
 269 count regularly, that they justify going into EXT0xx.
 270
 271 The other instruction is `cprop` - Carry-Propagation - which takes the P and Q
 272 from carry-propagation algorithms and generates carry look-ahead. Greatly
 273 increases the efficiency of arbitrary-precision integer arithmetic by combining
 274 what would otherwise be half a dozen instructions into one. However it is
 275 still not a huge priority unlike `bmask` so is probably best placed in EXT2xx.
 276
 277 ## Float-Load-Immediate
 278
 279 Very easily justified.  As explained in [[ls002]] these
 280 always saves one LD L1/2/3 D-Cache memory-lookup operation, by virtue of the Immediate
 281 FP value being in the I-Cache side.  It is such a high priority that these instuctions
 282 are easily justifiable adding into EXT0xx, despite requiring a 16-bit immediate.
 283 By designing the second-half instruction as a Read-Modify-Write it saves on XO
 284 bitlength (only 5 bits), and can be macro-op fused with its first-half to store a
 285 full IEEE754 FP32 immediate into a register.
 286
 287 There is little point in putting these instructions into EXT2xx. Their very benefit
 288 and inherent value *is* as 32-bit instructions, not 64-bit ones. Likewise there is
 289 less value in taking up EXT1xx Enoding space because EXT1xx only brings an additional
 290 16 bits (approx) to the table, and that is provided already by the second-half
 291 instuction.
 292
 293 Thus they qualify as both high priority and also EXT0xx candidates.
 294
 295 ## FPR/GPR LD/ST-PostIncrement-Update
 296
 297 These instruction, outlined in [[ls011]], save hugely in hot-loops.  Early ISAs
 298 such as PDP-8, PDP-11, which inspired the iconic Motorola 68000, 88100, Mitch
 299 Alsup's MyISA 66000, and can even be traced back to the iconic ultra-RISC CDC 6600,
 300 all had both pre- and post- increment Addressing Modes.
 301
 302 The reason is very simple: it is a direct recognition of the practice in c to
 303 frequently utilise both `*p++` and `*++p` which itself stems from common need in
 304 Computer Science algorithms.
 305
 306 The problem for the Power ISA is - was - that the opcode space needed to support both
 307 was far too great, and the decision was made to go with pre-increment, on the basis
 308 that outside the loop a "pre-subtraction" may be performed.
 309
 310 Whilst this is a "solution" it is less than ideal, and the opportunity exists now
 311 with the EXT2xx Primary Opcodes to correct this and bring Power ISA up a level.
 312
 313 ## Shift-and-add
 314
 315 Shift-and-Add are proposed in [[ls004]].  They mitigate the need to
 316 add LD-ST-Shift instructions which are a high-priority aspect of both
 317 x86 and ARM.  LD-ST-Shift is normally just the one instruction: Shift-and-add
 318 brings that down to two, where Power ISA presently requires three.
 319 Cryptography e.g. twofish also makes use of Integer double-and-add, so the value
 320 of these instructions is not limited to Effective Address computation.
 321 They will also have value in Audio DSP.
 322
 323 Being a 10-bit XO it would be somewhat punitive to place these in EXT2xx when their
 324 whole purpose and value is to reduce binary size in Address offset computation,
 325 thus they are best placed in EXT0xx.
 326
 327
 328
 329
 330 [[!inline pages="openpower/sv/rfc/ls012/areas.mdwn" raw=yes ]]
 331
 332 [[!tag opf_rfc]]