openpower/sv/rfc/ls012.mdwn

   1 # External RFC ls012: Discuss priorities of Libre-SOC Scalar(Vector) ops
   2
   3 **Date: 2023apr10. v1**
   4
   5 * Funded by NLnet Grants under EU Horizon 2020 and 2023
   6 * <https://git.openpower.foundation/isa/PowerISA/issues/121>
   7 * <https://bugs.libre-soc.org/show_bug.cgi?id=1051>
   8 * <https://bugs.libre-soc.org/show_bug.cgi?id=1052>
   9
  10 The purpose of this RFC is:
  11
  12 * to give a full list of the upcoming Scalar opcodes developed by Libre-SOC
  13   (respecting and being cognisant that *all* of them are Vectorisable)
  14 * to give OPF Members and non-Members alike the opportunity to comment and get
  15   involved early in RFC submission
  16 * formally agree a priority order on an iterative basis with new versions
  17   of this RFC,
  18 * which ones should be EXT022 Sandbox, which in EXT0xx, which in EXT2xx, which
  19   not proposed at all,
  20 * keep readers summarily informed of ongoing RFC submissions, with new versions
  21   of this RFC,
  22 * and for IBM (in their capacity as Allocator of Opcode resources)
  23   to get a clear overall advance picture of the Opcode Allocation needs
  24   *prior* to actual RFC submission
  25
  26 As this is a Formal ISA RFC the evaluation shall ultimatly define
  27 (in advance of the actual submission of the instructions themselves)
  28 which instructions will be submitted over the next 8-18 months.
  29
  30 *It is expected that readers visit and interact with the Libre-SOC
  31 resources in order to do due-diligence on the prioritisation
  32 evaluation. Otherwise the ISA WG is overwhelmed by "drip-fed" RFCs
  33 that may turn out not to be useful, against a background of having
  34 no guiding overview or pre-filtering, and everybody's precious time
  35 is wasted.  Also note that the Libre-SOC Team, being funded by NLnet
  36 under Privacy and Enhanced Trust Grants, are **prohibited** from signing
  37 Commercial-Confidentiality NDAs, as doing so is a direct conflict of
  38 interest with their funding body's Charitable Foundation Status and
  39 remit, and therefore the **entire** set of almost 150 new SFFS instructions
  40 can only go via the External RFC Process.  Also be advised and aware
  41 that "Libre-SOC" != "RED Semiconductor Ltd". The two are completely **separate**
  42 organisations*.
  43
  44 Worth bearing in mind during evaluation that every "Defined Word" may
  45 or may not be Vectoriseable, but that every "Defined Word" should have
  46 merits on its own, not just when Vectorised.  An example of a borderline
  47 Vectoriseable Defined Word is `mv.swizzle` which only really becomes
  48 high-priority for Audio/Video, Vector GPU and HPC Workloads, but has
  49 less merit as a Scalar-only operation.
  50
  51 Although one of the top world-class ISAs,
  52 Power ISA Scalar (SFFS) has not been significantly advanced in 12
  53 years: IBM's primary focus has understandably been on PackedSIMD VSX.
  54 Unfortunately, with VSX being 914 instructions and 128-bit it is far too
  55 much for any new team to consider (10 years development effort) and far
  56 outside of Embedded or Tablet/Desktop/Laptop power budgets. Thus bringing
  57 Power Scalar up-to-date to modern standards *and on its own merits*
  58 is a reasonable goal, and the advantages of the reduced focus is that
  59 SFFS remains RISC-paradigm, and  that lessons can be learned from other
  60 ISAs from the intervening years.  Good examples here include `bmask`.
  61
  62 SVP64 Prefixing - also known by the terms "Zero-Overhead-Loop-Prefixing"
  63 as well as "True-Scalable-Vector Prefixing" - also literally brings new
  64 dimensions to the Power ISA. Thus when adding new Scalar "Defined Words"
  65 it has to unavoidably and simultaneously be taken into consideration
  66 their value when Vector-Prefixed, *as well as* SVP64Single-Prefixed.
  67
  68 **Target areas**
  69
  70 Whilst entirely general-purpose there are some categories that these
  71 instructions are targetting: Bitmanipulation, Big-integer, cryptography,
  72 Audio/Visual, High-Performance Compute, GPU workloads and DSP.
  73
  74 **Instruction count guide and approximate priority order**
  75
  76 * 6 - SVP64 Management [[ls008]] [[ls009]] [[ls010]]
  77 * 5 - CR weirds [[sv/cr_int_predication]]
  78 * 4 - INT<->FP mv [[ls006]]
  79 * 19 - GPR LD/ST-PostIncrement-Update (saves hugely in hot-loops) [[ls011]]
  80 * ~12 - FPR LD/ST-PostIncrement-Update (ditto) [[ls011]]
  81 * 2 - Float-Load-Immediate (always saves one LD L1/2/3 D-Cache op) [[ls002]]
  82 * 5 - Big-Integer Chained 3-in 2-out (64-bit Carry) [[sv/biginteger]]
  83 * 6 - Bitmanip LUT2/3 operations. high cost high reward [[sv/bitmanip]]
  84 * 1 - fclass (Scalar variant of xvtstdcsp) [[sv/fclass]]
  85 * 5 - Audio-Video [[sv/av_opcodes]]
  86 * 2 - Shift-and-Add (mitigates LD-ST-Shift; Cryptography e.g. twofish) [[ls004]]
  87 * 2 - BMI group [[sv/vector_ops]]
  88 * 2 - GPU swizzle [[sv/mv.swizzle]]
  89 * 9 - FP DCT/FFT Butterfly (2/3-in 2-out)
  90 * ~9 Integer DCT/FFT Butterfly <https://bugs.libre-soc.org/show_bug.cgi?id=1028>
  91 * 18 - Trigonometric (1-arg) [[openpower/transcendentals]]
  92 * 15 - Transcendentals (1-arg) [[openpower/transcendentals]]
  93 * 25 - Transcendentals (2-arg) [[openpower/transcendentals]]
  94
  95 Summary tables are created below by different sort categories. Additional
  96 columns (and tables) as necessary can be requested to be added as part of update revisions
  97 to this RFC.
  98
  99 # Target Area summaries
 100
 101 Please note that there are some instructions developed thanks to NLnet
 102 funding that have not been included here for assessment. Examples
 103 include `pcdec` and the Galois Field arithmetic operations. From a purely
 104 practical perspective due to the quantity the lower-priority instructions
 105 were simply left out. However they remain in the Libre-SOC resources.
 106
 107 Some of these SFFS instructions appear to be duplicates of VSX.
 108 A frequent argument comes up that if instructions
 109 are in VSX already they should not be added to SFFS, especially if
 110 they are nominally the same.  The logic that this effectively damages
 111 performance of an SFFS-only implementation was raised earlier, however
 112 there is a more subtle reason why the instructions are needed.
 113
 114 Future versions of SVP64 and SVP64Single are expected to be developed
 115 by future Power ISA Stakeholders on top of VSX.  The decisions made
 116 there about the meaning of Prefixed Vectorised VSX may be **completely**
 117 different from those made for Prefixed SFFS instructions.  At which
 118 point the lack of SFFS equivalents would penalise SFFS implementors
 119 in a much more severe way, effectively expecting them and SFFS programmers
 120 to work with a non-orthogonal paradigm, to their detriment.
 121 The solution is to give the SFFS Subset the space and respect that it deserves
 122 and allow it to be stand-alone on its own merits.
 123
 124 ## SVP64 Management instructions
 125
 126 These without question have to go in EXT0xx.  Future extended variants,
 127 bringing even more powerful capabilities, can be followed up later with
 128 EXT1xx prefixed variants, which is not possible if placed in EXT2xx.
 129 *Only `svstep` is actually Vectoriseable*, all other Management
 130 instructions are UnVectoriseable.  PO1-Prefixed examples include adding
 131 psvshape in order to support both Inner and Outer Product Matrix
 132 Schedules, by providing the option to directly reverse the order of the
 133 triple loops.  Outer is used for standard Matrix Multiply (on top
 134 of a standard MAC or FMAC instruction), but Inner is
 135 required for Warshall Transitive Closure (on top of a cumulatively-applied
 136 max instruction).
 137
 138 The Management Instructions themselves are all Scalar Operations, so
 139 PO1-Prefixing is perfecly reasonable.  SVP64 Management instructions of
 140 which there are only 6 are all 5 or 6 bit XO, meaning that the opcode
 141 space they take up in EXT0xx is not alarmingly high for their intrinsic
 142 strategic value.
 143
 144 ## Transcendentals
 145
 146 Found at [[openpower/transcendentals]] these subdivide into high
 147 priority for accelerating general-purpose and High-Performance Compute,
 148 specialist 3D GPU operations suited to 3D visualisation, and low-priority
 149 less common instructions where IEEE754 full bit-accuracy is paramount.
 150 In 3D GPU scenarios for example even 12-bit accuracy can be overkill,
 151 but for HPC Scientific scenarios 12-bit would be disastrous.
 152
 153 There are a **lot** of operations here, and they also bring Power
 154 ISA up-to-date to IEEE754-2019.  Fortunately the number of critical
 155 instructions is quite low, but the caveat is that if those operations
 156 are utilised to synthesise other IEEE754 operations (divide by `pi` for
 157 example) full bitlevel accuracy (a hard requirement for IEEE754) is lost.
 158
 159 Also worth noting that the Khronos Group defines minimum acceptable
 160 bit-accuracy levels for 3D Graphics: these are **nowhere near** the full
 161 accuracy demanded by IEEE754, the reason for the Khronos definitions is
 162 a massive reduction often four-fold in power consumption and gate count
 163 when 3D Graphics simply has no need for full accuracy.
 164
 165 *For 3D GPU markets this definitely needs addressing*
 166
 167 ## Audio/Video
 168
 169 Found at [[sv/av_opcodes]] these do not require Saturated variants
 170 because Saturation is added via [[sv/svp64]] (Vector Prefixing) and via
 171 [[sv/svp64_single]] Scalar Prefixing. This is important to note for
 172 Opcode Allocation because placing these operations in the UnVectoriseble
 173 areas would irrediemably damage their value.  Unlike PackedSIMD ISAs
 174 the actual number of AV Opcodes is remarkably small once the usual
 175 cascading-option-multipliers (SIMD width, bitwidth, saturation,
 176 HI/LO) are abstracted out to RISC-paradigm Prefixing, leaving just
 177 absolute-diff-accumulate, min-max, average-add etc. as "basic primitives".
 178
 179 ## Twin-Butterfly FFT/DCT/DFT for DSP/HPC/AI/AV
 180
 181 The number of uses in Computer Science for DCT, NTT, FFT and DFT,
 182 is astonishing.  The wikipedia page lists over a hundred separate and
 183 distinct areas: Audio, Video, Radar, Baseband processing, AI, Solomon-Reed
 184 Error Correction, the list goes on and on.  ARM has special dedicated
 185 Integer Twin-butterfly instructions. TI's MSP Series DSPs have had FFT
 186 Inner loop support for over 30 years. Qualcomm's Hexagon VLIW Baseband
 187 DSP can do full FFT triple loops in one VLIW group.
 188
 189 It should be pretty clear this is high priority.
 190
 191 With SVP64  [[sv/remap]] providing the Loop Schedules it falls to
 192 the Scalar side of the ISA to add the prerequisite "Twin Butterfly"
 193 operations, typically performing for example one multiply but in-place
 194 subtracting that product from one operand and adding it to the other.
 195 The *in-place* aspect is strategically extremely important for significant
 196 reductions in Vectorised register usage, particularly for DCT.
 197
 198 ## CR Weird group
 199
 200 Outlined in [[sv/cr_int_predication]] these instructions massively save
 201 on CR-Field instruction count.  Multi-bit to single-bit and vice-versa
 202 normally requiring several CR-ops (crand, crxor) are done in one single
 203 instruction.  The reason for their addition is down to SVP64 overloading
 204 CR Fields as Vector Predicate Masks.  Reducing instruction count in
 205 hot-loops is considered high priority.
 206
 207 An additional need is to do popcount on CR Field bit vectors but adding
 208 such instructions to the *Condition Register* side was deemed to be far
 209 too much. Therefore, priority was given instead to transferring several
 210 CR Field bits into GPRs, whereupon the full set of tandard Scalar GPR
 211 Logical Operations may be used. This strategy has the side-effect of
 212 keeping the CRweird group down to only five instructions.
 213
 214 ## Big-integer Math
 215
 216 [[sv/biginteger]]  has always been a high priority area for commercial
 217 applications, privacy, Banking, as well as HPC Numerical Accuracy:
 218 libgmp as well as cryptographic uses in Asymmetric Ciphers. poly1305
 219 and ec25519 are finding their way into everyday use via OpenSSL.
 220
 221 A very early variant of the Power ISA had a 32-bit Carry-in Carry-out
 222 SPR. Its removal from subsequent revisions is regrettable.  An alternative
 223 concept is to add six explicit 3-in 2-out operations that, on close
 224 inspection, always turn out to be supersets of *existing Scalar
 225 operations* that discard upper or lower DWords, or parts thereof.
 226
 227 *Thus it is critical to note that not one single one of these operations
 228 expands the bitwidth of any existing Scalar pipelines*.
 229
 230 The `dsld` instruction for example merely places additional LSBs into the
 231 64-bit shift (64-bit carry-in), and then places the (normally discarded)
 232 MSBs into the second output register (64-bit carry-out). It does **not**
 233 require a 128-bit shifter to replace the existing Scalar Power ISA
 234 64-bit shifters.
 235
 236 The reduction in instruction count these operations bring, in critical
 237 hotloops, is remarkably high, to the extent where a Scalar-to-Vector
 238 operation of *arbitrary length* becomes just the one Vector-Prefixed
 239 instruction.
 240
 241 Whilst these are 5-6 bit XO their utility is considered high strategic
 242 value and as such are strongly advocated to be in EXT04. The alternative
 243 is to bring back a 64-bit Carry SPR but how it is retrospectively
 244 applicable to pre-existing Scalar Power ISA mutiply, divide, and shift
 245 operations at this late stage of maturity of the Power ISA is an entire
 246 area of research on its own deemed unlikely to be achievable.
 247
 248 ## fclass and GPR-FPR moves
 249
 250 [[sv/fclass]] - just one instruction.  With SFFS being locked down to
 251 exclude VSX, and there being no desire within the nascent OpenPOWER
 252 ecosystem outside of IBM to implement the VSX PackedSIMD paradigm, it
 253 becomes necessary to upgrade SFFS such that it is stand-alone capable. One
 254 omission based on the assumption that VSX would always be present is an
 255 equivalent to `xvtstdcsp`.
 256
 257 Similar arguments apply to the GPR-INT move operations, proposed in
 258 [[ls006]], with the opportunity taken to add rounding modes present
 259 in other ISAs that Power ISA VSX PackedSIMD does not have. Javascript
 260 rounding, one of the worst offenders of Computer Science, requires a
 261 phenomental 35 instructions with *six branches* to emulate in Power
 262 ISA! For desktop as well as Server HTML/JS back-end execution of
 263 javascript this becomes an obvious priority, recognised already by ARM
 264 as just one example.
 265
 266 ## Bitmanip LUT2/3
 267
 268 These LUT2/3 operations are high cost high reward. Outlined in
 269 [[sv/bitmanip]], the simplest ones already exist in PackedSIMD VSX:
 270 `xxeval`.  The same reasoning applies as to fclass: SFFS needs to be
 271 stand-alone on its own merits and should an implementor
 272 choose not to implement any aspect of PackedSIMD VSX the performance
 273 of their product should not be penalised for making that decision.
 274
 275 With Predication being such a high priority in GPUs and HPC, CR Field
 276 variants of Ternary and Binary LUT instructions were considered high
 277 priority, and again just like in the CRweird group the opportunity was
 278 taken to work on *all* bits of a CR Field rather than just one bit as
 279 is done with the existing CR operations crand, cror etc.
 280
 281 The other high strategic value instruction is `grevlut` (and  `grevluti`
 282 which can generate a remarkably large number of regular-patterned magic
 283 constants).  The grevlut set require of the order of 20,000 gates but
 284 provide an astonishing plethora of innovative bit-permuting instructions
 285 never seen in any other ISA.
 286
 287 The downside of all of these instructions is the extremely low XO bit
 288 requirements: 2-3 bit XO due to the large immediates *and* the number of
 289 operands required.  The LUT3 instructions are already compacted down to
 290 "Overwrite" variants.  (By contrast the Float-Load-Immediate instructions
 291 are a much larger XO because despite having 16-bit immediate only one
 292 Register Operand is needed).
 293
 294 Realistically these high-value instructions should be proposed in EXT2xx
 295 where their XO cost does not overwhelm EXT0xx.
 296
 297
 298 ## (f)mv.swizzle
 299
 300 [[sv/mv.swizzle]] is dicey. It is a 2-in 2-out operation whose value
 301 as a Scalar instruction is limited *except* if combined with `cmpi` and
 302 SVP64Single Predication, whereupon the end result is the RISC-synthesis
 303 of Compare-and-Swap, in two instructions.
 304
 305 Where this instruction comes into its full value is when Vectorised.
 306 3D GPU and HPC numerical workloads astonishingly contain between 10 to 15%
 307 swizzle operations: access to YYZ, XY, of an XYZW Quaternion, performing
 308 balancing of ARGB pixel data. The usage is so high that 3D GPU ISAs make
 309 Swizzle a first-class priority in their VLIW words. Even 64-bit Embedded
 310 GPU ISAs have a staggering 24-bits dedicated to 2-operand Swizzle.
 311
 312 So as not to radicalise the Power ISA the Libre-SOC team decided to
 313 introduce mv Swizzle operations, which can always be Macro-op fused
 314 in exactly the same way that ARM SVE predicated-move extends 3-operand
 315 "overwrite" opcodes to full independent 3-in 1-out.
 316
 317 # BMI (bitmanipulation) group.
 318
 319 Whilst the [[sv/vector_ops]] instructions are only two in number, in
 320 reality the `bmask` instruction has a Mode field allowing it to cover
 321 **24** instructions, more than have been added to any other CPUs by
 322 ARM, Intel or AMD.  Analyis of the BMI sets of these CPUs shows simple
 323 patterns that can greatly simplify both Decode and implementation. These
 324 are sufficiently commonly used, saving instruction count regularly,
 325 that they justify going into EXT0xx.
 326
 327 The other instruction is `cprop` - Carry-Propagation - which takes
 328 the P and Q from carry-propagation algorithms and generates carry
 329 look-ahead. Greatly increases the efficiency of arbitrary-precision
 330 integer arithmetic by combining what would otherwise be half a dozen
 331 instructions into one. However it is still not a huge priority unlike
 332 `bmask` so is probably best placed in EXT2xx.
 333
 334 ## Float-Load-Immediate
 335
 336 Very easily justified.  As explained in [[ls002]] these always saves one
 337 LD L1/2/3 D-Cache memory-lookup operation, by virtue of the Immediate
 338 FP value being in the I-Cache side.  It is such a high priority that
 339 these instuctions are easily justifiable adding into EXT0xx, despite
 340 requiring a 16-bit immediate.  By designing the second-half instruction
 341 as a Read-Modify-Write it saves on XO bitlength (only 5 bits), and can be
 342 macro-op fused with its first-half to store a full IEEE754 FP32 immediate
 343 into a register.
 344
 345 There is little point in putting these instructions into EXT2xx. Their
 346 very benefit and inherent value *is* as 32-bit instructions, not 64-bit
 347 ones. Likewise there is less value in taking up EXT1xx Enoding space
 348 because EXT1xx only brings an additional 16 bits (approx) to the table,
 349 and that is provided already by the second-half instuction.
 350
 351 Thus they qualify as both high priority and also EXT0xx candidates.
 352
 353 ## FPR/GPR LD/ST-PostIncrement-Update
 354
 355 These instruction, outlined in [[ls011]], save hugely in hot-loops.
 356 Early ISAs such as PDP-8, PDP-11, which inspired the iconic Motorola
 357 68000, 88100, Mitch Alsup's MyISA 66000, and can even be traced back to
 358 the iconic ultra-RISC CDC 6600, all had both pre- and post- increment
 359 Addressing Modes.
 360
 361 The reason is very simple: it is a direct recognition of the practice
 362 in c to frequently utilise both `*p++` and `*++p` which itself stems
 363 from common need in Computer Science algorithms.
 364
 365 The problem for the Power ISA is - was - that the opcode space needed
 366 to support both was far too great, and the decision was made to go with
 367 pre-increment, on the basis that outside the loop a "pre-subtraction"
 368 may be performed.
 369
 370 Whilst this is a "solution" it is less than ideal, and the opportunity
 371 exists now with the EXT2xx Primary Opcodes to correct this and bring
 372 Power ISA up a level.
 373
 374 ## Shift-and-add
 375
 376 Shift-and-Add are proposed in [[ls004]].  They mitigate the need to add
 377 LD-ST-Shift instructions which are a high-priority aspect of both x86
 378 and ARM.  LD-ST-Shift is normally just the one instruction: Shift-and-add
 379 brings that down to two, where Power ISA presently requires three.
 380 Cryptography e.g. twofish also makes use of Integer double-and-add,
 381 so the value of these instructions is not limited to Effective Address
 382 computation.  They will also have value in Audio DSP.
 383
 384 Being a 10-bit XO it would be somewhat punitive to place these in EXT2xx
 385 when their whole purpose and value is to reduce binary size in Address
 386 offset computation, thus they are best placed in EXT0xx.
 387
 388
 389 # Tables
 390
 391 The original tables are available publicly as as CSV file at
 392 <https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/rfc/ls012/optable.csv;hb=HEAD>.
 393 A python program auto-generates the tables in the following sections
 394 by sorting into different useful priorities.
 395
 396 The key to headings and sections are as follows:
 397
 398 * **Area** - Target Area as described in above sections
 399 * **XO Cost** - the number of bits required in the XO Field. whilst not
 400   the full picture it is a good indicator as to how costly in terms
 401   of Opcode Allocation a given instruction will be.  Lower number is
 402   a higher cost for the Power ISA's precious remaining Opcode space.
 403   "PO" indicates that an entire Primary Opcode is required.
 404 * **rfc** the Libre-SOC External RFC resource,
 405   <https://libre-soc.org/openpower/sv/rfc/> where advance notice of
 406   upcoming RFCs in development may be found.
 407   *Reading advance Draft RFCs and providing feedback strongly advised*,
 408   it saves time and effort for the OPF ISA Workgroup.
 409 * **SVP64** - Vectoriseable (SVP64-Prefixable) - also implies that
 410   SVP64Single is also permitted (required).
 411 * **page** - Libre-SOC wiki page at which further information can
 412   be found.  Again: **advance reading strongly advised due to the
 413   sheer volume of information**.
 414 * **PO1** - the instruction is capable of being PO1-Prefixed
 415   (given an EXT1xx Opcode Allocation). Bear in mind that this option
 416   is **mutually exclusively incompatible** with Vectorisation.
 417 * **group** - the Primary Opcode Group recommended for this instruction.
 418   Options are EXT0xx (EXT000-EXT063), EXT1xx and EXT2xx.  A third area
 419   (UnVectoriseable),
 420   EXT3xx, was available in an early Draft RFC but has been made "RESERVED"
 421   instead.  see [[sv/po9_encoding]].
 422 * **regs** - a guide to register usage, to how costly Hazard Management
 423   will be, in hardware:
 424   - 1R: reads one GPR/FPR/SPR/CR.
 425   - 1W: writes one GPR/FPR/SPR/CR.
 426   - 1r: reads one CR *Field* (not necessarily the entire CR)
 427   - 1w: writes one CR *Field* (not necessarily the entire CR)
 428
 429 [[!inline pages="openpower/sv/rfc/ls012/areas.mdwn" raw=yes ]]
 430 [[!inline pages="openpower/sv/rfc/ls012/xo_cost.mdwn" raw=yes ]]
 431
 432 [[!tag opf_rfc]]