openpower/sv/rfc/ls012.mdwn

   1 # External RFC ls012: Discuss priorities of Libre-SOC Scalar(Vector) ops
   2
   3 **Date: 2023apr10. v2 released: TODO**
   4
   5 * Funded by NLnet Grants under EU Horizon Grants 101069594 and 825310
   6 * <https://git.openpower.foundation/isa/PowerISA/issues/121>
   7 * <https://bugs.libre-soc.org/show_bug.cgi?id=1051>
   8 * <https://bugs.libre-soc.org/show_bug.cgi?id=1052>
   9 * <https://bugs.libre-soc.org/show_bug.cgi?id=1054>
  10
  11 The purpose of this RFC is:
  12
  13 * to give a full list of upcoming **Scalar** opcodes developed by Libre-SOC
  14   (being cognisant that *all* of them are Vectorizeable)
  15 * to give OPF Members and non-Members alike the opportunity to comment and get
  16   involved early in RFC submission
  17 * formally agree a priority order on an iterative basis with new versions
  18   of this RFC,
  19 * which ones should be EXT022 Sandbox, which in EXT0xx, which in EXT2xx, which
  20   not proposed at all,
  21 * keep readers summarily informed of ongoing Detailed RFC submissions, with new versions
  22   of this RFC,
  23 * for IBM (in their capacity as Allocator of Opcodes)
  24   to get a clear advance picture of Opcode Allocation
  25   *prior* to submission (as Detailed RFCs)
  26
  27 As this is a Formal ISA RFC the evaluation shall ultimately define
  28 (in advance of the actual submission of the instructions themselves)
  29 which instructions will be submitted over the next 1-18 months.
  30
  31 *It is expected that readers visit and interact with the Libre-SOC
  32 resources in order to do due-diligence on the prioritisation
  33 evaluation. Otherwise the ISA WG is overwhelmed by "drip-fed" RFCs
  34 that may turn out not to be useful, against a background of having
  35 no guiding overview or pre-filtering, and everybody's precious time
  36 is wasted.  Also note that the Libre-SOC Team, being funded by NLnet
  37 under Privacy and Enhanced Trust Grants, are **prohibited** from signing
  38 Commercial-Confidentiality NDAs, as doing so is a direct conflict of
  39 interest with their funding body's Charitable Foundation Status and remit,
  40 and therefore the **entire** set of almost 150 new SFFS instructions
  41 can only go via the External RFC Process.  Also be advised and aware
  42 that "Libre-SOC" != "RED Semiconductor Ltd". The two are completely
  43 **separate** organisations*.
  44
  45 Worth bearing in mind during evaluation that every "Defined Word-instruction" may
  46 or may not be Vectorizeable, but that every "Defined Word-instruction" should have
  47 merits on its own, not just when Vectorized, precisely because the
  48 instructions are Scalar.  An example of a borderline
  49 Vectorizeable Defined Word-instruction is `mv.swizzle` which only really becomes
  50 high-priority for Audio/Video, Vector GPU and HPC Workloads, but has
  51 less merit as a Scalar-only operation, yet when SVP64Single-Prefixed
  52 can be part of an atomic Compare-and-Swap sequence.
  53
  54 Although one of the top world-class ISAs, Power ISA Scalar (SFFS) has
  55 not been significantly advanced in 12 years: IBM's primary focus has
  56 understandably been on PackedSIMD VSX.  Unfortunately, with VSX being
  57 914 instructions and 128-bit it is far too much for any new team to
  58 consider (10+ years development effort) and far outside of Embedded or
  59 Tablet/Desktop/Laptop power budgets. Thus bringing Power Scalar up-to-date
  60 to modern standards *and on its own merits* is a reasonable goal, and
  61 the advantages of the reduced focus is that SFFS remains RISC-paradigm,
  62 with lessons being learned from other ISAs from the intervening years.
  63 Good examples here include `bmask`.
  64
  65 SVP64 Prefixing - also known by the terms "Zero-Overhead-Loop-Prefixing"
  66 as well as "True-Scalable-Vector Prefixing" - also literally brings new
  67 dimensions to the Power ISA. Thus when adding new Scalar "Defined Word-instructions"
  68 it has to unavoidably and simultaneously be taken into consideration
  69 their value when Vector-Prefixed, *as well as* SVP64Single-Prefixed.
  70
  71 **Target areas**
  72
  73 Whilst entirely general-purpose there are some categories that these
  74 instructions are targetting: Bit-manipulation, Big-integer, cryptography,
  75 Audio/Visual, High-Performance Compute, GPU workloads and DSP.
  76
  77 \newpage{}
  78
  79 **Instruction count guide and approximate priority order**
  80
  81 |qty| description                     | RFC      | URL                   |
  82 |-|---------------------------------------|---------------|-------------------|
  83 | 6   | SVP64 Management |[[ls008]] <br> [[ls009]] <br> [[ls010]] | |
  84 | 5   | CR weirds | [[ls015]] |  [[sv/cr_int_predication]] |
  85 | 4   | INT<->FP mv | [[ls006.fpintmv]] | |
  86 | 19  | GPR LD/ST-PostInc-Update (saves hugely in hot-loops) | [[ls011]] | |
  87 | ~12 | FPR LD/ST-PostInc-Update (ditto) | [[ls011]] | |
  88 | 11  | GPR LD/ST-Shifted-PostInc-Update (saves in hot-loops) | [[ls011]] | |
  89 | 4   | FPR LD/ST-Shifted-PostInc-Update (ditto) | [[ls011]] | |
  90 | 26  | GPR LD/ST-Shifted (again saves hugely in hot-loops) | [[ls004]] | |
  91 | 11  | FPR LD/ST-Shifted (ditto) | [[ls004]] | |
  92 | 2   | Float-Load-Immediate (always saves D-Cache) | [[ls002.fmi]] | |
  93 | 5   | Big-Integer Chained 3-in 2-out (64-bit Carry) | [[ls003.bignum]] | [[sv/biginteger]] |
  94 | 6   | Bitmanip LUT2/3 ops. high cost high reward | [[ls007]] |  [[sv/bitmanip]] |
  95 | 1   | fclass (Scalar variant of xvtstdcsp) |TBD|  [[sv/fclass]] |
  96 | 5   | Audio-Video |TBD|  [[sv/av_opcodes]] |
  97 | 2   | Shift-and-Add (mitigates LD-ST-Shift; Crypto twofish) | [[ls004]] | |
  98 | 2   | BMI group | [[ls014]] | [[sv/vector_ops]] |
  99 | 2   | GPU swizzle |TBD|  [[sv/mv.swizzle]] |
 100 | 9   | FP DCT/FFT Butterfly (2/3-in 2-out) | [[ls016]] | [[sv/twin_butterfly]] |
 101 | ~2? | Integer DCT/FFT Butterfly | [[ls016]] | [[sv/twin_butterfly]] |
 102 | 18  | Trigonometric (1-arg) |?|  [[openpower/transcendentals]] |
 103 | 15  | Transcendentals (1-arg) |?|  [[openpower/transcendentals]] |
 104 | 25  | Transcendentals (2-arg) |?|  [[openpower/transcendentals]] |
 105
 106 Summary tables are created below by different sort categories. Additional
 107 columns (and tables) as necessary can be requested to be added as part
 108 of update revisions to this RFC.
 109
 110 \newpage{}
 111
 112 # Target Area summaries
 113
 114 Please note that there are some instructions developed thanks to
 115 NLnet funding that have not been included here for assessment. Examples
 116 include `pcdec` and the Galois Field arithmetic operations. From a purely
 117 practical perspective due to the quantity the lower-priority instructions
 118 were simply left out. However they remain in the Libre-SOC resources.
 119
 120 Some of these SFFS instructions appear to be duplicates of VSX.
 121 A frequent argument comes up that if instructions are in VSX already they
 122 should not be added to SFFS, especially if they are nominally the same.
 123 The logic that this effectively damages performance of an SFFS-only
 124 implementation was raised earlier, however there is a more subtle reason
 125 why the instructions are needed.
 126
 127 Future versions of SVP64 and SVP64Single are expected to be developed
 128 by future Power ISA Stakeholders on top of VSX.  The decisions made
 129 there about the meaning of Prefixed Vectorized VSX may be *completely
 130 different* from those made for Prefixed SFFS instructions.  At which
 131 point the lack of SFFS equivalents would penalise SFFS implementors in a
 132 much more severe way, effectively expecting them and SFFS programmers to
 133 work with a non-orthogonal paradigm, to their detriment.  The solution
 134 is to give the SFFS Subset the space and respect that it deserves and
 135 allow it to be stand-alone on its own merits.
 136
 137 ## SVP64 Management instructions
 138
 139 These without question have to go in EXT0xx.  Future extended variants,
 140 bringing even more powerful capabilities, can be followed up later with
 141 EXT1xx prefixed variants, which is not possible if placed in EXT2xx.
 142 *Only `svstep` is actually Vectorizeable*, all other Management
 143 instructions are Unvectorizeable.  PO1-Prefixed examples include
 144 adding psvshape in order to support both Inner and Outer Product Matrix
 145 Schedules, by providing the option to directly reverse the order of the
 146 triple loops.  Outer is used for standard Matrix Multiply (on top of a
 147 standard MAC or FMAC instruction), but Inner is required for Warshall
 148 Transitive Closure (on top of a cumulatively-applied max instruction).
 149
 150 Excpt for `svstep` which is Vectorizeable the Management Instructions
 151 themselves are all 32-bit Defined Word-instructions (Scalar Operations), so
 152 PO1-Prefixing is perfectly reasonable.  SVP64 Management instructions
 153 of which there are only 6 are all 5 or 6 bit XO, meaning that the opcode
 154 space they take up in EXT0xx is not alarmingly high for their intrinsic
 155 strategic value.
 156
 157 ## Transcendentals
 158
 159 Found at [[openpower/transcendentals]] these subdivide into high
 160 priority for accelerating general-purpose and High-Performance Compute,
 161 specialist 3D GPU operations suited to 3D visualisation, and low-priority
 162 less common instructions where IEEE754 full bit-accuracy is paramount.
 163 In 3D GPU scenarios for example even 12-bit accuracy can be overkill,
 164 but for HPC Scientific scenarios 12-bit would be disastrous.
 165
 166 There are a **lot** of operations here, and they also bring Power
 167 ISA up-to-date to IEEE754-2019.  Fortunately the number of critical
 168 instructions is quite low, but the caveat is that if those operations
 169 are utilised to synthesise other IEEE754 operations (divide by `pi` for
 170 example) full bit-level accuracy (a hard requirement for IEEE754) is lost.
 171
 172 Also worth noting that the Khronos Group defines minimum acceptable
 173 bit-accuracy levels for 3D Graphics: these are **nowhere near** the full
 174 accuracy demanded by IEEE754, the reason for the Khronos definitions is
 175 a massive reduction often four-fold in power consumption and gate count
 176 when 3D Graphics simply has no need for full accuracy.
 177
 178 *For 3D GPU markets this definitely needs addressing*
 179
 180 These instructions are therefore only likely to be proposed if a Stakeholder
 181 comes forward and needs them.  If for example RED Semiconductor Ltd had a
 182 customer requiring a GPS/GNSS Correlator DSP then the SIN/COS Transcendentals
 183 would become a high priority but still be optional, as DSP (and 3D) is still
 184 specialist.
 185
 186 ## Audio/Video
 187
 188 Found at [[sv/av_opcodes]] these do not require Saturated variants
 189 because Saturation is added via [[sv/svp64]] (Vector Prefixing) and
 190 via [[sv/svp64-single]] Scalar Prefixing. This is important to note for
 191 Opcode Allocation because placing these operations in the Unvectorizeable
 192 areas would irredeemably damage their value.  Unlike PackedSIMD ISAs
 193 the actual number of AV Opcodes is remarkably small once the usual
 194 cascading-option-multipliers (SIMD width, bitwidth, saturation,
 195 HI/LO) are abstracted out to RISC-paradigm Prefixing, leaving just
 196 absolute-diff-accumulate, min-max, average-add etc. as "basic primitives".
 197
 198 The min/max set are under their own RFC, [[ls013]]. They are sufficent
 199 high priority: fmax requires an astounding 32 SFFS instructions.
 200
 201 ## Twin-Butterfly FFT/DCT/DFT for DSP/HPC/AI/AV
 202
 203 The number of uses in Computer Science for DCT, NTT, FFT and DFT,
 204 is astonishing.  The wikipedia page lists over a hundred separate and
 205 distinct areas: Audio, Video, Radar, Baseband processing, AI, Solomon-Reed
 206 Error Correction, the list goes on and on.  ARM has special dedicated
 207 Integer Twin-butterfly instructions. TI's MSP Series DSPs have had FFT
 208 Inner loop support for over 30 years. Qualcomm's Hexagon VLIW Baseband
 209 DSP can do full FFT triple loops in one VLIW group.
 210
 211 It should be pretty clear this is high priority.
 212
 213 With SVP64 [[sv/remap]] providing the Loop Schedules it falls to
 214 the Scalar side of the ISA to add the prerequisite "Twin Butterfly"
 215 operations, typically performing for example one multiply but in-place
 216 subtracting that product from one operand and adding it to the other.
 217 The *in-place* aspect is strategically extremely important for significant
 218 reductions in Vectorized register usage, particularly for DCT.
 219 Further: even without Simple-V the number of instructions saved is huge: 8 for
 220 integer and 4 for floating-point vs one.
 221
 222 ## CR Weird group
 223
 224 Outlined in [[sv/cr_int_predication]] these instructions massively save
 225 on CR-Field instruction count.  Multi-bit to single-bit and vice-versa
 226 normally requiring several CR-ops (crand, crxor) are done in one single
 227 instruction.  The reason for their addition is down to SVP64 overloading
 228 CR Fields as Vector Predicate Masks.  Reducing instruction count in
 229 hot-loops is considered high priority.
 230
 231 An additional need is to do popcount on CR Field bit vectors but adding
 232 such instructions to the *Condition Register* side was deemed to be far
 233 too much. Therefore, priority was given instead to transferring several
 234 CR Field bits into GPRs, whereupon the full set of Standard Scalar GPR
 235 Logical Operations may be used. This strategy has the side-effect of
 236 keeping the CRweird group down to only five instructions.
 237
 238 ## Big-integer Math
 239
 240 [[sv/biginteger]] has always been a high priority area for commercial
 241 applications, privacy, Banking, as well as HPC Numerical Accuracy:
 242 libgmp as well as cryptographic uses in Asymmetric Ciphers. poly1305
 243 and ec25519 are finding their way into everyday use via OpenSSL.
 244
 245 A very early variant of the Power ISA had a 32-bit Carry-in Carry-out
 246 SPR. Its removal from subsequent revisions is regrettable.  An alternative
 247 concept is to add six explicit 3-in 2-out operations that, on close
 248 inspection, always turn out to be supersets of *existing Scalar
 249 operations* that discard upper or lower DWords, or parts thereof.
 250
 251 *Thus it is critical to note that not one single one of these operations
 252 expands the bitwidth of any existing Scalar pipelines*.
 253
 254 The `dsld` instruction for example merely places additional LSBs into the
 255 64-bit shift (64-bit carry-in), and then places the (normally discarded)
 256 MSBs into the second output register (64-bit carry-out). It does **not**
 257 require a 128-bit shifter to replace the existing Scalar Power ISA
 258 64-bit shifters.
 259
 260 The reduction in instruction count these operations bring, in critical
 261 hot loops, is remarkably high, to the extent where a Scalar-to-Vector
 262 operation of *arbitrary length* becomes just the one Vector-Prefixed
 263 instruction.
 264
 265 Whilst these are 5-6 bit XO their utility is considered high strategic
 266 value and as such are strongly advocated to be in EXT04. The alternative
 267 is to bring back a 64-bit Carry SPR but how it is retrospectively
 268 applicable to pre-existing Scalar Power ISA multiply, divide, and shift
 269 operations at this late stage of maturity of the Power ISA is an entire
 270 area of research on its own deemed unlikely to be achievable.
 271
 272 Note: none of these instructions are in VSX. They are a different paradigm
 273 and have more akin with their x86 equivalents.
 274
 275 **Critical to note regarding 2-out instructions**:
 276
 277 <https://groups.google.com/g/comp.arch/c/_-dp_ZU6TN0/m/hVuZt86_BgAJ>
 278
 279 ```
 280 >For example, having instructions with 2 dest registers changes
 281 >the cost for a multi-lane OoO renamer from BigO(n^2) to BigO((2n)^2)
 282 >so a 4-lane 2-dest renamer costs 16 times as much.
 283 >And this is for a feature that would be rarely used and is redundant.
 284 ```
 285
 286 Further down an additional author observes that Operand-Forwarding mitigates this
 287 problem but sufficient "advance notice" is needed.  An inline-assembler chained
 288 sequence such as those **required** for bigint would be considered such and
 289 thus the high cost may be avoided.
 290
 291 ## fclass and GPR-FPR moves
 292
 293 [[sv/fclass]] - just one instruction.  With SFFS being locked down to
 294 exclude VSX, and there being no desire within the nascent OpenPOWER
 295 ecosystem outside of IBM to implement the VSX PackedSIMD paradigm, it
 296 becomes necessary to upgrade SFFS such that it is stand-alone capable. One
 297 omission based on the assumption that VSX would always be present is an
 298 equivalent to `xvtstdcsp`.
 299
 300 Similar arguments apply to the GPR-INT move operations, proposed in
 301 [[ls006.fpintmv]], with the opportunity taken to add rounding modes present
 302 in other ISAs that Power ISA VSX PackedSIMD does not have. Javascript
 303 rounding, one of the worst offenders of Computer Science, requires a
 304 phenomenal 35 instructions with *six branches* to emulate in Power
 305 ISA! For desktop as well as Server HTML/JS back-end execution of
 306 javascript this becomes an obvious priority, recognised already by ARM
 307 as just one example.
 308
 309 Whilst some of these instructions have VSX equivalents they must not
 310 be excluded on that basis.  SVP64/VSX may have a different meaning from
 311 SVP64/SFFS i e. the two *Vectorized* instructions may not be equivalent.
 312
 313 ## Bitmanip LUT2/3
 314
 315 These LUT2/3 operations are high cost high reward. Outlined in
 316 [[sv/bitmanip]], the simplest ones already exist in PackedSIMD VSX:
 317 `xxeval`.  The same reasoning applies as to fclass: SFFS needs to be
 318 stand-alone on its own merits and should an implementor choose not to
 319 implement any aspect of PackedSIMD VSX the performance of their product
 320 should not be penalised for making that decision.
 321
 322 With Predication being such a high priority in GPUs and HPC, CR Field
 323 variants of Ternary and Binary LUT instructions were considered high
 324 priority, and again just like in the CRweird group the opportunity was
 325 taken to work on *all* bits of a CR Field rather than just one bit as
 326 is done with the existing CR operations crand, cror etc.
 327
 328 The other high strategic value instruction is `grevlut` (and `grevluti`
 329 which can generate a remarkably large number of regular-patterned magic
 330 constants).  The grevlut set require of the order of 20,000 gates but
 331 provide an astonishing plethora of innovative bit-permuting instructions
 332 never seen in any other ISA.
 333
 334 The downside of all of these instructions is the extremely low XO bit
 335 requirements: 2-3 bit XO due to the large immediates *and* the number of
 336 operands required.  The LUT3 instructions are already compacted down to
 337 "Overwrite" variants.  (By contrast the Float-Load-Immediate instructions
 338 are a much larger XO because despite having 16-bit immediate only one
 339 Register Operand is needed).
 340
 341 Realistically these high-value instructions should be proposed in EXT2xx
 342 where their XO cost does not overwhelm EXT0xx.
 343
 344 ## (f)mv.swizzle
 345
 346 [[sv/mv.swizzle]] is dicey. It is a 2-in 2-out operation whose value
 347 as a Scalar instruction is limited *except* if combined with `cmpi` and
 348 SVP64Single Predication, whereupon the end result is the RISC-synthesis
 349 of Compare-and-Swap, in two instructions.
 350
 351 Where this instruction comes into its full value is when Vectorized.
 352 3D GPU and HPC numerical workloads astonishingly contain between 10 to 15%
 353 swizzle operations: access to YYZ, XY, of an XYZW Quaternion, performing
 354 balancing of ARGB pixel data. The usage is so high that 3D GPU ISAs make
 355 Swizzle a first-class priority in their VLIW words. Even 64-bit Embedded
 356 GPU ISAs have a staggering 24-bits dedicated to 2-operand Swizzle.
 357
 358 So as not to radicalise the Power ISA the Libre-SOC team decided to
 359 introduce mv Swizzle operations, which can always be Macro-op fused
 360 in exactly the same way that ARM SVE predicated-move extends 3-operand
 361 "overwrite" opcodes to full independent 3-in 1-out.
 362
 363 ## BMI (bit-manipulation) group.
 364
 365 Whilst the [[sv/vector_ops]] instructions are only two in number, in
 366 reality the `bmask` instruction has a Mode field allowing it to cover
 367 **24** instructions, more than have been added to any other CPUs by
 368 ARM, Intel or AMD.  Analysis of the BMI sets of these CPUs shows simple
 369 patterns that can greatly simplify both Decode and implementation. These
 370 are sufficiently commonly used, saving instruction count regularly,
 371 that they justify going into EXT0xx.
 372
 373 The other instruction is `cprop` - Carry-Propagation - which takes
 374 the P and Q from carry-propagation algorithms and generates carry
 375 look-ahead. Greatly increases the efficiency of arbitrary-precision
 376 integer arithmetic by combining what would otherwise be half a dozen
 377 instructions into one. However it is still not a huge priority unlike
 378 `bmask` so is probably best placed in EXT2xx.
 379
 380 ## Float-Load-Immediate
 381
 382 Very easily justified.  As explained in [[ls002.fmi]] these always saves one
 383 LD L1/2/3 D-Cache memory-lookup operation, by virtue of the Immediate
 384 FP value being in the I-Cache side.  It is such a high priority that
 385 these instructions are easily justifiable adding into EXT0xx, despite
 386 requiring a 16-bit immediate.  By designing the second-half instruction
 387 as a Read-Modify-Write it saves on XO bit-length (only 5 bits), and
 388 can be macro-op fused with its first-half to store a full IEEE754 FP32
 389 immediate into a register.
 390
 391 There is little point in putting these instructions into EXT2xx. Their
 392 very benefit and inherent value *is* as 32-bit instructions, not 64-bit
 393 ones. Likewise there is less value in taking up EXT1xx Encoding space
 394 because EXT1xx only brings an additional 16 bits (approx) to the table,
 395 and that is provided already by the second-half instruction.
 396
 397 Thus they qualify as both high priority and also EXT0xx candidates.
 398
 399 ## FPR/GPR LD/ST-PostIncrement-Update
 400
 401 These instruction, outlined in [[ls011]], save hugely in hot-loops.
 402 Early ISAs such as PDP-8, PDP-11, which inspired the iconic Motorola
 403 68000, 88100, Mitch Alsup's MyISA 66000, and can even be traced back to
 404 the iconic ultra-RISC CDC 6600, all had both pre- and post- increment
 405 Addressing Modes.
 406
 407 The reason is very simple: it is a direct recognition of the practice
 408 in c to frequently utilise both `*p++` and `*++p` which itself stems
 409 from common need in Computer Science algorithms.
 410
 411 The problem for the Power ISA is - was - that the opcode space needed
 412 to support both was far too great, and the decision was made to go with
 413 pre-increment, on the basis that outside the loop a "pre-subtraction"
 414 may be performed.
 415
 416 Whilst this is a "solution" it is less than ideal, and the opportunity
 417 exists now with the EXT2xx Primary Opcodes to correct this and bring
 418 Power ISA up a level.
 419
 420 Where things begin to get more than a little hairy is if both
 421 Post-Increment *and* Shifted are included.  If SVP64 keeps one
 422 single bit (/pi) dedicated in the `RM.Mode` field then this
 423 problem ges away, at the cost of reducing SVP64's effectiveness.
 424 However again, given that even the Shifted-Post-Increment
 425 instructions are all 9-bit XO it is not outside the realm of
 426 possibility to include them in EXT2xx.
 427
 428 ## Shift-and-add (and LD/ST Indexed-Shift)
 429
 430 Shift-and-Add are proposed in [[ls004]].  They mitigate the need to add
 431 LD-ST-Shift instructions which are a high-priority aspect of both x86
 432 and ARM.  LD-ST-Shift is normally just the one instruction: Shift-and-add
 433 brings that down to two, where Power ISA presently requires three.
 434 Cryptography e.g. twofish also makes use of Integer double-and-add,
 435 so the value of these instructions is not limited to Effective Address
 436 computation.  They will also have value in Audio DSP.
 437
 438 Being a 10-bit XO it would be somewhat punitive to place these in EXT2xx
 439 when their whole purpose and value is to reduce binary size in Address
 440 offset computation, thus they are best placed in EXT0xx.
 441
 442 The upside as far as adding them is concerned is that existing hardware
 443 will already have amalgamated pipelines with very few actual back-end
 444 (Micro-Coded) internal operations (likely just two: one load, one store).
 445 Passing a 2-bit additional immediate field down to those pipelines really
 446 is not hard.
 447
 448 *(Readers unfamiliar with Micro-coding should look at the Microwatt VHDL
 449 source code)*
 450
 451 Also included because it is important to see the quantity of instructions:
 452 LD/ST-Indexed-Shifted.  Across Update variants, Byte-reverse variants,
 453 Arithmetic and FP, the total is a slightly-eye-watering **37**
 454 instructions, only ameliorated by the fact that they are all 9-bit XO.
 455 Even when adding the Post-Increment-Shifted group it is still only
 456 52 9-bit XO instructions, which is not unreasonable to consider (in
 457 EXT2xx).
 458
 459 \newpage{}
 460
 461 # Vectorization: SVP64 and SVP64Single
 462
 463 To be submitted as part of [[ls001]], [[ls008]], [[ls009]] and [[ls010]],
 464 with SVP64Single to follow in a subsequent RFC, SVP64 is conceptually
 465 identical to the 50+ year old 8080 `REP` instruction and the Zilog
 466 Z80 `CPIR` and `LDIR` instructions.  Parallelism is best achieved
 467 by exploiting a Multi-Issue Out-of-Order Micro-architecture.  It is
 468 extremely important to bear in mind that at no time does SVP64 add even
 469 one single actual Vector instruction.  It is a *pure* RISC-paradigm
 470 Prefixing concept only.
 471
 472 This has some implications which need unpacking.  Firstly: in the future,
 473 the Prefixing may be applied to VSX.  The only reason it was not included
 474 in the initial proposal of SVP64 is because due to the number of VSX
 475 instructions the Due Diligence required is obviously five times higher
 476 than the 3+ years work done so far on the SFFS Subset.
 477
 478 Secondly: **any** Scalar instruction involving registers **automatically**
 479 becomes a candidate for Vector-Prefixing.  This in turn means that when
 480 a new instruction is proposed, it becomes a hard requirement to consider
 481 not only the implications of its inclusion as a Scalar-only instruction,
 482 but how it will best be utilised as a Vectorized instruction **as well**.
 483 Extreme examples of this are the Big-Integer 3-in 2-out instructions
 484 that use one 64-bit register effectively as a Carry-in and Carry-out. The
 485 instructions were designed in a *Scalar* context to be inline-efficient
 486 in hardware (use of Operand-Forwarding to reduce the chain down to 2-in
 487 1-out), but in a *Vector* context it is extremely straightforward to
 488 Micro-code an entire batch onto 128-bit SIMD pipelines, 256-bit SIMD
 489 pipelines, and to perform a large internal Forward-Carry-Propagation on
 490 for example the Vectorized-Multiply instruction.
 491
 492 Thirdly: as far as Opcode Allocation is concerned, SVP64 needs to be
 493 considered as an independent stand-alone instruction (just like `REP`).
 494 In other words, the Suffix **never** gets decoded as a completely
 495 different instruction just because of the Prefix.  The cost of doing so
 496 is simply too high in hardware.
 497
 498 --------
 499
 500 # Guidance for evaluation
 501
 502 Deciding which instructions go into an ISA is extremely complex, costly,
 503 and a huge responsibility. In public standards mistakes are irrevocable,
 504 and in the case of an ISA the Opcode Allocation is a finite resource,
 505 meaning that mistakes punish future instructions as well.  This section
 506 therefore provides some Evaluation Guidance on the decision process,
 507 particularly for people new to ISA development, given that this RFC is
 508 circulated widely and publicly.  Constructive feedback from experienced
 509 ISA Architects welcomed to improve this section.
 510
 511 **Does anyone want it?**
 512
 513 Sounds like an obvious question but if there is no driving need (no
 514 "Stakeholder") then why is the instruction being proposed? If it is
 515 purely out of curiosity or part of a Research effort not intended for
 516 production then it's probably best left in the EXT022 Sandbox.
 517
 518 **Common, Frequent, Rare**
 519
 520 Even to the point of asking the same question about a register file,
 521 not just about an instruction, is the instruction (or other feature)
 522 intended to be:
 523
 524 * Common   (used all of the time, typically built-in to toolchain)
 525 * Frequent (specialised tasks but time or resource critical)
 526 * Rare (when you need them you need them)
 527
 528 A good example would be the addition of 128-bit operations, or even
 529 (for Elliptic Curve Cryptography - ec25519) 512-bit ALUs.
 530
 531 **How many registers does it need?**
 532
 533 The basic RISC Paradigm is not only to make instruction encoding simple
 534 (often "wasting" encoding space compared to highly-compacted ISAs such
 535 as x86), but also to keep the number of registers used down to a minimum.
 536
 537 Counter-examples are FMAC which had to be added to IEEE754 because the
 538 *internal* product requires more accuracy than can fit into a register
 539 (it is well-known that FMUL followed by FADD performs an additional
 540 rounding on the intermediate register which loses accuracy compared to
 541 FMAC).  Another would be a dot-product instruction, which again requires
 542 an accumulator of at least double the width of the two vector inputs.
 543 And in the AMDGPU ISA, there are Texture-mapping instructions taking up
 544 to an astounding *twelve* input operands!
 545
 546 The downside of going too far however has to be a trade-off with the
 547 next question. Both MIPS and RISC-V lack Condition Codes, which means
 548 that emulating x86 Branch-Conditional requires *ten* MIPS instructions.
 549
 550 The downside of creating too complex instructions is that the Dependency
 551 Hazard Management in high-performance multi-issue out-of-order
 552 microarchitectures becomes infeasibly large, and even simple in-order
 553 systems may have performance severely compromised by an overabundance
 554 of stalls.  Also worth remembering is that register file ports are
 555 insanely costly, not just to design but also use considerable power.
 556
 557 That said there do exist genuine reasons why more registers is better than
 558 less: Compare-and-Swap has huge benefits but is costly to implement,
 559 and DCT/FFT Twin-Butterfly instructions allow creation of in-place
 560 in-register algorithms reducing the number of registers needed and
 561 thus saving power due to making the *overall* algorithm more efficient,
 562 as opposed to micro-focussing on a localised power increase.
 563
 564 As a general rule of thumb, though:
 565
 566 * going beyond 3-in 2-out is an extremely bad idea
 567 * 3-in 2-out is extreme borderline (including Condition Codes)
 568 * 3-in 1-out needs really good justification
 569 * 2-in 1-out (or 2-in 2-out if one is a Condition Code or Status Register)
 570   is acceptable
 571
 572 Remember to include all Register Files (Status Registers,
 573 Condition Fields) in the assessment: each register will
 574 need its own Hazard Protection, and in an Out-of-Order
 575 system that results in significant resource utilisation
 576 in silicon.
 577
 578 **How many register files does it use?**
 579
 580 Complex instructions pulling in data from multiple register files can
 581 create unnecessary issues surrounding Dependency Hazard Management in
 582 Out-of-Order systems.  As a general rule it is better to keep complex
 583 instructions reading and writing to the same register file, relying
 584 on much simpler (1-in 1-out) instructions to transfer data between
 585 register files.  This rule-of-thumb allows the Dependency Matrices
 586 to be made sparse or significantly reduced in both row and column entries.
 587
 588 **Can other existing instructions (plural) do the same job**
 589
 590 The general rule being: if two or more instructions can do the
 591 same job, leave it out...  *unless* the number of occurrences of
 592 that instruction being missing is causing huge increases in binary
 593 size.  RISC-V has gone too far in this regard, as explained here:
 594 <https://news.ycombinator.com/item?id=24459314>
 595
 596 Good examples are LD-ST-Indexed-shifted (multiply RB by 2, 4 8 or 16)
 597 which are high-priority instructions in x86 and ARM, but lacking in
 598 Power ISA, MIPS, and RISC-V. With many critical hot-loops in Computer
 599 Science having to perform shift and add as explicit instructions,
 600 adding LD/ST-shifted should be considered high priority, except that
 601 the sheer *number* of such instructions needing to be added takes us
 602 into the next question
 603
 604 **How costly is the encoding?**
 605
 606 This can either be a single instruction that is costly (several operands
 607 or a few long ones) or it could be a group of simpler ones that purely
 608 due to their number increases overall encoding cost.  An example of an
 609 extreme costly instruction would be those with their own Primary Opcode:
 610 addi is a good candidate.  However the sheer overwhelming number of
 611 times that instruction is used easily makes a case for its inclusion.
 612
 613 Mentioned above was Load-Store-Indexed-Shifted, which only needs 2
 614 bits to specify how much to shift: x2 x4 x8 or x16. And they are all
 615 a 10-bit XO Field, so not that costly for any one given instruction.
 616 Unfortunately there are *around 30* Load-Store-Indexed Instructions in the
 617 Power ISA, which means an extra *five* bits taken up of precious XO space.
 618 Then let us not forget the two needed for the Shift amount. Now we are
 619 up to *three* bit XO for the group.
 620
 621 Is this a worthwhile tradeoff? Honestly it could well be.  And that's
 622 the decision process that the OpenPOWER ISA Working Group could use some
 623 assistance on, to make the evaluation easier.
 624
 625 **How many gates does it need?**
 626
 627 `grevlut` comes in at an astonishing 20,000 gates, where for comparison
 628 an FP64 Multiply typically takes between 12 to 15,000.  Not counting
 629 the cost in hardware terms is just asking for trouble.
 630
 631 If the number of gates gets too large it has an unintended side-effect:
 632 power consumption goes up but so does the distance between functions
 633 on-chip. A good illustration here is the CDC6600 and Cray Supercomputers
 634 where speed was limited by the size of the *room*.  In other words larger
 635 functions cause communication delays, and communication delays reduce
 636 top speed.
 637
 638 **How long will it take to complete?**
 639
 640 In the case of divide or Transcendentals the algorithms needed are so
 641 complex that simple implementations can often take an astounding 128
 642 clock cycles to complete (Goldschmidtt reduces that significantly).
 643 Other instructions waiting for the results
 644 will back up and eventually stall, where in-order systems pretty much
 645 just stall straight away.
 646
 647 Less extreme examples include instructions that take only a few cycles
 648 to complete, but if commonly used in tight loops with Conditional Branches, an
 649 Out-of-Order system with Speculative capability may need significantly
 650 more Reservation Stations to hold in-flight data for *all* instructions when
 651 some take longer, so even a single clock cycle reduction
 652 could become important.
 653
 654 A rule of thumb is that in Hardware, at 4.8 ghz the budget for what is called
 655 "gate propagation delay" is only around 16 to 19 gates chained one after
 656 the other.  Anything beyond that budget will need to be stored in DFFs
 657 (Flip-flops) and another set of 16-19 gates continues on the next clock
 658 cycle. Thus for example with `grevlut` above it is almost certainly the
 659 case that high-performance high-clock-rate systems would need at least
 660 two clock cycles (two pipeline stages) to produce a valid result.
 661 This in turn brings us to the next question as it is common to consider
 662 subdividing complex instructions into smaller parts.
 663
 664 **Can one instruction do the job of many?**
 665
 666 Large numbers of disparate instructions adversely affects resource
 667 utilisation in In-Order systems.  However it is not always that simple:
 668 every one of the Power ISA "add" and "subtract" instructions, as shown by
 669 the Microwatt source code, may be micro-coded as one single instruction
 670 where RA may optionally be inverted, output likewise, and Carry-In set to
 671 1, 0 or XER.CA.  From these options the *entire* suite of add/subtract
 672 may be synthesised (subtract by inverting RA and adding an extra 1 it
 673 produces a 2s-complement of RA).
 674
 675 `bmask` for example is to be proposed as a single instruction with
 676 a 5-bit "Mode" operand, greatly simplifying some micro-architectural
 677 implementations. Likewise the FP-INT conversion instructions are grouped
 678 as a set of four, instead of over 30 separate instructions.  Aside from
 679 anything this strategy makes the ISA Working Group's evaluation task
 680 easier, as well as reducing the work of writing a Compliance Test Suite.
 681
 682 In the case of the MIPS 3D ASE Extension, a Reciprocal-Square-Root
 683 instruction was proposed that was split into two halves: 12-14 bit
 684 accuracy completing in 7 cycles and "Carry On And Get Better Accuracy"
 685 for the second instruction! With 3D only needing reduced accuracy
 686 the saving in power consumption and time was definitely worthwhile,
 687 and it neatly illustrates a counter-example to trying to make one
 688 instruction do too much.
 689
 690 Another good example is the Integer Twin-butterfly instructions,
 691 `((a +/- b) * c) >> sh` which require **eight** instructions and
 692 temporary registers. Although expensive they save so many other
 693 instructions - and registers - that it is hard to disregard them
 694 even if their internal implementation is Micro-coded.
 695
 696 **Is it general-purpose or have a compelling use-case?**
 697
 698 The more specialised the instruction the less power used but the less
 699 opportunity it has for being used elsewhere.  Good examples of bad
 700 instructions are illustrated by an MSc proposing a chacha20 SIMD add-xor-rotate-by-7
 701 instruction, when chacha20 has nowhere near the decades-established use as Rijndael
 702 (AES) or SHA. Although the instruction halved the number of inline-unrolled
 703 instructions in chacha20 it is clearly so specific as to be useless for any other purpose.
 704
 705 Good examples of good specialist instructions are the
 706 AES and SHA round-acceleration instructions in VSX, because these algorithms
 707 are so heavily used that nearly all ISAs have them.
 708
 709 Perhaps this point should be placed first but it is a different angle on
 710 the cost-benefit analysis that starts with "Does anyone want it": that
 711 alone is not quite enough, because although a given Stakeholder might want
 712 a particular instruction to accelerate *their* application, the expression
 713 of need is only where the evaluation process *begins*.
 714
 715 **Summary**
 716
 717 There are many tradeoffs here, it is a huge list of considerations: any
 718 others known about please do submit feedback so they may be included,
 719 here.  Then the evaluation process may take place: again, constructive
 720 feedback on that as to which instructions are a priority also appreciated.
 721 The above helps explain the columns in the tables that follow.
 722
 723 \newpage{}
 724
 725 # Tables
 726
 727 The original tables are available publicly as as CSV file at
 728 <https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/rfc/ls012/optable.csv;hb=HEAD>.
 729 A python program auto-generates the tables in the following sections by
 730 sorting into different useful priorities.
 731
 732 The key to headings and sections are as follows:
 733
 734 * **Area** - Target Area as described in above sections
 735 * **XO Cost** - the number of bits required in the XO Field. whilst not
 736   the full picture it is a good indicator as to how costly in terms
 737   of Opcode Allocation a given instruction will be.  Lower number is
 738   a higher cost for the Power ISA's precious remaining Opcode space.
 739   "PO" indicates that an entire Primary Opcode is required.
 740 * **rfc** the Libre-SOC External RFC resource,
 741   <https://libre-soc.org/openpower/sv/rfc/> where advance notice of
 742   upcoming RFCs in development may be found.
 743   *Reading advance Draft RFCs and providing feedback strongly advised*,
 744   it saves time and effort for the OPF ISA Workgroup.
 745 * **SVP64** - Vectorizeable (SVP64-Prefixable) - also implies that
 746   SVP64Single is also permitted (required).
 747 * **page** - Libre-SOC wiki page at which further information can
 748   be found.  Again: **advance reading strongly advised due to the
 749   sheer volume of information**.
 750 * **PO1** - the instruction is capable of being PO1-Prefixed
 751   (given an EXT1xx Opcode Allocation). Bear in mind that this option
 752   is **mutually exclusively incompatible** with Vectorization.
 753 * **group** - the Primary Opcode Group recommended for this instruction.
 754   Options are EXT0xx (EXT000-EXT063), EXT1xx and EXT2xx.  A third area
 755   (Unvectorizeable),
 756   EXT3xx, was available in an early Draft RFC but has been made "RESERVED"
 757   instead.  see [[sv/po9_encoding]].
 758 * **Level** - Compliancy Subset and Simple-V Level. `SFFS` indicates "mandatory"
 759   in SFFS. All else is "optional" however some instructions are further Subsetted
 760   within Simple-V: SV/Embedded, SV/DSP and SV/Supercomputing.
 761 * **regs** - a guide to register usage, to how costly Hazard Management
 762   will be, in hardware:
 763
 764 ```
 765      - 1R: reads one GPR/FPR/SPR/CR.
 766      - 1W: writes one GPR/FPR/SPR/CR.
 767      - 1r: reads one CR *Field* (not necessarily the entire CR)
 768      - 1w: writes one CR *Field* (not necessarily the entire CR)
 769 ```
 770
 771 [[!inline pages="openpower/sv/rfc/ls012/areas.mdwn" raw=yes ]]
 772 [[!inline pages="openpower/sv/rfc/ls012/xo_cost.mdwn" raw=yes ]]
 773 [[!inline pages="openpower/sv/rfc/ls012/level.mdwn" raw=yes ]]
 774
 775 [[!tag opf_rfc]]