openpower/sv/rfc/ls012.mdwn

   1 # External RFC ls012: Discuss priorities of Libre-SOC Scalar(Vector) ops
   2
   3 **Date: 2023apr10. v2 released: TODO**
   4
   5 * Funded by NLnet Grants under EU Horizon Grants 101069594 825310
   6 * <https://git.openpower.foundation/isa/PowerISA/issues/121>
   7 * <https://bugs.libre-soc.org/show_bug.cgi?id=1051>
   8 * <https://bugs.libre-soc.org/show_bug.cgi?id=1052>
   9 * <https://bugs.libre-soc.org/show_bug.cgi?id=1054>
  10
  11 The purpose of this RFC is:
  12
  13 * to give a full list of upcoming Scalar opcodes developed by Libre-SOC
  14   (being cognisant that *all* of them are Vectoriseable)
  15 * to give OPF Members and non-Members alike the opportunity to comment and get
  16   involved early in RFC submission
  17 * formally agree a priority order on an iterative basis with new versions
  18   of this RFC,
  19 * which ones should be EXT022 Sandbox, which in EXT0xx, which in EXT2xx, which
  20   not proposed at all,
  21 * keep readers summarily informed of ongoing RFC submissions, with new versions
  22   of this RFC,
  23 * for IBM (in their capacity as Allocator of Opcodes)
  24   to get a clear advance picture of Opcode Allocation
  25   *prior* to submission
  26
  27 As this is a Formal ISA RFC the evaluation shall ultimately define
  28 (in advance of the actual submission of the instructions themselves)
  29 which instructions will be submitted over the next 1-18 months.
  30
  31 *It is expected that readers visit and interact with the Libre-SOC
  32 resources in order to do due-diligence on the prioritisation
  33 evaluation. Otherwise the ISA WG is overwhelmed by "drip-fed" RFCs
  34 that may turn out not to be useful, against a background of having
  35 no guiding overview or pre-filtering, and everybody's precious time
  36 is wasted.  Also note that the Libre-SOC Team, being funded by NLnet
  37 under Privacy and Enhanced Trust Grants, are **prohibited** from signing
  38 Commercial-Confidentiality NDAs, as doing so is a direct conflict of
  39 interest with their funding body's Charitable Foundation Status and remit,
  40 and therefore the **entire** set of almost 150 new SFFS instructions
  41 can only go via the External RFC Process.  Also be advised and aware
  42 that "Libre-SOC" != "RED Semiconductor Ltd". The two are completely
  43 **separate** organisations*.
  44
  45 Worth bearing in mind during evaluation that every "Defined Word" may
  46 or may not be Vectoriseable, but that every "Defined Word" should have
  47 merits on its own, not just when Vectorised.  An example of a borderline
  48 Vectoriseable Defined Word is `mv.swizzle` which only really becomes
  49 high-priority for Audio/Video, Vector GPU and HPC Workloads, but has
  50 less merit as a Scalar-only operation, yet when SVP64Single-Prefixed
  51 can be part of an atomic Compare-and-Swap sequence.
  52
  53 Although one of the top world-class ISAs, Power ISA Scalar (SFFS) has
  54 not been significantly advanced in 12 years: IBM's primary focus has
  55 understandably been on PackedSIMD VSX.  Unfortunately, with VSX being
  56 914 instructions and 128-bit it is far too much for any new team to
  57 consider (10+ years development effort) and far outside of Embedded or
  58 Tablet/Desktop/Laptop power budgets. Thus bringing Power Scalar up-to-date
  59 to modern standards *and on its own merits* is a reasonable goal, and
  60 the advantages of the reduced focus is that SFFS remains RISC-paradigm,
  61 with lessons being be learned from other ISAs from the intervening years.
  62 Good examples here include `bmask`.
  63
  64 SVP64 Prefixing - also known by the terms "Zero-Overhead-Loop-Prefixing"
  65 as well as "True-Scalable-Vector Prefixing" - also literally brings new
  66 dimensions to the Power ISA. Thus when adding new Scalar "Defined Words"
  67 it has to unavoidably and simultaneously be taken into consideration
  68 their value when Vector-Prefixed, *as well as* SVP64Single-Prefixed.
  69
  70 **Target areas**
  71
  72 Whilst entirely general-purpose there are some categories that these
  73 instructions are targetting: Bit-manipulation, Big-integer, cryptography,
  74 Audio/Visual, High-Performance Compute, GPU workloads and DSP.
  75
  76 **Instruction count guide and approximate priority order**
  77
  78 * 6 - SVP64 Management [[ls008]] [[ls009]] [[ls010]]
  79 * 5 - CR weirds [[sv/cr_int_predication]]
  80 * 4 - INT<->FP mv [[ls006]]
  81 * 19 - GPR LD/ST-PostIncrement-Update (saves hugely in hot-loops) [[ls011]]
  82 * ~12 - FPR LD/ST-PostIncrement-Update (ditto) [[ls011]]
  83 * 19 - GPR LD/ST-Shifted-PostIncrement-Update (saves hugely in hot-loops) [[ls011]]
  84 * ~12 - FPR LD/ST-Shifted-PostIncrement-Update (ditto) [[ls011]]
  85 * 26 - GPR LD/ST-Shifted (again saves hugely in hot-loops) [[ls004]]
  86 * 11 - FPR LD/ST-Shifted (ditto) [[ls004]]
  87 * 2 - Float-Load-Immediate (always saves one LD L1/2/3 D-Cache op) [[ls002]]
  88 * 5 - Big-Integer Chained 3-in 2-out (64-bit Carry) [[sv/biginteger]]
  89 * 6 - Bitmanip LUT2/3 operations. high cost high reward [[sv/bitmanip]]
  90 * 1 - fclass (Scalar variant of xvtstdcsp) [[sv/fclass]]
  91 * 5 - Audio-Video [[sv/av_opcodes]]
  92 * 2 - Shift-and-Add (mitigates LD-ST-Shift; Cryptography e.g. twofish) [[ls004]]
  93 * 2 - BMI group [[sv/vector_ops]]
  94 * 2 - GPU swizzle [[sv/mv.swizzle]]
  95 * 9 - FP DCT/FFT Butterfly (2/3-in 2-out)
  96 * ~9 Integer DCT/FFT Butterfly <https://bugs.libre-soc.org/show_bug.cgi?id=1028>
  97 * 18 - Trigonometric (1-arg) [[openpower/transcendentals]]
  98 * 15 - Transcendentals (1-arg) [[openpower/transcendentals]]
  99 * 25 - Transcendentals (2-arg) [[openpower/transcendentals]]
 100
 101 Summary tables are created below by different sort categories. Additional
 102 columns (and tables) as necessary can be requested to be added as part
 103 of update revisions to this RFC.
 104
 105 \newpage{}
 106
 107 # Target Area summaries
 108
 109 Please note that there are some instructions developed thanks to
 110 NLnet funding that have not been included here for assessment. Examples
 111 include `pcdec` and the Galois Field arithmetic operations. From a purely
 112 practical perspective due to the quantity the lower-priority instructions
 113 were simply left out. However they remain in the Libre-SOC resources.
 114
 115 Some of these SFFS instructions appear to be duplicates of VSX.
 116 A frequent argument comes up that if instructions are in VSX already they
 117 should not be added to SFFS, especially if they are nominally the same.
 118 The logic that this effectively damages performance of an SFFS-only
 119 implementation was raised earlier, however there is a more subtle reason
 120 why the instructions are needed.
 121
 122 Future versions of SVP64 and SVP64Single are expected to be developed
 123 by future Power ISA Stakeholders on top of VSX.  The decisions made
 124 there about the meaning of Prefixed Vectorised VSX may be *completely
 125 different* from those made for Prefixed SFFS instructions.  At which
 126 point the lack of SFFS equivalents would penalise SFFS implementors in a
 127 much more severe way, effectively expecting them and SFFS programmers to
 128 work with a non-orthogonal paradigm, to their detriment.  The solution
 129 is to give the SFFS Subset the space and respect that it deserves and
 130 allow it to be stand-alone on its own merits.
 131
 132 ## SVP64 Management instructions
 133
 134 These without question have to go in EXT0xx.  Future extended variants,
 135 bringing even more powerful capabilities, can be followed up later with
 136 EXT1xx prefixed variants, which is not possible if placed in EXT2xx.
 137 *Only `svstep` is actually Vectoriseable*, all other Management
 138 instructions are UnVectoriseable.  PO1-Prefixed examples include
 139 adding psvshape in order to support both Inner and Outer Product Matrix
 140 Schedules, by providing the option to directly reverse the order of the
 141 triple loops.  Outer is used for standard Matrix Multiply (on top of a
 142 standard MAC or FMAC instruction), but Inner is required for Warshall
 143 Transitive Closure (on top of a cumulatively-applied max instruction).
 144
 145 The Management Instructions themselves are all Scalar Operations, so
 146 PO1-Prefixing is perfectly reasonable.  SVP64 Management instructions
 147 of which there are only 6 are all 5 or 6 bit XO, meaning that the opcode
 148 space they take up in EXT0xx is not alarmingly high for their intrinsic
 149 strategic value.
 150
 151 ## Transcendentals
 152
 153 Found at [[openpower/transcendentals]] these subdivide into high
 154 priority for accelerating general-purpose and High-Performance Compute,
 155 specialist 3D GPU operations suited to 3D visualisation, and low-priority
 156 less common instructions where IEEE754 full bit-accuracy is paramount.
 157 In 3D GPU scenarios for example even 12-bit accuracy can be overkill,
 158 but for HPC Scientific scenarios 12-bit would be disastrous.
 159
 160 There are a **lot** of operations here, and they also bring Power
 161 ISA up-to-date to IEEE754-2019.  Fortunately the number of critical
 162 instructions is quite low, but the caveat is that if those operations
 163 are utilised to synthesise other IEEE754 operations (divide by `pi` for
 164 example) full bit-level accuracy (a hard requirement for IEEE754) is lost.
 165
 166 Also worth noting that the Khronos Group defines minimum acceptable
 167 bit-accuracy levels for 3D Graphics: these are **nowhere near** the full
 168 accuracy demanded by IEEE754, the reason for the Khronos definitions is
 169 a massive reduction often four-fold in power consumption and gate count
 170 when 3D Graphics simply has no need for full accuracy.
 171
 172 *For 3D GPU markets this definitely needs addressing*
 173
 174 ## Audio/Video
 175
 176 Found at [[sv/av_opcodes]] these do not require Saturated variants
 177 because Saturation is added via [[sv/svp64]] (Vector Prefixing) and
 178 via [[sv/svp64_single]] Scalar Prefixing. This is important to note for
 179 Opcode Allocation because placing these operations in the UnVectoriseable
 180 areas would irredeemably damage their value.  Unlike PackedSIMD ISAs
 181 the actual number of AV Opcodes is remarkably small once the usual
 182 cascading-option-multipliers (SIMD width, bitwidth, saturation,
 183 HI/LO) are abstracted out to RISC-paradigm Prefixing, leaving just
 184 absolute-diff-accumulate, min-max, average-add etc. as "basic primitives".
 185
 186 ## Twin-Butterfly FFT/DCT/DFT for DSP/HPC/AI/AV
 187
 188 The number of uses in Computer Science for DCT, NTT, FFT and DFT,
 189 is astonishing.  The wikipedia page lists over a hundred separate and
 190 distinct areas: Audio, Video, Radar, Baseband processing, AI, Solomon-Reed
 191 Error Correction, the list goes on and on.  ARM has special dedicated
 192 Integer Twin-butterfly instructions. TI's MSP Series DSPs have had FFT
 193 Inner loop support for over 30 years. Qualcomm's Hexagon VLIW Baseband
 194 DSP can do full FFT triple loops in one VLIW group.
 195
 196 It should be pretty clear this is high priority.
 197
 198 With SVP64 [[sv/remap]] providing the Loop Schedules it falls to
 199 the Scalar side of the ISA to add the prerequisite "Twin Butterfly"
 200 operations, typically performing for example one multiply but in-place
 201 subtracting that product from one operand and adding it to the other.
 202 The *in-place* aspect is strategically extremely important for significant
 203 reductions in Vectorised register usage, particularly for DCT.
 204
 205 ## CR Weird group
 206
 207 Outlined in [[sv/cr_int_predication]] these instructions massively save
 208 on CR-Field instruction count.  Multi-bit to single-bit and vice-versa
 209 normally requiring several CR-ops (crand, crxor) are done in one single
 210 instruction.  The reason for their addition is down to SVP64 overloading
 211 CR Fields as Vector Predicate Masks.  Reducing instruction count in
 212 hot-loops is considered high priority.
 213
 214 An additional need is to do popcount on CR Field bit vectors but adding
 215 such instructions to the *Condition Register* side was deemed to be far
 216 too much. Therefore, priority was given instead to transferring several
 217 CR Field bits into GPRs, whereupon the full set of Standard Scalar GPR
 218 Logical Operations may be used. This strategy has the side-effect of
 219 keeping the CRweird group down to only five instructions.
 220
 221 ## Big-integer Math
 222
 223 [[sv/biginteger]] has always been a high priority area for commercial
 224 applications, privacy, Banking, as well as HPC Numerical Accuracy:
 225 libgmp as well as cryptographic uses in Asymmetric Ciphers. poly1305
 226 and ec25519 are finding their way into everyday use via OpenSSL.
 227
 228 A very early variant of the Power ISA had a 32-bit Carry-in Carry-out
 229 SPR. Its removal from subsequent revisions is regrettable.  An alternative
 230 concept is to add six explicit 3-in 2-out operations that, on close
 231 inspection, always turn out to be supersets of *existing Scalar
 232 operations* that discard upper or lower DWords, or parts thereof.
 233
 234 *Thus it is critical to note that not one single one of these operations
 235 expands the bitwidth of any existing Scalar pipelines*.
 236
 237 The `dsld` instruction for example merely places additional LSBs into the
 238 64-bit shift (64-bit carry-in), and then places the (normally discarded)
 239 MSBs into the second output register (64-bit carry-out). It does **not**
 240 require a 128-bit shifter to replace the existing Scalar Power ISA
 241 64-bit shifters.
 242
 243 The reduction in instruction count these operations bring, in critical
 244 hot loops, is remarkably high, to the extent where a Scalar-to-Vector
 245 operation of *arbitrary length* becomes just the one Vector-Prefixed
 246 instruction.
 247
 248 Whilst these are 5-6 bit XO their utility is considered high strategic
 249 value and as such are strongly advocated to be in EXT04. The alternative
 250 is to bring back a 64-bit Carry SPR but how it is retrospectively
 251 applicable to pre-existing Scalar Power ISA multiply, divide, and shift
 252 operations at this late stage of maturity of the Power ISA is an entire
 253 area of research on its own deemed unlikely to be achievable.
 254
 255 ## fclass and GPR-FPR moves
 256
 257 [[sv/fclass]] - just one instruction.  With SFFS being locked down to
 258 exclude VSX, and there being no desire within the nascent OpenPOWER
 259 ecosystem outside of IBM to implement the VSX PackedSIMD paradigm, it
 260 becomes necessary to upgrade SFFS such that it is stand-alone capable. One
 261 omission based on the assumption that VSX would always be present is an
 262 equivalent to `xvtstdcsp`.
 263
 264 Similar arguments apply to the GPR-INT move operations, proposed in
 265 [[ls006]], with the opportunity taken to add rounding modes present
 266 in other ISAs that Power ISA VSX PackedSIMD does not have. Javascript
 267 rounding, one of the worst offenders of Computer Science, requires a
 268 phenomenal 35 instructions with *six branches* to emulate in Power
 269 ISA! For desktop as well as Server HTML/JS back-end execution of
 270 javascript this becomes an obvious priority, recognised already by ARM
 271 as just one example.
 272
 273 ## Bitmanip LUT2/3
 274
 275 These LUT2/3 operations are high cost high reward. Outlined in
 276 [[sv/bitmanip]], the simplest ones already exist in PackedSIMD VSX:
 277 `xxeval`.  The same reasoning applies as to fclass: SFFS needs to be
 278 stand-alone on its own merits and should an implementor choose not to
 279 implement any aspect of PackedSIMD VSX the performance of their product
 280 should not be penalised for making that decision.
 281
 282 With Predication being such a high priority in GPUs and HPC, CR Field
 283 variants of Ternary and Binary LUT instructions were considered high
 284 priority, and again just like in the CRweird group the opportunity was
 285 taken to work on *all* bits of a CR Field rather than just one bit as
 286 is done with the existing CR operations crand, cror etc.
 287
 288 The other high strategic value instruction is `grevlut` (and `grevluti`
 289 which can generate a remarkably large number of regular-patterned magic
 290 constants).  The grevlut set require of the order of 20,000 gates but
 291 provide an astonishing plethora of innovative bit-permuting instructions
 292 never seen in any other ISA.
 293
 294 The downside of all of these instructions is the extremely low XO bit
 295 requirements: 2-3 bit XO due to the large immediates *and* the number of
 296 operands required.  The LUT3 instructions are already compacted down to
 297 "Overwrite" variants.  (By contrast the Float-Load-Immediate instructions
 298 are a much larger XO because despite having 16-bit immediate only one
 299 Register Operand is needed).
 300
 301 Realistically these high-value instructions should be proposed in EXT2xx
 302 where their XO cost does not overwhelm EXT0xx.
 303
 304
 305 ## (f)mv.swizzle
 306
 307 [[sv/mv.swizzle]] is dicey. It is a 2-in 2-out operation whose value
 308 as a Scalar instruction is limited *except* if combined with `cmpi` and
 309 SVP64Single Predication, whereupon the end result is the RISC-synthesis
 310 of Compare-and-Swap, in two instructions.
 311
 312 Where this instruction comes into its full value is when Vectorised.
 313 3D GPU and HPC numerical workloads astonishingly contain between 10 to 15%
 314 swizzle operations: access to YYZ, XY, of an XYZW Quaternion, performing
 315 balancing of ARGB pixel data. The usage is so high that 3D GPU ISAs make
 316 Swizzle a first-class priority in their VLIW words. Even 64-bit Embedded
 317 GPU ISAs have a staggering 24-bits dedicated to 2-operand Swizzle.
 318
 319 So as not to radicalise the Power ISA the Libre-SOC team decided to
 320 introduce mv Swizzle operations, which can always be Macro-op fused
 321 in exactly the same way that ARM SVE predicated-move extends 3-operand
 322 "overwrite" opcodes to full independent 3-in 1-out.
 323
 324 ## BMI (bit-manipulation) group.
 325
 326 Whilst the [[sv/vector_ops]] instructions are only two in number, in
 327 reality the `bmask` instruction has a Mode field allowing it to cover
 328 **24** instructions, more than have been added to any other CPUs by
 329 ARM, Intel or AMD.  Analysis of the BMI sets of these CPUs shows simple
 330 patterns that can greatly simplify both Decode and implementation. These
 331 are sufficiently commonly used, saving instruction count regularly,
 332 that they justify going into EXT0xx.
 333
 334 The other instruction is `cprop` - Carry-Propagation - which takes
 335 the P and Q from carry-propagation algorithms and generates carry
 336 look-ahead. Greatly increases the efficiency of arbitrary-precision
 337 integer arithmetic by combining what would otherwise be half a dozen
 338 instructions into one. However it is still not a huge priority unlike
 339 `bmask` so is probably best placed in EXT2xx.
 340
 341 ## Float-Load-Immediate
 342
 343 Very easily justified.  As explained in [[ls002]] these always saves one
 344 LD L1/2/3 D-Cache memory-lookup operation, by virtue of the Immediate
 345 FP value being in the I-Cache side.  It is such a high priority that
 346 these instructions are easily justifiable adding into EXT0xx, despite
 347 requiring a 16-bit immediate.  By designing the second-half instruction
 348 as a Read-Modify-Write it saves on XO bit-length (only 5 bits), and
 349 can be macro-op fused with its first-half to store a full IEEE754 FP32
 350 immediate into a register.
 351
 352 There is little point in putting these instructions into EXT2xx. Their
 353 very benefit and inherent value *is* as 32-bit instructions, not 64-bit
 354 ones. Likewise there is less value in taking up EXT1xx Encoding space
 355 because EXT1xx only brings an additional 16 bits (approx) to the table,
 356 and that is provided already by the second-half instruction.
 357
 358 Thus they qualify as both high priority and also EXT0xx candidates.
 359
 360 ## FPR/GPR LD/ST-PostIncrement-Update
 361
 362 These instruction, outlined in [[ls011]], save hugely in hot-loops.
 363 Early ISAs such as PDP-8, PDP-11, which inspired the iconic Motorola
 364 68000, 88100, Mitch Alsup's MyISA 66000, and can even be traced back to
 365 the iconic ultra-RISC CDC 6600, all had both pre- and post- increment
 366 Addressing Modes.
 367
 368 The reason is very simple: it is a direct recognition of the practice
 369 in c to frequently utilise both `*p++` and `*++p` which itself stems
 370 from common need in Computer Science algorithms.
 371
 372 The problem for the Power ISA is - was - that the opcode space needed
 373 to support both was far too great, and the decision was made to go with
 374 pre-increment, on the basis that outside the loop a "pre-subtraction"
 375 may be performed.
 376
 377 Whilst this is a "solution" it is less than ideal, and the opportunity
 378 exists now with the EXT2xx Primary Opcodes to correct this and bring
 379 Power ISA up a level.
 380
 381 Where things begin to get more than a little hairy is if both
 382 Post-Increment *and* Shifted are included.  If SVP64 keeps one
 383 single bit (/pi) dedicated in the `RM.Mode` field then this
 384 problem ges away, at the cost of reducing SVP64's effectiveness,
 385 but at least a stunning **24** Primary Opcodes (there are only
 386 32 in EXT2xx) would not disappear overnight.
 387 Mostly the Post-Increment-and-Shifted set are included to illustrate
 388 the options and have a formal record of the evluation, for Due Diligence.
 389
 390 ## Shift-and-add (and LD/ST Indexed-Shift)
 391
 392 Shift-and-Add are proposed in [[ls004]].  They mitigate the need to add
 393 LD-ST-Shift instructions which are a high-priority aspect of both x86
 394 and ARM.  LD-ST-Shift is normally just the one instruction: Shift-and-add
 395 brings that down to two, where Power ISA presently requires three.
 396 Cryptography e.g. twofish also makes use of Integer double-and-add,
 397 so the value of these instructions is not limited to Effective Address
 398 computation.  They will also have value in Audio DSP.
 399
 400 Being a 10-bit XO it would be somewhat punitive to place these in EXT2xx
 401 when their whole purpose and value is to reduce binary size in Address
 402 offset computation, thus they are best placed in EXT0xx.
 403
 404 Also included because it is important to see the quantity of instructions:
 405 LD/ST-Indexed-Shifted.  Across Update variants, Byte-reverse variants,
 406 Arithmetic and FP, the total is a slightly-eye-watering **37**
 407 instructions, only ameliorated by the fact that they are all 9-bit XO.
 408 When it comes to Shifted-Postincrement the number of Primary Opcodes
 409 needed in EXT2xx comes to 24 which is most of them.
 410 The upside as far as adding them is concerned is that existing hardware
 411 will already have amalgamated pipelines with very few actual back-end
 412 (Micro-Coded) internal operations (likely just two: one load, one store).
 413 Passing a 2-bit additional immediate field down to those pipelines really
 414 is not hard.
 415
 416 *(Readers unfamiliar with Micro-coding should look at the Microwatt VHDL
 417 source code)*
 418
 419 When it comes to LD/ST-Shifted-Postincrement the sheer number particularly
 420 Primary Opcodes needed in EXT2xx makes for a compelling case to prioritise
 421 Shift-and-Add.
 422
 423 \newpage{}
 424
 425 # Vectorisation: SVP64 and SVP64Single
 426
 427 To be submitted as part of [[ls001]], [[ls008]], [[ls009]] and [[ls010]],
 428 with SVP64Single to follow in a subsequent RFC, SVP64 is conceptually
 429 identical to the 50+ year old 8080 `REP` instruction and the Zilog
 430 Z80 `CPIR` and `LDIR` instructions.  Parallelism is best achieved
 431 by exploiting a Multi-Issue Out-of-Order Micro-architecture.  It is
 432 extremely important to bear in mind that at no time does SVP64 add even
 433 one single actual Vector instruction.  It is a *pure* RISC-paradigm
 434 Prefixing concept only.
 435
 436 This has some implications which need unpacking.  Firstly: in the future,
 437 the Prefixing may be applied to VSX.  The only reason it was not included
 438 in the initial proposal of SVP64 is because due to the number of VSX
 439 instructions the Due Diligence required is obviously five times higher
 440 than the 3+ years work done so far on the SFFS Subset.
 441
 442 Secondly: **any** Scalar instruction involving registers **automatically**
 443 becomes a candidate for Vector-Prefixing.  This in turn means that when
 444 a new instruction is proposed, it becomes a hard requirement to consider
 445 not only the implications of its inclusion as a Scalar-only instruction,
 446 but how it will best be utilised as a Vectorised instruction **as well**.
 447 Extreme examples of this are the Big-Integer 3-in 2-out instructions
 448 that use one 64-bit register effectively as a Carry-in and Carry-out. The
 449 instructions were designed in a *Scalar* context to be inline-efficient
 450 in hardware (use of Operand-Forwarding to reduce the chain down to 2-in
 451 1-out), but in a *Vector* context it is extremely straightforward to
 452 Micro-code an entire batch onto 128-bit SIMD pipelines, 256-bit SIMD
 453 pipelines, and to perform a large internal Forward-Carry-Propagation on
 454 for example the Vectorised-Multiply instruction.
 455
 456 Thirdly: as far as Opcode Allocation is concerned, SVP64 needs to be
 457 considered as an independent stand-alone instruction (just like `REP`).
 458 In other words, the Suffix **never** gets decoded as a completely
 459 different instruction just because of the Prefix.  The cost of doing so
 460 is simply too high in hardware.
 461
 462 --------
 463
 464 # Guidance for evaluation
 465
 466 Deciding which instructions go into an ISA is extremely complex, costly,
 467 and a huge responsibility. In public standards mistakes are irrevocable,
 468 and in the case of an ISA the Opcode Allocation is a finite resource,
 469 meaning that mistakes punish future instructions as well.  This section
 470 therefore provides some Evaluation Guidance on the decision process,
 471 particularly for people new to ISA development, given that this RFC is
 472 circulated widely and publicly.  Constructive feedback from experienced
 473 ISA Architects welcomed to improve this section.
 474
 475 **Does anyone want it?**
 476
 477 Sounds like an obvious question but if there is no driving need (no
 478 "Stakeholder") then why is the instruction being proposed? If it is
 479 purely out of curiosity or part of a Research effort not intended for
 480 production then it's probably best left in the EXT022 Sandbox.
 481
 482 **How many registers does it need?**
 483
 484 The basic RISC Paradigm is not only to make instruction encoding simple
 485 (often "wasting" encoding space compared to highly-compacted ISAs such
 486 as x86), but also to keep the number of registers used down to a minimum.
 487
 488 Counter-examples are FMAC which had to be added to IEEE754 because the
 489 *internal* product requires more accuracy than can fit into a register
 490 (it is well-known that FMUL followed by FADD performs an additional
 491 rounding on the intermediate register which loses accuracy compared to
 492 FMAC).  Another would be a dot-product instruction, which again requires
 493 an accumulator of at least double the width of the two vector inputs.
 494 And in the AMDGPU ISA, there are Texture-mapping instructions taking up
 495 to an astounding *twelve* input operands!
 496
 497 The downside of going too far however has to be a trade-off with the
 498 next question. Both MIPS and RISC-V lack Condition Codes, which means
 499 that emulating x86 Branch-Conditional requires *ten* MIPS instructions.
 500
 501 The downside of creating too complex instructions is that the Dependency
 502 Hazard Management in high-performance multi-issue out-of-order
 503 microarchitectures becomes infeasibly large, and even simple in-order
 504 systems may have performance severely compromised by an overabundance
 505 of stalls.  Also worth remembering is that register file ports are
 506 insanely costly, not just to design but also use considerable power.
 507
 508 That said there do exist genuine reasons why more registers is better than
 509 less: Compare-and-Swap has huge benefits but is costly to implement,
 510 and DCT/FFT Twin-Butterfly instructions allow creation of in-place
 511 in-register algorithms reducing the number of registers needed and
 512 thus saving power due to making the *overall* algorithm more efficient,
 513 as opposed to micro-focussing on a localised power increase.
 514
 515 **How many register files does it use?**
 516
 517 Complex instructions pulling in data from multiple register files can
 518 create unnecessary issues surrounding Dependency Hazard Management in
 519 Out-of-Order systems.  As a general rule it is better to keep complex
 520 instructions reading and writing to the same register file, relying
 521 on much simpler (1-in 1-out) instructions to transfer data between
 522 register files.
 523
 524 **Can other existing instructions (plural) do the same job**
 525
 526 The general rule being: if two or more instructions can do the
 527 same job, leave it out...  *unless* the number of occurrences of
 528 that instruction being missing is causing huge increases in binary
 529 size.  RISC-V has gone too far in this regard, as explained here:
 530 <https://news.ycombinator.com/item?id=24459314>
 531
 532 Good examples are LD-ST-Indexed-shifted (multiply RB by 2, 4 8 or 16)
 533 which are high-priority instructions in x86 and ARM, but lacking in
 534 Power ISA, MIPS, and RISC-V. With many critical hot-loops in Computer
 535 Science having to perform shift and add as explicit instructions,
 536 adding LD/ST-shifted should be considered high priority, except that
 537 the sheer *number* of such instructions needing to be added takes us
 538 into the next question
 539
 540 **How costly is the encoding?**
 541
 542 This can either be a single instruction that is costly (several operands
 543 or a few long ones) or it could be a group of simpler ones that purely
 544 due to their number increases overall encoding cost.  An example of an
 545 extreme costly instruction would be those with their own Primary Opcode:
 546 addi is a good candidate.  However the sheer overwhelming number of
 547 times that instruction is used easily makes a case for its inclusion.
 548
 549 Mentioned above was Load-Store-Indexed-Shifted, which only needs 2
 550 bits to specify how much to shift: x2 x4 x8 or x16. And they are all
 551 a 10-bit XO Field, so not that costly for any one given instruction.
 552 Unfortunately there are *around 30* Load-Store-Indexed Instructions in the
 553 Power ISA, which means an extra *five* bits taken up of precious XO space.
 554 Then let us not forget the two needed for the Shift amount. Now we are
 555 up to *three* bit XO for the group.
 556
 557 Is this a worthwhile tradeoff? Honestly it could well be.  And that's
 558 the decision process that the OpenPOWER ISA Working Group could use some
 559 assistance on, to make the evaluation easier.
 560
 561 **How many gates does it need?**
 562
 563 `grevlut` comes in at an astonishing 20,000 gates, where for comparison
 564 an FP64 Multiply typically takes between 12 to 15,000.  Not counting
 565 the cost in hardware terms is just asking for trouble.
 566
 567 **How long will it take to complete?**
 568
 569 In the case of divide or Transcendentals the algorithms needed are so
 570 complex that simple implementations can often take an astounding 128
 571 clock cycles to complete.  Other instructions waiting for the results
 572 will back up and eventually stall, where in-order systems pretty much
 573 just stall straight away.
 574
 575 Less extreme examples include instructions that take only a few cycles
 576 to complete, but if commonly used in tight loops with Conditional Branches, an
 577 Out-of-Order system with Speculative capability may need significantly
 578 more Reservation Stations to hold in-flight data for instructions which
 579 take longer than those which do not, so even a single clock cycle reduction
 580 could become important.
 581
 582 A rule of thumb is that in Hardware, at 4.8 ghz the budget for what is called
 583 "gate propagation delay" is only around 16 to 19 gates chained one after
 584 the other.  Anything beyond that budget will need to be stored in DFFs
 585 (Flip-flops) and another set of 16-19 gates continues on the next clock
 586 cycle. Thus for example with `grevlut` above it is almost certainly the
 587 case that high-performance high-clock-rate systems would need at least
 588 two clock cycles (two pipeline stages) to produce a valid result.
 589 This in turn brings us to the next question as it is common to consider
 590 subdividing complex instructions into smaller parts.
 591
 592 **Can one instruction do the job of many?**
 593
 594 Large numbers of disparate instructions adversely affects resource
 595 utilisation in In-Order systems.  However it is not always that simple:
 596 every one of the Power ISA "add" and "subtract" instructions, as shown by
 597 the Microwatt source code, may be micro-coded as one single instruction
 598 where RA may optionally be inverted, output likewise, and Carry-In set to
 599 1, 0 or XER.CA.  From these options the *entire* suite of add/subtract
 600 may be synthesised (subtract by inverting RA and adding an extra 1 it
 601 produces a 2s-complement of RA).
 602
 603 `bmask` for example is to be proposed as a single instruction with
 604 a 5-bit "Mode" operand, greatly simplifying some micro-architectural
 605 implementations. Likewise the FP-INT conversion instructions are grouped
 606 as a set of four, instead of over 30 separate instructions.  Aside from
 607 anything this strategy makes the ISA Working Group's evaluation task
 608 easier, as well as reducing the work of writing a Compliance Test Suite.
 609
 610 In the case of the MIPS 3D ASE Extension, a Reciprocal-Square-Root
 611 instruction was proposed that was split into two halves: 12-14 bit
 612 accuracy completing in 7 cycles and "Carry On And Get Better Accuracy"
 613 for the second instruction! With 3D only needing reduced accuracy
 614 the saving in power consumption and time was definitely worthwhile,
 615 and it neatly illustrates a counter-example to trying to make one
 616 instruction do too much.
 617
 618 **Summary**
 619
 620 There are many tradeoffs here, it is a huge list of considerations: any
 621 others known about please do submit feedback so they may be included,
 622 here.  Then the evaluation process may take place: again, constructive
 623 feedback on that as to which instructions are a priority also appreciated.
 624 The above helps explain the columns in the tables that follow.
 625
 626 \newpage{}
 627
 628 # Tables
 629
 630 The original tables are available publicly as as CSV file at
 631 <https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/rfc/ls012/optable.csv;hb=HEAD>.
 632 A python program auto-generates the tables in the following sections by
 633 sorting into different useful priorities.
 634
 635 The key to headings and sections are as follows:
 636
 637 * **Area** - Target Area as described in above sections
 638 * **XO Cost** - the number of bits required in the XO Field. whilst not
 639   the full picture it is a good indicator as to how costly in terms
 640   of Opcode Allocation a given instruction will be.  Lower number is
 641   a higher cost for the Power ISA's precious remaining Opcode space.
 642   "PO" indicates that an entire Primary Opcode is required.
 643 * **rfc** the Libre-SOC External RFC resource,
 644   <https://libre-soc.org/openpower/sv/rfc/> where advance notice of
 645   upcoming RFCs in development may be found.
 646   *Reading advance Draft RFCs and providing feedback strongly advised*,
 647   it saves time and effort for the OPF ISA Workgroup.
 648 * **SVP64** - Vectoriseable (SVP64-Prefixable) - also implies that
 649   SVP64Single is also permitted (required).
 650 * **page** - Libre-SOC wiki page at which further information can
 651   be found.  Again: **advance reading strongly advised due to the
 652   sheer volume of information**.
 653 * **PO1** - the instruction is capable of being PO1-Prefixed
 654   (given an EXT1xx Opcode Allocation). Bear in mind that this option
 655   is **mutually exclusively incompatible** with Vectorisation.
 656 * **group** - the Primary Opcode Group recommended for this instruction.
 657   Options are EXT0xx (EXT000-EXT063), EXT1xx and EXT2xx.  A third area
 658   (UnVectoriseable),
 659   EXT3xx, was available in an early Draft RFC but has been made "RESERVED"
 660   instead.  see [[sv/po9_encoding]].
 661 * **regs** - a guide to register usage, to how costly Hazard Management
 662   will be, in hardware:
 663
 664 ```
 665      - 1R: reads one GPR/FPR/SPR/CR.
 666      - 1W: writes one GPR/FPR/SPR/CR.
 667      - 1r: reads one CR *Field* (not necessarily the entire CR)
 668      - 1w: writes one CR *Field* (not necessarily the entire CR)
 669 ```
 670
 671 [[!inline pages="openpower/sv/rfc/ls012/areas.mdwn" raw=yes ]]
 672 [[!inline pages="openpower/sv/rfc/ls012/xo_cost.mdwn" raw=yes ]]
 673
 674 [[!tag opf_rfc]]