openpower/sv/rfc/ls012.mdwn

   1 # External RFC ls012: Discuss priorities of Libre-SOC Scalar(Vector) ops
   2
   3 **Date: 2023apr10. v1**
   4
   5 * Funded by NLnet Grants under EU Horizon Grants 101069594 825310
   6 * <https://git.openpower.foundation/isa/PowerISA/issues/121>
   7 * <https://bugs.libre-soc.org/show_bug.cgi?id=1051>
   8 * <https://bugs.libre-soc.org/show_bug.cgi?id=1052>
   9
  10 The purpose of this RFC is:
  11
  12 * to give a full list of upcoming Scalar opcodes developed by Libre-SOC
  13   (being cognisant that *all* of them are Vectoriseable)
  14 * to give OPF Members and non-Members alike the opportunity to comment and get
  15   involved early in RFC submission
  16 * formally agree a priority order on an iterative basis with new versions
  17   of this RFC,
  18 * which ones should be EXT022 Sandbox, which in EXT0xx, which in EXT2xx, which
  19   not proposed at all,
  20 * keep readers summarily informed of ongoing RFC submissions, with new versions
  21   of this RFC,
  22 * for IBM (in their capacity as Allocator of Opcodes)
  23   to get a clear advance picture of Opcode Allocation
  24   *prior* to submission
  25
  26 As this is a Formal ISA RFC the evaluation shall ultimately define
  27 (in advance of the actual submission of the instructions themselves)
  28 which instructions will be submitted over the next 1-18 months.
  29
  30 *It is expected that readers visit and interact with the Libre-SOC
  31 resources in order to do due-diligence on the prioritisation
  32 evaluation. Otherwise the ISA WG is overwhelmed by "drip-fed" RFCs
  33 that may turn out not to be useful, against a background of having
  34 no guiding overview or pre-filtering, and everybody's precious time
  35 is wasted.  Also note that the Libre-SOC Team, being funded by NLnet
  36 under Privacy and Enhanced Trust Grants, are **prohibited** from signing
  37 Commercial-Confidentiality NDAs, as doing so is a direct conflict of
  38 interest with their funding body's Charitable Foundation Status and
  39 remit, and therefore the **entire** set of almost 150 new SFFS instructions
  40 can only go via the External RFC Process.  Also be advised and aware
  41 that "Libre-SOC" != "RED Semiconductor Ltd". The two are completely **separate**
  42 organisations*.
  43
  44 Worth bearing in mind during evaluation that every "Defined Word" may
  45 or may not be Vectoriseable, but that every "Defined Word" should have
  46 merits on its own, not just when Vectorised.  An example of a borderline
  47 Vectoriseable Defined Word is `mv.swizzle` which only really becomes
  48 high-priority for Audio/Video, Vector GPU and HPC Workloads, but has
  49 less merit as a Scalar-only operation, yet when SVP64Single-Prefixed
  50 can be part of an atomic Compare-and-Swap sequence.
  51
  52 Although one of the top world-class ISAs,
  53 Power ISA Scalar (SFFS) has not been significantly advanced in 12
  54 years: IBM's primary focus has understandably been on PackedSIMD VSX.
  55 Unfortunately, with VSX being 914 instructions and 128-bit it is far too
  56 much for any new team to consider (10+ years development effort) and far
  57 outside of Embedded or Tablet/Desktop/Laptop power budgets. Thus bringing
  58 Power Scalar up-to-date to modern standards *and on its own merits*
  59 is a reasonable goal, and the advantages of the reduced focus is that
  60 SFFS remains RISC-paradigm, with lessons being be learned from other
  61 ISAs from the intervening years.  Good examples here include `bmask`.
  62
  63 SVP64 Prefixing - also known by the terms "Zero-Overhead-Loop-Prefixing"
  64 as well as "True-Scalable-Vector Prefixing" - also literally brings new
  65 dimensions to the Power ISA. Thus when adding new Scalar "Defined Words"
  66 it has to unavoidably and simultaneously be taken into consideration
  67 their value when Vector-Prefixed, *as well as* SVP64Single-Prefixed.
  68
  69 **Target areas**
  70
  71 Whilst entirely general-purpose there are some categories that these
  72 instructions are targetting: Bit-manipulation, Big-integer, cryptography,
  73 Audio/Visual, High-Performance Compute, GPU workloads and DSP.
  74
  75 **Instruction count guide and approximate priority order**
  76
  77 * 6 - SVP64 Management [[ls008]] [[ls009]] [[ls010]]
  78 * 5 - CR weirds [[sv/cr_int_predication]]
  79 * 4 - INT<->FP mv [[ls006]]
  80 * 19 - GPR LD/ST-PostIncrement-Update (saves hugely in hot-loops) [[ls011]]
  81 * ~12 - FPR LD/ST-PostIncrement-Update (ditto) [[ls011]]
  82 * 2 - Float-Load-Immediate (always saves one LD L1/2/3 D-Cache op) [[ls002]]
  83 * 5 - Big-Integer Chained 3-in 2-out (64-bit Carry) [[sv/biginteger]]
  84 * 6 - Bitmanip LUT2/3 operations. high cost high reward [[sv/bitmanip]]
  85 * 1 - fclass (Scalar variant of xvtstdcsp) [[sv/fclass]]
  86 * 5 - Audio-Video [[sv/av_opcodes]]
  87 * 2 - Shift-and-Add (mitigates LD-ST-Shift; Cryptography e.g. twofish) [[ls004]]
  88 * 2 - BMI group [[sv/vector_ops]]
  89 * 2 - GPU swizzle [[sv/mv.swizzle]]
  90 * 9 - FP DCT/FFT Butterfly (2/3-in 2-out)
  91 * ~9 Integer DCT/FFT Butterfly <https://bugs.libre-soc.org/show_bug.cgi?id=1028>
  92 * 18 - Trigonometric (1-arg) [[openpower/transcendentals]]
  93 * 15 - Transcendentals (1-arg) [[openpower/transcendentals]]
  94 * 25 - Transcendentals (2-arg) [[openpower/transcendentals]]
  95
  96 Summary tables are created below by different sort categories. Additional
  97 columns (and tables) as necessary can be requested to be added as part of update revisions
  98 to this RFC.
  99
 100 \newpage{}
 101
 102 # Target Area summaries
 103
 104 Please note that there are some instructions developed thanks to NLnet
 105 funding that have not been included here for assessment. Examples
 106 include `pcdec` and the Galois Field arithmetic operations. From a purely
 107 practical perspective due to the quantity the lower-priority instructions
 108 were simply left out. However they remain in the Libre-SOC resources.
 109
 110 Some of these SFFS instructions appear to be duplicates of VSX.
 111 A frequent argument comes up that if instructions
 112 are in VSX already they should not be added to SFFS, especially if
 113 they are nominally the same.  The logic that this effectively damages
 114 performance of an SFFS-only implementation was raised earlier, however
 115 there is a more subtle reason why the instructions are needed.
 116
 117 Future versions of SVP64 and SVP64Single are expected to be developed
 118 by future Power ISA Stakeholders on top of VSX.  The decisions made
 119 there about the meaning of Prefixed Vectorised VSX may be **completely**
 120 different from those made for Prefixed SFFS instructions.  At which
 121 point the lack of SFFS equivalents would penalise SFFS implementors
 122 in a much more severe way, effectively expecting them and SFFS programmers
 123 to work with a non-orthogonal paradigm, to their detriment.
 124 The solution is to give the SFFS Subset the space and respect that it deserves
 125 and allow it to be stand-alone on its own merits.
 126
 127 ## SVP64 Management instructions
 128
 129 These without question have to go in EXT0xx.  Future extended variants,
 130 bringing even more powerful capabilities, can be followed up later with
 131 EXT1xx prefixed variants, which is not possible if placed in EXT2xx.
 132 *Only `svstep` is actually Vectoriseable*, all other Management
 133 instructions are UnVectoriseable.  PO1-Prefixed examples include adding
 134 psvshape in order to support both Inner and Outer Product Matrix
 135 Schedules, by providing the option to directly reverse the order of the
 136 triple loops.  Outer is used for standard Matrix Multiply (on top
 137 of a standard MAC or FMAC instruction), but Inner is
 138 required for Warshall Transitive Closure (on top of a cumulatively-applied
 139 max instruction).
 140
 141 The Management Instructions themselves are all Scalar Operations, so
 142 PO1-Prefixing is perfectly reasonable.  SVP64 Management instructions of
 143 which there are only 6 are all 5 or 6 bit XO, meaning that the opcode
 144 space they take up in EXT0xx is not alarmingly high for their intrinsic
 145 strategic value.
 146
 147 ## Transcendentals
 148
 149 Found at [[openpower/transcendentals]] these subdivide into high
 150 priority for accelerating general-purpose and High-Performance Compute,
 151 specialist 3D GPU operations suited to 3D visualisation, and low-priority
 152 less common instructions where IEEE754 full bit-accuracy is paramount.
 153 In 3D GPU scenarios for example even 12-bit accuracy can be overkill,
 154 but for HPC Scientific scenarios 12-bit would be disastrous.
 155
 156 There are a **lot** of operations here, and they also bring Power
 157 ISA up-to-date to IEEE754-2019.  Fortunately the number of critical
 158 instructions is quite low, but the caveat is that if those operations
 159 are utilised to synthesise other IEEE754 operations (divide by `pi` for
 160 example) full bit-level accuracy (a hard requirement for IEEE754) is lost.
 161
 162 Also worth noting that the Khronos Group defines minimum acceptable
 163 bit-accuracy levels for 3D Graphics: these are **nowhere near** the full
 164 accuracy demanded by IEEE754, the reason for the Khronos definitions is
 165 a massive reduction often four-fold in power consumption and gate count
 166 when 3D Graphics simply has no need for full accuracy.
 167
 168 *For 3D GPU markets this definitely needs addressing*
 169
 170 ## Audio/Video
 171
 172 Found at [[sv/av_opcodes]] these do not require Saturated variants
 173 because Saturation is added via [[sv/svp64]] (Vector Prefixing) and via
 174 [[sv/svp64_single]] Scalar Prefixing. This is important to note for
 175 Opcode Allocation because placing these operations in the UnVectoriseable
 176 areas would irredeemably damage their value.  Unlike PackedSIMD ISAs
 177 the actual number of AV Opcodes is remarkably small once the usual
 178 cascading-option-multipliers (SIMD width, bitwidth, saturation,
 179 HI/LO) are abstracted out to RISC-paradigm Prefixing, leaving just
 180 absolute-diff-accumulate, min-max, average-add etc. as "basic primitives".
 181
 182 ## Twin-Butterfly FFT/DCT/DFT for DSP/HPC/AI/AV
 183
 184 The number of uses in Computer Science for DCT, NTT, FFT and DFT,
 185 is astonishing.  The wikipedia page lists over a hundred separate and
 186 distinct areas: Audio, Video, Radar, Baseband processing, AI, Solomon-Reed
 187 Error Correction, the list goes on and on.  ARM has special dedicated
 188 Integer Twin-butterfly instructions. TI's MSP Series DSPs have had FFT
 189 Inner loop support for over 30 years. Qualcomm's Hexagon VLIW Baseband
 190 DSP can do full FFT triple loops in one VLIW group.
 191
 192 It should be pretty clear this is high priority.
 193
 194 With SVP64 [[sv/remap]] providing the Loop Schedules it falls to
 195 the Scalar side of the ISA to add the prerequisite "Twin Butterfly"
 196 operations, typically performing for example one multiply but in-place
 197 subtracting that product from one operand and adding it to the other.
 198 The *in-place* aspect is strategically extremely important for significant
 199 reductions in Vectorised register usage, particularly for DCT.
 200
 201 ## CR Weird group
 202
 203 Outlined in [[sv/cr_int_predication]] these instructions massively save
 204 on CR-Field instruction count.  Multi-bit to single-bit and vice-versa
 205 normally requiring several CR-ops (crand, crxor) are done in one single
 206 instruction.  The reason for their addition is down to SVP64 overloading
 207 CR Fields as Vector Predicate Masks.  Reducing instruction count in
 208 hot-loops is considered high priority.
 209
 210 An additional need is to do popcount on CR Field bit vectors but adding
 211 such instructions to the *Condition Register* side was deemed to be far
 212 too much. Therefore, priority was given instead to transferring several
 213 CR Field bits into GPRs, whereupon the full set of Standard Scalar GPR
 214 Logical Operations may be used. This strategy has the side-effect of
 215 keeping the CRweird group down to only five instructions.
 216
 217 ## Big-integer Math
 218
 219 [[sv/biginteger]] has always been a high priority area for commercial
 220 applications, privacy, Banking, as well as HPC Numerical Accuracy:
 221 libgmp as well as cryptographic uses in Asymmetric Ciphers. poly1305
 222 and ec25519 are finding their way into everyday use via OpenSSL.
 223
 224 A very early variant of the Power ISA had a 32-bit Carry-in Carry-out
 225 SPR. Its removal from subsequent revisions is regrettable.  An alternative
 226 concept is to add six explicit 3-in 2-out operations that, on close
 227 inspection, always turn out to be supersets of *existing Scalar
 228 operations* that discard upper or lower DWords, or parts thereof.
 229
 230 *Thus it is critical to note that not one single one of these operations
 231 expands the bitwidth of any existing Scalar pipelines*.
 232
 233 The `dsld` instruction for example merely places additional LSBs into the
 234 64-bit shift (64-bit carry-in), and then places the (normally discarded)
 235 MSBs into the second output register (64-bit carry-out). It does **not**
 236 require a 128-bit shifter to replace the existing Scalar Power ISA
 237 64-bit shifters.
 238
 239 The reduction in instruction count these operations bring, in critical
 240 hot loops, is remarkably high, to the extent where a Scalar-to-Vector
 241 operation of *arbitrary length* becomes just the one Vector-Prefixed
 242 instruction.
 243
 244 Whilst these are 5-6 bit XO their utility is considered high strategic
 245 value and as such are strongly advocated to be in EXT04. The alternative
 246 is to bring back a 64-bit Carry SPR but how it is retrospectively
 247 applicable to pre-existing Scalar Power ISA multiply, divide, and shift
 248 operations at this late stage of maturity of the Power ISA is an entire
 249 area of research on its own deemed unlikely to be achievable.
 250
 251 ## fclass and GPR-FPR moves
 252
 253 [[sv/fclass]] - just one instruction.  With SFFS being locked down to
 254 exclude VSX, and there being no desire within the nascent OpenPOWER
 255 ecosystem outside of IBM to implement the VSX PackedSIMD paradigm, it
 256 becomes necessary to upgrade SFFS such that it is stand-alone capable. One
 257 omission based on the assumption that VSX would always be present is an
 258 equivalent to `xvtstdcsp`.
 259
 260 Similar arguments apply to the GPR-INT move operations, proposed in
 261 [[ls006]], with the opportunity taken to add rounding modes present
 262 in other ISAs that Power ISA VSX PackedSIMD does not have. Javascript
 263 rounding, one of the worst offenders of Computer Science, requires a
 264 phenomenal 35 instructions with *six branches* to emulate in Power
 265 ISA! For desktop as well as Server HTML/JS back-end execution of
 266 javascript this becomes an obvious priority, recognised already by ARM
 267 as just one example.
 268
 269 ## Bitmanip LUT2/3
 270
 271 These LUT2/3 operations are high cost high reward. Outlined in
 272 [[sv/bitmanip]], the simplest ones already exist in PackedSIMD VSX:
 273 `xxeval`.  The same reasoning applies as to fclass: SFFS needs to be
 274 stand-alone on its own merits and should an implementor
 275 choose not to implement any aspect of PackedSIMD VSX the performance
 276 of their product should not be penalised for making that decision.
 277
 278 With Predication being such a high priority in GPUs and HPC, CR Field
 279 variants of Ternary and Binary LUT instructions were considered high
 280 priority, and again just like in the CRweird group the opportunity was
 281 taken to work on *all* bits of a CR Field rather than just one bit as
 282 is done with the existing CR operations crand, cror etc.
 283
 284 The other high strategic value instruction is `grevlut` (and `grevluti`
 285 which can generate a remarkably large number of regular-patterned magic
 286 constants).  The grevlut set require of the order of 20,000 gates but
 287 provide an astonishing plethora of innovative bit-permuting instructions
 288 never seen in any other ISA.
 289
 290 The downside of all of these instructions is the extremely low XO bit
 291 requirements: 2-3 bit XO due to the large immediates *and* the number of
 292 operands required.  The LUT3 instructions are already compacted down to
 293 "Overwrite" variants.  (By contrast the Float-Load-Immediate instructions
 294 are a much larger XO because despite having 16-bit immediate only one
 295 Register Operand is needed).
 296
 297 Realistically these high-value instructions should be proposed in EXT2xx
 298 where their XO cost does not overwhelm EXT0xx.
 299
 300
 301 ## (f)mv.swizzle
 302
 303 [[sv/mv.swizzle]] is dicey. It is a 2-in 2-out operation whose value
 304 as a Scalar instruction is limited *except* if combined with `cmpi` and
 305 SVP64Single Predication, whereupon the end result is the RISC-synthesis
 306 of Compare-and-Swap, in two instructions.
 307
 308 Where this instruction comes into its full value is when Vectorised.
 309 3D GPU and HPC numerical workloads astonishingly contain between 10 to 15%
 310 swizzle operations: access to YYZ, XY, of an XYZW Quaternion, performing
 311 balancing of ARGB pixel data. The usage is so high that 3D GPU ISAs make
 312 Swizzle a first-class priority in their VLIW words. Even 64-bit Embedded
 313 GPU ISAs have a staggering 24-bits dedicated to 2-operand Swizzle.
 314
 315 So as not to radicalise the Power ISA the Libre-SOC team decided to
 316 introduce mv Swizzle operations, which can always be Macro-op fused
 317 in exactly the same way that ARM SVE predicated-move extends 3-operand
 318 "overwrite" opcodes to full independent 3-in 1-out.
 319
 320 ## BMI (bit-manipulation) group.
 321
 322 Whilst the [[sv/vector_ops]] instructions are only two in number, in
 323 reality the `bmask` instruction has a Mode field allowing it to cover
 324 **24** instructions, more than have been added to any other CPUs by
 325 ARM, Intel or AMD.  Analysis of the BMI sets of these CPUs shows simple
 326 patterns that can greatly simplify both Decode and implementation. These
 327 are sufficiently commonly used, saving instruction count regularly,
 328 that they justify going into EXT0xx.
 329
 330 The other instruction is `cprop` - Carry-Propagation - which takes
 331 the P and Q from carry-propagation algorithms and generates carry
 332 look-ahead. Greatly increases the efficiency of arbitrary-precision
 333 integer arithmetic by combining what would otherwise be half a dozen
 334 instructions into one. However it is still not a huge priority unlike
 335 `bmask` so is probably best placed in EXT2xx.
 336
 337 ## Float-Load-Immediate
 338
 339 Very easily justified.  As explained in [[ls002]] these always saves one
 340 LD L1/2/3 D-Cache memory-lookup operation, by virtue of the Immediate
 341 FP value being in the I-Cache side.  It is such a high priority that
 342 these instructions are easily justifiable adding into EXT0xx, despite
 343 requiring a 16-bit immediate.  By designing the second-half instruction
 344 as a Read-Modify-Write it saves on XO bit-length (only 5 bits), and can be
 345 macro-op fused with its first-half to store a full IEEE754 FP32 immediate
 346 into a register.
 347
 348 There is little point in putting these instructions into EXT2xx. Their
 349 very benefit and inherent value *is* as 32-bit instructions, not 64-bit
 350 ones. Likewise there is less value in taking up EXT1xx Encoding space
 351 because EXT1xx only brings an additional 16 bits (approx) to the table,
 352 and that is provided already by the second-half instruction.
 353
 354 Thus they qualify as both high priority and also EXT0xx candidates.
 355
 356 ## FPR/GPR LD/ST-PostIncrement-Update
 357
 358 These instruction, outlined in [[ls011]], save hugely in hot-loops.
 359 Early ISAs such as PDP-8, PDP-11, which inspired the iconic Motorola
 360 68000, 88100, Mitch Alsup's MyISA 66000, and can even be traced back to
 361 the iconic ultra-RISC CDC 6600, all had both pre- and post- increment
 362 Addressing Modes.
 363
 364 The reason is very simple: it is a direct recognition of the practice
 365 in c to frequently utilise both `*p++` and `*++p` which itself stems
 366 from common need in Computer Science algorithms.
 367
 368 The problem for the Power ISA is - was - that the opcode space needed
 369 to support both was far too great, and the decision was made to go with
 370 pre-increment, on the basis that outside the loop a "pre-subtraction"
 371 may be performed.
 372
 373 Whilst this is a "solution" it is less than ideal, and the opportunity
 374 exists now with the EXT2xx Primary Opcodes to correct this and bring
 375 Power ISA up a level.
 376
 377 ## Shift-and-add
 378
 379 Shift-and-Add are proposed in [[ls004]].  They mitigate the need to add
 380 LD-ST-Shift instructions which are a high-priority aspect of both x86
 381 and ARM.  LD-ST-Shift is normally just the one instruction: Shift-and-add
 382 brings that down to two, where Power ISA presently requires three.
 383 Cryptography e.g. twofish also makes use of Integer double-and-add,
 384 so the value of these instructions is not limited to Effective Address
 385 computation.  They will also have value in Audio DSP.
 386
 387 Being a 10-bit XO it would be somewhat punitive to place these in EXT2xx
 388 when their whole purpose and value is to reduce binary size in Address
 389 offset computation, thus they are best placed in EXT0xx.
 390
 391 \newpage{}
 392
 393 # Vectorisation: SVP64 and SVP64Single
 394
 395 To be submitted as part of [[ls001]], [[ls008]], [[ls009]] and [[ls010]],
 396 with SVP64Single to follow in a subsequent RFC, SVP64 is conceptually
 397 identical to the 50+ year old 8080 `REP` instruction and the Zilog Z80
 398 `CPIR` and `LDIR` instructions.  Parallelism is best achieved by exploiting
 399 a Multi-Issue Out-of-Order Micro-architecture.  It is extremely important
 400 to bear in mind that at no time does SVP64 add even one single actual
 401 Vector instruction.  It is a *pure* RISC-paradigm Prefixing concept only.
 402
 403 This has some implications which need unpacking.  Firstly: in the future,
 404 the Prefixing may be applied to VSX.  The only reason it was not included
 405 in the initial proposal of SVP64 is because due to the number of VSX
 406 instructions the Due Diligence required is obviously five times higher
 407 than the 3+ years work done so far on the SFFS Subset.
 408
 409 Secondly: **any** Scalar instruction involving registers **automatically**
 410 becomes a candidate for Vector-Prefixing.  This in turn means that when
 411 a new instruction is proposed, it becomes a hard requirement to consider
 412 not only the implications of its inclusion as a Scalar-only instruction,
 413 but how it will best be utilised as a Vectorised instruction **as well**.
 414 Extreme examples of this are the Big-Integer 3-in 2-out instructions that
 415 use one 64-bit register effectively as a Carry-in and Carry-out. The
 416 instructions were designed in a *Scalar* context to be inline-efficient
 417 in hardware (use of Operand-Forwarding to reduce the chain down to 2-in 1-out),
 418 but in a *Vector* context it is extremely straightforward to Micro-code
 419 an entire batch onto 128-bit SIMD pipelines, 256-bit SIMD pipelines, and
 420 to perform a large internal Forward-Carry-Propagation on for example the
 421 Vectorised-Multiply instruction.
 422
 423 Thirdly: as far as Opcode Allocation is concerned, SVP64 needs to be
 424 considered as an independent stand-alone instruction (just like `REP`).
 425 In other words, the Suffix **never** gets decoded as a completely different
 426 instruction just because of the Prefix.  The cost of doing so is simply
 427 too high in hardware.
 428
 429 --------
 430
 431 # Guidance for evaluation
 432
 433 Deciding which instructions go into an ISA is extremely complex, costly,
 434 and a huge responsibility. In public standards mistakes are irrevocable,
 435 and in the case of an ISA the Opcode Allocation is a finite resource,
 436 meaning that mistakes punish future instructions as well.  This section
 437 therefore provides some Evaluation Guidance on the decision process,
 438 particularly for people new to ISA development, given that this RFC
 439 is circulated widely and publicly.  Constructive feedback from experienced
 440 ISA Architects welcomed to improve this section.
 441
 442 **Does anyone want it?**
 443
 444 Sounds like an obvious question but if there is no driving need (no
 445 "Stakeholder") then why is the instruction being proposed? If it is
 446 purely out of curiosity or part of a Research effort not intended for
 447 production then it's probably best left in the EXT022 Sandbox.
 448
 449 **How many registers does it need?**
 450
 451 The basic RISC Paradigm is not only to make instruction encoding simple
 452 (often "wasting" encoding space compared to highly-compacted ISAs such
 453 as x86), but also to keep the number of registers used down to a minimum.
 454
 455 Counter-examples are FMAC which had to be added to IEEE754 because the
 456 *internal* product requires more accuracy than can fit into a register
 457 (it is well-known that FMUL followed by FADD performs an additional
 458 rounding on the intermediate register which loses accuracy compared to
 459 FMAC).  Another would be a dot-product instruction, which again requires
 460 an accumulator of at least double the width of the two vector inputs.
 461 And in the AMDGPU ISA, there are Texture-mapping instructions taking up
 462 to an astounding *twelve* input operands!
 463
 464 The downside of going too far however has to be a trade-off with the
 465 next question. Both MIPS and RISC-V lack Condition Codes, which means
 466 that emulating x86 Branch-Conditional requires *ten* MIPS instructions.
 467
 468 The downside of creating too complex instructions is that the Dependency
 469 Hazard Management in high-performance multi-issue out-of-order
 470 microarchitectures becomes infeasibly large, and even simple in-order
 471 systems may have performance severely compromised by an overabundance
 472 of stalls.  Also worth remembering is that register file ports are
 473 insanely costly, not just to design but also use considerable power.
 474
 475 That said there do exist genuine reasons why more registers is better than
 476 less: Compare-and-Swap has huge benefits but is costly to implement,
 477 and DCT/FFT Twin-Butterfly instructions allow creation of in-place
 478 in-register algorithms reducing the number of registers needed and
 479 thus saving power due to making the *overall* algorithm more efficient,
 480 as opposed to micro-focussing on a localised power increase.
 481
 482 **How many register files does it use?**
 483
 484 Complex instructions pulling in data from multiple register files can
 485 create unnecessary issues surrounding Dependency Hazard Management in
 486 Out-of-Order systems.  As a general rule it is better to keep complex
 487 instructions reading and writing to the same register file, relying
 488 on much simpler (1-in 1-out) instructions to transfer data between
 489 register files.
 490
 491 **Can other existing instructions (plural) do the same job**
 492
 493 The general rule being: if two or more instructions can do the
 494 same job, leave it out...  *unless* the number of occurrences of
 495 that instruction being missing is causing huge increases in binary
 496 size.  RISC-V has gone too far in this regard, as explained here:
 497 <https://news.ycombinator.com/item?id=24459314>
 498
 499 Good examples are LD-ST-Indexed-shifted (multiply RB by 2, 4 8 or 16)
 500 which are high-priority instructions in x86 and ARM, but lacking in
 501 Power ISA, MIPS, and RISC-V. With many critical hot-loops in Computer
 502 Science having to perform shift and add as explicit instructions,
 503 adding LD/ST-shifted should be considered high priority, except that
 504 the sheer *number* of such instructions needing to be added takes us
 505 into the next question
 506
 507 **How costly is the encoding?**
 508
 509 This can either be a single instruction that is costly (several operands
 510 or a few long ones) or it could be a group of simpler ones that purely
 511 due to their number increases overall encoding cost.  An example of an
 512 extreme costly instruction would be those with their own Primary Opcode:
 513 addi is a good candidate.  However the sheer overwhelming number of
 514 times that instruction is used easily makes a case for its inclusion.
 515
 516 Mentioned above was Load-Store-Indexed-Shifted, which only needs 2
 517 bits to specify how much to shift: x2 x4 x8 or x16. And they are all
 518 a 10-bit XO Field, so not that costly for any one given instruction.
 519 Unfortunately there are *around 30* Load-Store-Indexed Instructions in the
 520 Power ISA, which means an extra *five* bits taken up of precious XO space.
 521 Then let us not forget the two needed for the Shift amount. Now we are
 522 up to *three* bit XO for the group.
 523
 524 Is this a worthwhile tradeoff? Honestly it could well be.  And that's
 525 the decision process that the OpenPOWER ISA Working Group could use some
 526 assistance on, to make the evaluation easier.
 527
 528 **How many gates does it need?**
 529
 530 `grevlut` comes in at an astonishing 20,000 gates, where for comparison
 531 an FP64 Multiply typically takes between 12 to 15,000.  Not counting
 532 the cost in hardware terms is just asking for trouble.
 533
 534 **How long will it take to complete?**
 535
 536 In the case of divide or Transcendentals the algorithms needed are so
 537 complex that simple implementations can often take an astounding 128
 538 clock cycles to complete.  Other instructions waiting for the results
 539 will back up and eventually stall, where in-order systems pretty much
 540 just stall straight away.
 541
 542 Less extreme examples include instructions that take only a few cycles
 543 to complete, but if used in tight loops with Conditional Branches, an
 544 Out-of-Order system with Speculative capability may need significantly
 545 more Reservation Stations to hold in-flight data for instructions which
 546 take longer than those which do not.
 547
 548 **Can one instruction do the job of many?**
 549
 550 Large numbers of disparate instructions adversely affects resource
 551 utilisation in In-Order systems.  However it is not always that simple:
 552 every one of the Power ISA "add" and "subtract" instructions, as shown by
 553 the Microwatt source code, may be micro-coded as one single instruction
 554 where RA may optionally be inverted, output likewise, and Carry-In set to
 555 1, 0 or XER.CA.  From these options the *entire* suite of add/subtract
 556 may be synthesised (subtract by inverting RA and adding an extra 1 it
 557 produces a 2s-complement of RA).
 558
 559 `bmask` for example is to be proposed as a single instruction with
 560 a 5-bit "Mode" operand, greatly simplifying some micro-architectural
 561 implementations. Likewise the FP-INT conversion instructions are grouped
 562 as a set of four, instead of over 30 separate instructions.  Aside from
 563 anything this strategy makes the ISA Working Group's evaluation task
 564 easier, as well as reducing the work of writing a Compliance Test Suite.
 565
 566 **Summary**
 567
 568 There are many tradeoffs here, it is a huge list of considerations: any
 569 others known about please do submit feedback so they may be included,
 570 here.  Then the evaluation process may take place: again, constructive
 571 feedback on that as to which instructions are a priority also appreciated.
 572 The above helps explain the columns in the tables that follow.
 573
 574 # Tables
 575
 576 The original tables are available publicly as as CSV file at
 577 <https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/rfc/ls012/optable.csv;hb=HEAD>.
 578 A python program auto-generates the tables in the following sections
 579 by sorting into different useful priorities.
 580
 581 The key to headings and sections are as follows:
 582
 583 * **Area** - Target Area as described in above sections
 584 * **XO Cost** - the number of bits required in the XO Field. whilst not
 585   the full picture it is a good indicator as to how costly in terms
 586   of Opcode Allocation a given instruction will be.  Lower number is
 587   a higher cost for the Power ISA's precious remaining Opcode space.
 588   "PO" indicates that an entire Primary Opcode is required.
 589 * **rfc** the Libre-SOC External RFC resource,
 590   <https://libre-soc.org/openpower/sv/rfc/> where advance notice of
 591   upcoming RFCs in development may be found.
 592   *Reading advance Draft RFCs and providing feedback strongly advised*,
 593   it saves time and effort for the OPF ISA Workgroup.
 594 * **SVP64** - Vectoriseable (SVP64-Prefixable) - also implies that
 595   SVP64Single is also permitted (required).
 596 * **page** - Libre-SOC wiki page at which further information can
 597   be found.  Again: **advance reading strongly advised due to the
 598   sheer volume of information**.
 599 * **PO1** - the instruction is capable of being PO1-Prefixed
 600   (given an EXT1xx Opcode Allocation). Bear in mind that this option
 601   is **mutually exclusively incompatible** with Vectorisation.
 602 * **group** - the Primary Opcode Group recommended for this instruction.
 603   Options are EXT0xx (EXT000-EXT063), EXT1xx and EXT2xx.  A third area
 604   (UnVectoriseable),
 605   EXT3xx, was available in an early Draft RFC but has been made "RESERVED"
 606   instead.  see [[sv/po9_encoding]].
 607 * **regs** - a guide to register usage, to how costly Hazard Management
 608   will be, in hardware:
 609
 610 ```
 611      - 1R: reads one GPR/FPR/SPR/CR.
 612      - 1W: writes one GPR/FPR/SPR/CR.
 613      - 1r: reads one CR *Field* (not necessarily the entire CR)
 614      - 1w: writes one CR *Field* (not necessarily the entire CR)
 615 ```
 616
 617 [[!inline pages="openpower/sv/rfc/ls012/areas.mdwn" raw=yes ]]
 618 [[!inline pages="openpower/sv/rfc/ls012/xo_cost.mdwn" raw=yes ]]
 619
 620 [[!tag opf_rfc]]