openpower/sv/rfc/ls012.mdwn

   1 # External RFC ls012: Discuss priorities of Libre-SOC Scalar(Vector) ops
   2
   3 * <https://git.openpower.foundation/isa/PowerISA/issues/121>
   4 * <https://bugs.libre-soc.org/show_bug.cgi?id=1051>
   5 * <https://bugs.libre-soc.org/show_bug.cgi?id=1052>
   6
   7 The purpose of this RFC is to give a full list of the upcoming Scalar
   8 opcodes developed by Libre-SOC, formally agree a priority order, which
   9 ones should be EXT022 Sandbox, and for IBM to get a clear picture of
  10 the Opcode Allocation needs.  As this is a Formal ISA RFC the evaluation
  11 shall define (in advance of the actual submission of the instructions
  12 themselves) which instructions should be submitted over the next 18
  13 months.
  14
  15 *It is expected that readers visit and interact with the Libre-SOC resources
  16 in order to do due-diligence on the prioritisation evaluation*.
  17
  18 Worth bearing in mind during evaluation that every "Defined
  19 Word" may or may not be Vectoriseable, but that every "Defined Word"
  20 should have merits on its own not just when Vectorised.  An example
  21 of a borderline Vectoriseable Defined Word is `mv.swizzle` which
  22 only really becomes high-priority for Vector GPU and HPC Workloads,
  23 but has less merit as a Scalar-only operation.
  24
  25 Power ISA Scalar (SFFS) has not been significantly advanced in 12 years.
  26 With VSX bring 914 instructions and 128-bit it is far too much for any
  27 new team to consider (10 years development effort) and far outside of
  28 Embedded or Tablet/Desktop/Laptop power budgets. Thus bringing Power Scalar
  29 up-to-date to modern standards is a reasonable goal, and the advantage is
  30 that lessons can be learned from other ISAs.
  31
  32 SVP64 Prefixing - also known by the terms "Zero-Overhead-Loop-Prefixing"
  33 as well as "True-Scalable-Vector Prefixing" - also literally brings new
  34 dimensions to the Power ISA. Thus when adding new Scalar "Defined Words"
  35 it has to unavoidably and simultaneously be taken into consideration their value when
  36 Vectorised.
  37
  38 **Target areas**
  39
  40 Whilst entirely general-purpose there are some categories that
  41 these instructions are targetting: Bitmanipulation, Big-integer,
  42 cryptography, Audio/Visual, High-Performance Compute, GPU workloads
  43 and DSP.
  44
  45 **Instruction count guide and approximate priority order**
  46
  47 * 6 - SVP64 Management [[ls008]] [[ls009]] [[ls010]]
  48 * 5 - CR weirds [[sv/cr_int_predication]]
  49 * 4 - INT<->FP mv [[ls006]]
  50 * 19 - GPR LD/ST-PostIncrement-Update (saves hugely in hot-loops) [[ls011]]
  51 * ~12 - FPR LD/ST-PostIncrement-Update (ditto) [[ls011]]
  52 * 2 - Float-Load-Immediate (always saves one LD L1/2/3 D-Cache op) [[ls002]]
  53 * 5 - Big-Integer Chained 3-in 2-out (64-bit Carry) [[sv/biginteger]]
  54 * 6 - Bitmanip LUT2/3 operations. high cost high reward [[sv/bitmanip]]
  55 * 1 - fclass (Scalar variant of xvtstdcsp) [[sv/fclass]]
  56 * 5 - Audio-Video [[sv/av_opcodes]]
  57 * 2 - Shift-and-Add (mitigates LD-ST-Shift; Cryptography e.g. twofish)
  58 * 2 - BMI group [[sv/vector_ops]]
  59 * 2 - GPU swizzle [[sv/mv.swizzle]]
  60 * 9 - FP DCT/FFT Butterfly (2/3-in 2-out)
  61 * ~9 Integer DCT/FFT Butterfly
  62 * 18 - Trigonometric (1-arg) [[openpower/transcendentals]]
  63 * 15 - Transcendentals (1-arg) [[openpower/transcendentals]]
  64 * 25 - Transcendentals (2-arg) [[openpower/transcendentals]]
  65
  66 Summary tables are created below by different sort categories. Additional
  67 columns as necessary can be requested to be added as part of update revisions
  68 to this RFC.
  69
  70 # Target Area summaries
  71
  72 ## Transcendentals
  73
  74 Found at [[openpower/transcendentals]] these subdivide into high priority for
  75 accelerating general-purpose and High-Performance Compute, specialist 3D GPU
  76 operations suited to 3D visualisation, and low-priority less common instructions
  77 where IEEE754 full bit-accuracy is paramount.  In 3D GPU scenarios for example
  78 even 12-bit accuracy can be overkill, but for HPC Scientific scenarios 12-bit
  79 would be disastrous.
  80
  81 ## Audio/Video
  82
  83 Found at [[sv/av_opcodes]] these do not require Saturated variants because Saturation
  84 is added via [[sv/svp64]] (Vector Prefixing) and via [[sv/svp64_single]] Scalar
  85 Prefixing. This is important to note for Opcode Allocation because placing these
  86 operations in the UnVectoriseble areas would irrediemably damage their value.
  87 Unlike PackedSIMD ISAs the actual number of AV Opcodes is remarkably small once
  88 the usual cascading-option-multipliers (SIMD width, bitwidth, saturation, HI/LO)
  89 are abstracted out to RISC-paradigm Prefixing, leaving just absolute-diff-accumulate,
  90 min-max, average-add etc. as "basic primitives".
  91
  92 ## Twin-Butterfly FFT/DCT/DFT for DSP/HPC/AI/AV
  93
  94 The number of uses in Computer Science for DCT, NTT, FFT and DFT, is astonishing.
  95 The wikipedia page lists over a hundred separate and distinct areas: Audio, Video,
  96 Radar, Baseband processing, AI, Solomon-Reed Error Correction, the list goes on and on.
  97 ARM has special dedicated Integer Twin-butterfly instructions. TI's MSP Series DSPs
  98 have had FFT Inner loop support for over 30 years. Qualcomm's Hexagon VLIW Baseband
  99 DSP can do full FFT triple loops in one VLIW group.
 100
 101 It should be pretty clear this is high priority.
 102
 103 With SVP64  [[sv/remap]] providing the Loop Schedules it falls to the Scalar side of
 104 the ISA to add the prerequisite "Twin Butterfly" operations, typically performing
 105 for example one multiply but in-place subtracting that product from one operand and
 106 adding it to the other.  The *in-place* aspect is strategically extremely important
 107 for significant reductions in Vectorised register usage, particularly for DCT.
 108
 109 ## CR Weird group
 110
 111 Outlined in [[sv/cr_int_predication]] these instructions massively save on CR-Field
 112 instruction count.  Multi-bit to single-bit and vice-versa normally requiring several
 113 CR-ops (crand, crxor) are done in one single instruction.  The reason for their
 114 addition is down to SVP64 overloading CR Fields as Vector Predicate Masks.
 115 Reducing instruction count in hot-loops is considered high priority.
 116
 117 An additional need is to do popcount on CR Field bit vectors but adding such instructions
 118 to the *Condition Register* side was deemed to be far too much. Therefore, priority
 119 was giiven instead to transferring several CR Field bits into GPRs, whereupon
 120 the full set of tandard Scalar GPR Logical Operations may be used. This strategy
 121 has the side-effect of keeping the CRweird group down to only five instructions.
 122
 123
 124 [[!inline pages="openpower/sv/rfc/ls012/areas.mdwn" raw=yes ]]
 125
 126 [[!tag opf_rfc]]