(no commit message)
[libreriscv.git] / openpower / sv / rfc / ls012.mdwn
1 # External RFC ls012: Discuss priorities of Libre-SOC Scalar(Vector) ops
2
3 * <https://git.openpower.foundation/isa/PowerISA/issues/121>
4 * <https://bugs.libre-soc.org/show_bug.cgi?id=1051>
5 * <https://bugs.libre-soc.org/show_bug.cgi?id=1052>
6
7 The purpose of this RFC is to give a full list of the upcoming Scalar
8 opcodes developed by Libre-SOC, formally agree a priority order, which
9 ones should be EXT022 Sandbox, and for IBM to get a clear picture of
10 the Opcode Allocation needs. As this is a Formal ISA RFC the evaluation
11 shall define (in advance of the actual submission of the instructions
12 themselves) which instructions should be submitted over the next 18
13 months.
14
15 *It is expected that readers visit and interact with the Libre-SOC resources
16 in order to do due-diligence on the prioritisation evaluation. Otherwise
17 the ISA WG is overwhelmed by piecemeal RFCs that may turn out not
18 to be useful, against a background of having no guiding overview*.
19
20 Worth bearing in mind during evaluation that every "Defined
21 Word" may or may not be Vectoriseable, but that every "Defined Word"
22 should have merits on its own, not just when Vectorised. An example
23 of a borderline Vectoriseable Defined Word is `mv.swizzle` which
24 only really becomes high-priority for Audio/Video, Vector GPU and HPC Workloads,
25 but has less merit as a Scalar-only operation.
26
27 Power ISA Scalar (SFFS) has not been significantly advanced in 12 years:
28 IBM's primary focus has understandably been on PackedSIMD VSX.
29 Unfortunately, with VSX being 914 instructions and 128-bit it is far too much for any
30 new team to consider (10 years development effort) and far outside of
31 Embedded or Tablet/Desktop/Laptop power budgets. Thus bringing Power Scalar
32 up-to-date to modern standards is a reasonable goal, and the advantage is
33 that lessons can be learned from other ISAs.
34
35 SVP64 Prefixing - also known by the terms "Zero-Overhead-Loop-Prefixing"
36 as well as "True-Scalable-Vector Prefixing" - also literally brings new
37 dimensions to the Power ISA. Thus when adding new Scalar "Defined Words"
38 it has to unavoidably and simultaneously be taken into consideration their value when
39 Vector-Prefixed, *as well as* SVP64Single-Prefixed.
40
41 **Target areas**
42
43 Whilst entirely general-purpose there are some categories that
44 these instructions are targetting: Bitmanipulation, Big-integer,
45 cryptography, Audio/Visual, High-Performance Compute, GPU workloads
46 and DSP.
47
48 **Instruction count guide and approximate priority order**
49
50 * 6 - SVP64 Management [[ls008]] [[ls009]] [[ls010]]
51 * 5 - CR weirds [[sv/cr_int_predication]]
52 * 4 - INT<->FP mv [[ls006]]
53 * 19 - GPR LD/ST-PostIncrement-Update (saves hugely in hot-loops) [[ls011]]
54 * ~12 - FPR LD/ST-PostIncrement-Update (ditto) [[ls011]]
55 * 2 - Float-Load-Immediate (always saves one LD L1/2/3 D-Cache op) [[ls002]]
56 * 5 - Big-Integer Chained 3-in 2-out (64-bit Carry) [[sv/biginteger]]
57 * 6 - Bitmanip LUT2/3 operations. high cost high reward [[sv/bitmanip]]
58 * 1 - fclass (Scalar variant of xvtstdcsp) [[sv/fclass]]
59 * 5 - Audio-Video [[sv/av_opcodes]]
60 * 2 - Shift-and-Add (mitigates LD-ST-Shift; Cryptography e.g. twofish)
61 * 2 - BMI group [[sv/vector_ops]]
62 * 2 - GPU swizzle [[sv/mv.swizzle]]
63 * 9 - FP DCT/FFT Butterfly (2/3-in 2-out)
64 * ~9 Integer DCT/FFT Butterfly
65 * 18 - Trigonometric (1-arg) [[openpower/transcendentals]]
66 * 15 - Transcendentals (1-arg) [[openpower/transcendentals]]
67 * 25 - Transcendentals (2-arg) [[openpower/transcendentals]]
68
69 Summary tables are created below by different sort categories. Additional
70 columns as necessary can be requested to be added as part of update revisions
71 to this RFC.
72
73 # Target Area summaries
74
75 ## Transcendentals
76
77 Found at [[openpower/transcendentals]] these subdivide into high priority for
78 accelerating general-purpose and High-Performance Compute, specialist 3D GPU
79 operations suited to 3D visualisation, and low-priority less common instructions
80 where IEEE754 full bit-accuracy is paramount. In 3D GPU scenarios for example
81 even 12-bit accuracy can be overkill, but for HPC Scientific scenarios 12-bit
82 would be disastrous.
83
84 ## Audio/Video
85
86 Found at [[sv/av_opcodes]] these do not require Saturated variants because Saturation
87 is added via [[sv/svp64]] (Vector Prefixing) and via [[sv/svp64_single]] Scalar
88 Prefixing. This is important to note for Opcode Allocation because placing these
89 operations in the UnVectoriseble areas would irrediemably damage their value.
90 Unlike PackedSIMD ISAs the actual number of AV Opcodes is remarkably small once
91 the usual cascading-option-multipliers (SIMD width, bitwidth, saturation, HI/LO)
92 are abstracted out to RISC-paradigm Prefixing, leaving just absolute-diff-accumulate,
93 min-max, average-add etc. as "basic primitives".
94
95 ## Twin-Butterfly FFT/DCT/DFT for DSP/HPC/AI/AV
96
97 The number of uses in Computer Science for DCT, NTT, FFT and DFT, is astonishing.
98 The wikipedia page lists over a hundred separate and distinct areas: Audio, Video,
99 Radar, Baseband processing, AI, Solomon-Reed Error Correction, the list goes on and on.
100 ARM has special dedicated Integer Twin-butterfly instructions. TI's MSP Series DSPs
101 have had FFT Inner loop support for over 30 years. Qualcomm's Hexagon VLIW Baseband
102 DSP can do full FFT triple loops in one VLIW group.
103
104 It should be pretty clear this is high priority.
105
106 With SVP64 [[sv/remap]] providing the Loop Schedules it falls to the Scalar side of
107 the ISA to add the prerequisite "Twin Butterfly" operations, typically performing
108 for example one multiply but in-place subtracting that product from one operand and
109 adding it to the other. The *in-place* aspect is strategically extremely important
110 for significant reductions in Vectorised register usage, particularly for DCT.
111
112 ## CR Weird group
113
114 Outlined in [[sv/cr_int_predication]] these instructions massively save on CR-Field
115 instruction count. Multi-bit to single-bit and vice-versa normally requiring several
116 CR-ops (crand, crxor) are done in one single instruction. The reason for their
117 addition is down to SVP64 overloading CR Fields as Vector Predicate Masks.
118 Reducing instruction count in hot-loops is considered high priority.
119
120 An additional need is to do popcount on CR Field bit vectors but adding such instructions
121 to the *Condition Register* side was deemed to be far too much. Therefore, priority
122 was giiven instead to transferring several CR Field bits into GPRs, whereupon
123 the full set of tandard Scalar GPR Logical Operations may be used. This strategy
124 has the side-effect of keeping the CRweird group down to only five instructions.
125
126 # Big-integer Math
127
128 [[sv/biginteger]] has always been a high priority area for commercial applications, privacy,
129 Banking, as well as HPC Numerical Accuracy: libgmp as well as cryptographic uses
130 in Asymmetric Ciphers. poly1305 and ec25519 are finding their way into everyday
131 use via OpenSSL.
132
133 A very early variant of the Power ISA had a 32-bit Carry-in Carry-out SPR. Its
134 removal from subsequent revisions is regrettable. An alternative concept is
135 to add six explicit 3-in 2-out operations that, on close inspection, always
136 turn out to be supersets of *existing Scalar operations* that discard upper
137 or lower DWords, or parts thereof.
138
139 *Thus it is critical to note that not one single one of these operations
140 expands the bitwidth of any existing Scalar pipelines*.
141
142 The `dsld` instruction for example merely places additional LSBs into the 64-bit
143 shift (64-bit carry-in), and then places the (normally discarded) MSBs into the second
144 output register (64-bit carry-out). It does **not** require a 128-bit shifter to
145 replace the existing Scalar Power ISA 64-bit shifters.
146
147 The reduction in instruction count these operations bring, in critical hotloops,
148 is remarkably high, to the extent where a Scalar-to-Vector operation of
149 *arbitrary length* becomes just the one Vector-Prefixed instruction.
150
151 Whilst these are 5-6 bit XO their utility is considered high strategic value
152 and as such are strongly advocated to be in EXT04. The alternative is to bring
153 back a 64-bit Carry SPR but how it is retrospectively applicable to pre-existing Scalar
154 Power ISA mutiply, divide, and shift operations at this late stage of maturity of
155 the Power ISA is an entire area of research on its own deemed unlikely to be
156 achievable.
157
158 [[!inline pages="openpower/sv/rfc/ls012/areas.mdwn" raw=yes ]]
159
160 [[!tag opf_rfc]]