whitespace
[libreriscv.git] / openpower / sv / rfc / ls012.mdwn
1 # External RFC ls012: Discuss priorities of Libre-SOC Scalar(Vector) ops
2
3 * <https://git.openpower.foundation/isa/PowerISA/issues/121>
4 * <https://bugs.libre-soc.org/show_bug.cgi?id=1051>
5 * <https://bugs.libre-soc.org/show_bug.cgi?id=1052>
6
7 The purpose of this RFC is:
8
9 * to give a full list of the upcoming Scalar opcodes developed by Libre-SOC
10 (respecting that *all* of them are Vectoriseble)
11 * formally agree a priority order on an itertive basis with new versions
12 of this RFC,
13 * which ones should be EXT022 Sandbox, which in EXT0xx, which in EXT2xx,
14 * and for IBM to get a clear picture of the Opcode Allocation needs.
15
16 As this is a Formal ISA RFC the evaluation shall ultimatly define
17 (in advance of the actual submission of the instructions themselves)
18 which instructions will be submitted over the next 8-18 months.
19
20 *It is expected that readers visit and interact with the Libre-SOC
21 resources in order to do due-diligence on the prioritisation
22 evaluation. Otherwise the ISA WG is overwhelmed by "drip-fed" RFCs
23 that may turn out not to be useful, against a background of having
24 no guiding overview or pre-filtering, and everybody's precious time
25 is wasted. Also note that the Libre-SOC Team, being funded by NLnet
26 under Privacy and Enhanced Trust Grants, are **prohibited** from signing
27 Commercial-Confidentiality NDAs, as doing so is a direct conflict of
28 interest with their funding body's Charitable Foundation Status and
29 remit*.
30
31 Worth bearing in mind during evaluation that every "Defined Word" may
32 or may not be Vectoriseable, but that every "Defined Word" should have
33 merits on its own, not just when Vectorised. An example of a borderline
34 Vectoriseable Defined Word is `mv.swizzle` which only really becomes
35 high-priority for Audio/Video, Vector GPU and HPC Workloads, but has
36 less merit as a Scalar-only operation.
37
38 Power ISA Scalar (SFFS) has not been significantly advanced in 12
39 years: IBM's primary focus has understandably been on PackedSIMD VSX.
40 Unfortunately, with VSX being 914 instructions and 128-bit it is far too
41 much for any new team to consider (10 years development effort) and far
42 outside of Embedded or Tablet/Desktop/Laptop power budgets. Thus bringing
43 Power Scalar up-to-date to modern standards *and on its own merits*
44 is a reasonable goal, and the advantages of the reduced focus is that
45 SFFS remains RISC-paradigm, and that lessons can be learned from other
46 ISAs from the intervening years. Good examples here include `bmask`.
47
48 SVP64 Prefixing - also known by the terms "Zero-Overhead-Loop-Prefixing"
49 as well as "True-Scalable-Vector Prefixing" - also literally brings new
50 dimensions to the Power ISA. Thus when adding new Scalar "Defined Words"
51 it has to unavoidably and simultaneously be taken into consideration
52 their value when Vector-Prefixed, *as well as* SVP64Single-Prefixed.
53
54 **Target areas**
55
56 Whilst entirely general-purpose there are some categories that these
57 instructions are targetting: Bitmanipulation, Big-integer, cryptography,
58 Audio/Visual, High-Performance Compute, GPU workloads and DSP.
59
60 **Instruction count guide and approximate priority order**
61
62 * 6 - SVP64 Management [[ls008]] [[ls009]] [[ls010]]
63 * 5 - CR weirds [[sv/cr_int_predication]]
64 * 4 - INT<->FP mv [[ls006]]
65 * 19 - GPR LD/ST-PostIncrement-Update (saves hugely in hot-loops) [[ls011]]
66 * ~12 - FPR LD/ST-PostIncrement-Update (ditto) [[ls011]]
67 * 2 - Float-Load-Immediate (always saves one LD L1/2/3 D-Cache op) [[ls002]]
68 * 5 - Big-Integer Chained 3-in 2-out (64-bit Carry) [[sv/biginteger]]
69 * 6 - Bitmanip LUT2/3 operations. high cost high reward [[sv/bitmanip]]
70 * 1 - fclass (Scalar variant of xvtstdcsp) [[sv/fclass]]
71 * 5 - Audio-Video [[sv/av_opcodes]]
72 * 2 - Shift-and-Add (mitigates LD-ST-Shift; Cryptography e.g. twofish) [[ls004]]
73 * 2 - BMI group [[sv/vector_ops]]
74 * 2 - GPU swizzle [[sv/mv.swizzle]]
75 * 9 - FP DCT/FFT Butterfly (2/3-in 2-out)
76 * ~9 Integer DCT/FFT Butterfly <https://bugs.libre-soc.org/show_bug.cgi?id=1028>
77 * 18 - Trigonometric (1-arg) [[openpower/transcendentals]]
78 * 15 - Transcendentals (1-arg) [[openpower/transcendentals]]
79 * 25 - Transcendentals (2-arg) [[openpower/transcendentals]]
80
81 Summary tables are created below by different sort categories. Additional
82 columns as necessary can be requested to be added as part of update revisions
83 to this RFC.
84
85 # Target Area summaries
86
87 ## SVP64 Management instructions
88
89 These without question have to go in EXT0xx. Future extended variants,
90 bringing even more powerful capabilities, can be followed up later with
91 EXT1xx prefixed variants, which is not possible if placed in EXT2xx.
92 *Only `svstep` is actually Vectoriseable*, all other Management
93 instructions are UnVectoriseane. PO1-Prefixed examples include adding
94 psvshape in order to support both Inner and Outer Product Matrix
95 Schedules, by providing the option to directly reverse the order of the
96 triple loops. Outer is used for standard Matrix Multiply, but Inner is
97 required for Warshall Transitive Closure (on top of a cumulatively-applied
98 max instruction).
99
100 The Management Instructions themselves are all Scalar Operations, so
101 PO1-Prefixing is perfecly reasonable. SVP64 Management instructions of
102 which there are only 6 are all 5 or 6 bit XO, meaning that the opcode
103 space they take up in EXT0xx is not alarmingly high for their intrinsic
104 strategic value.
105
106 ## Transcendentals
107
108 Found at [[openpower/transcendentals]] these subdivide into high
109 priority for accelerating general-purpose and High-Performance Compute,
110 specialist 3D GPU operations suited to 3D visualisation, and low-priority
111 less common instructions where IEEE754 full bit-accuracy is paramount.
112 In 3D GPU scenarios for example even 12-bit accuracy can be overkill,
113 but for HPC Scientific scenarios 12-bit would be disastrous.
114
115 There are a **lot** of operations here, and they also bring Power
116 ISA up-to-date to IEEE754-2019. Fortunately the number of critical
117 instructions is quite low, but the caveat is that if those operations
118 are utilised to synthesise other IEEE754 operations (divide by `pi` for
119 example) full bitlevel accuracy (a hard requirement for IEEE754) is lost.
120
121 Also worth noting that the Khronos Group defines minimum acceptable
122 bit-accuracy levels for 3D Graphics: these are **nowhere near** the full
123 accuracy demanded by IEEE754, the reason for the Khronos definitions is
124 a massive reduction often four-fold in power consumption and gate count
125 when 3D Graphics simply has no need for full accuracy.
126
127 *For 3D GPU markets this definitely needs addressing*
128
129 ## Audio/Video
130
131 Found at [[sv/av_opcodes]] these do not require Saturated variants
132 because Saturation is added via [[sv/svp64]] (Vector Prefixing) and via
133 [[sv/svp64_single]] Scalar Prefixing. This is important to note for
134 Opcode Allocation because placing these operations in the UnVectoriseble
135 areas would irrediemably damage their value. Unlike PackedSIMD ISAs
136 the actual number of AV Opcodes is remarkably small once the usual
137 cascading-option-multipliers (SIMD width, bitwidth, saturation,
138 HI/LO) are abstracted out to RISC-paradigm Prefixing, leaving just
139 absolute-diff-accumulate, min-max, average-add etc. as "basic primitives".
140
141 ## Twin-Butterfly FFT/DCT/DFT for DSP/HPC/AI/AV
142
143 The number of uses in Computer Science for DCT, NTT, FFT and DFT,
144 is astonishing. The wikipedia page lists over a hundred separate and
145 distinct areas: Audio, Video, Radar, Baseband processing, AI, Solomon-Reed
146 Error Correction, the list goes on and on. ARM has special dedicated
147 Integer Twin-butterfly instructions. TI's MSP Series DSPs have had FFT
148 Inner loop support for over 30 years. Qualcomm's Hexagon VLIW Baseband
149 DSP can do full FFT triple loops in one VLIW group.
150
151 It should be pretty clear this is high priority.
152
153 With SVP64 [[sv/remap]] providing the Loop Schedules it falls to
154 the Scalar side of the ISA to add the prerequisite "Twin Butterfly"
155 operations, typically performing for example one multiply but in-place
156 subtracting that product from one operand and adding it to the other.
157 The *in-place* aspect is strategically extremely important for significant
158 reductions in Vectorised register usage, particularly for DCT.
159
160 ## CR Weird group
161
162 Outlined in [[sv/cr_int_predication]] these instructions massively save
163 on CR-Field instruction count. Multi-bit to single-bit and vice-versa
164 normally requiring several CR-ops (crand, crxor) are done in one single
165 instruction. The reason for their addition is down to SVP64 overloading
166 CR Fields as Vector Predicate Masks. Reducing instruction count in
167 hot-loops is considered high priority.
168
169 An additional need is to do popcount on CR Field bit vectors but adding
170 such instructions to the *Condition Register* side was deemed to be far
171 too much. Therefore, priority was giiven instead to transferring several
172 CR Field bits into GPRs, whereupon the full set of tandard Scalar GPR
173 Logical Operations may be used. This strategy has the side-effect of
174 keeping the CRweird group down to only five instructions.
175
176 ## Big-integer Math
177
178 [[sv/biginteger]] has always been a high priority area for commercial
179 applications, privacy, Banking, as well as HPC Numerical Accuracy:
180 libgmp as well as cryptographic uses in Asymmetric Ciphers. poly1305
181 and ec25519 are finding their way into everyday use via OpenSSL.
182
183 A very early variant of the Power ISA had a 32-bit Carry-in Carry-out
184 SPR. Its removal from subsequent revisions is regrettable. An alternative
185 concept is to add six explicit 3-in 2-out operations that, on close
186 inspection, always turn out to be supersets of *existing Scalar
187 operations* that discard upper or lower DWords, or parts thereof.
188
189 *Thus it is critical to note that not one single one of these operations
190 expands the bitwidth of any existing Scalar pipelines*.
191
192 The `dsld` instruction for example merely places additional LSBs into the
193 64-bit shift (64-bit carry-in), and then places the (normally discarded)
194 MSBs into the second output register (64-bit carry-out). It does **not**
195 require a 128-bit shifter to replace the existing Scalar Power ISA
196 64-bit shifters.
197
198 The reduction in instruction count these operations bring, in critical
199 hotloops, is remarkably high, to the extent where a Scalar-to-Vector
200 operation of *arbitrary length* becomes just the one Vector-Prefixed
201 instruction.
202
203 Whilst these are 5-6 bit XO their utility is considered high strategic
204 value and as such are strongly advocated to be in EXT04. The alternative
205 is to bring back a 64-bit Carry SPR but how it is retrospectively
206 applicable to pre-existing Scalar Power ISA mutiply, divide, and shift
207 operations at this late stage of maturity of the Power ISA is an entire
208 area of research on its own deemed unlikely to be achievable.
209
210 ## fclass and GPR-FPR moves
211
212 [[sv/fclass]] - just one instruction. With SFFS being locked down to
213 exclude VSX, and there being no desire within the nascent OpenPOWER
214 ecosystem outside of IBM to implement the VSX PackedSIMD paradigm, it
215 becomes necessary to upgrade SFFS such that it is stand-alone capable. One
216 omission based on the assumption that VSX would always be present is an
217 equivalent to `xvtstdcsp`.
218
219 Similar arguments apply to the GPR-INT move operations, proposed in
220 [[ls006]], with the opportunity taken to add rounding modes present
221 in other ISAs that Power ISA VSX PackedSIMD does not have. Javascript
222 rounding, one of the worst offenders of Computer Science, requires a
223 phenomental 35 instructions with *six branches* to emulate in Power
224 ISA! For desktop as well as Server HTML/JS back-end execution of
225 javascript this becomes an obvious priority, recognised already by ARM
226 as just one example.
227
228 ## Bitmanip LUT2/3
229
230 These LUT2/3 operations are high cost high reward. Outlined in
231 [[sv/bitmanip]], the simplest ones already exist in PackedSIMD VSX:
232 `xxeval`. The same reasoning applies as to fclass: SFFS needs to be
233 stand-alone on its own merits and not "punished" should an implementor
234 choose not to implement any aspect of PackedSIMD VSX.
235
236 With Predication being such a high priority in GPUs and HPC, CR Field
237 variants of Ternary and Binary LUT instructions were considered high
238 priority, and again just like in the CRweird group the opportunity was
239 taken to work on *all* bits of a CR Field rather than just one bit as
240 is done with the existing CR operations crand, cror etc.
241
242 The other high strategic value instruction is `grevlut` (and `grevluti`
243 which can generate a remarkably large number of regular-patterned magic
244 constants). The grevlut set require of the order of 20,000 gates but
245 provide an astonishing plethora of innovative bit-permuting instructions
246 never seen in any other ISA.
247
248 The downside of all of these instructions is the extremely low XO bit
249 requirements: 2-3 bit XO due to the large immediates *and* the number of
250 operands required. The LUT3 instructions are already compacted down to
251 "Overwrite" variants. (By contrast the Float-Load-Immediate instructions
252 are a much larger XO because despite having 16-bit immediate only one
253 Register Operand is needed).
254
255 Realistically these high-value instructions should be proposed in EXT2xx
256 where their XO cost does not overwhelm EXT0xx.
257
258
259 ## (f)mv.swizzle
260
261 [[sv/mv.swizzle]] is dicey. It is a 2-in 2-out operation whose value
262 as a Scalar instruction is limited *except* if combined with `cmpi` and
263 SVP64Single Predication, whereupon the end result is the RISC-synthesis
264 of Compare-and-Swap, in two instructions.
265
266 Where this instruction comes into its full value is when Vectorised.
267 3D GPU and HPC numerical workloads astonishingly contain between 10 to 15%
268 swizzle operations: access to YYZ, XY, of an XYZW Quaternion, performing
269 balancing of ARGB pixel data. The usage is so high that 3D GPU ISAs make
270 Swizzle a first-class priority in their VLIW words. Even 64-bit Embedded
271 GPU ISAs have a staggering 24-bits dedicated to 2-operand Swizzle.
272
273 So as not to radicalise the Power ISA the Libre-SOC team decided to
274 introduce mv Swizzle operations, which can always be Macro-op fused
275 in exactly the same way that ARM SVE predicated-move extends 3-operand
276 "overwrite" opcodes to full independent 3-in 1-out.
277
278 # BMI (bitmanipulation) group.
279
280 Whilst the [[sv/vector_ops]] instructions are only two in number, in
281 reality the `bmask` instruction has a Mode field allowing it to cover
282 **24** instructions, more than have been added to any other CPUs by
283 ARM, Intel or AMD. Analyis of the BMI sets of these CPUs shows simple
284 patterns that can greatly simplify both Decode and implementation. These
285 are sufficiently commonly used, saving instruction count regularly,
286 that they justify going into EXT0xx.
287
288 The other instruction is `cprop` - Carry-Propagation - which takes
289 the P and Q from carry-propagation algorithms and generates carry
290 look-ahead. Greatly increases the efficiency of arbitrary-precision
291 integer arithmetic by combining what would otherwise be half a dozen
292 instructions into one. However it is still not a huge priority unlike
293 `bmask` so is probably best placed in EXT2xx.
294
295 ## Float-Load-Immediate
296
297 Very easily justified. As explained in [[ls002]] these always saves one
298 LD L1/2/3 D-Cache memory-lookup operation, by virtue of the Immediate
299 FP value being in the I-Cache side. It is such a high priority that
300 these instuctions are easily justifiable adding into EXT0xx, despite
301 requiring a 16-bit immediate. By designing the second-half instruction
302 as a Read-Modify-Write it saves on XO bitlength (only 5 bits), and can be
303 macro-op fused with its first-half to store a full IEEE754 FP32 immediate
304 into a register.
305
306 There is little point in putting these instructions into EXT2xx. Their
307 very benefit and inherent value *is* as 32-bit instructions, not 64-bit
308 ones. Likewise there is less value in taking up EXT1xx Enoding space
309 because EXT1xx only brings an additional 16 bits (approx) to the table,
310 and that is provided already by the second-half instuction.
311
312 Thus they qualify as both high priority and also EXT0xx candidates.
313
314 ## FPR/GPR LD/ST-PostIncrement-Update
315
316 These instruction, outlined in [[ls011]], save hugely in hot-loops.
317 Early ISAs such as PDP-8, PDP-11, which inspired the iconic Motorola
318 68000, 88100, Mitch Alsup's MyISA 66000, and can even be traced back to
319 the iconic ultra-RISC CDC 6600, all had both pre- and post- increment
320 Addressing Modes.
321
322 The reason is very simple: it is a direct recognition of the practice
323 in c to frequently utilise both `*p++` and `*++p` which itself stems
324 from common need in Computer Science algorithms.
325
326 The problem for the Power ISA is - was - that the opcode space needed
327 to support both was far too great, and the decision was made to go with
328 pre-increment, on the basis that outside the loop a "pre-subtraction"
329 may be performed.
330
331 Whilst this is a "solution" it is less than ideal, and the opportunity
332 exists now with the EXT2xx Primary Opcodes to correct this and bring
333 Power ISA up a level.
334
335 ## Shift-and-add
336
337 Shift-and-Add are proposed in [[ls004]]. They mitigate the need to add
338 LD-ST-Shift instructions which are a high-priority aspect of both x86
339 and ARM. LD-ST-Shift is normally just the one instruction: Shift-and-add
340 brings that down to two, where Power ISA presently requires three.
341 Cryptography e.g. twofish also makes use of Integer double-and-add,
342 so the value of these instructions is not limited to Effective Address
343 computation. They will also have value in Audio DSP.
344
345 Being a 10-bit XO it would be somewhat punitive to place these in EXT2xx
346 when their whole purpose and value is to reduce binary size in Address
347 offset computation, thus they are best placed in EXT0xx.
348
349
350
351
352 [[!inline pages="openpower/sv/rfc/ls012/areas.mdwn" raw=yes ]]
353 [[!inline pages="openpower/sv/rfc/ls012/xo_cost.mdwn" raw=yes ]]
354
355 [[!tag opf_rfc]]