(no commit message)
[libreriscv.git] / openpower / sv / rfc / ls012.mdwn
1 # External RFC ls012: Discuss priorities of Libre-SOC Scalar(Vector) ops
2
3 * <https://git.openpower.foundation/isa/PowerISA/issues/121>
4 * <https://bugs.libre-soc.org/show_bug.cgi?id=1051>
5 * <https://bugs.libre-soc.org/show_bug.cgi?id=1052>
6
7 The purpose of this RFC is:
8
9 * to give a full list of the upcoming Scalar opcodes developed by Libre-SOC
10 (respecting and being cognisant that *all* of them are Vectorisable)
11 * to give OPF Members and non-Members alike the opportunity to comment and get
12 involved early in RFC submission
13 * formally agree a priority order on an iterative basis with new versions
14 of this RFC,
15 * which ones should be EXT022 Sandbox, which in EXT0xx, which in EXT2xx,
16 * keep readers summarily informed of ongoing RFC submissions, with new versions
17 of this RFC,
18 * and for IBM (in their capacity as Allocator of Opcode resources)
19 to get a clear overall advance picture of the Opcode Allocation needs
20 *prior* to actual RFC submission
21
22 As this is a Formal ISA RFC the evaluation shall ultimatly define
23 (in advance of the actual submission of the instructions themselves)
24 which instructions will be submitted over the next 8-18 months.
25
26 *It is expected that readers visit and interact with the Libre-SOC
27 resources in order to do due-diligence on the prioritisation
28 evaluation. Otherwise the ISA WG is overwhelmed by "drip-fed" RFCs
29 that may turn out not to be useful, against a background of having
30 no guiding overview or pre-filtering, and everybody's precious time
31 is wasted. Also note that the Libre-SOC Team, being funded by NLnet
32 under Privacy and Enhanced Trust Grants, are **prohibited** from signing
33 Commercial-Confidentiality NDAs, as doing so is a direct conflict of
34 interest with their funding body's Charitable Foundation Status and
35 remit, and therefore the **entire** set of almost 150 new SFFS instructions
36 can only go via the External RFC Process. Also be advised and aware
37 that "Libre-SOC" != "RED Semiconductor Ltd". The two are completely **separate**
38 organisations*.
39
40 Worth bearing in mind during evaluation that every "Defined Word" may
41 or may not be Vectoriseable, but that every "Defined Word" should have
42 merits on its own, not just when Vectorised. An example of a borderline
43 Vectoriseable Defined Word is `mv.swizzle` which only really becomes
44 high-priority for Audio/Video, Vector GPU and HPC Workloads, but has
45 less merit as a Scalar-only operation.
46
47 Although one of the top world-class ISAs,
48 Power ISA Scalar (SFFS) has not been significantly advanced in 12
49 years: IBM's primary focus has understandably been on PackedSIMD VSX.
50 Unfortunately, with VSX being 914 instructions and 128-bit it is far too
51 much for any new team to consider (10 years development effort) and far
52 outside of Embedded or Tablet/Desktop/Laptop power budgets. Thus bringing
53 Power Scalar up-to-date to modern standards *and on its own merits*
54 is a reasonable goal, and the advantages of the reduced focus is that
55 SFFS remains RISC-paradigm, and that lessons can be learned from other
56 ISAs from the intervening years. Good examples here include `bmask`.
57
58 SVP64 Prefixing - also known by the terms "Zero-Overhead-Loop-Prefixing"
59 as well as "True-Scalable-Vector Prefixing" - also literally brings new
60 dimensions to the Power ISA. Thus when adding new Scalar "Defined Words"
61 it has to unavoidably and simultaneously be taken into consideration
62 their value when Vector-Prefixed, *as well as* SVP64Single-Prefixed.
63
64 **Target areas**
65
66 Whilst entirely general-purpose there are some categories that these
67 instructions are targetting: Bitmanipulation, Big-integer, cryptography,
68 Audio/Visual, High-Performance Compute, GPU workloads and DSP.
69
70 **Instruction count guide and approximate priority order**
71
72 * 6 - SVP64 Management [[ls008]] [[ls009]] [[ls010]]
73 * 5 - CR weirds [[sv/cr_int_predication]]
74 * 4 - INT<->FP mv [[ls006]]
75 * 19 - GPR LD/ST-PostIncrement-Update (saves hugely in hot-loops) [[ls011]]
76 * ~12 - FPR LD/ST-PostIncrement-Update (ditto) [[ls011]]
77 * 2 - Float-Load-Immediate (always saves one LD L1/2/3 D-Cache op) [[ls002]]
78 * 5 - Big-Integer Chained 3-in 2-out (64-bit Carry) [[sv/biginteger]]
79 * 6 - Bitmanip LUT2/3 operations. high cost high reward [[sv/bitmanip]]
80 * 1 - fclass (Scalar variant of xvtstdcsp) [[sv/fclass]]
81 * 5 - Audio-Video [[sv/av_opcodes]]
82 * 2 - Shift-and-Add (mitigates LD-ST-Shift; Cryptography e.g. twofish) [[ls004]]
83 * 2 - BMI group [[sv/vector_ops]]
84 * 2 - GPU swizzle [[sv/mv.swizzle]]
85 * 9 - FP DCT/FFT Butterfly (2/3-in 2-out)
86 * ~9 Integer DCT/FFT Butterfly <https://bugs.libre-soc.org/show_bug.cgi?id=1028>
87 * 18 - Trigonometric (1-arg) [[openpower/transcendentals]]
88 * 15 - Transcendentals (1-arg) [[openpower/transcendentals]]
89 * 25 - Transcendentals (2-arg) [[openpower/transcendentals]]
90
91 Summary tables are created below by different sort categories. Additional
92 columns as necessary can be requested to be added as part of update revisions
93 to this RFC.
94
95 # Target Area summaries
96
97 ## SVP64 Management instructions
98
99 These without question have to go in EXT0xx. Future extended variants,
100 bringing even more powerful capabilities, can be followed up later with
101 EXT1xx prefixed variants, which is not possible if placed in EXT2xx.
102 *Only `svstep` is actually Vectoriseable*, all other Management
103 instructions are UnVectoriseane. PO1-Prefixed examples include adding
104 psvshape in order to support both Inner and Outer Product Matrix
105 Schedules, by providing the option to directly reverse the order of the
106 triple loops. Outer is used for standard Matrix Multiply, but Inner is
107 required for Warshall Transitive Closure (on top of a cumulatively-applied
108 max instruction).
109
110 The Management Instructions themselves are all Scalar Operations, so
111 PO1-Prefixing is perfecly reasonable. SVP64 Management instructions of
112 which there are only 6 are all 5 or 6 bit XO, meaning that the opcode
113 space they take up in EXT0xx is not alarmingly high for their intrinsic
114 strategic value.
115
116 ## Transcendentals
117
118 Found at [[openpower/transcendentals]] these subdivide into high
119 priority for accelerating general-purpose and High-Performance Compute,
120 specialist 3D GPU operations suited to 3D visualisation, and low-priority
121 less common instructions where IEEE754 full bit-accuracy is paramount.
122 In 3D GPU scenarios for example even 12-bit accuracy can be overkill,
123 but for HPC Scientific scenarios 12-bit would be disastrous.
124
125 There are a **lot** of operations here, and they also bring Power
126 ISA up-to-date to IEEE754-2019. Fortunately the number of critical
127 instructions is quite low, but the caveat is that if those operations
128 are utilised to synthesise other IEEE754 operations (divide by `pi` for
129 example) full bitlevel accuracy (a hard requirement for IEEE754) is lost.
130
131 Also worth noting that the Khronos Group defines minimum acceptable
132 bit-accuracy levels for 3D Graphics: these are **nowhere near** the full
133 accuracy demanded by IEEE754, the reason for the Khronos definitions is
134 a massive reduction often four-fold in power consumption and gate count
135 when 3D Graphics simply has no need for full accuracy.
136
137 *For 3D GPU markets this definitely needs addressing*
138
139 ## Audio/Video
140
141 Found at [[sv/av_opcodes]] these do not require Saturated variants
142 because Saturation is added via [[sv/svp64]] (Vector Prefixing) and via
143 [[sv/svp64_single]] Scalar Prefixing. This is important to note for
144 Opcode Allocation because placing these operations in the UnVectoriseble
145 areas would irrediemably damage their value. Unlike PackedSIMD ISAs
146 the actual number of AV Opcodes is remarkably small once the usual
147 cascading-option-multipliers (SIMD width, bitwidth, saturation,
148 HI/LO) are abstracted out to RISC-paradigm Prefixing, leaving just
149 absolute-diff-accumulate, min-max, average-add etc. as "basic primitives".
150
151 ## Twin-Butterfly FFT/DCT/DFT for DSP/HPC/AI/AV
152
153 The number of uses in Computer Science for DCT, NTT, FFT and DFT,
154 is astonishing. The wikipedia page lists over a hundred separate and
155 distinct areas: Audio, Video, Radar, Baseband processing, AI, Solomon-Reed
156 Error Correction, the list goes on and on. ARM has special dedicated
157 Integer Twin-butterfly instructions. TI's MSP Series DSPs have had FFT
158 Inner loop support for over 30 years. Qualcomm's Hexagon VLIW Baseband
159 DSP can do full FFT triple loops in one VLIW group.
160
161 It should be pretty clear this is high priority.
162
163 With SVP64 [[sv/remap]] providing the Loop Schedules it falls to
164 the Scalar side of the ISA to add the prerequisite "Twin Butterfly"
165 operations, typically performing for example one multiply but in-place
166 subtracting that product from one operand and adding it to the other.
167 The *in-place* aspect is strategically extremely important for significant
168 reductions in Vectorised register usage, particularly for DCT.
169
170 ## CR Weird group
171
172 Outlined in [[sv/cr_int_predication]] these instructions massively save
173 on CR-Field instruction count. Multi-bit to single-bit and vice-versa
174 normally requiring several CR-ops (crand, crxor) are done in one single
175 instruction. The reason for their addition is down to SVP64 overloading
176 CR Fields as Vector Predicate Masks. Reducing instruction count in
177 hot-loops is considered high priority.
178
179 An additional need is to do popcount on CR Field bit vectors but adding
180 such instructions to the *Condition Register* side was deemed to be far
181 too much. Therefore, priority was giiven instead to transferring several
182 CR Field bits into GPRs, whereupon the full set of tandard Scalar GPR
183 Logical Operations may be used. This strategy has the side-effect of
184 keeping the CRweird group down to only five instructions.
185
186 ## Big-integer Math
187
188 [[sv/biginteger]] has always been a high priority area for commercial
189 applications, privacy, Banking, as well as HPC Numerical Accuracy:
190 libgmp as well as cryptographic uses in Asymmetric Ciphers. poly1305
191 and ec25519 are finding their way into everyday use via OpenSSL.
192
193 A very early variant of the Power ISA had a 32-bit Carry-in Carry-out
194 SPR. Its removal from subsequent revisions is regrettable. An alternative
195 concept is to add six explicit 3-in 2-out operations that, on close
196 inspection, always turn out to be supersets of *existing Scalar
197 operations* that discard upper or lower DWords, or parts thereof.
198
199 *Thus it is critical to note that not one single one of these operations
200 expands the bitwidth of any existing Scalar pipelines*.
201
202 The `dsld` instruction for example merely places additional LSBs into the
203 64-bit shift (64-bit carry-in), and then places the (normally discarded)
204 MSBs into the second output register (64-bit carry-out). It does **not**
205 require a 128-bit shifter to replace the existing Scalar Power ISA
206 64-bit shifters.
207
208 The reduction in instruction count these operations bring, in critical
209 hotloops, is remarkably high, to the extent where a Scalar-to-Vector
210 operation of *arbitrary length* becomes just the one Vector-Prefixed
211 instruction.
212
213 Whilst these are 5-6 bit XO their utility is considered high strategic
214 value and as such are strongly advocated to be in EXT04. The alternative
215 is to bring back a 64-bit Carry SPR but how it is retrospectively
216 applicable to pre-existing Scalar Power ISA mutiply, divide, and shift
217 operations at this late stage of maturity of the Power ISA is an entire
218 area of research on its own deemed unlikely to be achievable.
219
220 ## fclass and GPR-FPR moves
221
222 [[sv/fclass]] - just one instruction. With SFFS being locked down to
223 exclude VSX, and there being no desire within the nascent OpenPOWER
224 ecosystem outside of IBM to implement the VSX PackedSIMD paradigm, it
225 becomes necessary to upgrade SFFS such that it is stand-alone capable. One
226 omission based on the assumption that VSX would always be present is an
227 equivalent to `xvtstdcsp`.
228
229 Similar arguments apply to the GPR-INT move operations, proposed in
230 [[ls006]], with the opportunity taken to add rounding modes present
231 in other ISAs that Power ISA VSX PackedSIMD does not have. Javascript
232 rounding, one of the worst offenders of Computer Science, requires a
233 phenomental 35 instructions with *six branches* to emulate in Power
234 ISA! For desktop as well as Server HTML/JS back-end execution of
235 javascript this becomes an obvious priority, recognised already by ARM
236 as just one example.
237
238 ## Bitmanip LUT2/3
239
240 These LUT2/3 operations are high cost high reward. Outlined in
241 [[sv/bitmanip]], the simplest ones already exist in PackedSIMD VSX:
242 `xxeval`. The same reasoning applies as to fclass: SFFS needs to be
243 stand-alone on its own merits and not "punished" should an implementor
244 choose not to implement any aspect of PackedSIMD VSX.
245
246 With Predication being such a high priority in GPUs and HPC, CR Field
247 variants of Ternary and Binary LUT instructions were considered high
248 priority, and again just like in the CRweird group the opportunity was
249 taken to work on *all* bits of a CR Field rather than just one bit as
250 is done with the existing CR operations crand, cror etc.
251
252 The other high strategic value instruction is `grevlut` (and `grevluti`
253 which can generate a remarkably large number of regular-patterned magic
254 constants). The grevlut set require of the order of 20,000 gates but
255 provide an astonishing plethora of innovative bit-permuting instructions
256 never seen in any other ISA.
257
258 The downside of all of these instructions is the extremely low XO bit
259 requirements: 2-3 bit XO due to the large immediates *and* the number of
260 operands required. The LUT3 instructions are already compacted down to
261 "Overwrite" variants. (By contrast the Float-Load-Immediate instructions
262 are a much larger XO because despite having 16-bit immediate only one
263 Register Operand is needed).
264
265 Realistically these high-value instructions should be proposed in EXT2xx
266 where their XO cost does not overwhelm EXT0xx.
267
268
269 ## (f)mv.swizzle
270
271 [[sv/mv.swizzle]] is dicey. It is a 2-in 2-out operation whose value
272 as a Scalar instruction is limited *except* if combined with `cmpi` and
273 SVP64Single Predication, whereupon the end result is the RISC-synthesis
274 of Compare-and-Swap, in two instructions.
275
276 Where this instruction comes into its full value is when Vectorised.
277 3D GPU and HPC numerical workloads astonishingly contain between 10 to 15%
278 swizzle operations: access to YYZ, XY, of an XYZW Quaternion, performing
279 balancing of ARGB pixel data. The usage is so high that 3D GPU ISAs make
280 Swizzle a first-class priority in their VLIW words. Even 64-bit Embedded
281 GPU ISAs have a staggering 24-bits dedicated to 2-operand Swizzle.
282
283 So as not to radicalise the Power ISA the Libre-SOC team decided to
284 introduce mv Swizzle operations, which can always be Macro-op fused
285 in exactly the same way that ARM SVE predicated-move extends 3-operand
286 "overwrite" opcodes to full independent 3-in 1-out.
287
288 # BMI (bitmanipulation) group.
289
290 Whilst the [[sv/vector_ops]] instructions are only two in number, in
291 reality the `bmask` instruction has a Mode field allowing it to cover
292 **24** instructions, more than have been added to any other CPUs by
293 ARM, Intel or AMD. Analyis of the BMI sets of these CPUs shows simple
294 patterns that can greatly simplify both Decode and implementation. These
295 are sufficiently commonly used, saving instruction count regularly,
296 that they justify going into EXT0xx.
297
298 The other instruction is `cprop` - Carry-Propagation - which takes
299 the P and Q from carry-propagation algorithms and generates carry
300 look-ahead. Greatly increases the efficiency of arbitrary-precision
301 integer arithmetic by combining what would otherwise be half a dozen
302 instructions into one. However it is still not a huge priority unlike
303 `bmask` so is probably best placed in EXT2xx.
304
305 ## Float-Load-Immediate
306
307 Very easily justified. As explained in [[ls002]] these always saves one
308 LD L1/2/3 D-Cache memory-lookup operation, by virtue of the Immediate
309 FP value being in the I-Cache side. It is such a high priority that
310 these instuctions are easily justifiable adding into EXT0xx, despite
311 requiring a 16-bit immediate. By designing the second-half instruction
312 as a Read-Modify-Write it saves on XO bitlength (only 5 bits), and can be
313 macro-op fused with its first-half to store a full IEEE754 FP32 immediate
314 into a register.
315
316 There is little point in putting these instructions into EXT2xx. Their
317 very benefit and inherent value *is* as 32-bit instructions, not 64-bit
318 ones. Likewise there is less value in taking up EXT1xx Enoding space
319 because EXT1xx only brings an additional 16 bits (approx) to the table,
320 and that is provided already by the second-half instuction.
321
322 Thus they qualify as both high priority and also EXT0xx candidates.
323
324 ## FPR/GPR LD/ST-PostIncrement-Update
325
326 These instruction, outlined in [[ls011]], save hugely in hot-loops.
327 Early ISAs such as PDP-8, PDP-11, which inspired the iconic Motorola
328 68000, 88100, Mitch Alsup's MyISA 66000, and can even be traced back to
329 the iconic ultra-RISC CDC 6600, all had both pre- and post- increment
330 Addressing Modes.
331
332 The reason is very simple: it is a direct recognition of the practice
333 in c to frequently utilise both `*p++` and `*++p` which itself stems
334 from common need in Computer Science algorithms.
335
336 The problem for the Power ISA is - was - that the opcode space needed
337 to support both was far too great, and the decision was made to go with
338 pre-increment, on the basis that outside the loop a "pre-subtraction"
339 may be performed.
340
341 Whilst this is a "solution" it is less than ideal, and the opportunity
342 exists now with the EXT2xx Primary Opcodes to correct this and bring
343 Power ISA up a level.
344
345 ## Shift-and-add
346
347 Shift-and-Add are proposed in [[ls004]]. They mitigate the need to add
348 LD-ST-Shift instructions which are a high-priority aspect of both x86
349 and ARM. LD-ST-Shift is normally just the one instruction: Shift-and-add
350 brings that down to two, where Power ISA presently requires three.
351 Cryptography e.g. twofish also makes use of Integer double-and-add,
352 so the value of these instructions is not limited to Effective Address
353 computation. They will also have value in Audio DSP.
354
355 Being a 10-bit XO it would be somewhat punitive to place these in EXT2xx
356 when their whole purpose and value is to reduce binary size in Address
357 offset computation, thus they are best placed in EXT0xx.
358
359
360 # Tables
361
362 The original tables are available publicly as as CSV file at
363 <https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/rfc/ls012/optable.csv;hb=HEAD>.
364 A python program auto-generates the tables in the following sections
365 by sorting into different useful priorities.
366
367 The key to headings and sections are as follows:
368
369 * **Area** - Target Area as described in above sections
370 * **XO Cost** - the number of bits required in the XO Field. whilst not
371 the full picture it is a good indicator as to how costly in terms
372 of Opcode Allocation a given instruction will be. Lower number is
373 a higher cost for the Power ISA's precious remaining Opcode space
374 * **rfc** the Libre-SOC External RFC resource,
375 <https://libre-soc.org/openpower/sv/rfc/> where advance notice of
376 upcoming RFCs in development may be found.
377 *Reading advance Draft RFCs and providing feedback strongly advised*,
378 it saves time and effort for the OPF ISA Workgroup.
379 * **SVP64** - Vectoriseable (SVP64-Prefixable) - also implies that
380 SVP64Single is also permitted (required).
381 * **page** - Libre-SOC wiki page at which further information can
382 be found. Again: **advance reading strongly advised due to the
383 sheer volume of information**.
384 * **PO1** - the instruction is capable of being PO1-Prefixed
385 (given an EXT1xx Opcode Allocation). Bear in mind that this option
386 is **mutually exclusively incompatible** with Vectorisation.
387 * **group** - the Primary Opcode Group recommended for this instruction.
388 Options are EXT0xx (EXT000-EXT063), EXT1xx and EXT2xx. A third area
389 (UnVectoriseable),
390 EXT3xx, was available in an early Draft RFC but has been made "RESERVED"
391 instead. see [[sv/po9_encoding]].
392
393 [[!inline pages="openpower/sv/rfc/ls012/areas.mdwn" raw=yes ]]
394 [[!inline pages="openpower/sv/rfc/ls012/xo_cost.mdwn" raw=yes ]]
395
396 [[!tag opf_rfc]]