(no commit message)
[libreriscv.git] / openpower / sv / svp64_quirks.mdwn
1 # The Rules
2
3 [[!toc]]
4
5 SVP64 is designed around fundamental and inviolate RISC principles.
6 This gives a uniformity and regularity to the ISA, making implementation
7 straightforward, which was why RISC
8 as a concept became popular.
9
10 1. There are no actual Vector instructions: Scalar instructions
11 are the sole exclusive bedrock.
12 2. No scalar instruction ever deviates in its encoding or meaning
13 just because it is prefixed (semantic caveats below)
14 3. A hardware-level for-loop (the prefix) makes vector elements
15 100% synonymous with scalar instructions (the suffix)
16 4. Exactly as with Scalar RISC ISAs, the uniformity does produce
17 "holes" in the encoding or some strange combinations.
18
19 How can a Vector ISA even exist when no actual Vector instructions
20 are permitted to be added? It comes down to the strict RISC abstraction.
21 First you start from a **scalar** instruction (32-bit). Second, the
22 Prefixing is applied *in the abstract* to give the *appearance*
23 and ultimately the same effect as if an explicit Vector instruction
24 had also been added. Looking at the pseudocode of any Vector ISA
25 (RVV, NEC SX Aurora, Cray)
26 they always comprise (a) a for-loop around (b) element-based operations.
27 It is perfectly reasonable and rational to separate (a) from (b)
28 then find a powerful pre-existing
29 Supercomputing-class ISA that qualifies for (b).
30
31 There are a few exceptional places where these rules get
32 bent, and others where the rules take some explaining,
33 and this page tracks them all.
34
35 The modification caveat in (2) above semantically
36 exempts element width overrides,
37 which still do not actually modify the meaning of the instruction:
38 an add remains an add, even if its override makes it an 8-bit add rather than
39 a 64-bit add. Even add-with-carry remains an add-with-carry: it's just
40 that when elwidth=8 in the Prefix it's an *8-bit* add-with-carry
41 where the 9th bit becomes Carry-out (not the 65th bit).
42 In other words, elwidth overrides **definitely** do not fundamentally
43 alter the actual
44 Scalar v3.0 ISA encoding itself. Consequently we can still, in
45 the strictest semantic sense, not be breaking rule (2).
46
47 Likewise, other "modifications" such as saturation or Data-dependent
48 Fail-First likewise are actually post-augmentation or post-analysis, and do
49 not fundamentally change an add operation into a subtract
50 for example, and under absolutely no circumstances do the actual 32-bit
51 Scalar v3.0 operand field bits change or the number of operands change.
52
53 In an early Draft of SVP64,
54 an experiment was attempted, to modify LD-immediate instructions
55 to include a
56 third RC register i.e. reinterpret the normal
57 v3.0 32-bit instruction as a completely
58 different encoding if SVP64-prefixed. It did not go well.
59 The complexity that resulted
60 in the decode phase was too great. The lesson was learned, the
61 hard way: it would be infinitely preferable
62 to add a 32-bit Scalar Load-with-Shift
63 instruction *first*, which then inherently becomes Vectorised.
64 Perhaps a future Power ISA spec will have this Load-with-Shift instruction:
65 both ARM and x86 have it, because it saves greatly on instruction count in
66 hot-loops.
67
68 The other reason for not adding an SVP64-Prefixed instruction without
69 also having it as a Scalar un-prefixed instruction is that if the
70 32-bit encoding is ever allocated in a future revision
71 of the Power ISA
72 to a completely unrelated operation
73 then how can a Vectorised version of that new instruction ever be added?
74 The uniformity and RISC Abstraction is irreparably damaged.
75 Bottom line here is that the fundamental RISC Principle is strictly adhered
76 to, even though these are Advanced 64-bit Vector instructions.
77 Advocates of the RISC Principle will appreciate the uniformity of
78 SVP64 and the level of systematic abstraction kept between Prefix and Suffix.
79
80 # Instruction Groups
81
82 The basic principle of SVP64 is the prefix, which contains mode
83 as well as register augmentation and predicates. When thinking of
84 instructions and Vectorising them, it is natural for arithmetic
85 operations (ADD, OR) to be the first to spring to mind.
86 Arithmetic instructions have registers, therefore augmentation
87 applies, end of story, right?
88
89 Except, Load and Store deals also with Memory, not just registers.
90 Power ISA has Condition Register Fields: how can element widths
91 apply there? And branches: how can you have Saturation on something
92 that does not return an arithmetic result? In short: there are actually
93 four different categories (five including those for which Vectorisation
94 makes no sense at all, such as `sc` or `mtmsr`). The categories are:
95
96 * arithmetic/logical including floating-point
97 * Load/Store
98 * Condition Register Field operations
99 * branch
100
101 **Arithmetic**
102
103 Arithmetic (known as "normal" mode) is where Scalar and Parallel
104 Reduction can be done: Saturation as well, and two new innovative
105 modes for Vector ISAs: data-dependent fail-first and predicate result.
106 Reduction and Saturation are common to see in Vector ISAs: it is just
107 that they are usually added as explicit instructions,
108 and NEC SX Aurora has even more iterative instructions. In SVP64 these
109 concepts are applied in the abstract general form, which takes some
110 getting used to.
111
112 Reduction may, when applied to non-commutative
113 instructions incorrectly, result in invalid results, but ultimately
114 it is critical to think in terms of the "rules", that everything is
115 Scalar instructions in strict Program Order. Reduction on non-commutative
116 Scalar Operations is not *prohibited*: the strict Program Order allows
117 the programmer to think through what would happen and thus potentially
118 actually come up with legitimate use.
119
120 **Branches**
121
122 Branch is the one and only place where the Scalar
123 (non-prefixed) operations differ from the Vector (element)
124 instructions (as explained in a separate section) although
125 a case could be made for the perspective that they are identical,
126 but the defaults for new parameters in the Scalar case makes branch
127 identical to Power ISA v3.1 Scalar branches.
128
129 The
130 RM bits can be used for other purposes because the Arithmetic modes
131 make no sense at all for a Branch.
132 Almost the entire
133 SVP64 RM Field is interpreted differently from other Modes, in
134 order to support a wide range of parallel boolean condition options
135 which are expected of a Vector / GPU ISA. These save a considerable
136 number of instructions in tight inner loop situations.
137
138 **CR Field Ops**
139
140 Condition Register Fields are 4-bit wide and consequently element-width
141 overrides make absolutely no sense whatsoever. Therefore the elwidth
142 override field bits can be used for other purposes when Vectorising
143 CR Field instructions. Moreover, Rc=1 is completely invalid for
144 CR operations such as `crand`: Rc=1 is for arithmetic operations, producing
145 a "co-result" that goes into CR0 or CR1. Thus, the Arithmetic modes
146 such as predicate-result make no sense, and neither does Saturation.
147 All of these differences, which require quite a lot of logical
148 reasoning and deduction, help explain why there is an entirely different
149 CR ops Vectorisation Category.
150
151 A particularly strange quirk of CR-based Vector Operations is that the
152 Scalar Power ISA CR Register is 32-bits, but actually comprises eight
153 CR Fields, CR0-CR7. With each CR Field being four bits (EQ, LT, GT, SO)
154 this makes up 32 bits, and therefore a CR operand referring to one bit
155 of the CR will be 5 bits in length (BA, BT).
156 *However*, some instructions refer
157 to a *CR Field* (CR0-CR7) and consequently these operands
158 (BF, BFA etc) are only 3-bits.
159
160 (*It helps here to think of the top 3 bits of BA as referring
161 to a CR Field, like BFA does, and the bottom 2 bits of BA
162 referring to
163 EQ/LT/GT/SO within that Field*)
164
165 With SVP64 extending the number of CR *Fields* to 128, the number of
166 32-bit CR *Registers* extends to 16, in order to hold all 128 CR *Fields*
167 (8 per CR Register). Then, it gets even more strange, when it comes
168 to Vectorisation, which applies to the CR Field *numbers*. The
169 hardware-for-loop for Rc=1 for example starts at CR0 for element 0,
170 and moves to CR1 for element 1, and so on. The reason here is quite
171 simple: each element result has to have its own CR Field co-result.
172
173 In other words, the
174 element is the 4-bit CR *Field*, not the bits *of* the 32-bit
175 CR Register, and not the CR *Register* (of which there are now 16).
176 All quite logical, but a little mind-bending.
177
178 **Load/Store**
179
180 LOAD/STORE is another area that has different needs: this time it is
181 down to limitations in Scalar LD/ST. Vector ISAs have Load/Store modes
182 which simply make no sense in a RISC Scalar ISA: element-stride and
183 unit-stride and the entire concept of a stride itself (a spacing
184 between elements) has no place at all in a Scalar ISA. The problems
185 come when trying to *retrofit* the concept of "Vector Elements" onto
186 a Scalar ISA. Consequently it required a couple of bits (Modes) in the SVP64
187 RM Prefix to convey the stride mode, changing the Effective Address
188 computation as a result. Interestingly, worth noting for Hardware
189 designers: it did turn out to be possible to perform pre-multiplication
190 of the D/DS Immediate by the stride amount, making it possible to avoid
191 actually modifying the LD/ST Pipeline itself.
192
193 Other areas where LD/ST went quirky: element-width overrides especially
194 when combined with Saturation, given that LD/ST operations have byte,
195 halfword, word, dword and quad variants. The interaction between these
196 widths as part of the actual operation, and the source and destination
197 elwidth overrides, was particularly obtuse and hard to derive: some care
198 and attention is advised, here, when reading the specification,
199 especially on arithmetic loads (lbarx, lharx etc.)
200
201 **Non-vectorised**
202
203 The concept of a Vectorised halt (`attn`) makes no sense. There are never
204 going to be a Vector of global MSRs (Machine Status Register). `mtcr`
205 on the other hand is a grey area: `mtspr` is clearly Vectoriseable.
206 Even `td` and `tdi` makes a strange type of sense to permit it to be
207 Vectorised, because a sequence of comparisons could be Vectorised.
208 Vectorised System Calls (`sc`) or `tlbie` and other Cache or Virtual
209 Nemory Management
210 instructions, these make no sense to Vectorise.
211
212 However, it is really quite important to not be tempted to conclude that
213 just because these instructions are un-vectoriseable, the Prefix opcode space
214 must be free for reiterpretation and use for other purposes. This would
215 be a serious mistake because a future revision of the specification
216 might *retire* the Scalar instruction, and, worse, replace it with another.
217 Again this comes down to being quite strict about the rules: only Scalar
218 instructions get Vectorised: there are *no* actual explicit Vector
219 instructions.
220
221 **Summary**
222
223 Where a traditional Vector ISA effectively duplicates the entirety
224 of a Scalar ISA and then adds additional instructions which only
225 make sense in a Vector Context, such as Vector Shuffle, SVP64 goes to
226 considerable lengths to keep strictly to augmentation and embedding
227 of an entire Scalar ISA's instructions into an abstract Vectorisation
228 Context. That abstraction subdivides down into Categories appropriate
229 for the type of operation (Branch, CRs, Memory, Arithmetic),
230 and each Category has its own relevant but
231 ultimately rational quirks.
232
233 # Abstraction between Prefix and Suffix
234
235 In the introduction paragraph, a great fuss was made emphasising that
236 the Prefix is kept separate from the Suffix. The whole idea there is
237 that a Multi-issue Decoder and subsequent pipelines would in no way have
238 "back-propagation" of state that can only be determined far too late.
239 This *has* been preserved, however there is a hiccup.
240
241 Examining the Power ISA 3.1 a 64-bit Prefix was introduced, EXT001.
242 The encoding of the prefix has 6 bits that are dedicated to letting
243 the hardware know what the remainder of the Prefix bits mean: how they
244 are formatted, even without having to examine the Suffix to which
245 they are applied.
246
247 SVP64 has such pressure on its 24-bit encoding that it was simply
248 not possible to perform the same trick used by Power ISA 3.1 Prefixing.
249 Therefore, rather unfortunately, it becomes necessary to perform
250 a *partial decoding* of the v3.0 Suffix before the 24-bit SVP64 RM
251 Fields may be identified. Fortunately this is straightforward, and
252 does not rely on any outside state, and even more fortunately
253 for a Multi-Issue Execution decoder, the length 32/64 is also
254 easy to identify by looking for the EXT001 pattern. Once identified
255 the 32/64 bits may be passed independently to multiple Decoders in
256 parallel.
257
258 # Predication
259
260 Predication is entirely missing from the Power ISA.
261 Adding it would be a costly mistake because it cannot be retrofitted
262 to an ISA without literally duplicating all instructions. Prefixing
263 is about the only sane way to go.
264
265 CR Fields as predicate masks could be spread across multiple register
266 file entries, making them costly to read in one hit. Therefore the
267 possibility exists that an instruction element writing to a CR Field
268 could *overwrite* the Predicate mask CR Vector during the middle of
269 a for-loop.
270
271 Clearly this is bad, so don't do it. If there are potential issues
272 they can be avoided by using the crweird instructions to get CR Field
273 bits into an Integer GPR (r3, r10 or r30) and use that GPR as a
274 Predicate mask instead.
275
276 Even in Vertical-First Mode, which is a single Scalar instruction executed
277 with "offset" registers (in effect), the rule still applies: don't write
278 to the same register being used as the predicate, it's `UNDEFINED`
279 behaviour.
280
281 ## Single Predication
282
283 So named because there is a Twin Predication concept as well, Single
284 Predication is also unlike other Vector ISAs because it allows zeroing
285 on both the source and destination. This takes some explaining.
286
287 In Vector ISAs, there is a Predicate Mask, it applies to the
288 destination only, and there
289 is a choice of actions when a Predicate Mask bit
290 is zero:
291
292 * set the destination element to zero
293 * skip that element operation entirely, leaving the destination unmodified
294
295 The problem comes if the underlying register file SRAM is say 64-bit wide
296 write granularity but the Vector elements are say 8-bit wide.
297 Some Vector ISAs strongly advocate Zeroing because to leave one single
298 element at a small bitwidth in amongst other elements where the register
299 file does not have the prerequisite access granularity is very expensive,
300 requiring a Read-Modify-Write cycle to preserve the untouched elements.
301 Putting zero into the destination avoids that Read.
302
303 This is technically very easy to solve: use a Register File that does
304 in fact have the smallest element-level write-enable granularity.
305 If the elements are 8 bit then allow 8-bit writes!
306
307 With that technical issue solved there is nothing in the way of choosing
308 to support both zeroing and non-zeroing (skipping) at the ISA level:
309 SV chooses to further support both *on both the source and destination*.
310 This can result in the source and destination
311 element indices getting "out-of-sync" even though the Predicate Mask
312 is the same because the behaviour is different when zeros in the
313 Predicate are encountered.
314
315 ## Twin Predication
316
317 Twin Predication is an entirely new concept not present in any commercial
318 Vector ISA of the past forty years. To explain how normal Single-predication
319 is applied in a standard Vector ISA:
320
321 * Predication on the **source** of a LOAD instruction creates something
322 called "Vector Compressed Load" (VCOMPRESS).
323 * Predication on the **destination** of a STORE instruction creates something
324 called "Vector Expanded Store" (VEXPAND).
325 * SVP64 allows the two to be put back-to-back: one on source, one on
326 destination.
327
328 The above allows a reader familiar with VCOMPRESS and VEXPAND to
329 conceptualise what the effect of Twin Predication is, but it actually
330 goes much further: in *any* twin-predicated instruction (extsw, fmv)
331 it is possible to apply one predicate to the source register (compressing
332 the source element array) and another *completely separate* predicate
333 to the destination register, not just on Load/Stores but on *arithmetic*
334 operations.
335
336 No other Vector ISA in the world has this back-to-back
337 capability. All true Vector
338 ISAs have Predicate Masks: it is an absolutely essential characteristic.
339 However none of them have abstracted dual predicates out to the extent
340 where this VCOMPRESS-VEXPAND effect is applicable *in general* to a
341 wide range of arithmetic
342 instructions, as well as Load/Store.
343
344 It is however important to note that not all instructions can be Twin
345 Predicated (2P): some remain only Single Predicated (1P), as is normally found
346 in other Vector ISAs. Arithmetic operations with
347 four registers (3-in, 1-out, VA-Form for example) are Single. The reason
348 is that there just wasn't enough space in the 24-bits of the SVP64 Prefix.
349 Consequently, when using a given instruction, it is necessary to look
350 up in the ISA Tables whether it is 1P or 2P. caveat emptor!
351
352 Also worth a special mention: all Load/Store operations are Twin-Predicated.
353 The underlying key to understanding:
354
355 * one Predicate effectively applies to the Array of Memory *Addresses*,
356 * the other Predicate effectively applies to the Array of Memory *Data*.
357
358 # CR weird instructions
359
360 [[sv/cr_int_predication]] is by far the biggest violator of the SVP64
361 rules, for good reasons. Transfers between Vectors of CR Fields and Integers
362 for use as predicates is very awkward without them.
363
364 Normally, element width overrides allow the element width to be specified
365 as 8, 16, 32 or default (64) bit. With CR weird instructions producing or
366 consuming either 1 bit or 4 bit elements (in effect) some adaptation was
367 required. When this perspective is taken (that results or sources are
368 1 or 4 bits) the weirdness starts to make sense, because the "elements",
369 such as they are, are still packed sequentially.
370
371 From a hardware implementation perspective however they will need special
372 handling as far as Hazard Dependencies are concerned, due to nonconformance
373 (bit-level management)
374
375 # mv.x (vector permute)
376
377 [[sv/mv.x]] aka `GPR(RT) = GPR(GPR(RA))` is so horrendous in
378 terms of Register Hazard Management that its addition to any Scalar
379 ISA is anathematic. In a Traditional Vector ISA however, where the
380 indices are isolated behind a single Vector Hazard, there is no
381 problem at all. `sv.mv.x` is also fraught, precisely because it
382 sits on top of a Standard Scalar register paradigm, not a Vector
383 ISA with separate and distinct Vector registers.
384
385 To help partly solve this, `sv.mv.x` would have had to have
386 been made relative:
387
388 ```
389 for i in range(VL):
390 GPR(RT+i) = GPR(RT+MIN(GPR(RA+i), VL))
391 ```
392
393 The reason for doing so is that MAXVL or VL may be used to limit
394 the number of Register Hazards that need to be raised to a fixed
395 quantity, at Issue time.
396
397 `mv.x` itself would still have to be added as a Scalar instruction,
398 but the behaviour of `sv.mv.x` would have to be different from that
399 Scalar version.
400
401 Normally, Scalar Instructions have a good justification for being
402 added as Scalar instructions on their own merit. `mv.x` is the
403 polar opposite, and in the end, the idea was thrown out, and Indexed
404 REMAP added in its place. Indexed REMAP comes with its own quirks,
405 solving the Hazard problem, described in a later section.
406
407 # REMAP and other reordering
408
409 There are several places in Simple-V which apply some sort of reordering
410 schedule to elements. srcstep and dststep do not themselves reorder:
411 they continue to march in sequence (VL-1 downto 0 in the case of reverse-gear)
412
413 It is perfectly legal to apply Parallel-Reduction on top of any type
414 of REMAP, for example, and it is possible to apply Pack/Unpack on a
415 REMAP as well.
416
417 The order of application of REMAP combined with Parallel-Reduction
418 should be logically obvious: REMAP has to come first because otherwise
419 how can the Parallel-Reduction perform a tree-walk?
420
421 Pack/Unpack on the other hand is best implemented as applying first,
422 because it is applied
423 as the inversion of the for-loops which generate the steps and substeps.
424 REMAP then applies to the src/dst-step indices (never to the subvl
425 step indices: that is SWIZZLE's job).
426
427 It's all perfectly logical, just a lot going on.
428
429 # Branch-Conditional
430
431 [[sv/branches]] are a very special exception to the rule that there
432 shall be no deviation from the corresponding
433 Scalar instruction. This because of the tight
434 integration with looping and the application of Boolean Logic
435 manipulation needed for Parallel operations (predicate mask usage).
436 This results in an extremely important observation that `scalar identity
437 behaviour` is violated: the SV Prefixed variant of branch is **not** the same
438 operation as the unprefixed 32-bit scalar version.
439
440 One key difference is that LR is only updated if certain additional
441 conditions are met, whereas Scalar `bclrl` for example unconditionally
442 overwrites LR.
443
444 Another is that the Vectorised Branch-Conditional instructions are the
445 only ones where there are side-effects on predication when skipping
446 is enabled. This is so as to be able to use CTR to count down
447 *masked-out* elements.
448
449 Well over 500 Vectorised branch instructions exist in SVP64 due to the
450 number of options available: close integration and interaction with
451 the base Scalar Branch was unavoidable in order to create Conditional
452 Branching suitable for parallel 3D / CUDA GPU workloads.
453
454 # Saturation
455
456 The application of Saturation as a retro-fit to a Scalar ISA is challenging.
457 It does help that within the SFFS Compliancy subset there are no Saturated
458 operations at all: they are only added in VSX.
459
460 Saturation does not inherently change the instruction itself: it does however
461 come with some fundamental implications, when applied. For example:
462 a Floating-Point operation that would normally raise an exception will
463 no longer do so, instead setting the CR1.SO Flag. Another quirky
464 example: signed operations which produce a negative result will be
465 truncated to zero if Unsigned Saturation is requested.
466
467 One very important aspect for implementors is that the operation in
468 effect has to be considered to be performed at infinite precision,
469 followed by saturation detection. In practice this does not actually
470 require infinite precision hardware! Two 8-bit integers being
471 added can only ever overflow into a 9-bit result.
472
473 Overall some care and consideration needs to be applied.
474
475 # Fail-First
476
477 Fail-First (both the Load/Store and Data-Dependent variants)
478 is worthy of a special mention in its own right. Where VL is
479 normally forward-looking and may be part of a pre-decode phase
480 in a (simplified) pipelined architecture with no Read-after-Write Hazards,
481 Fail-First changes that because at any point during the execution
482 of the element-level instructions, one of those elements may not only
483 terminate further continuation of the hardware-for-looping but also
484 effect a change of VL:
485
486 ```
487 for i in range(VL):
488 result = element_operation(GPR(RA+i), GPR(RB+i))
489 if test(result):
490 VL = i
491 break
492 ```
493
494 This is not exactly a violation of SVP64 Rules, more of a breakage
495 of user expectations, particularly for LD/ST where exceptions
496 would normally be expected to be raised, Fail-First provides for
497 avoidance of those exceptions.
498
499 For Hardware implementers, a standard Out-of-Order micro-architecture
500 allows for Cancellation of speculatively-executed elements that extended
501 beyond the Vector Truncation point. In-order systems will have a slightly
502 harder time and may choose to execute one element only at a time, reducing
503 performance as a result.
504
505 # OE=1
506
507 The hardware cost of Sticky Overflow in a parallel environment is immense.
508 The SFFS Compliancy Level is permitted optionally to support XER.SO.
509 Therefore the decision is made to make it mandatory **not** to
510 support XER.SO. However, CR.SO *is* supported such that when Rc=1
511 is set the CR.SO flag will contain only the overflow of
512 the current instruction, rather than being actually "sticky".
513 Hardware Out-of-Order designers will recognise and appreciate
514 that the Hazards are
515 reduced to Read-After-Write (RAW) and that the WAR Hazard is removed.
516
517 This is sort-of a quirk and sort-of not, because the option to support
518 XER.SO is already optional from the SFFS Compliancy Level.
519
520 # Indexed REMAP and CR Field Predication Hazards
521
522 Normal Vector ISAs and those Packed SIMD ISAs inspired by them have
523 Vector "Permute" or "Shuffle" instructions. These provide a Vector of
524 indices whereby another Vector is reordered (permuted, shuffled) according
525 to the indices. Register Hazard Managent here is trivial because there
526 are three registers: indices source vector, elements source vector to
527 be shuffled, result vector.
528
529 For SVP64 which is based on top of a Scalar Register File paradigm,
530 combined with the hard requirement to respect full Register Hazard
531 Management as if element instructions were actual Scalar instructions,
532 the addition of a Vector permute instruction under these strict
533 conditions would result in a catastrophic
534 reduction in performance, due to having to consider Read-after-Write
535 and Write-after-Read Hazards *at the element level*.
536
537 A little leniency and rule-bending is therefore required.
538
539 Rather than add explicit Vector permute instructions, the "Indexing"
540 has been separated out into a REMAP Schedule. When an Indexed
541 REMAP is requested, it is assumed (required, of software) that
542 subsequent instructions intending to use those indices *will not*
543 attempt to modify the indices. It is *Software* that must consider them
544 to be read-only.
545
546 This simple relaxation of the rules releases Hardware from having the
547 horrendous job of dynamically detecting Write-after-Read Hazards on a
548 huge range of registers.
549
550 A similar Hazard problem exists for CR Field Predicates, in Vertical-First
551 Mode. Instructions could modify CR Fields currently being used as Predicate
552 Masks: detecting this is so horrendous for hardware resource utilisation
553 and hardware complexity that, again, the decision is made to relax these
554 constraints and for Software to take that into account.
555
556 # Floating-Point "Single" becomes "Half"
557
558 In several places in the Power ISA there are operations that are on
559 32-bit quantities in 64-bit registers. The best example is FP which
560 has 64-bit operations (`fadd`) and 32-bit operations (`fadds` or
561 FP Add "single"). Element-width overrides it would seem to
562 be unnecessary, under these circumstances.
563
564 However, it is not possible for `fadds` to fit two elements into
565 64-bit: that breaks the simplicity of SVP64.
566 Bear in mind that the FP32 bits are spread out across a 64
567 bit register in FP64 format. The solution here was to consider the
568 "s" at the end of each instruction
569 to mean "half of the element's width". Thus, `sv.fadds/ew=32`
570 actually stores an FP16 spread out across the 32 bits of an
571 element, in FP32 format, where `sv.fadd/ew=32` stores a full
572 FP32 result into the full 32 bits.
573
574 Where this breaks down is when attempting to do half-width on
575 BF16 or FP16 operations: there does not exist a BF8 or an IEEE754 FP8
576 format, so these (`sv.fadds/ew=8`) should be avoided.
577
578 # Vertical-First and Subvectors
579
580 Documented in the [[sv/setvl]] page, Vertical-First goes through
581 elements second instructions first and requires an explicit
582 [[sv/svstep]] instruction to move to the next element,
583 (whereas Horizontal-First
584 loops through elements in full first before moving on to
585 the next instruction): *Subvectors are considered "elements"*
586 in Vertical-First Mode.
587
588 This is conceptually quite easy to keep in mind that a Vertical-First
589 instruction does one element at a time, and when SUBVL is set,
590 that "element" in essence becomes a vec2/3/4.
591
592 # Swizzle and Pack/Unpack
593
594 These are both so weird it's best to just read the pages in full
595 and pay attention: [[sv/mv.swizzle]] and [[sv/mv.vec]].
596 Swizzle Moves only engage with vec2/3/4, *reordering* the copying
597 of the sub-vector elements (including allowing repeats and skips)
598 based on an immediate supplied by the instruction. The fun
599 comes when Pack/Unpack are enabled, and it is really important
600 to be aware how the Arrays of vec2/3/4 become re-ordered
601 *and swizzled at the same time*.
602
603 Pack/Unpack started out as
604 [[sv/mv.vec]] but became its own distinct Mode over time.
605 The main thing to keep in mind about Pack/Unpack
606 is that it engages a swap of the ordering of the VL-SUBVL
607 nested for-loops, in exactly the same way that Matrix REMAP
608 can do.
609 When Pack or Unpack is enabled it is the SUBVL for-loop
610 that becomes outermost. A bit of thought shows that this is
611 a 2D "Transpose" where Dimension X is VL and Dimension Y is SUBVL.
612 However *both* source *and* destination may be independently
613 "Transposed", which makes no sense at all until the fact that
614 Swizzle can have a *different SUBVL* is taken into account.
615
616 Basically Pack/Unpack covers everything that VSX `vpkpx` and
617 other ops can do, and then some: Saturation included, for arithmetic ops.
618
619 # LD/ST with zero-immediate vs mapreduce mode
620
621 LD/ST operations with a zero immediate effectively means that on a
622 Vector operation the element index to offset the memory location is
623 multiplied by zero. Thus, a sequence of LD operations will load from
624 the exact same address, and likewise STs to the exact same address.
625
626 Ordinarily this would make absolutely no sense whatsoever, except
627 that Power ISA has cache-inhibited LD/STs (Power ISA v.1, Book III,
628 1.6.1, p1033), for accessing memory-mapped
629 peripherals and other crucial uses. Thus, *despite not being a mapreduce mode*,
630 zero-immediates cause multiple hits on the same element.
631
632 Mapreduce mode is not actually mapreduce at all: it is
633 a relaxation of the normal rule where if the destination is a Scalar the
634 Vector for-looping is not terminated on first write to the destination.
635 Instead, the developer is expected to exploit the strict Program Order,
636 make one of the sources the same as that Scalar destination, effectively
637 making that Scalar register an "Accumulator", thus creating the *appearance*
638 (effect) of Simple-V having a mapreduce capability, when in fact it is
639 more of an artefact.
640
641 LD/ST zero-immediate has similar quirky overwriting as the "mapreduce"
642 mode, but actually requires the registers to be Vectors. It is simply
643 a mathematical artefact of multiplying by zero, which happens to be
644 useful for cache-inhibited operations.
645
646 # Limited space in LD/ST Mode
647
648 As pointed out in the [[sv/ldst]] page there is limited space in only
649 5 mode bits to fully express all potential modes of operation.
650
651 * LD/ST Immediate has no individual control over src/dest zeroing,
652 whereas LD/ST Indexed does.
653 * LD/ST Immediate has no Saturated Pack/Unpack (Arithmetic Mode does)
654 * LD/ST Indexed has no Pack/Unpack, whereas LD/ST Immediate does.
655
656 These are not insurmountable problems: there do exist workarounds.
657 For example it is possible to set up Matrix REMAP to perform the same
658 job as Pack/Unpack, at which point the LD/ST "Saturation" mode may
659 be used, saving on costly intermediary registers *at double the LD
660 width* if a Saturated MV had to be involved. Store on the other hand
661 it is extremely likely that an arithmetic operation already computed
662 a Saturated Vector of results, so is less of a problem than Load.
663
664 Also, the LD/ST Indexed Mode can be element-strided (RB as
665 a Scalar, times
666 the element index), or, if that is not enough,
667 although potentially costly it is possible to
668 use `svstep` to compute a Vector RB sequence of
669 Indices, then activate either `sz` or `dz` as required, as a workaround
670 for LDST Immediate only having `zz`.
671
672 Simple-V is powerful but it cannot do everything! There is just not
673 enough space and so some compromises had to be made.
674
675 # sv.mtcr on entire 64-bit Condition Register
676
677 Normally, CR operations are either bit-based (where the element numbering actually
678 applies to the CR Field) or field-based in which case the elements are still
679 fields. The `sv.mtcr` and other instructions are actually full 64-bit Condition
680 *Register* operations and are therefore qualified as Normal/Arithmetic not
681 CRops.
682
683 This is to save on both Vector Length (VL of 16 is sufficient) as well as
684 complexity in the Hazard Management when context-switching CR fields, as the
685 entire batch of 128 CR Fields may be transferred to 8 GPRs with a VL of 16
686 and elwidth overriding of 32. Truncation is sufficent, dropping the top 32 bits
687 of the Condition Register(s) which are always zero anywy.
688
689 # Separate Scalar and Vector Condition Register files
690
691 As explained in the introduction [[sv/svp64]] and [[sv/cr_ops]]
692 Scalar Power ISA lacks "Conditional Execution" present in ARM
693 Scalar ISA of several decades. When Vectorised the fact that
694 Rc=1 Vector results can immediately be used as a Predicate Mask
695 back into the following instruction can result in large latency
696 unless "Vector Chaining" is used in the Micro-Architecture.
697
698 But that aside is not the main problem faced by the introduction
699 of Simple-V to the Power ISA: it's that the existing implementations
700 (IBM) don't have "Conditional Execution" and to add it to their
701 existing designs would be too disruptive a first step.
702
703 A compromise is to wipe blank certain entries in the Register Dependency
704 Matrices by prohibiting some operations involving the two groups
705 of CR Fields: those that fall into the existing Scalar 32-bit CR
706 (fields CR0-CR7) and those that fall into the newly-introduced
707 CR Fields, CR8-CR127.
708
709 This will drive compiler writers nuts, and give assembler writers headaches,
710 but it gives IBM the opportunity to implement SVP64 without massive
711 disruption. They can add an entirely new Vector CR register file,
712 new pipelines etc safe in the knowledge that existing Scalar HDL
713 needs no modification.