(no commit message)
[libreriscv.git] / openpower / sv / svp64_quirks.mdwn
1 # The Rules
2
3 [[!toc]]
4
5 SVP64 is designed around fundamental and inviolate RISC principles.
6 This gives a uniformity and regularity to the ISA, making implementation
7 straightforward, which was why RISC
8 as a concept became popular.
9
10 1. There are no actual Vector instructions: Scalar instructions
11 are the sole exclusive bedrock.
12 2. No scalar instruction ever deviates in its encoding or meaning
13 just because it is prefixed (semantic caveats below)
14 3. A hardware-level for-loop (the prefix) makes vector elements
15 100% synonymous with scalar instructions (the suffix)
16
17 How can a Vector ISA even exist when no actual Vector instructions
18 are permitted to be added? It comes down to the strict RISC abstraction.
19 First you start from a **scalar** instruction (32-bit). Second, the
20 Prefixing is applied *in the abstract* to give the *appearance*
21 and ultimately the same effect as if an explicit Vector instruction
22 had also been added. Looking at the pseudocode of any Vector ISA
23 (RVV, NEC SX Aurora, Cray)
24 they always comprise (a) a for-loop around (b) element-based operations.
25 It is perfectly reasonable and rational to separate (a) from (b)
26 then find a powerful pre-existing
27 Supercomputing-class ISA that qualifies for (b).
28
29 There are a few exceptional places where these rules get
30 bent, and others where the rules take some explaining,
31 and this page tracks them all.
32
33 The modification caveat in (2) above semantically
34 exempts element width overrides,
35 which still do not actually modify the meaning of the instruction:
36 an add remains an add, even if its override makes it an 8-bit add rather than
37 a 64-bit add. Even add-with-carry remains an add-with-carry: it's just
38 that when elwidth=8 in the Prefix it's an *8-bit* add-with-carry
39 where the 9th bit becomes Carry-out (not the 65th bit).
40 In other words, elwidth overrides **definitely** do not fundamentally
41 alter the actual
42 Scalar v3.0 ISA encoding itself. Consequently we can still, in
43 the strictest semantic sense, not be breaking rule (2).
44
45 Likewise, other "modifications" such as saturation or Data-dependent
46 Fail-First likewise are actually post-augmentation or post-analysis, and do
47 not fundamentally change an add operation into a subtract
48 for example, and under absolutely no circumstances do the actual 32-bit
49 Scalar v3.0 operand field bits change or the number of operands change.
50
51 In an early Draft of SVP64,
52 an experiment was attempted, to modify LD-immediate instructions
53 to include a
54 third RC register i.e. reinterpret the normal
55 v3.0 32-bit instruction as a completely
56 different encoding if SVP64-prefixed. It did not go well.
57 The complexity that resulted
58 in the decode phase was too great. The lesson was learned, the
59 hard way: it would be infinitely preferable
60 to add a 32-bit Scalar Load-with-Shift
61 instruction *first*, which then inherently becomes Vectorised.
62 Perhaps a future Power ISA spec will have this Load-with-Shift instruction:
63 both ARM and x86 have it, because it saves greatly on instruction count in
64 hot-loops.
65
66 The other reason for not adding an SVP64-Prefixed instruction without
67 also having it as a Scalar un-prefixed instruction is that if the
68 32-bit encoding is ever allocated in a future revision
69 of the Power ISA
70 to a completely unrelated operation
71 then how can a Vectorised version of that new instruction ever be added?
72 The uniformity and RISC Abstraction is irreparably damaged.
73 Bottom line here is that the fundamental RISC Principle is strictly adhered
74 to, even though these are Advanced 64-bit Vector instructions.
75 Advocates of the RISC Principle will appreciate the uniformity of
76 SVP64 and the level of systematic abstraction kept between Prefix and Suffix.
77
78 # Instruction Groups
79
80 The basic principle of SVP64 is the prefix, which contains mode
81 as well as register augmentation and predicates. When thinking of
82 instructions and Vectorising them, it is natural for arithmetic
83 operations (ADD, OR) to be the first to spring to mind.
84 Arithmetic instructions have registers, therefore augmentation
85 applies, end of story, right?
86
87 Except, Load and Store deals also with Memory, not just registers.
88 Power ISA has Condition Register Fields: how can element widths
89 apply there? And branches: how can you have Saturation on something
90 that does not return an arithmetic result? In short: there are actually
91 four different categories (five including those for which Vectorisation
92 makes no sense at all, such as `sc` or `mtmsr`). The categories are:
93
94 * arithmetic/logical including floating-point
95 * Load/Store
96 * Condition Register Field operations
97 * branch
98
99 **Arithmetic**
100
101 Arithmetic (known as "normal" mode) is where Scalar and Parallel
102 Reduction can be done: Saturation as well, and two new innovative
103 modes for Vector ISAs: data-dependent fail-first and predicate result.
104 Reduction and Saturation are common to see in Vector ISAs: it is just
105 that they are usually added as explicit instructions,
106 and NEC SX Aurora has even more iterative instructions. In SVP64 these
107 concepts are applied in the abstract general form, which takes some
108 getting used to.
109
110 Reduction may, when applied to non-commutative
111 instructions incorrectly, result in invalid results, but ultimately
112 it is critical to think in terms of the "rules", that everything is
113 Scalar instructions in strict Program Order. Reduction on non-commutative
114 Scalar Operations is not *prohibited*: the strict Program Order allows
115 the programmer to think through what would happen and thus potentially
116 actually come up with legitimate use.
117
118 **Branches**
119
120 Branch is the one and only place where the Scalar
121 (non-prefixed) operations differ from the Vector (element)
122 instructions, as explained in a separate section.
123 The
124 RM bits can be used for other purposes because the Arithmetic modes
125 make no sense at all for a Branch.
126 Almost the entire
127 SVP64 RM Field is interpreted differently from other Modes, in
128 order to support a wide range of parallel boolean condition options
129 which are expected of a Vector / GPU ISA. These save a considerable
130 number of instructions in tight inner loop situations.
131
132 **CR Field Ops**
133
134 Condition Register Fields are 4-bit wide and consequently element-width
135 overrides make absolutely no sense whatsoever. Therefore the elwidth
136 override field bits can be used for other purposes when Vectorising
137 CR Field instructions. Moreover, Rc=1 is completely invalid for
138 CR operations such as `crand`: Rc=1 is for arithmetic operations, producing
139 a "co-result" that goes into CR0 or CR1. Thus, the Arithmetic modes
140 such as predicate-result make no sense, and neither does Saturation.
141 All of these differences, which require quite a lot of logical
142 reasoning and deduction, help explain why there is an entirely different
143 CR ops Vectorisation Category.
144
145 A particularly strange quirk of CR-based Vector Operations is that the
146 Scalar Power ISA CR Register is 32-bits, but actually comprises eight
147 CR Fields, CR0-CR7. With each CR Field being four bits (EQ, LT, GT, SO)
148 this makes up 32 bits, and therefore a CR operand referring to one bit
149 of the CR will be 5 bits in length (BA, BT).
150 *However*, some instructions refer
151 to a *CR Field* (CR0-CR7) and consequently these operands
152 (BF, BFA etc) are only 3-bits.
153
154 (*It helps here to think of the top 3 bits of BA as referring
155 to a CR Field, like BFA does, and the bottom 2 bits of BA
156 referring to
157 EQ/LT/GT/SO within that Field*)
158
159 With SVP64 extending the number of CR *Fields* to 128, the number of
160 32-bit CR *Registers* extends to 16, in order to hold all 128 CR *Fields*
161 (8 per CR Register). Then, it gets even more strange, when it comes
162 to Vectorisation, which applies to the CR Field *numbers*. The
163 hardware-for-loop for Rc=1 for example starts at CR0 for element 0,
164 and moves to CR1 for element 1, and so on. The reason here is quite
165 simple: each element result has to have its own CR Field co-result.
166
167 In other words, the
168 element is the 4-bit CR *Field*, not the bits *of* the 32-bit
169 CR Register, and not the CR *Register* (of which there are now 16).
170 All quite logical, but a little mind-bending.
171
172 **Load/Store**
173
174 LOAD/STORE is another area that has different needs: this time it is
175 down to limitations in Scalar LD/ST. Vector ISAs have Load/Store modes
176 which simply make no sense in a RISC Scalar ISA: element-stride and
177 unit-stride and the entire concept of a stride itself (a spacing
178 between elements) has no place at all in a Scalar ISA. The problems
179 come when trying to *retrofit* the concept of "Vector Elements" onto
180 a Scalar ISA. Consequently it required a couple of bits (Modes) in the SVP64
181 RM Prefix to convey the stride mode, changing the Effective Address
182 computation as a result. Interestingly, worth noting for Hardware
183 designers: it did turn out to be possible to perform pre-multiplication
184 of the D/DS Immediate by the stride amount, making it possible to avoid
185 actually modifying the LD/ST Pipeline itself.
186
187 Other areas where LD/ST went quirky: element-width overrides especially
188 when combined with Saturation, given that LD/ST operations have byte,
189 halfword, word, dword and quad variants. The interaction between these
190 widths as part of the actual operation, and the source and destination
191 elwidth overrides, was particularly obtuse and hard to derive: some care
192 and attention is advised, here, when reading the specification,
193 especially on arithmetic loads (lbarx, lharx etc.)
194
195 **Non-vectorised**
196
197 The concept of a Vectorised halt (`attn`) makes no sense. There are never
198 going to be a Vector of global MSRs (Machine Status Register). `mtcr`
199 on the other hand is a grey area: `mtspr` is clearly Vectoriseable.
200 Even `td` and `tdi` makes a strange type of sense to permit it to be
201 Vectorised, because a sequence of comparisons could be Vectorised.
202 Vectorised System Calls (`sc`) or `tlbie` and other Cache or Virtual
203 Nemory Management
204 instructions, these make no sense to Vectorise.
205
206 However, it is really quite important to not be tempted to conclude that
207 just because these instructions are un-vectoriseable, the opcode space
208 must be free for reiterpretation and use for other purposes. This would
209 be a serious mistake because a future revision of the specification
210 might *retire* the Scalar instruction, replace it with another.
211 Again this comes down to being quite strict about the rules: only Scalar
212 instructions get Vectorised: there are *no* actual explicit Vector
213 instructions.
214
215 **Summary**
216
217 Where a traditional Vector ISA effectively duplicates the entirety
218 of a Scalar ISA and then adds additional instructions which only
219 make sense in a Vector Context, such as Vector Shuffle, SVP64 goes to
220 considerable lengths to keep strictly to augmentation and embedding
221 of an entire Scalar ISA's instructions into an abstract Vectorisation
222 Context. That abstraction subdivides down into Categories appropriate
223 for the type of operation (Branch, CRs, Memory, Arithmetic),
224 and each Category has its own relevant but
225 ultimately rational quirks.
226
227 # Abstraction between Prefix and Suffix
228
229 In the introduction paragraph, a great fuss was made emphasising that
230 the Prefix is kept separate from the Suffix. The whole idea there is
231 that a Multi-issue Decoder and subsequent pipelines would in no way have
232 "back-propagation" of state that can only be determined far too late.
233 This *has* been preserved, however there is a hiccup.
234
235 Examining the Power ISA 3.1 a 64-bit Prefix was introduced, EXT001.
236 The encoding of the prefix has 6 bits that are dedicated to letting
237 the hardware know what the remainder of the Prefix bits mean: how they
238 are formatted, even without having to examine the Suffix to which
239 they are applied.
240
241 SVP64 has such pressure on its 24-bit encoding that it was simply
242 not possible to perform the same trick used by Power ISA 3.1 Prefixing.
243 Therefore, rather unfortunately, it becomes necessary to perform
244 a *partial decoding* of the v3.0 Suffix before the 24-bit SVP64 RM
245 Fields may be identified. Fortunately this is straightforward, and
246 does not rely on any outside state, and even more fortunately
247 for a Multi-Issue Execution decoder, the length 32/64 is also
248 easy to identify by looking for the EXT001 pattern. Once identified
249 the 32/64 bits may be passed independently to multiple Decoders in
250 parallel.
251
252 # Predication
253
254 Predication is entirely missing from the Power ISA.
255 Adding it would be a costly mistake because it cannot be retrofitted
256 to an ISA without literally duplicating all instructions. Prefixing
257 is about the only sane way to go.
258
259 CR Fields as predicate masks could be spread across multiple register
260 file entries, making them costly to read in one hit. Therefore the
261 possibility exists that an instruction element writing to a CR Field
262 could *overwrite* the Predicate mask CR Vector during the middle of
263 a for-loop.
264
265 Clearly this is bad, so don't do it. If there are potential issues
266 they can be avoided by using the crweird instructions to get CR Field
267 bits into an Integer GPR (r3, r10 or r30) and use that GPR as a
268 Predicate mask instead.
269
270 Even in Vertical-First Mode, which is a single Scalar instruction executed
271 with "offset" registers (in effect), the rule still applies: don't write
272 to the same register being used as the predicate, it's `UNDEFINED`
273 behaviour.
274
275 ## Single Predication
276
277 So named because there is a Twin Predication concept as well, Single
278 Predication is also unlike other Vector ISAs because it allows zeroing
279 on both the source and destination. This takes some explaining.
280
281 In Vector ISAs, there is a Predicate Mask, it applies to the
282 destination only, and there
283 is a choice of actions when a Predicate Mask bit
284 is zero:
285
286 * set the destination element to zero
287 * skip that element operation entirely, leaving the destination unmodified
288
289 The problem comes if the underlying register file SRAM is say 64-bit wide
290 write granularity but the Vector elements are say 8-bit wide.
291 Some Vector ISAs strongly advocate Zeroing because to leave one single
292 element at a small bitwidth in amongst other elements where the register
293 file does not have the prerequisite access granularity is very expensive,
294 requiring a Read-Modify-Write cycle to preserve the untouched elements.
295 Putting zero into the destination avoids that Read.
296
297 This is technically very easy to solve: use a Register File that does
298 in fact have the smallest element-level write-enable granularity.
299 If the elements are 8 bit then allow 8-bit writes!
300
301 With that technical issue solved there is nothing in the way of choosing
302 to support both zeroing and non-zeroing (skipping) at the ISA level:
303 SV chooses to further support both *on both the source and destination*.
304 This can result in the source and destination
305 element indices getting "out-of-sync" even though the Predicate Mask
306 is the same because the behaviour is different when zeros in the
307 Predicate are encountered.
308
309 ## Twin Predication
310
311 Twin Predication is an entirely new concept not present in any commercial
312 Vector ISA of the past forty years. To explain how normal Single-predication
313 is applied in a standard Vector ISA:
314
315 * Predication on the **source** of a LOAD instruction creates something
316 called "Vector Compressed Load" (VCOMPRESS).
317 * Predication on the **destination** of a STORE instruction creates something
318 called "Vector Expanded Store" (VEXPAND).
319 * SVP64 allows the two to be put back-to-back: one on source, one on
320 destination.
321
322 The above allows a reader familiar with VCOMPRESS and VEXPAND to
323 conceptualise what the effect of Twin Predication is, but it actually
324 goes much further: in *any* twin-predicated instruction (extsw, fmv)
325 it is possible to apply one predicate to the source register (compressing
326 the source element array) and another *completely separate* predicate
327 to the destination register, not just on Load/Stores but on *arithmetic*
328 operations.
329
330 No other Vector ISA in the world has this back-to-back
331 capability. All true Vector
332 ISAs have Predicate Masks: it is an absolutely essential characteristic.
333 However none of them have abstracted dual predicates out to the extent
334 where this VCOMPRESS-VEXPAND effect is applicable *in general* to a
335 wide range of arithmetic
336 instructions, as well as Load/Store.
337
338 It is however important to note that not all instructions can be Twin
339 Predicated (2P): some remain only Single Predicated (1P), as is normally found
340 in other Vector ISAs. Arithmetic operations with
341 four registers (3-in, 1-out, VA-Form for example) are Single. The reason
342 is that there just wasn't enough space in the 24-bits of the SVP64 Prefix.
343 Consequently, when using a given instruction, it is necessary to look
344 up in the ISA Tables whether it is 1P or 2P. caveat emptor!
345
346 Also worth a special mention: all Load/Store operations are Twin-Predicated.
347 The underlying key to understanding:
348
349 * one Predicate effectively applies to the Array of Memory *Addresses*,
350 * the other Predicate effectively applies to the Array of Memory *Data*.
351
352 # CR weird instructions
353
354 [[sv/cr_int_predication]] is by far the biggest violator of the SVP64
355 rules, for good reasons. Transfers between Vectors of CR Fields and Integers
356 for use as predicates is very awkward without them.
357
358 Normally, element width overrides allow the element width to be specified
359 as 8, 16, 32 or default (64) bit. With CR weird instructions producing or
360 consuming either 1 bit or 4 bit elements (in effect) some adaptation was
361 required. When this perspective is taken (that results or sources are
362 1 or 4 bits) the weirdness starts to make sense, because the "elements",
363 such as they are, are still packed sequentially.
364
365 From a hardware implementation perspective however they will need special
366 handling as far as Hazard Dependencies are concerned, due to nonconformance
367 (bit-level management)
368
369 # mv.x (vector permute)
370
371 [[sv/mv.x]] aka `GPR(RT) = GPR(GPR(RA))` is so horrendous in
372 terms of Register Hazard Management that its addition to any Scalar
373 ISA is anathematic. In a Traditional Vector ISA however, where the
374 indices are isolated behind a single Vector Hazard, there is no
375 problem at all. `sv.mv.x` is also fraught, precisely because it
376 sits on top of a Standard Scalar register paradigm, not a Vector
377 ISA with separate and distinct Vector registers.
378
379 To help partly solve this, `sv.mv.x` would have had to have
380 been made relative:
381
382 ```
383 for i in range(VL):
384 GPR(RT+i) = GPR(RT+MIN(GPR(RA+i), VL))
385 ```
386
387 The reason for doing so is that MAXVL or VL may be used to limit
388 the number of Register Hazards that need to be raised to a fixed
389 quantity, at Issue time.
390
391 `mv.x` itself would still have to be added as a Scalar instruction,
392 but the behaviour of `sv.mv.x` would have to be different from that
393 Scalar version.
394
395 Normally, Scalar Instructions have a good justification for being
396 added as Scalar instructions on their own merit. `mv.x` is the
397 polar opposite, and in the end, the idea was thrown out, and Indexed
398 REMAP added in its place. Indexed REMAP comes with its own quirks,
399 solving the Hazard problem, described in a later section.
400
401 # Branch-Conditional
402
403 [[sv/branches]] are a very special exception to the rule that there
404 shall be no deviation from the corresponding
405 Scalar instruction. This because of the tight
406 integration with looping and the application of Boolean Logic
407 manipulation needed for Parallel operations (predicate mask usage).
408 This results in an extremely important observation that `scalar identity
409 behaviour` is violated: the SV Prefixed variant of branch is **not** the same
410 operation as the unprefixed 32-bit scalar version.
411
412 One key difference is that LR is only updated if certain additional
413 conditions are met, whereas Scalar `bclrl` for example unconditionally
414 overwrites LR.
415
416 Another is that the Vectorised Branch-Conditional instructions are the
417 only ones where there are side-effects on predication when skipping
418 is enabled. This is so as to be able to use CTR to count down
419 *masked-out* elements.
420
421 Well over 500 Vectorised branch instructions exist in SVP64 due to the
422 number of options available: close integration and interaction with
423 the base Scalar Branch was unavoidable in order to create Conditional
424 Branching suitable for parallel 3D / CUDA GPU workloads.
425
426 # Saturation
427
428 The application of Saturation as a retro-fit to a Scalar ISA is challenging.
429 It does help that within the SFFS Compliancy subset there are no Saturated
430 operations at all: they are only added in VSX.
431
432 Saturation does not inherently change the instruction itself: it does however
433 come with some fundamental implications, when applied. For example:
434 a Floating-Point operation that would normally raise an exception will
435 no longer do so, instead setting the CR1.SO Flag. Another quirky
436 example: signed operations which produce a negative result will be
437 truncated to zero if Unsigned Saturation is requested.
438
439 One very important aspect for implementors is that the operation in
440 effect has to be considered to be performed at infinite precision,
441 followed by saturation detection. In practice this does not actually
442 require infinite precision hardware! Two 8-bit integers being
443 added can only ever overflow into a 9-bit result.
444
445 Overall some care and consideration needs to be applied.
446
447 # Fail-First
448
449 Fail-First (both the Load/Store and Data-Dependent variants)
450 is worthy of a special mention in its own right. Where VL is
451 normally forward-looking and may be part of a pre-decode phase
452 in a (simplified) pipelined architecture with no Read-after-Write Hazards,
453 Fail-First changes that because at any point during the execution
454 of the element-level instructions, one of those elements may not only
455 terminate further continuation of the hardware-for-looping but also
456 effect a change of VL:
457
458 ```
459 for i in range(VL):
460 result = element_operation(GPR(RA+i), GPR(RB+i))
461 if test(result):
462 VL = i
463 break
464 ```
465
466 This is not exactly a violation of SVP64 Rules, more of a breakage
467 of user expectations, particularly for LD/ST where exceptions
468 would normally be expected to be raised, Fail-First provides for
469 avoidance of those exceptions.
470
471 For Hardware implementers, a standard Out-of-Order micro-architecture
472 allows for Cancellation of speculatively-executed elements that extended
473 beyond the Vector Truncation point. In-order systems will have a slightly
474 harder time and may choose to execute one element only at a time, reducing
475 performance as a result.
476
477 # OE=1
478
479 The hardware cost of Sticky Overflow in a parallel environment is immense.
480 The SFFS Compliancy Level is permitted optionally to support XER.SO.
481 Therefore the decision is made to make it mandatory **not** to
482 support XER.SO. However, CR.SO *is* supported such that when Rc=1
483 is set the CR.SO flag will contain only the overflow of
484 the current instruction, rather than being actually "sticky".
485 Hardware Out-of-Order designers will recognise and appreciate
486 that the Hazards are
487 reduced to Read-After-Write (RAW) and that the WAR Hazard is removed.
488
489 This is sort-of a quirk and sort-of not, because the option to support
490 XER.SO is already optional from the SFFS Compliancy Level.
491
492 # Indexed REMAP and CR Field Predication Hazards
493
494 Normal Vector ISAs and those Packed SIMD ISAs inspired by them have
495 Vector "Permute" or "Shuffle" instructions. These provide a Vector of
496 indices whereby another Vector is reordered (permuted, shuffled) according
497 to the indices. Register Hazard Managent here is trivial because there
498 are three registers: indices source vector, elements source vector to
499 be shuffled, result vector.
500
501 For SVP64 which is based on top of a Scalar Register File paradigm,
502 combined with the hard requirement to respect full Register Hazard
503 Management as if element instructions were actual Scalar instructions,
504 the addition of a Vector permute instruction under these strict
505 conditions would result in a catastrophic
506 reduction in performance, due to having to consider Read-after-Write
507 and Write-after-Read Hazards *at the element level*.
508
509 A little leniency and rule-bending is therefore required.
510
511 Rather than add explicit Vector permute instructions, the "Indexing"
512 has been separated out into a REMAP Schedule. When an Indexed
513 REMAP is requested, it is assumed (required, of software) that
514 subsequent instructions intending to use those indices *will not*
515 attempt to modify the indices. It is *Software* that must consider them
516 to be read-only.
517
518 This simple relaxation of the rules releases Hardware from having the
519 horrendous job of dynamically detecting Write-after-Read Hazards on a
520 huge range of registers.
521
522 A similar Hazard problem exists for CR Field Predicates, in Vertical-First
523 Mode. Instructions could modify CR Fields currently being used as Predicate
524 Masks: detecting this is so horrendous for hardware resource utilisation
525 and hardware complexity that, again, the decision is made to relax these
526 constraints and for Software to take that into account.
527
528 # Floating-Point "Single" becomes "Half"
529
530 In several places in the Power ISA there are operations that are on
531 32-bit quantities in 64-bit registers. The best example is FP which
532 has 64-bit operations (`fadd`) and 32-bit operations (`fadds` or
533 FP Add "single"). Element-width overrides it would seem to
534 be unnecessary, under these circunstances.
535
536 However, it is not possible for `fadds` to fit two elements into
537 64-bit: that breaks the simplicity of SVP64.
538 Bear in mind that the FP32 bits are spread out across a 64
539 bit register in FP64 format. The solution here was to consider the
540 "s" at the end of each instruction
541 to mean "half of the element's width". Thus, `sv.fadds/ew=32`
542 actually stores an FP16 spread out across the 32 bits of an
543 element, in FP32 format, where `sv.fadd/ew=32` stores a full
544 FP32 result into the full 32 bits.
545
546 Where this breaks down is when attempting to do half-width on
547 BF16 or FP16 operations: there does not exist a BF8 or an IEE754 FP8
548 format, so these should be avoided.
549
550 # Vertical-First and Subvectors
551
552 Documented in the [[sv/setvl]] page, Vertical-First goes through
553 elements second instructions first and requires an explicit
554 [[sv/svstep]] instruction to move to the next element,
555 (whereas Horizontal-First
556 loops through elements in full first before moving on to
557 the next instruction): *Subvectors are considered "elements"*
558 in Vertical-First Mode.
559
560 This is conceptually quite easy to keep in mind that a Vertical-First
561 instruction does one element at a time, and when SUBVL is set,
562 that "element" in essence becomes a vec2/3/4.
563
564 # Swizzle and Pack/Unpack
565
566 These are both so weird it's best to just read the pages in full
567 and pay attention: [[sv/mv.swizzle]] and [[sv/mv.vec]].
568 Swizzle Moves only engage with vec2/3/4, *reordering* the copying
569 of the sub-vector elements (including allowing repeats and skips)
570 based on an immediate supplied by the instruction. The fun
571 comes when Pack/Unpack are enabled, and it is really important
572 to be aware how the Arrays of vec2/3/4 become re-ordered
573 *and swizzled at the same time*.
574
575 Pack/Unpack applies to
576 [[sv/mv.vec]] as well however the uniform relationship and
577 the fact that the source and destination subvector length
578 must be the same (vec2/3/4) makes things slightly easier to
579 understand. The main thing to keep in mind about Pack/Unpack
580 is that it engages a swap of the ordering of the VL-SUBVL
581 nested for-loops, in exactly the same way that Matrix REMAP
582 can do. When Pack or Unpack is enabled it is the SUBVL for-loop
583 that becomes outermost.
584
585 # No Scalar GPR Move
586
587 Perhaps unsurprisingly the Scalar Power ISA does not have
588 a Scalar GPR Move instruction: instead, there are a series
589 of pseudo-op opportunities such as `addi RT,RA,0` or `ori RT,RA,0`
590 and many more.
591
592 Strictly speaking these may orthogonally be Vectorised and achieve
593 the same effect as a Vector Move. However these instructions
594 are marked as `RM-2P-1S1D` and have EXTRA3 Augmentation. In other
595 words it is not possible to use them in Pack/Unpack Mode.
596 There is however a trick: [[sv/mv.swizzle]] with a straight linear
597 mapping (X to X, Y to Y...)
598 By applying a straight linear swizzle map, the `RM-2P-1S1D-PU` mode
599 of `sv.mv.swizzle`
600 is available.