add modes to ls010
[libreriscv.git] / openpower / sv / rfc / ls010.mdwn
1 # RFC ls009 SVP64 Zero-Overhead Loop Prefix Subsystem
2
3 Credits and acknowledgements:
4
5 * Luke Leighton
6 * Jacob Lifshay
7 * Hendrik Boom
8 * Richard Wilbur
9 * Alexandre Oliva
10 * Cesar Strauss
11 * NLnet Foundation, for funding
12 * OpenPOWER Foundation
13 * Paul Mackerras
14 * Toshaan Bharvani
15 * IBM for the Power ISA itself
16
17 Links:
18
19 * <https://bugs.libre-soc.org/show_bug.cgi?id=1045>
20
21 # Introduction
22
23 Simple-V is a type of Vectorisation best described as a "Prefix Loop Subsystem"
24 similar to the Z80 `LDIR` instruction and to the x86 `REP` Prefix instruction.
25 More advanced features are similar to the Z80 `CPIR` instruction. If viewed
26 as an actual Vector ISA it introduces over 1.5 million 64-bit Vector instructions.
27 SVP64, the instruction format, is therefore best viewed as an orthogonal
28 RISC-style "Prefixing" subsystem instead.
29
30 Except where explicitly stated all bit numbers remain as in the rest of the Power ISA:
31 in MSB0 form (the bits are numbered from 0 at the MSB on the left
32 and counting up as you move rightwards to the LSB end). All bit ranges are inclusive
33 (so `4:6` means bits 4, 5, and 6, in MSB0 order). **All register numbering and
34 element numbering however is LSB0 ordering** which is a different convention from that used
35 elsewhere in the Power ISA.
36
37 The SVP64 prefix always comes before the suffix in PC order and must be considered
38 an independent "Defined word" that augments the behaviour of the following instruction,
39 but does **not** change the actual Decoding of that following instruction.
40 **All prefixed instructions retain their non-prefixed encoding and definition**.
41
42 *Architectural Resource Allocation note: it is prohibited to accept RFCs which
43 fundamentally violate this hard requirement. Under no circumstances must the
44 Suffix space have an alternate instruction encoding allocated within SVP64 that is
45 entirely different from the non-prefixed Defined Word. Hardware Implementors
46 critically rely on this inviolate guarantee to implement High-Performance Multi-Issue
47 micro-architectures that can sustain 100% throughput*
48
49 | 0:5 | 6:31 | 32:63 |
50 |--------|--------------|--------------|
51 | EXT09 | v3.1 Prefix | v3.0/1 Suffix |
52
53 Subset implementations in hardware are permitted, as long as certain
54 rules are followed, allowing for full soft-emulation including future
55 revisions. Compliancy Subsets exist to ensure minimum levels of binary
56 interoperability expectations within certain environments.
57
58 ## Register files, elements, and Element-width Overrides
59
60 In the Upper Compliancy Levels the size of the GPR and FPR Register files are expanded
61 from 32 to 128 entries, and the number of CR Fields expanded from CR0-CR7 to CR0-CR127.
62
63 Memory access remains exactly the same: the effects of `MSR.LE` remain exactly the same,
64 affecting as they already do and remain **only** on the Load and Store memory-register
65 operation byte-order, and having nothing to do with the
66 ordering of the contents of register files or register-register operations.
67
68 Whilst the bits within the GPRs and FPRs are expected to be MSB0-ordered and for
69 numbering to be sequentially incremental the element offset numbering is naturally
70 **LSB0-sequentially-incrementing from zero not MSB0-incrementing.** Expressed exclusively in
71 MSB0-numbering, SVP64 is unnecessarily complex to understand: the required
72 subtractions from 63, 31, 15 and 7 unfortunately become a hostile minefield.
73 Therefore for the purposes of this section the more natural
74 **LSB0 numbering is assumed** and it is up to the reader to translate to MSB0 numbering.
75
76 The Canonical specification for how element-sequential numbering and element-width
77 overrides is defined is expressed in the following c structure, assuming a Little-Endian
78 system, and naturally using LSB0 numbering everywhere because the ANSI c specification
79 is inherently LSB0:
80
81 ```
82 #pragma pack
83 typedef union {
84 uint8_t b[]; // elwidth 8
85 uint16_t s[]; // elwidth 16
86 uint32_t i[]; // elwidth 32
87 uint64_t l[]; // elwidth 64
88 uint8_t actual_bytes[8];
89 } el_reg_t;
90
91 elreg_t int_regfile[128];
92
93 void get_register_element(el_reg_t* el, int gpr, int element, int width) {
94 switch (width) {
95 case 64: el->l = int_regfile[gpr].l[element];
96 case 32: el->i = int_regfile[gpr].i[element];
97 case 16: el->s = int_regfile[gpr].s[element];
98 case 8 : el->b = int_regfile[gpr].b[element];
99 }
100 }
101 void set_register_element(el_reg_t* el, int gpr, int element, int width) {
102 switch (width) {
103 case 64: int_regfile[gpr].l[element] = el->l;
104 case 32: int_regfile[gpr].i[element] = el->i;
105 case 16: int_regfile[gpr].s[element] = el->s;
106 case 8 : int_regfile[gpr].b[element] = el->b;
107 }
108 }
109 ```
110
111 Example Vector-looped add operation implementation when elwidths are 64-bit:
112
113 ```
114 # add RT, RA,RB using the "uint64_t" union member, "l"
115 for i in range(VL):
116 int_regfile[RT].l[i] = int_regfile[RA].l[i] + int_regfile[RB].l[i]
117 ```
118
119 However if elwidth overrides are set to 16 for both source and destination:
120
121 ```
122 # add RT, RA, RB using the "uint64_t" union member "s"
123 for i in range(VL):
124 int_regfile[RT].s[i] = int_regfile[RA].s[i] + int_regfile[RB].s[i]
125 ```
126
127 Hardware Architectural note: to avoid a Read-Modify-Write at the register file it is
128 strongly recommended to implement byte-level write-enable lines exactly as has been
129 implemented in DRAM ICs for many decades. Additionally the predicate mask bit is advised
130 to be associated with the element operation and alongside the result ultimately
131 passed to the register file.
132 When element-width is set to 64-bit the relevant predicate mask bit may be repeated
133 eight times and pull all eight write-port byte-level lines HIGH. Clearly when element-width
134 is set to 8-bit the relevant predicate mask bit corresponds directly with one single
135 byte-level write-enable line. It is up to the Hardware Architect to then amortise (merge)
136 elements together into both PredicatedSIMD Pipelines as well as simultaneous non-overlapping
137 Register File writes, to achieve High Performance designs.
138
139 ## SVP64 encoding features
140
141 A number of features need to be compacted into a very small space of only 24 bits:
142
143 * Independent per-register Scalar/Vector tagging and range extension on every register
144 * Element width overrides on both source and destination
145 * Predication on both source and destination
146 * Two different sources of predication: INT and CR Fields
147 * SV Modes including saturation (for Audio, Video and DSP), mapreduce, fail-first and
148 predicate-result mode.
149
150 Different classes of operations require different formats. The earlier sections cover
151 the c9mmon formats and the four separate modes follow: CR operations (crops),
152 Arithmetic/Logical (termed "normal"), Load/Store and Branch-Conditional.
153
154 ## Definition of Reserved in this spec.
155
156 For the new fields added in SVP64, instructions that have any of their
157 fields set to a reserved value must cause an illegal instruction trap,
158 to allow emulation of future instruction sets, or for subsets of SVP64
159 to be implemented in hardware and the rest emulated.
160 This includes SVP64 SPRs: reading or writing values which are not
161 supported in hardware must also raise illegal instruction traps
162 in order to allow emulation.
163 Unless otherwise stated, reserved values are always all zeros.
164
165 This is unlike OpenPower ISA v3.1, which in many instances does not require a trap if reserved fields are nonzero. Where the standard Power ISA definition
166 is intended the red keyword `RESERVED` is used.
167
168 ## Definition of "UnVectoriseable"
169
170 Any operation that inherently makes no sense if repeated is termed "UnVectoriseable"
171 or "UnVectorised". Examples include `sc` or `sync` which have no registers. `mtmsr` is
172 also classed as UnVectoriseable because there is only one `MSR`.
173
174 ## Scalar Identity Behaviour
175
176 SVP64 is designed so that when the prefix is all zeros, and
177 VL=1, no effect or
178 influence occurs (no augmentation) such that all standard Power ISA
179 v3.0/v3 1 instructions covered by the prefix are "unaltered". This is termed `scalar identity behaviour` (based on the mathematical definition for "identity", as in, "identity matrix" or better "identity transformation").
180
181 Note that this is completely different from when VL=0. VL=0 turns all operations under its influence into `nops` (regardless of the prefix)
182 whereas when VL=1 and the SV prefix is all zeros, the operation simply acts as if SV had not been applied at all to the instruction (an "identity transformation").
183
184 ## Register Naming and size
185
186 As previously mentioned SV Registers are simply the INT, FP and CR register files extended
187 linearly to larger sizes; SV Vectorisation iterates sequentially through these registers
188 (LSB0 sequential ordering from 0 to VL-1).
189
190 Where the integer regfile in standard scalar
191 Power ISA v3.0B/v3.1B is r0 to r31, SV extends this as r0 to r127.
192 Likewise FP registers are extended to 128 (fp0 to fp127), and CR Fields
193 are
194 extended to 128 entries, CR0 thru CR127.
195
196 The names of the registers therefore reflects a simple linear extension
197 of the Power ISA v3.0B / v3.1B register naming, and in hardware this
198 would be reflected by a linear increase in the size of the underlying
199 SRAM used for the regfiles.
200
201 Note: when an EXTRA field (defined below) is zero, SV is deliberately designed
202 so that the register fields are identical to as if SV was not in effect
203 i.e. under these circumstances (EXTRA=0) the register field names RA,
204 RB etc. are interpreted and treated as v3.0B / v3.1B scalar registers. This is part of
205 `scalar identity behaviour` described above.
206
207 ## Future expansion.
208
209 With the way that EXTRA fields are defined and applied to register fields,
210 future versions of SV may involve 256 or greater registers. Backwards binary compatibility may be achieved with a PCR bit (Program Compatibility Register). Further discussion is out of scope for this version of SVP64.
211
212 # Remapped Encoding (`RM[0:23]`)
213
214 To allow relatively easy remapping of which portions of the Prefix Opcode
215 Map are used for SVP64 without needing to rewrite a large portion of the
216 SVP64 spec, a mapping is defined from the OpenPower v3.1 prefix bits to
217 a new 24-bit Remapped Encoding denoted `RM[0]` at the MSB to `RM[23]`
218 at the LSB.
219
220 The mapping from the OpenPower v3.1 prefix bits to the Remapped Encoding
221 is defined in the Prefix Fields section.
222
223 ## Prefix Fields
224
225 TODO incorporate EXT09
226
227 To "activate" svp64 (in a way that does not conflict with v3.1B 64 bit Prefix mode), fields within the v3.1B Prefix Opcode Map are set
228 (see Prefix Opcode Map, above), leaving 24 bits "free" for use by SV.
229 This is achieved by setting bits 7 and 9 to 1:
230
231 | Name | Bits | Value | Description |
232 |------------|---------|-------|--------------------------------|
233 | EXT01 | `0:5` | `1` | Indicates Prefixed 64-bit |
234 | `RM[0]` | `6` | | Bit 0 of Remapped Encoding |
235 | SVP64_7 | `7` | `1` | Indicates this is SVP64 |
236 | `RM[1]` | `8` | | Bit 1 of Remapped Encoding |
237 | SVP64_9 | `9` | `1` | Indicates this is SVP64 |
238 | `RM[2:23]` | `10:31` | | Bits 2-23 of Remapped Encoding |
239
240 Laid out bitwise, this is as follows, showing how the 32-bits of the prefix
241 are constructed:
242
243 | 0:5 | 6 | 7 | 8 | 9 | 10:31 |
244 |--------|-------|---|-------|---|----------|
245 | EXT01 | RM | 1 | RM | 1 | RM |
246 | 000001 | RM[0] | 1 | RM[1] | 1 | RM[2:23] |
247
248 Following the prefix will be the suffix: this is simply a 32-bit v3.0B / v3.1
249 instruction. That instruction becomes "prefixed" with the SVP context: the
250 Remapped Encoding field (RM).
251
252 It is important to note that unlike v3.1 64-bit prefixed instructions
253 there is insufficient space in `RM` to provide identification of
254 any SVP64 Fields without first partially decoding the
255 32-bit suffix. Similar to the "Forms" (X-Form, D-Form) the
256 `RM` format is individually associated with every instruction.
257
258 Extreme caution and care must therefore be taken
259 when extending SVP64 in future, to not create unnecessary relationships
260 between prefix and suffix that could complicate decoding, adding latency.
261
262 # Common RM fields
263
264 The following fields are common to all Remapped Encodings:
265
266 | Field Name | Field bits | Description |
267 |------------|------------|----------------------------------------|
268 | MASKMODE | `0` | Execution (predication) Mask Kind |
269 | MASK | `1:3` | Execution Mask |
270 | SUBVL | `8:9` | Sub-vector length |
271
272 The following fields are optional or encoded differently depending
273 on context after decoding of the Scalar suffix:
274
275 | Field Name | Field bits | Description |
276 |------------|------------|----------------------------------------|
277 | ELWIDTH | `4:5` | Element Width |
278 | ELWIDTH_SRC | `6:7` | Element Width for Source |
279 | EXTRA | `10:18` | Register Extra encoding |
280 | MODE | `19:23` | changes Vector behaviour |
281
282 * MODE changes the behaviour of the SV operation (result saturation, mapreduce)
283 * SUBVL groups elements together into vec2, vec3, vec4 for use in 3D and Audio/Video DSP work
284 * ELWIDTH and ELWIDTH_SRC overrides the instruction's destination and source operand width
285 * MASK (and MASK_SRC) and MASKMODE provide predication (two types of sources: scalar INT and Vector CR).
286 * Bits 10 to 18 (EXTRA) are further decoded depending on the RM category for the instruction, which is determined only by decoding the Scalar 32 bit suffix.
287
288 Similar to Power ISA `X-Form` etc. EXTRA bits are given designations, such as `RM-1P-3S1D` which indicates for this example that the operation is to be single-predicated and that there are 3 source operand EXTRA tags and one destination operand tag.
289
290 Note that if ELWIDTH != ELWIDTH_SRC this may result in reduced performance or increased latency in some implementations due to lane-crossing.
291
292 # Mode
293
294 Mode is an augmentation of SV behaviour. Different types of
295 instructions have different needs, similar to Power ISA
296 v3.1 64 bit prefix 8LS and MTRR formats apply to different
297 instruction types. Modes include Reduction, Iteration, arithmetic
298 saturation, and Fail-First. More specific details in each
299 section and in the [[svp64/appendix]]
300
301 * For condition register operations see [[sv/cr_ops]]
302 * For LD/ST Modes, see [[sv/ldst]].
303 * For Branch modes, see [[sv/branches]]
304 * For arithmetic and logical, see [[sv/normal]]
305
306 # ELWIDTH Encoding
307
308 Default behaviour is set to 0b00 so that zeros follow the convention of
309 `scalar identity behaviour`. In this case it means that elwidth overrides
310 are not applicable. Thus if a 32 bit instruction operates on 32 bit,
311 `elwidth=0b00` specifies that this behaviour is unmodified. Likewise
312 when a processor is switched from 64 bit to 32 bit mode, `elwidth=0b00`
313 states that, again, the behaviour is not to be modified.
314
315 Only when elwidth is nonzero is the element width overridden to the
316 explicitly required value.
317
318 ## Elwidth for Integers:
319
320 | Value | Mnemonic | Description |
321 |-------|----------------|------------------------------------|
322 | 00 | DEFAULT | default behaviour for operation |
323 | 01 | `ELWIDTH=w` | Word: 32-bit integer |
324 | 10 | `ELWIDTH=h` | Halfword: 16-bit integer |
325 | 11 | `ELWIDTH=b` | Byte: 8-bit integer |
326
327 This encoding is chosen such that the byte width may be computed as
328 `8<<(3-ew)`
329
330 ## Elwidth for FP Registers:
331
332 | Value | Mnemonic | Description |
333 |-------|----------------|------------------------------------|
334 | 00 | DEFAULT | default behaviour for FP operation |
335 | 01 | `ELWIDTH=f32` | 32-bit IEEE 754 Single floating-point |
336 | 10 | `ELWIDTH=f16` | 16-bit IEEE 754 Half floating-point |
337 | 11 | `ELWIDTH=bf16` | Reserved for `bf16` |
338
339 Note:
340 [`bf16`](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format)
341 is reserved for a future implementation of SV
342
343 Note that any IEEE754 FP operation in Power ISA ending in "s" (`fadds`) shall
344 perform its operation at **half** the ELWIDTH then padded back out
345 to ELWIDTH. `sv.fadds/ew=f32` shall perform an IEEE754 FP16 operation that is then "padded" to fill out to an IEEE754 FP32. When ELWIDTH=DEFAULT
346 clearly the behaviour of `sv.fadds` is performed at 32-bit accuracy
347 then padded back out to fit in IEEE754 FP64, exactly as for Scalar
348 v3.0B "single" FP. Any FP operation ending in "s" where ELWIDTH=f16
349 or ELWIDTH=bf16 is reserved and must raise an illegal instruction
350 (IEEE754 FP8 or BF8 are not defined).
351
352 ## Elwidth for CRs:
353
354 Element-width overrides for CR Fields has no meaning. The bits
355 are therefore used for other purposes, or when Rc=1, the Elwidth
356 applies to the result being tested (a GPR or FPR), but not to the
357 Vector of CR Fields.
358
359 # SUBVL Encoding
360
361 the default for SUBVL is 1 and its encoding is 0b00 to indicate that
362 SUBVL is effectively disabled (a SUBVL for-loop of only one element). this
363 lines up in combination with all other "default is all zeros" behaviour.
364
365 | Value | Mnemonic | Subvec | Description |
366 |-------|-----------|---------|------------------------|
367 | 00 | `SUBVL=1` | single | Sub-vector length of 1 |
368 | 01 | `SUBVL=2` | vec2 | Sub-vector length of 2 |
369 | 10 | `SUBVL=3` | vec3 | Sub-vector length of 3 |
370 | 11 | `SUBVL=4` | vec4 | Sub-vector length of 4 |
371
372 The SUBVL encoding value may be thought of as an inclusive range of a
373 sub-vector. SUBVL=2 represents a vec2, its encoding is 0b01, therefore
374 this may be considered to be elements 0b00 to 0b01 inclusive.
375
376 # MASK/MASK_SRC & MASKMODE Encoding
377
378 TODO: rename MASK_KIND to MASKMODE
379
380 One bit (`MASKMODE`) indicates the mode: CR or Int predication. The two
381 types may not be mixed.
382
383 Special note: to disable predication this field must
384 be set to zero in combination with Integer Predication also being set
385 to 0b000. this has the effect of enabling "all 1s" in the predicate
386 mask, which is equivalent to "not having any predication at all"
387 and consequently, in combination with all other default zeros, fully
388 disables SV (`scalar identity behaviour`).
389
390 `MASKMODE` may be set to one of 2 values:
391
392 | Value | Description |
393 |-----------|------------------------------------------------------|
394 | 0 | MASK/MASK_SRC are encoded using Integer Predication |
395 | 1 | MASK/MASK_SRC are encoded using CR-based Predication |
396
397 Integer Twin predication has a second set of 3 bits that uses the same
398 encoding thus allowing either the same register (r3, r10 or r31) to be used
399 for both src and dest, or different regs (one for src, one for dest).
400
401 Likewise CR based twin predication has a second set of 3 bits, allowing
402 a different test to be applied.
403
404 Note that it is assumed that Predicate Masks (whether INT or CR)
405 are read *before* the operations proceed. In practice (for CR Fields)
406 this creates an unnecessary block on parallelism. Therefore,
407 it is up to the programmer to ensure that the CR fields used as
408 Predicate Masks are not being written to by any parallel Vector Loop.
409 Doing so results in **UNDEFINED** behaviour, according to the definition
410 outlined in the Power ISA v3.0B Specification.
411
412 Hardware Implementations are therefore free and clear to delay reading
413 of individual CR fields until the actual predicated element operation
414 needs to take place, safe in the knowledge that no programmer will
415 have issued a Vector Instruction where previous elements could have
416 overwritten (destroyed) not-yet-executed CR-Predicated element operations.
417
418 ## Integer Predication (MASKMODE=0)
419
420 When the predicate mode bit is zero the 3 bits are interpreted as below.
421 Twin predication has an identical 3 bit field similarly encoded.
422
423 `MASK` and `MASK_SRC` may be set to one of 8 values, to provide the following meaning:
424
425 | Value | Mnemonic | Element `i` enabled if: |
426 |-------|----------|------------------------------|
427 | 000 | ALWAYS | predicate effectively all 1s |
428 | 001 | 1 << R3 | `i == R3` |
429 | 010 | R3 | `R3 & (1 << i)` is non-zero |
430 | 011 | ~R3 | `R3 & (1 << i)` is zero |
431 | 100 | R10 | `R10 & (1 << i)` is non-zero |
432 | 101 | ~R10 | `R10 & (1 << i)` is zero |
433 | 110 | R30 | `R30 & (1 << i)` is non-zero |
434 | 111 | ~R30 | `R30 & (1 << i)` is zero |
435
436 r10 and r30 are at the high end of temporary and unused registers, so as not to interfere with register allocation from ABIs.
437
438 ## CR-based Predication (MASKMODE=1)
439
440 When the predicate mode bit is one the 3 bits are interpreted as below.
441 Twin predication has an identical 3 bit field similarly encoded.
442
443 `MASK` and `MASK_SRC` may be set to one of 8 values, to provide the following meaning:
444
445 | Value | Mnemonic | Element `i` is enabled if |
446 |-------|----------|--------------------------|
447 | 000 | lt | `CR[offs+i].LT` is set |
448 | 001 | nl/ge | `CR[offs+i].LT` is clear |
449 | 010 | gt | `CR[offs+i].GT` is set |
450 | 011 | ng/le | `CR[offs+i].GT` is clear |
451 | 100 | eq | `CR[offs+i].EQ` is set |
452 | 101 | ne | `CR[offs+i].EQ` is clear |
453 | 110 | so/un | `CR[offs+i].FU` is set |
454 | 111 | ns/nu | `CR[offs+i].FU` is clear |
455
456 CR based predication. TODO: select alternate CR for twin predication? see
457 [[discussion]] Overlap of the two CR based predicates must be taken
458 into account, so the starting point for one of them must be suitably
459 high, or accept that for twin predication VL must not exceed the range
460 where overlap will occur, *or* that they use the same starting point
461 but select different *bits* of the same CRs
462
463 `offs` is defined as CR32 (4x8) so as to mesh cleanly with Vectorised Rc=1 operations (see below). Rc=1 operations start from CR8 (TBD).
464
465 The CR Predicates chosen must start on a boundary that Vectorised
466 CR operations can access cleanly, in full.
467 With EXTRA2 restricting starting points
468 to multiples of 8 (CR0, CR8, CR16...) both Vectorised Rc=1 and CR Predicate
469 Masks have to be adapted to fit on these boundaries as well.
470
471 # Extra Remapped Encoding <a name="extra_remap"> </a>
472
473 Shows all instruction-specific fields in the Remapped Encoding `RM[10:18]` for all instruction variants. Note that due to the very tight space, the encoding mode is *not* included in the prefix itself. The mode is "applied", similar to Power ISA "Forms" (X-Form, D-Form) on a per-instruction basis, and, like "Forms" are given a designation (below) of the form `RM-nP-nSnD`. The full list of which instructions use which remaps is here [[opcode_regs_deduped]]. (*Machine-readable CSV files have been provided which will make the task of creating SV-aware ISA decoders easier*).
474
475 These mappings are part of the SVP64 Specification in exactly the same
476 way as X-Form, D-Form. New Scalar instructions added to the Power ISA
477 will need a corresponding SVP64 Mapping, which can be derived by-rote
478 from examining the Register "Profile" of the instruction.
479
480 There are two categories: Single and Twin Predication.
481 Due to space considerations further subdivision of Single Predication
482 is based on whether the number of src operands is 2 or 3. With only
483 9 bits available some compromises have to be made.
484
485 * `RM-1P-3S1D` Single Predication dest/src1/2/3, applies to 4-operand instructions (fmadd, isel, madd).
486 * `RM-1P-2S1D` Single Predication dest/src1/2 applies to 3-operand instructions (src1 src2 dest)
487 * `RM-2P-1S1D` Twin Predication (src=1, dest=1)
488 * `RM-2P-2S1D` Twin Predication (src=2, dest=1) primarily for LDST (Indexed)
489 * `RM-2P-1S2D` Twin Predication (src=1, dest=2) primarily for LDST Update
490
491 ## RM-1P-3S1D
492
493 | Field Name | Field bits | Description |
494 |------------|------------|----------------------------------------|
495 | Rdest\_EXTRA2 | `10:11` | extends Rdest (R\*\_EXTRA2 Encoding) |
496 | Rsrc1\_EXTRA2 | `12:13` | extends Rsrc1 (R\*\_EXTRA2 Encoding) |
497 | Rsrc2\_EXTRA2 | `14:15` | extends Rsrc2 (R\*\_EXTRA2 Encoding) |
498 | Rsrc3\_EXTRA2 | `16:17` | extends Rsrc3 (R\*\_EXTRA2 Encoding) |
499 | EXTRA2_MODE | `18` | used by `divmod2du` and `maddedu` for RS |
500
501 These are for 3 operand in and either 1 or 2 out instructions.
502 3-in 1-out includes `madd RT,RA,RB,RC`. (DRAFT) instructions
503 such as `maddedu` have an implicit second destination, RS, the
504 selection of which is determined by bit 18.
505
506 ## RM-1P-2S1D
507
508 | Field Name | Field bits | Description |
509 |------------|------------|-------------------------------------------|
510 | Rdest\_EXTRA3 | `10:12` | extends Rdest |
511 | Rsrc1\_EXTRA3 | `13:15` | extends Rsrc1 |
512 | Rsrc2\_EXTRA3 | `16:18` | extends Rsrc3 |
513
514 These are for 2 operand 1 dest instructions, such as `add RT, RA,
515 RB`. However also included are unusual instructions with an implicit dest
516 that is identical to its src reg, such as `rlwinmi`.
517
518 Normally, with instructions such as `rlwinmi`, the scalar v3.0B ISA would not have sufficient bit fields to allow
519 an alternative destination. With SV however this becomes possible.
520 Therefore, the fact that the dest is implicitly also a src should not
521 mislead: due to the *prefix* they are different SV regs.
522
523 * `rlwimi RA, RS, ...`
524 * Rsrc1_EXTRA3 applies to RS as the first src
525 * Rsrc2_EXTRA3 applies to RA as the secomd src
526 * Rdest_EXTRA3 applies to RA to create an **independent** dest.
527
528 With the addition of the EXTRA bits, the three registers
529 each may be *independently* made vector or scalar, and be independently
530 augmented to 7 bits in length.
531
532 ## RM-2P-1S1D/2S
533
534 | Field Name | Field bits | Description |
535 |------------|------------|----------------------------|
536 | Rdest_EXTRA3 | `10:12` | extends Rdest |
537 | Rsrc1_EXTRA3 | `13:15` | extends Rsrc1 |
538 | MASK_SRC | `16:18` | Execution Mask for Source |
539
540 `RM-2P-2S` is for `stw` etc. and is Rsrc1 Rsrc2.
541
542 ## RM-1P-2S1D
543
544 single-predicate, three registers (2 read, 1 write)
545
546 | Field Name | Field bits | Description |
547 |------------|------------|----------------------------|
548 | Rdest_EXTRA3 | `10:12` | extends Rdest |
549 | Rsrc1_EXTRA3 | `13:15` | extends Rsrc1 |
550 | Rsrc2_EXTRA3 | `16:18` | extends Rsrc2 |
551
552 ## RM-2P-2S1D/1S2D/3S
553
554 The primary purpose for this encoding is for Twin Predication on LOAD
555 and STORE operations. see [[sv/ldst]] for detailed anslysis.
556
557 RM-2P-2S1D:
558
559 | Field Name | Field bits | Description |
560 |------------|------------|----------------------------|
561 | Rdest_EXTRA2 | `10:11` | extends Rdest (R\*\_EXTRA2 Encoding) |
562 | Rsrc1_EXTRA2 | `12:13` | extends Rsrc1 (R\*\_EXTRA2 Encoding) |
563 | Rsrc2_EXTRA2 | `14:15` | extends Rsrc2 (R\*\_EXTRA2 Encoding) |
564 | MASK_SRC | `16:18` | Execution Mask for Source |
565
566 Note that for 1S2P the EXTRA2 dest and src names are switched (Rsrc_EXTRA2
567 is in bits 10:11, Rdest1_EXTRA2 in 12:13)
568
569 Also that for 3S (to cover `stdx` etc.) the names are switched to 3 src: Rsrc1_EXTRA2, Rsrc2_EXTRA2, Rsrc3_EXTRA2.
570
571 Note also that LD with update indexed, which takes 2 src and 2 dest
572 (e.g. `lhaux RT,RA,RB`), does not have room for 4 registers and also
573 Twin Predication. therefore these are treated as RM-2P-2S1D and the
574 src spec for RA is also used for the same RA as a dest.
575
576 Note that if ELWIDTH != ELWIDTH_SRC this may result in reduced performance or increased latency in some implementations due to lane-crossing.
577
578 # R\*\_EXTRA2/3
579
580 EXTRA is the means by which two things are achieved:
581
582 1. Registers are marked as either Vector *or Scalar*
583 2. Register field numbers (limited typically to 5 bit)
584 are extended in range, both for Scalar and Vector.
585
586 The register files are therefore extended:
587
588 * INT is extended from r0-31 to r0-127
589 * FP is extended from fp0-32 to fp0-fp127
590 * CR Fields are extended from CR0-7 to CR0-127
591
592 However due to pressure in `RM.EXTRA` not all these registers
593 are accessible by all instructions, particularly those with
594 a large number of operands (`madd`, `isel`).
595
596 In the following tables register numbers are constructed from the
597 standard v3.0B / v3.1B 32 bit register field (RA, FRA) and the EXTRA2
598 or EXTRA3 field from the SV Prefix, determined by the specific
599 RM-xx-yyyy designation for a given instruction.
600 The prefixing is arranged so that
601 interoperability between prefixing and nonprefixing of scalar registers
602 is direct and convenient (when the EXTRA field is all zeros).
603
604 A pseudocode algorithm explains the relationship, for INT/FP (see [[svp64/appendix]] for CRs)
605
606 ```
607 if extra3_mode:
608 spec = EXTRA3
609 else:
610 spec = EXTRA2 << 1 # same as EXTRA3, shifted
611 if spec[0]: # vector
612 return (RA << 2) | spec[1:2]
613 else: # scalar
614 return (spec[1:2] << 5) | RA
615 ```
616
617 Future versions may extend to 256 by shifting Vector numbering up.
618 Scalar will not be altered.
619
620 Note that in some cases the range of starting points for Vectors
621 is limited.
622
623 ## INT/FP EXTRA3
624
625 If EXTRA3 is zero, maps to
626 "scalar identity" (scalar Power ISA field naming).
627
628 Fields are as follows:
629
630 * Value: R_EXTRA3
631 * Mode: register is tagged as scalar or vector
632 * Range/Inc: the range of registers accessible from this EXTRA
633 encoding, and the "increment" (accessibility). "/4" means
634 that this EXTRA encoding may only give access (starting point)
635 every 4th register.
636 * MSB..LSB: the bit field showing how the register opcode field
637 combines with EXTRA to give (extend) the register number (GPR)
638
639 | Value | Mode | Range/Inc | 6..0 |
640 |-----------|-------|---------------|---------------------|
641 | 000 | Scalar | `r0-r31`/1 | `0b00 RA` |
642 | 001 | Scalar | `r32-r63`/1 | `0b01 RA` |
643 | 010 | Scalar | `r64-r95`/1 | `0b10 RA` |
644 | 011 | Scalar | `r96-r127`/1 | `0b11 RA` |
645 | 100 | Vector | `r0-r124`/4 | `RA 0b00` |
646 | 101 | Vector | `r1-r125`/4 | `RA 0b01` |
647 | 110 | Vector | `r2-r126`/4 | `RA 0b10` |
648 | 111 | Vector | `r3-r127`/4 | `RA 0b11` |
649
650 ## INT/FP EXTRA2
651
652 If EXTRA2 is zero will map to
653 "scalar identity behaviour" i.e Scalar Power ISA register naming:
654
655 | Value | Mode | Range/inc | 6..0 |
656 |-----------|-------|---------------|-----------|
657 | 00 | Scalar | `r0-r31`/1 | `0b00 RA` |
658 | 01 | Scalar | `r32-r63`/1 | `0b01 RA` |
659 | 10 | Vector | `r0-r124`/4 | `RA 0b00` |
660 | 11 | Vector | `r2-r126`/4 | `RA 0b10` |
661
662 **Note that unlike in EXTRA3, in EXTRA2**:
663
664 * the GPR Vectors may only start from
665 `r0, r2, r4, r6, r8` and likewise FPR Vectors.
666 * the GPR Scalars may only go from `r0, r1, r2.. r63` and likewise FPR Scalars.
667
668 as there is insufficient bits to cover the full range.
669
670 ## CR Field EXTRA3
671
672 CR Field encoding is essentially the same but made more complex due to CRs being bit-based. See [[svp64/appendix]] for explanation and pseudocode.
673 Note that Vectors may only start from `CR0, CR4, CR8, CR12, CR16, CR20`...
674 and Scalars may only go from `CR0, CR1, ... CR31`
675
676 Encoding shown MSB down to LSB
677
678 For a 5-bit operand (BA, BB, BT):
679
680 | Value | Mode | Range/Inc | 8..5 | 4..2 | 1..0 |
681 |-------|------|---------------|-----------| --------|---------|
682 | 000 | Scalar | `CR0-CR7`/1 | 0b0000 | BA[4:2] | BA[1:0] |
683 | 001 | Scalar | `CR8-CR15`/1 | 0b0001 | BA[4:2] | BA[1:0] |
684 | 010 | Scalar | `CR16-CR23`/1 | 0b0010 | BA[4:2] | BA[1:0] |
685 | 011 | Scalar | `CR24-CR31`/1 | 0b0011 | BA[4:2] | BA[1:0] |
686 | 100 | Vector | `CR0-CR112`/16 | BA[4:2] 0 | 0b000 | BA[1:0] |
687 | 101 | Vector | `CR4-CR116`/16 | BA[4:2] 0 | 0b100 | BA[1:0] |
688 | 110 | Vector | `CR8-CR120`/16 | BA[4:2] 1 | 0b000 | BA[1:0] |
689 | 111 | Vector | `CR12-CR124`/16 | BA[4:2] 1 | 0b100 | BA[1:0] |
690
691 For a 3-bit operand (e.g. BFA):
692
693 | Value | Mode | Range/Inc | 6..3 | 2..0 |
694 |-------|------|---------------|-----------| --------|
695 | 000 | Scalar | `CR0-CR7`/1 | 0b0000 | BFA |
696 | 001 | Scalar | `CR8-CR15`/1 | 0b0001 | BFA |
697 | 010 | Scalar | `CR16-CR23`/1 | 0b0010 | BFA |
698 | 011 | Scalar | `CR24-CR31`/1 | 0b0011 | BFA |
699 | 100 | Vector | `CR0-CR112`/16 | BFA 0 | 0b000 |
700 | 101 | Vector | `CR4-CR116`/16 | BFA 0 | 0b100 |
701 | 110 | Vector | `CR8-CR120`/16 | BFA 1 | 0b000 |
702 | 111 | Vector | `CR12-CR124`/16 | BFA 1 | 0b100 |
703
704 ## CR EXTRA2
705
706 CR encoding is essentially the same but made more complex due to CRs being bit-based. See separate section for explanation and pseudocode.
707 Note that Vectors may only start from CR0, CR8, CR16, CR24, CR32...
708
709
710 Encoding shown MSB down to LSB
711
712 For a 5-bit operand (BA, BB, BC):
713
714 | Value | Mode | Range/Inc | 8..5 | 4..2 | 1..0 |
715 |-------|--------|----------------|---------|---------|---------|
716 | 00 | Scalar | `CR0-CR7`/1 | 0b0000 | BA[4:2] | BA[1:0] |
717 | 01 | Scalar | `CR8-CR15`/1 | 0b0001 | BA[4:2] | BA[1:0] |
718 | 10 | Vector | `CR0-CR112`/16 | BA[4:2] 0 | 0b000 | BA[1:0] |
719 | 11 | Vector | `CR8-CR120`/16 | BA[4:2] 1 | 0b000 | BA[1:0] |
720
721 For a 3-bit operand (e.g. BFA):
722
723 | Value | Mode | Range/Inc | 6..3 | 2..0 |
724 |-------|------|---------------|-----------| --------|
725 | 00 | Scalar | `CR0-CR7`/1 | 0b0000 | BFA |
726 | 01 | Scalar | `CR8-CR15`/1 | 0b0001 | BFA |
727 | 10 | Vector | `CR0-CR112`/16 | BFA 0 | 0b000 |
728 | 11 | Vector | `CR8-CR120`/16 | BFA 1 | 0b000 |
729
730
731 # Normal SVP64 Modes, for Arithmetic and Logical Operations
732
733 Normal SVP64 Mode covers Arithmetic and Logical operations
734 to provide suitable additional behaviour. The Mode
735 field is bits 19-23 of the [[svp64]] RM Field.
736
737 ## Mode
738
739 Mode is an augmentation of SV behaviour, providing additional
740 functionality. Some of these alterations are element-based (saturation), others involve post-analysis (predicate result) and others are Vector-based (mapreduce, fail-on-first).
741
742 [[sv/ldst]],
743 [[sv/cr_ops]] and [[sv/branches]] are covered separately: the following
744 Modes apply to Arithmetic and Logical SVP64 operations:
745
746 * **simple** mode is straight vectorisation. no augmentations: the vector comprises an array of independently created results.
747 * **ffirst** or data-dependent fail-on-first: see separate section. the vector may be truncated depending on certain criteria.
748 *VL is altered as a result*.
749 * **sat mode** or saturation: clamps each element result to a min/max rather than overflows / wraps. allows signed and unsigned clamping for both INT
750 and FP.
751 * **reduce mode**. if used correctly, a mapreduce (or a prefix sum)
752 is performed. see [[svp64/appendix]].
753 note that there are comprehensive caveats when using this mode.
754 * **pred-result** will test the result (CR testing selects a bit of CR and inverts it, just like branch conditional testing) and if the test fails it
755 is as if the
756 *destination* predicate bit was zero even before starting the operation.
757 When Rc=1 the CR element however is still stored in the CR regfile, even if the test failed. See appendix for details.
758
759 Note that ffirst and reduce modes are not anticipated to be high-performance in some implementations. ffirst due to interactions with VL, and reduce due to it requiring additional operations to produce a result. simple, saturate and pred-result are however inter-element independent and may easily be parallelised to give high performance, regardless of the value of VL.
760
761 The Mode table for Arithmetic and Logical operations
762 is laid out as follows:
763
764 | 0-1 | 2 | 3 4 | description |
765 | --- | --- |---------|-------------------------- |
766 | 00 | 0 | dz sz | simple mode |
767 | 00 | 1 | 0 RG | scalar reduce mode (mapreduce) |
768 | 00 | 1 | 1 / | reserved |
769 | 01 | inv | CR-bit | Rc=1: ffirst CR sel |
770 | 01 | inv | VLi RC1 | Rc=0: ffirst z/nonz |
771 | 10 | N | dz sz | sat mode: N=0/1 u/s |
772 | 11 | inv | CR-bit | Rc=1: pred-result CR sel |
773 | 11 | inv | zz RC1 | Rc=0: pred-result z/nonz |
774
775 Fields:
776
777 * **sz / dz** if predication is enabled will put zeros into the dest (or as src in the case of twin pred) when the predicate bit is zero. otherwise the element is ignored or skipped, depending on context.
778 * **zz**: both sz and dz are set equal to this flag
779 * **inv CR bit** just as in branches (BO) these bits allow testing of a CR bit and whether it is set (inv=0) or unset (inv=1)
780 * **RG** inverts the Vector Loop order (VL-1 downto 0) rather
781 than the normal 0..VL-1
782 * **N** sets signed/unsigned saturation.
783 * **RC1** as if Rc=1, enables access to `VLi`.
784 * **VLi** VL inclusive: in fail-first mode, the truncation of
785 VL *includes* the current element at the failure point rather
786 than excludes it from the count.
787
788 For LD/ST Modes, see [[sv/ldst]]. For Condition Registers
789 see [[sv/cr_ops]].
790 For Branch modes, see [[sv/branches]].
791
792 # Rounding, clamp and saturate
793
794 See [[av_opcodes]] for relevant opcodes and use-cases.
795
796 To help ensure that audio quality is not compromised by overflow,
797 "saturation" is provided, as well as a way to detect when saturation
798 occurred if desired (Rc=1). When Rc=1 there will be a *vector* of CRs,
799 one CR per element in the result (Note: this is different from VSX which
800 has a single CR per block).
801
802 When N=0 the result is saturated to within the maximum range of an
803 unsigned value. For integer ops this will be 0 to 2^elwidth-1. Similar
804 logic applies to FP operations, with the result being saturated to
805 maximum rather than returning INF, and the minimum to +0.0
806
807 When N=1 the same occurs except that the result is saturated to the min
808 or max of a signed result, and for FP to the min and max value rather
809 than returning +/- INF.
810
811 When Rc=1, the CR "overflow" bit is set on the CR associated with the
812 element, to indicate whether saturation occurred. Note that due to
813 the hugely detrimental effect it has on parallel processing, XER.SO is
814 **ignored** completely and is **not** brought into play here. The CR
815 overflow bit is therefore simply set to zero if saturation did not occur,
816 and to one if it did.
817
818 Note also that saturate on operations that set OE=1 must raise an
819 Illegal Instruction due to the conflicting use of the CR.so bit for
820 storing if
821 saturation occurred. Integer Operations that produce a Carry-Out (CA, CA32):
822 these two bits will be `UNDEFINED` if saturation is also requested.
823
824 Note that the operation takes place at the maximum bitwidth (max of
825 src and dest elwidth) and that truncation occurs to the range of the
826 dest elwidth.
827
828 *Programmer's Note: Post-analysis of the Vector of CRs to find out if any given element hit
829 saturation may be done using a mapreduced CR op (cror), or by using the
830 new crrweird instruction with Rc=1, which will transfer the required
831 CR bits to a scalar integer and update CR0, which will allow testing
832 the scalar integer for nonzero. see [[sv/cr_int_predication]]*
833
834 ## Reduce mode
835
836 Reduction in SVP64 is similar in essence to other Vector Processing
837 ISAs, but leverages the underlying scalar Base v3.0B operations.
838 Thus it is more a convention that the programmer may utilise to give
839 the appearance and effect of a Horizontal Vector Reduction. Due
840 to the unusual decoupling it is also possible to perform
841 prefix-sum (Fibonacci Series) in certain circumstances. Details are in the [[svp64/appendix]]
842
843 Reduce Mode should not be confused with Parallel Reduction [[sv/remap]].
844 As explained in the [[sv/appendix]] Reduce Mode switches off the check
845 which would normally stop looping if the result register is scalar.
846 Thus, the result scalar register, if also used as a source scalar,
847 may be used to perform sequential accumulation. This *deliberately*
848 sets up a chain
849 of Register Hazard Dependencies, whereas Parallel Reduce [[sv/remap]]
850 deliberately issues a Tree-Schedule of operations that may be parallelised.
851
852 ## Fail-on-first
853
854 Data-dependent fail-on-first has two distinct variants: one for LD/ST,
855 the other for arithmetic operations (actually, CR-driven). Note in each
856 case the assumption is that vector elements are required to appear to be
857 executed in sequential Program Order. When REMAP is not active,
858 element 0 would be the first.
859
860 Data-driven (CR-driven) fail-on-first activates when Rc=1 or other
861 CR-creating operation produces a result (including cmp). Similar to
862 branch, an analysis of the CR is performed and if the test fails, the
863 vector operation terminates and discards all element operations **at and
864 above the current one**, and VL is truncated to either
865 the *previous* element or the current one, depending on whether
866 VLi (VL "inclusive") is clear or set, respectively.
867
868 Thus the new VL comprises a contiguous vector of results,
869 all of which pass the testing criteria (equal to zero, less than zero etc
870 as defined by the CR-bit test).
871
872 *Note: when VLi is clear, the behaviour at first seems counter-intuitive.
873 A result is calculated but if the test fails it is prohibited from being
874 actually written. This becomes intuitive again when it is remembered
875 that the length that VL is set to is the number of *written* elements,
876 and only when VLI is set will the current element be included in that
877 count.*
878
879 The CR-based data-driven fail-on-first is "new" and not found in ARM
880 SVE or RVV. At the same time it is "old" because it is almost
881 identical to a generalised form of Z80's `CPIR` instruction.
882 It is extremely useful for reducing instruction count,
883 however requires speculative execution involving modifications of VL
884 to get high performance implementations. An additional mode (RC1=1)
885 effectively turns what would otherwise be an arithmetic operation
886 into a type of `cmp`. The CR is stored (and the CR.eq bit tested
887 against the `inv` field).
888 If the CR.eq bit is equal to `inv` then the Vector is truncated and
889 the loop ends.
890
891 VLi is only available as an option when `Rc=0` (or for instructions
892 which do not have Rc). When set, the current element is always
893 also included in the count (the new length that VL will be set to).
894 This may be useful in combination with "inv" to truncate the Vector
895 to *exclude* elements that fail a test, or, in the case of implementations
896 of strncpy, to include the terminating zero.
897
898 In CR-based data-driven fail-on-first there is only the option to select
899 and test one bit of each CR (just as with branch BO). For more complex
900 tests this may be insufficient. If that is the case, a vectorised crop
901 such as crand, cror or [[sv/cr_int_predication]] crweirder may be used,
902 and ffirst applied to the crop instead of to
903 the arithmetic vector. Note that crops are covered by
904 the [[sv/cr_ops]] Mode format.
905
906 *Programmer's note: `VLi` is only accessible in normal operations
907 which in turn limits the CR field bit-testing to only `EQ/NE`.
908 [[sv/cr_ops]] are not so limited. Thus it is possible to use for
909 example `sv.cror/ff=gt/vli *0,*0,*0`, which is not a `nop` because
910 it allows Fail-First Mode to perform a test and truncate VL.*
911
912 Two extremely important aspects of ffirst are:
913
914 * LDST ffirst may never set VL equal to zero. This because on the first
915 element an exception must be raised "as normal".
916 * CR-based data-dependent ffirst on the other hand **can** set VL equal
917 to zero. This is the only means in the entirety of SV that VL may be set
918 to zero (with the exception of via the SV.STATE SPR). When VL is set
919 zero due to the first element failing the CR bit-test, all subsequent
920 vectorised operations are effectively `nops` which is
921 *precisely the desired and intended behaviour*.
922
923 The second crucial aspect, compared to LDST Ffirst:
924
925 * LD/ST Failfirst may (beyond the initial first element
926 conditions) truncate VL for any architecturally
927 suitable reason. Beyond the first element LD/ST Failfirst is
928 arbitrarily speculative and 100% non-deterministic.
929 * CR-based data-dependent first on the other hand MUST NOT truncate VL
930 arbitrarily to a length decided by the hardware: VL MUST only be
931 truncated based explicitly on whether a test fails.
932 This because it is a precise Deterministic test on which algorithms
933 can and will will rely.
934
935 **Floating-point Exceptions**
936
937 When Floating-point exceptions are enabled VL must be truncated at
938 the point where the Exception appears not to have occurred. If `VLi`
939 is set then VL must include the faulting element, and thus the
940 faulting element will always raise its exception. If however `VLi`
941 is clear then VL **excludes** the faulting element and thus the
942 exception will **never** be raised.
943
944 Although very strongly
945 discouraged the Exception Mode that permits Floating Point Exception
946 notification to arrive too late to unwind is permitted
947 (under protest, due it violating
948 the otherwise 100% Deterministic nature of Data-dependent Fail-first).
949
950 **Use of lax FP Exception Notification Mode could result in parallel
951 computations proceeding with invalid results that have to be explicitly
952 detected, whereas with the strict FP Execption Mode enabled, FFirst
953 truncates VL, allows subsequent parallel computation to avoid
954 the exceptions entirely**
955
956 ## Data-dependent fail-first on CR operations (crand etc)
957
958 Operations that actually produce or alter CR Field as a result
959 have their own SVP64 Mode, described
960 in [[sv/cr_ops]].
961
962 ## pred-result mode
963
964 This mode merges common CR testing with predication, saving on instruction
965 count. Below is the pseudocode excluding predicate zeroing and elwidth
966 overrides. Note that the pseudocode for SVP64 CR-ops is slightly different.
967
968 ```
969 for i in range(VL):
970 # predication test, skip all masked out elements.
971 if predicate_masked_out(i):
972 continue
973 result = op(iregs[RA+i], iregs[RB+i])
974 CRnew = analyse(result) # calculates eq/lt/gt
975 # Rc=1 always stores the CR field
976 if Rc=1 or RC1:
977 CR.field[offs+i] = CRnew
978 # now test CR, similar to branch
979 if RC1 or CR.field[BO[0:1]] != BO[2]:
980 continue # test failed: cancel store
981 # result optionally stored but CR always is
982 iregs[RT+i] = result
983 ```
984
985 The reason for allowing the CR element to be stored is so that
986 post-analysis of the CR Vector may be carried out. For example:
987 Saturation may have occurred (and been prevented from updating, by the
988 test) but it is desirable to know *which* elements fail saturation.
989
990 Note that RC1 Mode basically turns all operations into `cmp`. The
991 calculation is performed but it is only the CR that is written. The
992 element result is *always* discarded, never written (just like `cmp`).
993
994 Note that predication is still respected: predicate zeroing is slightly
995 different: elements that fail the CR test *or* are masked out are zero'd.
996
997 # SV Load and Store
998
999 **Rationale**
1000
1001 All Vector ISAs dating back fifty years have extensive and comprehensive
1002 Load and Store operations that go far beyond the capabilities of Scalar
1003 RISC and most CISC processors, yet at their heart on an individual element
1004 basis may be found to be no different from RISC Scalar equivalents.
1005
1006 The resource savings from Vector LD/ST are significant and stem from
1007 the fact that one single instruction can trigger a dozen (or in some
1008 microarchitectures such as Cray or NEC SX Aurora) hundreds of element-level Memory accesses.
1009
1010 Additionally, and simply: if the Arithmetic side of an ISA supports
1011 Vector Operations, then in order to keep the ALUs 100% occupied the
1012 Memory infrastructure (and the ISA itself) correspondingly needs Vector
1013 Memory Operations as well.
1014
1015 Vectorised Load and Store also presents an extra dimension (literally)
1016 which creates scenarios unique to Vector applications, that a Scalar
1017 (and even a SIMD) ISA simply never encounters. SVP64 endeavours to
1018 add the modes typically found in *all* Scalable Vector ISAs,
1019 without changing the behaviour of the underlying Base
1020 (Scalar) v3.0B operations in any way.
1021
1022 ## Modes overview
1023
1024 Vectorisation of Load and Store requires creation, from scalar operations,
1025 a number of different modes:
1026
1027 * **fixed aka "unit" stride** - contiguous sequence with no gaps
1028 * **element strided** - sequential but regularly offset, with gaps
1029 * **vector indexed** - vector of base addresses and vector of offsets
1030 * **Speculative fail-first** - where it makes sense to do so
1031 * **Structure Packing** - covered in SV by [[sv/remap]] and Pack/Unpack Mode.
1032
1033 *Despite being constructed from Scalar LD/ST none of these Modes
1034 exist or make sense in any Scalar ISA. They **only** exist in Vector ISAs*
1035
1036 Also included in SVP64 LD/ST is both signed and unsigned Saturation,
1037 as well as Element-width overrides and Twin-Predication.
1038
1039 Note also that Indexed [[sv/remap]] mode may be applied to both
1040 v3.0 LD/ST Immediate instructions *and* v3.0 LD/ST Indexed instructions.
1041 LD/ST-Indexed should not be conflated with Indexed REMAP mode: clarification
1042 is provided below.
1043
1044 **Determining the LD/ST Modes**
1045
1046 A minor complication (caused by the retro-fitting of modern Vector
1047 features to a Scalar ISA) is that certain features do not exactly make
1048 sense or are considered a security risk. Fail-first on Vector Indexed
1049 would allow attackers to probe large numbers of pages from userspace, where
1050 strided fail-first (by creating contiguous sequential LDs) does not.
1051
1052 In addition, reduce mode makes no sense.
1053 Realistically we need
1054 an alternative table definition for [[sv/svp64]] `RM.MODE`.
1055 The following modes make sense:
1056
1057 * saturation
1058 * predicate-result (mostly for cache-inhibited LD/ST)
1059 * simple (no augmentation)
1060 * fail-first (where Vector Indexed is banned)
1061 * Signed Effective Address computation (Vector Indexed only)
1062 * Pack/Unpack (on LD/ST immediate operations only)
1063
1064 More than that however it is necessary to fit the usual Vector ISA
1065 capabilities onto both Power ISA LD/ST with immediate and to
1066 LD/ST Indexed. They present subtly different Mode tables, which, due
1067 to lack of space, have the following quirks:
1068
1069 * LD/ST Immediate has no individual control over src/dest zeroing,
1070 whereas LD/ST Indexed does.
1071 * LD/ST Immediate has no Saturated Pack/Unpack (Arithmetic Mode does)
1072 * LD/ST Indexed has no Pack/Unpack (REMAP may be used instead)
1073
1074 # Format and fields
1075
1076 Fields used in tables below:
1077
1078 * **sz / dz** if predication is enabled will put zeros into the dest (or as src in the case of twin pred) when the predicate bit is zero. otherwise the element is ignored or skipped, depending on context.
1079 * **zz**: both sz and dz are set equal to this flag.
1080 * **inv CR bit** just as in branches (BO) these bits allow testing of a CR bit and whether it is set (inv=0) or unset (inv=1)
1081 * **N** sets signed/unsigned saturation.
1082 * **RC1** as if Rc=1, stores CRs *but not the result*
1083 * **SEA** - Signed Effective Address, if enabled performs sign-extension on
1084 registers that have been reduced due to elwidth overrides
1085
1086 **LD/ST immediate**
1087
1088 The table for [[sv/svp64]] for `immed(RA)` which is `RM.MODE`
1089 (bits 19:23 of `RM`) is:
1090
1091 | 0-1 | 2 | 3 4 | description |
1092 | --- | --- |---------|--------------------------- |
1093 | 00 | 0 | zz els | simple mode |
1094 | 00 | 1 | PI LF | post-increment and Fault-First |
1095 | 01 | inv | CR-bit | Rc=1: ffirst CR sel |
1096 | 01 | inv | els RC1 | Rc=0: ffirst z/nonz |
1097 | 10 | N | zz els | sat mode: N=0/1 u/s |
1098 | 11 | inv | CR-bit | Rc=1: pred-result CR sel |
1099 | 11 | inv | els RC1 | Rc=0: pred-result z/nonz |
1100
1101 The `els` bit is only relevant when `RA.isvec` is clear: this indicates
1102 whether stride is unit or element:
1103
1104 ```
1105 if RA.isvec:
1106 svctx.ldstmode = indexed
1107 elif els == 0:
1108 svctx.ldstmode = unitstride
1109 elif immediate != 0:
1110 svctx.ldstmode = elementstride
1111 ```
1112
1113 An immediate of zero is a safety-valve to allow `LD-VSPLAT`:
1114 in effect the multiplication of the immediate-offset by zero results
1115 in reading from the exact same memory location, *even with a Vector
1116 register*. (Normally this type of behaviour is reserved for the
1117 mapreduce modes)
1118
1119 For `LD-VSPLAT`, on non-cache-inhibited Loads, the read can occur
1120 just the once and be copied, rather than hitting the Data Cache
1121 multiple times with the same memory read at the same location.
1122 The benefit of Cache-inhibited LD-splats is that it allows
1123 for memory-mapped peripherals to have multiple
1124 data values read in quick succession and stored in sequentially
1125 numbered registers (but, see Note below).
1126
1127 For non-cache-inhibited ST from a vector source onto a scalar
1128 destination: with the Vector
1129 loop effectively creating multiple memory writes to the same location,
1130 we can deduce that the last of these will be the "successful" one. Thus,
1131 implementations are free and clear to optimise out the overwriting STs,
1132 leaving just the last one as the "winner". Bear in mind that predicate
1133 masks will skip some elements (in source non-zeroing mode).
1134 Cache-inhibited ST operations on the other hand **MUST** write out
1135 a Vector source multiple successive times to the exact same Scalar
1136 destination. Just like Cache-inhibited LDs, multiple values may be
1137 written out in quick succession to a memory-mapped peripheral from
1138 sequentially-numbered registers.
1139
1140 Note that any memory location may be Cache-inhibited
1141 (Power ISA v3.1, Book III, 1.6.1, p1033)
1142
1143 *Programmer's Note: an immediate also with a Scalar source as
1144 a "VSPLAT" mode is simply not possible: there are not enough
1145 Mode bits. One single Scalar Load operation may be used instead, followed
1146 by any arithmetic operation (including a simple mv) in "Splat"
1147 mode.*
1148
1149 **LD/ST Indexed**
1150
1151 The modes for `RA+RB` indexed version are slightly different
1152 but are the same `RM.MODE` bits (19:23 of `RM`):
1153
1154 | 0-1 | 2 | 3 4 | description |
1155 | --- | --- |---------|-------------------------- |
1156 | 00 | SEA | dz sz | simple mode |
1157 | 01 | SEA | dz sz | Strided (scalar only source) |
1158 | 10 | N | dz sz | sat mode: N=0/1 u/s |
1159 | 11 | inv | CR-bit | Rc=1: pred-result CR sel |
1160 | 11 | inv | zz RC1 | Rc=0: pred-result z/nonz |
1161
1162 Vector Indexed Strided Mode is qualified as follows:
1163
1164 if mode = 0b01 and !RA.isvec and !RB.isvec:
1165 svctx.ldstmode = elementstride
1166
1167 A summary of the effect of Vectorisation of src or dest:
1168
1169 imm(RA) RT.v RA.v no stride allowed
1170 imm(RA) RT.s RA.v no stride allowed
1171 imm(RA) RT.v RA.s stride-select allowed
1172 imm(RA) RT.s RA.s not vectorised
1173 RA,RB RT.v {RA|RB}.v Standard Indexed
1174 RA,RB RT.s {RA|RB}.v Indexed but single LD (no VSPLAT)
1175 RA,RB RT.v {RA&RB}.s VSPLAT possible. stride selectable
1176 RA,RB RT.s {RA&RB}.s not vectorised (scalar identity)
1177
1178 Signed Effective Address computation is only relevant for
1179 Vector Indexed Mode, when elwidth overrides are applied.
1180 The source override applies to RB, and before adding to
1181 RA in order to calculate the Effective Address, if SEA is
1182 set RB is sign-extended from elwidth bits to the full 64
1183 bits. For other Modes (ffirst, saturate),
1184 all EA computation with elwidth overrides is unsigned.
1185
1186 Note that cache-inhibited LD/ST when VSPLAT is activated will perform **multiple** LD/ST operations, sequentially. Even with scalar src a
1187 Cache-inhibited LD will read the same memory location *multiple times*, storing the result in successive Vector destination registers. This because the cache-inhibit instructions are typically used to read and write memory-mapped peripherals.
1188 If a genuine cache-inhibited LD-VSPLAT is required then a single *scalar*
1189 cache-inhibited LD should be performed, followed by a VSPLAT-augmented mv,
1190 copying the one *scalar* value into multiple register destinations.
1191
1192 Note also that cache-inhibited VSPLAT with Predicate-result is possible.
1193 This allows for example to issue a massive batch of memory-mapped
1194 peripheral reads, stopping at the first NULL-terminated character and
1195 truncating VL to that point. No branch is needed to issue that large burst
1196 of LDs, which may be valuable in Embedded scenarios.
1197
1198 ## Vectorisation of Scalar Power ISA v3.0B
1199
1200 Scalar Power ISA Load/Store operations may be seen from their
1201 pseudocode to be of the form:
1202
1203 lbux RT, RA, RB
1204 EA <- (RA) + (RB)
1205 RT <- MEM(EA)
1206
1207 and for immediate variants:
1208
1209 lb RT,D(RA)
1210 EA <- RA + EXTS(D)
1211 RT <- MEM(EA)
1212
1213 Thus in the first example, the source registers may each be independently
1214 marked as scalar or vector, and likewise the destination; in the second
1215 example only the one source and one dest may be marked as scalar or
1216 vector.
1217
1218 Thus we can see that Vector Indexed may be covered, and, as demonstrated
1219 with the pseudocode below, the immediate can be used to give unit
1220 stride or element stride. With there being no way to tell which from
1221 the Power v3.0B Scalar opcode alone, the choice is provided instead by
1222 the SV Context.
1223
1224 ```
1225 # LD not VLD! format - ldop RT, immed(RA)
1226 # op_width: lb=1, lh=2, lw=4, ld=8
1227 op_load(RT, RA, op_width, immed, svctx, RAupdate):
1228  ps = get_pred_val(FALSE, RA); # predication on src
1229  pd = get_pred_val(FALSE, RT); # ... AND on dest
1230  for (i=0, j=0, u=0; i < VL && j < VL;):
1231 # skip nonpredicates elements
1232 if (RA.isvec) while (!(ps & 1<<i)) i++;
1233 if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
1234 if (RT.isvec) while (!(pd & 1<<j)) j++;
1235 if postinc:
1236 offs = 0; # added afterwards
1237 if RA.isvec: srcbase = ireg[RA+i]
1238 else srcbase = ireg[RA]
1239 elif svctx.ldstmode == elementstride:
1240 # element stride mode
1241 srcbase = ireg[RA]
1242 offs = i * immed # j*immed for a ST
1243 elif svctx.ldstmode == unitstride:
1244 # unit stride mode
1245 srcbase = ireg[RA]
1246 offs = immed + (i * op_width) # j*op_width for ST
1247 elif RA.isvec:
1248 # quirky Vector indexed mode but with an immediate
1249 srcbase = ireg[RA+i]
1250 offs = immed;
1251 else
1252 # standard scalar mode (but predicated)
1253 # no stride multiplier means VSPLAT mode
1254 srcbase = ireg[RA]
1255 offs = immed
1256
1257 # compute EA
1258 EA = srcbase + offs
1259 # load from memory
1260 ireg[RT+j] <= MEM[EA];
1261 # check post-increment of EA
1262 if postinc: EA = srcbase + immed;
1263 # update RA?
1264 if RAupdate: ireg[RAupdate+u] = EA;
1265 if (!RT.isvec)
1266 break # destination scalar, end now
1267 if (RA.isvec) i++;
1268 if (RAupdate.isvec) u++;
1269 if (RT.isvec) j++;
1270 ```
1271
1272 Indexed LD is:
1273
1274 ```
1275 # format: ldop RT, RA, RB
1276 function op_ldx(RT, RA, RB, RAupdate=False) # LD not VLD!
1277  ps = get_pred_val(FALSE, RA); # predication on src
1278  pd = get_pred_val(FALSE, RT); # ... AND on dest
1279  for (i=0, j=0, k=0, u=0; i < VL && j < VL && k < VL):
1280 # skip nonpredicated RA, RB and RT
1281 if (RA.isvec) while (!(ps & 1<<i)) i++;
1282 if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
1283 if (RB.isvec) while (!(ps & 1<<k)) k++;
1284 if (RT.isvec) while (!(pd & 1<<j)) j++;
1285 if svctx.ldstmode == elementstride:
1286 EA = ireg[RA] + ireg[RB]*j # register-strided
1287 else
1288 EA = ireg[RA+i] + ireg[RB+k] # indexed address
1289 if RAupdate: ireg[RAupdate+u] = EA
1290 ireg[RT+j] <= MEM[EA];
1291 if (!RT.isvec)
1292 break # destination scalar, end immediately
1293 if (RA.isvec) i++;
1294 if (RAupdate.isvec) u++;
1295 if (RB.isvec) k++;
1296 if (RT.isvec) j++;
1297 ```
1298
1299 Note that Element-Strided uses the Destination Step because with both
1300 sources being Scalar as a prerequisite condition of activation of
1301 Element-Stride Mode, the source step (being Scalar) would never advance.
1302
1303 Note in both cases that [[sv/svp64]] allows RA-as-a-dest in "update" mode (`ldux`) to be effectively a *completely different* register from RA-as-a-source. This because there is room in svp64 to extend RA-as-src as well as RA-as-dest, both independently as scalar or vector *and* independently extending their range.
1304
1305 *Programmer's note: being able to set RA-as-a-source
1306 as separate from RA-as-a-destination as Scalar is **extremely valuable**
1307 once it is remembered that Simple-V element operations must
1308 be in Program Order, especially in loops, for saving on
1309 multiple address computations. Care does have
1310 to be taken however that RA-as-src is not overwritten by
1311 RA-as-dest unless intentionally desired, especially in element-strided Mode.*
1312
1313 ## LD/ST Indexed vs Indexed REMAP
1314
1315 Unfortunately the word "Indexed" is used twice in completely different
1316 contexts, potentially causing confusion.
1317
1318 * There has existed instructions in the Power ISA `ld RT,RA,RB` since
1319 its creation: these are called "LD/ST Indexed" instructions and their
1320 name and meaning is well-established.
1321 * There now exists, in Simple-V, a REMAP mode called "Indexed"
1322 Mode that can be applied to *any* instruction **including those
1323 named LD/ST Indexed**.
1324
1325 Whilst it may be costly in terms of register reads to allow REMAP
1326 Indexed Mode to be applied to any Vectorised LD/ST Indexed operation such as
1327 `sv.ld *RT,RA,*RB`, or even misleadingly
1328 labelled as redundant, firstly the strict
1329 application of the RISC Paradigm that Simple-V follows makes it awkward
1330 to consider *preventing* the application of Indexed REMAP to such
1331 operations, and secondly they are not actually the same at all.
1332
1333 Indexed REMAP, as applied to RB in the instruction `sv.ld *RT,RA,*RB`
1334 effectively performs an *in-place* re-ordering of the offsets, RB.
1335 To achieve the same effect without Indexed REMAP would require taking
1336 a *copy* of the Vector of offsets starting at RB, manually explicitly
1337 reordering them, and finally using the copy of re-ordered offsets in
1338 a non-REMAP'ed `sv.ld`. Using non-strided LD as an example,
1339 pseudocode showing what actually occurs,
1340 where the pseudocode for `indexed_remap` may be found in [[sv/remap]]:
1341
1342 ```
1343 # sv.ld *RT,RA,*RB with Index REMAP applied to RB
1344 for i in 0..VL-1:
1345 if remap.indexed:
1346 rb_idx = indexed_remap(i) # remap
1347 else:
1348 rb_idx = i # use the index as-is
1349 EA = GPR(RA) + GPR(RB+rb_idx)
1350 GPR(RT+i) = MEM(EA, 8)
1351 ```
1352
1353 Thus it can be seen that the use of Indexed REMAP saves copying
1354 and manual reordering of the Vector of RB offsets.
1355
1356 ## LD/ST ffirst
1357
1358 LD/ST ffirst treats the first LD/ST in a vector (element 0 if REMAP
1359 is not active) as an ordinary one, with all behaviour with respect to
1360 Interrupts Exceptions Page Faults Memory Management being identical
1361 in every regard to Scalar v3.0 Power ISA LD/ST. However for elements
1362 1 and above, if an exception would occur, then VL is **truncated**
1363 to the previous element: the exception is **not** then raised because
1364 the LD/ST that would otherwise have caused an exception is *required*
1365 to be cancelled. Additionally an implementor may choose to truncate VL
1366 for any arbitrary reason *except for the very first*.
1367
1368 ffirst LD/ST to multiple pages via a Vectorised Index base is
1369 considered a security risk due to the abuse of probing multiple
1370 pages in rapid succession and getting speculative feedback on which
1371 pages would fail. Therefore Vector Indexed LD/ST is prohibited
1372 entirely, and the Mode bit instead used for element-strided LD/ST.
1373 See <https://bugs.libre-soc.org/show_bug.cgi?id=561>
1374
1375 ```
1376 for(i = 0; i < VL; i++)
1377 reg[rt + i] = mem[reg[ra] + i * reg[rb]];
1378 ```
1379
1380 High security implementations where any kind of speculative probing
1381 of memory pages is considered a risk should take advantage of the fact that
1382 implementations may truncate VL at any point, without requiring software
1383 to be rewritten and made non-portable. Such implementations may choose
1384 to *always* set VL=1 which will have the effect of terminating any
1385 speculative probing (and also adversely affect performance), but will
1386 at least not require applications to be rewritten.
1387
1388 Low-performance simpler hardware implementations may also
1389 choose (always) to also set VL=1 as the bare minimum compliant implementation of
1390 LD/ST Fail-First. It is however critically important to remember that
1391 the first element LD/ST **MUST** be treated as an ordinary LD/ST, i.e.
1392 **MUST** raise exceptions exactly like an ordinary LD/ST.
1393
1394 For ffirst LD/STs, VL may be truncated arbitrarily to a nonzero value for any implementation-specific reason. For example: it is perfectly reasonable for implementations to alter VL when ffirst LD or ST operations are initiated on a nonaligned boundary, such that within a loop the subsequent iteration of that loop begins the following ffirst LD/ST operations on an aligned boundary
1395 such as the beginning of a cache line, or beginning of a Virtual Memory
1396 page. Likewise, to reduce workloads or balance resources.
1397
1398 Vertical-First Mode is slightly strange in that only one element
1399 at a time is ever executed anyway. Given that programmers may
1400 legitimately choose to alter srcstep and dststep in non-sequential
1401 order as part of explicit loops, it is neither possible nor
1402 safe to make speculative assumptions about future LD/STs.
1403 Therefore, Fail-First LD/ST in Vertical-First is `UNDEFINED`.
1404 This is very different from Arithmetic (Data-dependent) FFirst
1405 where Vertical-First Mode is fully deterministic, not speculative.
1406
1407 ## LOAD/STORE Elwidths <a name="elwidth"></a>
1408
1409 Loads and Stores are almost unique in that the Power Scalar ISA
1410 provides a width for the operation (lb, lh, lw, ld). Only `extsb` and
1411 others like it provide an explicit operation width. There are therefore
1412 *three* widths involved:
1413
1414 * operation width (lb=8, lh=16, lw=32, ld=64)
1415 * src element width override (8/16/32/default)
1416 * destination element width override (8/16/32/default)
1417
1418 Some care is therefore needed to express and make clear the transformations,
1419 which are expressly in this order:
1420
1421 * Calculate the Effective Address from RA at full width
1422 but (on Indexed Load) allow srcwidth overrides on RB
1423 * Load at the operation width (lb/lh/lw/ld) as usual
1424 * byte-reversal as usual
1425 * Non-saturated mode:
1426 - zero-extension or truncation from operation width to dest elwidth
1427 - place result in destination at dest elwidth
1428 * Saturated mode:
1429 - Sign-extension or truncation from operation width to dest width
1430 - signed/unsigned saturation down to dest elwidth
1431
1432 In order to respect Power v3.0B Scalar behaviour the memory side
1433 is treated effectively as completely separate and distinct from SV
1434 augmentation. This is primarily down to quirks surrounding LE/BE and
1435 byte-reversal.
1436
1437 It is rather unfortunately possible to request an elwidth override
1438 on the memory side which
1439 does not mesh with the overridden operation width: these result in
1440 `UNDEFINED`
1441 behaviour. The reason is that the effect of attempting a 64-bit `sv.ld`
1442 operation with a source elwidth override of 8/16/32 would result in
1443 overlapping memory requests, particularly on unit and element strided
1444 operations. Thus it is `UNDEFINED` when the elwidth is smaller than
1445 the memory operation width. Examples include `sv.lw/sw=16/els` which
1446 requests (overlapping) 4-byte memory reads offset from
1447 each other at 2-byte intervals. Store likewise is also `UNDEFINED`
1448 where the dest elwidth override is less than the operation width.
1449
1450 Note the following regarding the pseudocode to follow:
1451
1452 * `scalar identity behaviour` SV Context parameter conditions turn this
1453 into a straight absolute fully-compliant Scalar v3.0B LD operation
1454 * `brev` selects whether the operation is the byte-reversed variant (`ldbrx`
1455 rather than `ld`)
1456 * `op_width` specifies the operation width (`lb`, `lh`, `lw`, `ld`) as
1457 a "normal" part of Scalar v3.0B LD
1458 * `imm_offs` specifies the immediate offset `ld r3, imm_offs(r5)`, again
1459 as a "normal" part of Scalar v3.0B LD
1460 * `svctx` specifies the SV Context and includes VL as well as
1461 source and destination elwidth overrides.
1462
1463 Below is the pseudocode for Unit-Strided LD (which includes Vector capability). Observe in particular that RA, as the base address in
1464 both Immediate and Indexed LD/ST,
1465 does not have element-width overriding applied to it.
1466
1467 Note that predication, predication-zeroing,
1468 and other modes except saturation have all been removed,
1469 for clarity and simplicity:
1470
1471 ```
1472 # LD not VLD!
1473 # this covers unit stride mode and a type of vector offset
1474 function op_ld(RT, RA, op_width, imm_offs, svctx)
1475 for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL):
1476 if not svctx.unit/el-strided:
1477 # strange vector mode, compute 64 bit address which is
1478 # not polymorphic! elwidth hardcoded to 64 here
1479 srcbase = get_polymorphed_reg(RA, 64, i)
1480 else:
1481 # unit / element stride mode, compute 64 bit address
1482 srcbase = get_polymorphed_reg(RA, 64, 0)
1483 # adjust for unit/el-stride
1484 srcbase += ....
1485
1486 # read the underlying memory
1487 memread <= MEM(srcbase + imm_offs, op_width)
1488
1489 # check saturation.
1490 if svpctx.saturation_mode:
1491 # ... saturation adjustment...
1492 memread = clamp(memread, op_width, svctx.dest_elwidth)
1493 else:
1494 # truncate/extend to over-ridden dest width.
1495 memread = adjust_wid(memread, op_width, svctx.dest_elwidth)
1496
1497 # takes care of inserting memory-read (now correctly byteswapped)
1498 # into regfile underlying LE-defined order, into the right place
1499 # within the NEON-like register, respecting destination element
1500 # bitwidth, and the element index (j)
1501 set_polymorphed_reg(RT, svctx.dest_elwidth, j, memread)
1502
1503 # increments both src and dest element indices (no predication here)
1504 i++;
1505 j++;
1506 ```
1507
1508 Note above that the source elwidth is *not used at all* in LD-immediate.
1509
1510 For LD/Indexed, the key is that in the calculation of the Effective Address,
1511 RA has no elwidth override but RB does. Pseudocode below is simplified
1512 for clarity: predication and all modes except saturation are removed:
1513
1514 ```
1515 # LD not VLD! ld*rx if brev else ld*
1516 function op_ld(RT, RA, RB, op_width, svctx, brev)
1517 for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL):
1518 if not svctx.el-strided:
1519 # RA not polymorphic! elwidth hardcoded to 64 here
1520 srcbase = get_polymorphed_reg(RA, 64, i)
1521 else:
1522 # element stride mode, again RA not polymorphic
1523 srcbase = get_polymorphed_reg(RA, 64, 0)
1524 # RB *is* polymorphic
1525 offs = get_polymorphed_reg(RB, svctx.src_elwidth, i)
1526 # sign-extend
1527 if svctx.SEA: offs = sext(offs, svctx.src_elwidth, 64)
1528
1529 # takes care of (merges) processor LE/BE and ld/ldbrx
1530 bytereverse = brev XNOR MSR.LE
1531
1532 # read the underlying memory
1533 memread <= MEM(srcbase + offs, op_width)
1534
1535 # optionally performs byteswap at op width
1536 if (bytereverse):
1537 memread = byteswap(memread, op_width)
1538
1539 if svpctx.saturation_mode:
1540 # ... saturation adjustment...
1541 memread = clamp(memread, op_width, svctx.dest_elwidth)
1542 else:
1543 # truncate/extend to over-ridden dest width.
1544 memread = adjust_wid(memread, op_width, svctx.dest_elwidth)
1545
1546 # takes care of inserting memory-read (now correctly byteswapped)
1547 # into regfile underlying LE-defined order, into the right place
1548 # within the NEON-like register, respecting destination element
1549 # bitwidth, and the element index (j)
1550 set_polymorphed_reg(RT, svctx.dest_elwidth, j, memread)
1551
1552 # increments both src and dest element indices (no predication here)
1553 i++;
1554 j++;
1555 ```
1556
1557 # Remapped LD/ST
1558
1559 In the [[sv/remap]] page the concept of "Remapping" is described.
1560 Whilst it is expensive to set up (2 64-bit opcodes minimum) it provides
1561 a way to arbitrarily perform 1D, 2D and 3D "remapping" of up to 64
1562 elements worth of LDs or STs. The usual interest in such re-mapping
1563 is for example in separating out 24-bit RGB channel data into separate
1564 contiguous registers.
1565
1566 REMAP easily covers this capability, and with dest
1567 elwidth overrides and saturation may do so with built-in conversion that
1568 would normally require additional width-extension, sign-extension and
1569 min/max Vectorised instructions as post-processing stages.
1570
1571 Thus we do not need to provide specialist LD/ST "Structure Packed" opcodes
1572 because the generic abstracted concept of "Remapping", when applied to
1573 LD/ST, will give that same capability, with far more flexibility.
1574
1575 It is worth noting that Pack/Unpack Modes of SVSTATE, which may be
1576 established through `svstep`, are also an easy way to perform regular
1577 Structure Packing, at the vec2/vec3/vec4 granularity level. Beyond
1578 that, REMAP will need to be used.
1579
1580 # Condition Register SVP64 Operations
1581
1582 Condition Register Fields are only 4 bits wide: this presents some
1583 interesting conceptual challenges for SVP64, which was designed
1584 primarily for vectors of arithmetic and logical operations. However
1585 if predicates may be bits of CR Fields it makes sense to extend
1586 Simple-V to cover CR Operations, especially given that Vectorised Rc=1
1587 may be processed by Vectorised CR Operations tbat usefully in turn
1588 may become Predicate Masks to yet more Vector operations, like so:
1589
1590 ```
1591 sv.cmpi/ew=8 *B,*ra,0 # compare bytes against zero
1592 sv.cmpi/ew=8 *B2,*ra,13. # and against newline
1593 sv.cror PM.EQ,B.EQ,B2.EQ # OR compares to create mask
1594 sv.stb/sm=EQ ... # store only nonzero/newline
1595 ```
1596
1597 Element width however is clearly meaningless for a 4-bit collation of
1598 Conditions, EQ LT GE SO. Likewise, arithmetic saturation (an important
1599 part of Arithmetic SVP64) has no meaning. An alternative Mode Format is
1600 required, and given that elwidths are meaningless for CR Fields the bits
1601 in SVP64 `RM` may be used for other purposes.
1602
1603 This alternative mapping **only** applies to instructions that **only**
1604 reference a CR Field or CR bit as the sole exclusive result. This section
1605 **does not** apply to instructions which primarily produce arithmetic
1606 results that also, as an aside, produce a corresponding
1607 CR Field (such as when Rc=1).
1608 Instructions that involve Rc=1 are definitively arithmetic in nature,
1609 where the corresponding Condition Register Field can be considered to
1610 be a "co-result". Such CR Field "co-result" arithmeric operations
1611 are firmly out of scope for
1612 this section, being covered fully by [[sv/normal]].
1613
1614 * Examples of v3.0B instructions to which this section does
1615 apply is
1616 - `mfcr` and `cmpi` (3 bit operands) and
1617 - `crnor` and `crand` (5 bit operands).
1618 * Examples to which this section does **not** apply include
1619 `fadds.` and `subf.` which both produce arithmetic results
1620 (and a CR Field co-result).
1621
1622 The CR Mode Format still applies to `sv.cmpi` because despite
1623 taking a GPR as input, the output from the Base Scalar v3.0B `cmpi`
1624 instruction is purely to a Condition Register Field.
1625
1626 Other modes are still applicable and include:
1627
1628 * **Data-dependent fail-first**.
1629 useful to truncate VL based on
1630 analysis of a Condition Register result bit.
1631 * **Reduction**.
1632 Reduction is useful
1633 for analysing a Vector of Condition Register Fields
1634 and reducing it to one
1635 single Condition Register Field.
1636
1637 Predicate-result does not make any sense because
1638 when Rc=1 a co-result is created (a CR Field). Testing the co-result
1639 allows the decision to be made to store or not store the main
1640 result, and for CR Ops the CR Field result *is*
1641 the main result.
1642
1643 ## Format
1644
1645 SVP64 RM `MODE` (includes `ELWIDTH_SRC` bits) for CR-based operations:
1646
1647 |6 | 7 |19-20| 21 | 22 23 | description |
1648 |--|---|-----| --- |---------|----------------- |
1649 |/ | / |0 RG | 0 | dz sz | simple mode |
1650 |/ | / |0 RG | 1 | dz sz | scalar reduce mode (mapreduce) |
1651 |zz|SNZ|1 VLI| inv | CR-bit | Ffirst 3-bit mode |
1652 |/ |SNZ|1 VLI| inv | dz sz | Ffirst 5-bit mode (implies CR-bit from result) |
1653
1654 Fields:
1655
1656 * **sz / dz** if predication is enabled will put zeros into the dest (or as src in the case of twin pred) when the predicate bit is zero. otherwise the element is ignored or skipped, depending on context.
1657 * **zz** set both sz and dz equal to this flag
1658 * **SNZ** In fail-first mode, on the bit being tested, when sz=1 and SNZ=1 a value "1" is put in place of "0".
1659 * **inv CR-bit** just as in branches (BO) these bits allow testing of a CR bit and whether it is set (inv=0) or unset (inv=1)
1660 * **RG** inverts the Vector Loop order (VL-1 downto 0) rather
1661 than the normal 0..VL-1
1662 * **SVM** sets "subvector" reduce mode
1663 * **VLi** VL inclusive: in fail-first mode, the truncation of
1664 VL *includes* the current element at the failure point rather
1665 than excludes it from the count.
1666
1667 ## Data-dependent fail-first on CR operations
1668
1669 The principle of data-dependent fail-first is that if, during
1670 the course of sequentially evaluating an element's Condition Test,
1671 one such test is encountered which fails,
1672 then VL (Vector Length) is truncated (set) at that point. In the case
1673 of Arithmetic SVP64 Operations the Condition Register Field generated from
1674 Rc=1 is used as the basis for the truncation decision.
1675 However with CR-based operations that CR Field result to be
1676 tested is provided
1677 *by the operation itself*.
1678
1679 Data-dependent SVP64 Vectorised Operations involving the creation or
1680 modification of a CR can require an extra two bits, which are not available
1681 in the compact space of the SVP64 RM `MODE` Field. With the concept of element
1682 width overrides being meaningless for CR Fields it is possible to use the
1683 `ELWIDTH` field for alternative purposes.
1684
1685 Condition Register based operations such as `sv.mfcr` and `sv.crand` can thus
1686 be made more flexible. However the rules that apply in this section
1687 also apply to future CR-based instructions.
1688
1689 There are two primary different types of CR operations:
1690
1691 * Those which have a 3-bit operand field (referring to a CR Field)
1692 * Those which have a 5-bit operand (referring to a bit within the
1693 whole 32-bit CR)
1694
1695 Examining these two types it is observed that the
1696 difference may be considered to be that the 5-bit variant
1697 *already* provides the
1698 prerequisite information about which CR Field bit (EQ, GE, LT, SO) is to
1699 be operated on by the instruction.
1700 Thus, logically, we may set the following rule:
1701
1702 * When a 5-bit CR Result field is used in an instruction, the
1703 5-bit variant of Data-Dependent Fail-First
1704 must be used. i.e. the bit of the CR field to be tested is
1705 the one that has just been modified (created) by the operation.
1706 * When a 3-bit CR Result field is used the 3-bit variant
1707 must be used, providing as it does the missing `CRbit` field
1708 in order to select which CR Field bit of the result shall
1709 be tested (EQ, LE, GE, SO)
1710
1711 The reason why the 3-bit CR variant needs the additional CR-bit
1712 field should be obvious from the fact that the 3-bit CR Field
1713 from the base Power ISA v3.0B operation clearly does not contain
1714 and is missing the two CR Field Selector bits. Thus, these two
1715 bits (to select EQ, LE, GE or SO) must be provided in another
1716 way.
1717
1718 Examples of the former type:
1719
1720 * crand, cror, crnor. These all are 5-bit (BA, BB, BT). The bit
1721 to be tested against `inv` is the one selected by `BT`
1722 * mcrf. This has only 3-bit (BF, BFA). In order to select the
1723 bit to be tested, the alternative encoding must be used.
1724 With `CRbit` coming from the SVP64 RM bits 22-23 the bit
1725 of BF to be tested is identified.
1726
1727 Just as with SVP64 [[sv/branches]] there is the option to truncate
1728 VL to include the element being tested (`VLi=1`) and to exclude it
1729 (`VLi=0`).
1730
1731 Also exactly as with [[sv/normal]] fail-first, VL cannot, unlike
1732 [[sv/ldst]], be set to an arbitrary value. Deterministic behaviour
1733 is *required*.
1734
1735 ## Reduction and Iteration
1736
1737 Bearing in mind as described in the svp64 Appendix, SVP64 Horizontal
1738 Reduction is a deterministic schedule on top of base Scalar v3.0 operations,
1739 the same rules apply to CR Operations, i.e. that programmers must
1740 follow certain conventions in order for an *end result* of a
1741 reduction to be achieved. Unlike
1742 other Vector ISAs *there are no explicit reduction opcodes*
1743 in SVP64: Schedules however achieve the same effect.
1744
1745 Due to these conventions only reduction on operations such as `crand`
1746 and `cror` are meaningful because these have Condition Register Fields
1747 as both input and output.
1748 Meaningless operations are not prohibited because the cost in hardware
1749 of doing so is prohibitive, but neither are they `UNDEFINED`. Implementations
1750 are still required to execute them but are at liberty to optimise out
1751 any operations that would ultimately be overwritten, as long as Strict
1752 Program Order is still obvservable by the programmer.
1753
1754 Also bear in mind that 'Reverse Gear' may be enabled, which can be
1755 used in combination with overlapping CR operations to iteratively accumulate
1756 results. Issuing a `sv.crand` operation for example with `BA`
1757 differing from `BB` by one Condition Register Field would
1758 result in a cascade effect, where the first-encountered CR Field
1759 would set the result to zero, and also all subsequent CR Field
1760 elements thereafter:
1761
1762 ```
1763 # sv.crand/mr/rg CR4.ge.v, CR5.ge.v, CR4.ge.v
1764 for i in VL-1 downto 0 # reverse gear
1765 CR.field[4+i].ge &= CR.field[5+i].ge
1766 ```
1767
1768 `sv.crxor` with reduction would be particularly useful for parity calculation
1769 for example, although there are many ways in which the same calculation
1770 could be carried out after transferring a vector of CR Fields to a GPR
1771 using crweird operations.
1772
1773 Implementations are free and clear to optimise these reductions in any
1774 way they see fit, as long as the end-result is compatible with Strict Program
1775 Order being observed, and Interrupt latency is not adversely impacted.
1776
1777 ## Unusual and quirky CR operations
1778
1779 **cmp and other compare ops**
1780
1781 `cmp` and `cmpi` etc take GPRs as sources and create a CR Field as a result.
1782
1783 cmpli BF,L,RA,UI
1784 cmpeqb BF,RA,RB
1785
1786 With `ELWIDTH` applying to the source GPR operands this is perfectly fine.
1787
1788 **crweird operations**
1789
1790 There are 4 weird CR-GPR operations and one reasonable one in
1791 the [[cr_int_predication]] set:
1792
1793 * crrweird
1794 * mtcrweird
1795 * crweirder
1796 * crweird
1797 * mcrfm - reasonably normal and referring to CR Fields for src and dest.
1798
1799 The "weird" operations have a non-standard behaviour, being able to
1800 treat *individual bits* of a GPR effectively as elements. They are
1801 expected to be Micro-coded by most Hardware implementations.
1802
1803
1804 ## SVP64 Branch Conditional behaviour
1805
1806 Please note: although similar, SVP64 Branch instructions should be
1807 considered completely separate and distinct from
1808 standard scalar OpenPOWER-approved v3.0B branches.
1809 **v3.0B branches are in no way impacted, altered,
1810 changed or modified in any way, shape or form by
1811 the SVP64 Vectorised Variants**.
1812
1813 It is also
1814 extremely important to note that Branches are the
1815 sole pseudo-exception in SVP64 to `Scalar Identity Behaviour`.
1816 SVP64 Branches contain additional modes that are useful
1817 for scalar operations (i.e. even when VL=1 or when
1818 using single-bit predication).
1819
1820 **Rationale**
1821
1822 Scalar 3.0B Branch Conditional operations, `bc`, `bctar` etc. test a
1823 Condition Register. However for parallel processing it is simply impossible
1824 to perform multiple independent branches: the Program Counter simply
1825 cannot branch to multiple destinations based on multiple conditions.
1826 The best that can be done is
1827 to test multiple Conditions and make a decision of a *single* branch,
1828 based on analysis of a *Vector* of CR Fields
1829 which have just been calculated from a *Vector* of results.
1830
1831 In 3D Shader
1832 binaries, which are inherently parallelised and predicated, testing all or
1833 some results and branching based on multiple tests is extremely common,
1834 and a fundamental part of Shader Compilers. Example:
1835 without such multi-condition
1836 test-and-branch, if a predicate mask is all zeros a large batch of
1837 instructions may be masked out to `nop`, and it would waste
1838 CPU cycles to run them. 3D GPU ISAs can test for this scenario
1839 and, with the appropriate predicate-analysis instruction,
1840 jump over fully-masked-out operations, by spotting that
1841 *all* Conditions are false.
1842
1843 Unless Branches are aware and capable of such analysis, additional
1844 instructions would be required which perform Horizontal Cumulative
1845 analysis of Vectorised Condition Register Fields, in order to
1846 reduce the Vector of CR Fields down to one single yes or no
1847 decision that a Scalar-only v3.0B Branch-Conditional could cope with.
1848 Such instructions would be unavoidable, required, and costly
1849 by comparison to a single Vector-aware Branch.
1850 Therefore, in order to be commercially competitive, `sv.bc` and
1851 other Vector-aware Branch Conditional instructions are a high priority
1852 for 3D GPU (and OpenCL-style) workloads.
1853
1854 Given that Power ISA v3.0B is already quite powerful, particularly
1855 the Condition Registers and their interaction with Branches, there
1856 are opportunities to create extremely flexible and compact
1857 Vectorised Branch behaviour. In addition, the side-effects (updating
1858 of CTR, truncation of VL, described below) make it a useful instruction
1859 even if the branch points to the next instruction (no actual branch).
1860
1861 ## Overview
1862
1863 When considering an "array" of branch-tests, there are four
1864 primarily-useful modes:
1865 AND, OR, NAND and NOR of all Conditions.
1866 NAND and NOR may be synthesised from AND and OR by
1867 inverting `BO[1]` which just leaves two modes:
1868
1869 * Branch takes place on the **first** CR Field test to succeed
1870 (a Great Big OR of all condition tests). Exit occurs
1871 on the first **successful** test.
1872 * Branch takes place only if **all** CR field tests succeed:
1873 a Great Big AND of all condition tests. Exit occurs
1874 on the first **failed** test.
1875
1876 Early-exit is enacted such that the Vectorised Branch does not
1877 perform needless extra tests, which will help reduce reads on
1878 the Condition Register file.
1879
1880 *Note: Early-exit is **MANDATORY** (required) behaviour.
1881 Branches **MUST** exit at the first sequentially-encountered
1882 failure point, for
1883 exactly the same reasons for which it is mandatory in
1884 programming languages doing early-exit: to avoid
1885 damaging side-effects and to provide deterministic
1886 behaviour. Speculative testing of Condition
1887 Register Fields is permitted, as is speculative calculation
1888 of CTR, as long as, as usual in any Out-of-Order microarchitecture,
1889 that speculative testing is cancelled should an early-exit occur.
1890 i.e. the speculation must be "precise": Program Order must be preserved*
1891
1892 Also note that when early-exit occurs in Horizontal-first Mode,
1893 srcstep, dststep etc. are all reset, ready to begin looping from the
1894 beginning for the next instruction. However for Vertical-first
1895 Mode srcstep etc. are incremented "as usual" i.e. an early-exit
1896 has no special impact, regardless of whether the branch
1897 occurred or not. This can leave srcstep etc. in what may be
1898 considered an unusual
1899 state on exit from a loop and it is up to the programmer to
1900 reset srcstep, dststep etc. to known-good values
1901 *(easily achieved with `setvl`)*.
1902
1903 Additional useful behaviour involves two primary Modes (both of
1904 which may be enabled and combined):
1905
1906 * **VLSET Mode**: identical to Data-Dependent Fail-First Mode
1907 for Arithmetic SVP64 operations, with more
1908 flexibility and a close interaction and integration into the
1909 underlying base Scalar v3.0B Branch instruction.
1910 Truncation of VL takes place around the early-exit point.
1911 * **CTR-test Mode**: gives much more flexibility over when and why
1912 CTR is decremented, including options to decrement if a Condition
1913 test succeeds *or if it fails*.
1914
1915 With these side-effects, basic Boolean Logic Analysis advises that
1916 it is important to provide a means
1917 to enact them each based on whether testing succeeds *or fails*. This
1918 results in a not-insignificant number of additional Mode Augmentation bits,
1919 accompanying VLSET and CTR-test Modes respectively.
1920
1921 Predicate skipping or zeroing may, as usual with SVP64, be controlled
1922 by `sz`.
1923 Where the predicate is masked out and
1924 zeroing is enabled, then in such circumstances
1925 the same Boolean Logic Analysis dictates that
1926 rather than testing only against zero, the option to test
1927 against one is also prudent. This introduces a new
1928 immediate field, `SNZ`, which works in conjunction with
1929 `sz`.
1930
1931
1932 Vectorised Branches can be used
1933 in either SVP64 Horizontal-First or Vertical-First Mode. Essentially,
1934 at an element level, the behaviour is identical in both Modes,
1935 although the `ALL` bit is meaningless in Vertical-First Mode.
1936
1937 It is also important
1938 to bear in mind that, fundamentally, Vectorised Branch-Conditional
1939 is still extremely close to the Scalar v3.0B Branch-Conditional
1940 instructions, and that the same v3.0B Scalar Branch-Conditional
1941 instructions are still
1942 *completely separate and independent*, being unaltered and
1943 unaffected by their SVP64 variants in every conceivable way.
1944
1945 *Programming note: One important point is that SVP64 instructions are 64 bit.
1946 (8 bytes not 4). This needs to be taken into consideration when computing
1947 branch offsets: the offset is relative to the start of the instruction,
1948 which **includes** the SVP64 Prefix*
1949
1950 ## Format and fields
1951
1952 With element-width overrides being meaningless for Condition
1953 Register Fields, bits 4 thru 7 of SVP64 RM may be used for additional
1954 Mode bits.
1955
1956 SVP64 RM `MODE` (includes repurposing `ELWIDTH` bits 4:5,
1957 and `ELWIDTH_SRC` bits 6-7 for *alternate* uses) for Branch
1958 Conditional:
1959
1960 | 4 | 5 | 6 | 7 | 17 | 18 | 19 | 20 | 21 | 22 23 | description |
1961 | - | - | - | - | -- | -- | -- | -- | --- |--------|----------------- |
1962 |ALL|SNZ| / | / | SL |SLu | 0 | 0 | / | LRu sz | simple mode |
1963 |ALL|SNZ| / |VSb| SL |SLu | 0 | 1 | VLI | LRu sz | VLSET mode |
1964 |ALL|SNZ|CTi| / | SL |SLu | 1 | 0 | / | LRu sz | CTR-test mode |
1965 |ALL|SNZ|CTi|VSb| SL |SLu | 1 | 1 | VLI | LRu sz | CTR-test+VLSET mode |
1966
1967 Brief description of fields:
1968
1969 * **sz=1** if predication is enabled and `sz=1` and a predicate
1970 element bit is zero, `SNZ` will
1971 be substituted in place of the CR bit selected by `BI`,
1972 as the Condition tested.
1973 Contrast this with
1974 normal SVP64 `sz=1` behaviour, where *only* a zero is put in
1975 place of masked-out predicate bits.
1976 * **sz=0** When `sz=0` skipping occurs as usual on
1977 masked-out elements, but unlike all
1978 other SVP64 behaviour which entirely skips an element with
1979 no related side-effects at all, there are certain
1980 special circumstances where CTR
1981 may be decremented. See CTR-test Mode, below.
1982 * **ALL** when set, all branch conditional tests must pass in order for
1983 the branch to succeed. When clear, it is the first sequentially
1984 encountered successful test that causes the branch to succeed.
1985 This is identical behaviour to how programming languages perform
1986 early-exit on Boolean Logic chains.
1987 * **VLI** VLSET is identical to Data-dependent Fail-First mode.
1988 In VLSET mode, VL *may* (depending on `VSb`) be truncated.
1989 If VLI (Vector Length Inclusive) is clear,
1990 VL is truncated to *exclude* the current element, otherwise it is
1991 included. SVSTATE.MVL is not altered: only VL.
1992 * **SL** identical to `LR` except applicable to SVSTATE. If `SL`
1993 is set, SVSTATE is transferred to SVLR (conditionally on
1994 whether `SLu` is set).
1995 * **SLu**: SVSTATE Link Update, like `LRu` except applies to SVSTATE.
1996 * **LRu**: Link Register Update, used in conjunction with LK=1
1997 to make LR update conditional
1998 * **VSb** In VLSET Mode, after testing,
1999 if VSb is set, VL is truncated if the test succeeds. If VSb is clear,
2000 VL is truncated if a test *fails*. Masked-out (skipped)
2001 bits are not considered
2002 part of testing when `sz=0`
2003 * **CTi** CTR inversion. CTR-test Mode normally decrements per element
2004 tested. CTR inversion decrements if a test *fails*. Only relevant
2005 in CTR-test Mode.
2006
2007 LRu and CTR-test modes are where SVP64 Branches subtly differ from
2008 Scalar v3.0B Branches. `sv.bcl` for example will always update LR, whereas
2009 `sv.bcl/lru` will only update LR if the branch succeeds.
2010
2011 Of special interest is that when using ALL Mode (Great Big AND
2012 of all Condition Tests), if `VL=0`,
2013 which is rare but can occur in Data-Dependent Modes, the Branch
2014 will always take place because there will be no failing Condition
2015 Tests to prevent it. Likewise when not using ALL Mode (Great Big OR
2016 of all Condition Tests) and `VL=0` the Branch is guaranteed not
2017 to occur because there will be no *successful* Condition Tests
2018 to make it happen.
2019
2020 ## Vectorised CR Field numbering, and Scalar behaviour
2021
2022 It is important to keep in mind that just like all SVP64 instructions,
2023 the `BI` field of the base v3.0B Branch Conditional instruction
2024 may be extended by SVP64 EXTRA augmentation, as well as be marked
2025 as either Scalar or Vector. It is also crucially important to keep in mind
2026 that for CRs, SVP64 sequentially increments the CR *Field* numbers.
2027 CR *Fields* are treated as elements, not bit-numbers of the CR *register*.
2028
2029 The `BI` operand of Branch Conditional operations is five bits, in scalar
2030 v3.0B this would select one bit of the 32 bit CR,
2031 comprising eight CR Fields of 4 bits each. In SVP64 there are
2032 16 32 bit CRs, containing 128 4-bit CR Fields. Therefore, the 2 LSBs of
2033 `BI` select the bit from the CR Field (EQ LT GT SO), and the top 3 bits
2034 are extended to either scalar or vector and to select CR Fields 0..127
2035 as specified in SVP64 [[sv/svp64/appendix]].
2036
2037 When the CR Fields selected by SVP64-Augmented `BI` is marked as scalar,
2038 then as the usual SVP64 rules apply:
2039 the Vector loop ends at the first element tested
2040 (the first CR *Field*), after taking
2041 predication into consideration. Thus, also as usual, when a predicate mask is
2042 given, and `BI` marked as scalar, and `sz` is zero, srcstep
2043 skips forward to the first non-zero predicated element, and only that
2044 one element is tested.
2045
2046 In other words, the fact that this is a Branch
2047 Operation (instead of an arithmetic one) does not result, ultimately,
2048 in significant changes as to
2049 how SVP64 is fundamentally applied, except with respect to:
2050
2051 * the unique properties associated with conditionally
2052 changing the Program
2053 Counter (aka "a Branch"), resulting in early-out
2054 opportunities
2055 * CTR-testing
2056
2057 Both are outlined below, in later sections.
2058
2059 ## Horizontal-First and Vertical-First Modes
2060
2061 In SVP64 Horizontal-First Mode, the first failure in ALL mode (Great Big
2062 AND) results in early exit: no more updates to CTR occur (if requested);
2063 no branch occurs, and LR is not updated (if requested). Likewise for
2064 non-ALL mode (Great Big Or) on first success early exit also occurs,
2065 however this time with the Branch proceeding. In both cases the testing
2066 of the Vector of CRs should be done in linear sequential order (or in
2067 REMAP re-sequenced order): such that tests that are sequentially beyond
2068 the exit point are *not* carried out. (*Note: it is standard practice in
2069 Programming languages to exit early from conditional tests, however
2070 a little unusual to consider in an ISA that is designed for Parallel
2071 Vector Processing. The reason is to have strictly-defined guaranteed
2072 behaviour*)
2073
2074 In Vertical-First Mode, setting the `ALL` bit results in `UNDEFINED`
2075 behaviour. Given that only one element is being tested at a time
2076 in Vertical-First Mode, a test designed to be done on multiple
2077 bits is meaningless.
2078
2079 ## Description and Modes
2080
2081 Predication in both INT and CR modes may be applied to `sv.bc` and other
2082 SVP64 Branch Conditional operations, exactly as they may be applied to
2083 other SVP64 operations. When `sz` is zero, any masked-out Branch-element
2084 operations are not included in condition testing, exactly like all other
2085 SVP64 operations, *including* side-effects such as potentially updating
2086 LR or CTR, which will also be skipped. There is *one* exception here,
2087 which is when
2088 `BO[2]=0, sz=0, CTR-test=0, CTi=1` and the relevant element
2089 predicate mask bit is also zero:
2090 under these special circumstances CTR will also decrement.
2091
2092 When `sz` is non-zero, this normally requests insertion of a zero
2093 in place of the input data, when the relevant predicate mask bit is zero.
2094 This would mean that a zero is inserted in place of `CR[BI+32]` for
2095 testing against `BO`, which may not be desirable in all circumstances.
2096 Therefore, an extra field is provided `SNZ`, which, if set, will insert
2097 a **one** in place of a masked-out element, instead of a zero.
2098
2099 (*Note: Both options are provided because it is useful to deliberately
2100 cause the Branch-Conditional Vector testing to fail at a specific point,
2101 controlled by the Predicate mask. This is particularly useful in `VLSET`
2102 mode, which will truncate SVSTATE.VL at the point of the first failed
2103 test.*)
2104
2105 Normally, CTR mode will decrement once per Condition Test, resulting
2106 under normal circumstances that CTR reduces by up to VL in Horizontal-First
2107 Mode. Just as when v3.0B Branch-Conditional saves at
2108 least one instruction on tight inner loops through auto-decrementation
2109 of CTR, likewise it is also possible to save instruction count for
2110 SVP64 loops in both Vertical-First and Horizontal-First Mode, particularly
2111 in circumstances where there is conditional interaction between the
2112 element computation and testing, and the continuation (or otherwise)
2113 of a given loop. The potential combinations of interactions is why CTR
2114 testing options have been added.
2115
2116 Also, the unconditional bit `BO[0]` is still relevant when Predication
2117 is applied to the Branch because in `ALL` mode all nonmasked bits have
2118 to be tested, and when `sz=0` skipping occurs.
2119 Even when VLSET mode is not used, CTR
2120 may still be decremented by the total number of nonmasked elements,
2121 acting in effect as either a popcount or cntlz depending on which
2122 mode bits are set.
2123 In short, Vectorised Branch becomes an extremely powerful tool.
2124
2125 **Micro-Architectural Implementation Note**: *when implemented on
2126 top of a Multi-Issue Out-of-Order Engine it is possible to pass
2127 a copy of the predicate and the prerequisite CR Fields to all
2128 Branch Units, as well as the current value of CTR at the time of
2129 multi-issue, and for each Branch Unit to compute how many times
2130 CTR would be subtracted, in a fully-deterministic and parallel
2131 fashion. A SIMD-based Branch Unit, receiving and processing
2132 multiple CR Fields covered by multiple predicate bits, would
2133 do the exact same thing. Obviously, however, if CTR is modified
2134 within any given loop (mtctr) the behaviour of CTR is no longer
2135 deterministic.*
2136
2137 ### Link Register Update
2138
2139 For a Scalar Branch, unconditional updating of the Link Register
2140 LR is useful and practical. However, if a loop of CR Fields is
2141 tested, unconditional updating of LR becomes problematic.
2142
2143 For example when using `bclr` with `LRu=1,LK=0` in Horizontal-First Mode,
2144 LR's value will be unconditionally overwritten after the first element,
2145 such that for execution (testing) of the second element, LR
2146 has the value `CIA+8`. This is covered in the `bclrl` example, in
2147 a later section.
2148
2149 The addition of a LRu bit modifies behaviour in conjunction
2150 with LK, as follows:
2151
2152 * `sv.bc` When LRu=0,LK=0, Link Register is not updated
2153 * `sv.bcl` When LRu=0,LK=1, Link Register is updated unconditionally
2154 * `sv.bcl/lru` When LRu=1,LK=1, Link Register will
2155 only be updated if the Branch Condition fails.
2156 * `sv.bc/lru` When LRu=1,LK=0, Link Register will only be updated if
2157 the Branch Condition succeeds.
2158
2159 This avoids
2160 destruction of LR during loops (particularly Vertical-First
2161 ones).
2162
2163 **SVLR and SVSTATE**
2164
2165 For precisely the reasons why `LK=1` was added originally to the Power
2166 ISA, with SVSTATE being a peer of the Program Counter it becomes
2167 necessary to also add an SVLR (SVSTATE Link Register)
2168 and corresponding control bits `SL` and `SLu`.
2169
2170 ### CTR-test
2171
2172 Where a standard Scalar v3.0B branch unconditionally decrements
2173 CTR when `BO[2]` is clear, CTR-test Mode introduces more flexibility
2174 which allows CTR to be used for many more types of Vector loops
2175 constructs.
2176
2177 CTR-test mode and CTi interaction is as follows: note that
2178 `BO[2]` is still required to be clear for CTR decrements to be
2179 considered, exactly as is the case in Scalar Power ISA v3.0B
2180
2181 * **CTR-test=0, CTi=0**: CTR decrements on a per-element basis
2182 if `BO[2]` is zero. Masked-out elements when `sz=0` are
2183 skipped (i.e. CTR is *not* decremented when the predicate
2184 bit is zero and `sz=0`).
2185 * **CTR-test=0, CTi=1**: CTR decrements on a per-element basis
2186 if `BO[2]` is zero and a masked-out element is skipped
2187 (`sz=0` and predicate bit is zero). This one special case is the
2188 **opposite** of other combinations, as well as being
2189 completely different from normal SVP64 `sz=0` behaviour)
2190 * **CTR-test=1, CTi=0**: CTR decrements on a per-element basis
2191 if `BO[2]` is zero and the Condition Test succeeds.
2192 Masked-out elements when `sz=0` are skipped (including
2193 not decrementing CTR)
2194 * **CTR-test=1, CTi=1**: CTR decrements on a per-element basis
2195 if `BO[2]` is zero and the Condition Test *fails*.
2196 Masked-out elements when `sz=0` are skipped (including
2197 not decrementing CTR)
2198
2199 `CTR-test=0, CTi=1, sz=0` requires special emphasis because it is the
2200 only time in the entirety of SVP64 that has side-effects when
2201 a predicate mask bit is clear. **All** other SVP64 operations
2202 entirely skip an element when sz=0 and a predicate mask bit is zero.
2203 It is also critical to emphasise that in this unusual mode,
2204 no other side-effects occur: **only** CTR is decremented, i.e. the
2205 rest of the Branch operation is skipped.
2206
2207 ### VLSET Mode
2208
2209 VLSET Mode truncates the Vector Length so that subsequent instructions
2210 operate on a reduced Vector Length. This is similar to
2211 Data-dependent Fail-First and LD/ST Fail-First, where for VLSET the
2212 truncation occurs at the Branch decision-point.
2213
2214 Interestingly, due to the side-effects of `VLSET` mode
2215 it is actually useful to use Branch Conditional even
2216 to perform no actual branch operation, i.e to point to the instruction
2217 after the branch. Truncation of VL would thus conditionally occur yet control
2218 flow alteration would not.
2219
2220 `VLSET` mode with Vertical-First is particularly unusual. Vertical-First
2221 is designed to be used for explicit looping, where an explicit call to
2222 `svstep` is required to move both srcstep and dststep on to
2223 the next element, until VL (or other condition) is reached.
2224 Vertical-First Looping is expected (required) to terminate if the end
2225 of the Vector, VL, is reached. If however that loop is terminated early
2226 because VL is truncated, VLSET with Vertical-First becomes meaningless.
2227 Resolving this would require two branches: one Conditional, the other
2228 branching unconditionally to create the loop, where the Conditional
2229 one jumps over it.
2230
2231 Therefore, with `VSb`, the option to decide whether truncation should occur if the
2232 branch succeeds *or* if the branch condition fails allows for the flexibility
2233 required. This allows a Vertical-First Branch to *either* be used as
2234 a branch-back (loop) *or* as part of a conditional exit or function
2235 call from *inside* a loop, and for VLSET to be integrated into both
2236 types of decision-making.
2237
2238 In the case of a Vertical-First branch-back (loop), with `VSb=0` the branch takes
2239 place if success conditions are met, but on exit from that loop
2240 (branch condition fails), VL will be truncated. This is extremely
2241 useful.
2242
2243 `VLSET` mode with Horizontal-First when `VSb=0` is still
2244 useful, because it can be used to truncate VL to the first predicated
2245 (non-masked-out) element.
2246
2247 The truncation point for VL, when VLi is clear, must not include skipped
2248 elements that preceded the current element being tested.
2249 Example: `sz=0, VLi=0, predicate mask = 0b110010` and the Condition
2250 Register failure point is at CR Field element 4.
2251
2252 * Testing at element 0 is skipped because its predicate bit is zero
2253 * Testing at element 1 passed
2254 * Testing elements 2 and 3 are skipped because their
2255 respective predicate mask bits are zero
2256 * Testing element 4 fails therefore VL is truncated to **2**
2257 not 4 due to elements 2 and 3 being skipped.
2258
2259 If `sz=1` in the above example *then* VL would have been set to 4 because
2260 in non-zeroing mode the zero'd elements are still effectively part of the
2261 Vector (with their respective elements set to `SNZ`)
2262
2263 If `VLI=1` then VL would be set to 5 regardless of sz, due to being inclusive
2264 of the element actually being tested.
2265
2266 ### VLSET and CTR-test combined
2267
2268 If both CTR-test and VLSET Modes are requested, it's important to
2269 observe the correct order. What occurs depends on whether VLi
2270 is enabled, because VLi affects the length, VL.
2271
2272 If VLi (VL truncate inclusive) is set:
2273
2274 1. compute the test including whether CTR triggers
2275 2. (optionally) decrement CTR
2276 3. (optionally) truncate VL (VSb inverts the decision)
2277 4. decide (based on step 1) whether to terminate looping
2278 (including not executing step 5)
2279 5. decide whether to branch.
2280
2281 If VLi is clear, then when a test fails that element
2282 and any following it
2283 should **not** be considered part of the Vector. Consequently:
2284
2285 1. compute the branch test including whether CTR triggers
2286 2. if the test fails against VSb, truncate VL to the *previous*
2287 element, and terminate looping. No further steps executed.
2288 3. (optionally) decrement CTR
2289 4. decide whether to branch.
2290
2291 ## Boolean Logic combinations
2292
2293 In a Scalar ISA, Branch-Conditional testing even of vector
2294 results may be performed through inversion of tests. NOR of
2295 all tests may be performed by inversion of the scalar condition
2296 and branching *out* from the scalar loop around elements,
2297 using scalar operations.
2298
2299 In a parallel (Vector) ISA it is the ISA itself which must perform
2300 the prerequisite logic manipulation.
2301 Thus for SVP64 there are an extraordinary number of nesessary combinations
2302 which provide completely different and useful behaviour.
2303 Available options to combine:
2304
2305 * `BO[0]` to make an unconditional branch would seem irrelevant if
2306 it were not for predication and for side-effects (CTR Mode
2307 for example)
2308 * Enabling CTR-test Mode and setting `BO[2]` can still result in the
2309 Branch
2310 taking place, not because the Condition Test itself failed, but
2311 because CTR reached zero **because**, as required by CTR-test mode,
2312 CTR was decremented as a **result** of Condition Tests failing.
2313 * `BO[1]` to select whether the CR bit being tested is zero or nonzero
2314 * `R30` and `~R30` and other predicate mask options including CR and
2315 inverted CR bit testing
2316 * `sz` and `SNZ` to insert either zeros or ones in place of masked-out
2317 predicate bits
2318 * `ALL` or `ANY` behaviour corresponding to `AND` of all tests and
2319 `OR` of all tests, respectively.
2320 * Predicate Mask bits, which combine in effect with the CR being
2321 tested.
2322 * Inversion of Predicate Masks (`~r3` instead of `r3`, or using
2323 `NE` rather than `EQ`) which results in an additional
2324 level of possible ANDing, ORing etc. that would otherwise
2325 need explicit instructions.
2326
2327 The most obviously useful combinations here are to set `BO[1]` to zero
2328 in order to turn `ALL` into Great-Big-NAND and `ANY` into
2329 Great-Big-NOR. Other Mode bits which perform behavioural inversion then
2330 have to work round the fact that the Condition Testing is NOR or NAND.
2331 The alternative to not having additional behavioural inversion
2332 (`SNZ`, `VSb`, `CTi`) would be to have a second (unconditional)
2333 branch directly after the first, which the first branch jumps over.
2334 This contrivance is avoided by the behavioural inversion bits.
2335
2336 ## Pseudocode and examples
2337
2338 Please see the SVP64 appendix regarding CR bit ordering and for
2339 the definition of `CR{n}`
2340
2341 For comparative purposes this is a copy of the v3.0B `bc` pseudocode
2342
2343 ```
2344 if (mode_is_64bit) then M <- 0
2345 else M <- 32
2346 if ¬BO[2] then CTR <- CTR - 1
2347 ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
2348 cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1])
2349 if ctr_ok & cond_ok then
2350 if AA then NIA <-iea EXTS(BD || 0b00)
2351 else NIA <-iea CIA + EXTS(BD || 0b00)
2352 if LK then LR <-iea CIA + 4
2353 ```
2354
2355 Simplified pseudocode including LRu and CTR skipping, which illustrates
2356 clearly that SVP64 Scalar Branches (VL=1) are **not** identical to
2357 v3.0B Scalar Branches. The key areas where differences occur are
2358 the inclusion of predication (which can still be used when VL=1), in
2359 when and why CTR is decremented (CTRtest Mode) and whether LR is
2360 updated (which is unconditional in v3.0B when LK=1, and conditional
2361 in SVP64 when LRu=1).
2362
2363 Inline comments highlight the fact that the Scalar Branch behaviour
2364 and pseudocode is still clearly visible and embedded within the
2365 Vectorised variant:
2366
2367 ```
2368 if (mode_is_64bit) then M <- 0
2369 else M <- 32
2370 # the bit of CR to test, if the predicate bit is zero,
2371 # is overridden
2372 testbit = CR[BI+32]
2373 if ¬predicate_bit then testbit = SVRMmode.SNZ
2374 # otherwise apart from the override ctr_ok and cond_ok
2375 # are exactly the same
2376 ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
2377 cond_ok <- BO[0] | ¬(testbit ^ BO[1])
2378 if ¬predicate_bit & ¬SVRMmode.sz then
2379 # this is entirely new: CTR-test mode still decrements CTR
2380 # even when predicate-bits are zero
2381 if ¬BO[2] & CTRtest & ¬CTi then
2382 CTR = CTR - 1
2383 # instruction finishes here
2384 else
2385 # usual BO[2] CTR-mode now under CTR-test mode as well
2386 if ¬BO[2] & ¬(CTRtest & (cond_ok ^ CTi)) then CTR <- CTR - 1
2387 # new VLset mode, conditional test truncates VL
2388 if VLSET and VSb = (cond_ok & ctr_ok) then
2389 if SVRMmode.VLI then SVSTATE.VL = srcstep+1
2390 else SVSTATE.VL = srcstep
2391 # usual LR is now conditional, but also joined by SVLR
2392 lr_ok <- LK
2393 svlr_ok <- SVRMmode.SL
2394 if ctr_ok & cond_ok then
2395 if AA then NIA <-iea EXTS(BD || 0b00)
2396 else NIA <-iea CIA + EXTS(BD || 0b00)
2397 if SVRMmode.LRu then lr_ok <- ¬lr_ok
2398 if SVRMmode.SLu then svlr_ok <- ¬svlr_ok
2399 if lr_ok then LR <-iea CIA + 4
2400 if svlr_ok then SVLR <- SVSTATE
2401 ```
2402
2403 Below is the pseudocode for SVP64 Branches, which is a little less
2404 obvious but identical to the above. The lack of obviousness is down
2405 to the early-exit opportunities.
2406
2407 Effective pseudocode for Horizontal-First Mode:
2408
2409 ```
2410 if (mode_is_64bit) then M <- 0
2411 else M <- 32
2412 cond_ok = not SVRMmode.ALL
2413 for srcstep in range(VL):
2414 # select predicate bit or zero/one
2415 if predicate[srcstep]:
2416 # get SVP64 extended CR field 0..127
2417 SVCRf = SVP64EXTRA(BI>>2)
2418 CRbits = CR{SVCRf}
2419 testbit = CRbits[BI & 0b11]
2420 # testbit = CR[BI+32+srcstep*4]
2421 else if not SVRMmode.sz:
2422 # inverted CTR test skip mode
2423 if ¬BO[2] & CTRtest & ¬CTI then
2424 CTR = CTR - 1
2425 continue # skip to next element
2426 else
2427 testbit = SVRMmode.SNZ
2428 # actual element test here
2429 ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
2430 el_cond_ok <- BO[0] | ¬(testbit ^ BO[1])
2431 # check if CTR dec should occur
2432 ctrdec = ¬BO[2]
2433 if CTRtest & (el_cond_ok ^ CTi) then
2434 ctrdec = 0b0
2435 if ctrdec then CTR <- CTR - 1
2436 # merge in the test
2437 if SVRMmode.ALL:
2438 cond_ok &= (el_cond_ok & ctr_ok)
2439 else
2440 cond_ok |= (el_cond_ok & ctr_ok)
2441 # test for VL to be set (and exit)
2442 if VLSET and VSb = (el_cond_ok & ctr_ok) then
2443 if SVRMmode.VLI then SVSTATE.VL = srcstep+1
2444 else SVSTATE.VL = srcstep
2445 break
2446 # early exit?
2447 if SVRMmode.ALL != (el_cond_ok & ctr_ok):
2448 break
2449 # SVP64 rules about Scalar registers still apply!
2450 if SVCRf.scalar:
2451 break
2452 # loop finally done, now test if branch (and update LR)
2453 lr_ok <- LK
2454 svlr_ok <- SVRMmode.SL
2455 if cond_ok then
2456 if AA then NIA <-iea EXTS(BD || 0b00)
2457 else NIA <-iea CIA + EXTS(BD || 0b00)
2458 if SVRMmode.LRu then lr_ok <- ¬lr_ok
2459 if SVRMmode.SLu then svlr_ok <- ¬svlr_ok
2460 if lr_ok then LR <-iea CIA + 4
2461 if svlr_ok then SVLR <- SVSTATE
2462 ```
2463
2464 Pseudocode for Vertical-First Mode:
2465
2466 ```
2467 # get SVP64 extended CR field 0..127
2468 SVCRf = SVP64EXTRA(BI>>2)
2469 CRbits = CR{SVCRf}
2470 # select predicate bit or zero/one
2471 if predicate[srcstep]:
2472 if BRc = 1 then # CR0 vectorised
2473 CR{SVCRf+srcstep} = CRbits
2474 testbit = CRbits[BI & 0b11]
2475 else if not SVRMmode.sz:
2476 # inverted CTR test skip mode
2477 if ¬BO[2] & CTRtest & ¬CTI then
2478 CTR = CTR - 1
2479 SVSTATE.srcstep = new_srcstep
2480 exit # no branch testing
2481 else
2482 testbit = SVRMmode.SNZ
2483 # actual element test here
2484 cond_ok <- BO[0] | ¬(testbit ^ BO[1])
2485 # test for VL to be set (and exit)
2486 if VLSET and cond_ok = VSb then
2487 if SVRMmode.VLI
2488 SVSTATE.VL = new_srcstep+1
2489 else
2490 SVSTATE.VL = new_srcstep
2491 ```
2492
2493 ### Example Shader code
2494
2495 ```
2496 // assume f() g() or h() modify a and/or b
2497 while(a > 2) {
2498 if(b < 5)
2499 f();
2500 else
2501 g();
2502 h();
2503 }
2504 ```
2505
2506 which compiles to something like:
2507
2508 ```
2509 vec<i32> a, b;
2510 // ...
2511 pred loop_pred = a > 2;
2512 // loop continues while any of a elements greater than 2
2513 while(loop_pred.any()) {
2514 // vector of predicate bits
2515 pred if_pred = loop_pred & (b < 5);
2516 // only call f() if at least 1 bit set
2517 if(if_pred.any()) {
2518 f(if_pred);
2519 }
2520 label1:
2521 // loop mask ANDs with inverted if-test
2522 pred else_pred = loop_pred & ~if_pred;
2523 // only call g() if at least 1 bit set
2524 if(else_pred.any()) {
2525 g(else_pred);
2526 }
2527 h(loop_pred);
2528 }
2529 ```
2530
2531 which will end up as:
2532
2533 ```
2534 # start from while loop test point
2535 b looptest
2536 while_loop:
2537 sv.cmpi CR80.v, b.v, 5 # vector compare b into CR64 Vector
2538 sv.bc/m=r30/~ALL/sz CR80.v.LT skip_f # skip when none
2539 # only calculate loop_pred & pred_b because needed in f()
2540 sv.crand CR80.v.SO, CR60.v.GT, CR80.V.LT # if = loop & pred_b
2541 f(CR80.v.SO)
2542 skip_f:
2543 # illustrate inversion of pred_b. invert r30, test ALL
2544 # rather than SOME, but masked-out zero test would FAIL,
2545 # therefore masked-out instead is tested against 1 not 0
2546 sv.bc/m=~r30/ALL/SNZ CR80.v.LT skip_g
2547 # else = loop & ~pred_b, need this because used in g()
2548 sv.crternari(A&~B) CR80.v.SO, CR60.v.GT, CR80.V.LT
2549 g(CR80.v.SO)
2550 skip_g:
2551 # conditionally call h(r30) if any loop pred set
2552 sv.bclr/m=r30/~ALL/sz BO[1]=1 h()
2553 looptest:
2554 sv.cmpi CR60.v a.v, 2 # vector compare a into CR60 vector
2555 sv.crweird r30, CR60.GT # transfer GT vector to r30
2556 sv.bc/m=r30/~ALL/sz BO[1]=1 while_loop
2557 end:
2558 ```
2559
2560 ### LRu example
2561
2562 show why LRu would be useful in a loop. Imagine the following
2563 c code:
2564
2565 ```
2566 for (int i = 0; i < 8; i++) {
2567 if (x < y) break;
2568 }
2569 ```
2570
2571 Under these circumstances exiting from the loop is not only
2572 based on CTR it has become conditional on a CR result.
2573 Thus it is desirable that NIA *and* LR only be modified
2574 if the conditions are met
2575
2576
2577 v3.0 pseudocode for `bclrl`:
2578
2579 ```
2580 if (mode_is_64bit) then M <- 0
2581 else M <- 32
2582 if ¬BO[2] then CTR <- CTR - 1
2583 ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
2584 cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1])
2585 if ctr_ok & cond_ok then NIA <-iea LR[0:61] || 0b00
2586 if LK then LR <-iea CIA + 4
2587 ```
2588
2589 the latter part for SVP64 `bclrl` becomes:
2590
2591 ```
2592 for i in 0 to VL-1:
2593 ...
2594 ...
2595 cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1])
2596 lr_ok <- LK
2597 if ctr_ok & cond_ok then
2598 NIA <-iea LR[0:61] || 0b00
2599 if SVRMmode.LRu then lr_ok <- ¬lr_ok
2600 if lr_ok then LR <-iea CIA + 4
2601 # if NIA modified exit loop
2602 ```
2603
2604 The reason why should be clear from this being a Vector loop:
2605 unconditional destruction of LR when LK=1 makes `sv.bclrl`
2606 ineffective, because the intention going into the loop is
2607 that the branch should be to the copy of LR set at the *start*
2608 of the loop, not half way through it.
2609 However if the change to LR only occurs if
2610 the branch is taken then it becomes a useful instruction.
2611
2612 The following pseudocode should **not** be implemented because
2613 it violates the fundamental principle of SVP64 which is that
2614 SVP64 looping is a thin wrapper around Scalar Instructions.
2615 The pseducode below is more an actual Vector ISA Branch and
2616 as such is not at all appropriate:
2617
2618 ```
2619 for i in 0 to VL-1:
2620 ...
2621 ...
2622 cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1])
2623 if ctr_ok & cond_ok then NIA <-iea LR[0:61] || 0b00
2624 # only at the end of looping is LK checked.
2625 # this completely violates the design principle of SVP64
2626 # and would actually need to be a separate (scalar)
2627 # instruction "set LR to CIA+4 but retrospectively"
2628 # which is clearly impossible
2629 if LK then LR <-iea CIA + 4
2630 ```