add page-boundaries to ls010
[libreriscv.git] / openpower / sv / rfc / ls010.mdwn
1 # RFC ls009 SVP64 Zero-Overhead Loop Prefix Subsystem
2
3 Credits and acknowledgements:
4
5 * Luke Leighton
6 * Jacob Lifshay
7 * Hendrik Boom
8 * Richard Wilbur
9 * Alexandre Oliva
10 * Cesar Strauss
11 * NLnet Foundation, for funding
12 * OpenPOWER Foundation
13 * Paul Mackerras
14 * Toshaan Bharvani
15 * IBM for the Power ISA itself
16
17 Links:
18
19 * <https://bugs.libre-soc.org/show_bug.cgi?id=1045>
20
21 # Introduction
22
23 Simple-V is a type of Vectorisation best described as a "Prefix Loop Subsystem"
24 similar to the Z80 `LDIR` instruction and to the x86 `REP` Prefix instruction.
25 More advanced features are similar to the Z80 `CPIR` instruction. If viewed
26 as an actual Vector ISA it introduces over 1.5 million 64-bit Vector instructions.
27 SVP64, the instruction format, is therefore best viewed as an orthogonal
28 RISC-style "Prefixing" subsystem instead.
29
30 Except where explicitly stated all bit numbers remain as in the rest of the Power ISA:
31 in MSB0 form (the bits are numbered from 0 at the MSB on the left
32 and counting up as you move rightwards to the LSB end). All bit ranges are inclusive
33 (so `4:6` means bits 4, 5, and 6, in MSB0 order). **All register numbering and
34 element numbering however is LSB0 ordering** which is a different convention from that used
35 elsewhere in the Power ISA.
36
37 The SVP64 prefix always comes before the suffix in PC order and must be considered
38 an independent "Defined word" that augments the behaviour of the following instruction,
39 but does **not** change the actual Decoding of that following instruction.
40 **All prefixed instructions retain their non-prefixed encoding and definition**.
41
42 *Architectural Resource Allocation note: it is prohibited to accept RFCs which
43 fundamentally violate this hard requirement. Under no circumstances must the
44 Suffix space have an alternate instruction encoding allocated within SVP64 that is
45 entirely different from the non-prefixed Defined Word. Hardware Implementors
46 critically rely on this inviolate guarantee to implement High-Performance Multi-Issue
47 micro-architectures that can sustain 100% throughput*
48
49 | 0:5 | 6:31 | 32:63 |
50 |--------|--------------|--------------|
51 | EXT09 | v3.1 Prefix | v3.0/1 Suffix |
52
53 Subset implementations in hardware are permitted, as long as certain
54 rules are followed, allowing for full soft-emulation including future
55 revisions. Compliancy Subsets exist to ensure minimum levels of binary
56 interoperability expectations within certain environments.
57
58 ## Register files, elements, and Element-width Overrides
59
60 In the Upper Compliancy Levels the size of the GPR and FPR Register files are expanded
61 from 32 to 128 entries, and the number of CR Fields expanded from CR0-CR7 to CR0-CR127.
62
63 Memory access remains exactly the same: the effects of `MSR.LE` remain exactly the same,
64 affecting as they already do and remain **only** on the Load and Store memory-register
65 operation byte-order, and having nothing to do with the
66 ordering of the contents of register files or register-register operations.
67
68 Whilst the bits within the GPRs and FPRs are expected to be MSB0-ordered and for
69 numbering to be sequentially incremental the element offset numbering is naturally
70 **LSB0-sequentially-incrementing from zero not MSB0-incrementing.** Expressed exclusively in
71 MSB0-numbering, SVP64 is unnecessarily complex to understand: the required
72 subtractions from 63, 31, 15 and 7 unfortunately become a hostile minefield.
73 Therefore for the purposes of this section the more natural
74 **LSB0 numbering is assumed** and it is up to the reader to translate to MSB0 numbering.
75
76 The Canonical specification for how element-sequential numbering and element-width
77 overrides is defined is expressed in the following c structure, assuming a Little-Endian
78 system, and naturally using LSB0 numbering everywhere because the ANSI c specification
79 is inherently LSB0:
80
81 ```
82 #pragma pack
83 typedef union {
84 uint8_t b[]; // elwidth 8
85 uint16_t s[]; // elwidth 16
86 uint32_t i[]; // elwidth 32
87 uint64_t l[]; // elwidth 64
88 uint8_t actual_bytes[8];
89 } el_reg_t;
90
91 elreg_t int_regfile[128];
92
93 void get_register_element(el_reg_t* el, int gpr, int element, int width) {
94 switch (width) {
95 case 64: el->l = int_regfile[gpr].l[element];
96 case 32: el->i = int_regfile[gpr].i[element];
97 case 16: el->s = int_regfile[gpr].s[element];
98 case 8 : el->b = int_regfile[gpr].b[element];
99 }
100 }
101 void set_register_element(el_reg_t* el, int gpr, int element, int width) {
102 switch (width) {
103 case 64: int_regfile[gpr].l[element] = el->l;
104 case 32: int_regfile[gpr].i[element] = el->i;
105 case 16: int_regfile[gpr].s[element] = el->s;
106 case 8 : int_regfile[gpr].b[element] = el->b;
107 }
108 }
109 ```
110
111 Example Vector-looped add operation implementation when elwidths are 64-bit:
112
113 ```
114 # add RT, RA,RB using the "uint64_t" union member, "l"
115 for i in range(VL):
116 int_regfile[RT].l[i] = int_regfile[RA].l[i] + int_regfile[RB].l[i]
117 ```
118
119 However if elwidth overrides are set to 16 for both source and destination:
120
121 ```
122 # add RT, RA, RB using the "uint64_t" union member "s"
123 for i in range(VL):
124 int_regfile[RT].s[i] = int_regfile[RA].s[i] + int_regfile[RB].s[i]
125 ```
126
127 Hardware Architectural note: to avoid a Read-Modify-Write at the register file it is
128 strongly recommended to implement byte-level write-enable lines exactly as has been
129 implemented in DRAM ICs for many decades. Additionally the predicate mask bit is advised
130 to be associated with the element operation and alongside the result ultimately
131 passed to the register file.
132 When element-width is set to 64-bit the relevant predicate mask bit may be repeated
133 eight times and pull all eight write-port byte-level lines HIGH. Clearly when element-width
134 is set to 8-bit the relevant predicate mask bit corresponds directly with one single
135 byte-level write-enable line. It is up to the Hardware Architect to then amortise (merge)
136 elements together into both PredicatedSIMD Pipelines as well as simultaneous non-overlapping
137 Register File writes, to achieve High Performance designs.
138
139 ## SVP64 encoding features
140
141 A number of features need to be compacted into a very small space of only 24 bits:
142
143 * Independent per-register Scalar/Vector tagging and range extension on every register
144 * Element width overrides on both source and destination
145 * Predication on both source and destination
146 * Two different sources of predication: INT and CR Fields
147 * SV Modes including saturation (for Audio, Video and DSP), mapreduce, fail-first and
148 predicate-result mode.
149
150 Different classes of operations require different formats. The earlier sections cover
151 the c9mmon formats and the four separate modes follow: CR operations (crops),
152 Arithmetic/Logical (termed "normal"), Load/Store and Branch-Conditional.
153
154 ## Definition of Reserved in this spec.
155
156 For the new fields added in SVP64, instructions that have any of their
157 fields set to a reserved value must cause an illegal instruction trap,
158 to allow emulation of future instruction sets, or for subsets of SVP64
159 to be implemented in hardware and the rest emulated.
160 This includes SVP64 SPRs: reading or writing values which are not
161 supported in hardware must also raise illegal instruction traps
162 in order to allow emulation.
163 Unless otherwise stated, reserved values are always all zeros.
164
165 This is unlike OpenPower ISA v3.1, which in many instances does not require a trap if reserved fields are nonzero. Where the standard Power ISA definition
166 is intended the red keyword `RESERVED` is used.
167
168 ## Definition of "UnVectoriseable"
169
170 Any operation that inherently makes no sense if repeated is termed "UnVectoriseable"
171 or "UnVectorised". Examples include `sc` or `sync` which have no registers. `mtmsr` is
172 also classed as UnVectoriseable because there is only one `MSR`.
173
174 ## Scalar Identity Behaviour
175
176 SVP64 is designed so that when the prefix is all zeros, and
177 VL=1, no effect or
178 influence occurs (no augmentation) such that all standard Power ISA
179 v3.0/v3 1 instructions covered by the prefix are "unaltered". This is termed `scalar identity behaviour` (based on the mathematical definition for "identity", as in, "identity matrix" or better "identity transformation").
180
181 Note that this is completely different from when VL=0. VL=0 turns all operations under its influence into `nops` (regardless of the prefix)
182 whereas when VL=1 and the SV prefix is all zeros, the operation simply acts as if SV had not been applied at all to the instruction (an "identity transformation").
183
184 ## Register Naming and size
185
186 As previously mentioned SV Registers are simply the INT, FP and CR register files extended
187 linearly to larger sizes; SV Vectorisation iterates sequentially through these registers
188 (LSB0 sequential ordering from 0 to VL-1).
189
190 Where the integer regfile in standard scalar
191 Power ISA v3.0B/v3.1B is r0 to r31, SV extends this as r0 to r127.
192 Likewise FP registers are extended to 128 (fp0 to fp127), and CR Fields
193 are
194 extended to 128 entries, CR0 thru CR127.
195
196 The names of the registers therefore reflects a simple linear extension
197 of the Power ISA v3.0B / v3.1B register naming, and in hardware this
198 would be reflected by a linear increase in the size of the underlying
199 SRAM used for the regfiles.
200
201 Note: when an EXTRA field (defined below) is zero, SV is deliberately designed
202 so that the register fields are identical to as if SV was not in effect
203 i.e. under these circumstances (EXTRA=0) the register field names RA,
204 RB etc. are interpreted and treated as v3.0B / v3.1B scalar registers. This is part of
205 `scalar identity behaviour` described above.
206
207 ## Future expansion.
208
209 With the way that EXTRA fields are defined and applied to register fields,
210 future versions of SV may involve 256 or greater registers. Backwards binary compatibility may be achieved with a PCR bit (Program Compatibility Register). Further discussion is out of scope for this version of SVP64.
211
212 --------
213
214 \newpage{}
215
216 # Remapped Encoding (`RM[0:23]`)
217
218 To allow relatively easy remapping of which portions of the Prefix Opcode
219 Map are used for SVP64 without needing to rewrite a large portion of the
220 SVP64 spec, a mapping is defined from the OpenPower v3.1 prefix bits to
221 a new 24-bit Remapped Encoding denoted `RM[0]` at the MSB to `RM[23]`
222 at the LSB.
223
224 The mapping from the OpenPower v3.1 prefix bits to the Remapped Encoding
225 is defined in the Prefix Fields section.
226
227 ## Prefix Fields
228
229 TODO incorporate EXT09
230
231 To "activate" svp64 (in a way that does not conflict with v3.1B 64 bit Prefix mode), fields within the v3.1B Prefix Opcode Map are set
232 (see Prefix Opcode Map, above), leaving 24 bits "free" for use by SV.
233 This is achieved by setting bits 7 and 9 to 1:
234
235 | Name | Bits | Value | Description |
236 |------------|---------|-------|--------------------------------|
237 | EXT01 | `0:5` | `1` | Indicates Prefixed 64-bit |
238 | `RM[0]` | `6` | | Bit 0 of Remapped Encoding |
239 | SVP64_7 | `7` | `1` | Indicates this is SVP64 |
240 | `RM[1]` | `8` | | Bit 1 of Remapped Encoding |
241 | SVP64_9 | `9` | `1` | Indicates this is SVP64 |
242 | `RM[2:23]` | `10:31` | | Bits 2-23 of Remapped Encoding |
243
244 Laid out bitwise, this is as follows, showing how the 32-bits of the prefix
245 are constructed:
246
247 | 0:5 | 6 | 7 | 8 | 9 | 10:31 |
248 |--------|-------|---|-------|---|----------|
249 | EXT01 | RM | 1 | RM | 1 | RM |
250 | 000001 | RM[0] | 1 | RM[1] | 1 | RM[2:23] |
251
252 Following the prefix will be the suffix: this is simply a 32-bit v3.0B / v3.1
253 instruction. That instruction becomes "prefixed" with the SVP context: the
254 Remapped Encoding field (RM).
255
256 It is important to note that unlike v3.1 64-bit prefixed instructions
257 there is insufficient space in `RM` to provide identification of
258 any SVP64 Fields without first partially decoding the
259 32-bit suffix. Similar to the "Forms" (X-Form, D-Form) the
260 `RM` format is individually associated with every instruction.
261
262 Extreme caution and care must therefore be taken
263 when extending SVP64 in future, to not create unnecessary relationships
264 between prefix and suffix that could complicate decoding, adding latency.
265
266 # Common RM fields
267
268 The following fields are common to all Remapped Encodings:
269
270 | Field Name | Field bits | Description |
271 |------------|------------|----------------------------------------|
272 | MASKMODE | `0` | Execution (predication) Mask Kind |
273 | MASK | `1:3` | Execution Mask |
274 | SUBVL | `8:9` | Sub-vector length |
275
276 The following fields are optional or encoded differently depending
277 on context after decoding of the Scalar suffix:
278
279 | Field Name | Field bits | Description |
280 |------------|------------|----------------------------------------|
281 | ELWIDTH | `4:5` | Element Width |
282 | ELWIDTH_SRC | `6:7` | Element Width for Source |
283 | EXTRA | `10:18` | Register Extra encoding |
284 | MODE | `19:23` | changes Vector behaviour |
285
286 * MODE changes the behaviour of the SV operation (result saturation, mapreduce)
287 * SUBVL groups elements together into vec2, vec3, vec4 for use in 3D and Audio/Video DSP work
288 * ELWIDTH and ELWIDTH_SRC overrides the instruction's destination and source operand width
289 * MASK (and MASK_SRC) and MASKMODE provide predication (two types of sources: scalar INT and Vector CR).
290 * Bits 10 to 18 (EXTRA) are further decoded depending on the RM category for the instruction, which is determined only by decoding the Scalar 32 bit suffix.
291
292 Similar to Power ISA `X-Form` etc. EXTRA bits are given designations, such as `RM-1P-3S1D` which indicates for this example that the operation is to be single-predicated and that there are 3 source operand EXTRA tags and one destination operand tag.
293
294 Note that if ELWIDTH != ELWIDTH_SRC this may result in reduced performance or increased latency in some implementations due to lane-crossing.
295
296 # Mode
297
298 Mode is an augmentation of SV behaviour. Different types of
299 instructions have different needs, similar to Power ISA
300 v3.1 64 bit prefix 8LS and MTRR formats apply to different
301 instruction types. Modes include Reduction, Iteration, arithmetic
302 saturation, and Fail-First. More specific details in each
303 section and in the [[svp64/appendix]]
304
305 * For condition register operations see [[sv/cr_ops]]
306 * For LD/ST Modes, see [[sv/ldst]].
307 * For Branch modes, see [[sv/branches]]
308 * For arithmetic and logical, see [[sv/normal]]
309
310 # ELWIDTH Encoding
311
312 Default behaviour is set to 0b00 so that zeros follow the convention of
313 `scalar identity behaviour`. In this case it means that elwidth overrides
314 are not applicable. Thus if a 32 bit instruction operates on 32 bit,
315 `elwidth=0b00` specifies that this behaviour is unmodified. Likewise
316 when a processor is switched from 64 bit to 32 bit mode, `elwidth=0b00`
317 states that, again, the behaviour is not to be modified.
318
319 Only when elwidth is nonzero is the element width overridden to the
320 explicitly required value.
321
322 ## Elwidth for Integers:
323
324 | Value | Mnemonic | Description |
325 |-------|----------------|------------------------------------|
326 | 00 | DEFAULT | default behaviour for operation |
327 | 01 | `ELWIDTH=w` | Word: 32-bit integer |
328 | 10 | `ELWIDTH=h` | Halfword: 16-bit integer |
329 | 11 | `ELWIDTH=b` | Byte: 8-bit integer |
330
331 This encoding is chosen such that the byte width may be computed as
332 `8<<(3-ew)`
333
334 ## Elwidth for FP Registers:
335
336 | Value | Mnemonic | Description |
337 |-------|----------------|------------------------------------|
338 | 00 | DEFAULT | default behaviour for FP operation |
339 | 01 | `ELWIDTH=f32` | 32-bit IEEE 754 Single floating-point |
340 | 10 | `ELWIDTH=f16` | 16-bit IEEE 754 Half floating-point |
341 | 11 | `ELWIDTH=bf16` | Reserved for `bf16` |
342
343 Note:
344 [`bf16`](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format)
345 is reserved for a future implementation of SV
346
347 Note that any IEEE754 FP operation in Power ISA ending in "s" (`fadds`) shall
348 perform its operation at **half** the ELWIDTH then padded back out
349 to ELWIDTH. `sv.fadds/ew=f32` shall perform an IEEE754 FP16 operation that is then "padded" to fill out to an IEEE754 FP32. When ELWIDTH=DEFAULT
350 clearly the behaviour of `sv.fadds` is performed at 32-bit accuracy
351 then padded back out to fit in IEEE754 FP64, exactly as for Scalar
352 v3.0B "single" FP. Any FP operation ending in "s" where ELWIDTH=f16
353 or ELWIDTH=bf16 is reserved and must raise an illegal instruction
354 (IEEE754 FP8 or BF8 are not defined).
355
356 ## Elwidth for CRs:
357
358 Element-width overrides for CR Fields has no meaning. The bits
359 are therefore used for other purposes, or when Rc=1, the Elwidth
360 applies to the result being tested (a GPR or FPR), but not to the
361 Vector of CR Fields.
362
363 # SUBVL Encoding
364
365 the default for SUBVL is 1 and its encoding is 0b00 to indicate that
366 SUBVL is effectively disabled (a SUBVL for-loop of only one element). this
367 lines up in combination with all other "default is all zeros" behaviour.
368
369 | Value | Mnemonic | Subvec | Description |
370 |-------|-----------|---------|------------------------|
371 | 00 | `SUBVL=1` | single | Sub-vector length of 1 |
372 | 01 | `SUBVL=2` | vec2 | Sub-vector length of 2 |
373 | 10 | `SUBVL=3` | vec3 | Sub-vector length of 3 |
374 | 11 | `SUBVL=4` | vec4 | Sub-vector length of 4 |
375
376 The SUBVL encoding value may be thought of as an inclusive range of a
377 sub-vector. SUBVL=2 represents a vec2, its encoding is 0b01, therefore
378 this may be considered to be elements 0b00 to 0b01 inclusive.
379
380 # MASK/MASK_SRC & MASKMODE Encoding
381
382 TODO: rename MASK_KIND to MASKMODE
383
384 One bit (`MASKMODE`) indicates the mode: CR or Int predication. The two
385 types may not be mixed.
386
387 Special note: to disable predication this field must
388 be set to zero in combination with Integer Predication also being set
389 to 0b000. this has the effect of enabling "all 1s" in the predicate
390 mask, which is equivalent to "not having any predication at all"
391 and consequently, in combination with all other default zeros, fully
392 disables SV (`scalar identity behaviour`).
393
394 `MASKMODE` may be set to one of 2 values:
395
396 | Value | Description |
397 |-----------|------------------------------------------------------|
398 | 0 | MASK/MASK_SRC are encoded using Integer Predication |
399 | 1 | MASK/MASK_SRC are encoded using CR-based Predication |
400
401 Integer Twin predication has a second set of 3 bits that uses the same
402 encoding thus allowing either the same register (r3, r10 or r31) to be used
403 for both src and dest, or different regs (one for src, one for dest).
404
405 Likewise CR based twin predication has a second set of 3 bits, allowing
406 a different test to be applied.
407
408 Note that it is assumed that Predicate Masks (whether INT or CR)
409 are read *before* the operations proceed. In practice (for CR Fields)
410 this creates an unnecessary block on parallelism. Therefore,
411 it is up to the programmer to ensure that the CR fields used as
412 Predicate Masks are not being written to by any parallel Vector Loop.
413 Doing so results in **UNDEFINED** behaviour, according to the definition
414 outlined in the Power ISA v3.0B Specification.
415
416 Hardware Implementations are therefore free and clear to delay reading
417 of individual CR fields until the actual predicated element operation
418 needs to take place, safe in the knowledge that no programmer will
419 have issued a Vector Instruction where previous elements could have
420 overwritten (destroyed) not-yet-executed CR-Predicated element operations.
421
422 ## Integer Predication (MASKMODE=0)
423
424 When the predicate mode bit is zero the 3 bits are interpreted as below.
425 Twin predication has an identical 3 bit field similarly encoded.
426
427 `MASK` and `MASK_SRC` may be set to one of 8 values, to provide the following meaning:
428
429 | Value | Mnemonic | Element `i` enabled if: |
430 |-------|----------|------------------------------|
431 | 000 | ALWAYS | predicate effectively all 1s |
432 | 001 | 1 << R3 | `i == R3` |
433 | 010 | R3 | `R3 & (1 << i)` is non-zero |
434 | 011 | ~R3 | `R3 & (1 << i)` is zero |
435 | 100 | R10 | `R10 & (1 << i)` is non-zero |
436 | 101 | ~R10 | `R10 & (1 << i)` is zero |
437 | 110 | R30 | `R30 & (1 << i)` is non-zero |
438 | 111 | ~R30 | `R30 & (1 << i)` is zero |
439
440 r10 and r30 are at the high end of temporary and unused registers, so as not to interfere with register allocation from ABIs.
441
442 ## CR-based Predication (MASKMODE=1)
443
444 When the predicate mode bit is one the 3 bits are interpreted as below.
445 Twin predication has an identical 3 bit field similarly encoded.
446
447 `MASK` and `MASK_SRC` may be set to one of 8 values, to provide the following meaning:
448
449 | Value | Mnemonic | Element `i` is enabled if |
450 |-------|----------|--------------------------|
451 | 000 | lt | `CR[offs+i].LT` is set |
452 | 001 | nl/ge | `CR[offs+i].LT` is clear |
453 | 010 | gt | `CR[offs+i].GT` is set |
454 | 011 | ng/le | `CR[offs+i].GT` is clear |
455 | 100 | eq | `CR[offs+i].EQ` is set |
456 | 101 | ne | `CR[offs+i].EQ` is clear |
457 | 110 | so/un | `CR[offs+i].FU` is set |
458 | 111 | ns/nu | `CR[offs+i].FU` is clear |
459
460 CR based predication. TODO: select alternate CR for twin predication? see
461 [[discussion]] Overlap of the two CR based predicates must be taken
462 into account, so the starting point for one of them must be suitably
463 high, or accept that for twin predication VL must not exceed the range
464 where overlap will occur, *or* that they use the same starting point
465 but select different *bits* of the same CRs
466
467 `offs` is defined as CR32 (4x8) so as to mesh cleanly with Vectorised Rc=1 operations (see below). Rc=1 operations start from CR8 (TBD).
468
469 The CR Predicates chosen must start on a boundary that Vectorised
470 CR operations can access cleanly, in full.
471 With EXTRA2 restricting starting points
472 to multiples of 8 (CR0, CR8, CR16...) both Vectorised Rc=1 and CR Predicate
473 Masks have to be adapted to fit on these boundaries as well.
474
475 # Extra Remapped Encoding <a name="extra_remap"> </a>
476
477 Shows all instruction-specific fields in the Remapped Encoding `RM[10:18]` for all instruction variants. Note that due to the very tight space, the encoding mode is *not* included in the prefix itself. The mode is "applied", similar to Power ISA "Forms" (X-Form, D-Form) on a per-instruction basis, and, like "Forms" are given a designation (below) of the form `RM-nP-nSnD`. The full list of which instructions use which remaps is here [[opcode_regs_deduped]]. (*Machine-readable CSV files have been provided which will make the task of creating SV-aware ISA decoders easier*).
478
479 These mappings are part of the SVP64 Specification in exactly the same
480 way as X-Form, D-Form. New Scalar instructions added to the Power ISA
481 will need a corresponding SVP64 Mapping, which can be derived by-rote
482 from examining the Register "Profile" of the instruction.
483
484 There are two categories: Single and Twin Predication.
485 Due to space considerations further subdivision of Single Predication
486 is based on whether the number of src operands is 2 or 3. With only
487 9 bits available some compromises have to be made.
488
489 * `RM-1P-3S1D` Single Predication dest/src1/2/3, applies to 4-operand instructions (fmadd, isel, madd).
490 * `RM-1P-2S1D` Single Predication dest/src1/2 applies to 3-operand instructions (src1 src2 dest)
491 * `RM-2P-1S1D` Twin Predication (src=1, dest=1)
492 * `RM-2P-2S1D` Twin Predication (src=2, dest=1) primarily for LDST (Indexed)
493 * `RM-2P-1S2D` Twin Predication (src=1, dest=2) primarily for LDST Update
494
495 ## RM-1P-3S1D
496
497 | Field Name | Field bits | Description |
498 |------------|------------|----------------------------------------|
499 | Rdest\_EXTRA2 | `10:11` | extends Rdest (R\*\_EXTRA2 Encoding) |
500 | Rsrc1\_EXTRA2 | `12:13` | extends Rsrc1 (R\*\_EXTRA2 Encoding) |
501 | Rsrc2\_EXTRA2 | `14:15` | extends Rsrc2 (R\*\_EXTRA2 Encoding) |
502 | Rsrc3\_EXTRA2 | `16:17` | extends Rsrc3 (R\*\_EXTRA2 Encoding) |
503 | EXTRA2_MODE | `18` | used by `divmod2du` and `maddedu` for RS |
504
505 These are for 3 operand in and either 1 or 2 out instructions.
506 3-in 1-out includes `madd RT,RA,RB,RC`. (DRAFT) instructions
507 such as `maddedu` have an implicit second destination, RS, the
508 selection of which is determined by bit 18.
509
510 ## RM-1P-2S1D
511
512 | Field Name | Field bits | Description |
513 |------------|------------|-------------------------------------------|
514 | Rdest\_EXTRA3 | `10:12` | extends Rdest |
515 | Rsrc1\_EXTRA3 | `13:15` | extends Rsrc1 |
516 | Rsrc2\_EXTRA3 | `16:18` | extends Rsrc3 |
517
518 These are for 2 operand 1 dest instructions, such as `add RT, RA,
519 RB`. However also included are unusual instructions with an implicit dest
520 that is identical to its src reg, such as `rlwinmi`.
521
522 Normally, with instructions such as `rlwinmi`, the scalar v3.0B ISA would not have sufficient bit fields to allow
523 an alternative destination. With SV however this becomes possible.
524 Therefore, the fact that the dest is implicitly also a src should not
525 mislead: due to the *prefix* they are different SV regs.
526
527 * `rlwimi RA, RS, ...`
528 * Rsrc1_EXTRA3 applies to RS as the first src
529 * Rsrc2_EXTRA3 applies to RA as the secomd src
530 * Rdest_EXTRA3 applies to RA to create an **independent** dest.
531
532 With the addition of the EXTRA bits, the three registers
533 each may be *independently* made vector or scalar, and be independently
534 augmented to 7 bits in length.
535
536 ## RM-2P-1S1D/2S
537
538 | Field Name | Field bits | Description |
539 |------------|------------|----------------------------|
540 | Rdest_EXTRA3 | `10:12` | extends Rdest |
541 | Rsrc1_EXTRA3 | `13:15` | extends Rsrc1 |
542 | MASK_SRC | `16:18` | Execution Mask for Source |
543
544 `RM-2P-2S` is for `stw` etc. and is Rsrc1 Rsrc2.
545
546 ## RM-1P-2S1D
547
548 single-predicate, three registers (2 read, 1 write)
549
550 | Field Name | Field bits | Description |
551 |------------|------------|----------------------------|
552 | Rdest_EXTRA3 | `10:12` | extends Rdest |
553 | Rsrc1_EXTRA3 | `13:15` | extends Rsrc1 |
554 | Rsrc2_EXTRA3 | `16:18` | extends Rsrc2 |
555
556 ## RM-2P-2S1D/1S2D/3S
557
558 The primary purpose for this encoding is for Twin Predication on LOAD
559 and STORE operations. see [[sv/ldst]] for detailed anslysis.
560
561 RM-2P-2S1D:
562
563 | Field Name | Field bits | Description |
564 |------------|------------|----------------------------|
565 | Rdest_EXTRA2 | `10:11` | extends Rdest (R\*\_EXTRA2 Encoding) |
566 | Rsrc1_EXTRA2 | `12:13` | extends Rsrc1 (R\*\_EXTRA2 Encoding) |
567 | Rsrc2_EXTRA2 | `14:15` | extends Rsrc2 (R\*\_EXTRA2 Encoding) |
568 | MASK_SRC | `16:18` | Execution Mask for Source |
569
570 Note that for 1S2P the EXTRA2 dest and src names are switched (Rsrc_EXTRA2
571 is in bits 10:11, Rdest1_EXTRA2 in 12:13)
572
573 Also that for 3S (to cover `stdx` etc.) the names are switched to 3 src: Rsrc1_EXTRA2, Rsrc2_EXTRA2, Rsrc3_EXTRA2.
574
575 Note also that LD with update indexed, which takes 2 src and 2 dest
576 (e.g. `lhaux RT,RA,RB`), does not have room for 4 registers and also
577 Twin Predication. therefore these are treated as RM-2P-2S1D and the
578 src spec for RA is also used for the same RA as a dest.
579
580 Note that if ELWIDTH != ELWIDTH_SRC this may result in reduced performance or increased latency in some implementations due to lane-crossing.
581
582 # R\*\_EXTRA2/3
583
584 EXTRA is the means by which two things are achieved:
585
586 1. Registers are marked as either Vector *or Scalar*
587 2. Register field numbers (limited typically to 5 bit)
588 are extended in range, both for Scalar and Vector.
589
590 The register files are therefore extended:
591
592 * INT is extended from r0-31 to r0-127
593 * FP is extended from fp0-32 to fp0-fp127
594 * CR Fields are extended from CR0-7 to CR0-127
595
596 However due to pressure in `RM.EXTRA` not all these registers
597 are accessible by all instructions, particularly those with
598 a large number of operands (`madd`, `isel`).
599
600 In the following tables register numbers are constructed from the
601 standard v3.0B / v3.1B 32 bit register field (RA, FRA) and the EXTRA2
602 or EXTRA3 field from the SV Prefix, determined by the specific
603 RM-xx-yyyy designation for a given instruction.
604 The prefixing is arranged so that
605 interoperability between prefixing and nonprefixing of scalar registers
606 is direct and convenient (when the EXTRA field is all zeros).
607
608 A pseudocode algorithm explains the relationship, for INT/FP (see [[svp64/appendix]] for CRs)
609
610 ```
611 if extra3_mode:
612 spec = EXTRA3
613 else:
614 spec = EXTRA2 << 1 # same as EXTRA3, shifted
615 if spec[0]: # vector
616 return (RA << 2) | spec[1:2]
617 else: # scalar
618 return (spec[1:2] << 5) | RA
619 ```
620
621 Future versions may extend to 256 by shifting Vector numbering up.
622 Scalar will not be altered.
623
624 Note that in some cases the range of starting points for Vectors
625 is limited.
626
627 ## INT/FP EXTRA3
628
629 If EXTRA3 is zero, maps to
630 "scalar identity" (scalar Power ISA field naming).
631
632 Fields are as follows:
633
634 * Value: R_EXTRA3
635 * Mode: register is tagged as scalar or vector
636 * Range/Inc: the range of registers accessible from this EXTRA
637 encoding, and the "increment" (accessibility). "/4" means
638 that this EXTRA encoding may only give access (starting point)
639 every 4th register.
640 * MSB..LSB: the bit field showing how the register opcode field
641 combines with EXTRA to give (extend) the register number (GPR)
642
643 | Value | Mode | Range/Inc | 6..0 |
644 |-----------|-------|---------------|---------------------|
645 | 000 | Scalar | `r0-r31`/1 | `0b00 RA` |
646 | 001 | Scalar | `r32-r63`/1 | `0b01 RA` |
647 | 010 | Scalar | `r64-r95`/1 | `0b10 RA` |
648 | 011 | Scalar | `r96-r127`/1 | `0b11 RA` |
649 | 100 | Vector | `r0-r124`/4 | `RA 0b00` |
650 | 101 | Vector | `r1-r125`/4 | `RA 0b01` |
651 | 110 | Vector | `r2-r126`/4 | `RA 0b10` |
652 | 111 | Vector | `r3-r127`/4 | `RA 0b11` |
653
654 ## INT/FP EXTRA2
655
656 If EXTRA2 is zero will map to
657 "scalar identity behaviour" i.e Scalar Power ISA register naming:
658
659 | Value | Mode | Range/inc | 6..0 |
660 |-----------|-------|---------------|-----------|
661 | 00 | Scalar | `r0-r31`/1 | `0b00 RA` |
662 | 01 | Scalar | `r32-r63`/1 | `0b01 RA` |
663 | 10 | Vector | `r0-r124`/4 | `RA 0b00` |
664 | 11 | Vector | `r2-r126`/4 | `RA 0b10` |
665
666 **Note that unlike in EXTRA3, in EXTRA2**:
667
668 * the GPR Vectors may only start from
669 `r0, r2, r4, r6, r8` and likewise FPR Vectors.
670 * the GPR Scalars may only go from `r0, r1, r2.. r63` and likewise FPR Scalars.
671
672 as there is insufficient bits to cover the full range.
673
674 ## CR Field EXTRA3
675
676 CR Field encoding is essentially the same but made more complex due to CRs being bit-based. See [[svp64/appendix]] for explanation and pseudocode.
677 Note that Vectors may only start from `CR0, CR4, CR8, CR12, CR16, CR20`...
678 and Scalars may only go from `CR0, CR1, ... CR31`
679
680 Encoding shown MSB down to LSB
681
682 For a 5-bit operand (BA, BB, BT):
683
684 | Value | Mode | Range/Inc | 8..5 | 4..2 | 1..0 |
685 |-------|------|---------------|-----------| --------|---------|
686 | 000 | Scalar | `CR0-CR7`/1 | 0b0000 | BA[4:2] | BA[1:0] |
687 | 001 | Scalar | `CR8-CR15`/1 | 0b0001 | BA[4:2] | BA[1:0] |
688 | 010 | Scalar | `CR16-CR23`/1 | 0b0010 | BA[4:2] | BA[1:0] |
689 | 011 | Scalar | `CR24-CR31`/1 | 0b0011 | BA[4:2] | BA[1:0] |
690 | 100 | Vector | `CR0-CR112`/16 | BA[4:2] 0 | 0b000 | BA[1:0] |
691 | 101 | Vector | `CR4-CR116`/16 | BA[4:2] 0 | 0b100 | BA[1:0] |
692 | 110 | Vector | `CR8-CR120`/16 | BA[4:2] 1 | 0b000 | BA[1:0] |
693 | 111 | Vector | `CR12-CR124`/16 | BA[4:2] 1 | 0b100 | BA[1:0] |
694
695 For a 3-bit operand (e.g. BFA):
696
697 | Value | Mode | Range/Inc | 6..3 | 2..0 |
698 |-------|------|---------------|-----------| --------|
699 | 000 | Scalar | `CR0-CR7`/1 | 0b0000 | BFA |
700 | 001 | Scalar | `CR8-CR15`/1 | 0b0001 | BFA |
701 | 010 | Scalar | `CR16-CR23`/1 | 0b0010 | BFA |
702 | 011 | Scalar | `CR24-CR31`/1 | 0b0011 | BFA |
703 | 100 | Vector | `CR0-CR112`/16 | BFA 0 | 0b000 |
704 | 101 | Vector | `CR4-CR116`/16 | BFA 0 | 0b100 |
705 | 110 | Vector | `CR8-CR120`/16 | BFA 1 | 0b000 |
706 | 111 | Vector | `CR12-CR124`/16 | BFA 1 | 0b100 |
707
708 ## CR EXTRA2
709
710 CR encoding is essentially the same but made more complex due to CRs being bit-based. See separate section for explanation and pseudocode.
711 Note that Vectors may only start from CR0, CR8, CR16, CR24, CR32...
712
713
714 Encoding shown MSB down to LSB
715
716 For a 5-bit operand (BA, BB, BC):
717
718 | Value | Mode | Range/Inc | 8..5 | 4..2 | 1..0 |
719 |-------|--------|----------------|---------|---------|---------|
720 | 00 | Scalar | `CR0-CR7`/1 | 0b0000 | BA[4:2] | BA[1:0] |
721 | 01 | Scalar | `CR8-CR15`/1 | 0b0001 | BA[4:2] | BA[1:0] |
722 | 10 | Vector | `CR0-CR112`/16 | BA[4:2] 0 | 0b000 | BA[1:0] |
723 | 11 | Vector | `CR8-CR120`/16 | BA[4:2] 1 | 0b000 | BA[1:0] |
724
725 For a 3-bit operand (e.g. BFA):
726
727 | Value | Mode | Range/Inc | 6..3 | 2..0 |
728 |-------|------|---------------|-----------| --------|
729 | 00 | Scalar | `CR0-CR7`/1 | 0b0000 | BFA |
730 | 01 | Scalar | `CR8-CR15`/1 | 0b0001 | BFA |
731 | 10 | Vector | `CR0-CR112`/16 | BFA 0 | 0b000 |
732 | 11 | Vector | `CR8-CR120`/16 | BFA 1 | 0b000 |
733
734 --------
735
736 \newpage{}
737
738
739 # Normal SVP64 Modes, for Arithmetic and Logical Operations
740
741 Normal SVP64 Mode covers Arithmetic and Logical operations
742 to provide suitable additional behaviour. The Mode
743 field is bits 19-23 of the [[svp64]] RM Field.
744
745 ## Mode
746
747 Mode is an augmentation of SV behaviour, providing additional
748 functionality. Some of these alterations are element-based (saturation), others involve post-analysis (predicate result) and others are Vector-based (mapreduce, fail-on-first).
749
750 [[sv/ldst]],
751 [[sv/cr_ops]] and [[sv/branches]] are covered separately: the following
752 Modes apply to Arithmetic and Logical SVP64 operations:
753
754 * **simple** mode is straight vectorisation. no augmentations: the vector comprises an array of independently created results.
755 * **ffirst** or data-dependent fail-on-first: see separate section. the vector may be truncated depending on certain criteria.
756 *VL is altered as a result*.
757 * **sat mode** or saturation: clamps each element result to a min/max rather than overflows / wraps. allows signed and unsigned clamping for both INT
758 and FP.
759 * **reduce mode**. if used correctly, a mapreduce (or a prefix sum)
760 is performed. see [[svp64/appendix]].
761 note that there are comprehensive caveats when using this mode.
762 * **pred-result** will test the result (CR testing selects a bit of CR and inverts it, just like branch conditional testing) and if the test fails it
763 is as if the
764 *destination* predicate bit was zero even before starting the operation.
765 When Rc=1 the CR element however is still stored in the CR regfile, even if the test failed. See appendix for details.
766
767 Note that ffirst and reduce modes are not anticipated to be high-performance in some implementations. ffirst due to interactions with VL, and reduce due to it requiring additional operations to produce a result. simple, saturate and pred-result are however inter-element independent and may easily be parallelised to give high performance, regardless of the value of VL.
768
769 The Mode table for Arithmetic and Logical operations
770 is laid out as follows:
771
772 | 0-1 | 2 | 3 4 | description |
773 | --- | --- |---------|-------------------------- |
774 | 00 | 0 | dz sz | simple mode |
775 | 00 | 1 | 0 RG | scalar reduce mode (mapreduce) |
776 | 00 | 1 | 1 / | reserved |
777 | 01 | inv | CR-bit | Rc=1: ffirst CR sel |
778 | 01 | inv | VLi RC1 | Rc=0: ffirst z/nonz |
779 | 10 | N | dz sz | sat mode: N=0/1 u/s |
780 | 11 | inv | CR-bit | Rc=1: pred-result CR sel |
781 | 11 | inv | zz RC1 | Rc=0: pred-result z/nonz |
782
783 Fields:
784
785 * **sz / dz** if predication is enabled will put zeros into the dest (or as src in the case of twin pred) when the predicate bit is zero. otherwise the element is ignored or skipped, depending on context.
786 * **zz**: both sz and dz are set equal to this flag
787 * **inv CR bit** just as in branches (BO) these bits allow testing of a CR bit and whether it is set (inv=0) or unset (inv=1)
788 * **RG** inverts the Vector Loop order (VL-1 downto 0) rather
789 than the normal 0..VL-1
790 * **N** sets signed/unsigned saturation.
791 * **RC1** as if Rc=1, enables access to `VLi`.
792 * **VLi** VL inclusive: in fail-first mode, the truncation of
793 VL *includes* the current element at the failure point rather
794 than excludes it from the count.
795
796 For LD/ST Modes, see [[sv/ldst]]. For Condition Registers
797 see [[sv/cr_ops]].
798 For Branch modes, see [[sv/branches]].
799
800 ## Rounding, clamp and saturate
801
802 To help ensure for example that audio quality is not compromised by overflow,
803 "saturation" is provided, as well as a way to detect when saturation
804 occurred if desired (Rc=1). When Rc=1 there will be a *vector* of CRs,
805 one CR per element in the result (Note: this is different from VSX which
806 has a single CR per block).
807
808 When N=0 the result is saturated to within the maximum range of an
809 unsigned value. For integer ops this will be 0 to 2^elwidth-1. Similar
810 logic applies to FP operations, with the result being saturated to
811 maximum rather than returning INF, and the minimum to +0.0
812
813 When N=1 the same occurs except that the result is saturated to the min
814 or max of a signed result, and for FP to the min and max value rather
815 than returning +/- INF.
816
817 When Rc=1, the CR "overflow" bit is set on the CR associated with the
818 element, to indicate whether saturation occurred. Note that due to
819 the hugely detrimental effect it has on parallel processing, XER.SO is
820 **ignored** completely and is **not** brought into play here. The CR
821 overflow bit is therefore simply set to zero if saturation did not occur,
822 and to one if it did.
823
824 Note also that saturate on operations that set OE=1 must raise an
825 Illegal Instruction due to the conflicting use of the CR.so bit for
826 storing if
827 saturation occurred. Integer Operations that produce a Carry-Out (CA, CA32):
828 these two bits will be `UNDEFINED` if saturation is also requested.
829
830 Note that the operation takes place at the maximum bitwidth (max of
831 src and dest elwidth) and that truncation occurs to the range of the
832 dest elwidth.
833
834 *Programmer's Note: Post-analysis of the Vector of CRs to find out if any given element hit
835 saturation may be done using a mapreduced CR op (cror), or by using the
836 new crrweird instruction with Rc=1, which will transfer the required
837 CR bits to a scalar integer and update CR0, which will allow testing
838 the scalar integer for nonzero. see [[sv/cr_int_predication]]*
839
840 ## Reduce mode
841
842 Reduction in SVP64 is similar in essence to other Vector Processing
843 ISAs, but leverages the underlying scalar Base v3.0B operations.
844 Thus it is more a convention that the programmer may utilise to give
845 the appearance and effect of a Horizontal Vector Reduction. Due
846 to the unusual decoupling it is also possible to perform
847 prefix-sum (Fibonacci Series) in certain circumstances. Details are in the [[svp64/appendix]]
848
849 Reduce Mode should not be confused with Parallel Reduction [[sv/remap]].
850 As explained in the [[sv/appendix]] Reduce Mode switches off the check
851 which would normally stop looping if the result register is scalar.
852 Thus, the result scalar register, if also used as a source scalar,
853 may be used to perform sequential accumulation. This *deliberately*
854 sets up a chain
855 of Register Hazard Dependencies, whereas Parallel Reduce [[sv/remap]]
856 deliberately issues a Tree-Schedule of operations that may be parallelised.
857
858 ## Fail-on-first
859
860 Data-dependent fail-on-first has two distinct variants: one for LD/ST,
861 the other for arithmetic operations (actually, CR-driven). Note in each
862 case the assumption is that vector elements are required to appear to be
863 executed in sequential Program Order. When REMAP is not active,
864 element 0 would be the first.
865
866 Data-driven (CR-driven) fail-on-first activates when Rc=1 or other
867 CR-creating operation produces a result (including cmp). Similar to
868 branch, an analysis of the CR is performed and if the test fails, the
869 vector operation terminates and discards all element operations **at and
870 above the current one**, and VL is truncated to either
871 the *previous* element or the current one, depending on whether
872 VLi (VL "inclusive") is clear or set, respectively.
873
874 Thus the new VL comprises a contiguous vector of results,
875 all of which pass the testing criteria (equal to zero, less than zero etc
876 as defined by the CR-bit test).
877
878 *Note: when VLi is clear, the behaviour at first seems counter-intuitive.
879 A result is calculated but if the test fails it is prohibited from being
880 actually written. This becomes intuitive again when it is remembered
881 that the length that VL is set to is the number of *written* elements,
882 and only when VLI is set will the current element be included in that
883 count.*
884
885 The CR-based data-driven fail-on-first is "new" and not found in ARM
886 SVE or RVV. At the same time it is "old" because it is almost
887 identical to a generalised form of Z80's `CPIR` instruction.
888 It is extremely useful for reducing instruction count,
889 however requires speculative execution involving modifications of VL
890 to get high performance implementations. An additional mode (RC1=1)
891 effectively turns what would otherwise be an arithmetic operation
892 into a type of `cmp`. The CR is stored (and the CR.eq bit tested
893 against the `inv` field).
894 If the CR.eq bit is equal to `inv` then the Vector is truncated and
895 the loop ends.
896
897 VLi is only available as an option when `Rc=0` (or for instructions
898 which do not have Rc). When set, the current element is always
899 also included in the count (the new length that VL will be set to).
900 This may be useful in combination with "inv" to truncate the Vector
901 to *exclude* elements that fail a test, or, in the case of implementations
902 of strncpy, to include the terminating zero.
903
904 In CR-based data-driven fail-on-first there is only the option to select
905 and test one bit of each CR (just as with branch BO). For more complex
906 tests this may be insufficient. If that is the case, a vectorised crop
907 such as crand, cror or [[sv/cr_int_predication]] crweirder may be used,
908 and ffirst applied to the crop instead of to
909 the arithmetic vector. Note that crops are covered by
910 the [[sv/cr_ops]] Mode format.
911
912 *Programmer's note: `VLi` is only accessible in normal operations
913 which in turn limits the CR field bit-testing to only `EQ/NE`.
914 [[sv/cr_ops]] are not so limited. Thus it is possible to use for
915 example `sv.cror/ff=gt/vli *0,*0,*0`, which is not a `nop` because
916 it allows Fail-First Mode to perform a test and truncate VL.*
917
918 Two extremely important aspects of ffirst are:
919
920 * LDST ffirst may never set VL equal to zero. This because on the first
921 element an exception must be raised "as normal".
922 * CR-based data-dependent ffirst on the other hand **can** set VL equal
923 to zero. This is the only means in the entirety of SV that VL may be set
924 to zero (with the exception of via the SV.STATE SPR). When VL is set
925 zero due to the first element failing the CR bit-test, all subsequent
926 vectorised operations are effectively `nops` which is
927 *precisely the desired and intended behaviour*.
928
929 The second crucial aspect, compared to LDST Ffirst:
930
931 * LD/ST Failfirst may (beyond the initial first element
932 conditions) truncate VL for any architecturally
933 suitable reason. Beyond the first element LD/ST Failfirst is
934 arbitrarily speculative and 100% non-deterministic.
935 * CR-based data-dependent first on the other hand MUST NOT truncate VL
936 arbitrarily to a length decided by the hardware: VL MUST only be
937 truncated based explicitly on whether a test fails.
938 This because it is a precise Deterministic test on which algorithms
939 can and will will rely.
940
941 **Floating-point Exceptions**
942
943 When Floating-point exceptions are enabled VL must be truncated at
944 the point where the Exception appears not to have occurred. If `VLi`
945 is set then VL must include the faulting element, and thus the
946 faulting element will always raise its exception. If however `VLi`
947 is clear then VL **excludes** the faulting element and thus the
948 exception will **never** be raised.
949
950 Although very strongly
951 discouraged the Exception Mode that permits Floating Point Exception
952 notification to arrive too late to unwind is permitted
953 (under protest, due it violating
954 the otherwise 100% Deterministic nature of Data-dependent Fail-first).
955
956 **Use of lax FP Exception Notification Mode could result in parallel
957 computations proceeding with invalid results that have to be explicitly
958 detected, whereas with the strict FP Execption Mode enabled, FFirst
959 truncates VL, allows subsequent parallel computation to avoid
960 the exceptions entirely**
961
962 ## Data-dependent fail-first on CR operations (crand etc)
963
964 Operations that actually produce or alter CR Field as a result
965 have their own SVP64 Mode, described
966 in [[sv/cr_ops]].
967
968 ## pred-result mode
969
970 This mode merges common CR testing with predication, saving on instruction
971 count. Below is the pseudocode excluding predicate zeroing and elwidth
972 overrides. Note that the pseudocode for SVP64 CR-ops is slightly different.
973
974 ```
975 for i in range(VL):
976 # predication test, skip all masked out elements.
977 if predicate_masked_out(i):
978 continue
979 result = op(iregs[RA+i], iregs[RB+i])
980 CRnew = analyse(result) # calculates eq/lt/gt
981 # Rc=1 always stores the CR field
982 if Rc=1 or RC1:
983 CR.field[offs+i] = CRnew
984 # now test CR, similar to branch
985 if RC1 or CR.field[BO[0:1]] != BO[2]:
986 continue # test failed: cancel store
987 # result optionally stored but CR always is
988 iregs[RT+i] = result
989 ```
990
991 The reason for allowing the CR element to be stored is so that
992 post-analysis of the CR Vector may be carried out. For example:
993 Saturation may have occurred (and been prevented from updating, by the
994 test) but it is desirable to know *which* elements fail saturation.
995
996 Note that RC1 Mode basically turns all operations into `cmp`. The
997 calculation is performed but it is only the CR that is written. The
998 element result is *always* discarded, never written (just like `cmp`).
999
1000 Note that predication is still respected: predicate zeroing is slightly
1001 different: elements that fail the CR test *or* are masked out are zero'd.
1002
1003 --------
1004
1005 \newpage{}
1006
1007 # SV Load and Store
1008
1009 **Rationale**
1010
1011 All Vector ISAs dating back fifty years have extensive and comprehensive
1012 Load and Store operations that go far beyond the capabilities of Scalar
1013 RISC and most CISC processors, yet at their heart on an individual element
1014 basis may be found to be no different from RISC Scalar equivalents.
1015
1016 The resource savings from Vector LD/ST are significant and stem from
1017 the fact that one single instruction can trigger a dozen (or in some
1018 microarchitectures such as Cray or NEC SX Aurora) hundreds of element-level Memory accesses.
1019
1020 Additionally, and simply: if the Arithmetic side of an ISA supports
1021 Vector Operations, then in order to keep the ALUs 100% occupied the
1022 Memory infrastructure (and the ISA itself) correspondingly needs Vector
1023 Memory Operations as well.
1024
1025 Vectorised Load and Store also presents an extra dimension (literally)
1026 which creates scenarios unique to Vector applications, that a Scalar
1027 (and even a SIMD) ISA simply never encounters. SVP64 endeavours to
1028 add the modes typically found in *all* Scalable Vector ISAs,
1029 without changing the behaviour of the underlying Base
1030 (Scalar) v3.0B operations in any way.
1031
1032 ## Modes overview
1033
1034 Vectorisation of Load and Store requires creation, from scalar operations,
1035 a number of different modes:
1036
1037 * **fixed aka "unit" stride** - contiguous sequence with no gaps
1038 * **element strided** - sequential but regularly offset, with gaps
1039 * **vector indexed** - vector of base addresses and vector of offsets
1040 * **Speculative fail-first** - where it makes sense to do so
1041 * **Structure Packing** - covered in SV by [[sv/remap]] and Pack/Unpack Mode.
1042
1043 *Despite being constructed from Scalar LD/ST none of these Modes
1044 exist or make sense in any Scalar ISA. They **only** exist in Vector ISAs*
1045
1046 Also included in SVP64 LD/ST is both signed and unsigned Saturation,
1047 as well as Element-width overrides and Twin-Predication.
1048
1049 Note also that Indexed [[sv/remap]] mode may be applied to both
1050 v3.0 LD/ST Immediate instructions *and* v3.0 LD/ST Indexed instructions.
1051 LD/ST-Indexed should not be conflated with Indexed REMAP mode: clarification
1052 is provided below.
1053
1054 **Determining the LD/ST Modes**
1055
1056 A minor complication (caused by the retro-fitting of modern Vector
1057 features to a Scalar ISA) is that certain features do not exactly make
1058 sense or are considered a security risk. Fail-first on Vector Indexed
1059 would allow attackers to probe large numbers of pages from userspace, where
1060 strided fail-first (by creating contiguous sequential LDs) does not.
1061
1062 In addition, reduce mode makes no sense.
1063 Realistically we need
1064 an alternative table definition for [[sv/svp64]] `RM.MODE`.
1065 The following modes make sense:
1066
1067 * saturation
1068 * predicate-result (mostly for cache-inhibited LD/ST)
1069 * simple (no augmentation)
1070 * fail-first (where Vector Indexed is banned)
1071 * Signed Effective Address computation (Vector Indexed only)
1072 * Pack/Unpack (on LD/ST immediate operations only)
1073
1074 More than that however it is necessary to fit the usual Vector ISA
1075 capabilities onto both Power ISA LD/ST with immediate and to
1076 LD/ST Indexed. They present subtly different Mode tables, which, due
1077 to lack of space, have the following quirks:
1078
1079 * LD/ST Immediate has no individual control over src/dest zeroing,
1080 whereas LD/ST Indexed does.
1081 * LD/ST Immediate has no Saturated Pack/Unpack (Arithmetic Mode does)
1082 * LD/ST Indexed has no Pack/Unpack (REMAP may be used instead)
1083
1084 ## Format and fields
1085
1086 Fields used in tables below:
1087
1088 * **sz / dz** if predication is enabled will put zeros into the dest (or as src in the case of twin pred) when the predicate bit is zero. otherwise the element is ignored or skipped, depending on context.
1089 * **zz**: both sz and dz are set equal to this flag.
1090 * **inv CR bit** just as in branches (BO) these bits allow testing of a CR bit and whether it is set (inv=0) or unset (inv=1)
1091 * **N** sets signed/unsigned saturation.
1092 * **RC1** as if Rc=1, stores CRs *but not the result*
1093 * **SEA** - Signed Effective Address, if enabled performs sign-extension on
1094 registers that have been reduced due to elwidth overrides
1095
1096 **LD/ST immediate**
1097
1098 The table for [[sv/svp64]] for `immed(RA)` which is `RM.MODE`
1099 (bits 19:23 of `RM`) is:
1100
1101 | 0-1 | 2 | 3 4 | description |
1102 | --- | --- |---------|--------------------------- |
1103 | 00 | 0 | zz els | simple mode |
1104 | 00 | 1 | PI LF | post-increment and Fault-First |
1105 | 01 | inv | CR-bit | Rc=1: ffirst CR sel |
1106 | 01 | inv | els RC1 | Rc=0: ffirst z/nonz |
1107 | 10 | N | zz els | sat mode: N=0/1 u/s |
1108 | 11 | inv | CR-bit | Rc=1: pred-result CR sel |
1109 | 11 | inv | els RC1 | Rc=0: pred-result z/nonz |
1110
1111 The `els` bit is only relevant when `RA.isvec` is clear: this indicates
1112 whether stride is unit or element:
1113
1114 ```
1115 if RA.isvec:
1116 svctx.ldstmode = indexed
1117 elif els == 0:
1118 svctx.ldstmode = unitstride
1119 elif immediate != 0:
1120 svctx.ldstmode = elementstride
1121 ```
1122
1123 An immediate of zero is a safety-valve to allow `LD-VSPLAT`:
1124 in effect the multiplication of the immediate-offset by zero results
1125 in reading from the exact same memory location, *even with a Vector
1126 register*. (Normally this type of behaviour is reserved for the
1127 mapreduce modes)
1128
1129 For `LD-VSPLAT`, on non-cache-inhibited Loads, the read can occur
1130 just the once and be copied, rather than hitting the Data Cache
1131 multiple times with the same memory read at the same location.
1132 The benefit of Cache-inhibited LD-splats is that it allows
1133 for memory-mapped peripherals to have multiple
1134 data values read in quick succession and stored in sequentially
1135 numbered registers (but, see Note below).
1136
1137 For non-cache-inhibited ST from a vector source onto a scalar
1138 destination: with the Vector
1139 loop effectively creating multiple memory writes to the same location,
1140 we can deduce that the last of these will be the "successful" one. Thus,
1141 implementations are free and clear to optimise out the overwriting STs,
1142 leaving just the last one as the "winner". Bear in mind that predicate
1143 masks will skip some elements (in source non-zeroing mode).
1144 Cache-inhibited ST operations on the other hand **MUST** write out
1145 a Vector source multiple successive times to the exact same Scalar
1146 destination. Just like Cache-inhibited LDs, multiple values may be
1147 written out in quick succession to a memory-mapped peripheral from
1148 sequentially-numbered registers.
1149
1150 Note that any memory location may be Cache-inhibited
1151 (Power ISA v3.1, Book III, 1.6.1, p1033)
1152
1153 *Programmer's Note: an immediate also with a Scalar source as
1154 a "VSPLAT" mode is simply not possible: there are not enough
1155 Mode bits. One single Scalar Load operation may be used instead, followed
1156 by any arithmetic operation (including a simple mv) in "Splat"
1157 mode.*
1158
1159 **LD/ST Indexed**
1160
1161 The modes for `RA+RB` indexed version are slightly different
1162 but are the same `RM.MODE` bits (19:23 of `RM`):
1163
1164 | 0-1 | 2 | 3 4 | description |
1165 | --- | --- |---------|-------------------------- |
1166 | 00 | SEA | dz sz | simple mode |
1167 | 01 | SEA | dz sz | Strided (scalar only source) |
1168 | 10 | N | dz sz | sat mode: N=0/1 u/s |
1169 | 11 | inv | CR-bit | Rc=1: pred-result CR sel |
1170 | 11 | inv | zz RC1 | Rc=0: pred-result z/nonz |
1171
1172 Vector Indexed Strided Mode is qualified as follows:
1173
1174 if mode = 0b01 and !RA.isvec and !RB.isvec:
1175 svctx.ldstmode = elementstride
1176
1177 A summary of the effect of Vectorisation of src or dest:
1178
1179 imm(RA) RT.v RA.v no stride allowed
1180 imm(RA) RT.s RA.v no stride allowed
1181 imm(RA) RT.v RA.s stride-select allowed
1182 imm(RA) RT.s RA.s not vectorised
1183 RA,RB RT.v {RA|RB}.v Standard Indexed
1184 RA,RB RT.s {RA|RB}.v Indexed but single LD (no VSPLAT)
1185 RA,RB RT.v {RA&RB}.s VSPLAT possible. stride selectable
1186 RA,RB RT.s {RA&RB}.s not vectorised (scalar identity)
1187
1188 Signed Effective Address computation is only relevant for
1189 Vector Indexed Mode, when elwidth overrides are applied.
1190 The source override applies to RB, and before adding to
1191 RA in order to calculate the Effective Address, if SEA is
1192 set RB is sign-extended from elwidth bits to the full 64
1193 bits. For other Modes (ffirst, saturate),
1194 all EA computation with elwidth overrides is unsigned.
1195
1196 Note that cache-inhibited LD/ST when VSPLAT is activated will perform **multiple** LD/ST operations, sequentially. Even with scalar src a
1197 Cache-inhibited LD will read the same memory location *multiple times*, storing the result in successive Vector destination registers. This because the cache-inhibit instructions are typically used to read and write memory-mapped peripherals.
1198 If a genuine cache-inhibited LD-VSPLAT is required then a single *scalar*
1199 cache-inhibited LD should be performed, followed by a VSPLAT-augmented mv,
1200 copying the one *scalar* value into multiple register destinations.
1201
1202 Note also that cache-inhibited VSPLAT with Predicate-result is possible.
1203 This allows for example to issue a massive batch of memory-mapped
1204 peripheral reads, stopping at the first NULL-terminated character and
1205 truncating VL to that point. No branch is needed to issue that large burst
1206 of LDs, which may be valuable in Embedded scenarios.
1207
1208 ## Vectorisation of Scalar Power ISA v3.0B
1209
1210 Scalar Power ISA Load/Store operations may be seen from their
1211 pseudocode to be of the form:
1212
1213 lbux RT, RA, RB
1214 EA <- (RA) + (RB)
1215 RT <- MEM(EA)
1216
1217 and for immediate variants:
1218
1219 lb RT,D(RA)
1220 EA <- RA + EXTS(D)
1221 RT <- MEM(EA)
1222
1223 Thus in the first example, the source registers may each be independently
1224 marked as scalar or vector, and likewise the destination; in the second
1225 example only the one source and one dest may be marked as scalar or
1226 vector.
1227
1228 Thus we can see that Vector Indexed may be covered, and, as demonstrated
1229 with the pseudocode below, the immediate can be used to give unit
1230 stride or element stride. With there being no way to tell which from
1231 the Power v3.0B Scalar opcode alone, the choice is provided instead by
1232 the SV Context.
1233
1234 ```
1235 # LD not VLD! format - ldop RT, immed(RA)
1236 # op_width: lb=1, lh=2, lw=4, ld=8
1237 op_load(RT, RA, op_width, immed, svctx, RAupdate):
1238  ps = get_pred_val(FALSE, RA); # predication on src
1239  pd = get_pred_val(FALSE, RT); # ... AND on dest
1240  for (i=0, j=0, u=0; i < VL && j < VL;):
1241 # skip nonpredicates elements
1242 if (RA.isvec) while (!(ps & 1<<i)) i++;
1243 if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
1244 if (RT.isvec) while (!(pd & 1<<j)) j++;
1245 if postinc:
1246 offs = 0; # added afterwards
1247 if RA.isvec: srcbase = ireg[RA+i]
1248 else srcbase = ireg[RA]
1249 elif svctx.ldstmode == elementstride:
1250 # element stride mode
1251 srcbase = ireg[RA]
1252 offs = i * immed # j*immed for a ST
1253 elif svctx.ldstmode == unitstride:
1254 # unit stride mode
1255 srcbase = ireg[RA]
1256 offs = immed + (i * op_width) # j*op_width for ST
1257 elif RA.isvec:
1258 # quirky Vector indexed mode but with an immediate
1259 srcbase = ireg[RA+i]
1260 offs = immed;
1261 else
1262 # standard scalar mode (but predicated)
1263 # no stride multiplier means VSPLAT mode
1264 srcbase = ireg[RA]
1265 offs = immed
1266
1267 # compute EA
1268 EA = srcbase + offs
1269 # load from memory
1270 ireg[RT+j] <= MEM[EA];
1271 # check post-increment of EA
1272 if postinc: EA = srcbase + immed;
1273 # update RA?
1274 if RAupdate: ireg[RAupdate+u] = EA;
1275 if (!RT.isvec)
1276 break # destination scalar, end now
1277 if (RA.isvec) i++;
1278 if (RAupdate.isvec) u++;
1279 if (RT.isvec) j++;
1280 ```
1281
1282 Indexed LD is:
1283
1284 ```
1285 # format: ldop RT, RA, RB
1286 function op_ldx(RT, RA, RB, RAupdate=False) # LD not VLD!
1287  ps = get_pred_val(FALSE, RA); # predication on src
1288  pd = get_pred_val(FALSE, RT); # ... AND on dest
1289  for (i=0, j=0, k=0, u=0; i < VL && j < VL && k < VL):
1290 # skip nonpredicated RA, RB and RT
1291 if (RA.isvec) while (!(ps & 1<<i)) i++;
1292 if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
1293 if (RB.isvec) while (!(ps & 1<<k)) k++;
1294 if (RT.isvec) while (!(pd & 1<<j)) j++;
1295 if svctx.ldstmode == elementstride:
1296 EA = ireg[RA] + ireg[RB]*j # register-strided
1297 else
1298 EA = ireg[RA+i] + ireg[RB+k] # indexed address
1299 if RAupdate: ireg[RAupdate+u] = EA
1300 ireg[RT+j] <= MEM[EA];
1301 if (!RT.isvec)
1302 break # destination scalar, end immediately
1303 if (RA.isvec) i++;
1304 if (RAupdate.isvec) u++;
1305 if (RB.isvec) k++;
1306 if (RT.isvec) j++;
1307 ```
1308
1309 Note that Element-Strided uses the Destination Step because with both
1310 sources being Scalar as a prerequisite condition of activation of
1311 Element-Stride Mode, the source step (being Scalar) would never advance.
1312
1313 Note in both cases that [[sv/svp64]] allows RA-as-a-dest in "update" mode (`ldux`) to be effectively a *completely different* register from RA-as-a-source. This because there is room in svp64 to extend RA-as-src as well as RA-as-dest, both independently as scalar or vector *and* independently extending their range.
1314
1315 *Programmer's note: being able to set RA-as-a-source
1316 as separate from RA-as-a-destination as Scalar is **extremely valuable**
1317 once it is remembered that Simple-V element operations must
1318 be in Program Order, especially in loops, for saving on
1319 multiple address computations. Care does have
1320 to be taken however that RA-as-src is not overwritten by
1321 RA-as-dest unless intentionally desired, especially in element-strided Mode.*
1322
1323 ## LD/ST Indexed vs Indexed REMAP
1324
1325 Unfortunately the word "Indexed" is used twice in completely different
1326 contexts, potentially causing confusion.
1327
1328 * There has existed instructions in the Power ISA `ld RT,RA,RB` since
1329 its creation: these are called "LD/ST Indexed" instructions and their
1330 name and meaning is well-established.
1331 * There now exists, in Simple-V, a REMAP mode called "Indexed"
1332 Mode that can be applied to *any* instruction **including those
1333 named LD/ST Indexed**.
1334
1335 Whilst it may be costly in terms of register reads to allow REMAP
1336 Indexed Mode to be applied to any Vectorised LD/ST Indexed operation such as
1337 `sv.ld *RT,RA,*RB`, or even misleadingly
1338 labelled as redundant, firstly the strict
1339 application of the RISC Paradigm that Simple-V follows makes it awkward
1340 to consider *preventing* the application of Indexed REMAP to such
1341 operations, and secondly they are not actually the same at all.
1342
1343 Indexed REMAP, as applied to RB in the instruction `sv.ld *RT,RA,*RB`
1344 effectively performs an *in-place* re-ordering of the offsets, RB.
1345 To achieve the same effect without Indexed REMAP would require taking
1346 a *copy* of the Vector of offsets starting at RB, manually explicitly
1347 reordering them, and finally using the copy of re-ordered offsets in
1348 a non-REMAP'ed `sv.ld`. Using non-strided LD as an example,
1349 pseudocode showing what actually occurs,
1350 where the pseudocode for `indexed_remap` may be found in [[sv/remap]]:
1351
1352 ```
1353 # sv.ld *RT,RA,*RB with Index REMAP applied to RB
1354 for i in 0..VL-1:
1355 if remap.indexed:
1356 rb_idx = indexed_remap(i) # remap
1357 else:
1358 rb_idx = i # use the index as-is
1359 EA = GPR(RA) + GPR(RB+rb_idx)
1360 GPR(RT+i) = MEM(EA, 8)
1361 ```
1362
1363 Thus it can be seen that the use of Indexed REMAP saves copying
1364 and manual reordering of the Vector of RB offsets.
1365
1366 ## LD/ST ffirst
1367
1368 LD/ST ffirst treats the first LD/ST in a vector (element 0 if REMAP
1369 is not active) as an ordinary one, with all behaviour with respect to
1370 Interrupts Exceptions Page Faults Memory Management being identical
1371 in every regard to Scalar v3.0 Power ISA LD/ST. However for elements
1372 1 and above, if an exception would occur, then VL is **truncated**
1373 to the previous element: the exception is **not** then raised because
1374 the LD/ST that would otherwise have caused an exception is *required*
1375 to be cancelled. Additionally an implementor may choose to truncate VL
1376 for any arbitrary reason *except for the very first*.
1377
1378 ffirst LD/ST to multiple pages via a Vectorised Index base is
1379 considered a security risk due to the abuse of probing multiple
1380 pages in rapid succession and getting speculative feedback on which
1381 pages would fail. Therefore Vector Indexed LD/ST is prohibited
1382 entirely, and the Mode bit instead used for element-strided LD/ST.
1383 See <https://bugs.libre-soc.org/show_bug.cgi?id=561>
1384
1385 ```
1386 for(i = 0; i < VL; i++)
1387 reg[rt + i] = mem[reg[ra] + i * reg[rb]];
1388 ```
1389
1390 High security implementations where any kind of speculative probing
1391 of memory pages is considered a risk should take advantage of the fact that
1392 implementations may truncate VL at any point, without requiring software
1393 to be rewritten and made non-portable. Such implementations may choose
1394 to *always* set VL=1 which will have the effect of terminating any
1395 speculative probing (and also adversely affect performance), but will
1396 at least not require applications to be rewritten.
1397
1398 Low-performance simpler hardware implementations may also
1399 choose (always) to also set VL=1 as the bare minimum compliant implementation of
1400 LD/ST Fail-First. It is however critically important to remember that
1401 the first element LD/ST **MUST** be treated as an ordinary LD/ST, i.e.
1402 **MUST** raise exceptions exactly like an ordinary LD/ST.
1403
1404 For ffirst LD/STs, VL may be truncated arbitrarily to a nonzero value for any implementation-specific reason. For example: it is perfectly reasonable for implementations to alter VL when ffirst LD or ST operations are initiated on a nonaligned boundary, such that within a loop the subsequent iteration of that loop begins the following ffirst LD/ST operations on an aligned boundary
1405 such as the beginning of a cache line, or beginning of a Virtual Memory
1406 page. Likewise, to reduce workloads or balance resources.
1407
1408 Vertical-First Mode is slightly strange in that only one element
1409 at a time is ever executed anyway. Given that programmers may
1410 legitimately choose to alter srcstep and dststep in non-sequential
1411 order as part of explicit loops, it is neither possible nor
1412 safe to make speculative assumptions about future LD/STs.
1413 Therefore, Fail-First LD/ST in Vertical-First is `UNDEFINED`.
1414 This is very different from Arithmetic (Data-dependent) FFirst
1415 where Vertical-First Mode is fully deterministic, not speculative.
1416
1417 ## LOAD/STORE Elwidths <a name="elwidth"></a>
1418
1419 Loads and Stores are almost unique in that the Power Scalar ISA
1420 provides a width for the operation (lb, lh, lw, ld). Only `extsb` and
1421 others like it provide an explicit operation width. There are therefore
1422 *three* widths involved:
1423
1424 * operation width (lb=8, lh=16, lw=32, ld=64)
1425 * src element width override (8/16/32/default)
1426 * destination element width override (8/16/32/default)
1427
1428 Some care is therefore needed to express and make clear the transformations,
1429 which are expressly in this order:
1430
1431 * Calculate the Effective Address from RA at full width
1432 but (on Indexed Load) allow srcwidth overrides on RB
1433 * Load at the operation width (lb/lh/lw/ld) as usual
1434 * byte-reversal as usual
1435 * Non-saturated mode:
1436 - zero-extension or truncation from operation width to dest elwidth
1437 - place result in destination at dest elwidth
1438 * Saturated mode:
1439 - Sign-extension or truncation from operation width to dest width
1440 - signed/unsigned saturation down to dest elwidth
1441
1442 In order to respect Power v3.0B Scalar behaviour the memory side
1443 is treated effectively as completely separate and distinct from SV
1444 augmentation. This is primarily down to quirks surrounding LE/BE and
1445 byte-reversal.
1446
1447 It is rather unfortunately possible to request an elwidth override
1448 on the memory side which
1449 does not mesh with the overridden operation width: these result in
1450 `UNDEFINED`
1451 behaviour. The reason is that the effect of attempting a 64-bit `sv.ld`
1452 operation with a source elwidth override of 8/16/32 would result in
1453 overlapping memory requests, particularly on unit and element strided
1454 operations. Thus it is `UNDEFINED` when the elwidth is smaller than
1455 the memory operation width. Examples include `sv.lw/sw=16/els` which
1456 requests (overlapping) 4-byte memory reads offset from
1457 each other at 2-byte intervals. Store likewise is also `UNDEFINED`
1458 where the dest elwidth override is less than the operation width.
1459
1460 Note the following regarding the pseudocode to follow:
1461
1462 * `scalar identity behaviour` SV Context parameter conditions turn this
1463 into a straight absolute fully-compliant Scalar v3.0B LD operation
1464 * `brev` selects whether the operation is the byte-reversed variant (`ldbrx`
1465 rather than `ld`)
1466 * `op_width` specifies the operation width (`lb`, `lh`, `lw`, `ld`) as
1467 a "normal" part of Scalar v3.0B LD
1468 * `imm_offs` specifies the immediate offset `ld r3, imm_offs(r5)`, again
1469 as a "normal" part of Scalar v3.0B LD
1470 * `svctx` specifies the SV Context and includes VL as well as
1471 source and destination elwidth overrides.
1472
1473 Below is the pseudocode for Unit-Strided LD (which includes Vector capability). Observe in particular that RA, as the base address in
1474 both Immediate and Indexed LD/ST,
1475 does not have element-width overriding applied to it.
1476
1477 Note that predication, predication-zeroing,
1478 and other modes except saturation have all been removed,
1479 for clarity and simplicity:
1480
1481 ```
1482 # LD not VLD!
1483 # this covers unit stride mode and a type of vector offset
1484 function op_ld(RT, RA, op_width, imm_offs, svctx)
1485 for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL):
1486 if not svctx.unit/el-strided:
1487 # strange vector mode, compute 64 bit address which is
1488 # not polymorphic! elwidth hardcoded to 64 here
1489 srcbase = get_polymorphed_reg(RA, 64, i)
1490 else:
1491 # unit / element stride mode, compute 64 bit address
1492 srcbase = get_polymorphed_reg(RA, 64, 0)
1493 # adjust for unit/el-stride
1494 srcbase += ....
1495
1496 # read the underlying memory
1497 memread <= MEM(srcbase + imm_offs, op_width)
1498
1499 # check saturation.
1500 if svpctx.saturation_mode:
1501 # ... saturation adjustment...
1502 memread = clamp(memread, op_width, svctx.dest_elwidth)
1503 else:
1504 # truncate/extend to over-ridden dest width.
1505 memread = adjust_wid(memread, op_width, svctx.dest_elwidth)
1506
1507 # takes care of inserting memory-read (now correctly byteswapped)
1508 # into regfile underlying LE-defined order, into the right place
1509 # within the NEON-like register, respecting destination element
1510 # bitwidth, and the element index (j)
1511 set_polymorphed_reg(RT, svctx.dest_elwidth, j, memread)
1512
1513 # increments both src and dest element indices (no predication here)
1514 i++;
1515 j++;
1516 ```
1517
1518 Note above that the source elwidth is *not used at all* in LD-immediate.
1519
1520 For LD/Indexed, the key is that in the calculation of the Effective Address,
1521 RA has no elwidth override but RB does. Pseudocode below is simplified
1522 for clarity: predication and all modes except saturation are removed:
1523
1524 ```
1525 # LD not VLD! ld*rx if brev else ld*
1526 function op_ld(RT, RA, RB, op_width, svctx, brev)
1527 for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL):
1528 if not svctx.el-strided:
1529 # RA not polymorphic! elwidth hardcoded to 64 here
1530 srcbase = get_polymorphed_reg(RA, 64, i)
1531 else:
1532 # element stride mode, again RA not polymorphic
1533 srcbase = get_polymorphed_reg(RA, 64, 0)
1534 # RB *is* polymorphic
1535 offs = get_polymorphed_reg(RB, svctx.src_elwidth, i)
1536 # sign-extend
1537 if svctx.SEA: offs = sext(offs, svctx.src_elwidth, 64)
1538
1539 # takes care of (merges) processor LE/BE and ld/ldbrx
1540 bytereverse = brev XNOR MSR.LE
1541
1542 # read the underlying memory
1543 memread <= MEM(srcbase + offs, op_width)
1544
1545 # optionally performs byteswap at op width
1546 if (bytereverse):
1547 memread = byteswap(memread, op_width)
1548
1549 if svpctx.saturation_mode:
1550 # ... saturation adjustment...
1551 memread = clamp(memread, op_width, svctx.dest_elwidth)
1552 else:
1553 # truncate/extend to over-ridden dest width.
1554 memread = adjust_wid(memread, op_width, svctx.dest_elwidth)
1555
1556 # takes care of inserting memory-read (now correctly byteswapped)
1557 # into regfile underlying LE-defined order, into the right place
1558 # within the NEON-like register, respecting destination element
1559 # bitwidth, and the element index (j)
1560 set_polymorphed_reg(RT, svctx.dest_elwidth, j, memread)
1561
1562 # increments both src and dest element indices (no predication here)
1563 i++;
1564 j++;
1565 ```
1566
1567 ## Remapped LD/ST
1568
1569 In the [[sv/remap]] page the concept of "Remapping" is described.
1570 Whilst it is expensive to set up (2 64-bit opcodes minimum) it provides
1571 a way to arbitrarily perform 1D, 2D and 3D "remapping" of up to 64
1572 elements worth of LDs or STs. The usual interest in such re-mapping
1573 is for example in separating out 24-bit RGB channel data into separate
1574 contiguous registers.
1575
1576 REMAP easily covers this capability, and with dest
1577 elwidth overrides and saturation may do so with built-in conversion that
1578 would normally require additional width-extension, sign-extension and
1579 min/max Vectorised instructions as post-processing stages.
1580
1581 Thus we do not need to provide specialist LD/ST "Structure Packed" opcodes
1582 because the generic abstracted concept of "Remapping", when applied to
1583 LD/ST, will give that same capability, with far more flexibility.
1584
1585 It is worth noting that Pack/Unpack Modes of SVSTATE, which may be
1586 established through `svstep`, are also an easy way to perform regular
1587 Structure Packing, at the vec2/vec3/vec4 granularity level. Beyond
1588 that, REMAP will need to be used.
1589
1590 --------
1591
1592 \newpage{}
1593
1594 # Condition Register SVP64 Operations
1595
1596 Condition Register Fields are only 4 bits wide: this presents some
1597 interesting conceptual challenges for SVP64, which was designed
1598 primarily for vectors of arithmetic and logical operations. However
1599 if predicates may be bits of CR Fields it makes sense to extend
1600 Simple-V to cover CR Operations, especially given that Vectorised Rc=1
1601 may be processed by Vectorised CR Operations tbat usefully in turn
1602 may become Predicate Masks to yet more Vector operations, like so:
1603
1604 ```
1605 sv.cmpi/ew=8 *B,*ra,0 # compare bytes against zero
1606 sv.cmpi/ew=8 *B2,*ra,13. # and against newline
1607 sv.cror PM.EQ,B.EQ,B2.EQ # OR compares to create mask
1608 sv.stb/sm=EQ ... # store only nonzero/newline
1609 ```
1610
1611 Element width however is clearly meaningless for a 4-bit collation of
1612 Conditions, EQ LT GE SO. Likewise, arithmetic saturation (an important
1613 part of Arithmetic SVP64) has no meaning. An alternative Mode Format is
1614 required, and given that elwidths are meaningless for CR Fields the bits
1615 in SVP64 `RM` may be used for other purposes.
1616
1617 This alternative mapping **only** applies to instructions that **only**
1618 reference a CR Field or CR bit as the sole exclusive result. This section
1619 **does not** apply to instructions which primarily produce arithmetic
1620 results that also, as an aside, produce a corresponding
1621 CR Field (such as when Rc=1).
1622 Instructions that involve Rc=1 are definitively arithmetic in nature,
1623 where the corresponding Condition Register Field can be considered to
1624 be a "co-result". Such CR Field "co-result" arithmeric operations
1625 are firmly out of scope for
1626 this section, being covered fully by [[sv/normal]].
1627
1628 * Examples of v3.0B instructions to which this section does
1629 apply is
1630 - `mfcr` and `cmpi` (3 bit operands) and
1631 - `crnor` and `crand` (5 bit operands).
1632 * Examples to which this section does **not** apply include
1633 `fadds.` and `subf.` which both produce arithmetic results
1634 (and a CR Field co-result).
1635
1636 The CR Mode Format still applies to `sv.cmpi` because despite
1637 taking a GPR as input, the output from the Base Scalar v3.0B `cmpi`
1638 instruction is purely to a Condition Register Field.
1639
1640 Other modes are still applicable and include:
1641
1642 * **Data-dependent fail-first**.
1643 useful to truncate VL based on
1644 analysis of a Condition Register result bit.
1645 * **Reduction**.
1646 Reduction is useful
1647 for analysing a Vector of Condition Register Fields
1648 and reducing it to one
1649 single Condition Register Field.
1650
1651 Predicate-result does not make any sense because
1652 when Rc=1 a co-result is created (a CR Field). Testing the co-result
1653 allows the decision to be made to store or not store the main
1654 result, and for CR Ops the CR Field result *is*
1655 the main result.
1656
1657 ## Format
1658
1659 SVP64 RM `MODE` (includes `ELWIDTH_SRC` bits) for CR-based operations:
1660
1661 |6 | 7 |19-20| 21 | 22 23 | description |
1662 |--|---|-----| --- |---------|----------------- |
1663 |/ | / |0 RG | 0 | dz sz | simple mode |
1664 |/ | / |0 RG | 1 | dz sz | scalar reduce mode (mapreduce) |
1665 |zz|SNZ|1 VLI| inv | CR-bit | Ffirst 3-bit mode |
1666 |/ |SNZ|1 VLI| inv | dz sz | Ffirst 5-bit mode (implies CR-bit from result) |
1667
1668 Fields:
1669
1670 * **sz / dz** if predication is enabled will put zeros into the dest (or as src in the case of twin pred) when the predicate bit is zero. otherwise the element is ignored or skipped, depending on context.
1671 * **zz** set both sz and dz equal to this flag
1672 * **SNZ** In fail-first mode, on the bit being tested, when sz=1 and SNZ=1 a value "1" is put in place of "0".
1673 * **inv CR-bit** just as in branches (BO) these bits allow testing of a CR bit and whether it is set (inv=0) or unset (inv=1)
1674 * **RG** inverts the Vector Loop order (VL-1 downto 0) rather
1675 than the normal 0..VL-1
1676 * **SVM** sets "subvector" reduce mode
1677 * **VLi** VL inclusive: in fail-first mode, the truncation of
1678 VL *includes* the current element at the failure point rather
1679 than excludes it from the count.
1680
1681 ## Data-dependent fail-first on CR operations
1682
1683 The principle of data-dependent fail-first is that if, during
1684 the course of sequentially evaluating an element's Condition Test,
1685 one such test is encountered which fails,
1686 then VL (Vector Length) is truncated (set) at that point. In the case
1687 of Arithmetic SVP64 Operations the Condition Register Field generated from
1688 Rc=1 is used as the basis for the truncation decision.
1689 However with CR-based operations that CR Field result to be
1690 tested is provided
1691 *by the operation itself*.
1692
1693 Data-dependent SVP64 Vectorised Operations involving the creation or
1694 modification of a CR can require an extra two bits, which are not available
1695 in the compact space of the SVP64 RM `MODE` Field. With the concept of element
1696 width overrides being meaningless for CR Fields it is possible to use the
1697 `ELWIDTH` field for alternative purposes.
1698
1699 Condition Register based operations such as `sv.mfcr` and `sv.crand` can thus
1700 be made more flexible. However the rules that apply in this section
1701 also apply to future CR-based instructions.
1702
1703 There are two primary different types of CR operations:
1704
1705 * Those which have a 3-bit operand field (referring to a CR Field)
1706 * Those which have a 5-bit operand (referring to a bit within the
1707 whole 32-bit CR)
1708
1709 Examining these two types it is observed that the
1710 difference may be considered to be that the 5-bit variant
1711 *already* provides the
1712 prerequisite information about which CR Field bit (EQ, GE, LT, SO) is to
1713 be operated on by the instruction.
1714 Thus, logically, we may set the following rule:
1715
1716 * When a 5-bit CR Result field is used in an instruction, the
1717 5-bit variant of Data-Dependent Fail-First
1718 must be used. i.e. the bit of the CR field to be tested is
1719 the one that has just been modified (created) by the operation.
1720 * When a 3-bit CR Result field is used the 3-bit variant
1721 must be used, providing as it does the missing `CRbit` field
1722 in order to select which CR Field bit of the result shall
1723 be tested (EQ, LE, GE, SO)
1724
1725 The reason why the 3-bit CR variant needs the additional CR-bit
1726 field should be obvious from the fact that the 3-bit CR Field
1727 from the base Power ISA v3.0B operation clearly does not contain
1728 and is missing the two CR Field Selector bits. Thus, these two
1729 bits (to select EQ, LE, GE or SO) must be provided in another
1730 way.
1731
1732 Examples of the former type:
1733
1734 * crand, cror, crnor. These all are 5-bit (BA, BB, BT). The bit
1735 to be tested against `inv` is the one selected by `BT`
1736 * mcrf. This has only 3-bit (BF, BFA). In order to select the
1737 bit to be tested, the alternative encoding must be used.
1738 With `CRbit` coming from the SVP64 RM bits 22-23 the bit
1739 of BF to be tested is identified.
1740
1741 Just as with SVP64 [[sv/branches]] there is the option to truncate
1742 VL to include the element being tested (`VLi=1`) and to exclude it
1743 (`VLi=0`).
1744
1745 Also exactly as with [[sv/normal]] fail-first, VL cannot, unlike
1746 [[sv/ldst]], be set to an arbitrary value. Deterministic behaviour
1747 is *required*.
1748
1749 ## Reduction and Iteration
1750
1751 Bearing in mind as described in the svp64 Appendix, SVP64 Horizontal
1752 Reduction is a deterministic schedule on top of base Scalar v3.0 operations,
1753 the same rules apply to CR Operations, i.e. that programmers must
1754 follow certain conventions in order for an *end result* of a
1755 reduction to be achieved. Unlike
1756 other Vector ISAs *there are no explicit reduction opcodes*
1757 in SVP64: Schedules however achieve the same effect.
1758
1759 Due to these conventions only reduction on operations such as `crand`
1760 and `cror` are meaningful because these have Condition Register Fields
1761 as both input and output.
1762 Meaningless operations are not prohibited because the cost in hardware
1763 of doing so is prohibitive, but neither are they `UNDEFINED`. Implementations
1764 are still required to execute them but are at liberty to optimise out
1765 any operations that would ultimately be overwritten, as long as Strict
1766 Program Order is still obvservable by the programmer.
1767
1768 Also bear in mind that 'Reverse Gear' may be enabled, which can be
1769 used in combination with overlapping CR operations to iteratively accumulate
1770 results. Issuing a `sv.crand` operation for example with `BA`
1771 differing from `BB` by one Condition Register Field would
1772 result in a cascade effect, where the first-encountered CR Field
1773 would set the result to zero, and also all subsequent CR Field
1774 elements thereafter:
1775
1776 ```
1777 # sv.crand/mr/rg CR4.ge.v, CR5.ge.v, CR4.ge.v
1778 for i in VL-1 downto 0 # reverse gear
1779 CR.field[4+i].ge &= CR.field[5+i].ge
1780 ```
1781
1782 `sv.crxor` with reduction would be particularly useful for parity calculation
1783 for example, although there are many ways in which the same calculation
1784 could be carried out after transferring a vector of CR Fields to a GPR
1785 using crweird operations.
1786
1787 Implementations are free and clear to optimise these reductions in any
1788 way they see fit, as long as the end-result is compatible with Strict Program
1789 Order being observed, and Interrupt latency is not adversely impacted.
1790
1791 ## Unusual and quirky CR operations
1792
1793 **cmp and other compare ops**
1794
1795 `cmp` and `cmpi` etc take GPRs as sources and create a CR Field as a result.
1796
1797 cmpli BF,L,RA,UI
1798 cmpeqb BF,RA,RB
1799
1800 With `ELWIDTH` applying to the source GPR operands this is perfectly fine.
1801
1802 **crweird operations**
1803
1804 There are 4 weird CR-GPR operations and one reasonable one in
1805 the [[cr_int_predication]] set:
1806
1807 * crrweird
1808 * mtcrweird
1809 * crweirder
1810 * crweird
1811 * mcrfm - reasonably normal and referring to CR Fields for src and dest.
1812
1813 The "weird" operations have a non-standard behaviour, being able to
1814 treat *individual bits* of a GPR effectively as elements. They are
1815 expected to be Micro-coded by most Hardware implementations.
1816
1817
1818 --------
1819
1820 \newpage{}
1821
1822 # SVP64 Branch Conditional behaviour
1823
1824 Please note: although similar, SVP64 Branch instructions should be
1825 considered completely separate and distinct from
1826 standard scalar OpenPOWER-approved v3.0B branches.
1827 **v3.0B branches are in no way impacted, altered,
1828 changed or modified in any way, shape or form by
1829 the SVP64 Vectorised Variants**.
1830
1831 It is also
1832 extremely important to note that Branches are the
1833 sole pseudo-exception in SVP64 to `Scalar Identity Behaviour`.
1834 SVP64 Branches contain additional modes that are useful
1835 for scalar operations (i.e. even when VL=1 or when
1836 using single-bit predication).
1837
1838 **Rationale**
1839
1840 Scalar 3.0B Branch Conditional operations, `bc`, `bctar` etc. test a
1841 Condition Register. However for parallel processing it is simply impossible
1842 to perform multiple independent branches: the Program Counter simply
1843 cannot branch to multiple destinations based on multiple conditions.
1844 The best that can be done is
1845 to test multiple Conditions and make a decision of a *single* branch,
1846 based on analysis of a *Vector* of CR Fields
1847 which have just been calculated from a *Vector* of results.
1848
1849 In 3D Shader
1850 binaries, which are inherently parallelised and predicated, testing all or
1851 some results and branching based on multiple tests is extremely common,
1852 and a fundamental part of Shader Compilers. Example:
1853 without such multi-condition
1854 test-and-branch, if a predicate mask is all zeros a large batch of
1855 instructions may be masked out to `nop`, and it would waste
1856 CPU cycles to run them. 3D GPU ISAs can test for this scenario
1857 and, with the appropriate predicate-analysis instruction,
1858 jump over fully-masked-out operations, by spotting that
1859 *all* Conditions are false.
1860
1861 Unless Branches are aware and capable of such analysis, additional
1862 instructions would be required which perform Horizontal Cumulative
1863 analysis of Vectorised Condition Register Fields, in order to
1864 reduce the Vector of CR Fields down to one single yes or no
1865 decision that a Scalar-only v3.0B Branch-Conditional could cope with.
1866 Such instructions would be unavoidable, required, and costly
1867 by comparison to a single Vector-aware Branch.
1868 Therefore, in order to be commercially competitive, `sv.bc` and
1869 other Vector-aware Branch Conditional instructions are a high priority
1870 for 3D GPU (and OpenCL-style) workloads.
1871
1872 Given that Power ISA v3.0B is already quite powerful, particularly
1873 the Condition Registers and their interaction with Branches, there
1874 are opportunities to create extremely flexible and compact
1875 Vectorised Branch behaviour. In addition, the side-effects (updating
1876 of CTR, truncation of VL, described below) make it a useful instruction
1877 even if the branch points to the next instruction (no actual branch).
1878
1879 ## Overview
1880
1881 When considering an "array" of branch-tests, there are four
1882 primarily-useful modes:
1883 AND, OR, NAND and NOR of all Conditions.
1884 NAND and NOR may be synthesised from AND and OR by
1885 inverting `BO[1]` which just leaves two modes:
1886
1887 * Branch takes place on the **first** CR Field test to succeed
1888 (a Great Big OR of all condition tests). Exit occurs
1889 on the first **successful** test.
1890 * Branch takes place only if **all** CR field tests succeed:
1891 a Great Big AND of all condition tests. Exit occurs
1892 on the first **failed** test.
1893
1894 Early-exit is enacted such that the Vectorised Branch does not
1895 perform needless extra tests, which will help reduce reads on
1896 the Condition Register file.
1897
1898 *Note: Early-exit is **MANDATORY** (required) behaviour.
1899 Branches **MUST** exit at the first sequentially-encountered
1900 failure point, for
1901 exactly the same reasons for which it is mandatory in
1902 programming languages doing early-exit: to avoid
1903 damaging side-effects and to provide deterministic
1904 behaviour. Speculative testing of Condition
1905 Register Fields is permitted, as is speculative calculation
1906 of CTR, as long as, as usual in any Out-of-Order microarchitecture,
1907 that speculative testing is cancelled should an early-exit occur.
1908 i.e. the speculation must be "precise": Program Order must be preserved*
1909
1910 Also note that when early-exit occurs in Horizontal-first Mode,
1911 srcstep, dststep etc. are all reset, ready to begin looping from the
1912 beginning for the next instruction. However for Vertical-first
1913 Mode srcstep etc. are incremented "as usual" i.e. an early-exit
1914 has no special impact, regardless of whether the branch
1915 occurred or not. This can leave srcstep etc. in what may be
1916 considered an unusual
1917 state on exit from a loop and it is up to the programmer to
1918 reset srcstep, dststep etc. to known-good values
1919 *(easily achieved with `setvl`)*.
1920
1921 Additional useful behaviour involves two primary Modes (both of
1922 which may be enabled and combined):
1923
1924 * **VLSET Mode**: identical to Data-Dependent Fail-First Mode
1925 for Arithmetic SVP64 operations, with more
1926 flexibility and a close interaction and integration into the
1927 underlying base Scalar v3.0B Branch instruction.
1928 Truncation of VL takes place around the early-exit point.
1929 * **CTR-test Mode**: gives much more flexibility over when and why
1930 CTR is decremented, including options to decrement if a Condition
1931 test succeeds *or if it fails*.
1932
1933 With these side-effects, basic Boolean Logic Analysis advises that
1934 it is important to provide a means
1935 to enact them each based on whether testing succeeds *or fails*. This
1936 results in a not-insignificant number of additional Mode Augmentation bits,
1937 accompanying VLSET and CTR-test Modes respectively.
1938
1939 Predicate skipping or zeroing may, as usual with SVP64, be controlled
1940 by `sz`.
1941 Where the predicate is masked out and
1942 zeroing is enabled, then in such circumstances
1943 the same Boolean Logic Analysis dictates that
1944 rather than testing only against zero, the option to test
1945 against one is also prudent. This introduces a new
1946 immediate field, `SNZ`, which works in conjunction with
1947 `sz`.
1948
1949
1950 Vectorised Branches can be used
1951 in either SVP64 Horizontal-First or Vertical-First Mode. Essentially,
1952 at an element level, the behaviour is identical in both Modes,
1953 although the `ALL` bit is meaningless in Vertical-First Mode.
1954
1955 It is also important
1956 to bear in mind that, fundamentally, Vectorised Branch-Conditional
1957 is still extremely close to the Scalar v3.0B Branch-Conditional
1958 instructions, and that the same v3.0B Scalar Branch-Conditional
1959 instructions are still
1960 *completely separate and independent*, being unaltered and
1961 unaffected by their SVP64 variants in every conceivable way.
1962
1963 *Programming note: One important point is that SVP64 instructions are 64 bit.
1964 (8 bytes not 4). This needs to be taken into consideration when computing
1965 branch offsets: the offset is relative to the start of the instruction,
1966 which **includes** the SVP64 Prefix*
1967
1968 ## Format and fields
1969
1970 With element-width overrides being meaningless for Condition
1971 Register Fields, bits 4 thru 7 of SVP64 RM may be used for additional
1972 Mode bits.
1973
1974 SVP64 RM `MODE` (includes repurposing `ELWIDTH` bits 4:5,
1975 and `ELWIDTH_SRC` bits 6-7 for *alternate* uses) for Branch
1976 Conditional:
1977
1978 | 4 | 5 | 6 | 7 | 17 | 18 | 19 | 20 | 21 | 22 23 | description |
1979 | - | - | - | - | -- | -- | -- | -- | --- |--------|----------------- |
1980 |ALL|SNZ| / | / | SL |SLu | 0 | 0 | / | LRu sz | simple mode |
1981 |ALL|SNZ| / |VSb| SL |SLu | 0 | 1 | VLI | LRu sz | VLSET mode |
1982 |ALL|SNZ|CTi| / | SL |SLu | 1 | 0 | / | LRu sz | CTR-test mode |
1983 |ALL|SNZ|CTi|VSb| SL |SLu | 1 | 1 | VLI | LRu sz | CTR-test+VLSET mode |
1984
1985 Brief description of fields:
1986
1987 * **sz=1** if predication is enabled and `sz=1` and a predicate
1988 element bit is zero, `SNZ` will
1989 be substituted in place of the CR bit selected by `BI`,
1990 as the Condition tested.
1991 Contrast this with
1992 normal SVP64 `sz=1` behaviour, where *only* a zero is put in
1993 place of masked-out predicate bits.
1994 * **sz=0** When `sz=0` skipping occurs as usual on
1995 masked-out elements, but unlike all
1996 other SVP64 behaviour which entirely skips an element with
1997 no related side-effects at all, there are certain
1998 special circumstances where CTR
1999 may be decremented. See CTR-test Mode, below.
2000 * **ALL** when set, all branch conditional tests must pass in order for
2001 the branch to succeed. When clear, it is the first sequentially
2002 encountered successful test that causes the branch to succeed.
2003 This is identical behaviour to how programming languages perform
2004 early-exit on Boolean Logic chains.
2005 * **VLI** VLSET is identical to Data-dependent Fail-First mode.
2006 In VLSET mode, VL *may* (depending on `VSb`) be truncated.
2007 If VLI (Vector Length Inclusive) is clear,
2008 VL is truncated to *exclude* the current element, otherwise it is
2009 included. SVSTATE.MVL is not altered: only VL.
2010 * **SL** identical to `LR` except applicable to SVSTATE. If `SL`
2011 is set, SVSTATE is transferred to SVLR (conditionally on
2012 whether `SLu` is set).
2013 * **SLu**: SVSTATE Link Update, like `LRu` except applies to SVSTATE.
2014 * **LRu**: Link Register Update, used in conjunction with LK=1
2015 to make LR update conditional
2016 * **VSb** In VLSET Mode, after testing,
2017 if VSb is set, VL is truncated if the test succeeds. If VSb is clear,
2018 VL is truncated if a test *fails*. Masked-out (skipped)
2019 bits are not considered
2020 part of testing when `sz=0`
2021 * **CTi** CTR inversion. CTR-test Mode normally decrements per element
2022 tested. CTR inversion decrements if a test *fails*. Only relevant
2023 in CTR-test Mode.
2024
2025 LRu and CTR-test modes are where SVP64 Branches subtly differ from
2026 Scalar v3.0B Branches. `sv.bcl` for example will always update LR, whereas
2027 `sv.bcl/lru` will only update LR if the branch succeeds.
2028
2029 Of special interest is that when using ALL Mode (Great Big AND
2030 of all Condition Tests), if `VL=0`,
2031 which is rare but can occur in Data-Dependent Modes, the Branch
2032 will always take place because there will be no failing Condition
2033 Tests to prevent it. Likewise when not using ALL Mode (Great Big OR
2034 of all Condition Tests) and `VL=0` the Branch is guaranteed not
2035 to occur because there will be no *successful* Condition Tests
2036 to make it happen.
2037
2038 ## Vectorised CR Field numbering, and Scalar behaviour
2039
2040 It is important to keep in mind that just like all SVP64 instructions,
2041 the `BI` field of the base v3.0B Branch Conditional instruction
2042 may be extended by SVP64 EXTRA augmentation, as well as be marked
2043 as either Scalar or Vector. It is also crucially important to keep in mind
2044 that for CRs, SVP64 sequentially increments the CR *Field* numbers.
2045 CR *Fields* are treated as elements, not bit-numbers of the CR *register*.
2046
2047 The `BI` operand of Branch Conditional operations is five bits, in scalar
2048 v3.0B this would select one bit of the 32 bit CR,
2049 comprising eight CR Fields of 4 bits each. In SVP64 there are
2050 16 32 bit CRs, containing 128 4-bit CR Fields. Therefore, the 2 LSBs of
2051 `BI` select the bit from the CR Field (EQ LT GT SO), and the top 3 bits
2052 are extended to either scalar or vector and to select CR Fields 0..127
2053 as specified in SVP64 [[sv/svp64/appendix]].
2054
2055 When the CR Fields selected by SVP64-Augmented `BI` is marked as scalar,
2056 then as the usual SVP64 rules apply:
2057 the Vector loop ends at the first element tested
2058 (the first CR *Field*), after taking
2059 predication into consideration. Thus, also as usual, when a predicate mask is
2060 given, and `BI` marked as scalar, and `sz` is zero, srcstep
2061 skips forward to the first non-zero predicated element, and only that
2062 one element is tested.
2063
2064 In other words, the fact that this is a Branch
2065 Operation (instead of an arithmetic one) does not result, ultimately,
2066 in significant changes as to
2067 how SVP64 is fundamentally applied, except with respect to:
2068
2069 * the unique properties associated with conditionally
2070 changing the Program
2071 Counter (aka "a Branch"), resulting in early-out
2072 opportunities
2073 * CTR-testing
2074
2075 Both are outlined below, in later sections.
2076
2077 ## Horizontal-First and Vertical-First Modes
2078
2079 In SVP64 Horizontal-First Mode, the first failure in ALL mode (Great Big
2080 AND) results in early exit: no more updates to CTR occur (if requested);
2081 no branch occurs, and LR is not updated (if requested). Likewise for
2082 non-ALL mode (Great Big Or) on first success early exit also occurs,
2083 however this time with the Branch proceeding. In both cases the testing
2084 of the Vector of CRs should be done in linear sequential order (or in
2085 REMAP re-sequenced order): such that tests that are sequentially beyond
2086 the exit point are *not* carried out. (*Note: it is standard practice in
2087 Programming languages to exit early from conditional tests, however
2088 a little unusual to consider in an ISA that is designed for Parallel
2089 Vector Processing. The reason is to have strictly-defined guaranteed
2090 behaviour*)
2091
2092 In Vertical-First Mode, setting the `ALL` bit results in `UNDEFINED`
2093 behaviour. Given that only one element is being tested at a time
2094 in Vertical-First Mode, a test designed to be done on multiple
2095 bits is meaningless.
2096
2097 ## Description and Modes
2098
2099 Predication in both INT and CR modes may be applied to `sv.bc` and other
2100 SVP64 Branch Conditional operations, exactly as they may be applied to
2101 other SVP64 operations. When `sz` is zero, any masked-out Branch-element
2102 operations are not included in condition testing, exactly like all other
2103 SVP64 operations, *including* side-effects such as potentially updating
2104 LR or CTR, which will also be skipped. There is *one* exception here,
2105 which is when
2106 `BO[2]=0, sz=0, CTR-test=0, CTi=1` and the relevant element
2107 predicate mask bit is also zero:
2108 under these special circumstances CTR will also decrement.
2109
2110 When `sz` is non-zero, this normally requests insertion of a zero
2111 in place of the input data, when the relevant predicate mask bit is zero.
2112 This would mean that a zero is inserted in place of `CR[BI+32]` for
2113 testing against `BO`, which may not be desirable in all circumstances.
2114 Therefore, an extra field is provided `SNZ`, which, if set, will insert
2115 a **one** in place of a masked-out element, instead of a zero.
2116
2117 (*Note: Both options are provided because it is useful to deliberately
2118 cause the Branch-Conditional Vector testing to fail at a specific point,
2119 controlled by the Predicate mask. This is particularly useful in `VLSET`
2120 mode, which will truncate SVSTATE.VL at the point of the first failed
2121 test.*)
2122
2123 Normally, CTR mode will decrement once per Condition Test, resulting
2124 under normal circumstances that CTR reduces by up to VL in Horizontal-First
2125 Mode. Just as when v3.0B Branch-Conditional saves at
2126 least one instruction on tight inner loops through auto-decrementation
2127 of CTR, likewise it is also possible to save instruction count for
2128 SVP64 loops in both Vertical-First and Horizontal-First Mode, particularly
2129 in circumstances where there is conditional interaction between the
2130 element computation and testing, and the continuation (or otherwise)
2131 of a given loop. The potential combinations of interactions is why CTR
2132 testing options have been added.
2133
2134 Also, the unconditional bit `BO[0]` is still relevant when Predication
2135 is applied to the Branch because in `ALL` mode all nonmasked bits have
2136 to be tested, and when `sz=0` skipping occurs.
2137 Even when VLSET mode is not used, CTR
2138 may still be decremented by the total number of nonmasked elements,
2139 acting in effect as either a popcount or cntlz depending on which
2140 mode bits are set.
2141 In short, Vectorised Branch becomes an extremely powerful tool.
2142
2143 **Micro-Architectural Implementation Note**: *when implemented on
2144 top of a Multi-Issue Out-of-Order Engine it is possible to pass
2145 a copy of the predicate and the prerequisite CR Fields to all
2146 Branch Units, as well as the current value of CTR at the time of
2147 multi-issue, and for each Branch Unit to compute how many times
2148 CTR would be subtracted, in a fully-deterministic and parallel
2149 fashion. A SIMD-based Branch Unit, receiving and processing
2150 multiple CR Fields covered by multiple predicate bits, would
2151 do the exact same thing. Obviously, however, if CTR is modified
2152 within any given loop (mtctr) the behaviour of CTR is no longer
2153 deterministic.*
2154
2155 ### Link Register Update
2156
2157 For a Scalar Branch, unconditional updating of the Link Register
2158 LR is useful and practical. However, if a loop of CR Fields is
2159 tested, unconditional updating of LR becomes problematic.
2160
2161 For example when using `bclr` with `LRu=1,LK=0` in Horizontal-First Mode,
2162 LR's value will be unconditionally overwritten after the first element,
2163 such that for execution (testing) of the second element, LR
2164 has the value `CIA+8`. This is covered in the `bclrl` example, in
2165 a later section.
2166
2167 The addition of a LRu bit modifies behaviour in conjunction
2168 with LK, as follows:
2169
2170 * `sv.bc` When LRu=0,LK=0, Link Register is not updated
2171 * `sv.bcl` When LRu=0,LK=1, Link Register is updated unconditionally
2172 * `sv.bcl/lru` When LRu=1,LK=1, Link Register will
2173 only be updated if the Branch Condition fails.
2174 * `sv.bc/lru` When LRu=1,LK=0, Link Register will only be updated if
2175 the Branch Condition succeeds.
2176
2177 This avoids
2178 destruction of LR during loops (particularly Vertical-First
2179 ones).
2180
2181 **SVLR and SVSTATE**
2182
2183 For precisely the reasons why `LK=1` was added originally to the Power
2184 ISA, with SVSTATE being a peer of the Program Counter it becomes
2185 necessary to also add an SVLR (SVSTATE Link Register)
2186 and corresponding control bits `SL` and `SLu`.
2187
2188 ### CTR-test
2189
2190 Where a standard Scalar v3.0B branch unconditionally decrements
2191 CTR when `BO[2]` is clear, CTR-test Mode introduces more flexibility
2192 which allows CTR to be used for many more types of Vector loops
2193 constructs.
2194
2195 CTR-test mode and CTi interaction is as follows: note that
2196 `BO[2]` is still required to be clear for CTR decrements to be
2197 considered, exactly as is the case in Scalar Power ISA v3.0B
2198
2199 * **CTR-test=0, CTi=0**: CTR decrements on a per-element basis
2200 if `BO[2]` is zero. Masked-out elements when `sz=0` are
2201 skipped (i.e. CTR is *not* decremented when the predicate
2202 bit is zero and `sz=0`).
2203 * **CTR-test=0, CTi=1**: CTR decrements on a per-element basis
2204 if `BO[2]` is zero and a masked-out element is skipped
2205 (`sz=0` and predicate bit is zero). This one special case is the
2206 **opposite** of other combinations, as well as being
2207 completely different from normal SVP64 `sz=0` behaviour)
2208 * **CTR-test=1, CTi=0**: CTR decrements on a per-element basis
2209 if `BO[2]` is zero and the Condition Test succeeds.
2210 Masked-out elements when `sz=0` are skipped (including
2211 not decrementing CTR)
2212 * **CTR-test=1, CTi=1**: CTR decrements on a per-element basis
2213 if `BO[2]` is zero and the Condition Test *fails*.
2214 Masked-out elements when `sz=0` are skipped (including
2215 not decrementing CTR)
2216
2217 `CTR-test=0, CTi=1, sz=0` requires special emphasis because it is the
2218 only time in the entirety of SVP64 that has side-effects when
2219 a predicate mask bit is clear. **All** other SVP64 operations
2220 entirely skip an element when sz=0 and a predicate mask bit is zero.
2221 It is also critical to emphasise that in this unusual mode,
2222 no other side-effects occur: **only** CTR is decremented, i.e. the
2223 rest of the Branch operation is skipped.
2224
2225 ### VLSET Mode
2226
2227 VLSET Mode truncates the Vector Length so that subsequent instructions
2228 operate on a reduced Vector Length. This is similar to
2229 Data-dependent Fail-First and LD/ST Fail-First, where for VLSET the
2230 truncation occurs at the Branch decision-point.
2231
2232 Interestingly, due to the side-effects of `VLSET` mode
2233 it is actually useful to use Branch Conditional even
2234 to perform no actual branch operation, i.e to point to the instruction
2235 after the branch. Truncation of VL would thus conditionally occur yet control
2236 flow alteration would not.
2237
2238 `VLSET` mode with Vertical-First is particularly unusual. Vertical-First
2239 is designed to be used for explicit looping, where an explicit call to
2240 `svstep` is required to move both srcstep and dststep on to
2241 the next element, until VL (or other condition) is reached.
2242 Vertical-First Looping is expected (required) to terminate if the end
2243 of the Vector, VL, is reached. If however that loop is terminated early
2244 because VL is truncated, VLSET with Vertical-First becomes meaningless.
2245 Resolving this would require two branches: one Conditional, the other
2246 branching unconditionally to create the loop, where the Conditional
2247 one jumps over it.
2248
2249 Therefore, with `VSb`, the option to decide whether truncation should occur if the
2250 branch succeeds *or* if the branch condition fails allows for the flexibility
2251 required. This allows a Vertical-First Branch to *either* be used as
2252 a branch-back (loop) *or* as part of a conditional exit or function
2253 call from *inside* a loop, and for VLSET to be integrated into both
2254 types of decision-making.
2255
2256 In the case of a Vertical-First branch-back (loop), with `VSb=0` the branch takes
2257 place if success conditions are met, but on exit from that loop
2258 (branch condition fails), VL will be truncated. This is extremely
2259 useful.
2260
2261 `VLSET` mode with Horizontal-First when `VSb=0` is still
2262 useful, because it can be used to truncate VL to the first predicated
2263 (non-masked-out) element.
2264
2265 The truncation point for VL, when VLi is clear, must not include skipped
2266 elements that preceded the current element being tested.
2267 Example: `sz=0, VLi=0, predicate mask = 0b110010` and the Condition
2268 Register failure point is at CR Field element 4.
2269
2270 * Testing at element 0 is skipped because its predicate bit is zero
2271 * Testing at element 1 passed
2272 * Testing elements 2 and 3 are skipped because their
2273 respective predicate mask bits are zero
2274 * Testing element 4 fails therefore VL is truncated to **2**
2275 not 4 due to elements 2 and 3 being skipped.
2276
2277 If `sz=1` in the above example *then* VL would have been set to 4 because
2278 in non-zeroing mode the zero'd elements are still effectively part of the
2279 Vector (with their respective elements set to `SNZ`)
2280
2281 If `VLI=1` then VL would be set to 5 regardless of sz, due to being inclusive
2282 of the element actually being tested.
2283
2284 ### VLSET and CTR-test combined
2285
2286 If both CTR-test and VLSET Modes are requested, it's important to
2287 observe the correct order. What occurs depends on whether VLi
2288 is enabled, because VLi affects the length, VL.
2289
2290 If VLi (VL truncate inclusive) is set:
2291
2292 1. compute the test including whether CTR triggers
2293 2. (optionally) decrement CTR
2294 3. (optionally) truncate VL (VSb inverts the decision)
2295 4. decide (based on step 1) whether to terminate looping
2296 (including not executing step 5)
2297 5. decide whether to branch.
2298
2299 If VLi is clear, then when a test fails that element
2300 and any following it
2301 should **not** be considered part of the Vector. Consequently:
2302
2303 1. compute the branch test including whether CTR triggers
2304 2. if the test fails against VSb, truncate VL to the *previous*
2305 element, and terminate looping. No further steps executed.
2306 3. (optionally) decrement CTR
2307 4. decide whether to branch.
2308
2309 ## Boolean Logic combinations
2310
2311 In a Scalar ISA, Branch-Conditional testing even of vector
2312 results may be performed through inversion of tests. NOR of
2313 all tests may be performed by inversion of the scalar condition
2314 and branching *out* from the scalar loop around elements,
2315 using scalar operations.
2316
2317 In a parallel (Vector) ISA it is the ISA itself which must perform
2318 the prerequisite logic manipulation.
2319 Thus for SVP64 there are an extraordinary number of nesessary combinations
2320 which provide completely different and useful behaviour.
2321 Available options to combine:
2322
2323 * `BO[0]` to make an unconditional branch would seem irrelevant if
2324 it were not for predication and for side-effects (CTR Mode
2325 for example)
2326 * Enabling CTR-test Mode and setting `BO[2]` can still result in the
2327 Branch
2328 taking place, not because the Condition Test itself failed, but
2329 because CTR reached zero **because**, as required by CTR-test mode,
2330 CTR was decremented as a **result** of Condition Tests failing.
2331 * `BO[1]` to select whether the CR bit being tested is zero or nonzero
2332 * `R30` and `~R30` and other predicate mask options including CR and
2333 inverted CR bit testing
2334 * `sz` and `SNZ` to insert either zeros or ones in place of masked-out
2335 predicate bits
2336 * `ALL` or `ANY` behaviour corresponding to `AND` of all tests and
2337 `OR` of all tests, respectively.
2338 * Predicate Mask bits, which combine in effect with the CR being
2339 tested.
2340 * Inversion of Predicate Masks (`~r3` instead of `r3`, or using
2341 `NE` rather than `EQ`) which results in an additional
2342 level of possible ANDing, ORing etc. that would otherwise
2343 need explicit instructions.
2344
2345 The most obviously useful combinations here are to set `BO[1]` to zero
2346 in order to turn `ALL` into Great-Big-NAND and `ANY` into
2347 Great-Big-NOR. Other Mode bits which perform behavioural inversion then
2348 have to work round the fact that the Condition Testing is NOR or NAND.
2349 The alternative to not having additional behavioural inversion
2350 (`SNZ`, `VSb`, `CTi`) would be to have a second (unconditional)
2351 branch directly after the first, which the first branch jumps over.
2352 This contrivance is avoided by the behavioural inversion bits.
2353
2354 ## Pseudocode and examples
2355
2356 Please see the SVP64 appendix regarding CR bit ordering and for
2357 the definition of `CR{n}`
2358
2359 For comparative purposes this is a copy of the v3.0B `bc` pseudocode
2360
2361 ```
2362 if (mode_is_64bit) then M <- 0
2363 else M <- 32
2364 if ¬BO[2] then CTR <- CTR - 1
2365 ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
2366 cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1])
2367 if ctr_ok & cond_ok then
2368 if AA then NIA <-iea EXTS(BD || 0b00)
2369 else NIA <-iea CIA + EXTS(BD || 0b00)
2370 if LK then LR <-iea CIA + 4
2371 ```
2372
2373 Simplified pseudocode including LRu and CTR skipping, which illustrates
2374 clearly that SVP64 Scalar Branches (VL=1) are **not** identical to
2375 v3.0B Scalar Branches. The key areas where differences occur are
2376 the inclusion of predication (which can still be used when VL=1), in
2377 when and why CTR is decremented (CTRtest Mode) and whether LR is
2378 updated (which is unconditional in v3.0B when LK=1, and conditional
2379 in SVP64 when LRu=1).
2380
2381 Inline comments highlight the fact that the Scalar Branch behaviour
2382 and pseudocode is still clearly visible and embedded within the
2383 Vectorised variant:
2384
2385 ```
2386 if (mode_is_64bit) then M <- 0
2387 else M <- 32
2388 # the bit of CR to test, if the predicate bit is zero,
2389 # is overridden
2390 testbit = CR[BI+32]
2391 if ¬predicate_bit then testbit = SVRMmode.SNZ
2392 # otherwise apart from the override ctr_ok and cond_ok
2393 # are exactly the same
2394 ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
2395 cond_ok <- BO[0] | ¬(testbit ^ BO[1])
2396 if ¬predicate_bit & ¬SVRMmode.sz then
2397 # this is entirely new: CTR-test mode still decrements CTR
2398 # even when predicate-bits are zero
2399 if ¬BO[2] & CTRtest & ¬CTi then
2400 CTR = CTR - 1
2401 # instruction finishes here
2402 else
2403 # usual BO[2] CTR-mode now under CTR-test mode as well
2404 if ¬BO[2] & ¬(CTRtest & (cond_ok ^ CTi)) then CTR <- CTR - 1
2405 # new VLset mode, conditional test truncates VL
2406 if VLSET and VSb = (cond_ok & ctr_ok) then
2407 if SVRMmode.VLI then SVSTATE.VL = srcstep+1
2408 else SVSTATE.VL = srcstep
2409 # usual LR is now conditional, but also joined by SVLR
2410 lr_ok <- LK
2411 svlr_ok <- SVRMmode.SL
2412 if ctr_ok & cond_ok then
2413 if AA then NIA <-iea EXTS(BD || 0b00)
2414 else NIA <-iea CIA + EXTS(BD || 0b00)
2415 if SVRMmode.LRu then lr_ok <- ¬lr_ok
2416 if SVRMmode.SLu then svlr_ok <- ¬svlr_ok
2417 if lr_ok then LR <-iea CIA + 4
2418 if svlr_ok then SVLR <- SVSTATE
2419 ```
2420
2421 Below is the pseudocode for SVP64 Branches, which is a little less
2422 obvious but identical to the above. The lack of obviousness is down
2423 to the early-exit opportunities.
2424
2425 Effective pseudocode for Horizontal-First Mode:
2426
2427 ```
2428 if (mode_is_64bit) then M <- 0
2429 else M <- 32
2430 cond_ok = not SVRMmode.ALL
2431 for srcstep in range(VL):
2432 # select predicate bit or zero/one
2433 if predicate[srcstep]:
2434 # get SVP64 extended CR field 0..127
2435 SVCRf = SVP64EXTRA(BI>>2)
2436 CRbits = CR{SVCRf}
2437 testbit = CRbits[BI & 0b11]
2438 # testbit = CR[BI+32+srcstep*4]
2439 else if not SVRMmode.sz:
2440 # inverted CTR test skip mode
2441 if ¬BO[2] & CTRtest & ¬CTI then
2442 CTR = CTR - 1
2443 continue # skip to next element
2444 else
2445 testbit = SVRMmode.SNZ
2446 # actual element test here
2447 ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
2448 el_cond_ok <- BO[0] | ¬(testbit ^ BO[1])
2449 # check if CTR dec should occur
2450 ctrdec = ¬BO[2]
2451 if CTRtest & (el_cond_ok ^ CTi) then
2452 ctrdec = 0b0
2453 if ctrdec then CTR <- CTR - 1
2454 # merge in the test
2455 if SVRMmode.ALL:
2456 cond_ok &= (el_cond_ok & ctr_ok)
2457 else
2458 cond_ok |= (el_cond_ok & ctr_ok)
2459 # test for VL to be set (and exit)
2460 if VLSET and VSb = (el_cond_ok & ctr_ok) then
2461 if SVRMmode.VLI then SVSTATE.VL = srcstep+1
2462 else SVSTATE.VL = srcstep
2463 break
2464 # early exit?
2465 if SVRMmode.ALL != (el_cond_ok & ctr_ok):
2466 break
2467 # SVP64 rules about Scalar registers still apply!
2468 if SVCRf.scalar:
2469 break
2470 # loop finally done, now test if branch (and update LR)
2471 lr_ok <- LK
2472 svlr_ok <- SVRMmode.SL
2473 if cond_ok then
2474 if AA then NIA <-iea EXTS(BD || 0b00)
2475 else NIA <-iea CIA + EXTS(BD || 0b00)
2476 if SVRMmode.LRu then lr_ok <- ¬lr_ok
2477 if SVRMmode.SLu then svlr_ok <- ¬svlr_ok
2478 if lr_ok then LR <-iea CIA + 4
2479 if svlr_ok then SVLR <- SVSTATE
2480 ```
2481
2482 Pseudocode for Vertical-First Mode:
2483
2484 ```
2485 # get SVP64 extended CR field 0..127
2486 SVCRf = SVP64EXTRA(BI>>2)
2487 CRbits = CR{SVCRf}
2488 # select predicate bit or zero/one
2489 if predicate[srcstep]:
2490 if BRc = 1 then # CR0 vectorised
2491 CR{SVCRf+srcstep} = CRbits
2492 testbit = CRbits[BI & 0b11]
2493 else if not SVRMmode.sz:
2494 # inverted CTR test skip mode
2495 if ¬BO[2] & CTRtest & ¬CTI then
2496 CTR = CTR - 1
2497 SVSTATE.srcstep = new_srcstep
2498 exit # no branch testing
2499 else
2500 testbit = SVRMmode.SNZ
2501 # actual element test here
2502 cond_ok <- BO[0] | ¬(testbit ^ BO[1])
2503 # test for VL to be set (and exit)
2504 if VLSET and cond_ok = VSb then
2505 if SVRMmode.VLI
2506 SVSTATE.VL = new_srcstep+1
2507 else
2508 SVSTATE.VL = new_srcstep
2509 ```
2510
2511 ### Example Shader code
2512
2513 ```
2514 // assume f() g() or h() modify a and/or b
2515 while(a > 2) {
2516 if(b < 5)
2517 f();
2518 else
2519 g();
2520 h();
2521 }
2522 ```
2523
2524 which compiles to something like:
2525
2526 ```
2527 vec<i32> a, b;
2528 // ...
2529 pred loop_pred = a > 2;
2530 // loop continues while any of a elements greater than 2
2531 while(loop_pred.any()) {
2532 // vector of predicate bits
2533 pred if_pred = loop_pred & (b < 5);
2534 // only call f() if at least 1 bit set
2535 if(if_pred.any()) {
2536 f(if_pred);
2537 }
2538 label1:
2539 // loop mask ANDs with inverted if-test
2540 pred else_pred = loop_pred & ~if_pred;
2541 // only call g() if at least 1 bit set
2542 if(else_pred.any()) {
2543 g(else_pred);
2544 }
2545 h(loop_pred);
2546 }
2547 ```
2548
2549 which will end up as:
2550
2551 ```
2552 # start from while loop test point
2553 b looptest
2554 while_loop:
2555 sv.cmpi CR80.v, b.v, 5 # vector compare b into CR64 Vector
2556 sv.bc/m=r30/~ALL/sz CR80.v.LT skip_f # skip when none
2557 # only calculate loop_pred & pred_b because needed in f()
2558 sv.crand CR80.v.SO, CR60.v.GT, CR80.V.LT # if = loop & pred_b
2559 f(CR80.v.SO)
2560 skip_f:
2561 # illustrate inversion of pred_b. invert r30, test ALL
2562 # rather than SOME, but masked-out zero test would FAIL,
2563 # therefore masked-out instead is tested against 1 not 0
2564 sv.bc/m=~r30/ALL/SNZ CR80.v.LT skip_g
2565 # else = loop & ~pred_b, need this because used in g()
2566 sv.crternari(A&~B) CR80.v.SO, CR60.v.GT, CR80.V.LT
2567 g(CR80.v.SO)
2568 skip_g:
2569 # conditionally call h(r30) if any loop pred set
2570 sv.bclr/m=r30/~ALL/sz BO[1]=1 h()
2571 looptest:
2572 sv.cmpi CR60.v a.v, 2 # vector compare a into CR60 vector
2573 sv.crweird r30, CR60.GT # transfer GT vector to r30
2574 sv.bc/m=r30/~ALL/sz BO[1]=1 while_loop
2575 end:
2576 ```
2577
2578 ### LRu example
2579
2580 show why LRu would be useful in a loop. Imagine the following
2581 c code:
2582
2583 ```
2584 for (int i = 0; i < 8; i++) {
2585 if (x < y) break;
2586 }
2587 ```
2588
2589 Under these circumstances exiting from the loop is not only
2590 based on CTR it has become conditional on a CR result.
2591 Thus it is desirable that NIA *and* LR only be modified
2592 if the conditions are met
2593
2594
2595 v3.0 pseudocode for `bclrl`:
2596
2597 ```
2598 if (mode_is_64bit) then M <- 0
2599 else M <- 32
2600 if ¬BO[2] then CTR <- CTR - 1
2601 ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
2602 cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1])
2603 if ctr_ok & cond_ok then NIA <-iea LR[0:61] || 0b00
2604 if LK then LR <-iea CIA + 4
2605 ```
2606
2607 the latter part for SVP64 `bclrl` becomes:
2608
2609 ```
2610 for i in 0 to VL-1:
2611 ...
2612 ...
2613 cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1])
2614 lr_ok <- LK
2615 if ctr_ok & cond_ok then
2616 NIA <-iea LR[0:61] || 0b00
2617 if SVRMmode.LRu then lr_ok <- ¬lr_ok
2618 if lr_ok then LR <-iea CIA + 4
2619 # if NIA modified exit loop
2620 ```
2621
2622 The reason why should be clear from this being a Vector loop:
2623 unconditional destruction of LR when LK=1 makes `sv.bclrl`
2624 ineffective, because the intention going into the loop is
2625 that the branch should be to the copy of LR set at the *start*
2626 of the loop, not half way through it.
2627 However if the change to LR only occurs if
2628 the branch is taken then it becomes a useful instruction.
2629
2630 The following pseudocode should **not** be implemented because
2631 it violates the fundamental principle of SVP64 which is that
2632 SVP64 looping is a thin wrapper around Scalar Instructions.
2633 The pseducode below is more an actual Vector ISA Branch and
2634 as such is not at all appropriate:
2635
2636 ```
2637 for i in 0 to VL-1:
2638 ...
2639 ...
2640 cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1])
2641 if ctr_ok & cond_ok then NIA <-iea LR[0:61] || 0b00
2642 # only at the end of looping is LK checked.
2643 # this completely violates the design principle of SVP64
2644 # and would actually need to be a separate (scalar)
2645 # instruction "set LR to CIA+4 but retrospectively"
2646 # which is clearly impossible
2647 if LK then LR <-iea CIA + 4
2648 ```
2649
2650 [[!tag standards]]