(no commit message)
[libreriscv.git] / openpower / sv / branches.mdwn
1 [[!tag standards]]
2 # SVP64 Branch Conditional behaviour
3
4 **DRAFT STATUS**
5
6 Please note: although similar, SVP64 Branch instructions should be
7 considered completely separate and distinct from
8 standard scalar OpenPOWER-approved v3.0B branches.
9 **v3.0B branches are in no way impacted, altered,
10 changed or modified in any way, shape or form by
11 the SVP64 Vectorised Variants**.
12
13 It is also
14 extremely important to note that Branches are the
15 sole semi-exception in SVP64 to `Scalar Identity Behaviour`.
16 SVP64 Branches contain additional modes that are useful
17 for scalar operations (i.e. even when VL=1 or when
18 using single-bit predication).
19
20 Links
21
22 * <https://bugs.libre-soc.org/show_bug.cgi?id=664>
23 * <http://lists.libre-soc.org/pipermail/libre-soc-dev/2021-August/003416.html>
24 * <https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-April/004678.html>
25 * [[openpower/isa/branch]]
26 * [[sv/cr_int_predication]]
27
28 # Rationale
29
30 Scalar 3.0B Branch Conditional operations, `bc`, `bctar` etc. test a
31 Condition Register. However for parallel processing it is simply impossible
32 to perform multiple independent branches: the Program Counter simply
33 cannot branch to multiple destinations based on multiple conditions.
34 The best that can be done is
35 to test multiple Conditions and make a decision of a *single* branch,
36 based on analysis of a *Vector* of CR Fields
37 which have just been calculated from a *Vector* of results.
38
39 In 3D Shader
40 binaries, which are inherently parallelised and predicated, testing all or
41 some results and branching based on multiple tests is extremely common,
42 and a fundamental part of Shader Compilers. Example:
43 without such multi-condition
44 test-and-branch, if a predicate mask is all zeros a large batch of
45 instructions may be masked out to `nop`, and it would waste
46 CPU cycles to run them. 3D GPU ISAs can test for this scenario
47 and, with the appropriate predicate-analysis instruction,
48 jump over fully-masked-out operations, by spotting that
49 *all* Conditions are false.
50
51 Unless Branches are aware and capable of such analysis, additional
52 instructions would be required which perform Horizontal Cumulative
53 analysis of Vectorised Condition Register Fields, in order to
54 reduce the Vector of CR Fields down to one single yes or no
55 decision that a Scalar-only v3.0B Branch-Conditional could cope with.
56 Such instructions would be unavoidable, required, and costly
57 by comparison to a single Vector-aware Branch.
58 Therefore, in order to be commercially competitive, `sv.bc` and
59 other Vector-aware Branch Conditional instructions are a high priority
60 for 3D GPU (and OpenCL-style) workloads.
61
62 Given that Power ISA v3.0B is already quite powerful, particularly
63 the Condition Registers and their interaction with Branches, there
64 are opportunities to create extremely flexible and compact
65 Vectorised Branch behaviour. In addition, the side-effects (updating
66 of CTR, truncation of VL, described below) make it a useful instruction
67 even if the branch points to the next instruction (no actual branch).
68
69 # Overview
70
71 When considering an "array" of branch-tests, there are four
72 primarily-useful modes:
73 AND, OR, NAND and NOR of all Conditions.
74 NAND and NOR may be synthesised from AND and OR by
75 inverting `BO[1]` which just leaves two modes:
76
77 * Branch takes place on the **first** CR Field test to succeed
78 (a Great Big OR of all condition tests)
79 * Branch takes place only if **all** CR field tests succeed:
80 a Great Big AND of all condition tests
81
82 Early-exit is enacted such that the Vectorised Branch does not
83 perform needless extra tests, which will help reduce reads on
84 the Condition Register file.
85
86 *Note: Early-exit is **MANDATORY** (required) behaviour.
87 Branches **MUST** exit at the first sequentially-encountered
88 failure point, for
89 exactly the same reasons for which it is mandatory in
90 programming languages doing early-exit: to avoid
91 damaging side-effects and to provide deterministic
92 behaviour. Speculative testing of Condition
93 Register Fields is permitted, as is speculative calculation
94 of CTR, as long as, as usual in any Out-of-Order microarchitecture,
95 that speculative testing is cancelled should an early-exit occur.
96 i.e. the speculation must be "precise": Program Order must be preserved*
97
98 Also note that when early-exit occurs in Horizontal-first Mode,
99 srcstep, dststep etc. are all reset, ready to begin looping from the
100 beginning for the next instruction. However for Vertical-first
101 Mode srcstep etc. are incremented "as usual" i.e. an early-exit
102 has no special impact, regardless of whether the branch
103 occurred or not. This can leave srcstep etc. in what may be
104 considered an unusual
105 state on exit from a loop and it is up to the programmer to
106 reset srcstep, dststep etc. to known-good values.
107
108 Additional useful behaviour involves two primary Modes (both of
109 which may be enabled and combined):
110
111 * **VLSET Mode**: identical to Data-Dependent Fail-First Mode
112 for Arithmetic SVP64 operations, with more
113 flexibility and a close interaction and integration into the
114 underlying base Scalar v3.0B Branch instruction.
115 Truncation of VL takes place around the early-exit point.
116 * **CTR-test Mode**: gives much more flexibility over when and why
117 CTR is decremented, including options to decrement if a Condition
118 test succeeds *or if it fails*.
119
120 With these side-effects, basic Boolean Logic Analysis advises that
121 it is important to provide a means
122 to enact them each based on whether testing succeeds *or fails*. This
123 results in a not-insignificant number of additional Mode Augmentation bits,
124 accompanying VLSET and CTR-test Modes respectively.
125
126 Predicate skipping or zeroing may, as usual with SVP64, be controlled
127 by `sz`.
128 Where the predicate is masked out and
129 zeroing is enabled, then in such circumstances
130 the same Boolean Logic Analysis dictates that
131 rather than testing only against zero, the option to test
132 against one is also prudent. This introduces a new
133 immediate field, `SNZ`, which works in conjunction with
134 `sz`.
135
136
137 Vectorised Branches can be used
138 in either SVP64 Horizontal-First or Vertical-First Mode. Essentially,
139 at an element level, the behaviour is identical in both Modes,
140 although the `ALL` bit is meaningless in Vertical-First Mode.
141
142 It is also important
143 to bear in mind that, fundamentally, Vectorised Branch-Conditional
144 is still extremely close to the Scalar v3.0B Branch-Conditional
145 instructions, and that the same v3.0B Scalar Branch-Conditional
146 instructions are still
147 *completely separate and independent*, being unaltered and
148 unaffected by their SVP64 variants in every conceivable way.
149
150 *Programming note: One important point is that SVP64 instructions are 64 bit.
151 (8 bytes not 4). This needs to be taken into consideration when computing
152 branch offsets: the offset is relative to the start of the instruction,
153 which **includes** the SVP64 Prefix*
154
155 # Format and fields
156
157 With element-width overrides being meaningless for Condition
158 Register Fields, bits 4 thru 7 of SVP64 RM may be used for additional
159 Mode bits.
160
161 SVP64 RM `MODE` (includes `ELWIDTH` and `ELWIDTH_SRC` bits) for Branch
162 Conditional:
163
164 | 4 | 5 | 6 | 7 | 17 | 18 | 19 | 20 | 21 | 22 23 | description |
165 | - | - | - | - | -- | -- | -- | -- | --- |--------|----------------- |
166 |ALL|SNZ| / | / | SL |SLu | 0 | 0 | / | LRu sz | normal mode |
167 |ALL|SNZ| / |VSb| SL |SLu | 0 | 1 | VLI | LRu sz | VLSET mode |
168 |ALL|SNZ|CTi| / | SL |SLu | 1 | 0 | / | LRu sz | CTR-test mode |
169 |ALL|SNZ|CTi|VSb| SL |SLu | 1 | 1 | VLI | LRu sz | CTR-test+VLSET mode |
170
171 Brief description of fields:
172
173 * **sz=1** if predication is enabled and `sz=1` and a predicate
174 element bit is zero, `SNZ` will
175 be substituted in place of the CR bit selected by `BI`,
176 as the Condition tested.
177 Contrast this with
178 normal SVP64 `sz=1` behaviour, where *only* a zero is put in
179 place of masked-out predicate bits.
180 * **sz=0** When `sz=0` skipping occurs as usual on
181 masked-out elements, but unlike all
182 other SVP64 behaviour which entirely skips an element with
183 no related side-effects at all, there are certain
184 special circumstances where CTR
185 may be decremented. See CTR-test Mode, below.
186 * **ALL** when set, all branch conditional tests must pass in order for
187 the branch to succeed. When clear, it is the first sequentially
188 encountered successful test that causes the branch to succeed.
189 This is identical behaviour to how programming languages perform
190 early-exit on Boolean Logic chains.
191 * **VLI** VLSET is identical to Data-dependent Fail-First mode.
192 In VLSET mode, VL *may* (depending on `VSb`) be truncated.
193 If VLI (Vector Length Inclusive) is clear,
194 VL is truncated to *exclude* the current element, otherwise it is
195 included. SVSTATE.MVL is not altered: only VL.
196 * **SL** identical to `LR` except applicable to SVSTATE. If `SL`
197 is set, SVSTATE is transferred to SVLR (conditionally on
198 whether `SLu` is set).
199 * **SLu**: SVSTATE Link Update, like `LRu` except applies to SVSTATE.
200 * **LRu**: Link Register Update, used in conjunction with LK=1
201 to make LR update conditional
202 * **VSb** In VLSET Mode, after testing,
203 if VSb is set, VL is truncated if the test succeeds. If VSb is clear,
204 VL is truncated if a test *fails*. Masked-out (skipped)
205 bits are not considered
206 part of testing when `sz=0`
207 * **CTi** CTR inversion. CTR-test Mode normally decrements per element
208 tested. CTR inversion decrements if a test *fails*. Only relevant
209 in CTR-test Mode.
210
211 LRu and CTR-test modes are where SVP64 Branches subtly differ from
212 Scalar v3.0B Branches. `sv.bcl` for example will always update LR, whereas
213 `sv.bcl/lru` will only update LR if the branch succeeds.
214
215 Of special interest is that when using ALL Mode (Great Big AND
216 of all Condition Tests), if `VL=0`,
217 which is rare but can occur in Data-Dependent Modes, the Branch
218 will always take place because there will be no failing Condition
219 Tests to prevent it. Likewise when not using ALL Mode (Great Big OR
220 of all Condition Tests) and `VL=0` the Branch is guaranteed not
221 to occur because there will be no *successful* Condition Tests
222 to make it happen.
223
224 # Vectorised CR Field numbering, and Scalar behaviour
225
226 It is important to keep in mind that just like all SVP64 instructions,
227 the `BI` field of the base v3.0B Branch Conditional instruction
228 may be extended by SVP64 EXTRA augmentation, as well as be marked
229 as either Scalar or Vector. It is also crucially important to keep in mind
230 that for CRs, SVP64 sequentially increments the CR *Field* numbers.
231 CR *Fields* are treated as elements, not bit-numbers of the CR *register*.
232
233 The `BI` operand of Branch Conditional operations is five bits, in scalar
234 v3.0B this would select one bit of the 32 bit CR,
235 comprising eight CR Fields of 4 bits each. In SVP64 there are
236 16 32 bit CRs, containing 128 4-bit CR Fields. Therefore, the 2 LSBs of
237 `BI` select the bit from the CR Field (EQ LT GT SO), and the top 3 bits
238 are extended to either scalar or vector and to select CR Fields 0..127
239 as specified in SVP64 [[sv/svp64/appendix]].
240
241 When the CR Fields selected by SVP64-Augmented `BI` is marked as scalar,
242 then as the usual SVP64 rules apply:
243 the Vector loop ends at the first element tested
244 (the first CR *Field*), after taking
245 predication into consideration. Thus, also as usual, when a predicate mask is
246 given, and `BI` marked as scalar, and `sz` is zero, srcstep
247 skips forward to the first non-zero predicated element, and only that
248 one element is tested.
249
250 In other words, the fact that this is a Branch
251 Operation (instead of an arithmetic one) does not result, ultimately,
252 in significant changes as to
253 how SVP64 is fundamentally applied, except with respect to:
254
255 * the unique properties associated with conditionally
256 changing the Program
257 Counter (aka "a Branch"), resulting in early-out
258 opportunities
259 * CTR-testing
260
261 Both are outlined below, in later sections.
262
263 # Horizontal-First and Vertical-First Modes
264
265 In SVP64 Horizontal-First Mode, the first failure in ALL mode (Great Big
266 AND) results in early exit: no more updates to CTR occur (if requested);
267 no branch occurs, and LR is not updated (if requested). Likewise for
268 non-ALL mode (Great Big Or) on first success early exit also occurs,
269 however this time with the Branch proceeding. In both cases the testing
270 of the Vector of CRs should be done in linear sequential order (or in
271 REMAP re-sequenced order): such that tests that are sequentially beyond
272 the exit point are *not* carried out. (*Note: it is standard practice in
273 Programming languages to exit early from conditional tests, however
274 a little unusual to consider in an ISA that is designed for Parallel
275 Vector Processing. The reason is to have strictly-defined guaranteed
276 behaviour*)
277
278 In Vertical-First Mode, setting the `ALL` bit results in `UNDEFINED`
279 behaviour. Given that only one element is being tested at a time
280 in Vertical-First Mode, a test designed to be done on multiple
281 bits is meaningless.
282
283 # Description and Modes
284
285 Predication in both INT and CR modes may be applied to `sv.bc` and other
286 SVP64 Branch Conditional operations, exactly as they may be applied to
287 other SVP64 operations. When `sz` is zero, any masked-out Branch-element
288 operations are not included in condition testing, exactly like all other
289 SVP64 operations, *including* side-effects such as potentially updating
290 LR or CTR, which will also be skipped. There is *one* exception here,
291 which is when
292 `BO[2]=0, sz=0, CTR-test=0, CTi=1` and the relevant element
293 predicate mask bit is also zero:
294 under these special circumstances CTR will also decrement.
295
296 When `sz` is non-zero, this normally requests insertion of a zero
297 in place of the input data, when the relevant predicate mask bit is zero.
298 This would mean that a zero is inserted in place of `CR[BI+32]` for
299 testing against `BO`, which may not be desirable in all circumstances.
300 Therefore, an extra field is provided `SNZ`, which, if set, will insert
301 a **one** in place of a masked-out element, instead of a zero.
302
303 (*Note: Both options are provided because it is useful to deliberately
304 cause the Branch-Conditional Vector testing to fail at a specific point,
305 controlled by the Predicate mask. This is particularly useful in `VLSET`
306 mode, which will truncate SVSTATE.VL at the point of the first failed
307 test.*)
308
309 Normally, CTR mode will decrement once per Condition Test, resulting
310 under normal circumstances that CTR reduces by up to VL in Horizontal-First
311 Mode. Just as when v3.0B Branch-Conditional saves at
312 least one instruction on tight inner loops through auto-decrementation
313 of CTR, likewise it is also possible to save instruction count for
314 SVP64 loops in both Vertical-First and Horizontal-First Mode, particularly
315 in circumstances where there is conditional interaction between the
316 element computation and testing, and the continuation (or otherwise)
317 of a given loop. The potential combinations of interactions is why CTR
318 testing options have been added.
319
320 Also, the unconditional bit `BO[0]` is still relevant when Predication
321 is applied to the Branch because in `ALL` mode all nonmasked bits have
322 to be tested, and when `sz=0` skipping occurs.
323 Even when VLSET mode is not used, CTR
324 may still be decremented by the total number of nonmasked elements,
325 acting in effect as either a popcount or cntlz depending on which
326 mode bits are set.
327 In short, Vectorised Branch becomes an extremely powerful tool.
328
329 **Micro-Architectural Implementation Note**: *when implemented on
330 top of a Multi-Issue Out-of-Order Engine it is possible to pass
331 a copy of the predicate and the prerequisite CR Fields to all
332 Branch Units, as well as the current value of CTR at the time of
333 multi-issue, and for each Branch Unit to compute how many times
334 CTR would be subtracted, in a fully-deterministic and parallel
335 fashion. A SIMD-based Branch Unit, receiving and processing
336 multiple CR Fields covered by multiple predicate bits, would
337 do the exact same thing. Obviously, however, if CTR is modified
338 within any given loop (mtctr) the behaviour of CTR is no longer
339 deterministic.*
340
341 ## Link Register Update
342
343 For a Scalar Branch, unconditional updating of the Link Register
344 LR is useful and practical. However, if a loop of CR Fields is
345 tested, unconditional updating of LR becomes problematic.
346
347 For example when using `bclr` with `LRu=1,LK=0` in Horizontal-First Mode,
348 LR's value will be unconditionally overwritten after the first element,
349 such that for execution (testing) of the second element, LR
350 has the value `CIA+8`. This is covered in the `bclrl` example, in
351 a later section.
352
353 The addition of a LRu bit modifies behaviour in conjunction
354 with LK, as follows:
355
356 * `sv.bc` When LRu=0,LK=0, Link Register is not updated
357 * `sv.bcl` When LRu=0,LK=1, Link Register is updated unconditionally
358 * `sv.bcl/lru` When LRu=1,LK=1, Link Register will
359 only be updated if the Branch Condition fails.
360 * `sv.bc/lru` When LRu=1,LK=0, Link Register will only be updated if
361 the Branch Condition succeeds.
362
363 This avoids
364 destruction of LR during loops (particularly Vertical-First
365 ones).
366
367 ## CTR-test
368
369 Where a standard Scalar v3.0B branch unconditionally decrements
370 CTR when `BO[2]` is clear, CTR-test Mode introduces more flexibility
371 which allows CTR to be used for many more types of Vector loops
372 constructs.
373
374 CTR-test mode and CTi interaction is as follows: note that
375 `BO[2]` is still required to be clear for CTR decrements to be
376 considered, exactly as is the case in Scalar Power ISA v3.0B
377
378 * **CTR-test=0, CTi=0**: CTR decrements on a per-element basis
379 if `BO[2]` is zero. Masked-out elements when `sz=0` are
380 skipped (i.e. CTR is *not* decremented when the predicate
381 bit is zero and `sz=0`).
382 * **CTR-test=0, CTi=1**: CTR decrements on a per-element basis
383 if `BO[2]` is zero and a masked-out element is skipped
384 (`sz=0` and predicate bit is zero). This one special case is the
385 **opposite** of other combinations, as well as being
386 completely different from normal SVP64 `sz=0` behaviour)
387 * **CTR-test=1, CTi=0**: CTR decrements on a per-element basis
388 if `BO[2]` is zero and the Condition Test succeeds.
389 Masked-out elements when `sz=0` are skipped (including
390 not decrementing CTR)
391 * **CTR-test=1, CTi=1**: CTR decrements on a per-element basis
392 if `BO[2]` is zero and the Condition Test *fails*.
393 Masked-out elements when `sz=0` are skipped (including
394 not decrementing CTR)
395
396 `CTR-test=0, CTi=1, sz=0` requires special emphasis because it is the
397 only time in the entirety of SVP64 that has side-effects when
398 a predicate mask bit is clear. **All** other SVP64 operations
399 entirely skip an element when sz=0 and a predicate mask bit is zero.
400 It is also critical to emphasise that in this unusual mode,
401 no other side-effects occur: **only** CTR is decremented, i.e. the
402 rest of the Branch operation is skipped.
403
404 ## VLSET Mode
405
406 VLSET Mode truncates the Vector Length so that subsequent instructions
407 operate on a reduced Vector Length. This is similar to
408 Data-dependent Fail-First and LD/ST Fail-First, where for VLSET the
409 truncation occurs at the Branch decision-point.
410
411 Interestingly, due to the side-effects of `VLSET` mode
412 it is actually useful to use Branch Conditional even
413 to perform no actual branch operation, i.e to point to the instruction
414 after the branch. Truncation of VL would thus conditionally occur yet control
415 flow alteration would not.
416
417 `VLSET` mode with Vertical-First is particularly unusual. Vertical-First
418 is designed to be used for explicit looping, where an explicit call to
419 `svstep` is required to move both srcstep and dststep on to
420 the next element, until VL (or other condition) is reached.
421 Vertical-First Looping is expected (required) to terminate if the end
422 of the Vector, VL, is reached. If however that loop is terminated early
423 because VL is truncated, VLSET with Vertical-First becomes meaningless.
424 Resolving this would require two branches: one Conditional, the other
425 branching unconditionally to create the loop, where the Conditional
426 one jumps over it.
427
428 Therefore, with `VSb`, the option to decide whether truncation should occur if the
429 branch succeeds *or* if the branch condition fails allows for the flexibility
430 required. This allows a Vertical-First Branch to *either* be used as
431 a branch-back (loop) *or* as part of a conditional exit or function
432 call from *inside* a loop, and for VLSET to be integrated into both
433 types of decision-making.
434
435 In the case of a Vertical-First branch-back (loop), with `VSb=0` the branch takes
436 place if success conditions are met, but on exit from that loop
437 (branch condition fails), VL will be truncated. This is extremely
438 useful.
439
440 `VLSET` mode with Horizontal-First when `VSb=0` is still
441 useful, because it can be used to truncate VL to the first predicated
442 (non-masked-out) element.
443
444 The truncation point for VL, when VLi is clear, must not include skipped
445 elements that preceded the current element being tested.
446 Example: `sz=0, VLi=0, predicate mask = 0b110010` and the Condition
447 Register failure point is at CR Field element 4.
448
449 * Testing at element 0 is skipped because its predicate bit is zero
450 * Testing at element 1 passed
451 * Testing elements 2 and 3 are skipped because their
452 respective predicate mask bits are zero
453 * Testing element 4 fails therefore VL is truncated to **2**
454 not 4 due to elements 2 and 3 being skipped.
455
456 If `sz=1` in the above example *then* VL would have been set to 4 because
457 in non-zeroing mode the zero'd elements are still effectively part of the
458 Vector (with their respective elements set to `SNZ`)
459
460 If `VLI=1` then VL would be set to 5 regardless of sz, due to being inclusive
461 of the element actually being tested.
462
463 ## VLSET and CTR-test combined
464
465 If both CTR-test and VLSET Modes are requested, it's important to
466 observe the correct order. What occurs depends on whether VLi
467 is enabled, because VLi affects the length, VL.
468
469 If VLi (VL truncate inclusive) is set:
470
471 1. compute the test including whether CTR triggers
472 2. (optionally) decrement CTR
473 3. (optionally) truncate VL (VSb inverts the decision)
474 4. decide (based on step 1) whether to terminate looping
475 (including not executing step 5)
476 5. decide whether to branch.
477
478 If VLi is clear, then when a test fails that element
479 and any following it
480 should **not** be considered part of the Vector. Consequently:
481
482 1. compute the branch test including whether CTR triggers
483 2. if the test fails against VSb, truncate VL to the *previous*
484 element, and terminate looping. No further steps executed.
485 3. (optionally) decrement CTR
486 4. decide whether to branch.
487
488 # Boolean Logic combinations
489
490 In a Scalar ISA, Branch-Conditional testing even of vector
491 results may be performed through inversion of tests. NOR of
492 all tests may be performed by inversion of the scalar condition
493 and branching *out* from the scalar loop around elements,
494 using scalar operations.
495
496 In a parallel (Vector) ISA it is the ISA itself which must perform
497 the prerequisite logic manipulation.
498 Thus for SVP64 there are an extraordinary number of nesessary combinations
499 which provide completely different and useful behaviour.
500 Available options to combine:
501
502 * `BO[0]` to make an unconditional branch would seem irrelevant if
503 it were not for predication and for side-effects (CTR Mode
504 for example)
505 * Enabling CTR-test Mode and setting `BO[2]` can still result in the
506 Branch
507 taking place, not because the Condition Test itself failed, but
508 because CTR reached zero **because**, as required by CTR-test mode,
509 CTR was decremented as a **result** of Condition Tests failing.
510 * `BO[1]` to select whether the CR bit being tested is zero or nonzero
511 * `R30` and `~R30` and other predicate mask options including CR and
512 inverted CR bit testing
513 * `sz` and `SNZ` to insert either zeros or ones in place of masked-out
514 predicate bits
515 * `ALL` or `ANY` behaviour corresponding to `AND` of all tests and
516 `OR` of all tests, respectively.
517 * Predicate Mask bits, which combine in effect with the CR being
518 tested.
519 * Inversion of Predicate Masks (`~r3` instead of `r3`, or using
520 `NE` rather than `EQ`) which results in an additional
521 level of possible ANDing, ORing etc. that would otherwise
522 need explicit instructions.
523
524 The most obviously useful combinations here are to set `BO[1]` to zero
525 in order to turn `ALL` into Great-Big-NAND and `ANY` into
526 Great-Big-NOR. Other Mode bits which perform behavioural inversion then
527 have to work round the fact that the Condition Testing is NOR or NAND.
528 The alternative to not having additional behavioural inversion
529 (`SNZ`, `VSb`, `CTi`) would be to have a second (unconditional)
530 branch directly after the first, which the first branch jumps over.
531 This contrivance is avoided by the behavioural inversion bits.
532
533 # Pseudocode and examples
534
535 Please see [[svp64/appendix]] regarding CR bit ordering and for
536 the definition of `CR{n}`
537
538 For comparative purposes this is a copy of the v3.0B `bc` pseudocode
539
540 ```
541 if (mode_is_64bit) then M <- 0
542 else M <- 32
543 if ¬BO[2] then CTR <- CTR - 1
544 ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
545 cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1])
546 if ctr_ok & cond_ok then
547 if AA then NIA <-iea EXTS(BD || 0b00)
548 else NIA <-iea CIA + EXTS(BD || 0b00)
549 if LK then LR <-iea CIA + 4
550 ```
551
552 Simplified pseudocode including LRu and CTR skipping, which illustrates
553 clearly that SVP64 Scalar Branches (VL=1) are **not** identical to
554 v3.0B Scalar Branches. The key areas where differences occur are
555 the inclusion of predication (which can still be used when VL=1), in
556 when and why CTR is decremented (CTRtest Mode) and whether LR is
557 updated (which is unconditional in v3.0B when LK=1, and conditional
558 in SVP64 when LRu=1).
559
560 ```
561 if (mode_is_64bit) then M <- 0
562 else M <- 32
563 testbit = CR[BI+32]
564 if ¬predicate_bit then testbit = SVRMmode.SNZ
565 ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
566 cond_ok <- BO[0] | ¬(testbit ^ BO[1])
567 if ¬predicate_bit & ¬SVRMmode.sz then
568 if ¬BO[2] & CTRtest & ¬CTi then
569 CTR = CTR - 1
570 # instruction finishes here
571 else
572 if ¬BO[2] & ¬(CTRtest & (cond_ok ^ CTi)) then CTR <- CTR - 1
573 if VLSET and VSb = (cond_ok & ctr_ok) then
574 if SVRMmode.VLI then SVSTATE.VL = srcstep+1
575 else SVSTATE.VL = srcstep
576 lr_ok <- LK
577 svlr_ok <- SVRMmode.SL
578 if ctr_ok & cond_ok then
579 if AA then NIA <-iea EXTS(BD || 0b00)
580 else NIA <-iea CIA + EXTS(BD || 0b00)
581 if SVRMmode.LRu then lr_ok <- ¬lr_ok
582 if SVRMmode.SLu then svlr_ok <- ¬svlr_ok
583 if lr_ok then LR <-iea CIA + 4
584 if svlr_ok then SVLR <- SVSTATE
585 ```
586
587 Below is the pseudocode for SVP64 Branches, which is a little less
588 obvious but identical to the above. The lack of obviousness is down
589 to the early-exit opportunities.
590
591 Effective pseudocode for Horizontal-First Mode:
592
593 ```
594 if (mode_is_64bit) then M <- 0
595 else M <- 32
596 cond_ok = not SVRMmode.ALL
597 for srcstep in range(VL):
598 # select predicate bit or zero/one
599 if predicate[srcstep]:
600 # get SVP64 extended CR field 0..127
601 SVCRf = SVP64EXTRA(BI>>2)
602 CRbits = CR{SVCRf}
603 testbit = CRbits[BI & 0b11]
604 # testbit = CR[BI+32+srcstep*4]
605 else if not SVRMmode.sz:
606 # inverted CTR test skip mode
607 if ¬BO[2] & CTRtest & ¬CTI then
608 CTR = CTR - 1
609 continue # skip to next element
610 else
611 testbit = SVRMmode.SNZ
612 # actual element test here
613 ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
614 el_cond_ok <- BO[0] | ¬(testbit ^ BO[1])
615 # check if CTR dec should occur
616 ctrdec = ¬BO[2]
617 if CTRtest & (el_cond_ok ^ CTi) then
618 ctrdec = 0b0
619 if ctrdec then CTR <- CTR - 1
620 # merge in the test
621 if SVRMmode.ALL:
622 cond_ok &= (el_cond_ok & ctr_ok)
623 else
624 cond_ok |= (el_cond_ok & ctr_ok)
625 # test for VL to be set (and exit)
626 if VLSET and VSb = (el_cond_ok & ctr_ok) then
627 if SVRMmode.VLI then SVSTATE.VL = srcstep+1
628 else SVSTATE.VL = srcstep
629 break
630 # early exit?
631 if SVRMmode.ALL != (el_cond_ok & ctr_ok):
632 break
633 # SVP64 rules about Scalar registers still apply!
634 if SVCRf.scalar:
635 break
636 # loop finally done, now test if branch (and update LR)
637 lr_ok <- LK
638 if cond_ok then
639 if AA then NIA <-iea EXTS(BD || 0b00)
640 else NIA <-iea CIA + EXTS(BD || 0b00)
641 if SVRMmode.LRu then lr_ok <- ¬lr_ok
642 if lr_ok then LR <-iea CIA + 4
643 ```
644
645 Pseudocode for Vertical-First Mode:
646
647 ```
648 # get SVP64 extended CR field 0..127
649 SVCRf = SVP64EXTRA(BI>>2)
650 CRbits = CR{SVCRf}
651 # select predicate bit or zero/one
652 if predicate[srcstep]:
653 if BRc = 1 then # CR0 vectorised
654 CR{SVCRf+srcstep} = CRbits
655 testbit = CRbits[BI & 0b11]
656 else if not SVRMmode.sz:
657 # inverted CTR test skip mode
658 if ¬BO[2] & CTRtest & ¬CTI then
659 CTR = CTR - 1
660 SVSTATE.srcstep = new_srcstep
661 exit # no branch testing
662 else
663 testbit = SVRMmode.SNZ
664 # actual element test here
665 cond_ok <- BO[0] | ¬(testbit ^ BO[1])
666 # test for VL to be set (and exit)
667 if VLSET and cond_ok = VSb then
668 if SVRMmode.VLI
669 SVSTATE.VL = new_srcstep+1
670 else
671 SVSTATE.VL = new_srcstep
672 ```
673
674 # Example Shader code
675
676 ```
677 // assume f() g() or h() modify a and/or b
678 while(a > 2) {
679 if(b < 5)
680 f();
681 else
682 g();
683 h();
684 }
685 ```
686
687 which compiles to something like:
688
689 ```
690 vec<i32> a, b;
691 // ...
692 pred loop_pred = a > 2;
693 // loop continues while any of a elements greater than 2
694 while(loop_pred.any()) {
695 // vector of predicate bits
696 pred if_pred = loop_pred & (b < 5);
697 // only call f() if at least 1 bit set
698 if(if_pred.any()) {
699 f(if_pred);
700 }
701 label1:
702 // loop mask ANDs with inverted if-test
703 pred else_pred = loop_pred & ~if_pred;
704 // only call g() if at least 1 bit set
705 if(else_pred.any()) {
706 g(else_pred);
707 }
708 h(loop_pred);
709 }
710 ```
711
712 which will end up as:
713
714 ```
715 # start from while loop test point
716 b looptest
717 while_loop:
718 sv.cmpi CR80.v, b.v, 5 # vector compare b into CR64 Vector
719 sv.bc/m=r30/~ALL/sz CR80.v.LT skip_f # skip when none
720 # only calculate loop_pred & pred_b because needed in f()
721 sv.crand CR80.v.SO, CR60.v.GT, CR80.V.LT # if = loop & pred_b
722 f(CR80.v.SO)
723 skip_f:
724 # illustrate inversion of pred_b. invert r30, test ALL
725 # rather than SOME, but masked-out zero test would FAIL,
726 # therefore masked-out instead is tested against 1 not 0
727 sv.bc/m=~r30/ALL/SNZ CR80.v.LT skip_g
728 # else = loop & ~pred_b, need this because used in g()
729 sv.crternari(A&~B) CR80.v.SO, CR60.v.GT, CR80.V.LT
730 g(CR80.v.SO)
731 skip_g:
732 # conditionally call h(r30) if any loop pred set
733 sv.bclr/m=r30/~ALL/sz BO[1]=1 h()
734 looptest:
735 sv.cmpi CR60.v a.v, 2 # vector compare a into CR60 vector
736 sv.crweird r30, CR60.GT # transfer GT vector to r30
737 sv.bc/m=r30/~ALL/sz BO[1]=1 while_loop
738 end:
739 ```
740 # TODO LRu example
741
742 show why LRu would be useful in a loop. Imagine the following
743 c code:
744
745 ```
746 for (int i = 0; i < 8; i++) {
747 if (x < y) break;
748 }
749 ```
750
751 Under these circumstances exiting from the loop is not only
752 based on CTR it has become conditional on a CR result.
753 Thus it is desirable that NIA *and* LR only be modified
754 if the conditions are met
755
756
757 v3.0 pseudocode for `bclrl`:
758
759 ```
760 if (mode_is_64bit) then M <- 0
761 else M <- 32
762 if ¬BO[2] then CTR <- CTR - 1
763 ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
764 cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1])
765 if ctr_ok & cond_ok then NIA <-iea LR[0:61] || 0b00
766 if LK then LR <-iea CIA + 4
767 ```
768
769 the latter part for SVP64 `bclrl` becomes:
770
771 ```
772 for i in 0 to VL-1:
773 ...
774 ...
775 cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1])
776 lr_ok <- LK
777 if ctr_ok & cond_ok then
778 NIA <-iea LR[0:61] || 0b00
779 if SVRMmode.LRu then lr_ok <- ¬lr_ok
780 if lr_ok then LR <-iea CIA + 4
781 # if NIA modified exit loop
782 ```
783
784 The reason why should be clear from this being a Vector loop:
785 unconditional destruction of LR when LK=1 makes `sv.bclrl`
786 ineffective, because the intention going into the loop is
787 that the branch should be to the copy of LR set at the *start*
788 of the loop, not half way through it.
789 However if the change to LR only occurs if
790 the branch is taken then it becomes a useful instruction.
791
792 The following pseudocode should **not** be implemented because
793 it violates the fundamental principle of SVP64 which is that
794 SVP64 looping is a thin wrapper around Scalar Instructions.
795 The pseducode below is more an actual Vector ISA Branch and
796 as such is not at all appropriate:
797
798 ```
799 for i in 0 to VL-1:
800 ...
801 ...
802 cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1])
803 if ctr_ok & cond_ok then NIA <-iea LR[0:61] || 0b00
804 # only at the end of looping is LK checked.
805 # this completely violates the design principle of SVP64
806 # and would actually need to be a separate (scalar)
807 # instruction "set LR to CIA+4 but retrospectively"
808 # which is clearly impossible
809 if LK then LR <-iea CIA + 4
810 ```