fb9b9da34c7b32f4a38f9f4fa39101990f6a919a
[libreriscv.git] / openpower / sv / branches.mdwn
1 [[!tag standards]]
2 # SVP64 Branch Conditional behaviour
3
4 **DRAFT STATUS**
5
6 Please note: although similar, SVP64 Branch instructions should be
7 considered completely separate and distinct from
8 standard scalar OpenPOWER-approved v3.0B branches.
9 **v3.0B branches are in no way impacted, altered,
10 changed or modified in any way, shape or form by
11 the SVP64 Vectorised Variants**.
12
13 It is also
14 extremely important to note that Branches are the
15 sole semi-exception in SVP64 to `Scalar Identity Behaviour`.
16 SVP64 Branches contain additional modes that are useful
17 for scalar operations (i.e. even when VL=1 or when
18 using single-bit predication).
19
20 Links
21
22 * <https://bugs.libre-soc.org/show_bug.cgi?id=664>
23 * <http://lists.libre-soc.org/pipermail/libre-soc-dev/2021-August/003416.html>
24 * <https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-April/004678.html>
25 * [[openpower/isa/branch]]
26 * [[sv/cr_int_predication]]
27
28 # Rationale
29
30 Scalar 3.0B Branch Conditional operations, `bc`, `bctar` etc. test a
31 Condition Register. However for parallel processing it is simply impossible
32 to perform multiple independent branches: the Program Counter simply
33 cannot branch to multiple destinations based on multiple conditions.
34 The best that can be done is
35 to test multiple Conditions and make a decision of a *single* branch,
36 based on analysis of a *Vector* of CR Fields
37 which have just been calculated from a *Vector* of results.
38
39 In 3D Shader
40 binaries, which are inherently parallelised and predicated, testing all or
41 some results and branching based on multiple tests is extremely common,
42 and a fundamental part of Shader Compilers. Example:
43 without such multi-condition
44 test-and-branch, if a predicate mask is all zeros a large batch of
45 instructions may be masked out to `nop`, and it would waste
46 CPU cycles to run them. 3D GPU ISAs can test for this scenario
47 and, with the appropriate predicate-analysis instruction,
48 jump over fully-masked-out operations, by spotting that
49 *all* Conditions are false.
50
51 Unless Branches are aware and capable of such analysis, additional
52 instructions would be required which perform Horizontal Cumulative
53 analysis of Vectorised Condition Register Fields, in order to
54 reduce the Vector of CR Fields down to one single yes or no
55 decision that a Scalar-only v3.0B Branch-Conditional could cope with.
56 Such instructions would be unavoidable, required, and costly
57 by comparison to a single Vector-aware Branch.
58 Therefore, in order to be commercially competitive, `sv.bc` and
59 other Vector-aware Branch Conditional instructions are a high priority
60 for 3D GPU (and OpenCL-style) workloads.
61
62 Given that Power ISA v3.0B is already quite powerful, particularly
63 the Condition Registers and their interaction with Branches, there
64 are opportunities to create extremely flexible and compact
65 Vectorised Branch behaviour. In addition, the side-effects (updating
66 of CTR, truncation of VL, described below) make it a useful instruction
67 even if the branch points to the next instruction (no actual branch).
68
69 # Overview
70
71 When considering an "array" of branch-tests, there are four
72 primarily-useful modes:
73 AND, OR, NAND and NOR of all Conditions.
74 NAND and NOR may be synthesised from AND and OR by
75 inverting `BO[1]` which just leaves two modes:
76
77 * Branch takes place on the **first** CR Field test to succeed
78 (a Great Big OR of all condition tests)
79 * Branch takes place only if **all** CR field tests succeed:
80 a Great Big AND of all condition tests
81
82 Early-exit is enacted such that the Vectorised Branch does not
83 perform needless extra tests, which will help reduce reads on
84 the Condition Register file.
85
86 *Note: Early-exit is **MANDATORY** (required) behaviour.
87 Branches **MUST** exit at the first sequentially-encountered
88 failure point, for
89 exactly the same reasons for which it is mandatory in
90 programming languages doing early-exit: to avoid
91 damaging side-effects and to provide deterministic
92 behaviour. Speculative testing of Condition
93 Register Fields is permitted, as is speculative calculation
94 of CTR, as long as, as usual in any Out-of-Order microarchitecture,
95 that speculative testing is cancelled should an early-exit occur.
96 i.e. the speculation must be "precise": Program Order must be preserved*
97
98 Also note that when early-exit occurs in Horizontal-first Mode,
99 srcstep, dststep etc. are all reset, ready to begin looping from the
100 beginning for the next instruction. However for Vertical-first
101 Mode srcstep etc. are incremented "as usual" i.e. an early-exit
102 has no special impact, regardless of whether the branch
103 occurred or not. This can leave srcstep etc. in what may be
104 considered an unusual
105 state on exit from a loop and it is up to the programmer to
106 reset srcstep, dststep etc. to known-good values.
107
108 Additional useful behaviour involves two primary Modes (both of
109 which may be enabled and combined):
110
111 * **VLSET Mode**: identical to Data-Dependent Fail-First Mode
112 for Arithmetic SVP64 operations, with more
113 flexibility and a close interaction and integration into the
114 underlying base Scalar v3.0B Branch instruction.
115 Truncation of VL takes place around the early-exit point.
116 * **CTR-test Mode**: gives much more flexibility over when and why
117 CTR is decremented, including options to decrement if a Condition
118 test succeeds *or if it fails*.
119
120 With these side-effects, basic Boolean Logic Analysis advises that
121 it is important to provide a means
122 to enact them each based on whether testing succeeds *or fails*. This
123 results in a not-insignificant number of additional Mode Augmentation bits,
124 accompanying VLSET and CTR-test Modes respectively.
125
126 Predicate skipping or zeroing may, as usual with SVP64, be controlled
127 by `sz`.
128 Where the predicate is masked out and
129 zeroing is enabled, then in such circumstances
130 the same Boolean Logic Analysis dictates that
131 rather than testing only against zero, the option to test
132 against one is also prudent. This introduces a new
133 immediate field, `SNZ`, which works in conjunction with
134 `sz`.
135
136
137 Vectorised Branches can be used
138 in either SVP64 Horizontal-First or Vertical-First Mode. Essentially,
139 at an element level, the behaviour is identical in both Modes,
140 although the `ALL` bit is meaningless in Vertical-First Mode.
141
142 It is also important
143 to bear in mind that, fundamentally, Vectorised Branch-Conditional
144 is still extremely close to the Scalar v3.0B Branch-Conditional
145 instructions, and that the same v3.0B Scalar Branch-Conditional
146 instructions are still
147 *completely separate and independent*, being unaltered and
148 unaffected by their SVP64 variants in every conceivable way.
149
150 *Programming note: One important point is that SVP64 instructions are 64 bit.
151 (8 bytes not 4). This needs to be taken into consideration when computing
152 branch offsets: the offset is relative to the start of the instruction,
153 which **includes** the SVP64 Prefix*
154
155 # Format and fields
156
157 With element-width overrides being meaningless for Condition
158 Register Fields, bits 4 thru 7 of SVP64 RM may be used for additional
159 Mode bits.
160
161 SVP64 RM `MODE` (includes `ELWIDTH` and `ELWIDTH_SRC` bits) for Branch
162 Conditional:
163
164 | 4 | 5 | 6 | 7 | 17 | 18 | 19 | 20 | 21 | 22 23 | description |
165 | - | - | - | - | -- | -- | -- | -- | --- |--------|----------------- |
166 |ALL|SNZ| / | / | SL |SLu | 0 | 0 | / | LRu sz | normal mode |
167 |ALL|SNZ| / |VSb| SL |SLu | 0 | 1 | VLI | LRu sz | VLSET mode |
168 |ALL|SNZ|CTi| / | SL |SLu | 1 | 0 | / | LRu sz | CTR-test mode |
169 |ALL|SNZ|CTi|VSb| SL |SLu | 1 | 1 | VLI | LRu sz | CTR-test+VLSET mode |
170
171 TODO bits 17,18 for SVSTATE-variant of LR and LRu.
172
173 Brief description of fields:
174
175 * **sz=1** if predication is enabled and `sz=1` and a predicate
176 element bit is zero, `SNZ` will
177 be substituted in place of the CR bit selected by `BI`,
178 as the Condition tested.
179 Contrast this with
180 normal SVP64 `sz=1` behaviour, where *only* a zero is put in
181 place of masked-out predicate bits.
182 * **sz=0** When `sz=0` skipping occurs as usual on
183 masked-out elements, but unlike all
184 other SVP64 behaviour which entirely skips an element with
185 no related side-effects at all, there are certain
186 special circumstances where CTR
187 may be decremented. See CTR-test Mode, below.
188 * **ALL** when set, all branch conditional tests must pass in order for
189 the branch to succeed. When clear, it is the first sequentially
190 encountered successful test that causes the branch to succeed.
191 This is identical behaviour to how programming languages perform
192 early-exit on Boolean Logic chains.
193 * **VLI** VLSET is identical to Data-dependent Fail-First mode.
194 In VLSET mode, VL *may* (depending on `VSb`) be truncated.
195 If VLI (Vector Length Inclusive) is clear,
196 VL is truncated to *exclude* the current element, otherwise it is
197 included. SVSTATE.MVL is not altered: only VL.
198 * **LRu**: Link Register Update, used in conjunction with LK=1
199 to make LR update conditional
200 * **VSb** In VLSET Mode, after testing,
201 if VSb is set, VL is truncated if the test succeeds. If VSb is clear,
202 VL is truncated if a test *fails*. Masked-out (skipped)
203 bits are not considered
204 part of testing when `sz=0`
205 * **CTi** CTR inversion. CTR-test Mode normally decrements per element
206 tested. CTR inversion decrements if a test *fails*. Only relevant
207 in CTR-test Mode.
208
209 LRu and CTR-test modes are where SVP64 Branches subtly differ from
210 Scalar v3.0B Branches. `sv.bcl` for example will always update LR, whereas
211 `sv.bcl/lru` will only update LR if the branch succeeds.
212
213 Of special interest is that when using ALL Mode (Great Big AND
214 of all Condition Tests), if `VL=0`,
215 which is rare but can occur in Data-Dependent Modes, the Branch
216 will always take place because there will be no failing Condition
217 Tests to prevent it. Likewise when not using ALL Mode (Great Big OR
218 of all Condition Tests) and `VL=0` the Branch is guaranteed not
219 to occur because there will be no *successful* Condition Tests
220 to make it happen.
221
222 # Vectorised CR Field numbering, and Scalar behaviour
223
224 It is important to keep in mind that just like all SVP64 instructions,
225 the `BI` field of the base v3.0B Branch Conditional instruction
226 may be extended by SVP64 EXTRA augmentation, as well as be marked
227 as either Scalar or Vector. It is also crucially important to keep in mind
228 that for CRs, SVP64 sequentially increments the CR *Field* numbers.
229 CR *Fields* are treated as elements, not bit-numbers of the CR *register*.
230
231 The `BI` operand of Branch Conditional operations is five bits, in scalar
232 v3.0B this would select one bit of the 32 bit CR,
233 comprising eight CR Fields of 4 bits each. In SVP64 there are
234 16 32 bit CRs, containing 128 4-bit CR Fields. Therefore, the 2 LSBs of
235 `BI` select the bit from the CR Field (EQ LT GT SO), and the top 3 bits
236 are extended to either scalar or vector and to select CR Fields 0..127
237 as specified in SVP64 [[sv/svp64/appendix]].
238
239 When the CR Fields selected by SVP64-Augmented `BI` is marked as scalar,
240 then as the usual SVP64 rules apply:
241 the Vector loop ends at the first element tested
242 (the first CR *Field*), after taking
243 predication into consideration. Thus, also as usual, when a predicate mask is
244 given, and `BI` marked as scalar, and `sz` is zero, srcstep
245 skips forward to the first non-zero predicated element, and only that
246 one element is tested.
247
248 In other words, the fact that this is a Branch
249 Operation (instead of an arithmetic one) does not result, ultimately,
250 in significant changes as to
251 how SVP64 is fundamentally applied, except with respect to:
252
253 * the unique properties associated with conditionally
254 changing the Program
255 Counter (aka "a Branch"), resulting in early-out
256 opportunities
257 * CTR-testing
258
259 Both are outlined below, in later sections.
260
261 # Horizontal-First and Vertical-First Modes
262
263 In SVP64 Horizontal-First Mode, the first failure in ALL mode (Great Big
264 AND) results in early exit: no more updates to CTR occur (if requested);
265 no branch occurs, and LR is not updated (if requested). Likewise for
266 non-ALL mode (Great Big Or) on first success early exit also occurs,
267 however this time with the Branch proceeding. In both cases the testing
268 of the Vector of CRs should be done in linear sequential order (or in
269 REMAP re-sequenced order): such that tests that are sequentially beyond
270 the exit point are *not* carried out. (*Note: it is standard practice in
271 Programming languages to exit early from conditional tests, however
272 a little unusual to consider in an ISA that is designed for Parallel
273 Vector Processing. The reason is to have strictly-defined guaranteed
274 behaviour*)
275
276 In Vertical-First Mode, setting the `ALL` bit results in `UNDEFINED`
277 behaviour. Given that only one element is being tested at a time
278 in Vertical-First Mode, a test designed to be done on multiple
279 bits is meaningless.
280
281 # Description and Modes
282
283 Predication in both INT and CR modes may be applied to `sv.bc` and other
284 SVP64 Branch Conditional operations, exactly as they may be applied to
285 other SVP64 operations. When `sz` is zero, any masked-out Branch-element
286 operations are not included in condition testing, exactly like all other
287 SVP64 operations, *including* side-effects such as potentially updating
288 LR or CTR, which will also be skipped. There is *one* exception here,
289 which is when
290 `BO[2]=0, sz=0, CTR-test=0, CTi=1` and the relevant element
291 predicate mask bit is also zero:
292 under these special circumstances CTR will also decrement.
293
294 When `sz` is non-zero, this normally requests insertion of a zero
295 in place of the input data, when the relevant predicate mask bit is zero.
296 This would mean that a zero is inserted in place of `CR[BI+32]` for
297 testing against `BO`, which may not be desirable in all circumstances.
298 Therefore, an extra field is provided `SNZ`, which, if set, will insert
299 a **one** in place of a masked-out element, instead of a zero.
300
301 (*Note: Both options are provided because it is useful to deliberately
302 cause the Branch-Conditional Vector testing to fail at a specific point,
303 controlled by the Predicate mask. This is particularly useful in `VLSET`
304 mode, which will truncate SVSTATE.VL at the point of the first failed
305 test.*)
306
307 Normally, CTR mode will decrement once per Condition Test, resulting
308 under normal circumstances that CTR reduces by up to VL in Horizontal-First
309 Mode. Just as when v3.0B Branch-Conditional saves at
310 least one instruction on tight inner loops through auto-decrementation
311 of CTR, likewise it is also possible to save instruction count for
312 SVP64 loops in both Vertical-First and Horizontal-First Mode, particularly
313 in circumstances where there is conditional interaction between the
314 element computation and testing, and the continuation (or otherwise)
315 of a given loop. The potential combinations of interactions is why CTR
316 testing options have been added.
317
318 Also, the unconditional bit `BO[0]` is still relevant when Predication
319 is applied to the Branch because in `ALL` mode all nonmasked bits have
320 to be tested, and when `sz=0` skipping occurs.
321 Even when VLSET mode is not used, CTR
322 may still be decremented by the total number of nonmasked elements,
323 acting in effect as either a popcount or cntlz depending on which
324 mode bits are set.
325 In short, Vectorised Branch becomes an extremely powerful tool.
326
327 **Micro-Architectural Implementation Note**: *when implemented on
328 top of a Multi-Issue Out-of-Order Engine it is possible to pass
329 a copy of the predicate and the prerequisite CR Fields to all
330 Branch Units, as well as the current value of CTR at the time of
331 multi-issue, and for each Branch Unit to compute how many times
332 CTR would be subtracted, in a fully-deterministic and parallel
333 fashion. A SIMD-based Branch Unit, receiving and processing
334 multiple CR Fields covered by multiple predicate bits, would
335 do the exact same thing. Obviously, however, if CTR is modified
336 within any given loop (mtctr) the behaviour of CTR is no longer
337 deterministic.*
338
339 ## Link Register Update
340
341 For a Scalar Branch, unconditional updating of the Link Register
342 LR is useful and practical. However, if a loop of CR Fields is
343 tested, unconditional updating of LR becomes problematic.
344
345 For example when using `bclr` with `LRu=1,LK=0` in Horizontal-First Mode,
346 LR's value will be unconditionally overwritten after the first element,
347 such that for execution (testing) of the second element, LR
348 has the value `CIA+8`. This is covered in the `bclrl` example, in
349 a later section.
350
351 The addition of a LRu bit modifies behaviour in conjunction
352 with LK, as follows:
353
354 * `sv.bc` When LRu=0,LK=0, Link Register is not updated
355 * `sv.bcl` When LRu=0,LK=1, Link Register is updated unconditionally
356 * `sv.bcl/lru` When LRu=1,LK=1, Link Register will
357 only be updated if the Branch Condition fails.
358 * `sv.bc/lru` When LRu=1,LK=0, Link Register will only be updated if
359 the Branch Condition succeeds.
360
361 This avoids
362 destruction of LR during loops (particularly Vertical-First
363 ones).
364
365 ## CTR-test
366
367 Where a standard Scalar v3.0B branch unconditionally decrements
368 CTR when `BO[2]` is clear, CTR-test Mode introduces more flexibility
369 which allows CTR to be used for many more types of Vector loops
370 constructs.
371
372 CTR-test mode and CTi interaction is as follows: note that
373 `BO[2]` is still required to be clear for CTR decrements to be
374 considered, exactly as is the case in Scalar Power ISA v3.0B
375
376 * **CTR-test=0, CTi=0**: CTR decrements on a per-element basis
377 if `BO[2]` is zero. Masked-out elements when `sz=0` are
378 skipped (i.e. CTR is *not* decremented when the predicate
379 bit is zero and `sz=0`).
380 * **CTR-test=0, CTi=1**: CTR decrements on a per-element basis
381 if `BO[2]` is zero and a masked-out element is skipped
382 (`sz=0` and predicate bit is zero). This one special case is the
383 **opposite** of other combinations, as well as being
384 completely different from normal SVP64 `sz=0` behaviour)
385 * **CTR-test=1, CTi=0**: CTR decrements on a per-element basis
386 if `BO[2]` is zero and the Condition Test succeeds.
387 Masked-out elements when `sz=0` are skipped (including
388 not decrementing CTR)
389 * **CTR-test=1, CTi=1**: CTR decrements on a per-element basis
390 if `BO[2]` is zero and the Condition Test *fails*.
391 Masked-out elements when `sz=0` are skipped (including
392 not decrementing CTR)
393
394 `CTR-test=0, CTi=1, sz=0` requires special emphasis because it is the
395 only time in the entirety of SVP64 that has side-effects when
396 a predicate mask bit is clear. **All** other SVP64 operations
397 entirely skip an element when sz=0 and a predicate mask bit is zero.
398 It is also critical to emphasise that in this unusual mode,
399 no other side-effects occur: **only** CTR is decremented, i.e. the
400 rest of the Branch operation is skipped.
401
402 ## VLSET Mode
403
404 VLSET Mode truncates the Vector Length so that subsequent instructions
405 operate on a reduced Vector Length. This is similar to
406 Data-dependent Fail-First and LD/ST Fail-First, where for VLSET the
407 truncation occurs at the Branch decision-point.
408
409 Interestingly, due to the side-effects of `VLSET` mode
410 it is actually useful to use Branch Conditional even
411 to perform no actual branch operation, i.e to point to the instruction
412 after the branch. Truncation of VL would thus conditionally occur yet control
413 flow alteration would not.
414
415 `VLSET` mode with Vertical-First is particularly unusual. Vertical-First
416 is designed to be used for explicit looping, where an explicit call to
417 `svstep` is required to move both srcstep and dststep on to
418 the next element, until VL (or other condition) is reached.
419 Vertical-First Looping is expected (required) to terminate if the end
420 of the Vector, VL, is reached. If however that loop is terminated early
421 because VL is truncated, VLSET with Vertical-First becomes meaningless.
422 Resolving this would require two branches: one Conditional, the other
423 branching unconditionally to create the loop, where the Conditional
424 one jumps over it.
425
426 Therefore, with `VSb`, the option to decide whether truncation should occur if the
427 branch succeeds *or* if the branch condition fails allows for the flexibility
428 required. This allows a Vertical-First Branch to *either* be used as
429 a branch-back (loop) *or* as part of a conditional exit or function
430 call from *inside* a loop, and for VLSET to be integrated into both
431 types of decision-making.
432
433 In the case of a Vertical-First branch-back (loop), with `VSb=0` the branch takes
434 place if success conditions are met, but on exit from that loop
435 (branch condition fails), VL will be truncated. This is extremely
436 useful.
437
438 `VLSET` mode with Horizontal-First when `VSb=0` is still
439 useful, because it can be used to truncate VL to the first predicated
440 (non-masked-out) element.
441
442 The truncation point for VL, when VLi is clear, must not include skipped
443 elements that preceded the current element being tested.
444 Example: `sz=0, VLi=0, predicate mask = 0b110010` and the Condition
445 Register failure point is at CR Field element 4.
446
447 * Testing at element 0 is skipped because its predicate bit is zero
448 * Testing at element 1 passed
449 * Testing elements 2 and 3 are skipped because their
450 respective predicate mask bits are zero
451 * Testing element 4 fails therefore VL is truncated to **2**
452 not 4 due to elements 2 and 3 being skipped.
453
454 If `sz=1` in the above example *then* VL would have been set to 4 because
455 in non-zeroing mode the zero'd elements are still effectively part of the
456 Vector (with their respective elements set to `SNZ`)
457
458 If `VLI=1` then VL would be set to 5 regardless of sz, due to being inclusive
459 of the element actually being tested.
460
461 ## VLSET and CTR-test combined
462
463 If both CTR-test and VLSET Modes are requested, it's important to
464 observe the correct order. What occurs depends on whether VLi
465 is enabled, because VLi affects the length, VL.
466
467 If VLi (VL truncate inclusive) is set:
468
469 1. compute the test including whether CTR triggers
470 2. (optionally) decrement CTR
471 3. (optionally) truncate VL (VSb inverts the decision)
472 4. decide (based on step 1) whether to terminate looping
473 (including not executing step 5)
474 5. decide whether to branch.
475
476 If VLi is clear, then when a test fails that element
477 and any following it
478 should **not** be considered part of the Vector. Consequently:
479
480 1. compute the branch test including whether CTR triggers
481 2. if the test fails against VSb, truncate VL to the *previous*
482 element, and terminate looping. No further steps executed.
483 3. (optionally) decrement CTR
484 4. decide whether to branch.
485
486 # Boolean Logic combinations
487
488 In a Scalar ISA, Branch-Conditional testing even of vector
489 results may be performed through inversion of tests. NOR of
490 all tests may be performed by inversion of the scalar condition
491 and branching *out* from the scalar loop around elements,
492 using scalar operations.
493
494 In a parallel (Vector) ISA it is the ISA itself which must perform
495 the prerequisite logic manipulation.
496 Thus for SVP64 there are an extraordinary number of nesessary combinations
497 which provide completely different and useful behaviour.
498 Available options to combine:
499
500 * `BO[0]` to make an unconditional branch would seem irrelevant if
501 it were not for predication and for side-effects (CTR Mode
502 for example)
503 * Enabling CTR-test Mode and setting `BO[2]` can still result in the
504 Branch
505 taking place, not because the Condition Test itself failed, but
506 because CTR reached zero **because**, as required by CTR-test mode,
507 CTR was decremented as a **result** of Condition Tests failing.
508 * `BO[1]` to select whether the CR bit being tested is zero or nonzero
509 * `R30` and `~R30` and other predicate mask options including CR and
510 inverted CR bit testing
511 * `sz` and `SNZ` to insert either zeros or ones in place of masked-out
512 predicate bits
513 * `ALL` or `ANY` behaviour corresponding to `AND` of all tests and
514 `OR` of all tests, respectively.
515 * Predicate Mask bits, which combine in effect with the CR being
516 tested.
517 * Inversion of Predicate Masks (`~r3` instead of `r3`, or using
518 `NE` rather than `EQ`) which results in an additional
519 level of possible ANDing, ORing etc. that would otherwise
520 need explicit instructions.
521
522 The most obviously useful combinations here are to set `BO[1]` to zero
523 in order to turn `ALL` into Great-Big-NAND and `ANY` into
524 Great-Big-NOR. Other Mode bits which perform behavioural inversion then
525 have to work round the fact that the Condition Testing is NOR or NAND.
526 The alternative to not having additional behavioural inversion
527 (`SNZ`, `VSb`, `CTi`) would be to have a second (unconditional)
528 branch directly after the first, which the first branch jumps over.
529 This contrivance is avoided by the behavioural inversion bits.
530
531 # Pseudocode and examples
532
533 Please see [[svp64/appendix]] regarding CR bit ordering and for
534 the definition of `CR{n}`
535
536 For comparative purposes this is a copy of the v3.0B `bc` pseudocode
537
538 ```
539 if (mode_is_64bit) then M <- 0
540 else M <- 32
541 if ¬BO[2] then CTR <- CTR - 1
542 ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
543 cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1])
544 if ctr_ok & cond_ok then
545 if AA then NIA <-iea EXTS(BD || 0b00)
546 else NIA <-iea CIA + EXTS(BD || 0b00)
547 if LK then LR <-iea CIA + 4
548 ```
549
550 Simplified pseudocode including LRu and CTR skipping, which illustrates
551 clearly that SVP64 Scalar Branches (VL=1) are **not** identical to
552 v3.0B Scalar Branches. The key areas where differences occur are
553 the inclusion of predication (which can still be used when VL=1), in
554 when and why CTR is decremented (CTRtest Mode) and whether LR is
555 updated (which is unconditional in v3.0B when LK=1, and conditional
556 in SVP64 when LRu=1).
557
558 ```
559 if (mode_is_64bit) then M <- 0
560 else M <- 32
561 testbit = CR[BI+32]
562 if ¬predicate_bit then testbit = SVRMmode.SNZ
563 ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
564 cond_ok <- BO[0] | ¬(testbit ^ BO[1])
565 if ¬predicate_bit & ¬SVRMmode.sz then
566 if ¬BO[2] & CTRtest & ¬CTi then
567 CTR = CTR - 1
568 # instruction finishes here
569 else
570 if ¬BO[2] & ¬(CTRtest & (cond_ok ^ CTi)) then CTR <- CTR - 1
571 if VLSET and VSb = (cond_ok & ctr_ok) then
572 if SVRMmode.VLI then SVSTATE.VL = srcstep+1
573 else SVSTATE.VL = srcstep
574 lr_ok <- LK
575 svlr_ok <- SVRMmode.SL
576 if ctr_ok & cond_ok then
577 if AA then NIA <-iea EXTS(BD || 0b00)
578 else NIA <-iea CIA + EXTS(BD || 0b00)
579 if SVRMmode.LRu then lr_ok <- ¬lr_ok
580 if SVRMmode.SLu then svlr_ok <- ¬svlr_ok
581 if lr_ok then LR <-iea CIA + 4
582 if svlr_ok then SVLR <- SVSTATE
583 ```
584
585 Below is the pseudocode for SVP64 Branches, which is a little less
586 obvious but identical to the above. The lack of obviousness is down
587 to the early-exit opportunities.
588
589 Effective pseudocode for Horizontal-First Mode:
590
591 ```
592 if (mode_is_64bit) then M <- 0
593 else M <- 32
594 cond_ok = not SVRMmode.ALL
595 for srcstep in range(VL):
596 # select predicate bit or zero/one
597 if predicate[srcstep]:
598 # get SVP64 extended CR field 0..127
599 SVCRf = SVP64EXTRA(BI>>2)
600 CRbits = CR{SVCRf}
601 testbit = CRbits[BI & 0b11]
602 # testbit = CR[BI+32+srcstep*4]
603 else if not SVRMmode.sz:
604 # inverted CTR test skip mode
605 if ¬BO[2] & CTRtest & ¬CTI then
606 CTR = CTR - 1
607 continue # skip to next element
608 else
609 testbit = SVRMmode.SNZ
610 # actual element test here
611 ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
612 el_cond_ok <- BO[0] | ¬(testbit ^ BO[1])
613 # check if CTR dec should occur
614 ctrdec = ¬BO[2]
615 if CTRtest & (el_cond_ok ^ CTi) then
616 ctrdec = 0b0
617 if ctrdec then CTR <- CTR - 1
618 # merge in the test
619 if SVRMmode.ALL:
620 cond_ok &= (el_cond_ok & ctr_ok)
621 else
622 cond_ok |= (el_cond_ok & ctr_ok)
623 # test for VL to be set (and exit)
624 if VLSET and VSb = (el_cond_ok & ctr_ok) then
625 if SVRMmode.VLI then SVSTATE.VL = srcstep+1
626 else SVSTATE.VL = srcstep
627 break
628 # early exit?
629 if SVRMmode.ALL != (el_cond_ok & ctr_ok):
630 break
631 # SVP64 rules about Scalar registers still apply!
632 if SVCRf.scalar:
633 break
634 # loop finally done, now test if branch (and update LR)
635 lr_ok <- LK
636 if cond_ok then
637 if AA then NIA <-iea EXTS(BD || 0b00)
638 else NIA <-iea CIA + EXTS(BD || 0b00)
639 if SVRMmode.LRu then lr_ok <- ¬lr_ok
640 if lr_ok then LR <-iea CIA + 4
641 ```
642
643 Pseudocode for Vertical-First Mode:
644
645 ```
646 # get SVP64 extended CR field 0..127
647 SVCRf = SVP64EXTRA(BI>>2)
648 CRbits = CR{SVCRf}
649 # select predicate bit or zero/one
650 if predicate[srcstep]:
651 if BRc = 1 then # CR0 vectorised
652 CR{SVCRf+srcstep} = CRbits
653 testbit = CRbits[BI & 0b11]
654 else if not SVRMmode.sz:
655 # inverted CTR test skip mode
656 if ¬BO[2] & CTRtest & ¬CTI then
657 CTR = CTR - 1
658 SVSTATE.srcstep = new_srcstep
659 exit # no branch testing
660 else
661 testbit = SVRMmode.SNZ
662 # actual element test here
663 cond_ok <- BO[0] | ¬(testbit ^ BO[1])
664 # test for VL to be set (and exit)
665 if VLSET and cond_ok = VSb then
666 if SVRMmode.VLI
667 SVSTATE.VL = new_srcstep+1
668 else
669 SVSTATE.VL = new_srcstep
670 ```
671
672 # Example Shader code
673
674 ```
675 // assume f() g() or h() modify a and/or b
676 while(a > 2) {
677 if(b < 5)
678 f();
679 else
680 g();
681 h();
682 }
683 ```
684
685 which compiles to something like:
686
687 ```
688 vec<i32> a, b;
689 // ...
690 pred loop_pred = a > 2;
691 // loop continues while any of a elements greater than 2
692 while(loop_pred.any()) {
693 // vector of predicate bits
694 pred if_pred = loop_pred & (b < 5);
695 // only call f() if at least 1 bit set
696 if(if_pred.any()) {
697 f(if_pred);
698 }
699 label1:
700 // loop mask ANDs with inverted if-test
701 pred else_pred = loop_pred & ~if_pred;
702 // only call g() if at least 1 bit set
703 if(else_pred.any()) {
704 g(else_pred);
705 }
706 h(loop_pred);
707 }
708 ```
709
710 which will end up as:
711
712 ```
713 # start from while loop test point
714 b looptest
715 while_loop:
716 sv.cmpi CR80.v, b.v, 5 # vector compare b into CR64 Vector
717 sv.bc/m=r30/~ALL/sz CR80.v.LT skip_f # skip when none
718 # only calculate loop_pred & pred_b because needed in f()
719 sv.crand CR80.v.SO, CR60.v.GT, CR80.V.LT # if = loop & pred_b
720 f(CR80.v.SO)
721 skip_f:
722 # illustrate inversion of pred_b. invert r30, test ALL
723 # rather than SOME, but masked-out zero test would FAIL,
724 # therefore masked-out instead is tested against 1 not 0
725 sv.bc/m=~r30/ALL/SNZ CR80.v.LT skip_g
726 # else = loop & ~pred_b, need this because used in g()
727 sv.crternari(A&~B) CR80.v.SO, CR60.v.GT, CR80.V.LT
728 g(CR80.v.SO)
729 skip_g:
730 # conditionally call h(r30) if any loop pred set
731 sv.bclr/m=r30/~ALL/sz BO[1]=1 h()
732 looptest:
733 sv.cmpi CR60.v a.v, 2 # vector compare a into CR60 vector
734 sv.crweird r30, CR60.GT # transfer GT vector to r30
735 sv.bc/m=r30/~ALL/sz BO[1]=1 while_loop
736 end:
737 ```
738 # TODO LRu example
739
740 show why LRu would be useful in a loop. Imagine the following
741 c code:
742
743 ```
744 for (int i = 0; i < 8; i++) {
745 if (x < y) break;
746 }
747 ```
748
749 Under these circumstances exiting from the loop is not only
750 based on CTR it has become conditional on a CR result.
751 Thus it is desirable that NIA *and* LR only be modified
752 if the conditions are met
753
754
755 v3.0 pseudocode for `bclrl`:
756
757 ```
758 if (mode_is_64bit) then M <- 0
759 else M <- 32
760 if ¬BO[2] then CTR <- CTR - 1
761 ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
762 cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1])
763 if ctr_ok & cond_ok then NIA <-iea LR[0:61] || 0b00
764 if LK then LR <-iea CIA + 4
765 ```
766
767 the latter part for SVP64 `bclrl` becomes:
768
769 ```
770 for i in 0 to VL-1:
771 ...
772 ...
773 cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1])
774 lr_ok <- LK
775 if ctr_ok & cond_ok then
776 NIA <-iea LR[0:61] || 0b00
777 if SVRMmode.LRu then lr_ok <- ¬lr_ok
778 if lr_ok then LR <-iea CIA + 4
779 # if NIA modified exit loop
780 ```
781
782 The reason why should be clear from this being a Vector loop:
783 unconditional destruction of LR when LK=1 makes `sv.bclrl`
784 ineffective, because the intention going into the loop is
785 that the branch should be to the copy of LR set at the *start*
786 of the loop, not half way through it.
787 However if the change to LR only occurs if
788 the branch is taken then it becomes a useful instruction.
789
790 The following pseudocode should **not** be implemented because
791 it violates the fundamental principle of SVP64 which is that
792 SVP64 looping is a thin wrapper around Scalar Instructions.
793 The pseducode below is more an actual Vector ISA Branch and
794 as such is not at all appropriate:
795
796 ```
797 for i in 0 to VL-1:
798 ...
799 ...
800 cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1])
801 if ctr_ok & cond_ok then NIA <-iea LR[0:61] || 0b00
802 # only at the end of looping is LK checked.
803 # this completely violates the design principle of SVP64
804 # and would actually need to be a separate (scalar)
805 # instruction "set LR to CIA+4 but retrospectively"
806 # which is clearly impossible
807 if LK then LR <-iea CIA + 4
808 ```