openpower/sv/svp64/appendix.mdwn

   1 # Appendix
   2
   3 * <https://bugs.libre-soc.org/show_bug.cgi?id=574>
   4 * <https://bugs.libre-soc.org/show_bug.cgi?id=558#c47>
   5
   6 This is the appendix to [[sv/svp64]], providing explanations of modes
   7 etc. leaving the main svp64 page's primary purpose as outlining the
   8 instruction format.
   9
  10 Table of contents:
  11
  12 [[!toc]]
  13
  14 # XER, SO and other global flags
  15
  16 Vector systems are expected to be high performance.  This is achieved
  17 through parallelism, which requires that elements in the vector be
  18 independent.  XER SO and other global "accumulation" flags (CR.OV) cause
  19 Read-Write Hazards on single-bit global resources, having a significant
  20 detrimental effect.
  21
  22 Consequently in SV, XER.SO and CR.OV behaviour is disregarded (including
  23 in `cmp` instructions).  XER is simply neither read nor written.
  24 This includes when `scalar identity behaviour` occurs.  If precise
  25 OpenPOWER v3.0/1 scalar behaviour is desired then OpenPOWER v3.0/1
  26 instructions should be used without an SV Prefix.
  27
  28 An interesting side-effect of this decision is that the OE flag is now
  29 free for other uses when SV Prefixing is used.
  30
  31 Regarding XER.CA: this does not fit either: it was designed for a scalar
  32 ISA. Instead, both carry-in and carry-out go into the CR.so bit of a given
  33 Vector element.  This provides a means to perform large parallel batches
  34 of Vectorised carry-capable additions.  crweird instructions can be used
  35 to transfer the CRs in and out of an integer, where bitmanipulation
  36 may be performed to analyse the carry bits (including carry lookahead
  37 propagation) before continuing with further parallel additions.
  38
  39 # v3.0B/v3.1 relevant instructions
  40
  41 SV is primarily designed for use as an efficient hybrid 3D GPU / VPU /
  42 CPU ISA.
  43
  44 As mentioned above, OE=1 is not applicable in SV, freeing this bit for
  45 alternative uses.  Additionally, Vectorisation of the VSX SIMD system
  46 likewise makes no sense whatsoever. SV *replaces* VSX and provides,
  47 at the very minimum, predication (which VSX was designed without).
  48 Thus all VSX Major Opcodes - all of them - are "unused" and must raise
  49 illegal instruction exceptions in SV Prefix Mode.
  50
  51 Likewise, `lq` (Load Quad), and Load/Store Multiple make no sense to
  52 have because they are not only provided by SV, the SV alternatives may
  53 be predicated as well, making them far better suited to use in function
  54 calls and context-switching.
  55
  56 Additionally, some v3.0/1 instructions simply make no sense at all in a
  57 Vector context: `rfid` falls into this category,
  58 as well as `sc` and `scv`.  Here there is simply no point
  59 trying to Vectorise them: the standard OpenPOWER v3.0/1 instructions
  60 should be called instead.
  61
  62 Fortuitously this leaves several Major Opcodes free for use by SV
  63 to fit alternative future instructions.  In a 3D context this means
  64 Vector Product, Vector Normalise, [[sv/mv.swizzle]], Texture LD/ST
  65 operations, and others critical to an efficient, effective 3D GPU and
  66 VPU ISA. With such instructions being included as standard in other
  67 commercially-successful GPU ISAs it is likewise critical that a 3D
  68 GPU/VPU based on svp64 also have such instructions.
  69
  70 Note however that svp64 is stand-alone and is in no way
  71 critically dependent on the existence or provision of 3D GPU or VPU
  72 instructions. These should be considered extensions, and their discussion
  73 and specification is out of scope for this document.
  74
  75 Note, again: this is *only* under svp64 prefixing.  Standard v3.0B /
  76 v3.1B is *not* altered by svp64 in any way.
  77
  78 ## Major opcode map (v3.0B)
  79
  80 This table is taken from v3.0B.
  81 Table 9: Primary Opcode Map (opcode bits 0:5)
  82
  83         |  000   |   001 |  010  | 011   |  100  |    101 |  110  |  111
  84     000 |        |       |  tdi  | twi   | EXT04 |        |       | mulli | 000
  85     001 | subfic |       | cmpli | cmpi  | addic | addic. | addi  | addis | 001
  86     010 | bc/l/a | EXT17 | b/l/a | EXT19 | rlwimi| rlwinm |       | rlwnm | 010
  87     011 |  ori   | oris  | xori  | xoris | andi. | andis. | EXT30 | EXT31 | 011
  88     100 |  lwz   | lwzu  | lbz   | lbzu  | stw   | stwu   | stb   | stbu  | 100
  89     101 |  lhz   | lhzu  | lha   | lhau  | sth   | sthu   | lmw   | stmw  | 101
  90     110 |  lfs   | lfsu  | lfd   | lfdu  | stfs  | stfsu  | stfd  | stfdu | 110
  91     111 |  lq    | EXT57 | EXT58 | EXT59 | EXT60 | EXT61  | EXT62 | EXT63 | 111
  92         |  000   |   001 |   010 |  011  |   100 |   101  | 110   |  111
  93
  94 ## Suitable for svp64-only
  95
  96 This is the same table containing v3.0B Primary Opcodes except those that
  97 make no sense in a Vectorisation Context have been removed.  These removed
  98 POs can, *in the SV Vector Context only*, be assigned to alternative
  99 (Vectorised-only) instructions, including future extensions.
 100
 101 Note, again, to emphasise: outside of svp64 these opcodes **do not**
 102 change.  When not prefixed with svp64 these opcodes **specifically**
 103 retain their v3.0B / v3.1B OpenPOWER Standard compliant meaning.
 104
 105         |  000   |   001 |  010  | 011   |  100  |    101 |  110  |  111
 106     000 |        |       |       |       |       |        |       | mulli | 000
 107     001 | subfic |       | cmpli | cmpi  | addic | addic. | addi  | addis | 001
 108     010 |        |       |       | EXT19 | rlwimi| rlwinm |       | rlwnm | 010
 109     011 |  ori   | oris  | xori  | xoris | andi. | andis. | EXT30 | EXT31 | 011
 110     100 |  lwz   | lwzu  | lbz   | lbzu  | stw   | stwu   | stb   | stbu  | 100
 111     101 |  lhz   | lhzu  | lha   | lhau  | sth   | sthu   |       |       | 101
 112     110 |  lfs   | lfsu  | lfd   | lfdu  | stfs  | stfsu  | stfd  | stfdu | 110
 113     111 |        |       | EXT58 | EXT59 |       | EXT61  |       | EXT63 | 111
 114         |  000   |   001 |   010 |  011  |   100 |   101  | 110   |  111
 115
 116 It is important to note that having a different v3.0B Scalar opcode
 117 that is different from an SVP64 one is highly undesirable: the complexity
 118 in the decoder is greatly increased.
 119
 120 # Single Predication
 121
 122 This is a standard mode normally found in Vector ISAs.  every element in every source Vector and in the destination uses the same bit of one single predicate mask.
 123
 124 In SVSTATE, for Single-predication, implementors MUST increment both srcstep and dststep: unlike Twin-Predication the two must be equal at all times.
 125
 126 # Twin Predication
 127
 128 This is a novel concept that allows predication to be applied to a single
 129 source and a single dest register.  The following types of traditional
 130 Vector operations may be encoded with it, *without requiring explicit
 131 opcodes to do so*
 132
 133 * VSPLAT (a single scalar distributed across a vector)
 134 * VEXTRACT (like LLVM IR [`extractelement`](https://releases.llvm.org/11.0.0/docs/LangRef.html#extractelement-instruction))
 135 * VINSERT (like LLVM IR [`insertelement`](https://releases.llvm.org/11.0.0/docs/LangRef.html#insertelement-instruction))
 136 * VCOMPRESS (like LLVM IR [`llvm.masked.compressstore.*`](https://releases.llvm.org/11.0.0/docs/LangRef.html#llvm-masked-compressstore-intrinsics))
 137 * VEXPAND (like LLVM IR [`llvm.masked.expandload.*`](https://releases.llvm.org/11.0.0/docs/LangRef.html#llvm-masked-expandload-intrinsics))
 138
 139 Those patterns (and more) may be applied to:
 140
 141 * mv (the usual way that V\* ISA operations are created)
 142 * exts\* sign-extension
 143 * rwlinm and other RS-RA shift operations (**note**: excluding
 144   those that take RA as both a src and dest. These are not
 145   1-src 1-dest, they are 2-src, 1-dest)
 146 * LD and ST (treating AGEN as one source)
 147 * FP fclass, fsgn, fneg, fabs, fcvt, frecip, fsqrt etc.
 148 * Condition Register ops mfcr, mtcr and other similar
 149
 150 This is a huge list that creates extremely powerful combinations,
 151 particularly given that one of the predicate options is `(1<<r3)`
 152
 153 Additional unusual capabilities of Twin Predication include a back-to-back
 154 version of VCOMPRESS-VEXPAND which is effectively the ability to do
 155 sequentially ordered multiple VINSERTs.  The source predicate selects a
 156 sequentially ordered subset of elements to be inserted; the destination
 157 predicate specifies the sequentially ordered recipient locations.
 158 This is equivalent to
 159 `llvm.masked.compressstore.*`
 160 followed by
 161 `llvm.masked.expandload.*`
 162
 163 # Reduce modes
 164
 165 Reduction in SVP64 is deterministic and somewhat of a misnomer.  A normal
 166 Vector ISA would have explicit Reduce opcodes with defined characteristics
 167 per operation: in SX Aurora there is even an additional scalar argument
 168 containing the initial reduction value. SVP64 fundamentally has to
 169 utilise *existing* Scalar Power ISA v3.0B operations, which presents some
 170 unique challenges.
 171
 172 The solution turns out to be to simply define reduction as permitting
 173 deterministic element-based schedules to be issued using the base Scalar
 174 operations, and to rely on the underlying microarchitecture to resolve
 175 Register Hazards at the element level.  This goes back to
 176 the fundamental principle that SV is nothing more than a Sub-Program-Counter
 177 sitting between Decode and Issue phases.
 178
 179 Microarchitectures *may* take opportunities to parallelise the reduction
 180 but only if in doing so they preserve Program Order at the Element Level.
 181 Opportunities where this is possible include an `OR` operation
 182 or a MIN/MAX operation: it may be possible to parallelise the reduction,
 183 but for Floating Point it is not permitted due to different results
 184 being obtained if the reduction is not executed in strict sequential
 185 order.
 186
 187 ## Scalar result reduce mode
 188
 189 Scalar Reduction per se does not exist, instead is implemented in SVP64
 190 as a simple and natural relaxation of the usual restriction on the Vector
 191 Looping which would terminate if the destination was marked as a Scalar.
 192 Scalar Reduction by contrast *keeps issuing Vector Element Operations*
 193 even though the destination register is marked as scalar.
 194 Thus it is up to the programmer to be aware of this and observe some
 195 conventions.  It is also important to appreciate that there is no
 196 actual imposition or restriction on how this mode is utilised: there
 197 will therefore be several valuable uses (including Vector Iteration)
 198 and it is up to the programmer to make best use of the capability
 199 provided.
 200
 201 In this mode, which is suited to operations involving carry or overflow,
 202 one register must be identified by the programmer as being the "accumulator".
 203 Scalar reduction is thus categorised by:
 204
 205 * One of the sources is a Vector
 206 * the destination is a scalar
 207 * optionally but most usefully when one source register is also the destination
 208 * That the source register type is the same as the destination register
 209   type identified as the "accumulator".  scalar reduction on `cmp`,
 210   `setb` or `isel` makes no sense for example because of the mixture
 211   between CRs and GPRs.
 212
 213 *Note that issuing instructions in Scalar reduce mode such as `setb`
 214 are neither `UNDEFINED` nor prohibited, despite them not making much
 215 sense at first glance.
 216 Scalar reduce is strictly defined behaviour, and the cost in
 217 hardware terms of prohibition of seemingly non-sensical operations is too great.
 218 Therefore it is permitted and required to be executed successfully.
 219 Implementors **MAY** choose to optimise such instructions in instances
 220 where their use results in "extraneous execution", i.e. where it is clear
 221 that the sequence of operations, comprising multiple overwrites to
 222 a scalar destination **without** cumulative, iterative, or reductive
 223 behaviour, may discard all but the last element operation.  Identification
 224 of such is trivial to do for `setb` and `cmp`: the source register type is
 225 a completely different register file from the destination*
 226
 227 Typical applications include simple operations such as `ADD r3, r10.v,
 228 r3` where, clearly, r3 is being used to accumulate the addition of all
 229 elements is the vector starting at r10.
 230
 231      # add RT, RA,RB but when RT==RA
 232      for i in range(VL):
 233           iregs[RA] += iregs[RB+i] # RT==RA
 234
 235 However, *unless* the operation is marked as "mapreduce", SV ordinarily
 236 **terminates** at the first scalar operation.  Only by marking the
 237 operation as "mapreduce" will it continue to issue multiple sub-looped
 238 (element) instructions in `Program Order`.
 239
 240 To.perform the loop in reverse order, the ```RG``` (reverse gear) bit must be set.  This is useful for leaving a cumulative suffix sum in reverse order:
 241
 242     for i in (VL-1 downto 0):
 243         # RT-1 = RA gives a suffix sum
 244         iregs[RT+i] = iregs[RA+i] - iregs[RB+i]
 245
 246 Other examples include shift-mask operations where a Vector of inserts
 247 into a single destination register is required, as a way to construct
 248 a value quickly from multiple arbitrary bit-ranges and bit-offsets.
 249 Using the same register as both the source and destination, with Vectors
 250 of different offsets masks and values to be inserted has multiple
 251 applications including Video, cryptography and JIT compilation.
 252
 253 Subtract and Divide are still permitted to be executed in this mode,
 254 although from an algorithmic perspective it is strongly discouraged.
 255 It would be better to use addition followed by one final subtract,
 256 or in the case of divide, to get better accuracy, to perform a multiply
 257 cascade followed by a final divide.
 258
 259 Note that single-operand or three-operand scalar-dest reduce is perfectly
 260 well permitted: the programmer may still declare one register, used as
 261 both a Vector source and Scalar destination, to be utilised as
 262 the "accumulator".  In the case of `sv.fmadds` and `sv.maddhw` etc
 263 this naturally fits well with the normal expected usage of these
 264 operations.
 265
 266 If an interrupt or exception occurs in the middle of the scalar mapreduce,
 267 the scalar destination register **MUST** be updated with the current
 268 (intermediate) result, because this is how ```Program Order``` is
 269 preserved (Vector Loops are to be considered to be just another way of issuing instructions
 270 in Program Order).  In this way, after return from interrupt,
 271 the scalar mapreduce may continue where it left off.  This provides
 272 "precise" exception behaviour.
 273
 274 Note that hardware is perfectly permitted to perform multi-issue
 275 parallel optimisation of the scalar reduce operation: it's just that
 276 as far as the user is concerned, all exceptions and interrupts **MUST**
 277 be precise.
 278
 279 ## Vector result reduce mode
 280
 281 Vector Reduce Mode issues a deterministic tree-reduction schedule to the underlying micro-architecture.  Like Scalar reduction, the "Scalar Base"
 282 (Power ISA v3.0B) operation is leveraged, unmodified, to give the
 283 *appearance* and *effect* of Reduction.
 284
 285 Given that the tree-reduction schedule is deterministic,
 286 Interrupts and exceptions
 287 can therefore also be precise.  The final result will be in the first
 288 non-predicate-masked-out destination element, but due again to
 289 the deterministic schedule programmers may find uses for the intermediate
 290 results.
 291
 292 When Rc=1 a corresponding Vector of co-resultant CRs is also
 293 created.  No special action is taken: the result and its CR Field
 294 are stored "as usual" exactly as all other SVP64 Rc=1 operations.
 295
 296 ## Sub-Vector Horizontal Reduction
 297
 298 Note that when SVM is clear and SUBVL!=1 the sub-elements are
 299 *independent*, i.e. they are mapreduced per *sub-element* as a result.
 300 illustration with a vec2:
 301
 302     result.x = op(iregs[RA].x, iregs[RA+1].x)
 303     result.y = op(iregs[RA].y, iregs[RA+1].y)
 304     for i in range(2, VL):
 305         result.x = op(result.x, iregs[RA+i].x)
 306         result.y = op(result.y, iregs[RA+i].y)
 307
 308 Note here that Rc=1 does not make sense when SVM is clear and SUBVL!=1.
 309
 310 When SVM is set and SUBVL!=1, another variant is enabled: horizontal
 311 subvector mode.  Example for a vec3:
 312
 313     for i in range(VL):
 314         result = iregs[RA+i].x
 315         result = op(result, iregs[RA+i].y)
 316         result = op(result, iregs[RA+i].z)
 317         iregs[RT+i] = result
 318
 319 In this mode, when Rc=1 the Vector of CRs is as normal: each result
 320 element creates a corresponding CR element (for the final, reduced, result).
 321
 322 # Fail-on-first
 323
 324 Data-dependent fail-on-first has two distinct variants: one for LD/ST,
 325 the other for arithmetic operations (actually, CR-driven).  Note in each
 326 case the assumption is that vector elements are required appear to be
 327 executed in sequential Program Order, element 0 being the first.
 328
 329 * LD/ST ffirst treats the first LD/ST in a vector (element 0) as an
 330   ordinary one.  Exceptions occur "as normal".  However for elements 1
 331   and above, if an exception would occur, then VL is **truncated** to the
 332   previous element.
 333 * Data-driven (CR-driven) fail-on-first activates when Rc=1 or other
 334   CR-creating operation produces a result (including cmp).  Similar to
 335   branch, an analysis of the CR is performed and if the test fails, the
 336   vector operation terminates and discards all element operations at and
 337   above the current one, and VL is truncated to either
 338   the *previous* element or the current one, depending on whether
 339   VLi (VL "inclusive") is set.
 340
 341 Thus the new VL comprises a contiguous vector of results,
 342 all of which pass the testing criteria (equal to zero, less than zero).
 343
 344 The CR-based data-driven fail-on-first is new and not found in ARM
 345 SVE or RVV. It is extremely useful for reducing instruction count,
 346 however requires speculative execution involving modifications of VL
 347 to get high performance implementations.  An additional mode (RC1=1)
 348 effectively turns what would otherwise be an arithmetic operation
 349 into a type of `cmp`.  The CR is stored (and the CR.eq bit tested
 350 against the `inv` field).
 351 If the CR.eq bit is equal to `inv` then the Vector is truncated and
 352 the loop ends.
 353 Note that when RC1=1 the result elements are never stored, only the CRs.
 354
 355 VLi is only available as an option when `Rc=0` (or for instructions
 356 which do not have Rc). When set, the current element is always
 357 also included in the count (the new length that VL will be set to).
 358 This may be useful in combination with "inv" to truncate the Vector
 359 to `exclude` elements that fail a test, or, in the case of implementations
 360 of strncpy, to include the terminating zero.
 361
 362 In CR-based data-driven fail-on-first there is only the option to select
 363 and test one bit of each CR (just as with branch BO).  For more complex
 364 tests this may be insufficient.  If that is the case, a vectorised crops
 365 (crand, cror) may be used, and ffirst applied to the crop instead of to
 366 the arithmetic vector.
 367
 368 One extremely important aspect of ffirst is:
 369
 370 * LDST ffirst may never set VL equal to zero.  This because on the first
 371   element an exception must be raised "as normal".
 372 * CR-based data-dependent ffirst on the other hand **can** set VL equal
 373   to zero. This is the only means in the entirety of SV that VL may be set
 374   to zero (with the exception of via the SV.STATE SPR).  When VL is set
 375   zero due to the first element failing the CR bit-test, all subsequent
 376   vectorised operations are effectively `nops` which is
 377   *precisely the desired and intended behaviour*.
 378
 379 Another aspect is that for ffirst LD/STs, VL may be truncated arbitrarily
 380 to a nonzero value for any implementation-specific reason.  For example:
 381 it is perfectly reasonable for implementations to alter VL when ffirst
 382 LD or ST operations are initiated on a nonaligned boundary, such that
 383 within a loop the subsequent iteration of that loop begins subsequent
 384 ffirst LD/ST operations on an aligned boundary.  Likewise, to reduce
 385 workloads or balance resources.
 386
 387 CR-based data-dependent first on the other hand MUST not truncate VL
 388 arbitrarily to a length decided by the hardware: VL MUST only be
 389 truncated based explicitly on whether a test fails.
 390 This because it is a precise test on which algorithms
 391 will rely.
 392
 393 ## Data-dependent fail-first on CR operations (crand etc)
 394
 395 Operations that actually produce or alter CR Field as a result
 396 do not also in turn have an Rc=1 mode.  However it makes no
 397 sense to try to test the 4 bits of a CR Field for being equal
 398 or not equal to zero. Moreover, the result is already in the
 399 form that is desired: it is a CR field.  Therefore,
 400 CR-based operations have their own SVP64 Mode, described
 401 in [[sv/cr_ops]]
 402
 403 There are two primary different types of CR operations:
 404
 405 * Those which have a 3-bit operand field (referring to a CR Field)
 406 * Those which have a 5-bit operand (referring to a bit within the
 407    whole 32-bit CR)
 408
 409 More details can be found in [[sv/cr_ops]].
 410
 411 # pred-result mode
 412
 413 This mode merges common CR testing with predication, saving on instruction
 414 count. Below is the pseudocode excluding predicate zeroing and elwidth
 415 overrides. Note that the paeudocode for [[sv/cr_ops]] is slightly different.
 416
 417     for i in range(VL):
 418         # predication test, skip all masked out elements.
 419         if predicate_masked_out(i):
 420              continue
 421         result = op(iregs[RA+i], iregs[RB+i])
 422         CRnew = analyse(result) # calculates eq/lt/gt
 423         # Rc=1 always stores the CR
 424         if Rc=1 or RC1:
 425             crregs[offs+i] = CRnew
 426         # now test CR, similar to branch
 427         if RC1 or CRnew[BO[0:1]] != BO[2]:
 428             continue # test failed: cancel store
 429         # result optionally stored but CR always is
 430         iregs[RT+i] = result
 431
 432 The reason for allowing the CR element to be stored is so that
 433 post-analysis of the CR Vector may be carried out.  For example:
 434 Saturation may have occurred (and been prevented from updating, by the
 435 test) but it is desirable to know *which* elements fail saturation.
 436
 437 Note that RC1 Mode basically turns all operations into `cmp`.  The
 438 calculation is performed but it is only the CR that is written. The
 439 element result is *always* discarded, never written (just like `cmp`).
 440
 441 Note that predication is still respected: predicate zeroing is slightly
 442 different: elements that fail the CR test *or* are masked out are zero'd.
 443
 444 ## pred-result mode on CR ops
 445
 446 CR operations (mtcr, crand, cror) may be Vectorised,
 447 predicated, and also pred-result mode applied to it.
 448 Vectorisation applies to 4-bit CR Fields which are treated as
 449 elements, not the individual bits of the 32-bit CR.
 450 CR ops and how to identify them is described in [[sv/cr_ops]]
 451
 452 # CR Operations
 453
 454 CRs are slightly more involved than INT or FP registers due to the
 455 possibility for indexing individual bits (crops BA/BB/BT).  Again however
 456 the access pattern needs to be understandable in relation to v3.0B / v3.1B
 457 numbering, with a clear linear relationship and mapping existing when
 458 SV is applied.
 459
 460 ## CR EXTRA mapping table and algorithm
 461
 462 Numbering relationships for CR fields are already complex due to being
 463 in BE format (*the relationship is not clearly explained in the v3.0B
 464 or v3.1B specification*).  However with some care and consideration
 465 the exact same mapping used for INT and FP regfiles may be applied,
 466 just to the upper bits, as explained below.  The notation
 467 `CR{field number}` is used to indicate access to a particular
 468 Condition Register Field (as opposed to the notation `CR[bit]`
 469 which accesses one bit of the 32 bit Power ISA v3.0B
 470 Condition Register)
 471
 472 In OpenPOWER v3.0/1, BF/BT/BA/BB are all 5 bits.  The top 3 bits (0:2)
 473 select one of the 8 CRs; the bottom 2 bits (3:4) select one of 4 bits
 474 *in* that CR.  The numbering was determined (after 4 months of
 475 analysis and research) to be as follows:
 476
 477     CR_index = 7-(BA>>2)      # top 3 bits but BE
 478     bit_index = 3-(BA & 0b11) # low 2 bits but BE
 479     CR_reg = CR{CR_index}     # get the CR
 480     # finally get the bit from the CR.
 481     CR_bit = (CR_reg & (1<<bit_index)) != 0
 482
 483 When it comes to applying SV, it is the CR\_reg number to which SV EXTRA2/3
 484 applies, **not** the CR\_bit portion (bits 3:4):
 485
 486     if extra3_mode:
 487         spec = EXTRA3
 488     else:
 489         spec = EXTRA2<<1 | 0b0
 490     if spec[0]:
 491        # vector constructs "BA[0:2] spec[1:2] 00 BA[3:4]"
 492        return ((BA >> 2)<<6) | # hi 3 bits shifted up
 493               (spec[1:2]<<4) | # to make room for these
 494               (BA & 0b11)      # CR_bit on the end
 495     else:
 496        # scalar constructs "00 spec[1:2] BA[0:4]"
 497        return (spec[1:2] << 5) | BA
 498
 499 Thus, for example, to access a given bit for a CR in SV mode, the v3.0B
 500 algorithm to determin CR\_reg is modified to as follows:
 501
 502     CR_index = 7-(BA>>2)      # top 3 bits but BE
 503     if spec[0]:
 504         # vector mode, 0-124 increments of 4
 505         CR_index = (CR_index<<4) | (spec[1:2] << 2)
 506     else:
 507         # scalar mode, 0-32 increments of 1
 508         CR_index = (spec[1:2]<<3) | CR_index
 509     # same as for v3.0/v3.1 from this point onwards
 510     bit_index = 3-(BA & 0b11) # low 2 bits but BE
 511     CR_reg = CR{CR_index}     # get the CR
 512     # finally get the bit from the CR.
 513     CR_bit = (CR_reg & (1<<bit_index)) != 0
 514
 515 Note here that the decoding pattern to determine CR\_bit does not change.
 516
 517 Note: high-performance implementations may read/write Vectors of CRs in
 518 batches of aligned 32-bit chunks (CR0-7, CR7-15).  This is to greatly
 519 simplify internal design.  If instructions are issued where CR Vectors
 520 do not start on a 32-bit aligned boundary, performance may be affected.
 521
 522 ## CR fields as inputs/outputs of vector operations
 523
 524 CRs (or, the arithmetic operations associated with them)
 525 may be marked as Vectorised or Scalar.  When Rc=1 in arithmetic operations that have no explicit EXTRA to cover the CR, the CR is Vectorised if the destination is Vectorised.  Likewise if the destination is scalar then so is the CR.
 526
 527 When vectorized, the CR inputs/outputs are sequentially read/written
 528 to 4-bit CR fields.  Vectorised Integer results, when Rc=1, will begin
 529 writing to CR8 (TBD evaluate) and increase sequentially from there.
 530 This is so that:
 531
 532 * implementations may rely on the Vector CRs being aligned to 8. This
 533   means that CRs may be read or written in aligned batches of 32 bits
 534   (8 CRs per batch), for high performance implementations.
 535 * scalar Rc=1 operation (CR0, CR1) and callee-saved CRs (CR2-4) are not
 536   overwritten by vector Rc=1 operations except for very large VL
 537 * CR-based predication, from CR32, is also not interfered with
 538   (except by large VL).
 539
 540 However when the SV result (destination) is marked as a scalar by the
 541 EXTRA field the *standard* v3.0B behaviour applies: the accompanying
 542 CR when Rc=1 is written to.  This is CR0 for integer operations and CR1
 543 for FP operations.
 544
 545 Note that yes, the CR Fields are genuinely Vectorised.  Unlike in SIMD VSX which
 546 has a single CR (CR6) for a given SIMD result, SV Vectorised OpenPOWER
 547 v3.0B scalar operations produce a **tuple** of element results: the
 548 result of the operation as one part of that element *and a corresponding
 549 CR element*.  Greatly simplified pseudocode:
 550
 551     for i in range(VL):
 552          # calculate the vector result of an add iregs[RT+i] = iregs[RA+i]
 553          + iregs[RB+i] # now calculate CR bits CRs{8+i}.eq = iregs[RT+i]
 554          == 0 CRs{8+i}.gt = iregs[RT+i] > 0 ... etc
 555
 556 If a "cumulated" CR based analysis of results is desired (a la VSX CR6)
 557 then a followup instruction must be performed, setting "reduce" mode on
 558 the Vector of CRs, using cr ops (crand, crnor) to do so.  This provides far
 559 more flexibility in analysing vectors than standard Vector ISAs.  Normal
 560 Vector ISAs are typically restricted to "were all results nonzero" and
 561 "were some results nonzero". The application of mapreduce to Vectorised
 562 cr operations allows far more sophisticated analysis, particularly in
 563 conjunction with the new crweird operations see [[sv/cr_int_predication]].
 564
 565 Note in particular that the use of a separate instruction in this way
 566 ensures that high performance multi-issue OoO inplementations do not
 567 have the computation of the cumulative analysis CR as a bottleneck and
 568 hindrance, regardless of the length of VL.
 569
 570 Additionally,
 571 SVP64 [[sv/branches]] may be used, even when the branch itself is to
 572 the following instruction.  The combined side-effects of CTR reduction
 573 and VL truncation provide several benefits.
 574
 575 (see [[discussion]].  some alternative schemes are described there)
 576
 577 ## Rc=1 when SUBVL!=1
 578
 579 sub-vectors are effectively a form of Packed SIMD (length 2 to 4). Only 1 bit of
 580 predicate is allocated per subvector; likewise only one CR is allocated
 581 per subvector.
 582
 583 This leaves a conundrum as to how to apply CR computation per subvector,
 584 when normally Rc=1 is exclusively applied to scalar elements.  A solution
 585 is to perform a bitwise OR or AND of the subvector tests.  Given that
 586 OE is ignored in SVP64, this field may (when available) be used to select OR or
 587 AND behavior.
 588
 589 ### Table of CR fields
 590
 591 CR[i] is the notation used by the OpenPower spec to refer to CR field #i,
 592 so FP instructions with Rc=1 write to CR[1] aka SVCR1_000.
 593
 594 CRs are not stored in SPRs: they are registers in their own right.
 595 Therefore context-switching the full set of CRs involves a Vectorised
 596 mfcr or mtcr, using VL=64, elwidth=8 to do so.  This is exactly as how
 597 scalar OpenPOWER context-switches CRs: it is just that there are now
 598 more of them.
 599
 600 The 64 SV CRs are arranged similarly to the way the 128 integer registers
 601 are arranged.  TODO a python program that auto-generates a CSV file
 602 which can be included in a table, which is in a new page (so as not to
 603 overwhelm this one). [[svp64/cr_names]]
 604
 605 # Register Profiles
 606
 607 **NOTE THIS TABLE SHOULD NO LONGER BE HAND EDITED** see
 608 <https://bugs.libre-soc.org/show_bug.cgi?id=548> for details.
 609
 610 Instructions are broken down by Register Profiles as listed in the
 611 following auto-generated page: [[opcode_regs_deduped]].  "Non-SV"
 612 indicates that the operations with this Register Profile cannot be
 613 Vectorised (mtspr, bc, dcbz, twi)
 614
 615 TODO generate table which will be here [[svp64/reg_profiles]]
 616
 617 # SV pseudocode illilustration
 618
 619 ## Single-predicated Instruction
 620
 621 illustration of normal mode add operation: zeroing not included, elwidth
 622 overrides not included.  if there is no predicate, it is set to all 1s
 623
 624     function op_add(rd, rs1, rs2) # add not VADD!
 625       int i, id=0, irs1=0, irs2=0; predval = get_pred_val(FALSE, rd);
 626       for (i = 0; i < VL; i++)
 627         STATE.srcoffs = i # save context if (predval & 1<<i) # predication
 628         uses intregs
 629            ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2]; if (!int_vec[rd
 630            ].isvec) break;
 631         if (rd.isvec)  { id += 1; } if (rs1.isvec)  { irs1 += 1; } if
 632         (rs2.isvec)  { irs2 += 1; } if (id == VL or irs1 == VL or irs2 ==
 633         VL) {
 634           # end VL hardware loop STATE.srcoffs = 0; # reset return;
 635         }
 636
 637 This has several modes:
 638
 639 * RT.v = RA.v RB.v * RT.v = RA.v RB.s (and RA.s RB.v) * RT.v = RA.s RB.s *
 640 RT.s = RA.v RB.v * RT.s = RA.v RB.s (and RA.s RB.v) * RT.s = RA.s RB.s
 641
 642 All of these may be predicated.  Vector-Vector is straightfoward.
 643 When one of source is a Vector and the other a Scalar, it is clear that
 644 each element of the Vector source should be added to the Scalar source,
 645 each result placed into the Vector (or, if the destination is a scalar,
 646 only the first nonpredicated result).
 647
 648 The one that is not obvious is RT=vector but both RA/RB=scalar.
 649 Here this acts as a "splat scalar result", copying the same result into
 650 all nonpredicated result elements.  If a fixed destination scalar was
 651 intended, then an all-Scalar operation should be used.
 652
 653 See <https://bugs.libre-soc.org/show_bug.cgi?id=552>
 654
 655 # Assembly Annotation
 656
 657 Assembly code annotation is required for SV to be able to successfully
 658 mark instructions as "prefixed".
 659
 660 A reasonable (prototype) starting point:
 661
 662     svp64 [field=value]*
 663
 664 Fields:
 665
 666 * ew=8/16/32 - element width
 667 * sew=8/16/32 - source element width
 668 * vec=2/3/4 - SUBVL
 669 * mode=reduce/satu/sats/crpred
 670 * pred=1\<\<3/r3/~r3/r10/~r10/r30/~r30/lt/gt/le/ge/eq/ne
 671 * spred={reg spec}
 672
 673 similar to x86 "rex" prefix.
 674
 675 For actual assembler:
 676
 677     sv.asmcode/mode.vec{N}.ew=8,sw=16,m={pred},sm={pred} reg.v, src.s
 678
 679 Qualifiers:
 680
 681 * m={pred}: predicate mask mode
 682 * sm={pred}: source-predicate mask mode (only allowed in Twin-predication)
 683 * vec{N}: vec2 OR vec3 OR vec4 - sets SUBVL=2/3/4
 684 * ew={N}: ew=8/16/32 - sets elwidth override
 685 * sw={N}: sw=8/16/32 - sets source elwidth override
 686 * ff={xx}: see fail-first mode
 687 * pr={xx}: see predicate-result mode
 688 * sat{x}: satu / sats - see saturation mode
 689 * mr: see map-reduce mode
 690 * mr.svm see map-reduce with sub-vector mode
 691 * crm: see map-reduce CR mode
 692 * crm.svm see map-reduce CR with sub-vector mode
 693 * sz: predication with source-zeroing
 694 * dz: predication with dest-zeroing
 695
 696 For modes:
 697
 698 * pred-result:
 699   - pm=lt/gt/le/ge/eq/ne/so/ns OR
 700   - pm=RC1 OR pm=~RC1
 701 * fail-first
 702   - ff=lt/gt/le/ge/eq/ne/so/ns OR
 703   - ff=RC1 OR ff=~RC1
 704 * saturation:
 705   - sats
 706   - satu
 707 * map-reduce:
 708   - mr OR crm: "normal" map-reduce mode or CR-mode.
 709   - mr.svm OR crm.svm: when vec2/3/4 set, sub-vector mapreduce is enabled
 710
 711 # Proposed Parallel-reduction algorithm
 712
 713 ```
 714 /// reference implementation of proposed SimpleV reduction semantics.
 715 ///
 716                 // reduction operation -- we still use this algorithm even
 717                 // if the reduction operation isn't associative or
 718                 // commutative.
 719 /// `temp_pred` is a user-visible Vector Condition register
 720 ///
 721 /// all input arrays have length `vl`
 722 def reduce(  vl,  vec, pred, pred,):
 723     step = 1;
 724     while step < vl
 725         step *= 2;
 726         for i in (0..vl).step_by(step)
 727             other = i + step / 2;
 728             other_pred = other < vl && pred[other];
 729             if pred[i] && other_pred
 730                 vec[i] += vec[other];
 731             else if other_pred
 732                 vec[i] = vec[other];
 733             pred[i] |= other_pred;
 734
 735 def reduce(  vl,  vec, pred, pred,):
 736     j = 0
 737     vi = [] # array of lookup indices to skip nonpredicated
 738     for i, pbit in enumerate(pred):
 739        if pbit:
 740            vi[j] = i
 741            j += 1
 742     step = 2
 743     while step <= vl
 744         halfstep = step // 2
 745         for i in (0..vl).step_by(step)
 746             other = vi[i + halfstep]
 747             i = vi[i]
 748             other_pred = other < vl && pred[other]
 749             if pred[i] && other_pred
 750                 vec[i] += vec[other]
 751             pred[i] |= other_pred
 752          step *= 2
 753
 754 ```