openpower/sv/svp64/appendix.mdwn

   1 # Appendix
   2
   3 * <https://bugs.libre-soc.org/show_bug.cgi?id=574>
   4 * <https://bugs.libre-soc.org/show_bug.cgi?id=558#c47>
   5
   6 This is the appendix to [[sv/svp64]], providing explanations of modes
   7 etc. leaving the main svp64 page's primary purpose as outlining the
   8 instruction format.
   9
  10 Table of contents:
  11
  12 [[!toc]]
  13
  14 # XER, SO and other global flags
  15
  16 Vector systems are expected to be high performance.  This is achieved
  17 through parallelism, which requires that elements in the vector be
  18 independent.  XER SO and other global "accumulation" flags (CR.OV) cause
  19 Read-Write Hazards on single-bit global resources, having a significant
  20 detrimental effect.
  21
  22 Consequently in SV, XER.SO and CR.OV behaviour is disregarded (including
  23 in `cmp` instructions).  XER is simply neither read nor written.
  24 This includes when `scalar identity behaviour` occurs.  If precise
  25 OpenPOWER v3.0/1 scalar behaviour is desired then OpenPOWER v3.0/1
  26 instructions should be used without an SV Prefix.
  27
  28 An interesting side-effect of this decision is that the OE flag is now
  29 free for other uses when SV Prefixing is used.
  30
  31 Regarding XER.CA: this does not fit either: it was designed for a scalar
  32 ISA. Instead, both carry-in and carry-out go into the CR.so bit of a given
  33 Vector element.  This provides a means to perform large parallel batches
  34 of Vectorised carry-capable additions.  crweird instructions can be used
  35 to transfer the CRs in and out of an integer, where bitmanipulation
  36 may be performed to analyse the carry bits (including carry lookahead
  37 propagation) before continuing with further parallel additions.
  38
  39 # v3.0B/v3.1B relevant instructions
  40
  41 SV is primarily designed for use as an efficient hybrid 3D GPU / VPU /
  42 CPU ISA.
  43
  44 As mentioned above, OE=1 is not applicable in SV, freeing this bit for
  45 alternative uses.  Additionally, Vectorisation of the VSX SIMD system
  46 likewise makes no sense whatsoever. SV *replaces* VSX and provides,
  47 at the very minimum, predication (which VSX was designed without).
  48 Thus all VSX Major Opcodes - all of them - are "unused" and must raise
  49 illegal instruction exceptions in SV Prefix Mode.
  50
  51 Likewise, `lq` (Load Quad), and Load/Store Multiple make no sense to
  52 have because they are not only provided by SV, the SV alternatives may
  53 be predicated as well, making them far better suited to use in function
  54 calls and context-switching.
  55
  56 Additionally, some v3.0/1 instructions simply make no sense at all in a
  57 Vector context: `twi` and `tdi` fall into this category, as do branch
  58 operations as well as `sc` and `scv`.  Here there is simply no point
  59 trying to Vectorise them: the standard OpenPOWER v3.0/1 instructions
  60 should be called instead.
  61
  62 Fortuitously this leaves several Major Opcodes free for use by SV
  63 to fit alternative future instructions.  In a 3D context this means
  64 Vector Product, Vector Normalise, [[sv/mv.swizzle]], Texture LD/ST
  65 operations, and others critical to an efficient, effective 3D GPU and
  66 VPU ISA. With such instructions being included as standard in other
  67 commercially-successful GPU ISAs it is likewise critical that a 3D
  68 GPU/VPU based on svp64 also have such instructions.
  69
  70 Note however that svp64 is stand-alone and is in no way
  71 critically dependent on the existence or provision of 3D GPU or VPU
  72 instructions. These should be considered extensions, and their discussion
  73 and specification is out of scope for this document.
  74
  75 Note, again: this is *only* under svp64 prefixing.  Standard v3.0B /
  76 v3.1B is *not* altered by svp64 in any way.
  77
  78 ## Major opcode map (v3.0B)
  79
  80 This table is taken from v3.0B.
  81 Table 9: Primary Opcode Map (opcode bits 0:5)
  82
  83         |  000   |   001 |  010  | 011   |  100  |    101 |  110  |  111
  84     000 |        |       |  tdi  | twi   | EXT04 |        |       | mulli | 000
  85     001 | subfic |       | cmpli | cmpi  | addic | addic. | addi  | addis | 001
  86     010 | bc/l/a | EXT17 | b/l/a | EXT19 | rlwimi| rlwinm |       | rlwnm | 010
  87     011 |  ori   | oris  | xori  | xoris | andi. | andis. | EXT30 | EXT31 | 011
  88     100 |  lwz   | lwzu  | lbz   | lbzu  | stw   | stwu   | stb   | stbu  | 100
  89     101 |  lhz   | lhzu  | lha   | lhau  | sth   | sthu   | lmw   | stmw  | 101
  90     110 |  lfs   | lfsu  | lfd   | lfdu  | stfs  | stfsu  | stfd  | stfdu | 110
  91     111 |  lq    | EXT57 | EXT58 | EXT59 | EXT60 | EXT61  | EXT62 | EXT63 | 111
  92         |  000   |   001 |   010 |  011  |   100 |   101  | 110   |  111
  93
  94 ## Suitable for svp64
  95
  96 This is the same table containing v3.0B Primary Opcodes except those that
  97 make no sense in a Vectorisation Context have been removed.  These removed
  98 POs can, *in the SV Vector Context only*, be assigned to alternative
  99 (Vectorised-only) instructions, including future extensions.
 100
 101 Note, again, to emphasise: outside of svp64 these opcodes **do not**
 102 change.  When not prefixed with svp64 these opcodes **specifically**
 103 retain their v3.0B / v3.1B OpenPOWER Standard compliant meaning.
 104
 105         |  000   |   001 |  010  | 011   |  100  |    101 |  110  |  111
 106     000 |        |       |       |       |       |        |       | mulli | 000
 107     001 | subfic |       | cmpli | cmpi  | addic | addic. | addi  | addis | 001
 108     010 |        |       |       | EXT19 | rlwimi| rlwinm |       | rlwnm | 010
 109     011 |  ori   | oris  | xori  | xoris | andi. | andis. | EXT30 | EXT31 | 011
 110     100 |  lwz   | lwzu  | lbz   | lbzu  | stw   | stwu   | stb   | stbu  | 100
 111     101 |  lhz   | lhzu  | lha   | lhau  | sth   | sthu   |       |       | 101
 112     110 |  lfs   | lfsu  | lfd   | lfdu  | stfs  | stfsu  | stfd  | stfdu | 110
 113     111 |        |       | EXT58 | EXT59 |       | EXT61  |       | EXT63 | 111
 114         |  000   |   001 |   010 |  011  |   100 |   101  | 110   |  111
 115
 116 # Twin Predication
 117
 118 This is a novel concept that allows predication to be applied to a single
 119 source and a single dest register.  The following types of traditional
 120 Vector operations may be encoded with it, *without requiring explicit
 121 opcodes to do so*
 122
 123 * VSPLAT (a single scalar distributed across a vector)
 124 * VEXTRACT (like LLVM IR [`extractelement`](https://releases.llvm.org/11.0.0/docs/LangRef.html#extractelement-instruction))
 125 * VINSERT (like LLVM IR [`insertelement`](https://releases.llvm.org/11.0.0/docs/LangRef.html#insertelement-instruction))
 126 * VCOMPRESS (like LLVM IR [`llvm.masked.compressstore.*`](https://releases.llvm.org/11.0.0/docs/LangRef.html#llvm-masked-compressstore-intrinsics))
 127 * VEXPAND (like LLVM IR [`llvm.masked.expandload.*`](https://releases.llvm.org/11.0.0/docs/LangRef.html#llvm-masked-expandload-intrinsics))
 128
 129 Those patterns (and more) may be applied to:
 130
 131 * mv (the usual way that V\* ISA operations are created)
 132 * exts\* sign-extension
 133 * rwlinm and other RS-RA shift operations (**note**: excluding
 134   those that take RA as both a src and dest. These are not
 135   1-src 1-dest, they are 2-src, 1-dest)
 136 * LD and ST (treating AGEN as one source)
 137 * FP fclass, fsgn, fneg, fabs, fcvt, frecip, fsqrt etc.
 138 * Condition Register ops mfcr, mtcr and other similar
 139
 140 This is a huge list that creates extremely powerful combinations,
 141 particularly given that one of the predicate options is `(1<<r3)`
 142
 143 Additional unusual capabilities of Twin Predication include a back-to-back
 144 version of VCOMPRESS-VEXPAND which is effectively the ability to do
 145 sequentially ordered multiple VINSERTs.  The source predicate selects a
 146 sequentially ordered subset of elements to be inserted; the destination
 147 predicate specifies the sequentially ordered recipient locations.
 148 This is equivalent to
 149 `llvm.masked.compressstore.*`
 150 followed by
 151 `llvm.masked.expandload.*`
 152
 153
 154 # Rounding, clamp and saturate
 155
 156 see  [[av_opcodes]].
 157
 158 To help ensure that audio quality is not compromised by overflow,
 159 "saturation" is provided, as well as a way to detect when saturation
 160 occurred if desired (Rc=1). When Rc=1 there will be a *vector* of CRs,
 161 one CR per element in the result (Note: this is different from VSX which
 162 has a single CR per block).
 163
 164 When N=0 the result is saturated to within the maximum range of an
 165 unsigned value.  For integer ops this will be 0 to 2^elwidth-1. Similar
 166 logic applies to FP operations, with the result being saturated to
 167 maximum rather than returning INF, and the minimum to +0.0
 168
 169 When N=1 the same occurs except that the result is saturated to the min
 170 or max of a signed result, and for FP to the min and max value rather
 171 than returning +/- INF.
 172
 173 When Rc=1, the CR "overflow" bit is set on the CR associated with the
 174 element, to indicate whether saturation occurred.  Note that due to
 175 the hugely detrimental effect it has on parallel processing, XER.SO is
 176 **ignored** completely and is **not** brought into play here.  The CR
 177 overflow bit is therefore simply set to zero if saturation did not occur,
 178 and to one if it did.
 179
 180 Note also that saturate on operations that produce a carry output are
 181 prohibited due to the conflicting use of the CR.so bit for storing if
 182 saturation occurred.
 183
 184 Post-analysis of the Vector of CRs to find out if any given element hit
 185 saturation may be done using a mapreduced CR op (cror), or by using the
 186 new crweird instruction, transferring the relevant CR bits to a scalar
 187 integer and testing it for nonzero.  see [[sv/cr_int_predication]]
 188
 189 Note that the operation takes place at the maximum bitwidth (max of
 190 src and dest elwidth) and that truncation occurs to the range of the
 191 dest elwidth.
 192
 193 # Reduce mode
 194
 195 There are two variants here.  The first is when the destination is scalar
 196 and at least one of the sources is Vector.  The second is more complex
 197 and involves map-reduction on vectors.
 198
 199 The first defining characteristic distinguishing Scalar-dest reduce mode
 200 from Vector reduce mode is that Scalar-dest reduce issues VL element
 201 operations, whereas Vector reduce mode performs an actual map-reduce
 202 (tree reduction): typically `O(VL log VL)` actual computations.
 203
 204 The second defining characteristic of scalar-dest reduce mode is that it
 205 is, in simplistic and shallow terms *serial and sequential in nature*,
 206 whereas the Vector reduce mode is definitely inherently paralleliseable.
 207
 208 The reason why scalar-dest reduce mode is "simplistically" serial and
 209 sequential is that in certain circumstances (such as an `OR` operation
 210 or a MIN/MAX operation) it may be possible to parallelise the reduction.
 211
 212 ## Scalar result reduce mode
 213
 214 In this mode, one register is identified as being the "accumulator".
 215 Scalar reduction is thus categorised by:
 216
 217 * One of the sources is a Vector
 218 * the destination is a scalar
 219 * optionally but most usefully when one source register is also the destination
 220 * That the source register type is the same as the destination register
 221   type identified as the "accumulator".  scalar reduction on `cmp`,
 222   `setb` or `isel` is not possible for example because of the mixture
 223   between CRs and GPRs.
 224
 225 Typical applications include simple operations such as `ADD r3, r10.v,
 226 r3` where, clearly, r3 is being used to accumulate the addition of all
 227 elements is the vector starting at r10.
 228
 229      # add RT, RA,RB but when RT==RA
 230      for i in range(VL):
 231           iregs[RA] += iregs[RB+i] # RT==RA
 232
 233 However, *unless* the operation is marked as "mapreduce", SV ordinarily
 234 **terminates** at the first scalar operation.  Only by marking the
 235 operation as "mapreduce" will it continue to issue multiple sub-looped
 236 (element) instructions in `Program Order`.
 237
 238 Other examples include shift-mask operations where a Vector of inserts
 239 into a single destination register is required, as a way to construct
 240 a value quickly from multiple arbitrary bit-ranges and bit-offsets.
 241 Using the same register as both the source and destination, with Vectors
 242 of different offsets masks and values to be inserted has multiple
 243 applications including Video, cryptography and JIT compilation.
 244
 245 Subtract and Divide are still permitted to be executed in this mode,
 246 although from an algorithmic perspective it is strongly discouraged.
 247 It would be better to use addition followed by one final subtract,
 248 or in the case of divide, to get better accuracy, to perform a multiply
 249 cascade followed by a final divide.
 250
 251 Note that single-operand or three-operand scalar-dest reduce is perfectly
 252 well permitted: both still meet the qualifying characteristics that one
 253 source operand can also be the destination, which allows the "accumulator"
 254 to be identified.
 255
 256 ## Vector result reduce mode
 257
 258 1. limited to single predicated dual src operations (add RT, RA, RB).
 259    triple source operations are prohibited (fma).
 260 2. limited to operations that make sense.  divide is excluded, as is
 261    subtract (X - Y - Z produces different answers depending on the order)
 262    and asymmetric CRops (crandc, crorc). sane  operations:
 263    multiply, min/max, add, logical bitwise OR, most other CR ops.
 264    operations that do have the same source and dest register type are
 265    also excluded (isel, cmp). operations involving carry or overflow
 266    (XER.CA / OV) are also prohibited.
 267 3. the destination is a vector but the result is stored, ultimately,
 268    in the first nonzero predicated element.  all other nonzero predicated
 269    elements are undefined. *this includes the CR vector* when Rc=1
 270 4. implementations may use any ordering and any algorithm to reduce
 271    down to a single result.  However it must be equivalent to a straight
 272    application of mapreduce.  The destination vector (except masked out
 273    elements) may be used for storing any intermediate results. these may
 274    be left in the vector (undefined).
 275 5. CRM applies when Rc=1.  When CRM is zero, the CR associated with
 276    the result is regarded as a "some results met standard CR result
 277    criteria". When CRM is one, this changes to "all results met standard
 278    CR criteria".
 279 6. implementations MAY use destoffs as well as srcoffs (see [[sv/sprs]])
 280    in order to store sufficient state to resume operation should an
 281    interrupt occur. this is also why implementations are permitted to use
 282    the destination vector to store intermediary computations
 283 7. *Predication may be applied*.  zeroing mode is not an option.  masked-out
 284    inputs are ignored; masked-out elements in the destination vector are
 285    unaltered (not used for the purposes of intermediary storage); the
 286    scalar result is placed in the first available unmasked element.
 287
 288 Pseudocode for the case where RA==RB:
 289
 290     result = op(iregs[RA], iregs[RA+1])
 291     CR = analyse(result)
 292     for i in range(2, VL):
 293         result = op(result, iregs[RA+i])
 294         CRnew = analyse(result)
 295         if Rc=1
 296             if CRM:
 297                  CR = CR bitwise or CRnew
 298             else:
 299                  CR = CR bitwise AND CRnew
 300
 301 TODO: case where RA!=RB which involves first a vector of 2-operand
 302 results followed by a mapreduce on the intermediates.
 303
 304 Note that when SVM is clear and SUBVL!=1 the sub-elements are
 305 *independent*, i.e. they are mapreduced per *sub-element* as a result.
 306 illustration with a vec2:
 307
 308     result.x = op(iregs[RA].x, iregs[RA+1].x)
 309     result.y = op(iregs[RA].y, iregs[RA+1].y)
 310     for i in range(2, VL):
 311         result.x = op(result.x, iregs[RA+i].x)
 312         result.y = op(result.y, iregs[RA+i].y)
 313
 314 Note here that Rc=1 does not make sense when SVM is clear and SUBVL!=1.
 315
 316 When SVM is set and SUBVL!=1, another variant is enabled: horizontal
 317 subvector mode.  Example for a vec3:
 318
 319     for i in range(VL):
 320         result = op(iregs[RA+i].x, iregs[RA+i].x)
 321         result = op(result, iregs[RA+i].y)
 322         result = op(result, iregs[RA+i].z)
 323         iregs[RT+i] = result
 324
 325 In this mode, when Rc=1 the Vector of CRs is as normal: each result
 326 element creates a corresponding CR element.
 327
 328 # Fail-on-first
 329
 330 Data-dependent fail-on-first has two distinct variants: one for LD/ST,
 331 the other for arithmetic operations (actually, CR-driven).  Note in each
 332 case the assumption is that vector elements are required appear to be
 333 executed in sequential Program Order, element 0 being the first.
 334
 335 * LD/ST ffirst treats the first LD/ST in a vector (element 0) as an
 336   ordinary one.  Exceptions occur "as normal".  However for elements 1
 337   and above, if an exception would occur, then VL is **truncated** to the
 338   previous element.
 339 * Data-driven (CR-driven) fail-on-first activates when Rc=1 or other
 340   CR-creating operation produces a result (including cmp).  Similar to
 341   branch, an analysis of the CR is performed and if the test fails, the
 342   vector operation terminates and discards all element operations at and
 343   above the current one, and VL is truncated to the *previous* element.
 344   Thus the new VL comprises a contiguous vector of results, all of which
 345   pass the testing criteria (equal to zero, less than zero).
 346
 347 The CR-based data-driven fail-on-first is new and not found in ARM
 348 SVE or RVV. It is extremely useful for reducing instruction count,
 349 however requires speculative execution involving modifications of VL
 350 to get high performance implementations.  An additional mode (RC1=1)
 351 effectively turns what would otherwise be an arithmetic operation
 352 into a type of `cmp`.  The CR is stored (and the CR.eq bit tested).
 353 If the CR.eq bit fails then the Vector is truncated and the loop ends.
 354 Note that when RC1=1 the result elements arw never stored, only the CRs.
 355
 356 In CR-based data-driven fail-on-first there is only the option to select
 357 and test one bit of each CR (just as with branch BO).  For more complex
 358 tests this may be insufficient.  If that is the case, a vectorised crops
 359 (crand, cror) may be used, and ffirst applied to the crop instead of to
 360 the arithmetic vector.
 361
 362 One extremely important aspect of ffirst is:
 363
 364 * LDST ffirst may never set VL equal to zero.  This because on the first
 365   element an exception must be raised "as normal".
 366 * CR-based data-dependent ffirst on the other hand **can** set VL equal
 367   to zero. This is the only means in the entirety of SV that VL may be set
 368   to zero (with the exception of via the SV.STATE SPR).  When VL is set
 369   zero due to the first element failing the CR bit-test, all subsequent
 370   vectorised operations are effectively `nops` which is
 371   *precisely the desired and intended behaviour*.
 372
 373 Another aspect is that for ffirst LD/STs, VL may be truncated arbitrarily
 374 to a nonzero value for any implementation-specific reason.  For example:
 375 it is perfectly reasonable for implementations to alter VL when ffirst
 376 LD or ST operations are initiated on a nonaligned boundary, such that
 377 within a loop the subsequent iteration of that loop begins subsequent
 378 ffirst LD/ST operations on an aligned boundary.  Likewise, to reduce
 379 workloads or balance resources.
 380
 381 CR-based data-dependent first on the other hand MUST not truncate VL
 382 arbitrarily.  This because it is a precise test on which algorithms
 383 will rely.
 384
 385 # pred-result mode
 386
 387 This mode merges common CR testing with predication, saving on instruction
 388 count. Below is the pseudocode excluding predicate zeroing and elwidth
 389 overrides.
 390
 391     for i in range(VL):
 392         # predication test, skip all masked out elements.
 393         if predicate_masked_out(i):
 394              continue
 395         result = op(iregs[RA+i], iregs[RB+i])
 396         CRnew = analyse(result) # calculates eq/lt/gt
 397         # Rc=1 always stores the CR
 398         if Rc=1 or RC1:
 399             crregs[offs+i] = CRnew
 400         # now test CR, similar to branch
 401         if RC1 or CRnew[BO[0:1]] != BO[2]:
 402             continue # test failed: cancel store
 403         # result optionally stored but CR always is
 404         iregs[RT+i] = result
 405
 406 The reason for allowing the CR element to be stored is so that
 407 post-analysis of the CR Vector may be carried out.  For example:
 408 Saturation may have occurred (and been prevented from updating, by the
 409 test) but it is desirable to know *which* elements fail saturation.
 410
 411 Note that RC1 Mode basically turns all operations into `cmp`.  The
 412 calculation is performed but it is only the CR that is written. The
 413 element result is *always* discarded, never written (just like `cmp`).
 414
 415 Note that predication is still respected: predicate zeroing is slightly
 416 different: elements that fail the CR test *or* are masked out are zero'd.
 417
 418 ## pred-result mode on CR ops
 419
 420 Yes, really: CR operations (mtcr, crand, cror) may be Vectorised,
 421 predicated, and also pred-result mode applied to it.  In this case,
 422 the Vectorisation applies to the batch of 4 bits, i.e. it is not the CR
 423 individual bits that are treated as the Vector, but the CRs themselves
 424 (CR0, CR8, CR9...)
 425
 426 Thus after each Vectorised operation (crand) a test of the CR result
 427 can in fact be performed.
 428
 429 # CR Operations
 430
 431 CRs are slightly more involved than INT or FP registers due to the
 432 possibility for indexing individual bits (crops BA/BB/BT).  Again however
 433 the access pattern needs to be understandable in relation to v3.0B / v3.1B
 434 numbering, with a clear linear relationship and mapping existing when
 435 SV is applied.
 436
 437 ## CR EXTRA mapping table and algorithm
 438
 439 Numbering relationships for CR fields are already complex due to being
 440 in BE format (*the relationship is not clearly explained in the v3.0B
 441 or v3.1B specification*).  However with some care and consideration
 442 the exact same mapping used for INT and FP regfiles may be applied,
 443 just to the upper bits, as explained below.
 444
 445 In OpenPOWER v3.0/1, BF/BT/BA/BB are all 5 bits.  The top 3 bits (0:2)
 446 select one of the 8 CRs; the bottom 2 bits (3:4) select one of 4 bits
 447 *in* that CR.  The numbering was determined (after 4 months of
 448 analysis and research) to be as follows:
 449
 450     CR_index = 7-(BA>>2)      # top 3 bits but BE
 451     bit_index = 3-(BA & 0b11) # low 2 bits but BE
 452     CR_reg = CR{CR_index}     # get the CR
 453     # finally get the bit from the CR.
 454     CR_bit = (CR_reg & (1<<bit_index)) != 0
 455
 456 When it comes to applying SV, it is the CR\_reg number to which SV EXTRA2/3
 457 applies, **not** the CR\_bit portion (bits 3:4):
 458
 459     if extra3_mode:
 460         spec = EXTRA3
 461     else:
 462         spec = EXTRA2<<1 | 0b0
 463     if spec[0]:
 464        # vector constructs "BA[0:2] spec[1:2] 00 BA[3:4]"
 465        return ((BA >> 2)<<6) | # hi 3 bits shifted up
 466               (spec[1:2]<<4) | # to make room for these
 467               (BA & 0b11)      # CR_bit on the end
 468     else:
 469        # scalar constructs "00 spec[1:2] BA[0:4]"
 470        return (spec[1:2] << 5) | BA
 471
 472 Thus, for example, to access a given bit for a CR in SV mode, the v3.0B
 473 algorithm to determin CR\_reg is modified to as follows:
 474
 475     CR_index = 7-(BA>>2)      # top 3 bits but BE
 476     if spec[0]:
 477         # vector mode, 0-124 increments of 4
 478         CR_index = (CR_index<<4) | (spec[1:2] << 2)
 479     else:
 480         # scalar mode, 0-32 increments of 1
 481         CR_index = (spec[1:2]<<3) | CR_index
 482     # same as for v3.0/v3.1 from this point onwards
 483     bit_index = 3-(BA & 0b11) # low 2 bits but BE
 484     CR_reg = CR{CR_index}     # get the CR
 485     # finally get the bit from the CR.
 486     CR_bit = (CR_reg & (1<<bit_index)) != 0
 487
 488 Note here that the decoding pattern to determine CR\_bit does not change.
 489
 490 Note: high-performance implementations may read/write Vectors of CRs in
 491 batches of aligned 32-bit chunks (CR0-7, CR7-15).  This is to greatly
 492 simplify internal design.  If instructions are issued where CR Vectors
 493 do not start on a 32-bit aligned boundary, performance may be affected.
 494
 495 ## CR fields as inputs/outputs of vector operations
 496
 497 CRs (or, the arithmetic operations associated with them)
 498 may be marked as Vectorised or Scalar.  When Rc=1 in arithmetic operations that have no explicit EXTRA to cover the CR, the CR is Vectorised if the destination is Vectorised.  Likewise if the destination is scalar then so is the CR.
 499
 500 When vectorized, the CR inputs/outputs are sequentially read/written
 501 to 4-bit CR fields.  Vectorised Integer results, when Rc=1, will begin
 502 writing to CR8 (TBD evaluate) and increase sequentially from there.
 503 This is so that:
 504
 505 * implementations may rely on the Vector CRs being aligned to 8. This
 506   means that CRs may be read or written in aligned batches of 32 bits
 507   (8 CRs per batch), for high performance implementations.
 508 * scalar Rc=1 operation (CR0, CR1) and callee-saved CRs (CR2-4) are not
 509   overwritten by vector Rc=1 operations except for very large VL
 510 * CR-based predication, from CR32, is also not interfered with
 511   (except by large VL).
 512
 513 However when the SV result (destination) is marked as a scalar by the
 514 EXTRA field the *standard* v3.0B behaviour applies: the accompanying
 515 CR when Rc=1 is written to.  This is CR0 for integer operations and CR1
 516 for FP operations.
 517
 518 Note that yes, the CRs are genuinely Vectorised.  Unlike in SIMD VSX which
 519 has a single CR (CR6) for a given SIMD result, SV Vectorised OpenPOWER
 520 v3.0B scalar operations produce a **tuple** of element results: the
 521 result of the operation as one part of that element *and a corresponding
 522 CR element*.  Greatly simplified pseudocode:
 523
 524     for i in range(VL):
 525          # calculate the vector result of an add iregs[RT+i] = iregs[RA+i]
 526          + iregs[RB+i] # now calculate CR bits CRs{8+i}.eq = iregs[RT+i]
 527          == 0 CRs{8+i}.gt = iregs[RT+i] > 0 ... etc
 528
 529 If a "cumulated" CR based analysis of results is desired (a la VSX CR6)
 530 then a followup instruction must be performed, setting "reduce" mode on
 531 the Vector of CRs, using cr ops (crand, crnor)to do so.  This provides far
 532 more flexibility in analysing vectors than standard Vector ISAs.  Normal
 533 Vector ISAs are typically restricted to "were all results nonzero" and
 534 "were some results nonzero". The application of mapreduce to Vectorised
 535 cr operations allows far more sophisticated analysis, particularly in
 536 conjunction with the new crweird operations see [[sv/cr_int_predication]].
 537
 538 Note in particular that the use of a separate instruction in this way
 539 ensures that high performance multi-issue OoO inplementations do not
 540 have the computation of the cumulative analysis CR as a bottleneck and
 541 hindrance, regardless of the length of VL.
 542
 543 (see [[discussion]].  some alternative schemes are described there)
 544
 545 ## Rc=1 when SUBVL!=1
 546
 547 sub-vectors are effectively a form of SIMD (length 2 to 4). Only 1 bit of
 548 predicate is allocated per subvector; likewise only one CR is allocated
 549 per subvector.
 550
 551 This leaves a conundrum as to how to apply CR computation per subvector,
 552 when normally Rc=1 is exclusively applied to scalar elements.  A solution
 553 is to perform a bitwise OR or AND of the subvector tests.  Given that
 554 OE is ignored, rhis field may (when available) be used to select OR or
 555 AND behavior.
 556
 557 ### Table of CR fields
 558
 559 CR[i] is the notation used by the OpenPower spec to refer to CR field #i,
 560 so FP instructions with Rc=1 write to CR[1] aka SVCR1_000.
 561
 562 CRs are not stored in SPRs: they are registers in their own right.
 563 Therefore context-switching the full set of CRs involves a Vectorised
 564 mfcr or mtcr, using VL=64, elwidth=8 to do so.  This is exactly as how
 565 scalar OpenPOWER context-switches CRs: it is just that there are now
 566 more of them.
 567
 568 The 64 SV CRs are arranged similarly to the way the 128 integer registers
 569 are arranged.  TODO a python program that auto-generates a CSV file
 570 which can be included in a table, which is in a new page (so as not to
 571 overwhelm this one). [[svp64/cr_names]]
 572
 573 # Register Profiles
 574
 575 **NOTE THIS TABLE SHOULD NO LONGER BE HAND EDITED** see
 576 <https://bugs.libre-soc.org/show_bug.cgi?id=548> for details.
 577
 578 Instructions are broken down by Register Profiles as listed in the
 579 following auto-generated page: [[opcode_regs_deduped]].  "Non-SV"
 580 indicates that the operations with this Register Profile cannot be
 581 Vectorised (mtspr, bc, dcbz, twi)
 582
 583 TODO generate table which will be here [[svp64/reg_profiles]]
 584
 585 # SV pseudocode illilustration
 586
 587 ## Single-predicated Instruction
 588
 589 illustration of normal mode add operation: zeroing not included, elwidth
 590 overrides not included.  if there is no predicate, it is set to all 1s
 591
 592     function op_add(rd, rs1, rs2) # add not VADD!
 593       int i, id=0, irs1=0, irs2=0; predval = get_pred_val(FALSE, rd);
 594       for (i = 0; i < VL; i++)
 595         STATE.srcoffs = i # save context if (predval & 1<<i) # predication
 596         uses intregs
 597            ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2]; if (!int_vec[rd
 598            ].isvec) break;
 599         if (rd.isvec)  { id += 1; } if (rs1.isvec)  { irs1 += 1; } if
 600         (rs2.isvec)  { irs2 += 1; } if (id == VL or irs1 == VL or irs2 ==
 601         VL) {
 602           # end VL hardware loop STATE.srcoffs = 0; # reset return;
 603         }
 604
 605 This has several modes:
 606
 607 * RT.v = RA.v RB.v * RT.v = RA.v RB.s (and RA.s RB.v) * RT.v = RA.s RB.s *
 608 RT.s = RA.v RB.v * RT.s = RA.v RB.s (and RA.s RB.v) * RT.s = RA.s RB.s
 609
 610 All of these may be predicated.  Vector-Vector is straightfoward.
 611 When one of source is a Vector and the other a Scalar, it is clear that
 612 each element of the Vector source should be added to the Scalar source,
 613 each result placed into the Vector (or, if the destination is a scalar,
 614 only the first nonpredicated result).
 615
 616 The one that is not obvious is RT=vector but both RA/RB=scalar.
 617 Here this acts as a "splat scalar result", copying the same result into
 618 all nonpredicated result elements.  If a fixed destination scalar was
 619 intended, then an all-Scalar operation should be used.
 620
 621 See <https://bugs.libre-soc.org/show_bug.cgi?id=552>
 622
 623 # Assembly Annotation
 624
 625 Assembly code annotation is required for SV to be able to successfully
 626 mark instructions as "prefixed".
 627
 628 A reasonable (prototype) starting point:
 629
 630     svp64 [field=value]*
 631
 632 Fields:
 633
 634 * ew=8/16/32 - element width
 635 * sew=8/16/32 - source element width
 636 * vec=2/3/4 - SUBVL
 637 * mode=reduce/satu/sats/crpred
 638 * pred=1\<\<3/r3/~r3/r10/~r10/r30/~r30/lt/gt/le/ge/eq/ne
 639 * spred={reg spec}
 640
 641 similar to x86 "rex" prefix.
 642
 643 For actual assembler:
 644
 645     sv.asmcode/mode.vec{N}.ew=8,sw=16,m={pred},sm={pred} reg.v, src.s
 646
 647 Qualifiers:
 648
 649 * m={pred}: predicate mask mode
 650 * sm={pred}: source-predicate mask mode (only allowed in Twin-predication)
 651 * vec{N}: vec2 OR vec3 OR vec4 - sets SUBVL=2/3/4
 652 * ew={N}: ew=8/16/32 - sets elwidth override
 653 * sw={N}: sw=8/16/32 - sets source elwidth override
 654 * ff={xx}: see fail-first mode
 655 * pr={xx}: see predicate-result mode
 656 * sat{x}: satu / sats - see saturation mode
 657 * mr: see map-reduce mode
 658 * mr.svm see map-reduce with sub-vector mode
 659 * crm: see map-reduce CR mode
 660 * crm.svm see map-reduce CR with sub-vector mode
 661 * sz: predication with source-zeroing
 662 * dz: predication with dest-zeroing
 663
 664 For modes:
 665
 666 * pred-result:
 667   - pm=lt/gt/le/ge/eq/ne/so/ns OR
 668   - pm=RC1 OR pm=~RC1
 669 * fail-first
 670   - ff=lt/gt/le/ge/eq/ne/so/ns OR
 671   - ff=RC1 OR ff=~RC1
 672 * saturation:
 673   - sats
 674   - satu
 675 * map-reduce:
 676   - mr OR crm: "normal" map-reduce mode or CR-mode.
 677   - mr.svm OR crm.svm: when vec2/3/4 set, sub-vector mapreduce is enabled
 678