openpower/sv/remap.mdwn

   1 # REMAP <a name="remap" />
   2
   3 * <https://bugs.libre-soc.org/show_bug.cgi?id=143> matrix multiply
   4 * <https://bugs.libre-soc.org/show_bug.cgi?id=867> add svindex
   5 * <https://bugs.libre-soc.org/show_bug.cgi?id=885> svindex in simulator
   6 * <https://bugs.libre-soc.org/show_bug.cgi?id=911> offset svshape option
   7 * <https://bugs.libre-soc.org/show_bug.cgi?id=864> parallel reduction
   8 * <https://bugs.libre-soc.org/show_bug.cgi?id=930> DCT/FFT "strides"
   9 * see [[sv/remap/appendix]] for examples and usage
  10 * see [[sv/propagation]] for a future way to apply REMAP
  11 * [[remap/discussion]]
  12
  13 REMAP is an advanced form of Vector "Structure Packing" that provides
  14 hardware-level support for commonly-used *nested* loop patterns that would
  15 otherwise require full inline loop unrolling.  For more general reordering
  16 an Indexed REMAP mode is available (an abstracted analog to `xxperm`).
  17
  18 REMAP allows the usual sequential vector loop `0..VL-1` to be "reshaped"
  19 (re-mapped) from a linear form to a 2D or 3D transposed form, or "offset"
  20 to permit arbitrary access to elements (when elwidth overrides are
  21 used), independently on each Vector src or dest register. Aside from
  22 Indexed REMAP this is entirely Hardware-accelerated reordering and
  23 consequently not costly in terms of register access. It will however
  24 place a burden on Multi-Issue systems but no more than if the equivalent
  25 Scalar instructions were explicitly loop-unrolled without SVP64, and
  26 some advanced implementations may even find the Deterministic nature of
  27 the Scheduling to be easier on resources.
  28
  29 The initial primary motivation of REMAP was for Matrix Multiplication,
  30 reordering of sequential data in-place: in-place DCT and FFT were
  31 easily justified given the exceptionally high usage in Computer Science.
  32 Four SPRs are provided which may be applied to any GPR, FPR or CR Field so
  33 that for example a single FMAC may be used in a single hardware-controlled
  34 100% Deterministic loop to perform 5x3 times 3x4 Matrix multiplication,
  35 generating 60 FMACs *without needing explicit assembler unrolling*.
  36 Additional uses include regular "Structure Packing" such as RGB pixel
  37 data extraction and reforming (although less costly vec2/3/4 reshaping
  38 is achievable with `PACK/UNPACK`).
  39
  40 REMAP, like all of SV, is abstracted out, meaning that unlike traditional
  41 Vector ISAs which would typically only have a limited set of instructions
  42 that can be structure-packed (LD/ST and Move operations
  43 being the most common), REMAP may be applied to
  44 literally any instruction: CRs, Arithmetic, Logical, LD/ST, even
  45 Vectorised Branch-Conditional.
  46
  47 When SUBVL is greater than 1 a given group of Subvector
  48 elements are kept together: effectively the group becomes the
  49 element, and with REMAP applying to elements
  50 (not sub-elements) each group is REMAPed together.
  51 Swizzle *can* however be applied to the same
  52 instruction as REMAP, providing re-sequencing of
  53 Subvector elements which REMAP cannot. Also as explained in [[sv/mv.swizzle]], [[sv/mv.vec]] and the [[svp64/appendix]], Pack and Unpack Mode bits
  54 can extend down into Sub-vector elements to influence vec2/vec3/vec4
  55 sequential reordering, but even here, REMAP is not *individually*
  56 extended down to the actual sub-vector elements themselves.
  57
  58 In its general form, REMAP is quite expensive to set up, and on some
  59 implementations may introduce latency, so should realistically be used
  60 only where it is worthwhile.  Given that even with latency the fact
  61 that up to 127 operations can be requested to be issued (from a single
  62 instruction) it should be clear that REMAP should not be dismissed
  63 for *possible* latency alone.  Commonly-used patterns such as Matrix
  64 Multiply, DCT and FFT have helper instruction options which make REMAP
  65 easier to use.
  66
  67 There are four types of REMAP:
  68
  69 * **Matrix**, also known as 2D and 3D reshaping, can perform in-place
  70   Matrix transpose and rotate. The Shapes are set up for an "Outer Product"
  71   Matrix Multiply.
  72 * **FFT/DCT**, with full triple-loop in-place support: limited to
  73   Power-2 RADIX
  74 * **Indexing**, for any general-purpose reordering, also includes
  75   limited 2D reshaping.
  76 * **Parallel Reduction**, for scheduling a sequence of operations
  77   in a Deterministic fashion, in a way that may be parallelised,
  78   to reduce a Vector down to a single value.
  79
  80 Best implemented on top of a Multi-Issue Out-of-Order Micro-architecture,
  81 REMAP Schedules are 100% Deterministic **including Indexing** and are
  82 designed to be incorporated in between the Decode and Issue phases,
  83 directly into Register Hazard Management.
  84
  85 As long as the SVSHAPE SPRs
  86 are not written to directly, Hardware may treat REMAP as 100%
  87 Deterministic: all REMAP Management instructions take static
  88 operands (no dynamic register operands)
  89 with the exception of Indexed Mode, and even then
  90 Architectural State is permitted to assume that the Indices
  91 are cacheable from the point at which the `svindex` instruction
  92 is executed.
  93
  94 Parallel Reduction is unusual in that it requires a full vector array
  95 of results (not a scalar) and uses the rest of the result Vector for
  96 the purposes of storing intermediary calculations.  As these intermediary
  97 results are Deterministically computed they may be useful.
  98 Additionally, because the intermediate results are always written out
  99 it is possible to service Precise Interrupts without affecting latency
 100 (a common limitation of Vector ISAs implementing explicit
 101 Parallel Reduction instructions).
 102
 103 ## Basic principle
 104
 105 * normal vector element read/write of operands would be sequential
 106   (0 1 2 3 ....)
 107 * this is not appropriate for (e.g.) Matrix multiply which requires
 108   accessing elements in alternative sequences (0 3 6 1 4 7 ...)
 109 * normal Vector ISAs use either Indexed-MV or Indexed-LD/ST to "cope"
 110   with this.  both are expensive (copy large vectors, spill through memory)
 111   and very few Packed SIMD ISAs cope with non-Power-2.
 112 * REMAP **redefines** the order of access according to set
 113   (Deterministic) "Schedules".
 114 * Matrix Schedules are not at all restricted to power-of-two boundaries
 115   making it unnecessary to have for example specialised 3x4 transpose
 116   instructions of other Vector ISAs.
 117
 118 Only the most commonly-used algorithms in computer science have REMAP
 119 support, due to the high cost in both the ISA and in hardware.  For
 120 arbitrary remapping the `Indexed` REMAP may be used.
 121
 122 ## Example Usage
 123
 124 * `svshape` to set the type of reordering to be applied to an
 125   otherwise usual `0..VL-1` hardware for-loop
 126 * `svremap` to set which registers a given reordering is to apply to
 127   (RA, RT etc)
 128 * `sv.{instruction}` where any Vectorised register marked by `svremap`
 129   will have its ordering REMAPPED according to the schedule set
 130   by `svshape`.
 131
 132 The following illustrative example multiplies a 3x4 and a 5x3
 133 matrix to create
 134 a 5x4 result:
 135
 136 ```
 137     svshape 5, 4, 3, 0, 0            # Outer Product
 138     svremap 15, 1, 2, 3, 0, 0, 0, 0
 139     sv.fmadds *0, *32, *64, *0
 140 ```
 141
 142 * svshape sets up the four SVSHAPE SPRS for a Matrix Schedule
 143 * svremap activates four out of five registers RA RB RC RT RS (15)
 144 * svremap requests:
 145   - RA to use SVSHAPE1
 146   - RB to use SVSHAPE2
 147   - RC to use SVSHAPE3
 148   - RT to use SVSHAPE0
 149   - RS Remapping to not be activated
 150 * sv.fmadds has RT=0.v, RA=8.v, RB=16.v, RC=0.v
 151 * With REMAP being active each register's element index is
 152   *independently* transformed using the specified SHAPEs.
 153
 154 Thus the Vector Loop is arranged such that the use of
 155 the multiply-and-accumulate instruction executes precisely the required
 156 Schedule to perform an in-place in-registers Outer Product
 157 Matrix Multiply with no
 158 need to perform additional Transpose or register copy instructions.
 159 The example above may be executed as a unit test and demo,
 160 [here](https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_matrix.py;h=c15479db9a36055166b6b023c7495f9ca3637333;hb=a17a252e474d5d5bf34026c25a19682e3f2015c3#l94)
 161
 162 ## REMAP types
 163
 164 This section summarises the motivation for each REMAP Schedule
 165 and briefly goes over their characteristics and limitations.
 166 Further details on the Deterministic Precise-Interruptible algorithms
 167 used in these Schedules is found in the [[sv/remap/appendix]].
 168
 169 ### Matrix (1D/2D/3D shaping)
 170
 171 Matrix Multiplication is a huge part of High-Performance Compute,
 172 and 3D.
 173 In many PackedSIMD as well as Scalable Vector ISAs, non-power-of-two
 174 Matrix sizes are a serious challenge. PackedSIMD ISAs, in order to
 175 cope with for example 3x4 Matrices, recommend rolling data-repetition and loop-unrolling.
 176 Aside from the cost of the load on the L1 I-Cache, the trick only
 177 works if one of the dimensions X or Y are power-two. Prime Numbers
 178 (5x7, 3x5) become deeply problematic to unroll.
 179
 180 Even traditional Scalable Vector ISAs have issues with Matrices, often
 181 having to perform data Transpose by pushing out through Memory and back,
 182 or computing Transposition Indices (costly) then copying to another
 183 Vector (costly).
 184
 185 Matrix REMAP was thus designed to solve these issues by providing Hardware
 186 Assisted
 187 "Schedules" that can view what would otherwise be limited to a strictly
 188 linear Vector as instead being 2D (even 3D) *in-place* reordered.
 189 With both Transposition and non-power-two being supported the issues
 190 faced by other ISAs are mitigated.
 191
 192 Limitations of Matrix REMAP are that the Vector Length (VL) is currently
 193 restricted to 127: up to 127 FMAs (or other operation)
 194 may be performed in total.
 195 Also given that it is in-registers only at present some care has to be
 196 taken on regfile resource utilisation. However it is perfectly possible
 197 to utilise Matrix REMAP to perform the three inner-most "kernel"
 198 ("Tiling") loops of
 199 the usual 6-level large Matrix Multiply, without the usual difficulties
 200 associated with SIMD.
 201
 202 Also the `svshape` instruction only provides access to part of the
 203 Matrix REMAP capability. Rotation and mirroring need to be done by
 204 programming the SVSHAPE SPRs directly, which can take a lot more
 205 instructions. Future versions of SVP64 will include EXT1xx prefixed
 206 variants (`psvshape`) which provide more comprehensive capacity and
 207 mitigate the need to write direct to the SVSHAPE SPRs.
 208
 209 ### FFT/DCT Triple Loop
 210
 211 DCT and FFT are some of the most astonishingly used algorithms in
 212 Computer Science.  Radar, Audio, Video, R.F. Baseband and dozens more.  At least
 213 two DSPs, TMS320 and Hexagon, have VLIW instructions specially tailored
 214 to FFT.
 215
 216 An in-depth analysis showed that it is possible to do in-place in-register
 217 DCT and FFT as long as twin-result "butterfly" instructions are provided.
 218 These can be found in the [[openpower/isa/svfparith]] page if performing
 219 IEEE754 FP transforms. *(For fixed-point transforms, equivalent 3-in 2-out
 220 integer operations would be required)*. These "butterfly" instructions
 221 avoid the need for a temporary register because the two array positions
 222 being overwritten will be "in-flight" in any In-Order or Out-of-Order
 223 micro-architecture.
 224
 225 DCT and FFT Schedules are currently limited to RADIX2 sizes and do not
 226 accept predicate masks.  Given that it is common to perform recursive
 227 convolutions combining smaller Power-2 DCT/FFT to create larger DCT/FFTs
 228 in practice the RADIX2 limit is not a problem.  A Bluestein convolution
 229 to compute arbitrary length is demonstrated by
 230 [Project Nayuki](https://www.nayuki.io/res/free-small-fft-in-multiple-languages/fft.py)
 231
 232 ### Indexed
 233
 234 The purpose of Indexing is to provide a generalised version of
 235 Vector ISA "Permute" instructions, such as VSX `vperm`.  The
 236 Indexing is abstracted out and may be applied to much more
 237 than an element move/copy, and is not limited for example
 238 to the number of bytes that can fit into a VSX register.
 239 Indexing may be applied to LD/ST (even on Indexed LD/ST
 240 instructions such as `sv.lbzx`), arithmetic operations,
 241 extsw: there is no artificial limit.
 242
 243 The only major caveat is that the registers to be used as
 244 Indices must not be modified by any instruction after Indexed Mode
 245 is established, and neither must MAXVL be altered. Additionally,
 246 no register used as an Index may exceed MAXVL-1.
 247
 248 Failure to observe
 249 these conditions results in `UNDEFINED` behaviour.
 250 These conditions allow a Read-After-Write (RAW) Hazard to be created on
 251 the entire range of Indices to be subsequently used, but a corresponding
 252 Write-After-Read Hazard by any instruction that modifies the Indices
 253 **does not have to be created**. Given the large number of registers
 254 involved in Indexing this is a huge resource saving and reduction
 255 in micro-architectural complexity. MAXVL is likewise
 256 included in the RAW Hazards because it is involved in calculating
 257 how many registers are to be considered Indices.
 258
 259 With these Hazard Mitigations in place, high-performance implementations
 260 may read-cache the Indices at the point where a given `svindex` instruction
 261 is called (or SVSHAPE SPRs - and MAXVL - directly altered) by issuing
 262 background GPR register file reads whilst other instructions are being
 263 issued and executed.
 264
 265 The original motivation for Indexed REMAP was to mitigate the need to add
 266 an expensive `mv.x` to the Scalar ISA, which was likely to be rejected as
 267 a stand-alone instruction.  Usually a Vector ISA would add a non-conflicting
 268 variant (as in VSX `vperm`) but it is common to need to permute by source,
 269 with the risk of conflict, that has to be resolved, for example, in AVX-512
 270 with `conflictd`.
 271
 272 Indexed REMAP on the other hand **does not prevent conflicts** (overlapping
 273 destinations), which on a superficial analysis may be perceived to be a
 274 problem, until it is recalled that, firstly, Simple-V is designed specifically
 275 to require Program Order to be respected, and that Matrix, DCT and FFT
 276 all *already* critically depend on overlapping Reads/Writes: Matrix
 277 uses overlapping registers as accumulators.  Thus the Register Hazard
 278 Management needed by Indexed REMAP *has* to be in place anyway.
 279
 280 The cost compared to Matrix and other REMAPs (and Pack/Unpack) is
 281 clearly that of the additional reading of the GPRs to be used as Indices,
 282 plus the setup cost associated with creating those same Indices.
 283 If any Deterministic REMAP can cover the required task, clearly it
 284 is adviseable to use it instead.
 285
 286 *Programmer's note: some algorithms may require skipping of Indices exceeding
 287 VL-1, not MAXVL-1. This may be achieved programmatically by performing
 288 an `sv.cmp *BF,*RA,RB` where RA is the same GPRs used in the Indexed REMAP,
 289 and RB contains the value of VL returned from `setvl`. The resultant
 290 CR Fields may then be used as Predicate Masks to exclude those operations
 291 with an Index exceeding VL-1.*
 292
 293 ### Parallel Reduction
 294
 295 Vector Reduce Mode issues a deterministic tree-reduction schedule to the underlying micro-architecture.  Like Scalar reduction, the "Scalar Base"
 296 (Power ISA v3.0B) operation is leveraged, unmodified, to give the
 297 *appearance* and *effect* of Reduction.
 298
 299 In Horizontal-First Mode, Vector-result reduction **requires**
 300 the destination to be a Vector, which will be used to store
 301 intermediary results.
 302
 303 Given that the tree-reduction schedule is deterministic,
 304 Interrupts and exceptions
 305 can therefore also be precise.  The final result will be in the first
 306 non-predicate-masked-out destination element, but due again to
 307 the deterministic schedule programmers may find uses for the intermediate
 308 results.
 309
 310 When Rc=1 a corresponding Vector of co-resultant CRs is also
 311 created.  No special action is taken: the result *and its CR Field*
 312 are stored "as usual" exactly as all other SVP64 Rc=1 operations.
 313
 314 Note that the Schedule only makes sense on top of certain instructions:
 315 X-Form with a Register Profile of `RT,RA,RB` is fine because two sources
 316 and the destination are all the same type.  Like Scalar
 317 Reduction, nothing is prohibited:
 318 the results of execution on an unsuitable instruction may simply
 319 not make sense. With care, even 3-input instructions (madd, fmadd, ternlogi)
 320 may be used, and whilst it is down to the Programmer to walk through the
 321 process the Programmer can be confident that the Parallel-Reduction is
 322 guaranteed 100% Deterministic.
 323
 324 Critical to note regarding use of Parallel-Reduction REMAP is that,
 325 exactly as with all REMAP Modes, the `svshape` instruction *requests*
 326 a certain Vector Length (number of elements to reduce) and then
 327 sets VL and MAXVL at the number of **operations** needed to be
 328 carried out.  Thus, equally as importantly, like Matrix REMAP
 329 the total number of operations
 330 is restricted to 127.  Any Parallel-Reduction requiring more operations
 331 will need to be done manually in batches (hierarchical
 332 recursive Reduction).
 333
 334 Also important to note is that the Deterministic Schedule is arranged
 335 so that some implementations *may* parallelise it (as long as doing so
 336 respects Program Order and Register Hazards).  Performance (speed)
 337 of any given
 338 implementation is neither strictly defined or guaranteed.  As with
 339 the Vulkan(tm) Specification, strict compliance is paramount whilst
 340 performance is at the discretion of Implementors.
 341
 342 **Parallel-Reduction with Predication**
 343
 344 To avoid breaking the strict RISC-paradigm, keeping the Issue-Schedule
 345 completely separate from the actual element-level (scalar) operations,
 346 Move operations are **not** included in the Schedule.  This means that
 347 the Schedule leaves the final (scalar) result in the first-non-masked
 348 element of the Vector used.  With the predicate mask being dynamic
 349 (but deterministic) this result could be anywhere.
 350
 351 If that result is needed to be moved to a (single) scalar register
 352 then a follow-up `sv.mv/sm=predicate rt, *ra` instruction will be
 353 needed to get it, where the predicate is the exact same predicate used
 354 in the prior Parallel-Reduction instruction.
 355
 356 * If there was only a single
 357   bit in the predicate then the result will not have moved or been altered
 358   from the source vector prior to the Reduction
 359 * If there was more than one bit the result will be in the
 360   first element with a predicate bit set.
 361
 362 In either case the result is in the element with the first bit set in
 363 the predicate mask. Thus, no move/copy *within the Reduction itself* was needed.
 364
 365 Programmer's Note: For *some* hardware implementations
 366 the vector-to-scalar copy may be a slow operation, as may the Predicated
 367 Parallel Reduction itself.
 368 It may be better to perform a pre-copy
 369 of the values, compressing them (VREDUCE-style) into a contiguous block,
 370 which will guarantee that the result goes into the very first element
 371 of the destination vector, in which case clearly no follow-up
 372 predicated vector-to-scalar MV operation is needed.
 373
 374 **Usage conditions**
 375
 376 The simplest usage is to perform an overwrite, specifying all three
 377 register operands the same.
 378
 379 ```
 380     svshape parallelreduce, 6
 381     sv.add *8, *8, *8
 382 ```
 383
 384 The Reduction Schedule will issue the Parallel Tree Reduction spanning
 385 registers 8 through 13, by adjusting the offsets to RT, RA and RB as
 386 necessary (see "Parallel Reduction algorithm" in a later section).
 387
 388 A non-overwrite is possible as well but just as with the overwrite
 389 version, only those destination elements necessary for storing
 390 intermediary computations will be written to: the remaining elements
 391 will **not** be overwritten and will **not** be zero'd.
 392
 393 ```
 394     svshape parallelreduce, 6
 395     sv.add *0, *8, *8
 396 ```
 397
 398 However it is critical to note that if the source and destination are
 399 not the same then the trick of using a follow-up vector-scalar MV will
 400 not work.
 401
 402 ### Sub-Vector Horizontal Reduction
 403
 404 To achieve Sub-Vector Horizontal Reduction, Pack/Unpack should be enabled,
 405 which will turn the Schedule around such that issuing of the Scalar
 406 Defined Words is done with SUBVL looping as the inner loop not the
 407 outer loop. Rc=1 with Sub-Vectors (SUBVL=2,3,4) is `UNDEFINED` behaviour.
 408
 409 ## Determining Register Hazards
 410
 411 For high-performance (Multi-Issue, Out-of-Order) systems it is critical
 412 to be able to statically determine the extent of Vectors in order to
 413 allocate pre-emptive Hazard protection.  The next task is to eliminate
 414 masked-out elements using predicate bits, freeing up the associated
 415 Hazards.
 416
 417 For non-REMAP situations `VL` is sufficient to ascertain early
 418 Hazard coverage, and with SVSTATE being a high priority cached
 419 quantity at the same level of MSR and PC this is not a problem.
 420
 421 The problems come when REMAP is enabled.  Indexed REMAP must instead
 422 use `MAXVL` as the earliest (simplest)
 423 batch-level Hazard Reservation indicator (after taking element-width
 424 overriding on the Index source into consideration),
 425 but Matrix, FFT and Parallel Reduction must all use completely different
 426 schemes.  The reason is that VL is used to step through the total
 427 number of *operations*, not the number of registers.
 428 The "Saving Grace" is that all of the REMAP Schedules are 100% Deterministic.
 429
 430 Advance-notice Parallel computation and subsequent cacheing
 431 of all of these complex Deterministic REMAP Schedules is
 432 *strongly recommended*, thus allowing clear and precise multi-issue
 433 batched Hazard coverage to be deployed, *even for Indexed Mode*.
 434 This is only possible for Indexed due to the strict guidelines
 435 given to Programmers.
 436
 437 In short, there exists solutions to the problem of Hazard Management,
 438 with varying degrees of refinement possible at correspondingly
 439 increasing levels of complexity in hardware.
 440
 441 A reminder: when Rc=1 each result register (element) has an associated
 442 co-result CR Field (one per result element).  Thus above when determining
 443 the Write-Hazards for result registers the corresponding Write-Hazards for the
 444 corresponding associated co-result CR Field must not be forgotten, *including* when
 445 Predication is used.
 446
 447 ## REMAP area of SVSTATE SPR
 448
 449 The following bits of the SVSTATE SPR are used for REMAP:
 450
 451 |32.33|34.35|36.37|38.39|40.41| 42.46 | 62 |
 452 | --  | --  | --  | --  | --  | ----- | ------ |
 453 |mi0  |mi1  |mi2  |mo0  |mo1  | SVme  | RMpst    |
 454
 455 mi0-2 and mo0-1 each select SVSHAPE0-3 to apply to a given register.
 456 mi0-2 apply to RA, RB, RC respectively, as input registers, and
 457 likewise mo0-1 apply to output registers (RT/FRT, RS/FRS) respectively.
 458 SVme is 5 bits (one for each of mi0-2/mo0-1) and indicates whether the
 459 SVSHAPE is actively applied or not.
 460
 461 * bit 0 of SVme indicates if mi0 is applied to RA / FRA / BA / BFA
 462 * bit 1 of SVme indicates if mi1 is applied to RB / FRB / BB
 463 * bit 2 of SVme indicates if mi2 is applied to RC / FRC / BC
 464 * bit 3 of SVme indicates if mo0 is applied to RT / FRT / BT / BF
 465 * bit 4 of SVme indicates if mo1 is applied to Effective Address / FRS / RS
 466   (LD/ST-with-update has an implicit 2nd write register, RA)
 467
 468 The "persistence" bit if set will result in all Active REMAPs being applied
 469 indefinitely.
 470
 471 ----------------
 472
 473 \newpage{}
 474
 475 # svremap instruction <a name="svremap"> </a>
 476
 477 SVRM-Form:
 478
 479     svremap SVme,mi0,mi1,mi2,mo0,mo2,pst
 480
 481 |0     |6     |11  |13   |15   |17   |19   |21    | 22:25 |26:31  |
 482 | --   | --   | -- | --  | --  | --  | --  | --   | ----  | ----- |
 483 | PO   | SVme |mi0 | mi1 | mi2 | mo0 | mo1 | pst  | rsvd  | XO    |
 484
 485 SVRM-Form
 486
 487 * svremap SVme,mi0,mi1,mi2,mo0,mo1,pst
 488
 489 Pseudo-code:
 490
 491 ```
 492     # registers RA RB RC RT EA/FRS SVSHAPE0-3 indices
 493     SVSTATE[32:33] <- mi0
 494     SVSTATE[34:35] <- mi1
 495     SVSTATE[36:37] <- mi2
 496     SVSTATE[38:39] <- mo0
 497     SVSTATE[40:41] <- mo1
 498     # enable bit for RA RB RC RT EA/FRS
 499     SVSTATE[42:46] <- SVme
 500     # persistence bit (applies to more than one instruction)
 501     SVSTATE[62] <- pst
 502 ```
 503
 504 Special Registers Altered:
 505
 506 ```
 507     SVSTATE
 508 ```
 509
 510 `svremap` determines the relationship between registers and SVSHAPE SPRs.
 511 The bitmask `SVme` determines which registers have a REMAP applied, and mi0-mo1
 512 determine which shape is applied to an activated register.  the `pst` bit if
 513 cleared indicated that the REMAP operation shall only apply to the immediately-following
 514 instruction.  If set then REMAP remains permanently enabled until such time as it is
 515 explicitly disabled, either by `setvl` setting a new MAXVL, or with another
 516 `svremap` instruction. `svindex` and `svshape2` are also capable of setting or
 517 clearing persistence, as well as partially covering a subset of the capability of
 518 `svremap` to set register-to-SVSHAPE relationships.
 519
 520 Programmer's Note: applying non-persistent `svremap` to an instruction that has
 521 no REMAP enabled or is a Scalar operation will obviously have no effect but
 522 the bits 32 to 46 will at least have been set in SVSTATE. This may prove useful
 523 when using `svindex` or `svshape2`.
 524
 525 Hardware Architectural Note: when persistence is not set it is critically important
 526 to treat the `svremap` and the following SVP64 instruction as an indivisible fused operation.
 527 *No state* is stored in the SVSTATE SPR in order to allow continuation should an
 528 Interrupt occur between the two instructions. Thus, Interrupts must be prohibited
 529 from occurring or other workaround deployed.  When persistence is set this issue
 530 is moot.
 531
 532 It is critical to note that if persistence is clear then `svremap` is the *only* way
 533 to activate REMAP on any given (following) instruction.  If persistence is set however then
 534 **all** SVP64 instructions go through REMAP as long as `SVme` is non-zero.
 535
 536 -------------
 537
 538 \newpage{}
 539
 540 # SHAPE Remapping SPRs
 541
 542 There are four "shape" SPRs, SHAPE0-3, 32-bits in each,
 543 which have the same format.
 544
 545 Shape is 32-bits.  When SHAPE is set entirely to zeros, remapping is
 546 disabled: the register's elements are a linear (1D) vector.
 547
 548 |31.30|29..28 |27..24| 23..21 | 20..18  | 17..12  |11..6 |5..0  | Mode  |
 549 |---- |------ |------| ------ | ------- | ------- |----- |----- | ----- |
 550 |mode |skip   |offset| invxyz | permute | zdimsz  |ydimsz|xdimsz|Matrix |
 551 |0b00 |elwidth|offset|sk1/invxy|0b110/0b111|SVGPR|ydimsz|xdimsz|Indexed|
 552 |0b01 |submode|offset| invxyz | submode2| zdimsz  |mode  |xdimsz|DCT/FFT|
 553 |0b10 |submode|offset| invxyz | rsvd    | rsvd    |rsvd  |xdimsz|Preduce|
 554 |0b11 |       |      |        |         |         |      |      |rsvd   |
 555
 556 mode sets different behaviours (straight matrix multiply, FFT, DCT).
 557
 558 * **mode=0b00** sets straight Matrix Mode
 559 * **mode=0b00** with permute=0b110 or 0b111 sets Indexed Mode
 560 * **mode=0b01** sets "FFT/DCT" mode and activates submodes
 561 * **mode=0b10** sets "Parallel Reduction" Schedules.
 562
 563 ## Parallel Reduction Mode
 564
 565 Creates the Schedules for Parallel Tree Reduction.
 566
 567 * **submode=0b00** selects the left operand index
 568 * **submode=0b01** selects the right operand index
 569
 570 * When bit 0 of `invxyz` is set, the order of the indices
 571   in the inner for-loop are reversed. This has the side-effect
 572   of placing the final reduced result in the last-predicated element.
 573   It also has the indirect side-effect of swapping the source
 574   registers: Left-operand index numbers will always exceed
 575   Right-operand indices.
 576   When clear, the reduced result will be in the first-predicated
 577   element, and Left-operand indices will always be *less* than
 578   Right-operand ones.
 579 * When bit 1 of `invxyz` is set, the order of the outer loop
 580   step is inverted: stepping begins at the nearest power-of two
 581   to half of the vector length and reduces by half each time.
 582   When clear the step will begin at 2 and double on each
 583   inner loop.
 584
 585 ## FFT/DCT mode
 586
 587 submode2=0 is for FFT. For FFT submode the following schedules may be
 588 selected:
 589
 590 * **submode=0b00** selects the ``j`` offset of the innermost for-loop
 591   of Tukey-Cooley
 592 * **submode=0b10** selects the ``j+halfsize`` offset of the innermost for-loop
 593   of Tukey-Cooley
 594 * **submode=0b11** selects the ``k`` of exptable (which coefficient)
 595
 596 When submode2 is 1 or 2, for DCT inner butterfly submode the following
 597 schedules may be selected.  When submode2 is 1, additional bit-reversing
 598 is also performed.
 599
 600 * **submode=0b00** selects the ``j`` offset of the innermost for-loop,
 601     in-place
 602 * **submode=0b010** selects the ``j+halfsize`` offset of the innermost for-loop,
 603   in reverse-order, in-place
 604 * **submode=0b10** selects the ``ci`` count of the innermost for-loop,
 605   useful for calculating the cosine coefficient
 606 * **submode=0b11** selects the ``size`` offset of the outermost for-loop,
 607   useful for the cosine coefficient ``cos(ci + 0.5) * pi / size``
 608
 609 When submode2 is 3 or 4, for DCT outer butterfly submode the following
 610 schedules may be selected.  When submode is 3, additional bit-reversing
 611 is also performed.
 612
 613 * **submode=0b00** selects the ``j`` offset of the innermost for-loop,
 614 * **submode=0b01** selects the ``j+1`` offset of the innermost for-loop,
 615
 616 `zdimsz` is used as an in-place "Stride", particularly useful for
 617 column-based in-place DCT/FFT.
 618
 619 ## Matrix Mode
 620
 621 In Matrix Mode, skip allows dimensions to be skipped from being included
 622 in the resultant output index.  this allows sequences to be repeated:
 623 ```0 0 0 1 1 1 2 2 2 ...``` or in the case of skip=0b11 this results in
 624 modulo ```0 1 2 0 1 2 ...```
 625
 626 * **skip=0b00** indicates no dimensions to be skipped
 627 * **skip=0b01** sets "skip 1st dimension"
 628 * **skip=0b10** sets "skip 2nd dimension"
 629 * **skip=0b11** sets "skip 3rd dimension"
 630
 631 invxyz will invert the start index of each of x, y or z. If invxyz[0] is
 632 zero then x-dimensional counting begins from 0 and increments, otherwise
 633 it begins from xdimsz-1 and iterates down to zero. Likewise for y and z.
 634
 635 offset will have the effect of offsetting the result by ```offset``` elements:
 636
 637 ```
 638     for i in 0..VL-1:
 639         GPR(RT + remap(i) + SVSHAPE.offset) = ....
 640 ```
 641
 642 this appears redundant because the register RT could simply be changed by a compiler, until element width overrides are introduced.  also
 643 bear in mind that unlike a static compiler SVSHAPE.offset may
 644 be set dynamically at runtime.
 645
 646 xdimsz, ydimsz and zdimsz are offset by 1, such that a value of 0 indicates
 647 that the array dimensionality for that dimension is 1. any dimension
 648 not intended to be used must have its value set to 0 (dimensionality
 649 of 1).  A value of xdimsz=2 would indicate that in the first dimension
 650 there are 3 elements in the array.  For example, to create a 2D array
 651 X,Y of dimensionality X=3 and Y=2, set xdimsz=2, ydimsz=1 and zdimsz=0
 652
 653 The format of the array is therefore as follows:
 654
 655 ```
 656     array[xdimsz+1][ydimsz+1][zdimsz+1]
 657 ```
 658
 659 However whilst illustrative of the dimensionality, that does not take the
 660 "permute" setting into account.  "permute" may be any one of six values
 661 (0-5, with values of 6 and 7 indicating "Indexed" Mode).  The table
 662 below shows how the permutation dimensionality order works:
 663
 664 | permute | order | array format             |
 665 | ------- | ----- | ------------------------ |
 666 | 000     | 0,1,2 | (xdim+1)(ydim+1)(zdim+1) |
 667 | 001     | 0,2,1 | (xdim+1)(zdim+1)(ydim+1) |
 668 | 010     | 1,0,2 | (ydim+1)(xdim+1)(zdim+1) |
 669 | 011     | 1,2,0 | (ydim+1)(zdim+1)(xdim+1) |
 670 | 100     | 2,0,1 | (zdim+1)(xdim+1)(ydim+1) |
 671 | 101     | 2,1,0 | (zdim+1)(ydim+1)(xdim+1) |
 672 | 110     | 0,1   | Indexed (xdim+1)(ydim+1) |
 673 | 111     | 1,0   | Indexed (ydim+1)(xdim+1) |
 674
 675 In other words, the "permute" option changes the order in which
 676 nested for-loops over the array would be done.  See executable
 677 python reference code for further details.
 678
 679 *Note: permute=0b110 and permute=0b111 enable Indexed REMAP Mode,
 680 described below*
 681
 682 With all these options it is possible to support in-place transpose,
 683 in-place rotate, Matrix Multiply and Convolutions, without being
 684 limited to Power-of-Two dimension sizes.
 685
 686 ## Indexed Mode
 687
 688 Indexed Mode activates reading of the element indices from the GPR
 689 and includes optional limited 2D reordering.
 690 In its simplest form (without elwidth overrides or other modes):
 691
 692 ```
 693     def index_remap(i):
 694         return GPR((SVSHAPE.SVGPR<<1)+i) + SVSHAPE.offset
 695
 696     for i in 0..VL-1:
 697         element_result = ....
 698         GPR(RT + indexed_remap(i)) = element_result
 699 ```
 700
 701 With element-width overrides included, and using the pseudocode
 702 from the SVP64 [[sv/svp64/appendix#elwidth]] elwidth section
 703 this becomes:
 704
 705 ```
 706     def index_remap(i):
 707         svreg = SVSHAPE.SVGPR << 1
 708         srcwid = elwid_to_bitwidth(SVSHAPE.elwid)
 709         offs = SVSHAPE.offset
 710         return get_polymorphed_reg(svreg, srcwid, i) + offs
 711
 712     for i in 0..VL-1:
 713         element_result = ....
 714         rt_idx = indexed_remap(i)
 715         set_polymorphed_reg(RT, destwid, rt_idx, element_result)
 716 ```
 717
 718 Matrix-style reordering still applies to the indices, except limited
 719 to up to 2 Dimensions (X,Y). Ordering is therefore limited to (X,Y) or
 720 (Y,X) for in-place Transposition.
 721 Only one dimension may optionally be skipped. Inversion of either
 722 X or Y or both is possible (2D mirroring). Pseudocode for Indexed Mode (including elwidth
 723 overrides) may be written in terms of Matrix Mode, specifically
 724 purposed to ensure that the 3rd dimension (Z) has no effect:
 725
 726 ```
 727     def index_remap(ISHAPE, i):
 728         MSHAPE.skip   = 0b0 || ISHAPE.sk1
 729         MSHAPE.invxyz = 0b0 || ISHAPE.invxy
 730         MSHAPE.xdimsz = ISHAPE.xdimsz
 731         MSHAPE.ydimsz = ISHAPE.ydimsz
 732         MSHAPE.zdimsz = 0 # disabled
 733         if ISHAPE.permute = 0b110 # 0,1
 734            MSHAPE.permute = 0b000 # 0,1,2
 735         if ISHAPE.permute = 0b111 # 1,0
 736            MSHAPE.permute = 0b010 # 1,0,2
 737         el_idx = remap_matrix(MSHAPE, i)
 738         svreg = ISHAPE.SVGPR << 1
 739         srcwid = elwid_to_bitwidth(ISHAPE.elwid)
 740         offs = ISHAPE.offset
 741         return get_polymorphed_reg(svreg, srcwid, el_idx) + offs
 742 ```
 743
 744 The most important observation above is that the Matrix-style
 745 remapping occurs first and the Index lookup second.  Thus it
 746 becomes possible to perform in-place Transpose of Indices which
 747 may have been costly to set up or costly to duplicate
 748 (waste register file space).
 749
 750 -------------
 751
 752 \newpage{}
 753
 754 # svshape instruction  <a name="svshape"> </a>
 755
 756 SVM-Form
 757
 758     svshape SVxd,SVyd,SVzd,SVRM,vf
 759
 760 | 0:5|6:10  |11:15  |16:20  | 21:24  | 25 | 26:31 |  name    |
 761 | -- | --   | ---   | ----- | ------ | -- | ------| -------- |
 762 |PO  | SVxd | SVyd  | SVzd  | SVRM   | vf | XO    | svshape  |
 763
 764 ```
 765     # for convenience, VL to be calculated and stored in SVSTATE
 766     vlen <- [0] * 7
 767     mscale[0:5] <- 0b000001 # for scaling MAXVL
 768     itercount[0:6] <- [0] * 7
 769     SVSTATE[0:31] <- [0] * 32
 770     # only overwrite REMAP if "persistence" is zero
 771     if (SVSTATE[62] = 0b0) then
 772         SVSTATE[32:33] <- 0b00
 773         SVSTATE[34:35] <- 0b00
 774         SVSTATE[36:37] <- 0b00
 775         SVSTATE[38:39] <- 0b00
 776         SVSTATE[40:41] <- 0b00
 777         SVSTATE[42:46] <- 0b00000
 778         SVSTATE[62] <- 0b0
 779         SVSTATE[63] <- 0b0
 780     # clear out all SVSHAPEs
 781     SVSHAPE0[0:31] <- [0] * 32
 782     SVSHAPE1[0:31] <- [0] * 32
 783     SVSHAPE2[0:31] <- [0] * 32
 784     SVSHAPE3[0:31] <- [0] * 32
 785
 786     # set schedule up for multiply
 787     if (SVrm = 0b0000) then
 788         # VL in Matrix Multiply is xd*yd*zd
 789         xd <- (0b00 || SVxd) + 1
 790         yd <- (0b00 || SVyd) + 1
 791         zd <- (0b00 || SVzd) + 1
 792         n <- xd * yd * zd
 793         vlen[0:6] <- n[14:20]
 794         # set up template in SVSHAPE0, then copy to 1-3
 795         SVSHAPE0[0:5] <- (0b0 || SVxd)   # xdim
 796         SVSHAPE0[6:11] <- (0b0 || SVyd)   # ydim
 797         SVSHAPE0[12:17] <- (0b0 || SVzd)   # zdim
 798         SVSHAPE0[28:29] <- 0b11           # skip z
 799         # copy
 800         SVSHAPE1[0:31] <- SVSHAPE0[0:31]
 801         SVSHAPE2[0:31] <- SVSHAPE0[0:31]
 802         SVSHAPE3[0:31] <- SVSHAPE0[0:31]
 803         # set up FRA
 804         SVSHAPE1[18:20] <- 0b001          # permute x,z,y
 805         SVSHAPE1[28:29] <- 0b01           # skip z
 806         # FRC
 807         SVSHAPE2[18:20] <- 0b001          # permute x,z,y
 808         SVSHAPE2[28:29] <- 0b11           # skip y
 809
 810     # set schedule up for FFT butterfly
 811     if (SVrm = 0b0001) then
 812         # calculate O(N log2 N)
 813         n <- [0] * 3
 814         do while n < 5
 815            if SVxd[4-n] = 0 then
 816                leave
 817            n <- n + 1
 818         n <- ((0b0 || SVxd) + 1) * n
 819         vlen[0:6] <- n[1:7]
 820         # set up template in SVSHAPE0, then copy to 1-3
 821         # for FRA and FRT
 822         SVSHAPE0[0:5] <- (0b0 || SVxd)   # xdim
 823         SVSHAPE0[12:17] <- (0b0 || SVzd)   # zdim - "striding" (2D FFT)
 824         mscale <- (0b0 || SVzd) + 1
 825         SVSHAPE0[30:31] <- 0b01          # Butterfly mode
 826         # copy
 827         SVSHAPE1[0:31] <- SVSHAPE0[0:31]
 828         SVSHAPE2[0:31] <- SVSHAPE0[0:31]
 829         # set up FRB and FRS
 830         SVSHAPE1[28:29] <- 0b01           # j+halfstep schedule
 831         # FRC (coefficients)
 832         SVSHAPE2[28:29] <- 0b10           # k schedule
 833
 834     # set schedule up for (i)DCT Inner butterfly
 835     # SVrm Mode 4 (Mode 12 for iDCT) is for on-the-fly (Vertical-First Mode)
 836     if ((SVrm = 0b0100) |
 837         (SVrm = 0b1100)) then
 838         # calculate O(N log2 N)
 839         n <- [0] * 3
 840         do while n < 5
 841            if SVxd[4-n] = 0 then
 842                leave
 843            n <- n + 1
 844         n <- ((0b0 || SVxd) + 1) * n
 845         vlen[0:6] <- n[1:7]
 846         # set up template in SVSHAPE0, then copy to 1-3
 847         # set up FRB and FRS
 848         SVSHAPE0[0:5] <- (0b0 || SVxd)   # xdim
 849         SVSHAPE0[12:17] <- (0b0 || SVzd)   # zdim - "striding" (2D DCT)
 850         mscale <- (0b0 || SVzd) + 1
 851         if (SVrm = 0b1100) then
 852             SVSHAPE0[30:31] <- 0b11          # iDCT mode
 853             SVSHAPE0[18:20] <- 0b011         # iDCT Inner Butterfly sub-mode
 854         else
 855             SVSHAPE0[30:31] <- 0b01          # DCT mode
 856             SVSHAPE0[18:20] <- 0b001         # DCT Inner Butterfly sub-mode
 857             SVSHAPE0[21:23] <- 0b001         # "inverse" on outer loop
 858         SVSHAPE0[6:11] <- 0b000011       # (i)DCT Inner Butterfly mode 4
 859         # copy
 860         SVSHAPE1[0:31] <- SVSHAPE0[0:31]
 861         SVSHAPE2[0:31] <- SVSHAPE0[0:31]
 862         if (SVrm != 0b0100) & (SVrm != 0b1100) then
 863             SVSHAPE3[0:31] <- SVSHAPE0[0:31]
 864         # for FRA and FRT
 865         SVSHAPE0[28:29] <- 0b01           # j+halfstep schedule
 866         # for cos coefficient
 867         SVSHAPE2[28:29] <- 0b10           # ci (k for mode 4) schedule
 868         SVSHAPE2[12:17] <- 0b000000       # reset costable "striding" to 1
 869         if (SVrm != 0b0100) & (SVrm != 0b1100) then
 870             SVSHAPE3[28:29] <- 0b11           # size schedule
 871
 872     # set schedule up for (i)DCT Outer butterfly
 873     if (SVrm = 0b0011) | (SVrm = 0b1011) then
 874         # calculate O(N log2 N) number of outer butterfly overlapping adds
 875         vlen[0:6] <- [0] * 7
 876         n <- 0b000
 877         size <- 0b0000001
 878         itercount[0:6] <- (0b00 || SVxd) + 0b0000001
 879         itercount[0:6] <- (0b0 || itercount[0:5])
 880         do while n < 5
 881            if SVxd[4-n] = 0 then
 882                leave
 883            n <- n + 1
 884            count <- (itercount - 0b0000001) * size
 885            vlen[0:6] <- vlen + count[7:13]
 886            size[0:6] <- (size[1:6] || 0b0)
 887            itercount[0:6] <- (0b0 || itercount[0:5])
 888         # set up template in SVSHAPE0, then copy to 1-3
 889         # set up FRB and FRS
 890         SVSHAPE0[0:5] <- (0b0 || SVxd)   # xdim
 891         SVSHAPE0[12:17] <- (0b0 || SVzd)   # zdim - "striding" (2D DCT)
 892         mscale <- (0b0 || SVzd) + 1
 893         if (SVrm = 0b1011) then
 894             SVSHAPE0[30:31] <- 0b11      # iDCT mode
 895             SVSHAPE0[18:20] <- 0b011     # iDCT Outer Butterfly sub-mode
 896             SVSHAPE0[21:23] <- 0b101     # "inverse" on outer and inner loop
 897         else
 898             SVSHAPE0[30:31] <- 0b01      # DCT mode
 899             SVSHAPE0[18:20] <- 0b100     # DCT Outer Butterfly sub-mode
 900         SVSHAPE0[6:11] <- 0b000010       # DCT Butterfly mode
 901         # copy
 902         SVSHAPE1[0:31] <- SVSHAPE0[0:31] # j+halfstep schedule
 903         SVSHAPE2[0:31] <- SVSHAPE0[0:31] # costable coefficients
 904         # for FRA and FRT
 905         SVSHAPE1[28:29] <- 0b01           # j+halfstep schedule
 906         # reset costable "striding" to 1
 907         SVSHAPE2[12:17] <- 0b000000
 908
 909     # set schedule up for DCT COS table generation
 910     if (SVrm = 0b0101) | (SVrm = 0b1101) then
 911         # calculate O(N log2 N)
 912         vlen[0:6] <- [0] * 7
 913         itercount[0:6] <- (0b00 || SVxd) + 0b0000001
 914         itercount[0:6] <- (0b0 || itercount[0:5])
 915         n <- [0] * 3
 916         do while n < 5
 917            if SVxd[4-n] = 0 then
 918                leave
 919            n <- n + 1
 920            vlen[0:6] <- vlen + itercount
 921            itercount[0:6] <- (0b0 || itercount[0:5])
 922         # set up template in SVSHAPE0, then copy to 1-3
 923         # set up FRB and FRS
 924         SVSHAPE0[0:5] <- (0b0 || SVxd)   # xdim
 925         SVSHAPE0[12:17] <- (0b0 || SVzd)   # zdim - "striding" (2D DCT)
 926         mscale <- (0b0 || SVzd) + 1
 927         SVSHAPE0[30:31] <- 0b01          # DCT/FFT mode
 928         SVSHAPE0[6:11] <- 0b000100       # DCT Inner Butterfly COS-gen mode
 929         if (SVrm = 0b0101) then
 930             SVSHAPE0[21:23] <- 0b001     # "inverse" on outer loop for DCT
 931         # copy
 932         SVSHAPE1[0:31] <- SVSHAPE0[0:31]
 933         SVSHAPE2[0:31] <- SVSHAPE0[0:31]
 934         # for cos coefficient
 935         SVSHAPE1[28:29] <- 0b10           # ci schedule
 936         SVSHAPE2[28:29] <- 0b11           # size schedule
 937
 938     # set schedule up for iDCT / DCT inverse of half-swapped ordering
 939     if (SVrm = 0b0110) | (SVrm = 0b1110) | (SVrm = 0b1111) then
 940         vlen[0:6] <- (0b00 || SVxd) + 0b0000001
 941         # set up template in SVSHAPE0
 942         SVSHAPE0[0:5] <- (0b0 || SVxd)   # xdim
 943         SVSHAPE0[12:17] <- (0b0 || SVzd)   # zdim - "striding" (2D DCT)
 944         mscale <- (0b0 || SVzd) + 1
 945         if (SVrm = 0b1110) then
 946             SVSHAPE0[18:20] <- 0b001     # DCT opposite half-swap
 947         if (SVrm = 0b1111) then
 948             SVSHAPE0[30:31] <- 0b01          # FFT mode
 949         else
 950             SVSHAPE0[30:31] <- 0b11          # DCT mode
 951         SVSHAPE0[6:11] <- 0b000101       # DCT "half-swap" mode
 952
 953     # set schedule up for parallel reduction
 954     if (SVrm = 0b0111) then
 955         # calculate the total number of operations (brute-force)
 956         vlen[0:6] <- [0] * 7
 957         itercount[0:6] <- (0b00 || SVxd) + 0b0000001
 958         step[0:6] <- 0b0000001
 959         i[0:6] <- 0b0000000
 960         do while step <u itercount
 961             newstep <- step[1:6] || 0b0
 962             j[0:6] <- 0b0000000
 963             do while (j+step <u itercount)
 964                 j <- j + newstep
 965                 i <- i + 1
 966             step <- newstep
 967         # VL in Parallel-Reduce is the number of operations
 968         vlen[0:6] <- i
 969         # set up template in SVSHAPE0, then copy to 1. only 2 needed
 970         SVSHAPE0[0:5] <- (0b0 || SVxd)   # xdim
 971         SVSHAPE0[12:17] <- (0b0 || SVzd)   # zdim - "striding" (2D DCT)
 972         mscale <- (0b0 || SVzd) + 1
 973         SVSHAPE0[30:31] <- 0b10          # parallel reduce submode
 974         # copy
 975         SVSHAPE1[0:31] <- SVSHAPE0[0:31]
 976         # set up right operand (left operand 28:29 is zero)
 977         SVSHAPE1[28:29] <- 0b01           # right operand
 978
 979     # set VL, MVL and Vertical-First
 980     m[0:12] <- vlen * mscale
 981     maxvl[0:6] <- m[6:12]
 982     SVSTATE[0:6] <- maxvl  # MAVXL
 983     SVSTATE[7:13] <- vlen  # VL
 984     SVSTATE[63] <- vf
 985 ```
 986
 987 Special Registers Altered:
 988
 989 ```
 990     SVSTATE, SVSHAPE0-3
 991 ```
 992
 993 `svshape` is a convenience instruction that reduces instruction
 994 count for common usage patterns, particularly Matrix, DCT and FFT. It sets up
 995 (overwrites) all required SVSHAPE SPRs and also modifies SVSTATE
 996 including VL and MAXVL. Using `svshape` therefore does not also
 997 require `setvl`.
 998
 999 Fields:
1000
1001 * **SVxd** - SV REMAP "xdim"
1002 * **SVyd** - SV REMAP "ydim"
1003 * **SVzd** - SV REMAP "zdim"
1004 * **SVRM** - SV REMAP Mode (0b00000 for Matrix, 0b00001 for FFT etc.)
1005 * **vf** - sets "Vertical-First" mode
1006 * **XO** - standard 6-bit XO field
1007
1008 *Note: SVxd, SVyz and SVzd are all stored "off-by-one".  In the assembler
1009 mnemonic the values `1-32` are stored in binary as `0b00000..0b11111`*
1010
1011 There are 12 REMAP Modes (2 Modes are RESERVED for `svshape2`, 2 Modes
1012 are RESERVED)
1013
1014 | SVRM   | Remap Mode description |
1015 | --     | --              |
1016 | 0b0000 | Matrix 1/2/3D    |
1017 | 0b0001 | FFT Butterfly   |
1018 | 0b0010 | reserved |
1019 | 0b0011 | DCT Outer butterfly  |
1020 | 0b0100 | DCT Inner butterfly, on-the-fly (Vertical-First Mode) |
1021 | 0b0101 | DCT COS table index generation |
1022 | 0b0110 | DCT half-swap   |
1023 | 0b0111 | Parallel Reduction |
1024 | 0b1000 | reserved for svshape2 |
1025 | 0b1001 | reserved for svshape2 |
1026 | 0b1010 | reserved |
1027 | 0b1011 | iDCT Outer butterfly  |
1028 | 0b1100 | iDCT Inner butterfly, on-the-fly (Vertical-First Mode) |
1029 | 0b1101 | iDCT COS table index generation |
1030 | 0b1110 | iDCT half-swap   |
1031 | 0b1111 | FFT half-swap   |
1032
1033 Examples showing how all of these Modes operate exists in the online
1034 [SVP64 unit tests](https://git.libre-soc.org/?p=openpower-isa.git;a=tree;f=src/openpower/decoder/isa;hb=HEAD).  Explaining
1035 these Modes further in detail is beyond the scope of this document.
1036
1037 In Indexed Mode, there are only 5 bits available to specify the GPR
1038 to use, out of 128 GPRs (7 bit numbering).  Therefore, only the top
1039 5 bits are given in the `SVxd` field: the bottom two implicit bits
1040 will be zero (`SVxd || 0b00`).
1041
1042 `svshape` has *limited applicability* due to being a 32-bit instruction.
1043 The full capability of SVSHAPE SPRs may be accessed by directly writing
1044 to SVSHAPE0-3 with `mtspr`. Circumstances include Matrices with dimensions
1045 larger than 32, and in-place Transpose.  Potentially a future v3.1 Prefixed
1046 instruction, `psvshape`, may extend the capability here.
1047
1048 *Architectural Resource Allocation note: the SVRM field is carefully
1049 crafted to allocate two Modes, corresponding to bits 21-23 within the
1050 instruction being set to the value `0b100`, to `svshape2` (not
1051 `svshape`). These two Modes are
1052 considered "RESERVED" within the context of `svshape` but it is
1053 absolutely critical to allocate the exact same pattern in XO for
1054 both instructions in bits 26-31.*
1055
1056 -------------
1057
1058 \newpage{}
1059
1060
1061 # svindex instruction  <a name="svindex"> </a>
1062
1063 SVI-Form
1064
1065 | 0:5|6:10 |11:15  |16:20 | 21:25       | 26:31 |  Form    |
1066 | -- | --  | ---   | ---- | ----------- | ------| -------- |
1067 | PO | SVG | rmm   | SVd  | ew/yx/mm/sk | XO    | SVI-Form |
1068
1069 * svindex SVG,rmm,SVd,ew,SVyx,mm,sk
1070
1071 Pseudo-code:
1072
1073 ```
1074     # based on nearest MAXVL compute other dimension
1075     MVL <- SVSTATE[0:6]
1076     d <- [0] * 6
1077     dim <- SVd+1
1078     do while d*dim <u ([0]*4 || MVL)
1079        d <- d + 1
1080
1081     # set up template, then copy once location identified
1082     shape <- [0]*32
1083     shape[30:31] <- 0b00            # mode
1084     if SVyx = 0 then
1085         shape[18:20] <- 0b110       # indexed xd/yd
1086         shape[0:5] <- (0b0 || SVd)  # xdim
1087         if sk = 0 then shape[6:11] <- 0 # ydim
1088         else           shape[6:11] <- 0b111111 # ydim max
1089     else
1090         shape[18:20] <- 0b111       # indexed yd/xd
1091         if sk = 1 then shape[6:11] <- 0 # ydim
1092         else           shape[6:11] <- d-1 # ydim max
1093         shape[0:5] <- (0b0 || SVd) # ydim
1094     shape[12:17] <- (0b0 || SVG)        # SVGPR
1095     shape[28:29] <- ew                  # element-width override
1096     shape[21] <- sk                     # skip 1st dimension
1097
1098     # select the mode for updating SVSHAPEs
1099     SVSTATE[62] <- mm # set or clear persistence
1100     if mm = 0 then
1101         # clear out all SVSHAPEs first
1102         SVSHAPE0[0:31] <- [0] * 32
1103         SVSHAPE1[0:31] <- [0] * 32
1104         SVSHAPE2[0:31] <- [0] * 32
1105         SVSHAPE3[0:31] <- [0] * 32
1106         SVSTATE[32:41] <- [0] * 10 # clear REMAP.mi/o
1107         SVSTATE[42:46] <- rmm # rmm exactly REMAP.SVme
1108         idx <- 0
1109         for bit = 0 to 4
1110             if rmm[4-bit] then
1111                 # activate requested shape
1112                 if idx = 0 then SVSHAPE0 <- shape
1113                 if idx = 1 then SVSHAPE1 <- shape
1114                 if idx = 2 then SVSHAPE2 <- shape
1115                 if idx = 3 then SVSHAPE3 <- shape
1116                 SVSTATE[bit*2+32:bit*2+33] <- idx
1117                 # increment shape index, modulo 4
1118                 if idx = 3 then idx <- 0
1119                 else            idx <- idx + 1
1120     else
1121         # refined SVSHAPE/REMAP update mode
1122         bit <- rmm[0:2]
1123         idx <- rmm[3:4]
1124         if idx = 0 then SVSHAPE0 <- shape
1125         if idx = 1 then SVSHAPE1 <- shape
1126         if idx = 2 then SVSHAPE2 <- shape
1127         if idx = 3 then SVSHAPE3 <- shape
1128         SVSTATE[bit*2+32:bit*2+33] <- idx
1129         SVSTATE[46-bit] <- 1
1130 ```
1131
1132 Special Registers Altered:
1133
1134 ```
1135     SVSTATE, SVSHAPE0-3
1136 ```
1137
1138 `svindex` is a convenience instruction that reduces instruction count
1139 for Indexed REMAP Mode. It sets up (overwrites) all required SVSHAPE
1140 SPRs and **unlike** `svshape` can modify the REMAP area of the SVSTATE
1141 SPR as well, including setting persistence.  The relevant SPRs *may*
1142 be directly programmed with `mtspr` however it is laborious to do so:
1143 svindex saves instructions covering much of Indexed REMAP capability.
1144
1145 Fields:
1146
1147 * **SVd** - SV REMAP x/y dim
1148 * **rmm** - REMAP mask: sets remap mi0-2/mo0-1 and SVSHAPEs,
1149   controlled by mm
1150 * **ew** - sets element width override on the Indices
1151 * **SVG** - GPR SVG<<2 to be used for Indexing
1152 * **yx** - 2D reordering to be used if yx=1
1153 * **mm** - mask mode. determines how `rmm` is interpreted.
1154 * **sk** - Dimension skipping enabled
1155
1156 *Note: SVd, like SVxd, SVyz and SVzd of `svshape`, are all stored
1157 "off-by-one".  In the assembler
1158 mnemonic the values `1-32` are stored in binary as `0b00000..0b11111`*.
1159
1160 *Note: when `yx=1,sk=0` the second dimension is calculated as
1161 `CEIL(MAXVL/SVd)`*.
1162
1163 When `mm=0`:
1164
1165 * `rmm`, like REMAP.SVme, has bit 0
1166   correspond to mi0, bit 1 to mi1, bit 2 to mi2,
1167   bit 3 to mo0 and bit 4 to mi1
1168 * all SVSHAPEs and the REMAP parts of SVSHAPE are first reset (initialised to zero)
1169 * for each bit set in the 5-bit `rmm`, in order, the first
1170   as-yet-unset SVSHAPE will be updated
1171   with the other operands in the instruction, and the REMAP
1172   SPR set.
1173 * If all 5 bits of `rmm` are set then both mi0 and mo1 use SVSHAPE0.
1174 * SVSTATE persistence bit is cleared
1175 * No other alterations to SVSTATE are carried out
1176
1177 Example 1: if rmm=0b00110 then SVSHAPE0 and SVSHAPE1 are set up,
1178 and the REMAP SPR set so that mi1 uses SVSHAPE0 and mi2
1179 uses mi2.  REMAP.SVme is also set to 0b00110, REMAP.mi1=0
1180 (SVSHAPE0) and REMAP.mi2=1 (SVSHAPE1)
1181
1182 Example 2: if rmm=0b10001 then again SVSHAPE0 and SVSHAPE1
1183 are set up, but the REMAP SPR is set so that mi0 uses SVSHAPE0
1184 and mo1 uses SVSHAPE1. REMAP.SVme=0b10001, REMAP.mi0=0, REMAP.mo1=1
1185
1186 Rough algorithmic form:
1187
1188 ```
1189     marray = [mi0, mi1, mi2, mo0, mo1]
1190     idx = 0
1191     for bit = 0 to 4:
1192         if not rmm[bit]: continue
1193         setup(SVSHAPE[idx])
1194         SVSTATE{marray[bit]} = idx
1195         idx = (idx+1) modulo 4
1196 ```
1197
1198 When `mm=1`:
1199
1200 * bits 0-2 (MSB0 numbering) of `rmm` indicate an index selecting mi0-mo1
1201 * bits 3-4 (MSB0 numbering) of `rmm` indicate which SVSHAPE 0-3 shall
1202   be updated
1203 * only the selected SVSHAPE is overwritten
1204 * only the relevant bits in the REMAP area of SVSTATE are updated
1205 * REMAP persistence bit is set.
1206
1207 Example 1: if `rmm`=0b01110 then bits 0-2 (MSB0) are 0b011 and
1208 bits 3-4 are 0b10. thus, mo0 is selected and SVSHAPE2
1209 to be updated. REMAP.SVme[3] will be set high and REMAP.mo0
1210 set to 2 (SVSHAPE2).
1211
1212 Example 2: if `rmm`=0b10011 then bits 0-2 (MSB0) are 0b100 and
1213 bits 3-4 are 0b11.  thus, mo1 is selected and SVSHAPE3
1214 to be updated. REMAP.SVme[4] will be set high and REMAP.mo1
1215 set to 3 (SVSHAPE3).
1216
1217 Rough algorithmic form:
1218
1219 ```
1220     marray = [mi0, mi1, mi2, mo0, mo1]
1221     bit = rmm[0:2]
1222     idx = rmm[3:4]
1223     setup(SVSHAPE[idx])
1224     SVSTATE{marray[bit]} = idx
1225     SVSTATE.pst = 1
1226 ```
1227
1228 In essence, `mm=0` is intended for use to set as much of the
1229 REMAP State SPRs as practical with a single instruction,
1230 whilst `mm=1` is intended to be a little more refined.
1231
1232 **Usage guidelines**
1233
1234 * **Disable 2D mapping**: to only perform Indexing without
1235  reordering use `SVd=1,sk=0,yx=0` (or set SVd to a value larger
1236  or equal to VL)
1237 * **Modulo 1D mapping**: to perform Indexing cycling through the
1238  first N Indices use `SVd=N,sk=0,yx=0` where `VL>N`. There is
1239  no requirement to set VL equal to a multiple of N.
1240 * **Modulo 2D transposed**: `SVd=M,sk=0,yx=1`, sets
1241  `xdim=M,ydim=CEIL(MAXVL/M)`.
1242
1243 Beyond these mappings it becomes necessary to write directly to
1244 the SVSTATE SPRs manually.
1245
1246 -------------
1247
1248 \newpage{}
1249
1250
1251 # svshape2 (offset-priority) <a name="svshape2"> </a>
1252
1253 SVM2-Form
1254
1255 | 0:5|6:9 |10|11:15  |16:20  | 21:24  | 25 | 26:31 |  Form      |
1256 | -- |----|--| ---   | ----- | ------ | -- | ------| --------   |
1257 | PO |offs|yx| rmm   | SVd   | 100/mm | sk | XO    | SVM2-Form  |
1258
1259 * svshape2 offs,yx,rmm,SVd,sk,mm
1260
1261 Pseudo-code:
1262
1263 ```
1264     # based on nearest MAXVL compute other dimension
1265     MVL <- SVSTATE[0:6]
1266     d <- [0] * 6
1267     dim <- SVd+1
1268     do while d*dim <u ([0]*4 || MVL)
1269        d <- d + 1
1270     # set up template, then copy once location identified
1271     shape <- [0]*32
1272     shape[30:31] <- 0b00            # mode
1273     shape[0:5] <- (0b0 || SVd)      # x/ydim
1274     if SVyx = 0 then
1275         shape[18:20] <- 0b000       # ordering xd/yd(/zd)
1276         if sk = 0 then shape[6:11] <- 0 # ydim
1277         else           shape[6:11] <- 0b111111 # ydim max
1278     else
1279         shape[18:20] <- 0b010       # ordering yd/xd(/zd)
1280         if sk = 1 then shape[6:11] <- 0 # ydim
1281         else           shape[6:11] <- d-1 # ydim max
1282     # offset (the prime purpose of this instruction)
1283     shape[24:27] <- SVo         # offset
1284     if sk = 1 then shape[28:29] <- 0b01 # skip 1st dimension
1285     else           shape[28:29] <- 0b00 # no skipping
1286     # select the mode for updating SVSHAPEs
1287     SVSTATE[62] <- mm # set or clear persistence
1288     if mm = 0 then
1289         # clear out all SVSHAPEs first
1290         SVSHAPE0[0:31] <- [0] * 32
1291         SVSHAPE1[0:31] <- [0] * 32
1292         SVSHAPE2[0:31] <- [0] * 32
1293         SVSHAPE3[0:31] <- [0] * 32
1294         SVSTATE[32:41] <- [0] * 10 # clear REMAP.mi/o
1295         SVSTATE[42:46] <- rmm # rmm exactly REMAP.SVme
1296         idx <- 0
1297         for bit = 0 to 4
1298             if rmm[4-bit] then
1299                 # activate requested shape
1300                 if idx = 0 then SVSHAPE0 <- shape
1301                 if idx = 1 then SVSHAPE1 <- shape
1302                 if idx = 2 then SVSHAPE2 <- shape
1303                 if idx = 3 then SVSHAPE3 <- shape
1304                 SVSTATE[bit*2+32:bit*2+33] <- idx
1305                 # increment shape index, modulo 4
1306                 if idx = 3 then idx <- 0
1307                 else            idx <- idx + 1
1308     else
1309         # refined SVSHAPE/REMAP update mode
1310         bit <- rmm[0:2]
1311         idx <- rmm[3:4]
1312         if idx = 0 then SVSHAPE0 <- shape
1313         if idx = 1 then SVSHAPE1 <- shape
1314         if idx = 2 then SVSHAPE2 <- shape
1315         if idx = 3 then SVSHAPE3 <- shape
1316         SVSTATE[bit*2+32:bit*2+33] <- idx
1317         SVSTATE[46-bit] <- 1
1318 ```
1319
1320 Special Registers Altered:
1321
1322 ```
1323     SVSTATE, SVSHAPE0-3
1324 ```
1325
1326 `svshape2` is an additional convenience instruction that prioritises
1327 setting `SVSHAPE.offset`. Its primary purpose is for use when
1328 element-width overrides are used. It has identical capabilities to `svindex` and
1329 in terms of both options (skip, etc.) and ability to activate REMAP
1330 (rmm, mask mode) but unlike `svindex` it does not set GPR REMAP,
1331 only a 1D or 2D `svshape`, and
1332 unlike `svshape` it can set an arbirrary `SVSHAPE.offset` immediate.
1333
1334 One of the limitations of Simple-V is that Vector elements start on the boundary
1335 of the Scalar regfile, which is fine when element-width overrides are not
1336 needed. If the starting point of a Vector with smaller elwidths must begin
1337 in the middle of a register, normally there would be no way to do so except
1338 through LD/ST.  `SVSHAPE.offset` caters for this scenario and `svshape2`is
1339 makes it easier.
1340
1341 **Operand Fields**:
1342
1343 * **offs** (4 bits) - unsigned offset
1344 * **yx** (1 bit) - swap XY to YX
1345 * **SVd** dimension size
1346 * **rmm** REMAP mask
1347 * **mm** mask mode
1348 * **sk** (1 bit) skips 1st dimension if set
1349
1350 Dimensions are calculated exactly as `svindex`. `rmm` and
1351 `mm` are as per `svindex`.
1352
1353 *Programmer's Note: offsets for `svshape2` may be specified in the range
1354 0-15. Given that the principle of Simple-V is to fit on top of
1355 byte-addressable register files and that GPR and FPR are 64-bit (8 bytes)
1356 it should be clear that the offset may, when `elwidth=8`, begin an
1357 element-level operation starting element zero at any arbitrary byte.
1358 On cursory examination attempting to go beyond the range 0-7 seems
1359 unnecessary given that the **next GPR or FPR** is an
1360 alias for an offset in the range 8-15.  Thus by simply increasing
1361 the starting Vector point of the operation to the next register it
1362 can be seen that the offset of 0-7 would be sufficient.  Unfortunately
1363 however some operations are EXTRA2-encoded it is **not possible**
1364 to increase the GPR/FPR register number by one, because EXTRA2-encoding
1365 of GPR/FPR Vector numbers are restricted to even numbering.
1366 For CR Fields the EXTRA2 encoding is even more sparse.
1367 The additional offset range (8-15) helps overcome these limitations.*
1368
1369 *Hardware Implementor's note: with the offsets only being immediates
1370 and with register numbering being entirely immediate as well it is
1371 possible to correctly compute Register Hazards without requiring
1372 reading the contents of any SPRs.  If however there are
1373 instructions that have directly written to the SVSTATE or SVSHAPE
1374 SPRs and those instructions are still in-flight then this position
1375 is clearly **invalid**. This is why Programmers are strongly
1376 discouraged from directly writing to these SPRs.*
1377
1378 *Architectural Resource Allocation note: this instruction shares
1379 the space of `svshape`. Therefore it is critical that the two
1380 instructions, `svshape` and `svshape2` have the exact same XO
1381 in bits 26 thru 31.  It is also critical that for `svshape2`,
1382 bit 21 of XO is a 1, bit 22 of XO is a 0, and bit 23 of XO is a 0.*
1383
1384 [[!tag standards]]
1385
1386 -------------
1387
1388 \newpage{}
1389