openpower/sv/rfc/ls009.mdwn

   1 # RFC ls009 Simple-V REMAP Subsystem
   2
   3 **URLs**:
   4
   5 * <https://libre-soc.org/openpower/sv/>
   6 * <https://libre-soc.org/openpower/sv/rfc/ls009/>
   7 * <https://bugs.libre-soc.org/show_bug.cgi?id=1042>
   8 * <https://git.openpower.foundation/isa/PowerISA/issues/87>
   9
  10 **Severity**: Major
  11
  12 **Status**: New
  13
  14 **Date**: 26 Mar 2023
  15
  16 **Target**: v3.2B
  17
  18 **Source**: v3.0B
  19
  20 **Books and Section affected**:
  21
  22 ```
  23     Book I, new Zero-Overhead-Loop Chapter.
  24     Appendix E Power ISA sorted by opcode
  25     Appendix F Power ISA sorted by version
  26     Appendix G Power ISA sorted by Compliancy Subset
  27     Appendix H Power ISA sorted by mnemonic
  28 ```
  29
  30 **Summary**
  31
  32 ```
  33     svremap  - Re-Mapping of Register Element Offsets
  34     svindex  - General-purpose setting of SHAPEs to be re-mapped
  35     svshape  - Hardware-level setting of SHAPEs for element re-mapping
  36     svshape2 - Hardware-level setting of SHAPEs for element re-mapping (v2)
  37 ```
  38
  39 **Submitter**: Luke Leighton (Libre-SOC)
  40
  41 **Requester**: Libre-SOC
  42
  43 **Impact on processor**:
  44
  45 ```
  46     Addition of four new "Zero-Overhead-Loop-Control" DSP-style Vector-style
  47     Management Instructions which provide advanced features such as Matrix
  48     FFT DCT Hardware-Assist Schedules and general-purpose Index reordering.
  49 ```
  50
  51 **Impact on software**:
  52
  53 ```
  54     Requires support for new instructions in assembler, debuggers,
  55     and related tools.
  56 ```
  57
  58 **Keywords**:
  59
  60 ```
  61     Cray Supercomputing, Vectorisation, Zero-Overhead-Loop-Control (ZOLC),
  62     Scalable Vectors, Multi-Issue Out-of-Order, Sequential Programming Model,
  63     Digital Signal Processing (DSP)
  64 ```
  65
  66 **Motivation**
  67
  68 These REMAP Management instructions provide state-of-the-art advanced capabilities
  69 to dramatically decrease instruction count and power reduction whilst retaining
  70 unprecedented general-purpose capability and a standard Sequential Execution Model.
  71
  72 **Notes and Observations**:
  73
  74 1. TODO
  75
  76 **Changes**
  77
  78 Add the following entries to:
  79
  80 * the Appendices of Book I
  81 * Instructions of Book I as a new Section
  82 * TODO-Form of Book I Section 1.6.1.6 and 1.6.2
  83
  84 ----------------
  85
  86 \newpage{}
  87
  88 # REMAP <a name="remap" />
  89
  90 REMAP is an advanced form of Vector "Structure Packing" that
  91 provides hardware-level support for commonly-used *nested* loop patterns
  92 that would otherwise require full inline loop unrolling.
  93 For more general reordering an Indexed REMAP mode is available
  94 (an abstracted analog to `xxperm`).
  95
  96 REMAP allows the usual sequential vector loop `0..VL-1` to be "reshaped" (re-mapped)
  97 from a linear form to a 2D or 3D transposed form, or "offset" to permit
  98 arbitrary access to elements (when elwidth overrides are used),
  99 independently on each Vector src or dest
 100 register. Aside from Indexed REMAP this is entirely Hardware-accelerated
 101 reordering and consequently not costly in terms of register access. It
 102 will however place a burden on Multi-Issue systems but no more than if
 103 the equivalent Scalar instructions were explicitly
 104 loop-unrolled without SVP64, and some advanced implementations may even find
 105 the Deterministic nature of the Scheduling to be easier on resources.
 106
 107 The initial primary motivation of REMAP was for Matrix Multiplication, reordering
 108 of sequential data in-place: in-place DCT and FFT were easily justified given the
 109 exceptionally high usage in Computer Science.
 110 Four SPRs are provided which may be applied to any GPR, FPR or CR Field
 111 so that for example a single FMAC may be
 112 used in a single hardware-controlled 100% Deterministic loop to
 113 perform 5x3 times 3x4 Matrix multiplication,
 114 generating 60 FMACs *without needing explicit assembler unrolling*.
 115 Additional uses include regular "Structure Packing"
 116 such as RGB pixel data extraction and reforming.
 117
 118 REMAP, like all of SV, is abstracted out, meaning that unlike traditional
 119 Vector ISAs which would typically only have a limited set of instructions
 120 that can be structure-packed (LD/ST typically), REMAP may be applied to
 121 literally any instruction: CRs, Arithmetic, Logical, LD/ST, anything.
 122
 123 Note that REMAP does not *directly* apply to sub-vector elements but
 124 only to the group: that
 125 is what swizzle is for.  Swizzle *can* however be applied to the same
 126 instruction as REMAP.  As explained in [[sv/mv.swizzle]]
 127 and the [[svp64/appendix]], Pack and Unpack EXTRA Mode bits
 128 can extend down into Sub-vector elements to perform vec2/vec3/vec4
 129 sequential reordering, but even here, REMAP is not *individually*
 130 extended down to the actual sub-vector elements themselves.
 131
 132 In its general form, REMAP is quite expensive to set up, and on some
 133 implementations may introduce
 134 latency, so should realistically be used only where it is worthwhile.
 135 Given that even with latency the fact that up to 127 operations
 136 can be requested to be issued (from a single instruction) it should
 137 be clear that REMAP should not be dismissed for *possible* latency alone.
 138 Commonly-used patterns such as Matrix Multiply, DCT and FFT have
 139 helper instruction options which make REMAP easier to use.
 140
 141 There are four types of REMAP:
 142
 143 * **Matrix**, also known as 2D and 3D reshaping, can perform in-place
 144   Matrix transpose and rotate. The Shapes are set up for an "Outer Product"
 145   Matrix Multiply.
 146 * **FFT/DCT**, with full triple-loop in-place support: limited to
 147   Power-2 RADIX
 148 * **Indexing**, for any general-purpose reordering, also includes
 149   limited 2D reshaping.
 150 * **Parallel Reduction**, for scheduling a sequence of operations
 151   in a Deterministic fashion, in a way that may be parallelised,
 152   to reduce a Vector down to a single value.
 153
 154 Best implemented on top of a Multi-Issue Out-of-Order Micro-architecture,
 155 REMAP Schedules are 100% Deterministic **including Indexing** and are
 156 designed to be incorporated in between the Decode and Issue phases,
 157 directly into Register Hazard Management.
 158
 159 Parallel Reduction is unusual in that it requires a full vector array
 160 of results (not a scalar) and uses the rest of the result Vector for
 161 the purposes of storing intermediary calculations.  As these intermediary
 162 results are Deterministically computed they may be useful.
 163 Additionally, because the intermediate results are always written out
 164 it is possible to service Precise Interrupts without affecting latency
 165 (a common limitation of Vector ISAs implementing explicit
 166 Parallel Reduction instructions).
 167
 168 ## Basic principle
 169
 170 * normal vector element read/write of operands would be sequential
 171   (0 1 2 3 ....)
 172 * this is not appropriate for (e.g.) Matrix multiply which requires
 173   accessing elements in alternative sequences (0 3 6 1 4 7 ...)
 174 * normal Vector ISAs use either Indexed-MV or Indexed-LD/ST to "cope"
 175   with this.  both are expensive (copy large vectors, spill through memory)
 176   and very few Packed SIMD ISAs cope with non-Power-2.
 177 * REMAP **redefines** the order of access according to set
 178   (Deterministic) "Schedules".
 179 * The Schedules are not at all restricted to power-of-two boundaries
 180   making it unnecessary to have for example specialised 3x4 transpose
 181   instructions of other Vector ISAs.
 182
 183 Only the most commonly-used algorithms in computer science have REMAP
 184 support, due to the high cost in both the ISA and in hardware.  For
 185 arbitrary remapping the `Indexed` REMAP may be used.
 186
 187 ## Example Usage
 188
 189 * `svshape` to set the type of reordering to be applied to an
 190   otherwise usual `0..VL-1` hardware for-loop
 191 * `svremap` to set which registers a given reordering is to apply to
 192   (RA, RT etc)
 193 * `sv.{instruction}` where any Vectorised register marked by `svremap`
 194   will have its ordering REMAPPED according to the schedule set
 195   by `svshape`.
 196
 197 The following illustrative example multiplies a 3x4 and a 5x3
 198 matrix to create
 199 a 5x4 result:
 200
 201 ```
 202     svshape 5, 4, 3, 0, 0
 203     svremap 15, 1, 2, 3, 0, 0, 0, 0
 204     sv.fmadds *0, *8, *16, *0
 205 ```
 206
 207 * svshape sets up the four SVSHAPE SPRS for a Matrix Schedule
 208 * svremap activates four out of five registers RA RB RC RT RS (15)
 209 * svremap requests:
 210   - RA to use SVSHAPE1
 211   - RB to use SVSHAPE2
 212   - RC to use SVSHAPE3
 213   - RT to use SVSHAPE0
 214   - RS Remapping to not be activated
 215 * sv.fmadds has RT=0.v, RA=8.v, RB=16.v, RC=0.v
 216 * With REMAP being active each register's element index is
 217   *independently* transformed using the specified SHAPEs.
 218
 219 Thus the Vector Loop is arranged such that the use of
 220 the multiply-and-accumulate instruction executes precisely the required
 221 Schedule to perform an in-place in-registers Matrix Multiply with no
 222 need to perform additional Transpose or register copy instructions.
 223 The example above may be executed as a unit test and demo,
 224 [here](https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_matrix.py;h=c15479db9a36055166b6b023c7495f9ca3637333;hb=a17a252e474d5d5bf34026c25a19682e3f2015c3#l94)
 225
 226 ## REMAP types
 227
 228 This section summarises the motivation for each REMAP Schedule
 229 and briefly goes over their characteristics and limitations.
 230 Further details on the Deterministic Precise-Interruptible algorithms
 231 used in these Schedules is found in the [[sv/remap/appendix]].
 232
 233 ### Matrix (1D/2D/3D shaping)
 234
 235 Matrix Multiplication is a huge part of High-Performance Compute,
 236 and 3D.
 237 In many PackedSIMD as well as Scalable Vector ISAs, non-power-of-two
 238 Matrix sizes are a serious challenge. PackedSIMD ISAs, in order to
 239 cope with for example 3x4 Matrices, recommend rolling data-repetition and loop-unrolling.
 240 Aside from the cost of the load on the L1 I-Cache, the trick only
 241 works if one of the dimensions X or Y are power-two. Prime Numbers
 242 (5x7, 3x5) become deeply problematic to unroll.
 243
 244 Even traditional Scalable Vector ISAs have issues with Matrices, often
 245 having to perform data Transpose by pushing out through Memory and back,
 246 or computing Transposition Indices (costly) then copying to another
 247 Vector (costly).
 248
 249 Matrix REMAP was thus designed to solve these issues by providing Hardware
 250 Assisted
 251 "Schedules" that can view what would otherwise be limited to a strictly
 252 linear Vector as instead being 2D (even 3D) *in-place* reordered.
 253 With both Transposition and non-power-two being supported the issues
 254 faced by other ISAs are mitigated.
 255
 256 Limitations of Matrix REMAP are that the Vector Length (VL) is currently
 257 restricted to 127: up to 127 FMAs (or other operation)
 258 may be performed in total.
 259 Also given that it is in-registers only at present some care has to be
 260 taken on regfile resource utilisation. However it is perfectly possible
 261 to utilise Matrix REMAP to perform the three inner-most "kernel" loops of
 262 the usual 6-level large Matrix Multiply, without the usual difficulties
 263 associated with SIMD.
 264
 265 Also the `svshape` instruction only provides access to part of the
 266 Matrix REMAP capability. Rotation and mirroring need to be done by
 267 programming the SVSHAPE SPRs directly, which can take a lot more
 268 instructions. Future versions of SVP64 will include EXT1xx prefixed
 269 variants (`psvshape`) which provide more comprehensive capacity and
 270 mitigate the need to write direct to the SVSHAPE SPRs.
 271
 272 ### FFT/DCT Triple Loop
 273
 274 DCT and FFT are some of the most astonishingly used algorithms in
 275 Computer Science.  Radar, Audio, Video, R.F. Baseband and dozens more.  At least
 276 two DSPs, TMS320 and Hexagon, have VLIW instructions specially tailored
 277 to FFT.
 278
 279 An in-depth analysis showed that it is possible to do in-place in-register
 280 DCT and FFT as long as twin-result "butterfly" instructions are provided.
 281 These can be found in the [[openpower/isa/svfparith]] page if performing
 282 IEEE754 FP transforms. *(For fixed-point transforms, equivalent 3-in 2-out
 283 integer operations would be required)*. These "butterfly" instructions
 284 avoid the need for a temporary register because the two array positions
 285 being overwritten will be "in-flight" in any In-Order or Out-of-Order
 286 micro-architecture.
 287
 288 DCT and FFT Schedules are currently limited to RADIX2 sizes and do not
 289 accept predicate masks.  Given that it is common to perform recursive
 290 convolutions combining smaller Power-2 DCT/FFT to create larger DCT/FFTs
 291 in practice the RADIX2 limit is not a problem.  A Bluestein convolution
 292 to compute arbitrary length is demonstrated by
 293 [Project Nayuki](https://www.nayuki.io/res/free-small-fft-in-multiple-languages/fft.py)
 294
 295 ### Indexed
 296
 297 The purpose of Indexing is to provide a generalised version of
 298 Vector ISA "Permute" instructions, such as VSX `vperm`.  The
 299 Indexing is abstracted out and may be applied to much more
 300 than an element move/copy, and is not limited for example
 301 to the number of bytes that can fit into a VSX register.
 302 Indexing may be applied to LD/ST (even on Indexed LD/ST
 303 instructions such as `sv.lbzx`), arithmetic operations,
 304 extsw: there is no artificial limit.
 305
 306 The only major caveat is that the registers to be used as
 307 Indices must not be modified by any instruction after Indexed Mode
 308 is established, and neither must MAXVL be altered. Additionally,
 309 no register used as an Index may exceed MAXVL-1.
 310
 311 Failure to observe
 312 these conditions results in `UNDEFINED` behaviour.
 313 These conditions allow a Read-After-Write (RAW) Hazard to be created on
 314 the entire range of Indices to be subsequently used, but a corresponding
 315 Write-After-Read Hazard by any instruction that modifies the Indices
 316 **does not have to be created**. Given the large number of registers
 317 involved in Indexing this is a huge resource saving and reduction
 318 in micro-architectural complexity. MAXVL is likewise
 319 included in the RAW Hazards because it is involved in calculating
 320 how many registers are to be considered Indices.
 321
 322 With these Hazard Mitigations in place, high-performance implementations
 323 may read-cache the Indices at the point where a given `svindex` instruction
 324 is called (or SVSHAPE SPRs - and MAXVL - directly altered) by issuing
 325 background GPR register file reads whilst other instructions are being
 326 issued and executed.
 327
 328 The original motivation for Indexed REMAP was to mitigate the need to add
 329 an expensive `mv.x` to the Scalar ISA, which was likely to be rejected as
 330 a stand-alone instruction.  Usually a Vector ISA would add a non-conflicting
 331 variant (as in VSX `vperm`) but it is common to need to permute by source,
 332 with the risk of conflict, that has to be resolved, for example, in AVX-512
 333 with `conflictd`.
 334
 335 Indexed REMAP on the other hand **does not prevent conflicts** (overlapping
 336 destinations), which on a superficial analysis may be perceived to be a
 337 problem, until it is recalled that, firstly, Simple-V is designed specifically
 338 to require Program Order to be respected, and that Matrix, DCT and FFT
 339 all *already* critically depend on overlapping Reads/Writes: Matrix
 340 uses overlapping registers as accumulators.  Thus the Register Hazard
 341 Management needed by Indexed REMAP *has* to be in place anyway.
 342
 343 The cost compared to Matrix and other REMAPs (and Pack/Unpack) is
 344 clearly that of the additional reading of the GPRs to be used as Indices,
 345 plus the setup cost associated with creating those same Indices.
 346 If any Deterministic REMAP can cover the required task, clearly it
 347 is adviseable to use it instead.
 348
 349 *Programmer's note: some algorithms may require skipping of Indices exceeding
 350 VL-1, not MAXVL-1. This may be achieved programmatically by performing
 351 an `sv.cmp *BF,*RA,RB` where RA is the same GPRs used in the Indexed REMAP,
 352 and RB contains the value of VL returned from `setvl`. The resultant
 353 CR Fields may then be used as Predicate Masks to exclude those operations
 354 with an Index exceeding VL-1.*
 355
 356 ### Parallel Reduction
 357
 358 Vector Reduce Mode issues a deterministic tree-reduction schedule to the underlying micro-architecture.  Like Scalar reduction, the "Scalar Base"
 359 (Power ISA v3.0B) operation is leveraged, unmodified, to give the
 360 *appearance* and *effect* of Reduction.
 361
 362 In Horizontal-First Mode, Vector-result reduction **requires**
 363 the destination to be a Vector, which will be used to store
 364 intermediary results.
 365
 366 Given that the tree-reduction schedule is deterministic,
 367 Interrupts and exceptions
 368 can therefore also be precise.  The final result will be in the first
 369 non-predicate-masked-out destination element, but due again to
 370 the deterministic schedule programmers may find uses for the intermediate
 371 results.
 372
 373 When Rc=1 a corresponding Vector of co-resultant CRs is also
 374 created.  No special action is taken: the result *and its CR Field*
 375 are stored "as usual" exactly as all other SVP64 Rc=1 operations.
 376
 377 Note that the Schedule only makes sense on top of certain instructions:
 378 X-Form with a Register Profile of `RT,RA,RB` is fine because two sources
 379 and the destination are all the same type.  Like Scalar
 380 Reduction, nothing is prohibited:
 381 the results of execution on an unsuitable instruction may simply
 382 not make sense. With care, even 3-input instructions (madd, fmadd, ternlogi)
 383 may be used, and whilst it is down to the Programmer to walk through the
 384 process the Programmer can be confident that the Parallel-Reduction is
 385 guaranteed 100% Deterministic.
 386
 387 Critical to note regarding use of Parallel-Reduction REMAP is that,
 388 exactly as with all REMAP Modes, the `svshape` instruction *requests*
 389 a certain Vector Length (number of elements to reduce) and then
 390 sets VL and MAXVL at the number of **operations** needed to be
 391 carried out.  Thus, equally as importantly, like Matrix REMAP
 392 the total number of operations
 393 is restricted to 127.  Any Parallel-Reduction requiring more operations
 394 will need to be done manually in batches (hierarchical
 395 recursive Reduction).
 396
 397 Also important to note is that the Deterministic Schedule is arranged
 398 so that some implementations *may* parallelise it (as long as doing so
 399 respects Program Order and Register Hazards).  Performance (speed)
 400 of any given
 401 implementation is neither strictly defined or guaranteed.  As with
 402 the Vulkan(tm) Specification, strict compliance is paramount whilst
 403 performance is at the discretion of Implementors.
 404
 405 **Parallel-Reduction with Predication**
 406
 407 To avoid breaking the strict RISC-paradigm, keeping the Issue-Schedule
 408 completely separate from the actual element-level (scalar) operations,
 409 Move operations are **not** included in the Schedule.  This means that
 410 the Schedule leaves the final (scalar) result in the first-non-masked
 411 element of the Vector used.  With the predicate mask being dynamic
 412 (but deterministic) this result could be anywhere.
 413
 414 If that result is needed to be moved to a (single) scalar register
 415 then a follow-up `sv.mv/sm=predicate rt, *ra` instruction will be
 416 needed to get it, where the predicate is the exact same predicate used
 417 in the prior Parallel-Reduction instruction.
 418
 419 * If there was only a single
 420   bit in the predicate then the result will not have moved or been altered
 421   from the source vector prior to the Reduction
 422 * If there was more than one bit the result will be in the
 423   first element with a predicate bit set.
 424
 425 In either case the result is in the element with the first bit set in
 426 the predicate mask. Thus, no move/copy *within the Reduction itself* was needed.
 427
 428 Programmer's Note: For *some* hardware implementations
 429 the vector-to-scalar copy may be a slow operation, as may the Predicated
 430 Parallel Reduction itself.
 431 It may be better to perform a pre-copy
 432 of the values, compressing them (VREDUCE-style) into a contiguous block,
 433 which will guarantee that the result goes into the very first element
 434 of the destination vector, in which case clearly no follow-up
 435 predicated vector-to-scalar MV operation is needed.
 436
 437 **Usage conditions**
 438
 439 The simplest usage is to perform an overwrite, specifying all three
 440 register operands the same.
 441
 442 ```
 443     svshape parallelreduce, 6
 444     sv.add *8, *8, *8
 445 ```
 446
 447 The Reduction Schedule will issue the Parallel Tree Reduction spanning
 448 registers 8 through 13, by adjusting the offsets to RT, RA and RB as
 449 necessary (see "Parallel Reduction algorithm" in a later section).
 450
 451 A non-overwrite is possible as well but just as with the overwrite
 452 version, only those destination elements necessary for storing
 453 intermediary computations will be written to: the remaining elements
 454 will **not** be overwritten and will **not** be zero'd.
 455
 456 ```
 457     svshape parallelreduce, 6
 458     sv.add *0, *8, *8
 459 ```
 460
 461 However it is critical to note that if the source and destination are
 462 not the same then the trick of using a follow-up vector-scalar MV will
 463 not work.
 464
 465 ### Sub-Vector Horizontal Reduction
 466
 467 To achieve Sub-Vector Horizontal Reduction, Pack/Unpack should be enabled,
 468 which will turn the Schedule around such that issuing of the Scalar
 469 Defined Words is done with SUBVL looping as the inner loop not the
 470 outer loop. Rc=1 with Sub-Vectors (SUBVL=2,3,4) is `UNDEFINED` behaviour.
 471
 472 ## Determining Register Hazards
 473
 474 For high-performance (Multi-Issue, Out-of-Order) systems it is critical
 475 to be able to statically determine the extent of Vectors in order to
 476 allocate pre-emptive Hazard protection.  The next task is to eliminate
 477 masked-out elements using predicate bits, freeing up the associated
 478 Hazards.
 479
 480 For non-REMAP situations `VL` is sufficient to ascertain early
 481 Hazard coverage, and with SVSTATE being a high priority cached
 482 quantity at the same level of MSR and PC this is not a problem.
 483
 484 The problems come when REMAP is enabled.  Indexed REMAP must instead
 485 use `MAXVL` as the earliest (simplest)
 486 batch-level Hazard Reservation indicator (after taking element-width
 487 overriding on the Index source into consideration),
 488 but Matrix, FFT and Parallel Reduction must all use completely different
 489 schemes.  The reason is that VL is used to step through the total
 490 number of *operations*, not the number of registers.
 491 The "Saving Grace" is that all of the REMAP Schedules are 100% Deterministic.
 492
 493 Advance-notice Parallel computation and subsequent cacheing
 494 of all of these complex Deterministic REMAP Schedules is
 495 *strongly recommended*, thus allowing clear and precise multi-issue
 496 batched Hazard coverage to be deployed, *even for Indexed Mode*.
 497 This is only possible for Indexed due to the strict guidelines
 498 given to Programmers.
 499
 500 In short, there exists solutions to the problem of Hazard Management,
 501 with varying degrees of refinement possible at correspondingly
 502 increasing levels of complexity in hardware.
 503
 504 A reminder: when Rc=1 each result register (element) has an associated
 505 co-result CR Field (one per result element).  Thus above when determining
 506 the Write-Hazards for result registers the corresponding Write-Hazards for the
 507 corresponding associated co-result CR Field must not be forgotten, *including* when
 508 Predication is used.
 509
 510 ## REMAP area of SVSTATE SPR
 511
 512 The following bits of the SVSTATE SPR are used for REMAP:
 513
 514 |32.33|34.35|36.37|38.39|40.41| 42.46 | 62 |
 515 | --  | --  | --  | --  | --  | ----- | ------ |
 516 |mi0  |mi1  |mi2  |mo0  |mo1  | SVme  | RMpst    |
 517
 518 mi0-2 and mo0-1 each select SVSHAPE0-3 to apply to a given register.
 519 mi0-2 apply to RA, RB, RC respectively, as input registers, and
 520 likewise mo0-1 apply to output registers (RT/FRT, RS/FRS) respectively.
 521 SVme is 5 bits (one for each of mi0-2/mo0-1) and indicates whether the
 522 SVSHAPE is actively applied or not.
 523
 524 * bit 0 of SVme indicates if mi0 is applied to RA / FRA / BA / BFA
 525 * bit 1 of SVme indicates if mi1 is applied to RB / FRB / BB
 526 * bit 2 of SVme indicates if mi2 is applied to RC / FRC / BC
 527 * bit 3 of SVme indicates if mo0 is applied to RT / FRT / BT / BF
 528 * bit 4 of SVme indicates if mo1 is applied to Effective Address / FRS / RS
 529   (LD/ST-with-update has an implicit 2nd write register, RA)
 530
 531 The "persistence" bit if set will result in all Active REMAPs being applied
 532 indefinitely.
 533
 534 ----------------
 535
 536 \newpage{}
 537
 538 # svremap instruction <a name="svremap"> </a>
 539
 540 SVRM-Form:
 541
 542     svremap SVme,mi0,mi1,mi2,mo0,mo2,pst
 543
 544 |0     |6     |11  |13   |15   |17   |19   |21    | 22.25 |26..31 |
 545 | --   | --   | -- | --  | --  | --  | --  | --   | ----  | ----- |
 546 | PO   | SVme |mi0 | mi1 | mi2 | mo0 | mo1 | pst  | rsvd  | XO    |
 547
 548 SVRM-Form
 549
 550 * svremap SVme,mi0,mi1,mi2,mo0,mo1,pst
 551
 552 Pseudo-code:
 553
 554 ```
 555     # registers RA RB RC RT EA/FRS SVSHAPE0-3 indices
 556     SVSTATE[32:33] <- mi0
 557     SVSTATE[34:35] <- mi1
 558     SVSTATE[36:37] <- mi2
 559     SVSTATE[38:39] <- mo0
 560     SVSTATE[40:41] <- mo1
 561     # enable bit for RA RB RC RT EA/FRS
 562     SVSTATE[42:46] <- SVme
 563     # persistence bit (applies to more than one instruction)
 564     SVSTATE[62] <- pst
 565 ```
 566
 567 Special Registers Altered:
 568
 569 ```
 570     SVSTATE
 571 ```
 572
 573 `svremap` determines the relationship between registers and SVSHAPE SPRs.
 574 The bitmask `SVme` determines which registers have a REMAP applied, and mi0-mo1
 575 determine which shape is applied to an activated register.  the `pst` bit if
 576 cleared indicated that the REMAP operation shall only apply to the immediately-following
 577 instruction.  If set then REMAP remains permanently enabled until such time as it is
 578 explicitly disabled, either by `setvl` setting a new MAXVL, or with another
 579 `svremap` instruction. `svindex` and `svshape2` are also capable of setting or
 580 clearing persistence, as well as partially covering a subset of the capability of
 581 `svremap` to set register-to-SVSHAPE relationships.
 582
 583 Programmer's Note: applying non-persistent `svremap` to an instruction that has
 584 no REMAP enabled or is a Scalar operation will obviously have no effect but
 585 the bits 32 to 46 will at least have been set in SVSTATE. This may prove useful
 586 when using `svindex` or `svshape2`.
 587
 588 Hardware Architectural Note: when persistence is not set it is critically important
 589 to treat the `svremap` and the following SVP64 instruction as an indivisible fused operation.
 590 *No state* is stored in the SVSTATE SPR in order to allow continuation should an
 591 Interrupt occur between the two instructions. Thus, Interrupts must be prohibited
 592 from occurring or other workaround deployed.  When persistence is set this issue
 593 is moot.
 594
 595 It is critical to note that if persistence is clear then `svremap` is the *only* way
 596 to activate REMAP on any given (following) instruction.  If persistence is set however then
 597 **all** SVP64 instructions go through REMAP as long as `SVme` is non-zero.
 598
 599 -------------
 600
 601 \newpage{}
 602
 603 # SHAPE Remapping SPRs
 604
 605 There are four "shape" SPRs, SHAPE0-3, 32-bits in each,
 606 which have the same format.
 607
 608 Shape is 32-bits.  When SHAPE is set entirely to zeros, remapping is
 609 disabled: the register's elements are a linear (1D) vector.
 610
 611 |31.30|29..28 |27..24| 23..21 | 20..18  | 17..12  |11..6 |5..0  | Mode  |
 612 |---- |------ |------| ------ | ------- | ------- |----- |----- | ----- |
 613 |0b00 |skip   |offset| invxyz | permute | zdimsz  |ydimsz|xdimsz|Matrix |
 614 |0b00 |elwidth|offset|sk1/invxy|0b110/0b111|SVGPR|ydimsz|xdimsz|Indexed|
 615 |0b01 |submode|offset| invxyz | submode2| zdimsz  |mode  |xdimsz|DCT/FFT|
 616 |0b10 |submode|offset| invxyz | rsvd    | rsvd    |rsvd  |xdimsz|Preduce|
 617 |0b11 |       |      |        |         |         |      |      |rsvd   |
 618
 619 mode sets different behaviours (straight matrix multiply, FFT, DCT).
 620
 621 * **mode=0b00** sets straight Matrix Mode
 622 * **mode=0b00** with permute=0b110 or 0b111 sets Indexed Mode
 623 * **mode=0b01** sets "FFT/DCT" mode and activates submodes
 624 * **mode=0b10** sets "Parallel Reduction" Schedules.
 625
 626 ## Parallel Reduction Mode
 627
 628 Creates the Schedules for Parallel Tree Reduction.
 629
 630 * **submode=0b00** selects the left operand index
 631 * **submode=0b01** selects the right operand index
 632
 633 * When bit 0 of `invxyz` is set, the order of the indices
 634   in the inner for-loop are reversed. This has the side-effect
 635   of placing the final reduced result in the last-predicated element.
 636   It also has the indirect side-effect of swapping the source
 637   registers: Left-operand index numbers will always exceed
 638   Right-operand indices.
 639   When clear, the reduced result will be in the first-predicated
 640   element, and Left-operand indices will always be *less* than
 641   Right-operand ones.
 642 * When bit 1 of `invxyz` is set, the order of the outer loop
 643   step is inverted: stepping begins at the nearest power-of two
 644   to half of the vector length and reduces by half each time.
 645   When clear the step will begin at 2 and double on each
 646   inner loop.
 647
 648 ## FFT/DCT mode
 649
 650 submode2=0 is for FFT. For FFT submode the following schedules may be
 651 selected:
 652
 653 * **submode=0b00** selects the ``j`` offset of the innermost for-loop
 654   of Tukey-Cooley
 655 * **submode=0b10** selects the ``j+halfsize`` offset of the innermost for-loop
 656   of Tukey-Cooley
 657 * **submode=0b11** selects the ``k`` of exptable (which coefficient)
 658
 659 When submode2 is 1 or 2, for DCT inner butterfly submode the following
 660 schedules may be selected.  When submode2 is 1, additional bit-reversing
 661 is also performed.
 662
 663 * **submode=0b00** selects the ``j`` offset of the innermost for-loop,
 664     in-place
 665 * **submode=0b010** selects the ``j+halfsize`` offset of the innermost for-loop,
 666   in reverse-order, in-place
 667 * **submode=0b10** selects the ``ci`` count of the innermost for-loop,
 668   useful for calculating the cosine coefficient
 669 * **submode=0b11** selects the ``size`` offset of the outermost for-loop,
 670   useful for the cosine coefficient ``cos(ci + 0.5) * pi / size``
 671
 672 When submode2 is 3 or 4, for DCT outer butterfly submode the following
 673 schedules may be selected.  When submode is 3, additional bit-reversing
 674 is also performed.
 675
 676 * **submode=0b00** selects the ``j`` offset of the innermost for-loop,
 677 * **submode=0b01** selects the ``j+1`` offset of the innermost for-loop,
 678
 679 `zdimsz` is used as an in-place "Stride", particularly useful for
 680 column-based in-place DCT/FFT.
 681
 682 ## Matrix Mode
 683
 684 In Matrix Mode, skip allows dimensions to be skipped from being included
 685 in the resultant output index.  this allows sequences to be repeated:
 686 ```0 0 0 1 1 1 2 2 2 ...``` or in the case of skip=0b11 this results in
 687 modulo ```0 1 2 0 1 2 ...```
 688
 689 * **skip=0b00** indicates no dimensions to be skipped
 690 * **skip=0b01** sets "skip 1st dimension"
 691 * **skip=0b10** sets "skip 2nd dimension"
 692 * **skip=0b11** sets "skip 3rd dimension"
 693
 694 invxyz will invert the start index of each of x, y or z. If invxyz[0] is
 695 zero then x-dimensional counting begins from 0 and increments, otherwise
 696 it begins from xdimsz-1 and iterates down to zero. Likewise for y and z.
 697
 698 offset will have the effect of offsetting the result by ```offset``` elements:
 699
 700 ```
 701     for i in 0..VL-1:
 702         GPR(RT + remap(i) + SVSHAPE.offset) = ....
 703 ```
 704
 705 this appears redundant because the register RT could simply be changed by a compiler, until element width overrides are introduced.  also
 706 bear in mind that unlike a static compiler SVSHAPE.offset may
 707 be set dynamically at runtime.
 708
 709 xdimsz, ydimsz and zdimsz are offset by 1, such that a value of 0 indicates
 710 that the array dimensionality for that dimension is 1. any dimension
 711 not intended to be used must have its value set to 0 (dimensionality
 712 of 1).  A value of xdimsz=2 would indicate that in the first dimension
 713 there are 3 elements in the array.  For example, to create a 2D array
 714 X,Y of dimensionality X=3 and Y=2, set xdimsz=2, ydimsz=1 and zdimsz=0
 715
 716 The format of the array is therefore as follows:
 717
 718 ```
 719     array[xdimsz+1][ydimsz+1][zdimsz+1]
 720 ```
 721
 722 However whilst illustrative of the dimensionality, that does not take the
 723 "permute" setting into account.  "permute" may be any one of six values
 724 (0-5, with values of 6 and 7 indicating "Indexed" Mode).  The table
 725 below shows how the permutation dimensionality order works:
 726
 727 | permute | order | array format             |
 728 | ------- | ----- | ------------------------ |
 729 | 000     | 0,1,2 | (xdim+1)(ydim+1)(zdim+1) |
 730 | 001     | 0,2,1 | (xdim+1)(zdim+1)(ydim+1) |
 731 | 010     | 1,0,2 | (ydim+1)(xdim+1)(zdim+1) |
 732 | 011     | 1,2,0 | (ydim+1)(zdim+1)(xdim+1) |
 733 | 100     | 2,0,1 | (zdim+1)(xdim+1)(ydim+1) |
 734 | 101     | 2,1,0 | (zdim+1)(ydim+1)(xdim+1) |
 735 | 110     | 0,1   | Indexed (xdim+1)(ydim+1) |
 736 | 111     | 1,0   | Indexed (ydim+1)(xdim+1) |
 737
 738 In other words, the "permute" option changes the order in which
 739 nested for-loops over the array would be done.  See executable
 740 python reference code for further details.
 741
 742 *Note: permute=0b110 and permute=0b111 enable Indexed REMAP Mode,
 743 described below*
 744
 745 With all these options it is possible to support in-place transpose,
 746 in-place rotate, Matrix Multiply and Convolutions, without being
 747 limited to Power-of-Two dimension sizes.
 748
 749 ## Indexed Mode
 750
 751 Indexed Mode activates reading of the element indices from the GPR
 752 and includes optional limited 2D reordering.
 753 In its simplest form (without elwidth overrides or other modes):
 754
 755 ```
 756     def index_remap(i):
 757         return GPR((SVSHAPE.SVGPR<<1)+i) + SVSHAPE.offset
 758
 759     for i in 0..VL-1:
 760         element_result = ....
 761         GPR(RT + indexed_remap(i)) = element_result
 762 ```
 763
 764 With element-width overrides included, and using the pseudocode
 765 from the SVP64 [[sv/svp64/appendix#elwidth]] elwidth section
 766 this becomes:
 767
 768 ```
 769     def index_remap(i):
 770         svreg = SVSHAPE.SVGPR << 1
 771         srcwid = elwid_to_bitwidth(SVSHAPE.elwid)
 772         offs = SVSHAPE.offset
 773         return get_polymorphed_reg(svreg, srcwid, i) + offs
 774
 775     for i in 0..VL-1:
 776         element_result = ....
 777         rt_idx = indexed_remap(i)
 778         set_polymorphed_reg(RT, destwid, rt_idx, element_result)
 779 ```
 780
 781 Matrix-style reordering still applies to the indices, except limited
 782 to up to 2 Dimensions (X,Y). Ordering is therefore limited to (X,Y) or
 783 (Y,X) for in-place Transposition.
 784 Only one dimension may optionally be skipped. Inversion of either
 785 X or Y or both is possible (2D mirroring). Pseudocode for Indexed Mode (including elwidth
 786 overrides) may be written in terms of Matrix Mode, specifically
 787 purposed to ensure that the 3rd dimension (Z) has no effect:
 788
 789 ```
 790     def index_remap(ISHAPE, i):
 791         MSHAPE.skip   = 0b0 || ISHAPE.sk1
 792         MSHAPE.invxyz = 0b0 || ISHAPE.invxy
 793         MSHAPE.xdimsz = ISHAPE.xdimsz
 794         MSHAPE.ydimsz = ISHAPE.ydimsz
 795         MSHAPE.zdimsz = 0 # disabled
 796         if ISHAPE.permute = 0b110 # 0,1
 797            MSHAPE.permute = 0b000 # 0,1,2
 798         if ISHAPE.permute = 0b111 # 1,0
 799            MSHAPE.permute = 0b010 # 1,0,2
 800         el_idx = remap_matrix(MSHAPE, i)
 801         svreg = ISHAPE.SVGPR << 1
 802         srcwid = elwid_to_bitwidth(ISHAPE.elwid)
 803         offs = ISHAPE.offset
 804         return get_polymorphed_reg(svreg, srcwid, el_idx) + offs
 805 ```
 806
 807 The most important observation above is that the Matrix-style
 808 remapping occurs first and the Index lookup second.  Thus it
 809 becomes possible to perform in-place Transpose of Indices which
 810 may have been costly to set up or costly to duplicate
 811 (waste register file space).
 812
 813 -------------
 814
 815 \newpage{}
 816
 817 # svshape instruction  <a name="svshape"> </a>
 818
 819 Form: SVM-Form SV "Matrix" Form (see [[isatables/fields.text]])
 820
 821     svshape SVxd,SVyd,SVzd,SVRM,vf
 822
 823 | 0.5|6.10  |11.15  |16..20 | 21..24 | 25 | 26..31|  name    |
 824 | -- | --   | ---   | ----- | ------ | -- | ------| -------- |
 825 |OPCD| SVxd | SVyd  | SVzd  | SVRM   | vf | XO    | svshape  |
 826
 827 ```
 828     # for convenience, VL to be calculated and stored in SVSTATE
 829     vlen <- [0] * 7
 830     mscale[0:5] <- 0b000001 # for scaling MAXVL
 831     itercount[0:6] <- [0] * 7
 832     SVSTATE[0:31] <- [0] * 32
 833     # only overwrite REMAP if "persistence" is zero
 834     if (SVSTATE[62] = 0b0) then
 835         SVSTATE[32:33] <- 0b00
 836         SVSTATE[34:35] <- 0b00
 837         SVSTATE[36:37] <- 0b00
 838         SVSTATE[38:39] <- 0b00
 839         SVSTATE[40:41] <- 0b00
 840         SVSTATE[42:46] <- 0b00000
 841         SVSTATE[62] <- 0b0
 842         SVSTATE[63] <- 0b0
 843     # clear out all SVSHAPEs
 844     SVSHAPE0[0:31] <- [0] * 32
 845     SVSHAPE1[0:31] <- [0] * 32
 846     SVSHAPE2[0:31] <- [0] * 32
 847     SVSHAPE3[0:31] <- [0] * 32
 848
 849     # set schedule up for multiply
 850     if (SVrm = 0b0000) then
 851         # VL in Matrix Multiply is xd*yd*zd
 852         xd <- (0b00 || SVxd) + 1
 853         yd <- (0b00 || SVyd) + 1
 854         zd <- (0b00 || SVzd) + 1
 855         n <- xd * yd * zd
 856         vlen[0:6] <- n[14:20]
 857         # set up template in SVSHAPE0, then copy to 1-3
 858         SVSHAPE0[0:5] <- (0b0 || SVxd)   # xdim
 859         SVSHAPE0[6:11] <- (0b0 || SVyd)   # ydim
 860         SVSHAPE0[12:17] <- (0b0 || SVzd)   # zdim
 861         SVSHAPE0[28:29] <- 0b11           # skip z
 862         # copy
 863         SVSHAPE1[0:31] <- SVSHAPE0[0:31]
 864         SVSHAPE2[0:31] <- SVSHAPE0[0:31]
 865         SVSHAPE3[0:31] <- SVSHAPE0[0:31]
 866         # set up FRA
 867         SVSHAPE1[18:20] <- 0b001          # permute x,z,y
 868         SVSHAPE1[28:29] <- 0b01           # skip z
 869         # FRC
 870         SVSHAPE2[18:20] <- 0b001          # permute x,z,y
 871         SVSHAPE2[28:29] <- 0b11           # skip y
 872
 873     # set schedule up for FFT butterfly
 874     if (SVrm = 0b0001) then
 875         # calculate O(N log2 N)
 876         n <- [0] * 3
 877         do while n < 5
 878            if SVxd[4-n] = 0 then
 879                leave
 880            n <- n + 1
 881         n <- ((0b0 || SVxd) + 1) * n
 882         vlen[0:6] <- n[1:7]
 883         # set up template in SVSHAPE0, then copy to 1-3
 884         # for FRA and FRT
 885         SVSHAPE0[0:5] <- (0b0 || SVxd)   # xdim
 886         SVSHAPE0[12:17] <- (0b0 || SVzd)   # zdim - "striding" (2D FFT)
 887         mscale <- (0b0 || SVzd) + 1
 888         SVSHAPE0[30:31] <- 0b01          # Butterfly mode
 889         # copy
 890         SVSHAPE1[0:31] <- SVSHAPE0[0:31]
 891         SVSHAPE2[0:31] <- SVSHAPE0[0:31]
 892         # set up FRB and FRS
 893         SVSHAPE1[28:29] <- 0b01           # j+halfstep schedule
 894         # FRC (coefficients)
 895         SVSHAPE2[28:29] <- 0b10           # k schedule
 896
 897     # set schedule up for (i)DCT Inner butterfly
 898     # SVrm Mode 4 (Mode 12 for iDCT) is for on-the-fly (Vertical-First Mode)
 899     if ((SVrm = 0b0100) |
 900         (SVrm = 0b1100)) then
 901         # calculate O(N log2 N)
 902         n <- [0] * 3
 903         do while n < 5
 904            if SVxd[4-n] = 0 then
 905                leave
 906            n <- n + 1
 907         n <- ((0b0 || SVxd) + 1) * n
 908         vlen[0:6] <- n[1:7]
 909         # set up template in SVSHAPE0, then copy to 1-3
 910         # set up FRB and FRS
 911         SVSHAPE0[0:5] <- (0b0 || SVxd)   # xdim
 912         SVSHAPE0[12:17] <- (0b0 || SVzd)   # zdim - "striding" (2D DCT)
 913         mscale <- (0b0 || SVzd) + 1
 914         if (SVrm = 0b1100) then
 915             SVSHAPE0[30:31] <- 0b11          # iDCT mode
 916             SVSHAPE0[18:20] <- 0b011         # iDCT Inner Butterfly sub-mode
 917         else
 918             SVSHAPE0[30:31] <- 0b01          # DCT mode
 919             SVSHAPE0[18:20] <- 0b001         # DCT Inner Butterfly sub-mode
 920             SVSHAPE0[21:23] <- 0b001         # "inverse" on outer loop
 921         SVSHAPE0[6:11] <- 0b000011       # (i)DCT Inner Butterfly mode 4
 922         # copy
 923         SVSHAPE1[0:31] <- SVSHAPE0[0:31]
 924         SVSHAPE2[0:31] <- SVSHAPE0[0:31]
 925         if (SVrm != 0b0100) & (SVrm != 0b1100) then
 926             SVSHAPE3[0:31] <- SVSHAPE0[0:31]
 927         # for FRA and FRT
 928         SVSHAPE0[28:29] <- 0b01           # j+halfstep schedule
 929         # for cos coefficient
 930         SVSHAPE2[28:29] <- 0b10           # ci (k for mode 4) schedule
 931         SVSHAPE2[12:17] <- 0b000000       # reset costable "striding" to 1
 932         if (SVrm != 0b0100) & (SVrm != 0b1100) then
 933             SVSHAPE3[28:29] <- 0b11           # size schedule
 934
 935     # set schedule up for (i)DCT Outer butterfly
 936     if (SVrm = 0b0011) | (SVrm = 0b1011) then
 937         # calculate O(N log2 N) number of outer butterfly overlapping adds
 938         vlen[0:6] <- [0] * 7
 939         n <- 0b000
 940         size <- 0b0000001
 941         itercount[0:6] <- (0b00 || SVxd) + 0b0000001
 942         itercount[0:6] <- (0b0 || itercount[0:5])
 943         do while n < 5
 944            if SVxd[4-n] = 0 then
 945                leave
 946            n <- n + 1
 947            count <- (itercount - 0b0000001) * size
 948            vlen[0:6] <- vlen + count[7:13]
 949            size[0:6] <- (size[1:6] || 0b0)
 950            itercount[0:6] <- (0b0 || itercount[0:5])
 951         # set up template in SVSHAPE0, then copy to 1-3
 952         # set up FRB and FRS
 953         SVSHAPE0[0:5] <- (0b0 || SVxd)   # xdim
 954         SVSHAPE0[12:17] <- (0b0 || SVzd)   # zdim - "striding" (2D DCT)
 955         mscale <- (0b0 || SVzd) + 1
 956         if (SVrm = 0b1011) then
 957             SVSHAPE0[30:31] <- 0b11      # iDCT mode
 958             SVSHAPE0[18:20] <- 0b011     # iDCT Outer Butterfly sub-mode
 959             SVSHAPE0[21:23] <- 0b101     # "inverse" on outer and inner loop
 960         else
 961             SVSHAPE0[30:31] <- 0b01      # DCT mode
 962             SVSHAPE0[18:20] <- 0b100     # DCT Outer Butterfly sub-mode
 963         SVSHAPE0[6:11] <- 0b000010       # DCT Butterfly mode
 964         # copy
 965         SVSHAPE1[0:31] <- SVSHAPE0[0:31] # j+halfstep schedule
 966         SVSHAPE2[0:31] <- SVSHAPE0[0:31] # costable coefficients
 967         # for FRA and FRT
 968         SVSHAPE1[28:29] <- 0b01           # j+halfstep schedule
 969         # reset costable "striding" to 1
 970         SVSHAPE2[12:17] <- 0b000000
 971
 972     # set schedule up for DCT COS table generation
 973     if (SVrm = 0b0101) | (SVrm = 0b1101) then
 974         # calculate O(N log2 N)
 975         vlen[0:6] <- [0] * 7
 976         itercount[0:6] <- (0b00 || SVxd) + 0b0000001
 977         itercount[0:6] <- (0b0 || itercount[0:5])
 978         n <- [0] * 3
 979         do while n < 5
 980            if SVxd[4-n] = 0 then
 981                leave
 982            n <- n + 1
 983            vlen[0:6] <- vlen + itercount
 984            itercount[0:6] <- (0b0 || itercount[0:5])
 985         # set up template in SVSHAPE0, then copy to 1-3
 986         # set up FRB and FRS
 987         SVSHAPE0[0:5] <- (0b0 || SVxd)   # xdim
 988         SVSHAPE0[12:17] <- (0b0 || SVzd)   # zdim - "striding" (2D DCT)
 989         mscale <- (0b0 || SVzd) + 1
 990         SVSHAPE0[30:31] <- 0b01          # DCT/FFT mode
 991         SVSHAPE0[6:11] <- 0b000100       # DCT Inner Butterfly COS-gen mode
 992         if (SVrm = 0b0101) then
 993             SVSHAPE0[21:23] <- 0b001     # "inverse" on outer loop for DCT
 994         # copy
 995         SVSHAPE1[0:31] <- SVSHAPE0[0:31]
 996         SVSHAPE2[0:31] <- SVSHAPE0[0:31]
 997         # for cos coefficient
 998         SVSHAPE1[28:29] <- 0b10           # ci schedule
 999         SVSHAPE2[28:29] <- 0b11           # size schedule
1000
1001     # set schedule up for iDCT / DCT inverse of half-swapped ordering
1002     if (SVrm = 0b0110) | (SVrm = 0b1110) | (SVrm = 0b1111) then
1003         vlen[0:6] <- (0b00 || SVxd) + 0b0000001
1004         # set up template in SVSHAPE0
1005         SVSHAPE0[0:5] <- (0b0 || SVxd)   # xdim
1006         SVSHAPE0[12:17] <- (0b0 || SVzd)   # zdim - "striding" (2D DCT)
1007         mscale <- (0b0 || SVzd) + 1
1008         if (SVrm = 0b1110) then
1009             SVSHAPE0[18:20] <- 0b001     # DCT opposite half-swap
1010         if (SVrm = 0b1111) then
1011             SVSHAPE0[30:31] <- 0b01          # FFT mode
1012         else
1013             SVSHAPE0[30:31] <- 0b11          # DCT mode
1014         SVSHAPE0[6:11] <- 0b000101       # DCT "half-swap" mode
1015
1016     # set schedule up for parallel reduction
1017     if (SVrm = 0b0111) then
1018         # calculate the total number of operations (brute-force)
1019         vlen[0:6] <- [0] * 7
1020         itercount[0:6] <- (0b00 || SVxd) + 0b0000001
1021         step[0:6] <- 0b0000001
1022         i[0:6] <- 0b0000000
1023         do while step <u itercount
1024             newstep <- step[1:6] || 0b0
1025             j[0:6] <- 0b0000000
1026             do while (j+step <u itercount)
1027                 j <- j + newstep
1028                 i <- i + 1
1029             step <- newstep
1030         # VL in Parallel-Reduce is the number of operations
1031         vlen[0:6] <- i
1032         # set up template in SVSHAPE0, then copy to 1. only 2 needed
1033         SVSHAPE0[0:5] <- (0b0 || SVxd)   # xdim
1034         SVSHAPE0[12:17] <- (0b0 || SVzd)   # zdim - "striding" (2D DCT)
1035         mscale <- (0b0 || SVzd) + 1
1036         SVSHAPE0[30:31] <- 0b10          # parallel reduce submode
1037         # copy
1038         SVSHAPE1[0:31] <- SVSHAPE0[0:31]
1039         # set up right operand (left operand 28:29 is zero)
1040         SVSHAPE1[28:29] <- 0b01           # right operand
1041
1042     # set VL, MVL and Vertical-First
1043     m[0:12] <- vlen * mscale
1044     maxvl[0:6] <- m[6:12]
1045     SVSTATE[0:6] <- maxvl  # MAVXL
1046     SVSTATE[7:13] <- vlen  # VL
1047     SVSTATE[63] <- vf
1048 ```
1049
1050 Special Registers Altered:
1051
1052 ```
1053     SVSTATE, SVSHAPE0-3
1054 ```
1055
1056 `svshape` is a convenience instruction that reduces instruction
1057 count for common usage patterns, particularly Matrix, DCT and FFT. It sets up
1058 (overwrites) all required SVSHAPE SPRs and also modifies SVSTATE
1059 including VL and MAXVL. Using `svshape` therefore does not also
1060 require `setvl`.
1061
1062 Fields:
1063
1064 * **SVxd** - SV REMAP "xdim"
1065 * **SVyd** - SV REMAP "ydim"
1066 * **SVzd** - SV REMAP "zdim"
1067 * **SVRM** - SV REMAP Mode (0b00000 for Matrix, 0b00001 for FFT etc.)
1068 * **vf** - sets "Vertical-First" mode
1069
1070 *Note: SVxd, SVyz and SVzd are all stored "off-by-one".  In the assembler
1071 mnemonic the values `1-32` are stored in binary as `0b00000..0b11111`*
1072
1073 There are 12 REMAP Modes (2 Modes are RESERVED for `svshape2`, 2 Modes
1074 are RESERVED)
1075
1076 | SVRM   | Remap Mode description |
1077 | --     | --              |
1078 | 0b0000 | Matrix 1/2/3D    |
1079 | 0b0001 | FFT Butterfly   |
1080 | 0b0010 | reserved |
1081 | 0b0011 | DCT Outer butterfly  |
1082 | 0b0100 | DCT Inner butterfly, on-the-fly (Vertical-First Mode) |
1083 | 0b0101 | DCT COS table index generation |
1084 | 0b0110 | DCT half-swap   |
1085 | 0b0111 | Parallel Reduction |
1086 | 0b1000 | reserved for svshape2 |
1087 | 0b1001 | reserved for svshape2 |
1088 | 0b1010 | reserved |
1089 | 0b1011 | iDCT Outer butterfly  |
1090 | 0b1100 | iDCT Inner butterfly, on-the-fly (Vertical-First Mode) |
1091 | 0b1101 | iDCT COS table index generation |
1092 | 0b1110 | iDCT half-swap   |
1093 | 0b1111 | FFT half-swap   |
1094
1095 Examples showing how all of these Modes operate exists in the online
1096 [SVP64 unit tests](https://git.libre-soc.org/?p=openpower-isa.git;a=tree;f=src/openpower/decoder/isa;hb=HEAD).  Explaining
1097 these Modes further in detail is beyond the scope of this document.
1098
1099 In Indexed Mode, there are only 5 bits available to specify the GPR
1100 to use, out of 128 GPRs (7 bit numbering).  Therefore, only the top
1101 5 bits are given in the `SVxd` field: the bottom two implicit bits
1102 will be zero (`SVxd || 0b00`).
1103
1104 `svshape` has *limited applicability* due to being a 32-bit instruction.
1105 The full capability of SVSHAPE SPRs may be accessed by directly writing
1106 to SVSHAPE0-3 with `mtspr`. Circumstances include Matrices with dimensions
1107 larger than 32, and in-place Transpose.  Potentially a future v3.1 Prefixed
1108 instruction, `psvshape`, may extend the capability here.
1109
1110 *Architectural Resource Allocation note: the SVRM field is carefully
1111 crafted to allocate two Modes, corresponding to bits 21-23 within the
1112 instruction being set to the value `0b100`, to `svshape2` (not
1113 `svshape`). These two Modes are
1114 considered "RESERVED" within the context of `svshape` but it is
1115 absolutely critical to allocate the exact same pattern in XO for
1116 both instructions in bits 26-31.*
1117
1118 -------------
1119
1120 \newpage{}
1121
1122
1123 # svindex instruction  <a name="svindex"> </a>
1124
1125 SVI-Form
1126
1127 | 0-5|6-10 |11-15  |16-20 | 21-25       | 26-31 |  Form    |
1128 | -- | --  | ---   | ---- | ----------- | ------| -------- |
1129 | PO | SVG | rmm   | SVd  | ew/yx/mm/sk | XO    | SVI-Form |
1130
1131 * svindex SVG,rmm,SVd,ew,SVyx,mm,sk
1132
1133 Pseudo-code:
1134
1135 ```
1136     # based on nearest MAXVL compute other dimension
1137     MVL <- SVSTATE[0:6]
1138     d <- [0] * 6
1139     dim <- SVd+1
1140     do while d*dim <u ([0]*4 || MVL)
1141        d <- d + 1
1142
1143     # set up template, then copy once location identified
1144     shape <- [0]*32
1145     shape[30:31] <- 0b00            # mode
1146     if SVyx = 0 then
1147         shape[18:20] <- 0b110       # indexed xd/yd
1148         shape[0:5] <- (0b0 || SVd)  # xdim
1149         if sk = 0 then shape[6:11] <- 0 # ydim
1150         else           shape[6:11] <- 0b111111 # ydim max
1151     else
1152         shape[18:20] <- 0b111       # indexed yd/xd
1153         if sk = 1 then shape[6:11] <- 0 # ydim
1154         else           shape[6:11] <- d-1 # ydim max
1155         shape[0:5] <- (0b0 || SVd) # ydim
1156     shape[12:17] <- (0b0 || SVG)        # SVGPR
1157     shape[28:29] <- ew                  # element-width override
1158     shape[21] <- sk                     # skip 1st dimension
1159
1160     # select the mode for updating SVSHAPEs
1161     SVSTATE[62] <- mm # set or clear persistence
1162     if mm = 0 then
1163         # clear out all SVSHAPEs first
1164         SVSHAPE0[0:31] <- [0] * 32
1165         SVSHAPE1[0:31] <- [0] * 32
1166         SVSHAPE2[0:31] <- [0] * 32
1167         SVSHAPE3[0:31] <- [0] * 32
1168         SVSTATE[32:41] <- [0] * 10 # clear REMAP.mi/o
1169         SVSTATE[42:46] <- rmm # rmm exactly REMAP.SVme
1170         idx <- 0
1171         for bit = 0 to 4
1172             if rmm[4-bit] then
1173                 # activate requested shape
1174                 if idx = 0 then SVSHAPE0 <- shape
1175                 if idx = 1 then SVSHAPE1 <- shape
1176                 if idx = 2 then SVSHAPE2 <- shape
1177                 if idx = 3 then SVSHAPE3 <- shape
1178                 SVSTATE[bit*2+32:bit*2+33] <- idx
1179                 # increment shape index, modulo 4
1180                 if idx = 3 then idx <- 0
1181                 else            idx <- idx + 1
1182     else
1183         # refined SVSHAPE/REMAP update mode
1184         bit <- rmm[0:2]
1185         idx <- rmm[3:4]
1186         if idx = 0 then SVSHAPE0 <- shape
1187         if idx = 1 then SVSHAPE1 <- shape
1188         if idx = 2 then SVSHAPE2 <- shape
1189         if idx = 3 then SVSHAPE3 <- shape
1190         SVSTATE[bit*2+32:bit*2+33] <- idx
1191         SVSTATE[46-bit] <- 1
1192 ```
1193
1194 Special Registers Altered:
1195
1196 ```
1197     SVSTATE, SVSHAPE0-3
1198 ```
1199
1200 `svindex` is a convenience instruction that reduces instruction count
1201 for Indexed REMAP Mode. It sets up (overwrites) all required SVSHAPE
1202 SPRs and **unlike** `svshape` can modify the REMAP area of the SVSTATE
1203 SPR as well, including setting persistence.  The relevant SPRs *may*
1204 be directly programmed with `mtspr` however it is laborious to do so:
1205 svindex saves instructions covering much of Indexed REMAP capability.
1206
1207 Fields:
1208
1209 * **SVd** - SV REMAP x/y dim
1210 * **rmm** - REMAP mask: sets remap mi0-2/mo0-1 and SVSHAPEs,
1211   controlled by mm
1212 * **ew** - sets element width override on the Indices
1213 * **SVG** - GPR SVG<<2 to be used for Indexing
1214 * **yx** - 2D reordering to be used if yx=1
1215 * **mm** - mask mode. determines how `rmm` is interpreted.
1216 * **sk** - Dimension skipping enabled
1217
1218 *Note: SVd, like SVxd, SVyz and SVzd of `svshape`, are all stored
1219 "off-by-one".  In the assembler
1220 mnemonic the values `1-32` are stored in binary as `0b00000..0b11111`*.
1221
1222 *Note: when `yx=1,sk=0` the second dimension is calculated as
1223 `CEIL(MAXVL/SVd)`*.
1224
1225 When `mm=0`:
1226
1227 * `rmm`, like REMAP.SVme, has bit 0
1228   correspond to mi0, bit 1 to mi1, bit 2 to mi2,
1229   bit 3 to mo0 and bit 4 to mi1
1230 * all SVSHAPEs and the REMAP parts of SVSHAPE are first reset (initialised to zero)
1231 * for each bit set in the 5-bit `rmm`, in order, the first
1232   as-yet-unset SVSHAPE will be updated
1233   with the other operands in the instruction, and the REMAP
1234   SPR set.
1235 * If all 5 bits of `rmm` are set then both mi0 and mo1 use SVSHAPE0.
1236 * SVSTATE persistence bit is cleared
1237 * No other alterations to SVSTATE are carried out
1238
1239 Example 1: if rmm=0b00110 then SVSHAPE0 and SVSHAPE1 are set up,
1240 and the REMAP SPR set so that mi1 uses SVSHAPE0 and mi2
1241 uses mi2.  REMAP.SVme is also set to 0b00110, REMAP.mi1=0
1242 (SVSHAPE0) and REMAP.mi2=1 (SVSHAPE1)
1243
1244 Example 2: if rmm=0b10001 then again SVSHAPE0 and SVSHAPE1
1245 are set up, but the REMAP SPR is set so that mi0 uses SVSHAPE0
1246 and mo1 uses SVSHAPE1. REMAP.SVme=0b10001, REMAP.mi0=0, REMAP.mo1=1
1247
1248 Rough algorithmic form:
1249
1250 ```
1251     marray = [mi0, mi1, mi2, mo0, mo1]
1252     idx = 0
1253     for bit = 0 to 4:
1254         if not rmm[bit]: continue
1255         setup(SVSHAPE[idx])
1256         SVSTATE{marray[bit]} = idx
1257         idx = (idx+1) modulo 4
1258 ```
1259
1260 When `mm=1`:
1261
1262 * bits 0-2 (MSB0 numbering) of `rmm` indicate an index selecting mi0-mo1
1263 * bits 3-4 (MSB0 numbering) of `rmm` indicate which SVSHAPE 0-3 shall
1264   be updated
1265 * only the selected SVSHAPE is overwritten
1266 * only the relevant bits in the REMAP area of SVSTATE are updated
1267 * REMAP persistence bit is set.
1268
1269 Example 1: if `rmm`=0b01110 then bits 0-2 (MSB0) are 0b011 and
1270 bits 3-4 are 0b10. thus, mo0 is selected and SVSHAPE2
1271 to be updated. REMAP.SVme[3] will be set high and REMAP.mo0
1272 set to 2 (SVSHAPE2).
1273
1274 Example 2: if `rmm`=0b10011 then bits 0-2 (MSB0) are 0b100 and
1275 bits 3-4 are 0b11.  thus, mo1 is selected and SVSHAPE3
1276 to be updated. REMAP.SVme[4] will be set high and REMAP.mo1
1277 set to 3 (SVSHAPE3).
1278
1279 Rough algorithmic form:
1280
1281 ```
1282     marray = [mi0, mi1, mi2, mo0, mo1]
1283     bit = rmm[0:2]
1284     idx = rmm[3:4]
1285     setup(SVSHAPE[idx])
1286     SVSTATE{marray[bit]} = idx
1287     SVSTATE.pst = 1
1288 ```
1289
1290 In essence, `mm=0` is intended for use to set as much of the
1291 REMAP State SPRs as practical with a single instruction,
1292 whilst `mm=1` is intended to be a little more refined.
1293
1294 **Usage guidelines**
1295
1296 * **Disable 2D mapping**: to only perform Indexing without
1297  reordering use `SVd=1,sk=0,yx=0` (or set SVd to a value larger
1298  or equal to VL)
1299 * **Modulo 1D mapping**: to perform Indexing cycling through the
1300  first N Indices use `SVd=N,sk=0,yx=0` where `VL>N`. There is
1301  no requirement to set VL equal to a multiple of N.
1302 * **Modulo 2D transposed**: `SVd=M,sk=0,yx=1`, sets
1303  `xdim=M,ydim=CEIL(MAXVL/M)`.
1304
1305 Beyond these mappings it becomes necessary to write directly to
1306 the SVSTATE SPRs manually.
1307
1308 -------------
1309
1310 \newpage{}
1311
1312
1313 # svshape2 (offset-priority) <a name="svshape2"> </a>
1314
1315 SVM2-Form
1316
1317 | 0-5|6-9 |10|11-15  |16-20  | 21-24  | 25 | 26-31 |  Form      |
1318 | -- |----|--| ---   | ----- | ------ | -- | ------| --------   |
1319 | PO |offs|yx| rmm   | SVd   | 100/mm | sk | XO    | SVM2-Form  |
1320
1321 * svshape2 offs,yx,rmm,SVd,sk,mm
1322
1323 Pseudo-code:
1324
1325 ```
1326     # based on nearest MAXVL compute other dimension
1327     MVL <- SVSTATE[0:6]
1328     d <- [0] * 6
1329     dim <- SVd+1
1330     do while d*dim <u ([0]*4 || MVL)
1331        d <- d + 1
1332     # set up template, then copy once location identified
1333     shape <- [0]*32
1334     shape[30:31] <- 0b00            # mode
1335     shape[0:5] <- (0b0 || SVd)      # x/ydim
1336     if SVyx = 0 then
1337         shape[18:20] <- 0b000       # ordering xd/yd(/zd)
1338         if sk = 0 then shape[6:11] <- 0 # ydim
1339         else           shape[6:11] <- 0b111111 # ydim max
1340     else
1341         shape[18:20] <- 0b010       # ordering yd/xd(/zd)
1342         if sk = 1 then shape[6:11] <- 0 # ydim
1343         else           shape[6:11] <- d-1 # ydim max
1344     # offset (the prime purpose of this instruction)
1345     shape[24:27] <- SVo         # offset
1346     if sk = 1 then shape[28:29] <- 0b01 # skip 1st dimension
1347     else           shape[28:29] <- 0b00 # no skipping
1348     # select the mode for updating SVSHAPEs
1349     SVSTATE[62] <- mm # set or clear persistence
1350     if mm = 0 then
1351         # clear out all SVSHAPEs first
1352         SVSHAPE0[0:31] <- [0] * 32
1353         SVSHAPE1[0:31] <- [0] * 32
1354         SVSHAPE2[0:31] <- [0] * 32
1355         SVSHAPE3[0:31] <- [0] * 32
1356         SVSTATE[32:41] <- [0] * 10 # clear REMAP.mi/o
1357         SVSTATE[42:46] <- rmm # rmm exactly REMAP.SVme
1358         idx <- 0
1359         for bit = 0 to 4
1360             if rmm[4-bit] then
1361                 # activate requested shape
1362                 if idx = 0 then SVSHAPE0 <- shape
1363                 if idx = 1 then SVSHAPE1 <- shape
1364                 if idx = 2 then SVSHAPE2 <- shape
1365                 if idx = 3 then SVSHAPE3 <- shape
1366                 SVSTATE[bit*2+32:bit*2+33] <- idx
1367                 # increment shape index, modulo 4
1368                 if idx = 3 then idx <- 0
1369                 else            idx <- idx + 1
1370     else
1371         # refined SVSHAPE/REMAP update mode
1372         bit <- rmm[0:2]
1373         idx <- rmm[3:4]
1374         if idx = 0 then SVSHAPE0 <- shape
1375         if idx = 1 then SVSHAPE1 <- shape
1376         if idx = 2 then SVSHAPE2 <- shape
1377         if idx = 3 then SVSHAPE3 <- shape
1378         SVSTATE[bit*2+32:bit*2+33] <- idx
1379         SVSTATE[46-bit] <- 1
1380 ```
1381
1382 Special Registers Altered:
1383
1384 ```
1385     SVSTATE, SVSHAPE0-3
1386 ```
1387
1388 `svshape2` is an additional convenience instruction that prioritises
1389 setting `SVSHAPE.offset`. Its primary purpose is for use when
1390 element-width overrides are used. It has identical capabilities to `svindex` and
1391 in terms of both options (skip, etc.) and ability to activate REMAP
1392 (rmm, mask mode) but unlike `svindex` it does not set GPR REMAP,
1393 only a 1D or 2D `svshape`, and
1394 unlike `svshape` it can set an arbirrary `SVSHAPE.offset` immediate.
1395
1396 One of the limitations of Simple-V is that Vector elements start on the boundary
1397 of the Scalar regfile, which is fine when element-width overrides are not
1398 needed. If the starting point of a Vector with smaller elwidths must begin
1399 in the middle of a register, normally there would be no way to do so except
1400 through LD/ST.  `SVSHAPE.offset` caters for this scenario and `svshape2`is
1401 makes it easier.
1402
1403 **Operand Fields**:
1404
1405 * **offs** (4 bits) - unsigned offset
1406 * **yx** (1 bit) - swap XY to YX
1407 * **SVd** dimension size
1408 * **rmm** REMAP mask
1409 * **mm** mask mode
1410 * **sk** (1 bit) skips 1st dimension if set
1411
1412 Dimensions are calculated exactly as `svindex`. `rmm` and
1413 `mm` are as per `svindex`.
1414
1415 *Programmer's Note: offsets for `svshape2` may be specified in the range
1416 0-15. Given that the principle of Simple-V is to fit on top of
1417 byte-addressable register files and that GPR and FPR are 64-bit (8 bytes)
1418 it should be clear that the offset may, when `elwidth=8`, begin an
1419 element-level operation starting element zero at any arbitrary byte.
1420 On cursory examination attempting to go beyond the range 0-7 seems
1421 unnecessary given that the **next GPR or FPR** is an
1422 alias for an offset in the range 8-15.  Thus by simply increasing
1423 the starting Vector point of the operation to the next register it
1424 can be seen that the offset of 0-7 would be sufficient.  Unfortunately
1425 however some operations are EXTRA2-encoded it is **not possible**
1426 to increase the GPR/FPR register number by one, because EXTRA2-encoding
1427 of GPR/FPR Vector numbers are restricted to even numbering.
1428 For CR Fields the EXTRA2 encoding is even more sparse.
1429 The additional offset range (8-15) helps overcome these limitations.*
1430
1431 *Hardware Implementor's note: with the offsets only being immediates
1432 and with register numbering being entirely immediate as well it is
1433 possible to correctly compute Register Hazards without requiring
1434 reading the contents of any SPRs.  If however there are
1435 instructions that have directly written to the SVSTATE or SVSHAPE
1436 SPRs and those instructions are still in-flight then this position
1437 is clearly **invalid**. This is why Programmers are strongly
1438 discouraged from directly writing to these SPRs.*
1439
1440 *Architectural Resource Allocation note: this instruction shares
1441 the space of `svshape`. Therefore it is critical that the two
1442 instructions, `svshape` and `svshape2` have the exact same XO
1443 in bits 26 thru 31.  It is also critical that for `svshape2`,
1444 bit 21 of XO is a 1, bit 22 of XO is a 0, and bit 23 of XO is a 0.*
1445
1446 -------------
1447
1448 \newpage{}
1449
1450 # Forms
1451
1452 Add `SVI, SVM, SVRM` to `XO (26:31)` Field in Book I, 1.6.2
1453
1454 Add the following to Book I, 1.6.1, SVI-Form
1455
1456 ```
1457     |0     |6    |11    |16   |21 |23  |24|25|26    31|
1458     | PO   |  SVG|rmm   | SVd |ew |SVyx|mm|sk|   XO   |
1459 ```
1460
1461 Add the following to Book I, 1.6.1, SVM-Form
1462
1463 ```
1464     |0     |6        |11      |16    |21    |25 |26    |31  |
1465     | PO   |  SVxd   |   SVyd | SVzd | SVrm |vf |   XO      |
1466 ```
1467
1468 Add the following to Book I, 1.6.1, SVM2-Form
1469
1470 ```
1471     |0     |6     |10  |11      |16    |21 |24|25 |26    |31  |
1472     | PO   | SVo  |SVyx|   rmm  | SVd  |XO |mm|sk |   XO      |
1473 ```
1474
1475 Add the following to Book I, 1.6.1, SVRM-Form
1476
1477 ```
1478     |0     |6     |11  |13   |15   |17   |19   |21  |22   |26     |31 |
1479     | PO   | SVme |mi0 | mi1 | mi2 | mo0 | mo1 |pst |///  | XO        |
1480 ```
1481
1482 Add the following to Book I, 1.6.2
1483
1484 ```
1485     mi0 (11:12)
1486         Field used in REMAP to select the SVSHAPE for 1st input register
1487         Formats: SVRM
1488     mi1 (13:14)
1489         Field used in REMAP to select the SVSHAPE for 2nd input register
1490         Formats: SVRM
1491     mi2 (15:16)
1492         Field used in REMAP to select the SVSHAPE for 3rd input register
1493         Formats: SVRM
1494     mm (24)
1495         Field used to specify the meaning of the rmm field for SVI-Form
1496         and SVM2-Form
1497         Formats: SVI, SVM2
1498     mo0 (17:18)
1499         Field used in REMAP to select the SVSHAPE for 1st output register
1500         Formats: SVRM
1501     mo1 (19:20)
1502         Field used in REMAP to select the SVSHAPE for 2nd output register
1503         Formats: SVRM
1504     pst (21)
1505         Field used in REMAP to indicate "persistence" mode (REMAP
1506         continues to apply to multiple instructions)
1507         Formats: SVRM
1508     rmm (11:15)
1509         REMAP Mode field for SVI-Form and SVM2-Form
1510         Formats: SVI, SVM2
1511     sk (25)
1512         Field used to specify dimensional skipping in svindex
1513         Formats: SVI, SVM2
1514     SVd (16:20)
1515         Immediate field used to specify the size of the REMAP dimension
1516         in the svindex and svshape2 instructions
1517         Formats: SVI, SVM2
1518     SVDS (16:29)
1519         Immediate field used to specify a 9-bit signed
1520         two's complement integer which is concatenated
1521         on the right with 0b00 and sign-extended to 64 bits.
1522         Formats: SVDS
1523     SVG (6:10)
1524         Field used to specify a GPR to be used as a
1525         source for indexing.
1526         Formats: SVI
1527     SVi (16:22)
1528          Simple-V immediate field for setting VL or MVL
1529          Formats: SVL
1530     SVme (6:10)
1531          Simple-V "REMAP" map-enable bits (0-4)
1532          Formats: SVRM
1533     SVo (6:9)
1534         Field used by the svshape2 instruction as an offset
1535         Formats: SVM2
1536     SVrm (21:24)
1537          Simple-V "REMAP" Mode
1538          Formats: SVM
1539     SVxd (6:10)
1540          Simple-V "REMAP" x-dimension size
1541          Formats: SVM
1542     SVyd (11:15)
1543          Simple-V "REMAP" y-dimension size
1544          Formats: SVM
1545     SVzd (16:20)
1546          Simple-V "REMAP" z-dimension size
1547          Formats: SVM
1548     XO (21:23,26:31)
1549         Extended opcode field.  Note that bit 21 must be 1, 22 and 23
1550         must be zero, and bits 26-31 must be exactly the same as
1551         used for svshape.
1552         Formats: SVM2
1553 ```
1554
1555 # Appendices
1556
1557     Appendix E Power ISA sorted by opcode
1558     Appendix F Power ISA sorted by version
1559     Appendix G Power ISA sorted by Compliancy Subset
1560     Appendix H Power ISA sorted by mnemonic
1561
1562 | Form | Book | Page | Version | mnemonic | Description |
1563 |------|------|------|---------|----------|-------------|
1564 | SVRM | I    | #    | 3.0B    | svremap    | REMAP enabling instruction |
1565 | SVM  | I    | #    | 3.0B    | svshape  | REMAP shape instruction |
1566 | SVM2 | I    | #    | 3.0B    | svshape2 | REMAP shape instruction (2) |
1567 | SVI  | I    | #    | 3.0B    | svindex | REMAP General-purpose Indexing |
1568
1569 ## REMAP pseudocode
1570
1571 Written in python3 the following stand-alone executable source code is the Canonical
1572 Specification for each REMAP. Vectors of "loopends" are returned when Rc=1
1573 in Vectors of CR Fields on `sv.svstep.`, or in Vertical-First Mode
1574 a single CR Field (CR0) on `svstep.`.  The `SVSTATE.srcstep` or `SVSTATE.dststep` sequential
1575 offset is put through each algorithm to determine the actual Element Offset.
1576 Alternative implementations producing different ordering
1577 is prohibited as software will be critically relying on these Deterministic Schedules.
1578
1579 ### REMAP 2D/3D Matrix
1580
1581 The following stand-alone executable source code is the Canonical
1582 Specification for Matrix (2D/3D) REMAP.
1583 Hardware implementations are achievable with simple cascading counter-and-compares.
1584
1585 ```
1586 # python "yield" can be iterated. use this to make it clear how
1587 # the indices are generated by using natural-looking nested loops
1588 def iterate_indices(SVSHAPE):
1589     # get indices to iterate over, in the required order
1590     xd = SVSHAPE.lims[0]
1591     yd = SVSHAPE.lims[1]
1592     zd = SVSHAPE.lims[2]
1593     # create lists of indices to iterate over in each dimension
1594     x_r = list(range(xd))
1595     y_r = list(range(yd))
1596     z_r = list(range(zd))
1597     # invert the indices if needed
1598     if SVSHAPE.invxyz[0]: x_r.reverse()
1599     if SVSHAPE.invxyz[1]: y_r.reverse()
1600     if SVSHAPE.invxyz[2]: z_r.reverse()
1601     # start an infinite (wrapping) loop
1602     step = 0 # track src/dst step
1603     while True:
1604         for z in z_r:   # loop over 1st order dimension
1605             z_end = z == z_r[-1]
1606             for y in y_r:       # loop over 2nd order dimension
1607                 y_end = y == y_r[-1]
1608                 for x in x_r:           # loop over 3rd order dimension
1609                     x_end = x == x_r[-1]
1610                     # ok work out which order to construct things in.
1611                     # start by creating a list of tuples of the dimension
1612                     # and its limit
1613                     vals = [(SVSHAPE.lims[0], x, "x"),
1614                             (SVSHAPE.lims[1], y, "y"),
1615                             (SVSHAPE.lims[2], z, "z")
1616                            ]
1617                     # now select those by order.  this allows us to
1618                     # create schedules for [z][x], [x][y], or [y][z]
1619                     # for matrix multiply.
1620                     vals = [vals[SVSHAPE.order[0]],
1621                             vals[SVSHAPE.order[1]],
1622                             vals[SVSHAPE.order[2]]
1623                            ]
1624                     # ok now we can construct the result, using bits of
1625                     # "order" to say which ones get stacked on
1626                     result = 0
1627                     mult = 1
1628                     for i in range(3):
1629                         lim, idx, dbg = vals[i]
1630                         # some of the dimensions can be "skipped".  the order
1631                         # was actually selected above on all 3 dimensions,
1632                         # e.g. [z][x][y] or [y][z][x].  "skip" allows one of
1633                         # those to be knocked out
1634                         if SVSHAPE.skip == i+1: continue
1635                         idx *= mult   # shifts up by previous dimension(s)
1636                         result += idx # adds on this dimension
1637                         mult *= lim   # for the next dimension
1638
1639                     loopends = (x_end |
1640                                ((y_end and x_end)<<1) |
1641                                 ((y_end and x_end and z_end)<<2))
1642
1643                     yield result + SVSHAPE.offset, loopends
1644                     step += 1
1645
1646 def demo():
1647     # set the dimension sizes here
1648     xdim = 3
1649     ydim = 2
1650     zdim = 4
1651
1652     # set total (can repeat, e.g. VL=x*y*z*4)
1653     VL = xdim * ydim * zdim
1654
1655     # set up an SVSHAPE
1656     class SVSHAPE:
1657         pass
1658     SVSHAPE0 = SVSHAPE()
1659     SVSHAPE0.lims = [xdim, ydim, zdim]
1660     SVSHAPE0.order = [1,0,2]  # experiment with different permutations, here
1661     SVSHAPE0.mode = 0b00
1662     SVSHAPE0.skip = 0b00
1663     SVSHAPE0.offset = 0       # experiment with different offset, here
1664     SVSHAPE0.invxyz = [0,0,0] # inversion if desired
1665
1666     # enumerate over the iterator function, getting new indices
1667     for idx, (new_idx, end) in enumerate(iterate_indices(SVSHAPE0)):
1668         if idx >= VL:
1669             break
1670         print ("%d->%d" % (idx, new_idx), "end", bin(end)[2:])
1671
1672 # run the demo
1673 if __name__ == '__main__':
1674     demo()
1675 ```
1676
1677 ### REMAP Parallel Reduction pseudocode
1678
1679 The python3 program below is stand-alone executable and is the Canonical Specification
1680 for Parallel Reduction REMAP.
1681 The Algorithm below is not limited to RADIX2 sizes, and Predicate
1682 sources, unlike in Matrix REMAP, apply to the Element Indices **after** REMAP
1683 has been applied, not before.  MV operations are not required: the algorithm
1684 tracks positions of elements that would normally be moved and when applying
1685 an Element Reduction Operation sources the operands from their last-known (tracked)
1686 position.
1687
1688 ```
1689 # a "yield" version of the Parallel Reduction REMAP algorithm.
1690 # the algorithm is in-place. it does not perform "MV" operations.
1691 # instead, where a masked-out value *should* be read from is tracked
1692
1693 def iterate_indices(SVSHAPE, pred=None):
1694     # get indices to iterate over, in the required order
1695     xd = SVSHAPE.lims[0]
1696     # create lists of indices to iterate over in each dimension
1697     ix = list(range(xd))
1698     # invert the indices if needed
1699     if SVSHAPE.invxyz[0]: ix.reverse()
1700     # start a loop from the lowest step
1701     step = 1
1702     steps = []
1703     while step < xd:
1704         step *= 2
1705         steps.append(step)
1706     # invert the indices if needed
1707     if SVSHAPE.invxyz[1]: steps.reverse()
1708     for step in steps:
1709         stepend = (step == steps[-1]) # note end of steps
1710         idxs = list(range(0, xd, step))
1711         results = []
1712         for i in idxs:
1713             other = i + step // 2
1714             ci = ix[i]
1715             oi = ix[other] if other < xd else None
1716             other_pred = other < xd and (pred is None or pred[oi])
1717             if (pred is None or pred[ci]) and other_pred:
1718                 if SVSHAPE.skip == 0b00: # submode 00
1719                     result = ci
1720                 elif SVSHAPE.skip == 0b01: # submode 01
1721                     result = oi
1722                 results.append([result + SVSHAPE.offset, 0])
1723             elif other_pred:
1724                 ix[i] = oi
1725         if results:
1726             results[-1][1] = (stepend<<1) | 1  # notify end of loops
1727         yield from results
1728
1729 def demo():
1730     # set the dimension sizes here
1731     xdim = 9
1732
1733     # set up an SVSHAPE
1734     class SVSHAPE:
1735         pass
1736     SVSHAPE0 = SVSHAPE()
1737     SVSHAPE0.lims = [xdim, 0, 0]
1738     SVSHAPE0.order = [0,1,2]
1739     SVSHAPE0.mode = 0b10
1740     SVSHAPE0.skip = 0b00
1741     SVSHAPE0.offset = 0       # experiment with different offset, here
1742     SVSHAPE0.invxyz = [0,0,0] # inversion if desired
1743
1744     SVSHAPE1 = SVSHAPE()
1745     SVSHAPE1.lims = [xdim, 0, 0]
1746     SVSHAPE1.order = [0,1,2]
1747     SVSHAPE1.mode = 0b10
1748     SVSHAPE1.skip = 0b01
1749     SVSHAPE1.offset = 0       # experiment with different offset, here
1750     SVSHAPE1.invxyz = [0,0,0] # inversion if desired
1751
1752     # enumerate over the iterator function, getting new indices
1753     shapes = list(iterate_indices(SVSHAPE0)), \
1754               list(iterate_indices(SVSHAPE1))
1755     for idx in range(len(shapes[0])):
1756         l = shapes[0][idx]
1757         r = shapes[1][idx]
1758         (l_idx, lend) = l
1759         (r_idx, rend) = r
1760         print ("%d->%d:%d" % (idx, l_idx, r_idx),
1761                "end", bin(lend)[2:], bin(rend)[2:])
1762
1763 # run the demo
1764 if __name__ == '__main__':
1765     demo()
1766 ```
1767
1768 ### REMAP FFT pseudocode
1769
1770 The FFT REMAP is RADIX2 only.
1771
1772 ```
1773 # a "yield" version of the REMAP algorithm, for FFT Tukey-Cooley schedules
1774 # original code for the FFT Tukey-Cooley schedule:
1775 # https://www.nayuki.io/res/free-small-fft-in-multiple-languages/fft.py
1776 """
1777     # Radix-2 decimation-in-time FFT (real, not complex)
1778     size = 2
1779     while size <= n:
1780         halfsize = size // 2
1781         tablestep = n // size
1782         for i in range(0, n, size):
1783             k = 0
1784             for j in range(i, i + halfsize):
1785                 jh = j+halfsize
1786                 jl = j
1787                 temp1 = vec[jh] * exptable[k]
1788                 temp2 = vec[jl]
1789                 vec[jh] = temp2 - temp1
1790                 vec[jl] = temp2 + temp1
1791                 k += tablestep
1792         size *= 2
1793 """
1794
1795 # python "yield" can be iterated. use this to make it clear how
1796 # the indices are generated by using natural-looking nested loops
1797 def iterate_butterfly_indices(SVSHAPE):
1798     # get indices to iterate over, in the required order
1799     n = SVSHAPE.lims[0]
1800     stride = SVSHAPE.lims[2] # stride-multiplier on reg access
1801     # creating lists of indices to iterate over in each dimension
1802     # has to be done dynamically, because it depends on the size
1803     # first, the size-based loop (which can be done statically)
1804     x_r = []
1805     size = 2
1806     while size <= n:
1807         x_r.append(size)
1808         size *= 2
1809     # invert order if requested
1810     if SVSHAPE.invxyz[0]: x_r.reverse()
1811
1812     if len(x_r) == 0:
1813         return
1814
1815     # start an infinite (wrapping) loop
1816     skip = 0
1817     while True:
1818         for size in x_r:           # loop over 3rd order dimension (size)
1819             x_end = size == x_r[-1]
1820             # y_r schedule depends on size
1821             halfsize = size // 2
1822             tablestep = n // size
1823             y_r = []
1824             for i in range(0, n, size):
1825                 y_r.append(i)
1826             # invert if requested
1827             if SVSHAPE.invxyz[1]: y_r.reverse()
1828             for i in y_r:       # loop over 2nd order dimension
1829                 y_end = i == y_r[-1]
1830                 k_r = []
1831                 j_r = []
1832                 k = 0
1833                 for j in range(i, i+halfsize):
1834                     k_r.append(k)
1835                     j_r.append(j)
1836                     k += tablestep
1837                 # invert if requested
1838                 if SVSHAPE.invxyz[2]: k_r.reverse()
1839                 if SVSHAPE.invxyz[2]: j_r.reverse()
1840                 for j, k in zip(j_r, k_r):   # loop over 1st order dimension
1841                     z_end = j == j_r[-1]
1842                     # now depending on MODE return the index
1843                     if SVSHAPE.skip == 0b00:
1844                         result = j              # for vec[j]
1845                     elif SVSHAPE.skip == 0b01:
1846                         result = j + halfsize   # for vec[j+halfsize]
1847                     elif SVSHAPE.skip == 0b10:
1848                         result = k              # for exptable[k]
1849
1850                     loopends = (z_end |
1851                                ((y_end and z_end)<<1) |
1852                                 ((y_end and x_end and z_end)<<2))
1853
1854                     yield (result * stride) + SVSHAPE.offset, loopends
1855
1856 def demo():
1857     # set the dimension sizes here
1858     xdim = 8
1859     ydim = 0 # not needed
1860     zdim = 1 # stride must be set to 1
1861
1862     # set total. err don't know how to calculate how many there are...
1863     # do it manually for now
1864     VL = 0
1865     size = 2
1866     n = xdim
1867     while size <= n:
1868         halfsize = size // 2
1869         tablestep = n // size
1870         for i in range(0, n, size):
1871             for j in range(i, i + halfsize):
1872                 VL += 1
1873         size *= 2
1874
1875     # set up an SVSHAPE
1876     class SVSHAPE:
1877         pass
1878     # j schedule
1879     SVSHAPE0 = SVSHAPE()
1880     SVSHAPE0.lims = [xdim, ydim, zdim]
1881     SVSHAPE0.order = [0,1,2]  # experiment with different permutations, here
1882     SVSHAPE0.mode = 0b01
1883     SVSHAPE0.skip = 0b00
1884     SVSHAPE0.offset = 0       # experiment with different offset, here
1885     SVSHAPE0.invxyz = [0,0,0] # inversion if desired
1886     # j+halfstep schedule
1887     SVSHAPE1 = SVSHAPE()
1888     SVSHAPE1.lims = [xdim, ydim, zdim]
1889     SVSHAPE1.order = [0,1,2]  # experiment with different permutations, here
1890     SVSHAPE0.mode = 0b01
1891     SVSHAPE1.skip = 0b01
1892     SVSHAPE1.offset = 0       # experiment with different offset, here
1893     SVSHAPE1.invxyz = [0,0,0] # inversion if desired
1894     # k schedule
1895     SVSHAPE2 = SVSHAPE()
1896     SVSHAPE2.lims = [xdim, ydim, zdim]
1897     SVSHAPE2.order = [0,1,2]  # experiment with different permutations, here
1898     SVSHAPE0.mode = 0b01
1899     SVSHAPE2.skip = 0b10
1900     SVSHAPE2.offset = 0       # experiment with different offset, here
1901     SVSHAPE2.invxyz = [0,0,0] # inversion if desired
1902
1903     # enumerate over the iterator function, getting new indices
1904     schedule = []
1905     for idx, (jl, jh, k) in enumerate(zip(iterate_butterfly_indices(SVSHAPE0),
1906                                           iterate_butterfly_indices(SVSHAPE1),
1907                                           iterate_butterfly_indices(SVSHAPE2))):
1908         if idx >= VL:
1909             break
1910         schedule.append((jl, jh, k))
1911
1912     # ok now pretty-print the results, with some debug output
1913     size = 2
1914     idx = 0
1915     while size <= n:
1916         halfsize = size // 2
1917         tablestep = n // size
1918         print ("size %d halfsize %d tablestep %d" % \
1919                 (size, halfsize, tablestep))
1920         for i in range(0, n, size):
1921             prefix = "i %d\t" % i
1922             k = 0
1923             for j in range(i, i + halfsize):
1924                 (jl, je), (jh, he), (ks, ke) = schedule[idx]
1925                 print ("  %-3d\t%s j=%-2d jh=%-2d k=%-2d -> "
1926                         "j[jl=%-2d] j[jh=%-2d] ex[k=%d]" % \
1927                                 (idx, prefix, j, j+halfsize, k,
1928                                       jl, jh, ks,
1929                                 ),
1930                                 "end", bin(je)[2:], bin(je)[2:], bin(ke)[2:])
1931                 k += tablestep
1932                 idx += 1
1933         size *= 2
1934
1935 # run the demo
1936 if __name__ == '__main__':
1937     demo()
1938 ```
1939
1940 ### DCT REMAP
1941
1942 DCT REMAP is RADIX2 only.  Convolutions may be applied as usual
1943 to create non-RADIX2 DCT. Combined with appropriate Twin-butterfly
1944 instructions the algorithm below (written in python3), becomes part
1945 of an in-place in-registers Vectorised DCT.  The algorithms work
1946 by loading data such that as the nested loops progress the result
1947 is sorted into correct sequential order.
1948
1949 ```
1950 # DCT "REMAP" scheduler to create an in-place iterative DCT.
1951 #
1952
1953 # bits of the integer 'val' of width 'width' are reversed
1954 def reverse_bits(val, width):
1955     result = 0
1956     for _ in range(width):
1957         result = (result << 1) | (val & 1)
1958         val >>= 1
1959     return result
1960
1961
1962 # iterative version of [recursively-applied] half-reversing
1963 # turns out this is Gray-Encoding.
1964 def halfrev2(vec, pre_rev=True):
1965     res = []
1966     for i in range(len(vec)):
1967         if pre_rev:
1968             res.append(vec[i ^ (i>>1)])
1969         else:
1970             ri = i
1971             bl = i.bit_length()
1972             for ji in range(1, bl):
1973                 ri ^= (i >> ji)
1974             res.append(vec[ri])
1975     return res
1976
1977
1978 def iterate_dct_inner_halfswap_loadstore(SVSHAPE):
1979     # get indices to iterate over, in the required order
1980     n = SVSHAPE.lims[0]
1981     mode = SVSHAPE.lims[1]
1982     stride = SVSHAPE.lims[2]
1983
1984     # reference list for not needing to do data-swaps, just swap what
1985     # *indices* are referenced (two levels of indirection at the moment)
1986     # pre-reverse the data-swap list so that it *ends up* in the order 0123..
1987     ji = list(range(n))
1988
1989     levels = n.bit_length() - 1
1990     ri = [reverse_bits(i, levels) for i in range(n)]
1991
1992     if SVSHAPE.mode == 0b01: # FFT, bitrev only
1993         ji = [ji[ri[i]] for i in range(n)]
1994     elif SVSHAPE.submode2 == 0b001:
1995         ji = [ji[ri[i]] for i in range(n)]
1996         ji = halfrev2(ji, True)
1997     else:
1998         ji = halfrev2(ji, False)
1999         ji = [ji[ri[i]] for i in range(n)]
2000
2001     # invert order if requested
2002     if SVSHAPE.invxyz[0]:
2003         ji.reverse()
2004
2005     for i, jl in enumerate(ji):
2006         y_end = jl == ji[-1]
2007         yield jl * stride, (0b111 if y_end else 0b000)
2008
2009 def iterate_dct_inner_costable_indices(SVSHAPE):
2010     # get indices to iterate over, in the required order
2011     n = SVSHAPE.lims[0]
2012     mode = SVSHAPE.lims[1]
2013     stride = SVSHAPE.lims[2]
2014     # creating lists of indices to iterate over in each dimension
2015     # has to be done dynamically, because it depends on the size
2016     # first, the size-based loop (which can be done statically)
2017     x_r = []
2018     size = 2
2019     while size <= n:
2020         x_r.append(size)
2021         size *= 2
2022     # invert order if requested
2023     if SVSHAPE.invxyz[0]:
2024         x_r.reverse()
2025
2026     if len(x_r) == 0:
2027         return
2028
2029     # start an infinite (wrapping) loop
2030     skip = 0
2031     z_end = 1 # doesn't exist in this, only 2 loops
2032     k = 0
2033     while True:
2034         for size in x_r:           # loop over 3rd order dimension (size)
2035             x_end = size == x_r[-1]
2036             # y_r schedule depends on size
2037             halfsize = size // 2
2038             y_r = []
2039             for i in range(0, n, size):
2040                 y_r.append(i)
2041             # invert if requested
2042             if SVSHAPE.invxyz[1]: y_r.reverse()
2043             # two lists of half-range indices, e.g. j 0123, jr 7654
2044             j = list(range(0, halfsize))
2045             # invert if requested
2046             if SVSHAPE.invxyz[2]: j_r.reverse()
2047             # loop over 1st order dimension
2048             for ci, jl in enumerate(j):
2049                 y_end = jl == j[-1]
2050                 # now depending on MODE return the index.  inner butterfly
2051                 if SVSHAPE.skip == 0b00: # in [0b00, 0b10]:
2052                     result = k  # offset into COS table
2053                 elif SVSHAPE.skip == 0b10: #
2054                     result = ci # coefficient helper
2055                 elif SVSHAPE.skip == 0b11: #
2056                     result = size # coefficient helper
2057                 loopends = (z_end |
2058                            ((y_end and z_end)<<1) |
2059                             ((y_end and x_end and z_end)<<2))
2060
2061                 yield (result * stride) + SVSHAPE.offset, loopends
2062                 k += 1
2063
2064 def iterate_dct_inner_butterfly_indices(SVSHAPE):
2065     # get indices to iterate over, in the required order
2066     n = SVSHAPE.lims[0]
2067     mode = SVSHAPE.lims[1]
2068     stride = SVSHAPE.lims[2]
2069     # creating lists of indices to iterate over in each dimension
2070     # has to be done dynamically, because it depends on the size
2071     # first, the size-based loop (which can be done statically)
2072     x_r = []
2073     size = 2
2074     while size <= n:
2075         x_r.append(size)
2076         size *= 2
2077     # invert order if requested
2078     if SVSHAPE.invxyz[0]:
2079         x_r.reverse()
2080
2081     if len(x_r) == 0:
2082         return
2083
2084     # reference (read/write) the in-place data in *reverse-bit-order*
2085     ri = list(range(n))
2086     if SVSHAPE.submode2 == 0b01:
2087         levels = n.bit_length() - 1
2088         ri = [ri[reverse_bits(i, levels)] for i in range(n)]
2089
2090     # reference list for not needing to do data-swaps, just swap what
2091     # *indices* are referenced (two levels of indirection at the moment)
2092     # pre-reverse the data-swap list so that it *ends up* in the order 0123..
2093     ji = list(range(n))
2094     inplace_mode = True
2095     if inplace_mode and SVSHAPE.submode2 == 0b01:
2096         ji = halfrev2(ji, True)
2097     if inplace_mode and SVSHAPE.submode2 == 0b11:
2098         ji = halfrev2(ji, False)
2099
2100     # start an infinite (wrapping) loop
2101     while True:
2102         k = 0
2103         k_start = 0
2104         for size in x_r:           # loop over 3rd order dimension (size)
2105             x_end = size == x_r[-1]
2106             # y_r schedule depends on size
2107             halfsize = size // 2
2108             y_r = []
2109             for i in range(0, n, size):
2110                 y_r.append(i)
2111             # invert if requested
2112             if SVSHAPE.invxyz[1]: y_r.reverse()
2113             for i in y_r:       # loop over 2nd order dimension
2114                 y_end = i == y_r[-1]
2115                 # two lists of half-range indices, e.g. j 0123, jr 7654
2116                 j = list(range(i, i + halfsize))
2117                 jr = list(range(i+halfsize, i + size))
2118                 jr.reverse()
2119                 # invert if requested
2120                 if SVSHAPE.invxyz[2]:
2121                     j.reverse()
2122                     jr.reverse()
2123                 hz2 = halfsize // 2 # zero stops reversing 1-item lists
2124                 # loop over 1st order dimension
2125                 k = k_start
2126                 for ci, (jl, jh) in enumerate(zip(j, jr)):
2127                     z_end = jl == j[-1]
2128                     # now depending on MODE return the index.  inner butterfly
2129                     if SVSHAPE.skip == 0b00: # in [0b00, 0b10]:
2130                         if SVSHAPE.submode2 == 0b11: # iDCT
2131                             result = ji[ri[jl]]        # lower half
2132                         else:
2133                             result = ri[ji[jl]]        # lower half
2134                     elif SVSHAPE.skip == 0b01: # in [0b01, 0b11]:
2135                         if SVSHAPE.submode2 == 0b11: # iDCT
2136                             result = ji[ri[jl+halfsize]] # upper half
2137                         else:
2138                             result = ri[ji[jh]] # upper half
2139                     elif mode == 4:
2140                         # COS table pre-generated mode
2141                         if SVSHAPE.skip == 0b10: #
2142                             result = k # cos table offset
2143                     else: # mode 2
2144                         # COS table generated on-demand ("Vertical-First") mode
2145                         if SVSHAPE.skip == 0b10: #
2146                             result = ci # coefficient helper
2147                         elif SVSHAPE.skip == 0b11: #
2148                             result = size # coefficient helper
2149                     loopends = (z_end |
2150                                ((y_end and z_end)<<1) |
2151                                 ((y_end and x_end and z_end)<<2))
2152
2153                     yield (result * stride) + SVSHAPE.offset, loopends
2154                     k += 1
2155
2156                 # now in-place swap
2157                 if inplace_mode:
2158                     for ci, (jl, jh) in enumerate(zip(j[:hz2], jr[:hz2])):
2159                         jlh = jl+halfsize
2160                         tmp1, tmp2 = ji[jlh], ji[jh]
2161                         ji[jlh], ji[jh] = tmp2, tmp1
2162
2163             # new k_start point for cos tables( runs inside x_r loop NOT i loop)
2164             k_start += halfsize
2165
2166
2167 # python "yield" can be iterated. use this to make it clear how
2168 # the indices are generated by using natural-looking nested loops
2169 def iterate_dct_outer_butterfly_indices(SVSHAPE):
2170     # get indices to iterate over, in the required order
2171     n = SVSHAPE.lims[0]
2172     mode = SVSHAPE.lims[1]
2173     stride = SVSHAPE.lims[2]
2174     # creating lists of indices to iterate over in each dimension
2175     # has to be done dynamically, because it depends on the size
2176     # first, the size-based loop (which can be done statically)
2177     x_r = []
2178     size = n // 2
2179     while size >= 2:
2180         x_r.append(size)
2181         size //= 2
2182     # invert order if requested
2183     if SVSHAPE.invxyz[0]:
2184         x_r.reverse()
2185
2186     if len(x_r) == 0:
2187         return
2188
2189     # I-DCT, reference (read/write) the in-place data in *reverse-bit-order*
2190     ri = list(range(n))
2191     if SVSHAPE.submode2 in [0b11, 0b01]:
2192         levels = n.bit_length() - 1
2193         ri = [ri[reverse_bits(i, levels)] for i in range(n)]
2194
2195     # reference list for not needing to do data-swaps, just swap what
2196     # *indices* are referenced (two levels of indirection at the moment)
2197     # pre-reverse the data-swap list so that it *ends up* in the order 0123..
2198     ji = list(range(n))
2199     inplace_mode = False # need the space... SVSHAPE.skip in [0b10, 0b11]
2200     if SVSHAPE.submode2 == 0b11:
2201         ji = halfrev2(ji, False)
2202
2203     # start an infinite (wrapping) loop
2204     while True:
2205         k = 0
2206         k_start = 0
2207         for size in x_r:           # loop over 3rd order dimension (size)
2208             halfsize = size//2
2209             x_end = size == x_r[-1]
2210             y_r = list(range(0, halfsize))
2211             # invert if requested
2212             if SVSHAPE.invxyz[1]: y_r.reverse()
2213             for i in y_r:       # loop over 2nd order dimension
2214                 y_end = i == y_r[-1]
2215                 # one list to create iterative-sum schedule
2216                 jr = list(range(i+halfsize, i+n-halfsize, size))
2217                 # invert if requested
2218                 if SVSHAPE.invxyz[2]: jr.reverse()
2219                 hz2 = halfsize // 2 # zero stops reversing 1-item lists
2220                 k = k_start
2221                 for ci, jh in enumerate(jr):   # loop over 1st order dimension
2222                     z_end = jh == jr[-1]
2223                     if mode == 4:
2224                         # COS table pre-generated mode
2225                         if SVSHAPE.skip == 0b00: # in [0b00, 0b10]:
2226                             if SVSHAPE.submode2 == 0b11: # iDCT
2227                                 result = ji[ri[jh]] # upper half
2228                             else:
2229                                 result = ri[ji[jh]]        # lower half
2230                         elif SVSHAPE.skip == 0b01: # in [0b01, 0b11]:
2231                             if SVSHAPE.submode2 == 0b11: # iDCT
2232                                 result = ji[ri[jh+size]] # upper half
2233                             else:
2234                                 result = ri[ji[jh+size]] # upper half
2235                         elif SVSHAPE.skip == 0b10: #
2236                             result = k # cos table offset
2237                     else:
2238                         # COS table generated on-demand ("Vertical-First") mode
2239                         if SVSHAPE.skip == 0b00: # in [0b00, 0b10]:
2240                             if SVSHAPE.submode2 == 0b11: # iDCT
2241                                 result = ji[ri[jh]]        # lower half
2242                             else:
2243                                 result = ri[ji[jh]]        # lower half
2244                         elif SVSHAPE.skip == 0b01: # in [0b01, 0b11]:
2245                             if SVSHAPE.submode2 == 0b11: # iDCT
2246                                 result = ji[ri[jh+size]] # upper half
2247                             else:
2248                                 result = ri[ji[jh+size]] # upper half
2249                         elif SVSHAPE.skip == 0b10: #
2250                             result = ci # coefficient helper
2251                         elif SVSHAPE.skip == 0b11: #
2252                             result = size # coefficient helper
2253                     loopends = (z_end |
2254                                ((y_end and z_end)<<1) |
2255                                 ((y_end and x_end and z_end)<<2))
2256
2257                     yield (result * stride) + SVSHAPE.offset, loopends
2258                     k += 1
2259
2260             # new k_start point for cos tables( runs inside x_r loop NOT i loop)
2261             k_start += halfsize
2262
2263 ```
2264
2265 ## REMAP selector
2266
2267 Selecting which REMAP Schedule to use is shown by the pseudocode below.
2268 Each SVSHAPE 0-3 goes through this selection process.
2269
2270 ```
2271     if SVSHAPEn.mode == 0b00:
2272         iterate_fn = iterate_indices
2273     elif SVSHAPEn.mode == 0b10:
2274         iterate_fn = iterate_preduce_indices
2275     elif SVSHAPEn.mode in [0b01, 0b11]:
2276         # further sub-selection
2277         if SVSHAPEn.ydimsz == 1:
2278             iterate_fn = iterate_butterfly_indices
2279         elif SVSHAPEn.ydimsz == 2:
2280             iterate_fn = iterate_dct_inner_butterfly_indices
2281         elif SVSHAPEn.ydimsz == 3:
2282             iterate_fn = iterate_dct_outer_butterfly_indices
2283         elif SVSHAPEn.ydimsz in [5, 13]:
2284             iterate_fn = iterate_dct_inner_costable_indices
2285         elif SVSHAPEn.ydimsz in [6, 14, 15]:
2286             iterate_fn = iterate_dct_inner_halfswap_loadstore
2287 ```
2288
2289
2290 [[!tag opf_rfc]]