openpower/sv/rfc/ls009.mdwn

   1 # RFC ls009 Simple-V REMAP Subsystem
   2
   3 * Funded by NLnet under the Privacy and Enhanced Trust Programme, EU
   4   Horizon2020 Grant 825310, and NGI0 Entrust No 101069594
   5 * <https://libre-soc.org/openpower/sv/rfc/ls009/>
   6 * <https://bugs.libre-soc.org/show_bug.cgi?id=1060>
   7 * <https://git.openpower.foundation/isa/PowerISA/issues/124>
   8
   9 **Severity**: Major
  10
  11 **Status**: New
  12
  13 **Date**: 26 Mar 2023. v2 TODO
  14
  15 **Target**: v3.2B
  16
  17 **Source**: v3.0B
  18
  19 **Books and Section affected**:
  20
  21 ```
  22     Book I, new Zero-Overhead-Loop Chapter.
  23     Appendix E Power ISA sorted by opcode
  24     Appendix F Power ISA sorted by version
  25     Appendix G Power ISA sorted by Compliancy Subset
  26     Appendix H Power ISA sorted by mnemonic
  27 ```
  28
  29 **Summary**
  30
  31 ```
  32     svremap  - Re-Mapping of Register Element Offsets
  33     svindex  - General-purpose setting of SHAPEs to be re-mapped
  34     svshape  - Hardware-level setting of SHAPEs for element re-mapping
  35     svshape2 - Hardware-level setting of SHAPEs for element re-mapping (v2)
  36 ```
  37
  38 **Submitter**: Luke Leighton (Libre-SOC)
  39
  40 **Requester**: Libre-SOC
  41
  42 **Impact on processor**:
  43
  44 ```
  45     Addition of four new "Zero-Overhead-Loop-Control" DSP-style Vector-style
  46     Management Instructions which provide advanced features such as Matrix
  47     FFT DCT Hardware-Assist Schedules and general-purpose Index reordering.
  48 ```
  49
  50 **Impact on software**:
  51
  52 ```
  53     Requires support for new instructions in assembler, debuggers,
  54     and related tools.
  55 ```
  56
  57 **Keywords**:
  58
  59 ```
  60     Cray Supercomputing, Vectorization, Zero-Overhead-Loop-Control (ZOLC),
  61     Scalable Vectors, Multi-Issue Out-of-Order, Sequential Programming Model,
  62     Digital Signal Processing (DSP)
  63 ```
  64
  65 **Motivation**
  66
  67 These REMAP Management instructions provide state-of-the-art advanced capabilities
  68 to dramatically decrease instruction count and power reduction whilst retaining
  69 unprecedented general-purpose capability and a standard Sequential Execution Model.
  70
  71 **Notes and Observations**:
  72
  73 1. Although costly the alternatives in SIMD-paradigm software result in
  74    huge algorithmic complexity and associated power consumption increases.
  75    Loop-unrolling compilers are prevalent as is thousands to tens of
  76    thousands of instructions.
  77 2. Core inner kernels of Matrix DCT DFT FFT NTT are dramatically reduced
  78    to a handful of instructions.
  79 3. No REMAP instructions with the exception of Indexed rely on registers
  80    for the establishment of REMAP capability.
  81 4. Future EXT1xx variants and SVP64/VSX will dramatically expand the power
  82    and flexibility of REMAP.
  83 5. The Simple-V Compliancy Subsets make REMAP optional except in the
  84    Advanced Levels.  Like v3.1 MMA it is **not** necessary for **all**
  85    hardware to implement REMAP.
  86
  87 **Changes**
  88
  89 Add the following entries to:
  90
  91 * the Appendices of SV Book
  92 * Instructions of SV Book as a new Section
  93 * SVI, SVM, SVM2, SVRM Form of Book I Section 1.6.1.6 and 1.6.2
  94
  95 ----------------
  96
  97 \newpage{}
  98
  99 # Rationale and background
 100
 101 In certain common algorithms there are clear patterns of behaviour which
 102 typically require inline loop-unrolled instructions comprising interleaved
 103 permute and arithmetic operations.  REMAP was invented to split the permuting
 104 from the arithmetic, and to allow the Indexing to be done as a hardware Schedule.
 105
 106 A normal Vector Add:
 107
 108 ```
 109    for i in range(VL):
 110      GPR[RT+i] <= GPR[RA+i] + GPR[RB+i];
 111 ```
 112
 113 A Hardware-assisted REMAP Vector Add:
 114
 115 ```
 116    for i in range(VL):
 117      GPR[RT+remap1(i)] <= GPR[RA+remap2(i)] + GPR[RB+remap3(i)];
 118 ```
 119
 120 The result is a huge saving on register file accesses (no need to calculate Indices
 121 then use Permutation instructions), instruction count (Matrix Multiply up to 127 FMACs
 122 is 3 instructions), and programmer sanity.
 123
 124 **Basic principle**
 125
 126 The following illustrates why REMAP was added.
 127
 128 * normal vector element read/write of operands would be sequential
 129   (0 1 2 3 ....)
 130 * this is not appropriate for (e.g.) Matrix multiply which requires
 131   accessing elements in alternative sequences (0 3 6 1 4 7 ...)
 132 * normal Vector ISAs use either Indexed-MV or Indexed-LD/ST to "cope"
 133   with this.  both are expensive (copy large vectors, spill through memory)
 134   and very few Packed SIMD ISAs cope with non-Power-2
 135   (Duplicate-data inline-loop-unrolling is the costly solution)
 136 * REMAP **redefines** the order of access according to set
 137   (Deterministic) "Schedules".
 138 * Matrix Schedules are not at all restricted to power-of-two boundaries
 139   making it unnecessary to have for example specialised 3x4 transpose
 140   instructions of other Vector ISAs.
 141 * DCT and FFT REMAP are RADIX-2 limited but this is the case in existing Packed/Predicated
 142   SIMD ISAs anyway (and Bluestein Convolution is typically deployed to
 143   solve that).
 144
 145 Only the most commonly-used algorithms in computer science have REMAP
 146 support, due to the high cost in both the ISA and in hardware.  For
 147 arbitrary remapping the `Indexed` REMAP may be used.
 148
 149 # REMAP types
 150
 151 This section summarises the motivation for each REMAP Schedule
 152 and briefly goes over their characteristics and limitations.
 153 Further details on the Deterministic Precise-Interruptible algorithms
 154 used in these Schedules is found in the [[sv/remap/appendix]].
 155
 156 ## Matrix (1D/2D/3D shaping)
 157
 158 Matrix Multiplication is a huge part of High-Performance Compute,
 159 and 3D.
 160 In many PackedSIMD as well as Scalable Vector ISAs, non-power-of-two
 161 Matrix sizes are a serious challenge. PackedSIMD ISAs, in order to
 162 cope with for example 3x4 Matrices, recommend rolling data-repetition and loop-unrolling.
 163 Aside from the cost of the load on the L1 I-Cache, the trick only
 164 works if one of the dimensions X or Y are power-two. Prime Numbers
 165 (5x7, 3x5) become deeply problematic to unroll.
 166
 167 Even traditional Scalable Vector ISAs have issues with Matrices, often
 168 having to perform data Transpose by pushing out through Memory and back
 169 (costly),
 170 or computing Transposition Indices (costly) then copying to another
 171 Vector (costly).
 172
 173 Matrix REMAP was thus designed to solve these issues by providing Hardware
 174 Assisted
 175 "Schedules" that can view what would otherwise be limited to a strictly
 176 linear Vector as instead being 2D (even 3D) *in-place* reordered.
 177 With both Transposition and non-power-two being supported the issues
 178 faced by other ISAs are mitigated.
 179
 180 Limitations of Matrix REMAP are that the Vector Length (VL) is currently
 181 restricted to 127: up to 127 FMAs (or other operation)
 182 may be performed in total.
 183 Also given that it is in-registers only at present some care has to be
 184 taken on regfile resource utilisation. However it is perfectly possible
 185 to utilise Matrix REMAP to perform the three inner-most "kernel" loops of
 186 the usual 6-level "Tiled" large Matrix Multiply, without the usual
 187 difficulties associated with SIMD.
 188
 189 Also the `svshape` instruction only provides access to part of the
 190 Matrix REMAP capability. Rotation and mirroring need to be done by
 191 programming the SVSHAPE SPRs directly, which can take a lot more
 192 instructions. Future versions of SVP64 will include EXT1xx prefixed
 193 variants (`psvshape`) which provide more comprehensive capacity and
 194 mitigate the need to write direct to the SVSHAPE SPRs.
 195
 196 ## FFT/DCT Triple Loop
 197
 198 DCT and FFT are some of the most astonishingly used algorithms in
 199 Computer Science.  Radar, Audio, Video, R.F. Baseband and dozens more.  At least
 200 two DSPs, TMS320 and Hexagon, have VLIW instructions specially tailored
 201 to FFT.
 202
 203 An in-depth analysis showed that it is possible to do in-place in-register
 204 DCT and FFT as long as twin-result "butterfly" instructions are provided.
 205 These can be found in the [[openpower/isa/svfparith]] page if performing
 206 IEEE754 FP transforms. *(For fixed-point transforms, equivalent 3-in 2-out
 207 integer operations would be required,
 208 currently under development)*. These "butterfly" instructions
 209 avoid the need for a temporary register because the two array positions
 210 being overwritten will be "in-flight" in any In-Order or Out-of-Order
 211 micro-architecture.
 212
 213 DCT and FFT Schedules are currently limited to RADIX2 sizes and do not
 214 accept predicate masks.  Given that it is common to perform recursive
 215 convolutions combining smaller Power-2 DCT/FFT to create larger DCT/FFTs
 216 in practice the RADIX2 limit is not a problem.  A Bluestein convolution
 217 to compute arbitrary length is demonstrated by
 218 [Project Nayuki](https://www.nayuki.io/res/free-small-fft-in-multiple-languages/fft.py)
 219
 220 ## Indexed
 221
 222 The purpose of Indexing is to provide a generalised version of
 223 Vector ISA "Permute" instructions, such as VSX `vperm`.  The
 224 Indexing is abstracted out and may be applied to much more
 225 than an element move/copy, and is not limited for example
 226 to the number of bytes that can fit into a VSX register.
 227 Indexing may be applied to LD/ST (even on Indexed LD/ST
 228 instructions such as `sv.lbzx`), arithmetic operations,
 229 extsw: there is no artificial limit.
 230
 231 The original motivation for Indexed REMAP was to mitigate the need to add
 232 an expensive `mv.x` to the Scalar ISA, which was highly likely to be rejected as
 233 a stand-alone instruction
 234 (`GPR(RT) <- GPR(GPR(RA))`).  Usually a Vector ISA would add a non-conflicting
 235 variant (as in VSX `vperm`) but it is common to need to permute by source,
 236 with the risk of conflict, that has to be resolved, for example, in AVX-512
 237 with `conflictd`.
 238
 239 Indexed REMAP is the "Get out of Jail Free" card which (for a price)
 240 allows any general-purpose arbitrary Element Permutation not covered by
 241 the other Hardware Schedules.
 242
 243 ## Parallel Reduction
 244
 245 Vector Reduce Mode issues a deterministic tree-reduction schedule to the underlying micro-architecture.  Like Scalar reduction, the "Scalar Base"
 246 (Power ISA v3.0B) operation is leveraged, unmodified, to give the
 247 *appearance* and *effect* of Reduction. Parallel Reduction is not limited
 248 to Power-of-two but is limited as usual by the total number of
 249 element operations (127) as well as available register file size.
 250
 251 Parallel Reduction is normally done explicitly as a loop-unrolled operation,
 252 taking up a significant number of instructions. With REMAP it is just three
 253 instructions: two for setup and one Scalar Base.
 254
 255 ## Parallel Prefix Sum
 256
 257 This is a work-efficient Parallel Schedule that for example produces Trangular
 258 or Factorial number sequences. Half of the Prefix Sum Schedule is near-identical
 259 to Parallel Reduction.  Whilst the Arithmetic mapreduce Mode (`/mr`) may achieve the same
 260 end-result, implementations may only implement Mapreduce in serial form (or give
 261 the appearance to Programmers of the same). The Parallel Prefix Schedule is
 262 *required* to be implemented in such a way that its Deterministic Schedule may be
 263 parallelised. Like the Reduction Schedule it is 100% Deterministic and consequently
 264 may be used with non-commutative operations.
 265
 266 Parallel Reduction combined with Parallel Prefix Sum can be combined to help
 267 perform AI non-linear interpolation in around twelve to fifteen instructions.
 268
 269 ----------------
 270
 271 \newpage{}
 272
 273 Add the following to SV Book
 274
 275 [[!inline pages="openpower/sv/remap" raw=yes ]]
 276
 277 # Forms
 278
 279 Add `SVI, SVM, SVM2, SVRM` to `XO (26:31)` Field in Book I, 1.6.2
 280
 281 Add the following to Book I, 1.6.1, SVI-Form
 282
 283 ```
 284     |0     |6    |11    |16   |21 |23  |24|25|26    31|
 285     | PO   |  SVG|rmm   | SVd |ew |SVyx|mm|sk|   XO   |
 286 ```
 287
 288 Add the following to Book I, 1.6.1, SVM-Form
 289
 290 ```
 291     |0     |6        |11      |16    |21    |25 |26    |31  |
 292     | PO   |  SVxd   |   SVyd | SVzd | SVrm |vf |   XO      |
 293 ```
 294
 295 Add the following to Book I, 1.6.1, SVM2-Form
 296
 297 ```
 298     |0     |6     |10  |11      |16    |21 |24|25 |26    |31  |
 299     | PO   | SVo  |SVyx|   rmm  | SVd  |XO |mm|sk |   XO      |
 300 ```
 301
 302 Add the following to Book I, 1.6.1, SVRM-Form
 303
 304 ```
 305     |0     |6     |11  |13   |15   |17   |19   |21  |22   |26     |31 |
 306     | PO   | SVme |mi0 | mi1 | mi2 | mo0 | mo1 |pst |///  | XO        |
 307 ```
 308
 309 Add the following to Book I, 1.6.2
 310
 311 ```
 312     mi0 (11:12)
 313         Field used in REMAP to select the SVSHAPE for 1st input register
 314         Formats: SVRM
 315     mi1 (13:14)
 316         Field used in REMAP to select the SVSHAPE for 2nd input register
 317         Formats: SVRM
 318     mi2 (15:16)
 319         Field used in REMAP to select the SVSHAPE for 3rd input register
 320         Formats: SVRM
 321     mm (24)
 322         Field used to specify the meaning of the rmm field for SVI-Form
 323         and SVM2-Form
 324         Formats: SVI, SVM2
 325     mo0 (17:18)
 326         Field used in REMAP to select the SVSHAPE for 1st output register
 327         Formats: SVRM
 328     mo1 (19:20)
 329         Field used in REMAP to select the SVSHAPE for 2nd output register
 330         Formats: SVRM
 331     pst (21)
 332         Field used in REMAP to indicate "persistence" mode (REMAP
 333         continues to apply to multiple instructions)
 334         Formats: SVRM
 335     rmm (11:15)
 336         REMAP Mode field for SVI-Form and SVM2-Form
 337         Formats: SVI, SVM2
 338     sk (25)
 339         Field used to specify dimensional skipping in svindex
 340         Formats: SVI, SVM2
 341     SVd (16:20)
 342         Immediate field used to specify the size of the REMAP dimension
 343         in the svindex and svshape2 instructions
 344         Formats: SVI, SVM2
 345     SVDS (16:29)
 346         Immediate field used to specify a 9-bit signed
 347         two's complement integer which is concatenated
 348         on the right with 0b00 and sign-extended to 64 bits.
 349         Formats: SVDS
 350     SVG (6:10)
 351         Field used to specify a GPR to be used as a
 352         source for indexing.
 353         Formats: SVI
 354     SVi (16:22)
 355          Simple-V immediate field for setting VL or MVL
 356          Formats: SVL
 357     SVme (6:10)
 358          Simple-V "REMAP" map-enable bits (0-4)
 359          Formats: SVRM
 360     SVo (6:9)
 361         Field used by the svshape2 instruction as an offset
 362         Formats: SVM2
 363     SVrm (21:24)
 364          Simple-V "REMAP" Mode
 365          Formats: SVM
 366     SVxd (6:10)
 367          Simple-V "REMAP" x-dimension size
 368          Formats: SVM
 369     SVyd (11:15)
 370          Simple-V "REMAP" y-dimension size
 371          Formats: SVM
 372     SVzd (16:20)
 373          Simple-V "REMAP" z-dimension size
 374          Formats: SVM
 375     XO (21:23,26:31)
 376         Extended opcode field.  Note that bit 21 must be 1, 22 and 23
 377         must be zero, and bits 26-31 must be exactly the same as
 378         used for svshape.
 379         Formats: SVM2
 380 ```
 381
 382 # Appendices
 383
 384     Appendix E Power ISA sorted by opcode
 385     Appendix F Power ISA sorted by version
 386     Appendix G Power ISA sorted by Compliancy Subset
 387     Appendix H Power ISA sorted by mnemonic
 388
 389 | Form | Book | Page | Version | mnemonic | Description |
 390 |------|------|------|---------|----------|-------------|
 391 | SVRM | I    | #    | 3.0B    | svremap    | REMAP enabling instruction |
 392 | SVM  | I    | #    | 3.0B    | svshape  | REMAP shape instruction |
 393 | SVM2 | I    | #    | 3.0B    | svshape2 | REMAP shape instruction (2) |
 394 | SVI  | I    | #    | 3.0B    | svindex | REMAP General-purpose Indexing |
 395
 396 [[!inline pages="openpower/sv/remap/appendix" raw=yes ]]
 397
 398 ## REMAP pseudocode
 399
 400 Written in python3 the following stand-alone executable source code is the Canonical
 401 Specification for each REMAP. Vectors of "loopends" are returned when Rc=1
 402 in Vectors of CR Fields on `sv.svstep.`, or in Vertical-First Mode
 403 a single CR Field (CR0) on `svstep.`.  The `SVSTATE.srcstep` or `SVSTATE.dststep` sequential
 404 offset is put through each algorithm to determine the actual Element Offset.
 405 Alternative implementations producing different ordering
 406 is prohibited as software will be critically relying on these Deterministic Schedules.
 407
 408 ### REMAP 2D/3D Matrix
 409
 410 The following stand-alone executable source code is the Canonical
 411 Specification for Matrix (2D/3D) REMAP.
 412 Hardware implementations are achievable with simple cascading counter-and-compares.
 413
 414 ```
 415 # python "yield" can be iterated. use this to make it clear how
 416 # the indices are generated by using natural-looking nested loops
 417 def iterate_indices(SVSHAPE):
 418     # get indices to iterate over, in the required order
 419     xd = SVSHAPE.lims[0]
 420     yd = SVSHAPE.lims[1]
 421     zd = SVSHAPE.lims[2]
 422     # create lists of indices to iterate over in each dimension
 423     x_r = list(range(xd))
 424     y_r = list(range(yd))
 425     z_r = list(range(zd))
 426     # invert the indices if needed
 427     if SVSHAPE.invxyz[0]: x_r.reverse()
 428     if SVSHAPE.invxyz[1]: y_r.reverse()
 429     if SVSHAPE.invxyz[2]: z_r.reverse()
 430     # start an infinite (wrapping) loop
 431     step = 0 # track src/dst step
 432     while True:
 433         for z in z_r:   # loop over 1st order dimension
 434             z_end = z == z_r[-1]
 435             for y in y_r:       # loop over 2nd order dimension
 436                 y_end = y == y_r[-1]
 437                 for x in x_r:           # loop over 3rd order dimension
 438                     x_end = x == x_r[-1]
 439                     # ok work out which order to construct things in.
 440                     # start by creating a list of tuples of the dimension
 441                     # and its limit
 442                     vals = [(SVSHAPE.lims[0], x, "x"),
 443                             (SVSHAPE.lims[1], y, "y"),
 444                             (SVSHAPE.lims[2], z, "z")
 445                            ]
 446                     # now select those by order.  this allows us to
 447                     # create schedules for [z][x], [x][y], or [y][z]
 448                     # for matrix multiply.
 449                     vals = [vals[SVSHAPE.order[0]],
 450                             vals[SVSHAPE.order[1]],
 451                             vals[SVSHAPE.order[2]]
 452                            ]
 453                     # ok now we can construct the result, using bits of
 454                     # "order" to say which ones get stacked on
 455                     result = 0
 456                     mult = 1
 457                     for i in range(3):
 458                         lim, idx, dbg = vals[i]
 459                         # some of the dimensions can be "skipped".  the order
 460                         # was actually selected above on all 3 dimensions,
 461                         # e.g. [z][x][y] or [y][z][x].  "skip" allows one of
 462                         # those to be knocked out
 463                         if SVSHAPE.skip == i+1: continue
 464                         idx *= mult   # shifts up by previous dimension(s)
 465                         result += idx # adds on this dimension
 466                         mult *= lim   # for the next dimension
 467
 468                     loopends = (x_end |
 469                                ((y_end and x_end)<<1) |
 470                                 ((y_end and x_end and z_end)<<2))
 471
 472                     yield result + SVSHAPE.offset, loopends
 473                     step += 1
 474
 475 def demo():
 476     # set the dimension sizes here
 477     xdim = 3
 478     ydim = 2
 479     zdim = 4
 480
 481     # set total (can repeat, e.g. VL=x*y*z*4)
 482     VL = xdim * ydim * zdim
 483
 484     # set up an SVSHAPE
 485     class SVSHAPE:
 486         pass
 487     SVSHAPE0 = SVSHAPE()
 488     SVSHAPE0.lims = [xdim, ydim, zdim]
 489     SVSHAPE0.order = [1,0,2]  # experiment with different permutations, here
 490     SVSHAPE0.mode = 0b00
 491     SVSHAPE0.skip = 0b00
 492     SVSHAPE0.offset = 0       # experiment with different offset, here
 493     SVSHAPE0.invxyz = [0,0,0] # inversion if desired
 494
 495     # enumerate over the iterator function, getting new indices
 496     for idx, (new_idx, end) in enumerate(iterate_indices(SVSHAPE0)):
 497         if idx >= VL:
 498             break
 499         print ("%d->%d" % (idx, new_idx), "end", bin(end)[2:])
 500
 501 # run the demo
 502 if __name__ == '__main__':
 503     demo()
 504 ```
 505
 506 ### REMAP Parallel Reduction pseudocode
 507
 508 The python3 program below is stand-alone executable and is the Canonical Specification
 509 for Parallel Reduction REMAP.
 510 The Algorithm below is not limited to RADIX2 sizes, and Predicate
 511 sources, unlike in Matrix REMAP, apply to the Element Indices **after** REMAP
 512 has been applied, not before.  MV operations are not required: the algorithm
 513 tracks positions of elements that would normally be moved and when applying
 514 an Element Reduction Operation sources the operands from their last-known (tracked)
 515 position.
 516
 517 ```
 518 # a "yield" version of the Parallel Reduction REMAP algorithm.
 519 # the algorithm is in-place. it does not perform "MV" operations.
 520 # instead, where a masked-out value *should* be read from is tracked
 521
 522 def iterate_indices(SVSHAPE, pred=None):
 523     # get indices to iterate over, in the required order
 524     xd = SVSHAPE.lims[0]
 525     # create lists of indices to iterate over in each dimension
 526     ix = list(range(xd))
 527     # invert the indices if needed
 528     if SVSHAPE.invxyz[0]: ix.reverse()
 529     # start a loop from the lowest step
 530     step = 1
 531     steps = []
 532     while step < xd:
 533         step *= 2
 534         steps.append(step)
 535     # invert the indices if needed
 536     if SVSHAPE.invxyz[1]: steps.reverse()
 537     for step in steps:
 538         stepend = (step == steps[-1]) # note end of steps
 539         idxs = list(range(0, xd, step))
 540         results = []
 541         for i in idxs:
 542             other = i + step // 2
 543             ci = ix[i]
 544             oi = ix[other] if other < xd else None
 545             other_pred = other < xd and (pred is None or pred[oi])
 546             if (pred is None or pred[ci]) and other_pred:
 547                 if SVSHAPE.skip == 0b00: # submode 00
 548                     result = ci
 549                 elif SVSHAPE.skip == 0b01: # submode 01
 550                     result = oi
 551                 results.append([result + SVSHAPE.offset, 0])
 552             elif other_pred:
 553                 ix[i] = oi
 554         if results:
 555             results[-1][1] = (stepend<<1) | 1  # notify end of loops
 556         yield from results
 557
 558 def demo():
 559     # set the dimension sizes here
 560     xdim = 9
 561
 562     # set up an SVSHAPE
 563     class SVSHAPE:
 564         pass
 565     SVSHAPE0 = SVSHAPE()
 566     SVSHAPE0.lims = [xdim, 0, 0]
 567     SVSHAPE0.order = [0,1,2]
 568     SVSHAPE0.mode = 0b10
 569     SVSHAPE0.skip = 0b00
 570     SVSHAPE0.offset = 0       # experiment with different offset, here
 571     SVSHAPE0.invxyz = [0,0,0] # inversion if desired
 572
 573     SVSHAPE1 = SVSHAPE()
 574     SVSHAPE1.lims = [xdim, 0, 0]
 575     SVSHAPE1.order = [0,1,2]
 576     SVSHAPE1.mode = 0b10
 577     SVSHAPE1.skip = 0b01
 578     SVSHAPE1.offset = 0       # experiment with different offset, here
 579     SVSHAPE1.invxyz = [0,0,0] # inversion if desired
 580
 581     # enumerate over the iterator function, getting new indices
 582     shapes = list(iterate_indices(SVSHAPE0)), \
 583               list(iterate_indices(SVSHAPE1))
 584     for idx in range(len(shapes[0])):
 585         l = shapes[0][idx]
 586         r = shapes[1][idx]
 587         (l_idx, lend) = l
 588         (r_idx, rend) = r
 589         print ("%d->%d:%d" % (idx, l_idx, r_idx),
 590                "end", bin(lend)[2:], bin(rend)[2:])
 591
 592 # run the demo
 593 if __name__ == '__main__':
 594     demo()
 595 ```
 596
 597 ### REMAP FFT pseudocode
 598
 599 The FFT REMAP is RADIX2 only.
 600
 601 ```
 602 # a "yield" version of the REMAP algorithm, for FFT Tukey-Cooley schedules
 603 # original code for the FFT Tukey-Cooley schedule:
 604 # https://www.nayuki.io/res/free-small-fft-in-multiple-languages/fft.py
 605 """
 606     # Radix-2 decimation-in-time FFT (real, not complex)
 607     size = 2
 608     while size <= n:
 609         halfsize = size // 2
 610         tablestep = n // size
 611         for i in range(0, n, size):
 612             k = 0
 613             for j in range(i, i + halfsize):
 614                 jh = j+halfsize
 615                 jl = j
 616                 temp1 = vec[jh] * exptable[k]
 617                 temp2 = vec[jl]
 618                 vec[jh] = temp2 - temp1
 619                 vec[jl] = temp2 + temp1
 620                 k += tablestep
 621         size *= 2
 622 """
 623
 624 # python "yield" can be iterated. use this to make it clear how
 625 # the indices are generated by using natural-looking nested loops
 626 def iterate_butterfly_indices(SVSHAPE):
 627     # get indices to iterate over, in the required order
 628     n = SVSHAPE.lims[0]
 629     stride = SVSHAPE.lims[2] # stride-multiplier on reg access
 630     # creating lists of indices to iterate over in each dimension
 631     # has to be done dynamically, because it depends on the size
 632     # first, the size-based loop (which can be done statically)
 633     x_r = []
 634     size = 2
 635     while size <= n:
 636         x_r.append(size)
 637         size *= 2
 638     # invert order if requested
 639     if SVSHAPE.invxyz[0]: x_r.reverse()
 640
 641     if len(x_r) == 0:
 642         return
 643
 644     # start an infinite (wrapping) loop
 645     skip = 0
 646     while True:
 647         for size in x_r:           # loop over 3rd order dimension (size)
 648             x_end = size == x_r[-1]
 649             # y_r schedule depends on size
 650             halfsize = size // 2
 651             tablestep = n // size
 652             y_r = []
 653             for i in range(0, n, size):
 654                 y_r.append(i)
 655             # invert if requested
 656             if SVSHAPE.invxyz[1]: y_r.reverse()
 657             for i in y_r:       # loop over 2nd order dimension
 658                 y_end = i == y_r[-1]
 659                 k_r = []
 660                 j_r = []
 661                 k = 0
 662                 for j in range(i, i+halfsize):
 663                     k_r.append(k)
 664                     j_r.append(j)
 665                     k += tablestep
 666                 # invert if requested
 667                 if SVSHAPE.invxyz[2]: k_r.reverse()
 668                 if SVSHAPE.invxyz[2]: j_r.reverse()
 669                 for j, k in zip(j_r, k_r):   # loop over 1st order dimension
 670                     z_end = j == j_r[-1]
 671                     # now depending on MODE return the index
 672                     if SVSHAPE.skip == 0b00:
 673                         result = j              # for vec[j]
 674                     elif SVSHAPE.skip == 0b01:
 675                         result = j + halfsize   # for vec[j+halfsize]
 676                     elif SVSHAPE.skip == 0b10:
 677                         result = k              # for exptable[k]
 678
 679                     loopends = (z_end |
 680                                ((y_end and z_end)<<1) |
 681                                 ((y_end and x_end and z_end)<<2))
 682
 683                     yield (result * stride) + SVSHAPE.offset, loopends
 684
 685 def demo():
 686     # set the dimension sizes here
 687     xdim = 8
 688     ydim = 0 # not needed
 689     zdim = 1 # stride must be set to 1
 690
 691     # set total. err don't know how to calculate how many there are...
 692     # do it manually for now
 693     VL = 0
 694     size = 2
 695     n = xdim
 696     while size <= n:
 697         halfsize = size // 2
 698         tablestep = n // size
 699         for i in range(0, n, size):
 700             for j in range(i, i + halfsize):
 701                 VL += 1
 702         size *= 2
 703
 704     # set up an SVSHAPE
 705     class SVSHAPE:
 706         pass
 707     # j schedule
 708     SVSHAPE0 = SVSHAPE()
 709     SVSHAPE0.lims = [xdim, ydim, zdim]
 710     SVSHAPE0.order = [0,1,2]  # experiment with different permutations, here
 711     SVSHAPE0.mode = 0b01
 712     SVSHAPE0.skip = 0b00
 713     SVSHAPE0.offset = 0       # experiment with different offset, here
 714     SVSHAPE0.invxyz = [0,0,0] # inversion if desired
 715     # j+halfstep schedule
 716     SVSHAPE1 = SVSHAPE()
 717     SVSHAPE1.lims = [xdim, ydim, zdim]
 718     SVSHAPE1.order = [0,1,2]  # experiment with different permutations, here
 719     SVSHAPE0.mode = 0b01
 720     SVSHAPE1.skip = 0b01
 721     SVSHAPE1.offset = 0       # experiment with different offset, here
 722     SVSHAPE1.invxyz = [0,0,0] # inversion if desired
 723     # k schedule
 724     SVSHAPE2 = SVSHAPE()
 725     SVSHAPE2.lims = [xdim, ydim, zdim]
 726     SVSHAPE2.order = [0,1,2]  # experiment with different permutations, here
 727     SVSHAPE0.mode = 0b01
 728     SVSHAPE2.skip = 0b10
 729     SVSHAPE2.offset = 0       # experiment with different offset, here
 730     SVSHAPE2.invxyz = [0,0,0] # inversion if desired
 731
 732     # enumerate over the iterator function, getting new indices
 733     schedule = []
 734     for idx, (jl, jh, k) in enumerate(zip(iterate_butterfly_indices(SVSHAPE0),
 735                                           iterate_butterfly_indices(SVSHAPE1),
 736                                           iterate_butterfly_indices(SVSHAPE2))):
 737         if idx >= VL:
 738             break
 739         schedule.append((jl, jh, k))
 740
 741     # ok now pretty-print the results, with some debug output
 742     size = 2
 743     idx = 0
 744     while size <= n:
 745         halfsize = size // 2
 746         tablestep = n // size
 747         print ("size %d halfsize %d tablestep %d" % \
 748                 (size, halfsize, tablestep))
 749         for i in range(0, n, size):
 750             prefix = "i %d\t" % i
 751             k = 0
 752             for j in range(i, i + halfsize):
 753                 (jl, je), (jh, he), (ks, ke) = schedule[idx]
 754                 print ("  %-3d\t%s j=%-2d jh=%-2d k=%-2d -> "
 755                         "j[jl=%-2d] j[jh=%-2d] ex[k=%d]" % \
 756                                 (idx, prefix, j, j+halfsize, k,
 757                                       jl, jh, ks,
 758                                 ),
 759                                 "end", bin(je)[2:], bin(je)[2:], bin(ke)[2:])
 760                 k += tablestep
 761                 idx += 1
 762         size *= 2
 763
 764 # run the demo
 765 if __name__ == '__main__':
 766     demo()
 767 ```
 768
 769 ### DCT REMAP
 770
 771 DCT REMAP is RADIX2 only.  Convolutions may be applied as usual
 772 to create non-RADIX2 DCT. Combined with appropriate Twin-butterfly
 773 instructions the algorithm below (written in python3), becomes part
 774 of an in-place in-registers Vectorized DCT.  The algorithms work
 775 by loading data such that as the nested loops progress the result
 776 is sorted into correct sequential order.
 777
 778 ```
 779 # DCT "REMAP" scheduler to create an in-place iterative DCT.
 780 #
 781
 782 # bits of the integer 'val' of width 'width' are reversed
 783 def reverse_bits(val, width):
 784     result = 0
 785     for _ in range(width):
 786         result = (result << 1) | (val & 1)
 787         val >>= 1
 788     return result
 789
 790
 791 # iterative version of [recursively-applied] half-reversing
 792 # turns out this is Gray-Encoding.
 793 def halfrev2(vec, pre_rev=True):
 794     res = []
 795     for i in range(len(vec)):
 796         if pre_rev:
 797             res.append(vec[i ^ (i>>1)])
 798         else:
 799             ri = i
 800             bl = i.bit_length()
 801             for ji in range(1, bl):
 802                 ri ^= (i >> ji)
 803             res.append(vec[ri])
 804     return res
 805
 806
 807 def iterate_dct_inner_halfswap_loadstore(SVSHAPE):
 808     # get indices to iterate over, in the required order
 809     n = SVSHAPE.lims[0]
 810     mode = SVSHAPE.lims[1]
 811     stride = SVSHAPE.lims[2]
 812
 813     # reference list for not needing to do data-swaps, just swap what
 814     # *indices* are referenced (two levels of indirection at the moment)
 815     # pre-reverse the data-swap list so that it *ends up* in the order 0123..
 816     ji = list(range(n))
 817
 818     levels = n.bit_length() - 1
 819     ri = [reverse_bits(i, levels) for i in range(n)]
 820
 821     if SVSHAPE.mode == 0b01: # FFT, bitrev only
 822         ji = [ji[ri[i]] for i in range(n)]
 823     elif SVSHAPE.submode2 == 0b001:
 824         ji = [ji[ri[i]] for i in range(n)]
 825         ji = halfrev2(ji, True)
 826     else:
 827         ji = halfrev2(ji, False)
 828         ji = [ji[ri[i]] for i in range(n)]
 829
 830     # invert order if requested
 831     if SVSHAPE.invxyz[0]:
 832         ji.reverse()
 833
 834     for i, jl in enumerate(ji):
 835         y_end = jl == ji[-1]
 836         yield jl * stride, (0b111 if y_end else 0b000)
 837
 838 def iterate_dct_inner_costable_indices(SVSHAPE):
 839     # get indices to iterate over, in the required order
 840     n = SVSHAPE.lims[0]
 841     mode = SVSHAPE.lims[1]
 842     stride = SVSHAPE.lims[2]
 843     # creating lists of indices to iterate over in each dimension
 844     # has to be done dynamically, because it depends on the size
 845     # first, the size-based loop (which can be done statically)
 846     x_r = []
 847     size = 2
 848     while size <= n:
 849         x_r.append(size)
 850         size *= 2
 851     # invert order if requested
 852     if SVSHAPE.invxyz[0]:
 853         x_r.reverse()
 854
 855     if len(x_r) == 0:
 856         return
 857
 858     # start an infinite (wrapping) loop
 859     skip = 0
 860     z_end = 1 # doesn't exist in this, only 2 loops
 861     k = 0
 862     while True:
 863         for size in x_r:           # loop over 3rd order dimension (size)
 864             x_end = size == x_r[-1]
 865             # y_r schedule depends on size
 866             halfsize = size // 2
 867             y_r = []
 868             for i in range(0, n, size):
 869                 y_r.append(i)
 870             # invert if requested
 871             if SVSHAPE.invxyz[1]: y_r.reverse()
 872             # two lists of half-range indices, e.g. j 0123, jr 7654
 873             j = list(range(0, halfsize))
 874             # invert if requested
 875             if SVSHAPE.invxyz[2]: j_r.reverse()
 876             # loop over 1st order dimension
 877             for ci, jl in enumerate(j):
 878                 y_end = jl == j[-1]
 879                 # now depending on MODE return the index.  inner butterfly
 880                 if SVSHAPE.skip == 0b00: # in [0b00, 0b10]:
 881                     result = k  # offset into COS table
 882                 elif SVSHAPE.skip == 0b10: #
 883                     result = ci # coefficient helper
 884                 elif SVSHAPE.skip == 0b11: #
 885                     result = size # coefficient helper
 886                 loopends = (z_end |
 887                            ((y_end and z_end)<<1) |
 888                             ((y_end and x_end and z_end)<<2))
 889
 890                 yield (result * stride) + SVSHAPE.offset, loopends
 891                 k += 1
 892
 893 def iterate_dct_inner_butterfly_indices(SVSHAPE):
 894     # get indices to iterate over, in the required order
 895     n = SVSHAPE.lims[0]
 896     mode = SVSHAPE.lims[1]
 897     stride = SVSHAPE.lims[2]
 898     # creating lists of indices to iterate over in each dimension
 899     # has to be done dynamically, because it depends on the size
 900     # first, the size-based loop (which can be done statically)
 901     x_r = []
 902     size = 2
 903     while size <= n:
 904         x_r.append(size)
 905         size *= 2
 906     # invert order if requested
 907     if SVSHAPE.invxyz[0]:
 908         x_r.reverse()
 909
 910     if len(x_r) == 0:
 911         return
 912
 913     # reference (read/write) the in-place data in *reverse-bit-order*
 914     ri = list(range(n))
 915     if SVSHAPE.submode2 == 0b01:
 916         levels = n.bit_length() - 1
 917         ri = [ri[reverse_bits(i, levels)] for i in range(n)]
 918
 919     # reference list for not needing to do data-swaps, just swap what
 920     # *indices* are referenced (two levels of indirection at the moment)
 921     # pre-reverse the data-swap list so that it *ends up* in the order 0123..
 922     ji = list(range(n))
 923     inplace_mode = True
 924     if inplace_mode and SVSHAPE.submode2 == 0b01:
 925         ji = halfrev2(ji, True)
 926     if inplace_mode and SVSHAPE.submode2 == 0b11:
 927         ji = halfrev2(ji, False)
 928
 929     # start an infinite (wrapping) loop
 930     while True:
 931         k = 0
 932         k_start = 0
 933         for size in x_r:           # loop over 3rd order dimension (size)
 934             x_end = size == x_r[-1]
 935             # y_r schedule depends on size
 936             halfsize = size // 2
 937             y_r = []
 938             for i in range(0, n, size):
 939                 y_r.append(i)
 940             # invert if requested
 941             if SVSHAPE.invxyz[1]: y_r.reverse()
 942             for i in y_r:       # loop over 2nd order dimension
 943                 y_end = i == y_r[-1]
 944                 # two lists of half-range indices, e.g. j 0123, jr 7654
 945                 j = list(range(i, i + halfsize))
 946                 jr = list(range(i+halfsize, i + size))
 947                 jr.reverse()
 948                 # invert if requested
 949                 if SVSHAPE.invxyz[2]:
 950                     j.reverse()
 951                     jr.reverse()
 952                 hz2 = halfsize // 2 # zero stops reversing 1-item lists
 953                 # loop over 1st order dimension
 954                 k = k_start
 955                 for ci, (jl, jh) in enumerate(zip(j, jr)):
 956                     z_end = jl == j[-1]
 957                     # now depending on MODE return the index.  inner butterfly
 958                     if SVSHAPE.skip == 0b00: # in [0b00, 0b10]:
 959                         if SVSHAPE.submode2 == 0b11: # iDCT
 960                             result = ji[ri[jl]]        # lower half
 961                         else:
 962                             result = ri[ji[jl]]        # lower half
 963                     elif SVSHAPE.skip == 0b01: # in [0b01, 0b11]:
 964                         if SVSHAPE.submode2 == 0b11: # iDCT
 965                             result = ji[ri[jl+halfsize]] # upper half
 966                         else:
 967                             result = ri[ji[jh]] # upper half
 968                     elif mode == 4:
 969                         # COS table pre-generated mode
 970                         if SVSHAPE.skip == 0b10: #
 971                             result = k # cos table offset
 972                     else: # mode 2
 973                         # COS table generated on-demand ("Vertical-First") mode
 974                         if SVSHAPE.skip == 0b10: #
 975                             result = ci # coefficient helper
 976                         elif SVSHAPE.skip == 0b11: #
 977                             result = size # coefficient helper
 978                     loopends = (z_end |
 979                                ((y_end and z_end)<<1) |
 980                                 ((y_end and x_end and z_end)<<2))
 981
 982                     yield (result * stride) + SVSHAPE.offset, loopends
 983                     k += 1
 984
 985                 # now in-place swap
 986                 if inplace_mode:
 987                     for ci, (jl, jh) in enumerate(zip(j[:hz2], jr[:hz2])):
 988                         jlh = jl+halfsize
 989                         tmp1, tmp2 = ji[jlh], ji[jh]
 990                         ji[jlh], ji[jh] = tmp2, tmp1
 991
 992             # new k_start point for cos tables( runs inside x_r loop NOT i loop)
 993             k_start += halfsize
 994
 995
 996 # python "yield" can be iterated. use this to make it clear how
 997 # the indices are generated by using natural-looking nested loops
 998 def iterate_dct_outer_butterfly_indices(SVSHAPE):
 999     # get indices to iterate over, in the required order
1000     n = SVSHAPE.lims[0]
1001     mode = SVSHAPE.lims[1]
1002     stride = SVSHAPE.lims[2]
1003     # creating lists of indices to iterate over in each dimension
1004     # has to be done dynamically, because it depends on the size
1005     # first, the size-based loop (which can be done statically)
1006     x_r = []
1007     size = n // 2
1008     while size >= 2:
1009         x_r.append(size)
1010         size //= 2
1011     # invert order if requested
1012     if SVSHAPE.invxyz[0]:
1013         x_r.reverse()
1014
1015     if len(x_r) == 0:
1016         return
1017
1018     # I-DCT, reference (read/write) the in-place data in *reverse-bit-order*
1019     ri = list(range(n))
1020     if SVSHAPE.submode2 in [0b11, 0b01]:
1021         levels = n.bit_length() - 1
1022         ri = [ri[reverse_bits(i, levels)] for i in range(n)]
1023
1024     # reference list for not needing to do data-swaps, just swap what
1025     # *indices* are referenced (two levels of indirection at the moment)
1026     # pre-reverse the data-swap list so that it *ends up* in the order 0123..
1027     ji = list(range(n))
1028     inplace_mode = False # need the space... SVSHAPE.skip in [0b10, 0b11]
1029     if SVSHAPE.submode2 == 0b11:
1030         ji = halfrev2(ji, False)
1031
1032     # start an infinite (wrapping) loop
1033     while True:
1034         k = 0
1035         k_start = 0
1036         for size in x_r:           # loop over 3rd order dimension (size)
1037             halfsize = size//2
1038             x_end = size == x_r[-1]
1039             y_r = list(range(0, halfsize))
1040             # invert if requested
1041             if SVSHAPE.invxyz[1]: y_r.reverse()
1042             for i in y_r:       # loop over 2nd order dimension
1043                 y_end = i == y_r[-1]
1044                 # one list to create iterative-sum schedule
1045                 jr = list(range(i+halfsize, i+n-halfsize, size))
1046                 # invert if requested
1047                 if SVSHAPE.invxyz[2]: jr.reverse()
1048                 hz2 = halfsize // 2 # zero stops reversing 1-item lists
1049                 k = k_start
1050                 for ci, jh in enumerate(jr):   # loop over 1st order dimension
1051                     z_end = jh == jr[-1]
1052                     if mode == 4:
1053                         # COS table pre-generated mode
1054                         if SVSHAPE.skip == 0b00: # in [0b00, 0b10]:
1055                             if SVSHAPE.submode2 == 0b11: # iDCT
1056                                 result = ji[ri[jh]] # upper half
1057                             else:
1058                                 result = ri[ji[jh]]        # lower half
1059                         elif SVSHAPE.skip == 0b01: # in [0b01, 0b11]:
1060                             if SVSHAPE.submode2 == 0b11: # iDCT
1061                                 result = ji[ri[jh+size]] # upper half
1062                             else:
1063                                 result = ri[ji[jh+size]] # upper half
1064                         elif SVSHAPE.skip == 0b10: #
1065                             result = k # cos table offset
1066                     else:
1067                         # COS table generated on-demand ("Vertical-First") mode
1068                         if SVSHAPE.skip == 0b00: # in [0b00, 0b10]:
1069                             if SVSHAPE.submode2 == 0b11: # iDCT
1070                                 result = ji[ri[jh]]        # lower half
1071                             else:
1072                                 result = ri[ji[jh]]        # lower half
1073                         elif SVSHAPE.skip == 0b01: # in [0b01, 0b11]:
1074                             if SVSHAPE.submode2 == 0b11: # iDCT
1075                                 result = ji[ri[jh+size]] # upper half
1076                             else:
1077                                 result = ri[ji[jh+size]] # upper half
1078                         elif SVSHAPE.skip == 0b10: #
1079                             result = ci # coefficient helper
1080                         elif SVSHAPE.skip == 0b11: #
1081                             result = size # coefficient helper
1082                     loopends = (z_end |
1083                                ((y_end and z_end)<<1) |
1084                                 ((y_end and x_end and z_end)<<2))
1085
1086                     yield (result * stride) + SVSHAPE.offset, loopends
1087                     k += 1
1088
1089             # new k_start point for cos tables( runs inside x_r loop NOT i loop)
1090             k_start += halfsize
1091
1092 ```
1093
1094 ## REMAP selector
1095
1096 Selecting which REMAP Schedule to use is shown by the pseudocode below.
1097 Each SVSHAPE 0-3 goes through this selection process.
1098
1099 ```
1100   if SVSHAPEn.mode == 0b00:             iterate_fn = iterate_indices
1101   if SVSHAPEn.mode == 0b10:             iterate_fn = iterate_preduce_indices
1102   if SVSHAPEn.mode in [0b01, 0b11]:
1103     # further sub-selection
1104     if SVSHAPEn.ydimsz == 1:            iterate_fn = iterate_butterfly_indices
1105     if SVSHAPEn.ydimsz == 2:            iterate_fn = iterate_dct_inner_butterfly_indices
1106     if SVSHAPEn.ydimsz == 3:            iterate_fn = iterate_dct_outer_butterfly_indices
1107     if SVSHAPEn.ydimsz in [5, 13]:      iterate_fn = iterate_dct_inner_costable_indices
1108     if SVSHAPEn.ydimsz in [6, 14, 15]:  iterate_fn = iterate_dct_inner_halfswap_loadstore
1109 ```
1110
1111
1112 [[!tag opf_rfc]]