openpower/sv/rfc/ls009.mdwn

   1 # RFC ls009 Simple-V REMAP Subsystem
   2
   3 **URLs**:
   4
   5 * <https://libre-soc.org/openpower/sv/rfc/ls009/>
   6 * <https://bugs.libre-soc.org/show_bug.cgi?id=1042>
   7 * <https://git.openpower.foundation/isa/PowerISA/issues/124>
   8
   9 **Severity**: Major
  10
  11 **Status**: New
  12
  13 **Date**: 26 Mar 2023
  14
  15 **Target**: v3.2B
  16
  17 **Source**: v3.0B
  18
  19 **Books and Section affected**:
  20
  21 ```
  22     Book I, new Zero-Overhead-Loop Chapter.
  23     Appendix E Power ISA sorted by opcode
  24     Appendix F Power ISA sorted by version
  25     Appendix G Power ISA sorted by Compliancy Subset
  26     Appendix H Power ISA sorted by mnemonic
  27 ```
  28
  29 **Summary**
  30
  31 ```
  32     svremap  - Re-Mapping of Register Element Offsets
  33     svindex  - General-purpose setting of SHAPEs to be re-mapped
  34     svshape  - Hardware-level setting of SHAPEs for element re-mapping
  35     svshape2 - Hardware-level setting of SHAPEs for element re-mapping (v2)
  36 ```
  37
  38 **Submitter**: Luke Leighton (Libre-SOC)
  39
  40 **Requester**: Libre-SOC
  41
  42 **Impact on processor**:
  43
  44 ```
  45     Addition of four new "Zero-Overhead-Loop-Control" DSP-style Vector-style
  46     Management Instructions which provide advanced features such as Matrix
  47     FFT DCT Hardware-Assist Schedules and general-purpose Index reordering.
  48 ```
  49
  50 **Impact on software**:
  51
  52 ```
  53     Requires support for new instructions in assembler, debuggers,
  54     and related tools.
  55 ```
  56
  57 **Keywords**:
  58
  59 ```
  60     Cray Supercomputing, Vectorisation, Zero-Overhead-Loop-Control (ZOLC),
  61     Scalable Vectors, Multi-Issue Out-of-Order, Sequential Programming Model,
  62     Digital Signal Processing (DSP)
  63 ```
  64
  65 **Motivation**
  66
  67 These REMAP Management instructions provide state-of-the-art advanced capabilities
  68 to dramatically decrease instruction count and power reduction whilst retaining
  69 unprecedented general-purpose capability and a standard Sequential Execution Model.
  70
  71 **Notes and Observations**:
  72
  73 1. TODO
  74
  75 **Changes**
  76
  77 Add the following entries to:
  78
  79 * the Appendices of SV Book
  80 * Instructions of SV Book as a new Section
  81 * SVI, SVM, SVM2, SVRM Form of Book I Section 1.6.1.6 and 1.6.2
  82
  83 ----------------
  84
  85 \newpage{}
  86
  87 [[!inline pages="openpower/sv/remap" raw=yes ]]
  88
  89 # Forms
  90
  91 Add `SVI, SVM, SVM2, SVRM` to `XO (26:31)` Field in Book I, 1.6.2
  92
  93 Add the following to Book I, 1.6.1, SVI-Form
  94
  95 ```
  96     |0     |6    |11    |16   |21 |23  |24|25|26    31|
  97     | PO   |  SVG|rmm   | SVd |ew |SVyx|mm|sk|   XO   |
  98 ```
  99
 100 Add the following to Book I, 1.6.1, SVM-Form
 101
 102 ```
 103     |0     |6        |11      |16    |21    |25 |26    |31  |
 104     | PO   |  SVxd   |   SVyd | SVzd | SVrm |vf |   XO      |
 105 ```
 106
 107 Add the following to Book I, 1.6.1, SVM2-Form
 108
 109 ```
 110     |0     |6     |10  |11      |16    |21 |24|25 |26    |31  |
 111     | PO   | SVo  |SVyx|   rmm  | SVd  |XO |mm|sk |   XO      |
 112 ```
 113
 114 Add the following to Book I, 1.6.1, SVRM-Form
 115
 116 ```
 117     |0     |6     |11  |13   |15   |17   |19   |21  |22   |26     |31 |
 118     | PO   | SVme |mi0 | mi1 | mi2 | mo0 | mo1 |pst |///  | XO        |
 119 ```
 120
 121 Add the following to Book I, 1.6.2
 122
 123 ```
 124     mi0 (11:12)
 125         Field used in REMAP to select the SVSHAPE for 1st input register
 126         Formats: SVRM
 127     mi1 (13:14)
 128         Field used in REMAP to select the SVSHAPE for 2nd input register
 129         Formats: SVRM
 130     mi2 (15:16)
 131         Field used in REMAP to select the SVSHAPE for 3rd input register
 132         Formats: SVRM
 133     mm (24)
 134         Field used to specify the meaning of the rmm field for SVI-Form
 135         and SVM2-Form
 136         Formats: SVI, SVM2
 137     mo0 (17:18)
 138         Field used in REMAP to select the SVSHAPE for 1st output register
 139         Formats: SVRM
 140     mo1 (19:20)
 141         Field used in REMAP to select the SVSHAPE for 2nd output register
 142         Formats: SVRM
 143     pst (21)
 144         Field used in REMAP to indicate "persistence" mode (REMAP
 145         continues to apply to multiple instructions)
 146         Formats: SVRM
 147     rmm (11:15)
 148         REMAP Mode field for SVI-Form and SVM2-Form
 149         Formats: SVI, SVM2
 150     sk (25)
 151         Field used to specify dimensional skipping in svindex
 152         Formats: SVI, SVM2
 153     SVd (16:20)
 154         Immediate field used to specify the size of the REMAP dimension
 155         in the svindex and svshape2 instructions
 156         Formats: SVI, SVM2
 157     SVDS (16:29)
 158         Immediate field used to specify a 9-bit signed
 159         two's complement integer which is concatenated
 160         on the right with 0b00 and sign-extended to 64 bits.
 161         Formats: SVDS
 162     SVG (6:10)
 163         Field used to specify a GPR to be used as a
 164         source for indexing.
 165         Formats: SVI
 166     SVi (16:22)
 167          Simple-V immediate field for setting VL or MVL
 168          Formats: SVL
 169     SVme (6:10)
 170          Simple-V "REMAP" map-enable bits (0-4)
 171          Formats: SVRM
 172     SVo (6:9)
 173         Field used by the svshape2 instruction as an offset
 174         Formats: SVM2
 175     SVrm (21:24)
 176          Simple-V "REMAP" Mode
 177          Formats: SVM
 178     SVxd (6:10)
 179          Simple-V "REMAP" x-dimension size
 180          Formats: SVM
 181     SVyd (11:15)
 182          Simple-V "REMAP" y-dimension size
 183          Formats: SVM
 184     SVzd (16:20)
 185          Simple-V "REMAP" z-dimension size
 186          Formats: SVM
 187     XO (21:23,26:31)
 188         Extended opcode field.  Note that bit 21 must be 1, 22 and 23
 189         must be zero, and bits 26-31 must be exactly the same as
 190         used for svshape.
 191         Formats: SVM2
 192 ```
 193
 194 # Appendices
 195
 196     Appendix E Power ISA sorted by opcode
 197     Appendix F Power ISA sorted by version
 198     Appendix G Power ISA sorted by Compliancy Subset
 199     Appendix H Power ISA sorted by mnemonic
 200
 201 | Form | Book | Page | Version | mnemonic | Description |
 202 |------|------|------|---------|----------|-------------|
 203 | SVRM | I    | #    | 3.0B    | svremap    | REMAP enabling instruction |
 204 | SVM  | I    | #    | 3.0B    | svshape  | REMAP shape instruction |
 205 | SVM2 | I    | #    | 3.0B    | svshape2 | REMAP shape instruction (2) |
 206 | SVI  | I    | #    | 3.0B    | svindex | REMAP General-purpose Indexing |
 207
 208 [[!inline pages="openpower/sv/remap/appendix" raw=yes ]]
 209
 210 ## REMAP pseudocode
 211
 212 Written in python3 the following stand-alone executable source code is the Canonical
 213 Specification for each REMAP. Vectors of "loopends" are returned when Rc=1
 214 in Vectors of CR Fields on `sv.svstep.`, or in Vertical-First Mode
 215 a single CR Field (CR0) on `svstep.`.  The `SVSTATE.srcstep` or `SVSTATE.dststep` sequential
 216 offset is put through each algorithm to determine the actual Element Offset.
 217 Alternative implementations producing different ordering
 218 is prohibited as software will be critically relying on these Deterministic Schedules.
 219
 220 ### REMAP 2D/3D Matrix
 221
 222 The following stand-alone executable source code is the Canonical
 223 Specification for Matrix (2D/3D) REMAP.
 224 Hardware implementations are achievable with simple cascading counter-and-compares.
 225
 226 ```
 227 # python "yield" can be iterated. use this to make it clear how
 228 # the indices are generated by using natural-looking nested loops
 229 def iterate_indices(SVSHAPE):
 230     # get indices to iterate over, in the required order
 231     xd = SVSHAPE.lims[0]
 232     yd = SVSHAPE.lims[1]
 233     zd = SVSHAPE.lims[2]
 234     # create lists of indices to iterate over in each dimension
 235     x_r = list(range(xd))
 236     y_r = list(range(yd))
 237     z_r = list(range(zd))
 238     # invert the indices if needed
 239     if SVSHAPE.invxyz[0]: x_r.reverse()
 240     if SVSHAPE.invxyz[1]: y_r.reverse()
 241     if SVSHAPE.invxyz[2]: z_r.reverse()
 242     # start an infinite (wrapping) loop
 243     step = 0 # track src/dst step
 244     while True:
 245         for z in z_r:   # loop over 1st order dimension
 246             z_end = z == z_r[-1]
 247             for y in y_r:       # loop over 2nd order dimension
 248                 y_end = y == y_r[-1]
 249                 for x in x_r:           # loop over 3rd order dimension
 250                     x_end = x == x_r[-1]
 251                     # ok work out which order to construct things in.
 252                     # start by creating a list of tuples of the dimension
 253                     # and its limit
 254                     vals = [(SVSHAPE.lims[0], x, "x"),
 255                             (SVSHAPE.lims[1], y, "y"),
 256                             (SVSHAPE.lims[2], z, "z")
 257                            ]
 258                     # now select those by order.  this allows us to
 259                     # create schedules for [z][x], [x][y], or [y][z]
 260                     # for matrix multiply.
 261                     vals = [vals[SVSHAPE.order[0]],
 262                             vals[SVSHAPE.order[1]],
 263                             vals[SVSHAPE.order[2]]
 264                            ]
 265                     # ok now we can construct the result, using bits of
 266                     # "order" to say which ones get stacked on
 267                     result = 0
 268                     mult = 1
 269                     for i in range(3):
 270                         lim, idx, dbg = vals[i]
 271                         # some of the dimensions can be "skipped".  the order
 272                         # was actually selected above on all 3 dimensions,
 273                         # e.g. [z][x][y] or [y][z][x].  "skip" allows one of
 274                         # those to be knocked out
 275                         if SVSHAPE.skip == i+1: continue
 276                         idx *= mult   # shifts up by previous dimension(s)
 277                         result += idx # adds on this dimension
 278                         mult *= lim   # for the next dimension
 279
 280                     loopends = (x_end |
 281                                ((y_end and x_end)<<1) |
 282                                 ((y_end and x_end and z_end)<<2))
 283
 284                     yield result + SVSHAPE.offset, loopends
 285                     step += 1
 286
 287 def demo():
 288     # set the dimension sizes here
 289     xdim = 3
 290     ydim = 2
 291     zdim = 4
 292
 293     # set total (can repeat, e.g. VL=x*y*z*4)
 294     VL = xdim * ydim * zdim
 295
 296     # set up an SVSHAPE
 297     class SVSHAPE:
 298         pass
 299     SVSHAPE0 = SVSHAPE()
 300     SVSHAPE0.lims = [xdim, ydim, zdim]
 301     SVSHAPE0.order = [1,0,2]  # experiment with different permutations, here
 302     SVSHAPE0.mode = 0b00
 303     SVSHAPE0.skip = 0b00
 304     SVSHAPE0.offset = 0       # experiment with different offset, here
 305     SVSHAPE0.invxyz = [0,0,0] # inversion if desired
 306
 307     # enumerate over the iterator function, getting new indices
 308     for idx, (new_idx, end) in enumerate(iterate_indices(SVSHAPE0)):
 309         if idx >= VL:
 310             break
 311         print ("%d->%d" % (idx, new_idx), "end", bin(end)[2:])
 312
 313 # run the demo
 314 if __name__ == '__main__':
 315     demo()
 316 ```
 317
 318 ### REMAP Parallel Reduction pseudocode
 319
 320 The python3 program below is stand-alone executable and is the Canonical Specification
 321 for Parallel Reduction REMAP.
 322 The Algorithm below is not limited to RADIX2 sizes, and Predicate
 323 sources, unlike in Matrix REMAP, apply to the Element Indices **after** REMAP
 324 has been applied, not before.  MV operations are not required: the algorithm
 325 tracks positions of elements that would normally be moved and when applying
 326 an Element Reduction Operation sources the operands from their last-known (tracked)
 327 position.
 328
 329 ```
 330 # a "yield" version of the Parallel Reduction REMAP algorithm.
 331 # the algorithm is in-place. it does not perform "MV" operations.
 332 # instead, where a masked-out value *should* be read from is tracked
 333
 334 def iterate_indices(SVSHAPE, pred=None):
 335     # get indices to iterate over, in the required order
 336     xd = SVSHAPE.lims[0]
 337     # create lists of indices to iterate over in each dimension
 338     ix = list(range(xd))
 339     # invert the indices if needed
 340     if SVSHAPE.invxyz[0]: ix.reverse()
 341     # start a loop from the lowest step
 342     step = 1
 343     steps = []
 344     while step < xd:
 345         step *= 2
 346         steps.append(step)
 347     # invert the indices if needed
 348     if SVSHAPE.invxyz[1]: steps.reverse()
 349     for step in steps:
 350         stepend = (step == steps[-1]) # note end of steps
 351         idxs = list(range(0, xd, step))
 352         results = []
 353         for i in idxs:
 354             other = i + step // 2
 355             ci = ix[i]
 356             oi = ix[other] if other < xd else None
 357             other_pred = other < xd and (pred is None or pred[oi])
 358             if (pred is None or pred[ci]) and other_pred:
 359                 if SVSHAPE.skip == 0b00: # submode 00
 360                     result = ci
 361                 elif SVSHAPE.skip == 0b01: # submode 01
 362                     result = oi
 363                 results.append([result + SVSHAPE.offset, 0])
 364             elif other_pred:
 365                 ix[i] = oi
 366         if results:
 367             results[-1][1] = (stepend<<1) | 1  # notify end of loops
 368         yield from results
 369
 370 def demo():
 371     # set the dimension sizes here
 372     xdim = 9
 373
 374     # set up an SVSHAPE
 375     class SVSHAPE:
 376         pass
 377     SVSHAPE0 = SVSHAPE()
 378     SVSHAPE0.lims = [xdim, 0, 0]
 379     SVSHAPE0.order = [0,1,2]
 380     SVSHAPE0.mode = 0b10
 381     SVSHAPE0.skip = 0b00
 382     SVSHAPE0.offset = 0       # experiment with different offset, here
 383     SVSHAPE0.invxyz = [0,0,0] # inversion if desired
 384
 385     SVSHAPE1 = SVSHAPE()
 386     SVSHAPE1.lims = [xdim, 0, 0]
 387     SVSHAPE1.order = [0,1,2]
 388     SVSHAPE1.mode = 0b10
 389     SVSHAPE1.skip = 0b01
 390     SVSHAPE1.offset = 0       # experiment with different offset, here
 391     SVSHAPE1.invxyz = [0,0,0] # inversion if desired
 392
 393     # enumerate over the iterator function, getting new indices
 394     shapes = list(iterate_indices(SVSHAPE0)), \
 395               list(iterate_indices(SVSHAPE1))
 396     for idx in range(len(shapes[0])):
 397         l = shapes[0][idx]
 398         r = shapes[1][idx]
 399         (l_idx, lend) = l
 400         (r_idx, rend) = r
 401         print ("%d->%d:%d" % (idx, l_idx, r_idx),
 402                "end", bin(lend)[2:], bin(rend)[2:])
 403
 404 # run the demo
 405 if __name__ == '__main__':
 406     demo()
 407 ```
 408
 409 ### REMAP FFT pseudocode
 410
 411 The FFT REMAP is RADIX2 only.
 412
 413 ```
 414 # a "yield" version of the REMAP algorithm, for FFT Tukey-Cooley schedules
 415 # original code for the FFT Tukey-Cooley schedule:
 416 # https://www.nayuki.io/res/free-small-fft-in-multiple-languages/fft.py
 417 """
 418     # Radix-2 decimation-in-time FFT (real, not complex)
 419     size = 2
 420     while size <= n:
 421         halfsize = size // 2
 422         tablestep = n // size
 423         for i in range(0, n, size):
 424             k = 0
 425             for j in range(i, i + halfsize):
 426                 jh = j+halfsize
 427                 jl = j
 428                 temp1 = vec[jh] * exptable[k]
 429                 temp2 = vec[jl]
 430                 vec[jh] = temp2 - temp1
 431                 vec[jl] = temp2 + temp1
 432                 k += tablestep
 433         size *= 2
 434 """
 435
 436 # python "yield" can be iterated. use this to make it clear how
 437 # the indices are generated by using natural-looking nested loops
 438 def iterate_butterfly_indices(SVSHAPE):
 439     # get indices to iterate over, in the required order
 440     n = SVSHAPE.lims[0]
 441     stride = SVSHAPE.lims[2] # stride-multiplier on reg access
 442     # creating lists of indices to iterate over in each dimension
 443     # has to be done dynamically, because it depends on the size
 444     # first, the size-based loop (which can be done statically)
 445     x_r = []
 446     size = 2
 447     while size <= n:
 448         x_r.append(size)
 449         size *= 2
 450     # invert order if requested
 451     if SVSHAPE.invxyz[0]: x_r.reverse()
 452
 453     if len(x_r) == 0:
 454         return
 455
 456     # start an infinite (wrapping) loop
 457     skip = 0
 458     while True:
 459         for size in x_r:           # loop over 3rd order dimension (size)
 460             x_end = size == x_r[-1]
 461             # y_r schedule depends on size
 462             halfsize = size // 2
 463             tablestep = n // size
 464             y_r = []
 465             for i in range(0, n, size):
 466                 y_r.append(i)
 467             # invert if requested
 468             if SVSHAPE.invxyz[1]: y_r.reverse()
 469             for i in y_r:       # loop over 2nd order dimension
 470                 y_end = i == y_r[-1]
 471                 k_r = []
 472                 j_r = []
 473                 k = 0
 474                 for j in range(i, i+halfsize):
 475                     k_r.append(k)
 476                     j_r.append(j)
 477                     k += tablestep
 478                 # invert if requested
 479                 if SVSHAPE.invxyz[2]: k_r.reverse()
 480                 if SVSHAPE.invxyz[2]: j_r.reverse()
 481                 for j, k in zip(j_r, k_r):   # loop over 1st order dimension
 482                     z_end = j == j_r[-1]
 483                     # now depending on MODE return the index
 484                     if SVSHAPE.skip == 0b00:
 485                         result = j              # for vec[j]
 486                     elif SVSHAPE.skip == 0b01:
 487                         result = j + halfsize   # for vec[j+halfsize]
 488                     elif SVSHAPE.skip == 0b10:
 489                         result = k              # for exptable[k]
 490
 491                     loopends = (z_end |
 492                                ((y_end and z_end)<<1) |
 493                                 ((y_end and x_end and z_end)<<2))
 494
 495                     yield (result * stride) + SVSHAPE.offset, loopends
 496
 497 def demo():
 498     # set the dimension sizes here
 499     xdim = 8
 500     ydim = 0 # not needed
 501     zdim = 1 # stride must be set to 1
 502
 503     # set total. err don't know how to calculate how many there are...
 504     # do it manually for now
 505     VL = 0
 506     size = 2
 507     n = xdim
 508     while size <= n:
 509         halfsize = size // 2
 510         tablestep = n // size
 511         for i in range(0, n, size):
 512             for j in range(i, i + halfsize):
 513                 VL += 1
 514         size *= 2
 515
 516     # set up an SVSHAPE
 517     class SVSHAPE:
 518         pass
 519     # j schedule
 520     SVSHAPE0 = SVSHAPE()
 521     SVSHAPE0.lims = [xdim, ydim, zdim]
 522     SVSHAPE0.order = [0,1,2]  # experiment with different permutations, here
 523     SVSHAPE0.mode = 0b01
 524     SVSHAPE0.skip = 0b00
 525     SVSHAPE0.offset = 0       # experiment with different offset, here
 526     SVSHAPE0.invxyz = [0,0,0] # inversion if desired
 527     # j+halfstep schedule
 528     SVSHAPE1 = SVSHAPE()
 529     SVSHAPE1.lims = [xdim, ydim, zdim]
 530     SVSHAPE1.order = [0,1,2]  # experiment with different permutations, here
 531     SVSHAPE0.mode = 0b01
 532     SVSHAPE1.skip = 0b01
 533     SVSHAPE1.offset = 0       # experiment with different offset, here
 534     SVSHAPE1.invxyz = [0,0,0] # inversion if desired
 535     # k schedule
 536     SVSHAPE2 = SVSHAPE()
 537     SVSHAPE2.lims = [xdim, ydim, zdim]
 538     SVSHAPE2.order = [0,1,2]  # experiment with different permutations, here
 539     SVSHAPE0.mode = 0b01
 540     SVSHAPE2.skip = 0b10
 541     SVSHAPE2.offset = 0       # experiment with different offset, here
 542     SVSHAPE2.invxyz = [0,0,0] # inversion if desired
 543
 544     # enumerate over the iterator function, getting new indices
 545     schedule = []
 546     for idx, (jl, jh, k) in enumerate(zip(iterate_butterfly_indices(SVSHAPE0),
 547                                           iterate_butterfly_indices(SVSHAPE1),
 548                                           iterate_butterfly_indices(SVSHAPE2))):
 549         if idx >= VL:
 550             break
 551         schedule.append((jl, jh, k))
 552
 553     # ok now pretty-print the results, with some debug output
 554     size = 2
 555     idx = 0
 556     while size <= n:
 557         halfsize = size // 2
 558         tablestep = n // size
 559         print ("size %d halfsize %d tablestep %d" % \
 560                 (size, halfsize, tablestep))
 561         for i in range(0, n, size):
 562             prefix = "i %d\t" % i
 563             k = 0
 564             for j in range(i, i + halfsize):
 565                 (jl, je), (jh, he), (ks, ke) = schedule[idx]
 566                 print ("  %-3d\t%s j=%-2d jh=%-2d k=%-2d -> "
 567                         "j[jl=%-2d] j[jh=%-2d] ex[k=%d]" % \
 568                                 (idx, prefix, j, j+halfsize, k,
 569                                       jl, jh, ks,
 570                                 ),
 571                                 "end", bin(je)[2:], bin(je)[2:], bin(ke)[2:])
 572                 k += tablestep
 573                 idx += 1
 574         size *= 2
 575
 576 # run the demo
 577 if __name__ == '__main__':
 578     demo()
 579 ```
 580
 581 ### DCT REMAP
 582
 583 DCT REMAP is RADIX2 only.  Convolutions may be applied as usual
 584 to create non-RADIX2 DCT. Combined with appropriate Twin-butterfly
 585 instructions the algorithm below (written in python3), becomes part
 586 of an in-place in-registers Vectorised DCT.  The algorithms work
 587 by loading data such that as the nested loops progress the result
 588 is sorted into correct sequential order.
 589
 590 ```
 591 # DCT "REMAP" scheduler to create an in-place iterative DCT.
 592 #
 593
 594 # bits of the integer 'val' of width 'width' are reversed
 595 def reverse_bits(val, width):
 596     result = 0
 597     for _ in range(width):
 598         result = (result << 1) | (val & 1)
 599         val >>= 1
 600     return result
 601
 602
 603 # iterative version of [recursively-applied] half-reversing
 604 # turns out this is Gray-Encoding.
 605 def halfrev2(vec, pre_rev=True):
 606     res = []
 607     for i in range(len(vec)):
 608         if pre_rev:
 609             res.append(vec[i ^ (i>>1)])
 610         else:
 611             ri = i
 612             bl = i.bit_length()
 613             for ji in range(1, bl):
 614                 ri ^= (i >> ji)
 615             res.append(vec[ri])
 616     return res
 617
 618
 619 def iterate_dct_inner_halfswap_loadstore(SVSHAPE):
 620     # get indices to iterate over, in the required order
 621     n = SVSHAPE.lims[0]
 622     mode = SVSHAPE.lims[1]
 623     stride = SVSHAPE.lims[2]
 624
 625     # reference list for not needing to do data-swaps, just swap what
 626     # *indices* are referenced (two levels of indirection at the moment)
 627     # pre-reverse the data-swap list so that it *ends up* in the order 0123..
 628     ji = list(range(n))
 629
 630     levels = n.bit_length() - 1
 631     ri = [reverse_bits(i, levels) for i in range(n)]
 632
 633     if SVSHAPE.mode == 0b01: # FFT, bitrev only
 634         ji = [ji[ri[i]] for i in range(n)]
 635     elif SVSHAPE.submode2 == 0b001:
 636         ji = [ji[ri[i]] for i in range(n)]
 637         ji = halfrev2(ji, True)
 638     else:
 639         ji = halfrev2(ji, False)
 640         ji = [ji[ri[i]] for i in range(n)]
 641
 642     # invert order if requested
 643     if SVSHAPE.invxyz[0]:
 644         ji.reverse()
 645
 646     for i, jl in enumerate(ji):
 647         y_end = jl == ji[-1]
 648         yield jl * stride, (0b111 if y_end else 0b000)
 649
 650 def iterate_dct_inner_costable_indices(SVSHAPE):
 651     # get indices to iterate over, in the required order
 652     n = SVSHAPE.lims[0]
 653     mode = SVSHAPE.lims[1]
 654     stride = SVSHAPE.lims[2]
 655     # creating lists of indices to iterate over in each dimension
 656     # has to be done dynamically, because it depends on the size
 657     # first, the size-based loop (which can be done statically)
 658     x_r = []
 659     size = 2
 660     while size <= n:
 661         x_r.append(size)
 662         size *= 2
 663     # invert order if requested
 664     if SVSHAPE.invxyz[0]:
 665         x_r.reverse()
 666
 667     if len(x_r) == 0:
 668         return
 669
 670     # start an infinite (wrapping) loop
 671     skip = 0
 672     z_end = 1 # doesn't exist in this, only 2 loops
 673     k = 0
 674     while True:
 675         for size in x_r:           # loop over 3rd order dimension (size)
 676             x_end = size == x_r[-1]
 677             # y_r schedule depends on size
 678             halfsize = size // 2
 679             y_r = []
 680             for i in range(0, n, size):
 681                 y_r.append(i)
 682             # invert if requested
 683             if SVSHAPE.invxyz[1]: y_r.reverse()
 684             # two lists of half-range indices, e.g. j 0123, jr 7654
 685             j = list(range(0, halfsize))
 686             # invert if requested
 687             if SVSHAPE.invxyz[2]: j_r.reverse()
 688             # loop over 1st order dimension
 689             for ci, jl in enumerate(j):
 690                 y_end = jl == j[-1]
 691                 # now depending on MODE return the index.  inner butterfly
 692                 if SVSHAPE.skip == 0b00: # in [0b00, 0b10]:
 693                     result = k  # offset into COS table
 694                 elif SVSHAPE.skip == 0b10: #
 695                     result = ci # coefficient helper
 696                 elif SVSHAPE.skip == 0b11: #
 697                     result = size # coefficient helper
 698                 loopends = (z_end |
 699                            ((y_end and z_end)<<1) |
 700                             ((y_end and x_end and z_end)<<2))
 701
 702                 yield (result * stride) + SVSHAPE.offset, loopends
 703                 k += 1
 704
 705 def iterate_dct_inner_butterfly_indices(SVSHAPE):
 706     # get indices to iterate over, in the required order
 707     n = SVSHAPE.lims[0]
 708     mode = SVSHAPE.lims[1]
 709     stride = SVSHAPE.lims[2]
 710     # creating lists of indices to iterate over in each dimension
 711     # has to be done dynamically, because it depends on the size
 712     # first, the size-based loop (which can be done statically)
 713     x_r = []
 714     size = 2
 715     while size <= n:
 716         x_r.append(size)
 717         size *= 2
 718     # invert order if requested
 719     if SVSHAPE.invxyz[0]:
 720         x_r.reverse()
 721
 722     if len(x_r) == 0:
 723         return
 724
 725     # reference (read/write) the in-place data in *reverse-bit-order*
 726     ri = list(range(n))
 727     if SVSHAPE.submode2 == 0b01:
 728         levels = n.bit_length() - 1
 729         ri = [ri[reverse_bits(i, levels)] for i in range(n)]
 730
 731     # reference list for not needing to do data-swaps, just swap what
 732     # *indices* are referenced (two levels of indirection at the moment)
 733     # pre-reverse the data-swap list so that it *ends up* in the order 0123..
 734     ji = list(range(n))
 735     inplace_mode = True
 736     if inplace_mode and SVSHAPE.submode2 == 0b01:
 737         ji = halfrev2(ji, True)
 738     if inplace_mode and SVSHAPE.submode2 == 0b11:
 739         ji = halfrev2(ji, False)
 740
 741     # start an infinite (wrapping) loop
 742     while True:
 743         k = 0
 744         k_start = 0
 745         for size in x_r:           # loop over 3rd order dimension (size)
 746             x_end = size == x_r[-1]
 747             # y_r schedule depends on size
 748             halfsize = size // 2
 749             y_r = []
 750             for i in range(0, n, size):
 751                 y_r.append(i)
 752             # invert if requested
 753             if SVSHAPE.invxyz[1]: y_r.reverse()
 754             for i in y_r:       # loop over 2nd order dimension
 755                 y_end = i == y_r[-1]
 756                 # two lists of half-range indices, e.g. j 0123, jr 7654
 757                 j = list(range(i, i + halfsize))
 758                 jr = list(range(i+halfsize, i + size))
 759                 jr.reverse()
 760                 # invert if requested
 761                 if SVSHAPE.invxyz[2]:
 762                     j.reverse()
 763                     jr.reverse()
 764                 hz2 = halfsize // 2 # zero stops reversing 1-item lists
 765                 # loop over 1st order dimension
 766                 k = k_start
 767                 for ci, (jl, jh) in enumerate(zip(j, jr)):
 768                     z_end = jl == j[-1]
 769                     # now depending on MODE return the index.  inner butterfly
 770                     if SVSHAPE.skip == 0b00: # in [0b00, 0b10]:
 771                         if SVSHAPE.submode2 == 0b11: # iDCT
 772                             result = ji[ri[jl]]        # lower half
 773                         else:
 774                             result = ri[ji[jl]]        # lower half
 775                     elif SVSHAPE.skip == 0b01: # in [0b01, 0b11]:
 776                         if SVSHAPE.submode2 == 0b11: # iDCT
 777                             result = ji[ri[jl+halfsize]] # upper half
 778                         else:
 779                             result = ri[ji[jh]] # upper half
 780                     elif mode == 4:
 781                         # COS table pre-generated mode
 782                         if SVSHAPE.skip == 0b10: #
 783                             result = k # cos table offset
 784                     else: # mode 2
 785                         # COS table generated on-demand ("Vertical-First") mode
 786                         if SVSHAPE.skip == 0b10: #
 787                             result = ci # coefficient helper
 788                         elif SVSHAPE.skip == 0b11: #
 789                             result = size # coefficient helper
 790                     loopends = (z_end |
 791                                ((y_end and z_end)<<1) |
 792                                 ((y_end and x_end and z_end)<<2))
 793
 794                     yield (result * stride) + SVSHAPE.offset, loopends
 795                     k += 1
 796
 797                 # now in-place swap
 798                 if inplace_mode:
 799                     for ci, (jl, jh) in enumerate(zip(j[:hz2], jr[:hz2])):
 800                         jlh = jl+halfsize
 801                         tmp1, tmp2 = ji[jlh], ji[jh]
 802                         ji[jlh], ji[jh] = tmp2, tmp1
 803
 804             # new k_start point for cos tables( runs inside x_r loop NOT i loop)
 805             k_start += halfsize
 806
 807
 808 # python "yield" can be iterated. use this to make it clear how
 809 # the indices are generated by using natural-looking nested loops
 810 def iterate_dct_outer_butterfly_indices(SVSHAPE):
 811     # get indices to iterate over, in the required order
 812     n = SVSHAPE.lims[0]
 813     mode = SVSHAPE.lims[1]
 814     stride = SVSHAPE.lims[2]
 815     # creating lists of indices to iterate over in each dimension
 816     # has to be done dynamically, because it depends on the size
 817     # first, the size-based loop (which can be done statically)
 818     x_r = []
 819     size = n // 2
 820     while size >= 2:
 821         x_r.append(size)
 822         size //= 2
 823     # invert order if requested
 824     if SVSHAPE.invxyz[0]:
 825         x_r.reverse()
 826
 827     if len(x_r) == 0:
 828         return
 829
 830     # I-DCT, reference (read/write) the in-place data in *reverse-bit-order*
 831     ri = list(range(n))
 832     if SVSHAPE.submode2 in [0b11, 0b01]:
 833         levels = n.bit_length() - 1
 834         ri = [ri[reverse_bits(i, levels)] for i in range(n)]
 835
 836     # reference list for not needing to do data-swaps, just swap what
 837     # *indices* are referenced (two levels of indirection at the moment)
 838     # pre-reverse the data-swap list so that it *ends up* in the order 0123..
 839     ji = list(range(n))
 840     inplace_mode = False # need the space... SVSHAPE.skip in [0b10, 0b11]
 841     if SVSHAPE.submode2 == 0b11:
 842         ji = halfrev2(ji, False)
 843
 844     # start an infinite (wrapping) loop
 845     while True:
 846         k = 0
 847         k_start = 0
 848         for size in x_r:           # loop over 3rd order dimension (size)
 849             halfsize = size//2
 850             x_end = size == x_r[-1]
 851             y_r = list(range(0, halfsize))
 852             # invert if requested
 853             if SVSHAPE.invxyz[1]: y_r.reverse()
 854             for i in y_r:       # loop over 2nd order dimension
 855                 y_end = i == y_r[-1]
 856                 # one list to create iterative-sum schedule
 857                 jr = list(range(i+halfsize, i+n-halfsize, size))
 858                 # invert if requested
 859                 if SVSHAPE.invxyz[2]: jr.reverse()
 860                 hz2 = halfsize // 2 # zero stops reversing 1-item lists
 861                 k = k_start
 862                 for ci, jh in enumerate(jr):   # loop over 1st order dimension
 863                     z_end = jh == jr[-1]
 864                     if mode == 4:
 865                         # COS table pre-generated mode
 866                         if SVSHAPE.skip == 0b00: # in [0b00, 0b10]:
 867                             if SVSHAPE.submode2 == 0b11: # iDCT
 868                                 result = ji[ri[jh]] # upper half
 869                             else:
 870                                 result = ri[ji[jh]]        # lower half
 871                         elif SVSHAPE.skip == 0b01: # in [0b01, 0b11]:
 872                             if SVSHAPE.submode2 == 0b11: # iDCT
 873                                 result = ji[ri[jh+size]] # upper half
 874                             else:
 875                                 result = ri[ji[jh+size]] # upper half
 876                         elif SVSHAPE.skip == 0b10: #
 877                             result = k # cos table offset
 878                     else:
 879                         # COS table generated on-demand ("Vertical-First") mode
 880                         if SVSHAPE.skip == 0b00: # in [0b00, 0b10]:
 881                             if SVSHAPE.submode2 == 0b11: # iDCT
 882                                 result = ji[ri[jh]]        # lower half
 883                             else:
 884                                 result = ri[ji[jh]]        # lower half
 885                         elif SVSHAPE.skip == 0b01: # in [0b01, 0b11]:
 886                             if SVSHAPE.submode2 == 0b11: # iDCT
 887                                 result = ji[ri[jh+size]] # upper half
 888                             else:
 889                                 result = ri[ji[jh+size]] # upper half
 890                         elif SVSHAPE.skip == 0b10: #
 891                             result = ci # coefficient helper
 892                         elif SVSHAPE.skip == 0b11: #
 893                             result = size # coefficient helper
 894                     loopends = (z_end |
 895                                ((y_end and z_end)<<1) |
 896                                 ((y_end and x_end and z_end)<<2))
 897
 898                     yield (result * stride) + SVSHAPE.offset, loopends
 899                     k += 1
 900
 901             # new k_start point for cos tables( runs inside x_r loop NOT i loop)
 902             k_start += halfsize
 903
 904 ```
 905
 906 ## REMAP selector
 907
 908 Selecting which REMAP Schedule to use is shown by the pseudocode below.
 909 Each SVSHAPE 0-3 goes through this selection process.
 910
 911 ```
 912   if SVSHAPEn.mode == 0b00:             iterate_fn = iterate_indices
 913   if SVSHAPEn.mode == 0b10:             iterate_fn = iterate_preduce_indices
 914   if SVSHAPEn.mode in [0b01, 0b11]:
 915     # further sub-selection
 916     if SVSHAPEn.ydimsz == 1:            iterate_fn = iterate_butterfly_indices
 917     if SVSHAPEn.ydimsz == 2:            iterate_fn = iterate_dct_inner_butterfly_indices
 918     if SVSHAPEn.ydimsz == 3:            iterate_fn = iterate_dct_outer_butterfly_indices
 919     if SVSHAPEn.ydimsz in [5, 13]:      iterate_fn = iterate_dct_inner_costable_indices
 920     if SVSHAPEn.ydimsz in [6, 14, 15]:  iterate_fn = iterate_dct_inner_halfswap_loadstore
 921 ```
 922
 923
 924 [[!tag opf_rfc]]