openpower/sv/rfc/ls009.mdwn

   1 # RFC ls009 Simple-V REMAP Subsystem
   2
   3 **URLs**:
   4
   5 * <https://libre-soc.org/openpower/sv/rfc/ls009/>
   6 * <https://bugs.libre-soc.org/show_bug.cgi?id=1042>
   7 * <https://git.openpower.foundation/isa/PowerISA/issues/124>
   8
   9 **Severity**: Major
  10
  11 **Status**: New
  12
  13 **Date**: 26 Mar 2023
  14
  15 **Target**: v3.2B
  16
  17 **Source**: v3.0B
  18
  19 **Books and Section affected**:
  20
  21 ```
  22     Book I, new Zero-Overhead-Loop Chapter.
  23     Appendix E Power ISA sorted by opcode
  24     Appendix F Power ISA sorted by version
  25     Appendix G Power ISA sorted by Compliancy Subset
  26     Appendix H Power ISA sorted by mnemonic
  27 ```
  28
  29 **Summary**
  30
  31 ```
  32     svremap  - Re-Mapping of Register Element Offsets
  33     svindex  - General-purpose setting of SHAPEs to be re-mapped
  34     svshape  - Hardware-level setting of SHAPEs for element re-mapping
  35     svshape2 - Hardware-level setting of SHAPEs for element re-mapping (v2)
  36 ```
  37
  38 **Submitter**: Luke Leighton (Libre-SOC)
  39
  40 **Requester**: Libre-SOC
  41
  42 **Impact on processor**:
  43
  44 ```
  45     Addition of four new "Zero-Overhead-Loop-Control" DSP-style Vector-style
  46     Management Instructions which provide advanced features such as Matrix
  47     FFT DCT Hardware-Assist Schedules and general-purpose Index reordering.
  48 ```
  49
  50 **Impact on software**:
  51
  52 ```
  53     Requires support for new instructions in assembler, debuggers,
  54     and related tools.
  55 ```
  56
  57 **Keywords**:
  58
  59 ```
  60     Cray Supercomputing, Vectorisation, Zero-Overhead-Loop-Control (ZOLC),
  61     Scalable Vectors, Multi-Issue Out-of-Order, Sequential Programming Model,
  62     Digital Signal Processing (DSP)
  63 ```
  64
  65 **Motivation**
  66
  67 These REMAP Management instructions provide state-of-the-art advanced capabilities
  68 to dramatically decrease instruction count and power reduction whilst retaining
  69 unprecedented general-purpose capability and a standard Sequential Execution Model.
  70
  71 **Notes and Observations**:
  72
  73 1. Although costly the alternatives in SIMD-paradigm software result in
  74    huge algorithmic complexity and associated power consumption increases.
  75    Loop-unrolling compilers are prevalent as is thousands to tens of
  76    thousands of instructions.
  77 2. Core inner kernels of Matrix DCT DFT FFT NTT are dramatically reduced
  78    to a handful of instructions.
  79 3. No REMAP instructions with the exception of Indexed rely on registers
  80    for the establishment of REMAP capability.
  81 4. Future EXT1xx variants and SVP64/VSX will dramatically expand the power
  82    and flexibility of REMAP.
  83 5. The Simple-V Compliancy Subsets make REMAP optional except in the
  84    Advanced Levels.  Like v3.1 MMA it is **not** necessary for **all**
  85    hardware to implement REMAP.
  86
  87 **Changes**
  88
  89 Add the following entries to:
  90
  91 * the Appendices of SV Book
  92 * Instructions of SV Book as a new Section
  93 * SVI, SVM, SVM2, SVRM Form of Book I Section 1.6.1.6 and 1.6.2
  94
  95 ----------------
  96
  97 \newpage{}
  98
  99 [[!inline pages="openpower/sv/remap" raw=yes ]]
 100
 101 # Forms
 102
 103 Add `SVI, SVM, SVM2, SVRM` to `XO (26:31)` Field in Book I, 1.6.2
 104
 105 Add the following to Book I, 1.6.1, SVI-Form
 106
 107 ```
 108     |0     |6    |11    |16   |21 |23  |24|25|26    31|
 109     | PO   |  SVG|rmm   | SVd |ew |SVyx|mm|sk|   XO   |
 110 ```
 111
 112 Add the following to Book I, 1.6.1, SVM-Form
 113
 114 ```
 115     |0     |6        |11      |16    |21    |25 |26    |31  |
 116     | PO   |  SVxd   |   SVyd | SVzd | SVrm |vf |   XO      |
 117 ```
 118
 119 Add the following to Book I, 1.6.1, SVM2-Form
 120
 121 ```
 122     |0     |6     |10  |11      |16    |21 |24|25 |26    |31  |
 123     | PO   | SVo  |SVyx|   rmm  | SVd  |XO |mm|sk |   XO      |
 124 ```
 125
 126 Add the following to Book I, 1.6.1, SVRM-Form
 127
 128 ```
 129     |0     |6     |11  |13   |15   |17   |19   |21  |22   |26     |31 |
 130     | PO   | SVme |mi0 | mi1 | mi2 | mo0 | mo1 |pst |///  | XO        |
 131 ```
 132
 133 Add the following to Book I, 1.6.2
 134
 135 ```
 136     mi0 (11:12)
 137         Field used in REMAP to select the SVSHAPE for 1st input register
 138         Formats: SVRM
 139     mi1 (13:14)
 140         Field used in REMAP to select the SVSHAPE for 2nd input register
 141         Formats: SVRM
 142     mi2 (15:16)
 143         Field used in REMAP to select the SVSHAPE for 3rd input register
 144         Formats: SVRM
 145     mm (24)
 146         Field used to specify the meaning of the rmm field for SVI-Form
 147         and SVM2-Form
 148         Formats: SVI, SVM2
 149     mo0 (17:18)
 150         Field used in REMAP to select the SVSHAPE for 1st output register
 151         Formats: SVRM
 152     mo1 (19:20)
 153         Field used in REMAP to select the SVSHAPE for 2nd output register
 154         Formats: SVRM
 155     pst (21)
 156         Field used in REMAP to indicate "persistence" mode (REMAP
 157         continues to apply to multiple instructions)
 158         Formats: SVRM
 159     rmm (11:15)
 160         REMAP Mode field for SVI-Form and SVM2-Form
 161         Formats: SVI, SVM2
 162     sk (25)
 163         Field used to specify dimensional skipping in svindex
 164         Formats: SVI, SVM2
 165     SVd (16:20)
 166         Immediate field used to specify the size of the REMAP dimension
 167         in the svindex and svshape2 instructions
 168         Formats: SVI, SVM2
 169     SVDS (16:29)
 170         Immediate field used to specify a 9-bit signed
 171         two's complement integer which is concatenated
 172         on the right with 0b00 and sign-extended to 64 bits.
 173         Formats: SVDS
 174     SVG (6:10)
 175         Field used to specify a GPR to be used as a
 176         source for indexing.
 177         Formats: SVI
 178     SVi (16:22)
 179          Simple-V immediate field for setting VL or MVL
 180          Formats: SVL
 181     SVme (6:10)
 182          Simple-V "REMAP" map-enable bits (0-4)
 183          Formats: SVRM
 184     SVo (6:9)
 185         Field used by the svshape2 instruction as an offset
 186         Formats: SVM2
 187     SVrm (21:24)
 188          Simple-V "REMAP" Mode
 189          Formats: SVM
 190     SVxd (6:10)
 191          Simple-V "REMAP" x-dimension size
 192          Formats: SVM
 193     SVyd (11:15)
 194          Simple-V "REMAP" y-dimension size
 195          Formats: SVM
 196     SVzd (16:20)
 197          Simple-V "REMAP" z-dimension size
 198          Formats: SVM
 199     XO (21:23,26:31)
 200         Extended opcode field.  Note that bit 21 must be 1, 22 and 23
 201         must be zero, and bits 26-31 must be exactly the same as
 202         used for svshape.
 203         Formats: SVM2
 204 ```
 205
 206 # Appendices
 207
 208     Appendix E Power ISA sorted by opcode
 209     Appendix F Power ISA sorted by version
 210     Appendix G Power ISA sorted by Compliancy Subset
 211     Appendix H Power ISA sorted by mnemonic
 212
 213 | Form | Book | Page | Version | mnemonic | Description |
 214 |------|------|------|---------|----------|-------------|
 215 | SVRM | I    | #    | 3.0B    | svremap    | REMAP enabling instruction |
 216 | SVM  | I    | #    | 3.0B    | svshape  | REMAP shape instruction |
 217 | SVM2 | I    | #    | 3.0B    | svshape2 | REMAP shape instruction (2) |
 218 | SVI  | I    | #    | 3.0B    | svindex | REMAP General-purpose Indexing |
 219
 220 [[!inline pages="openpower/sv/remap/appendix" raw=yes ]]
 221
 222 ## REMAP pseudocode
 223
 224 Written in python3 the following stand-alone executable source code is the Canonical
 225 Specification for each REMAP. Vectors of "loopends" are returned when Rc=1
 226 in Vectors of CR Fields on `sv.svstep.`, or in Vertical-First Mode
 227 a single CR Field (CR0) on `svstep.`.  The `SVSTATE.srcstep` or `SVSTATE.dststep` sequential
 228 offset is put through each algorithm to determine the actual Element Offset.
 229 Alternative implementations producing different ordering
 230 is prohibited as software will be critically relying on these Deterministic Schedules.
 231
 232 ### REMAP 2D/3D Matrix
 233
 234 The following stand-alone executable source code is the Canonical
 235 Specification for Matrix (2D/3D) REMAP.
 236 Hardware implementations are achievable with simple cascading counter-and-compares.
 237
 238 ```
 239 # python "yield" can be iterated. use this to make it clear how
 240 # the indices are generated by using natural-looking nested loops
 241 def iterate_indices(SVSHAPE):
 242     # get indices to iterate over, in the required order
 243     xd = SVSHAPE.lims[0]
 244     yd = SVSHAPE.lims[1]
 245     zd = SVSHAPE.lims[2]
 246     # create lists of indices to iterate over in each dimension
 247     x_r = list(range(xd))
 248     y_r = list(range(yd))
 249     z_r = list(range(zd))
 250     # invert the indices if needed
 251     if SVSHAPE.invxyz[0]: x_r.reverse()
 252     if SVSHAPE.invxyz[1]: y_r.reverse()
 253     if SVSHAPE.invxyz[2]: z_r.reverse()
 254     # start an infinite (wrapping) loop
 255     step = 0 # track src/dst step
 256     while True:
 257         for z in z_r:   # loop over 1st order dimension
 258             z_end = z == z_r[-1]
 259             for y in y_r:       # loop over 2nd order dimension
 260                 y_end = y == y_r[-1]
 261                 for x in x_r:           # loop over 3rd order dimension
 262                     x_end = x == x_r[-1]
 263                     # ok work out which order to construct things in.
 264                     # start by creating a list of tuples of the dimension
 265                     # and its limit
 266                     vals = [(SVSHAPE.lims[0], x, "x"),
 267                             (SVSHAPE.lims[1], y, "y"),
 268                             (SVSHAPE.lims[2], z, "z")
 269                            ]
 270                     # now select those by order.  this allows us to
 271                     # create schedules for [z][x], [x][y], or [y][z]
 272                     # for matrix multiply.
 273                     vals = [vals[SVSHAPE.order[0]],
 274                             vals[SVSHAPE.order[1]],
 275                             vals[SVSHAPE.order[2]]
 276                            ]
 277                     # ok now we can construct the result, using bits of
 278                     # "order" to say which ones get stacked on
 279                     result = 0
 280                     mult = 1
 281                     for i in range(3):
 282                         lim, idx, dbg = vals[i]
 283                         # some of the dimensions can be "skipped".  the order
 284                         # was actually selected above on all 3 dimensions,
 285                         # e.g. [z][x][y] or [y][z][x].  "skip" allows one of
 286                         # those to be knocked out
 287                         if SVSHAPE.skip == i+1: continue
 288                         idx *= mult   # shifts up by previous dimension(s)
 289                         result += idx # adds on this dimension
 290                         mult *= lim   # for the next dimension
 291
 292                     loopends = (x_end |
 293                                ((y_end and x_end)<<1) |
 294                                 ((y_end and x_end and z_end)<<2))
 295
 296                     yield result + SVSHAPE.offset, loopends
 297                     step += 1
 298
 299 def demo():
 300     # set the dimension sizes here
 301     xdim = 3
 302     ydim = 2
 303     zdim = 4
 304
 305     # set total (can repeat, e.g. VL=x*y*z*4)
 306     VL = xdim * ydim * zdim
 307
 308     # set up an SVSHAPE
 309     class SVSHAPE:
 310         pass
 311     SVSHAPE0 = SVSHAPE()
 312     SVSHAPE0.lims = [xdim, ydim, zdim]
 313     SVSHAPE0.order = [1,0,2]  # experiment with different permutations, here
 314     SVSHAPE0.mode = 0b00
 315     SVSHAPE0.skip = 0b00
 316     SVSHAPE0.offset = 0       # experiment with different offset, here
 317     SVSHAPE0.invxyz = [0,0,0] # inversion if desired
 318
 319     # enumerate over the iterator function, getting new indices
 320     for idx, (new_idx, end) in enumerate(iterate_indices(SVSHAPE0)):
 321         if idx >= VL:
 322             break
 323         print ("%d->%d" % (idx, new_idx), "end", bin(end)[2:])
 324
 325 # run the demo
 326 if __name__ == '__main__':
 327     demo()
 328 ```
 329
 330 ### REMAP Parallel Reduction pseudocode
 331
 332 The python3 program below is stand-alone executable and is the Canonical Specification
 333 for Parallel Reduction REMAP.
 334 The Algorithm below is not limited to RADIX2 sizes, and Predicate
 335 sources, unlike in Matrix REMAP, apply to the Element Indices **after** REMAP
 336 has been applied, not before.  MV operations are not required: the algorithm
 337 tracks positions of elements that would normally be moved and when applying
 338 an Element Reduction Operation sources the operands from their last-known (tracked)
 339 position.
 340
 341 ```
 342 # a "yield" version of the Parallel Reduction REMAP algorithm.
 343 # the algorithm is in-place. it does not perform "MV" operations.
 344 # instead, where a masked-out value *should* be read from is tracked
 345
 346 def iterate_indices(SVSHAPE, pred=None):
 347     # get indices to iterate over, in the required order
 348     xd = SVSHAPE.lims[0]
 349     # create lists of indices to iterate over in each dimension
 350     ix = list(range(xd))
 351     # invert the indices if needed
 352     if SVSHAPE.invxyz[0]: ix.reverse()
 353     # start a loop from the lowest step
 354     step = 1
 355     steps = []
 356     while step < xd:
 357         step *= 2
 358         steps.append(step)
 359     # invert the indices if needed
 360     if SVSHAPE.invxyz[1]: steps.reverse()
 361     for step in steps:
 362         stepend = (step == steps[-1]) # note end of steps
 363         idxs = list(range(0, xd, step))
 364         results = []
 365         for i in idxs:
 366             other = i + step // 2
 367             ci = ix[i]
 368             oi = ix[other] if other < xd else None
 369             other_pred = other < xd and (pred is None or pred[oi])
 370             if (pred is None or pred[ci]) and other_pred:
 371                 if SVSHAPE.skip == 0b00: # submode 00
 372                     result = ci
 373                 elif SVSHAPE.skip == 0b01: # submode 01
 374                     result = oi
 375                 results.append([result + SVSHAPE.offset, 0])
 376             elif other_pred:
 377                 ix[i] = oi
 378         if results:
 379             results[-1][1] = (stepend<<1) | 1  # notify end of loops
 380         yield from results
 381
 382 def demo():
 383     # set the dimension sizes here
 384     xdim = 9
 385
 386     # set up an SVSHAPE
 387     class SVSHAPE:
 388         pass
 389     SVSHAPE0 = SVSHAPE()
 390     SVSHAPE0.lims = [xdim, 0, 0]
 391     SVSHAPE0.order = [0,1,2]
 392     SVSHAPE0.mode = 0b10
 393     SVSHAPE0.skip = 0b00
 394     SVSHAPE0.offset = 0       # experiment with different offset, here
 395     SVSHAPE0.invxyz = [0,0,0] # inversion if desired
 396
 397     SVSHAPE1 = SVSHAPE()
 398     SVSHAPE1.lims = [xdim, 0, 0]
 399     SVSHAPE1.order = [0,1,2]
 400     SVSHAPE1.mode = 0b10
 401     SVSHAPE1.skip = 0b01
 402     SVSHAPE1.offset = 0       # experiment with different offset, here
 403     SVSHAPE1.invxyz = [0,0,0] # inversion if desired
 404
 405     # enumerate over the iterator function, getting new indices
 406     shapes = list(iterate_indices(SVSHAPE0)), \
 407               list(iterate_indices(SVSHAPE1))
 408     for idx in range(len(shapes[0])):
 409         l = shapes[0][idx]
 410         r = shapes[1][idx]
 411         (l_idx, lend) = l
 412         (r_idx, rend) = r
 413         print ("%d->%d:%d" % (idx, l_idx, r_idx),
 414                "end", bin(lend)[2:], bin(rend)[2:])
 415
 416 # run the demo
 417 if __name__ == '__main__':
 418     demo()
 419 ```
 420
 421 ### REMAP FFT pseudocode
 422
 423 The FFT REMAP is RADIX2 only.
 424
 425 ```
 426 # a "yield" version of the REMAP algorithm, for FFT Tukey-Cooley schedules
 427 # original code for the FFT Tukey-Cooley schedule:
 428 # https://www.nayuki.io/res/free-small-fft-in-multiple-languages/fft.py
 429 """
 430     # Radix-2 decimation-in-time FFT (real, not complex)
 431     size = 2
 432     while size <= n:
 433         halfsize = size // 2
 434         tablestep = n // size
 435         for i in range(0, n, size):
 436             k = 0
 437             for j in range(i, i + halfsize):
 438                 jh = j+halfsize
 439                 jl = j
 440                 temp1 = vec[jh] * exptable[k]
 441                 temp2 = vec[jl]
 442                 vec[jh] = temp2 - temp1
 443                 vec[jl] = temp2 + temp1
 444                 k += tablestep
 445         size *= 2
 446 """
 447
 448 # python "yield" can be iterated. use this to make it clear how
 449 # the indices are generated by using natural-looking nested loops
 450 def iterate_butterfly_indices(SVSHAPE):
 451     # get indices to iterate over, in the required order
 452     n = SVSHAPE.lims[0]
 453     stride = SVSHAPE.lims[2] # stride-multiplier on reg access
 454     # creating lists of indices to iterate over in each dimension
 455     # has to be done dynamically, because it depends on the size
 456     # first, the size-based loop (which can be done statically)
 457     x_r = []
 458     size = 2
 459     while size <= n:
 460         x_r.append(size)
 461         size *= 2
 462     # invert order if requested
 463     if SVSHAPE.invxyz[0]: x_r.reverse()
 464
 465     if len(x_r) == 0:
 466         return
 467
 468     # start an infinite (wrapping) loop
 469     skip = 0
 470     while True:
 471         for size in x_r:           # loop over 3rd order dimension (size)
 472             x_end = size == x_r[-1]
 473             # y_r schedule depends on size
 474             halfsize = size // 2
 475             tablestep = n // size
 476             y_r = []
 477             for i in range(0, n, size):
 478                 y_r.append(i)
 479             # invert if requested
 480             if SVSHAPE.invxyz[1]: y_r.reverse()
 481             for i in y_r:       # loop over 2nd order dimension
 482                 y_end = i == y_r[-1]
 483                 k_r = []
 484                 j_r = []
 485                 k = 0
 486                 for j in range(i, i+halfsize):
 487                     k_r.append(k)
 488                     j_r.append(j)
 489                     k += tablestep
 490                 # invert if requested
 491                 if SVSHAPE.invxyz[2]: k_r.reverse()
 492                 if SVSHAPE.invxyz[2]: j_r.reverse()
 493                 for j, k in zip(j_r, k_r):   # loop over 1st order dimension
 494                     z_end = j == j_r[-1]
 495                     # now depending on MODE return the index
 496                     if SVSHAPE.skip == 0b00:
 497                         result = j              # for vec[j]
 498                     elif SVSHAPE.skip == 0b01:
 499                         result = j + halfsize   # for vec[j+halfsize]
 500                     elif SVSHAPE.skip == 0b10:
 501                         result = k              # for exptable[k]
 502
 503                     loopends = (z_end |
 504                                ((y_end and z_end)<<1) |
 505                                 ((y_end and x_end and z_end)<<2))
 506
 507                     yield (result * stride) + SVSHAPE.offset, loopends
 508
 509 def demo():
 510     # set the dimension sizes here
 511     xdim = 8
 512     ydim = 0 # not needed
 513     zdim = 1 # stride must be set to 1
 514
 515     # set total. err don't know how to calculate how many there are...
 516     # do it manually for now
 517     VL = 0
 518     size = 2
 519     n = xdim
 520     while size <= n:
 521         halfsize = size // 2
 522         tablestep = n // size
 523         for i in range(0, n, size):
 524             for j in range(i, i + halfsize):
 525                 VL += 1
 526         size *= 2
 527
 528     # set up an SVSHAPE
 529     class SVSHAPE:
 530         pass
 531     # j schedule
 532     SVSHAPE0 = SVSHAPE()
 533     SVSHAPE0.lims = [xdim, ydim, zdim]
 534     SVSHAPE0.order = [0,1,2]  # experiment with different permutations, here
 535     SVSHAPE0.mode = 0b01
 536     SVSHAPE0.skip = 0b00
 537     SVSHAPE0.offset = 0       # experiment with different offset, here
 538     SVSHAPE0.invxyz = [0,0,0] # inversion if desired
 539     # j+halfstep schedule
 540     SVSHAPE1 = SVSHAPE()
 541     SVSHAPE1.lims = [xdim, ydim, zdim]
 542     SVSHAPE1.order = [0,1,2]  # experiment with different permutations, here
 543     SVSHAPE0.mode = 0b01
 544     SVSHAPE1.skip = 0b01
 545     SVSHAPE1.offset = 0       # experiment with different offset, here
 546     SVSHAPE1.invxyz = [0,0,0] # inversion if desired
 547     # k schedule
 548     SVSHAPE2 = SVSHAPE()
 549     SVSHAPE2.lims = [xdim, ydim, zdim]
 550     SVSHAPE2.order = [0,1,2]  # experiment with different permutations, here
 551     SVSHAPE0.mode = 0b01
 552     SVSHAPE2.skip = 0b10
 553     SVSHAPE2.offset = 0       # experiment with different offset, here
 554     SVSHAPE2.invxyz = [0,0,0] # inversion if desired
 555
 556     # enumerate over the iterator function, getting new indices
 557     schedule = []
 558     for idx, (jl, jh, k) in enumerate(zip(iterate_butterfly_indices(SVSHAPE0),
 559                                           iterate_butterfly_indices(SVSHAPE1),
 560                                           iterate_butterfly_indices(SVSHAPE2))):
 561         if idx >= VL:
 562             break
 563         schedule.append((jl, jh, k))
 564
 565     # ok now pretty-print the results, with some debug output
 566     size = 2
 567     idx = 0
 568     while size <= n:
 569         halfsize = size // 2
 570         tablestep = n // size
 571         print ("size %d halfsize %d tablestep %d" % \
 572                 (size, halfsize, tablestep))
 573         for i in range(0, n, size):
 574             prefix = "i %d\t" % i
 575             k = 0
 576             for j in range(i, i + halfsize):
 577                 (jl, je), (jh, he), (ks, ke) = schedule[idx]
 578                 print ("  %-3d\t%s j=%-2d jh=%-2d k=%-2d -> "
 579                         "j[jl=%-2d] j[jh=%-2d] ex[k=%d]" % \
 580                                 (idx, prefix, j, j+halfsize, k,
 581                                       jl, jh, ks,
 582                                 ),
 583                                 "end", bin(je)[2:], bin(je)[2:], bin(ke)[2:])
 584                 k += tablestep
 585                 idx += 1
 586         size *= 2
 587
 588 # run the demo
 589 if __name__ == '__main__':
 590     demo()
 591 ```
 592
 593 ### DCT REMAP
 594
 595 DCT REMAP is RADIX2 only.  Convolutions may be applied as usual
 596 to create non-RADIX2 DCT. Combined with appropriate Twin-butterfly
 597 instructions the algorithm below (written in python3), becomes part
 598 of an in-place in-registers Vectorised DCT.  The algorithms work
 599 by loading data such that as the nested loops progress the result
 600 is sorted into correct sequential order.
 601
 602 ```
 603 # DCT "REMAP" scheduler to create an in-place iterative DCT.
 604 #
 605
 606 # bits of the integer 'val' of width 'width' are reversed
 607 def reverse_bits(val, width):
 608     result = 0
 609     for _ in range(width):
 610         result = (result << 1) | (val & 1)
 611         val >>= 1
 612     return result
 613
 614
 615 # iterative version of [recursively-applied] half-reversing
 616 # turns out this is Gray-Encoding.
 617 def halfrev2(vec, pre_rev=True):
 618     res = []
 619     for i in range(len(vec)):
 620         if pre_rev:
 621             res.append(vec[i ^ (i>>1)])
 622         else:
 623             ri = i
 624             bl = i.bit_length()
 625             for ji in range(1, bl):
 626                 ri ^= (i >> ji)
 627             res.append(vec[ri])
 628     return res
 629
 630
 631 def iterate_dct_inner_halfswap_loadstore(SVSHAPE):
 632     # get indices to iterate over, in the required order
 633     n = SVSHAPE.lims[0]
 634     mode = SVSHAPE.lims[1]
 635     stride = SVSHAPE.lims[2]
 636
 637     # reference list for not needing to do data-swaps, just swap what
 638     # *indices* are referenced (two levels of indirection at the moment)
 639     # pre-reverse the data-swap list so that it *ends up* in the order 0123..
 640     ji = list(range(n))
 641
 642     levels = n.bit_length() - 1
 643     ri = [reverse_bits(i, levels) for i in range(n)]
 644
 645     if SVSHAPE.mode == 0b01: # FFT, bitrev only
 646         ji = [ji[ri[i]] for i in range(n)]
 647     elif SVSHAPE.submode2 == 0b001:
 648         ji = [ji[ri[i]] for i in range(n)]
 649         ji = halfrev2(ji, True)
 650     else:
 651         ji = halfrev2(ji, False)
 652         ji = [ji[ri[i]] for i in range(n)]
 653
 654     # invert order if requested
 655     if SVSHAPE.invxyz[0]:
 656         ji.reverse()
 657
 658     for i, jl in enumerate(ji):
 659         y_end = jl == ji[-1]
 660         yield jl * stride, (0b111 if y_end else 0b000)
 661
 662 def iterate_dct_inner_costable_indices(SVSHAPE):
 663     # get indices to iterate over, in the required order
 664     n = SVSHAPE.lims[0]
 665     mode = SVSHAPE.lims[1]
 666     stride = SVSHAPE.lims[2]
 667     # creating lists of indices to iterate over in each dimension
 668     # has to be done dynamically, because it depends on the size
 669     # first, the size-based loop (which can be done statically)
 670     x_r = []
 671     size = 2
 672     while size <= n:
 673         x_r.append(size)
 674         size *= 2
 675     # invert order if requested
 676     if SVSHAPE.invxyz[0]:
 677         x_r.reverse()
 678
 679     if len(x_r) == 0:
 680         return
 681
 682     # start an infinite (wrapping) loop
 683     skip = 0
 684     z_end = 1 # doesn't exist in this, only 2 loops
 685     k = 0
 686     while True:
 687         for size in x_r:           # loop over 3rd order dimension (size)
 688             x_end = size == x_r[-1]
 689             # y_r schedule depends on size
 690             halfsize = size // 2
 691             y_r = []
 692             for i in range(0, n, size):
 693                 y_r.append(i)
 694             # invert if requested
 695             if SVSHAPE.invxyz[1]: y_r.reverse()
 696             # two lists of half-range indices, e.g. j 0123, jr 7654
 697             j = list(range(0, halfsize))
 698             # invert if requested
 699             if SVSHAPE.invxyz[2]: j_r.reverse()
 700             # loop over 1st order dimension
 701             for ci, jl in enumerate(j):
 702                 y_end = jl == j[-1]
 703                 # now depending on MODE return the index.  inner butterfly
 704                 if SVSHAPE.skip == 0b00: # in [0b00, 0b10]:
 705                     result = k  # offset into COS table
 706                 elif SVSHAPE.skip == 0b10: #
 707                     result = ci # coefficient helper
 708                 elif SVSHAPE.skip == 0b11: #
 709                     result = size # coefficient helper
 710                 loopends = (z_end |
 711                            ((y_end and z_end)<<1) |
 712                             ((y_end and x_end and z_end)<<2))
 713
 714                 yield (result * stride) + SVSHAPE.offset, loopends
 715                 k += 1
 716
 717 def iterate_dct_inner_butterfly_indices(SVSHAPE):
 718     # get indices to iterate over, in the required order
 719     n = SVSHAPE.lims[0]
 720     mode = SVSHAPE.lims[1]
 721     stride = SVSHAPE.lims[2]
 722     # creating lists of indices to iterate over in each dimension
 723     # has to be done dynamically, because it depends on the size
 724     # first, the size-based loop (which can be done statically)
 725     x_r = []
 726     size = 2
 727     while size <= n:
 728         x_r.append(size)
 729         size *= 2
 730     # invert order if requested
 731     if SVSHAPE.invxyz[0]:
 732         x_r.reverse()
 733
 734     if len(x_r) == 0:
 735         return
 736
 737     # reference (read/write) the in-place data in *reverse-bit-order*
 738     ri = list(range(n))
 739     if SVSHAPE.submode2 == 0b01:
 740         levels = n.bit_length() - 1
 741         ri = [ri[reverse_bits(i, levels)] for i in range(n)]
 742
 743     # reference list for not needing to do data-swaps, just swap what
 744     # *indices* are referenced (two levels of indirection at the moment)
 745     # pre-reverse the data-swap list so that it *ends up* in the order 0123..
 746     ji = list(range(n))
 747     inplace_mode = True
 748     if inplace_mode and SVSHAPE.submode2 == 0b01:
 749         ji = halfrev2(ji, True)
 750     if inplace_mode and SVSHAPE.submode2 == 0b11:
 751         ji = halfrev2(ji, False)
 752
 753     # start an infinite (wrapping) loop
 754     while True:
 755         k = 0
 756         k_start = 0
 757         for size in x_r:           # loop over 3rd order dimension (size)
 758             x_end = size == x_r[-1]
 759             # y_r schedule depends on size
 760             halfsize = size // 2
 761             y_r = []
 762             for i in range(0, n, size):
 763                 y_r.append(i)
 764             # invert if requested
 765             if SVSHAPE.invxyz[1]: y_r.reverse()
 766             for i in y_r:       # loop over 2nd order dimension
 767                 y_end = i == y_r[-1]
 768                 # two lists of half-range indices, e.g. j 0123, jr 7654
 769                 j = list(range(i, i + halfsize))
 770                 jr = list(range(i+halfsize, i + size))
 771                 jr.reverse()
 772                 # invert if requested
 773                 if SVSHAPE.invxyz[2]:
 774                     j.reverse()
 775                     jr.reverse()
 776                 hz2 = halfsize // 2 # zero stops reversing 1-item lists
 777                 # loop over 1st order dimension
 778                 k = k_start
 779                 for ci, (jl, jh) in enumerate(zip(j, jr)):
 780                     z_end = jl == j[-1]
 781                     # now depending on MODE return the index.  inner butterfly
 782                     if SVSHAPE.skip == 0b00: # in [0b00, 0b10]:
 783                         if SVSHAPE.submode2 == 0b11: # iDCT
 784                             result = ji[ri[jl]]        # lower half
 785                         else:
 786                             result = ri[ji[jl]]        # lower half
 787                     elif SVSHAPE.skip == 0b01: # in [0b01, 0b11]:
 788                         if SVSHAPE.submode2 == 0b11: # iDCT
 789                             result = ji[ri[jl+halfsize]] # upper half
 790                         else:
 791                             result = ri[ji[jh]] # upper half
 792                     elif mode == 4:
 793                         # COS table pre-generated mode
 794                         if SVSHAPE.skip == 0b10: #
 795                             result = k # cos table offset
 796                     else: # mode 2
 797                         # COS table generated on-demand ("Vertical-First") mode
 798                         if SVSHAPE.skip == 0b10: #
 799                             result = ci # coefficient helper
 800                         elif SVSHAPE.skip == 0b11: #
 801                             result = size # coefficient helper
 802                     loopends = (z_end |
 803                                ((y_end and z_end)<<1) |
 804                                 ((y_end and x_end and z_end)<<2))
 805
 806                     yield (result * stride) + SVSHAPE.offset, loopends
 807                     k += 1
 808
 809                 # now in-place swap
 810                 if inplace_mode:
 811                     for ci, (jl, jh) in enumerate(zip(j[:hz2], jr[:hz2])):
 812                         jlh = jl+halfsize
 813                         tmp1, tmp2 = ji[jlh], ji[jh]
 814                         ji[jlh], ji[jh] = tmp2, tmp1
 815
 816             # new k_start point for cos tables( runs inside x_r loop NOT i loop)
 817             k_start += halfsize
 818
 819
 820 # python "yield" can be iterated. use this to make it clear how
 821 # the indices are generated by using natural-looking nested loops
 822 def iterate_dct_outer_butterfly_indices(SVSHAPE):
 823     # get indices to iterate over, in the required order
 824     n = SVSHAPE.lims[0]
 825     mode = SVSHAPE.lims[1]
 826     stride = SVSHAPE.lims[2]
 827     # creating lists of indices to iterate over in each dimension
 828     # has to be done dynamically, because it depends on the size
 829     # first, the size-based loop (which can be done statically)
 830     x_r = []
 831     size = n // 2
 832     while size >= 2:
 833         x_r.append(size)
 834         size //= 2
 835     # invert order if requested
 836     if SVSHAPE.invxyz[0]:
 837         x_r.reverse()
 838
 839     if len(x_r) == 0:
 840         return
 841
 842     # I-DCT, reference (read/write) the in-place data in *reverse-bit-order*
 843     ri = list(range(n))
 844     if SVSHAPE.submode2 in [0b11, 0b01]:
 845         levels = n.bit_length() - 1
 846         ri = [ri[reverse_bits(i, levels)] for i in range(n)]
 847
 848     # reference list for not needing to do data-swaps, just swap what
 849     # *indices* are referenced (two levels of indirection at the moment)
 850     # pre-reverse the data-swap list so that it *ends up* in the order 0123..
 851     ji = list(range(n))
 852     inplace_mode = False # need the space... SVSHAPE.skip in [0b10, 0b11]
 853     if SVSHAPE.submode2 == 0b11:
 854         ji = halfrev2(ji, False)
 855
 856     # start an infinite (wrapping) loop
 857     while True:
 858         k = 0
 859         k_start = 0
 860         for size in x_r:           # loop over 3rd order dimension (size)
 861             halfsize = size//2
 862             x_end = size == x_r[-1]
 863             y_r = list(range(0, halfsize))
 864             # invert if requested
 865             if SVSHAPE.invxyz[1]: y_r.reverse()
 866             for i in y_r:       # loop over 2nd order dimension
 867                 y_end = i == y_r[-1]
 868                 # one list to create iterative-sum schedule
 869                 jr = list(range(i+halfsize, i+n-halfsize, size))
 870                 # invert if requested
 871                 if SVSHAPE.invxyz[2]: jr.reverse()
 872                 hz2 = halfsize // 2 # zero stops reversing 1-item lists
 873                 k = k_start
 874                 for ci, jh in enumerate(jr):   # loop over 1st order dimension
 875                     z_end = jh == jr[-1]
 876                     if mode == 4:
 877                         # COS table pre-generated mode
 878                         if SVSHAPE.skip == 0b00: # in [0b00, 0b10]:
 879                             if SVSHAPE.submode2 == 0b11: # iDCT
 880                                 result = ji[ri[jh]] # upper half
 881                             else:
 882                                 result = ri[ji[jh]]        # lower half
 883                         elif SVSHAPE.skip == 0b01: # in [0b01, 0b11]:
 884                             if SVSHAPE.submode2 == 0b11: # iDCT
 885                                 result = ji[ri[jh+size]] # upper half
 886                             else:
 887                                 result = ri[ji[jh+size]] # upper half
 888                         elif SVSHAPE.skip == 0b10: #
 889                             result = k # cos table offset
 890                     else:
 891                         # COS table generated on-demand ("Vertical-First") mode
 892                         if SVSHAPE.skip == 0b00: # in [0b00, 0b10]:
 893                             if SVSHAPE.submode2 == 0b11: # iDCT
 894                                 result = ji[ri[jh]]        # lower half
 895                             else:
 896                                 result = ri[ji[jh]]        # lower half
 897                         elif SVSHAPE.skip == 0b01: # in [0b01, 0b11]:
 898                             if SVSHAPE.submode2 == 0b11: # iDCT
 899                                 result = ji[ri[jh+size]] # upper half
 900                             else:
 901                                 result = ri[ji[jh+size]] # upper half
 902                         elif SVSHAPE.skip == 0b10: #
 903                             result = ci # coefficient helper
 904                         elif SVSHAPE.skip == 0b11: #
 905                             result = size # coefficient helper
 906                     loopends = (z_end |
 907                                ((y_end and z_end)<<1) |
 908                                 ((y_end and x_end and z_end)<<2))
 909
 910                     yield (result * stride) + SVSHAPE.offset, loopends
 911                     k += 1
 912
 913             # new k_start point for cos tables( runs inside x_r loop NOT i loop)
 914             k_start += halfsize
 915
 916 ```
 917
 918 ## REMAP selector
 919
 920 Selecting which REMAP Schedule to use is shown by the pseudocode below.
 921 Each SVSHAPE 0-3 goes through this selection process.
 922
 923 ```
 924   if SVSHAPEn.mode == 0b00:             iterate_fn = iterate_indices
 925   if SVSHAPEn.mode == 0b10:             iterate_fn = iterate_preduce_indices
 926   if SVSHAPEn.mode in [0b01, 0b11]:
 927     # further sub-selection
 928     if SVSHAPEn.ydimsz == 1:            iterate_fn = iterate_butterfly_indices
 929     if SVSHAPEn.ydimsz == 2:            iterate_fn = iterate_dct_inner_butterfly_indices
 930     if SVSHAPEn.ydimsz == 3:            iterate_fn = iterate_dct_outer_butterfly_indices
 931     if SVSHAPEn.ydimsz in [5, 13]:      iterate_fn = iterate_dct_inner_costable_indices
 932     if SVSHAPEn.ydimsz in [6, 14, 15]:  iterate_fn = iterate_dct_inner_halfswap_loadstore
 933 ```
 934
 935
 936 [[!tag opf_rfc]]