add ls009 Notes
[libreriscv.git] / openpower / sv / rfc / ls009.mdwn
index d4d2cbc0393398925fd531fb4f8d75dce9ab4508..059b441d095fa68bc1b7a7eb1255212c21582698 100644 (file)
@@ -1,17 +1,16 @@
-# RFC ls009 SVP64 REMAP instructions
+# RFC ls009 Simple-V REMAP Subsystem
 
 **URLs**:
 
-* <https://libre-soc.org/openpower/sv/>
 * <https://libre-soc.org/openpower/sv/rfc/ls009/>
-* <https://bugs.libre-soc.org/show_bug.cgi?id=1040>
-* <https://git.openpower.foundation/isa/PowerISA/issues/87>
+* <https://bugs.libre-soc.org/show_bug.cgi?id=1042>
+* <https://git.openpower.foundation/isa/PowerISA/issues/124>
 
 **Severity**: Major
 
 **Status**: New
 
-**Date**: 24 Mar 2023
+**Date**: 26 Mar 2023
 
 **Target**: v3.2B
 
@@ -71,95 +70,137 @@ unprecedented general-purpose capability and a standard Sequential Execution Mod
 
 **Notes and Observations**:
 
-1. TODO
+1. Although costly the alternatives in SIMD-paradigm software result in
+   huge algorithmic complexity and associated power consumption increases.
+   Loop-unrolling compilers are prevalent as is thousands to tens of
+   thousands of instructions.
+2. Core inner kernels of Matrix DCT DFT FFT NTT are dramatically reduced
+   to a handful of instructions.
+3. No REMAP instructions with the exception of Indexed rely on registers
+   for the establishment of REMAP capability.
+4. Future EXT1xx variants and SVP64/VSX will dramatically expand the power
+   and flexibility of REMAP.
+5. The Simple-V Compliancy Subsets make REMAP optional except in the
+   Advanced Levels.  Like v3.1 MMA it is **not** necessary for **all**
+   hardware to implement REMAP.
 
 **Changes**
 
 Add the following entries to:
 
-* the Appendices of Book I
-* Instructions of Book I as a new Section
-* TODO-Form of Book I Section 1.6.1.6 and 1.6.2
+* the Appendices of SV Book
+* Instructions of SV Book as a new Section
+* SVI, SVM, SVM2, SVRM Form of Book I Section 1.6.1.6 and 1.6.2
 
 ----------------
 
 \newpage{}
 
-# svstep: Vertical-First Stepping and status reporting
+[[!inline pages="openpower/sv/remap" raw=yes ]]
 
-SVL-Form
+# Forms
 
-* svstep RT,SVi,vf (Rc=0)
-* svstep. RT,SVi,vf (Rc=1)
+Add `SVI, SVM, SVM2, SVRM` to `XO (26:31)` Field in Book I, 1.6.2
 
-| 0-5|6-10|11.15|16..22| 23-25    | 26-30 |31|   Form   |
-|----|----|-----|------|----------|-------|--|--------- |
-|PO  | RT | /   | SVi  |  / / vf  | XO    |Rc| SVL-Form |
-
-Pseudo-code:
+Add the following to Book I, 1.6.1, SVI-Form
 
 ```
-    if SVi[3:4] = 0b11 then
-        # store pack and unpack in SVSTATE
-        SVSTATE[53] <- SVi[5]
-        SVSTATE[54] <- SVi[6]
-        RT <- [0]*62 || SVSTATE[53:54]
-    else
-        # Vertical-First explicit stepping.
-        step <- SVSTATE_NEXT(SVi, vf)
-        RT <- [0]*57 || step
+    |0     |6    |11    |16   |21 |23  |24|25|26    31|
+    | PO   |  SVG|rmm   | SVd |ew |SVyx|mm|sk|   XO   |
 ```
 
-Special Registers Altered:
-
-    CR0                     (if Rc=1)
-
-**Description**
+Add the following to Book I, 1.6.1, SVM-Form
 
+```
+    |0     |6        |11      |16    |21    |25 |26    |31  |
+    | PO   |  SVxd   |   SVyd | SVzd | SVrm |vf |   XO      |
+```
 
--------------
-
-\newpage{}
-
-
--------------
-
-\newpage{}
+Add the following to Book I, 1.6.1, SVM2-Form
 
-# SVL-Form
+```
+    |0     |6     |10  |11      |16    |21 |24|25 |26    |31  |
+    | PO   | SVo  |SVyx|   rmm  | SVd  |XO |mm|sk |   XO      |
+```
 
-Add the following to Book I, 1.6.1, SVL-Form
+Add the following to Book I, 1.6.1, SVRM-Form
 
 ```
-    |0     |6    |11    |16   |23 |24 |25 |26    |31 |
-    | PO   |  RT |   RA | SVi |ms |vs |vf |   XO |Rc |
-    | PO   |  RT | /    | SVi |/  |/  |vf |   XO |Rc |
+    |0     |6     |11  |13   |15   |17   |19   |21  |22   |26     |31 |
+    | PO   | SVme |mi0 | mi1 | mi2 | mo0 | mo1 |pst |///  | XO        |
 ```
 
-* Add `SVL` to `RA (11:15)` Field in Book I, 1.6.2
-* Add `SVL` to `RT (6:10)` Field in Book I, 1.6.2
-* Add `SVL` to `Rc (31)` Field in Book I, 1.6.2
-* Add `SVL` to `XO (26:31)` Field in Book I, 1.6.2
-
 Add the following to Book I, 1.6.2
 
 ```
-    ms (23)
-        Field used in Simple-V to specify whether MVL (maxvl in the SVSTATE SPR)
-        is to be set
-        Formats: SVL
-    vf (25)
-        Field used in Simple-V to specify whether "Vertical" Mode is set
-        (vfirst in the SVSTATE SPR)
-        Formats: SVL
-    vs (24)
-        Field used in Simple-V to specify whether VL (vl in the SVSTATE SPR) is to be set
-        Formats: SVL
+    mi0 (11:12)
+        Field used in REMAP to select the SVSHAPE for 1st input register
+        Formats: SVRM
+    mi1 (13:14)
+        Field used in REMAP to select the SVSHAPE for 2nd input register
+        Formats: SVRM
+    mi2 (15:16)
+        Field used in REMAP to select the SVSHAPE for 3rd input register
+        Formats: SVRM
+    mm (24)
+        Field used to specify the meaning of the rmm field for SVI-Form
+        and SVM2-Form
+        Formats: SVI, SVM2
+    mo0 (17:18)
+        Field used in REMAP to select the SVSHAPE for 1st output register
+        Formats: SVRM
+    mo1 (19:20)
+        Field used in REMAP to select the SVSHAPE for 2nd output register
+        Formats: SVRM
+    pst (21)
+        Field used in REMAP to indicate "persistence" mode (REMAP
+        continues to apply to multiple instructions)
+        Formats: SVRM
+    rmm (11:15)
+        REMAP Mode field for SVI-Form and SVM2-Form
+        Formats: SVI, SVM2
+    sk (25)
+        Field used to specify dimensional skipping in svindex
+        Formats: SVI, SVM2
+    SVd (16:20)
+        Immediate field used to specify the size of the REMAP dimension
+        in the svindex and svshape2 instructions
+        Formats: SVI, SVM2
+    SVDS (16:29)
+        Immediate field used to specify a 9-bit signed
+        two's complement integer which is concatenated
+        on the right with 0b00 and sign-extended to 64 bits.
+        Formats: SVDS
+    SVG (6:10)
+        Field used to specify a GPR to be used as a
+        source for indexing.
+        Formats: SVI
     SVi (16:22)
-         Simple-V immediate field used by setvl for setting VL or MVL
-         (vl, maxvl in the SVSTATE SPR)
-         and used as a "Mode of Operation" selector in svstep
+         Simple-V immediate field for setting VL or MVL
          Formats: SVL
+    SVme (6:10)
+         Simple-V "REMAP" map-enable bits (0-4)
+         Formats: SVRM
+    SVo (6:9)
+        Field used by the svshape2 instruction as an offset
+        Formats: SVM2
+    SVrm (21:24)
+         Simple-V "REMAP" Mode
+         Formats: SVM
+    SVxd (6:10)
+         Simple-V "REMAP" x-dimension size
+         Formats: SVM
+    SVyd (11:15)
+         Simple-V "REMAP" y-dimension size
+         Formats: SVM
+    SVzd (16:20)
+         Simple-V "REMAP" z-dimension size
+         Formats: SVM
+    XO (21:23,26:31)
+        Extended opcode field.  Note that bit 21 must be 1, 22 and 23
+        must be zero, and bits 26-31 must be exactly the same as
+        used for svshape.
+        Formats: SVM2
 ```
 
 # Appendices
@@ -171,7 +212,725 @@ Add the following to Book I, 1.6.2
 
 | Form | Book | Page | Version | mnemonic | Description |
 |------|------|------|---------|----------|-------------|
-| SVL  | I    | #    | 3.0B    | svstep   | Vertical-First Stepping and status reporting |
-| SVL  | I    | #    | 3.0B    | setvl    | Cray-like establishment of Looping (Vector) context |
+| SVRM | I    | #    | 3.0B    | svremap    | REMAP enabling instruction |
+| SVM  | I    | #    | 3.0B    | svshape  | REMAP shape instruction |
+| SVM2 | I    | #    | 3.0B    | svshape2 | REMAP shape instruction (2) |
+| SVI  | I    | #    | 3.0B    | svindex | REMAP General-purpose Indexing |
+
+[[!inline pages="openpower/sv/remap/appendix" raw=yes ]]
+
+## REMAP pseudocode
+
+Written in python3 the following stand-alone executable source code is the Canonical
+Specification for each REMAP. Vectors of "loopends" are returned when Rc=1
+in Vectors of CR Fields on `sv.svstep.`, or in Vertical-First Mode
+a single CR Field (CR0) on `svstep.`.  The `SVSTATE.srcstep` or `SVSTATE.dststep` sequential
+offset is put through each algorithm to determine the actual Element Offset.
+Alternative implementations producing different ordering
+is prohibited as software will be critically relying on these Deterministic Schedules.
+
+### REMAP 2D/3D Matrix
+
+The following stand-alone executable source code is the Canonical
+Specification for Matrix (2D/3D) REMAP. 
+Hardware implementations are achievable with simple cascading counter-and-compares.
+
+```
+# python "yield" can be iterated. use this to make it clear how
+# the indices are generated by using natural-looking nested loops
+def iterate_indices(SVSHAPE):
+    # get indices to iterate over, in the required order
+    xd = SVSHAPE.lims[0]
+    yd = SVSHAPE.lims[1]
+    zd = SVSHAPE.lims[2]
+    # create lists of indices to iterate over in each dimension
+    x_r = list(range(xd))
+    y_r = list(range(yd))
+    z_r = list(range(zd))
+    # invert the indices if needed
+    if SVSHAPE.invxyz[0]: x_r.reverse()
+    if SVSHAPE.invxyz[1]: y_r.reverse()
+    if SVSHAPE.invxyz[2]: z_r.reverse()
+    # start an infinite (wrapping) loop
+    step = 0 # track src/dst step
+    while True:
+        for z in z_r:   # loop over 1st order dimension
+            z_end = z == z_r[-1]
+            for y in y_r:       # loop over 2nd order dimension
+                y_end = y == y_r[-1]
+                for x in x_r:           # loop over 3rd order dimension
+                    x_end = x == x_r[-1]
+                    # ok work out which order to construct things in.
+                    # start by creating a list of tuples of the dimension
+                    # and its limit
+                    vals = [(SVSHAPE.lims[0], x, "x"),
+                            (SVSHAPE.lims[1], y, "y"),
+                            (SVSHAPE.lims[2], z, "z")
+                           ]
+                    # now select those by order.  this allows us to
+                    # create schedules for [z][x], [x][y], or [y][z]
+                    # for matrix multiply.
+                    vals = [vals[SVSHAPE.order[0]],
+                            vals[SVSHAPE.order[1]],
+                            vals[SVSHAPE.order[2]]
+                           ]
+                    # ok now we can construct the result, using bits of
+                    # "order" to say which ones get stacked on
+                    result = 0
+                    mult = 1
+                    for i in range(3):
+                        lim, idx, dbg = vals[i]
+                        # some of the dimensions can be "skipped".  the order
+                        # was actually selected above on all 3 dimensions,
+                        # e.g. [z][x][y] or [y][z][x].  "skip" allows one of
+                        # those to be knocked out
+                        if SVSHAPE.skip == i+1: continue
+                        idx *= mult   # shifts up by previous dimension(s)
+                        result += idx # adds on this dimension
+                        mult *= lim   # for the next dimension
+
+                    loopends = (x_end |
+                               ((y_end and x_end)<<1) |
+                                ((y_end and x_end and z_end)<<2))
+
+                    yield result + SVSHAPE.offset, loopends
+                    step += 1
+
+def demo():
+    # set the dimension sizes here
+    xdim = 3
+    ydim = 2
+    zdim = 4
+
+    # set total (can repeat, e.g. VL=x*y*z*4)
+    VL = xdim * ydim * zdim
+
+    # set up an SVSHAPE
+    class SVSHAPE:
+        pass
+    SVSHAPE0 = SVSHAPE()
+    SVSHAPE0.lims = [xdim, ydim, zdim]
+    SVSHAPE0.order = [1,0,2]  # experiment with different permutations, here
+    SVSHAPE0.mode = 0b00
+    SVSHAPE0.skip = 0b00
+    SVSHAPE0.offset = 0       # experiment with different offset, here
+    SVSHAPE0.invxyz = [0,0,0] # inversion if desired
+
+    # enumerate over the iterator function, getting new indices
+    for idx, (new_idx, end) in enumerate(iterate_indices(SVSHAPE0)):
+        if idx >= VL:
+            break
+        print ("%d->%d" % (idx, new_idx), "end", bin(end)[2:])
+
+# run the demo
+if __name__ == '__main__':
+    demo()
+```
+
+### REMAP Parallel Reduction pseudocode
+
+The python3 program below is stand-alone executable and is the Canonical Specification
+for Parallel Reduction REMAP.
+The Algorithm below is not limited to RADIX2 sizes, and Predicate
+sources, unlike in Matrix REMAP, apply to the Element Indices **after** REMAP
+has been applied, not before.  MV operations are not required: the algorithm
+tracks positions of elements that would normally be moved and when applying
+an Element Reduction Operation sources the operands from their last-known (tracked)
+position.
+
+```
+# a "yield" version of the Parallel Reduction REMAP algorithm. 
+# the algorithm is in-place. it does not perform "MV" operations.
+# instead, where a masked-out value *should* be read from is tracked
+
+def iterate_indices(SVSHAPE, pred=None):
+    # get indices to iterate over, in the required order
+    xd = SVSHAPE.lims[0]
+    # create lists of indices to iterate over in each dimension
+    ix = list(range(xd))
+    # invert the indices if needed
+    if SVSHAPE.invxyz[0]: ix.reverse()
+    # start a loop from the lowest step
+    step = 1 
+    steps = []
+    while step < xd:
+        step *= 2
+        steps.append(step)
+    # invert the indices if needed
+    if SVSHAPE.invxyz[1]: steps.reverse()
+    for step in steps:
+        stepend = (step == steps[-1]) # note end of steps
+        idxs = list(range(0, xd, step))
+        results = []
+        for i in idxs:
+            other = i + step // 2
+            ci = ix[i]
+            oi = ix[other] if other < xd else None
+            other_pred = other < xd and (pred is None or pred[oi])
+            if (pred is None or pred[ci]) and other_pred:
+                if SVSHAPE.skip == 0b00: # submode 00
+                    result = ci
+                elif SVSHAPE.skip == 0b01: # submode 01
+                    result = oi
+                results.append([result + SVSHAPE.offset, 0])
+            elif other_pred:
+                ix[i] = oi
+        if results:
+            results[-1][1] = (stepend<<1) | 1  # notify end of loops
+        yield from results
+
+def demo():
+    # set the dimension sizes here
+    xdim = 9
+
+    # set up an SVSHAPE
+    class SVSHAPE:
+        pass
+    SVSHAPE0 = SVSHAPE()
+    SVSHAPE0.lims = [xdim, 0, 0]
+    SVSHAPE0.order = [0,1,2]
+    SVSHAPE0.mode = 0b10
+    SVSHAPE0.skip = 0b00
+    SVSHAPE0.offset = 0       # experiment with different offset, here
+    SVSHAPE0.invxyz = [0,0,0] # inversion if desired
+
+    SVSHAPE1 = SVSHAPE()
+    SVSHAPE1.lims = [xdim, 0, 0]
+    SVSHAPE1.order = [0,1,2]
+    SVSHAPE1.mode = 0b10
+    SVSHAPE1.skip = 0b01
+    SVSHAPE1.offset = 0       # experiment with different offset, here
+    SVSHAPE1.invxyz = [0,0,0] # inversion if desired
+
+    # enumerate over the iterator function, getting new indices
+    shapes = list(iterate_indices(SVSHAPE0)), \
+              list(iterate_indices(SVSHAPE1))
+    for idx in range(len(shapes[0])):
+        l = shapes[0][idx]
+        r = shapes[1][idx]
+        (l_idx, lend) = l
+        (r_idx, rend) = r
+        print ("%d->%d:%d" % (idx, l_idx, r_idx),
+               "end", bin(lend)[2:], bin(rend)[2:])
+
+# run the demo
+if __name__ == '__main__':
+    demo()
+```
+
+### REMAP FFT pseudocode
+
+The FFT REMAP is RADIX2 only.
+
+```
+# a "yield" version of the REMAP algorithm, for FFT Tukey-Cooley schedules
+# original code for the FFT Tukey-Cooley schedule:
+# https://www.nayuki.io/res/free-small-fft-in-multiple-languages/fft.py
+"""
+    # Radix-2 decimation-in-time FFT (real, not complex)
+    size = 2
+    while size <= n:
+        halfsize = size // 2
+        tablestep = n // size
+        for i in range(0, n, size):
+            k = 0
+            for j in range(i, i + halfsize):
+                jh = j+halfsize
+                jl = j
+                temp1 = vec[jh] * exptable[k]
+                temp2 = vec[jl]
+                vec[jh] = temp2 - temp1
+                vec[jl] = temp2 + temp1
+                k += tablestep
+        size *= 2
+"""
+
+# python "yield" can be iterated. use this to make it clear how
+# the indices are generated by using natural-looking nested loops
+def iterate_butterfly_indices(SVSHAPE):
+    # get indices to iterate over, in the required order
+    n = SVSHAPE.lims[0]
+    stride = SVSHAPE.lims[2] # stride-multiplier on reg access
+    # creating lists of indices to iterate over in each dimension
+    # has to be done dynamically, because it depends on the size
+    # first, the size-based loop (which can be done statically)
+    x_r = []
+    size = 2
+    while size <= n:
+        x_r.append(size)
+        size *= 2
+    # invert order if requested
+    if SVSHAPE.invxyz[0]: x_r.reverse()
+
+    if len(x_r) == 0:
+        return
+
+    # start an infinite (wrapping) loop
+    skip = 0
+    while True:
+        for size in x_r:           # loop over 3rd order dimension (size)
+            x_end = size == x_r[-1]
+            # y_r schedule depends on size
+            halfsize = size // 2
+            tablestep = n // size
+            y_r = []
+            for i in range(0, n, size):
+                y_r.append(i)
+            # invert if requested
+            if SVSHAPE.invxyz[1]: y_r.reverse()
+            for i in y_r:       # loop over 2nd order dimension
+                y_end = i == y_r[-1]
+                k_r = []
+                j_r = []
+                k = 0
+                for j in range(i, i+halfsize):
+                    k_r.append(k)
+                    j_r.append(j)
+                    k += tablestep
+                # invert if requested
+                if SVSHAPE.invxyz[2]: k_r.reverse()
+                if SVSHAPE.invxyz[2]: j_r.reverse()
+                for j, k in zip(j_r, k_r):   # loop over 1st order dimension
+                    z_end = j == j_r[-1]
+                    # now depending on MODE return the index
+                    if SVSHAPE.skip == 0b00:
+                        result = j              # for vec[j]
+                    elif SVSHAPE.skip == 0b01:
+                        result = j + halfsize   # for vec[j+halfsize]
+                    elif SVSHAPE.skip == 0b10:
+                        result = k              # for exptable[k]
+
+                    loopends = (z_end |
+                               ((y_end and z_end)<<1) |
+                                ((y_end and x_end and z_end)<<2))
+
+                    yield (result * stride) + SVSHAPE.offset, loopends
+
+def demo():
+    # set the dimension sizes here
+    xdim = 8
+    ydim = 0 # not needed
+    zdim = 1 # stride must be set to 1
+
+    # set total. err don't know how to calculate how many there are...
+    # do it manually for now
+    VL = 0
+    size = 2
+    n = xdim
+    while size <= n:
+        halfsize = size // 2
+        tablestep = n // size
+        for i in range(0, n, size):
+            for j in range(i, i + halfsize):
+                VL += 1
+        size *= 2
+
+    # set up an SVSHAPE
+    class SVSHAPE:
+        pass
+    # j schedule
+    SVSHAPE0 = SVSHAPE()
+    SVSHAPE0.lims = [xdim, ydim, zdim]
+    SVSHAPE0.order = [0,1,2]  # experiment with different permutations, here
+    SVSHAPE0.mode = 0b01
+    SVSHAPE0.skip = 0b00
+    SVSHAPE0.offset = 0       # experiment with different offset, here
+    SVSHAPE0.invxyz = [0,0,0] # inversion if desired
+    # j+halfstep schedule
+    SVSHAPE1 = SVSHAPE()
+    SVSHAPE1.lims = [xdim, ydim, zdim]
+    SVSHAPE1.order = [0,1,2]  # experiment with different permutations, here
+    SVSHAPE0.mode = 0b01
+    SVSHAPE1.skip = 0b01
+    SVSHAPE1.offset = 0       # experiment with different offset, here
+    SVSHAPE1.invxyz = [0,0,0] # inversion if desired
+    # k schedule
+    SVSHAPE2 = SVSHAPE()
+    SVSHAPE2.lims = [xdim, ydim, zdim]
+    SVSHAPE2.order = [0,1,2]  # experiment with different permutations, here
+    SVSHAPE0.mode = 0b01
+    SVSHAPE2.skip = 0b10
+    SVSHAPE2.offset = 0       # experiment with different offset, here
+    SVSHAPE2.invxyz = [0,0,0] # inversion if desired
+
+    # enumerate over the iterator function, getting new indices
+    schedule = []
+    for idx, (jl, jh, k) in enumerate(zip(iterate_butterfly_indices(SVSHAPE0),
+                                          iterate_butterfly_indices(SVSHAPE1),
+                                          iterate_butterfly_indices(SVSHAPE2))):
+        if idx >= VL:
+            break
+        schedule.append((jl, jh, k))
+
+    # ok now pretty-print the results, with some debug output
+    size = 2
+    idx = 0
+    while size <= n:
+        halfsize = size // 2
+        tablestep = n // size
+        print ("size %d halfsize %d tablestep %d" % \
+                (size, halfsize, tablestep))
+        for i in range(0, n, size):
+            prefix = "i %d\t" % i
+            k = 0
+            for j in range(i, i + halfsize):
+                (jl, je), (jh, he), (ks, ke) = schedule[idx]
+                print ("  %-3d\t%s j=%-2d jh=%-2d k=%-2d -> "
+                        "j[jl=%-2d] j[jh=%-2d] ex[k=%d]" % \
+                                (idx, prefix, j, j+halfsize, k,
+                                      jl, jh, ks,
+                                ),
+                                "end", bin(je)[2:], bin(je)[2:], bin(ke)[2:])
+                k += tablestep
+                idx += 1
+        size *= 2
+
+# run the demo
+if __name__ == '__main__':
+    demo()
+```
+
+### DCT REMAP
+
+DCT REMAP is RADIX2 only.  Convolutions may be applied as usual
+to create non-RADIX2 DCT. Combined with appropriate Twin-butterfly
+instructions the algorithm below (written in python3), becomes part
+of an in-place in-registers Vectorised DCT.  The algorithms work
+by loading data such that as the nested loops progress the result
+is sorted into correct sequential order.
+
+```
+# DCT "REMAP" scheduler to create an in-place iterative DCT.
+#
+
+# bits of the integer 'val' of width 'width' are reversed
+def reverse_bits(val, width):
+    result = 0
+    for _ in range(width):
+        result = (result << 1) | (val & 1)
+        val >>= 1
+    return result
+
+
+# iterative version of [recursively-applied] half-reversing
+# turns out this is Gray-Encoding.
+def halfrev2(vec, pre_rev=True):
+    res = []
+    for i in range(len(vec)):
+        if pre_rev:
+            res.append(vec[i ^ (i>>1)])
+        else:
+            ri = i
+            bl = i.bit_length()
+            for ji in range(1, bl):
+                ri ^= (i >> ji)
+            res.append(vec[ri])
+    return res
+
+
+def iterate_dct_inner_halfswap_loadstore(SVSHAPE):
+    # get indices to iterate over, in the required order
+    n = SVSHAPE.lims[0]
+    mode = SVSHAPE.lims[1]
+    stride = SVSHAPE.lims[2]
+
+    # reference list for not needing to do data-swaps, just swap what
+    # *indices* are referenced (two levels of indirection at the moment)
+    # pre-reverse the data-swap list so that it *ends up* in the order 0123..
+    ji = list(range(n))
+
+    levels = n.bit_length() - 1
+    ri = [reverse_bits(i, levels) for i in range(n)]
+
+    if SVSHAPE.mode == 0b01: # FFT, bitrev only
+        ji = [ji[ri[i]] for i in range(n)]
+    elif SVSHAPE.submode2 == 0b001:
+        ji = [ji[ri[i]] for i in range(n)]
+        ji = halfrev2(ji, True)
+    else:
+        ji = halfrev2(ji, False)
+        ji = [ji[ri[i]] for i in range(n)]
+
+    # invert order if requested
+    if SVSHAPE.invxyz[0]:
+        ji.reverse()
+
+    for i, jl in enumerate(ji):
+        y_end = jl == ji[-1]
+        yield jl * stride, (0b111 if y_end else 0b000)
+
+def iterate_dct_inner_costable_indices(SVSHAPE):
+    # get indices to iterate over, in the required order
+    n = SVSHAPE.lims[0]
+    mode = SVSHAPE.lims[1]
+    stride = SVSHAPE.lims[2]
+    # creating lists of indices to iterate over in each dimension
+    # has to be done dynamically, because it depends on the size
+    # first, the size-based loop (which can be done statically)
+    x_r = []
+    size = 2
+    while size <= n:
+        x_r.append(size)
+        size *= 2
+    # invert order if requested
+    if SVSHAPE.invxyz[0]:
+        x_r.reverse()
+
+    if len(x_r) == 0:
+        return
+
+    # start an infinite (wrapping) loop
+    skip = 0
+    z_end = 1 # doesn't exist in this, only 2 loops
+    k = 0
+    while True:
+        for size in x_r:           # loop over 3rd order dimension (size)
+            x_end = size == x_r[-1]
+            # y_r schedule depends on size
+            halfsize = size // 2
+            y_r = []
+            for i in range(0, n, size):
+                y_r.append(i)
+            # invert if requested
+            if SVSHAPE.invxyz[1]: y_r.reverse()
+            # two lists of half-range indices, e.g. j 0123, jr 7654
+            j = list(range(0, halfsize))
+            # invert if requested
+            if SVSHAPE.invxyz[2]: j_r.reverse()
+            # loop over 1st order dimension
+            for ci, jl in enumerate(j):
+                y_end = jl == j[-1]
+                # now depending on MODE return the index.  inner butterfly
+                if SVSHAPE.skip == 0b00: # in [0b00, 0b10]:
+                    result = k  # offset into COS table
+                elif SVSHAPE.skip == 0b10: #
+                    result = ci # coefficient helper
+                elif SVSHAPE.skip == 0b11: #
+                    result = size # coefficient helper
+                loopends = (z_end |
+                           ((y_end and z_end)<<1) |
+                            ((y_end and x_end and z_end)<<2))
+
+                yield (result * stride) + SVSHAPE.offset, loopends
+                k += 1
+
+def iterate_dct_inner_butterfly_indices(SVSHAPE):
+    # get indices to iterate over, in the required order
+    n = SVSHAPE.lims[0]
+    mode = SVSHAPE.lims[1]
+    stride = SVSHAPE.lims[2]
+    # creating lists of indices to iterate over in each dimension
+    # has to be done dynamically, because it depends on the size
+    # first, the size-based loop (which can be done statically)
+    x_r = []
+    size = 2
+    while size <= n:
+        x_r.append(size)
+        size *= 2
+    # invert order if requested
+    if SVSHAPE.invxyz[0]:
+        x_r.reverse()
+
+    if len(x_r) == 0:
+        return
+
+    # reference (read/write) the in-place data in *reverse-bit-order*
+    ri = list(range(n))
+    if SVSHAPE.submode2 == 0b01:
+        levels = n.bit_length() - 1
+        ri = [ri[reverse_bits(i, levels)] for i in range(n)]
+
+    # reference list for not needing to do data-swaps, just swap what
+    # *indices* are referenced (two levels of indirection at the moment)
+    # pre-reverse the data-swap list so that it *ends up* in the order 0123..
+    ji = list(range(n))
+    inplace_mode = True
+    if inplace_mode and SVSHAPE.submode2 == 0b01:
+        ji = halfrev2(ji, True)
+    if inplace_mode and SVSHAPE.submode2 == 0b11:
+        ji = halfrev2(ji, False)
+
+    # start an infinite (wrapping) loop
+    while True:
+        k = 0
+        k_start = 0
+        for size in x_r:           # loop over 3rd order dimension (size)
+            x_end = size == x_r[-1]
+            # y_r schedule depends on size
+            halfsize = size // 2
+            y_r = []
+            for i in range(0, n, size):
+                y_r.append(i)
+            # invert if requested
+            if SVSHAPE.invxyz[1]: y_r.reverse()
+            for i in y_r:       # loop over 2nd order dimension
+                y_end = i == y_r[-1]
+                # two lists of half-range indices, e.g. j 0123, jr 7654
+                j = list(range(i, i + halfsize))
+                jr = list(range(i+halfsize, i + size))
+                jr.reverse()
+                # invert if requested
+                if SVSHAPE.invxyz[2]:
+                    j.reverse()
+                    jr.reverse()
+                hz2 = halfsize // 2 # zero stops reversing 1-item lists
+                # loop over 1st order dimension
+                k = k_start
+                for ci, (jl, jh) in enumerate(zip(j, jr)):
+                    z_end = jl == j[-1]
+                    # now depending on MODE return the index.  inner butterfly
+                    if SVSHAPE.skip == 0b00: # in [0b00, 0b10]:
+                        if SVSHAPE.submode2 == 0b11: # iDCT
+                            result = ji[ri[jl]]        # lower half
+                        else:
+                            result = ri[ji[jl]]        # lower half
+                    elif SVSHAPE.skip == 0b01: # in [0b01, 0b11]:
+                        if SVSHAPE.submode2 == 0b11: # iDCT
+                            result = ji[ri[jl+halfsize]] # upper half
+                        else:
+                            result = ri[ji[jh]] # upper half
+                    elif mode == 4:
+                        # COS table pre-generated mode
+                        if SVSHAPE.skip == 0b10: #
+                            result = k # cos table offset
+                    else: # mode 2
+                        # COS table generated on-demand ("Vertical-First") mode
+                        if SVSHAPE.skip == 0b10: #
+                            result = ci # coefficient helper
+                        elif SVSHAPE.skip == 0b11: #
+                            result = size # coefficient helper
+                    loopends = (z_end |
+                               ((y_end and z_end)<<1) |
+                                ((y_end and x_end and z_end)<<2))
+
+                    yield (result * stride) + SVSHAPE.offset, loopends
+                    k += 1
+
+                # now in-place swap
+                if inplace_mode:
+                    for ci, (jl, jh) in enumerate(zip(j[:hz2], jr[:hz2])):
+                        jlh = jl+halfsize
+                        tmp1, tmp2 = ji[jlh], ji[jh]
+                        ji[jlh], ji[jh] = tmp2, tmp1
+
+            # new k_start point for cos tables( runs inside x_r loop NOT i loop)
+            k_start += halfsize
+
+
+# python "yield" can be iterated. use this to make it clear how
+# the indices are generated by using natural-looking nested loops
+def iterate_dct_outer_butterfly_indices(SVSHAPE):
+    # get indices to iterate over, in the required order
+    n = SVSHAPE.lims[0]
+    mode = SVSHAPE.lims[1]
+    stride = SVSHAPE.lims[2]
+    # creating lists of indices to iterate over in each dimension
+    # has to be done dynamically, because it depends on the size
+    # first, the size-based loop (which can be done statically)
+    x_r = []
+    size = n // 2
+    while size >= 2:
+        x_r.append(size)
+        size //= 2
+    # invert order if requested
+    if SVSHAPE.invxyz[0]:
+        x_r.reverse()
+
+    if len(x_r) == 0:
+        return
+
+    # I-DCT, reference (read/write) the in-place data in *reverse-bit-order*
+    ri = list(range(n))
+    if SVSHAPE.submode2 in [0b11, 0b01]:
+        levels = n.bit_length() - 1
+        ri = [ri[reverse_bits(i, levels)] for i in range(n)]
+
+    # reference list for not needing to do data-swaps, just swap what
+    # *indices* are referenced (two levels of indirection at the moment)
+    # pre-reverse the data-swap list so that it *ends up* in the order 0123..
+    ji = list(range(n))
+    inplace_mode = False # need the space... SVSHAPE.skip in [0b10, 0b11]
+    if SVSHAPE.submode2 == 0b11:
+        ji = halfrev2(ji, False)
+
+    # start an infinite (wrapping) loop
+    while True:
+        k = 0
+        k_start = 0
+        for size in x_r:           # loop over 3rd order dimension (size)
+            halfsize = size//2
+            x_end = size == x_r[-1]
+            y_r = list(range(0, halfsize))
+            # invert if requested
+            if SVSHAPE.invxyz[1]: y_r.reverse()
+            for i in y_r:       # loop over 2nd order dimension
+                y_end = i == y_r[-1]
+                # one list to create iterative-sum schedule
+                jr = list(range(i+halfsize, i+n-halfsize, size))
+                # invert if requested
+                if SVSHAPE.invxyz[2]: jr.reverse()
+                hz2 = halfsize // 2 # zero stops reversing 1-item lists
+                k = k_start
+                for ci, jh in enumerate(jr):   # loop over 1st order dimension
+                    z_end = jh == jr[-1]
+                    if mode == 4:
+                        # COS table pre-generated mode
+                        if SVSHAPE.skip == 0b00: # in [0b00, 0b10]:
+                            if SVSHAPE.submode2 == 0b11: # iDCT
+                                result = ji[ri[jh]] # upper half
+                            else:
+                                result = ri[ji[jh]]        # lower half
+                        elif SVSHAPE.skip == 0b01: # in [0b01, 0b11]:
+                            if SVSHAPE.submode2 == 0b11: # iDCT
+                                result = ji[ri[jh+size]] # upper half
+                            else:
+                                result = ri[ji[jh+size]] # upper half
+                        elif SVSHAPE.skip == 0b10: #
+                            result = k # cos table offset
+                    else:
+                        # COS table generated on-demand ("Vertical-First") mode
+                        if SVSHAPE.skip == 0b00: # in [0b00, 0b10]:
+                            if SVSHAPE.submode2 == 0b11: # iDCT
+                                result = ji[ri[jh]]        # lower half
+                            else:
+                                result = ri[ji[jh]]        # lower half
+                        elif SVSHAPE.skip == 0b01: # in [0b01, 0b11]:
+                            if SVSHAPE.submode2 == 0b11: # iDCT
+                                result = ji[ri[jh+size]] # upper half
+                            else:
+                                result = ri[ji[jh+size]] # upper half
+                        elif SVSHAPE.skip == 0b10: #
+                            result = ci # coefficient helper
+                        elif SVSHAPE.skip == 0b11: #
+                            result = size # coefficient helper
+                    loopends = (z_end |
+                               ((y_end and z_end)<<1) |
+                                ((y_end and x_end and z_end)<<2))
+
+                    yield (result * stride) + SVSHAPE.offset, loopends
+                    k += 1
+
+            # new k_start point for cos tables( runs inside x_r loop NOT i loop)
+            k_start += halfsize
+
+```
+
+## REMAP selector
+
+Selecting which REMAP Schedule to use is shown by the pseudocode below.
+Each SVSHAPE 0-3 goes through this selection process.
+
+```
+  if SVSHAPEn.mode == 0b00:             iterate_fn = iterate_indices
+  if SVSHAPEn.mode == 0b10:             iterate_fn = iterate_preduce_indices
+  if SVSHAPEn.mode in [0b01, 0b11]:
+    # further sub-selection
+    if SVSHAPEn.ydimsz == 1:            iterate_fn = iterate_butterfly_indices
+    if SVSHAPEn.ydimsz == 2:            iterate_fn = iterate_dct_inner_butterfly_indices
+    if SVSHAPEn.ydimsz == 3:            iterate_fn = iterate_dct_outer_butterfly_indices
+    if SVSHAPEn.ydimsz in [5, 13]:      iterate_fn = iterate_dct_inner_costable_indices
+    if SVSHAPEn.ydimsz in [6, 14, 15]:  iterate_fn = iterate_dct_inner_halfswap_loadstore
+```
+
 
 [[!tag opf_rfc]]