-# RFC ls009 SVP64 REMAP instructions
+# RFC ls009 Simple-V REMAP Subsystem
**URLs**:
-* <https://libre-soc.org/openpower/sv/>
* <https://libre-soc.org/openpower/sv/rfc/ls009/>
-* <https://bugs.libre-soc.org/show_bug.cgi?id=1040>
-* <https://git.openpower.foundation/isa/PowerISA/issues/87>
+* <https://bugs.libre-soc.org/show_bug.cgi?id=1042>
+* <https://git.openpower.foundation/isa/PowerISA/issues/124>
**Severity**: Major
**Status**: New
-**Date**: 24 Mar 2023
+**Date**: 26 Mar 2023
**Target**: v3.2B
**Notes and Observations**:
-1. TODO
+1. Although costly the alternatives in SIMD-paradigm software result in
+ huge algorithmic complexity and associated power consumption increases.
+ Loop-unrolling compilers are prevalent as is thousands to tens of
+ thousands of instructions.
+2. Core inner kernels of Matrix DCT DFT FFT NTT are dramatically reduced
+ to a handful of instructions.
+3. No REMAP instructions with the exception of Indexed rely on registers
+ for the establishment of REMAP capability.
+4. Future EXT1xx variants and SVP64/VSX will dramatically expand the power
+ and flexibility of REMAP.
+5. The Simple-V Compliancy Subsets make REMAP optional except in the
+ Advanced Levels. Like v3.1 MMA it is **not** necessary for **all**
+ hardware to implement REMAP.
**Changes**
Add the following entries to:
-* the Appendices of Book I
-* Instructions of Book I as a new Section
-* TODO-Form of Book I Section 1.6.1.6 and 1.6.2
+* the Appendices of SV Book
+* Instructions of SV Book as a new Section
+* SVI, SVM, SVM2, SVRM Form of Book I Section 1.6.1.6 and 1.6.2
----------------
\newpage{}
-# svstep: Vertical-First Stepping and status reporting
+[[!inline pages="openpower/sv/remap" raw=yes ]]
-SVL-Form
+# Forms
-* svstep RT,SVi,vf (Rc=0)
-* svstep. RT,SVi,vf (Rc=1)
+Add `SVI, SVM, SVM2, SVRM` to `XO (26:31)` Field in Book I, 1.6.2
-| 0-5|6-10|11.15|16..22| 23-25 | 26-30 |31| Form |
-|----|----|-----|------|----------|-------|--|--------- |
-|PO | RT | / | SVi | / / vf | XO |Rc| SVL-Form |
-
-Pseudo-code:
+Add the following to Book I, 1.6.1, SVI-Form
```
- if SVi[3:4] = 0b11 then
- # store pack and unpack in SVSTATE
- SVSTATE[53] <- SVi[5]
- SVSTATE[54] <- SVi[6]
- RT <- [0]*62 || SVSTATE[53:54]
- else
- # Vertical-First explicit stepping.
- step <- SVSTATE_NEXT(SVi, vf)
- RT <- [0]*57 || step
+ |0 |6 |11 |16 |21 |23 |24|25|26 31|
+ | PO | SVG|rmm | SVd |ew |SVyx|mm|sk| XO |
```
-Special Registers Altered:
-
- CR0 (if Rc=1)
-
-**Description**
+Add the following to Book I, 1.6.1, SVM-Form
+```
+ |0 |6 |11 |16 |21 |25 |26 |31 |
+ | PO | SVxd | SVyd | SVzd | SVrm |vf | XO |
+```
--------------
-
-\newpage{}
-
-
--------------
-
-\newpage{}
+Add the following to Book I, 1.6.1, SVM2-Form
-# SVL-Form
+```
+ |0 |6 |10 |11 |16 |21 |24|25 |26 |31 |
+ | PO | SVo |SVyx| rmm | SVd |XO |mm|sk | XO |
+```
-Add the following to Book I, 1.6.1, SVL-Form
+Add the following to Book I, 1.6.1, SVRM-Form
```
- |0 |6 |11 |16 |23 |24 |25 |26 |31 |
- | PO | RT | RA | SVi |ms |vs |vf | XO |Rc |
- | PO | RT | / | SVi |/ |/ |vf | XO |Rc |
+ |0 |6 |11 |13 |15 |17 |19 |21 |22 |26 |31 |
+ | PO | SVme |mi0 | mi1 | mi2 | mo0 | mo1 |pst |/// | XO |
```
-* Add `SVL` to `RA (11:15)` Field in Book I, 1.6.2
-* Add `SVL` to `RT (6:10)` Field in Book I, 1.6.2
-* Add `SVL` to `Rc (31)` Field in Book I, 1.6.2
-* Add `SVL` to `XO (26:31)` Field in Book I, 1.6.2
-
Add the following to Book I, 1.6.2
```
- ms (23)
- Field used in Simple-V to specify whether MVL (maxvl in the SVSTATE SPR)
- is to be set
- Formats: SVL
- vf (25)
- Field used in Simple-V to specify whether "Vertical" Mode is set
- (vfirst in the SVSTATE SPR)
- Formats: SVL
- vs (24)
- Field used in Simple-V to specify whether VL (vl in the SVSTATE SPR) is to be set
- Formats: SVL
+ mi0 (11:12)
+ Field used in REMAP to select the SVSHAPE for 1st input register
+ Formats: SVRM
+ mi1 (13:14)
+ Field used in REMAP to select the SVSHAPE for 2nd input register
+ Formats: SVRM
+ mi2 (15:16)
+ Field used in REMAP to select the SVSHAPE for 3rd input register
+ Formats: SVRM
+ mm (24)
+ Field used to specify the meaning of the rmm field for SVI-Form
+ and SVM2-Form
+ Formats: SVI, SVM2
+ mo0 (17:18)
+ Field used in REMAP to select the SVSHAPE for 1st output register
+ Formats: SVRM
+ mo1 (19:20)
+ Field used in REMAP to select the SVSHAPE for 2nd output register
+ Formats: SVRM
+ pst (21)
+ Field used in REMAP to indicate "persistence" mode (REMAP
+ continues to apply to multiple instructions)
+ Formats: SVRM
+ rmm (11:15)
+ REMAP Mode field for SVI-Form and SVM2-Form
+ Formats: SVI, SVM2
+ sk (25)
+ Field used to specify dimensional skipping in svindex
+ Formats: SVI, SVM2
+ SVd (16:20)
+ Immediate field used to specify the size of the REMAP dimension
+ in the svindex and svshape2 instructions
+ Formats: SVI, SVM2
+ SVDS (16:29)
+ Immediate field used to specify a 9-bit signed
+ two's complement integer which is concatenated
+ on the right with 0b00 and sign-extended to 64 bits.
+ Formats: SVDS
+ SVG (6:10)
+ Field used to specify a GPR to be used as a
+ source for indexing.
+ Formats: SVI
SVi (16:22)
- Simple-V immediate field used by setvl for setting VL or MVL
- (vl, maxvl in the SVSTATE SPR)
- and used as a "Mode of Operation" selector in svstep
+ Simple-V immediate field for setting VL or MVL
Formats: SVL
+ SVme (6:10)
+ Simple-V "REMAP" map-enable bits (0-4)
+ Formats: SVRM
+ SVo (6:9)
+ Field used by the svshape2 instruction as an offset
+ Formats: SVM2
+ SVrm (21:24)
+ Simple-V "REMAP" Mode
+ Formats: SVM
+ SVxd (6:10)
+ Simple-V "REMAP" x-dimension size
+ Formats: SVM
+ SVyd (11:15)
+ Simple-V "REMAP" y-dimension size
+ Formats: SVM
+ SVzd (16:20)
+ Simple-V "REMAP" z-dimension size
+ Formats: SVM
+ XO (21:23,26:31)
+ Extended opcode field. Note that bit 21 must be 1, 22 and 23
+ must be zero, and bits 26-31 must be exactly the same as
+ used for svshape.
+ Formats: SVM2
```
# Appendices
| Form | Book | Page | Version | mnemonic | Description |
|------|------|------|---------|----------|-------------|
-| SVL | I | # | 3.0B | svstep | Vertical-First Stepping and status reporting |
-| SVL | I | # | 3.0B | setvl | Cray-like establishment of Looping (Vector) context |
+| SVRM | I | # | 3.0B | svremap | REMAP enabling instruction |
+| SVM | I | # | 3.0B | svshape | REMAP shape instruction |
+| SVM2 | I | # | 3.0B | svshape2 | REMAP shape instruction (2) |
+| SVI | I | # | 3.0B | svindex | REMAP General-purpose Indexing |
+
+[[!inline pages="openpower/sv/remap/appendix" raw=yes ]]
+
+## REMAP pseudocode
+
+Written in python3 the following stand-alone executable source code is the Canonical
+Specification for each REMAP. Vectors of "loopends" are returned when Rc=1
+in Vectors of CR Fields on `sv.svstep.`, or in Vertical-First Mode
+a single CR Field (CR0) on `svstep.`. The `SVSTATE.srcstep` or `SVSTATE.dststep` sequential
+offset is put through each algorithm to determine the actual Element Offset.
+Alternative implementations producing different ordering
+is prohibited as software will be critically relying on these Deterministic Schedules.
+
+### REMAP 2D/3D Matrix
+
+The following stand-alone executable source code is the Canonical
+Specification for Matrix (2D/3D) REMAP.
+Hardware implementations are achievable with simple cascading counter-and-compares.
+
+```
+# python "yield" can be iterated. use this to make it clear how
+# the indices are generated by using natural-looking nested loops
+def iterate_indices(SVSHAPE):
+ # get indices to iterate over, in the required order
+ xd = SVSHAPE.lims[0]
+ yd = SVSHAPE.lims[1]
+ zd = SVSHAPE.lims[2]
+ # create lists of indices to iterate over in each dimension
+ x_r = list(range(xd))
+ y_r = list(range(yd))
+ z_r = list(range(zd))
+ # invert the indices if needed
+ if SVSHAPE.invxyz[0]: x_r.reverse()
+ if SVSHAPE.invxyz[1]: y_r.reverse()
+ if SVSHAPE.invxyz[2]: z_r.reverse()
+ # start an infinite (wrapping) loop
+ step = 0 # track src/dst step
+ while True:
+ for z in z_r: # loop over 1st order dimension
+ z_end = z == z_r[-1]
+ for y in y_r: # loop over 2nd order dimension
+ y_end = y == y_r[-1]
+ for x in x_r: # loop over 3rd order dimension
+ x_end = x == x_r[-1]
+ # ok work out which order to construct things in.
+ # start by creating a list of tuples of the dimension
+ # and its limit
+ vals = [(SVSHAPE.lims[0], x, "x"),
+ (SVSHAPE.lims[1], y, "y"),
+ (SVSHAPE.lims[2], z, "z")
+ ]
+ # now select those by order. this allows us to
+ # create schedules for [z][x], [x][y], or [y][z]
+ # for matrix multiply.
+ vals = [vals[SVSHAPE.order[0]],
+ vals[SVSHAPE.order[1]],
+ vals[SVSHAPE.order[2]]
+ ]
+ # ok now we can construct the result, using bits of
+ # "order" to say which ones get stacked on
+ result = 0
+ mult = 1
+ for i in range(3):
+ lim, idx, dbg = vals[i]
+ # some of the dimensions can be "skipped". the order
+ # was actually selected above on all 3 dimensions,
+ # e.g. [z][x][y] or [y][z][x]. "skip" allows one of
+ # those to be knocked out
+ if SVSHAPE.skip == i+1: continue
+ idx *= mult # shifts up by previous dimension(s)
+ result += idx # adds on this dimension
+ mult *= lim # for the next dimension
+
+ loopends = (x_end |
+ ((y_end and x_end)<<1) |
+ ((y_end and x_end and z_end)<<2))
+
+ yield result + SVSHAPE.offset, loopends
+ step += 1
+
+def demo():
+ # set the dimension sizes here
+ xdim = 3
+ ydim = 2
+ zdim = 4
+
+ # set total (can repeat, e.g. VL=x*y*z*4)
+ VL = xdim * ydim * zdim
+
+ # set up an SVSHAPE
+ class SVSHAPE:
+ pass
+ SVSHAPE0 = SVSHAPE()
+ SVSHAPE0.lims = [xdim, ydim, zdim]
+ SVSHAPE0.order = [1,0,2] # experiment with different permutations, here
+ SVSHAPE0.mode = 0b00
+ SVSHAPE0.skip = 0b00
+ SVSHAPE0.offset = 0 # experiment with different offset, here
+ SVSHAPE0.invxyz = [0,0,0] # inversion if desired
+
+ # enumerate over the iterator function, getting new indices
+ for idx, (new_idx, end) in enumerate(iterate_indices(SVSHAPE0)):
+ if idx >= VL:
+ break
+ print ("%d->%d" % (idx, new_idx), "end", bin(end)[2:])
+
+# run the demo
+if __name__ == '__main__':
+ demo()
+```
+
+### REMAP Parallel Reduction pseudocode
+
+The python3 program below is stand-alone executable and is the Canonical Specification
+for Parallel Reduction REMAP.
+The Algorithm below is not limited to RADIX2 sizes, and Predicate
+sources, unlike in Matrix REMAP, apply to the Element Indices **after** REMAP
+has been applied, not before. MV operations are not required: the algorithm
+tracks positions of elements that would normally be moved and when applying
+an Element Reduction Operation sources the operands from their last-known (tracked)
+position.
+
+```
+# a "yield" version of the Parallel Reduction REMAP algorithm.
+# the algorithm is in-place. it does not perform "MV" operations.
+# instead, where a masked-out value *should* be read from is tracked
+
+def iterate_indices(SVSHAPE, pred=None):
+ # get indices to iterate over, in the required order
+ xd = SVSHAPE.lims[0]
+ # create lists of indices to iterate over in each dimension
+ ix = list(range(xd))
+ # invert the indices if needed
+ if SVSHAPE.invxyz[0]: ix.reverse()
+ # start a loop from the lowest step
+ step = 1
+ steps = []
+ while step < xd:
+ step *= 2
+ steps.append(step)
+ # invert the indices if needed
+ if SVSHAPE.invxyz[1]: steps.reverse()
+ for step in steps:
+ stepend = (step == steps[-1]) # note end of steps
+ idxs = list(range(0, xd, step))
+ results = []
+ for i in idxs:
+ other = i + step // 2
+ ci = ix[i]
+ oi = ix[other] if other < xd else None
+ other_pred = other < xd and (pred is None or pred[oi])
+ if (pred is None or pred[ci]) and other_pred:
+ if SVSHAPE.skip == 0b00: # submode 00
+ result = ci
+ elif SVSHAPE.skip == 0b01: # submode 01
+ result = oi
+ results.append([result + SVSHAPE.offset, 0])
+ elif other_pred:
+ ix[i] = oi
+ if results:
+ results[-1][1] = (stepend<<1) | 1 # notify end of loops
+ yield from results
+
+def demo():
+ # set the dimension sizes here
+ xdim = 9
+
+ # set up an SVSHAPE
+ class SVSHAPE:
+ pass
+ SVSHAPE0 = SVSHAPE()
+ SVSHAPE0.lims = [xdim, 0, 0]
+ SVSHAPE0.order = [0,1,2]
+ SVSHAPE0.mode = 0b10
+ SVSHAPE0.skip = 0b00
+ SVSHAPE0.offset = 0 # experiment with different offset, here
+ SVSHAPE0.invxyz = [0,0,0] # inversion if desired
+
+ SVSHAPE1 = SVSHAPE()
+ SVSHAPE1.lims = [xdim, 0, 0]
+ SVSHAPE1.order = [0,1,2]
+ SVSHAPE1.mode = 0b10
+ SVSHAPE1.skip = 0b01
+ SVSHAPE1.offset = 0 # experiment with different offset, here
+ SVSHAPE1.invxyz = [0,0,0] # inversion if desired
+
+ # enumerate over the iterator function, getting new indices
+ shapes = list(iterate_indices(SVSHAPE0)), \
+ list(iterate_indices(SVSHAPE1))
+ for idx in range(len(shapes[0])):
+ l = shapes[0][idx]
+ r = shapes[1][idx]
+ (l_idx, lend) = l
+ (r_idx, rend) = r
+ print ("%d->%d:%d" % (idx, l_idx, r_idx),
+ "end", bin(lend)[2:], bin(rend)[2:])
+
+# run the demo
+if __name__ == '__main__':
+ demo()
+```
+
+### REMAP FFT pseudocode
+
+The FFT REMAP is RADIX2 only.
+
+```
+# a "yield" version of the REMAP algorithm, for FFT Tukey-Cooley schedules
+# original code for the FFT Tukey-Cooley schedule:
+# https://www.nayuki.io/res/free-small-fft-in-multiple-languages/fft.py
+"""
+ # Radix-2 decimation-in-time FFT (real, not complex)
+ size = 2
+ while size <= n:
+ halfsize = size // 2
+ tablestep = n // size
+ for i in range(0, n, size):
+ k = 0
+ for j in range(i, i + halfsize):
+ jh = j+halfsize
+ jl = j
+ temp1 = vec[jh] * exptable[k]
+ temp2 = vec[jl]
+ vec[jh] = temp2 - temp1
+ vec[jl] = temp2 + temp1
+ k += tablestep
+ size *= 2
+"""
+
+# python "yield" can be iterated. use this to make it clear how
+# the indices are generated by using natural-looking nested loops
+def iterate_butterfly_indices(SVSHAPE):
+ # get indices to iterate over, in the required order
+ n = SVSHAPE.lims[0]
+ stride = SVSHAPE.lims[2] # stride-multiplier on reg access
+ # creating lists of indices to iterate over in each dimension
+ # has to be done dynamically, because it depends on the size
+ # first, the size-based loop (which can be done statically)
+ x_r = []
+ size = 2
+ while size <= n:
+ x_r.append(size)
+ size *= 2
+ # invert order if requested
+ if SVSHAPE.invxyz[0]: x_r.reverse()
+
+ if len(x_r) == 0:
+ return
+
+ # start an infinite (wrapping) loop
+ skip = 0
+ while True:
+ for size in x_r: # loop over 3rd order dimension (size)
+ x_end = size == x_r[-1]
+ # y_r schedule depends on size
+ halfsize = size // 2
+ tablestep = n // size
+ y_r = []
+ for i in range(0, n, size):
+ y_r.append(i)
+ # invert if requested
+ if SVSHAPE.invxyz[1]: y_r.reverse()
+ for i in y_r: # loop over 2nd order dimension
+ y_end = i == y_r[-1]
+ k_r = []
+ j_r = []
+ k = 0
+ for j in range(i, i+halfsize):
+ k_r.append(k)
+ j_r.append(j)
+ k += tablestep
+ # invert if requested
+ if SVSHAPE.invxyz[2]: k_r.reverse()
+ if SVSHAPE.invxyz[2]: j_r.reverse()
+ for j, k in zip(j_r, k_r): # loop over 1st order dimension
+ z_end = j == j_r[-1]
+ # now depending on MODE return the index
+ if SVSHAPE.skip == 0b00:
+ result = j # for vec[j]
+ elif SVSHAPE.skip == 0b01:
+ result = j + halfsize # for vec[j+halfsize]
+ elif SVSHAPE.skip == 0b10:
+ result = k # for exptable[k]
+
+ loopends = (z_end |
+ ((y_end and z_end)<<1) |
+ ((y_end and x_end and z_end)<<2))
+
+ yield (result * stride) + SVSHAPE.offset, loopends
+
+def demo():
+ # set the dimension sizes here
+ xdim = 8
+ ydim = 0 # not needed
+ zdim = 1 # stride must be set to 1
+
+ # set total. err don't know how to calculate how many there are...
+ # do it manually for now
+ VL = 0
+ size = 2
+ n = xdim
+ while size <= n:
+ halfsize = size // 2
+ tablestep = n // size
+ for i in range(0, n, size):
+ for j in range(i, i + halfsize):
+ VL += 1
+ size *= 2
+
+ # set up an SVSHAPE
+ class SVSHAPE:
+ pass
+ # j schedule
+ SVSHAPE0 = SVSHAPE()
+ SVSHAPE0.lims = [xdim, ydim, zdim]
+ SVSHAPE0.order = [0,1,2] # experiment with different permutations, here
+ SVSHAPE0.mode = 0b01
+ SVSHAPE0.skip = 0b00
+ SVSHAPE0.offset = 0 # experiment with different offset, here
+ SVSHAPE0.invxyz = [0,0,0] # inversion if desired
+ # j+halfstep schedule
+ SVSHAPE1 = SVSHAPE()
+ SVSHAPE1.lims = [xdim, ydim, zdim]
+ SVSHAPE1.order = [0,1,2] # experiment with different permutations, here
+ SVSHAPE0.mode = 0b01
+ SVSHAPE1.skip = 0b01
+ SVSHAPE1.offset = 0 # experiment with different offset, here
+ SVSHAPE1.invxyz = [0,0,0] # inversion if desired
+ # k schedule
+ SVSHAPE2 = SVSHAPE()
+ SVSHAPE2.lims = [xdim, ydim, zdim]
+ SVSHAPE2.order = [0,1,2] # experiment with different permutations, here
+ SVSHAPE0.mode = 0b01
+ SVSHAPE2.skip = 0b10
+ SVSHAPE2.offset = 0 # experiment with different offset, here
+ SVSHAPE2.invxyz = [0,0,0] # inversion if desired
+
+ # enumerate over the iterator function, getting new indices
+ schedule = []
+ for idx, (jl, jh, k) in enumerate(zip(iterate_butterfly_indices(SVSHAPE0),
+ iterate_butterfly_indices(SVSHAPE1),
+ iterate_butterfly_indices(SVSHAPE2))):
+ if idx >= VL:
+ break
+ schedule.append((jl, jh, k))
+
+ # ok now pretty-print the results, with some debug output
+ size = 2
+ idx = 0
+ while size <= n:
+ halfsize = size // 2
+ tablestep = n // size
+ print ("size %d halfsize %d tablestep %d" % \
+ (size, halfsize, tablestep))
+ for i in range(0, n, size):
+ prefix = "i %d\t" % i
+ k = 0
+ for j in range(i, i + halfsize):
+ (jl, je), (jh, he), (ks, ke) = schedule[idx]
+ print (" %-3d\t%s j=%-2d jh=%-2d k=%-2d -> "
+ "j[jl=%-2d] j[jh=%-2d] ex[k=%d]" % \
+ (idx, prefix, j, j+halfsize, k,
+ jl, jh, ks,
+ ),
+ "end", bin(je)[2:], bin(je)[2:], bin(ke)[2:])
+ k += tablestep
+ idx += 1
+ size *= 2
+
+# run the demo
+if __name__ == '__main__':
+ demo()
+```
+
+### DCT REMAP
+
+DCT REMAP is RADIX2 only. Convolutions may be applied as usual
+to create non-RADIX2 DCT. Combined with appropriate Twin-butterfly
+instructions the algorithm below (written in python3), becomes part
+of an in-place in-registers Vectorised DCT. The algorithms work
+by loading data such that as the nested loops progress the result
+is sorted into correct sequential order.
+
+```
+# DCT "REMAP" scheduler to create an in-place iterative DCT.
+#
+
+# bits of the integer 'val' of width 'width' are reversed
+def reverse_bits(val, width):
+ result = 0
+ for _ in range(width):
+ result = (result << 1) | (val & 1)
+ val >>= 1
+ return result
+
+
+# iterative version of [recursively-applied] half-reversing
+# turns out this is Gray-Encoding.
+def halfrev2(vec, pre_rev=True):
+ res = []
+ for i in range(len(vec)):
+ if pre_rev:
+ res.append(vec[i ^ (i>>1)])
+ else:
+ ri = i
+ bl = i.bit_length()
+ for ji in range(1, bl):
+ ri ^= (i >> ji)
+ res.append(vec[ri])
+ return res
+
+
+def iterate_dct_inner_halfswap_loadstore(SVSHAPE):
+ # get indices to iterate over, in the required order
+ n = SVSHAPE.lims[0]
+ mode = SVSHAPE.lims[1]
+ stride = SVSHAPE.lims[2]
+
+ # reference list for not needing to do data-swaps, just swap what
+ # *indices* are referenced (two levels of indirection at the moment)
+ # pre-reverse the data-swap list so that it *ends up* in the order 0123..
+ ji = list(range(n))
+
+ levels = n.bit_length() - 1
+ ri = [reverse_bits(i, levels) for i in range(n)]
+
+ if SVSHAPE.mode == 0b01: # FFT, bitrev only
+ ji = [ji[ri[i]] for i in range(n)]
+ elif SVSHAPE.submode2 == 0b001:
+ ji = [ji[ri[i]] for i in range(n)]
+ ji = halfrev2(ji, True)
+ else:
+ ji = halfrev2(ji, False)
+ ji = [ji[ri[i]] for i in range(n)]
+
+ # invert order if requested
+ if SVSHAPE.invxyz[0]:
+ ji.reverse()
+
+ for i, jl in enumerate(ji):
+ y_end = jl == ji[-1]
+ yield jl * stride, (0b111 if y_end else 0b000)
+
+def iterate_dct_inner_costable_indices(SVSHAPE):
+ # get indices to iterate over, in the required order
+ n = SVSHAPE.lims[0]
+ mode = SVSHAPE.lims[1]
+ stride = SVSHAPE.lims[2]
+ # creating lists of indices to iterate over in each dimension
+ # has to be done dynamically, because it depends on the size
+ # first, the size-based loop (which can be done statically)
+ x_r = []
+ size = 2
+ while size <= n:
+ x_r.append(size)
+ size *= 2
+ # invert order if requested
+ if SVSHAPE.invxyz[0]:
+ x_r.reverse()
+
+ if len(x_r) == 0:
+ return
+
+ # start an infinite (wrapping) loop
+ skip = 0
+ z_end = 1 # doesn't exist in this, only 2 loops
+ k = 0
+ while True:
+ for size in x_r: # loop over 3rd order dimension (size)
+ x_end = size == x_r[-1]
+ # y_r schedule depends on size
+ halfsize = size // 2
+ y_r = []
+ for i in range(0, n, size):
+ y_r.append(i)
+ # invert if requested
+ if SVSHAPE.invxyz[1]: y_r.reverse()
+ # two lists of half-range indices, e.g. j 0123, jr 7654
+ j = list(range(0, halfsize))
+ # invert if requested
+ if SVSHAPE.invxyz[2]: j_r.reverse()
+ # loop over 1st order dimension
+ for ci, jl in enumerate(j):
+ y_end = jl == j[-1]
+ # now depending on MODE return the index. inner butterfly
+ if SVSHAPE.skip == 0b00: # in [0b00, 0b10]:
+ result = k # offset into COS table
+ elif SVSHAPE.skip == 0b10: #
+ result = ci # coefficient helper
+ elif SVSHAPE.skip == 0b11: #
+ result = size # coefficient helper
+ loopends = (z_end |
+ ((y_end and z_end)<<1) |
+ ((y_end and x_end and z_end)<<2))
+
+ yield (result * stride) + SVSHAPE.offset, loopends
+ k += 1
+
+def iterate_dct_inner_butterfly_indices(SVSHAPE):
+ # get indices to iterate over, in the required order
+ n = SVSHAPE.lims[0]
+ mode = SVSHAPE.lims[1]
+ stride = SVSHAPE.lims[2]
+ # creating lists of indices to iterate over in each dimension
+ # has to be done dynamically, because it depends on the size
+ # first, the size-based loop (which can be done statically)
+ x_r = []
+ size = 2
+ while size <= n:
+ x_r.append(size)
+ size *= 2
+ # invert order if requested
+ if SVSHAPE.invxyz[0]:
+ x_r.reverse()
+
+ if len(x_r) == 0:
+ return
+
+ # reference (read/write) the in-place data in *reverse-bit-order*
+ ri = list(range(n))
+ if SVSHAPE.submode2 == 0b01:
+ levels = n.bit_length() - 1
+ ri = [ri[reverse_bits(i, levels)] for i in range(n)]
+
+ # reference list for not needing to do data-swaps, just swap what
+ # *indices* are referenced (two levels of indirection at the moment)
+ # pre-reverse the data-swap list so that it *ends up* in the order 0123..
+ ji = list(range(n))
+ inplace_mode = True
+ if inplace_mode and SVSHAPE.submode2 == 0b01:
+ ji = halfrev2(ji, True)
+ if inplace_mode and SVSHAPE.submode2 == 0b11:
+ ji = halfrev2(ji, False)
+
+ # start an infinite (wrapping) loop
+ while True:
+ k = 0
+ k_start = 0
+ for size in x_r: # loop over 3rd order dimension (size)
+ x_end = size == x_r[-1]
+ # y_r schedule depends on size
+ halfsize = size // 2
+ y_r = []
+ for i in range(0, n, size):
+ y_r.append(i)
+ # invert if requested
+ if SVSHAPE.invxyz[1]: y_r.reverse()
+ for i in y_r: # loop over 2nd order dimension
+ y_end = i == y_r[-1]
+ # two lists of half-range indices, e.g. j 0123, jr 7654
+ j = list(range(i, i + halfsize))
+ jr = list(range(i+halfsize, i + size))
+ jr.reverse()
+ # invert if requested
+ if SVSHAPE.invxyz[2]:
+ j.reverse()
+ jr.reverse()
+ hz2 = halfsize // 2 # zero stops reversing 1-item lists
+ # loop over 1st order dimension
+ k = k_start
+ for ci, (jl, jh) in enumerate(zip(j, jr)):
+ z_end = jl == j[-1]
+ # now depending on MODE return the index. inner butterfly
+ if SVSHAPE.skip == 0b00: # in [0b00, 0b10]:
+ if SVSHAPE.submode2 == 0b11: # iDCT
+ result = ji[ri[jl]] # lower half
+ else:
+ result = ri[ji[jl]] # lower half
+ elif SVSHAPE.skip == 0b01: # in [0b01, 0b11]:
+ if SVSHAPE.submode2 == 0b11: # iDCT
+ result = ji[ri[jl+halfsize]] # upper half
+ else:
+ result = ri[ji[jh]] # upper half
+ elif mode == 4:
+ # COS table pre-generated mode
+ if SVSHAPE.skip == 0b10: #
+ result = k # cos table offset
+ else: # mode 2
+ # COS table generated on-demand ("Vertical-First") mode
+ if SVSHAPE.skip == 0b10: #
+ result = ci # coefficient helper
+ elif SVSHAPE.skip == 0b11: #
+ result = size # coefficient helper
+ loopends = (z_end |
+ ((y_end and z_end)<<1) |
+ ((y_end and x_end and z_end)<<2))
+
+ yield (result * stride) + SVSHAPE.offset, loopends
+ k += 1
+
+ # now in-place swap
+ if inplace_mode:
+ for ci, (jl, jh) in enumerate(zip(j[:hz2], jr[:hz2])):
+ jlh = jl+halfsize
+ tmp1, tmp2 = ji[jlh], ji[jh]
+ ji[jlh], ji[jh] = tmp2, tmp1
+
+ # new k_start point for cos tables( runs inside x_r loop NOT i loop)
+ k_start += halfsize
+
+
+# python "yield" can be iterated. use this to make it clear how
+# the indices are generated by using natural-looking nested loops
+def iterate_dct_outer_butterfly_indices(SVSHAPE):
+ # get indices to iterate over, in the required order
+ n = SVSHAPE.lims[0]
+ mode = SVSHAPE.lims[1]
+ stride = SVSHAPE.lims[2]
+ # creating lists of indices to iterate over in each dimension
+ # has to be done dynamically, because it depends on the size
+ # first, the size-based loop (which can be done statically)
+ x_r = []
+ size = n // 2
+ while size >= 2:
+ x_r.append(size)
+ size //= 2
+ # invert order if requested
+ if SVSHAPE.invxyz[0]:
+ x_r.reverse()
+
+ if len(x_r) == 0:
+ return
+
+ # I-DCT, reference (read/write) the in-place data in *reverse-bit-order*
+ ri = list(range(n))
+ if SVSHAPE.submode2 in [0b11, 0b01]:
+ levels = n.bit_length() - 1
+ ri = [ri[reverse_bits(i, levels)] for i in range(n)]
+
+ # reference list for not needing to do data-swaps, just swap what
+ # *indices* are referenced (two levels of indirection at the moment)
+ # pre-reverse the data-swap list so that it *ends up* in the order 0123..
+ ji = list(range(n))
+ inplace_mode = False # need the space... SVSHAPE.skip in [0b10, 0b11]
+ if SVSHAPE.submode2 == 0b11:
+ ji = halfrev2(ji, False)
+
+ # start an infinite (wrapping) loop
+ while True:
+ k = 0
+ k_start = 0
+ for size in x_r: # loop over 3rd order dimension (size)
+ halfsize = size//2
+ x_end = size == x_r[-1]
+ y_r = list(range(0, halfsize))
+ # invert if requested
+ if SVSHAPE.invxyz[1]: y_r.reverse()
+ for i in y_r: # loop over 2nd order dimension
+ y_end = i == y_r[-1]
+ # one list to create iterative-sum schedule
+ jr = list(range(i+halfsize, i+n-halfsize, size))
+ # invert if requested
+ if SVSHAPE.invxyz[2]: jr.reverse()
+ hz2 = halfsize // 2 # zero stops reversing 1-item lists
+ k = k_start
+ for ci, jh in enumerate(jr): # loop over 1st order dimension
+ z_end = jh == jr[-1]
+ if mode == 4:
+ # COS table pre-generated mode
+ if SVSHAPE.skip == 0b00: # in [0b00, 0b10]:
+ if SVSHAPE.submode2 == 0b11: # iDCT
+ result = ji[ri[jh]] # upper half
+ else:
+ result = ri[ji[jh]] # lower half
+ elif SVSHAPE.skip == 0b01: # in [0b01, 0b11]:
+ if SVSHAPE.submode2 == 0b11: # iDCT
+ result = ji[ri[jh+size]] # upper half
+ else:
+ result = ri[ji[jh+size]] # upper half
+ elif SVSHAPE.skip == 0b10: #
+ result = k # cos table offset
+ else:
+ # COS table generated on-demand ("Vertical-First") mode
+ if SVSHAPE.skip == 0b00: # in [0b00, 0b10]:
+ if SVSHAPE.submode2 == 0b11: # iDCT
+ result = ji[ri[jh]] # lower half
+ else:
+ result = ri[ji[jh]] # lower half
+ elif SVSHAPE.skip == 0b01: # in [0b01, 0b11]:
+ if SVSHAPE.submode2 == 0b11: # iDCT
+ result = ji[ri[jh+size]] # upper half
+ else:
+ result = ri[ji[jh+size]] # upper half
+ elif SVSHAPE.skip == 0b10: #
+ result = ci # coefficient helper
+ elif SVSHAPE.skip == 0b11: #
+ result = size # coefficient helper
+ loopends = (z_end |
+ ((y_end and z_end)<<1) |
+ ((y_end and x_end and z_end)<<2))
+
+ yield (result * stride) + SVSHAPE.offset, loopends
+ k += 1
+
+ # new k_start point for cos tables( runs inside x_r loop NOT i loop)
+ k_start += halfsize
+
+```
+
+## REMAP selector
+
+Selecting which REMAP Schedule to use is shown by the pseudocode below.
+Each SVSHAPE 0-3 goes through this selection process.
+
+```
+ if SVSHAPEn.mode == 0b00: iterate_fn = iterate_indices
+ if SVSHAPEn.mode == 0b10: iterate_fn = iterate_preduce_indices
+ if SVSHAPEn.mode in [0b01, 0b11]:
+ # further sub-selection
+ if SVSHAPEn.ydimsz == 1: iterate_fn = iterate_butterfly_indices
+ if SVSHAPEn.ydimsz == 2: iterate_fn = iterate_dct_inner_butterfly_indices
+ if SVSHAPEn.ydimsz == 3: iterate_fn = iterate_dct_outer_butterfly_indices
+ if SVSHAPEn.ydimsz in [5, 13]: iterate_fn = iterate_dct_inner_costable_indices
+ if SVSHAPEn.ydimsz in [6, 14, 15]: iterate_fn = iterate_dct_inner_halfswap_loadstore
+```
+
[[!tag opf_rfc]]