1 # RFC ls009 Simple-V REMAP Subsystem
3 * Funded by NLnet under the Privacy and Enhanced Trust Programme, EU
4 Horizon2020 Grant 825310, and NGI0 Entrust No 101069594
5 * <https://libre-soc.org/openpower/sv/rfc/ls009/>
6 * <https://bugs.libre-soc.org/show_bug.cgi?id=1060>
7 * <https://git.openpower.foundation/isa/PowerISA/issues/124>
13 **Date**: 26 Mar 2023. v2 TODO
19 **Books and Section affected**:
22 Book I, new Zero-Overhead-Loop Chapter.
23 Appendix E Power ISA sorted by opcode
24 Appendix F Power ISA sorted by version
25 Appendix G Power ISA sorted by Compliancy Subset
26 Appendix H Power ISA sorted by mnemonic
32 svremap - Re-Mapping of Register Element Offsets
33 svindex - General-purpose setting of SHAPEs to be re-mapped
34 svshape - Hardware-level setting of SHAPEs for element re-mapping
35 svshape2 - Hardware-level setting of SHAPEs for element re-mapping (v2)
38 **Submitter**: Luke Leighton (Libre-SOC)
40 **Requester**: Libre-SOC
42 **Impact on processor**:
45 Addition of four new "Zero-Overhead-Loop-Control" DSP-style Vector-style
46 Management Instructions which provide advanced features such as Matrix
47 FFT DCT Hardware-Assist Schedules and general-purpose Index reordering.
50 **Impact on software**:
53 Requires support for new instructions in assembler, debuggers,
60 Cray Supercomputing, Vectorization, Zero-Overhead-Loop-Control (ZOLC),
61 Scalable Vectors, Multi-Issue Out-of-Order, Sequential Programming Model,
62 Digital Signal Processing (DSP)
67 These REMAP Management instructions provide state-of-the-art advanced capabilities
68 to dramatically decrease instruction count and power reduction whilst retaining
69 unprecedented general-purpose capability and a standard Sequential Execution Model.
71 **Notes and Observations**:
73 1. Although costly the alternatives in SIMD-paradigm software result in
74 huge algorithmic complexity and associated power consumption increases.
75 Loop-unrolling compilers are prevalent as is thousands to tens of
76 thousands of instructions.
77 2. Core inner kernels of Matrix DCT DFT FFT NTT are dramatically reduced
78 to a handful of instructions.
79 3. No REMAP instructions with the exception of Indexed rely on registers
80 for the establishment of REMAP capability.
81 4. Future EXT1xx variants and SVP64/VSX will dramatically expand the power
82 and flexibility of REMAP.
83 5. The Simple-V Compliancy Subsets make REMAP optional except in the
84 Advanced Levels. Like v3.1 MMA it is **not** necessary for **all**
85 hardware to implement REMAP.
89 Add the following entries to:
91 * the Appendices of SV Book
92 * Instructions of SV Book as a new Section
93 * SVI, SVM, SVM2, SVRM Form of Book I Section 1.6.1.6 and 1.6.2
99 # Rationale and background
101 In certain common algorithms there are clear patterns of behaviour which
102 typically require inline loop-unrolled instructions comprising interleaved
103 permute and arithmetic operations. REMAP was invented to split the permuting
104 from the arithmetic, and to allow the Indexing to be done as a hardware Schedule.
110 GPR[RT+i] <= GPR[RA+i] + GPR[RB+i];
113 A Hardware-assisted REMAP Vector Add:
117 GPR[RT+remap1(i)] <= GPR[RA+remap2(i)] + GPR[RB+remap3(i)];
120 The result is a huge saving on register file accesses (no need to calculate Indices
121 then use Permutation instructions), instruction count (Matrix Multiply up to 127 FMACs
122 is 3 instructions), and programmer sanity.
126 The following illustrates why REMAP was added.
128 * normal vector element read/write of operands would be sequential
130 * this is not appropriate for (e.g.) Matrix multiply which requires
131 accessing elements in alternative sequences (0 3 6 1 4 7 ...)
132 * normal Vector ISAs use either Indexed-MV or Indexed-LD/ST to "cope"
133 with this. both are expensive (copy large vectors, spill through memory)
134 and very few Packed SIMD ISAs cope with non-Power-2
135 (Duplicate-data inline-loop-unrolling is the costly solution)
136 * REMAP **redefines** the order of access according to set
137 (Deterministic) "Schedules".
138 * Matrix Schedules are not at all restricted to power-of-two boundaries
139 making it unnecessary to have for example specialised 3x4 transpose
140 instructions of other Vector ISAs.
141 * DCT and FFT REMAP are RADIX-2 limited but this is the case in existing Packed/Predicated
142 SIMD ISAs anyway (and Bluestein Convolution is typically deployed to
145 Only the most commonly-used algorithms in computer science have REMAP
146 support, due to the high cost in both the ISA and in hardware. For
147 arbitrary remapping the `Indexed` REMAP may be used.
151 This section summarises the motivation for each REMAP Schedule
152 and briefly goes over their characteristics and limitations.
153 Further details on the Deterministic Precise-Interruptible algorithms
154 used in these Schedules is found in the [[sv/remap/appendix]].
156 ## Matrix (1D/2D/3D shaping)
158 Matrix Multiplication is a huge part of High-Performance Compute,
160 In many PackedSIMD as well as Scalable Vector ISAs, non-power-of-two
161 Matrix sizes are a serious challenge. PackedSIMD ISAs, in order to
162 cope with for example 3x4 Matrices, recommend rolling data-repetition and loop-unrolling.
163 Aside from the cost of the load on the L1 I-Cache, the trick only
164 works if one of the dimensions X or Y are power-two. Prime Numbers
165 (5x7, 3x5) become deeply problematic to unroll.
167 Even traditional Scalable Vector ISAs have issues with Matrices, often
168 having to perform data Transpose by pushing out through Memory and back
170 or computing Transposition Indices (costly) then copying to another
173 Matrix REMAP was thus designed to solve these issues by providing Hardware
175 "Schedules" that can view what would otherwise be limited to a strictly
176 linear Vector as instead being 2D (even 3D) *in-place* reordered.
177 With both Transposition and non-power-two being supported the issues
178 faced by other ISAs are mitigated.
180 Limitations of Matrix REMAP are that the Vector Length (VL) is currently
181 restricted to 127: up to 127 FMAs (or other operation)
182 may be performed in total.
183 Also given that it is in-registers only at present some care has to be
184 taken on regfile resource utilisation. However it is perfectly possible
185 to utilise Matrix REMAP to perform the three inner-most "kernel" loops of
186 the usual 6-level "Tiled" large Matrix Multiply, without the usual
187 difficulties associated with SIMD.
189 Also the `svshape` instruction only provides access to part of the
190 Matrix REMAP capability. Rotation and mirroring need to be done by
191 programming the SVSHAPE SPRs directly, which can take a lot more
192 instructions. Future versions of SVP64 will include EXT1xx prefixed
193 variants (`psvshape`) which provide more comprehensive capacity and
194 mitigate the need to write direct to the SVSHAPE SPRs.
196 ## FFT/DCT Triple Loop
198 DCT and FFT are some of the most astonishingly used algorithms in
199 Computer Science. Radar, Audio, Video, R.F. Baseband and dozens more. At least
200 two DSPs, TMS320 and Hexagon, have VLIW instructions specially tailored
203 An in-depth analysis showed that it is possible to do in-place in-register
204 DCT and FFT as long as twin-result "butterfly" instructions are provided.
205 These can be found in the [[openpower/isa/svfparith]] page if performing
206 IEEE754 FP transforms. *(For fixed-point transforms, equivalent 3-in 2-out
207 integer operations would be required,
208 currently under development)*. These "butterfly" instructions
209 avoid the need for a temporary register because the two array positions
210 being overwritten will be "in-flight" in any In-Order or Out-of-Order
213 DCT and FFT Schedules are currently limited to RADIX2 sizes and do not
214 accept predicate masks. Given that it is common to perform recursive
215 convolutions combining smaller Power-2 DCT/FFT to create larger DCT/FFTs
216 in practice the RADIX2 limit is not a problem. A Bluestein convolution
217 to compute arbitrary length is demonstrated by
218 [Project Nayuki](https://www.nayuki.io/res/free-small-fft-in-multiple-languages/fft.py)
222 The purpose of Indexing is to provide a generalised version of
223 Vector ISA "Permute" instructions, such as VSX `vperm`. The
224 Indexing is abstracted out and may be applied to much more
225 than an element move/copy, and is not limited for example
226 to the number of bytes that can fit into a VSX register.
227 Indexing may be applied to LD/ST (even on Indexed LD/ST
228 instructions such as `sv.lbzx`), arithmetic operations,
229 extsw: there is no artificial limit.
231 The original motivation for Indexed REMAP was to mitigate the need to add
232 an expensive `mv.x` to the Scalar ISA, which was highly likely to be rejected as
233 a stand-alone instruction
234 (`GPR(RT) <- GPR(GPR(RA))`). Usually a Vector ISA would add a non-conflicting
235 variant (as in VSX `vperm`) but it is common to need to permute by source,
236 with the risk of conflict, that has to be resolved, for example, in AVX-512
239 Indexed REMAP is the "Get out of Jail Free" card which (for a price)
240 allows any general-purpose arbitrary Element Permutation not covered by
241 the other Hardware Schedules.
243 ## Parallel Reduction
245 Vector Reduce Mode issues a deterministic tree-reduction schedule to the underlying micro-architecture. Like Scalar reduction, the "Scalar Base"
246 (Power ISA v3.0B) operation is leveraged, unmodified, to give the
247 *appearance* and *effect* of Reduction. Parallel Reduction is not limited
248 to Power-of-two but is limited as usual by the total number of
249 element operations (127) as well as available register file size.
251 Parallel Reduction is normally done explicitly as a loop-unrolled operation,
252 taking up a significant number of instructions. With REMAP it is just three
253 instructions: two for setup and one Scalar Base.
255 ## Parallel Prefix Sum
257 This is a work-efficient Parallel Schedule that for example produces Trangular
258 or Factorial number sequences. Half of the Prefix Sum Schedule is near-identical
259 to Parallel Reduction. Whilst the Arithmetic mapreduce Mode (`/mr`) may achieve the same
260 end-result, implementations may only implement Mapreduce in serial form (or give
261 the appearance to Programmers of the same). The Parallel Prefix Schedule is
262 *required* to be implemented in such a way that its Deterministic Schedule may be
263 parallelised. Like the Reduction Schedule it is 100% Deterministic and consequently
264 may be used with non-commutative operations.
266 Parallel Reduction combined with Parallel Prefix Sum can be combined to help
267 perform AI non-linear interpolation in around twelve to fifteen instructions.
273 Add the following to SV Book
275 [[!inline pages="openpower/sv/remap" raw=yes ]]
279 Add `SVI, SVM, SVM2, SVRM` to `XO (26:31)` Field in Book I, 1.6.2
281 Add the following to Book I, 1.6.1, SVI-Form
284 |0 |6 |11 |16 |21 |23 |24|25|26 31|
285 | PO | SVG|rmm | SVd |ew |SVyx|mm|sk| XO |
288 Add the following to Book I, 1.6.1, SVM-Form
291 |0 |6 |11 |16 |21 |25 |26 |31 |
292 | PO | SVxd | SVyd | SVzd | SVrm |vf | XO |
295 Add the following to Book I, 1.6.1, SVM2-Form
298 |0 |6 |10 |11 |16 |21 |24|25 |26 |31 |
299 | PO | SVo |SVyx| rmm | SVd |XO |mm|sk | XO |
302 Add the following to Book I, 1.6.1, SVRM-Form
305 |0 |6 |11 |13 |15 |17 |19 |21 |22 |26 |31 |
306 | PO | SVme |mi0 | mi1 | mi2 | mo0 | mo1 |pst |/// | XO |
309 Add the following to Book I, 1.6.2
313 Field used in REMAP to select the SVSHAPE for 1st input register
316 Field used in REMAP to select the SVSHAPE for 2nd input register
319 Field used in REMAP to select the SVSHAPE for 3rd input register
322 Field used to specify the meaning of the rmm field for SVI-Form
326 Field used in REMAP to select the SVSHAPE for 1st output register
329 Field used in REMAP to select the SVSHAPE for 2nd output register
332 Field used in REMAP to indicate "persistence" mode (REMAP
333 continues to apply to multiple instructions)
336 REMAP Mode field for SVI-Form and SVM2-Form
339 Field used to specify dimensional skipping in svindex
342 Immediate field used to specify the size of the REMAP dimension
343 in the svindex and svshape2 instructions
346 Immediate field used to specify a 9-bit signed
347 two's complement integer which is concatenated
348 on the right with 0b00 and sign-extended to 64 bits.
351 Field used to specify a GPR to be used as a
355 Simple-V immediate field for setting VL or MVL
358 Simple-V "REMAP" map-enable bits (0-4)
361 Field used by the svshape2 instruction as an offset
364 Simple-V "REMAP" Mode
367 Simple-V "REMAP" x-dimension size
370 Simple-V "REMAP" y-dimension size
373 Simple-V "REMAP" z-dimension size
376 Extended opcode field. Note that bit 21 must be 1, 22 and 23
377 must be zero, and bits 26-31 must be exactly the same as
384 Appendix E Power ISA sorted by opcode
385 Appendix F Power ISA sorted by version
386 Appendix G Power ISA sorted by Compliancy Subset
387 Appendix H Power ISA sorted by mnemonic
389 | Form | Book | Page | Version | mnemonic | Description |
390 |------|------|------|---------|----------|-------------|
391 | SVRM | I | # | 3.0B | svremap | REMAP enabling instruction |
392 | SVM | I | # | 3.0B | svshape | REMAP shape instruction |
393 | SVM2 | I | # | 3.0B | svshape2 | REMAP shape instruction (2) |
394 | SVI | I | # | 3.0B | svindex | REMAP General-purpose Indexing |
396 [[!inline pages="openpower/sv/remap/appendix" raw=yes ]]
400 Written in python3 the following stand-alone executable source code is the Canonical
401 Specification for each REMAP. Vectors of "loopends" are returned when Rc=1
402 in Vectors of CR Fields on `sv.svstep.`, or in Vertical-First Mode
403 a single CR Field (CR0) on `svstep.`. The `SVSTATE.srcstep` or `SVSTATE.dststep` sequential
404 offset is put through each algorithm to determine the actual Element Offset.
405 Alternative implementations producing different ordering
406 is prohibited as software will be critically relying on these Deterministic Schedules.
408 ### REMAP 2D/3D Matrix
410 The following stand-alone executable source code is the Canonical
411 Specification for Matrix (2D/3D) REMAP.
412 Hardware implementations are achievable with simple cascading counter-and-compares.
415 # python "yield" can be iterated. use this to make it clear how
416 # the indices are generated by using natural-looking nested loops
417 def iterate_indices(SVSHAPE):
418 # get indices to iterate over, in the required order
422 # create lists of indices to iterate over in each dimension
423 x_r = list(range(xd))
424 y_r = list(range(yd))
425 z_r = list(range(zd))
426 # invert the indices if needed
427 if SVSHAPE.invxyz[0]: x_r.reverse()
428 if SVSHAPE.invxyz[1]: y_r.reverse()
429 if SVSHAPE.invxyz[2]: z_r.reverse()
430 # start an infinite (wrapping) loop
431 step = 0 # track src/dst step
433 for z in z_r: # loop over 1st order dimension
435 for y in y_r: # loop over 2nd order dimension
437 for x in x_r: # loop over 3rd order dimension
439 # ok work out which order to construct things in.
440 # start by creating a list of tuples of the dimension
442 vals = [(SVSHAPE.lims[0], x, "x"),
443 (SVSHAPE.lims[1], y, "y"),
444 (SVSHAPE.lims[2], z, "z")
446 # now select those by order. this allows us to
447 # create schedules for [z][x], [x][y], or [y][z]
448 # for matrix multiply.
449 vals = [vals[SVSHAPE.order[0]],
450 vals[SVSHAPE.order[1]],
451 vals[SVSHAPE.order[2]]
453 # ok now we can construct the result, using bits of
454 # "order" to say which ones get stacked on
458 lim, idx, dbg = vals[i]
459 # some of the dimensions can be "skipped". the order
460 # was actually selected above on all 3 dimensions,
461 # e.g. [z][x][y] or [y][z][x]. "skip" allows one of
462 # those to be knocked out
463 if SVSHAPE.skip == i+1: continue
464 idx *= mult # shifts up by previous dimension(s)
465 result += idx # adds on this dimension
466 mult *= lim # for the next dimension
469 ((y_end and x_end)<<1) |
470 ((y_end and x_end and z_end)<<2))
472 yield result + SVSHAPE.offset, loopends
476 # set the dimension sizes here
481 # set total (can repeat, e.g. VL=x*y*z*4)
482 VL = xdim * ydim * zdim
488 SVSHAPE0.lims = [xdim, ydim, zdim]
489 SVSHAPE0.order = [1,0,2] # experiment with different permutations, here
492 SVSHAPE0.offset = 0 # experiment with different offset, here
493 SVSHAPE0.invxyz = [0,0,0] # inversion if desired
495 # enumerate over the iterator function, getting new indices
496 for idx, (new_idx, end) in enumerate(iterate_indices(SVSHAPE0)):
499 print ("%d->%d" % (idx, new_idx), "end", bin(end)[2:])
502 if __name__ == '__main__':
506 ### REMAP Parallel Reduction pseudocode
508 The python3 program below is stand-alone executable and is the Canonical Specification
509 for Parallel Reduction REMAP.
510 The Algorithm below is not limited to RADIX2 sizes, and Predicate
511 sources, unlike in Matrix REMAP, apply to the Element Indices **after** REMAP
512 has been applied, not before. MV operations are not required: the algorithm
513 tracks positions of elements that would normally be moved and when applying
514 an Element Reduction Operation sources the operands from their last-known (tracked)
518 # a "yield" version of the Parallel Reduction REMAP algorithm.
519 # the algorithm is in-place. it does not perform "MV" operations.
520 # instead, where a masked-out value *should* be read from is tracked
522 def iterate_indices(SVSHAPE, pred=None):
523 # get indices to iterate over, in the required order
525 # create lists of indices to iterate over in each dimension
527 # invert the indices if needed
528 if SVSHAPE.invxyz[0]: ix.reverse()
529 # start a loop from the lowest step
535 # invert the indices if needed
536 if SVSHAPE.invxyz[1]: steps.reverse()
538 stepend = (step == steps[-1]) # note end of steps
539 idxs = list(range(0, xd, step))
542 other = i + step // 2
544 oi = ix[other] if other < xd else None
545 other_pred = other < xd and (pred is None or pred[oi])
546 if (pred is None or pred[ci]) and other_pred:
547 if SVSHAPE.skip == 0b00: # submode 00
549 elif SVSHAPE.skip == 0b01: # submode 01
551 results.append([result + SVSHAPE.offset, 0])
555 results[-1][1] = (stepend<<1) | 1 # notify end of loops
559 # set the dimension sizes here
566 SVSHAPE0.lims = [xdim, 0, 0]
567 SVSHAPE0.order = [0,1,2]
570 SVSHAPE0.offset = 0 # experiment with different offset, here
571 SVSHAPE0.invxyz = [0,0,0] # inversion if desired
574 SVSHAPE1.lims = [xdim, 0, 0]
575 SVSHAPE1.order = [0,1,2]
578 SVSHAPE1.offset = 0 # experiment with different offset, here
579 SVSHAPE1.invxyz = [0,0,0] # inversion if desired
581 # enumerate over the iterator function, getting new indices
582 shapes = list(iterate_indices(SVSHAPE0)), \
583 list(iterate_indices(SVSHAPE1))
584 for idx in range(len(shapes[0])):
589 print ("%d->%d:%d" % (idx, l_idx, r_idx),
590 "end", bin(lend)[2:], bin(rend)[2:])
593 if __name__ == '__main__':
597 ### REMAP FFT pseudocode
599 The FFT REMAP is RADIX2 only.
602 # a "yield" version of the REMAP algorithm, for FFT Tukey-Cooley schedules
603 # original code for the FFT Tukey-Cooley schedule:
604 # https://www.nayuki.io/res/free-small-fft-in-multiple-languages/fft.py
606 # Radix-2 decimation-in-time FFT (real, not complex)
610 tablestep = n // size
611 for i in range(0, n, size):
613 for j in range(i, i + halfsize):
616 temp1 = vec[jh] * exptable[k]
618 vec[jh] = temp2 - temp1
619 vec[jl] = temp2 + temp1
624 # python "yield" can be iterated. use this to make it clear how
625 # the indices are generated by using natural-looking nested loops
626 def iterate_butterfly_indices(SVSHAPE):
627 # get indices to iterate over, in the required order
629 stride = SVSHAPE.lims[2] # stride-multiplier on reg access
630 # creating lists of indices to iterate over in each dimension
631 # has to be done dynamically, because it depends on the size
632 # first, the size-based loop (which can be done statically)
638 # invert order if requested
639 if SVSHAPE.invxyz[0]: x_r.reverse()
644 # start an infinite (wrapping) loop
647 for size in x_r: # loop over 3rd order dimension (size)
648 x_end = size == x_r[-1]
649 # y_r schedule depends on size
651 tablestep = n // size
653 for i in range(0, n, size):
655 # invert if requested
656 if SVSHAPE.invxyz[1]: y_r.reverse()
657 for i in y_r: # loop over 2nd order dimension
662 for j in range(i, i+halfsize):
666 # invert if requested
667 if SVSHAPE.invxyz[2]: k_r.reverse()
668 if SVSHAPE.invxyz[2]: j_r.reverse()
669 for j, k in zip(j_r, k_r): # loop over 1st order dimension
671 # now depending on MODE return the index
672 if SVSHAPE.skip == 0b00:
673 result = j # for vec[j]
674 elif SVSHAPE.skip == 0b01:
675 result = j + halfsize # for vec[j+halfsize]
676 elif SVSHAPE.skip == 0b10:
677 result = k # for exptable[k]
680 ((y_end and z_end)<<1) |
681 ((y_end and x_end and z_end)<<2))
683 yield (result * stride) + SVSHAPE.offset, loopends
686 # set the dimension sizes here
688 ydim = 0 # not needed
689 zdim = 1 # stride must be set to 1
691 # set total. err don't know how to calculate how many there are...
692 # do it manually for now
698 tablestep = n // size
699 for i in range(0, n, size):
700 for j in range(i, i + halfsize):
709 SVSHAPE0.lims = [xdim, ydim, zdim]
710 SVSHAPE0.order = [0,1,2] # experiment with different permutations, here
713 SVSHAPE0.offset = 0 # experiment with different offset, here
714 SVSHAPE0.invxyz = [0,0,0] # inversion if desired
715 # j+halfstep schedule
717 SVSHAPE1.lims = [xdim, ydim, zdim]
718 SVSHAPE1.order = [0,1,2] # experiment with different permutations, here
721 SVSHAPE1.offset = 0 # experiment with different offset, here
722 SVSHAPE1.invxyz = [0,0,0] # inversion if desired
725 SVSHAPE2.lims = [xdim, ydim, zdim]
726 SVSHAPE2.order = [0,1,2] # experiment with different permutations, here
729 SVSHAPE2.offset = 0 # experiment with different offset, here
730 SVSHAPE2.invxyz = [0,0,0] # inversion if desired
732 # enumerate over the iterator function, getting new indices
734 for idx, (jl, jh, k) in enumerate(zip(iterate_butterfly_indices(SVSHAPE0),
735 iterate_butterfly_indices(SVSHAPE1),
736 iterate_butterfly_indices(SVSHAPE2))):
739 schedule.append((jl, jh, k))
741 # ok now pretty-print the results, with some debug output
746 tablestep = n // size
747 print ("size %d halfsize %d tablestep %d" % \
748 (size, halfsize, tablestep))
749 for i in range(0, n, size):
750 prefix = "i %d\t" % i
752 for j in range(i, i + halfsize):
753 (jl, je), (jh, he), (ks, ke) = schedule[idx]
754 print (" %-3d\t%s j=%-2d jh=%-2d k=%-2d -> "
755 "j[jl=%-2d] j[jh=%-2d] ex[k=%d]" % \
756 (idx, prefix, j, j+halfsize, k,
759 "end", bin(je)[2:], bin(je)[2:], bin(ke)[2:])
765 if __name__ == '__main__':
771 DCT REMAP is RADIX2 only. Convolutions may be applied as usual
772 to create non-RADIX2 DCT. Combined with appropriate Twin-butterfly
773 instructions the algorithm below (written in python3), becomes part
774 of an in-place in-registers Vectorized DCT. The algorithms work
775 by loading data such that as the nested loops progress the result
776 is sorted into correct sequential order.
779 # DCT "REMAP" scheduler to create an in-place iterative DCT.
782 # bits of the integer 'val' of width 'width' are reversed
783 def reverse_bits(val, width):
785 for _ in range(width):
786 result = (result << 1) | (val & 1)
791 # iterative version of [recursively-applied] half-reversing
792 # turns out this is Gray-Encoding.
793 def halfrev2(vec, pre_rev=True):
795 for i in range(len(vec)):
797 res.append(vec[i ^ (i>>1)])
801 for ji in range(1, bl):
807 def iterate_dct_inner_halfswap_loadstore(SVSHAPE):
808 # get indices to iterate over, in the required order
810 mode = SVSHAPE.lims[1]
811 stride = SVSHAPE.lims[2]
813 # reference list for not needing to do data-swaps, just swap what
814 # *indices* are referenced (two levels of indirection at the moment)
815 # pre-reverse the data-swap list so that it *ends up* in the order 0123..
818 levels = n.bit_length() - 1
819 ri = [reverse_bits(i, levels) for i in range(n)]
821 if SVSHAPE.mode == 0b01: # FFT, bitrev only
822 ji = [ji[ri[i]] for i in range(n)]
823 elif SVSHAPE.submode2 == 0b001:
824 ji = [ji[ri[i]] for i in range(n)]
825 ji = halfrev2(ji, True)
827 ji = halfrev2(ji, False)
828 ji = [ji[ri[i]] for i in range(n)]
830 # invert order if requested
831 if SVSHAPE.invxyz[0]:
834 for i, jl in enumerate(ji):
836 yield jl * stride, (0b111 if y_end else 0b000)
838 def iterate_dct_inner_costable_indices(SVSHAPE):
839 # get indices to iterate over, in the required order
841 mode = SVSHAPE.lims[1]
842 stride = SVSHAPE.lims[2]
843 # creating lists of indices to iterate over in each dimension
844 # has to be done dynamically, because it depends on the size
845 # first, the size-based loop (which can be done statically)
851 # invert order if requested
852 if SVSHAPE.invxyz[0]:
858 # start an infinite (wrapping) loop
860 z_end = 1 # doesn't exist in this, only 2 loops
863 for size in x_r: # loop over 3rd order dimension (size)
864 x_end = size == x_r[-1]
865 # y_r schedule depends on size
868 for i in range(0, n, size):
870 # invert if requested
871 if SVSHAPE.invxyz[1]: y_r.reverse()
872 # two lists of half-range indices, e.g. j 0123, jr 7654
873 j = list(range(0, halfsize))
874 # invert if requested
875 if SVSHAPE.invxyz[2]: j_r.reverse()
876 # loop over 1st order dimension
877 for ci, jl in enumerate(j):
879 # now depending on MODE return the index. inner butterfly
880 if SVSHAPE.skip == 0b00: # in [0b00, 0b10]:
881 result = k # offset into COS table
882 elif SVSHAPE.skip == 0b10: #
883 result = ci # coefficient helper
884 elif SVSHAPE.skip == 0b11: #
885 result = size # coefficient helper
887 ((y_end and z_end)<<1) |
888 ((y_end and x_end and z_end)<<2))
890 yield (result * stride) + SVSHAPE.offset, loopends
893 def iterate_dct_inner_butterfly_indices(SVSHAPE):
894 # get indices to iterate over, in the required order
896 mode = SVSHAPE.lims[1]
897 stride = SVSHAPE.lims[2]
898 # creating lists of indices to iterate over in each dimension
899 # has to be done dynamically, because it depends on the size
900 # first, the size-based loop (which can be done statically)
906 # invert order if requested
907 if SVSHAPE.invxyz[0]:
913 # reference (read/write) the in-place data in *reverse-bit-order*
915 if SVSHAPE.submode2 == 0b01:
916 levels = n.bit_length() - 1
917 ri = [ri[reverse_bits(i, levels)] for i in range(n)]
919 # reference list for not needing to do data-swaps, just swap what
920 # *indices* are referenced (two levels of indirection at the moment)
921 # pre-reverse the data-swap list so that it *ends up* in the order 0123..
924 if inplace_mode and SVSHAPE.submode2 == 0b01:
925 ji = halfrev2(ji, True)
926 if inplace_mode and SVSHAPE.submode2 == 0b11:
927 ji = halfrev2(ji, False)
929 # start an infinite (wrapping) loop
933 for size in x_r: # loop over 3rd order dimension (size)
934 x_end = size == x_r[-1]
935 # y_r schedule depends on size
938 for i in range(0, n, size):
940 # invert if requested
941 if SVSHAPE.invxyz[1]: y_r.reverse()
942 for i in y_r: # loop over 2nd order dimension
944 # two lists of half-range indices, e.g. j 0123, jr 7654
945 j = list(range(i, i + halfsize))
946 jr = list(range(i+halfsize, i + size))
948 # invert if requested
949 if SVSHAPE.invxyz[2]:
952 hz2 = halfsize // 2 # zero stops reversing 1-item lists
953 # loop over 1st order dimension
955 for ci, (jl, jh) in enumerate(zip(j, jr)):
957 # now depending on MODE return the index. inner butterfly
958 if SVSHAPE.skip == 0b00: # in [0b00, 0b10]:
959 if SVSHAPE.submode2 == 0b11: # iDCT
960 result = ji[ri[jl]] # lower half
962 result = ri[ji[jl]] # lower half
963 elif SVSHAPE.skip == 0b01: # in [0b01, 0b11]:
964 if SVSHAPE.submode2 == 0b11: # iDCT
965 result = ji[ri[jl+halfsize]] # upper half
967 result = ri[ji[jh]] # upper half
969 # COS table pre-generated mode
970 if SVSHAPE.skip == 0b10: #
971 result = k # cos table offset
973 # COS table generated on-demand ("Vertical-First") mode
974 if SVSHAPE.skip == 0b10: #
975 result = ci # coefficient helper
976 elif SVSHAPE.skip == 0b11: #
977 result = size # coefficient helper
979 ((y_end and z_end)<<1) |
980 ((y_end and x_end and z_end)<<2))
982 yield (result * stride) + SVSHAPE.offset, loopends
987 for ci, (jl, jh) in enumerate(zip(j[:hz2], jr[:hz2])):
989 tmp1, tmp2 = ji[jlh], ji[jh]
990 ji[jlh], ji[jh] = tmp2, tmp1
992 # new k_start point for cos tables( runs inside x_r loop NOT i loop)
996 # python "yield" can be iterated. use this to make it clear how
997 # the indices are generated by using natural-looking nested loops
998 def iterate_dct_outer_butterfly_indices(SVSHAPE):
999 # get indices to iterate over, in the required order
1001 mode = SVSHAPE.lims[1]
1002 stride = SVSHAPE.lims[2]
1003 # creating lists of indices to iterate over in each dimension
1004 # has to be done dynamically, because it depends on the size
1005 # first, the size-based loop (which can be done statically)
1011 # invert order if requested
1012 if SVSHAPE.invxyz[0]:
1018 # I-DCT, reference (read/write) the in-place data in *reverse-bit-order*
1020 if SVSHAPE.submode2 in [0b11, 0b01]:
1021 levels = n.bit_length() - 1
1022 ri = [ri[reverse_bits(i, levels)] for i in range(n)]
1024 # reference list for not needing to do data-swaps, just swap what
1025 # *indices* are referenced (two levels of indirection at the moment)
1026 # pre-reverse the data-swap list so that it *ends up* in the order 0123..
1028 inplace_mode = False # need the space... SVSHAPE.skip in [0b10, 0b11]
1029 if SVSHAPE.submode2 == 0b11:
1030 ji = halfrev2(ji, False)
1032 # start an infinite (wrapping) loop
1036 for size in x_r: # loop over 3rd order dimension (size)
1038 x_end = size == x_r[-1]
1039 y_r = list(range(0, halfsize))
1040 # invert if requested
1041 if SVSHAPE.invxyz[1]: y_r.reverse()
1042 for i in y_r: # loop over 2nd order dimension
1043 y_end = i == y_r[-1]
1044 # one list to create iterative-sum schedule
1045 jr = list(range(i+halfsize, i+n-halfsize, size))
1046 # invert if requested
1047 if SVSHAPE.invxyz[2]: jr.reverse()
1048 hz2 = halfsize // 2 # zero stops reversing 1-item lists
1050 for ci, jh in enumerate(jr): # loop over 1st order dimension
1051 z_end = jh == jr[-1]
1053 # COS table pre-generated mode
1054 if SVSHAPE.skip == 0b00: # in [0b00, 0b10]:
1055 if SVSHAPE.submode2 == 0b11: # iDCT
1056 result = ji[ri[jh]] # upper half
1058 result = ri[ji[jh]] # lower half
1059 elif SVSHAPE.skip == 0b01: # in [0b01, 0b11]:
1060 if SVSHAPE.submode2 == 0b11: # iDCT
1061 result = ji[ri[jh+size]] # upper half
1063 result = ri[ji[jh+size]] # upper half
1064 elif SVSHAPE.skip == 0b10: #
1065 result = k # cos table offset
1067 # COS table generated on-demand ("Vertical-First") mode
1068 if SVSHAPE.skip == 0b00: # in [0b00, 0b10]:
1069 if SVSHAPE.submode2 == 0b11: # iDCT
1070 result = ji[ri[jh]] # lower half
1072 result = ri[ji[jh]] # lower half
1073 elif SVSHAPE.skip == 0b01: # in [0b01, 0b11]:
1074 if SVSHAPE.submode2 == 0b11: # iDCT
1075 result = ji[ri[jh+size]] # upper half
1077 result = ri[ji[jh+size]] # upper half
1078 elif SVSHAPE.skip == 0b10: #
1079 result = ci # coefficient helper
1080 elif SVSHAPE.skip == 0b11: #
1081 result = size # coefficient helper
1083 ((y_end and z_end)<<1) |
1084 ((y_end and x_end and z_end)<<2))
1086 yield (result * stride) + SVSHAPE.offset, loopends
1089 # new k_start point for cos tables( runs inside x_r loop NOT i loop)
1096 Selecting which REMAP Schedule to use is shown by the pseudocode below.
1097 Each SVSHAPE 0-3 goes through this selection process.
1100 if SVSHAPEn.mode == 0b00: iterate_fn = iterate_indices
1101 if SVSHAPEn.mode == 0b10: iterate_fn = iterate_preduce_indices
1102 if SVSHAPEn.mode in [0b01, 0b11]:
1103 # further sub-selection
1104 if SVSHAPEn.ydimsz == 1: iterate_fn = iterate_butterfly_indices
1105 if SVSHAPEn.ydimsz == 2: iterate_fn = iterate_dct_inner_butterfly_indices
1106 if SVSHAPEn.ydimsz == 3: iterate_fn = iterate_dct_outer_butterfly_indices
1107 if SVSHAPEn.ydimsz in [5, 13]: iterate_fn = iterate_dct_inner_costable_indices
1108 if SVSHAPEn.ydimsz in [6, 14, 15]: iterate_fn = iterate_dct_inner_halfswap_loadstore