https://bugs.libre-soc.org/show_bug.cgi?id=985
[libreriscv.git] / openpower / sv / rfc / ls009.mdwn
1 # RFC ls009 Simple-V REMAP Subsystem
2
3 * Funded by NLnet under the Privacy and Enhanced Trust Programme, EU
4 Horizon2020 Grant 825310, and NGI0 Entrust No 101069594
5 * <https://libre-soc.org/openpower/sv/rfc/ls009/>
6 * <https://bugs.libre-soc.org/show_bug.cgi?id=1060>
7 * <https://git.openpower.foundation/isa/PowerISA/issues/124>
8
9 **Severity**: Major
10
11 **Status**: New
12
13 **Date**: 26 Mar 2023. v2 TODO
14
15 **Target**: v3.2B
16
17 **Source**: v3.0B
18
19 **Books and Section affected**:
20
21 ```
22 Book I, new Zero-Overhead-Loop Chapter.
23 Appendix E Power ISA sorted by opcode
24 Appendix F Power ISA sorted by version
25 Appendix G Power ISA sorted by Compliancy Subset
26 Appendix H Power ISA sorted by mnemonic
27 ```
28
29 **Summary**
30
31 ```
32 svremap - Re-Mapping of Register Element Offsets
33 svindex - General-purpose setting of SHAPEs to be re-mapped
34 svshape - Hardware-level setting of SHAPEs for element re-mapping
35 svshape2 - Hardware-level setting of SHAPEs for element re-mapping (v2)
36 ```
37
38 **Submitter**: Luke Leighton (Libre-SOC)
39
40 **Requester**: Libre-SOC
41
42 **Impact on processor**:
43
44 ```
45 Addition of four new "Zero-Overhead-Loop-Control" DSP-style Vector-style
46 Management Instructions which provide advanced features such as Matrix
47 FFT DCT Hardware-Assist Schedules and general-purpose Index reordering.
48 ```
49
50 **Impact on software**:
51
52 ```
53 Requires support for new instructions in assembler, debuggers,
54 and related tools.
55 ```
56
57 **Keywords**:
58
59 ```
60 Cray Supercomputing, Vectorization, Zero-Overhead-Loop-Control (ZOLC),
61 Scalable Vectors, Multi-Issue Out-of-Order, Sequential Programming Model,
62 Digital Signal Processing (DSP)
63 ```
64
65 **Motivation**
66
67 These REMAP Management instructions provide state-of-the-art advanced capabilities
68 to dramatically decrease instruction count and power reduction whilst retaining
69 unprecedented general-purpose capability and a standard Sequential Execution Model.
70
71 **Notes and Observations**:
72
73 1. Although costly the alternatives in SIMD-paradigm software result in
74 huge algorithmic complexity and associated power consumption increases.
75 Loop-unrolling compilers are prevalent as is thousands to tens of
76 thousands of instructions.
77 2. Core inner kernels of Matrix DCT DFT FFT NTT are dramatically reduced
78 to a handful of instructions.
79 3. No REMAP instructions with the exception of Indexed rely on registers
80 for the establishment of REMAP capability.
81 4. Future EXT1xx variants and SVP64/VSX will dramatically expand the power
82 and flexibility of REMAP.
83 5. The Simple-V Compliancy Subsets make REMAP optional except in the
84 Advanced Levels. Like v3.1 MMA it is **not** necessary for **all**
85 hardware to implement REMAP.
86
87 **Changes**
88
89 Add the following entries to:
90
91 * the Appendices of SV Book
92 * Instructions of SV Book as a new Section
93 * SVI, SVM, SVM2, SVRM Form of Book I Section 1.6.1.6 and 1.6.2
94
95 ----------------
96
97 \newpage{}
98
99 # Rationale and background
100
101 In certain common algorithms there are clear patterns of behaviour which
102 typically require inline loop-unrolled instructions comprising interleaved
103 permute and arithmetic operations. REMAP was invented to split the permuting
104 from the arithmetic, and to allow the Indexing to be done as a hardware Schedule.
105
106 A normal Vector Add:
107
108 ```
109  for i in range(VL):
110  GPR[RT+i] <= GPR[RA+i] + GPR[RB+i];
111 ```
112
113 A Hardware-assisted REMAP Vector Add:
114
115 ```
116 for i in range(VL):
117 GPR[RT+remap1(i)] <= GPR[RA+remap2(i)] + GPR[RB+remap3(i)];
118 ```
119
120 The result is a huge saving on register file accesses (no need to calculate Indices
121 then use Permutation instructions), instruction count (Matrix Multiply up to 127 FMACs
122 is 3 instructions), and programmer sanity.
123
124 **Basic principle**
125
126 The following illustrates why REMAP was added.
127
128 * normal vector element read/write of operands would be sequential
129 (0 1 2 3 ....)
130 * this is not appropriate for (e.g.) Matrix multiply which requires
131 accessing elements in alternative sequences (0 3 6 1 4 7 ...)
132 * normal Vector ISAs use either Indexed-MV or Indexed-LD/ST to "cope"
133 with this. both are expensive (copy large vectors, spill through memory)
134 and very few Packed SIMD ISAs cope with non-Power-2
135 (Duplicate-data inline-loop-unrolling is the costly solution)
136 * REMAP **redefines** the order of access according to set
137 (Deterministic) "Schedules".
138 * Matrix Schedules are not at all restricted to power-of-two boundaries
139 making it unnecessary to have for example specialised 3x4 transpose
140 instructions of other Vector ISAs.
141 * DCT and FFT REMAP are RADIX-2 limited but this is the case in existing Packed/Predicated
142 SIMD ISAs anyway (and Bluestein Convolution is typically deployed to
143 solve that).
144
145 Only the most commonly-used algorithms in computer science have REMAP
146 support, due to the high cost in both the ISA and in hardware. For
147 arbitrary remapping the `Indexed` REMAP may be used.
148
149 # REMAP types
150
151 This section summarises the motivation for each REMAP Schedule
152 and briefly goes over their characteristics and limitations.
153 Further details on the Deterministic Precise-Interruptible algorithms
154 used in these Schedules is found in the [[sv/remap/appendix]].
155
156 ## Matrix (1D/2D/3D shaping)
157
158 Matrix Multiplication is a huge part of High-Performance Compute,
159 and 3D.
160 In many PackedSIMD as well as Scalable Vector ISAs, non-power-of-two
161 Matrix sizes are a serious challenge. PackedSIMD ISAs, in order to
162 cope with for example 3x4 Matrices, recommend rolling data-repetition and loop-unrolling.
163 Aside from the cost of the load on the L1 I-Cache, the trick only
164 works if one of the dimensions X or Y are power-two. Prime Numbers
165 (5x7, 3x5) become deeply problematic to unroll.
166
167 Even traditional Scalable Vector ISAs have issues with Matrices, often
168 having to perform data Transpose by pushing out through Memory and back
169 (costly),
170 or computing Transposition Indices (costly) then copying to another
171 Vector (costly).
172
173 Matrix REMAP was thus designed to solve these issues by providing Hardware
174 Assisted
175 "Schedules" that can view what would otherwise be limited to a strictly
176 linear Vector as instead being 2D (even 3D) *in-place* reordered.
177 With both Transposition and non-power-two being supported the issues
178 faced by other ISAs are mitigated.
179
180 Limitations of Matrix REMAP are that the Vector Length (VL) is currently
181 restricted to 127: up to 127 FMAs (or other operation)
182 may be performed in total.
183 Also given that it is in-registers only at present some care has to be
184 taken on regfile resource utilisation. However it is perfectly possible
185 to utilise Matrix REMAP to perform the three inner-most "kernel" loops of
186 the usual 6-level "Tiled" large Matrix Multiply, without the usual
187 difficulties associated with SIMD.
188
189 Also the `svshape` instruction only provides access to part of the
190 Matrix REMAP capability. Rotation and mirroring need to be done by
191 programming the SVSHAPE SPRs directly, which can take a lot more
192 instructions. Future versions of SVP64 will include EXT1xx prefixed
193 variants (`psvshape`) which provide more comprehensive capacity and
194 mitigate the need to write direct to the SVSHAPE SPRs.
195
196 ## FFT/DCT Triple Loop
197
198 DCT and FFT are some of the most astonishingly used algorithms in
199 Computer Science. Radar, Audio, Video, R.F. Baseband and dozens more. At least
200 two DSPs, TMS320 and Hexagon, have VLIW instructions specially tailored
201 to FFT.
202
203 An in-depth analysis showed that it is possible to do in-place in-register
204 DCT and FFT as long as twin-result "butterfly" instructions are provided.
205 These can be found in the [[openpower/isa/svfparith]] page if performing
206 IEEE754 FP transforms. *(For fixed-point transforms, equivalent 3-in 2-out
207 integer operations would be required,
208 currently under development)*. These "butterfly" instructions
209 avoid the need for a temporary register because the two array positions
210 being overwritten will be "in-flight" in any In-Order or Out-of-Order
211 micro-architecture.
212
213 DCT and FFT Schedules are currently limited to RADIX2 sizes and do not
214 accept predicate masks. Given that it is common to perform recursive
215 convolutions combining smaller Power-2 DCT/FFT to create larger DCT/FFTs
216 in practice the RADIX2 limit is not a problem. A Bluestein convolution
217 to compute arbitrary length is demonstrated by
218 [Project Nayuki](https://www.nayuki.io/res/free-small-fft-in-multiple-languages/fft.py)
219
220 ## Indexed
221
222 The purpose of Indexing is to provide a generalised version of
223 Vector ISA "Permute" instructions, such as VSX `vperm`. The
224 Indexing is abstracted out and may be applied to much more
225 than an element move/copy, and is not limited for example
226 to the number of bytes that can fit into a VSX register.
227 Indexing may be applied to LD/ST (even on Indexed LD/ST
228 instructions such as `sv.lbzx`), arithmetic operations,
229 extsw: there is no artificial limit.
230
231 The original motivation for Indexed REMAP was to mitigate the need to add
232 an expensive `mv.x` to the Scalar ISA, which was highly likely to be rejected as
233 a stand-alone instruction
234 (`GPR(RT) <- GPR(GPR(RA))`). Usually a Vector ISA would add a non-conflicting
235 variant (as in VSX `vperm`) but it is common to need to permute by source,
236 with the risk of conflict, that has to be resolved, for example, in AVX-512
237 with `conflictd`.
238
239 Indexed REMAP is the "Get out of Jail Free" card which (for a price)
240 allows any general-purpose arbitrary Element Permutation not covered by
241 the other Hardware Schedules.
242
243 ## Parallel Reduction
244
245 Vector Reduce Mode issues a deterministic tree-reduction schedule to the underlying micro-architecture. Like Scalar reduction, the "Scalar Base"
246 (Power ISA v3.0B) operation is leveraged, unmodified, to give the
247 *appearance* and *effect* of Reduction. Parallel Reduction is not limited
248 to Power-of-two but is limited as usual by the total number of
249 element operations (127) as well as available register file size.
250
251 Parallel Reduction is normally done explicitly as a loop-unrolled operation,
252 taking up a significant number of instructions. With REMAP it is just three
253 instructions: two for setup and one Scalar Base.
254
255 ## Parallel Prefix Sum
256
257 This is a work-efficient Parallel Schedule that for example produces Trangular
258 or Factorial number sequences. Half of the Prefix Sum Schedule is near-identical
259 to Parallel Reduction. Whilst the Arithmetic mapreduce Mode (`/mr`) may achieve the same
260 end-result, implementations may only implement Mapreduce in serial form (or give
261 the appearance to Programmers of the same). The Parallel Prefix Schedule is
262 *required* to be implemented in such a way that its Deterministic Schedule may be
263 parallelised. Like the Reduction Schedule it is 100% Deterministic and consequently
264 may be used with non-commutative operations.
265
266 Parallel Reduction combined with Parallel Prefix Sum can be combined to help
267 perform AI non-linear interpolation in around twelve to fifteen instructions.
268
269 ----------------
270
271 \newpage{}
272
273 Add the following to SV Book
274
275 [[!inline pages="openpower/sv/remap" raw=yes ]]
276
277 # Forms
278
279 Add `SVI, SVM, SVM2, SVRM` to `XO (26:31)` Field in Book I, 1.6.2
280
281 Add the following to Book I, 1.6.1, SVI-Form
282
283 ```
284 |0 |6 |11 |16 |21 |23 |24|25|26 31|
285 | PO | SVG|rmm | SVd |ew |SVyx|mm|sk| XO |
286 ```
287
288 Add the following to Book I, 1.6.1, SVM-Form
289
290 ```
291 |0 |6 |11 |16 |21 |25 |26 |31 |
292 | PO | SVxd | SVyd | SVzd | SVrm |vf | XO |
293 ```
294
295 Add the following to Book I, 1.6.1, SVM2-Form
296
297 ```
298 |0 |6 |10 |11 |16 |21 |24|25 |26 |31 |
299 | PO | SVo |SVyx| rmm | SVd |XO |mm|sk | XO |
300 ```
301
302 Add the following to Book I, 1.6.1, SVRM-Form
303
304 ```
305 |0 |6 |11 |13 |15 |17 |19 |21 |22 |26 |31 |
306 | PO | SVme |mi0 | mi1 | mi2 | mo0 | mo1 |pst |/// | XO |
307 ```
308
309 Add the following to Book I, 1.6.2
310
311 ```
312 mi0 (11:12)
313 Field used in REMAP to select the SVSHAPE for 1st input register
314 Formats: SVRM
315 mi1 (13:14)
316 Field used in REMAP to select the SVSHAPE for 2nd input register
317 Formats: SVRM
318 mi2 (15:16)
319 Field used in REMAP to select the SVSHAPE for 3rd input register
320 Formats: SVRM
321 mm (24)
322 Field used to specify the meaning of the rmm field for SVI-Form
323 and SVM2-Form
324 Formats: SVI, SVM2
325 mo0 (17:18)
326 Field used in REMAP to select the SVSHAPE for 1st output register
327 Formats: SVRM
328 mo1 (19:20)
329 Field used in REMAP to select the SVSHAPE for 2nd output register
330 Formats: SVRM
331 pst (21)
332 Field used in REMAP to indicate "persistence" mode (REMAP
333 continues to apply to multiple instructions)
334 Formats: SVRM
335 rmm (11:15)
336 REMAP Mode field for SVI-Form and SVM2-Form
337 Formats: SVI, SVM2
338 sk (25)
339 Field used to specify dimensional skipping in svindex
340 Formats: SVI, SVM2
341 SVd (16:20)
342 Immediate field used to specify the size of the REMAP dimension
343 in the svindex and svshape2 instructions
344 Formats: SVI, SVM2
345 SVDS (16:29)
346 Immediate field used to specify a 9-bit signed
347 two's complement integer which is concatenated
348 on the right with 0b00 and sign-extended to 64 bits.
349 Formats: SVDS
350 SVG (6:10)
351 Field used to specify a GPR to be used as a
352 source for indexing.
353 Formats: SVI
354 SVi (16:22)
355 Simple-V immediate field for setting VL or MVL
356 Formats: SVL
357 SVme (6:10)
358 Simple-V "REMAP" map-enable bits (0-4)
359 Formats: SVRM
360 SVo (6:9)
361 Field used by the svshape2 instruction as an offset
362 Formats: SVM2
363 SVrm (21:24)
364 Simple-V "REMAP" Mode
365 Formats: SVM
366 SVxd (6:10)
367 Simple-V "REMAP" x-dimension size
368 Formats: SVM
369 SVyd (11:15)
370 Simple-V "REMAP" y-dimension size
371 Formats: SVM
372 SVzd (16:20)
373 Simple-V "REMAP" z-dimension size
374 Formats: SVM
375 XO (21:23,26:31)
376 Extended opcode field. Note that bit 21 must be 1, 22 and 23
377 must be zero, and bits 26-31 must be exactly the same as
378 used for svshape.
379 Formats: SVM2
380 ```
381
382 # Appendices
383
384 Appendix E Power ISA sorted by opcode
385 Appendix F Power ISA sorted by version
386 Appendix G Power ISA sorted by Compliancy Subset
387 Appendix H Power ISA sorted by mnemonic
388
389 | Form | Book | Page | Version | mnemonic | Description |
390 |------|------|------|---------|----------|-------------|
391 | SVRM | I | # | 3.0B | svremap | REMAP enabling instruction |
392 | SVM | I | # | 3.0B | svshape | REMAP shape instruction |
393 | SVM2 | I | # | 3.0B | svshape2 | REMAP shape instruction (2) |
394 | SVI | I | # | 3.0B | svindex | REMAP General-purpose Indexing |
395
396 [[!inline pages="openpower/sv/remap/appendix" raw=yes ]]
397
398 ## REMAP pseudocode
399
400 Written in python3 the following stand-alone executable source code is the Canonical
401 Specification for each REMAP. Vectors of "loopends" are returned when Rc=1
402 in Vectors of CR Fields on `sv.svstep.`, or in Vertical-First Mode
403 a single CR Field (CR0) on `svstep.`. The `SVSTATE.srcstep` or `SVSTATE.dststep` sequential
404 offset is put through each algorithm to determine the actual Element Offset.
405 Alternative implementations producing different ordering
406 is prohibited as software will be critically relying on these Deterministic Schedules.
407
408 ### REMAP 2D/3D Matrix
409
410 The following stand-alone executable source code is the Canonical
411 Specification for Matrix (2D/3D) REMAP.
412 Hardware implementations are achievable with simple cascading counter-and-compares.
413
414 ```
415 # python "yield" can be iterated. use this to make it clear how
416 # the indices are generated by using natural-looking nested loops
417 def iterate_indices(SVSHAPE):
418 # get indices to iterate over, in the required order
419 xd = SVSHAPE.lims[0]
420 yd = SVSHAPE.lims[1]
421 zd = SVSHAPE.lims[2]
422 # create lists of indices to iterate over in each dimension
423 x_r = list(range(xd))
424 y_r = list(range(yd))
425 z_r = list(range(zd))
426 # invert the indices if needed
427 if SVSHAPE.invxyz[0]: x_r.reverse()
428 if SVSHAPE.invxyz[1]: y_r.reverse()
429 if SVSHAPE.invxyz[2]: z_r.reverse()
430 # start an infinite (wrapping) loop
431 step = 0 # track src/dst step
432 while True:
433 for z in z_r: # loop over 1st order dimension
434 z_end = z == z_r[-1]
435 for y in y_r: # loop over 2nd order dimension
436 y_end = y == y_r[-1]
437 for x in x_r: # loop over 3rd order dimension
438 x_end = x == x_r[-1]
439 # ok work out which order to construct things in.
440 # start by creating a list of tuples of the dimension
441 # and its limit
442 vals = [(SVSHAPE.lims[0], x, "x"),
443 (SVSHAPE.lims[1], y, "y"),
444 (SVSHAPE.lims[2], z, "z")
445 ]
446 # now select those by order. this allows us to
447 # create schedules for [z][x], [x][y], or [y][z]
448 # for matrix multiply.
449 vals = [vals[SVSHAPE.order[0]],
450 vals[SVSHAPE.order[1]],
451 vals[SVSHAPE.order[2]]
452 ]
453 # ok now we can construct the result, using bits of
454 # "order" to say which ones get stacked on
455 result = 0
456 mult = 1
457 for i in range(3):
458 lim, idx, dbg = vals[i]
459 # some of the dimensions can be "skipped". the order
460 # was actually selected above on all 3 dimensions,
461 # e.g. [z][x][y] or [y][z][x]. "skip" allows one of
462 # those to be knocked out
463 if SVSHAPE.skip == i+1: continue
464 idx *= mult # shifts up by previous dimension(s)
465 result += idx # adds on this dimension
466 mult *= lim # for the next dimension
467
468 loopends = (x_end |
469 ((y_end and x_end)<<1) |
470 ((y_end and x_end and z_end)<<2))
471
472 yield result + SVSHAPE.offset, loopends
473 step += 1
474
475 def demo():
476 # set the dimension sizes here
477 xdim = 3
478 ydim = 2
479 zdim = 4
480
481 # set total (can repeat, e.g. VL=x*y*z*4)
482 VL = xdim * ydim * zdim
483
484 # set up an SVSHAPE
485 class SVSHAPE:
486 pass
487 SVSHAPE0 = SVSHAPE()
488 SVSHAPE0.lims = [xdim, ydim, zdim]
489 SVSHAPE0.order = [1,0,2] # experiment with different permutations, here
490 SVSHAPE0.mode = 0b00
491 SVSHAPE0.skip = 0b00
492 SVSHAPE0.offset = 0 # experiment with different offset, here
493 SVSHAPE0.invxyz = [0,0,0] # inversion if desired
494
495 # enumerate over the iterator function, getting new indices
496 for idx, (new_idx, end) in enumerate(iterate_indices(SVSHAPE0)):
497 if idx >= VL:
498 break
499 print ("%d->%d" % (idx, new_idx), "end", bin(end)[2:])
500
501 # run the demo
502 if __name__ == '__main__':
503 demo()
504 ```
505
506 ### REMAP Parallel Reduction pseudocode
507
508 The python3 program below is stand-alone executable and is the Canonical Specification
509 for Parallel Reduction REMAP.
510 The Algorithm below is not limited to RADIX2 sizes, and Predicate
511 sources, unlike in Matrix REMAP, apply to the Element Indices **after** REMAP
512 has been applied, not before. MV operations are not required: the algorithm
513 tracks positions of elements that would normally be moved and when applying
514 an Element Reduction Operation sources the operands from their last-known (tracked)
515 position.
516
517 ```
518 # a "yield" version of the Parallel Reduction REMAP algorithm.
519 # the algorithm is in-place. it does not perform "MV" operations.
520 # instead, where a masked-out value *should* be read from is tracked
521
522 def iterate_indices(SVSHAPE, pred=None):
523 # get indices to iterate over, in the required order
524 xd = SVSHAPE.lims[0]
525 # create lists of indices to iterate over in each dimension
526 ix = list(range(xd))
527 # invert the indices if needed
528 if SVSHAPE.invxyz[0]: ix.reverse()
529 # start a loop from the lowest step
530 step = 1
531 steps = []
532 while step < xd:
533 step *= 2
534 steps.append(step)
535 # invert the indices if needed
536 if SVSHAPE.invxyz[1]: steps.reverse()
537 for step in steps:
538 stepend = (step == steps[-1]) # note end of steps
539 idxs = list(range(0, xd, step))
540 results = []
541 for i in idxs:
542 other = i + step // 2
543 ci = ix[i]
544 oi = ix[other] if other < xd else None
545 other_pred = other < xd and (pred is None or pred[oi])
546 if (pred is None or pred[ci]) and other_pred:
547 if SVSHAPE.skip == 0b00: # submode 00
548 result = ci
549 elif SVSHAPE.skip == 0b01: # submode 01
550 result = oi
551 results.append([result + SVSHAPE.offset, 0])
552 elif other_pred:
553 ix[i] = oi
554 if results:
555 results[-1][1] = (stepend<<1) | 1 # notify end of loops
556 yield from results
557
558 def demo():
559 # set the dimension sizes here
560 xdim = 9
561
562 # set up an SVSHAPE
563 class SVSHAPE:
564 pass
565 SVSHAPE0 = SVSHAPE()
566 SVSHAPE0.lims = [xdim, 0, 0]
567 SVSHAPE0.order = [0,1,2]
568 SVSHAPE0.mode = 0b10
569 SVSHAPE0.skip = 0b00
570 SVSHAPE0.offset = 0 # experiment with different offset, here
571 SVSHAPE0.invxyz = [0,0,0] # inversion if desired
572
573 SVSHAPE1 = SVSHAPE()
574 SVSHAPE1.lims = [xdim, 0, 0]
575 SVSHAPE1.order = [0,1,2]
576 SVSHAPE1.mode = 0b10
577 SVSHAPE1.skip = 0b01
578 SVSHAPE1.offset = 0 # experiment with different offset, here
579 SVSHAPE1.invxyz = [0,0,0] # inversion if desired
580
581 # enumerate over the iterator function, getting new indices
582 shapes = list(iterate_indices(SVSHAPE0)), \
583 list(iterate_indices(SVSHAPE1))
584 for idx in range(len(shapes[0])):
585 l = shapes[0][idx]
586 r = shapes[1][idx]
587 (l_idx, lend) = l
588 (r_idx, rend) = r
589 print ("%d->%d:%d" % (idx, l_idx, r_idx),
590 "end", bin(lend)[2:], bin(rend)[2:])
591
592 # run the demo
593 if __name__ == '__main__':
594 demo()
595 ```
596
597 ### REMAP FFT pseudocode
598
599 The FFT REMAP is RADIX2 only.
600
601 ```
602 # a "yield" version of the REMAP algorithm, for FFT Tukey-Cooley schedules
603 # original code for the FFT Tukey-Cooley schedule:
604 # https://www.nayuki.io/res/free-small-fft-in-multiple-languages/fft.py
605 """
606 # Radix-2 decimation-in-time FFT (real, not complex)
607 size = 2
608 while size <= n:
609 halfsize = size // 2
610 tablestep = n // size
611 for i in range(0, n, size):
612 k = 0
613 for j in range(i, i + halfsize):
614 jh = j+halfsize
615 jl = j
616 temp1 = vec[jh] * exptable[k]
617 temp2 = vec[jl]
618 vec[jh] = temp2 - temp1
619 vec[jl] = temp2 + temp1
620 k += tablestep
621 size *= 2
622 """
623
624 # python "yield" can be iterated. use this to make it clear how
625 # the indices are generated by using natural-looking nested loops
626 def iterate_butterfly_indices(SVSHAPE):
627 # get indices to iterate over, in the required order
628 n = SVSHAPE.lims[0]
629 stride = SVSHAPE.lims[2] # stride-multiplier on reg access
630 # creating lists of indices to iterate over in each dimension
631 # has to be done dynamically, because it depends on the size
632 # first, the size-based loop (which can be done statically)
633 x_r = []
634 size = 2
635 while size <= n:
636 x_r.append(size)
637 size *= 2
638 # invert order if requested
639 if SVSHAPE.invxyz[0]: x_r.reverse()
640
641 if len(x_r) == 0:
642 return
643
644 # start an infinite (wrapping) loop
645 skip = 0
646 while True:
647 for size in x_r: # loop over 3rd order dimension (size)
648 x_end = size == x_r[-1]
649 # y_r schedule depends on size
650 halfsize = size // 2
651 tablestep = n // size
652 y_r = []
653 for i in range(0, n, size):
654 y_r.append(i)
655 # invert if requested
656 if SVSHAPE.invxyz[1]: y_r.reverse()
657 for i in y_r: # loop over 2nd order dimension
658 y_end = i == y_r[-1]
659 k_r = []
660 j_r = []
661 k = 0
662 for j in range(i, i+halfsize):
663 k_r.append(k)
664 j_r.append(j)
665 k += tablestep
666 # invert if requested
667 if SVSHAPE.invxyz[2]: k_r.reverse()
668 if SVSHAPE.invxyz[2]: j_r.reverse()
669 for j, k in zip(j_r, k_r): # loop over 1st order dimension
670 z_end = j == j_r[-1]
671 # now depending on MODE return the index
672 if SVSHAPE.skip == 0b00:
673 result = j # for vec[j]
674 elif SVSHAPE.skip == 0b01:
675 result = j + halfsize # for vec[j+halfsize]
676 elif SVSHAPE.skip == 0b10:
677 result = k # for exptable[k]
678
679 loopends = (z_end |
680 ((y_end and z_end)<<1) |
681 ((y_end and x_end and z_end)<<2))
682
683 yield (result * stride) + SVSHAPE.offset, loopends
684
685 def demo():
686 # set the dimension sizes here
687 xdim = 8
688 ydim = 0 # not needed
689 zdim = 1 # stride must be set to 1
690
691 # set total. err don't know how to calculate how many there are...
692 # do it manually for now
693 VL = 0
694 size = 2
695 n = xdim
696 while size <= n:
697 halfsize = size // 2
698 tablestep = n // size
699 for i in range(0, n, size):
700 for j in range(i, i + halfsize):
701 VL += 1
702 size *= 2
703
704 # set up an SVSHAPE
705 class SVSHAPE:
706 pass
707 # j schedule
708 SVSHAPE0 = SVSHAPE()
709 SVSHAPE0.lims = [xdim, ydim, zdim]
710 SVSHAPE0.order = [0,1,2] # experiment with different permutations, here
711 SVSHAPE0.mode = 0b01
712 SVSHAPE0.skip = 0b00
713 SVSHAPE0.offset = 0 # experiment with different offset, here
714 SVSHAPE0.invxyz = [0,0,0] # inversion if desired
715 # j+halfstep schedule
716 SVSHAPE1 = SVSHAPE()
717 SVSHAPE1.lims = [xdim, ydim, zdim]
718 SVSHAPE1.order = [0,1,2] # experiment with different permutations, here
719 SVSHAPE0.mode = 0b01
720 SVSHAPE1.skip = 0b01
721 SVSHAPE1.offset = 0 # experiment with different offset, here
722 SVSHAPE1.invxyz = [0,0,0] # inversion if desired
723 # k schedule
724 SVSHAPE2 = SVSHAPE()
725 SVSHAPE2.lims = [xdim, ydim, zdim]
726 SVSHAPE2.order = [0,1,2] # experiment with different permutations, here
727 SVSHAPE0.mode = 0b01
728 SVSHAPE2.skip = 0b10
729 SVSHAPE2.offset = 0 # experiment with different offset, here
730 SVSHAPE2.invxyz = [0,0,0] # inversion if desired
731
732 # enumerate over the iterator function, getting new indices
733 schedule = []
734 for idx, (jl, jh, k) in enumerate(zip(iterate_butterfly_indices(SVSHAPE0),
735 iterate_butterfly_indices(SVSHAPE1),
736 iterate_butterfly_indices(SVSHAPE2))):
737 if idx >= VL:
738 break
739 schedule.append((jl, jh, k))
740
741 # ok now pretty-print the results, with some debug output
742 size = 2
743 idx = 0
744 while size <= n:
745 halfsize = size // 2
746 tablestep = n // size
747 print ("size %d halfsize %d tablestep %d" % \
748 (size, halfsize, tablestep))
749 for i in range(0, n, size):
750 prefix = "i %d\t" % i
751 k = 0
752 for j in range(i, i + halfsize):
753 (jl, je), (jh, he), (ks, ke) = schedule[idx]
754 print (" %-3d\t%s j=%-2d jh=%-2d k=%-2d -> "
755 "j[jl=%-2d] j[jh=%-2d] ex[k=%d]" % \
756 (idx, prefix, j, j+halfsize, k,
757 jl, jh, ks,
758 ),
759 "end", bin(je)[2:], bin(je)[2:], bin(ke)[2:])
760 k += tablestep
761 idx += 1
762 size *= 2
763
764 # run the demo
765 if __name__ == '__main__':
766 demo()
767 ```
768
769 ### DCT REMAP
770
771 DCT REMAP is RADIX2 only. Convolutions may be applied as usual
772 to create non-RADIX2 DCT. Combined with appropriate Twin-butterfly
773 instructions the algorithm below (written in python3), becomes part
774 of an in-place in-registers Vectorized DCT. The algorithms work
775 by loading data such that as the nested loops progress the result
776 is sorted into correct sequential order.
777
778 ```
779 # DCT "REMAP" scheduler to create an in-place iterative DCT.
780 #
781
782 # bits of the integer 'val' of width 'width' are reversed
783 def reverse_bits(val, width):
784 result = 0
785 for _ in range(width):
786 result = (result << 1) | (val & 1)
787 val >>= 1
788 return result
789
790
791 # iterative version of [recursively-applied] half-reversing
792 # turns out this is Gray-Encoding.
793 def halfrev2(vec, pre_rev=True):
794 res = []
795 for i in range(len(vec)):
796 if pre_rev:
797 res.append(vec[i ^ (i>>1)])
798 else:
799 ri = i
800 bl = i.bit_length()
801 for ji in range(1, bl):
802 ri ^= (i >> ji)
803 res.append(vec[ri])
804 return res
805
806
807 def iterate_dct_inner_halfswap_loadstore(SVSHAPE):
808 # get indices to iterate over, in the required order
809 n = SVSHAPE.lims[0]
810 mode = SVSHAPE.lims[1]
811 stride = SVSHAPE.lims[2]
812
813 # reference list for not needing to do data-swaps, just swap what
814 # *indices* are referenced (two levels of indirection at the moment)
815 # pre-reverse the data-swap list so that it *ends up* in the order 0123..
816 ji = list(range(n))
817
818 levels = n.bit_length() - 1
819 ri = [reverse_bits(i, levels) for i in range(n)]
820
821 if SVSHAPE.mode == 0b01: # FFT, bitrev only
822 ji = [ji[ri[i]] for i in range(n)]
823 elif SVSHAPE.submode2 == 0b001:
824 ji = [ji[ri[i]] for i in range(n)]
825 ji = halfrev2(ji, True)
826 else:
827 ji = halfrev2(ji, False)
828 ji = [ji[ri[i]] for i in range(n)]
829
830 # invert order if requested
831 if SVSHAPE.invxyz[0]:
832 ji.reverse()
833
834 for i, jl in enumerate(ji):
835 y_end = jl == ji[-1]
836 yield jl * stride, (0b111 if y_end else 0b000)
837
838 def iterate_dct_inner_costable_indices(SVSHAPE):
839 # get indices to iterate over, in the required order
840 n = SVSHAPE.lims[0]
841 mode = SVSHAPE.lims[1]
842 stride = SVSHAPE.lims[2]
843 # creating lists of indices to iterate over in each dimension
844 # has to be done dynamically, because it depends on the size
845 # first, the size-based loop (which can be done statically)
846 x_r = []
847 size = 2
848 while size <= n:
849 x_r.append(size)
850 size *= 2
851 # invert order if requested
852 if SVSHAPE.invxyz[0]:
853 x_r.reverse()
854
855 if len(x_r) == 0:
856 return
857
858 # start an infinite (wrapping) loop
859 skip = 0
860 z_end = 1 # doesn't exist in this, only 2 loops
861 k = 0
862 while True:
863 for size in x_r: # loop over 3rd order dimension (size)
864 x_end = size == x_r[-1]
865 # y_r schedule depends on size
866 halfsize = size // 2
867 y_r = []
868 for i in range(0, n, size):
869 y_r.append(i)
870 # invert if requested
871 if SVSHAPE.invxyz[1]: y_r.reverse()
872 # two lists of half-range indices, e.g. j 0123, jr 7654
873 j = list(range(0, halfsize))
874 # invert if requested
875 if SVSHAPE.invxyz[2]: j_r.reverse()
876 # loop over 1st order dimension
877 for ci, jl in enumerate(j):
878 y_end = jl == j[-1]
879 # now depending on MODE return the index. inner butterfly
880 if SVSHAPE.skip == 0b00: # in [0b00, 0b10]:
881 result = k # offset into COS table
882 elif SVSHAPE.skip == 0b10: #
883 result = ci # coefficient helper
884 elif SVSHAPE.skip == 0b11: #
885 result = size # coefficient helper
886 loopends = (z_end |
887 ((y_end and z_end)<<1) |
888 ((y_end and x_end and z_end)<<2))
889
890 yield (result * stride) + SVSHAPE.offset, loopends
891 k += 1
892
893 def iterate_dct_inner_butterfly_indices(SVSHAPE):
894 # get indices to iterate over, in the required order
895 n = SVSHAPE.lims[0]
896 mode = SVSHAPE.lims[1]
897 stride = SVSHAPE.lims[2]
898 # creating lists of indices to iterate over in each dimension
899 # has to be done dynamically, because it depends on the size
900 # first, the size-based loop (which can be done statically)
901 x_r = []
902 size = 2
903 while size <= n:
904 x_r.append(size)
905 size *= 2
906 # invert order if requested
907 if SVSHAPE.invxyz[0]:
908 x_r.reverse()
909
910 if len(x_r) == 0:
911 return
912
913 # reference (read/write) the in-place data in *reverse-bit-order*
914 ri = list(range(n))
915 if SVSHAPE.submode2 == 0b01:
916 levels = n.bit_length() - 1
917 ri = [ri[reverse_bits(i, levels)] for i in range(n)]
918
919 # reference list for not needing to do data-swaps, just swap what
920 # *indices* are referenced (two levels of indirection at the moment)
921 # pre-reverse the data-swap list so that it *ends up* in the order 0123..
922 ji = list(range(n))
923 inplace_mode = True
924 if inplace_mode and SVSHAPE.submode2 == 0b01:
925 ji = halfrev2(ji, True)
926 if inplace_mode and SVSHAPE.submode2 == 0b11:
927 ji = halfrev2(ji, False)
928
929 # start an infinite (wrapping) loop
930 while True:
931 k = 0
932 k_start = 0
933 for size in x_r: # loop over 3rd order dimension (size)
934 x_end = size == x_r[-1]
935 # y_r schedule depends on size
936 halfsize = size // 2
937 y_r = []
938 for i in range(0, n, size):
939 y_r.append(i)
940 # invert if requested
941 if SVSHAPE.invxyz[1]: y_r.reverse()
942 for i in y_r: # loop over 2nd order dimension
943 y_end = i == y_r[-1]
944 # two lists of half-range indices, e.g. j 0123, jr 7654
945 j = list(range(i, i + halfsize))
946 jr = list(range(i+halfsize, i + size))
947 jr.reverse()
948 # invert if requested
949 if SVSHAPE.invxyz[2]:
950 j.reverse()
951 jr.reverse()
952 hz2 = halfsize // 2 # zero stops reversing 1-item lists
953 # loop over 1st order dimension
954 k = k_start
955 for ci, (jl, jh) in enumerate(zip(j, jr)):
956 z_end = jl == j[-1]
957 # now depending on MODE return the index. inner butterfly
958 if SVSHAPE.skip == 0b00: # in [0b00, 0b10]:
959 if SVSHAPE.submode2 == 0b11: # iDCT
960 result = ji[ri[jl]] # lower half
961 else:
962 result = ri[ji[jl]] # lower half
963 elif SVSHAPE.skip == 0b01: # in [0b01, 0b11]:
964 if SVSHAPE.submode2 == 0b11: # iDCT
965 result = ji[ri[jl+halfsize]] # upper half
966 else:
967 result = ri[ji[jh]] # upper half
968 elif mode == 4:
969 # COS table pre-generated mode
970 if SVSHAPE.skip == 0b10: #
971 result = k # cos table offset
972 else: # mode 2
973 # COS table generated on-demand ("Vertical-First") mode
974 if SVSHAPE.skip == 0b10: #
975 result = ci # coefficient helper
976 elif SVSHAPE.skip == 0b11: #
977 result = size # coefficient helper
978 loopends = (z_end |
979 ((y_end and z_end)<<1) |
980 ((y_end and x_end and z_end)<<2))
981
982 yield (result * stride) + SVSHAPE.offset, loopends
983 k += 1
984
985 # now in-place swap
986 if inplace_mode:
987 for ci, (jl, jh) in enumerate(zip(j[:hz2], jr[:hz2])):
988 jlh = jl+halfsize
989 tmp1, tmp2 = ji[jlh], ji[jh]
990 ji[jlh], ji[jh] = tmp2, tmp1
991
992 # new k_start point for cos tables( runs inside x_r loop NOT i loop)
993 k_start += halfsize
994
995
996 # python "yield" can be iterated. use this to make it clear how
997 # the indices are generated by using natural-looking nested loops
998 def iterate_dct_outer_butterfly_indices(SVSHAPE):
999 # get indices to iterate over, in the required order
1000 n = SVSHAPE.lims[0]
1001 mode = SVSHAPE.lims[1]
1002 stride = SVSHAPE.lims[2]
1003 # creating lists of indices to iterate over in each dimension
1004 # has to be done dynamically, because it depends on the size
1005 # first, the size-based loop (which can be done statically)
1006 x_r = []
1007 size = n // 2
1008 while size >= 2:
1009 x_r.append(size)
1010 size //= 2
1011 # invert order if requested
1012 if SVSHAPE.invxyz[0]:
1013 x_r.reverse()
1014
1015 if len(x_r) == 0:
1016 return
1017
1018 # I-DCT, reference (read/write) the in-place data in *reverse-bit-order*
1019 ri = list(range(n))
1020 if SVSHAPE.submode2 in [0b11, 0b01]:
1021 levels = n.bit_length() - 1
1022 ri = [ri[reverse_bits(i, levels)] for i in range(n)]
1023
1024 # reference list for not needing to do data-swaps, just swap what
1025 # *indices* are referenced (two levels of indirection at the moment)
1026 # pre-reverse the data-swap list so that it *ends up* in the order 0123..
1027 ji = list(range(n))
1028 inplace_mode = False # need the space... SVSHAPE.skip in [0b10, 0b11]
1029 if SVSHAPE.submode2 == 0b11:
1030 ji = halfrev2(ji, False)
1031
1032 # start an infinite (wrapping) loop
1033 while True:
1034 k = 0
1035 k_start = 0
1036 for size in x_r: # loop over 3rd order dimension (size)
1037 halfsize = size//2
1038 x_end = size == x_r[-1]
1039 y_r = list(range(0, halfsize))
1040 # invert if requested
1041 if SVSHAPE.invxyz[1]: y_r.reverse()
1042 for i in y_r: # loop over 2nd order dimension
1043 y_end = i == y_r[-1]
1044 # one list to create iterative-sum schedule
1045 jr = list(range(i+halfsize, i+n-halfsize, size))
1046 # invert if requested
1047 if SVSHAPE.invxyz[2]: jr.reverse()
1048 hz2 = halfsize // 2 # zero stops reversing 1-item lists
1049 k = k_start
1050 for ci, jh in enumerate(jr): # loop over 1st order dimension
1051 z_end = jh == jr[-1]
1052 if mode == 4:
1053 # COS table pre-generated mode
1054 if SVSHAPE.skip == 0b00: # in [0b00, 0b10]:
1055 if SVSHAPE.submode2 == 0b11: # iDCT
1056 result = ji[ri[jh]] # upper half
1057 else:
1058 result = ri[ji[jh]] # lower half
1059 elif SVSHAPE.skip == 0b01: # in [0b01, 0b11]:
1060 if SVSHAPE.submode2 == 0b11: # iDCT
1061 result = ji[ri[jh+size]] # upper half
1062 else:
1063 result = ri[ji[jh+size]] # upper half
1064 elif SVSHAPE.skip == 0b10: #
1065 result = k # cos table offset
1066 else:
1067 # COS table generated on-demand ("Vertical-First") mode
1068 if SVSHAPE.skip == 0b00: # in [0b00, 0b10]:
1069 if SVSHAPE.submode2 == 0b11: # iDCT
1070 result = ji[ri[jh]] # lower half
1071 else:
1072 result = ri[ji[jh]] # lower half
1073 elif SVSHAPE.skip == 0b01: # in [0b01, 0b11]:
1074 if SVSHAPE.submode2 == 0b11: # iDCT
1075 result = ji[ri[jh+size]] # upper half
1076 else:
1077 result = ri[ji[jh+size]] # upper half
1078 elif SVSHAPE.skip == 0b10: #
1079 result = ci # coefficient helper
1080 elif SVSHAPE.skip == 0b11: #
1081 result = size # coefficient helper
1082 loopends = (z_end |
1083 ((y_end and z_end)<<1) |
1084 ((y_end and x_end and z_end)<<2))
1085
1086 yield (result * stride) + SVSHAPE.offset, loopends
1087 k += 1
1088
1089 # new k_start point for cos tables( runs inside x_r loop NOT i loop)
1090 k_start += halfsize
1091
1092 ```
1093
1094 ## REMAP selector
1095
1096 Selecting which REMAP Schedule to use is shown by the pseudocode below.
1097 Each SVSHAPE 0-3 goes through this selection process.
1098
1099 ```
1100 if SVSHAPEn.mode == 0b00: iterate_fn = iterate_indices
1101 if SVSHAPEn.mode == 0b10: iterate_fn = iterate_preduce_indices
1102 if SVSHAPEn.mode in [0b01, 0b11]:
1103 # further sub-selection
1104 if SVSHAPEn.ydimsz == 1: iterate_fn = iterate_butterfly_indices
1105 if SVSHAPEn.ydimsz == 2: iterate_fn = iterate_dct_inner_butterfly_indices
1106 if SVSHAPEn.ydimsz == 3: iterate_fn = iterate_dct_outer_butterfly_indices
1107 if SVSHAPEn.ydimsz in [5, 13]: iterate_fn = iterate_dct_inner_costable_indices
1108 if SVSHAPEn.ydimsz in [6, 14, 15]: iterate_fn = iterate_dct_inner_halfswap_loadstore
1109 ```
1110
1111
1112 [[!tag opf_rfc]]