add ls009 Notes
[libreriscv.git] / openpower / sv / rfc / ls009.mdwn
1 # RFC ls009 Simple-V REMAP Subsystem
2
3 **URLs**:
4
5 * <https://libre-soc.org/openpower/sv/rfc/ls009/>
6 * <https://bugs.libre-soc.org/show_bug.cgi?id=1042>
7 * <https://git.openpower.foundation/isa/PowerISA/issues/124>
8
9 **Severity**: Major
10
11 **Status**: New
12
13 **Date**: 26 Mar 2023
14
15 **Target**: v3.2B
16
17 **Source**: v3.0B
18
19 **Books and Section affected**:
20
21 ```
22 Book I, new Zero-Overhead-Loop Chapter.
23 Appendix E Power ISA sorted by opcode
24 Appendix F Power ISA sorted by version
25 Appendix G Power ISA sorted by Compliancy Subset
26 Appendix H Power ISA sorted by mnemonic
27 ```
28
29 **Summary**
30
31 ```
32 svremap - Re-Mapping of Register Element Offsets
33 svindex - General-purpose setting of SHAPEs to be re-mapped
34 svshape - Hardware-level setting of SHAPEs for element re-mapping
35 svshape2 - Hardware-level setting of SHAPEs for element re-mapping (v2)
36 ```
37
38 **Submitter**: Luke Leighton (Libre-SOC)
39
40 **Requester**: Libre-SOC
41
42 **Impact on processor**:
43
44 ```
45 Addition of four new "Zero-Overhead-Loop-Control" DSP-style Vector-style
46 Management Instructions which provide advanced features such as Matrix
47 FFT DCT Hardware-Assist Schedules and general-purpose Index reordering.
48 ```
49
50 **Impact on software**:
51
52 ```
53 Requires support for new instructions in assembler, debuggers,
54 and related tools.
55 ```
56
57 **Keywords**:
58
59 ```
60 Cray Supercomputing, Vectorisation, Zero-Overhead-Loop-Control (ZOLC),
61 Scalable Vectors, Multi-Issue Out-of-Order, Sequential Programming Model,
62 Digital Signal Processing (DSP)
63 ```
64
65 **Motivation**
66
67 These REMAP Management instructions provide state-of-the-art advanced capabilities
68 to dramatically decrease instruction count and power reduction whilst retaining
69 unprecedented general-purpose capability and a standard Sequential Execution Model.
70
71 **Notes and Observations**:
72
73 1. Although costly the alternatives in SIMD-paradigm software result in
74 huge algorithmic complexity and associated power consumption increases.
75 Loop-unrolling compilers are prevalent as is thousands to tens of
76 thousands of instructions.
77 2. Core inner kernels of Matrix DCT DFT FFT NTT are dramatically reduced
78 to a handful of instructions.
79 3. No REMAP instructions with the exception of Indexed rely on registers
80 for the establishment of REMAP capability.
81 4. Future EXT1xx variants and SVP64/VSX will dramatically expand the power
82 and flexibility of REMAP.
83 5. The Simple-V Compliancy Subsets make REMAP optional except in the
84 Advanced Levels. Like v3.1 MMA it is **not** necessary for **all**
85 hardware to implement REMAP.
86
87 **Changes**
88
89 Add the following entries to:
90
91 * the Appendices of SV Book
92 * Instructions of SV Book as a new Section
93 * SVI, SVM, SVM2, SVRM Form of Book I Section 1.6.1.6 and 1.6.2
94
95 ----------------
96
97 \newpage{}
98
99 [[!inline pages="openpower/sv/remap" raw=yes ]]
100
101 # Forms
102
103 Add `SVI, SVM, SVM2, SVRM` to `XO (26:31)` Field in Book I, 1.6.2
104
105 Add the following to Book I, 1.6.1, SVI-Form
106
107 ```
108 |0 |6 |11 |16 |21 |23 |24|25|26 31|
109 | PO | SVG|rmm | SVd |ew |SVyx|mm|sk| XO |
110 ```
111
112 Add the following to Book I, 1.6.1, SVM-Form
113
114 ```
115 |0 |6 |11 |16 |21 |25 |26 |31 |
116 | PO | SVxd | SVyd | SVzd | SVrm |vf | XO |
117 ```
118
119 Add the following to Book I, 1.6.1, SVM2-Form
120
121 ```
122 |0 |6 |10 |11 |16 |21 |24|25 |26 |31 |
123 | PO | SVo |SVyx| rmm | SVd |XO |mm|sk | XO |
124 ```
125
126 Add the following to Book I, 1.6.1, SVRM-Form
127
128 ```
129 |0 |6 |11 |13 |15 |17 |19 |21 |22 |26 |31 |
130 | PO | SVme |mi0 | mi1 | mi2 | mo0 | mo1 |pst |/// | XO |
131 ```
132
133 Add the following to Book I, 1.6.2
134
135 ```
136 mi0 (11:12)
137 Field used in REMAP to select the SVSHAPE for 1st input register
138 Formats: SVRM
139 mi1 (13:14)
140 Field used in REMAP to select the SVSHAPE for 2nd input register
141 Formats: SVRM
142 mi2 (15:16)
143 Field used in REMAP to select the SVSHAPE for 3rd input register
144 Formats: SVRM
145 mm (24)
146 Field used to specify the meaning of the rmm field for SVI-Form
147 and SVM2-Form
148 Formats: SVI, SVM2
149 mo0 (17:18)
150 Field used in REMAP to select the SVSHAPE for 1st output register
151 Formats: SVRM
152 mo1 (19:20)
153 Field used in REMAP to select the SVSHAPE for 2nd output register
154 Formats: SVRM
155 pst (21)
156 Field used in REMAP to indicate "persistence" mode (REMAP
157 continues to apply to multiple instructions)
158 Formats: SVRM
159 rmm (11:15)
160 REMAP Mode field for SVI-Form and SVM2-Form
161 Formats: SVI, SVM2
162 sk (25)
163 Field used to specify dimensional skipping in svindex
164 Formats: SVI, SVM2
165 SVd (16:20)
166 Immediate field used to specify the size of the REMAP dimension
167 in the svindex and svshape2 instructions
168 Formats: SVI, SVM2
169 SVDS (16:29)
170 Immediate field used to specify a 9-bit signed
171 two's complement integer which is concatenated
172 on the right with 0b00 and sign-extended to 64 bits.
173 Formats: SVDS
174 SVG (6:10)
175 Field used to specify a GPR to be used as a
176 source for indexing.
177 Formats: SVI
178 SVi (16:22)
179 Simple-V immediate field for setting VL or MVL
180 Formats: SVL
181 SVme (6:10)
182 Simple-V "REMAP" map-enable bits (0-4)
183 Formats: SVRM
184 SVo (6:9)
185 Field used by the svshape2 instruction as an offset
186 Formats: SVM2
187 SVrm (21:24)
188 Simple-V "REMAP" Mode
189 Formats: SVM
190 SVxd (6:10)
191 Simple-V "REMAP" x-dimension size
192 Formats: SVM
193 SVyd (11:15)
194 Simple-V "REMAP" y-dimension size
195 Formats: SVM
196 SVzd (16:20)
197 Simple-V "REMAP" z-dimension size
198 Formats: SVM
199 XO (21:23,26:31)
200 Extended opcode field. Note that bit 21 must be 1, 22 and 23
201 must be zero, and bits 26-31 must be exactly the same as
202 used for svshape.
203 Formats: SVM2
204 ```
205
206 # Appendices
207
208 Appendix E Power ISA sorted by opcode
209 Appendix F Power ISA sorted by version
210 Appendix G Power ISA sorted by Compliancy Subset
211 Appendix H Power ISA sorted by mnemonic
212
213 | Form | Book | Page | Version | mnemonic | Description |
214 |------|------|------|---------|----------|-------------|
215 | SVRM | I | # | 3.0B | svremap | REMAP enabling instruction |
216 | SVM | I | # | 3.0B | svshape | REMAP shape instruction |
217 | SVM2 | I | # | 3.0B | svshape2 | REMAP shape instruction (2) |
218 | SVI | I | # | 3.0B | svindex | REMAP General-purpose Indexing |
219
220 [[!inline pages="openpower/sv/remap/appendix" raw=yes ]]
221
222 ## REMAP pseudocode
223
224 Written in python3 the following stand-alone executable source code is the Canonical
225 Specification for each REMAP. Vectors of "loopends" are returned when Rc=1
226 in Vectors of CR Fields on `sv.svstep.`, or in Vertical-First Mode
227 a single CR Field (CR0) on `svstep.`. The `SVSTATE.srcstep` or `SVSTATE.dststep` sequential
228 offset is put through each algorithm to determine the actual Element Offset.
229 Alternative implementations producing different ordering
230 is prohibited as software will be critically relying on these Deterministic Schedules.
231
232 ### REMAP 2D/3D Matrix
233
234 The following stand-alone executable source code is the Canonical
235 Specification for Matrix (2D/3D) REMAP.
236 Hardware implementations are achievable with simple cascading counter-and-compares.
237
238 ```
239 # python "yield" can be iterated. use this to make it clear how
240 # the indices are generated by using natural-looking nested loops
241 def iterate_indices(SVSHAPE):
242 # get indices to iterate over, in the required order
243 xd = SVSHAPE.lims[0]
244 yd = SVSHAPE.lims[1]
245 zd = SVSHAPE.lims[2]
246 # create lists of indices to iterate over in each dimension
247 x_r = list(range(xd))
248 y_r = list(range(yd))
249 z_r = list(range(zd))
250 # invert the indices if needed
251 if SVSHAPE.invxyz[0]: x_r.reverse()
252 if SVSHAPE.invxyz[1]: y_r.reverse()
253 if SVSHAPE.invxyz[2]: z_r.reverse()
254 # start an infinite (wrapping) loop
255 step = 0 # track src/dst step
256 while True:
257 for z in z_r: # loop over 1st order dimension
258 z_end = z == z_r[-1]
259 for y in y_r: # loop over 2nd order dimension
260 y_end = y == y_r[-1]
261 for x in x_r: # loop over 3rd order dimension
262 x_end = x == x_r[-1]
263 # ok work out which order to construct things in.
264 # start by creating a list of tuples of the dimension
265 # and its limit
266 vals = [(SVSHAPE.lims[0], x, "x"),
267 (SVSHAPE.lims[1], y, "y"),
268 (SVSHAPE.lims[2], z, "z")
269 ]
270 # now select those by order. this allows us to
271 # create schedules for [z][x], [x][y], or [y][z]
272 # for matrix multiply.
273 vals = [vals[SVSHAPE.order[0]],
274 vals[SVSHAPE.order[1]],
275 vals[SVSHAPE.order[2]]
276 ]
277 # ok now we can construct the result, using bits of
278 # "order" to say which ones get stacked on
279 result = 0
280 mult = 1
281 for i in range(3):
282 lim, idx, dbg = vals[i]
283 # some of the dimensions can be "skipped". the order
284 # was actually selected above on all 3 dimensions,
285 # e.g. [z][x][y] or [y][z][x]. "skip" allows one of
286 # those to be knocked out
287 if SVSHAPE.skip == i+1: continue
288 idx *= mult # shifts up by previous dimension(s)
289 result += idx # adds on this dimension
290 mult *= lim # for the next dimension
291
292 loopends = (x_end |
293 ((y_end and x_end)<<1) |
294 ((y_end and x_end and z_end)<<2))
295
296 yield result + SVSHAPE.offset, loopends
297 step += 1
298
299 def demo():
300 # set the dimension sizes here
301 xdim = 3
302 ydim = 2
303 zdim = 4
304
305 # set total (can repeat, e.g. VL=x*y*z*4)
306 VL = xdim * ydim * zdim
307
308 # set up an SVSHAPE
309 class SVSHAPE:
310 pass
311 SVSHAPE0 = SVSHAPE()
312 SVSHAPE0.lims = [xdim, ydim, zdim]
313 SVSHAPE0.order = [1,0,2] # experiment with different permutations, here
314 SVSHAPE0.mode = 0b00
315 SVSHAPE0.skip = 0b00
316 SVSHAPE0.offset = 0 # experiment with different offset, here
317 SVSHAPE0.invxyz = [0,0,0] # inversion if desired
318
319 # enumerate over the iterator function, getting new indices
320 for idx, (new_idx, end) in enumerate(iterate_indices(SVSHAPE0)):
321 if idx >= VL:
322 break
323 print ("%d->%d" % (idx, new_idx), "end", bin(end)[2:])
324
325 # run the demo
326 if __name__ == '__main__':
327 demo()
328 ```
329
330 ### REMAP Parallel Reduction pseudocode
331
332 The python3 program below is stand-alone executable and is the Canonical Specification
333 for Parallel Reduction REMAP.
334 The Algorithm below is not limited to RADIX2 sizes, and Predicate
335 sources, unlike in Matrix REMAP, apply to the Element Indices **after** REMAP
336 has been applied, not before. MV operations are not required: the algorithm
337 tracks positions of elements that would normally be moved and when applying
338 an Element Reduction Operation sources the operands from their last-known (tracked)
339 position.
340
341 ```
342 # a "yield" version of the Parallel Reduction REMAP algorithm.
343 # the algorithm is in-place. it does not perform "MV" operations.
344 # instead, where a masked-out value *should* be read from is tracked
345
346 def iterate_indices(SVSHAPE, pred=None):
347 # get indices to iterate over, in the required order
348 xd = SVSHAPE.lims[0]
349 # create lists of indices to iterate over in each dimension
350 ix = list(range(xd))
351 # invert the indices if needed
352 if SVSHAPE.invxyz[0]: ix.reverse()
353 # start a loop from the lowest step
354 step = 1
355 steps = []
356 while step < xd:
357 step *= 2
358 steps.append(step)
359 # invert the indices if needed
360 if SVSHAPE.invxyz[1]: steps.reverse()
361 for step in steps:
362 stepend = (step == steps[-1]) # note end of steps
363 idxs = list(range(0, xd, step))
364 results = []
365 for i in idxs:
366 other = i + step // 2
367 ci = ix[i]
368 oi = ix[other] if other < xd else None
369 other_pred = other < xd and (pred is None or pred[oi])
370 if (pred is None or pred[ci]) and other_pred:
371 if SVSHAPE.skip == 0b00: # submode 00
372 result = ci
373 elif SVSHAPE.skip == 0b01: # submode 01
374 result = oi
375 results.append([result + SVSHAPE.offset, 0])
376 elif other_pred:
377 ix[i] = oi
378 if results:
379 results[-1][1] = (stepend<<1) | 1 # notify end of loops
380 yield from results
381
382 def demo():
383 # set the dimension sizes here
384 xdim = 9
385
386 # set up an SVSHAPE
387 class SVSHAPE:
388 pass
389 SVSHAPE0 = SVSHAPE()
390 SVSHAPE0.lims = [xdim, 0, 0]
391 SVSHAPE0.order = [0,1,2]
392 SVSHAPE0.mode = 0b10
393 SVSHAPE0.skip = 0b00
394 SVSHAPE0.offset = 0 # experiment with different offset, here
395 SVSHAPE0.invxyz = [0,0,0] # inversion if desired
396
397 SVSHAPE1 = SVSHAPE()
398 SVSHAPE1.lims = [xdim, 0, 0]
399 SVSHAPE1.order = [0,1,2]
400 SVSHAPE1.mode = 0b10
401 SVSHAPE1.skip = 0b01
402 SVSHAPE1.offset = 0 # experiment with different offset, here
403 SVSHAPE1.invxyz = [0,0,0] # inversion if desired
404
405 # enumerate over the iterator function, getting new indices
406 shapes = list(iterate_indices(SVSHAPE0)), \
407 list(iterate_indices(SVSHAPE1))
408 for idx in range(len(shapes[0])):
409 l = shapes[0][idx]
410 r = shapes[1][idx]
411 (l_idx, lend) = l
412 (r_idx, rend) = r
413 print ("%d->%d:%d" % (idx, l_idx, r_idx),
414 "end", bin(lend)[2:], bin(rend)[2:])
415
416 # run the demo
417 if __name__ == '__main__':
418 demo()
419 ```
420
421 ### REMAP FFT pseudocode
422
423 The FFT REMAP is RADIX2 only.
424
425 ```
426 # a "yield" version of the REMAP algorithm, for FFT Tukey-Cooley schedules
427 # original code for the FFT Tukey-Cooley schedule:
428 # https://www.nayuki.io/res/free-small-fft-in-multiple-languages/fft.py
429 """
430 # Radix-2 decimation-in-time FFT (real, not complex)
431 size = 2
432 while size <= n:
433 halfsize = size // 2
434 tablestep = n // size
435 for i in range(0, n, size):
436 k = 0
437 for j in range(i, i + halfsize):
438 jh = j+halfsize
439 jl = j
440 temp1 = vec[jh] * exptable[k]
441 temp2 = vec[jl]
442 vec[jh] = temp2 - temp1
443 vec[jl] = temp2 + temp1
444 k += tablestep
445 size *= 2
446 """
447
448 # python "yield" can be iterated. use this to make it clear how
449 # the indices are generated by using natural-looking nested loops
450 def iterate_butterfly_indices(SVSHAPE):
451 # get indices to iterate over, in the required order
452 n = SVSHAPE.lims[0]
453 stride = SVSHAPE.lims[2] # stride-multiplier on reg access
454 # creating lists of indices to iterate over in each dimension
455 # has to be done dynamically, because it depends on the size
456 # first, the size-based loop (which can be done statically)
457 x_r = []
458 size = 2
459 while size <= n:
460 x_r.append(size)
461 size *= 2
462 # invert order if requested
463 if SVSHAPE.invxyz[0]: x_r.reverse()
464
465 if len(x_r) == 0:
466 return
467
468 # start an infinite (wrapping) loop
469 skip = 0
470 while True:
471 for size in x_r: # loop over 3rd order dimension (size)
472 x_end = size == x_r[-1]
473 # y_r schedule depends on size
474 halfsize = size // 2
475 tablestep = n // size
476 y_r = []
477 for i in range(0, n, size):
478 y_r.append(i)
479 # invert if requested
480 if SVSHAPE.invxyz[1]: y_r.reverse()
481 for i in y_r: # loop over 2nd order dimension
482 y_end = i == y_r[-1]
483 k_r = []
484 j_r = []
485 k = 0
486 for j in range(i, i+halfsize):
487 k_r.append(k)
488 j_r.append(j)
489 k += tablestep
490 # invert if requested
491 if SVSHAPE.invxyz[2]: k_r.reverse()
492 if SVSHAPE.invxyz[2]: j_r.reverse()
493 for j, k in zip(j_r, k_r): # loop over 1st order dimension
494 z_end = j == j_r[-1]
495 # now depending on MODE return the index
496 if SVSHAPE.skip == 0b00:
497 result = j # for vec[j]
498 elif SVSHAPE.skip == 0b01:
499 result = j + halfsize # for vec[j+halfsize]
500 elif SVSHAPE.skip == 0b10:
501 result = k # for exptable[k]
502
503 loopends = (z_end |
504 ((y_end and z_end)<<1) |
505 ((y_end and x_end and z_end)<<2))
506
507 yield (result * stride) + SVSHAPE.offset, loopends
508
509 def demo():
510 # set the dimension sizes here
511 xdim = 8
512 ydim = 0 # not needed
513 zdim = 1 # stride must be set to 1
514
515 # set total. err don't know how to calculate how many there are...
516 # do it manually for now
517 VL = 0
518 size = 2
519 n = xdim
520 while size <= n:
521 halfsize = size // 2
522 tablestep = n // size
523 for i in range(0, n, size):
524 for j in range(i, i + halfsize):
525 VL += 1
526 size *= 2
527
528 # set up an SVSHAPE
529 class SVSHAPE:
530 pass
531 # j schedule
532 SVSHAPE0 = SVSHAPE()
533 SVSHAPE0.lims = [xdim, ydim, zdim]
534 SVSHAPE0.order = [0,1,2] # experiment with different permutations, here
535 SVSHAPE0.mode = 0b01
536 SVSHAPE0.skip = 0b00
537 SVSHAPE0.offset = 0 # experiment with different offset, here
538 SVSHAPE0.invxyz = [0,0,0] # inversion if desired
539 # j+halfstep schedule
540 SVSHAPE1 = SVSHAPE()
541 SVSHAPE1.lims = [xdim, ydim, zdim]
542 SVSHAPE1.order = [0,1,2] # experiment with different permutations, here
543 SVSHAPE0.mode = 0b01
544 SVSHAPE1.skip = 0b01
545 SVSHAPE1.offset = 0 # experiment with different offset, here
546 SVSHAPE1.invxyz = [0,0,0] # inversion if desired
547 # k schedule
548 SVSHAPE2 = SVSHAPE()
549 SVSHAPE2.lims = [xdim, ydim, zdim]
550 SVSHAPE2.order = [0,1,2] # experiment with different permutations, here
551 SVSHAPE0.mode = 0b01
552 SVSHAPE2.skip = 0b10
553 SVSHAPE2.offset = 0 # experiment with different offset, here
554 SVSHAPE2.invxyz = [0,0,0] # inversion if desired
555
556 # enumerate over the iterator function, getting new indices
557 schedule = []
558 for idx, (jl, jh, k) in enumerate(zip(iterate_butterfly_indices(SVSHAPE0),
559 iterate_butterfly_indices(SVSHAPE1),
560 iterate_butterfly_indices(SVSHAPE2))):
561 if idx >= VL:
562 break
563 schedule.append((jl, jh, k))
564
565 # ok now pretty-print the results, with some debug output
566 size = 2
567 idx = 0
568 while size <= n:
569 halfsize = size // 2
570 tablestep = n // size
571 print ("size %d halfsize %d tablestep %d" % \
572 (size, halfsize, tablestep))
573 for i in range(0, n, size):
574 prefix = "i %d\t" % i
575 k = 0
576 for j in range(i, i + halfsize):
577 (jl, je), (jh, he), (ks, ke) = schedule[idx]
578 print (" %-3d\t%s j=%-2d jh=%-2d k=%-2d -> "
579 "j[jl=%-2d] j[jh=%-2d] ex[k=%d]" % \
580 (idx, prefix, j, j+halfsize, k,
581 jl, jh, ks,
582 ),
583 "end", bin(je)[2:], bin(je)[2:], bin(ke)[2:])
584 k += tablestep
585 idx += 1
586 size *= 2
587
588 # run the demo
589 if __name__ == '__main__':
590 demo()
591 ```
592
593 ### DCT REMAP
594
595 DCT REMAP is RADIX2 only. Convolutions may be applied as usual
596 to create non-RADIX2 DCT. Combined with appropriate Twin-butterfly
597 instructions the algorithm below (written in python3), becomes part
598 of an in-place in-registers Vectorised DCT. The algorithms work
599 by loading data such that as the nested loops progress the result
600 is sorted into correct sequential order.
601
602 ```
603 # DCT "REMAP" scheduler to create an in-place iterative DCT.
604 #
605
606 # bits of the integer 'val' of width 'width' are reversed
607 def reverse_bits(val, width):
608 result = 0
609 for _ in range(width):
610 result = (result << 1) | (val & 1)
611 val >>= 1
612 return result
613
614
615 # iterative version of [recursively-applied] half-reversing
616 # turns out this is Gray-Encoding.
617 def halfrev2(vec, pre_rev=True):
618 res = []
619 for i in range(len(vec)):
620 if pre_rev:
621 res.append(vec[i ^ (i>>1)])
622 else:
623 ri = i
624 bl = i.bit_length()
625 for ji in range(1, bl):
626 ri ^= (i >> ji)
627 res.append(vec[ri])
628 return res
629
630
631 def iterate_dct_inner_halfswap_loadstore(SVSHAPE):
632 # get indices to iterate over, in the required order
633 n = SVSHAPE.lims[0]
634 mode = SVSHAPE.lims[1]
635 stride = SVSHAPE.lims[2]
636
637 # reference list for not needing to do data-swaps, just swap what
638 # *indices* are referenced (two levels of indirection at the moment)
639 # pre-reverse the data-swap list so that it *ends up* in the order 0123..
640 ji = list(range(n))
641
642 levels = n.bit_length() - 1
643 ri = [reverse_bits(i, levels) for i in range(n)]
644
645 if SVSHAPE.mode == 0b01: # FFT, bitrev only
646 ji = [ji[ri[i]] for i in range(n)]
647 elif SVSHAPE.submode2 == 0b001:
648 ji = [ji[ri[i]] for i in range(n)]
649 ji = halfrev2(ji, True)
650 else:
651 ji = halfrev2(ji, False)
652 ji = [ji[ri[i]] for i in range(n)]
653
654 # invert order if requested
655 if SVSHAPE.invxyz[0]:
656 ji.reverse()
657
658 for i, jl in enumerate(ji):
659 y_end = jl == ji[-1]
660 yield jl * stride, (0b111 if y_end else 0b000)
661
662 def iterate_dct_inner_costable_indices(SVSHAPE):
663 # get indices to iterate over, in the required order
664 n = SVSHAPE.lims[0]
665 mode = SVSHAPE.lims[1]
666 stride = SVSHAPE.lims[2]
667 # creating lists of indices to iterate over in each dimension
668 # has to be done dynamically, because it depends on the size
669 # first, the size-based loop (which can be done statically)
670 x_r = []
671 size = 2
672 while size <= n:
673 x_r.append(size)
674 size *= 2
675 # invert order if requested
676 if SVSHAPE.invxyz[0]:
677 x_r.reverse()
678
679 if len(x_r) == 0:
680 return
681
682 # start an infinite (wrapping) loop
683 skip = 0
684 z_end = 1 # doesn't exist in this, only 2 loops
685 k = 0
686 while True:
687 for size in x_r: # loop over 3rd order dimension (size)
688 x_end = size == x_r[-1]
689 # y_r schedule depends on size
690 halfsize = size // 2
691 y_r = []
692 for i in range(0, n, size):
693 y_r.append(i)
694 # invert if requested
695 if SVSHAPE.invxyz[1]: y_r.reverse()
696 # two lists of half-range indices, e.g. j 0123, jr 7654
697 j = list(range(0, halfsize))
698 # invert if requested
699 if SVSHAPE.invxyz[2]: j_r.reverse()
700 # loop over 1st order dimension
701 for ci, jl in enumerate(j):
702 y_end = jl == j[-1]
703 # now depending on MODE return the index. inner butterfly
704 if SVSHAPE.skip == 0b00: # in [0b00, 0b10]:
705 result = k # offset into COS table
706 elif SVSHAPE.skip == 0b10: #
707 result = ci # coefficient helper
708 elif SVSHAPE.skip == 0b11: #
709 result = size # coefficient helper
710 loopends = (z_end |
711 ((y_end and z_end)<<1) |
712 ((y_end and x_end and z_end)<<2))
713
714 yield (result * stride) + SVSHAPE.offset, loopends
715 k += 1
716
717 def iterate_dct_inner_butterfly_indices(SVSHAPE):
718 # get indices to iterate over, in the required order
719 n = SVSHAPE.lims[0]
720 mode = SVSHAPE.lims[1]
721 stride = SVSHAPE.lims[2]
722 # creating lists of indices to iterate over in each dimension
723 # has to be done dynamically, because it depends on the size
724 # first, the size-based loop (which can be done statically)
725 x_r = []
726 size = 2
727 while size <= n:
728 x_r.append(size)
729 size *= 2
730 # invert order if requested
731 if SVSHAPE.invxyz[0]:
732 x_r.reverse()
733
734 if len(x_r) == 0:
735 return
736
737 # reference (read/write) the in-place data in *reverse-bit-order*
738 ri = list(range(n))
739 if SVSHAPE.submode2 == 0b01:
740 levels = n.bit_length() - 1
741 ri = [ri[reverse_bits(i, levels)] for i in range(n)]
742
743 # reference list for not needing to do data-swaps, just swap what
744 # *indices* are referenced (two levels of indirection at the moment)
745 # pre-reverse the data-swap list so that it *ends up* in the order 0123..
746 ji = list(range(n))
747 inplace_mode = True
748 if inplace_mode and SVSHAPE.submode2 == 0b01:
749 ji = halfrev2(ji, True)
750 if inplace_mode and SVSHAPE.submode2 == 0b11:
751 ji = halfrev2(ji, False)
752
753 # start an infinite (wrapping) loop
754 while True:
755 k = 0
756 k_start = 0
757 for size in x_r: # loop over 3rd order dimension (size)
758 x_end = size == x_r[-1]
759 # y_r schedule depends on size
760 halfsize = size // 2
761 y_r = []
762 for i in range(0, n, size):
763 y_r.append(i)
764 # invert if requested
765 if SVSHAPE.invxyz[1]: y_r.reverse()
766 for i in y_r: # loop over 2nd order dimension
767 y_end = i == y_r[-1]
768 # two lists of half-range indices, e.g. j 0123, jr 7654
769 j = list(range(i, i + halfsize))
770 jr = list(range(i+halfsize, i + size))
771 jr.reverse()
772 # invert if requested
773 if SVSHAPE.invxyz[2]:
774 j.reverse()
775 jr.reverse()
776 hz2 = halfsize // 2 # zero stops reversing 1-item lists
777 # loop over 1st order dimension
778 k = k_start
779 for ci, (jl, jh) in enumerate(zip(j, jr)):
780 z_end = jl == j[-1]
781 # now depending on MODE return the index. inner butterfly
782 if SVSHAPE.skip == 0b00: # in [0b00, 0b10]:
783 if SVSHAPE.submode2 == 0b11: # iDCT
784 result = ji[ri[jl]] # lower half
785 else:
786 result = ri[ji[jl]] # lower half
787 elif SVSHAPE.skip == 0b01: # in [0b01, 0b11]:
788 if SVSHAPE.submode2 == 0b11: # iDCT
789 result = ji[ri[jl+halfsize]] # upper half
790 else:
791 result = ri[ji[jh]] # upper half
792 elif mode == 4:
793 # COS table pre-generated mode
794 if SVSHAPE.skip == 0b10: #
795 result = k # cos table offset
796 else: # mode 2
797 # COS table generated on-demand ("Vertical-First") mode
798 if SVSHAPE.skip == 0b10: #
799 result = ci # coefficient helper
800 elif SVSHAPE.skip == 0b11: #
801 result = size # coefficient helper
802 loopends = (z_end |
803 ((y_end and z_end)<<1) |
804 ((y_end and x_end and z_end)<<2))
805
806 yield (result * stride) + SVSHAPE.offset, loopends
807 k += 1
808
809 # now in-place swap
810 if inplace_mode:
811 for ci, (jl, jh) in enumerate(zip(j[:hz2], jr[:hz2])):
812 jlh = jl+halfsize
813 tmp1, tmp2 = ji[jlh], ji[jh]
814 ji[jlh], ji[jh] = tmp2, tmp1
815
816 # new k_start point for cos tables( runs inside x_r loop NOT i loop)
817 k_start += halfsize
818
819
820 # python "yield" can be iterated. use this to make it clear how
821 # the indices are generated by using natural-looking nested loops
822 def iterate_dct_outer_butterfly_indices(SVSHAPE):
823 # get indices to iterate over, in the required order
824 n = SVSHAPE.lims[0]
825 mode = SVSHAPE.lims[1]
826 stride = SVSHAPE.lims[2]
827 # creating lists of indices to iterate over in each dimension
828 # has to be done dynamically, because it depends on the size
829 # first, the size-based loop (which can be done statically)
830 x_r = []
831 size = n // 2
832 while size >= 2:
833 x_r.append(size)
834 size //= 2
835 # invert order if requested
836 if SVSHAPE.invxyz[0]:
837 x_r.reverse()
838
839 if len(x_r) == 0:
840 return
841
842 # I-DCT, reference (read/write) the in-place data in *reverse-bit-order*
843 ri = list(range(n))
844 if SVSHAPE.submode2 in [0b11, 0b01]:
845 levels = n.bit_length() - 1
846 ri = [ri[reverse_bits(i, levels)] for i in range(n)]
847
848 # reference list for not needing to do data-swaps, just swap what
849 # *indices* are referenced (two levels of indirection at the moment)
850 # pre-reverse the data-swap list so that it *ends up* in the order 0123..
851 ji = list(range(n))
852 inplace_mode = False # need the space... SVSHAPE.skip in [0b10, 0b11]
853 if SVSHAPE.submode2 == 0b11:
854 ji = halfrev2(ji, False)
855
856 # start an infinite (wrapping) loop
857 while True:
858 k = 0
859 k_start = 0
860 for size in x_r: # loop over 3rd order dimension (size)
861 halfsize = size//2
862 x_end = size == x_r[-1]
863 y_r = list(range(0, halfsize))
864 # invert if requested
865 if SVSHAPE.invxyz[1]: y_r.reverse()
866 for i in y_r: # loop over 2nd order dimension
867 y_end = i == y_r[-1]
868 # one list to create iterative-sum schedule
869 jr = list(range(i+halfsize, i+n-halfsize, size))
870 # invert if requested
871 if SVSHAPE.invxyz[2]: jr.reverse()
872 hz2 = halfsize // 2 # zero stops reversing 1-item lists
873 k = k_start
874 for ci, jh in enumerate(jr): # loop over 1st order dimension
875 z_end = jh == jr[-1]
876 if mode == 4:
877 # COS table pre-generated mode
878 if SVSHAPE.skip == 0b00: # in [0b00, 0b10]:
879 if SVSHAPE.submode2 == 0b11: # iDCT
880 result = ji[ri[jh]] # upper half
881 else:
882 result = ri[ji[jh]] # lower half
883 elif SVSHAPE.skip == 0b01: # in [0b01, 0b11]:
884 if SVSHAPE.submode2 == 0b11: # iDCT
885 result = ji[ri[jh+size]] # upper half
886 else:
887 result = ri[ji[jh+size]] # upper half
888 elif SVSHAPE.skip == 0b10: #
889 result = k # cos table offset
890 else:
891 # COS table generated on-demand ("Vertical-First") mode
892 if SVSHAPE.skip == 0b00: # in [0b00, 0b10]:
893 if SVSHAPE.submode2 == 0b11: # iDCT
894 result = ji[ri[jh]] # lower half
895 else:
896 result = ri[ji[jh]] # lower half
897 elif SVSHAPE.skip == 0b01: # in [0b01, 0b11]:
898 if SVSHAPE.submode2 == 0b11: # iDCT
899 result = ji[ri[jh+size]] # upper half
900 else:
901 result = ri[ji[jh+size]] # upper half
902 elif SVSHAPE.skip == 0b10: #
903 result = ci # coefficient helper
904 elif SVSHAPE.skip == 0b11: #
905 result = size # coefficient helper
906 loopends = (z_end |
907 ((y_end and z_end)<<1) |
908 ((y_end and x_end and z_end)<<2))
909
910 yield (result * stride) + SVSHAPE.offset, loopends
911 k += 1
912
913 # new k_start point for cos tables( runs inside x_r loop NOT i loop)
914 k_start += halfsize
915
916 ```
917
918 ## REMAP selector
919
920 Selecting which REMAP Schedule to use is shown by the pseudocode below.
921 Each SVSHAPE 0-3 goes through this selection process.
922
923 ```
924 if SVSHAPEn.mode == 0b00: iterate_fn = iterate_indices
925 if SVSHAPEn.mode == 0b10: iterate_fn = iterate_preduce_indices
926 if SVSHAPEn.mode in [0b01, 0b11]:
927 # further sub-selection
928 if SVSHAPEn.ydimsz == 1: iterate_fn = iterate_butterfly_indices
929 if SVSHAPEn.ydimsz == 2: iterate_fn = iterate_dct_inner_butterfly_indices
930 if SVSHAPEn.ydimsz == 3: iterate_fn = iterate_dct_outer_butterfly_indices
931 if SVSHAPEn.ydimsz in [5, 13]: iterate_fn = iterate_dct_inner_costable_indices
932 if SVSHAPEn.ydimsz in [6, 14, 15]: iterate_fn = iterate_dct_inner_halfswap_loadstore
933 ```
934
935
936 [[!tag opf_rfc]]