c5bbb56294b4c2635f1a95cfa0e49703d41d38af
[libreriscv.git] / openpower / sv / remap.mdwn
1 [[!tag standards]]
2
3 # REMAP <a name="remap" />
4
5 * <https://bugs.libre-soc.org/show_bug.cgi?id=143> matrix multiply
6 * <https://bugs.libre-soc.org/show_bug.cgi?id=867> add svindex
7 * <https://bugs.libre-soc.org/show_bug.cgi?id=885> svindex in simulator
8 * <https://bugs.libre-soc.org/show_bug.cgi?id=911> offset svshape option
9 * <https://bugs.libre-soc.org/show_bug.cgi?id=864> parallel reduction
10 * <https://bugs.libre-soc.org/show_bug.cgi?id=930> DCT/FFT "strides"
11 * see [[sv/remap/appendix]] for examples and usage
12 * see [[sv/propagation]] for a future way to apply REMAP
13 * [[remap/discussion]]
14
15 REMAP is an advanced form of Vector "Structure Packing" that
16 provides hardware-level support for commonly-used *nested* loop patterns.
17 For more general reordering an Indexed REMAP mode is available.
18
19 REMAP allows the usual vector loop `0..VL-1` to be "reshaped" (re-mapped)
20 from a linear form to a 2D or 3D transposed form, or "offset" to permit
21 arbitrary access to elements (when elwidth overrides are used),
22 independently on each Vector src or dest
23 register.
24
25 The initial primary motivation of REMAP was for Matrix Multiplication, reordering of sequential
26 data in-place: in-place DCT and FFT were easily justified given the
27 high usage in Computer Science.
28 Four SPRs are provided which may be applied to any GPR, FPR or CR Field
29 so that for example a single FMAC may be
30 used in a single loop to perform 5x3 times 3x4 Matrix multiplication,
31 generating 60 FMACs *without needing explicit assembler unrolling*.
32 Additional uses include regular "Structure Packing"
33 such as RGB pixel data extraction and reforming.
34
35 REMAP, like all of SV, is abstracted out, meaning that unlike traditional
36 Vector ISAs which would typically only have a limited set of instructions
37 that can be structure-packed (LD/ST typically), REMAP may be applied to
38 literally any instruction: CRs, Arithmetic, Logical, LD/ST, anything.
39
40 Note that REMAP does not *directly* apply to sub-vector elements: that
41 is what swizzle is for. Swizzle *can* however be applied to the same
42 instruction as REMAP. As explained in [[sv/mv.swizzle]], [[sv/mv.vec]] and the [[svp64/appendix]], Pack and Unpack EXTRA Mode bits
43 can extend down into Sub-vector elements to perform vec2/vec3/vec4
44 sequential reordering, but even here, REMAP is not extended down to
45 the actual sub-vector elements themselves.
46
47 In its general form, REMAP is quite expensive to set up, and on some
48 implementations may introduce
49 latency, so should realistically be used only where it is worthwhile.
50 Commonly-used patterns such as Matrix Multiply, DCT and FFT have
51 helper instruction options which make REMAP easier to use.
52
53 There are four types of REMAP:
54
55 * **Matrix**, also known as 2D and 3D reshaping, can perform in-place
56 Matrix transpose and rotate. The Shapes are set up for an "Outer Product"
57 Matrix Multiply.
58 * **FFT/DCT**, with full triple-loop in-place support: limited to
59 Power-2 RADIX
60 * **Indexing**, for any general-purpose reordering, also includes
61 limited 2D reshaping.
62 * **Parallel Reduction**, for scheduling a sequence of operations
63 in a Deterministic fashion, in a way that may be parallelised,
64 to reduce a Vector down to a single value.
65
66 Best implemented on top of a Multi-Issue Out-of-Order Micro-architecture,
67 REMAP Schedules are 100% Deterministic **including Indexing** and are
68 designed to be incorporated in between the Decode and Issue phases,
69 directly into Register Hazard Management.
70
71 Parallel Reduction is unusual in that it requires a full vector array
72 of results (not a scalar) and uses the rest of the result Vector for
73 the purposes of storing intermediary calculations. As these intermediary
74 results are Deterministically computed they may be useful.
75 Additionally, because the intermediate results are always written out
76 it is possible to service Precise Interrupts without affecting latency
77 (a common limitation of Vector ISAs).
78
79 # Basic principle
80
81 * normal vector element read/write of operands would be sequential
82 (0 1 2 3 ....)
83 * this is not appropriate for (e.g.) Matrix multiply which requires
84 accessing elements in alternative sequences (0 3 6 1 4 7 ...)
85 * normal Vector ISAs use either Indexed-MV or Indexed-LD/ST to "cope"
86 with this. both are expensive (copy large vectors, spill through memory)
87 and very few Packed SIMD ISAs cope with non-Power-2.
88 * REMAP **redefines** the order of access according to set
89 (Deterministic) "Schedules".
90 * The Schedules are not at all restricted to power-of-two boundaries
91 making it unnecessary to have for example specialised 3x4 transpose
92 instructions of other Vector ISAs.
93
94 Only the most commonly-used algorithms in computer science have REMAP
95 support, due to the high cost in both the ISA and in hardware. For
96 arbitrary remapping the `Indexed` REMAP may be used.
97
98 # Example Usage
99
100 * `svshape` to set the type of reordering to be applied to an
101 otherwise usual `0..VL-1` hardware for-loop
102 * `svremap` to set which registers a given reordering is to apply to
103 (RA, RT etc)
104 * `sv.{instruction}` where any Vectorised register marked by `svremap`
105 will have its ordering REMAPPED according to the schedule set
106 by `svshape`.
107
108 The following illustrative example multiplies a 3x4 and a 5x3
109 matrix to create
110 a 5x4 result:
111
112 svshape 5, 4, 3, 0, 0
113 svremap 15, 1, 2, 3, 0, 0, 0, 0
114 sv.fmadds *0, *8, *16, *0
115
116 * svshape sets up the four SVSHAPE SPRS for a Matrix Schedule
117 * svremap activates four out of five registers RA RB RC RT RS (15)
118 * svremap requests:
119 - RA to use SVSHAPE1
120 - RB to use SVSHAPE2
121 - RC to use SVSHAPE3
122 - RT to use SVSHAPE0
123 - RS Remapping to not be activated
124 * sv.fmadds has RT=0.v, RA=8.v, RB=16.v, RC=0.v
125 * With REMAP being active each register's element index is
126 *independently* transformed using the specified SHAPEs.
127
128 Thus the Vector Loop is arranged such that the use of
129 the multiply-and-accumulate instruction executes precisely the required
130 Schedule to perform an in-place in-registers Matrix Multiply with no
131 need to perform additional Transpose or register copy instructions.
132 The example above may be executed as a unit test and demo,
133 [here](https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_matrix.py;h=c15479db9a36055166b6b023c7495f9ca3637333;hb=a17a252e474d5d5bf34026c25a19682e3f2015c3#l94)
134
135 # REMAP types
136
137 This section summarises the motivation for each REMAP Schedule
138 and briefly goes over their characteristics and limitations.
139 Further details on the Deterministic Precise-Interruptible algorithms
140 used in these Schedules is found in the [[sv/remap/appendix]].
141
142 ## Matrix (1D/2D/3D shaping)
143
144 Matrix Multiplication is a huge part of High-Performance Compute,
145 and 3D.
146 In many PackedSIMD as well as Scalable Vector ISAs, non-power-of-two
147 Matrix sizes are a serious challenge. PackedSIMD ISAs, in order to
148 cope with for example 3x4 Matrices, recommend rolling data-repetition and loop-unrolling.
149 Aside from the cost of the load on the L1 I-Cache, the trick only
150 works if one of the dimensions X or Y are power-two. Prime Numbers
151 (5x7, 3x5) become deeply problematic to unroll.
152
153 Even traditional Scalable Vector ISAs have issues with Matrices, often
154 having to perform data Transpose by pushing out through Memory and back,
155 or computing Transposition Indices (costly) then copying to another
156 Vector (costly).
157
158 Matrix REMAP was thus designed to solve these issues by providing Hardware
159 Assisted
160 "Schedules" that can view what would otherwise be limited to a strictly
161 linear Vector as instead being 2D (even 3D) *in-place* reordered.
162 With both Transposition and non-power-two being supported the issues
163 faced by other ISAs are mitigated.
164
165 Limitations of Matrix REMAP are that the Vector Length (VL) is currently
166 restricted to 127: up to 127 FMAs (or other operation)
167 may be performed in total.
168 Also given that it is in-registers only at present some care has to be
169 taken on regfile resource utilisation. However it is perfectly possible
170 to utilise Matrix REMAP to perform the three inner-most "kernel" loops of
171 the usual 6-level large Matrix Multiply, without the usual difficulties
172 associated with SIMD.
173
174 Also the `svshape` instruction only provides access to part of the
175 Matrix REMAP capability. Rotation and mirroring need to be done by
176 programming the SVSHAPE SPRs directly, which can take a lot more
177 instructions.
178
179 ## FFT/DCT Triple Loop
180
181 DCT and FFT are some of the most astonishingly used algorithms in
182 Computer Science. Radar, Audio, Video, R.F. Baseband and dozens more. At least
183 two DSPs, TMS320 and Hexagon, have VLIW instructions specially tailored
184 to FFT.
185
186 An in-depth analysis showed that it is possible to do in-place in-register
187 DCT and FFT as long as twin-result "butterfly" instructions are provided.
188 These can be found in the [[openpower/isa/svfparith]] page if performing
189 IEEE754 FP transforms. *(For fixed-point transforms, equivalent 3-in 2-out
190 integer operations would be required)*. These "butterfly" instructions
191 avoid the need for a temporary register because the two array positions
192 being overwritten will be "in-flight" in any In-Order or Out-of-Order
193 micro-architecture.
194
195 DCT and FFT Schedules are currently limited to RADIX2 sizes and do not
196 accept predicate masks. Given that it is common to perform recursive
197 convolutions combining smaller Power-2 DCT/FFT to create larger DCT/FFTs
198 in practice the RADIX2 limit is not a problem. A Bluestein convolution
199 to compute arbitrary length is demonstrated by
200 [Project Nayuki](https://www.nayuki.io/res/free-small-fft-in-multiple-languages/fft.py)
201
202 ## Indexed
203
204 The purpose of Indexing is to provide a generalised version of
205 Vector ISA "Permute" instructions, such as VSX `vperm`. The
206 Indexing is abstracted out and may be applied to much more
207 than an element move/copy, and is not limited for example
208 to the number of bytes that can fit into a VSX register.
209 Indexing may be applied to LD/ST (even on Indexed LD/ST
210 instructions such as `sv.lbzx`), arithmetic operations,
211 extsw: there is no artificial limit.
212
213 The only major caveat is that the registers to be used as
214 Indices must not be modified by any instruction after Indexed Mode
215 is established, and neither must MAXVL be altered. Additionally,
216 no register used as an Index may exceed MAXVL-1.
217
218 Failure to observe
219 these conditions results in `UNDEFINED` behaviour.
220 These conditions allow a Read-After-Write (RAW) Hazard to be created on
221 the entire range of Indices to be subsequently used, but a corresponding
222 Write-After-Read Hazard by any instruction that modifies the Indices
223 **does not have to be created**. Given the large number of registers
224 involved in Indexing this is a huge resource saving and reduction
225 in micro-architectural complexity. MAXVL is likewise
226 included in the RAW Hazards because it is involved in calculating
227 how many registers are to be considered Indices.
228
229 With these Hazard Mitigations in place, high-performance implementations
230 may read-cache the Indices from the point where a given `svindex` instruction
231 is called (or SVSHAPE SPRs - and MAXVL- directly altered).
232
233 The original motivation for Indexed REMAP was to mitigate the need to add
234 an expensive `mv.x` to the Scalar ISA, which was likely to be rejected as
235 a stand-alone instruction. Usually a Vector ISA would add a non-conflicting
236 variant (as in VSX `vperm`) but it is common to need to permute by source,
237 with the risk of conflict, that has to be resolved, for example, in AVX-512
238 with `conflictd`.
239
240 Indexed REMAP on the other hand **does not prevent conflicts** (overlapping
241 destinations), which on a superficial analysis may be perceived to be a
242 problem, until it is recalled that, firstly, Simple-V is designed specifically
243 to require Program Order to be respected, and that Matrix, DCT and FFT
244 all *already* critically depend on overlapping Reads/Writes: Matrix
245 uses overlapping registers as accumulators. Thus the Register Hazard
246 Management needed by Indexed REMAP *has* to be in place anyway.
247
248 The cost compared to Matrix and other REMAPs (and Pack/Unpack) is
249 clearly that of the additional reading of the GPRs to be used as Indices,
250 plus the setup cost associated with creating those same Indices.
251 If any Deterministic REMAP can cover the required task, clearly it
252 is adviseable to use it instead.
253
254 *Programmer's note: some algorithms may require skipping of Indices exceeding
255 VL-1, not MAXVL-1. This may be achieved programmatically by performing
256 an `sv.cmp *BF,*RA,RB` where RA is the same GPRs used in the Indexed REMAP,
257 and RB contains the value of VL returned from `setvl`. The resultant
258 CR Fields may then be used as Predicate Masks to exclude those operations
259 with an Index exceeding VL-1.*
260
261 ## Parallel Reduction
262
263 Vector Reduce Mode issues a deterministic tree-reduction schedule to the underlying micro-architecture. Like Scalar reduction, the "Scalar Base"
264 (Power ISA v3.0B) operation is leveraged, unmodified, to give the
265 *appearance* and *effect* of Reduction.
266
267 In Horizontal-First Mode, Vector-result reduction **requires**
268 the destination to be a Vector, which will be used to store
269 intermediary results.
270
271 Given that the tree-reduction schedule is deterministic,
272 Interrupts and exceptions
273 can therefore also be precise. The final result will be in the first
274 non-predicate-masked-out destination element, but due again to
275 the deterministic schedule programmers may find uses for the intermediate
276 results.
277
278 When Rc=1 a corresponding Vector of co-resultant CRs is also
279 created. No special action is taken: the result and its CR Field
280 are stored "as usual" exactly as all other SVP64 Rc=1 operations.
281
282 Note that the Schedule only makes sense on top of certain instructions:
283 X-Form with a Register Profile of `RT,RA,RB` is fine because two sources
284 and the destination are all the same type. Like Scalar
285 Reduction, nothing is prohibited:
286 the results of execution on an unsuitable instruction may simply
287 not make sense. With care, even 3-input instructions (madd, fmadd, ternlogi)
288 may be used.
289
290 Critical to note regarding use of Parallel-Reduction REMAP is that,
291 exactly as with all REMAP Modes, the `svshape` instruction *requests*
292 a certain Vector Length (number of elements to reduce) and then
293 sets VL and MAXVL at the number of **operations** needed to be
294 carried out. Thus, equally as importantly, like Matrix REMAP
295 the total number of operations
296 is restricted to 127. Any Parallel-Reduction requiring more operations
297 will need to be done manually in batches (hierarchical
298 recursive Reduction).
299
300 Also important to note is that the Deterministic Schedule is arranged
301 so that some implementations *may* parallelise it (as long as doing so
302 respects Program Order and Register Hazards). Performance (speed)
303 of any given
304 implementation is neither strictly defined or guaranteed. As with
305 the Vulkan(tm) Specification, strict compliance is paramount whilst
306 performance is at the discretion of Implementors.
307
308 **Parallel-Reduction with Predication**
309
310 To avoid breaking the strict RISC-paradigm, keeping the Issue-Schedule
311 completely separate from the actual element-level (scalar) operations,
312 Move operations are **not** included in the Schedule. This means that
313 the Schedule leaves the final (scalar) result in the first-non-masked
314 element of the Vector used. With the predicate mask being dynamic
315 (but deterministic) this result could be anywhere.
316
317 If that result is needed to be moved to a (single) scalar register
318 then a follow-up `sv.mv/sm=predicate rt, *ra` instruction will be
319 needed to get it, where the predicate is the exact same predicate used
320 in the prior Parallel-Reduction instruction.
321
322 * If there was only a single
323 bit in the predicate then the result will not have moved or been altered
324 from the source vector prior to the Reduction
325 * If there was more than one bit the result will be in the
326 first element with a predicate bit set.
327
328 In either case the result is in the element with the first bit set in
329 the predicate mask.
330
331 For *some* implementations
332 the vector-to-scalar copy may be a slow operation, as may the Predicated
333 Parallel Reduction itself.
334 It may be better to perform a pre-copy
335 of the values, compressing them (VREDUCE-style) into a contiguous block,
336 which will guarantee that the result goes into the very first element
337 of the destination vector, in which case clearly no follow-up
338 vector-to-scalar MV operation is needed.
339
340 **Usage conditions**
341
342 The simplest usage is to perform an overwrite, specifying all three
343 register operands the same.
344
345 svshape parallelreduce, 6
346 sv.add *8, *8, *8
347
348 The Reduction Schedule will issue the Parallel Tree Reduction spanning
349 registers 8 through 13, by adjusting the offsets to RT, RA and RB as
350 necessary (see "Parallel Reduction algorithm" in a later section).
351
352 A non-overwrite is possible as well but just as with the overwrite
353 version, only those destination elements necessary for storing
354 intermediary computations will be written to: the remaining elements
355 will **not** be overwritten and will **not** be zero'd.
356
357 svshape parallelreduce, 6
358 sv.add *0, *8, *8
359
360 However it is critical to note that if the source and destination are
361 not the same then the trick of using a follow-up vector-scalar MV will
362 not work.
363
364 ## Sub-Vector Horizontal Reduction
365
366 Note that when SVM is clear and SUBVL!=1 a Parallel Reduction is performed
367 on all first Subvector elements, followed by another separate independent
368 Parallel Reduction on all the second Subvector elements and so on.
369
370 for selectsubelement in (x,y,z,w):
371 parallelreduce(0..VL-1, selectsubelement)
372
373 By contrast, when SVM is set and SUBVL!=1, a Horizontal
374 Subvector mode is enabled, applying the Parallel Reduction
375 Algorithm to the Subvector Elements. The Parallel Reduction
376 is independently applied VL times, to each group of Subvector
377 elements. Bear in mind that predication is never applied down
378 into individual Subvector elements, but will be applied
379 to select whether the *entire* Parallel Reduction on each
380 group is performed or not.
381
382  for (i = 0; i < VL; i++)
383 if (predval & 1<<i) # predication
384 el = element[i]
385 parallelreduction([el.x, el.y, el.z, el.w])
386
387 Note that as this is a Parallel Reduction, for best results
388 it should be an overwrite operation, where the result for
389 the Horizontal Reduction of each Subvector will be in the
390 first Subvector element.
391 Also note that use of Rc=1 is `UNDEFINED` behaviour.
392
393 In essence what is happening here is that Structure Packing is being
394 combined with Parallel Reduction. If the Subvector elements may be
395 laid out as a 2D matrix, with the Subvector elements on rows,
396 and Parallel Reduction is applied per row, then if `SVM` is **clear**
397 the Matrix is transposed (like Pack/Unpack)
398 before still applying the Parallel Reduction to the **row**.
399
400 # Determining Register Hazards
401
402 For high-performance (Multi-Issue, Out-of-Order) systems it is critical
403 to be able to statically determine the extent of Vectors in order to
404 allocate pre-emptive Hazard protection. The next task is to eliminate
405 masked-out elements using predicate bits, freeing up the associated
406 Hazards.
407
408 For non-REMAP situations `VL` is sufficient to ascertain early
409 Hazard coverage, and with SVSTATE being a high priority cached
410 quantity at the same level of MSR and PC this is not a problem.
411
412 The problems come when REMAP is enabled. Indexed REMAP must instead
413 use `MAXVL` as the earliest (simplest)
414 batch-level Hazard Reservation indicator,
415 but Matrix, FFT and Parallel Reduction must all use completely different
416 schemes. The reason is that VL is used to step through the total
417 number of *operations*, not the number of registers. The "Saving Grace"
418 is that all of the REMAP Schedules are Deterministic.
419
420 Advance-notice Parallel computation and subsequent cacheing
421 of all of these complex Deterministic REMAP Schedules is
422 *strongly recommended*, thus allowing clear and precise multi-issue
423 batched Hazard coverage to be deployed, *even for Indexed Mode*.
424 This is only possible for Indexed due to the strict guidelines
425 given to Programmers.
426
427 In short, there exists solutions to the problem of Hazard Management,
428 with varying degrees of refinement possible at correspondingly
429 increasing levels of complexity in hardware.
430
431 # REMAP area of SVSTATE
432
433 The following bits of the SVSTATE SPR are used for REMAP:
434
435 |32.33|34.35|36.37|38.39|40.41| 42.46 | 62 |
436 | -- | -- | -- | -- | -- | ----- | ------ |
437 |mi0 |mi1 |mi2 |mo0 |mo1 | SVme | RMpst |
438
439 mi0-2 and mo0-1 each select SVSHAPE0-3 to apply to a given register.
440 mi0-2 apply to RA, RB, RC respectively, as input registers, and
441 likewise mo0-1 apply to output registers (RT/FRT, RS/FRS) respectively.
442 SVme is 5 bits (one for each of mi0-2/mo0-1) and indicates whether the
443 SVSHAPE is actively applied or not.
444
445 * bit 0 of SVme indicates if mi0 is applied to RA / FRA
446 * bit 1 of SVme indicates if mi1 is applied to RB / FRB
447 * bit 2 of SVme indicates if mi2 is applied to RC / FRC
448 * bit 3 of SVme indicates if mo0 is applied to RT / FRT
449 * bit 4 of SVme indicates if mo1 is applied to Effective Address / FRS / RS
450 (LD/ST-with-update has an implicit 2nd write register, RA)
451
452 # svremap instruction <a name="svremap"> </a>
453
454 There is also a corresponding SVRM-Form for the svremap
455 instruction which matches the above SPR:
456
457 svremap SVme,mi0,mi1,mi2,mo0,mo2,pst
458
459 |0 |6 |11 |13 |15 |17 |19 |21 | 22.25 |26..31 |
460 | -- | -- | -- | -- | -- | -- | -- | -- | ---- | ----- |
461 | PO | SVme |mi0 | mi1 | mi2 | mo0 | mo1 | pst | rsvd | XO |
462
463 # SHAPE Remapping SPRs
464
465 There are four "shape" SPRs, SHAPE0-3, 32-bits in each,
466 which have the same format.
467
468 Shape is 32-bits. When SHAPE is set entirely to zeros, remapping is
469 disabled: the register's elements are a linear (1D) vector.
470
471 |31.30|29..28 |27..24| 23..21 | 20..18 | 17..12 |11..6 |5..0 | Mode |
472 |---- |------ |------| ------ | ------- | ------- |----- |----- | ----- |
473 |0b00 |skip |offset| invxyz | permute | zdimsz |ydimsz|xdimsz|Matrix |
474 |0b00 |elwidth|offset|sk1/invxy|0b110/0b111|SVGPR|ydimsz|xdimsz|Indexed|
475 |0b01 |submode|offset| invxyz | submode2| zdimsz |mode |xdimsz|DCT/FFT|
476 |0b10 |submode|offset| invxyz | rsvd | rsvd |rsvd |xdimsz|Preduce|
477 |0b11 | | | | | | | |rsvd |
478
479 mode sets different behaviours (straight matrix multiply, FFT, DCT).
480
481 * **mode=0b00** sets straight Matrix Mode
482 * **mode=0b00** with permute=0b110 or 0b111 sets Indexed Mode
483 * **mode=0b01** sets "FFT/DCT" mode and activates submodes
484 * **mode=0b10** sets "Parallel Reduction" Schedules.
485
486 ## Parallel Reduction Mode
487
488 Creates the Schedules for Parallel Tree Reduction.
489
490 * **submode=0b00** selects the left operand index
491 * **submode=0b01** selects the right operand index
492
493 * When bit 0 of `invxyz` is set, the order of the indices
494 in the inner for-loop are reversed. This has the side-effect
495 of placing the final reduced result in the last-predicated element.
496 It also has the indirect side-effect of swapping the source
497 registers: Left-operand index numbers will always exceed
498 Right-operand indices.
499 When clear, the reduced result will be in the first-predicated
500 element, and Left-operand indices will always be *less* than
501 Right-operand ones.
502 * When bit 1 of `invxyz` is set, the order of the outer loop
503 step is inverted: stepping begins at the nearest power-of two
504 to half of the vector length and reduces by half each time.
505 When clear the step will begin at 2 and double on each
506 inner loop.
507
508 ## FFT/DCT mode
509
510 submode2=0 is for FFT. For FFT submode the following schedules may be
511 selected:
512
513 * **submode=0b00** selects the ``j`` offset of the innermost for-loop
514 of Tukey-Cooley
515 * **submode=0b10** selects the ``j+halfsize`` offset of the innermost for-loop
516 of Tukey-Cooley
517 * **submode=0b11** selects the ``k`` of exptable (which coefficient)
518
519 When submode2 is 1 or 2, for DCT inner butterfly submode the following
520 schedules may be selected. When submode2 is 1, additional bit-reversing
521 is also performed.
522
523 * **submode=0b00** selects the ``j`` offset of the innermost for-loop,
524 in-place
525 * **submode=0b010** selects the ``j+halfsize`` offset of the innermost for-loop,
526 in reverse-order, in-place
527 * **submode=0b10** selects the ``ci`` count of the innermost for-loop,
528 useful for calculating the cosine coefficient
529 * **submode=0b11** selects the ``size`` offset of the outermost for-loop,
530 useful for the cosine coefficient ``cos(ci + 0.5) * pi / size``
531
532 When submode2 is 3 or 4, for DCT outer butterfly submode the following
533 schedules may be selected. When submode is 3, additional bit-reversing
534 is also performed.
535
536 * **submode=0b00** selects the ``j`` offset of the innermost for-loop,
537 * **submode=0b01** selects the ``j+1`` offset of the innermost for-loop,
538
539 `zdimsz` is used as an in-place "Stride", particularly useful for
540 column-based in-place DCT/FFT.
541
542 ## Matrix Mode
543
544 In Matrix Mode, skip allows dimensions to be skipped from being included
545 in the resultant output index. this allows sequences to be repeated:
546 ```0 0 0 1 1 1 2 2 2 ...``` or in the case of skip=0b11 this results in
547 modulo ```0 1 2 0 1 2 ...```
548
549 * **skip=0b00** indicates no dimensions to be skipped
550 * **skip=0b01** sets "skip 1st dimension"
551 * **skip=0b10** sets "skip 2nd dimension"
552 * **skip=0b11** sets "skip 3rd dimension"
553
554 invxyz will invert the start index of each of x, y or z. If invxyz[0] is
555 zero then x-dimensional counting begins from 0 and increments, otherwise
556 it begins from xdimsz-1 and iterates down to zero. Likewise for y and z.
557
558 offset will have the effect of offsetting the result by ```offset``` elements:
559
560 for i in 0..VL-1:
561 GPR(RT + remap(i) + SVSHAPE.offset) = ....
562
563 this appears redundant because the register RT could simply be changed by a compiler, until element width overrides are introduced. also
564 bear in mind that unlike a static compiler SVSHAPE.offset may
565 be set dynamically at runtime.
566
567 xdimsz, ydimsz and zdimsz are offset by 1, such that a value of 0 indicates
568 that the array dimensionality for that dimension is 1. any dimension
569 not intended to be used must have its value set to 0 (dimensionality
570 of 1). A value of xdimsz=2 would indicate that in the first dimension
571 there are 3 elements in the array. For example, to create a 2D array
572 X,Y of dimensionality X=3 and Y=2, set xdimsz=2, ydimsz=1 and zdimsz=0
573
574 The format of the array is therefore as follows:
575
576 array[xdimsz+1][ydimsz+1][zdimsz+1]
577
578 However whilst illustrative of the dimensionality, that does not take the
579 "permute" setting into account. "permute" may be any one of six values
580 (0-5, with values of 6 and 7 indicating "Indexed" Mode). The table
581 below shows how the permutation dimensionality order works:
582
583 | permute | order | array format |
584 | ------- | ----- | ------------------------ |
585 | 000 | 0,1,2 | (xdim+1)(ydim+1)(zdim+1) |
586 | 001 | 0,2,1 | (xdim+1)(zdim+1)(ydim+1) |
587 | 010 | 1,0,2 | (ydim+1)(xdim+1)(zdim+1) |
588 | 011 | 1,2,0 | (ydim+1)(zdim+1)(xdim+1) |
589 | 100 | 2,0,1 | (zdim+1)(xdim+1)(ydim+1) |
590 | 101 | 2,1,0 | (zdim+1)(ydim+1)(xdim+1) |
591 | 110 | 0,1 | Indexed (xdim+1)(ydim+1) |
592 | 111 | 1,0 | Indexed (ydim+1)(xdim+1) |
593
594 In other words, the "permute" option changes the order in which
595 nested for-loops over the array would be done. See executable
596 python reference code for further details.
597
598 *Note: permute=0b110 and permute=0b111 enable Indexed REMAP Mode,
599 described below*
600
601 With all these options it is possible to support in-place transpose,
602 in-place rotate, Matrix Multiply and Convolutions, without being
603 limited to Power-of-Two dimension sizes.
604
605 ## Indexed Mode
606
607 Indexed Mode activates reading of the element indices from the GPR
608 and includes optional limited 2D reordering.
609 In its simplest form (without elwidth overrides or other modes):
610
611 ```
612 def index_remap(i):
613 return GPR((SVSHAPE.SVGPR<<1)+i) + SVSHAPE.offset
614
615 for i in 0..VL-1:
616 element_result = ....
617 GPR(RT + indexed_remap(i)) = element_result
618 ```
619
620 With element-width overrides included, and using the pseudocode
621 from the SVP64 [[sv/svp64/appendix#elwidth]] elwidth section
622 this becomes:
623
624 ```
625 def index_remap(i):
626 svreg = SVSHAPE.SVGPR << 1
627 srcwid = elwid_to_bitwidth(SVSHAPE.elwid)
628 offs = SVSHAPE.offset
629 return get_polymorphed_reg(svreg, srcwid, i) + offs
630
631 for i in 0..VL-1:
632 element_result = ....
633 rt_idx = indexed_remap(i)
634 set_polymorphed_reg(RT, destwid, rt_idx, element_result)
635 ```
636
637 Matrix-style reordering still applies to the indices, except limited
638 to up to 2 Dimensions (X,Y). Ordering is therefore limited to (X,Y) or
639 (Y,X). Only one dimension may optionally be skipped. Inversion of either
640 X or Y or both is possible. Pseudocode for Indexed Mode (including elwidth
641 overrides) may be written in terms of Matrix Mode, specifically
642 purposed to ensure that the 3rd dimension (Z) has no effect:
643
644 ```
645 def index_remap(ISHAPE, i):
646 MSHAPE.skip = 0b0 || ISHAPE.sk1
647 MSHAPE.invxyz = 0b0 || ISHAPE.invxy
648 MSHAPE.xdimsz = ISHAPE.xdimsz
649 MSHAPE.ydimsz = ISHAPE.ydimsz
650 MSHAPE.zdimsz = 0 # disabled
651 if ISHAPE.permute = 0b110 # 0,1
652 MSHAPE.permute = 0b000 # 0,1,2
653 if ISHAPE.permute = 0b111 # 1,0
654 MSHAPE.permute = 0b010 # 1,0,2
655 el_idx = remap_matrix(MSHAPE, i)
656 svreg = ISHAPE.SVGPR << 1
657 srcwid = elwid_to_bitwidth(ISHAPE.elwid)
658 offs = ISHAPE.offset
659 return get_polymorphed_reg(svreg, srcwid, el_idx) + offs
660 ```
661
662 The most important observation above is that the Matrix-style
663 remapping occurs first and the Index lookup second. Thus it
664 becomes possible to perform in-place Transpose of Indices which
665 may have been costly to set up or costly to duplicate
666 (waste register file space).
667
668 # svshape instruction <a name="svshape"> </a>
669
670 `svshape` is a convenience instruction that reduces instruction
671 count for common usage patterns, particularly Matrix, DCT and FFT. It sets up
672 (overwrites) all required SVSHAPE SPRs and also modifies SVSTATE
673 including VL and MAXVL. Using `svshape` therefore does not also
674 require `setvl`.
675
676 Form: SVM-Form SV "Matrix" Form (see [[isatables/fields.text]])
677
678 svshape SVxd,SVyd,SVzd,SVRM,vf
679
680 | 0.5|6.10 |11.15 |16..20 | 21..24 | 25 | 26..31| name |
681 | -- | -- | --- | ----- | ------ | -- | ------| -------- |
682 |OPCD| SVxd | SVyd | SVzd | SVRM | vf | XO | svshape |
683
684 Fields:
685
686 * **SVxd** - SV REMAP "xdim"
687 * **SVyd** - SV REMAP "ydim"
688 * **SVzd** - SV REMAP "zdim"
689 * **SVRM** - SV REMAP Mode (0b00000 for Matrix, 0b00001 for FFT etc.)
690 * **vf** - sets "Vertical-First" mode
691 * **XO** - standard 6-bit XO field
692
693 *Note: SVxd, SVyz and SVzd are all stored "off-by-one". In the assembler
694 mnemonic the values `1-32` are stored in binary as `0b00000..0b11111`*
695
696 | SVRM | Remap Mode description |
697 | -- | -- |
698 | 0b0000 | Matrix 1/2/3D |
699 | 0b0001 | FFT Butterfly |
700 | 0b0010 | DCT Inner butterfly, pre-calculated coefficients - deprecate |
701 | 0b0011 | DCT Outer butterfly |
702 | 0b0100 | DCT Inner butterfly, on-the-fly (Vertical-First Mode) |
703 | 0b0101 | DCT COS table index generation |
704 | 0b0110 | DCT half-swap |
705 | 0b0111 | Parallel Reduction |
706 | 0b1000 | reserved for svshape2 |
707 | 0b1001 | reserved for svshape2 |
708 | 0b1010 | iDCT Inner butterfly, pre-calculated coefficients - deprecate|
709 | 0b1011 | iDCT Outer butterfly |
710 | 0b1100 | iDCT Inner butterfly, on-the-fly (Vertical-First Mode) |
711 | 0b1101 | iDCT COS table index generation |
712 | 0b1110 | iDCT half-swap |
713 | 0b1111 | FFT half-swap |
714
715 Examples showing how all of these Modes operate exists in the online
716 [SVP64 unit tests](https://git.libre-soc.org/?p=openpower-isa.git;a=tree;f=src/openpower/decoder/isa;hb=HEAD)
717 and the full pseudocode setting up all SPRs
718 is in the [[openpower/isa/simplev]] page.
719
720 In Indexed Mode, there are only 5 bits available to specify the GPR
721 to use, out of 128 GPRs (7 bit numbering). Therefore, only the top
722 5 bits are given in the `SVxd` field: the bottom two implicit bits
723 will be zero (`SVxd || 0b00`).
724
725 `svshape` has *limited applicability* due to being a 32-bit instruction.
726 The full capability of SVSHAPE SPRs may be accessed by directly writing
727 to SVSHAPE0-3 with `mtspr`. Circumstances include Matrices with dimensions
728 larger than 32, and in-place Transpose. Potentially a future v3.1 Prefixed
729 instruction, `psvshape`, may extend the capability here.
730
731 # svindex instruction <a name="svindex"> </a>
732
733 `svindex` is a convenience instruction that reduces instruction
734 count for Indexed REMAP Mode. It sets up
735 (overwrites) all required SVSHAPE SPRs and can modify the REMAP
736 SPR as well. The relevant SPRs *may* be directly programmed with
737 `mtspr` however it is laborious to do so: svindex saves instructions
738 covering much of Indexed REMAP capability.
739
740 Form: SVI-Form SV "Indexed" Form (see [[isatables/fields.text]])
741
742 svindex SVG,rmm,SVd,ew,yx,mr,sk
743
744 | 0.5|6.10 |11.15 |16.20 | 21..25 | 26..31| name | Form |
745 | -- | -- | --- | ---- | ----------- | ------| -------- | ---- |
746 |OPCD| SVG | rmm | SVd | ew/yx/mm/sk | XO | svindex | SVI-Form |
747
748 Fields:
749
750 * **SVd** - SV REMAP x/y dim
751 * **rmm** - REMAP mask: sets remap mi0-2/mo0-1 and SVSHAPEs,
752 controlled by mm
753 * **ew** - sets element width override on the Indices
754 * **SVG** - GPR SVG<<2 to be used for Indexing
755 * **yx** - 2D reordering to be used if yx=1
756 * **mm** - mask mode. determines how `rmm` is interpreted.
757 * **sk** - Dimension skipping enabled
758 * **XO** - standard 6-bit XO field
759
760 *Note: SVd, like SVxd, SVyz and SVzd of `svshape`, are all stored
761 "off-by-one". In the assembler
762 mnemonic the values `1-32` are stored in binary as `0b00000..0b11111`*.
763
764 *Note: when `yx=1,sk=0` the second dimension is calculated as
765 `CEIL(MAXVL/SVd)`*.
766
767 When `mm=0`:
768
769 * `rmm`, like REMAP.SVme, has bit 0
770 correspond to mi0, bit 1 to mi1, bit 2 to mi2,
771 bit 3 to mo0 and bit 4 to mi1
772 * all SVSHAPEs and the REMAP parts of SVSHAPE are first reset (initialised to zero)
773 * for each bit set in the 5-bit `rmm`, in order, the first
774 as-yet-unset SVSHAPE will be updated
775 with the other operands in the instruction, and the REMAP
776 SPR set.
777 * If all 5 bits of `rmm` are set then both mi0 and mo1 use SVSHAPE0.
778 * SVSTATE persistence bit is cleared
779 * No other alterations to SVSTATE are carried out
780
781 Example 1: if rmm=0b00110 then SVSHAPE0 and SVSHAPE1 are set up,
782 and the REMAP SPR set so that mi1 uses SVSHAPE0 and mi2
783 uses mi2. REMAP.SVme is also set to 0b00110, REMAP.mi1=0
784 (SVSHAPE0) and REMAP.mi2=1 (SVSHAPE1)
785
786 Example 2: if rmm=0b10001 then again SVSHAPE0 and SVSHAPE1
787 are set up, but the REMAP SPR is set so that mi0 uses SVSHAPE0
788 and mo1 uses SVSHAPE1. REMAP.SVme=0b10001, REMAP.mi0=0, REMAP.mo1=1
789
790 Rough algorithmic form:
791
792 marray = [mi0, mi1, mi2, mo0, mo1]
793 idx = 0
794 for bit = 0 to 4:
795 if not rmm[bit]: continue
796 setup(SVSHAPE[idx])
797 SVSTATE{marray[bit]} = idx
798 idx = (idx+1) modulo 4
799
800 When `mm=1`:
801
802 * bits 0-2 (MSB0 numbering) of `rmm` indicate an index selecting mi0-mo1
803 * bits 3-4 (MSB0 numbering) of `rmm` indicate which SVSHAPE 0-3 shall
804 be updated
805 * only the selected SVSHAPE is overwritten
806 * only the relevant bits in the REMAP area of SVSTATE are updated
807 * REMAP persistence bit is set.
808
809 Example 1: if `rmm`=0b01110 then bits 0-2 (MSB0) are 0b011 and
810 bits 3-4 are 0b10. thus, mo0 is selected and SVSHAPE2
811 to be updated. REMAP.SVme[3] will be set high and REMAP.mo0
812 set to 2 (SVSHAPE2).
813
814 Example 2: if `rmm`=0b10011 then bits 0-2 (MSB0) are 0b100 and
815 bits 3-4 are 0b11. thus, mo1 is selected and SVSHAPE3
816 to be updated. REMAP.SVme[4] will be set high and REMAP.mo1
817 set to 3 (SVSHAPE3).
818
819 Rough algorithmic form:
820
821 marray = [mi0, mi1, mi2, mo0, mo1]
822 bit = rmm[0:2]
823 idx = rmm[3:4]
824 setup(SVSHAPE[idx])
825 SVSTATE{marray[bit]} = idx
826 SVSTATE.pst = 1
827
828 In essence, `mm=0` is intended for use to set as much of the
829 REMAP State SPRs as practical with a single instruction,
830 whilst `mm=1` is intended to be a little more refined.
831
832 **Usage guidelines**
833
834 * **Disable 2D mapping**: to only perform Indexing without
835 reordering use `SVd=1,sk=0,yx=0` (or set SVd to a value larger
836 or equal to VL)
837 * **Modulo 1D mapping**: to perform Indexing cycling through the
838 first N Indices use `SVd=N,sk=0,yx=0` where `VL>N`. There is
839 no requirement to set VL equal to a multiple of N.
840 * **Modulo 2D transposed**: `SVd=M,sk=0,yx=1`, sets
841 `xdim=M,ydim=CEIL(MAXVL/M)`.
842
843 Beyond these mappings it becomes necessary to write directly to
844 the SVSTATE SPRs manually.
845
846 # svshape2 (offset) <a name="svshape2"> </a>
847
848 `svshape2` is an additional convenience instruction that prioritises
849 setting `SVSHAPE.offset`. Its primary purpose is for use when
850 element-width overrides are used. It has identical capabilities to `svindex` and
851 in terms of both options (skip, etc.) and ability to activate REMAP
852 (rmm, mask mode) but unlike `svindex` it does not set GPR REMAP,
853 only a 1D or 2D `svshape`, and
854 unlike `svshape` it can set an arbirrary `SVSHAPE.offset` immediate.
855
856 One of the limitations of Simple-V is that Vector elements start on the boundary
857 of the Scalar regfile, which is fine when element-width overrides are not
858 needed. If the starting point of a Vector with smaller elwidths must begin
859 in the middle of a register, normally there would be no way to do so except
860 through LD/ST. `SVSHAPE.offset` caters for this scenario and `svshape2`is
861 makes it easier.
862
863 svshape2 offs,yx,rmm,SVd,sk,mm
864
865 | 0.5|6..9|10|11.15 |16..20 | 21..25 | 25 | 26..31| name |
866 | -- |----|--| --- | ----- | ------ | -- | ------| -------- |
867 |OPCD|offs|yx| rmm | SVd | 100/mm | sk | XO | svshape |
868
869 * **offs** (4 bits) - unsigned offset
870 * **yx** (1 bit) - swap XY to YX
871 * **SVd** dimension size
872 * **rmm** REMAP mask
873 * **mm** mask mode
874 * **sk** (1 bit) skips 1st dimension if set
875
876 Dimensions are calculated exactly as `svindex`. `rmm` and
877 `mm` are as per `svindex`.
878
879 *Programmer's Note: offsets for `svshape2` may be specified in the range
880 0-15. Given that the principle of Simple-V is to fit on top of
881 byte-addressable register files and that GPR and FPR are 64-bit (8 bytes)
882 it should be clear that the offset may, when `elwidth=8`, begin an
883 element-level operation starting element zero at any arbitrary byte.
884 On cursory examination attempting to go beyond the range 0-7 seems
885 unnecessary given that the **next GPR or FPR** is an
886 alias for an offset in the range 8-15. Thus by simply increasing
887 the starting Vector point of the operation to the next register it
888 can be seen that the offset of 0-7 would be sufficient. Unfortunately
889 however some operations are EXTRA2-encoded it is **not possible**
890 to increase the GPR/FPR register number by one, because EXTRA2-encoding
891 of GPR/FPR Vector numbers are restricted to even numbering.
892 For CR Fields the EXTRA2 encoding is even more sparse.
893 The additional offset range (8-15) helps overcome these limitations.*
894
895 *Hardware Implementor's note: with the offsets only being immediates
896 and with register numbering being entirely immediate as well it is
897 possible to correctly compute Register Hazards without requiring
898 reading the contents of any SPRs. If however there are
899 instructions that have directly written to the SVSTATE or SVSHAPE
900 SPRs and those instructions are still in-flight then this position
901 is clearly **invalid**.*
902
903 # TODO
904
905 * investigate https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6879380/#!po=19.6429
906 in https://bugs.libre-soc.org/show_bug.cgi?id=653
907 * UTF-8 <https://bugs.libre-soc.org/show_bug.cgi?id=794>
908 * Triangular REMAP
909 * Cross-Product REMAP (actually, skew Matrix: https://en.m.wikipedia.org/wiki/Skew-symmetric_matrix)
910 * Convolution REMAP