From 922f68f0cc0e413a6fbff04680435bb4523bb9aa Mon Sep 17 00:00:00 2001 From: lkcl Date: Sun, 26 Mar 2023 20:56:00 +0100 Subject: [PATCH] --- openpower/sv/rfc/ls009.mdwn | 901 ++++++++++++++++++++++++++++++++++++ 1 file changed, 901 insertions(+) diff --git a/openpower/sv/rfc/ls009.mdwn b/openpower/sv/rfc/ls009.mdwn index 451e11ba1..114008922 100644 --- a/openpower/sv/rfc/ls009.mdwn +++ b/openpower/sv/rfc/ls009.mdwn @@ -85,6 +85,907 @@ Add the following entries to: \newpage{} + +[[!tag standards]] + +# REMAP + +* see [[sv/propagation]] for a future way to apply REMAP + +REMAP is an advanced form of Vector "Structure Packing" that +provides hardware-level support for commonly-used *nested* loop patterns. +For more general reordering an Indexed REMAP mode is available. + +REMAP allows the usual vector loop `0..VL-1` to be "reshaped" (re-mapped) +from a linear form to a 2D or 3D transposed form, or "offset" to permit +arbitrary access to elements (when elwidth overrides are used), +independently on each Vector src or dest +register. + +The initial primary motivation of REMAP was for Matrix Multiplication, reordering of sequential +data in-place: in-place DCT and FFT were easily justified given the +high usage in Computer Science. +Four SPRs are provided which may be applied to any GPR, FPR or CR Field +so that for example a single FMAC may be +used in a single loop to perform 5x3 times 3x4 Matrix multiplication, +generating 60 FMACs *without needing explicit assembler unrolling*. +Additional uses include regular "Structure Packing" +such as RGB pixel data extraction and reforming. + +REMAP, like all of SV, is abstracted out, meaning that unlike traditional +Vector ISAs which would typically only have a limited set of instructions +that can be structure-packed (LD/ST typically), REMAP may be applied to +literally any instruction: CRs, Arithmetic, Logical, LD/ST, anything. + +Note that REMAP does not *directly* apply to sub-vector elements: that +is what swizzle is for. Swizzle *can* however be applied to the same +instruction as REMAP. As explained in [[sv/mv.swizzle]], [[sv/mv.vec]] and the [[svp64/appendix]], Pack and Unpack EXTRA Mode bits +can extend down into Sub-vector elements to perform vec2/vec3/vec4 +sequential reordering, but even here, REMAP is not extended down to +the actual sub-vector elements themselves. + +In its general form, REMAP is quite expensive to set up, and on some +implementations may introduce +latency, so should realistically be used only where it is worthwhile. +Commonly-used patterns such as Matrix Multiply, DCT and FFT have +helper instruction options which make REMAP easier to use. + +There are four types of REMAP: + +* **Matrix**, also known as 2D and 3D reshaping, can perform in-place + Matrix transpose and rotate. The Shapes are set up for an "Outer Product" + Matrix Multiply. +* **FFT/DCT**, with full triple-loop in-place support: limited to + Power-2 RADIX +* **Indexing**, for any general-purpose reordering, also includes + limited 2D reshaping. +* **Parallel Reduction**, for scheduling a sequence of operations + in a Deterministic fashion, in a way that may be parallelised, + to reduce a Vector down to a single value. + +Best implemented on top of a Multi-Issue Out-of-Order Micro-architecture, +REMAP Schedules are 100% Deterministic **including Indexing** and are +designed to be incorporated in between the Decode and Issue phases, +directly into Register Hazard Management. + +Parallel Reduction is unusual in that it requires a full vector array +of results (not a scalar) and uses the rest of the result Vector for +the purposes of storing intermediary calculations. As these intermediary +results are Deterministically computed they may be useful. +Additionally, because the intermediate results are always written out +it is possible to service Precise Interrupts without affecting latency +(a common limitation of Vector ISAs). + +# Basic principle + +* normal vector element read/write of operands would be sequential + (0 1 2 3 ....) +* this is not appropriate for (e.g.) Matrix multiply which requires + accessing elements in alternative sequences (0 3 6 1 4 7 ...) +* normal Vector ISAs use either Indexed-MV or Indexed-LD/ST to "cope" + with this. both are expensive (copy large vectors, spill through memory) + and very few Packed SIMD ISAs cope with non-Power-2. +* REMAP **redefines** the order of access according to set + (Deterministic) "Schedules". +* The Schedules are not at all restricted to power-of-two boundaries + making it unnecessary to have for example specialised 3x4 transpose + instructions of other Vector ISAs. + +Only the most commonly-used algorithms in computer science have REMAP +support, due to the high cost in both the ISA and in hardware. For +arbitrary remapping the `Indexed` REMAP may be used. + +# Example Usage + +* `svshape` to set the type of reordering to be applied to an + otherwise usual `0..VL-1` hardware for-loop +* `svremap` to set which registers a given reordering is to apply to + (RA, RT etc) +* `sv.{instruction}` where any Vectorised register marked by `svremap` + will have its ordering REMAPPED according to the schedule set + by `svshape`. + +The following illustrative example multiplies a 3x4 and a 5x3 +matrix to create +a 5x4 result: + + svshape 5, 4, 3, 0, 0 + svremap 15, 1, 2, 3, 0, 0, 0, 0 + sv.fmadds *0, *8, *16, *0 + +* svshape sets up the four SVSHAPE SPRS for a Matrix Schedule +* svremap activates four out of five registers RA RB RC RT RS (15) +* svremap requests: + - RA to use SVSHAPE1 + - RB to use SVSHAPE2 + - RC to use SVSHAPE3 + - RT to use SVSHAPE0 + - RS Remapping to not be activated +* sv.fmadds has RT=0.v, RA=8.v, RB=16.v, RC=0.v +* With REMAP being active each register's element index is + *independently* transformed using the specified SHAPEs. + +Thus the Vector Loop is arranged such that the use of +the multiply-and-accumulate instruction executes precisely the required +Schedule to perform an in-place in-registers Matrix Multiply with no +need to perform additional Transpose or register copy instructions. +The example above may be executed as a unit test and demo, +[here](https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_matrix.py;h=c15479db9a36055166b6b023c7495f9ca3637333;hb=a17a252e474d5d5bf34026c25a19682e3f2015c3#l94) + +# REMAP types + +This section summarises the motivation for each REMAP Schedule +and briefly goes over their characteristics and limitations. +Further details on the Deterministic Precise-Interruptible algorithms +used in these Schedules is found in the [[sv/remap/appendix]]. + +## Matrix (1D/2D/3D shaping) + +Matrix Multiplication is a huge part of High-Performance Compute, +and 3D. +In many PackedSIMD as well as Scalable Vector ISAs, non-power-of-two +Matrix sizes are a serious challenge. PackedSIMD ISAs, in order to +cope with for example 3x4 Matrices, recommend rolling data-repetition and loop-unrolling. +Aside from the cost of the load on the L1 I-Cache, the trick only +works if one of the dimensions X or Y are power-two. Prime Numbers +(5x7, 3x5) become deeply problematic to unroll. + +Even traditional Scalable Vector ISAs have issues with Matrices, often +having to perform data Transpose by pushing out through Memory and back, +or computing Transposition Indices (costly) then copying to another +Vector (costly). + +Matrix REMAP was thus designed to solve these issues by providing Hardware +Assisted +"Schedules" that can view what would otherwise be limited to a strictly +linear Vector as instead being 2D (even 3D) *in-place* reordered. +With both Transposition and non-power-two being supported the issues +faced by other ISAs are mitigated. + +Limitations of Matrix REMAP are that the Vector Length (VL) is currently +restricted to 127: up to 127 FMAs (or other operation) +may be performed in total. +Also given that it is in-registers only at present some care has to be +taken on regfile resource utilisation. However it is perfectly possible +to utilise Matrix REMAP to perform the three inner-most "kernel" loops of +the usual 6-level large Matrix Multiply, without the usual difficulties +associated with SIMD. + +Also the `svshape` instruction only provides access to part of the +Matrix REMAP capability. Rotation and mirroring need to be done by +programming the SVSHAPE SPRs directly, which can take a lot more +instructions. + +## FFT/DCT Triple Loop + +DCT and FFT are some of the most astonishingly used algorithms in +Computer Science. Radar, Audio, Video, R.F. Baseband and dozens more. At least +two DSPs, TMS320 and Hexagon, have VLIW instructions specially tailored +to FFT. + +An in-depth analysis showed that it is possible to do in-place in-register +DCT and FFT as long as twin-result "butterfly" instructions are provided. +These can be found in the [[openpower/isa/svfparith]] page if performing +IEEE754 FP transforms. *(For fixed-point transforms, equivalent 3-in 2-out +integer operations would be required)*. These "butterfly" instructions +avoid the need for a temporary register because the two array positions +being overwritten will be "in-flight" in any In-Order or Out-of-Order +micro-architecture. + +DCT and FFT Schedules are currently limited to RADIX2 sizes and do not +accept predicate masks. Given that it is common to perform recursive +convolutions combining smaller Power-2 DCT/FFT to create larger DCT/FFTs +in practice the RADIX2 limit is not a problem. A Bluestein convolution +to compute arbitrary length is demonstrated by +[Project Nayuki](https://www.nayuki.io/res/free-small-fft-in-multiple-languages/fft.py) + +## Indexed + +The purpose of Indexing is to provide a generalised version of +Vector ISA "Permute" instructions, such as VSX `vperm`. The +Indexing is abstracted out and may be applied to much more +than an element move/copy, and is not limited for example +to the number of bytes that can fit into a VSX register. +Indexing may be applied to LD/ST (even on Indexed LD/ST +instructions such as `sv.lbzx`), arithmetic operations, +extsw: there is no artificial limit. + +The only major caveat is that the registers to be used as +Indices must not be modified by any instruction after Indexed Mode +is established, and neither must MAXVL be altered. Additionally, +no register used as an Index may exceed MAXVL-1. + +Failure to observe +these conditions results in `UNDEFINED` behaviour. +These conditions allow a Read-After-Write (RAW) Hazard to be created on +the entire range of Indices to be subsequently used, but a corresponding +Write-After-Read Hazard by any instruction that modifies the Indices +**does not have to be created**. Given the large number of registers +involved in Indexing this is a huge resource saving and reduction +in micro-architectural complexity. MAXVL is likewise +included in the RAW Hazards because it is involved in calculating +how many registers are to be considered Indices. + +With these Hazard Mitigations in place, high-performance implementations +may read-cache the Indices from the point where a given `svindex` instruction +is called (or SVSHAPE SPRs - and MAXVL- directly altered). + +The original motivation for Indexed REMAP was to mitigate the need to add +an expensive `mv.x` to the Scalar ISA, which was likely to be rejected as +a stand-alone instruction. Usually a Vector ISA would add a non-conflicting +variant (as in VSX `vperm`) but it is common to need to permute by source, +with the risk of conflict, that has to be resolved, for example, in AVX-512 +with `conflictd`. + +Indexed REMAP on the other hand **does not prevent conflicts** (overlapping +destinations), which on a superficial analysis may be perceived to be a +problem, until it is recalled that, firstly, Simple-V is designed specifically +to require Program Order to be respected, and that Matrix, DCT and FFT +all *already* critically depend on overlapping Reads/Writes: Matrix +uses overlapping registers as accumulators. Thus the Register Hazard +Management needed by Indexed REMAP *has* to be in place anyway. + +The cost compared to Matrix and other REMAPs (and Pack/Unpack) is +clearly that of the additional reading of the GPRs to be used as Indices, +plus the setup cost associated with creating those same Indices. +If any Deterministic REMAP can cover the required task, clearly it +is adviseable to use it instead. + +*Programmer's note: some algorithms may require skipping of Indices exceeding +VL-1, not MAXVL-1. This may be achieved programmatically by performing +an `sv.cmp *BF,*RA,RB` where RA is the same GPRs used in the Indexed REMAP, +and RB contains the value of VL returned from `setvl`. The resultant +CR Fields may then be used as Predicate Masks to exclude those operations +with an Index exceeding VL-1.* + +## Parallel Reduction + +Vector Reduce Mode issues a deterministic tree-reduction schedule to the underlying micro-architecture. Like Scalar reduction, the "Scalar Base" +(Power ISA v3.0B) operation is leveraged, unmodified, to give the +*appearance* and *effect* of Reduction. + +In Horizontal-First Mode, Vector-result reduction **requires** +the destination to be a Vector, which will be used to store +intermediary results. + +Given that the tree-reduction schedule is deterministic, +Interrupts and exceptions +can therefore also be precise. The final result will be in the first +non-predicate-masked-out destination element, but due again to +the deterministic schedule programmers may find uses for the intermediate +results. + +When Rc=1 a corresponding Vector of co-resultant CRs is also +created. No special action is taken: the result and its CR Field +are stored "as usual" exactly as all other SVP64 Rc=1 operations. + +Note that the Schedule only makes sense on top of certain instructions: +X-Form with a Register Profile of `RT,RA,RB` is fine because two sources +and the destination are all the same type. Like Scalar +Reduction, nothing is prohibited: +the results of execution on an unsuitable instruction may simply +not make sense. With care, even 3-input instructions (madd, fmadd, ternlogi) +may be used. + +Critical to note regarding use of Parallel-Reduction REMAP is that, +exactly as with all REMAP Modes, the `svshape` instruction *requests* +a certain Vector Length (number of elements to reduce) and then +sets VL and MAXVL at the number of **operations** needed to be +carried out. Thus, equally as importantly, like Matrix REMAP +the total number of operations +is restricted to 127. Any Parallel-Reduction requiring more operations +will need to be done manually in batches (hierarchical +recursive Reduction). + +Also important to note is that the Deterministic Schedule is arranged +so that some implementations *may* parallelise it (as long as doing so +respects Program Order and Register Hazards). Performance (speed) +of any given +implementation is neither strictly defined or guaranteed. As with +the Vulkan(tm) Specification, strict compliance is paramount whilst +performance is at the discretion of Implementors. + +**Parallel-Reduction with Predication** + +To avoid breaking the strict RISC-paradigm, keeping the Issue-Schedule +completely separate from the actual element-level (scalar) operations, +Move operations are **not** included in the Schedule. This means that +the Schedule leaves the final (scalar) result in the first-non-masked +element of the Vector used. With the predicate mask being dynamic +(but deterministic) this result could be anywhere. + +If that result is needed to be moved to a (single) scalar register +then a follow-up `sv.mv/sm=predicate rt, *ra` instruction will be +needed to get it, where the predicate is the exact same predicate used +in the prior Parallel-Reduction instruction. + +* If there was only a single + bit in the predicate then the result will not have moved or been altered + from the source vector prior to the Reduction +* If there was more than one bit the result will be in the + first element with a predicate bit set. + +In either case the result is in the element with the first bit set in +the predicate mask. + +For *some* implementations +the vector-to-scalar copy may be a slow operation, as may the Predicated +Parallel Reduction itself. +It may be better to perform a pre-copy +of the values, compressing them (VREDUCE-style) into a contiguous block, +which will guarantee that the result goes into the very first element +of the destination vector, in which case clearly no follow-up +vector-to-scalar MV operation is needed. + +**Usage conditions** + +The simplest usage is to perform an overwrite, specifying all three +register operands the same. + + svshape parallelreduce, 6 + sv.add *8, *8, *8 + +The Reduction Schedule will issue the Parallel Tree Reduction spanning +registers 8 through 13, by adjusting the offsets to RT, RA and RB as +necessary (see "Parallel Reduction algorithm" in a later section). + +A non-overwrite is possible as well but just as with the overwrite +version, only those destination elements necessary for storing +intermediary computations will be written to: the remaining elements +will **not** be overwritten and will **not** be zero'd. + + svshape parallelreduce, 6 + sv.add *0, *8, *8 + +However it is critical to note that if the source and destination are +not the same then the trick of using a follow-up vector-scalar MV will +not work. + +## Sub-Vector Horizontal Reduction + +Note that when SVM is clear and SUBVL!=1 a Parallel Reduction is performed +on all first Subvector elements, followed by another separate independent +Parallel Reduction on all the second Subvector elements and so on. + + for selectsubelement in (x,y,z,w): + parallelreduce(0..VL-1, selectsubelement) + +By contrast, when SVM is set and SUBVL!=1, a Horizontal +Subvector mode is enabled, applying the Parallel Reduction +Algorithm to the Subvector Elements. The Parallel Reduction +is independently applied VL times, to each group of Subvector +elements. Bear in mind that predication is never applied down +into individual Subvector elements, but will be applied +to select whether the *entire* Parallel Reduction on each +group is performed or not. + +  for (i = 0; i < VL; i++) + if (predval & 1< + +There is also a corresponding SVRM-Form for the svremap +instruction which matches the above SPR: + + svremap SVme,mi0,mi1,mi2,mo0,mo2,pst + +|0 |6 |11 |13 |15 |17 |19 |21 | 22.25 |26..31 | +| -- | -- | -- | -- | -- | -- | -- | -- | ---- | ----- | +| PO | SVme |mi0 | mi1 | mi2 | mo0 | mo1 | pst | rsvd | XO | + +# SHAPE Remapping SPRs + +There are four "shape" SPRs, SHAPE0-3, 32-bits in each, +which have the same format. + +Shape is 32-bits. When SHAPE is set entirely to zeros, remapping is +disabled: the register's elements are a linear (1D) vector. + +|31.30|29..28 |27..24| 23..21 | 20..18 | 17..12 |11..6 |5..0 | Mode | +|---- |------ |------| ------ | ------- | ------- |----- |----- | ----- | +|0b00 |skip |offset| invxyz | permute | zdimsz |ydimsz|xdimsz|Matrix | +|0b00 |elwidth|offset|sk1/invxy|0b110/0b111|SVGPR|ydimsz|xdimsz|Indexed| +|0b01 |submode|offset| invxyz | submode2| zdimsz |mode |xdimsz|DCT/FFT| +|0b10 |submode|offset| invxyz | rsvd | rsvd |rsvd |xdimsz|Preduce| +|0b11 | | | | | | | |rsvd | + +mode sets different behaviours (straight matrix multiply, FFT, DCT). + +* **mode=0b00** sets straight Matrix Mode +* **mode=0b00** with permute=0b110 or 0b111 sets Indexed Mode +* **mode=0b01** sets "FFT/DCT" mode and activates submodes +* **mode=0b10** sets "Parallel Reduction" Schedules. + +## Parallel Reduction Mode + +Creates the Schedules for Parallel Tree Reduction. + +* **submode=0b00** selects the left operand index +* **submode=0b01** selects the right operand index + +* When bit 0 of `invxyz` is set, the order of the indices + in the inner for-loop are reversed. This has the side-effect + of placing the final reduced result in the last-predicated element. + It also has the indirect side-effect of swapping the source + registers: Left-operand index numbers will always exceed + Right-operand indices. + When clear, the reduced result will be in the first-predicated + element, and Left-operand indices will always be *less* than + Right-operand ones. +* When bit 1 of `invxyz` is set, the order of the outer loop + step is inverted: stepping begins at the nearest power-of two + to half of the vector length and reduces by half each time. + When clear the step will begin at 2 and double on each + inner loop. + +## FFT/DCT mode + +submode2=0 is for FFT. For FFT submode the following schedules may be +selected: + +* **submode=0b00** selects the ``j`` offset of the innermost for-loop + of Tukey-Cooley +* **submode=0b10** selects the ``j+halfsize`` offset of the innermost for-loop + of Tukey-Cooley +* **submode=0b11** selects the ``k`` of exptable (which coefficient) + +When submode2 is 1 or 2, for DCT inner butterfly submode the following +schedules may be selected. When submode2 is 1, additional bit-reversing +is also performed. + +* **submode=0b00** selects the ``j`` offset of the innermost for-loop, + in-place +* **submode=0b010** selects the ``j+halfsize`` offset of the innermost for-loop, + in reverse-order, in-place +* **submode=0b10** selects the ``ci`` count of the innermost for-loop, + useful for calculating the cosine coefficient +* **submode=0b11** selects the ``size`` offset of the outermost for-loop, + useful for the cosine coefficient ``cos(ci + 0.5) * pi / size`` + +When submode2 is 3 or 4, for DCT outer butterfly submode the following +schedules may be selected. When submode is 3, additional bit-reversing +is also performed. + +* **submode=0b00** selects the ``j`` offset of the innermost for-loop, +* **submode=0b01** selects the ``j+1`` offset of the innermost for-loop, + +`zdimsz` is used as an in-place "Stride", particularly useful for +column-based in-place DCT/FFT. + +## Matrix Mode + +In Matrix Mode, skip allows dimensions to be skipped from being included +in the resultant output index. this allows sequences to be repeated: +```0 0 0 1 1 1 2 2 2 ...``` or in the case of skip=0b11 this results in +modulo ```0 1 2 0 1 2 ...``` + +* **skip=0b00** indicates no dimensions to be skipped +* **skip=0b01** sets "skip 1st dimension" +* **skip=0b10** sets "skip 2nd dimension" +* **skip=0b11** sets "skip 3rd dimension" + +invxyz will invert the start index of each of x, y or z. If invxyz[0] is +zero then x-dimensional counting begins from 0 and increments, otherwise +it begins from xdimsz-1 and iterates down to zero. Likewise for y and z. + +offset will have the effect of offsetting the result by ```offset``` elements: + + for i in 0..VL-1: + GPR(RT + remap(i) + SVSHAPE.offset) = .... + +this appears redundant because the register RT could simply be changed by a compiler, until element width overrides are introduced. also +bear in mind that unlike a static compiler SVSHAPE.offset may +be set dynamically at runtime. + +xdimsz, ydimsz and zdimsz are offset by 1, such that a value of 0 indicates +that the array dimensionality for that dimension is 1. any dimension +not intended to be used must have its value set to 0 (dimensionality +of 1). A value of xdimsz=2 would indicate that in the first dimension +there are 3 elements in the array. For example, to create a 2D array +X,Y of dimensionality X=3 and Y=2, set xdimsz=2, ydimsz=1 and zdimsz=0 + +The format of the array is therefore as follows: + + array[xdimsz+1][ydimsz+1][zdimsz+1] + +However whilst illustrative of the dimensionality, that does not take the +"permute" setting into account. "permute" may be any one of six values +(0-5, with values of 6 and 7 indicating "Indexed" Mode). The table +below shows how the permutation dimensionality order works: + +| permute | order | array format | +| ------- | ----- | ------------------------ | +| 000 | 0,1,2 | (xdim+1)(ydim+1)(zdim+1) | +| 001 | 0,2,1 | (xdim+1)(zdim+1)(ydim+1) | +| 010 | 1,0,2 | (ydim+1)(xdim+1)(zdim+1) | +| 011 | 1,2,0 | (ydim+1)(zdim+1)(xdim+1) | +| 100 | 2,0,1 | (zdim+1)(xdim+1)(ydim+1) | +| 101 | 2,1,0 | (zdim+1)(ydim+1)(xdim+1) | +| 110 | 0,1 | Indexed (xdim+1)(ydim+1) | +| 111 | 1,0 | Indexed (ydim+1)(xdim+1) | + +In other words, the "permute" option changes the order in which +nested for-loops over the array would be done. See executable +python reference code for further details. + +*Note: permute=0b110 and permute=0b111 enable Indexed REMAP Mode, +described below* + +With all these options it is possible to support in-place transpose, +in-place rotate, Matrix Multiply and Convolutions, without being +limited to Power-of-Two dimension sizes. + +## Indexed Mode + +Indexed Mode activates reading of the element indices from the GPR +and includes optional limited 2D reordering. +In its simplest form (without elwidth overrides or other modes): + +``` +def index_remap(i): + return GPR((SVSHAPE.SVGPR<<1)+i) + SVSHAPE.offset + +for i in 0..VL-1: + element_result = .... + GPR(RT + indexed_remap(i)) = element_result +``` + +With element-width overrides included, and using the pseudocode +from the SVP64 [[sv/svp64/appendix#elwidth]] elwidth section +this becomes: + +``` +def index_remap(i): + svreg = SVSHAPE.SVGPR << 1 + srcwid = elwid_to_bitwidth(SVSHAPE.elwid) + offs = SVSHAPE.offset + return get_polymorphed_reg(svreg, srcwid, i) + offs + +for i in 0..VL-1: + element_result = .... + rt_idx = indexed_remap(i) + set_polymorphed_reg(RT, destwid, rt_idx, element_result) +``` + +Matrix-style reordering still applies to the indices, except limited +to up to 2 Dimensions (X,Y). Ordering is therefore limited to (X,Y) or +(Y,X). Only one dimension may optionally be skipped. Inversion of either +X or Y or both is possible. Pseudocode for Indexed Mode (including elwidth +overrides) may be written in terms of Matrix Mode, specifically +purposed to ensure that the 3rd dimension (Z) has no effect: + +``` +def index_remap(ISHAPE, i): + MSHAPE.skip = 0b0 || ISHAPE.sk1 + MSHAPE.invxyz = 0b0 || ISHAPE.invxy + MSHAPE.xdimsz = ISHAPE.xdimsz + MSHAPE.ydimsz = ISHAPE.ydimsz + MSHAPE.zdimsz = 0 # disabled + if ISHAPE.permute = 0b110 # 0,1 + MSHAPE.permute = 0b000 # 0,1,2 + if ISHAPE.permute = 0b111 # 1,0 + MSHAPE.permute = 0b010 # 1,0,2 + el_idx = remap_matrix(MSHAPE, i) + svreg = ISHAPE.SVGPR << 1 + srcwid = elwid_to_bitwidth(ISHAPE.elwid) + offs = ISHAPE.offset + return get_polymorphed_reg(svreg, srcwid, el_idx) + offs +``` + +The most important observation above is that the Matrix-style +remapping occurs first and the Index lookup second. Thus it +becomes possible to perform in-place Transpose of Indices which +may have been costly to set up or costly to duplicate +(waste register file space). + +# svshape instruction + +`svshape` is a convenience instruction that reduces instruction +count for common usage patterns, particularly Matrix, DCT and FFT. It sets up +(overwrites) all required SVSHAPE SPRs and also modifies SVSTATE +including VL and MAXVL. Using `svshape` therefore does not also +require `setvl`. + +Form: SVM-Form SV "Matrix" Form (see [[isatables/fields.text]]) + + svshape SVxd,SVyd,SVzd,SVRM,vf + +| 0.5|6.10 |11.15 |16..20 | 21..24 | 25 | 26..31| name | +| -- | -- | --- | ----- | ------ | -- | ------| -------- | +|OPCD| SVxd | SVyd | SVzd | SVRM | vf | XO | svshape | + +Fields: + +* **SVxd** - SV REMAP "xdim" +* **SVyd** - SV REMAP "ydim" +* **SVzd** - SV REMAP "zdim" +* **SVRM** - SV REMAP Mode (0b00000 for Matrix, 0b00001 for FFT etc.) +* **vf** - sets "Vertical-First" mode +* **XO** - standard 6-bit XO field + +*Note: SVxd, SVyz and SVzd are all stored "off-by-one". In the assembler +mnemonic the values `1-32` are stored in binary as `0b00000..0b11111`* + +| SVRM | Remap Mode description | +| -- | -- | +| 0b0000 | Matrix 1/2/3D | +| 0b0001 | FFT Butterfly | +| 0b0010 | DCT Inner butterfly, pre-calculated coefficients | +| 0b0011 | DCT Outer butterfly | +| 0b0100 | DCT Inner butterfly, on-the-fly (Vertical-First Mode) | +| 0b0101 | DCT COS table index generation | +| 0b0110 | DCT half-swap | +| 0b0111 | Parallel Reduction | +| 0b1000 | reserved for svshape2 | +| 0b1001 | reserved for svshape2 | +| 0b1010 | iDCT Inner butterfly, pre-calculated coefficients | +| 0b1011 | iDCT Outer butterfly | +| 0b1100 | iDCT Inner butterfly, on-the-fly (Vertical-First Mode) | +| 0b1101 | iDCT COS table index generation | +| 0b1110 | iDCT half-swap | +| 0b1111 | FFT half-swap | + +Examples showing how all of these Modes operate exists in the online +[SVP64 unit tests](https://git.libre-soc.org/?p=openpower-isa.git;a=tree;f=src/openpower/decoder/isa;hb=HEAD) +and the full pseudocode setting up all SPRs +is in the [[openpower/isa/simplev]] page. + +In Indexed Mode, there are only 5 bits available to specify the GPR +to use, out of 128 GPRs (7 bit numbering). Therefore, only the top +5 bits are given in the `SVxd` field: the bottom two implicit bits +will be zero (`SVxd || 0b00`). + +`svshape` has *limited applicability* due to being a 32-bit instruction. +The full capability of SVSHAPE SPRs may be accessed by directly writing +to SVSHAPE0-3 with `mtspr`. Circumstances include Matrices with dimensions +larger than 32, and in-place Transpose. Potentially a future v3.1 Prefixed +instruction, `psvshape`, may extend the capability here. + +# svindex instruction + +`svindex` is a convenience instruction that reduces instruction +count for Indexed REMAP Mode. It sets up +(overwrites) all required SVSHAPE SPRs and can modify the REMAP +SPR as well. The relevant SPRs *may* be directly programmed with +`mtspr` however it is laborious to do so: svindex saves instructions +covering much of Indexed REMAP capability. + +Form: SVI-Form SV "Indexed" Form (see [[isatables/fields.text]]) + + svindex SVG,rmm,SVd,ew,yx,mr,sk + +| 0.5|6.10 |11.15 |16.20 | 21..25 | 26..31| name | Form | +| -- | -- | --- | ---- | ----------- | ------| -------- | ---- | +|OPCD| SVG | rmm | SVd | ew/yx/mm/sk | XO | svindex | SVI-Form | + +Fields: + +* **SVd** - SV REMAP x/y dim +* **rmm** - REMAP mask: sets remap mi0-2/mo0-1 and SVSHAPEs, + controlled by mm +* **ew** - sets element width override on the Indices +* **SVG** - GPR SVG<<2 to be used for Indexing +* **yx** - 2D reordering to be used if yx=1 +* **mm** - mask mode. determines how `rmm` is interpreted. +* **sk** - Dimension skipping enabled +* **XO** - standard 6-bit XO field + +*Note: SVd, like SVxd, SVyz and SVzd of `svshape`, are all stored +"off-by-one". In the assembler +mnemonic the values `1-32` are stored in binary as `0b00000..0b11111`*. + +*Note: when `yx=1,sk=0` the second dimension is calculated as +`CEIL(MAXVL/SVd)`*. + +When `mm=0`: + +* `rmm`, like REMAP.SVme, has bit 0 + correspond to mi0, bit 1 to mi1, bit 2 to mi2, + bit 3 to mo0 and bit 4 to mi1 +* all SVSHAPEs and the REMAP parts of SVSHAPE are first reset (initialised to zero) +* for each bit set in the 5-bit `rmm`, in order, the first + as-yet-unset SVSHAPE will be updated + with the other operands in the instruction, and the REMAP + SPR set. +* If all 5 bits of `rmm` are set then both mi0 and mo1 use SVSHAPE0. +* SVSTATE persistence bit is cleared +* No other alterations to SVSTATE are carried out + +Example 1: if rmm=0b00110 then SVSHAPE0 and SVSHAPE1 are set up, +and the REMAP SPR set so that mi1 uses SVSHAPE0 and mi2 +uses mi2. REMAP.SVme is also set to 0b00110, REMAP.mi1=0 +(SVSHAPE0) and REMAP.mi2=1 (SVSHAPE1) + +Example 2: if rmm=0b10001 then again SVSHAPE0 and SVSHAPE1 +are set up, but the REMAP SPR is set so that mi0 uses SVSHAPE0 +and mo1 uses SVSHAPE1. REMAP.SVme=0b10001, REMAP.mi0=0, REMAP.mo1=1 + +Rough algorithmic form: + + marray = [mi0, mi1, mi2, mo0, mo1] + idx = 0 + for bit = 0 to 4: + if not rmm[bit]: continue + setup(SVSHAPE[idx]) + SVSTATE{marray[bit]} = idx + idx = (idx+1) modulo 4 + +When `mm=1`: + +* bits 0-2 (MSB0 numbering) of `rmm` indicate an index selecting mi0-mo1 +* bits 3-4 (MSB0 numbering) of `rmm` indicate which SVSHAPE 0-3 shall + be updated +* only the selected SVSHAPE is overwritten +* only the relevant bits in the REMAP area of SVSTATE are updated +* REMAP persistence bit is set. + +Example 1: if `rmm`=0b01110 then bits 0-2 (MSB0) are 0b011 and +bits 3-4 are 0b10. thus, mo0 is selected and SVSHAPE2 +to be updated. REMAP.SVme[3] will be set high and REMAP.mo0 +set to 2 (SVSHAPE2). + +Example 2: if `rmm`=0b10011 then bits 0-2 (MSB0) are 0b100 and +bits 3-4 are 0b11. thus, mo1 is selected and SVSHAPE3 +to be updated. REMAP.SVme[4] will be set high and REMAP.mo1 +set to 3 (SVSHAPE3). + +Rough algorithmic form: + + marray = [mi0, mi1, mi2, mo0, mo1] + bit = rmm[0:2] + idx = rmm[3:4] + setup(SVSHAPE[idx]) + SVSTATE{marray[bit]} = idx + SVSTATE.pst = 1 + +In essence, `mm=0` is intended for use to set as much of the +REMAP State SPRs as practical with a single instruction, +whilst `mm=1` is intended to be a little more refined. + +**Usage guidelines** + +* **Disable 2D mapping**: to only perform Indexing without + reordering use `SVd=1,sk=0,yx=0` (or set SVd to a value larger + or equal to VL) +* **Modulo 1D mapping**: to perform Indexing cycling through the + first N Indices use `SVd=N,sk=0,yx=0` where `VL>N`. There is + no requirement to set VL equal to a multiple of N. +* **Modulo 2D transposed**: `SVd=M,sk=0,yx=1`, sets + `xdim=M,ydim=CEIL(MAXVL/M)`. + +Beyond these mappings it becomes necessary to write directly to +the SVSTATE SPRs manually. + +# svshape2 (offset) + +`svshape2` is an additional convenience instruction that prioritises +setting `SVSHAPE.offset`. Its primary purpose is for use when +element-width overrides are used. It has identical capabilities to `svindex` and +in terms of both options (skip, etc.) and ability to activate REMAP +(rmm, mask mode) but unlike `svindex` it does not set GPR REMAP, +only a 1D or 2D `svshape`, and +unlike `svshape` it can set an arbirrary `SVSHAPE.offset` immediate. + +One of the limitations of Simple-V is that Vector elements start on the boundary +of the Scalar regfile, which is fine when element-width overrides are not +needed. If the starting point of a Vector with smaller elwidths must begin +in the middle of a register, normally there would be no way to do so except +through LD/ST. `SVSHAPE.offset` caters for this scenario and `svshape2`is +makes it easier. + + svshape2 offs,yx,rmm,SVd,sk,mm + +| 0.5|6..9|10|11.15 |16..20 | 21..25 | 25 | 26..31| name | +| -- |----|--| --- | ----- | ------ | -- | ------| -------- | +|OPCD|offs|yx| rmm | SVd | 100/mm | sk | XO | svshape | + +* **offs** (4 bits) - unsigned offset +* **yx** (1 bit) - swap XY to YX +* **SVd** dimension size +* **rmm** REMAP mask +* **mm** mask mode +* **sk** (1 bit) skips 1st dimension if set + +Dimensions are calculated exactly as `svindex`. `rmm` and +`mm` are as per `svindex`. + +*Programmer's Note: offsets for `svshape2` may be specified in the range +0-15. Given that the principle of Simple-V is to fit on top of +byte-addressable register files and that GPR and FPR are 64-bit (8 bytes) +it should be clear that the offset may, when `elwidth=8`, begin an +element-level operation starting element zero at any arbitrary byte. +On cursory examination attempting to go beyond the range 0-7 seems +unnecessary given that the **next GPR or FPR** is an +alias for an offset in the range 8-15. Thus by simply increasing +the starting Vector point of the operation to the next register it +can be seen that the offset of 0-7 would be sufficient. Unfortunately +however some operations are EXTRA2-encoded it is **not possible** +to increase the GPR/FPR register number by one, because EXTRA2-encoding +of GPR/FPR Vector numbers are restricted to even numbering. +For CR Fields the EXTRA2 encoding is even more sparse. +The additional offset range (8-15) helps overcome these limitations.* + +*Hardware Implementor's note: with the offsets only being immediates +and with register numbering being entirely immediate as well it is +possible to correctly compute Register Hazards without requiring +reading the contents of any SPRs. If however there are +instructions that have directly written to the SVSTATE or SVSHAPE +SPRs and those instructions are still in-flight then this position +is clearly **invalid**.* + + + + + + + # svstep: Vertical-First Stepping and status reporting SVL-Form -- 2.30.2