(no commit message)
[libreriscv.git] / openpower / sv / remap / appendix.mdwn
1 ## REMAP Matrix pseudocode
2
3 The algorithm below shows how REMAP works more clearly, and may be
4 executed as a python program:
5
6 ```
7 [[!inline pages="openpower/sv/remap.py" quick="yes" raw="yes" ]]
8 ```
9
10 An easier-to-read version (using python iterators) is given in
11 a separate section of this Appendix.
12
13 Each element index from the for-loop `0..VL-1`
14 is run through the above algorithm to work out the **actual** element
15 index, instead. Given that there are four possible SHAPE entries, up to
16 four separate registers in any given operation may be simultaneously
17 remapped:
18
19 ```
20 function op_add(RT, RA, RB) # add not VADD!
21  for (i=0,id=0,irs1=0,irs2=0; i < VL; i++)
22 SVSTATE.srcstep = i # save context
23 if (predval & 1<<i) # predication mask
24    GPR[RT+remap1(id)] <= GPR[RA+remap2(irs1)] +
25 GPR[RB+remap3(irs2)];
26 if (!RT.isvector) break;
27 if (RT.isvector)  { id += 1; }
28 if (RA.isvector)  { irs1 += 1; }
29 if (RB.isvector)  { irs2 += 1; }
30 ```
31
32 By changing remappings, 2D matrices may be transposed "in-place" for one
33 operation, followed by setting a different permutation order without
34 having to move the values in the registers to or from memory.
35
36 Note that:
37
38 * Over-running the register file clearly has to be detected and
39 an illegal instruction exception thrown
40 * When non-default elwidths are set, the exact same algorithm still
41 applies (i.e. it offsets *polymorphic* elements *within* registers rather
42 than entire registers).
43 * If permute option 000 is utilised, the actual order of the
44 reindexing does not change. However, modulo MVL still occurs
45 which will result in repeated operations (use with caution).
46 * If two or more dimensions are set to zero, the actual order does not change!
47 * The above algorithm is pseudo-code **only**. Actual implementations
48 will need to take into account the fact that the element for-looping
49 must be **re-entrant**, due to the possibility of exceptions occurring.
50 See SVSTATE SPR, which records the current element index.
51 Continuing after return from an interrupt may introduce latency
52 due to re-computation of the remapped offsets.
53 * Twin-predicated operations require **two** separate and distinct
54 element offsets. The above pseudo-code algorithm will be applied
55 separately and independently to each, should each of the two
56 operands be remapped. *This even includes unit-strided LD/ST*
57 and other operations
58 in that category, where in that case it will be the **offset** that is
59 remapped.
60 * Offset is especially useful, on its own, for accessing elements
61 within the middle of a register. Without offsets, it is necessary
62 to either use a predicated MV, skipping the first elements, or
63 performing a LOAD/STORE cycle to memory.
64 With offsets, the data does not have to be moved.
65 * Setting the total elements (xdim+1) times (ydim+1) times (zdim+1) to
66 less than MVL is **perfectly legal**, albeit very obscure. It permits
67 entries to be regularly presented to operands **more than once**, thus
68 allowing the same underlying registers to act as an accumulator of
69 multiple vector or matrix operations, for example.
70 * Note especially that Program Order **must** still be respected
71 even when overlaps occur that read or write the same register
72 elements *including polymorphic ones*
73
74 Clearly here some considerable care needs to be taken as the remapping
75 could hypothetically create arithmetic operations that target the
76 exact same underlying registers, resulting in data corruption due to
77 pipeline overlaps. Out-of-order / Superscalar micro-architectures with
78 register-renaming will have an easier time dealing with this than
79 DSP-style SIMD micro-architectures.
80
81 ## REMAP FFT, DFT, NTT
82
83 The algorithm from a later section of this Appendix shows how FFT REMAP works,
84 and it may be executed as a standalone python3 program.
85 The executable code is designed to illustrate how a hardware
86 implementation may generate Indices which are completely
87 independent of the Execution of element-level operations,
88 even for something as complex as a Triple-loop Tukey-Cooley
89 Schedule. A comprehensive demo and test suite may be found
90 [here](https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_fft.py;hb=HEAD)
91 including Complex Number FFT which deploys Vertical-First Mode
92 on top of the REMAP Schedules.
93
94 Other uses include more than DFT and NTT: as abstracted RISC-paradigm
95 the Schedules are not
96 restricted in any way or tied to any particular instructtion.
97 If the programmer can find any algorithm
98 which has identical triple nesting then the FFT Schedule may be
99 used even there.
100
101 # 4x4 Matrix to vec4 Multiply Example
102
103 The following settings will allow a 4x4 matrix (starting at f8), expressed
104 as a sequence of 16 numbers first by row then by column, to be multiplied
105 by a vector of length 4 (starting at f0), using a single FMAC instruction.
106
107 * SHAPE0: xdim=4, ydim=4, permute=yx, applied to f0
108 * SHAPE1: xdim=4, ydim=1, permute=xy, applied to f4
109 * VL=16, f4=vec, f0=vec, f8=vec
110 * FMAC f4, f0, f8, f4
111
112 The permutation on SHAPE0 will use f0 as a vec4 source. On the first
113 four iterations through the hardware loop, the REMAPed index will not
114 increment. On the second four, the index will increase by one. Likewise
115 on each subsequent group of four.
116
117 The permutation on SHAPE1 will increment f4 continuously cycling through
118 f4-f7 every four iterations of the hardware loop.
119
120 At the same time, VL will, because there is no SHAPE on f8, increment
121 straight sequentially through the 16 values f8-f23 in the Matrix. The
122 equivalent sequence thus is issued:
123
124 ```
125 fmac f4, f0, f8, f4
126 fmac f5, f0, f9, f5
127 fmac f6, f0, f10, f6
128 fmac f7, f0, f11, f7
129 fmac f4, f1, f12, f4
130 fmac f5, f1, f13, f5
131 fmac f6, f1, f14, f6
132 fmac f7, f1, f15, f7
133 fmac f4, f2, f16, f4
134 fmac f5, f2, f17, f5
135 fmac f6, f2, f18, f6
136 fmac f7, f2, f19, f7
137 fmac f4, f3, f20, f4
138 fmac f5, f3, f21, f5
139 fmac f6, f3, f22, f6
140 fmac f7, f3, f23, f7
141 ```
142
143 The only other instruction required is to ensure that f4-f7 are
144 initialised (usually to zero).
145
146 It should be clear that a 4x4 by 4x4 Matrix Multiply, being effectively
147 the same technique applied to four independent vectors, can be done by
148 setting VL=64, using an extra dimension on the SHAPE0 and SHAPE1 SPRs,
149 and applying a rotating 1D SHAPE SPR of xdim=16 to f8 in order to get
150 it to apply four times to compute the four columns worth of vectors.
151
152 [[!tag standards]]
153
154 ---------
155
156 \newpage{}
157