0929c941cbe9594c1e052fa4e433b44c9c4aa913
[libreriscv.git] / openpower / sv / remap / appendix.mdwn
1 ## REMAP Matrix pseudocode
2
3 The algorithm below shows how REMAP works more clearly, and may be
4 executed as a python program:
5
6 ```
7 [[!inline pages="openpower/sv/remap.py" quick="yes" raw="yes" ]]
8 ```
9
10 An easier-to-read version (using python iterators) shows the loop nesting:
11
12 ```
13 [[!inline pages="openpower/sv/remapyield.py" quick="yes" raw="yes" ]]
14 ```
15
16 Each element index from the for-loop `0..VL-1`
17 is run through the above algorithm to work out the **actual** element
18 index, instead. Given that there are four possible SHAPE entries, up to
19 four separate registers in any given operation may be simultaneously
20 remapped:
21
22 ```
23 function op_add(RT, RA, RB) # add not VADD!
24 ...
25 ...
26  for (i=0,id=0,irs1=0,irs2=0; i < VL; i++)
27 SVSTATE.srcstep = i # save context
28 if (predval & 1<<i) # predication mask
29    GPR[RT+remap1(id)] <= GPR[RA+remap2(irs1)] +
30 GPR[RB+remap3(irs2)];
31 if (!int_vec[RT ].isvector) break;
32 if (int_vec[RT].isvector)  { id += 1; }
33 if (int_vec[RA].isvector)  { irs1 += 1; }
34 if (int_vec[RB].isvector)  { irs2 += 1; }
35 ```
36
37 By changing remappings, 2D matrices may be transposed "in-place" for one
38 operation, followed by setting a different permutation order without
39 having to move the values in the registers to or from memory.
40
41 Note that:
42
43 * Over-running the register file clearly has to be detected and
44 an illegal instruction exception thrown
45 * When non-default elwidths are set, the exact same algorithm still
46 applies (i.e. it offsets *polymorphic* elements *within* registers rather
47 than entire registers).
48 * If permute option 000 is utilised, the actual order of the
49 reindexing does not change. However, modulo MVL still occurs
50 which will result in repeated operations (use with caution).
51 * If two or more dimensions are set to zero, the actual order does not change!
52 * The above algorithm is pseudo-code **only**. Actual implementations
53 will need to take into account the fact that the element for-looping
54 must be **re-entrant**, due to the possibility of exceptions occurring.
55 See SVSTATE SPR, which records the current element index.
56 Continuing after return from an interrupt may introduce latency
57 due to re-computation of the remapped offsets.
58 * Twin-predicated operations require **two** separate and distinct
59 element offsets. The above pseudo-code algorithm will be applied
60 separately and independently to each, should each of the two
61 operands be remapped. *This even includes unit-strided LD/ST*
62 and other operations
63 in that category, where in that case it will be the **offset** that is
64 remapped.
65 * Offset is especially useful, on its own, for accessing elements
66 within the middle of a register. Without offsets, it is necessary
67 to either use a predicated MV, skipping the first elements, or
68 performing a LOAD/STORE cycle to memory.
69 With offsets, the data does not have to be moved.
70 * Setting the total elements (xdim+1) times (ydim+1) times (zdim+1) to
71 less than MVL is **perfectly legal**, albeit very obscure. It permits
72 entries to be regularly presented to operands **more than once**, thus
73 allowing the same underlying registers to act as an accumulator of
74 multiple vector or matrix operations, for example.
75 * Note especially that Program Order **must** still be respected
76 even when overlaps occur that read or write the same register
77 elements *including polymorphic ones*
78
79 Clearly here some considerable care needs to be taken as the remapping
80 could hypothetically create arithmetic operations that target the
81 exact same underlying registers, resulting in data corruption due to
82 pipeline overlaps. Out-of-order / Superscalar micro-architectures with
83 register-renaming will have an easier time dealing with this than
84 DSP-style SIMD micro-architectures.
85
86 ## REMAP FFT pseudocode
87
88 The algorithm below shows how FFT REMAP works, and may be
89 executed as a python program:
90
91 ```
92 [[!inline pages="openpower/sv/remap_fft_yield.py" quick="yes" raw="yes" ]]
93 ```
94
95 The executable code above is designed to show how a hardware
96 implementation may generate Indices which are completely
97 independent of the Execution of element-level operations,
98 even for something as complex as a Triple-loop Tukey-Cooley
99 Schedule. A comprehensive demo and test suite may be found
100 [here](https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_fft.py;hb=HEAD)
101
102 Other uses include more than DFT and NTT: the Schedules are not
103 restricted in any way and if the programmer can find any algorithm
104 which has identical triple nesting then the FFT Schedule may be
105 used.
106
107 # 4x4 Matrix to vec4 Multiply Example
108
109 The following settings will allow a 4x4 matrix (starting at f8), expressed
110 as a sequence of 16 numbers first by row then by column, to be multiplied
111 by a vector of length 4 (starting at f0), using a single FMAC instruction.
112
113 * SHAPE0: xdim=4, ydim=4, permute=yx, applied to f0
114 * SHAPE1: xdim=4, ydim=1, permute=xy, applied to f4
115 * VL=16, f4=vec, f0=vec, f8=vec
116 * FMAC f4, f0, f8, f4
117
118 The permutation on SHAPE0 will use f0 as a vec4 source. On the first
119 four iterations through the hardware loop, the REMAPed index will not
120 increment. On the second four, the index will increase by one. Likewise
121 on each subsequent group of four.
122
123 The permutation on SHAPE1 will increment f4 continuously cycling through
124 f4-f7 every four iterations of the hardware loop.
125
126 At the same time, VL will, because there is no SHAPE on f8, increment
127 straight sequentially through the 16 values f8-f23 in the Matrix. The
128 equivalent sequence thus is issued:
129
130 ```
131 fmac f4, f0, f8, f4
132 fmac f5, f0, f9, f5
133 fmac f6, f0, f10, f6
134 fmac f7, f0, f11, f7
135 fmac f4, f1, f12, f4
136 fmac f5, f1, f13, f5
137 fmac f6, f1, f14, f6
138 fmac f7, f1, f15, f7
139 fmac f4, f2, f16, f4
140 fmac f5, f2, f17, f5
141 fmac f6, f2, f18, f6
142 fmac f7, f2, f19, f7
143 fmac f4, f3, f20, f4
144 fmac f5, f3, f21, f5
145 fmac f6, f3, f22, f6
146 fmac f7, f3, f23, f7
147 ```
148
149 The only other instruction required is to ensure that f4-f7 are
150 initialised (usually to zero).
151
152 It should be clear that a 4x4 by 4x4 Matrix Multiply, being effectively
153 the same technique applied to four independent vectors, can be done by
154 setting VL=64, using an extra dimension on the SHAPE0 and SHAPE1 SPRs,
155 and applying a rotating 1D SHAPE SPR of xdim=16 to f8 in order to get
156 it to apply four times to compute the four columns worth of vectors.
157
158 [[!tag standards]]
159
160 ---------
161
162 \newpage{}
163