(no commit message)
[libreriscv.git] / simple_v_extension / remap.mdwn
1 [[!tag standards]]
2
3 # NOTE
4
5 This section is under revision (and is optional)
6
7 # REMAP CSR <a name="remap" />
8
9 There is one 32-bit CSR which may be used to indicate which registers,
10 if used in any operation, must be "reshaped" (re-mapped) from a linear
11 form to a 2D or 3D transposed form, or "offset" to permit arbitrary
12 access to elements within a register.
13
14 Their primary use is for Matrix Multiplication, reordering of sequential data in-place. Three CSRs are provided so that a single FMAC may be used in a single loop to perform 4x4 times 4x4 Matrix multiplication, generating 64 FMACs.
15
16 The 32-bit REMAP CSR may reshape up to 3 registers:
17
18 | 29..28 | 27..26 | 25..24 | 23 | 22..16 | 15 | 14..8 | 7 | 6..0 |
19 | ------ | ------ | ------ | -- | ------- | -- | ------- | -- | ------- |
20 | shape2 | shape1 | shape0 | 0 | regidx2 | 0 | regidx1 | 0 | regidx0 |
21
22 regidx0-2 refer not to the Register CSR CAM entry but to the underlying
23 *real* register (see regidx, the value) and consequently is 7-bits wide.
24 When set to zero (referring to x0), clearly reshaping x0 is pointless,
25 so is used to indicate "disabled".
26 shape0-2 refers to one of three SHAPE CSRs. A value of 0x3 is reserved.
27 Bits 7, 15, 23, 30 and 31 are also reserved, and must be set to zero.
28
29 It is anticipated that these specialist CSRs not be very often used.
30 Unlike the CSR Register and Predication tables, the REMAP CSRs use
31 the full 7-bit regidx so that they can be set once and left alone,
32 whilst the CSR Register entries pointing to them are disabled, instead.
33
34 # SHAPE 1D/2D/3D vector-matrix remapping CSRs
35
36 There are three "shape" CSRs, SHAPE0, SHAPE1, SHAPE2, 32-bits in each,
37 which have the same format. When each SHAPE CSR is set entirely to zeros,
38 remapping is disabled: the register's elements are a linear (1D) vector.
39
40 | 31..30 | 29..24 | 23..21 | 20..18 | 17..12 | 11..6 | 5..0 |
41 | -------- | ------ | ------- | ------- | ------- | -------- | ------- |
42 | applydim |modulo | invxyz | permute | zdimsz | ydimsz | xdimsz |
43
44 applydim will set to zero the dimensions less than this. applydim=0 applies all three. applydim=1 applies y and z. applydim=2 applys only z. applydim=3 is reserved.
45
46 invxyz will invert the start index of each of x, y or z. If invxyz[0] is zero then x-dimensional counting begins from 0 and increments, otherwise it begins from xdimsz-1 and iterates down to zero. Likewise for y and z.
47
48 modulo will cause the output to wrap and remain within the range 0 to modulo. The value zero disables modulus application. Note that modulo arithmetic is applied after all other remapping calculations.
49
50 xdimsz, ydimsz and zdimsz are offset by 1, such that a value of 0 indicates
51 that the array dimensionality for that dimension is 1. A value of xdimsz=2
52 would indicate that in the first dimension there are 3 elements in the
53 array. The format of the array is therefore as follows:
54
55 array[xdim+1][ydim+1][zdim+1]
56
57 However whilst illustrative of the dimensionality, that does not take the
58 "permute" setting into account. "permute" may be any one of six values
59 (0-5, with values of 6 and 7 being reserved, and not legal). The table
60 below shows how the permutation dimensionality order works:
61
62 | permute | order | array format |
63 | ------- | ----- | ------------------------ |
64 | 000 | 0,1,2 | (xdim+1)(ydim+1)(zdim+1) |
65 | 001 | 0,2,1 | (xdim+1)(zdim+1)(ydim+1) |
66 | 010 | 1,0,2 | (ydim+1)(xdim+1)(zdim+1) |
67 | 011 | 1,2,0 | (ydim+1)(zdim+1)(xdim+1) |
68 | 100 | 2,0,1 | (zdim+1)(xdim+1)(ydim+1) |
69 | 101 | 2,1,0 | (zdim+1)(ydim+1)(xdim+1) |
70
71 In other words, the "permute" option changes the order in which
72 nested for-loops over the array would be done. The algorithm below
73 shows this more clearly, and may be executed as a python program:
74
75 # mapidx = REMAP.shape2
76 xdim = 3 # SHAPE[mapidx].xdim_sz+1
77 ydim = 4 # SHAPE[mapidx].ydim_sz+1
78 zdim = 5 # SHAPE[mapidx].zdim_sz+1
79
80 lims = [xdim, ydim, zdim]
81 idxs = [0,0,0] # starting indices
82 order = [1,0,2] # experiment with different permutations, here
83 modulo = 64 # experiment with different modulus, here
84 applydim=0
85 invxyz = [0,0,0]
86
87 for idx in range(xdim * ydim * zdim):
88 ix = [0] * 3
89 for i in range(3):
90 if i >= applydim:
91 ix[i] = idxs[i]
92 if invxyz[i]:
93 ix[i] = lims[i] - ix[i]
94 new_idx = ix[0] + ix[1] * xdim + ix[2] * xdim * ydim
95 print new_idx % modulo
96 for i in range(3):
97 idxs[order[i]] = idxs[order[i]] + 1
98 if (idxs[order[i]] != lims[order[i]]):
99 break
100 print
101 idxs[order[i]] = 0
102
103 Here, it is assumed that this algorithm be run within all pseudo-code
104 throughout this document where a (parallelism) for-loop would normally
105 run from 0 to VL-1 to refer to contiguous register
106 elements; instead, where REMAP indicates to do so, the element index
107 is run through the above algorithm to work out the **actual** element
108 index, instead. Given that there are three possible SHAPE entries, up to
109 three separate registers in any given operation may be simultaneously
110 remapped:
111
112 function op_add(rd, rs1, rs2) # add not VADD!
113 ...
114 ...
115  for (i = 0; i < VL; i++)
116 xSTATE.srcoffs = i # save context
117 if (predval & 1<<i) # predication uses intregs
118    ireg[rd+remap(id)] <= ireg[rs1+remap(irs1)] +
119 ireg[rs2+remap(irs2)];
120 if (!int_vec[rd ].isvector) break;
121 if (int_vec[rd ].isvector)  { id += 1; }
122 if (int_vec[rs1].isvector)  { irs1 += 1; }
123 if (int_vec[rs2].isvector)  { irs2 += 1; }
124
125 By changing remappings, 2D matrices may be transposed "in-place" for one
126 operation, followed by setting a different permutation order without
127 having to move the values in the registers to or from memory. Also,
128 the reason for having REMAP separate from the three SHAPE CSRs is so
129 that in a chain of matrix multiplications and additions, for example,
130 the SHAPE CSRs need only be set up once; only the REMAP CSR need be
131 changed to target different registers.
132
133 Note that:
134
135 * Over-running the register file clearly has to be detected and
136 an illegal instruction exception thrown
137 * When non-default elwidths are set, the exact same algorithm still
138 applies (i.e. it offsets elements *within* registers rather than
139 entire registers).
140 * If permute option 000 is utilised, the actual order of the
141 reindexing does not change!
142 * If two or more dimensions are set to zero, the actual order does not change!
143 * The above algorithm is pseudo-code **only**. Actual implementations
144 will need to take into account the fact that the element for-looping
145 must be **re-entrant**, due to the possibility of exceptions occurring.
146 See MSTATE CSR, which records the current element index.
147 * Twin-predicated operations require **two** separate and distinct
148 element offsets. The above pseudo-code algorithm will be applied
149 separately and independently to each, should each of the two
150 operands be remapped. *This even includes C.LDSP* and other operations
151 in that category, where in that case it will be the **offset** that is
152 remapped (see Compressed Stack LOAD/STORE section).
153 * Offset is especially useful, on its own, for accessing elements
154 within the middle of a register. Without offsets, it is necessary
155 to either use a predicated MV, skipping the first elements, or
156 performing a LOAD/STORE cycle to memory.
157 With offsets, the data does not have to be moved.
158 * Setting the total elements (xdim+1) times (ydim+1) times (zdim+1) to
159 less than MVL is **perfectly legal**, albeit very obscure. It permits
160 entries to be regularly presented to operands **more than once**, thus
161 allowing the same underlying registers to act as an accumulator of
162 multiple vector or matrix operations, for example.
163
164 Clearly here some considerable care needs to be taken as the remapping
165 could hypothetically create arithmetic operations that target the
166 exact same underlying registers, resulting in data corruption due to
167 pipeline overlaps. Out-of-order / Superscalar micro-architectures with
168 register-renaming will have an easier time dealing with this than
169 DSP-style SIMD micro-architectures.
170
171 # 4x4 Matrix to vec4 Multiply Example
172
173 The following settings will allow a 4x4 matrix (starting at f8), expressed as a sequence of 16 numbers first by row then by column, to be multiplied by a vector of length 4 (starting at f0), using a single FMAC instruction.
174
175 * SHAPE0: xdim=4, ydim=4, permute=yx, applied to f0
176 * SHAPE1: xdim=4, ydim=1, permute=xy, applied to f4
177 * VL=16, f4=vec, f0=vec, f8=vec
178 * FMAC f4, f0, f8, f4
179
180 The permutation on SHAPE0 will use f0 as a vec4 source. On the first four iterations through the hardware loop, the REMAPed index will not increment. On the second four, the index will increase by one. Likewise on each subsequent group of four.
181
182 The permutation on SHAPE1 will increment f4 continuously cycling through f4-f7 every four iterations of the hardware loop.
183
184 At the same time, VL will, because there is no SHAPE on f8, increment straight sequentially through the 16 values f8-f23 in the Matrix. The equivalent sequence thus is issued:
185
186 fmac f4, f0, f8, f4
187 fmac f5, f0, f9, f5
188 fmac f6, f0, f10, f6
189 fmac f7, f0, f11, f7
190 fmac f4, f1, f12, f4
191 fmac f5, f1, f13, f5
192 fmac f6, f1, f14, f6
193 fmac f7, f1, f15, f7
194 fmac f4, f2, f16, f4
195 fmac f5, f2, f17, f5
196 fmac f6, f2, f18, f6
197 fmac f7, f2, f19, f7
198 fmac f4, f3, f20, f4
199 fmac f5, f3, f21, f5
200 fmac f6, f3, f22, f6
201 fmac f7, f3, f23, f7
202
203 The only other instruction required is to ensure that f4-f7 are initialised (usually to zero).
204
205 It should be clear that a 4x4 by 4x4 Matrix Multiply, being effectively the same technique applied to four independent vectors, can be done by setting VL=64, using an extra dimension on the SHAPE0 and SHAPE1 CSRs, and applying a rotating 1D SHAPE CSR of xdim=16 to f8 in order to get it to apply four times to compute the four columns worth of vectors.