(no commit message)
[libreriscv.git] / simple_v_extension / remap.mdwn
1 [[!tag standards]]
2
3 # NOTE
4
5 This section is under revision (and is optional)
6
7 # REMAP CSR <a name="remap" />
8
9 There is one 32-bit CSR which may be used to indicate which registers,
10 if used in any operation, must be "reshaped" (re-mapped) from a linear
11 form to a 2D or 3D transposed form, or "offset" to permit arbitrary
12 access to elements within a register.
13
14 Their primary use is for Matrix Multiplication, reordering of sequential data in-place. Three CSRs are provided so that a single FMAC may be used in a single loop to perform 4x4 times 4x4 Matrix multiplication, generating 64 FMACs
15
16 The 32-bit REMAP CSR may reshape up to 3 registers:
17
18 | 29..28 | 27..26 | 25..24 | 23 | 22..16 | 15 | 14..8 | 7 | 6..0 |
19 | ------ | ------ | ------ | -- | ------- | -- | ------- | -- | ------- |
20 | shape2 | shape1 | shape0 | 0 | regidx2 | 0 | regidx1 | 0 | regidx0 |
21
22 regidx0-2 refer not to the Register CSR CAM entry but to the underlying
23 *real* register (see regidx, the value) and consequently is 7-bits wide.
24 When set to zero (referring to x0), clearly reshaping x0 is pointless,
25 so is used to indicate "disabled".
26 shape0-2 refers to one of three SHAPE CSRs. A value of 0x3 is reserved.
27 Bits 7, 15, 23, 30 and 31 are also reserved, and must be set to zero.
28
29 It is anticipated that these specialist CSRs not be very often used.
30 Unlike the CSR Register and Predication tables, the REMAP CSRs use
31 the full 7-bit regidx so that they can be set once and left alone,
32 whilst the CSR Register entries pointing to them are disabled, instead.
33
34 # SHAPE 1D/2D/3D vector-matrix remapping CSRs
35
36 There are three "shape" CSRs, SHAPE0, SHAPE1, SHAPE2, 32-bits in each,
37 which have the same format. When each SHAPE CSR is set entirely to zeros,
38 remapping is disabled: the register's elements are a linear (1D) vector.
39
40 | 31..25 | 24..22 | 21-18 | 17..12 | 11..6 | 5..0 |
41 | ------ | ------- | -- | ------- | ------- | -- | ------- |
42 | modulo | permute | offs | zdimsz | ydimsz | xdimsz |
43
44 modulo is applied to the output, causing it to cycle within the range 0..modulo-1. Note that zero indicates "unlimited". With VL being a maximum of 64, modulo is also 6 bits. Modulo is applied after dimensional remapping.
45
46 offs is a 4-bit field, spread out across bits 7, 15 and 23, which
47 is added to the element index during the loop calculation. It is added prior to the dimensional remapping.
48
49 xdimsz, ydimsz and zdimsz are offset by 1, such that a value of 0 indicates
50 that the array dimensionality for that dimension is 1. A value of xdimsz=2
51 would indicate that in the first dimension there are 3 elements in the
52 array. The format of the array is therefore as follows:
53
54 array[xdim+1][ydim+1][zdim+1]
55
56 However whilst illustrative of the dimensionality, that does not take the
57 "permute" setting into account. "permute" may be any one of six values
58 (0-5, with values of 6 and 7 being reserved, and not legal). The table
59 below shows how the permutation dimensionality order works:
60
61 | permute | order | array format |
62 | ------- | ----- | ------------------------ |
63 | 000 | 0,1,2 | (xdim+1)(ydim+1)(zdim+1) |
64 | 001 | 0,2,1 | (xdim+1)(zdim+1)(ydim+1) |
65 | 010 | 1,0,2 | (ydim+1)(xdim+1)(zdim+1) |
66 | 011 | 1,2,0 | (ydim+1)(zdim+1)(xdim+1) |
67 | 100 | 2,0,1 | (zdim+1)(xdim+1)(ydim+1) |
68 | 101 | 2,1,0 | (zdim+1)(ydim+1)(xdim+1) |
69
70 In other words, the "permute" option changes the order in which
71 nested for-loops over the array would be done. The algorithm below
72 shows this more clearly, and may be executed as a python program:
73
74 # mapidx = REMAP.shape2
75 xdim = 3 # SHAPE[mapidx].xdim_sz+1
76 ydim = 4 # SHAPE[mapidx].ydim_sz+1
77 zdim = 5 # SHAPE[mapidx].zdim_sz+1
78
79 lims = [xdim, ydim, zdim]
80 idxs = [0,0,0] # starting indices
81 order = [1,0,2] # experiment with different permutations, here
82 offs = 0 # experiment with different offsets, here
83 modulo = 64 # set different modulus, here
84
85 for idx in range(xdim * ydim * zdim):
86 new_idx = offs + idxs[0] + idxs[1] * xdim + idxs[2] * xdim * ydim
87 print new_idx % modulo
88 for i in range(3):
89 idxs[order[i]] = idxs[order[i]] + 1
90 if (idxs[order[i]] != lims[order[i]]):
91 break
92 print
93 idxs[order[i]] = 0
94
95 Here, it is assumed that this algorithm be run within all pseudo-code
96 throughout this document where a (parallelism) for-loop would normally
97 run from 0 to VL-1 to refer to contiguous register
98 elements; instead, where REMAP indicates to do so, the element index
99 is run through the above algorithm to work out the **actual** element
100 index, instead. Given that there are three possible SHAPE entries, up to
101 three separate registers in any given operation may be simultaneously
102 remapped:
103
104 function op_add(rd, rs1, rs2) # add not VADD!
105 ...
106 ...
107  for (i = 0; i < VL; i++)
108 xSTATE.srcoffs = i # save context
109 if (predval & 1<<i) # predication uses intregs
110    ireg[rd+remap(id)] <= ireg[rs1+remap(irs1)] +
111 ireg[rs2+remap(irs2)];
112 if (!int_vec[rd ].isvector) break;
113 if (int_vec[rd ].isvector)  { id += 1; }
114 if (int_vec[rs1].isvector)  { irs1 += 1; }
115 if (int_vec[rs2].isvector)  { irs2 += 1; }
116
117 By changing remappings, 2D matrices may be transposed "in-place" for one
118 operation, followed by setting a different permutation order without
119 having to move the values in the registers to or from memory. Also,
120 the reason for having REMAP separate from the three SHAPE CSRs is so
121 that in a chain of matrix multiplications and additions, for example,
122 the SHAPE CSRs need only be set up once; only the REMAP CSR need be
123 changed to target different registers.
124
125 Note that:
126
127 * Over-running the register file clearly has to be detected and
128 an illegal instruction exception thrown
129 * When non-default elwidths are set, the exact same algorithm still
130 applies (i.e. it offsets elements *within* registers rather than
131 entire registers).
132 * If permute option 000 is utilised, the actual order of the
133 reindexing does not change!
134 * If two or more dimensions are set to zero, the actual order does not change!
135 * The above algorithm is pseudo-code **only**. Actual implementations
136 will need to take into account the fact that the element for-looping
137 must be **re-entrant**, due to the possibility of exceptions occurring.
138 See MSTATE CSR, which records the current element index.
139 * Twin-predicated operations require **two** separate and distinct
140 element offsets. The above pseudo-code algorithm will be applied
141 separately and independently to each, should each of the two
142 operands be remapped. *This even includes C.LDSP* and other operations
143 in that category, where in that case it will be the **offset** that is
144 remapped (see Compressed Stack LOAD/STORE section).
145 * Offset is especially useful, on its own, for accessing elements
146 within the middle of a register. Without offsets, it is necessary
147 to either use a predicated MV, skipping the first elements, or
148 performing a LOAD/STORE cycle to memory.
149 With offsets, the data does not have to be moved.
150 * Setting the total elements (xdim+1) times (ydim+1) times (zdim+1) to
151 less than MVL is **perfectly legal**, albeit very obscure. It permits
152 entries to be regularly presented to operands **more than once**, thus
153 allowing the same underlying registers to act as an accumulator of
154 multiple vector or matrix operations, for example.
155
156 Clearly here some considerable care needs to be taken as the remapping
157 could hypothetically create arithmetic operations that target the
158 exact same underlying registers, resulting in data corruption due to
159 pipeline overlaps. Out-of-order / Superscalar micro-architectures with
160 register-renaming will have an easier time dealing with this than
161 DSP-style SIMD micro-architectures.
162