simple_v_extension/remap.mdwn

   1 [[!tag standards]]
   2
   3 # NOTE
   4
   5 This section is under revision (and is optional)
   6
   7 # REMAP CSR <a name="remap" />
   8
   9 There is one 32-bit CSR which may be used to indicate which registers,
  10 if used in any operation, must be "reshaped" (re-mapped) from a linear
  11 form to a 2D or 3D transposed form, or "offset" to permit arbitrary
  12 access to elements within a register.
  13
  14 Their primary use is for Matrix Multiplication, reordering of sequential data in-place.  Three CSRs are provided so that a single FMAC may be used in a single loop to perform 4x4 times 4x4 Matrix multiplication, generating 64 FMACs.
  15
  16 The 32-bit REMAP CSR may reshape up to 3 registers:
  17
  18 | 29..28 | 27..26 | 25..24 | 23 | 22..16  | 15 | 14..8   | 7  | 6..0    |
  19 | ------ | ------ | ------ | -- | ------- | -- | ------- | -- | ------- |
  20 | shape2 | shape1 | shape0 | 0  | regidx2 | 0  | regidx1 | 0  | regidx0 |
  21
  22 regidx0-2 refer not to the Register CSR CAM entry but to the underlying
  23 *real* register (see regidx, the value) and consequently is 7-bits wide.
  24 When set to zero (referring to x0), clearly reshaping x0 is pointless,
  25 so is used to indicate "disabled".
  26 shape0-2 refers to one of three SHAPE CSRs.  A value of 0x3 is reserved.
  27 Bits 7, 15, 23, 30 and 31 are also reserved, and must be set to zero.
  28
  29 It is anticipated that these specialist CSRs not be very often used.
  30 Unlike the CSR Register and Predication tables, the REMAP CSRs use
  31 the full 7-bit regidx so that they can be set once and left alone,
  32 whilst the CSR Register entries pointing to them are disabled, instead.
  33
  34 # SHAPE 1D/2D/3D vector-matrix remapping CSRs
  35
  36 There are three "shape" CSRs, SHAPE0, SHAPE1, SHAPE2, 32-bits in each,
  37 which have the same format.  When each SHAPE CSR is set entirely to zeros,
  38 remapping is disabled: the register's elements are a linear (1D) vector.
  39
  40 | 29..24 | 23..21  | 20..18  | 17..12  | 11..6   | 5..0    |
  41 | ------ | ------- | ------- | ------- | -------- | ------- |
  42 | modulo | invxyz | permute | zdimsz  | ydimsz  | xdimsz  |
  43
  44 modulo will cause the output to wrap and remain within the range 0 to modulo. The value zero disables modulus application.
  45
  46 invxyz will invert the start index of each of x, y or z. If invxyz[0] is zero then x-dimensional counting begins from 0 and increments, otherwise it begins from xdimsz-1 and iterates down to zero. Likewise for y and z.
  47
  48 offs is a 4-bit field, spread out across bits 7, 15 and 23, which
  49 is added to the element index during the loop calculation. It is added prior to the dimensional remapping.
  50
  51 xdimsz, ydimsz and zdimsz are offset by 1, such that a value of 0 indicates
  52 that the array dimensionality for that dimension is 1.  A value of xdimsz=2
  53 would indicate that in the first dimension there are 3 elements in the
  54 array.  The format of the array is therefore as follows:
  55
  56     array[xdim+1][ydim+1][zdim+1]
  57
  58 However whilst illustrative of the dimensionality, that does not take the
  59 "permute" setting into account.  "permute" may be any one of six values
  60 (0-5, with values of 6 and 7 being reserved, and not legal).  The table
  61 below shows how the permutation dimensionality order works:
  62
  63 | permute | order | array format             |
  64 | ------- | ----- | ------------------------ |
  65 | 000     | 0,1,2 | (xdim+1)(ydim+1)(zdim+1) |
  66 | 001     | 0,2,1 | (xdim+1)(zdim+1)(ydim+1) |
  67 | 010     | 1,0,2 | (ydim+1)(xdim+1)(zdim+1) |
  68 | 011     | 1,2,0 | (ydim+1)(zdim+1)(xdim+1) |
  69 | 100     | 2,0,1 | (zdim+1)(xdim+1)(ydim+1) |
  70 | 101     | 2,1,0 | (zdim+1)(ydim+1)(xdim+1) |
  71
  72 In other words, the "permute" option changes the order in which
  73 nested for-loops over the array would be done.  The algorithm below
  74 shows this more clearly, and may be executed as a python program:
  75
  76     # mapidx = REMAP.shape2
  77     xdim = 3 # SHAPE[mapidx].xdim_sz+1
  78     ydim = 4 # SHAPE[mapidx].ydim_sz+1
  79     zdim = 5 # SHAPE[mapidx].zdim_sz+1
  80
  81     lims = [xdim, ydim, zdim]
  82     idxs = [0,0,0] # starting indices
  83     order = [1,0,2] # experiment with different permutations, here
  84     modulo = 64     # experiment with different modulus, here
  85     invxyz = [0,0,0]
  86
  87     for idx in range(xdim * ydim * zdim):
  88         ix = [0] * 3
  89         for i in range(3):
  90             ix[i] = idxs[i]
  91             if invxyz[i]:
  92                 ix[i] = lims[i] - ix[i]
  93         new_idx = ix[0] + ix[1] * xdim + ix[2] * xdim * ydim
  94         print new_idx % modulo
  95         for i in range(3):
  96             idxs[order[i]] = idxs[order[i]] + 1
  97             if (idxs[order[i]] != lims[order[i]]):
  98                 break
  99             print
 100             idxs[order[i]] = 0
 101
 102 Here, it is assumed that this algorithm be run within all pseudo-code
 103 throughout this document where a (parallelism) for-loop would normally
 104 run from 0 to VL-1 to refer to contiguous register
 105 elements; instead, where REMAP indicates to do so, the element index
 106 is run through the above algorithm to work out the **actual** element
 107 index, instead.  Given that there are three possible SHAPE entries, up to
 108 three separate registers in any given operation may be simultaneously
 109 remapped:
 110
 111     function op_add(rd, rs1, rs2) # add not VADD!
 112       ...
 113       ...
 114       for (i = 0; i < VL; i++)
 115         xSTATE.srcoffs = i # save context
 116         if (predval & 1<<i) # predication uses intregs
 117            ireg[rd+remap(id)] <= ireg[rs1+remap(irs1)] +
 118                                  ireg[rs2+remap(irs2)];
 119            if (!int_vec[rd ].isvector) break;
 120         if (int_vec[rd ].isvector)  { id += 1; }
 121         if (int_vec[rs1].isvector)  { irs1 += 1; }
 122         if (int_vec[rs2].isvector)  { irs2 += 1; }
 123
 124 By changing remappings, 2D matrices may be transposed "in-place" for one
 125 operation, followed by setting a different permutation order without
 126 having to move the values in the registers to or from memory.  Also,
 127 the reason for having REMAP separate from the three SHAPE CSRs is so
 128 that in a chain of matrix multiplications and additions, for example,
 129 the SHAPE CSRs need only be set up once; only the REMAP CSR need be
 130 changed to target different registers.
 131
 132 Note that:
 133
 134 * Over-running the register file clearly has to be detected and
 135   an illegal instruction exception thrown
 136 * When non-default elwidths are set, the exact same algorithm still
 137   applies (i.e. it offsets elements *within* registers rather than
 138   entire registers).
 139 * If permute option 000 is utilised, the actual order of the
 140   reindexing does not change!
 141 * If two or more dimensions are set to zero, the actual order does not change!
 142 * The above algorithm is pseudo-code **only**.  Actual implementations
 143   will need to take into account the fact that the element for-looping
 144   must be **re-entrant**, due to the possibility of exceptions occurring.
 145   See MSTATE CSR, which records the current element index.
 146 * Twin-predicated operations require **two** separate and distinct
 147   element offsets.  The above pseudo-code algorithm will be applied
 148   separately and independently to each, should each of the two
 149   operands be remapped.  *This even includes C.LDSP* and other operations
 150   in that category, where in that case it will be the **offset** that is
 151   remapped (see Compressed Stack LOAD/STORE section).
 152 * Offset is especially useful, on its own, for accessing elements
 153   within the middle of a register.  Without offsets, it is necessary
 154   to either use a predicated MV, skipping the first elements, or
 155   performing a LOAD/STORE cycle to memory.
 156   With offsets, the data does not have to be moved.
 157 * Setting the total elements (xdim+1) times (ydim+1) times (zdim+1) to
 158   less than MVL is **perfectly legal**, albeit very obscure.  It permits
 159   entries to be regularly presented to operands **more than once**, thus
 160   allowing the same underlying registers to act as an accumulator of
 161   multiple vector or matrix operations, for example.
 162
 163 Clearly here some considerable care needs to be taken as the remapping
 164 could hypothetically create arithmetic operations that target the
 165 exact same underlying registers, resulting in data corruption due to
 166 pipeline overlaps.  Out-of-order / Superscalar micro-architectures with
 167 register-renaming will have an easier time dealing with this than
 168 DSP-style SIMD micro-architectures.
 169
 170 # 4x4 Matrix to vec4 Multiply Example
 171
 172 The following settings will allow a 4x4 matrix (starting at f8), expressed as a sequence of 16 numbers first by row then by column, to be multiplied by a vector of length 4 (starting at f0), using a single FMAC instruction.
 173
 174 * SHAPE0: xdim=4, ydim=4, permute=yx, applied to f0
 175 * VL=16.
 176 * FMAC f4, f0, f8, f4
 177
 178 The permutation on SHAPE0 will use f0 as a vec4 source. On the first four iterations through the hardware loop, the REMAPed index will not increment. On the second four, the index will increase by one. Likewise on each subsequent group of four.
 179
 180 At the same time, VL will increment through the 16 values in the Matrix. The equivalent sequence thus is issued:
 181
 182     fmac f4, f0, f8, f4
 183     fmac f5, f0, f9, f5
 184     fmac f6, f0, f10, f6
 185     fmac f7, f0, f11, f7
 186     fmac f4, f1, f12, f4
 187     fmac f5, f1, f13, f5
 188     fmac f6, f1, f14, f6
 189     fmac f7, f1, f15, f7
 190     fmac f4, f2, f16, f4
 191     fmac f5, f2, f17, f5
 192     fmac f6, f2, f18, f6
 193     fmac f7, f2, f19, f7
 194     fmac f4, f3, f20, f4
 195     fmac f5, f3, f21, f5
 196     fmac f6, f3, f22, f6
 197     fmac f7, f3, f23, f7
 198