simple_v_extension/remap.mdwn

   1 [[!tag standards]]
   2
   3 # NOTE
   4
   5 This section is under revision (and is optional)
   6
   7 # REMAP CSR <a name="remap" />
   8
   9 There is one 32-bit CSR which may be used to indicate which registers,
  10 if used in any operation, must be "reshaped" (re-mapped) from a linear
  11 form to a 2D or 3D transposed form, or "offset" to permit arbitrary
  12 access to elements within a register.
  13
  14 Their primary use is for Matrix Multiplication, reordering of sequential data in-place.  Three CSRs are provided so that a single FMAC may be used in a single loop to perform 4x4 times 4x4 Matrix multiplication, generating 64 FMACs.
  15
  16 The 32-bit REMAP CSR may reshape up to 3 registers:
  17
  18 | 29..28 | 27..26 | 25..24 | 23 | 22..16  | 15 | 14..8   | 7  | 6..0    |
  19 | ------ | ------ | ------ | -- | ------- | -- | ------- | -- | ------- |
  20 | shape2 | shape1 | shape0 | 0  | regidx2 | 0  | regidx1 | 0  | regidx0 |
  21
  22 regidx0-2 refer not to the Register CSR CAM entry but to the underlying
  23 *real* register (see regidx, the value) and consequently is 7-bits wide.
  24 When set to zero (referring to x0), clearly reshaping x0 is pointless,
  25 so is used to indicate "disabled".
  26 shape0-2 refers to one of three SHAPE CSRs.  A value of 0x3 is reserved.
  27 Bits 7, 15, 23, 30 and 31 are also reserved, and must be set to zero.
  28
  29 It is anticipated that these specialist CSRs not be very often used.
  30 Unlike the CSR Register and Predication tables, the REMAP CSRs use
  31 the full 7-bit regidx so that they can be set once and left alone,
  32 whilst the CSR Register entries pointing to them are disabled, instead.
  33
  34 # SHAPE 1D/2D/3D vector-matrix remapping CSRs
  35
  36 There are three "shape" CSRs, SHAPE0, SHAPE1, SHAPE2, 32-bits in each,
  37 which have the same format.
  38
  39 [[!inline raw="yes" pages="simple_v_extension/shape_table_format" ]]
  40
  41 The algorithm below shows how REMAP works more clearly, and may be
  42 executed as a python program:
  43
  44     xdim = 3
  45     ydim = 4
  46     zdim = 1
  47
  48     lims = [xdim, ydim, zdim]
  49     idxs = [0,0,0] # starting indices
  50     order = [0,1,2] # experiment with different permutations, here
  51     offset = 2     # experiment with different offset, here
  52     VL = xdim * ydim * zdim # multiply (or add) to this to get "cycling"
  53     applydim = 0
  54     invxyz = [0,0,0]
  55
  56     # run for offset iterations before actually starting
  57     for idx in range(offset):
  58         for i in range(3):
  59             idxs[order[i]] = idxs[order[i]] + 1
  60             if (idxs[order[i]] != lims[order[i]]):
  61                 break
  62             idxs[order[i]] = 0
  63
  64     break_count = 0
  65
  66     for idx in range(VL):
  67         ix = [0] * 3
  68         for i in range(3):
  69             if i >= applydim:
  70                 ix[i] = idxs[i]
  71             if invxyz[i]:
  72                 ix[i] = lims[i] - ix[i]
  73         new_idx = ix[0] + ix[1] * xdim + ix[2] * xdim * ydim
  74         print new_idx,
  75         break_count += 1
  76         if break_count == lims[order[0]]:
  77             print
  78             break_count = 0
  79         for i in range(3):
  80             idxs[order[i]] = idxs[order[i]] + 1
  81             if (idxs[order[i]] != lims[order[i]]):
  82                 break
  83             idxs[order[i]] = 0
  84
  85 Here, it is assumed that this algorithm be run within all pseudo-code
  86 throughout this document where a (parallelism) for-loop would normally
  87 run from 0 to VL-1 to refer to contiguous register
  88 elements; instead, where REMAP indicates to do so, the element index
  89 is run through the above algorithm to work out the **actual** element
  90 index, instead.  Given that there are three possible SHAPE entries, up to
  91 three separate registers in any given operation may be simultaneously
  92 remapped:
  93
  94     function op_add(rd, rs1, rs2) # add not VADD!
  95       ...
  96       ...
  97       for (i = 0; i < VL; i++)
  98         xSTATE.srcoffs = i # save context
  99         if (predval & 1<<i) # predication uses intregs
 100            ireg[rd+remap(id)] <= ireg[rs1+remap(irs1)] +
 101                                  ireg[rs2+remap(irs2)];
 102            if (!int_vec[rd ].isvector) break;
 103         if (int_vec[rd ].isvector)  { id += 1; }
 104         if (int_vec[rs1].isvector)  { irs1 += 1; }
 105         if (int_vec[rs2].isvector)  { irs2 += 1; }
 106
 107 By changing remappings, 2D matrices may be transposed "in-place" for one
 108 operation, followed by setting a different permutation order without
 109 having to move the values in the registers to or from memory.  Also,
 110 the reason for having REMAP separate from the three SHAPE CSRs is so
 111 that in a chain of matrix multiplications and additions, for example,
 112 the SHAPE CSRs need only be set up once; only the REMAP CSR need be
 113 changed to target different registers.
 114
 115 Note that:
 116
 117 * Over-running the register file clearly has to be detected and
 118   an illegal instruction exception thrown
 119 * When non-default elwidths are set, the exact same algorithm still
 120   applies (i.e. it offsets elements *within* registers rather than
 121   entire registers).
 122 * If permute option 000 is utilised, the actual order of the
 123   reindexing does not change!
 124 * If two or more dimensions are set to zero, the actual order does not change!
 125 * The above algorithm is pseudo-code **only**.  Actual implementations
 126   will need to take into account the fact that the element for-looping
 127   must be **re-entrant**, due to the possibility of exceptions occurring.
 128   See MSTATE CSR, which records the current element index.
 129 * Twin-predicated operations require **two** separate and distinct
 130   element offsets.  The above pseudo-code algorithm will be applied
 131   separately and independently to each, should each of the two
 132   operands be remapped.  *This even includes C.LDSP* and other operations
 133   in that category, where in that case it will be the **offset** that is
 134   remapped (see Compressed Stack LOAD/STORE section).
 135 * Offset is especially useful, on its own, for accessing elements
 136   within the middle of a register.  Without offsets, it is necessary
 137   to either use a predicated MV, skipping the first elements, or
 138   performing a LOAD/STORE cycle to memory.
 139   With offsets, the data does not have to be moved.
 140 * Setting the total elements (xdim+1) times (ydim+1) times (zdim+1) to
 141   less than MVL is **perfectly legal**, albeit very obscure.  It permits
 142   entries to be regularly presented to operands **more than once**, thus
 143   allowing the same underlying registers to act as an accumulator of
 144   multiple vector or matrix operations, for example.
 145
 146 Clearly here some considerable care needs to be taken as the remapping
 147 could hypothetically create arithmetic operations that target the
 148 exact same underlying registers, resulting in data corruption due to
 149 pipeline overlaps.  Out-of-order / Superscalar micro-architectures with
 150 register-renaming will have an easier time dealing with this than
 151 DSP-style SIMD micro-architectures.
 152
 153 # 4x4 Matrix to vec4 Multiply Example
 154
 155 The following settings will allow a 4x4 matrix (starting at f8), expressed as a sequence of 16 numbers first by row then by column, to be multiplied by a vector of length 4 (starting at f0), using a single FMAC instruction.
 156
 157 * SHAPE0: xdim=4, ydim=4, permute=yx, applied to f0
 158 * SHAPE1: xdim=4, ydim=1, permute=xy, applied to f4
 159 * VL=16, f4=vec, f0=vec, f8=vec
 160 * FMAC f4, f0, f8, f4
 161
 162 The permutation on SHAPE0 will use f0 as a vec4 source. On the first four iterations through the hardware loop, the REMAPed index will not increment. On the second four, the index will increase by one. Likewise on each subsequent group of four.
 163
 164 The permutation on SHAPE1 will increment f4 continuously cycling through f4-f7 every four iterations of the hardware loop.
 165
 166 At the same time, VL will, because there is no SHAPE on f8, increment straight sequentially through the 16 values f8-f23 in the Matrix. The equivalent sequence thus is issued:
 167
 168     fmac f4, f0, f8, f4
 169     fmac f5, f0, f9, f5
 170     fmac f6, f0, f10, f6
 171     fmac f7, f0, f11, f7
 172     fmac f4, f1, f12, f4
 173     fmac f5, f1, f13, f5
 174     fmac f6, f1, f14, f6
 175     fmac f7, f1, f15, f7
 176     fmac f4, f2, f16, f4
 177     fmac f5, f2, f17, f5
 178     fmac f6, f2, f18, f6
 179     fmac f7, f2, f19, f7
 180     fmac f4, f3, f20, f4
 181     fmac f5, f3, f21, f5
 182     fmac f6, f3, f22, f6
 183     fmac f7, f3, f23, f7
 184
 185 The only other instruction required is to ensure that f4-f7 are initialised (usually to zero).
 186
 187 It should be clear that a 4x4 by 4x4 Matrix Multiply, being effectively the same technique applied to four independent vectors, can be done by setting VL=64, using an extra dimension on the SHAPE0 and SHAPE1 CSRs, and applying a rotating 1D SHAPE CSR of xdim=16 to f8 in order to get it to apply four times to compute the four columns worth of vectors.