simple_v_extension/remap.mdwn

   1 # NOTE
   2
   3 This section is under revision (and is optional)
   4
   5 # REMAP CSR <a name="remap" />
   6
   7 (Note: both the REMAP and SHAPE sections are best read after the
   8  rest of the document has been read)
   9
  10 There is one 32-bit CSR which may be used to indicate which registers,
  11 if used in any operation, must be "reshaped" (re-mapped) from a linear
  12 form to a 2D or 3D transposed form, or "offset" to permit arbitrary
  13 access to elements within a register.
  14
  15 The 32-bit REMAP CSR may reshape up to 3 registers:
  16
  17 | 29..28 | 27..26 | 25..24 | 23 | 22..16  | 15 | 14..8   | 7  | 6..0    |
  18 | ------ | ------ | ------ | -- | ------- | -- | ------- | -- | ------- |
  19 | shape2 | shape1 | shape0 | 0  | regidx2 | 0  | regidx1 | 0  | regidx0 |
  20
  21 regidx0-2 refer not to the Register CSR CAM entry but to the underlying
  22 *real* register (see regidx, the value) and consequently is 7-bits wide.
  23 When set to zero (referring to x0), clearly reshaping x0 is pointless,
  24 so is used to indicate "disabled".
  25 shape0-2 refers to one of three SHAPE CSRs.  A value of 0x3 is reserved.
  26 Bits 7, 15, 23, 30 and 31 are also reserved, and must be set to zero.
  27
  28 It is anticipated that these specialist CSRs not be very often used.
  29 Unlike the CSR Register and Predication tables, the REMAP CSRs use
  30 the full 7-bit regidx so that they can be set once and left alone,
  31 whilst the CSR Register entries pointing to them are disabled, instead.
  32
  33 # SHAPE 1D/2D/3D vector-matrix remapping CSRs
  34
  35 (Note: both the REMAP and SHAPE sections are best read after the
  36  rest of the document has been read)
  37
  38 There are three "shape" CSRs, SHAPE0, SHAPE1, SHAPE2, 32-bits in each,
  39 which have the same format.  When each SHAPE CSR is set entirely to zeros,
  40 remapping is disabled: the register's elements are a linear (1D) vector.
  41
  42 | 26..24  | 23      | 22..16  | 15      | 14..8   | 7       | 6..0    |
  43 | ------- | --      | ------- | --      | ------- | --      | ------- |
  44 | permute | offs[2] | zdimsz  | offs[1] | ydimsz  | offs[0] | xdimsz  |
  45
  46 offs is a 3-bit field, spread out across bits 7, 15 and 23, which
  47 is added to the element index during the loop calculation.
  48
  49 xdimsz, ydimsz and zdimsz are offset by 1, such that a value of 0 indicates
  50 that the array dimensionality for that dimension is 1.  A value of xdimsz=2
  51 would indicate that in the first dimension there are 3 elements in the
  52 array.  The format of the array is therefore as follows:
  53
  54     array[xdim+1][ydim+1][zdim+1]
  55
  56 However whilst illustrative of the dimensionality, that does not take the
  57 "permute" setting into account.  "permute" may be any one of six values
  58 (0-5, with values of 6 and 7 being reserved, and not legal).  The table
  59 below shows how the permutation dimensionality order works:
  60
  61 | permute | order | array format             |
  62 | ------- | ----- | ------------------------ |
  63 | 000     | 0,1,2 | (xdim+1)(ydim+1)(zdim+1) |
  64 | 001     | 0,2,1 | (xdim+1)(zdim+1)(ydim+1) |
  65 | 010     | 1,0,2 | (ydim+1)(xdim+1)(zdim+1) |
  66 | 011     | 1,2,0 | (ydim+1)(zdim+1)(xdim+1) |
  67 | 100     | 2,0,1 | (zdim+1)(xdim+1)(ydim+1) |
  68 | 101     | 2,1,0 | (zdim+1)(ydim+1)(xdim+1) |
  69
  70 In other words, the "permute" option changes the order in which
  71 nested for-loops over the array would be done.  The algorithm below
  72 shows this more clearly, and may be executed as a python program:
  73
  74     # mapidx = REMAP.shape2
  75     xdim = 3 # SHAPE[mapidx].xdim_sz+1
  76     ydim = 4 # SHAPE[mapidx].ydim_sz+1
  77     zdim = 5 # SHAPE[mapidx].zdim_sz+1
  78
  79     lims = [xdim, ydim, zdim]
  80     idxs = [0,0,0] # starting indices
  81     order = [1,0,2] # experiment with different permutations, here
  82     offs = 0        # experiment with different offsets, here
  83
  84     for idx in range(xdim * ydim * zdim):
  85         new_idx = offs + idxs[0] + idxs[1] * xdim + idxs[2] * xdim * ydim
  86         print new_idx,
  87         for i in range(3):
  88             idxs[order[i]] = idxs[order[i]] + 1
  89             if (idxs[order[i]] != lims[order[i]]):
  90                 break
  91             print
  92             idxs[order[i]] = 0
  93
  94 Here, it is assumed that this algorithm be run within all pseudo-code
  95 throughout this document where a (parallelism) for-loop would normally
  96 run from 0 to VL-1 to refer to contiguous register
  97 elements; instead, where REMAP indicates to do so, the element index
  98 is run through the above algorithm to work out the **actual** element
  99 index, instead.  Given that there are three possible SHAPE entries, up to
 100 three separate registers in any given operation may be simultaneously
 101 remapped:
 102
 103     function op_add(rd, rs1, rs2) # add not VADD!
 104       ...
 105       ...
 106       for (i = 0; i < VL; i++)
 107         xSTATE.srcoffs = i # save context
 108         if (predval & 1<<i) # predication uses intregs
 109            ireg[rd+remap(id)] <= ireg[rs1+remap(irs1)] +
 110                                  ireg[rs2+remap(irs2)];
 111            if (!int_vec[rd ].isvector) break;
 112         if (int_vec[rd ].isvector)  { id += 1; }
 113         if (int_vec[rs1].isvector)  { irs1 += 1; }
 114         if (int_vec[rs2].isvector)  { irs2 += 1; }
 115
 116 By changing remappings, 2D matrices may be transposed "in-place" for one
 117 operation, followed by setting a different permutation order without
 118 having to move the values in the registers to or from memory.  Also,
 119 the reason for having REMAP separate from the three SHAPE CSRs is so
 120 that in a chain of matrix multiplications and additions, for example,
 121 the SHAPE CSRs need only be set up once; only the REMAP CSR need be
 122 changed to target different registers.
 123
 124 Note that:
 125
 126 * Over-running the register file clearly has to be detected and
 127   an illegal instruction exception thrown
 128 * When non-default elwidths are set, the exact same algorithm still
 129   applies (i.e. it offsets elements *within* registers rather than
 130   entire registers).
 131 * If permute option 000 is utilised, the actual order of the
 132   reindexing does not change!
 133 * If two or more dimensions are set to zero, the actual order does not change!
 134 * The above algorithm is pseudo-code **only**.  Actual implementations
 135   will need to take into account the fact that the element for-looping
 136   must be **re-entrant**, due to the possibility of exceptions occurring.
 137   See MSTATE CSR, which records the current element index.
 138 * Twin-predicated operations require **two** separate and distinct
 139   element offsets.  The above pseudo-code algorithm will be applied
 140   separately and independently to each, should each of the two
 141   operands be remapped.  *This even includes C.LDSP* and other operations
 142   in that category, where in that case it will be the **offset** that is
 143   remapped (see Compressed Stack LOAD/STORE section).
 144 * Offset is especially useful, on its own, for accessing elements
 145   within the middle of a register.  Without offsets, it is necessary
 146   to either use a predicated MV, skipping the first elements, or
 147   performing a LOAD/STORE cycle to memory.
 148   With offsets, the data does not have to be moved.
 149 * Setting the total elements (xdim+1) times (ydim+1) times (zdim+1) to
 150   less than MVL is **perfectly legal**, albeit very obscure.  It permits
 151   entries to be regularly presented to operands **more than once**, thus
 152   allowing the same underlying registers to act as an accumulator of
 153   multiple vector or matrix operations, for example.
 154
 155 Clearly here some considerable care needs to be taken as the remapping
 156 could hypothetically create arithmetic operations that target the
 157 exact same underlying registers, resulting in data corruption due to
 158 pipeline overlaps.  Out-of-order / Superscalar micro-architectures with
 159 register-renaming will have an easier time dealing with this than
 160 DSP-style SIMD micro-architectures.
 161