simple_v_extension/remap.mdwn

   1 [[!tag standards]]
   2
   3 # NOTE
   4
   5 This section is under revision (and is optional)
   6
   7 # REMAP <a name="remap" />
   8
   9 REMAP allows the usual vector loop `0 to VL-1` to be "reshaped" (re-mapped) from a linear
  10 form to a 2D or 3D transposed form, or "offset" to permit arbitrary
  11 access to elements, independently on each Vector src or dest register.
  12
  13 Their primary use is for Matrix Multiplication, reordering of sequential data in-place.  Four CSRs are provided so that a single FMAC may be used in a single loop to perform 4x4 times 4x4 Matrix multiplication, generating 64 FMACs.  Adfitionwl uses include regular "Structure Packing" such as RGB pixel data extraction and reforming.
  14
  15 # SHAPE 1D/2D/3D vector-matrix remapping CSRs
  16
  17 There are four "shape" CSRs, SHAPE0-3, 32-bits in each,
  18 which have the same format.
  19
  20 [[!inline raw="yes" pages="simple_v_extension/shape_table_format" ]]
  21
  22 The algorithm below shows how REMAP works more clearly, and may be
  23 executed as a python program:
  24
  25     xdim = 3
  26     ydim = 4
  27     zdim = 1
  28
  29     lims = [xdim, ydim, zdim]
  30     idxs = [0,0,0] # starting indices
  31     order = [0,1,2] # experiment with different permutations, here
  32     offset = 2     # experiment with different offset, here
  33     VL = xdim * ydim * zdim # multiply (or add) to this to get "cycling"
  34     applydim = 0
  35     invxyz = [0,0,0]
  36
  37     # run for offset iterations before actually starting
  38     for idx in range(offset):
  39         for i in range(3):
  40             idxs[order[i]] = idxs[order[i]] + 1
  41             if (idxs[order[i]] != lims[order[i]]):
  42                 break
  43             idxs[order[i]] = 0
  44
  45     break_count = 0
  46
  47     for idx in range(VL):
  48         ix = [0] * 3
  49         for i in range(3):
  50             if i >= applydim:
  51                 ix[i] = idxs[i]
  52             if invxyz[i]:
  53                 ix[i] = lims[i] - 1 - ix[i]
  54         new_idx = ix[0] + ix[1] * xdim + ix[2] * xdim * ydim
  55         print new_idx,
  56         break_count += 1
  57         if break_count == lims[order[0]]:
  58             print
  59             break_count = 0
  60         for i in range(3):
  61             idxs[order[i]] = idxs[order[i]] + 1
  62             if (idxs[order[i]] != lims[order[i]]):
  63                 break
  64             idxs[order[i]] = 0
  65
  66 Here, it is assumed that this algorithm be run within all pseudo-code
  67 throughout this document where a (parallelism) for-loop would normally
  68 run from 0 to VL-1 to refer to contiguous register
  69 elements; instead, where REMAP indicates to do so, the element index
  70 is run through the above algorithm to work out the **actual** element
  71 index, instead.  Given that there are four possible SHAPE entries, up to
  72 four separate registers in any given operation may be simultaneously
  73 remapped:
  74
  75     function op_add(rd, rs1, rs2) # add not VADD!
  76       ...
  77       ...
  78       for (i = 0; i < VL; i++)
  79         xSTATE.srcoffs = i # save context
  80         if (predval & 1<<i) # predication uses intregs
  81            ireg[rd+remap(id)] <= ireg[rs1+remap(irs1)] +
  82                                  ireg[rs2+remap(irs2)];
  83            if (!int_vec[rd ].isvector) break;
  84         if (int_vec[rd ].isvector)  { id += 1; }
  85         if (int_vec[rs1].isvector)  { irs1 += 1; }
  86         if (int_vec[rs2].isvector)  { irs2 += 1; }
  87
  88 By changing remappings, 2D matrices may be transposed "in-place" for one
  89 operation, followed by setting a different permutation order without
  90 having to move the values in the registers to or from memory.
  91
  92 Note that:
  93
  94 * Over-running the register file clearly has to be detected and
  95   an illegal instruction exception thrown
  96 * When non-default elwidths are set, the exact same algorithm still
  97   applies (i.e. it offsets elements *within* registers rather than
  98   entire registers).
  99 * If permute option 000 is utilised, the actual order of the
 100   reindexing does not change.  However, modulo MVL still occurs
 101   which will result in repeated operations (use with caution).
 102 * If two or more dimensions are set to zero, the actual order does not change!
 103 * The above algorithm is pseudo-code **only**.  Actual implementations
 104   will need to take into account the fact that the element for-looping
 105   must be **re-entrant**, due to the possibility of exceptions occurring.
 106   See MSTATE CSR, which records the current element index.
 107 * Twin-predicated operations require **two** separate and distinct
 108   element offsets.  The above pseudo-code algorithm will be applied
 109   separately and independently to each, should each of the two
 110   operands be remapped.  *This even includes C.LDSP* and other operations
 111   in that category, where in that case it will be the **offset** that is
 112   remapped (see Compressed Stack LOAD/STORE section).
 113 * Offset is especially useful, on its own, for accessing elements
 114   within the middle of a register.  Without offsets, it is necessary
 115   to either use a predicated MV, skipping the first elements, or
 116   performing a LOAD/STORE cycle to memory.
 117   With offsets, the data does not have to be moved.
 118 * Setting the total elements (xdim+1) times (ydim+1) times (zdim+1) to
 119   less than MVL is **perfectly legal**, albeit very obscure.  It permits
 120   entries to be regularly presented to operands **more than once**, thus
 121   allowing the same underlying registers to act as an accumulator of
 122   multiple vector or matrix operations, for example.
 123
 124 Clearly here some considerable care needs to be taken as the remapping
 125 could hypothetically create arithmetic operations that target the
 126 exact same underlying registers, resulting in data corruption due to
 127 pipeline overlaps.  Out-of-order / Superscalar micro-architectures with
 128 register-renaming will have an easier time dealing with this than
 129 DSP-style SIMD micro-architectures.
 130
 131 # 4x4 Matrix to vec4 Multiply Example
 132
 133 The following settings will allow a 4x4 matrix (starting at f8), expressed as a sequence of 16 numbers first by row then by column, to be multiplied by a vector of length 4 (starting at f0), using a single FMAC instruction.
 134
 135 * SHAPE0: xdim=4, ydim=4, permute=yx, applied to f0
 136 * SHAPE1: xdim=4, ydim=1, permute=xy, applied to f4
 137 * VL=16, f4=vec, f0=vec, f8=vec
 138 * FMAC f4, f0, f8, f4
 139
 140 The permutation on SHAPE0 will use f0 as a vec4 source. On the first four iterations through the hardware loop, the REMAPed index will not increment. On the second four, the index will increase by one. Likewise on each subsequent group of four.
 141
 142 The permutation on SHAPE1 will increment f4 continuously cycling through f4-f7 every four iterations of the hardware loop.
 143
 144 At the same time, VL will, because there is no SHAPE on f8, increment straight sequentially through the 16 values f8-f23 in the Matrix. The equivalent sequence thus is issued:
 145
 146     fmac f4, f0, f8, f4
 147     fmac f5, f0, f9, f5
 148     fmac f6, f0, f10, f6
 149     fmac f7, f0, f11, f7
 150     fmac f4, f1, f12, f4
 151     fmac f5, f1, f13, f5
 152     fmac f6, f1, f14, f6
 153     fmac f7, f1, f15, f7
 154     fmac f4, f2, f16, f4
 155     fmac f5, f2, f17, f5
 156     fmac f6, f2, f18, f6
 157     fmac f7, f2, f19, f7
 158     fmac f4, f3, f20, f4
 159     fmac f5, f3, f21, f5
 160     fmac f6, f3, f22, f6
 161     fmac f7, f3, f23, f7
 162
 163 The only other instruction required is to ensure that f4-f7 are initialised (usually to zero).
 164
 165 It should be clear that a 4x4 by 4x4 Matrix Multiply, being effectively the same technique applied to four independent vectors, can be done by setting VL=64, using an extra dimension on the SHAPE0 and SHAPE1 CSRs, and applying a rotating 1D SHAPE CSR of xdim=16 to f8 in order to get it to apply four times to compute the four columns worth of vectors.