openpower/sv/remap/appendix.mdwn

   1 ## REMAP Matrix pseudocode
   2
   3 The algorithm below shows how REMAP works more clearly, and may be
   4 executed as a python program:
   5
   6 ```
   7 [[!inline pages="openpower/sv/remap.py" quick="yes" raw="yes" ]]
   8 ```
   9
  10 An easier-to-read version (using python iterators) is given in
  11 a later section of this Appendix.
  12
  13 Each element index from the for-loop `0..VL-1`
  14 is run through the above algorithm to work out the **actual** element
  15 index, instead.  Given that there are four possible SHAPE entries, up to
  16 four separate registers in any given operation may be simultaneously
  17 remapped:
  18
  19 ```
  20     function op_add(RT, RA, RB) # add not VADD!
  21       for (i=0,id=0,irs1=0,irs2=0; i < VL; i++)
  22         SVSTATE.srcstep = i # save context
  23         if (predval & 1<<i) # predication mask
  24            GPR[RT+remap1(id)] <= GPR[RA+remap2(irs1)] +
  25                                  GPR[RB+remap3(irs2)];
  26            if (!RT.isvector) break;
  27         if (RT.isvector)  { id += 1; }
  28         if (RA.isvector)  { irs1 += 1; }
  29         if (RB.isvector)  { irs2 += 1; }
  30 ```
  31
  32 By changing remappings, 2D matrices may be transposed "in-place" for one
  33 operation, followed by setting a different permutation order without
  34 having to move the values in the registers to or from memory.
  35
  36 Note that:
  37
  38 * Over-running the register file clearly has to be detected and
  39   an illegal instruction exception thrown
  40 * When non-default elwidths are set, the exact same algorithm still
  41   applies (i.e. it offsets *polymorphic* elements *within* registers rather
  42   than entire registers).
  43 * If permute option 000 is utilised, the actual order of the
  44   reindexing does not change.  However, modulo MVL still occurs
  45   which will result in repeated operations (use with caution).
  46 * If two or more dimensions are set to zero, the actual order does not change!
  47 * The above algorithm is pseudo-code **only**.  Actual implementations
  48   will need to take into account the fact that the element for-looping
  49   must be **re-entrant**, due to the possibility of exceptions occurring.
  50   See SVSTATE SPR, which records the current element index.
  51   Continuing after return from an interrupt may introduce latency
  52   due to re-computation of the remapped offsets.
  53 * Twin-predicated operations require **two** separate and distinct
  54   element offsets.  The above pseudo-code algorithm will be applied
  55   separately and independently to each, should each of the two
  56   operands be remapped.  *This even includes unit-strided LD/ST*
  57   and other operations
  58   in that category, where in that case it will be the **offset** that is
  59   remapped.
  60 * Offset is especially useful, on its own, for accessing elements
  61   within the middle of a register.  Without offsets, it is necessary
  62   to either use a predicated MV, skipping the first elements, or
  63   performing a LOAD/STORE cycle to memory.
  64   With offsets, the data does not have to be moved.
  65 * Setting the total elements (xdim+1) times (ydim+1) times (zdim+1) to
  66   less than MVL is **perfectly legal**, albeit very obscure.  It permits
  67   entries to be regularly presented to operands **more than once**, thus
  68   allowing the same underlying registers to act as an accumulator of
  69   multiple vector or matrix operations, for example.
  70 * Note especially that Program Order **must** still be respected
  71   even when overlaps occur that read or write the same register
  72   elements *including polymorphic ones*
  73
  74 Clearly here some considerable care needs to be taken as the remapping
  75 could hypothetically create arithmetic operations that target the
  76 exact same underlying registers, resulting in data corruption due to
  77 pipeline overlaps.  Out-of-order / Superscalar micro-architectures with
  78 register-renaming will have an easier time dealing with this than
  79 DSP-style SIMD micro-architectures.
  80
  81
  82 ### 4x4 Matrix to vec4 Multiply (4x4 by 1x4)
  83
  84 The following settings will allow a 4x4 matrix (starting at f8), expressed
  85 as a sequence of 16 numbers first by row then by column, to be multiplied
  86 by a vector of length 4 (starting at f0), using a single FMAC instruction.
  87
  88 * SHAPE0: xdim=4, ydim=4, permute=yx, applied to f0
  89 * SHAPE1: xdim=4, ydim=1, permute=xy, applied to f4
  90 * VL=16, f4=vec, f0=vec, f8=vec
  91 * FMAC f4, f0, f8, f4
  92
  93 The permutation on SHAPE0 will use f0 as a vec4 source. On the first
  94 four iterations through the hardware loop, the REMAPed index will not
  95 increment. On the second four, the index will increase by one. Likewise
  96 on each subsequent group of four.
  97
  98 The permutation on SHAPE1 will increment f4 continuously cycling through
  99 f4-f7 every four iterations of the hardware loop.
 100
 101 At the same time, VL will, because there is no SHAPE on f8, increment
 102 straight sequentially through the 16 values f8-f23 in the Matrix. The
 103 equivalent sequence thus is issued:
 104
 105 ```
 106     fmac f4, f0, f8, f4
 107     fmac f5, f0, f9, f5
 108     fmac f6, f0, f10, f6
 109     fmac f7, f0, f11, f7
 110     fmac f4, f1, f12, f4
 111     fmac f5, f1, f13, f5
 112     fmac f6, f1, f14, f6
 113     fmac f7, f1, f15, f7
 114     fmac f4, f2, f16, f4
 115     fmac f5, f2, f17, f5
 116     fmac f6, f2, f18, f6
 117     fmac f7, f2, f19, f7
 118     fmac f4, f3, f20, f4
 119     fmac f5, f3, f21, f5
 120     fmac f6, f3, f22, f6
 121     fmac f7, f3, f23, f7
 122 ```
 123
 124 Hardware should easily pipeline the above FMACs and as long as each FMAC
 125 completes in 4 cycles or less there should be 100% sustained throughput,
 126 from the one single Vector FMAC.
 127
 128 The only other instruction required is to ensure that f4-f7 are
 129 initialised (usually to zero) however obviously if used as part
 130 of some other computation, which is frequently the case, then
 131 clearly the zeroing is not needed.
 132
 133 ## REMAP FFT, DFT, NTT
 134
 135 The algorithm from a later section of this Appendix shows how FFT REMAP works,
 136 and it may be executed as a standalone python3 program.
 137 The executable code is designed to illustrate how a hardware
 138 implementation may generate Indices which are completely
 139 independent of the Execution of element-level operations,
 140 even for something as complex as a Triple-loop Tukey-Cooley
 141 Schedule. A comprehensive demo and test suite may be found
 142 [here](https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_fft.py;hb=HEAD)
 143 including Complex Number FFT which deploys Vertical-First Mode
 144 on top of the REMAP Schedules.
 145
 146 Other uses include more than DFT and NTT: as abstracted RISC-paradigm
 147 the Schedules are not
 148 restricted in any way or tied to any particular instructtion.
 149 If the programmer can find any algorithm
 150 which has identical triple nesting then the FFT Schedule may be
 151 used even there.
 152
 153 [[!tag standards]]
 154
 155 ---------
 156
 157 \newpage{}
 158