openpower/sv/remap/appendix.mdwn

   1 [[!tag standards]]
   2
   3 # REMAP Matrix pseudocode
   4
   5 The algorithm below shows how REMAP works more clearly, and may be
   6 executed as a python program:
   7
   8 ```
   9 [[!inline pages="openpower/sv/remap.py" quick="yes" raw="yes" ]]
  10 ```
  11
  12 An easier-to-read version (using python iterators) shows the loop nesting:
  13
  14 ```
  15 [[!inline pages="openpower/sv/remapyield.py" quick="yes" raw="yes" ]]
  16 ```
  17
  18 Each element index from the for-loop `0..VL-1`
  19 is run through the above algorithm to work out the **actual** element
  20 index, instead.  Given that there are four possible SHAPE entries, up to
  21 four separate registers in any given operation may be simultaneously
  22 remapped:
  23
  24 ```
  25     function op_add(RT, RA, RB) # add not VADD!
  26       ...
  27       ...
  28       for (i=0,id=0,irs1=0,irs2=0; i < VL; i++)
  29         SVSTATE.srcstep = i # save context
  30         if (predval & 1<<i) # predication mask
  31            GPR[RT+remap1(id)] <= GPR[RA+remap2(irs1)] +
  32                                  GPR[RB+remap3(irs2)];
  33            if (!int_vec[RT ].isvector) break;
  34         if (int_vec[RT].isvector)  { id += 1; }
  35         if (int_vec[RA].isvector)  { irs1 += 1; }
  36         if (int_vec[RB].isvector)  { irs2 += 1; }
  37 ```
  38
  39 By changing remappings, 2D matrices may be transposed "in-place" for one
  40 operation, followed by setting a different permutation order without
  41 having to move the values in the registers to or from memory.
  42
  43 Note that:
  44
  45 * Over-running the register file clearly has to be detected and
  46   an illegal instruction exception thrown
  47 * When non-default elwidths are set, the exact same algorithm still
  48   applies (i.e. it offsets *polymorphic* elements *within* registers rather
  49   than entire registers).
  50 * If permute option 000 is utilised, the actual order of the
  51   reindexing does not change.  However, modulo MVL still occurs
  52   which will result in repeated operations (use with caution).
  53 * If two or more dimensions are set to zero, the actual order does not change!
  54 * The above algorithm is pseudo-code **only**.  Actual implementations
  55   will need to take into account the fact that the element for-looping
  56   must be **re-entrant**, due to the possibility of exceptions occurring.
  57   See SVSTATE SPR, which records the current element index.
  58   Continuing after return from an interrupt may introduce latency
  59   due to re-computation of the remapped offsets.
  60 * Twin-predicated operations require **two** separate and distinct
  61   element offsets.  The above pseudo-code algorithm will be applied
  62   separately and independently to each, should each of the two
  63   operands be remapped.  *This even includes unit-strided LD/ST*
  64   and other operations
  65   in that category, where in that case it will be the **offset** that is
  66   remapped.
  67 * Offset is especially useful, on its own, for accessing elements
  68   within the middle of a register.  Without offsets, it is necessary
  69   to either use a predicated MV, skipping the first elements, or
  70   performing a LOAD/STORE cycle to memory.
  71   With offsets, the data does not have to be moved.
  72 * Setting the total elements (xdim+1) times (ydim+1) times (zdim+1) to
  73   less than MVL is **perfectly legal**, albeit very obscure.  It permits
  74   entries to be regularly presented to operands **more than once**, thus
  75   allowing the same underlying registers to act as an accumulator of
  76   multiple vector or matrix operations, for example.
  77 * Note especially that Program Order **must** still be respected
  78   even when overlaps occur that read or write the same register
  79   elements *including polymorphic ones*
  80
  81 Clearly here some considerable care needs to be taken as the remapping
  82 could hypothetically create arithmetic operations that target the
  83 exact same underlying registers, resulting in data corruption due to
  84 pipeline overlaps.  Out-of-order / Superscalar micro-architectures with
  85 register-renaming will have an easier time dealing with this than
  86 DSP-style SIMD micro-architectures.
  87
  88 # REMAP FFT pseudocode
  89
  90 The algorithm below shows how FFT REMAP works, and may be
  91 executed as a python program:
  92
  93 ```
  94 [[!inline pages="openpower/sv/remap_fft_yield.py" quick="yes" raw="yes" ]]
  95 ```
  96
  97 The executable code above is designed to show how a hardware
  98 implementation may generate Indices which are completely
  99 independent of the Execution of element-level operations,
 100 even for something as complex as a Triple-loop Tukey-Cooley
 101 Schedule. A comprehensive demo and test suite may be found
 102 [here](https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_fft.py;hb=HEAD)
 103
 104 Other uses include more than DFT and NTT: the Schedules are not
 105 restricted in any way and if the programmer can find any algorithm
 106 which has identical triple nesting then the FFT Schedule may be
 107 used.
 108
 109 # 4x4 Matrix to vec4 Multiply Example
 110
 111 The following settings will allow a 4x4 matrix (starting at f8), expressed
 112 as a sequence of 16 numbers first by row then by column, to be multiplied
 113 by a vector of length 4 (starting at f0), using a single FMAC instruction.
 114
 115 * SHAPE0: xdim=4, ydim=4, permute=yx, applied to f0
 116 * SHAPE1: xdim=4, ydim=1, permute=xy, applied to f4
 117 * VL=16, f4=vec, f0=vec, f8=vec
 118 * FMAC f4, f0, f8, f4
 119
 120 The permutation on SHAPE0 will use f0 as a vec4 source. On the first
 121 four iterations through the hardware loop, the REMAPed index will not
 122 increment. On the second four, the index will increase by one. Likewise
 123 on each subsequent group of four.
 124
 125 The permutation on SHAPE1 will increment f4 continuously cycling through
 126 f4-f7 every four iterations of the hardware loop.
 127
 128 At the same time, VL will, because there is no SHAPE on f8, increment
 129 straight sequentially through the 16 values f8-f23 in the Matrix. The
 130 equivalent sequence thus is issued:
 131
 132 ```
 133     fmac f4, f0, f8, f4
 134     fmac f5, f0, f9, f5
 135     fmac f6, f0, f10, f6
 136     fmac f7, f0, f11, f7
 137     fmac f4, f1, f12, f4
 138     fmac f5, f1, f13, f5
 139     fmac f6, f1, f14, f6
 140     fmac f7, f1, f15, f7
 141     fmac f4, f2, f16, f4
 142     fmac f5, f2, f17, f5
 143     fmac f6, f2, f18, f6
 144     fmac f7, f2, f19, f7
 145     fmac f4, f3, f20, f4
 146     fmac f5, f3, f21, f5
 147     fmac f6, f3, f22, f6
 148     fmac f7, f3, f23, f7
 149 ```
 150
 151 The only other instruction required is to ensure that f4-f7 are
 152 initialised (usually to zero).
 153
 154 It should be clear that a 4x4 by 4x4 Matrix Multiply, being effectively
 155 the same technique applied to four independent vectors, can be done by
 156 setting VL=64, using an extra dimension on the SHAPE0 and SHAPE1 SPRs,
 157 and applying a rotating 1D SHAPE SPR of xdim=16 to f8 in order to get
 158 it to apply four times to compute the four columns worth of vectors.
 159
 160 # Warshall transitive closure algorithm
 161
 162 TODO move to [[sv/remap/discussion]] page, copied from here
 163 http://lists.libre-soc.org/pipermail/libre-soc-dev/2021-July/003286.html
 164
 165 with thanks to Hendrik.
 166
 167 <https://en.m.wikipedia.org/wiki/Floyd%E2%80%93Warshall_algorithm>
 168
 169 > Just a note:  interpreting + as 'or', and * as 'and',
 170 > operating on Boolean matrices,
 171 > and having result, X, and Y be the exact same matrix,
 172 > updated while being used,
 173 > gives the traditional Warshall transitive-closure
 174 > algorithm, if the loops are nested exactly in thie order.
 175
 176 this can be done with the ternary instruction which has
 177 an in-place triple boolean input:
 178
 179 ```
 180     RT = RT | (RA & RB)
 181 ```
 182
 183 and also has a CR Field variant of the same
 184
 185 notes from conversations:
 186
 187 > > for y in y_r:
 188 > >  for x in x_r:
 189 > >    for z in z_r:
 190 > >      result[y][x] +=
 191 > >         a[y][z] *
 192 > >         b[z][x]
 193
 194 > This nesting of loops works for matrix multiply, but not for transitive
 195 > closure.
 196
 197 > > it can be done:
 198 > >
 199 > >   for z in z_r:
 200 > >    for y in y_r:
 201 > >     for x in x_r:
 202 > >       result[y][x] +=
 203 > >          a[y][z] *
 204 > >          b[z][x]
 205 >
 206 > And this ordering of loops *does* work for transitive closure, when a,
 207 > b, and result are the very same matrix, updated while being used.
 208 >
 209 > By the way, I believe there is a graph algorithm that does the
 210 > transitive closure thing, but instead of using boolean, "and", and "or",
 211 > they use real numbers, addition, and minimum.  I think that one computes
 212 > shortest paths between vertices.
 213 >
 214 > By the time the z'th iteration of the z loop begins, the algorithm has
 215 > already peocessed paths that go through vertices numbered < z, and it
 216 > adds paths that go through vertices numbered z.
 217 >
 218 > For this to work, the outer loop has to be the one on the subscript that
 219 > bridges a and b (which in this case are teh same matrix, of course).
 220
 221 # SUBVL Remap
 222
 223 Remapping of SUBVL (vec2/3/4) elements is not permitted: the vec2/3/4
 224 itself must be considered to be the "element".  To perform REMAP
 225 on the elements of a vec2/3/4, either use [[sv/mv.swizzle]], or,
 226 due to the sub-elements themselves being contiguous, treat them as
 227 such and use Indexing, or add one
 228 extra dimension to Matrix REMAP, the inner dimension being the size
 229 of the Subvector (2, 3, or 4).
 230
 231 Note that Swizzle on Sub-vectors may be applied on top of REMAP.
 232 Where this is appropriate is the Rijndael MixColumns
 233 stage:
 234
 235 <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/7/76/AES-MixColumns.svg/600px-AES-MixColumns.svg.png" width="400px" />
 236
 237 Assuming that the bytes are stored `a00 a01 a02 a03 a10 .. a33`
 238 a 2D REMAP allows:
 239
 240 * the column bytes (as a vec4) to be iterated over as an inner loop,
 241   progressing vertically (`a00 a10 a20 a30`)
 242 * the columns themselves to be iterated as an outer loop
 243 * a 32 bit `GF(256)` Matrix Multiply on the vec4 to be performed.
 244
 245 This entirely in-place without special 128-bit opcodes.  Below is
 246 the pseudocode for [[!wikipedia Rijndael MixColumns]]
 247
 248 ```
 249 void gmix_column(unsigned char *r) {
 250     unsigned char a[4];
 251     unsigned char b[4];
 252     unsigned char c;
 253     unsigned char h;
 254     // no swizzle here but vec4 byte-level
 255     // elwidth overrides can be done though.
 256     for (c = 0; c < 4; c++) {
 257         a[c] = r[c];
 258         h = (unsigned char)((signed char)r[c] >> 7);
 259         b[c] = r[c] << 1;
 260         b[c] ^= 0x1B & h; /* Rijndael's Galois field */
 261     }
 262     // These may then each be 4x 8bit Swizzled
 263     // r0.vec4 = b.vec4
 264     // r0.vec4 ^= a.vec4.WXYZ
 265     // r0.vec4 ^= a.vec4.ZWXY
 266     // r0.vec4 ^= b.vec4.YZWX ^ a.vec4.YZWX
 267     r[0] = b[0] ^ a[3] ^ a[2] ^ b[1] ^ a[1];
 268     r[1] = b[1] ^ a[0] ^ a[3] ^ b[2] ^ a[2];
 269     r[2] = b[2] ^ a[1] ^ a[0] ^ b[3] ^ a[3];
 270     r[3] = b[3] ^ a[2] ^ a[1] ^ b[0] ^ a[0];
 271 }
 272 ```
 273
 274 The application of the swizzles allows the remapped vec4 a, b and r
 275 variables to perform four straight linear 32 bit XOR operations where a
 276 scalar processor would be required to perform 16 byte-level individual
 277 operations.  Given wide enough SIMD backends in hardware these 3 bit
 278 XORs may be done as single-cycle operations across the entire 128 bit
 279 Rijndael Matrix.
 280
 281 The other alternative is to simply perform the actual 4x4 GF(256) Matrix
 282 Multiply using the MDS Matrix.
 283