openpower/sv/remap/discussion.mdwn

   1 # TODO svshape offset concept
   2
   3     svshape2 offs,inv,yx,rmm,SVd,sk,mm
   4
   5 | 0.5|6..8|9  |10|11.15  |16..20 | 21..25 | 25 | 26..31|  name    |
   6 | -- |----|---|--| ---   | ----- | ------ | -- | ------| -------- |
   7 |OPCD|offs|inv|yx| rmm   | SVd   | 100/mm | sk | XO    | svshape  |
   8
   9 * **offs** (3 bits) - unsigned offset
  10 * **yx** (1 bit) - swap XY to YX
  11 * **inv** (1 bit) inverts X if yx=0, Y if yx=1
  12 * **SVd** dimension size
  13 * **rmm** REMAP mask
  14 * **mm** mask mode
  15 * **sk** (1 bit) skips 1st dimension if set
  16
  17 Dimensions are calculated exactly as `svindex`. `rmm` and
  18 `mm` are as per `svindex`.
  19
  20 # TODO
  21
  22 * investigate https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6879380/#!po=19.6429
  23 in https://bugs.libre-soc.org/show_bug.cgi?id=653
  24 * UTF-8 <https://bugs.libre-soc.org/show_bug.cgi?id=794>
  25 * Triangular REMAP
  26 * Cross-Product REMAP (actually, skew Matrix: https://en.m.wikipedia.org/wiki/Skew-symmetric_matrix)
  27 * Convolution REMAP
  28
  29 # Warshall transitive closure algorithm
  30
  31 TODO move to [[sv/remap/discussion]] page, copied from here
  32 http://lists.libre-soc.org/pipermail/libre-soc-dev/2021-July/003286.html
  33
  34 with thanks to Hendrik.
  35
  36 <https://en.m.wikipedia.org/wiki/Floyd%E2%80%93Warshall_algorithm>
  37
  38 > Just a note:  interpreting + as 'or', and * as 'and',
  39 > operating on Boolean matrices,
  40 > and having result, X, and Y be the exact same matrix,
  41 > updated while being used,
  42 > gives the traditional Warshall transitive-closure
  43 > algorithm, if the loops are nested exactly in thie order.
  44
  45 this can be done with the ternary instruction which has
  46 an in-place triple boolean input:
  47
  48 ```
  49     RT = RT | (RA & RB)
  50 ```
  51
  52 and also has a CR Field variant of the same
  53
  54 notes from conversations:
  55
  56 > > for y in y_r:
  57 > >  for x in x_r:
  58 > >    for z in z_r:
  59 > >      result[y][x] +=
  60 > >         a[y][z] *
  61 > >         b[z][x]
  62
  63 > This nesting of loops works for matrix multiply, but not for transitive
  64 > closure.
  65
  66 > > it can be done:
  67 > >
  68 > >   for z in z_r:
  69 > >    for y in y_r:
  70 > >     for x in x_r:
  71 > >       result[y][x] +=
  72 > >          a[y][z] *
  73 > >          b[z][x]
  74 >
  75 > And this ordering of loops *does* work for transitive closure, when a,
  76 > b, and result are the very same matrix, updated while being used.
  77 >
  78 > By the way, I believe there is a graph algorithm that does the
  79 > transitive closure thing, but instead of using boolean, "and", and "or",
  80 > they use real numbers, addition, and minimum.  I think that one computes
  81 > shortest paths between vertices.
  82 >
  83 > By the time the z'th iteration of the z loop begins, the algorithm has
  84 > already peocessed paths that go through vertices numbered < z, and it
  85 > adds paths that go through vertices numbered z.
  86 >
  87 > For this to work, the outer loop has to be the one on the subscript that
  88 > bridges a and b (which in this case are teh same matrix, of course).
  89
  90 # SUBVL Remap
  91
  92 Remapping of SUBVL (vec2/3/4) elements is not permitted: the vec2/3/4
  93 itself must be considered to be the "element".  To perform REMAP
  94 on the elements of a vec2/3/4, either use [[sv/mv.swizzle]], or,
  95 due to the sub-elements themselves being contiguous, treat them as
  96 such and use Indexing, or add one
  97 extra dimension to Matrix REMAP, the inner dimension being the size
  98 of the Subvector (2, 3, or 4).
  99
 100 Note that Swizzle on Sub-vectors may be applied on top of REMAP.
 101 Where this is appropriate is the Rijndael MixColumns
 102 stage:
 103
 104 <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/7/76/AES-MixColumns.svg/600px-AES-MixColumns.svg.png" width="400px" />
 105
 106 Assuming that the bytes are stored `a00 a01 a02 a03 a10 .. a33`
 107 a 2D REMAP allows:
 108
 109 * the column bytes (as a vec4) to be iterated over as an inner loop,
 110   progressing vertically (`a00 a10 a20 a30`)
 111 * the columns themselves to be iterated as an outer loop
 112 * a 32 bit `GF(256)` Matrix Multiply on the vec4 to be performed.
 113
 114 This entirely in-place without special 128-bit opcodes.  Below is
 115 the pseudocode for [[!wikipedia Rijndael MixColumns]]
 116
 117 ```
 118 void gmix_column(unsigned char *r) {
 119     unsigned char a[4];
 120     unsigned char b[4];
 121     unsigned char c;
 122     unsigned char h;
 123     // no swizzle here but vec4 byte-level
 124     // elwidth overrides can be done though.
 125     for (c = 0; c < 4; c++) {
 126         a[c] = r[c];
 127         h = (unsigned char)((signed char)r[c] >> 7);
 128         b[c] = r[c] << 1;
 129         b[c] ^= 0x1B & h; /* Rijndael's Galois field */
 130     }
 131     // These may then each be 4x 8bit Swizzled
 132     // r0.vec4 = b.vec4
 133     // r0.vec4 ^= a.vec4.WXYZ
 134     // r0.vec4 ^= a.vec4.ZWXY
 135     // r0.vec4 ^= b.vec4.YZWX ^ a.vec4.YZWX
 136     r[0] = b[0] ^ a[3] ^ a[2] ^ b[1] ^ a[1];
 137     r[1] = b[1] ^ a[0] ^ a[3] ^ b[2] ^ a[2];
 138     r[2] = b[2] ^ a[1] ^ a[0] ^ b[3] ^ a[3];
 139     r[3] = b[3] ^ a[2] ^ a[1] ^ b[0] ^ a[0];
 140 }
 141 ```
 142
 143 The application of the swizzles allows the remapped vec4 a, b and r
 144 variables to perform four straight linear 32 bit XOR operations where a
 145 scalar processor would be required to perform 16 byte-level individual
 146 operations.  Given wide enough SIMD backends in hardware these 3 bit
 147 XORs may be done as single-cycle operations across the entire 128 bit
 148 Rijndael Matrix.
 149
 150 The other alternative is to simply perform the actual 4x4 GF(256) Matrix
 151 Multiply using the MDS Matrix.
 152