openpower/sv/cookbook/fortran_maxloc.mdwn

   1 # Fortran MAXLOC SVP64 demo
   2
   3 <https://bugs.libre-soc.org/show_bug.cgi?id=676>
   4
   5 MAXLOC is a notoriously difficult function for SIMD to cope with.
   6 Typical approaches are to perform leaf-node (depth-first) parallel
   7 operations, merging the results mapreduce-style to guage a final
   8 index.
   9
  10 SVP64 however has similar capabilities to Z80 CPIR and LDIR and
  11 therefore hardware may transparently implement back-end parallel
  12 operations whilst the developer programs in a simple sequential
  13 algorithm.
  14
  15 A clear reference implementation of MAXLOC is as follows:
  16
  17 ```
  18 int maxloc(int * const restrict a, int n) {
  19     int m, nm = INT_MIN, 0;
  20     for (int i=0; i<n; i++) {
  21         if (a[i] > m) {
  22             m = a[i];
  23             nm = i;
  24         }
  25     }
  26     return nm;
  27 }
  28 ```
  29
  30 **AVX-512**
  31
  32 An informative article by Vamsi Sripathi of Intel shows the extent of
  33 the problem faced by SIMD ISAs (in the case below, AVX-512). Significant
  34 loop-unrolling is performed which leaves blocks that need to be merged:
  35 this is carried out with "blending" instructions.
  36
  37 Article:
  38 <https://www.intel.com/content/www/us/en/developer/articles/technical/optimizing-maxloc-operation-using-avx-512-vector-instructions.html#gs.12t5y0>
  39
  40 <img src="https://www.intel.com/content/dam/developer/articles/technical/optimizing-maxloc-operation-using-avx-512-vector-instructions/optimizing-maxloc-operation-using-avx-512-vector-instructions-code2.png
  41 " alt="NLnet foundation logo" width="100%" />
  42
  43 **ARM NEON**
  44
  45 From stackexchange in ARM NEON intrinsics, one developer (Pavel P) wrote the
  46 subroutine below, explaining that it finds the index of a minimum value within
  47 a group of eight unsigned bytes. It is necessary to use a second outer loop
  48 to perform many of these searches in parallel, followed by conditionally
  49 offsetting each of the block-results.
  50
  51 <https://stackoverflow.com/questions/49683866/find-min-and-position-of-the-min-element-in-uint8x8-t-neon-register>
  52
  53 ```
  54 #define VMIN8(x, index, value)                               \
  55 do {                                                         \
  56     uint8x8_t m = vpmin_u8(x, x);                            \
  57     m = vpmin_u8(m, m);                                      \
  58     m = vpmin_u8(m, m);                                      \
  59     uint8x8_t r = vceq_u8(x, m);                             \
  60                                                              \
  61     uint8x8_t z = vand_u8(vmask, r);                         \
  62                                                              \
  63     z = vpadd_u8(z, z);                                      \
  64     z = vpadd_u8(z, z);                                      \
  65     z = vpadd_u8(z, z);                                      \
  66                                                              \
  67     unsigned u32 = vget_lane_u32(vreinterpret_u32_u8(z), 0); \
  68     index = __lzcnt(u32);                                    \
  69     value = vget_lane_u8(m, 0);                              \
  70 } while (0)
  71
  72
  73 uint8_t v[8] = { ... };
  74
  75 static const uint8_t mask[] = { 0x80, 0x40, 0x20, 0x10, 0x08, 0x04, 0x02, 0x01 };
  76 uint8x8_t vmask = vld1_u8(mask);
  77
  78 uint8x8_t v8 = vld1_u8(v);
  79 int ret;
  80 int ret_pos;
  81 VMIN8(v8, ret_pos, ret);
  82 ```
  83
  84 **Rust Assembler Intrinsics**
  85
  86 An approach by jvdd shows that the two stage approach of "blending"
  87 arrays of results in a type of parallelised "leaf node depth first"
  88 search seems to be a common technique.
  89
  90 <https://github.com/jvdd/argminmax/blob/main/src/simd/simd_u64.rs>
  91
  92 **Python implementation**
  93
  94 A variant on the c reference implementation allows for skipping
  95 of updating m (the largest value found so far), followed by
  96 always updating m whilst a batch of numbers is found that are
  97 (in their order of sequence) always continuously greater than
  98 all previous numbers.  The algorithm therefore flips back and
  99 forth between "skip whilst smaller" and "update whilst bigger",
 100 only updating the index during "bigger" sequences. This is key
 101 later when doing SVP64 assembler.
 102
 103 ```
 104 def m2(a): # array a
 105     m, nm, i, n = 0, 0, 0, len(a)
 106     while i<n:
 107         while i<n and a[i]<=m: i += 1              # skip whilst smaller/equal
 108         while i<n and a[i]> m: m,nm,i = a[i],i,i+1 # only whilst bigger
 109     return nm
 110 ```
 111
 112 # Implementation in SVP64 Assembler
 113
 114 The core algorithm (inner part, in-register) is below: 11 instructions.
 115 Loading of data, and offsetting the "core" below is relatively
 116 straightforward: estimated another 6 instructions and needing one
 117 more branch (outer loop).
 118
 119 ```
 120 # while (i<n)
 121 setvl 2,0,4,0,1,1                  # set MVL=4, VL=MIN(MVL,CTR)
 122 #    while (i<n and a[i]<=m) : i += 1
 123 sv.cmp/ff=gt/m=ge *0,0,*10,4       # truncates VL to min
 124 sv.creqv *16,*16,*16               # set mask on already-tested
 125 setvl 2,0,4,0,1,1                  # set MVL=4, VL=MIN(MVL,CTR)
 126 mtcrf 128, 0                       # clear CR0 (in case VL=0?)
 127 #    while (i<n and a[i]>m):
 128 sv.minmax./ff=le/m=ge/mr 4,*10,4,1 # uses r4 as accumulator
 129 sv.crnand/m=lt/zz *19,*16,0 # SO=~LT, if CR0.eq=0
 130 #      nm = i (count masked bits. could use crweirds here TODO)
 131 sv.svstep/mr/m=so 1, 0, 6, 1       # svstep: get vector dststep
 132 sv.creqv *16,*16,*16               # set mask on already-tested
 133 bc 12,0, -0x3c                     # CR0 lt bit clear, branch back
 134 ```
 135
 136 `sv.cmp` can be used in the first while loop because m (r4, the current
 137 largest-found value so far) does not change.
 138 However `sv.minmax.` has to be used in the key while loop
 139 because r4 is sequentially replaced each time, and mapreduce (`/mr`)
 140 is used to request this rolling-chain (accumulator) mode.
 141
 142 Also note that `i` (the `"while i<n"` part) is represented as an
 143 unary bitmask (in CR bits 16,20,24,28) which turns out to be easier to
 144 use than a binary index, as it can be used directly as a Predicate Mask
 145 (`/m=ge`).
 146
 147 The algorithm works by excluding previous operations using `i-in-unary`,
 148 combined with VL being truncated due to use of Data-Dependent Fail-First.
 149 What therefore happens for example on the `sv.cmp/ff=gt/m=ge` operation
 150 is that it is *VL* (the Vector Length) that gets truncated to only
 151 contain those elements that are smaller than the current largest value
 152 found (`m` aka `r4`). Calling `sv.creqv` then sets **only** the
 153 CR field bits up to length `VL`, which on the next loop will exclude
 154 them because the Predicate Mask is `m=ge` (ok if the CR field bit is
 155 **zero**).
 156
 157 Therefore, the way that Data-Dependent Fail-First works, it attempts
 158 *up to* the current Vector Length, and on detecting the first failure
 159 will truncate at that point. In effect this is speculative sequential
 160 execution of `while (i<n and a[i]<=m) : i += 1`.
 161
 162 Next comes the `sv.minmax.` which covers the `while (i<n and a[i]>m)`
 163 again in a single instruction, but this time it is a little more
 164 involved.  Firstly: mapreduce mode is used, with `r4` as both source
 165 and destination, `r4` acts as the sequential accumulator. Secondly,
 166 again it is masked (`m=ge`) which again excludes testing of previously-tested
 167 elements.  The next few instructions extract the information provided
 168 by Vector Length (VL) being truncated - potentially even to zero!
 169 (Note that `mtcrf 128,0` takes care of the possibility of VL=0, which if
 170 that happens then CR0 would be left it in its previous state: a
 171 very much undesirable behaviour!)
 172
 173 `sv.crnand/m=lt/zz` is quite sophisticated - a lot is going on behind
 174 the scenes.  The effect is (through the NAND) to invert the Less-than
 175 to give us a Greater-than-or-equal, including under the circumstance
 176 where VL=0, but only when CR0.EQ is set. Note that the third operand
 177 `BB` (CR0.EQ) is a *scalar*, but that zeroing is used here. Therefore
 178 whenever the Vector of `LT` bits is zero, a zero is put into the
 179 Vector `SO` result. In effect, the predication is being exploited
 180 as a way to combine a third operand into what would otherwise be a
 181 2-in 1-out Condition Register operation, making it effectively 3-in 1-out.
 182 Obscure but effective!
 183
 184 ```
 185     CR4.SO = CR4.EQ AND CR0.EQ (if VL >= 1)
 186     CR5.SO = CR5.EQ AND CR0.EQ (if VL >= 2)
 187     CR6.SO = CR6.EQ AND CR0.EQ (if VL >= 3)
 188     CR7.SO = CR7.EQ AND CR0.EQ (if VL  = 4)
 189 ```
 190
 191 **Converting the unary mask to binary**
 192
 193 Now that the `CR4/5/6/7.SO` bits have been set, it is necessary to
 194 count them, i.e. convert an unary mask into a binary number. There
 195 are several ways to do this, one of which is
 196 the `crweird` suite of instructions, combined with `popcnt`. However
 197 there is a very straightforward way to it: use `sv.svstep`.
 198
 199 ```
 200 sv.crnand/m=lt/zz *19,*16,0 # Vector SO = Vector ~LT, if CR0.eq=0
 201
 202     i ----> 0        1        2        3
 203             CR4.EQ   CR5.EQ   CR6.EQ   CR7.EQ
 204              & CR0     & CR0    & CR0     & CR0
 205              |         |        |         |
 206              v         v        v         v
 207             CR4.SO   CR5.SO   CR6.SO   CR7.SO
 208
 209 sv.svstep/mr/m=so 1, 0, 6, 1
 210
 211              |         |        |         |
 212     count <--+---------+--------+---------+
 213 ```
 214
 215 In reality what is happening is that svstep is requested to return
 216 the current `dststep` (destination step index), into a scalar
 217 destination (`RT=r1`), but in "mapreduce" mode.  Mapreduce mode
 218 will keep running as if the destination was a Vector, overwriting
 219 previously-written results. Normally, one of the *sources* would
 220 be the same register (`RA=r1` for example) which would turn r1 into
 221 an accumulator, however in this particular case we simply consistently
 222 overwrite `r1` and end up with the last `dststep`.
 223
 224 There `is` an alternative to this approach: `getvl` followed by
 225 subtracting 1.  `VL` being the total Vector Length as set by
 226 Data-Dependent Fail-First, is the *length* of the Vector, whereas
 227 what we actually want is the index of the last element: hence
 228 using `sv.svstep`. Given that using `getvl` followed by `addi -1`
 229 is two instructions rather than one, we use `sv.svstep` instead,
 230 although they are the same amount of words (SVP64 is 64-bit sized).
 231
 232 Lastly the `creqv` sets up the mask (based on the current
 233 Vector Length), and the loop-back (`bc`) jumps to reset the
 234 Vector Length back to the maximum (4). Thus, the mask (CR4/5/6/7 EQ
 235 bit) which began as empty will exclude nothing but on subsequent
 236 loops will exclude previously-tested elements (up to the previous
 237 value of VL) before it was reset back to the maximum.
 238
 239 This is actually extremely important to note, here, because the
 240 Data-Dependent Fail-First testing starts at the **first non-masked**
 241 element, not the first element.  This fact is exploited here, and
 242 the only thing to contend with is if VL is ever set to zero
 243 (first element fails). If that happens then CR0 will never get set,
 244 (because no element succeeded and the Vector Length is truncated to zero)
 245 hence the need to clear CR0 prior to starting that instruction.
 246
 247 Overall this algorithm is extremely short but quite complex in its
 248 intricate use of SVP64 capabilities, yet still remains parallelliseable
 249 in hardware. It could potentially be shorter: there may be opportunities
 250 to use predicate masking with the CR field manipulation. However
 251 the number of instructions is already at the point where, even when the
 252 LOAD outer loop is added it will still remain short enough to fit in
 253 a single line of L1 Cache.
 254
 255 [[!tag svp64_cookbook ]]
 256
 257