openpower/sv/cookbook/fortran_maxloc.mdwn

   1 # Fortran MAXLOC SVP64 demo
   2
   3 <https://bugs.libre-soc.org/show_bug.cgi?id=676>
   4
   5 MAXLOC is a notoriously difficult function for SIMD to cope with.
   6 Typical approaches are to perform leaf-node (depth-first) parallel
   7 operations, merging the results mapreduce-style to guage a final
   8 index.
   9
  10 SVP64 however has similar capabilities to Z80 CPIR and LDIR and
  11 therefore hardware may transparently implement back-end parallel
  12 operations whilst the developer programs in a simple sequential
  13 algorithm.
  14
  15 A clear reference implementation of MAXLOC is as follows:
  16
  17 ```
  18 int maxloc(int * const restrict a, int n) {
  19     int m, nm = INT_MIN, 0;
  20     for (int i=0; i<n; i++) {
  21         if (a[i] > m) {
  22             m = a[i];
  23             nm = i;
  24         }
  25     }
  26     return nm;
  27 }
  28 ```
  29
  30 **AVX-512**
  31
  32 An informative article by Vamsi Sripathi of Intel shows the extent of
  33 the problem faced by SIMD ISAs (in the case below, AVX-512). Significant
  34 loop-unrolling is performed which leaves blocks that need to be merged:
  35 this is carried out with "blending" instructions.
  36
  37 Article:
  38 <https://www.intel.com/content/www/us/en/developer/articles/technical/optimizing-maxloc-operation-using-avx-512-vector-instructions.html#gs.12t5y0>
  39
  40 <img src="https://www.intel.com/content/dam/developer/articles/technical/optimizing-maxloc-operation-using-avx-512-vector-instructions/optimizing-maxloc-operation-using-avx-512-vector-instructions-code2.png
  41 " alt="NLnet foundation logo" width="100%" />
  42
  43 **ARM NEON**
  44
  45 From stackexchange in ARM NEON intrinsics, one developer (Pavel P) wrote the
  46 subroutine below, explaining that it finds the index of a minimum value within
  47 a group of eight unsigned bytes. It is necessary to use a second outer loop
  48 to perform many of these searches in parallel, followed by conditionally
  49 offsetting each of the block-results.
  50
  51 <https://stackoverflow.com/questions/49683866/find-min-and-position-of-the-min-element-in-uint8x8-t-neon-register>
  52
  53 ```
  54 #define VMIN8(x, index, value)                               \
  55 do {                                                         \
  56     uint8x8_t m = vpmin_u8(x, x);                            \
  57     m = vpmin_u8(m, m);                                      \
  58     m = vpmin_u8(m, m);                                      \
  59     uint8x8_t r = vceq_u8(x, m);                             \
  60                                                              \
  61     uint8x8_t z = vand_u8(vmask, r);                         \
  62                                                              \
  63     z = vpadd_u8(z, z);                                      \
  64     z = vpadd_u8(z, z);                                      \
  65     z = vpadd_u8(z, z);                                      \
  66                                                              \
  67     unsigned u32 = vget_lane_u32(vreinterpret_u32_u8(z), 0); \
  68     index = __lzcnt(u32);                                    \
  69     value = vget_lane_u8(m, 0);                              \
  70 } while (0)
  71
  72
  73 uint8_t v[8] = { ... };
  74
  75 static const uint8_t mask[] = { 0x80, 0x40, 0x20, 0x10, 0x08, 0x04, 0x02, 0x01 };
  76 uint8x8_t vmask = vld1_u8(mask);
  77
  78 uint8x8_t v8 = vld1_u8(v);
  79 int ret;
  80 int ret_pos;
  81 VMIN8(v8, ret_pos, ret);
  82 ```
  83
  84 **Rust Assembler Intrinsics**
  85
  86 An approach by jvdd shows that the two stage approach of "blending"
  87 arrays of results in a type of parallelised "leaf node depth first"
  88 search seems to be a common technique.
  89
  90 <https://github.com/jvdd/argminmax/blob/main/src/simd/simd_u64.rs>
  91
  92 **Python implementation**
  93
  94 A variant on the c reference implementation allows for skipping
  95 of updating m (the largest value found so far), followed by
  96 always updating m whilst a batch of numbers is found that are
  97 (in their order of sequence) always continuously greater than
  98 all previous numbers.  The algorithm therefore flips back and
  99 forth between "skip whilst smaller" and "update whilst bigger",
 100 only updating the index during "bigger" sequences. This is key
 101 later when doing SVP64 assembler.
 102
 103 ```
 104 def m2(a): # array a
 105     m, nm, i, n = 0, 0, 0, len(a)
 106     while i<n:
 107         while i<n and a[i]<=m: i += 1              # skip whilst smaller/equal
 108         while i<n and a[i]> m: m,nm,i = a[i],i,i+1 # only whilst bigger
 109     return nm
 110 ```
 111
 112 # Implementation in SVP64 Assembler
 113
 114 The core algorithm (inner part, in-register) is below: 11 instructions.
 115 Loading of data, and offsetting the "core" below is relatively
 116 straightforward: estimated another 6 instructions and needing one
 117 more branch (outer loop).
 118
 119 ```
 120 # while (i<n)
 121 setvl 2,0,4,0,1,1            # set MVL=4, VL=MIN(MVL,CTR)
 122 #    while (i<n and a[i]<=m) : i += 1
 123 sv.cmp/ff=gt/m=ge *0,0,*10,4 # truncates VL to min
 124 sv.creqv *16,*16,*16         # set mask on already-tested
 125 setvl 2,0,4,0,1,1            # set MVL=4, VL=MIN(MVL,CTR)
 126 mtcrf 128, 0                 # clear CR0 (in case VL=0?)
 127 #    while (i<n and a[i]>m):
 128 sv.minmax./ff=le/m=ge 4,*10,4,1 # uses r4 as accumulator
 129 crternlogi 0,1,2,127         # test greater/equal or VL=0
 130 sv.crand *19,*16,0           # clear if CR0.eq=0
 131 #      nm = i (count masked bits. could use crweirds here TODO)
 132 sv.svstep/mr/m=so 1, 0, 6, 1 # svstep: get vector dststep
 133 sv.creqv *16,*16,*16         # set mask on already-tested
 134 bc 12,0, -0x40               # CR0 lt bit clear, branch back
 135 ```
 136
 137 `sv.cmp` can be used in the first while loop because m (r4, the current
 138 largest-found value so far) does not change.
 139 However `sv.minmax.` has to be used in the key while loop
 140 because r4 is sequentially replaced each time, and mapreduce (`/mr`)
 141 is used to request this rolling-chain (accumulator) mode.
 142
 143 Also note that `i` (the `"while i<n"` part) is represented as an
 144 unary bitmask (in CR bits 16,20,24,28) which turns out to be easier to
 145 use than a binary index, as it can be used directly as a Predicate Mask
 146 (`/m=ge`).
 147
 148 The algorithm works by excluding previous operations using `i-in-unary`,
 149 combined with VL being truncated due to use of Data-Dependent Fail-First.
 150 What therefore happens for example on the `sv.cmp/ff=gt/m=ge` operation
 151 is that it is *VL* (the Vector Length) that gets truncated to only
 152 contain those elements that are smaller than the current largest value
 153 found (`m` aka `r4`). Calling `sv.creqv` then sets **only** the
 154 CR field bits up to length `VL`, which on the next loop will exclude
 155 them because the Predicate Mask is `m=ge` (ok if the CR field bit is
 156 **zero**).
 157
 158 Therefore, the way that Data-Dependent Fail-First works, it attempts
 159 *up to* the current Vector Length, and on detecting the first failure
 160 will truncate at that point. In effect this is speculative sequential
 161 execution of `while (i<n and a[i]<=m) : i += 1`.
 162
 163 Next comes the `sv.minmax.` which covers the `while (i<n and a[i]>m)`
 164 again in a single instruction, but this time it is a little more
 165 involved.  Firstly: mapreduce mode is used, with `r4` as both source
 166 and destination, `r4` acts as the sequential accumulator. Secondly,
 167 again it is masked (`m=ge`) which again excludes testing of previously-tested
 168 elements.  The next few instructions extract the information provided
 169 by Vector Length (VL) being truncated - potentially even to zero!
 170 (Note that `mtcrf 128,0` takes care of the possibility of VL=0, which if
 171 that happens then CR0 would be left it in its previous state: a
 172 very much undesirable behaviour!)
 173
 174 `crternlogi 0,1,2,127` will combine the setting of CR0.EQ and CR0.LT
 175 to give us a true Greater-than-or-equal, including under the circumstance
 176 where VL=0. The `sv.crand` will then take a copy of the `i-in-unary`
 177 mask, but only when CR0.EQ is set. This is why the third operand `BB`
 178 is a Scalar not a Vector (BT=16/Vector, BA=19/Vector, BB=0/Scalar)
 179 which effectively performs a broadcast-splat-ANDing, as follows:
 180
 181 ```
 182     CR4.SO = CR4.EQ AND CR0.EQ (if VL >= 1)
 183     CR5.SO = CR5.EQ AND CR0.EQ (if VL >= 2)
 184     CR6.SO = CR6.EQ AND CR0.EQ (if VL >= 3)
 185     CR7.SO = CR7.EQ AND CR0.EQ (if VL  = 4)
 186 ```
 187
 188 [[!tag svp64_cookbook ]]
 189
 190