openpower/sv/cookbook/fortran_maxloc.mdwn

   1 # Fortran MAXLOC SVP64 demo
   2
   3 MAXLOC is a notoriously difficult function for SIMD to cope with.
   4 SVP64 however has similar capabilities to Z80 CPIR and LDIR
   5
   6 <https://bugs.libre-soc.org/show_bug.cgi?id=676>
   7
   8 ```
   9 int m2(int * const restrict a, int n)
  10 {
  11    int m, nm;
  12    int i;
  13
  14    m = INT_MIN;
  15    nm = -1;
  16    for (i=0; i<n; i++)
  17    {
  18        if (a[i] > m)
  19        {
  20            m = a[i];
  21            nm = i;
  22        }
  23     }
  24     return nm;
  25 }
  26 ```
  27
  28 **AVX-512**
  29
  30 An informative article by Vamsi Sripathi of Intel shows the extent of
  31 the problem faced by SIMD ISAs (in the case below, AVX-512). Significant
  32 loop-unrolling is performed which leaves blocks that need to be merged:
  33 this is carried out with "blending" instructions.
  34
  35 Article:
  36 <https://www.intel.com/content/www/us/en/developer/articles/technical/optimizing-maxloc-operation-using-avx-512-vector-instructions.html#gs.12t5y0>
  37
  38 <img src="https://www.intel.com/content/dam/developer/articles/technical/optimizing-maxloc-operation-using-avx-512-vector-instructions/optimizing-maxloc-operation-using-avx-512-vector-instructions-code2.png
  39 " alt="NLnet foundation logo" width="100%" />
  40
  41 **ARM NEON**
  42
  43 From stackexchange in ARM NEON intrinsics, one developer (Pavel P) wrote the
  44 subroutine below, explaining that it finds the index of a minimum value within
  45 a group of eight unsigned bytes. It is necessary to use a second outer loop
  46 to perform many of these searches in parallel, followed by conditionally
  47 offsetting each of the block-results.
  48
  49 <https://stackoverflow.com/questions/49683866/find-min-and-position-of-the-min-element-in-uint8x8-t-neon-register>
  50
  51 ```
  52 #define VMIN8(x, index, value)                               \
  53 do {                                                         \
  54     uint8x8_t m = vpmin_u8(x, x);                            \
  55     m = vpmin_u8(m, m);                                      \
  56     m = vpmin_u8(m, m);                                      \
  57     uint8x8_t r = vceq_u8(x, m);                             \
  58                                                              \
  59     uint8x8_t z = vand_u8(vmask, r);                         \
  60                                                              \
  61     z = vpadd_u8(z, z);                                      \
  62     z = vpadd_u8(z, z);                                      \
  63     z = vpadd_u8(z, z);                                      \
  64                                                              \
  65     unsigned u32 = vget_lane_u32(vreinterpret_u32_u8(z), 0); \
  66     index = __lzcnt(u32);                                    \
  67     value = vget_lane_u8(m, 0);                              \
  68 } while (0)
  69
  70
  71 uint8_t v[8] = { ... };
  72
  73 static const uint8_t mask[] = { 0x80, 0x40, 0x20, 0x10, 0x08, 0x04, 0x02, 0x01 };
  74 uint8x8_t vmask = vld1_u8(mask);
  75
  76 uint8x8_t v8 = vld1_u8(v);
  77 int ret;
  78 int ret_pos;
  79 VMIN8(v8, ret_pos, ret);
  80 ```
  81
  82 **Rust Assembler Intrinsics**
  83
  84 An approach by jvdd shows that the two stage approach of "blending"
  85 arrays of results in a type of parallelised "leaf node depth first"
  86 search seems to be a common technique.
  87
  88 <https://github.com/jvdd/argminmax/blob/main/src/simd/simd_u64.rs>
  89
  90 [[!tag svp64_cookbook ]]
  91
  92