# Fortran MAXLOC SVP64 demo

MAXLOC is a notoriously difficult function for SIMD to cope with.
SVP64 however has similar capabilities to Z80 CPIR and LDIR

<https://bugs.libre-soc.org/show_bug.cgi?id=676>

```
int m2(int * const restrict a, int n) 
{ 
   int m, nm; 
   int i; 

   m = INT_MIN; 
   nm = -1; 
   for (i=0; i<n; i++) 
   { 
       if (a[i] > m) 
       { 
           m = a[i]; 
           nm = i; 
       } 
    } 
    return nm; 
}
```

**AVX-512**

An informative article by Vamsi Sripathi of Intel shows the extent of the problem
faced by SIMD ISAs (in the case below, AVX-512). Significant loop-unrolling is performed
which leaves blocks that need to be merged: this is carried out with "blending"
instructions.

Article:
<https://www.intel.com/content/www/us/en/developer/articles/technical/optimizing-maxloc-operation-using-avx-512-vector-instructions.html#gs.12t5y0>

<img src="https://www.intel.com/content/dam/developer/articles/technical/optimizing-maxloc-operation-using-avx-512-vector-instructions/optimizing-maxloc-operation-using-avx-512-vector-instructions-code2.png
" alt="NLnet foundation logo" width="100%" />

**ARM NEON**

From stackexchange in ARM NEON intrinsics, one developer (Pavel P) wrote the
subroutine below, explaining that it finds the index of a minimum value within
a group of eight unsigned bytes. It is necessary to use a second outer loop
to perform many of these searches in parallel, followed by conditionally
offsetting each of the block-results. 

<https://stackoverflow.com/questions/49683866/find-min-and-position-of-the-min-element-in-uint8x8-t-neon-register>

```
#define VMIN8(x, index, value)                               \
do {                                                         \
    uint8x8_t m = vpmin_u8(x, x);                            \
    m = vpmin_u8(m, m);                                      \
    m = vpmin_u8(m, m);                                      \
    uint8x8_t r = vceq_u8(x, m);                             \
                                                             \
    uint8x8_t z = vand_u8(vmask, r);                         \
                                                             \
    z = vpadd_u8(z, z);                                      \
    z = vpadd_u8(z, z);                                      \
    z = vpadd_u8(z, z);                                      \
                                                             \
    unsigned u32 = vget_lane_u32(vreinterpret_u32_u8(z), 0); \
    index = __lzcnt(u32);                                    \
    value = vget_lane_u8(m, 0);                              \
} while (0)


uint8_t v[8] = { ... };

static const uint8_t mask[] = { 0x80, 0x40, 0x20, 0x10, 0x08, 0x04, 0x02, 0x01 };
uint8x8_t vmask = vld1_u8(mask);

uint8x8_t v8 = vld1_u8(v);
int ret;
int ret_pos;
VMIN8(v8, ret_pos, ret);
```

**Rust Assembler Intrinsics**

An approach by jvdd shows that the two stage approach of "blending" arrays of
results in a type of parallelised "leaf node depth first" search seems to be
a common technique.

<https://github.com/jvdd/argminmax/blob/main/src/simd/simd_u64.rs>

[[!tag svp64_cookbook ]]