1 # Fortran MAXLOC SVP64 demo
3 MAXLOC is a notoriously difficult function for SIMD to cope with.
4 SVP64 however has similar capabilities to Z80 CPIR and LDIR
6 <https://bugs.libre-soc.org/show_bug.cgi?id=676>
9 int m2(int * const restrict a, int n)
30 An informative article by Vamsi Sripathi of Intel shows the extent of the problem
31 faced by SIMD ISAs (in the case below, AVX-512). Significant loop-unrolling is performed
32 which leaves blocks that need to be merged: this is carried out with "blending"
36 <https://www.intel.com/content/www/us/en/developer/articles/technical/optimizing-maxloc-operation-using-avx-512-vector-instructions.html#gs.12t5y0>
38 <img src="https://www.intel.com/content/dam/developer/articles/technical/optimizing-maxloc-operation-using-avx-512-vector-instructions/optimizing-maxloc-operation-using-avx-512-vector-instructions-code2.png
39 " alt="NLnet foundation logo" width="100%" />
43 From stackexchange in ARM NEON intrinsics, one developer (Pavel P) wrote the
44 subroutine below, explaining that it finds the index of a minimum value within
45 a group of eight unsigned bytes. It is necessary to use a second outer loop
46 to perform many of these searches in parallel, followed by conditionally
47 offsetting each of the block-results.
49 <https://stackoverflow.com/questions/49683866/find-min-and-position-of-the-min-element-in-uint8x8-t-neon-register>
52 #define VMIN8(x, index, value) \
54 uint8x8_t m = vpmin_u8(x, x); \
57 uint8x8_t r = vceq_u8(x, m); \
59 uint8x8_t z = vand_u8(vmask, r); \
65 unsigned u32 = vget_lane_u32(vreinterpret_u32_u8(z), 0); \
66 index = __lzcnt(u32); \
67 value = vget_lane_u8(m, 0); \
71 uint8_t v[8] = { ... };
73 static const uint8_t mask[] = { 0x80, 0x40, 0x20, 0x10, 0x08, 0x04, 0x02, 0x01 };
74 uint8x8_t vmask = vld1_u8(mask);
76 uint8x8_t v8 = vld1_u8(v);
79 VMIN8(v8, ret_pos, ret);
82 **Rust Assembler Intrinsics**
84 An approach by jvdd shows that the two stage approach of "blending" arrays of
85 results in a type of parallelised "leaf node depth first" search seems to be
88 <https://github.com/jvdd/argminmax/blob/main/src/simd/simd_u64.rs>
90 [[!tag svp64_cookbook ]]