# Fortran MAXLOC SVP64 demo MAXLOC is a notoriously difficult function for SIMD to cope with. Typical approaches are to perform leaf-node (depth-first) parallel operations, merging the results mapreduce-style to guage a final index. SVP64 however has similar capabilities to Z80 CPIR and LDIR and therefore hardware may transparently implement back-end parallel operations whilst the developer programs in a simple sequential algorithm. A clear reference implementation of MAXLOC is as follows: ``` int maxloc(int * const restrict a, int n) { int m, nm = INT_MIN, 0; for (int i=0; i m) { m = a[i]; nm = i; } } return nm; } ``` **AVX-512** An informative article by Vamsi Sripathi of Intel shows the extent of the problem faced by SIMD ISAs (in the case below, AVX-512). Significant loop-unrolling is performed which leaves blocks that need to be merged: this is carried out with "blending" instructions. Article: NLnet foundation logo **ARM NEON** From stackexchange in ARM NEON intrinsics, one developer (Pavel P) wrote the subroutine below, explaining that it finds the index of a minimum value within a group of eight unsigned bytes. It is necessary to use a second outer loop to perform many of these searches in parallel, followed by conditionally offsetting each of the block-results. ``` #define VMIN8(x, index, value) \ do { \ uint8x8_t m = vpmin_u8(x, x); \ m = vpmin_u8(m, m); \ m = vpmin_u8(m, m); \ uint8x8_t r = vceq_u8(x, m); \ \ uint8x8_t z = vand_u8(vmask, r); \ \ z = vpadd_u8(z, z); \ z = vpadd_u8(z, z); \ z = vpadd_u8(z, z); \ \ unsigned u32 = vget_lane_u32(vreinterpret_u32_u8(z), 0); \ index = __lzcnt(u32); \ value = vget_lane_u8(m, 0); \ } while (0) uint8_t v[8] = { ... }; static const uint8_t mask[] = { 0x80, 0x40, 0x20, 0x10, 0x08, 0x04, 0x02, 0x01 }; uint8x8_t vmask = vld1_u8(mask); uint8x8_t v8 = vld1_u8(v); int ret; int ret_pos; VMIN8(v8, ret_pos, ret); ``` **Rust Assembler Intrinsics** An approach by jvdd shows that the two stage approach of "blending" arrays of results in a type of parallelised "leaf node depth first" search seems to be a common technique. **Python implementation** A variant on the c reference implementation allows for skipping of updating m (the largest value found so far), followed by always updating m whilst a batch of numbers is found that are (in their order of sequence) always continuously greater than all previous numbers. The algorithm therefore flips back and forth between "skip whilst smaller" and "update whilst bigger", only updating the index during "bigger" sequences. This is key later when doing SVP64 assembler. ``` def m2(a): # array a m, nm, i, n = 0, 0, 0, len(a) while i m: m, nm, i = a[i], i, i+1 return nm; ``` # Implementation in SVP64 Assembler The core algorithm (inner part, in-register) is below: 11 instructions. Loading of data, and offsetting the "core" below is relatively straightforward: estimated another 6 instructions and needing one more branch (outer loop). ``` # while (im): sv.minmax./ff=le/m=ge 4,*10,4,1 # uses r4 as accumulator crternlogi 0,1,2,127 # test greater/equal or VL=0 sv.crand *19,*16,0 # clear if CR0.eq=0 # nm = i (count masked bits. could use crweirds here TODO) sv.svstep/mr/m=so 1, 0, 6, 1 # svstep: get vector dststep sv.creqv *16,*16,*16 # set mask on already-tested bc 12,0, -0x40 # CR0 lt bit clear, branch back ``` `sv.cmp` can be used in the first while loop because m (r4, the current largest-found value so far) does not change. However `sv.minmax.` has to be used in the key while loop because r4 is sequentially replaced each time, and mapreduce (`/mr`) is used to request this rolling-chain (accumulator) mode. Also note that `i` (the `"while i