# Fortran MAXLOC SVP64 demo MAXLOC is a notoriously difficult function for SIMD to cope with. Typical approaches are to perform leaf-node (depth-first) parallel operations, merging the results mapreduce-style to guage a final index. SVP64 however has similar capabilities to Z80 CPIR and LDIR and therefore hardware may transparently implement back-end parallel operations whilst the developer programs in a simple sequential algorithm. A clear reference implementation of MAXLOC is as follows: ``` int maxloc(int * const restrict a, int n) { int m, nm = INT_MIN, 0; for (int i=0; i m) { m = a[i]; nm = i; } } return nm; } ``` **AVX-512** An informative article by Vamsi Sripathi of Intel shows the extent of the problem faced by SIMD ISAs (in the case below, AVX-512). Significant loop-unrolling is performed which leaves blocks that need to be merged: this is carried out with "blending" instructions. Article: NLnet foundation logo **ARM NEON** From stackexchange in ARM NEON intrinsics, one developer (Pavel P) wrote the subroutine below, explaining that it finds the index of a minimum value within a group of eight unsigned bytes. It is necessary to use a second outer loop to perform many of these searches in parallel, followed by conditionally offsetting each of the block-results. ``` #define VMIN8(x, index, value) \ do { \ uint8x8_t m = vpmin_u8(x, x); \ m = vpmin_u8(m, m); \ m = vpmin_u8(m, m); \ uint8x8_t r = vceq_u8(x, m); \ \ uint8x8_t z = vand_u8(vmask, r); \ \ z = vpadd_u8(z, z); \ z = vpadd_u8(z, z); \ z = vpadd_u8(z, z); \ \ unsigned u32 = vget_lane_u32(vreinterpret_u32_u8(z), 0); \ index = __lzcnt(u32); \ value = vget_lane_u8(m, 0); \ } while (0) uint8_t v[8] = { ... }; static const uint8_t mask[] = { 0x80, 0x40, 0x20, 0x10, 0x08, 0x04, 0x02, 0x01 }; uint8x8_t vmask = vld1_u8(mask); uint8x8_t v8 = vld1_u8(v); int ret; int ret_pos; VMIN8(v8, ret_pos, ret); ``` **Rust Assembler Intrinsics** An approach by jvdd shows that the two stage approach of "blending" arrays of results in a type of parallelised "leaf node depth first" search seems to be a common technique. **Python implementation** A variant on the c reference implementation allows for skipping of updating m (the largest value found so far), followed by always updating m whilst a batch of numbers is found that are (in their order of sequence) always continuously greater than all previous numbers. The algorithm therefore flips back and forth between "skip whilst smaller" and "update whilst bigger", only updating the index during "bigger" sequences. This is key later when doing SVP64 assembler. ``` def m2(a): # array a m, nm, i, n = 0, 0, 0, len(a) while i m: m,nm,i = a[i],i,i+1 # only whilst bigger return nm ``` # Implementation in SVP64 Assembler The core algorithm (inner part, in-register) is below: 11 instructions. Loading of data, and offsetting the "core" below is relatively straightforward: estimated another 6 instructions and needing one more branch (outer loop). ``` # while (im): sv.minmax./ff=le/m=ge 4,*10,4,1 # uses r4 as accumulator crternlogi 0,1,2,127 # test greater/equal or VL=0 sv.crand *19,*16,0 # clear if CR0.eq=0 # nm = i (count masked bits. could use crweirds here TODO) sv.svstep/mr/m=so 1, 0, 6, 1 # svstep: get vector dststep sv.creqv *16,*16,*16 # set mask on already-tested bc 12,0, -0x40 # CR0 lt bit clear, branch back ``` `sv.cmp` can be used in the first while loop because m (r4, the current largest-found value so far) does not change. However `sv.minmax.` has to be used in the key while loop because r4 is sequentially replaced each time, and mapreduce (`/mr`) is used to request this rolling-chain (accumulator) mode. Also note that `i` (the `"while im)` again in a single instruction, but this time it is a little more involved. Firstly: mapreduce mode is used, with `r4` as both source and destination, `r4` acts as the sequential accumulator. Secondly, again it is masked (`m=ge`) which again excludes testing of previously-tested elements. The next few instructions extract the information provided by Vector Length (VL) being truncated - potentially even to zero! (Note that `mtcrf 128,0` takes care of the possibility of VL=0, which if that happens then CR0 would be left it in its previous state: a very much undesirable behaviour!) `crternlogi 0,1,2,127` will combine the setting of CR0.EQ and CR0.LT to give us a true Greater-than-or-equal, including under the circumstance where VL=0. The `sv.crand` will then take a copy of the `i-in-unary` mask, but only when CR0.EQ is set. This is why the third operand `BB` is a Scalar not a Vector (BT=16/Vector, BA=19/Vector, BB=0/Scalar) which effectively performs a broadcast-splat-ANDing, as follows: ``` CR4.SO = CR4.EQ AND CR0.EQ (if VL >= 1) CR5.SO = CR5.EQ AND CR0.EQ (if VL >= 2) CR6.SO = CR6.EQ AND CR0.EQ (if VL >= 3) CR7.SO = CR7.EQ AND CR0.EQ (if VL = 4) ``` [[!tag svp64_cookbook ]]