whitespace
[libreriscv.git] / openpower / sv / cookbook / fortran_maxloc.mdwn
1 # Fortran MAXLOC SVP64 demo
2
3 MAXLOC is a notoriously difficult function for SIMD to cope with.
4 SVP64 however has similar capabilities to Z80 CPIR and LDIR
5
6 <https://bugs.libre-soc.org/show_bug.cgi?id=676>
7
8 ```
9 int m2(int * const restrict a, int n)
10 {
11 int m, nm;
12 int i;
13
14 m = INT_MIN;
15 nm = -1;
16 for (i=0; i<n; i++)
17 {
18 if (a[i] > m)
19 {
20 m = a[i];
21 nm = i;
22 }
23 }
24 return nm;
25 }
26 ```
27
28 **AVX-512**
29
30 An informative article by Vamsi Sripathi of Intel shows the extent of
31 the problem faced by SIMD ISAs (in the case below, AVX-512). Significant
32 loop-unrolling is performed which leaves blocks that need to be merged:
33 this is carried out with "blending" instructions.
34
35 Article:
36 <https://www.intel.com/content/www/us/en/developer/articles/technical/optimizing-maxloc-operation-using-avx-512-vector-instructions.html#gs.12t5y0>
37
38 <img src="https://www.intel.com/content/dam/developer/articles/technical/optimizing-maxloc-operation-using-avx-512-vector-instructions/optimizing-maxloc-operation-using-avx-512-vector-instructions-code2.png
39 " alt="NLnet foundation logo" width="100%" />
40
41 **ARM NEON**
42
43 From stackexchange in ARM NEON intrinsics, one developer (Pavel P) wrote the
44 subroutine below, explaining that it finds the index of a minimum value within
45 a group of eight unsigned bytes. It is necessary to use a second outer loop
46 to perform many of these searches in parallel, followed by conditionally
47 offsetting each of the block-results.
48
49 <https://stackoverflow.com/questions/49683866/find-min-and-position-of-the-min-element-in-uint8x8-t-neon-register>
50
51 ```
52 #define VMIN8(x, index, value) \
53 do { \
54 uint8x8_t m = vpmin_u8(x, x); \
55 m = vpmin_u8(m, m); \
56 m = vpmin_u8(m, m); \
57 uint8x8_t r = vceq_u8(x, m); \
58 \
59 uint8x8_t z = vand_u8(vmask, r); \
60 \
61 z = vpadd_u8(z, z); \
62 z = vpadd_u8(z, z); \
63 z = vpadd_u8(z, z); \
64 \
65 unsigned u32 = vget_lane_u32(vreinterpret_u32_u8(z), 0); \
66 index = __lzcnt(u32); \
67 value = vget_lane_u8(m, 0); \
68 } while (0)
69
70
71 uint8_t v[8] = { ... };
72
73 static const uint8_t mask[] = { 0x80, 0x40, 0x20, 0x10, 0x08, 0x04, 0x02, 0x01 };
74 uint8x8_t vmask = vld1_u8(mask);
75
76 uint8x8_t v8 = vld1_u8(v);
77 int ret;
78 int ret_pos;
79 VMIN8(v8, ret_pos, ret);
80 ```
81
82 **Rust Assembler Intrinsics**
83
84 An approach by jvdd shows that the two stage approach of "blending"
85 arrays of results in a type of parallelised "leaf node depth first"
86 search seems to be a common technique.
87
88 <https://github.com/jvdd/argminmax/blob/main/src/simd/simd_u64.rs>
89
90 [[!tag svp64_cookbook ]]
91
92