bug 676: complete description of maxloc algorithm
[libreriscv.git] / openpower / sv / cookbook / fortran_maxloc.mdwn
1 # Fortran MAXLOC SVP64 demo
2
3 <https://bugs.libre-soc.org/show_bug.cgi?id=676>
4
5 MAXLOC is a notoriously difficult function for SIMD to cope with.
6 Typical approaches are to perform leaf-node (depth-first) parallel
7 operations, merging the results mapreduce-style to guage a final
8 index.
9
10 SVP64 however has similar capabilities to Z80 CPIR and LDIR and
11 therefore hardware may transparently implement back-end parallel
12 operations whilst the developer programs in a simple sequential
13 algorithm.
14
15 A clear reference implementation of MAXLOC is as follows:
16
17 ```
18 int maxloc(int * const restrict a, int n) {
19 int m, nm = INT_MIN, 0;
20 for (int i=0; i<n; i++) {
21 if (a[i] > m) {
22 m = a[i];
23 nm = i;
24 }
25 }
26 return nm;
27 }
28 ```
29
30 **AVX-512**
31
32 An informative article by Vamsi Sripathi of Intel shows the extent of
33 the problem faced by SIMD ISAs (in the case below, AVX-512). Significant
34 loop-unrolling is performed which leaves blocks that need to be merged:
35 this is carried out with "blending" instructions.
36
37 Article:
38 <https://www.intel.com/content/www/us/en/developer/articles/technical/optimizing-maxloc-operation-using-avx-512-vector-instructions.html#gs.12t5y0>
39
40 <img src="https://www.intel.com/content/dam/developer/articles/technical/optimizing-maxloc-operation-using-avx-512-vector-instructions/optimizing-maxloc-operation-using-avx-512-vector-instructions-code2.png
41 " alt="NLnet foundation logo" width="100%" />
42
43 **ARM NEON**
44
45 From stackexchange in ARM NEON intrinsics, one developer (Pavel P) wrote the
46 subroutine below, explaining that it finds the index of a minimum value within
47 a group of eight unsigned bytes. It is necessary to use a second outer loop
48 to perform many of these searches in parallel, followed by conditionally
49 offsetting each of the block-results.
50
51 <https://stackoverflow.com/questions/49683866/find-min-and-position-of-the-min-element-in-uint8x8-t-neon-register>
52
53 ```
54 #define VMIN8(x, index, value) \
55 do { \
56 uint8x8_t m = vpmin_u8(x, x); \
57 m = vpmin_u8(m, m); \
58 m = vpmin_u8(m, m); \
59 uint8x8_t r = vceq_u8(x, m); \
60 \
61 uint8x8_t z = vand_u8(vmask, r); \
62 \
63 z = vpadd_u8(z, z); \
64 z = vpadd_u8(z, z); \
65 z = vpadd_u8(z, z); \
66 \
67 unsigned u32 = vget_lane_u32(vreinterpret_u32_u8(z), 0); \
68 index = __lzcnt(u32); \
69 value = vget_lane_u8(m, 0); \
70 } while (0)
71
72
73 uint8_t v[8] = { ... };
74
75 static const uint8_t mask[] = { 0x80, 0x40, 0x20, 0x10, 0x08, 0x04, 0x02, 0x01 };
76 uint8x8_t vmask = vld1_u8(mask);
77
78 uint8x8_t v8 = vld1_u8(v);
79 int ret;
80 int ret_pos;
81 VMIN8(v8, ret_pos, ret);
82 ```
83
84 **Rust Assembler Intrinsics**
85
86 An approach by jvdd shows that the two stage approach of "blending"
87 arrays of results in a type of parallelised "leaf node depth first"
88 search seems to be a common technique.
89
90 <https://github.com/jvdd/argminmax/blob/main/src/simd/simd_u64.rs>
91
92 **Python implementation**
93
94 A variant on the c reference implementation allows for skipping
95 of updating m (the largest value found so far), followed by
96 always updating m whilst a batch of numbers is found that are
97 (in their order of sequence) always continuously greater than
98 all previous numbers. The algorithm therefore flips back and
99 forth between "skip whilst smaller" and "update whilst bigger",
100 only updating the index during "bigger" sequences. This is key
101 later when doing SVP64 assembler.
102
103 ```
104 def m2(a): # array a
105 m, nm, i, n = 0, 0, 0, len(a)
106 while i<n:
107 while i<n and a[i]<=m: i += 1 # skip whilst smaller/equal
108 while i<n and a[i]> m: m,nm,i = a[i],i,i+1 # only whilst bigger
109 return nm
110 ```
111
112 # Implementation in SVP64 Assembler
113
114 The core algorithm (inner part, in-register) is below: 11 instructions.
115 Loading of data, and offsetting the "core" below is relatively
116 straightforward: estimated another 6 instructions and needing one
117 more branch (outer loop).
118
119 ```
120 # while (i<n)
121 setvl 2,0,4,0,1,1 # set MVL=4, VL=MIN(MVL,CTR)
122 # while (i<n and a[i]<=m) : i += 1
123 sv.cmp/ff=gt/m=ge *0,0,*10,4 # truncates VL to min
124 sv.creqv *16,*16,*16 # set mask on already-tested
125 setvl 2,0,4,0,1,1 # set MVL=4, VL=MIN(MVL,CTR)
126 mtcrf 128, 0 # clear CR0 (in case VL=0?)
127 # while (i<n and a[i]>m):
128 sv.minmax./ff=le/m=ge/mr 4,*10,4,1 # uses r4 as accumulator
129 crternlogi 0,1,2,127 # test greater/equal or VL=0
130 sv.crand *19,*16,0 # clear if CR0.eq=0
131 # nm = i (count masked bits. could use crweirds here TODO)
132 sv.svstep/mr/m=so 1, 0, 6, 1 # svstep: get vector dststep
133 sv.creqv *16,*16,*16 # set mask on already-tested
134 bc 12,0, -0x40 # CR0 lt bit clear, branch back
135 ```
136
137 `sv.cmp` can be used in the first while loop because m (r4, the current
138 largest-found value so far) does not change.
139 However `sv.minmax.` has to be used in the key while loop
140 because r4 is sequentially replaced each time, and mapreduce (`/mr`)
141 is used to request this rolling-chain (accumulator) mode.
142
143 Also note that `i` (the `"while i<n"` part) is represented as an
144 unary bitmask (in CR bits 16,20,24,28) which turns out to be easier to
145 use than a binary index, as it can be used directly as a Predicate Mask
146 (`/m=ge`).
147
148 The algorithm works by excluding previous operations using `i-in-unary`,
149 combined with VL being truncated due to use of Data-Dependent Fail-First.
150 What therefore happens for example on the `sv.cmp/ff=gt/m=ge` operation
151 is that it is *VL* (the Vector Length) that gets truncated to only
152 contain those elements that are smaller than the current largest value
153 found (`m` aka `r4`). Calling `sv.creqv` then sets **only** the
154 CR field bits up to length `VL`, which on the next loop will exclude
155 them because the Predicate Mask is `m=ge` (ok if the CR field bit is
156 **zero**).
157
158 Therefore, the way that Data-Dependent Fail-First works, it attempts
159 *up to* the current Vector Length, and on detecting the first failure
160 will truncate at that point. In effect this is speculative sequential
161 execution of `while (i<n and a[i]<=m) : i += 1`.
162
163 Next comes the `sv.minmax.` which covers the `while (i<n and a[i]>m)`
164 again in a single instruction, but this time it is a little more
165 involved. Firstly: mapreduce mode is used, with `r4` as both source
166 and destination, `r4` acts as the sequential accumulator. Secondly,
167 again it is masked (`m=ge`) which again excludes testing of previously-tested
168 elements. The next few instructions extract the information provided
169 by Vector Length (VL) being truncated - potentially even to zero!
170 (Note that `mtcrf 128,0` takes care of the possibility of VL=0, which if
171 that happens then CR0 would be left it in its previous state: a
172 very much undesirable behaviour!)
173
174 `crternlogi 0,1,2,127` will combine the setting of CR0.EQ and CR0.LT
175 to give us a true Greater-than-or-equal, including under the circumstance
176 where VL=0. The `sv.crand` will then take a copy of the `i-in-unary`
177 mask, but only when CR0.EQ is set. This is why the third operand `BB`
178 is a Scalar not a Vector (BT=16/Vector, BA=19/Vector, BB=0/Scalar)
179 which effectively performs a broadcast-splat-ANDing, as follows:
180
181 ```
182 CR4.SO = CR4.EQ AND CR0.EQ (if VL >= 1)
183 CR5.SO = CR5.EQ AND CR0.EQ (if VL >= 2)
184 CR6.SO = CR6.EQ AND CR0.EQ (if VL >= 3)
185 CR7.SO = CR7.EQ AND CR0.EQ (if VL = 4)
186 ```
187
188 **Converting the unary mask to binary**
189
190 Now that the `CR4/5/6/7.SO` bits have been set, it is necessary to
191 count them, i.e. convert an unary mask into a binary number. There
192 are several ways to do this, one of which is
193 the `crweird` suite of instructions, combined with `popcnt`. However
194 there is a very straightforward way to it: use `sv.svstep`.
195
196 ```
197 crternlogi 0,1,2,127
198
199 i ----> 0 1 2 3
200 CR4.EQ CR5.EQ CR6.EQ CR7.EQ
201 & CR0 & CR0 & CR0 & CR0
202 | | | |
203 v v v v
204 CR4.SO CR5.SO CR6.SO CR7.SO
205
206 sv.svstep/mr/m=so 1, 0, 6, 1
207
208 | | | |
209 count <--+---------+--------+---------+
210 ```
211
212 In reality what is happening is that svstep is requested to return
213 the current `dststep` (destination step index), into a scalar
214 destination (`RT=r1`), but in "mapreduce" mode. Mapreduce mode
215 will keep running as if the destination was a Vector, overwriting
216 previously-written results. Normally, one of the *sources* would
217 be the same register (`RA=r1` for example) which would turn r1 into
218 an accumulator, however in this particular case we simply consistently
219 overwrite `r1` and end up with the last `dststep`.
220
221 There `is` an alternative to this approach: `getvl` followed by
222 subtracting 1. `VL` being the total Vector Length as set by
223 Data-Dependent Fail-First, is the *length* of the Vector, whereas
224 what we actually want is the index of the last element: hence
225 using `sv.svstep`. Given that using `getvl` followed by `addi -1`
226 is two instructions rather than one, we use `sv.svstep` instead,
227 although they are the same amount of words (SVP64 is 64-bit sized).
228
229 Lastly the `creqv` sets up the mask (based on the current
230 Vector Length), and the loop-back (`bc`) jumps to reset the
231 Vector Length back to the maximum (4). Thus, the mask (CR4/5/6/7 EQ
232 bit) which began as empty will exclude nothing but on subsequent
233 loops will exclude previously-tested elements (up to the previous
234 value of VL) before it was reset back to the maximum.
235
236 This is actually extremely important to note, here, because the
237 Data-Dependent Fail-First testing starts at the **first non-masked**
238 element, not the first element. This fact is exploited here, and
239 the only thing to contend with is if VL is ever set to zero
240 (first element fails). If that happens then CR0 will never get set,
241 (because no element succeeded and the Vector Length is truncated to zero)
242 hence the need to clear CR0 prior to starting that instruction.
243
244 Overall this algorithm is extremely short but quite complex in its
245 intricate use of SVP64 capabilities, yet still remains parallelliseable
246 in hardware. It could potentially be shorter: there may be opportunities
247 to use predicate masking with the CR field manipulation. However
248 the number of instructions is already at the point where, even when the
249 LOAD outer loop is added it will still remain short enough to fit in
250 a single line of L1 Cache.
251
252 [[!tag svp64_cookbook ]]
253
254