mention SIMD pospopcount really hard to comprehend
[libreriscv.git] / openpower / sv / cookbook / pospopcnt.mdwn
1 # Positional popcount SVP64
2
3 **Links**
4
5 * <https://bugs.libre-soc.org/show_bug.cgi?id=672>
6 * <https://github.com/clausecker/pospop/blob/master/countsse2_amd64.s>
7 * RISC-V Bitmanip Extension Document Version 0.94-draft Editor: Claire Wolf Symbiotic GmbH
8 <https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-draft.pdf>
9
10 **Funding**
11
12 This project is funded through the [NGI Assure Fund](https://nlnet.nl/assure), a fund established by [NLnet](https://nlnet.nl) with financial support from the European Commission's [Next Generation Internet](https://ngi.eu) program. Learn more at the [NLnet project page](https://nlnet.nl/project/Libre-SOC-OpenPOWER-ISA).
13
14 [<img src="https://nlnet.nl/logo/banner.png" alt="NLnet foundation logo" width="20%" />](https://nlnet.nl)
15 [<img src="https://nlnet.nl/image/logos/NGIAssure_tag.svg" alt="NGI Assure Logo" width="20%" />](https://nlnet.nl/assure)
16
17 **Introduction**
18
19 Positional popcount in optimised assembler is typically done on SIMD
20 ISAs in around 500 lines. The implementations are extraordinarily
21 complex and very hard to understand. Power ISA thanks to `bpermd` can be much
22 more efficient: with SVP64 even more so. The reference implementation
23 showing the concept is below.
24
25 ```
26 // Copyright (c) 2020 Robert Clausecker <fuz@fuz.su>
27 // count8 reference implementation for tests. Do not alter.
28 func count8safe(counts *[8]int, buf []uint8) {
29 for i := range buf {
30 for j := 0; j < 8; j++ {
31 counts[j] += int(buf[i] >> j & 1)
32 }
33 }
34 }
35 ```
36
37 A simple but still hardware-paralleliseable SVP64 assembler for
38 8-bit input values (`count8safe`) is as follows:
39
40 ```
41 mtspr 9, 3 # move r3 to CTR
42 # VL = MIN(CTR,MAXVL=8), Rc=1 (CR0 set if CTR ends)
43 setvl 3,0,8,0,1,1 # set MVL=8, VL=MIN(MVL,CTR)
44 # load VL bytes (update r4 addr) but compressed (dw=8)
45 addi 6, 0, 0 # initialise all 64-bits of r6 to zero
46 sv.lbzu/pi/dw=8 *6, 1(4) # should be /lf here as well
47 # gather performs the transpose (which gets us to positional..)
48 gbbd 8,6
49 # now those bits have been turned around, popcount and sum them
50 setvl 0,0,8,0,1,1 # set MVL=VL=8
51 sv.popcntd/sw=8 *24,*8 # do the (now transposed) popcount
52 sv.add *16,*16,*24 # and accumulate in results
53 # branch back if CTR still non-zero. works even though VL=8
54 sv.bc/all 16, *0, -0x28 # reduce CTR by VL and stop if -ve
55 ```
56
57
58 Array popcount is just standard popcount function
59 ([[!wikipedia Hamming weight]]) on an array of values, horizontally,
60 however positional popcount is different (vertical). Refer to Fig.1
61
62
63 <img src="/openpower/sv/cookbook/ArrayPopcnt.drawio.svg"
64 alt="pospopcnt" width="70%" />
65
66 Positional popcount adds up the totals of each bit set to 1 in each
67 bit-position, of an array of input values. Refer to Fig.2
68
69 <img src="/openpower/sv/cookbook/PositionalPopcnt.drawio.svg"
70 alt="pospopcnt" width="60%" />
71
72
73 # Visual representation of the pospopcount algorithm
74
75 In order to perform positional popcount we need to go through series of
76 steps shown below in figures 3, 4, 5 & 6. The limitation to overcome is
77 that the CPU can only work on registers (horizontally) but the requirement
78 of pospopcount is to add *vertically*. Part of the challenge is therefore
79 to perform an appropriate transpose of the data, in blocks that suit
80 the processor and the ISA capacity.
81
82 Fig.3 depicts how the data is divided into blocks of 8*8-bit.
83
84 <img src="/openpower/sv/cookbook/BlockDivision.drawio.svg"
85 alt="pospopcnt" width="70%" />
86
87 Fig.4 shows that a transpose is needed on each block. The gbbd
88 instruction is used for implementing the transpose function, preparing
89 the data for using the standard popcount instruction.
90
91 <img src="/openpower/sv/cookbook/Transpose.drawio.svg"
92 alt="pospopcnt" width="60%" />
93
94 Now on each 8*8 block, 8 popcount instructions can be run each of
95 which is independent and therefore can be parallelised even by In-Order
96 multi-issue hardware(Fig.5).
97
98 <img src="/openpower/sv/cookbook/PopcountBlocks.drawio.svg"
99 alt="pospopcnt" width="70%" />
100
101 Fig.6 depicts how each of the intermediate results are accumulated. It
102 is worth noting that each intermediate result is independent of the
103 other intermediate results and also parallel reduction can be applied
104 to all of them individually. This gives *two* opportunities for hardware
105 parallelism rather than one.
106
107 <img src="/openpower/sv/cookbook/ParallelAccumulate.drawio.svg"
108 alt="pospopcnt" width="100%" />
109
110 In short this algorithm is very straightforward to implement thanks to the
111 two crucial instructions, `gbbd` and `popcntd`. Below is a walkthrough
112 of the assembler, keeping it very simple, and exploiting only one of the
113 opportunities for parallelism (by not including the Parallel Reduction
114 opportunity mentioned above).
115
116 # Walkthrough of the assembler
117
118 Firstly the CTR (Counter) SPR is set up, and is key to looping
119 as outlined further, below
120
121 ```
122 mtspr 9, 3 # move r3 to CTR
123 ```
124
125 The Vector Length, which is limited to 8 (MVL - Maximum Vector Length)
126 is set up. A special "CTR" Mode is used which automatically uses the CTR
127 SPR rather than register RA. (*Note that RA is set to zero to indicate
128 this, because there is limited encoding space. See [[openpower/sv/setvl]]
129 instruction specification for details)*.
130
131 The result of this instruction is that if CTR is greater than 8, VL is set
132 to 8. If however CTR is less than or equal to 8, then VL is set to CTR.
133 Additionally, a copy of VL is placed into RT (r3 in this case), which
134 again is necessary as part of the limited encoding space but in some cases
135 (not here) this is desirable, and avoids a `mfspr` instruction to take
136 a copy of VL into a GPR.
137
138 ```
139 # VL = r3 = MIN(CTR,MAXVL=8)
140 setvl 3,0,8,0,1,1 # set MVL=8, VL=MIN(MVL,CTR)
141 ```
142
143 Next the data must be loaded in batches (of up to QTY 8of 8-bit
144 values). We must also not over-run the end of the array (which a normal
145 SIMD ISA would do, potentially causing a pagefault) and this is achieved
146 by when CTR <= 8: VL (Vector Length) in such circumstances being set to
147 the remaining elements.
148
149 The next instruction is `sv.lbzu/pi` - a Post-Increment variant of `lbzu`
150 which updates the Effective Address (in RA) **after** it has been used,
151 not before. Also of note is the element-width override of `dw=8` which
152 writes to *individual bytes* of r6, not to r6/7/8...13.
153
154 Of note here however is the precursor `addi` instruction, which clears
155 out the contents of r6 to zero before beginning the Vector/Looped Load
156 sequence. This is *important* because SV does **not** zero-out parts
157 of a register when element-width overrides are in use.
158
159 ```
160 # load VL bytes (update r4 addr) but compressed (dw=8)
161 addi 6, 0, 0 # initialise all 64-bits of r6 to zero
162 sv.lbzu/pi/dw=8 *6, 1(4) # should be /lf here as well
163 ```
164
165 The next instruction is quite straightforward, and has the effect of
166 turning (transposing) 8 values. Each bit zero of each input byte is
167 placed into the first byte; each bit one of each input byte is placed
168 into the second byte and so on.
169
170 (*This instruction in VSX is called `vgbbd`, and it is present in many
171 other ISAs. RISC-V bitmanip-draft-0.94 calls it `bmatflip`, x86 calls it
172 GF2P7AFFINEQB. Power ISA, Cray, and many others all have this extremely
173 powerful and useful instruction*)
174
175 ```
176 # gather performs the transpose (which gets us to positional..)
177 gbbd 8,6
178 ```
179
180 Now the bits have been transposed in bytes, it is plain sailing to
181 perform QTY8 parallel popcounts. However there is a trick going on:
182 we have set VL=MVL=8. To explain this: it covers the last case where
183 CTR may be between 1 and 8.
184
185 Remember what happened back at the Vector-Load, where r6 was cleared to
186 zero before-hand? This filled out the 8x8 transposed grid (`gbbd`) fully
187 with zeros prior to the actual transpose. Now when we do the popcount,
188 there will be upper-numbered columns that are *not part of the result*
189 that contain *zero* and *consequently have no impact on the result*.
190
191 This elegant trick extends even to the accumulation of the
192 results. However before we get there it is necessary to describe why
193 `sw=8` has been added to `sv.popcntd`. What this is doing is treating
194 each **byte** of its input (starting at the first byte of r8) as an
195 independent quantity to be popcounted, but the result is *zero-extended*
196 to 64-bit and stored in r24, r25, r26 ... r31.
197
198 Therefore:
199
200 * r24 contains the popcount of the first byte of r8
201 * r25 contains the popcount of the second byte of r8
202 * ...
203 * r31 contains the popcount of the last (7th) byte of r8
204
205 ```
206 # now those bits have been turned around, popcount and sum them
207 setvl 0,0,8,0,1,1 # set MVL=VL=8
208 sv.popcntd/sw=8 *24,*8 # do the (now transposed) popcount
209 ```
210
211 With VL being hard-set to 8, we can now Vector-Parallel sum the partial
212 results r24-31 into the array of accumulators r16-r23. There was no need
213 to set VL=8 again because it was set already by the `setvl` above.
214
215 ```
216 sv.add *16,*16,*24 # and accumulate in results
217 ```
218
219 Finally, `sv.bc` loops back, subtracting VL from CTR. Being based on
220 Power ISA Scalar Branch-Conditional, if CTR goes negative (which it will
221 because VL was set hard-coded to 8, above) it does not matter, the loop
222 will still be correctly exited. In other words we complete the correct
223 number of *blocks* (of length 8).
224
225 ```
226 # branch back if CTR still non-zero. works even though VL=8
227 sv.bc/all 16, *0, -0x28 # reduce CTR by VL and stop if -ve
228 ```
229
230 # Improvements
231
232 There exist many opportunities for parallelism that simpler hardware would
233 need to have in order to maximise performance. On Out-of-Order hardware
234 the above extremely simple and very clear algorithm will achieve extreme
235 high levels of performance simply by exploiting standard Multi-Issue
236 Register Hazard Management.
237
238 However simpler hardware - in-order - will need a little bit of
239 assistance, and that can come in the form of expanding to QTY4 or QTY8
240 64-bit blocks (so that sv.popcntd uses MVL=VL=32 or MVL=VL=64), `gbbd`
241 becomes an `sv.gbbd` but VL being set to the block count (QTY4 or QTY8),
242 and the SV REMAP Parallel Reduction Schedule being applied to each
243 intermediary result rather than using an array of straight accumulators
244 `r16-r23`. However this starts to push the boundaries of the number of
245 registers needed, so as an exercise is left for another time.
246
247 # Conclusion
248
249 Where a normal SIMD ISA requires explicit hand-crafted optimisation in
250 order to achieve full utilisation of the underlying hardware, Simple-V
251 instead can rely to a large extent on standard Multi-Issue hardware
252 to achieve similar performance, whilst crucially keeping the algorithm
253 implementation down to a shockingly-simple degree that makes it easy to
254 understand an easy to review. Again also as with many other algorithms
255 when implemented in Simple-V SVP54, by keeping to a LOAD-COMPUTE-STORE
256 paradigm the L1 Data Cache usage is minimised, and in this case just
257 as with chacha20 the entire algorithm, being only 9 lines of assembler
258 fitting into 13 4-byte words it can fit into a single L1 I-Cache Line
259 without triggering Virtual Memory TLB misses.
260
261 Further performance improvements are achievable by using REMAP Parallel
262 Reduction, still fitting into a single L1 Cache line, but beginning to
263 approach the limit of the 128-long register file.
264
265 [[!tag svp64_cookbook ]]