add funding link to nlnet bug #672
[libreriscv.git] / openpower / sv / cookbook / pospopcnt.mdwn
1 # Positional popcount SVP64
2
3 **Links**
4
5 * <https://bugs.libre-soc.org/show_bug.cgi?id=672>
6 * <https://github.com/clausecker/pospop/blob/master/countsse2_amd64.s>
7 * RISC-V Bitmanip Extension Document Version 0.94-draft Editor: Claire Wolf Symbiotic GmbH
8 <https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-draft.pdf>
9
10 **Funding**
11
12 This project is funded through the [NGI Assure Fund](https://nlnet.nl/assure), a fund established by [NLnet](https://nlnet.nl) with financial support from the European Commission's [Next Generation Internet](https://ngi.eu) program. Learn more at the [NLnet project page](https://nlnet.nl/project/Libre-SOC-OpenPOWER-ISA).
13
14 [<img src="https://nlnet.nl/logo/banner.png" alt="NLnet foundation logo" width="20%" />](https://nlnet.nl)
15 [<img src="https://nlnet.nl/image/logos/NGIAssure_tag.svg" alt="NGI Assure Logo" width="20%" />](https://nlnet.nl/assure)
16
17 **Introduction**
18
19 Positional popcount in optimised assembler is typically done on SIMD ISAs in
20 around 500 lines. Power ISA thanks to `bpermd` can be much more efficient:
21 with SVP64 even more so. The reference implementation showing the concept
22 is below.
23
24 ```
25 // Copyright (c) 2020 Robert Clausecker <fuz@fuz.su>
26 // count8 reference implementation for tests. Do not alter.
27 func count8safe(counts *[8]int, buf []uint8) {
28 for i := range buf {
29 for j := 0; j < 8; j++ {
30 counts[j] += int(buf[i] >> j & 1)
31 }
32 }
33 }
34 ```
35
36 A simple but still hardware-paralleliseable SVP64 assembler for
37 8-bit input values (`count8safe`) is as follows:
38
39 ```
40 mtspr 9, 3 # move r3 to CTR
41 # VL = MIN(CTR,MAXVL=8), Rc=1 (CR0 set if CTR ends)
42 setvl 3,0,8,0,1,1 # set MVL=8, VL=MIN(MVL,CTR)
43 # load VL bytes (update r4 addr) but compressed (dw=8)
44 addi 6, 0, 0 # initialise all 64-bits of r6 to zero
45 sv.lbzu/pi/dw=8 *6, 1(4) # should be /lf here as well
46 # gather performs the transpose (which gets us to positional..)
47 gbbd 8,6
48 # now those bits have been turned around, popcount and sum them
49 setvl 0,0,8,0,1,1 # set MVL=VL=8
50 sv.popcntd/sw=8 *24,*8 # do the (now transposed) popcount
51 sv.add *16,*16,*24 # and accumulate in results
52 # branch back if CTR still non-zero. works even though VL=8
53 sv.bc/all 16, *0, -0x28 # reduce CTR by VL and stop if -ve
54 ```
55
56
57
58 <img src="/openpower/sv/cookbook/ArrayPopcnt.drawio.svg"
59 alt="pospopcnt" width="70%" />
60
61 <br />
62
63 <img src="/openpower/sv/cookbook/PositionalPopcnt.drawio.svg"
64 alt="pospopcnt" width="60%" />
65
66
67
68 Array popcount is just standard popcount function
69 ([[!wikipedia Hamming weight]]) on an array of values, horizontally,
70 however positional popcount is different (vertical). Refer to Fig.1
71
72
73
74 Positional popcount adds up the totals of each bit set to 1 in each
75 bit-position, of an array of input values. Refer to Fig.2
76
77 <br />
78
79
80 # Visual representation of the pospopcount algorithm
81
82 <img src="/openpower/sv/cookbook/BlockDivision.drawio.svg"
83 alt="pospopcnt" width="70%" />
84
85 <br />
86
87 <img src="/openpower/sv/cookbook/Transpose.drawio.svg"
88 alt="pospopcnt" width="60%" />
89
90
91 <img src="/openpower/sv/cookbook/PopcountBlocks.drawio.svg"
92 alt="pospopcnt" width="70%" />
93
94 <img src="/openpower/sv/cookbook/ParallelAccumulate.drawio.svg"
95 alt="pospopcnt" width="100%" />
96
97
98
99 # Walkthrough of the assembler
100
101 Firstly the CTR (Counter) SPR is set up, and is key to looping
102 as outlined further, below
103
104 ```
105 mtspr 9, 3 # move r3 to CTR
106 ```
107
108 The Vector Length, which is limited to 8 (MVL - Maximum
109 Vector Length) is set up. A special "CTR" Mode is used
110 which automatically uses the CTR SPR rather than register
111 RA. (*Note that RA is set to zero to indicate this, because there is
112 limited encoding space. See [[openpower/sv/setvl]] instruction
113 specification for details)*.
114
115 The result of this instruction is that if CTR is greater than
116 8, VL is set to 8. If however CTR is less than or equal to 8,
117 then VL is set to CTR. Additionally, a copy of VL is placed
118 into RT (r3 in this case), which again is necessary as part
119 of the limited encoding space but in some cases (not here)
120 this is desirable, and avoids a `mfspr` instruction to take
121 a copy of VL into a GPR.
122
123 ```
124 # VL = r3 = MIN(CTR,MAXVL=8)
125 setvl 3,0,8,0,1,1 # set MVL=8, VL=MIN(MVL,CTR)
126 ```
127
128 Next the data must be loaded in batches (of up to QTY 8of
129 8-bit values). We must also not over-run the end of the
130 array (which a normal SIMD ISA would do, potentially causing
131 a pagefault) and this is achieved by when CTR <= 8: VL
132 (Vector Length) in such circumstances being set to the
133 remaining elements.
134
135 The next instruction is `sv.lbzu/pi` - a Post-Increment
136 variant of `lbzu` which updates the Effective Address (in RA)
137 **after** it has been used, not before. Also of note is the
138 element-width override of `dw=8` which writes to *individual
139 bytes* of r6, not to r6/7/8...13.
140
141 Of note here however is the precursor `addi` instruction, which
142 clears out the contents of r6 to zero before beginning the
143 Vector/Looped Load sequence. This is *important* because
144 SV does **not** zero-out parts of a register when element-width
145 overrides are in use.
146
147 ```
148 # load VL bytes (update r4 addr) but compressed (dw=8)
149 addi 6, 0, 0 # initialise all 64-bits of r6 to zero
150 sv.lbzu/pi/dw=8 *6, 1(4) # should be /lf here as well
151 ```
152
153 The next instruction is quite straightforward, and has the
154 effect of turning (transposing) 8 values. Each bit zero
155 of each input byte is placed into the first byte; each bit one
156 of each input byte is placed into the second byte and so on.
157
158 (*This instruction in VSX is called `vgbbd`, and it is present
159 in many other ISAs. RISC-V bitmanip-draft-0.94 calls it `bmatflip`,
160 x86 calls it GF2P7AFFINEQB. Power ISA, Cray, and many others
161 all have this extremely powerful and useful instruction*)
162
163 ```
164 # gather performs the transpose (which gets us to positional..)
165 gbbd 8,6
166 ```
167
168 Now the bits have been transposed in bytes, it is plain sailing
169 to perform QTY8 parallel popcounts. However there is a trick
170 going on: we have set VL=MVL=8. To explain this: it covers the
171 last case where CTR may be between 1 and 8.
172
173 Remember what happened back at the Vector-Load, where r6
174 was cleared to zero before-hand? This filled out the 8x8 transposed
175 grid (`gbbd`) fully with zeros prior to the actual transpose.
176 Now when we do the popcount, there will be upper-numbered
177 columns that are *not part of the result* that contain *zero*
178 and *consequently have no impact on the result*.
179
180 This elegant trick extends even to the accumulation of the
181 results. However before we get there it is necessary to
182 describe why `sw=8` has been added to `sv.popcntd`. What
183 this is doing is treating each **byte** of its input
184 (starting at the first byte of r8) as an independent
185 quantity to be popcounted, but the result is *zero-extended*
186 to 64-bit and stored in r24, r25, r26 ... r31.
187
188 Therefore:
189
190 * r24 contains the popcount of the first byte of r8
191 * r25 contains the popcount of the second byte of r8
192 * ...
193 * r31 contains the popcount of the last (7th) byte of r8
194
195 ```
196 # now those bits have been turned around, popcount and sum them
197 setvl 0,0,8,0,1,1 # set MVL=VL=8
198 sv.popcntd/sw=8 *24,*8 # do the (now transposed) popcount
199 ```
200
201 With VL being hard-set to 8, we can now Vector-Parallel
202 sum the partial results r24-31 into the array of accumulators
203 r16-r23. There was no need to set VL=8 again because it was
204 set already by the `setvl` above.
205
206 ```
207 sv.add *16,*16,*24 # and accumulate in results
208 ```
209
210 Finally, `sv.bc` loops back, subtracting VL from CTR.
211 Being based on Power ISA Scalar Branch-Conditional, if
212 CTR goes negative (which it will because VL was set
213 hard-coded to 8, above) it does not matter, the loop
214 will still be correctly exited. In other words we complete
215 the correct number of *blocks* (of length 8).
216
217 ```
218 # branch back if CTR still non-zero. works even though VL=8
219 sv.bc/all 16, *0, -0x28 # reduce CTR by VL and stop if -ve
220 ```
221
222 # Improvements
223
224 There exist many opportunities for parallelism that simpler hardware
225 would need to have in order to maximise performance. On Out-of-Order
226 hardware the above extremely simple and very clear algorithm will
227 achieve extreme high levels of performance simply by exploiting
228 standard Multi-Issue Register Hazard Management.
229
230 However simpler hardware - in-order - will need a little bit of
231 assistance, and that can come in the form of expanding to QTY4 or
232 QTY8 64-bit blocks (so that sv.popcntd uses MVL=VL=32 or MVL=VL=64),
233 `gbbd` becomes an `sv.gbbd` but VL being set to the block count
234 (QTY4 or QTY8), and the SV REMAP Parallel Reduction Schedule being
235 applied to each intermediary result rather than using an array
236 of straight accumulators `r16-r23`. However this starts to push
237 the boundaries of the number of registers needed, so as an
238 exercise is left for another time.
239
240 # Conclusion
241
242 Where a normal SIMD ISA requires explicit hand-crafted optimisation
243 in order to achieve full utilisation of the underlying hardware,
244 Simple-V instead can rely to a large extent on standard Multi-Issue
245 hardware to achieve similar performance, whilst crucially keeping the
246 algorithm implementation down to a shockingly-simple degree that makes
247 it easy to understand an easy to review. Again also as with many
248 other algorithms when implemented in Simple-V SVP54, by keeping to
249 a LOAD-COMPUTE-STORE paradigm the L1 Data Cache usage is minimised,
250 and in this case just as with chacha20 the entire algorithm, being
251 only 9 lines of assembler fitting into 13 4-byte words it can fit
252 into a single L1 I-Cache Line without triggering Virtual Memory TLB
253 misses.
254
255 Further performance improvements are achievable by using REMAP
256 Parallel Reduction, still fitting into a single L1 Cache line,
257 but beginning to approach the limit of the 128-long register file.
258
259 [[!tag svp64_cookbook ]]