516db6d2ca93b3f761a7cf60354dcfbf050ab29b
[libreriscv.git] / openpower / sv / cookbook / pospopcnt.mdwn
1 # Positional popcount SVP64
2
3 **Links**
4
5 * <https://bugs.libre-soc.org/show_bug.cgi?id=672>
6 * <https://github.com/clausecker/pospop/blob/master/countsse2_amd64.s>
7 * RISC-V Bitmanip Extension Document Version 0.94-draft Editor: Claire Wolf Symbiotic GmbH
8 <https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-draft.pdf>
9
10 **Funding**
11
12 This project is funded through the [NGI Assure Fund](https://nlnet.nl/assure), a fund established by [NLnet](https://nlnet.nl) with financial support from the European Commission's [Next Generation Internet](https://ngi.eu) program. Learn more at the [NLnet project page](https://nlnet.nl/project/Libre-SOC-OpenPOWER-ISA).
13
14 [<img src="https://nlnet.nl/logo/banner.png" alt="NLnet foundation logo" width="20%" />](https://nlnet.nl)
15 [<img src="https://nlnet.nl/image/logos/NGIAssure_tag.svg" alt="NGI Assure Logo" width="20%" />](https://nlnet.nl/assure)
16
17 **Introduction**
18
19 Positional popcount in optimised assembler is typically done on SIMD ISAs in
20 around 500 lines. Power ISA thanks to `bpermd` can be much more efficient:
21 with SVP64 even more so. The reference implementation showing the concept
22 is below.
23
24 ```
25 // Copyright (c) 2020 Robert Clausecker <fuz@fuz.su>
26 // count8 reference implementation for tests. Do not alter.
27 func count8safe(counts *[8]int, buf []uint8) {
28 for i := range buf {
29 for j := 0; j < 8; j++ {
30 counts[j] += int(buf[i] >> j & 1)
31 }
32 }
33 }
34 ```
35
36 A simple but still hardware-paralleliseable SVP64 assembler for
37 8-bit input values (`count8safe`) is as follows:
38
39 ```
40 mtspr 9, 3 # move r3 to CTR
41 # VL = MIN(CTR,MAXVL=8), Rc=1 (CR0 set if CTR ends)
42 setvl 3,0,8,0,1,1 # set MVL=8, VL=MIN(MVL,CTR)
43 # load VL bytes (update r4 addr) but compressed (dw=8)
44 addi 6, 0, 0 # initialise all 64-bits of r6 to zero
45 sv.lbzu/pi/dw=8 *6, 1(4) # should be /lf here as well
46 # gather performs the transpose (which gets us to positional..)
47 gbbd 8,6
48 # now those bits have been turned around, popcount and sum them
49 setvl 0,0,8,0,1,1 # set MVL=VL=8
50 sv.popcntd/sw=8 *24,*8 # do the (now transposed) popcount
51 sv.add *16,*16,*24 # and accumulate in results
52 # branch back if CTR still non-zero. works even though VL=8
53 sv.bc/all 16, *0, -0x28 # reduce CTR by VL and stop if -ve
54 ```
55
56
57 Array popcount is just standard popcount function
58 ([[!wikipedia Hamming weight]]) on an array of values, horizontally,
59 however positional popcount is different (vertical). Refer to Fig.1
60
61
62 <img src="/openpower/sv/cookbook/ArrayPopcnt.drawio.svg"
63 alt="pospopcnt" width="70%" />
64
65 Positional popcount adds up the totals of each bit set to 1 in each
66 bit-position, of an array of input values. Refer to Fig.2
67
68 <img src="/openpower/sv/cookbook/PositionalPopcnt.drawio.svg"
69 alt="pospopcnt" width="60%" />
70
71
72 # Visual representation of the pospopcount algorithm
73 In order to perform positional popcount we need to go
74 through series of steps shown below in figures 3, 4, 5 & 6.
75 The limitation to overcome is that the CPU can only work
76 on registers (horizontally) but the requirement of pospopcount
77 is to add *vertically*. Part of the challenge is therefore
78 to perform an appropriate transpose of the data, in blocks
79 that suit the processor and the ISA capacity.
80
81 Fig.3 depicts how the data is divided into blocks of 8*8-bit.
82
83 <img src="/openpower/sv/cookbook/BlockDivision.drawio.svg"
84 alt="pospopcnt" width="70%" />
85
86 Fig.4 shows that a transpose is needed on each block.
87 The gbbd instruction is used for implementing the transpose
88 function, preparing the data for using the standard popcount
89 instruction.
90
91 <img src="/openpower/sv/cookbook/Transpose.drawio.svg"
92 alt="pospopcnt" width="60%" />
93
94 Now on each 8*8 block, 8 popcount instructions
95 can be run each of which is independent and therefore can
96 be parallelised even by In-Order multi-issue hardware(Fig.5).
97
98 <img src="/openpower/sv/cookbook/PopcountBlocks.drawio.svg"
99 alt="pospopcnt" width="70%" />
100
101 Fig.6 depicts how each of the intermediate results are
102 accumulated. It is worth noting that each intermediate result
103 is independent of the other intermediate results and also
104 parallel reduction can be applied to all of them
105 individually. This gives *two* opportunities for
106 hardware parallelism rather than one.
107
108 <img src="/openpower/sv/cookbook/ParallelAccumulate.drawio.svg"
109 alt="pospopcnt" width="100%" />
110
111 In short this algorithm is very straightforward to implement thanks to the two
112 crucial instructions, `gbbd` and `popcntd`. Below is a walkthrough of the
113 assembler, keeping it very simple, and exploiting only one of the opportunities
114 for parallelism (by not including the Parallel Reduction opportunity mentioned
115 above).
116
117 # Walkthrough of the assembler
118
119 Firstly the CTR (Counter) SPR is set up, and is key to looping
120 as outlined further, below
121
122 ```
123 mtspr 9, 3 # move r3 to CTR
124 ```
125
126 The Vector Length, which is limited to 8 (MVL - Maximum
127 Vector Length) is set up. A special "CTR" Mode is used
128 which automatically uses the CTR SPR rather than register
129 RA. (*Note that RA is set to zero to indicate this, because there is
130 limited encoding space. See [[openpower/sv/setvl]] instruction
131 specification for details)*.
132
133 The result of this instruction is that if CTR is greater than
134 8, VL is set to 8. If however CTR is less than or equal to 8,
135 then VL is set to CTR. Additionally, a copy of VL is placed
136 into RT (r3 in this case), which again is necessary as part
137 of the limited encoding space but in some cases (not here)
138 this is desirable, and avoids a `mfspr` instruction to take
139 a copy of VL into a GPR.
140
141 ```
142 # VL = r3 = MIN(CTR,MAXVL=8)
143 setvl 3,0,8,0,1,1 # set MVL=8, VL=MIN(MVL,CTR)
144 ```
145
146 Next the data must be loaded in batches (of up to QTY 8of
147 8-bit values). We must also not over-run the end of the
148 array (which a normal SIMD ISA would do, potentially causing
149 a pagefault) and this is achieved by when CTR <= 8: VL
150 (Vector Length) in such circumstances being set to the
151 remaining elements.
152
153 The next instruction is `sv.lbzu/pi` - a Post-Increment
154 variant of `lbzu` which updates the Effective Address (in RA)
155 **after** it has been used, not before. Also of note is the
156 element-width override of `dw=8` which writes to *individual
157 bytes* of r6, not to r6/7/8...13.
158
159 Of note here however is the precursor `addi` instruction, which
160 clears out the contents of r6 to zero before beginning the
161 Vector/Looped Load sequence. This is *important* because
162 SV does **not** zero-out parts of a register when element-width
163 overrides are in use.
164
165 ```
166 # load VL bytes (update r4 addr) but compressed (dw=8)
167 addi 6, 0, 0 # initialise all 64-bits of r6 to zero
168 sv.lbzu/pi/dw=8 *6, 1(4) # should be /lf here as well
169 ```
170
171 The next instruction is quite straightforward, and has the
172 effect of turning (transposing) 8 values. Each bit zero
173 of each input byte is placed into the first byte; each bit one
174 of each input byte is placed into the second byte and so on.
175
176 (*This instruction in VSX is called `vgbbd`, and it is present
177 in many other ISAs. RISC-V bitmanip-draft-0.94 calls it `bmatflip`,
178 x86 calls it GF2P7AFFINEQB. Power ISA, Cray, and many others
179 all have this extremely powerful and useful instruction*)
180
181 ```
182 # gather performs the transpose (which gets us to positional..)
183 gbbd 8,6
184 ```
185
186 Now the bits have been transposed in bytes, it is plain sailing
187 to perform QTY8 parallel popcounts. However there is a trick
188 going on: we have set VL=MVL=8. To explain this: it covers the
189 last case where CTR may be between 1 and 8.
190
191 Remember what happened back at the Vector-Load, where r6
192 was cleared to zero before-hand? This filled out the 8x8 transposed
193 grid (`gbbd`) fully with zeros prior to the actual transpose.
194 Now when we do the popcount, there will be upper-numbered
195 columns that are *not part of the result* that contain *zero*
196 and *consequently have no impact on the result*.
197
198 This elegant trick extends even to the accumulation of the
199 results. However before we get there it is necessary to
200 describe why `sw=8` has been added to `sv.popcntd`. What
201 this is doing is treating each **byte** of its input
202 (starting at the first byte of r8) as an independent
203 quantity to be popcounted, but the result is *zero-extended*
204 to 64-bit and stored in r24, r25, r26 ... r31.
205
206 Therefore:
207
208 * r24 contains the popcount of the first byte of r8
209 * r25 contains the popcount of the second byte of r8
210 * ...
211 * r31 contains the popcount of the last (7th) byte of r8
212
213 ```
214 # now those bits have been turned around, popcount and sum them
215 setvl 0,0,8,0,1,1 # set MVL=VL=8
216 sv.popcntd/sw=8 *24,*8 # do the (now transposed) popcount
217 ```
218
219 With VL being hard-set to 8, we can now Vector-Parallel
220 sum the partial results r24-31 into the array of accumulators
221 r16-r23. There was no need to set VL=8 again because it was
222 set already by the `setvl` above.
223
224 ```
225 sv.add *16,*16,*24 # and accumulate in results
226 ```
227
228 Finally, `sv.bc` loops back, subtracting VL from CTR.
229 Being based on Power ISA Scalar Branch-Conditional, if
230 CTR goes negative (which it will because VL was set
231 hard-coded to 8, above) it does not matter, the loop
232 will still be correctly exited. In other words we complete
233 the correct number of *blocks* (of length 8).
234
235 ```
236 # branch back if CTR still non-zero. works even though VL=8
237 sv.bc/all 16, *0, -0x28 # reduce CTR by VL and stop if -ve
238 ```
239
240 # Improvements
241
242 There exist many opportunities for parallelism that simpler hardware
243 would need to have in order to maximise performance. On Out-of-Order
244 hardware the above extremely simple and very clear algorithm will
245 achieve extreme high levels of performance simply by exploiting
246 standard Multi-Issue Register Hazard Management.
247
248 However simpler hardware - in-order - will need a little bit of
249 assistance, and that can come in the form of expanding to QTY4 or
250 QTY8 64-bit blocks (so that sv.popcntd uses MVL=VL=32 or MVL=VL=64),
251 `gbbd` becomes an `sv.gbbd` but VL being set to the block count
252 (QTY4 or QTY8), and the SV REMAP Parallel Reduction Schedule being
253 applied to each intermediary result rather than using an array
254 of straight accumulators `r16-r23`. However this starts to push
255 the boundaries of the number of registers needed, so as an
256 exercise is left for another time.
257
258 # Conclusion
259
260 Where a normal SIMD ISA requires explicit hand-crafted optimisation
261 in order to achieve full utilisation of the underlying hardware,
262 Simple-V instead can rely to a large extent on standard Multi-Issue
263 hardware to achieve similar performance, whilst crucially keeping the
264 algorithm implementation down to a shockingly-simple degree that makes
265 it easy to understand an easy to review. Again also as with many
266 other algorithms when implemented in Simple-V SVP54, by keeping to
267 a LOAD-COMPUTE-STORE paradigm the L1 Data Cache usage is minimised,
268 and in this case just as with chacha20 the entire algorithm, being
269 only 9 lines of assembler fitting into 13 4-byte words it can fit
270 into a single L1 I-Cache Line without triggering Virtual Memory TLB
271 misses.
272
273 Further performance improvements are achievable by using REMAP
274 Parallel Reduction, still fitting into a single L1 Cache line,
275 but beginning to approach the limit of the 128-long register file.
276
277 [[!tag svp64_cookbook ]]