whitespace cleanup
[libreriscv.git] / openpower / sv / cookbook / pospopcnt.mdwn
1 # Positional popcount SVP64
2
3 **Links**
4
5 * <https://bugs.libre-soc.org/show_bug.cgi?id=672>
6 * <https://github.com/clausecker/pospop/blob/master/countsse2_amd64.s>
7 * RISC-V Bitmanip Extension Document Version 0.94-draft Editor: Claire Wolf Symbiotic GmbH
8 <https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-draft.pdf>
9
10 **Funding**
11
12 This project is funded through the [NGI Assure Fund](https://nlnet.nl/assure), a fund established by [NLnet](https://nlnet.nl) with financial support from the European Commission's [Next Generation Internet](https://ngi.eu) program. Learn more at the [NLnet project page](https://nlnet.nl/project/Libre-SOC-OpenPOWER-ISA).
13
14 [<img src="https://nlnet.nl/logo/banner.png" alt="NLnet foundation logo" width="20%" />](https://nlnet.nl)
15 [<img src="https://nlnet.nl/image/logos/NGIAssure_tag.svg" alt="NGI Assure Logo" width="20%" />](https://nlnet.nl/assure)
16
17 **Introduction**
18
19 Positional popcount in optimised assembler is typically done on SIMD
20 ISAs in around 500 lines. Power ISA thanks to `bpermd` can be much
21 more efficient: with SVP64 even more so. The reference implementation
22 showing the concept is below.
23
24 ```
25 // Copyright (c) 2020 Robert Clausecker <fuz@fuz.su>
26 // count8 reference implementation for tests. Do not alter.
27 func count8safe(counts *[8]int, buf []uint8) {
28 for i := range buf {
29 for j := 0; j < 8; j++ {
30 counts[j] += int(buf[i] >> j & 1)
31 }
32 }
33 }
34 ```
35
36 A simple but still hardware-paralleliseable SVP64 assembler for
37 8-bit input values (`count8safe`) is as follows:
38
39 ```
40 mtspr 9, 3 # move r3 to CTR
41 # VL = MIN(CTR,MAXVL=8), Rc=1 (CR0 set if CTR ends)
42 setvl 3,0,8,0,1,1 # set MVL=8, VL=MIN(MVL,CTR)
43 # load VL bytes (update r4 addr) but compressed (dw=8)
44 addi 6, 0, 0 # initialise all 64-bits of r6 to zero
45 sv.lbzu/pi/dw=8 *6, 1(4) # should be /lf here as well
46 # gather performs the transpose (which gets us to positional..)
47 gbbd 8,6
48 # now those bits have been turned around, popcount and sum them
49 setvl 0,0,8,0,1,1 # set MVL=VL=8
50 sv.popcntd/sw=8 *24,*8 # do the (now transposed) popcount
51 sv.add *16,*16,*24 # and accumulate in results
52 # branch back if CTR still non-zero. works even though VL=8
53 sv.bc/all 16, *0, -0x28 # reduce CTR by VL and stop if -ve
54 ```
55
56
57 Array popcount is just standard popcount function
58 ([[!wikipedia Hamming weight]]) on an array of values, horizontally,
59 however positional popcount is different (vertical). Refer to Fig.1
60
61
62 <img src="/openpower/sv/cookbook/ArrayPopcnt.drawio.svg"
63 alt="pospopcnt" width="70%" />
64
65 Positional popcount adds up the totals of each bit set to 1 in each
66 bit-position, of an array of input values. Refer to Fig.2
67
68 <img src="/openpower/sv/cookbook/PositionalPopcnt.drawio.svg"
69 alt="pospopcnt" width="60%" />
70
71
72 # Visual representation of the pospopcount algorithm
73
74 In order to perform positional popcount we need to go through series of
75 steps shown below in figures 3, 4, 5 & 6. The limitation to overcome is
76 that the CPU can only work on registers (horizontally) but the requirement
77 of pospopcount is to add *vertically*. Part of the challenge is therefore
78 to perform an appropriate transpose of the data, in blocks that suit
79 the processor and the ISA capacity.
80
81 Fig.3 depicts how the data is divided into blocks of 8*8-bit.
82
83 <img src="/openpower/sv/cookbook/BlockDivision.drawio.svg"
84 alt="pospopcnt" width="70%" />
85
86 Fig.4 shows that a transpose is needed on each block. The gbbd
87 instruction is used for implementing the transpose function, preparing
88 the data for using the standard popcount instruction.
89
90 <img src="/openpower/sv/cookbook/Transpose.drawio.svg"
91 alt="pospopcnt" width="60%" />
92
93 Now on each 8*8 block, 8 popcount instructions can be run each of
94 which is independent and therefore can be parallelised even by In-Order
95 multi-issue hardware(Fig.5).
96
97 <img src="/openpower/sv/cookbook/PopcountBlocks.drawio.svg"
98 alt="pospopcnt" width="70%" />
99
100 Fig.6 depicts how each of the intermediate results are accumulated. It
101 is worth noting that each intermediate result is independent of the
102 other intermediate results and also parallel reduction can be applied
103 to all of them individually. This gives *two* opportunities for hardware
104 parallelism rather than one.
105
106 <img src="/openpower/sv/cookbook/ParallelAccumulate.drawio.svg"
107 alt="pospopcnt" width="100%" />
108
109 In short this algorithm is very straightforward to implement thanks to the
110 two crucial instructions, `gbbd` and `popcntd`. Below is a walkthrough
111 of the assembler, keeping it very simple, and exploiting only one of the
112 opportunities for parallelism (by not including the Parallel Reduction
113 opportunity mentioned above).
114
115 # Walkthrough of the assembler
116
117 Firstly the CTR (Counter) SPR is set up, and is key to looping
118 as outlined further, below
119
120 ```
121 mtspr 9, 3 # move r3 to CTR
122 ```
123
124 The Vector Length, which is limited to 8 (MVL - Maximum Vector Length)
125 is set up. A special "CTR" Mode is used which automatically uses the CTR
126 SPR rather than register RA. (*Note that RA is set to zero to indicate
127 this, because there is limited encoding space. See [[openpower/sv/setvl]]
128 instruction specification for details)*.
129
130 The result of this instruction is that if CTR is greater than 8, VL is set
131 to 8. If however CTR is less than or equal to 8, then VL is set to CTR.
132 Additionally, a copy of VL is placed into RT (r3 in this case), which
133 again is necessary as part of the limited encoding space but in some cases
134 (not here) this is desirable, and avoids a `mfspr` instruction to take
135 a copy of VL into a GPR.
136
137 ```
138 # VL = r3 = MIN(CTR,MAXVL=8)
139 setvl 3,0,8,0,1,1 # set MVL=8, VL=MIN(MVL,CTR)
140 ```
141
142 Next the data must be loaded in batches (of up to QTY 8of 8-bit
143 values). We must also not over-run the end of the array (which a normal
144 SIMD ISA would do, potentially causing a pagefault) and this is achieved
145 by when CTR <= 8: VL (Vector Length) in such circumstances being set to
146 the remaining elements.
147
148 The next instruction is `sv.lbzu/pi` - a Post-Increment variant of `lbzu`
149 which updates the Effective Address (in RA) **after** it has been used,
150 not before. Also of note is the element-width override of `dw=8` which
151 writes to *individual bytes* of r6, not to r6/7/8...13.
152
153 Of note here however is the precursor `addi` instruction, which clears
154 out the contents of r6 to zero before beginning the Vector/Looped Load
155 sequence. This is *important* because SV does **not** zero-out parts
156 of a register when element-width overrides are in use.
157
158 ```
159 # load VL bytes (update r4 addr) but compressed (dw=8)
160 addi 6, 0, 0 # initialise all 64-bits of r6 to zero
161 sv.lbzu/pi/dw=8 *6, 1(4) # should be /lf here as well
162 ```
163
164 The next instruction is quite straightforward, and has the effect of
165 turning (transposing) 8 values. Each bit zero of each input byte is
166 placed into the first byte; each bit one of each input byte is placed
167 into the second byte and so on.
168
169 (*This instruction in VSX is called `vgbbd`, and it is present in many
170 other ISAs. RISC-V bitmanip-draft-0.94 calls it `bmatflip`, x86 calls it
171 GF2P7AFFINEQB. Power ISA, Cray, and many others all have this extremely
172 powerful and useful instruction*)
173
174 ```
175 # gather performs the transpose (which gets us to positional..)
176 gbbd 8,6
177 ```
178
179 Now the bits have been transposed in bytes, it is plain sailing to
180 perform QTY8 parallel popcounts. However there is a trick going on:
181 we have set VL=MVL=8. To explain this: it covers the last case where
182 CTR may be between 1 and 8.
183
184 Remember what happened back at the Vector-Load, where r6 was cleared to
185 zero before-hand? This filled out the 8x8 transposed grid (`gbbd`) fully
186 with zeros prior to the actual transpose. Now when we do the popcount,
187 there will be upper-numbered columns that are *not part of the result*
188 that contain *zero* and *consequently have no impact on the result*.
189
190 This elegant trick extends even to the accumulation of the
191 results. However before we get there it is necessary to describe why
192 `sw=8` has been added to `sv.popcntd`. What this is doing is treating
193 each **byte** of its input (starting at the first byte of r8) as an
194 independent quantity to be popcounted, but the result is *zero-extended*
195 to 64-bit and stored in r24, r25, r26 ... r31.
196
197 Therefore:
198
199 * r24 contains the popcount of the first byte of r8
200 * r25 contains the popcount of the second byte of r8
201 * ...
202 * r31 contains the popcount of the last (7th) byte of r8
203
204 ```
205 # now those bits have been turned around, popcount and sum them
206 setvl 0,0,8,0,1,1 # set MVL=VL=8
207 sv.popcntd/sw=8 *24,*8 # do the (now transposed) popcount
208 ```
209
210 With VL being hard-set to 8, we can now Vector-Parallel sum the partial
211 results r24-31 into the array of accumulators r16-r23. There was no need
212 to set VL=8 again because it was set already by the `setvl` above.
213
214 ```
215 sv.add *16,*16,*24 # and accumulate in results
216 ```
217
218 Finally, `sv.bc` loops back, subtracting VL from CTR. Being based on
219 Power ISA Scalar Branch-Conditional, if CTR goes negative (which it will
220 because VL was set hard-coded to 8, above) it does not matter, the loop
221 will still be correctly exited. In other words we complete the correct
222 number of *blocks* (of length 8).
223
224 ```
225 # branch back if CTR still non-zero. works even though VL=8
226 sv.bc/all 16, *0, -0x28 # reduce CTR by VL and stop if -ve
227 ```
228
229 # Improvements
230
231 There exist many opportunities for parallelism that simpler hardware would
232 need to have in order to maximise performance. On Out-of-Order hardware
233 the above extremely simple and very clear algorithm will achieve extreme
234 high levels of performance simply by exploiting standard Multi-Issue
235 Register Hazard Management.
236
237 However simpler hardware - in-order - will need a little bit of
238 assistance, and that can come in the form of expanding to QTY4 or QTY8
239 64-bit blocks (so that sv.popcntd uses MVL=VL=32 or MVL=VL=64), `gbbd`
240 becomes an `sv.gbbd` but VL being set to the block count (QTY4 or QTY8),
241 and the SV REMAP Parallel Reduction Schedule being applied to each
242 intermediary result rather than using an array of straight accumulators
243 `r16-r23`. However this starts to push the boundaries of the number of
244 registers needed, so as an exercise is left for another time.
245
246 # Conclusion
247
248 Where a normal SIMD ISA requires explicit hand-crafted optimisation in
249 order to achieve full utilisation of the underlying hardware, Simple-V
250 instead can rely to a large extent on standard Multi-Issue hardware
251 to achieve similar performance, whilst crucially keeping the algorithm
252 implementation down to a shockingly-simple degree that makes it easy to
253 understand an easy to review. Again also as with many other algorithms
254 when implemented in Simple-V SVP54, by keeping to a LOAD-COMPUTE-STORE
255 paradigm the L1 Data Cache usage is minimised, and in this case just
256 as with chacha20 the entire algorithm, being only 9 lines of assembler
257 fitting into 13 4-byte words it can fit into a single L1 I-Cache Line
258 without triggering Virtual Memory TLB misses.
259
260 Further performance improvements are achievable by using REMAP Parallel
261 Reduction, still fitting into a single L1 Cache line, but beginning to
262 approach the limit of the 128-long register file.
263
264 [[!tag svp64_cookbook ]]