d3db8bb422a8b56b322cbb3a6708994a526740cb
[libreriscv.git] / openpower / sv / cookbook / pospopcnt.mdwn
1 # Positional popcount SVP64
2
3 * <https://bugs.libre-soc.org/show_bug.cgi?id=672>
4 * <https://github.com/clausecker/pospop/blob/master/countsse2_amd64.s>
5 * RISC-V Bitmanip Extension Document Version 0.94-draft Editor: Claire Wolf Symbiotic GmbH
6 <https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-draft.pdf>
7
8 Positional popcount in optimised assembler is typically done on SIMD ISAs in
9 around 500 lines. Power ISA thanks to `bpermd` can be much more efficient:
10 with SVP64 even more so. The reference implementation showing the concept
11 is below.
12
13 ```
14 // Copyright (c) 2020 Robert Clausecker <fuz@fuz.su>
15 // count8 reference implementation for tests. Do not alter.
16 func count8safe(counts *[8]int, buf []uint8) {
17 for i := range buf {
18 for j := 0; j < 8; j++ {
19 counts[j] += int(buf[i] >> j & 1)
20 }
21 }
22 }
23 ```
24
25 A simple but still hardware-paralleliseable SVP64 assembler for
26 8-bit input values (`count8safe`) is as follows:
27
28 ```
29 mtspr 9, 3 # move r3 to CTR
30 # VL = MIN(CTR,MAXVL=8), Rc=1 (CR0 set if CTR ends)
31 setvl 3,0,8,0,1,1 # set MVL=8, VL=MIN(MVL,CTR)
32 # load VL bytes (update r4 addr) but compressed (dw=8)
33 addi 6, 0, 0 # initialise all 64-bits of r6 to zero
34 sv.lbzu/pi/dw=8 *6, 1(4) # should be /lf here as well
35 # gather performs the transpose (which gets us to positional..)
36 gbbd 8,6
37 # now those bits have been turned around, popcount and sum them
38 setvl 0,0,8,0,1,1 # set MVL=VL=8
39 sv.popcntd/sw=8 *24,*8 # do the (now transposed) popcount
40 sv.add *16,*16,*24 # and accumulate in results
41 # branch back if CTR still non-zero. works even though VL=8
42 sv.bc/all 16, *0, -0x28 # reduce CTR by VL and stop if -ve
43 ```
44
45
46
47 <img src="/openpower/sv/cookbook/ArrayPopcnt.drawio.svg"
48 alt="pospopcnt" width="70%" />
49
50 <br />
51
52 <img src="/openpower/sv/cookbook/PositionalPopcnt.drawio.svg"
53 alt="pospopcnt" width="60%" />
54
55
56
57 Array popcount is just standard popcount function
58 ([[!wikipedia Hamming weight]]) on an array of values, horizontally,
59 however positional popcount is different (vertical). Refer to Fig.1
60
61
62
63 Positional popcount adds up the totals of each bit set to 1 in each
64 bit-position, of an array of input values. Refer to Fig.2
65
66 <br />
67
68
69 # Visual representation of the pospopcount algorithm
70
71 <img src="/openpower/sv/cookbook/BlockDivision.drawio.svg"
72 alt="pospopcnt" width="70%" />
73
74 <br />
75
76 <img src="/openpower/sv/cookbook/Transpose.drawio.svg"
77 alt="pospopcnt" width="60%" />
78
79
80 <img src="/openpower/sv/cookbook/PopcountBlocks.drawio.svg"
81 alt="pospopcnt" width="70%" />
82
83 <img src="/openpower/sv/cookbook/ParallelAccumulate.drawio.svg"
84 alt="pospopcnt" width="100%" />
85
86
87
88 # Walkthrough of the assembler
89
90 Firstly the CTR (Counter) SPR is set up, and is key to looping
91 as outlined further, below
92
93 ```
94 mtspr 9, 3 # move r3 to CTR
95 ```
96
97 The Vector Length, which is limited to 8 (MVL - Maximum
98 Vector Length) is set up. A special "CTR" Mode is used
99 which automatically uses the CTR SPR rather than register
100 RA. (*Note that RA is set to zero to indicate this, because there is
101 limited encoding space. See [[openpower/sv/setvl]] instruction
102 specification for details)*.
103
104 The result of this instruction is that if CTR is greater than
105 8, VL is set to 8. If however CTR is less than or equal to 8,
106 then VL is set to CTR. Additionally, a copy of VL is placed
107 into RT (r3 in this case), which again is necessary as part
108 of the limited encoding space but in some cases (not here)
109 this is desirable, and avoids a `mfspr` instruction to take
110 a copy of VL into a GPR.
111
112 ```
113 # VL = r3 = MIN(CTR,MAXVL=8)
114 setvl 3,0,8,0,1,1 # set MVL=8, VL=MIN(MVL,CTR)
115 ```
116
117 Next the data must be loaded in batches (of up to QTY 8of
118 8-bit values). We must also not over-run the end of the
119 array (which a normal SIMD ISA would do, potentially causing
120 a pagefault) and this is achieved by when CTR <= 8: VL
121 (Vector Length) in such circumstances being set to the
122 remaining elements.
123
124 The next instruction is `sv.lbzu/pi` - a Post-Increment
125 variant of `lbzu` which updates the Effective Address (in RA)
126 **after** it has been used, not before. Also of note is the
127 element-width override of `dw=8` which writes to *individual
128 bytes* of r6, not to r6/7/8...13.
129
130 Of note here however is the precursor `addi` instruction, which
131 clears out the contents of r6 to zero before beginning the
132 Vector/Looped Load sequence. This is *important* because
133 SV does **not** zero-out parts of a register when element-width
134 overrides are in use.
135
136 ```
137 # load VL bytes (update r4 addr) but compressed (dw=8)
138 addi 6, 0, 0 # initialise all 64-bits of r6 to zero
139 sv.lbzu/pi/dw=8 *6, 1(4) # should be /lf here as well
140 ```
141
142 The next instruction is quite straightforward, and has the
143 effect of turning (transposing) 8 values. Each bit zero
144 of each input byte is placed into the first byte; each bit one
145 of each input byte is placed into the second byte and so on.
146
147 (*This instruction in VSX is called `vgbbd`, and it is present
148 in many other ISAs. RISC-V bitmanip-draft-0.94 calls it `bmatflip`,
149 x86 calls it GF2P7AFFINEQB. Power ISA, Cray, and many others
150 all have this extremely powerful and useful instruction*)
151
152 ```
153 # gather performs the transpose (which gets us to positional..)
154 gbbd 8,6
155 ```
156
157 Now the bits have been transposed in bytes, it is plain sailing
158 to perform QTY8 parallel popcounts. However there is a trick
159 going on: we have set VL=MVL=8. To explain this: it covers the
160 last case where CTR may be between 1 and 8.
161
162 Remember what happened back at the Vector-Load, where r6
163 was cleared to zero before-hand? This filled out the 8x8 transposed
164 grid (`gbbd`) fully with zeros prior to the actual transpose.
165 Now when we do the popcount, there will be upper-numbered
166 columns that are *not part of the result* that contain *zero*
167 and *consequently have no impact on the result*.
168
169 This elegant trick extends even to the accumulation of the
170 results. However before we get there it is necessary to
171 describe why `sw=8` has been added to `sv.popcntd`. What
172 this is doing is treating each **byte** of its input
173 (starting at the first byte of r8) as an independent
174 quantity to be popcounted, but the result is *zero-extended*
175 to 64-bit and stored in r24, r25, r26 ... r31.
176
177 Therefore:
178
179 * r24 contains the popcount of the first byte of r8
180 * r25 contains the popcount of the second byte of r8
181 * ...
182 * r31 contains the popcount of the last (7th) byte of r8
183
184 ```
185 # now those bits have been turned around, popcount and sum them
186 setvl 0,0,8,0,1,1 # set MVL=VL=8
187 sv.popcntd/sw=8 *24,*8 # do the (now transposed) popcount
188 ```
189
190 With VL being hard-set to 8, we can now Vector-Parallel
191 sum the partial results r24-31 into the array of accumulators
192 r16-r23. There was no need to set VL=8 again because it was
193 set already by the `setvl` above.
194
195 ```
196 sv.add *16,*16,*24 # and accumulate in results
197 ```
198
199 Finally, `sv.bc` loops back, subtracting VL from CTR.
200 Being based on Power ISA Scalar Branch-Conditional, if
201 CTR goes negative (which it will because VL was set
202 hard-coded to 8, above) it does not matter, the loop
203 will still be correctly exited. In other words we complete
204 the correct number of *blocks* (of length 8).
205
206 ```
207 # branch back if CTR still non-zero. works even though VL=8
208 sv.bc/all 16, *0, -0x28 # reduce CTR by VL and stop if -ve
209 ```
210
211 # Improvements
212
213 There exist many opportunities for parallelism that simpler hardware
214 would need to have in order to maximise performance. On Out-of-Order
215 hardware the above extremely simple and very clear algorithm will
216 achieve extreme high levels of performance simply by exploiting
217 standard Multi-Issue Register Hazard Management.
218
219 However simpler hardware - in-order - will need a little bit of
220 assistance, and that can come in the form of expanding to QTY4 or
221 QTY8 64-bit blocks (so that sv.popcntd uses MVL=VL=32 or MVL=VL=64),
222 `gbbd` becomes an `sv.gbbd` but VL being set to the block count
223 (QTY4 or QTY8), and the SV REMAP Parallel Reduction Schedule being
224 applied to each intermediary result rather than using an array
225 of straight accumulators `r16-r23`. However this starts to push
226 the boundaries of the number of registers needed, so as an
227 exercise is left for another time.
228
229 # Conclusion
230
231 Where a normal SIMD ISA requires explicit hand-crafted optimisation
232 in order to achieve full utilisation of the underlying hardware,
233 Simple-V instead can rely to a large extent on standard Multi-Issue
234 hardware to achieve similar performance, whilst crucially keeping the
235 algorithm implementation down to a shockingly-simple degree that makes
236 it easy to understand an easy to review. Again also as with many
237 other algorithms when implemented in Simple-V SVP54, by keeping to
238 a LOAD-COMPUTE-STORE paradigm the L1 Data Cache usage is minimised,
239 and in this case just as with chacha20 the entire algorithm, being
240 only 9 lines of assembler fitting into 13 4-byte words it can fit
241 into a single L1 I-Cache Line without triggering Virtual Memory TLB
242 misses.
243
244 Further performance improvements are achievable by using REMAP
245 Parallel Reduction, still fitting into a single L1 Cache line,
246 but beginning to approach the limit of the 128-long register file.
247
248 [[!tag svp64_cookbook ]]