(no commit message)
[libreriscv.git] / openpower / sv / cookbook / pospopcnt.mdwn
1 # Positional popcount SVP64
2
3 * <https://bugs.libre-soc.org/show_bug.cgi?id=672>
4 * <https://github.com/clausecker/pospop/blob/master/countsse2_amd64.s>
5 * RISC-V Bitmanip Extension Document Version 0.94-draft Editor: Claire Wolf Symbiotic GmbH
6 <https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-draft.pdf>
7
8 Positional popcount in optimised assembler is typically done on SIMD ISAs in
9 around 500 lines. Power ISA thanks to `bpermd` can be much more efficient:
10 with SVP64 even more so. The reference implementation showing the concept
11 is below.
12
13 ```
14 // Copyright (c) 2020 Robert Clausecker <fuz@fuz.su>
15 // count8 reference implementation for tests. Do not alter.
16 func count8safe(counts *[8]int, buf []uint8) {
17 for i := range buf {
18 for j := 0; j < 8; j++ {
19 counts[j] += int(buf[i] >> j & 1)
20 }
21 }
22 }
23 ```
24
25 A simple but still hardware-paralleliseable SVP64 assembler for
26 8-bit input values (`count8safe`) is as follows:
27
28 ```
29 mtspr 9, 3 # move r3 to CTR
30 # VL = MIN(CTR,MAXVL=8), Rc=1 (CR0 set if CTR ends)
31 setvl 3,0,8,0,1,1 # set MVL=8, VL=MIN(MVL,CTR)
32 # load VL bytes (update r4 addr) but compressed (dw=8)
33 addi 6, 0, 0 # initialise all 64-bits of r6 to zero
34 sv.lbzu/pi/dw=8 *6, 1(4) # should be /lf here as well
35 # gather performs the transpose (which gets us to positional..)
36 gbbd 8,6
37 # now those bits have been turned around, popcount and sum them
38 setvl 0,0,8,0,1,1 # set MVL=VL=8
39 sv.popcntd/sw=8 *24,*8 # do the (now transposed) popcount
40 sv.add *16,*16,*24 # and accumulate in results
41 # branch back if CTR still non-zero. works even though VL=8
42 sv.bc/all 16, *0, -0x28 # reduce CTR by VL and stop if -ve
43 ```
44
45 Array popcount is just standard popcount function
46 ([[!wikipedia Hamming weight]]) on an array of values, horizontally,
47 however positional popcount is different (vertical)
48
49
50
51 Positional popcount adds up the totals of each bit set to 1 in each
52 bit-position, of an array of input values.
53
54
55 <img src="/openpower/sv/cookbook/ArrayPopcnt.drawio.svg"
56 alt="pospopcnt" width="70%" />
57
58 <br />
59
60 <img src="/openpower/sv/cookbook/PositionalPopcnt.drawio.svg"
61 alt="pospopcnt" width="60%" />
62
63 <br />
64
65
66 <img src="/openpower/sv/cookbook/BlockDivision.drawio.svg"
67 alt="pospopcnt" width="70%" />
68
69 <br />
70
71 <img src="/openpower/sv/cookbook/Transpose.drawio.svg"
72 alt="pospopcnt" width="60%" />
73
74
75 <img src="/openpower/sv/cookbook/PopcountBlocks.drawio.svg"
76 alt="pospopcnt" width="70%" />
77
78 <img src="/openpower/sv/cookbook/ParallelAccumulate.drawio.svg"
79 alt="pospopcnt" width="100%" />
80
81
82 # Visual representation of the pospopcount algorithm
83
84 # Walkthrough of the assembler
85
86 Firstly the CTR (Counter) SPR is set up, and is key to looping
87 as outlined further, below
88
89 ```
90 mtspr 9, 3 # move r3 to CTR
91 ```
92
93 The Vector Length, which is limited to 8 (MVL - Maximum
94 Vector Length) is set up. A special "CTR" Mode is used
95 which automatically uses the CTR SPR rather than register
96 RA. (*Note that RA is set to zero to indicate this, because there is
97 limited encoding space. See [[openpower/sv/setvl]] instruction
98 specification for details)*.
99
100 The result of this instruction is that if CTR is greater than
101 8, VL is set to 8. If however CTR is less than or equal to 8,
102 then VL is set to CTR. Additionally, a copy of VL is placed
103 into RT (r3 in this case), which again is necessary as part
104 of the limited encoding space but in some cases (not here)
105 this is desirable, and avoids a `mfspr` instruction to take
106 a copy of VL into a GPR.
107
108 ```
109 # VL = r3 = MIN(CTR,MAXVL=8)
110 setvl 3,0,8,0,1,1 # set MVL=8, VL=MIN(MVL,CTR)
111 ```
112
113 Next the data must be loaded in batches (of up to QTY 8of
114 8-bit values). We must also not over-run the end of the
115 array (which a normal SIMD ISA would do, potentially causing
116 a pagefault) and this is achieved by when CTR <= 8: VL
117 (Vector Length) in such circumstances being set to the
118 remaining elements.
119
120 The next instruction is `sv.lbzu/pi` - a Post-Increment
121 variant of `lbzu` which updates the Effective Address (in RA)
122 **after** it has been used, not before. Also of note is the
123 element-width override of `dw=8` which writes to *individual
124 bytes* of r6, not to r6/7/8...13.
125
126 Of note here however is the precursor `addi` instruction, which
127 clears out the contents of r6 to zero before beginning the
128 Vector/Looped Load sequence. This is *important* because
129 SV does **not** zero-out parts of a register when element-width
130 overrides are in use.
131
132 ```
133 # load VL bytes (update r4 addr) but compressed (dw=8)
134 addi 6, 0, 0 # initialise all 64-bits of r6 to zero
135 sv.lbzu/pi/dw=8 *6, 1(4) # should be /lf here as well
136 ```
137
138 The next instruction is quite straightforward, and has the
139 effect of turning (transposing) 8 values. Each bit zero
140 of each input byte is placed into the first byte; each bit one
141 of each input byte is placed into the second byte and so on.
142
143 (*This instruction in VSX is called `vgbbd`, and it is present
144 in many other ISAs. RISC-V bitmanip-draft-0.94 calls it `bmatflip`,
145 x86 calls it GF2P7AFFINEQB. Power ISA, Cray, and many others
146 all have this extremely powerful and useful instruction*)
147
148 ```
149 # gather performs the transpose (which gets us to positional..)
150 gbbd 8,6
151 ```
152
153 Now the bits have been transposed in bytes, it is plain sailing
154 to perform QTY8 parallel popcounts. However there is a trick
155 going on: we have set VL=MVL=8. To explain this: it covers the
156 last case where CTR may be between 1 and 8.
157
158 Remember what happened back at the Vector-Load, where r6
159 was cleared to zero before-hand? This filled out the 8x8 transposed
160 grid (`gbbd`) fully with zeros prior to the actual transpose.
161 Now when we do the popcount, there will be upper-numbered
162 columns that are *not part of the result* that contain *zero*
163 and *consequently have no impact on the result*.
164
165 This elegant trick extends even to the accumulation of the
166 results. However before we get there it is necessary to
167 describe why `sw=8` has been added to `sv.popcntd`. What
168 this is doing is treating each **byte** of its input
169 (starting at the first byte of r8) as an independent
170 quantity to be popcounted, but the result is *zero-extended*
171 to 64-bit and stored in r24, r25, r26 ... r31.
172
173 Therefore:
174
175 * r24 contains the popcount of the first byte of r8
176 * r25 contains the popcount of the second byte of r8
177 * ...
178 * r31 contains the popcount of the last (7th) byte of r8
179
180 ```
181 # now those bits have been turned around, popcount and sum them
182 setvl 0,0,8,0,1,1 # set MVL=VL=8
183 sv.popcntd/sw=8 *24,*8 # do the (now transposed) popcount
184 ```
185
186 With VL being hard-set to 8, we can now Vector-Parallel
187 sum the partial results r24-31 into the array of accumulators
188 r16-r23. There was no need to set VL=8 again because it was
189 set already by the `setvl` above.
190
191 ```
192 sv.add *16,*16,*24 # and accumulate in results
193 ```
194
195 Finally, `sv.bc` loops back, subtracting VL from CTR.
196 Being based on Power ISA Scalar Branch-Conditional, if
197 CTR goes negative (which it will because VL was set
198 hard-coded to 8, above) it does not matter, the loop
199 will still be correctly exited. In other words we complete
200 the correct number of *blocks* (of length 8).
201
202 ```
203 # branch back if CTR still non-zero. works even though VL=8
204 sv.bc/all 16, *0, -0x28 # reduce CTR by VL and stop if -ve
205 ```
206
207 # Improvements
208
209 There exist many opportunities for parallelism that simpler hardware
210 would need to have in order to maximise performance. On Out-of-Order
211 hardware the above extremely simple and very clear algorithm will
212 achieve extreme high levels of performance simply by exploiting
213 standard Multi-Issue Register Hazard Management.
214
215 However simpler hardware - in-order - will need a little bit of
216 assistance, and that can come in the form of expanding to QTY4 or
217 QTY8 64-bit blocks (so that sv.popcntd uses MVL=VL=32 or MVL=VL=64),
218 `gbbd` becomes an `sv.gbbd` but VL being set to the block count
219 (QTY4 or QTY8), and the SV REMAP Parallel Reduction Schedule being
220 applied to each intermediary result rather than using an array
221 of straight accumulators `r16-r23`. However this starts to push
222 the boundaries of the number of registers needed, so as an
223 exercise is left for another time.
224
225 [[!tag svp64_cookbook ]]