add sv.bc warm words for pospopcount bug #672
[libreriscv.git] / openpower / sv / cookbook / pospopcnt.mdwn
1 # Positional popcount SVP64
2
3 * <https://bugs.libre-soc.org/show_bug.cgi?id=672>
4 * <https://github.com/clausecker/pospop/blob/master/countsse2_amd64.s>
5 * RISC-V Bitmanip Extension Document Version 0.94-draft Editor: Claire Wolf Symbiotic GmbH
6 <https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-draft.pdf>
7
8 Positional popcount in optimised assembler is typically done on SIMD ISAs in
9 around 500 lines. Power ISA thanks to `bpermd` can be much more efficient:
10 with SVP64 even more so. The reference implementation showing the concept
11 is below.
12
13 ```
14 // Copyright (c) 2020 Robert Clausecker <fuz@fuz.su>
15 // count8 reference implementation for tests. Do not alter.
16 func count8safe(counts *[8]int, buf []uint8) {
17 for i := range buf {
18 for j := 0; j < 8; j++ {
19 counts[j] += int(buf[i] >> j & 1)
20 }
21 }
22 }
23 ```
24
25 A simple but still hardware-paralleliseable SVP64 assembler for
26 8-bit input values (`count8safe`) is as follows:
27
28 ```
29 mtspr 9, 3 # move r3 to CTR
30 # VL = MIN(CTR,MAXVL=8), Rc=1 (CR0 set if CTR ends)
31 setvl 3,0,8,0,1,1 # set MVL=8, VL=MIN(MVL,CTR)
32 # load VL bytes (update r4 addr) but compressed (dw=8)
33 addi 6, 0, 0 # initialise all 64-bits of r6 to zero
34 sv.lbzu/pi/dw=8 *6, 1(4) # should be /lf here as well
35 # gather performs the transpose (which gets us to positional..)
36 gbbd 8,6
37 # now those bits have been turned around, popcount and sum them
38 setvl 0,0,8,0,1,1 # set MVL=VL=8
39 sv.popcntd/sw=8 *24,*8 # do the (now transposed) popcount
40 sv.add *16,*16,*24 # and accumulate in results
41 # branch back if CTR still non-zero. works even though VL=8
42 sv.bc/all 16, *0, -0x28 # reduce CTR by VL and stop if -ve
43 ```
44
45 Array popcount is just standard popcount function
46 ([[!wikipedia Hamming weight]]) on an array of values, horizontally,
47 however positional popcount is different (vertical)
48
49 <img src="/openpower/sv/cookbook/1_popcount.svg" alt="pospopcnt" width="70%" />
50
51 Positional popcount adds up the totals of each bit set to 1 in each
52 bit-position, of an array of input values.
53
54 <img src="/openpower/sv/cookbook/2_popcount.svg" alt="pospopcnt" width="70%" />
55
56 # Visual representation of the pospopcount algorithm
57
58 # Walkthrough of the assembler
59
60 Firstly the CTR (Counter) SPR is set up, and is key to looping
61 as outlined further, below
62
63 ```
64 mtspr 9, 3 # move r3 to CTR
65 ```
66
67 The Vector Length, which is limited to 8 (MVL - Maximum
68 Vector Length) is set up. A special "CTR" Mode is used
69 which automatically uses the CTR SPR rather than register
70 RA. (*Note that RA is set to zero to indicate this, because there is
71 limited encoding space. See [[openpower/sv/setvl]] instruction
72 specification for details)*.
73
74 The result of this instruction is that if CTR is greater than
75 8, VL is set to 8. If however CTR is less than or equal to 8,
76 then VL is set to CTR. Additionally, a copy of VL is placed
77 into RT (r3 in this case), which again is necessary as part
78 of the limited encoding space but in some cases (not here)
79 this is desirable, and avoids a `mfspr` instruction to take
80 a copy of VL into a GPR.
81
82 ```
83 # VL = r3 = MIN(CTR,MAXVL=8)
84 setvl 3,0,8,0,1,1 # set MVL=8, VL=MIN(MVL,CTR)
85 ```
86
87 Next the data must be loaded in batches (of up to QTY 8of
88 8-bit values). We must also not over-run the end of the
89 array (which a normal SIMD ISA would do, potentially causing
90 a pagefault) and this is achieved by when CTR <= 8: VL
91 (Vector Length) in such circumstances being set to the
92 remaining elements.
93
94 The next instruction is `sv.lbzu/pi` - a Post-Increment
95 variant of `lbzu` which updates the Effective Address (in RA)
96 **after** it has been used, not before. Also of note is the
97 element-width override of `dw=8` which writes to *individual
98 bytes* of r6, not to r6/7/8...13.
99
100 Of note here however is the precursor `addi` instruction, which
101 clears out the contents of r6 to zero before beginning the
102 Vector/Looped Load sequence. This is *important* because
103 SV does **not** zero-out parts of a register when element-width
104 overrides are in use.
105
106 ```
107 # load VL bytes (update r4 addr) but compressed (dw=8)
108 addi 6, 0, 0 # initialise all 64-bits of r6 to zero
109 sv.lbzu/pi/dw=8 *6, 1(4) # should be /lf here as well
110 ```
111
112 The next instruction is quite straightforward, and has the
113 effect of turning (transposing) 8 values. Each bit zero
114 of each input byte is placed into the first byte; each bit one
115 of each input byte is placed into the second byte and so on.
116
117 (*This instruction in VSX is called `vgbbd`, and it is present
118 in many other ISAs. RISC-V bitmanip-draft-0.94 calls it `bmatflip`,
119 x86 calls it GF2P7AFFINEQB. Power ISA, Cray, and many others
120 all have this extremely powerful and useful instruction*)
121
122 ```
123 # gather performs the transpose (which gets us to positional..)
124 gbbd 8,6
125 ```
126
127 Now the bits have been transposed in bytes, it is plain sailing
128 to perform QTY8 parallel popcounts. However there is a trick
129 going on: we have set VL=MVL=8. To explain this: it covers the
130 last case where CTR may be between 1 and 8.
131
132 Remember what happened back at the Vector-Load, where r6
133 was cleared to zero before-hand? This filled out the 8x8 transposed
134 grid (`gbbd`) fully with zeros prior to the actual transpose.
135 Now when we do the popcount, there will be upper-numbered
136 columns that are *not part of the result* that contain *zero*
137 and *consequently have no impact on the result*.
138
139 This elegant trick extends even to the accumulation of the
140 results. However before we get there it is necessary to
141 describe why `sw=8` has been added to `sv.popcntd`. What
142 this is doing is treating each **byte** of its input
143 (starting at the first byte of r8) as an independent
144 quantity to be popcounted, but the result is *zero-extended*
145 to 64-bit and stored in r24, r25, r26 ... r31.
146
147 Therefore:
148
149 * r24 contains the popcount of the first byte of r8
150 * r25 contains the popcount of the second byte of r8
151 * ...
152 * r31 contains the popcount of the last (7th) byte of r8
153
154 ```
155 # now those bits have been turned around, popcount and sum them
156 setvl 0,0,8,0,1,1 # set MVL=VL=8
157 sv.popcntd/sw=8 *24,*8 # do the (now transposed) popcount
158 ```
159
160 With VL being hard-set to 8, we can now Vector-Parallel
161 sum the partial results r24-31 into the array of accumulators
162 r16-r23. There was no need to set VL=8 again because it was
163 set already by the `setvl` above.
164
165 ```
166 sv.add *16,*16,*24 # and accumulate in results
167 ```
168
169 Finally, `sv.bc` loops back, subtracting VL from CTR.
170 Being based on Power ISA Scalar Branch-Conditional, if
171 CTR goes negative (which it will because VL was set
172 hard-coded to 8, above) it does not matter, the loop
173 will still be correctly exited. In other words we complete
174 the correct number of *blocks* (of length 8).
175
176 ```
177 # branch back if CTR still non-zero. works even though VL=8
178 sv.bc/all 16, *0, -0x28 # reduce CTR by VL and stop if -ve
179 ```
180
181 [[!tag svp64_cookbook ]]