openpower/sv/cookbook/pospopcnt.mdwn

   1 # Positional popcount SVP64
   2
   3 **Links**
   4
   5 * <https://bugs.libre-soc.org/show_bug.cgi?id=672>
   6 * <https://github.com/clausecker/pospop/blob/master/countsse2_amd64.s>
   7 * RISC-V Bitmanip Extension Document Version 0.94-draft Editor: Claire Wolf Symbiotic GmbH
   8 <https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-draft.pdf>
   9
  10 **Funding**
  11
  12 This project is funded through the [NGI Assure Fund](https://nlnet.nl/assure), a fund established by [NLnet](https://nlnet.nl) with financial support from the European Commission's [Next Generation Internet](https://ngi.eu) program. Learn more at the [NLnet project page](https://nlnet.nl/project/Libre-SOC-OpenPOWER-ISA).
  13
  14 [<img src="https://nlnet.nl/logo/banner.png" alt="NLnet foundation logo" width="20%" />](https://nlnet.nl)
  15 [<img src="https://nlnet.nl/image/logos/NGIAssure_tag.svg" alt="NGI Assure Logo" width="20%" />](https://nlnet.nl/assure)
  16
  17 **Introduction**
  18
  19 Positional popcount in optimised assembler is typically done on SIMD
  20 ISAs in around 500 lines.  The implementations are extraordinarily
  21 complex and very hard to understand. Power ISA thanks to `bpermd` can be much
  22 more efficient: with SVP64 even more so.  The reference implementation
  23 showing the concept is below.
  24
  25 ```
  26 // Copyright (c) 2020 Robert Clausecker <fuz@fuz.su>
  27 // count8 reference implementation for tests.  Do not alter.
  28 func count8safe(counts *[8]int, buf []uint8) {
  29         for i := range buf {
  30                 for j := 0; j < 8; j++ {
  31                         counts[j] += int(buf[i] >> j & 1)
  32                 }
  33         }
  34 }
  35 ```
  36
  37 A simple but still hardware-paralleliseable SVP64 assembler for
  38 8-bit input values (`count8safe`) is as follows:
  39
  40 ```
  41 mtspr 9, 3                # move r3 to CTR
  42 # VL = MIN(CTR,MAXVL=8), Rc=1 (CR0 set if CTR ends)
  43 setvl 3,0,8,0,1,1         # set MVL=8, VL=MIN(MVL,CTR)
  44 # load VL bytes (update r4 addr) but compressed (dw=8)
  45 addi 6, 0, 0               # initialise all 64-bits of r6 to zero
  46 sv.lbzu/pi/dw=8 *6, 1(4)   # should be /lf here as well
  47 # gather performs the transpose (which gets us to positional..)
  48 gbbd 8,6
  49 # now those bits have been turned around, popcount and sum them
  50 setvl 0,0,8,0,1,1          # set MVL=VL=8
  51 sv.popcntd/sw=8 *24,*8     # do the (now transposed) popcount
  52 sv.add *16,*16,*24         # and accumulate in results
  53 # branch back if CTR still non-zero. works even though VL=8
  54 sv.bc/all 16, *0, -0x28   # reduce CTR by VL and stop if -ve
  55 ```
  56
  57
  58 Array popcount is just standard popcount function
  59 ([[!wikipedia Hamming weight]]) on an array of values, horizontally,
  60 however positional popcount is different (vertical). Refer to Fig.1
  61
  62
  63 <img src="/openpower/sv/cookbook/ArrayPopcnt.drawio.svg"
  64      alt="pospopcnt" width="70%" />
  65
  66 Positional popcount adds up the totals of each bit set to 1 in each
  67 bit-position, of an array of input values. Refer to Fig.2
  68
  69 <img src="/openpower/sv/cookbook/PositionalPopcnt.drawio.svg"
  70      alt="pospopcnt" width="60%" />
  71
  72
  73 # Visual representation of the pospopcount algorithm
  74
  75 In order to perform positional popcount we need to go through series of
  76 steps shown below in figures 3, 4, 5 & 6.  The limitation to overcome is
  77 that the CPU can only work on registers (horizontally) but the requirement
  78 of pospopcount is to add *vertically*.  Part of the challenge is therefore
  79 to perform an appropriate transpose of the data, in blocks that suit
  80 the processor and the ISA capacity.
  81
  82 Fig.3 depicts how the data is divided into blocks of 8*8-bit.
  83
  84 <img src="/openpower/sv/cookbook/BlockDivision.drawio.svg"
  85      alt="pospopcnt" width="70%" />
  86
  87 Fig.4 shows that a transpose is needed on each block.  The gbbd
  88 instruction is used for implementing the transpose function, preparing
  89 the data for using the standard popcount instruction.
  90
  91 <img src="/openpower/sv/cookbook/Transpose.drawio.svg"
  92      alt="pospopcnt" width="60%" />
  93
  94 Now on each 8*8 block, 8 popcount instructions can be run each of
  95 which is independent and therefore can be parallelised even by In-Order
  96 multi-issue hardware(Fig.5).
  97
  98 <img src="/openpower/sv/cookbook/PopcountBlocks.drawio.svg"
  99      alt="pospopcnt" width="70%" />
 100
 101 Fig.6 depicts how each of the intermediate results are accumulated. It
 102 is worth noting that each intermediate result is independent of the
 103 other intermediate results and also parallel reduction can be applied
 104 to all of them individually. This gives *two* opportunities for hardware
 105 parallelism rather than one.
 106
 107 <img src="/openpower/sv/cookbook/ParallelAccumulate.drawio.svg"
 108      alt="pospopcnt" width="100%" />
 109
 110 In short this algorithm is very straightforward to implement thanks to the
 111 two crucial instructions, `gbbd` and `popcntd`.  Below is a walkthrough
 112 of the assembler, keeping it very simple, and exploiting only one of the
 113 opportunities for parallelism (by not including the Parallel Reduction
 114 opportunity mentioned above).
 115
 116 # Walkthrough of the assembler
 117
 118 Firstly the CTR (Counter) SPR is set up, and is key to looping
 119 as outlined further, below
 120
 121 ```
 122 mtspr 9, 3                # move r3 to CTR
 123 ```
 124
 125 The Vector Length, which is limited to 8 (MVL - Maximum Vector Length)
 126 is set up. A special "CTR" Mode is used which automatically uses the CTR
 127 SPR rather than register RA. (*Note that RA is set to zero to indicate
 128 this, because there is limited encoding space. See [[openpower/sv/setvl]]
 129 instruction specification for details)*.
 130
 131 The result of this instruction is that if CTR is greater than 8, VL is set
 132 to 8. If however CTR is less than or equal to 8, then VL is set to CTR.
 133 Additionally, a copy of VL is placed into RT (r3 in this case), which
 134 again is necessary as part of the limited encoding space but in some cases
 135 (not here) this is desirable, and avoids a `mfspr` instruction to take
 136 a copy of VL into a GPR.
 137
 138 ```
 139 # VL = r3 = MIN(CTR,MAXVL=8)
 140 setvl 3,0,8,0,1,1         # set MVL=8, VL=MIN(MVL,CTR)
 141 ```
 142
 143 Next the data must be loaded in batches (of up to QTY 8of 8-bit
 144 values). We must also not over-run the end of the array (which a normal
 145 SIMD ISA would do, potentially causing a pagefault) and this is achieved
 146 by when CTR <= 8: VL (Vector Length) in such circumstances being set to
 147 the remaining elements.
 148
 149 The next instruction is `sv.lbzu/pi` - a Post-Increment variant of `lbzu`
 150 which updates the Effective Address (in RA) **after** it has been used,
 151 not before. Also of note is the element-width override of `dw=8` which
 152 writes to *individual bytes* of r6, not to r6/7/8...13.
 153
 154 Of note here however is the precursor `addi` instruction, which clears
 155 out the contents of r6 to zero before beginning the Vector/Looped Load
 156 sequence.  This is *important* because SV does **not** zero-out parts
 157 of a register when element-width overrides are in use.
 158
 159 ```
 160 # load VL bytes (update r4 addr) but compressed (dw=8)
 161 addi 6, 0, 0               # initialise all 64-bits of r6 to zero
 162 sv.lbzu/pi/dw=8 *6, 1(4)   # should be /lf here as well
 163 ```
 164
 165 The next instruction is quite straightforward, and has the effect of
 166 turning (transposing) 8 values.  Each bit zero of each input byte is
 167 placed into the first byte; each bit one of each input byte is placed
 168 into the second byte and so on.
 169
 170 (*This instruction in VSX is called `vgbbd`, and it is present in many
 171 other ISAs. RISC-V bitmanip-draft-0.94 calls it `bmatflip`, x86 calls it
 172 GF2P7AFFINEQB. Power ISA, Cray, and many others all have this extremely
 173 powerful and useful instruction*)
 174
 175 ```
 176 # gather performs the transpose (which gets us to positional..)
 177 gbbd 8,6
 178 ```
 179
 180 Now the bits have been transposed in bytes, it is plain sailing to
 181 perform QTY8 parallel popcounts.  However there is a trick going on:
 182 we have set VL=MVL=8. To explain this: it covers the last case where
 183 CTR may be between 1 and 8.
 184
 185 Remember what happened back at the Vector-Load, where r6 was cleared to
 186 zero before-hand? This filled out the 8x8 transposed grid (`gbbd`) fully
 187 with zeros prior to the actual transpose.  Now when we do the popcount,
 188 there will be upper-numbered columns that are *not part of the result*
 189 that contain *zero* and *consequently have no impact on the result*.
 190
 191 This elegant trick extends even to the accumulation of the
 192 results. However before we get there it is necessary to describe why
 193 `sw=8` has been added to `sv.popcntd`. What this is doing is treating
 194 each **byte** of its input (starting at the first byte of r8) as an
 195 independent quantity to be popcounted, but the result is *zero-extended*
 196 to 64-bit and stored in r24, r25, r26 ... r31.
 197
 198 Therefore:
 199
 200 * r24 contains the popcount of the first byte of r8
 201 * r25 contains the popcount of the second byte of r8
 202 * ...
 203 * r31 contains the popcount of the last (7th) byte of r8
 204
 205 ```
 206 # now those bits have been turned around, popcount and sum them
 207 setvl 0,0,8,0,1,1          # set MVL=VL=8
 208 sv.popcntd/sw=8 *24,*8     # do the (now transposed) popcount
 209 ```
 210
 211 With VL being hard-set to 8, we can now Vector-Parallel sum the partial
 212 results r24-31 into the array of accumulators r16-r23. There was no need
 213 to set VL=8 again because it was set already by the `setvl` above.
 214
 215 ```
 216 sv.add *16,*16,*24         # and accumulate in results
 217 ```
 218
 219 Finally, `sv.bc` loops back, subtracting VL from CTR.  Being based on
 220 Power ISA Scalar Branch-Conditional, if CTR goes negative (which it will
 221 because VL was set hard-coded to 8, above) it does not matter, the loop
 222 will still be correctly exited. In other words we complete the correct
 223 number of *blocks* (of length 8).
 224
 225 ```
 226 # branch back if CTR still non-zero. works even though VL=8
 227 sv.bc/all 16, *0, -0x28   # reduce CTR by VL and stop if -ve
 228 ```
 229
 230 # Improvements
 231
 232 There exist many opportunities for parallelism that simpler hardware would
 233 need to have in order to maximise performance.  On Out-of-Order hardware
 234 the above extremely simple and very clear algorithm will achieve extreme
 235 high levels of performance simply by exploiting standard Multi-Issue
 236 Register Hazard Management.
 237
 238 However simpler hardware - in-order - will need a little bit of
 239 assistance, and that can come in the form of expanding to QTY4 or QTY8
 240 64-bit blocks (so that sv.popcntd uses MVL=VL=32 or MVL=VL=64), `gbbd`
 241 becomes an `sv.gbbd` but VL being set to the block count (QTY4 or QTY8),
 242 and the SV REMAP Parallel Reduction Schedule being applied to each
 243 intermediary result rather than using an array of straight accumulators
 244 `r16-r23`.  However this starts to push the boundaries of the number of
 245 registers needed, so as an exercise is left for another time.
 246
 247 # Conclusion
 248
 249 Where a normal SIMD ISA requires explicit hand-crafted optimisation in
 250 order to achieve full utilisation of the underlying hardware, Simple-V
 251 instead can rely to a large extent on standard Multi-Issue hardware
 252 to achieve similar performance, whilst crucially keeping the algorithm
 253 implementation down to a shockingly-simple degree that makes it easy to
 254 understand an easy to review.  Again also as with many other algorithms
 255 when implemented in Simple-V SVP54, by keeping to a LOAD-COMPUTE-STORE
 256 paradigm the L1 Data Cache usage is minimised, and in this case just
 257 as with chacha20 the entire algorithm, being only 9 lines of assembler
 258 fitting into 13 4-byte words it can fit into a single L1 I-Cache Line
 259 without triggering Virtual Memory TLB misses.
 260
 261 Further performance improvements are achievable by using REMAP Parallel
 262 Reduction, still fitting into a single L1 Cache line, but beginning to
 263 approach the limit of the 128-long register file.
 264
 265 [[!tag svp64_cookbook ]]