openpower/sv/cookbook/pospopcnt.mdwn

   1 # Positional popcount SVP64
   2
   3 **Links**
   4
   5 * <https://bugs.libre-soc.org/show_bug.cgi?id=672>
   6 * <https://github.com/clausecker/pospop/blob/master/countsse2_amd64.s>
   7 * RISC-V Bitmanip Extension Document Version 0.94-draft Editor: Claire Wolf Symbiotic GmbH
   8 <https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-draft.pdf>
   9
  10 **Funding**
  11
  12 This project is funded through the [NGI Assure Fund](https://nlnet.nl/assure), a fund established by [NLnet](https://nlnet.nl) with financial support from the European Commission's [Next Generation Internet](https://ngi.eu) program. Learn more at the [NLnet project page](https://nlnet.nl/project/Libre-SOC-OpenPOWER-ISA).
  13
  14 [<img src="https://nlnet.nl/logo/banner.png" alt="NLnet foundation logo" width="20%" />](https://nlnet.nl)
  15 [<img src="https://nlnet.nl/image/logos/NGIAssure_tag.svg" alt="NGI Assure Logo" width="20%" />](https://nlnet.nl/assure)
  16
  17 **Introduction**
  18
  19 Positional popcount in optimised assembler is typically done on SIMD
  20 ISAs in around 500 lines.  Power ISA thanks to `bpermd` can be much
  21 more efficient: with SVP64 even more so.  The reference implementation
  22 showing the concept is below.
  23
  24 ```
  25 // Copyright (c) 2020 Robert Clausecker <fuz@fuz.su>
  26 // count8 reference implementation for tests.  Do not alter.
  27 func count8safe(counts *[8]int, buf []uint8) {
  28         for i := range buf {
  29                 for j := 0; j < 8; j++ {
  30                         counts[j] += int(buf[i] >> j & 1)
  31                 }
  32         }
  33 }
  34 ```
  35
  36 A simple but still hardware-paralleliseable SVP64 assembler for
  37 8-bit input values (`count8safe`) is as follows:
  38
  39 ```
  40 mtspr 9, 3                # move r3 to CTR
  41 # VL = MIN(CTR,MAXVL=8), Rc=1 (CR0 set if CTR ends)
  42 setvl 3,0,8,0,1,1         # set MVL=8, VL=MIN(MVL,CTR)
  43 # load VL bytes (update r4 addr) but compressed (dw=8)
  44 addi 6, 0, 0               # initialise all 64-bits of r6 to zero
  45 sv.lbzu/pi/dw=8 *6, 1(4)   # should be /lf here as well
  46 # gather performs the transpose (which gets us to positional..)
  47 gbbd 8,6
  48 # now those bits have been turned around, popcount and sum them
  49 setvl 0,0,8,0,1,1          # set MVL=VL=8
  50 sv.popcntd/sw=8 *24,*8     # do the (now transposed) popcount
  51 sv.add *16,*16,*24         # and accumulate in results
  52 # branch back if CTR still non-zero. works even though VL=8
  53 sv.bc/all 16, *0, -0x28   # reduce CTR by VL and stop if -ve
  54 ```
  55
  56
  57 Array popcount is just standard popcount function
  58 ([[!wikipedia Hamming weight]]) on an array of values, horizontally,
  59 however positional popcount is different (vertical). Refer to Fig.1
  60
  61
  62 <img src="/openpower/sv/cookbook/ArrayPopcnt.drawio.svg"
  63      alt="pospopcnt" width="70%" />
  64
  65 Positional popcount adds up the totals of each bit set to 1 in each
  66 bit-position, of an array of input values. Refer to Fig.2
  67
  68 <img src="/openpower/sv/cookbook/PositionalPopcnt.drawio.svg"
  69      alt="pospopcnt" width="60%" />
  70
  71
  72 # Visual representation of the pospopcount algorithm
  73
  74 In order to perform positional popcount we need to go through series of
  75 steps shown below in figures 3, 4, 5 & 6.  The limitation to overcome is
  76 that the CPU can only work on registers (horizontally) but the requirement
  77 of pospopcount is to add *vertically*.  Part of the challenge is therefore
  78 to perform an appropriate transpose of the data, in blocks that suit
  79 the processor and the ISA capacity.
  80
  81 Fig.3 depicts how the data is divided into blocks of 8*8-bit.
  82
  83 <img src="/openpower/sv/cookbook/BlockDivision.drawio.svg"
  84      alt="pospopcnt" width="70%" />
  85
  86 Fig.4 shows that a transpose is needed on each block.  The gbbd
  87 instruction is used for implementing the transpose function, preparing
  88 the data for using the standard popcount instruction.
  89
  90 <img src="/openpower/sv/cookbook/Transpose.drawio.svg"
  91      alt="pospopcnt" width="60%" />
  92
  93 Now on each 8*8 block, 8 popcount instructions can be run each of
  94 which is independent and therefore can be parallelised even by In-Order
  95 multi-issue hardware(Fig.5).
  96
  97 <img src="/openpower/sv/cookbook/PopcountBlocks.drawio.svg"
  98      alt="pospopcnt" width="70%" />
  99
 100 Fig.6 depicts how each of the intermediate results are accumulated. It
 101 is worth noting that each intermediate result is independent of the
 102 other intermediate results and also parallel reduction can be applied
 103 to all of them individually. This gives *two* opportunities for hardware
 104 parallelism rather than one.
 105
 106 <img src="/openpower/sv/cookbook/ParallelAccumulate.drawio.svg"
 107      alt="pospopcnt" width="100%" />
 108
 109 In short this algorithm is very straightforward to implement thanks to the
 110 two crucial instructions, `gbbd` and `popcntd`.  Below is a walkthrough
 111 of the assembler, keeping it very simple, and exploiting only one of the
 112 opportunities for parallelism (by not including the Parallel Reduction
 113 opportunity mentioned above).
 114
 115 # Walkthrough of the assembler
 116
 117 Firstly the CTR (Counter) SPR is set up, and is key to looping
 118 as outlined further, below
 119
 120 ```
 121 mtspr 9, 3                # move r3 to CTR
 122 ```
 123
 124 The Vector Length, which is limited to 8 (MVL - Maximum Vector Length)
 125 is set up. A special "CTR" Mode is used which automatically uses the CTR
 126 SPR rather than register RA. (*Note that RA is set to zero to indicate
 127 this, because there is limited encoding space. See [[openpower/sv/setvl]]
 128 instruction specification for details)*.
 129
 130 The result of this instruction is that if CTR is greater than 8, VL is set
 131 to 8. If however CTR is less than or equal to 8, then VL is set to CTR.
 132 Additionally, a copy of VL is placed into RT (r3 in this case), which
 133 again is necessary as part of the limited encoding space but in some cases
 134 (not here) this is desirable, and avoids a `mfspr` instruction to take
 135 a copy of VL into a GPR.
 136
 137 ```
 138 # VL = r3 = MIN(CTR,MAXVL=8)
 139 setvl 3,0,8,0,1,1         # set MVL=8, VL=MIN(MVL,CTR)
 140 ```
 141
 142 Next the data must be loaded in batches (of up to QTY 8of 8-bit
 143 values). We must also not over-run the end of the array (which a normal
 144 SIMD ISA would do, potentially causing a pagefault) and this is achieved
 145 by when CTR <= 8: VL (Vector Length) in such circumstances being set to
 146 the remaining elements.
 147
 148 The next instruction is `sv.lbzu/pi` - a Post-Increment variant of `lbzu`
 149 which updates the Effective Address (in RA) **after** it has been used,
 150 not before. Also of note is the element-width override of `dw=8` which
 151 writes to *individual bytes* of r6, not to r6/7/8...13.
 152
 153 Of note here however is the precursor `addi` instruction, which clears
 154 out the contents of r6 to zero before beginning the Vector/Looped Load
 155 sequence.  This is *important* because SV does **not** zero-out parts
 156 of a register when element-width overrides are in use.
 157
 158 ```
 159 # load VL bytes (update r4 addr) but compressed (dw=8)
 160 addi 6, 0, 0               # initialise all 64-bits of r6 to zero
 161 sv.lbzu/pi/dw=8 *6, 1(4)   # should be /lf here as well
 162 ```
 163
 164 The next instruction is quite straightforward, and has the effect of
 165 turning (transposing) 8 values.  Each bit zero of each input byte is
 166 placed into the first byte; each bit one of each input byte is placed
 167 into the second byte and so on.
 168
 169 (*This instruction in VSX is called `vgbbd`, and it is present in many
 170 other ISAs. RISC-V bitmanip-draft-0.94 calls it `bmatflip`, x86 calls it
 171 GF2P7AFFINEQB. Power ISA, Cray, and many others all have this extremely
 172 powerful and useful instruction*)
 173
 174 ```
 175 # gather performs the transpose (which gets us to positional..)
 176 gbbd 8,6
 177 ```
 178
 179 Now the bits have been transposed in bytes, it is plain sailing to
 180 perform QTY8 parallel popcounts.  However there is a trick going on:
 181 we have set VL=MVL=8. To explain this: it covers the last case where
 182 CTR may be between 1 and 8.
 183
 184 Remember what happened back at the Vector-Load, where r6 was cleared to
 185 zero before-hand? This filled out the 8x8 transposed grid (`gbbd`) fully
 186 with zeros prior to the actual transpose.  Now when we do the popcount,
 187 there will be upper-numbered columns that are *not part of the result*
 188 that contain *zero* and *consequently have no impact on the result*.
 189
 190 This elegant trick extends even to the accumulation of the
 191 results. However before we get there it is necessary to describe why
 192 `sw=8` has been added to `sv.popcntd`. What this is doing is treating
 193 each **byte** of its input (starting at the first byte of r8) as an
 194 independent quantity to be popcounted, but the result is *zero-extended*
 195 to 64-bit and stored in r24, r25, r26 ... r31.
 196
 197 Therefore:
 198
 199 * r24 contains the popcount of the first byte of r8
 200 * r25 contains the popcount of the second byte of r8
 201 * ...
 202 * r31 contains the popcount of the last (7th) byte of r8
 203
 204 ```
 205 # now those bits have been turned around, popcount and sum them
 206 setvl 0,0,8,0,1,1          # set MVL=VL=8
 207 sv.popcntd/sw=8 *24,*8     # do the (now transposed) popcount
 208 ```
 209
 210 With VL being hard-set to 8, we can now Vector-Parallel sum the partial
 211 results r24-31 into the array of accumulators r16-r23. There was no need
 212 to set VL=8 again because it was set already by the `setvl` above.
 213
 214 ```
 215 sv.add *16,*16,*24         # and accumulate in results
 216 ```
 217
 218 Finally, `sv.bc` loops back, subtracting VL from CTR.  Being based on
 219 Power ISA Scalar Branch-Conditional, if CTR goes negative (which it will
 220 because VL was set hard-coded to 8, above) it does not matter, the loop
 221 will still be correctly exited. In other words we complete the correct
 222 number of *blocks* (of length 8).
 223
 224 ```
 225 # branch back if CTR still non-zero. works even though VL=8
 226 sv.bc/all 16, *0, -0x28   # reduce CTR by VL and stop if -ve
 227 ```
 228
 229 # Improvements
 230
 231 There exist many opportunities for parallelism that simpler hardware would
 232 need to have in order to maximise performance.  On Out-of-Order hardware
 233 the above extremely simple and very clear algorithm will achieve extreme
 234 high levels of performance simply by exploiting standard Multi-Issue
 235 Register Hazard Management.
 236
 237 However simpler hardware - in-order - will need a little bit of
 238 assistance, and that can come in the form of expanding to QTY4 or QTY8
 239 64-bit blocks (so that sv.popcntd uses MVL=VL=32 or MVL=VL=64), `gbbd`
 240 becomes an `sv.gbbd` but VL being set to the block count (QTY4 or QTY8),
 241 and the SV REMAP Parallel Reduction Schedule being applied to each
 242 intermediary result rather than using an array of straight accumulators
 243 `r16-r23`.  However this starts to push the boundaries of the number of
 244 registers needed, so as an exercise is left for another time.
 245
 246 # Conclusion
 247
 248 Where a normal SIMD ISA requires explicit hand-crafted optimisation in
 249 order to achieve full utilisation of the underlying hardware, Simple-V
 250 instead can rely to a large extent on standard Multi-Issue hardware
 251 to achieve similar performance, whilst crucially keeping the algorithm
 252 implementation down to a shockingly-simple degree that makes it easy to
 253 understand an easy to review.  Again also as with many other algorithms
 254 when implemented in Simple-V SVP54, by keeping to a LOAD-COMPUTE-STORE
 255 paradigm the L1 Data Cache usage is minimised, and in this case just
 256 as with chacha20 the entire algorithm, being only 9 lines of assembler
 257 fitting into 13 4-byte words it can fit into a single L1 I-Cache Line
 258 without triggering Virtual Memory TLB misses.
 259
 260 Further performance improvements are achievable by using REMAP Parallel
 261 Reduction, still fitting into a single L1 Cache line, but beginning to
 262 approach the limit of the 128-long register file.
 263
 264 [[!tag svp64_cookbook ]]