openpower/sv/cookbook/pospopcnt.mdwn

   1 # Positional popcount SVP64
   2
   3 **Links**
   4
   5 * <https://bugs.libre-soc.org/show_bug.cgi?id=672>
   6 * <https://github.com/clausecker/pospop/blob/master/countsse2_amd64.s>
   7 * RISC-V Bitmanip Extension Document Version 0.94-draft Editor: Claire Wolf Symbiotic GmbH
   8 <https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-draft.pdf>
   9
  10 **Funding**
  11
  12 This project is funded through the [NGI Assure Fund](https://nlnet.nl/assure), a fund established by [NLnet](https://nlnet.nl) with financial support from the European Commission's [Next Generation Internet](https://ngi.eu) program. Learn more at the [NLnet project page](https://nlnet.nl/project/Libre-SOC-OpenPOWER-ISA).
  13
  14 [<img src="https://nlnet.nl/logo/banner.png" alt="NLnet foundation logo" width="20%" />](https://nlnet.nl)
  15 [<img src="https://nlnet.nl/image/logos/NGIAssure_tag.svg" alt="NGI Assure Logo" width="20%" />](https://nlnet.nl/assure)
  16
  17 **Introduction**
  18
  19 Positional popcount in optimised assembler is typically done on SIMD ISAs in
  20 around 500 lines.  Power ISA thanks to `bpermd` can be much more efficient:
  21 with SVP64 even more so.  The reference implementation showing the concept
  22 is below.
  23
  24 ```
  25 // Copyright (c) 2020 Robert Clausecker <fuz@fuz.su>
  26 // count8 reference implementation for tests.  Do not alter.
  27 func count8safe(counts *[8]int, buf []uint8) {
  28         for i := range buf {
  29                 for j := 0; j < 8; j++ {
  30                         counts[j] += int(buf[i] >> j & 1)
  31                 }
  32         }
  33 }
  34 ```
  35
  36 A simple but still hardware-paralleliseable SVP64 assembler for
  37 8-bit input values (`count8safe`) is as follows:
  38
  39 ```
  40 mtspr 9, 3                # move r3 to CTR
  41 # VL = MIN(CTR,MAXVL=8), Rc=1 (CR0 set if CTR ends)
  42 setvl 3,0,8,0,1,1         # set MVL=8, VL=MIN(MVL,CTR)
  43 # load VL bytes (update r4 addr) but compressed (dw=8)
  44 addi 6, 0, 0               # initialise all 64-bits of r6 to zero
  45 sv.lbzu/pi/dw=8 *6, 1(4)   # should be /lf here as well
  46 # gather performs the transpose (which gets us to positional..)
  47 gbbd 8,6
  48 # now those bits have been turned around, popcount and sum them
  49 setvl 0,0,8,0,1,1          # set MVL=VL=8
  50 sv.popcntd/sw=8 *24,*8     # do the (now transposed) popcount
  51 sv.add *16,*16,*24         # and accumulate in results
  52 # branch back if CTR still non-zero. works even though VL=8
  53 sv.bc/all 16, *0, -0x28   # reduce CTR by VL and stop if -ve
  54 ```
  55
  56
  57 Array popcount is just standard popcount function
  58 ([[!wikipedia Hamming weight]]) on an array of values, horizontally,
  59 however positional popcount is different (vertical). Refer to Fig.1
  60
  61
  62 <img src="/openpower/sv/cookbook/ArrayPopcnt.drawio.svg"
  63      alt="pospopcnt" width="70%" />
  64
  65 Positional popcount adds up the totals of each bit set to 1 in each
  66 bit-position, of an array of input values. Refer to Fig.2
  67
  68 <img src="/openpower/sv/cookbook/PositionalPopcnt.drawio.svg"
  69      alt="pospopcnt" width="60%" />
  70
  71
  72
  73
  74 <br />
  75
  76
  77 # Visual representation of the pospopcount algorithm
  78 In order to perform positional popcount we need to go
  79 through series of steps shown below in figures 3, 4, 5 & 6.
  80 The limitation to overcome is that the CPU can only work
  81 on registers (horizontally) but the requirement of pospopcount
  82 is to add *vertically*.  Part of the challenge is therefore
  83 to perform an appropriate transpose of the data, in blocks
  84 that suit the processor and the ISA capacity.
  85
  86 Fig.3 depicts how the data is divided into blocks of 8*8-bit.
  87
  88 <img src="/openpower/sv/cookbook/BlockDivision.drawio.svg"
  89      alt="pospopcnt" width="70%" />
  90
  91 Fig.4 shows that a transpose is needed on each block.
  92 The gbbd instruction is used for implementing the transpose
  93 function, preparing the data for using the standard popcount
  94 instruction.
  95
  96 <img src="/openpower/sv/cookbook/Transpose.drawio.svg"
  97      alt="pospopcnt" width="60%" />
  98
  99
 100
 101 <img src="/openpower/sv/cookbook/PopcountBlocks.drawio.svg"
 102      alt="pospopcnt" width="70%" />
 103
 104 <img src="/openpower/sv/cookbook/ParallelAccumulate.drawio.svg"
 105      alt="pospopcnt" width="100%" />
 106
 107
 108
 109 # Walkthrough of the assembler
 110
 111 Firstly the CTR (Counter) SPR is set up, and is key to looping
 112 as outlined further, below
 113
 114 ```
 115 mtspr 9, 3                # move r3 to CTR
 116 ```
 117
 118 The Vector Length, which is limited to 8 (MVL - Maximum
 119 Vector Length) is set up. A special "CTR" Mode is used
 120 which automatically uses the CTR SPR rather than register
 121 RA. (*Note that RA is set to zero to indicate this, because there is
 122 limited encoding space. See [[openpower/sv/setvl]] instruction
 123 specification for details)*.
 124
 125 The result of this instruction is that if CTR is greater than
 126 8, VL is set to 8. If however CTR is less than or equal to 8,
 127 then VL is set to CTR.  Additionally, a copy of VL is placed
 128 into RT (r3 in this case), which again is necessary as part
 129 of the limited encoding space but in some cases (not here)
 130 this is desirable, and avoids a `mfspr` instruction to take
 131 a copy of VL into a GPR.
 132
 133 ```
 134 # VL = r3 = MIN(CTR,MAXVL=8)
 135 setvl 3,0,8,0,1,1         # set MVL=8, VL=MIN(MVL,CTR)
 136 ```
 137
 138 Next the data must be loaded in batches (of up to QTY 8of
 139 8-bit values). We must also not over-run the end of the
 140 array (which a normal SIMD ISA would do, potentially causing
 141 a pagefault) and this is achieved by when CTR <= 8: VL
 142 (Vector Length) in such circumstances being set to the
 143 remaining elements.
 144
 145 The next instruction is `sv.lbzu/pi` - a Post-Increment
 146 variant of `lbzu` which updates the Effective Address (in RA)
 147 **after** it has been used, not before. Also of note is the
 148 element-width override of `dw=8` which writes to *individual
 149 bytes* of r6, not to r6/7/8...13.
 150
 151 Of note here however is the precursor `addi` instruction, which
 152 clears out the contents of r6 to zero before beginning the
 153 Vector/Looped Load sequence.  This is *important* because
 154 SV does **not** zero-out parts of a register when element-width
 155 overrides are in use.
 156
 157 ```
 158 # load VL bytes (update r4 addr) but compressed (dw=8)
 159 addi 6, 0, 0               # initialise all 64-bits of r6 to zero
 160 sv.lbzu/pi/dw=8 *6, 1(4)   # should be /lf here as well
 161 ```
 162
 163 The next instruction is quite straightforward, and has the
 164 effect of turning (transposing) 8 values.  Each bit zero
 165 of each input byte is placed into the first byte; each bit one
 166 of each input byte is placed into the second byte and so on.
 167
 168 (*This instruction in VSX is called `vgbbd`, and it is present
 169 in many other ISAs. RISC-V bitmanip-draft-0.94 calls it `bmatflip`,
 170 x86 calls it GF2P7AFFINEQB. Power ISA, Cray, and many others
 171 all have this extremely powerful and useful instruction*)
 172
 173 ```
 174 # gather performs the transpose (which gets us to positional..)
 175 gbbd 8,6
 176 ```
 177
 178 Now the bits have been transposed in bytes, it is plain sailing
 179 to perform QTY8 parallel popcounts.  However there is a trick
 180 going on: we have set VL=MVL=8. To explain this: it covers the
 181 last case where CTR may be between 1 and 8.
 182
 183 Remember what happened back at the Vector-Load, where r6
 184 was cleared to zero before-hand? This filled out the 8x8 transposed
 185 grid (`gbbd`) fully with zeros prior to the actual transpose.
 186 Now when we do the popcount, there will be upper-numbered
 187 columns that are *not part of the result* that contain *zero*
 188 and *consequently have no impact on the result*.
 189
 190 This elegant trick extends even to the accumulation of the
 191 results. However before we get there it is necessary to
 192 describe why `sw=8` has been added to `sv.popcntd`. What
 193 this is doing is treating each **byte** of its input
 194 (starting at the first byte of r8) as an independent
 195 quantity to be popcounted, but the result is *zero-extended*
 196 to 64-bit and stored in r24, r25, r26 ... r31.
 197
 198 Therefore:
 199
 200 * r24 contains the popcount of the first byte of r8
 201 * r25 contains the popcount of the second byte of r8
 202 * ...
 203 * r31 contains the popcount of the last (7th) byte of r8
 204
 205 ```
 206 # now those bits have been turned around, popcount and sum them
 207 setvl 0,0,8,0,1,1          # set MVL=VL=8
 208 sv.popcntd/sw=8 *24,*8     # do the (now transposed) popcount
 209 ```
 210
 211 With VL being hard-set to 8, we can now Vector-Parallel
 212 sum the partial results r24-31 into the array of accumulators
 213 r16-r23. There was no need to set VL=8 again because it was
 214 set already by the `setvl` above.
 215
 216 ```
 217 sv.add *16,*16,*24         # and accumulate in results
 218 ```
 219
 220 Finally, `sv.bc` loops back, subtracting VL from CTR.
 221 Being based on Power ISA Scalar Branch-Conditional, if
 222 CTR goes negative (which it will because VL was set
 223 hard-coded to 8, above) it does not matter, the loop
 224 will still be correctly exited. In other words we complete
 225 the correct number of *blocks* (of length 8).
 226
 227 ```
 228 # branch back if CTR still non-zero. works even though VL=8
 229 sv.bc/all 16, *0, -0x28   # reduce CTR by VL and stop if -ve
 230 ```
 231
 232 # Improvements
 233
 234 There exist many opportunities for parallelism that simpler hardware
 235 would need to have in order to maximise performance.  On Out-of-Order
 236 hardware the above extremely simple and very clear algorithm will
 237 achieve extreme high levels of performance simply by exploiting
 238 standard Multi-Issue Register Hazard Management.
 239
 240 However simpler hardware - in-order - will need a little bit of
 241 assistance, and that can come in the form of expanding to QTY4 or
 242 QTY8 64-bit blocks (so that sv.popcntd uses MVL=VL=32 or MVL=VL=64),
 243 `gbbd` becomes an `sv.gbbd` but VL being set to the block count
 244 (QTY4 or QTY8), and the SV REMAP Parallel Reduction Schedule being
 245 applied to each intermediary result rather than using an array
 246 of straight accumulators `r16-r23`.  However this starts to push
 247 the boundaries of the number of registers needed, so as an
 248 exercise is left for another time.
 249
 250 # Conclusion
 251
 252 Where a normal SIMD ISA requires explicit hand-crafted optimisation
 253 in order to achieve full utilisation of the underlying hardware,
 254 Simple-V instead can rely to a large extent on standard Multi-Issue
 255 hardware to achieve similar performance, whilst crucially keeping the
 256 algorithm implementation down to a shockingly-simple degree that makes
 257 it easy to understand an easy to review.  Again also as with many
 258 other algorithms when implemented in Simple-V SVP54, by keeping to
 259 a LOAD-COMPUTE-STORE paradigm the L1 Data Cache usage is minimised,
 260 and in this case just as with chacha20 the entire algorithm, being
 261 only 9 lines of assembler fitting into 13 4-byte words it can fit
 262 into a single L1 I-Cache Line without triggering Virtual Memory TLB
 263 misses.
 264
 265 Further performance improvements are achievable by using REMAP
 266 Parallel Reduction, still fitting into a single L1 Cache line,
 267 but beginning to approach the limit of the 128-long register file.
 268
 269 [[!tag svp64_cookbook ]]