openpower/sv/cookbook/pospopcnt.mdwn

   1 # Positional popcount SVP64
   2
   3 * <https://bugs.libre-soc.org/show_bug.cgi?id=672>
   4 * <https://github.com/clausecker/pospop/blob/master/countsse2_amd64.s>
   5 * RISC-V Bitmanip Extension Document Version 0.94-draft Editor: Claire Wolf Symbiotic GmbH
   6 <https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-draft.pdf>
   7
   8 Positional popcount in optimised assembler is typically done on SIMD ISAs in
   9 around 500 lines.  Power ISA thanks to `bpermd` can be much more efficient:
  10 with SVP64 even more so.  The reference implementation showing the concept
  11 is below.
  12
  13 ```
  14 // Copyright (c) 2020 Robert Clausecker <fuz@fuz.su>
  15 // count8 reference implementation for tests.  Do not alter.
  16 func count8safe(counts *[8]int, buf []uint8) {
  17         for i := range buf {
  18                 for j := 0; j < 8; j++ {
  19                         counts[j] += int(buf[i] >> j & 1)
  20                 }
  21         }
  22 }
  23 ```
  24
  25 A simple but still hardware-paralleliseable SVP64 assembler for
  26 8-bit input values (`count8safe`) is as follows:
  27
  28 ```
  29 mtspr 9, 3                # move r3 to CTR
  30 # VL = MIN(CTR,MAXVL=8), Rc=1 (CR0 set if CTR ends)
  31 setvl 3,0,8,0,1,1         # set MVL=8, VL=MIN(MVL,CTR)
  32 # load VL bytes (update r4 addr) but compressed (dw=8)
  33 addi 6, 0, 0               # initialise all 64-bits of r6 to zero
  34 sv.lbzu/pi/dw=8 *6, 1(4)   # should be /lf here as well
  35 # gather performs the transpose (which gets us to positional..)
  36 gbbd 8,6
  37 # now those bits have been turned around, popcount and sum them
  38 setvl 0,0,8,0,1,1          # set MVL=VL=8
  39 sv.popcntd/sw=8 *24,*8     # do the (now transposed) popcount
  40 sv.add *16,*16,*24         # and accumulate in results
  41 # branch back if CTR still non-zero. works even though VL=8
  42 sv.bc/all 16, *0, -0x28   # reduce CTR by VL and stop if -ve
  43 ```
  44
  45 Array popcount is just standard popcount function
  46 ([[!wikipedia Hamming weight]]) on an array of values, horizontally,
  47 however positional popcount is different (vertical)
  48
  49
  50
  51 Positional popcount adds up the totals of each bit set to 1 in each
  52 bit-position, of an array of input values.
  53
  54
  55 <img src="/openpower/sv/cookbook/ArrayPopcnt.drawio.svg"
  56      alt="pospopcnt" width="70%" />
  57
  58 <br />
  59
  60 <img src="/openpower/sv/cookbook/PositionalPopcnt.drawio.svg"
  61      alt="pospopcnt" width="60%" />
  62
  63 <br />
  64
  65
  66 <img src="/openpower/sv/cookbook/BlockDivision.drawio.svg"
  67      alt="pospopcnt" width="70%" />
  68
  69 <br />
  70
  71 <img src="/openpower/sv/cookbook/Transpose.drawio.svg"
  72      alt="pospopcnt" width="60%" />
  73
  74
  75 <img src="/openpower/sv/cookbook/PopcountBlocks.drawio.svg"
  76      alt="pospopcnt" width="70%" />
  77
  78 <img src="/openpower/sv/cookbook/ParallelAccumulate.drawio.svg"
  79      alt="pospopcnt" width="100%" />
  80
  81
  82 # Visual representation of the pospopcount algorithm
  83
  84 # Walkthrough of the assembler
  85
  86 Firstly the CTR (Counter) SPR is set up, and is key to looping
  87 as outlined further, below
  88
  89 ```
  90 mtspr 9, 3                # move r3 to CTR
  91 ```
  92
  93 The Vector Length, which is limited to 8 (MVL - Maximum
  94 Vector Length) is set up. A special "CTR" Mode is used
  95 which automatically uses the CTR SPR rather than register
  96 RA. (*Note that RA is set to zero to indicate this, because there is
  97 limited encoding space. See [[openpower/sv/setvl]] instruction
  98 specification for details)*.
  99
 100 The result of this instruction is that if CTR is greater than
 101 8, VL is set to 8. If however CTR is less than or equal to 8,
 102 then VL is set to CTR.  Additionally, a copy of VL is placed
 103 into RT (r3 in this case), which again is necessary as part
 104 of the limited encoding space but in some cases (not here)
 105 this is desirable, and avoids a `mfspr` instruction to take
 106 a copy of VL into a GPR.
 107
 108 ```
 109 # VL = r3 = MIN(CTR,MAXVL=8)
 110 setvl 3,0,8,0,1,1         # set MVL=8, VL=MIN(MVL,CTR)
 111 ```
 112
 113 Next the data must be loaded in batches (of up to QTY 8of
 114 8-bit values). We must also not over-run the end of the
 115 array (which a normal SIMD ISA would do, potentially causing
 116 a pagefault) and this is achieved by when CTR <= 8: VL
 117 (Vector Length) in such circumstances being set to the
 118 remaining elements.
 119
 120 The next instruction is `sv.lbzu/pi` - a Post-Increment
 121 variant of `lbzu` which updates the Effective Address (in RA)
 122 **after** it has been used, not before. Also of note is the
 123 element-width override of `dw=8` which writes to *individual
 124 bytes* of r6, not to r6/7/8...13.
 125
 126 Of note here however is the precursor `addi` instruction, which
 127 clears out the contents of r6 to zero before beginning the
 128 Vector/Looped Load sequence.  This is *important* because
 129 SV does **not** zero-out parts of a register when element-width
 130 overrides are in use.
 131
 132 ```
 133 # load VL bytes (update r4 addr) but compressed (dw=8)
 134 addi 6, 0, 0               # initialise all 64-bits of r6 to zero
 135 sv.lbzu/pi/dw=8 *6, 1(4)   # should be /lf here as well
 136 ```
 137
 138 The next instruction is quite straightforward, and has the
 139 effect of turning (transposing) 8 values.  Each bit zero
 140 of each input byte is placed into the first byte; each bit one
 141 of each input byte is placed into the second byte and so on.
 142
 143 (*This instruction in VSX is called `vgbbd`, and it is present
 144 in many other ISAs. RISC-V bitmanip-draft-0.94 calls it `bmatflip`,
 145 x86 calls it GF2P7AFFINEQB. Power ISA, Cray, and many others
 146 all have this extremely powerful and useful instruction*)
 147
 148 ```
 149 # gather performs the transpose (which gets us to positional..)
 150 gbbd 8,6
 151 ```
 152
 153 Now the bits have been transposed in bytes, it is plain sailing
 154 to perform QTY8 parallel popcounts.  However there is a trick
 155 going on: we have set VL=MVL=8. To explain this: it covers the
 156 last case where CTR may be between 1 and 8.
 157
 158 Remember what happened back at the Vector-Load, where r6
 159 was cleared to zero before-hand? This filled out the 8x8 transposed
 160 grid (`gbbd`) fully with zeros prior to the actual transpose.
 161 Now when we do the popcount, there will be upper-numbered
 162 columns that are *not part of the result* that contain *zero*
 163 and *consequently have no impact on the result*.
 164
 165 This elegant trick extends even to the accumulation of the
 166 results. However before we get there it is necessary to
 167 describe why `sw=8` has been added to `sv.popcntd`. What
 168 this is doing is treating each **byte** of its input
 169 (starting at the first byte of r8) as an independent
 170 quantity to be popcounted, but the result is *zero-extended*
 171 to 64-bit and stored in r24, r25, r26 ... r31.
 172
 173 Therefore:
 174
 175 * r24 contains the popcount of the first byte of r8
 176 * r25 contains the popcount of the second byte of r8
 177 * ...
 178 * r31 contains the popcount of the last (7th) byte of r8
 179
 180 ```
 181 # now those bits have been turned around, popcount and sum them
 182 setvl 0,0,8,0,1,1          # set MVL=VL=8
 183 sv.popcntd/sw=8 *24,*8     # do the (now transposed) popcount
 184 ```
 185
 186 With VL being hard-set to 8, we can now Vector-Parallel
 187 sum the partial results r24-31 into the array of accumulators
 188 r16-r23. There was no need to set VL=8 again because it was
 189 set already by the `setvl` above.
 190
 191 ```
 192 sv.add *16,*16,*24         # and accumulate in results
 193 ```
 194
 195 Finally, `sv.bc` loops back, subtracting VL from CTR.
 196 Being based on Power ISA Scalar Branch-Conditional, if
 197 CTR goes negative (which it will because VL was set
 198 hard-coded to 8, above) it does not matter, the loop
 199 will still be correctly exited. In other words we complete
 200 the correct number of *blocks* (of length 8).
 201
 202 ```
 203 # branch back if CTR still non-zero. works even though VL=8
 204 sv.bc/all 16, *0, -0x28   # reduce CTR by VL and stop if -ve
 205 ```
 206
 207 # Improvements
 208
 209 There exist many opportunities for parallelism that simpler hardware
 210 would need to have in order to maximise performance.  On Out-of-Order
 211 hardware the above extremely simple and very clear algorithm will
 212 achieve extreme high levels of performance simply by exploiting
 213 standard Multi-Issue Register Hazard Management.
 214
 215 However simpler hardware - in-order - will need a little bit of
 216 assistance, and that can come in the form of expanding to QTY4 or
 217 QTY8 64-bit blocks (so that sv.popcntd uses MVL=VL=32 or MVL=VL=64),
 218 `gbbd` becomes an `sv.gbbd` but VL being set to the block count
 219 (QTY4 or QTY8), and the SV REMAP Parallel Reduction Schedule being
 220 applied to each intermediary result rather than using an array
 221 of straight accumulators `r16-r23`.  However this starts to push
 222 the boundaries of the number of registers needed, so as an
 223 exercise is left for another time.
 224
 225 [[!tag svp64_cookbook ]]