openpower/sv/cookbook/pospopcnt.mdwn

   1 # Positional popcount SVP64
   2
   3 * <https://bugs.libre-soc.org/show_bug.cgi?id=672>
   4 * <https://github.com/clausecker/pospop/blob/master/countsse2_amd64.s>
   5 * RISC-V Bitmanip Extension Document Version 0.94-draft Editor: Claire Wolf Symbiotic GmbH
   6 <https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-draft.pdf>
   7
   8 Positional popcount in optimised assembler is typically done on SIMD ISAs in
   9 around 500 lines.  Power ISA thanks to `bpermd` can be much more efficient:
  10 with SVP64 even more so.  The reference implementation showing the concept
  11 is below.
  12
  13 ```
  14 // Copyright (c) 2020 Robert Clausecker <fuz@fuz.su>
  15 // count8 reference implementation for tests.  Do not alter.
  16 func count8safe(counts *[8]int, buf []uint8) {
  17         for i := range buf {
  18                 for j := 0; j < 8; j++ {
  19                         counts[j] += int(buf[i] >> j & 1)
  20                 }
  21         }
  22 }
  23 ```
  24
  25 A simple but still hardware-paralleliseable SVP64 assembler for
  26 8-bit input values (`count8safe`) is as follows:
  27
  28 ```
  29 mtspr 9, 3                # move r3 to CTR
  30 # VL = MIN(CTR,MAXVL=8), Rc=1 (CR0 set if CTR ends)
  31 setvl 3,0,8,0,1,1         # set MVL=8, VL=MIN(MVL,CTR)
  32 # load VL bytes (update r4 addr) but compressed (dw=8)
  33 addi 6, 0, 0               # initialise all 64-bits of r6 to zero
  34 sv.lbzu/pi/dw=8 *6, 1(4)   # should be /lf here as well
  35 # gather performs the transpose (which gets us to positional..)
  36 gbbd 8,6
  37 # now those bits have been turned around, popcount and sum them
  38 setvl 0,0,8,0,1,1          # set MVL=VL=8
  39 sv.popcntd/sw=8 *24,*8     # do the (now transposed) popcount
  40 sv.add *16,*16,*24         # and accumulate in results
  41 # branch back if CTR still non-zero. works even though VL=8
  42 sv.bc/all 16, *0, -0x28   # reduce CTR by VL and stop if -ve
  43 ```
  44
  45 Array popcount is just standard popcount function
  46 ([[!wikipedia Hamming weight]]) on an array of values, horizontally,
  47 however positional popcount is different (vertical)
  48
  49 <img src="/openpower/sv/cookbook/1_popcount.svg" alt="pospopcnt" width="70%" />
  50
  51 Positional popcount adds up the totals of each bit set to 1 in each
  52 bit-position, of an array of input values.
  53
  54 <img src="/openpower/sv/cookbook/2_popcount.svg" alt="pospopcnt" width="70%" />
  55
  56 # Visual representation of the pospopcount algorithm
  57
  58 # Walkthrough of the assembler
  59
  60 Firstly the CTR (Counter) SPR is set up, and is key to looping
  61 as outlined further, below
  62
  63 ```
  64 mtspr 9, 3                # move r3 to CTR
  65 ```
  66
  67 The Vector Length, which is limited to 8 (MVL - Maximum
  68 Vector Length) is set up. A special "CTR" Mode is used
  69 which automatically uses the CTR SPR rather than register
  70 RA. (*Note that RA is set to zero to indicate this, because there is
  71 limited encoding space. See [[openpower/sv/setvl]] instruction
  72 specification for details)*.
  73
  74 The result of this instruction is that if CTR is greater than
  75 8, VL is set to 8. If however CTR is less than or equal to 8,
  76 then VL is set to CTR.  Additionally, a copy of VL is placed
  77 into RT (r3 in this case), which again is necessary as part
  78 of the limited encoding space but in some cases (not here)
  79 this is desirable, and avoids a `mfspr` instruction to take
  80 a copy of VL into a GPR.
  81
  82 ```
  83 # VL = r3 = MIN(CTR,MAXVL=8)
  84 setvl 3,0,8,0,1,1         # set MVL=8, VL=MIN(MVL,CTR)
  85 ```
  86
  87 Next the data must be loaded in batches (of up to QTY 8of
  88 8-bit values). We must also not over-run the end of the
  89 array (which a normal SIMD ISA would do, potentially causing
  90 a pagefault) and this is achieved by when CTR <= 8: VL
  91 (Vector Length) in such circumstances being set to the
  92 remaining elements.
  93
  94 The next instruction is `sv.lbzu/pi` - a Post-Increment
  95 variant of `lbzu` which updates the Effective Address (in RA)
  96 **after** it has been used, not before. Also of note is the
  97 element-width override of `dw=8` which writes to *individual
  98 bytes* of r6, not to r6/7/8...13.
  99
 100 Of note here however is the precursor `addi` instruction, which
 101 clears out the contents of r6 to zero before beginning the
 102 Vector/Looped Load sequence.  This is *important* because
 103 SV does **not** zero-out parts of a register when element-width
 104 overrides are in use.
 105
 106 ```
 107 # load VL bytes (update r4 addr) but compressed (dw=8)
 108 addi 6, 0, 0               # initialise all 64-bits of r6 to zero
 109 sv.lbzu/pi/dw=8 *6, 1(4)   # should be /lf here as well
 110 ```
 111
 112 The next instruction is quite straightforward, and has the
 113 effect of turning (transposing) 8 values.  Each bit zero
 114 of each input byte is placed into the first byte; each bit one
 115 of each input byte is placed into the second byte and so on.
 116
 117 (*This instruction in VSX is called `vgbbd`, and it is present
 118 in many other ISAs. RISC-V bitmanip-draft-0.94 calls it `bmatflip`,
 119 x86 calls it GF2P7AFFINEQB. Power ISA, Cray, and many others
 120 all have this extremely powerful and useful instruction*)
 121
 122 ```
 123 # gather performs the transpose (which gets us to positional..)
 124 gbbd 8,6
 125 ```
 126
 127 Now the bits have been transposed in bytes, it is plain sailing
 128 to perform QTY8 parallel popcounts.  However there is a trick
 129 going on: we have set VL=MVL=8. To explain this: it covers the
 130 last case where CTR may be between 1 and 8.
 131
 132 Remember what happened back at the Vector-Load, where r6
 133 was cleared to zero before-hand? This filled out the 8x8 transposed
 134 grid (`gbbd`) fully with zeros prior to the actual transpose.
 135 Now when we do the popcount, there will be upper-numbered
 136 columns that are *not part of the result* that contain *zero*
 137 and *consequently have no impact on the result*.
 138
 139 This elegant trick extends even to the accumulation of the
 140 results. However before we get there it is necessary to
 141 describe why `sw=8` has been added to `sv.popcntd`. What
 142 this is doing is treating each **byte** of its input
 143 (starting at the first byte of r8) as an independent
 144 quantity to be popcounted, but the result is *zero-extended*
 145 to 64-bit and stored in r24, r25, r26 ... r31.
 146
 147 Therefore:
 148
 149 * r24 contains the popcount of the first byte of r8
 150 * r25 contains the popcount of the second byte of r8
 151 * ...
 152 * r31 contains the popcount of the last (7th) byte of r8
 153
 154 ```
 155 # now those bits have been turned around, popcount and sum them
 156 setvl 0,0,8,0,1,1          # set MVL=VL=8
 157 sv.popcntd/sw=8 *24,*8     # do the (now transposed) popcount
 158 ```
 159
 160 With VL being hard-set to 8, we can now Vector-Parallel
 161 sum the partial results r24-31 into the array of accumulators
 162 r16-r23. There was no need to set VL=8 again because it was
 163 set already by the `setvl` above.
 164
 165 ```
 166 sv.add *16,*16,*24         # and accumulate in results
 167 ```
 168
 169 Finally, `sv.bc` loops back, subtracting VL from CTR.
 170 Being based on Power ISA Scalar Branch-Conditional, if
 171 CTR goes negative (which it will because VL was set
 172 hard-coded to 8, above) it does not matter, the loop
 173 will still be correctly exited. In other words we complete
 174 the correct number of *blocks* (of length 8).
 175
 176 ```
 177 # branch back if CTR still non-zero. works even though VL=8
 178 sv.bc/all 16, *0, -0x28   # reduce CTR by VL and stop if -ve
 179 ```
 180
 181 [[!tag svp64_cookbook ]]