From 4a3ebb8156b53109db6cae6847088906c54ed55a Mon Sep 17 00:00:00 2001 From: Luke Kenneth Casson Leighton Date: Tue, 5 Dec 2023 15:29:52 +0000 Subject: [PATCH] more instruction explanation on pospopcount, bug #672 --- openpower/sv/cookbook/pospopcnt.mdwn | 63 +++++++++++++++++++++++++++- 1 file changed, 62 insertions(+), 1 deletion(-) diff --git a/openpower/sv/cookbook/pospopcnt.mdwn b/openpower/sv/cookbook/pospopcnt.mdwn index 9e0046463..bf3a6880c 100644 --- a/openpower/sv/cookbook/pospopcnt.mdwn +++ b/openpower/sv/cookbook/pospopcnt.mdwn @@ -80,27 +80,88 @@ this is desirable, and avoids a `mfspr` instruction to take a copy of VL into a GPR. ``` -# VL = MIN(CTR,MAXVL=8) +# VL = r3 = MIN(CTR,MAXVL=8) setvl 3,0,8,0,1,1 # set MVL=8, VL=MIN(MVL,CTR) ``` +Next the data must be loaded in batches (of up to QTY 8of +8-bit values). We must also not over-run the end of the +array (which a normal SIMD ISA would do, potentially causing +a pagefault) and this is achieved by when CTR <= 8: VL +(Vector Length) in such circumstances being set to the +remaining elements. + +The next instruction is `sv.lbzu/pi` - a Post-Increment +variant of `lbzu` which updates the Effective Address (in RA) +**after** it has been used, not before. Also of note is the +element-width override of `dw=8` which writes to *individual +bytes* of r6, not to r6/7/8...13. + +Of note here however is the precursor `addi` instruction, which +clears out the contents of r6 to zero before beginning the +Vector/Looped Load sequence. This is *important* because +SV does **not** zero-out parts of a register when element-width +overrides are in use. + ``` # load VL bytes (update r4 addr) but compressed (dw=8) addi 6, 0, 0 # initialise all 64-bits of r6 to zero sv.lbzu/pi/dw=8 *6, 1(4) # should be /lf here as well ``` +The next instruction is quite straightforward, and has the +effect of turning (transposing) 8 values. Each bit zero +of each input byte is placed into the first byte; each bit one +of each input byte is placed into the second byte and so on. + +(*This instruction in VSX is called `vgbbd`, and it is present +in many other ISAs. RISC-V bitmanip-draft-0.94 calls it `bmatflip`, +x86 calls it GF2P7AFFINEQB. Power ISA, Cray, and many others +all have this extremely powerful and useful instruction*) + ``` # gather performs the transpose (which gets us to positional..) gbbd 8,6 ``` +Now the bits have been transposed in bytes, it is plain sailing +to perform QTY8 parallel popcounts. However there is a trick +going on: we have set VL=MVL=8. To explain this: it covers the +last case where CTR may be between 1 and 8. + +Remember what happened back at the Vector-Load, where r6 +was cleared to zero before-hand? This filled out the 8x8 transposed +grid (`gbbd`) fully with zeros prior to the actual transpose. +Now when we do the popcount, there will be upper-numbered +columns that are *not part of the result* that contain *zero* +and *consequently have no impact on the result*. + +This elegant trick extends even to the accumulation of the +results. However before we get there it is necessary to +describe why `sw=8` has been added to `sv.popcntd`. What +this is doing is treating each **byte** of its input +(starting at the first byte of r8) as an independent +quantity to be popcounted, but the result is *zero-extended* +to 64-bit and stored in r24, r25, r26 ... r31. + +Therefore: + +* r24 contains the popcount of the first byte of r8 +* r25 contains the popcount of the second byte of r8 +* ... +* r31 contains the popcount of the last (7th) byte of r8 + ``` # now those bits have been turned around, popcount and sum them setvl 0,0,8,0,1,1 # set MVL=VL=8 sv.popcntd/sw=8 *24,*8 # do the (now transposed) popcount ``` +With VL being hard-set to 8, we can now Vector-Parallel +sum the partial results r24-31 into the array of accumulators +r16-r23. There was no need to set VL=8 again because it was +set already by the `setvl` above. + ``` sv.add *16,*16,*24 # and accumulate in results ``` -- 2.30.2