+Now the bits have been transposed in bytes, it is plain sailing
+to perform QTY8 parallel popcounts. However there is a trick
+going on: we have set VL=MVL=8. To explain this: it covers the
+last case where CTR may be between 1 and 8.
+
+Remember what happened back at the Vector-Load, where r6
+was cleared to zero before-hand? This filled out the 8x8 transposed
+grid (`gbbd`) fully with zeros prior to the actual transpose.
+Now when we do the popcount, there will be upper-numbered
+columns that are *not part of the result* that contain *zero*
+and *consequently have no impact on the result*.
+
+This elegant trick extends even to the accumulation of the
+results. However before we get there it is necessary to
+describe why `sw=8` has been added to `sv.popcntd`. What
+this is doing is treating each **byte** of its input
+(starting at the first byte of r8) as an independent
+quantity to be popcounted, but the result is *zero-extended*
+to 64-bit and stored in r24, r25, r26 ... r31.
+
+Therefore:
+
+* r24 contains the popcount of the first byte of r8
+* r25 contains the popcount of the second byte of r8
+* ...
+* r31 contains the popcount of the last (7th) byte of r8
+