+**Converting the unary mask to binary**
+
+Now that the `CR4/5/6/7.SO` bits have been set, it is necessary to
+count them, i.e. convert an unary mask into a binary number. There
+are several ways to do this, one of which is
+the `crweird` suite of instructions, combined with `popcnt`. However
+there is a very straightforward way to it: use `sv.svstep`.
+
+```
+crternlogi 0,1,2,127
+
+ i ----> 0 1 2 3
+ CR4.EQ CR5.EQ CR6.EQ CR7.EQ
+ & CR0 & CR0 & CR0 & CR0
+ | | | |
+ v v v v
+ CR4.SO CR5.SO CR6.SO CR7.SO
+
+sv.svstep/mr/m=so 1, 0, 6, 1
+
+ | | | |
+ count <--+---------+--------+---------+
+```
+
+In reality what is happening is that svstep is requested to return
+the current `dststep` (destination step index), into a scalar
+destination (`RT=r1`), but in "mapreduce" mode. Mapreduce mode
+will keep running as if the destination was a Vector, overwriting
+previously-written results. Normally, one of the *sources* would
+be the same register (`RA=r1` for example) which would turn r1 into
+an accumulator, however in this particular case we simply consistently
+overwrite `r1` and end up with the last `dststep`.
+
+There `is` an alternative to this approach: `getvl` followed by
+subtracting 1. `VL` being the total Vector Length as set by
+Data-Dependent Fail-First, is the *length* of the Vector, whereas
+what we actually want is the index of the last element: hence
+using `sv.svstep`. Given that using `getvl` followed by `addi -1`
+is two instructions rather than one, we use `sv.svstep` instead,
+although they are the same amount of words (SVP64 is 64-bit sized).
+
+Lastly the `creqv` sets up the mask (based on the current
+Vector Length), and the loop-back (`bc`) jumps to reset the
+Vector Length back to the maximum (4). Thus, the mask (CR4/5/6/7 EQ
+bit) which began as empty will exclude nothing but on subsequent
+loops will exclude previously-tested elements (up to the previous
+value of VL) before it was reset back to the maximum.
+
+This is actually extremely important to note, here, because the
+Data-Dependent Fail-First testing starts at the **first non-masked**
+element, not the first element. This fact is exploited here, and
+the only thing to contend with is if VL is ever set to zero
+(first element fails). If that happens then CR0 will never get set,
+(because no element succeeded and the Vector Length is truncated to zero)
+hence the need to clear CR0 prior to starting that instruction.
+
+Overall this algorithm is extremely short but quite complex in its
+intricate use of SVP64 capabilities, yet still remains parallelliseable
+in hardware. It could potentially be shorter: there may be opportunities
+to use predicate masking with the CR field manipulation. However
+the number of instructions is already at the point where, even when the
+LOAD outer loop is added it will still remain short enough to fit in
+a single line of L1 Cache.
+