From 0bd5a7b00eca0dbd33345f21c9ff151ca77cf134 Mon Sep 17 00:00:00 2001
From: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Date: Mon, 12 Feb 2024 10:29:25 +0000
Subject: [PATCH] bug 676: complete description of maxloc algorithm

---
 openpower/sv/cookbook/fortran_maxloc.mdwn | 90 +++++++++++++++++++----
 1 file changed, 77 insertions(+), 13 deletions(-)
diff --git a/openpower/sv/cookbook/fortran_maxloc.mdwn b/openpower/sv/cookbook/fortran_maxloc.mdwn
index c3ffca6c1..5bfb537e6 100644
--- a/openpower/sv/cookbook/fortran_maxloc.mdwn
+++ b/openpower/sv/cookbook/fortran_maxloc.mdwn
@@ -23,7 +23,7 @@ int maxloc(int * const restrict a, int n) {
             nm = i;
         }
     }
-    return nm; 
+    return nm;
 }
 ```
 
@@ -46,7 +46,7 @@ From stackexchange in ARM NEON intrinsics, one developer (Pavel P) wrote the
 subroutine below, explaining that it finds the index of a minimum value within
 a group of eight unsigned bytes. It is necessary to use a second outer loop
 to perform many of these searches in parallel, followed by conditionally
-offsetting each of the block-results. 
+offsetting each of the block-results.
 
 <https://stackoverflow.com/questions/49683866/find-min-and-position-of-the-min-element-in-uint8x8-t-neon-register>
 
@@ -118,20 +118,20 @@ more branch (outer loop).
 
 ```
 # while (i<n)
-setvl 2,0,4,0,1,1            # set MVL=4, VL=MIN(MVL,CTR)
+setvl 2,0,4,0,1,1                  # set MVL=4, VL=MIN(MVL,CTR)
 #    while (i<n and a[i]<=m) : i += 1
-sv.cmp/ff=gt/m=ge *0,0,*10,4 # truncates VL to min
-sv.creqv *16,*16,*16         # set mask on already-tested
-setvl 2,0,4,0,1,1            # set MVL=4, VL=MIN(MVL,CTR)
-mtcrf 128, 0                 # clear CR0 (in case VL=0?)
+sv.cmp/ff=gt/m=ge *0,0,*10,4       # truncates VL to min
+sv.creqv *16,*16,*16               # set mask on already-tested
+setvl 2,0,4,0,1,1                  # set MVL=4, VL=MIN(MVL,CTR)
+mtcrf 128, 0                       # clear CR0 (in case VL=0?)
 #    while (i<n and a[i]>m):
-sv.minmax./ff=le/m=ge 4,*10,4,1 # uses r4 as accumulator
-crternlogi 0,1,2,127         # test greater/equal or VL=0
-sv.crand *19,*16,0           # clear if CR0.eq=0
+sv.minmax./ff=le/m=ge/mr 4,*10,4,1 # uses r4 as accumulator
+crternlogi 0,1,2,127               # test greater/equal or VL=0
+sv.crand *19,*16,0                 # clear if CR0.eq=0
 #      nm = i (count masked bits. could use crweirds here TODO)
-sv.svstep/mr/m=so 1, 0, 6, 1 # svstep: get vector dststep
-sv.creqv *16,*16,*16         # set mask on already-tested
-bc 12,0, -0x40               # CR0 lt bit clear, branch back
+sv.svstep/mr/m=so 1, 0, 6, 1       # svstep: get vector dststep
+sv.creqv *16,*16,*16               # set mask on already-tested
+bc 12,0, -0x40                     # CR0 lt bit clear, branch back
 ```
 
 `sv.cmp` can be used in the first while loop because m (r4, the current
@@ -185,6 +185,70 @@ which effectively performs a broadcast-splat-ANDing, as follows:
     CR7.SO = CR7.EQ AND CR0.EQ (if VL  = 4)
 ```
 
+**Converting the unary mask to binary**
+
+Now that the `CR4/5/6/7.SO` bits have been set, it is necessary to
+count them, i.e. convert an unary mask into a binary number. There
+are several ways to do this, one of which is
+the `crweird` suite of instructions, combined with `popcnt`. However
+there is a very straightforward way to it: use `sv.svstep`.
+
+```
+crternlogi 0,1,2,127
+
+    i ----> 0        1        2        3
+            CR4.EQ   CR5.EQ   CR6.EQ   CR7.EQ
+             & CR0     & CR0    & CR0     & CR0
+             |         |        |         |
+             v         v        v         v
+            CR4.SO   CR5.SO   CR6.SO   CR7.SO
+
+sv.svstep/mr/m=so 1, 0, 6, 1
+
+             |         |        |         |
+    count <--+---------+--------+---------+
+```
+
+In reality what is happening is that svstep is requested to return
+the current `dststep` (destination step index), into a scalar 
+destination (`RT=r1`), but in "mapreduce" mode.  Mapreduce mode
+will keep running as if the destination was a Vector, overwriting
+previously-written results. Normally, one of the *sources* would
+be the same register (`RA=r1` for example) which would turn r1 into
+an accumulator, however in this particular case we simply consistently
+overwrite `r1` and end up with the last `dststep`.
+
+There `is` an alternative to this approach: `getvl` followed by
+subtracting 1.  `VL` being the total Vector Length as set by 
+Data-Dependent Fail-First, is the *length* of the Vector, whereas
+what we actually want is the index of the last element: hence
+using `sv.svstep`. Given that using `getvl` followed by `addi -1`
+is two instructions rather than one, we use `sv.svstep` instead,
+although they are the same amount of words (SVP64 is 64-bit sized).
+
+Lastly the `creqv` sets up the mask (based on the current
+Vector Length), and the loop-back (`bc`) jumps to reset the
+Vector Length back to the maximum (4). Thus, the mask (CR4/5/6/7 EQ
+bit) which began as empty will exclude nothing but on subsequent
+loops will exclude previously-tested elements (up to the previous
+value of VL) before it was reset back to the maximum.
+
+This is actually extremely important to note, here, because the
+Data-Dependent Fail-First testing starts at the **first non-masked**
+element, not the first element.  This fact is exploited here, and
+the only thing to contend with is if VL is ever set to zero
+(first element fails). If that happens then CR0 will never get set,
+(because no element succeeded and the Vector Length is truncated to zero)
+hence the need to clear CR0 prior to starting that instruction.
+
+Overall this algorithm is extremely short but quite complex in its
+intricate use of SVP64 capabilities, yet still remains parallelliseable
+in hardware. It could potentially be shorter: there may be opportunities
+to use predicate masking with the CR field manipulation. However
+the number of instructions is already at the point where, even when the
+LOAD outer loop is added it will still remain short enough to fit in 
+a single line of L1 Cache.
+
 [[!tag svp64_cookbook ]]
 
 
-- 
2.30.2