bug 676: complete description of maxloc algorithm

author Luke Kenneth Casson Leighton <lkcl@lkcl.net>

Mon, 12 Feb 2024 10:29:25 +0000 (10:29 +0000)

committer Luke Kenneth Casson Leighton <lkcl@lkcl.net>

Mon, 12 Feb 2024 10:29:25 +0000 (10:29 +0000)
author Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Mon, 12 Feb 2024 10:29:25 +0000 (10:29 +0000)
committer Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Mon, 12 Feb 2024 10:29:25 +0000 (10:29 +0000)
diff --git a/openpower/sv/cookbook/fortran_maxloc.mdwn b/openpower/sv/cookbook/fortran_maxloc.mdwn

index c3ffca6c169d838f4acef93e37957506e9d0af0b..5bfb537e6f4a9fecea99ff8c578e112dbdbe170d 100644 (file)
--- a/openpower/sv/cookbook/fortran_maxloc.mdwn
+++ b/openpower/sv/cookbook/fortran_maxloc.mdwn
@@ -23,7 +23,7 @@ int maxloc(int * const restrict a, int n) {
              nm = i;
          }
      }
-    return nm; 
+    return nm;
  }
  ```
  
@@ -46,7 +46,7 @@ From stackexchange in ARM NEON intrinsics, one developer (Pavel P) wrote the
  subroutine below, explaining that it finds the index of a minimum value within
  a group of eight unsigned bytes. It is necessary to use a second outer loop
  to perform many of these searches in parallel, followed by conditionally
-offsetting each of the block-results. 
+offsetting each of the block-results.
  
  <https://stackoverflow.com/questions/49683866/find-min-and-position-of-the-min-element-in-uint8x8-t-neon-register>
  
@@ -118,20 +118,20 @@ more branch (outer loop).
  
  ```
  # while (i<n)
-setvl 2,0,4,0,1,1            # set MVL=4, VL=MIN(MVL,CTR)
+setvl 2,0,4,0,1,1                  # set MVL=4, VL=MIN(MVL,CTR)
  #    while (i<n and a[i]<=m) : i += 1
-sv.cmp/ff=gt/m=ge *0,0,*10,4 # truncates VL to min
-sv.creqv *16,*16,*16         # set mask on already-tested
-setvl 2,0,4,0,1,1            # set MVL=4, VL=MIN(MVL,CTR)
-mtcrf 128, 0                 # clear CR0 (in case VL=0?)
+sv.cmp/ff=gt/m=ge *0,0,*10,4       # truncates VL to min
+sv.creqv *16,*16,*16               # set mask on already-tested
+setvl 2,0,4,0,1,1                  # set MVL=4, VL=MIN(MVL,CTR)
+mtcrf 128, 0                       # clear CR0 (in case VL=0?)
  #    while (i<n and a[i]>m):
-sv.minmax./ff=le/m=ge 4,*10,4,1 # uses r4 as accumulator
-crternlogi 0,1,2,127         # test greater/equal or VL=0
-sv.crand *19,*16,0           # clear if CR0.eq=0
+sv.minmax./ff=le/m=ge/mr 4,*10,4,1 # uses r4 as accumulator
+crternlogi 0,1,2,127               # test greater/equal or VL=0
+sv.crand *19,*16,0                 # clear if CR0.eq=0
  #      nm = i (count masked bits. could use crweirds here TODO)
-sv.svstep/mr/m=so 1, 0, 6, 1 # svstep: get vector dststep
-sv.creqv *16,*16,*16         # set mask on already-tested
-bc 12,0, -0x40               # CR0 lt bit clear, branch back
+sv.svstep/mr/m=so 1, 0, 6, 1       # svstep: get vector dststep
+sv.creqv *16,*16,*16               # set mask on already-tested
+bc 12,0, -0x40                     # CR0 lt bit clear, branch back
  ```
  
  `sv.cmp` can be used in the first while loop because m (r4, the current
@@ -185,6 +185,70 @@ which effectively performs a broadcast-splat-ANDing, as follows:
      CR7.SO = CR7.EQ AND CR0.EQ (if VL  = 4)
  ```
  
+**Converting the unary mask to binary**
+
+Now that the `CR4/5/6/7.SO` bits have been set, it is necessary to
+count them, i.e. convert an unary mask into a binary number. There
+are several ways to do this, one of which is
+the `crweird` suite of instructions, combined with `popcnt`. However
+there is a very straightforward way to it: use `sv.svstep`.
+
+```
+crternlogi 0,1,2,127
+
+    i ----> 0        1        2        3
+            CR4.EQ   CR5.EQ   CR6.EQ   CR7.EQ
+             & CR0     & CR0    & CR0     & CR0
+             |         |        |         |
+             v         v        v         v
+            CR4.SO   CR5.SO   CR6.SO   CR7.SO
+
+sv.svstep/mr/m=so 1, 0, 6, 1
+
+             |         |        |         |
+    count <--+---------+--------+---------+
+```
+
+In reality what is happening is that svstep is requested to return
+the current `dststep` (destination step index), into a scalar 
+destination (`RT=r1`), but in "mapreduce" mode.  Mapreduce mode
+will keep running as if the destination was a Vector, overwriting
+previously-written results. Normally, one of the *sources* would
+be the same register (`RA=r1` for example) which would turn r1 into
+an accumulator, however in this particular case we simply consistently
+overwrite `r1` and end up with the last `dststep`.
+
+There `is` an alternative to this approach: `getvl` followed by
+subtracting 1.  `VL` being the total Vector Length as set by 
+Data-Dependent Fail-First, is the *length* of the Vector, whereas
+what we actually want is the index of the last element: hence
+using `sv.svstep`. Given that using `getvl` followed by `addi -1`
+is two instructions rather than one, we use `sv.svstep` instead,
+although they are the same amount of words (SVP64 is 64-bit sized).
+
+Lastly the `creqv` sets up the mask (based on the current
+Vector Length), and the loop-back (`bc`) jumps to reset the
+Vector Length back to the maximum (4). Thus, the mask (CR4/5/6/7 EQ
+bit) which began as empty will exclude nothing but on subsequent
+loops will exclude previously-tested elements (up to the previous
+value of VL) before it was reset back to the maximum.
+
+This is actually extremely important to note, here, because the
+Data-Dependent Fail-First testing starts at the **first non-masked**
+element, not the first element.  This fact is exploited here, and
+the only thing to contend with is if VL is ever set to zero
+(first element fails). If that happens then CR0 will never get set,
+(because no element succeeded and the Vector Length is truncated to zero)
+hence the need to clear CR0 prior to starting that instruction.
+
+Overall this algorithm is extremely short but quite complex in its
+intricate use of SVP64 capabilities, yet still remains parallelliseable
+in hardware. It could potentially be shorter: there may be opportunities
+to use predicate masking with the CR field manipulation. However
+the number of instructions is already at the point where, even when the
+LOAD outer loop is added it will still remain short enough to fit in 
+a single line of L1 Cache.
+
  [[!tag svp64_cookbook ]]
author	Luke Kenneth Casson Leighton <lkcl@lkcl.net>
	Mon, 12 Feb 2024 10:29:25 +0000 (10:29 +0000)
committer	Luke Kenneth Casson Leighton <lkcl@lkcl.net>
	Mon, 12 Feb 2024 10:29:25 +0000 (10:29 +0000)