From: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Date: Tue, 6 Feb 2024 15:19:36 +0000 (+0000)
Subject: bug 676 more on maxloc
X-Git-Url: https://git.libre-soc.org/?a=commitdiff_plain;h=9cfcf69bf157ddfed7a9f209d4532e9f54a4117c;p=libreriscv.git

bug 676 more on maxloc
---

diff --git a/openpower/sv/cookbook/fortran_maxloc.mdwn b/openpower/sv/cookbook/fortran_maxloc.mdwn
index e9bb553d2..c3ffca6c1 100644
--- a/openpower/sv/cookbook/fortran_maxloc.mdwn
+++ b/openpower/sv/cookbook/fortran_maxloc.mdwn
@@ -104,9 +104,9 @@ later when doing SVP64 assembler.
 def m2(a): # array a
     m, nm, i, n = 0, 0, 0, len(a)
     while i<n:
-        while i<n and a[i]<=m: i += 1 # skip whilst smaller
-        while i<n and a[i]> m: m, nm, i = a[i], i, i+1
-    return nm;
+        while i<n and a[i]<=m: i += 1              # skip whilst smaller/equal
+        while i<n and a[i]> m: m,nm,i = a[i],i,i+1 # only whilst bigger
+    return nm
 ```
 
 # Implementation in SVP64 Assembler
@@ -147,7 +147,7 @@ use than a binary index, as it can be used directly as a Predicate Mask
 
 The algorithm works by excluding previous operations using `i-in-unary`,
 combined with VL being truncated due to use of Data-Dependent Fail-First.
-What therefore happens for example on the `sv.com/ff=gt/m=ge` operation
+What therefore happens for example on the `sv.cmp/ff=gt/m=ge` operation
 is that it is *VL* (the Vector Length) that gets truncated to only
 contain those elements that are smaller than the current largest value
 found (`m` aka `r4`). Calling `sv.creqv` then sets **only** the
@@ -155,6 +155,36 @@ CR field bits up to length `VL`, which on the next loop will exclude
 them because the Predicate Mask is `m=ge` (ok if the CR field bit is
 **zero**).
 
+Therefore, the way that Data-Dependent Fail-First works, it attempts
+*up to* the current Vector Length, and on detecting the first failure
+will truncate at that point. In effect this is speculative sequential
+execution of `while (i<n and a[i]<=m) : i += 1`.
+
+Next comes the `sv.minmax.` which covers the `while (i<n and a[i]>m)`
+again in a single instruction, but this time it is a little more
+involved.  Firstly: mapreduce mode is used, with `r4` as both source
+and destination, `r4` acts as the sequential accumulator. Secondly,
+again it is masked (`m=ge`) which again excludes testing of previously-tested
+elements.  The next few instructions extract the information provided
+by Vector Length (VL) being truncated - potentially even to zero!
+(Note that `mtcrf 128,0` takes care of the possibility of VL=0, which if
+that happens then CR0 would be left it in its previous state: a
+very much undesirable behaviour!)
+
+`crternlogi 0,1,2,127` will combine the setting of CR0.EQ and CR0.LT
+to give us a true Greater-than-or-equal, including under the circumstance
+where VL=0. The `sv.crand` will then take a copy of the `i-in-unary`
+mask, but only when CR0.EQ is set. This is why the third operand `BB`
+is a Scalar not a Vector (BT=16/Vector, BA=19/Vector, BB=0/Scalar)
+which effectively performs a broadcast-splat-ANDing, as follows:
+
+```
+    CR4.SO = CR4.EQ AND CR0.EQ (if VL >= 1)
+    CR5.SO = CR5.EQ AND CR0.EQ (if VL >= 2)
+    CR6.SO = CR6.EQ AND CR0.EQ (if VL >= 3)
+    CR7.SO = CR7.EQ AND CR0.EQ (if VL  = 4)
+```
+
 [[!tag svp64_cookbook ]]