bug 676: notes on maxloc algorithm, add python version for clarity

author Luke Kenneth Casson Leighton <lkcl@lkcl.net>

Tue, 6 Feb 2024 14:15:56 +0000 (14:15 +0000)

committer Luke Kenneth Casson Leighton <lkcl@lkcl.net>

Tue, 6 Feb 2024 14:15:56 +0000 (14:15 +0000)
author Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Tue, 6 Feb 2024 14:15:56 +0000 (14:15 +0000)
committer Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Tue, 6 Feb 2024 14:15:56 +0000 (14:15 +0000)
diff --git a/openpower/sv/cookbook/fortran_maxloc.mdwn b/openpower/sv/cookbook/fortran_maxloc.mdwn

index 367d0d12a2ec3173fbc46fdfb08c4ef9bb7aef4e..e9bb553d232f2275791d990532b5424126a11a11 100644 (file)
--- a/openpower/sv/cookbook/fortran_maxloc.mdwn
+++ b/openpower/sv/cookbook/fortran_maxloc.mdwn
@@ -89,6 +89,26 @@ search seems to be a common technique.
  
  <https://github.com/jvdd/argminmax/blob/main/src/simd/simd_u64.rs>
  
+**Python implementation**
+
+A variant on the c reference implementation allows for skipping
+of updating m (the largest value found so far), followed by
+always updating m whilst a batch of numbers is found that are
+(in their order of sequence) always continuously greater than
+all previous numbers.  The algorithm therefore flips back and
+forth between "skip whilst smaller" and "update whilst bigger",
+only updating the index during "bigger" sequences. This is key
+later when doing SVP64 assembler.
+
+```
+def m2(a): # array a
+    m, nm, i, n = 0, 0, 0, len(a)
+    while i<n:
+        while i<n and a[i]<=m: i += 1 # skip whilst smaller
+        while i<n and a[i]> m: m, nm, i = a[i], i, i+1
+    return nm;
+```
+
  # Implementation in SVP64 Assembler
  
  The core algorithm (inner part, in-register) is below: 11 instructions.
@@ -114,6 +134,27 @@ sv.creqv *16,*16,*16         # set mask on already-tested
  bc 12,0, -0x40               # CR0 lt bit clear, branch back
  ```
  
+`sv.cmp` can be used in the first while loop because m (r4, the current
+largest-found value so far) does not change.
+However `sv.minmax.` has to be used in the key while loop
+because r4 is sequentially replaced each time, and mapreduce (`/mr`)
+is used to request this rolling-chain (accumulator) mode.
+
+Also note that `i` (the `"while i<n"` part) is represented as an
+unary bitmask (in CR bits 16,20,24,28) which turns out to be easier to
+use than a binary index, as it can be used directly as a Predicate Mask
+(`/m=ge`).
+
+The algorithm works by excluding previous operations using `i-in-unary`,
+combined with VL being truncated due to use of Data-Dependent Fail-First.
+What therefore happens for example on the `sv.com/ff=gt/m=ge` operation
+is that it is *VL* (the Vector Length) that gets truncated to only
+contain those elements that are smaller than the current largest value
+found (`m` aka `r4`). Calling `sv.creqv` then sets **only** the
+CR field bits up to length `VL`, which on the next loop will exclude
+them because the Predicate Mask is `m=ge` (ok if the CR field bit is
+**zero**).
+
  [[!tag svp64_cookbook ]]
author	Luke Kenneth Casson Leighton <lkcl@lkcl.net>
	Tue, 6 Feb 2024 14:15:56 +0000 (14:15 +0000)
committer	Luke Kenneth Casson Leighton <lkcl@lkcl.net>
	Tue, 6 Feb 2024 14:15:56 +0000 (14:15 +0000)