From 0bd5a7b00eca0dbd33345f21c9ff151ca77cf134 Mon Sep 17 00:00:00 2001 From: Luke Kenneth Casson Leighton Date: Mon, 12 Feb 2024 10:29:25 +0000 Subject: [PATCH] bug 676: complete description of maxloc algorithm --- openpower/sv/cookbook/fortran_maxloc.mdwn | 90 +++++++++++++++++++---- 1 file changed, 77 insertions(+), 13 deletions(-) diff --git a/openpower/sv/cookbook/fortran_maxloc.mdwn b/openpower/sv/cookbook/fortran_maxloc.mdwn index c3ffca6c1..5bfb537e6 100644 --- a/openpower/sv/cookbook/fortran_maxloc.mdwn +++ b/openpower/sv/cookbook/fortran_maxloc.mdwn @@ -23,7 +23,7 @@ int maxloc(int * const restrict a, int n) { nm = i; } } - return nm; + return nm; } ``` @@ -46,7 +46,7 @@ From stackexchange in ARM NEON intrinsics, one developer (Pavel P) wrote the subroutine below, explaining that it finds the index of a minimum value within a group of eight unsigned bytes. It is necessary to use a second outer loop to perform many of these searches in parallel, followed by conditionally -offsetting each of the block-results. +offsetting each of the block-results. @@ -118,20 +118,20 @@ more branch (outer loop). ``` # while (im): -sv.minmax./ff=le/m=ge 4,*10,4,1 # uses r4 as accumulator -crternlogi 0,1,2,127 # test greater/equal or VL=0 -sv.crand *19,*16,0 # clear if CR0.eq=0 +sv.minmax./ff=le/m=ge/mr 4,*10,4,1 # uses r4 as accumulator +crternlogi 0,1,2,127 # test greater/equal or VL=0 +sv.crand *19,*16,0 # clear if CR0.eq=0 # nm = i (count masked bits. could use crweirds here TODO) -sv.svstep/mr/m=so 1, 0, 6, 1 # svstep: get vector dststep -sv.creqv *16,*16,*16 # set mask on already-tested -bc 12,0, -0x40 # CR0 lt bit clear, branch back +sv.svstep/mr/m=so 1, 0, 6, 1 # svstep: get vector dststep +sv.creqv *16,*16,*16 # set mask on already-tested +bc 12,0, -0x40 # CR0 lt bit clear, branch back ``` `sv.cmp` can be used in the first while loop because m (r4, the current @@ -185,6 +185,70 @@ which effectively performs a broadcast-splat-ANDing, as follows: CR7.SO = CR7.EQ AND CR0.EQ (if VL = 4) ``` +**Converting the unary mask to binary** + +Now that the `CR4/5/6/7.SO` bits have been set, it is necessary to +count them, i.e. convert an unary mask into a binary number. There +are several ways to do this, one of which is +the `crweird` suite of instructions, combined with `popcnt`. However +there is a very straightforward way to it: use `sv.svstep`. + +``` +crternlogi 0,1,2,127 + + i ----> 0 1 2 3 + CR4.EQ CR5.EQ CR6.EQ CR7.EQ + & CR0 & CR0 & CR0 & CR0 + | | | | + v v v v + CR4.SO CR5.SO CR6.SO CR7.SO + +sv.svstep/mr/m=so 1, 0, 6, 1 + + | | | | + count <--+---------+--------+---------+ +``` + +In reality what is happening is that svstep is requested to return +the current `dststep` (destination step index), into a scalar +destination (`RT=r1`), but in "mapreduce" mode. Mapreduce mode +will keep running as if the destination was a Vector, overwriting +previously-written results. Normally, one of the *sources* would +be the same register (`RA=r1` for example) which would turn r1 into +an accumulator, however in this particular case we simply consistently +overwrite `r1` and end up with the last `dststep`. + +There `is` an alternative to this approach: `getvl` followed by +subtracting 1. `VL` being the total Vector Length as set by +Data-Dependent Fail-First, is the *length* of the Vector, whereas +what we actually want is the index of the last element: hence +using `sv.svstep`. Given that using `getvl` followed by `addi -1` +is two instructions rather than one, we use `sv.svstep` instead, +although they are the same amount of words (SVP64 is 64-bit sized). + +Lastly the `creqv` sets up the mask (based on the current +Vector Length), and the loop-back (`bc`) jumps to reset the +Vector Length back to the maximum (4). Thus, the mask (CR4/5/6/7 EQ +bit) which began as empty will exclude nothing but on subsequent +loops will exclude previously-tested elements (up to the previous +value of VL) before it was reset back to the maximum. + +This is actually extremely important to note, here, because the +Data-Dependent Fail-First testing starts at the **first non-masked** +element, not the first element. This fact is exploited here, and +the only thing to contend with is if VL is ever set to zero +(first element fails). If that happens then CR0 will never get set, +(because no element succeeded and the Vector Length is truncated to zero) +hence the need to clear CR0 prior to starting that instruction. + +Overall this algorithm is extremely short but quite complex in its +intricate use of SVP64 capabilities, yet still remains parallelliseable +in hardware. It could potentially be shorter: there may be opportunities +to use predicate masking with the CR field manipulation. However +the number of instructions is already at the point where, even when the +LOAD outer loop is added it will still remain short enough to fit in +a single line of L1 Cache. + [[!tag svp64_cookbook ]] -- 2.30.2