From: lkcl Date: Sun, 15 Oct 2023 17:11:06 +0000 (+0100) Subject: (no commit message) X-Git-Url: https://git.libre-soc.org/?a=commitdiff_plain;h=d4bd4fb2fb40943a71f59f77c8c5178992a6075e;p=libreriscv.git --- diff --git a/simple_v_extension/daxpy_example.mdwn b/simple_v_extension/daxpy_example.mdwn index 5acea7e9f..2c455efa2 100644 --- a/simple_v_extension/daxpy_example.mdwn +++ b/simple_v_extension/daxpy_example.mdwn @@ -11,33 +11,79 @@ Summary | ISA | total | loop | words | notes | |-----|-------|------|-------|-------| -| SVP64 | 9 | 7 | 14 | 5 64-bit, 4 32-bit | +| SVP64 | 8 | 6 | 13 | 5 64-bit, 4 32-bit | | RVV | 13 | 11 | 9.5 | 7 32-bit, 5 16-bit | | SVE | 12 | 7 | 12 | all 32-bit | # SVP64 Power ISA version -Relies on post-increment, relies on no overlap between x and y +The first instruction is simple: the plan is to use CTR for looping. +Therefore, copy n (r5) into CTR. Next however, at the start of +the loop (L2) is not so obvious: MAXVL is being set to 32 +elements, but at the same time VL is being set to MIN(MAXVL,CTR). + +This algorithm +relies on post-increment, relies on no overlap between x and y in memory, and critically relies on y overwrite. x is post-incremented -when read, but y is post-incremented on write. Element-Strided -ensures the Immediate (8) results in a contiguous LD (or store) -despite RA being marked Scalar (*without* modifying RA, on `sv.lfd/els`). -For `sv.lfdup`, RA is Scalar so that only one -LD/ST Update "wins": the last write to RA is the address for -the next block. +when read, but y is post-incremented on write. Load/Store Post-Increment +is a new Draft set of instructions for the Scalar Subsets, which save +having to pre-subtract an offset before running the loop. + +For `sv.lfdup`, RA is Scalar so continuously accumulates +additions of the immediate (8): +the last write to RA is the address for +the next block (the next time round the CTR loop). +To understand this it is necessary to appreciate that +SVP64 is as if a sequence of loop-unrolled scalar instructions were +issued. With that sequence all writing the new version of RA +before the next element-instruction, the end result is identical +in effect to Element-Strided, except that RA points to the start +of the next batch. + +Use of Element-Strided on `sv.lfd/els` +ensures the Immediate (8) results in a contiguous LD +*without* modifying RA. +The first element is loaded from RA, the second from RA+8, the third +from RA+16 and so on. However unlike the `sv.lfdup`, RA remains +pointing at the current block being processed of the y array. + +With both a part of x and y loaded into a batch of GPR +registers starting at 32 and 64 respectively, a multiply-and-accumulate +can be carried out. The scalar constant `a` is in fp1. + +Where the curret pointer to y had not been updated by the `sv.lfd/els` +instruction, this means that y (r7) is already pointing to the +right place to store the results. However given that we want r7 +to point to the start of the next batch, *now* we can use +`sv.stfdup` which will post-increment RA repeatedly by 8 + +Now that the batch of length `VL` has been done, the next step +is to decrement CTR by VL, which we know due to the setvl instruction +that VL and CTR will be equal or that if CTR is greater than MAXVL, +that VL will be *equal* to MAXVL. Therefore, when `sv bc/ctr` +performs a decrement of CTR by VL, we an be confident that CTR +will only reach zero if there is no more of the array to process. + +Thus in an elegant way each RISC instruction is actually quite sophisticated, +but not a huge CISC-like difference from the original Power ISA. +Scalar Power ISA already has LD/ST-Update (pre-increment on RA): +we propose adding Post-Increment (Motorola 68000 and 8086 have had +both for decades). Power ISA branch-conditional has had Decrement-CTR +since its inception: we propose in SVP64 to add "Decrement CTR by VL". +The end result is an exceptionally compact daxpy that is easy to read +and understand. ``` # r5: n count; r6: x ptr; r7: y ptr; fp1: a - 1 addi r3,r7,0 # return result - 2 mtctr 5 # move n to CTR - 3 .L2 - 4 setvl MAXVL=32,VL=CTR # actually VL=MIN(MAXVL,CTR) - 5 sv.lfdup/els *32,8(6) # load x into fp32-63, incr x - 6 sv.lfd/els *64,8(7) # load y into fp64-95, NO INC - 7 sv.fmadd *64,*64,1,*32 # (*y) = (*y) * (*x) + a - 8 sv.stfdup/els *64,8(7) # store at y, incr y - 9 sv.bc/ctr .L2 # decr CTR by VL, jump !zero - 10 blr # return + 1 mtctr 5 # move n to CTR + 2 .L2 + 3 setvl MAXVL=32,VL=CTR # actually VL=MIN(MAXVL,CTR) + 4 sv.lfdup *32,8(6) # load x into fp32-63, incr x + 5 sv.lfd/els *64,8(7) # load y into fp64-95, NO INC + 6 sv.fmadd *64,*64,1,*32 # (*y) = (*y) * (*x) + a + 7 sv.stfdup *64,8(7) # store at y, post-incr y + 8 sv.bc/ctr .L2 # decr CTR by VL, jump !zero + 9 blr # return ``` # RVV version