simple_v_extension/daxpy_example.mdwn

   1 # daxpy
   2
   3 This is a standard textbook algorithm demonstration for Vector and SIMD ISAs.
   4
   5 * <https://bugs.libre-soc.org/show_bug.cgi?id=1117>
   6
   7 Summary
   8
   9 | ISA | total | loop | words | notes |
  10 |-----|-------|------|-------|-------|
  11 | SVP64 | 8 | 6 | 13 | 5 64-bit, 4 32-bit |
  12 | RVV | 13 | 11 | 9.5 | 7 32-bit, 5 16-bit |
  13 | SVE | 12 | 7 | 12 | all 32-bit |
  14
  15 # c code
  16
  17 ```
  18     void daxpy(size_t n, double a, const double x[], double y[]) {
  19         for (size_t i = 0; i < n; i++)
  20             y[i] = a*x[i] + y[i];
  21     }
  22 ```
  23
  24 # SVP64 Power ISA version
  25
  26 The first instruction is simple: the plan is to use CTR for looping.
  27 Therefore, copy n (r5) into CTR. Next however, at the start of
  28 the loop (L2) is not so obvious: MAXVL is being set to 32
  29 elements, but at the same time VL is being set to MIN(MAXVL,CTR).
  30
  31 This algorithm
  32 relies on post-increment, relies on no overlap between x and y
  33 in memory, and critically relies on y overwrite. x is post-incremented
  34 when read, but y is post-incremented on write. Load/Store Post-Increment
  35 is a new Draft set of instructions for the Scalar Subsets, which save
  36 having to pre-subtract an offset before running the loop.
  37
  38 For `sv.lfdup`, RA is Scalar so continuously accumulates
  39 additions of the immediate (8) but only *after* RA has been used
  40 as the Effective Address.
  41 The last write to RA is the address for
  42 the next block (the next time round the CTR loop).
  43 To understand this it is necessary to appreciate that
  44 SVP64 is as if a sequence of loop-unrolled scalar instructions were
  45 issued.  With that sequence all writing the new version of RA
  46 before the next element-instruction, the end result is identical
  47 in effect to Element-Strided, except that RA points to the start
  48 of the next batch.
  49
  50 Use of Element-Strided on `sv.lfd/els`
  51 ensures the Immediate (8) results in a contiguous LD
  52 *without* modifying RA.
  53 The first element is loaded from RA, the second from RA+8, the third
  54 from RA+16 and so on. However unlike the `sv.lfdup`, RA remains
  55 pointing at the current block being processed of the y array.
  56
  57 With both a part of x and y loaded into a batch of GPR
  58 registers starting at 32 and 64 respectively, a multiply-and-accumulate
  59 can be carried out. The scalar constant `a` is in fp1.
  60
  61 Where the curret pointer to y had not been updated by the `sv.lfd/els`
  62 instruction, this means that y (r7) is already pointing to the
  63 right place to store the results.  However given that we want r7
  64 to point to the start of the next batch, *now* we can use
  65 `sv.stfdup` which will post-increment RA repeatedly by 8
  66
  67 Now that the batch of length `VL` has been done, the next step
  68 is to decrement CTR by VL, which we know due to the setvl instruction
  69 that VL and CTR will be equal or that if CTR is greater than MAXVL,
  70 that VL will be *equal* to MAXVL. Therefore, when `sv bc/ctr`
  71 performs a decrement of CTR by VL, we an be confident that CTR
  72 will only reach zero if there is no more of the array to process.
  73
  74 Thus in an elegant way each RISC instruction is actually quite sophisticated,
  75 but not a huge CISC-like difference from the original Power ISA.
  76 Scalar Power ISA already has LD/ST-Update (pre-increment on RA):
  77 we propose adding Post-Increment (Motorola 68000 and 8086 have had
  78 both for decades). Power ISA branch-conditional has had Decrement-CTR
  79 since its inception: we propose in SVP64 to add "Decrement CTR by VL".
  80 The end result is an exceptionally compact daxpy that is easy to read
  81 and understand.
  82
  83 ```
  84     # r5: n count; r6: x ptr; r7: y ptr; fp1: a
  85     1  mtctr 5                # move n to CTR
  86     2  .L2
  87     3  setvl MAXVL=32,VL=CTR  # actually VL=MIN(MAXVL,CTR)
  88     4  sv.lfdup   *32,8(6)    # load x into fp32-63, incr x
  89     5  sv.lfd/els *64,8(7)    # load y into fp64-95, NO INC
  90     6  sv.fmadd *64,*64,1,*32 # (*y) = (*y) * (*x) + a
  91     7  sv.stfdup  *64,8(7)    # store at y, post-incr y
  92     8  sv.bc/ctr .L2          # decr CTR by VL, jump !zero
  93     9  blr                    # return
  94 ```
  95
  96 # RVV version
  97
  98
  99 ```
 100     # a0 is n, a1 is pointer to x[0], a2 is pointer to y[0], fa0 is a
 101       li t0, 2<<25
 102       vsetdcfg t0             # enable 2 64b Fl.Pt. registers
 103     loop:
 104       setvl  t0, a0           # vl = t0 = min(mvl, n)
 105       vld    v0, a1           # load vector x
 106       c.slli   t1, t0, 3      # t1 = vl * 8 (in bytes)
 107       vld    v1, a2           # load vector y
 108       c.add    a1, a1, t1     # increment pointer to x by vl*8
 109       vfmadd v1, v0, fa0, v1  # v1 += v0 * fa0 (y = a * x + y)
 110       c.sub    a0, a0, t0     # n -= vl (t0)
 111       vst    v1, a2           # store Y
 112       c.add    a2, a2, t1     # increment pointer to y by vl*8
 113       c.bnez   a0, loop       # repeat if n != 0
 114       c.ret                   # return
 115 ```
 116
 117 # SVE Version
 118
 119 ```
 120     1  // x0 = &x[0], x1 = &y[0], x2 = &a, x3 = &n
 121     2  daxpy_:
 122     3  ldrswx3, [x3]                     // x3=*n
 123     4  movx4, #0                         // x4=i=0
 124     5  whilelt p0.d, x4, x3              // p0=while(i++<n)
 125     6  ld1rdz0.d, p0/z, [x2]             // p0:z0=bcast(*a)
 126     7  .loop:
 127     8  ld1d z1.d, p0/z, [x0, x4, lsl #3] // p0:z1=x[i]
 128     9  ld1d z2.d, p0/z, [x1, x4, lsl #3] // p0:z2=y[i]
 129     10 fmla z2.d, p0/m, z1.d, z0.d       // p0?z2+=x[i]*a
 130     11 st1d z2.d, p0, [x1, x4, lsl #3]   // p0?y[i]=z2
 131     12 incd x4                           // i+=(VL/64)
 132     13 .latch:
 133     14 whilelt p0.d, x4, x3             // p0=while(i++<n)
 134     15 b.first .loop                    // more to do?
 135     16 ret
 136 ```
 137
 138 [[!tag svp64_cookbook ]]