(no commit message)
[libreriscv.git] / simple_v_extension / daxpy_example.mdwn
1 # c code
2
3 ```
4 void daxpy(size_t n, double a, const double x[], double y[]) {
5 for (size_t i = 0; i < n; i++)
6 y[i] = a*x[i] + y[i];
7 }
8 ```
9
10 Summary
11
12 | ISA | total | loop | words | notes |
13 |-----|-------|------|-------|-------|
14 | SVP64 | 8 | 6 | 13 | 5 64-bit, 4 32-bit |
15 | RVV | 13 | 11 | 9.5 | 7 32-bit, 5 16-bit |
16 | SVE | 12 | 7 | 12 | all 32-bit |
17
18 # SVP64 Power ISA version
19
20 The first instruction is simple: the plan is to use CTR for looping.
21 Therefore, copy n (r5) into CTR. Next however, at the start of
22 the loop (L2) is not so obvious: MAXVL is being set to 32
23 elements, but at the same time VL is being set to MIN(MAXVL,CTR).
24
25 This algorithm
26 relies on post-increment, relies on no overlap between x and y
27 in memory, and critically relies on y overwrite. x is post-incremented
28 when read, but y is post-incremented on write. Load/Store Post-Increment
29 is a new Draft set of instructions for the Scalar Subsets, which save
30 having to pre-subtract an offset before running the loop.
31
32 For `sv.lfdup`, RA is Scalar so continuously accumulates
33 additions of the immediate (8):
34 the last write to RA is the address for
35 the next block (the next time round the CTR loop).
36 To understand this it is necessary to appreciate that
37 SVP64 is as if a sequence of loop-unrolled scalar instructions were
38 issued. With that sequence all writing the new version of RA
39 before the next element-instruction, the end result is identical
40 in effect to Element-Strided, except that RA points to the start
41 of the next batch.
42
43 Use of Element-Strided on `sv.lfd/els`
44 ensures the Immediate (8) results in a contiguous LD
45 *without* modifying RA.
46 The first element is loaded from RA, the second from RA+8, the third
47 from RA+16 and so on. However unlike the `sv.lfdup`, RA remains
48 pointing at the current block being processed of the y array.
49
50 With both a part of x and y loaded into a batch of GPR
51 registers starting at 32 and 64 respectively, a multiply-and-accumulate
52 can be carried out. The scalar constant `a` is in fp1.
53
54 Where the curret pointer to y had not been updated by the `sv.lfd/els`
55 instruction, this means that y (r7) is already pointing to the
56 right place to store the results. However given that we want r7
57 to point to the start of the next batch, *now* we can use
58 `sv.stfdup` which will post-increment RA repeatedly by 8
59
60 Now that the batch of length `VL` has been done, the next step
61 is to decrement CTR by VL, which we know due to the setvl instruction
62 that VL and CTR will be equal or that if CTR is greater than MAXVL,
63 that VL will be *equal* to MAXVL. Therefore, when `sv bc/ctr`
64 performs a decrement of CTR by VL, we an be confident that CTR
65 will only reach zero if there is no more of the array to process.
66
67 Thus in an elegant way each RISC instruction is actually quite sophisticated,
68 but not a huge CISC-like difference from the original Power ISA.
69 Scalar Power ISA already has LD/ST-Update (pre-increment on RA):
70 we propose adding Post-Increment (Motorola 68000 and 8086 have had
71 both for decades). Power ISA branch-conditional has had Decrement-CTR
72 since its inception: we propose in SVP64 to add "Decrement CTR by VL".
73 The end result is an exceptionally compact daxpy that is easy to read
74 and understand.
75
76 ```
77 # r5: n count; r6: x ptr; r7: y ptr; fp1: a
78 1 mtctr 5 # move n to CTR
79 2 .L2
80 3 setvl MAXVL=32,VL=CTR # actually VL=MIN(MAXVL,CTR)
81 4 sv.lfdup *32,8(6) # load x into fp32-63, incr x
82 5 sv.lfd/els *64,8(7) # load y into fp64-95, NO INC
83 6 sv.fmadd *64,*64,1,*32 # (*y) = (*y) * (*x) + a
84 7 sv.stfdup *64,8(7) # store at y, post-incr y
85 8 sv.bc/ctr .L2 # decr CTR by VL, jump !zero
86 9 blr # return
87 ```
88
89 # RVV version
90
91
92 ```
93 # a0 is n, a1 is pointer to x[0], a2 is pointer to y[0], fa0 is a
94 li t0, 2<<25
95 vsetdcfg t0 # enable 2 64b Fl.Pt. registers
96 loop:
97 setvl t0, a0 # vl = t0 = min(mvl, n)
98 vld v0, a1 # load vector x
99 c.slli t1, t0, 3 # t1 = vl * 8 (in bytes)
100 vld v1, a2 # load vector y
101 c.add a1, a1, t1 # increment pointer to x by vl*8
102 vfmadd v1, v0, fa0, v1 # v1 += v0 * fa0 (y = a * x + y)
103 c.sub a0, a0, t0 # n -= vl (t0)
104 vst v1, a2 # store Y
105 c.add a2, a2, t1 # increment pointer to y by vl*8
106 c.bnez a0, loop # repeat if n != 0
107 c.ret # return
108 ```
109
110 # SVE Version
111
112 ```
113 1 // x0 = &x[0], x1 = &y[0], x2 = &a, x3 = &n
114 2 daxpy_:
115 3 ldrswx3, [x3] // x3=*n
116 4 movx4, #0 // x4=i=0
117 5 whilelt p0.d, x4, x3 // p0=while(i++<n)
118 6 ld1rdz0.d, p0/z, [x2] // p0:z0=bcast(*a)
119 7 .loop:
120 8 ld1d z1.d, p0/z, [x0, x4, lsl #3] // p0:z1=x[i]
121 9 ld1d z2.d, p0/z, [x1, x4, lsl #3] // p0:z2=y[i]
122 10 fmla z2.d, p0/m, z1.d, z0.d // p0?z2+=x[i]*a
123 11 st1d z2.d, p0, [x1, x4, lsl #3] // p0?y[i]=z2
124 12 incd x4 // i+=(VL/64)
125 13 .latch:
126 14 whilelt p0.d, x4, x3 // p0=while(i++<n)
127 15 b.first .loop // more to do?
128 16 ret
129 ```