add clarification
[libreriscv.git] / bitmap_parallelism_extension.mdwn
1 # Parallelism using Bitmaps
2
3 If you think about it this way you can combine setvl, and predication,
4 and indeed vector length, by always working with bitmaps.
5
6 So: you have 32 WARL CSRs , called X0, ... X31 (or perhaps 2 banks of
7 32 CSR's and have a set of additional CSR's FX0,... FX31)
8
9 Each contains a bitmap of length 32 (assuming we only have the standard
10 registers)
11
12 By default, X0 contains 1<<0, X1 contains 1<<1, X2 contains 1 << 2, ...
13
14 now an instruction like
15
16 add x1 x2 x3
17
18 is reinterpreted as referring to the CSR's rather than individual
19 registers. i.e. under simple V it means
20
21 add X1, X2, X3
22
23 and it has the following semantics:
24
25 let rds = registers in bitmap X1
26 let rs1s = registers in bitmap X2 repeated periodically in order of register number to the length of X1
27 let rs2s = registers in bitmap X3 repeated periodically in order of register number to the length of X1
28
29
30 parallelfor (rd, rs1, rs2) in (rds[i],rs1s[i], rs2s[i]) where i = 0 to length(rds) - 1
31 add rd rs1 rs2
32
33
34 example:
35
36 X1 <- 0b011111
37 X2 <- 0b1011
38 X3 <- 0b00010
39
40 > Anyways my point was, for me it would have been more intuitive
41 > and easier to grasp if it showed:
42 > X1 -> b011111 (meaning x4,x3,x2,x1,x0)
43 > X2 -> b001011 (meaning x3,x1,x0)
44 > X3 -> b000010 (meaning x1)
45
46 then
47
48 rd1s = [x1, x2, x3, x4, x5]
49 rs1s = [x0, x2, x3, x0, x2]
50 rs2s = [x3, x3, x3, x3, x3]
51
52 and
53
54 add X1, X2, X3
55
56 is interpreted as
57
58 parallel{
59 add x1, x0, x3
60 add x2, x2, x3
61 add x3, x3, x3
62 add x4, x0, x3 # x2 and x3 have their original values!
63 add x5, x2, x3 # x2 and x3 have their original values!
64 }
65
66 This means that the analogue of setvl is simply the "write any" of
67 setting the bitmap, and the analogue of the return value of setvl,
68 is the "read legal" of the CSR. Moreover popc would tell you how many
69 operations are scheduled in parallel so you know how often you have to
70 repeat a sequential loop.
71
72 Notes:
73
74 > > Thinking about it more, a bitset for X0 seems a bad idea, or equivalently X0
75 > > should be
76 > > the immutable  bitset {x0}. That suggests FX0, ... FX31 _is_ a good idea.
77
78 >  what would it mean, to do ops with x0?  it would mean "always add 0"
79 > and so on.  it sounds kinda useful.  like MV being add r1, r2, x0. 
80 > it would completely pointless to *have* anything other than "all 1s"
81 > in it though i think :)
82
83 # pseudocode for decoding ops
84
85 uint32 XB[32]; // global, assume RV32 for now: CSRs for bitmapping
86 uint32 regs[32]; // global, actual (integer) register file
87
88 // gets current ACTUAL register to be used
89 // XB had better not be empty...
90 int regdecode(int rn, int *offs)
91 {
92 int bmap = XB[rn];
93 int _offs = *offs;
94 while (1)
95 {
96 int _newoffs = (_offs + 1) & 0x1f; // 32 regs, modulo
97 if (bmap & (1<<_offs))
98 {
99 *offs = _newoffs;
100 return _offs;
101 }
102 _offs = _newoffs;
103 }
104 }
105
106 example usage (pseudo-implementation of add):
107
108 op_add(int rd, int rs1, int rs2)
109 {
110 int id=0, irs1=0, irs2=0;
111 int VL = pcnt(XB[rd];
112 for (int i = 0; i < VL; i++)
113 {
114 int actualrd = regdecode(rd , &id);
115 int actualrs1 = regdecode(rs1, &irs1);
116 int actualrs2 = regdecode(rs2, &irs2);
117 regs[actualrd] = regs[actualrs1] + regs[actualrs2];
118 }
119 }
120