add notes on 2024-01-23 meeting. terminated due to harrassment
[libreriscv.git] / openpower / atomics.mdwn
1 # Draft proposal for improved atomic operations for the Power ISA
2
3 **NOTE THIS PROPOSAL IS NOT BEING SUBMITTED DUE TO
4 DISCOVERY DURING INVESTIGATION THAT ATOMICS ARE DESIGNED
5 FOR MASSIVE DISTRIBUTED CLUSTERS. SIGNIFICANT ADDITIONAL
6 RESEARCH IS REQUIRED SO THIS PROPOSAL IS PUT ON HOLD
7 UNTIL BUDGET IS AVAILABLE**
8
9 Links:
10
11 * <https://bugs.libre-soc.org/show_bug.cgi?id=236>
12 * [OpenCAPI spec](http://opencapi.org/wp-content/uploads/2017/02/OpenCAPI-TL.WGsec_.V3p1.2017Jan27.pdf) p47-49 for AMO section
13 * [RISC-V A](https://github.com/riscv/riscv-isa-manual/blob/master/src/a.tex)
14 * [[atomics/discussion]]
15 * <http://www.rdrop.com/~paulmck/scalability/paper/N2745r.2011.03.04a.html>
16
17 TODO:
18
19 * investigate Power ISA 3.1 p1077 eh hint
20
21
22 # Motivation
23
24 Power ISA currently has some issues with its atomic operations support,
25 which are exacerbated by 3D Data structure processing in 3D
26 Shader Binaries needing
27 of the order of 10^5 or greater atomic locks per second per SMP Core.
28
29 ## Power ISA's current atomic operations are inefficient
30
31 Implementations have a hard time recognizing existing atomic operations
32 via macro-op fusion because they would often have to detect and fuse a
33 large number of instructions, including branches. This is contrary
34 to the RISC paradigm.
35
36 There is also the issue that PowerISA's memory fences are unnecessarily
37 strong, particularly `isync` which is used for a lot of `acquire` and
38 stronger fences. `isync` forces the cpu to do a full pipeline flush,
39 which is unnecessary when all that is needed is a memory barrier.
40
41 `atomic_fetch_add_seq_cst` is 6 instructions including a loop:
42
43 ```
44 # address in r4, addend in r5
45 sync
46 loop:
47 ldarx 3, 0, 4
48 add 6, 5, 3
49 stdcx. 6, 0, 4
50 bne 0, loop
51 lwsync
52 # output in r3
53 ```
54
55 `atomic_load_seq_cst` is 5 instructions, including a branch, and an
56 unnecessarily-strong memory fence:
57
58 ```
59 # address in r3
60 sync
61 ld 3, 0(3)
62 cmpw 0, 3, 3
63 bne- 0, skip
64 isync
65 skip:
66 # output in r3
67 ```
68
69 `atomic_compare_exchange_strong_seq_cst` is 7 instructions, including
70 a loop with 2 branches, and an unnecessarily-strong memory fence:
71
72 ```
73 # address in r4, compared-to value in r5, replacement value in r6
74 sync
75 loop:
76 ldarx 3, 0, 4
77 cmpd 0, 3, 5
78 bne 0, not_eq
79 stdcx. 6, 0, 4
80 bne 0, loop
81 not_eq:
82 isync
83 # output loaded value in r3, store-occurred flag in cr0.eq
84 ```
85
86 `atomic_load_acquire` is 4 instructions, including a branch and an
87 unnecessarily-strong memory fence:
88
89 ```
90 # address in r3
91 ld 3, 0(3)
92 cmpw 0, 3, 3
93 bne- skip
94 isync
95 skip:
96 # output in r3
97 ```
98
99 Having single atomic operations is useful for implementations that want to send atomic operations to a shared cache since they can be more efficient to execute there, rather than having to move a whole cache block. Relying exclusively on
100 TODO
101
102 ## Power ISA doesn't align well with C++11 atomics
103
104 [P0668R5: Revising the C++ memory model](https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0668r5.html):
105
106 > Existing implementation schemes on Power and ARM are not correct with
107 > respect to the current memory model definition. These implementation
108 > schemes can lead to results that are disallowed by the current memory
109 > model when the user combines acquire/release ordering with seq_cst
110 > ordering. On some architectures, especially Power and Nvidia GPUs, it
111 > is expensive to repair the implementations to satisfy the existing
112 > memory model. Details are discussed in (Lahav et al)
113 > http://plv.mpi-sws.org/scfix/paper.pdf (which this discussion relies
114 > on heavily).
115
116 ## Power ISA's Atomic-Memory-Operations have issues
117
118 PowerISA v3.1 Book II section 4.5: Atomic Memory Operations
119
120 They are still missing better fences, combined operation/fence
121 instructions, and operations on 8/16-bit values, as well as issues with
122 unnecessary restrictions:
123
124 it has only 32-bit and 64-bit atomic operations.
125
126 see [[discussion]] for proposed operations and thoughts TODO
127 remove this sentence
128
129
130 # DRAFT atomic instructions
131
132 These two instructions, `lat` and `stat`, are identical
133 to `lwat/ldat` and `stwat/stdat` except add acquire and
134 release guaranteed ordering semantics as well as 8 and
135 16 bit memory widths.
136
137 AT-Form (TODO)
138
139 * lat. RT,RA,FC,aq,rl,ew
140 * stat. RS,RA,FC,aq,rl,ew
141
142 **DRAFT** EXT031 and XO, these are near to the existing
143 atomic memory operations
144
145 |0.5|6.10|11.15|16.20|21|22|23.24|25.30 |31|name| Form |
146 |-- | -- | --- | --- |--|--|---- |------|--|----|------------|
147 |31 | RT | RA | FC |lr|sc|ew |000101|Rc|lat | TODO-Form |
148 |31 | RS | RA | FC |lr|sc|ew |100101|/ |stat| TODO-Form |
149
150 * `ew` specifies the memory operation width: 0/1/2/3 8/16/32/64
151 * If the `aq` bit is set,
152 then no later atomic memory operations can be observed
153 to take place before the AMO in this or other cores.
154 (A global Write-after-Read Memory Hazard is created)
155 * If the `rl` bit is set, then other cores will not observe the AMO before
156 memory accesses preceding the AMO.
157 (A global Read-after-Write Memory Hazard is created)
158 * Setting both the `aq` and the `rl` bit makes the sequence
159 sequentially consistent, meaning that
160 it cannot be reordered with respect to earlier or later atomic
161 memory operations. (Both a RaW and WaR are simultaneously created)
162 * `FC` is identical to the Function tables used in Power ISA v3 for `lwat`
163 and `stwat`
164
165 read functions v3.1 book II section 4.5.1 p1071
166
167 |opcode| regs | memory | description |
168 |------|----------------|------------------------|-----------------------------|
169 |00000 | RT, RT+1 | mem(EA,s) | Fetch and Add |
170 |00001 | RT, RT+1 | mem(EA,s) | Fetch and XOR |
171 |00010 | RT, RT+1 | mem(EA,s) | Fetch and OR |
172 |00011 | RT, RT+1 | mem(EA,s) | Fetch and AND |
173 |00100 | RT, RT+1 | mem(EA,s) | Fetch and Maximum Unsigned |
174 |00101 | RT, RT+1 | mem(EA,s) | Fetch and Maximum Signed |
175 |00110 | RT, RT+1 | mem(EA,s) | Fetch and Minimum Unsigned |
176 |00111 | RT, RT+1 | mem(EA,s) | Fetch and Minimum Signed |
177 |01000 | RT, RT+1 | mem(EA,s) | Swap |
178 |10000 | RT, RT+1, RT+2 | mem(EA,s) | Compare and Swap Not Equal |
179 |11000 | RT | mem(EA,s) mem(EA+s, s) | Fetch and Increment Bounded |
180 |11001 | RT | mem(EA,s) mem(EA+s, s) | Fetch and Increment Equal |
181 |11100 | RT | mem(EA-s,s) mem(EA, s) | Fetch and Decrement Bounded |
182
183 store functions
184
185 |opcode| regs | memory | description |
186 |------|------|-----------|-----------------------------|
187 |00000 | RS | mem(EA,s) | Store Add |
188 |00001 | RS | mem(EA,s) | Store XOR |
189 |00010 | RS | mem(EA,s) | Store OR |
190 |00011 | RS | mem(EA,s) | Store AND |
191 |00100 | RS | mem(EA,s) | Store Maximum Unsigned |
192 |00101 | RS | mem(EA,s) | Store Maximum Signed |
193 |00110 | RS | mem(EA,s) | Store Minimum Unsigned |
194 |00111 | RS | mem(EA,s) | Store Minimum Signed |
195 |11000 | RS | mem(EA,s) | Store Twin |
196
197 These functions are also recognised as being part of the
198 OpenCAPI Specification.