openpower/atomics.mdwn

   1 # Draft proposal for improved atomic operations for the Power ISA
   2
   3 **NOTE THIS PROPOSAL IS NOT BEING SUBMITTED DUE TO
   4 DISCOVERY DURING INVESTIGATION THAT ATOMICS ARE DESIGNED
   5 FOR MASSIVE DISTRIBUTED CLUSTERS. SIGNIFICANT ADDITIONAL
   6 RESEARCH IS REQUIRED SO THIS PROPOSAL IS PUT ON HOLD
   7 UNTIL BUDGET IS AVAILABLE**
   8
   9 Links:
  10
  11 * <https://bugs.libre-soc.org/show_bug.cgi?id=236>
  12 * [OpenCAPI spec](http://opencapi.org/wp-content/uploads/2017/02/OpenCAPI-TL.WGsec_.V3p1.2017Jan27.pdf) p47-49 for AMO section
  13 * [RISC-V A](https://github.com/riscv/riscv-isa-manual/blob/master/src/a.tex)
  14 * [[atomics/discussion]]
  15 * <http://www.rdrop.com/~paulmck/scalability/paper/N2745r.2011.03.04a.html>
  16
  17 TODO:
  18
  19 * investigate Power ISA 3.1 p1077 eh hint
  20
  21
  22 # Motivation
  23
  24 Power ISA currently has some issues with its atomic operations support,
  25 which are exacerbated by 3D Data structure processing in 3D
  26 Shader Binaries needing
  27 of the order of 10^5 or greater atomic locks per second per SMP Core.
  28
  29 ## Power ISA's current atomic operations are inefficient
  30
  31 Implementations have a hard time recognizing existing atomic operations
  32 via macro-op fusion because they would often have to detect and fuse a
  33 large number of instructions, including branches. This is contrary
  34 to the RISC paradigm.
  35
  36 There is also the issue that PowerISA's memory fences are unnecessarily
  37 strong, particularly `isync` which is used for a lot of `acquire` and
  38 stronger fences. `isync` forces the cpu to do a full pipeline flush,
  39 which is unnecessary when all that is needed is a memory barrier.
  40
  41 `atomic_fetch_add_seq_cst` is 6 instructions including a loop:
  42
  43 ```
  44 # address in r4, addend in r5
  45     sync
  46 loop:
  47     ldarx 3, 0, 4
  48     add 6, 5, 3
  49     stdcx. 6, 0, 4
  50     bne 0, loop
  51     lwsync
  52 # output in r3
  53 ```
  54
  55 `atomic_load_seq_cst` is 5 instructions, including a branch, and an
  56 unnecessarily-strong memory fence:
  57
  58 ```
  59 # address in r3
  60     sync
  61     ld 3, 0(3)
  62     cmpw 0, 3, 3
  63     bne- 0, skip
  64     isync
  65 skip:
  66 # output in r3
  67 ```
  68
  69 `atomic_compare_exchange_strong_seq_cst` is 7 instructions, including
  70 a loop with 2 branches, and an unnecessarily-strong memory fence:
  71
  72 ```
  73 # address in r4, compared-to value in r5, replacement value in r6
  74     sync
  75 loop:
  76     ldarx 3, 0, 4
  77     cmpd 0, 3, 5
  78     bne 0, not_eq
  79     stdcx. 6, 0, 4
  80     bne 0, loop
  81 not_eq:
  82     isync
  83 # output loaded value in r3, store-occurred flag in cr0.eq
  84 ```
  85
  86 `atomic_load_acquire` is 4 instructions, including a branch and an
  87 unnecessarily-strong memory fence:
  88
  89 ```
  90 # address in r3
  91     ld 3, 0(3)
  92     cmpw 0, 3, 3
  93     bne- skip
  94     isync
  95 skip:
  96 # output in r3
  97 ```
  98
  99 Having single atomic operations is useful for implementations that want to send atomic operations to a shared cache since they can be more efficient to execute there, rather than having to move a whole cache block. Relying exclusively on
 100 TODO
 101
 102 ## Power ISA doesn't align well with C++11 atomics
 103
 104 [P0668R5: Revising the C++ memory model](https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0668r5.html):
 105
 106 > Existing implementation schemes on Power and ARM are not correct with
 107 > respect to the current memory model definition. These implementation
 108 > schemes can lead to results that are disallowed by the current memory
 109 > model when the user combines acquire/release ordering with seq_cst
 110 > ordering. On some architectures, especially Power and Nvidia GPUs, it
 111 > is expensive to repair the implementations to satisfy the existing
 112 > memory model. Details are discussed in (Lahav et al)
 113 > http://plv.mpi-sws.org/scfix/paper.pdf (which this discussion relies
 114 > on heavily).
 115
 116 ## Power ISA's Atomic-Memory-Operations have issues
 117
 118 PowerISA v3.1 Book II section 4.5: Atomic Memory Operations
 119
 120 They are still missing better fences, combined operation/fence
 121 instructions, and operations on 8/16-bit values, as well as issues with
 122 unnecessary restrictions:
 123
 124 it has only 32-bit and 64-bit atomic operations.
 125
 126 see [[discussion]] for proposed operations and thoughts TODO
 127 remove this sentence
 128
 129
 130 # DRAFT atomic instructions
 131
 132 These two instructions, `lat` and `stat`, are identical
 133 to `lwat/ldat` and `stwat/stdat` except add acquire and
 134 release guaranteed ordering semantics as well as 8 and
 135 16 bit memory widths.
 136
 137 AT-Form (TODO)
 138
 139 * lat. RT,RA,FC,aq,rl,ew
 140 * stat. RS,RA,FC,aq,rl,ew
 141
 142 **DRAFT** EXT031 and XO, these are near to the existing
 143 atomic memory operations
 144
 145 |0.5|6.10|11.15|16.20|21|22|23.24|25.30 |31|name| Form       |
 146 |-- | -- | --- | --- |--|--|---- |------|--|----|------------|
 147 |31 | RT | RA  | FC  |lr|sc|ew   |000101|Rc|lat | TODO-Form  |
 148 |31 | RS | RA  | FC  |lr|sc|ew   |100101|/ |stat| TODO-Form |
 149
 150 * `ew` specifies the memory operation width: 0/1/2/3 8/16/32/64
 151 * If the `aq` bit is set,
 152   then no later atomic memory operations can be observed
 153   to take place before the AMO in this or other cores.
 154   (A global Write-after-Read Memory Hazard is created)
 155 * If the `rl` bit is set, then other cores will not observe the AMO before
 156   memory accesses preceding the AMO.
 157   (A global Read-after-Write Memory Hazard is created)
 158 * Setting both the `aq` and the `rl` bit makes the sequence
 159   sequentially consistent, meaning that
 160   it cannot be reordered with respect to earlier or later atomic
 161   memory operations. (Both a RaW and WaR are simultaneously created)
 162 * `FC` is identical to the Function tables used in Power ISA v3 for `lwat`
 163   and `stwat`
 164
 165 read functions v3.1 book II section 4.5.1 p1071
 166
 167 |opcode| regs           | memory                 | description                 |
 168 |------|----------------|------------------------|-----------------------------|
 169 |00000 | RT, RT+1       | mem(EA,s)              | Fetch and Add               |
 170 |00001 | RT, RT+1       | mem(EA,s)              | Fetch and XOR               |
 171 |00010 | RT, RT+1       | mem(EA,s)              | Fetch and OR                |
 172 |00011 | RT, RT+1       | mem(EA,s)              | Fetch and AND               |
 173 |00100 | RT, RT+1       | mem(EA,s)              | Fetch and Maximum Unsigned  |
 174 |00101 | RT, RT+1       | mem(EA,s)              | Fetch and Maximum Signed    |
 175 |00110 | RT, RT+1       | mem(EA,s)              | Fetch and Minimum Unsigned  |
 176 |00111 | RT, RT+1       | mem(EA,s)              | Fetch and Minimum Signed    |
 177 |01000 | RT, RT+1       | mem(EA,s)              | Swap                        |
 178 |10000 | RT, RT+1, RT+2 | mem(EA,s)              | Compare and Swap Not Equal  |
 179 |11000 | RT             | mem(EA,s) mem(EA+s, s) | Fetch and Increment Bounded |
 180 |11001 | RT | mem(EA,s) mem(EA+s, s) | Fetch and Increment Equal               |
 181 |11100 | RT | mem(EA-s,s) mem(EA, s) | Fetch and Decrement Bounded             |
 182
 183 store functions
 184
 185 |opcode| regs | memory    | description                 |
 186 |------|------|-----------|-----------------------------|
 187 |00000 | RS   | mem(EA,s) | Store Add                   |
 188 |00001 | RS   | mem(EA,s) | Store XOR                   |
 189 |00010 | RS   | mem(EA,s) | Store OR                    |
 190 |00011 | RS   | mem(EA,s) | Store AND                   |
 191 |00100 | RS   | mem(EA,s) | Store Maximum Unsigned      |
 192 |00101 | RS   | mem(EA,s) | Store Maximum Signed        |
 193 |00110 | RS   | mem(EA,s) | Store Minimum Unsigned      |
 194 |00111 | RS   | mem(EA,s) | Store Minimum Signed        |
 195 |11000 | RS   | mem(EA,s) | Store Twin                  |
 196
 197 These functions are also recognised as being part of the
 198 OpenCAPI Specification.