openpower/sv/propagation.mdwn

   1 [[!tag standards]]
   2
   3 # SV Context Propagation
   4
   5 [[!toc]]
   6
   7 TODO: add setvl context propagation.
   8
   9 Context Propagation is for a future version of SV
  10
  11 [[sv/svp64]] context is 24 bits long, and Swizzle is 12.  These
  12 are enormous and not sustainable as far as power consumption is
  13 concerned.  Also, there is repetition of the same contexts to different
  14 instructions. An idea therefore is to add a level of indirection that
  15 allows these contexts to be applied to multiple instructions.
  16
  17 The basic principle is to have a suite of 40 indices in a shift register
  18 that indicate one of seven Contexts shall be applied to upcoming 32 bit
  19 v3.0B instructions.  The Least Significant Index in the shift register is
  20 the one that is applied.  One of those indices is 0b000 which indicates
  21 "no prefix applied".
  22
  23 A special instruction in an svp64 context takes a copy of the `RM[0..23]`
  24 bits, alongside a 21 bit suite that indicates up to 20 32 bit instructions
  25 will have that `RM` applied to them, as well as an index to associate
  26 with the `RM`.  If there are already indices set within the shift register
  27 then the new entries are placed after the end of the highest-indexed one.
  28
  29 | 0.5|6.8  | 9.10|11.31|  name   |
  30 | -- | --- | --- | --- | ------- |
  31 | OP |     | MMM |     | ?-Form  |
  32 | OP | idx | 000 | imm |         |
  33
  34 Four different types of contexts are available so far: svp64 RM, setvl, Remap and
  35 swizzle. Their format is as follows when stored in SPRs:
  36
  37 | 0..3 | 4..7   | 8........31 |  name     |
  38 | ---- | ----   | ----------- | --------- |
  39 | 0000 | 0000   | `RM[0:23]`  |  [[sv/svp64]] RM |
  40 | 0000 | 0001   |`setvl[0:23]`|  [[sv/setvl]] VL |
  41 | 0001 | 0 mask | swiz1 swiz2 |  swizzle  |
  42 | 0010 | brev   | sh0-3 ms0-3 |  [Remap](sv/remap)    |
  43 | 0011 | brev   | sh0-3 ms0-3 |  [SubVL Remap](sv/remap)    |
  44
  45 There are 4 64 bit SPRs used for storing Context, and the data is stored
  46 as follows:
  47
  48 * 7 32 bit contexts are stored, each indexed from 0b001 to 0b111,
  49   2 per 64 bit SPR and 1 in the 4th.
  50 * Starting from bit 32 of the 4th SPR, in batches of 40 bits the Shift
  51   Registers are stored.
  52
  53 When each LSB is nonzero in any one of the seven Shift Registers
  54 the corresponding Contexts are looked up and merged (ORed) together.
  55 Contexts for different purposes however may not be mixed: an illegal
  56 instruction is raised if this occurs.
  57
  58 The reason for merging the contexts is so that different aspects may be
  59 applied.  For example some `RM` contexts may indicate that predication
  60 is to be applied to an instruction whilst another context may contain
  61 the svp64 Mode.  Combining the two allows the predication aspect to be
  62 merged and shared, making for better packing.
  63
  64 These changes occur on a precise schedule: compilers should not have
  65 difficulties statically allocating the Context Propagation, as long
  66 as certain conventions are followed, such as avoidance of allowing the
  67 context to propagate through branches used by more than one incoming path,
  68 and variable-length loops.
  69
  70 Loops, clearly, because if the setup of the shift registers does
  71 not precisely match the number of instructions, the meaning of those
  72 instructions will change as the bits in the shift registers run out!
  73 However if the loops are of fixed static size, with no conditional early exit,  and small enough (40 instructions
  74 maximum) then it is perfectly reasonable to insert repeated patterns into
  75 the shift registers, enough to cover all the loops.  Ordinarily however
  76 the use of the Context Propagation instructions should be inside the
  77 loop and it is the responsibility of the compiler and assembler writer
  78 to ensure that the shift registers reach zero before any loop jump-back
  79 point.
  80
  81 ## Pseudocode:
  82
  83 The internal data structures need not precisely match the SPRs.  Here are
  84 some internal datastructures:
  85
  86     bit sreg[7][40] # seven 40 bit shift registers
  87     bit context[7][24]   # seven contexts
  88     int sregoffs[7] # indicator where last bits were placed
  89
  90 The Context Propagation instruction then inserts bits into the selected
  91 stream:
  92
  93     count = 20-count_trailing_zeros(imm)
  94     context[idx] = new_context
  95     start = sregoffs[idx]
  96     sreg[idx][start:start+count] = imm[0:count]
  97     sregoffs[idx] += count
  98
  99 With each shift register being maintained independently the new bits are
 100 dropped in where the last ones end.  To get which one is to be applied
 101 is as follows:
 102
 103     apply_context
 104     for i in range(7):
 105         if sreg[i][0]:
 106             apply_context |= context[i]
 107         sreg[i] = sreg[i] >> 1
 108         sregoffs[i] -= 1
 109
 110 Note that it is the LSB that says which context is to be applied.
 111
 112 # Swizzle Propagation
 113
 114 Swizzle Contexts follow the same schedule except that there is a mask
 115 for specifying to which registers the swizzle is to be applied, and
 116 there is only 17 bit suite to indicate the instructions to which the
 117 swizzle applies.
 118
 119 The bits in the svp64 `RM` field are interpreted as a pair of 12 bit
 120 swizzles
 121
 122 | 0.5| 6.8 | 9.11| 12.14 | 15.31 |  name   |
 123 | -- | --- | --- | ----- | ----- | ------- |
 124 | OP |     | MMM | mask  |       | ?-Form  |
 125 | OP | idx | 001 | mask  |  imm  |         |
 126
 127 Note however that it is only svp64 encoded instructions to which swizzle
 128 applies, so Swizzle Shift Registers only activate (and shift down)
 129 on svp64 instructions. *This includes Context-propagated ones!*
 130
 131 The mask is encoded as follows:
 132
 133 * bit 0 indicates that src1 is swizzled
 134 * bit 1 indicates that src2 is swizzled
 135 * bit 2 indicates that src3 is swizzled
 136
 137 When the compiler creates Swizzle Contexts it is important to recall
 138 that the Contexts will be ORed together. Thus one Context may specify
 139 a mask whilst the other Context specifies the swizzles: ORing different
 140 mask contexts with different swizzle Contexts allows more combinations
 141 than would normally fit into seven Contexts.
 142
 143 More than one bit is permitted to be set in the mask: swiz1 is applied
 144 to the first src operand specified by the mask, and swiz2 is applied to
 145 the second.
 146
 147 # 2D/3D Matrix Remap
 148
 149 [[sv/remap]] allows up to four Vectors (all four arguments of `fma` for example)
 150 to be algorithmically arbitrarily remapped via 1D, 2D or 3D reshaping.
 151 The amount of information needed to do so is however quite large: consequently it is only practical to apply indirectly, via Context propagation.
 152
 153 Vectors may be remapped such that Matrix multiply of any arbitrary size
 154 is performed in one Vectorised `fma` instruction as long as the total
 155 number of elements is less than 64 (maximum for VL).
 156
 157 Additionally, in a fashion known as "Structure Packing" in NEON and RVV, it may be used to perform "zipping" and "unzipping" of
 158 elements in a regular fashion of any arbitrary size and depth: RGB
 159 or Audio channel data may be split into separate contiguous lanes of
 160 registers, for example.
 161
 162 There are four possible Shapes.  Unlike swizzle contexts this one requires
 163 he external remap Shape SPRs because the state information is too large
 164 to fit into the Context itself.  Thus the Remap Context says which Shapes
 165 apply to which registers.
 166
 167 The instruction format is the same as `RM` and thus uses 21 bits of
 168 immediate, 29 of which are dropped into the indexed Shift Register
 169
 170 | 0.5| 6.8 | 9.10| 11.14 | 15.31|  name   |
 171 | -- | --- | --- | ----  | ---- | ------- |
 172 | OP |     | MM  |       |      | ?-Form  |
 173 | OP | idx | 10  | brev  | imm  | Remap        |
 174 | OP | idx | 11  | brev  | imm  | SUBVL Remap    |
 175
 176 SUBVL Remap applies the remapping even into the SUBVL Elements, for a total of `VL\*SUBVL` Elements.  **swizzle may be applied on top as a second phase** after SUBVL Remap.
 177
 178 brev field, which also applied down to SUBVL elements (not to the whole
 179 vec2/3/4, that would be handled by swizzle reordering):
 180
 181 * bit 0 indicates that dest elements are byte-reversed
 182 * bit 1 indicates that src1 elements are byte-reversed
 183 * bit 2 indicates that src2 elements are byte-reversed
 184 * bit 3 indicates that src3 elements are byte-reversed
 185
 186 Again it is the 24 bit `RM` that is interpreted differently:
 187
 188 | 0...7 | 8....23 |
 189 | ----- | ------- |
 190 | sh0-3 | mask0-3 |
 191
 192 The shape indices 0-3 are numbered 0-3 whilst the masks are bitmasks
 193 that indicate src or dest to which the associated shape (0-3) is to
 194 be applied.  A zero mask indicates that the Shape is not to be applied.
 195 Note that whilst the masks are unary encoded the Shape indices sh0-3
 196 are not: this must be taken into consideration when ORing occurs.
 197
 198 The mask is encoded as follows:
 199
 200 * bit 0 indicates that the first svp64 EXTRA field is reshaped
 201 * bit 1 indicates that the second svp64 EXTRA field is reshaped
 202 * bit 2 indicates that the third sv64 EXTRA field is reshaped
 203 * bit 3 indicates that the fourth svp64 EXTRA field reshaped
 204
 205 This allows even instructions that have 2 destination registers to be reshaped.
 206
 207 # setvl
 208
 209 Fitting into 22 bits with 2 reserved and 2 for future
 210 expansion of SV Vector Length is a total of 24 bits
 211 which is exactly the same size as SVP64 RM
 212
 213 | 0.5|6.10| 11..18 | 19..20 |21| 22.23 |
 214 | -- | -- | ------ | ------ |--| ----- |
 215 | RT | RA | SVi // | vs ms  |Rc| rsvd  |