simple_v_extension/specification/mv.x.rst

   1 [[!tag standards]]
   2
   3 MV.X and MV.swizzle
   4 ===================
   5
   6 swizzle needs a MV (there are 2 of them: swizzle and swizzle2).
   7 see below for a potential way to use the funct7 to do a swizzle in rs2.
   8
   9 +---------------+-------------+-------+----------+----------+--------+----------+--------+--------+
  10 | Encoding      | 31:27       | 26:25 | 24:20    | 19:15    | 14:12  | 11:7     | 6:2    | 1:0    |
  11 +---------------+-------------+-------+----------+----------+--------+----------+--------+--------+
  12 | RV32-I-type   + imm[11:0]                      + rs1[4:0] + funct3 | rd[4:0]  + opcode + 0b11   |
  13 +---------------+-------------+-------+----------+----------+--------+----------+--------+--------+
  14 | RV32-I-type   + fn4[3:0]    + swizzle[7:0]     + rs1[4:0] + 0b000  | rd[4:0]  + OP-V   + 0b11   |
  15 +---------------+-------------+-------+----------+----------+--------+----------+--------+--------+
  16
  17 * funct3 = MV: 0b000 for FP, 0b001 for INT
  18 * OP-V = 0b1010111
  19 * fn4 = 4 bit function.
  20 * fn4 = 0b0000 - MV-SWIZZLE
  21 * fn4 = 0bNN01 - MV-X, NN=elwidth (default/8/16/32)
  22 * fn4 = 0bNN11 - MV-X.SUBVL NN=elwidth (default/8/16/32)
  23
  24 swizzle (only active on SV or P48/P64 when SUBVL!=0):
  25
  26 +-----+-----+-----+-----+
  27 | 7:6 | 5:4 | 3:2 | 1:0 |
  28 +-----+-----+-----+-----+
  29 |   w |   z |   y |   x |
  30 +-----+-----+-----+-----+
  31
  32 MV.X has two modes: SUBVL mode applies the element offsets only within a SUBVL inner loop. This can be used for transposition.
  33
  34 ::
  35
  36   for i in range(VL):
  37      for j in range(SUBVL):
  38         regs[rd] = regs[rd+regs[rs+j]]
  39
  40 Normal mode will apply the element offsets incrementally:
  41
  42 ::
  43
  44   for i in range(VL):
  45      for j in range(SUBVL):
  46         regs[rd] = regs[rd+regs[rs+k]]
  47           k++
  48
  49
  50 Pseudocode for element width part of MV.X:
  51
  52 ::
  53
  54   def mv_x(rd, rs1, funct4):
  55       elwidth = (funct4>>2) & 0x3
  56       bitwidth = {0:XLEN, 1:8, 2:16, 3:32}[elwidth] # get bits per el
  57       bytewidth = bitwidth / 8 # get bytes per el
  58       for i in range(VL):
  59           addr = (unsigned char *)&regs[rs1]
  60           offset = addr + bytewidth # get offset within regfile as SRAM
  61           # TODO, actually, needs to respect rd and rs1 element width,
  62           # here, as well.  this pseudocode just illustrates that the
  63           # MV.X operation contains a way to compact the indices into
  64           # less space.
  65           regs[rd] = (unsigned char*)(regs)[offset]
  66
  67 The idea here is to allow 8-bit indices to be stored inside XLEN-sized
  68 registers, such that rather than doing this:
  69
  70 .. parsed-literal::
  71     ldimm x8, 1
  72     ldimm x9, 3
  73     ldimm x10, 2
  74     ldimm x11, 0
  75     {SVP.VL=4} MV.X x3, x8, elwidth=default
  76
  77 The alternative is this:
  78
  79 .. parsed-literal::
  80     ldimm x8, 0x00020301
  81     {SVP.VL=4} MV.X x3, x8, elwidth=8
  82
  83 Thus compacting four indices into the one register.  x3 and x8's element
  84 width are *independent* of the MV.X elwidth, thus allowing both source
  85 and element element widths of the *elements* to be moved to be over-ridden,
  86 whilst *at the same time* allowing the *indices* to be compacted, as well.
  87
  88 ----
  89
  90 potential MV.X?  register-version of MV-swizzle?
  91
  92 +-------------+-------+-------+----------+----------+--------+----------+--------+--------+
  93 | Encoding    | 31:27 | 26:25 | 24:20    | 19:15    | 14:12  | 11:7     | 6:2    | 1:0    |
  94 +-------------+-------+-------+----------+----------+--------+----------+--------+--------+
  95 | RV32-R-type + funct7        + rs2[4:0] + rs1[4:0] + funct3 | rd[4:0]  + opcode + 0b11   |
  96 +-------------+-------+-------+----------+----------+--------+----------+--------+--------+
  97 | RV32-R-type + 0b0000000     + rs2[4:0] + rs1[4:0] + 0b001  | rd[4:0]  + OP-V   + 0b11   |
  98 +-------------+-------+-------+----------+----------+--------+----------+--------+--------+
  99
 100 * funct3 = MV.X
 101 * OP-V = 0b1010111
 102 * funct7 = 0b000NN00 - INT MV.X, elwidth=NN (default/8/16/32)
 103 * funct7 = 0b000NN10 - FP MV.X, elwidth=NN (default/8/16/32)
 104 * funct7 = 0b0000001 - INT MV.swizzle to say that rs2 is a swizzle argument?
 105 * funct7 = 0b0000011 - FP MV.swizzle to say that rs2 is a swizzle argument?
 106
 107 question: do we need a swizzle MV.X as well?
 108
 109 MV.X with 3 operands
 110 ====================
 111
 112 regs[rd] = regs[rs1 + regs[rs2]]
 113
 114 Similar to LD/ST with the same twin predication rules
 115
 116 macro-op fusion
 117 ===============
 118
 119 there is the potential for macro-op fusion of mv-swizzle with the following instruction and/or preceding instruction.
 120 <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002486.html>
 121
 122 VBLOCK context?
 123 ===============
 124
 125 additional idea: a VBLOCK context that says that if a given register is used, it indicates that the
 126 register is to be "swizzled", and the VBLOCK swizzle context contains the swizzling to be carried out.
 127
 128 mm_shuffle_ps?
 129 ==============
 130
 131 __m128 _mm_shuffle_ps(__m128 lo,__m128 hi,
 132        _MM_SHUFFLE(hi3,hi2,lo1,lo0))
 133 Interleave inputs into low 2 floats and high 2 floats of output. Basically
 134    out[0]=lo[lo0];
 135    out[1]=lo[lo1];
 136    out[2]=hi[hi2];
 137    out[3]=hi[hi3];
 138
 139 For example, _mm_shuffle_ps(a,a,_MM_SHUFFLE(i,i,i,i)) copies the float
 140 a[i] into all 4 output floats.
 141
 142 Transpose
 143 =========
 144
 145 assuming a vector of 4x4 matrixes is stored as 4 separate vectors with subvl=4 in struct-of-array-of-struct form (the form I've been planning on using):
 146 using standard (4+4) -> 4 swizzle instructions with 2 input vectors with subvl=4 and 1 output vector with subvl, a vectorized matrix transpose operation can be done in 2 steps with 4 instructions per step to give 8 instructions in total:
 147
 148 input:
 149 | m00 m10 m20 m30 |
 150 | m01 m11 m21 m31 |
 151 | m02 m12 m22 m32 |
 152 | m03 m13 m23 m33 |
 153
 154 transpose 4 corner 2x2 matrices
 155
 156 intermediate:
 157 | m00 m01 m20 m21 |
 158 | m10 m11 m30 m31 |
 159 | m02 m03 m22 m23 |
 160 | m12 m13 m32 m33 |
 161
 162 finish transpose
 163
 164 output:
 165 | m00 m01 m02 m03 |
 166 | m10 m11 m12 m13 |
 167 | m20 m21 m22 m23 |
 168 | m30 m31 m32 m33 |
 169
 170 <http://web.archive.org/web/20100111104515/http://www.randombit.net:80/bitbashing/programming/integer_matrix_transpose_in_sse2.html>
 171
 172
 173 ::
 174
 175    __m128i T0 = _mm_unpacklo_epi32(I0, I1);
 176    __m128i T1 = _mm_unpacklo_epi32(I2, I3);
 177    __m128i T2 = _mm_unpackhi_epi32(I0, I1);
 178    __m128i T3 = _mm_unpackhi_epi32(I2, I3);
 179
 180    /* Assigning transposed values back into I[0-3] */
 181    I0 = _mm_unpacklo_epi64(T0, T1);
 182    I1 = _mm_unpackhi_epi64(T0, T1);
 183    I2 = _mm_unpacklo_epi64(T2, T3);
 184    I3 = _mm_unpackhi_epi64(T2, T3);
 185
 186 Transforms for DCT
 187 ==================
 188
 189 <https://opencores.org/websvn/filedetails?repname=mpeg2fpga&path=%2Fmpeg2fpga%2Ftrunk%2Frtl%2Fmpeg2%2Fidct.v>
 190
 191 Table to evaluate
 192 =================
 193
 194 swizzle2 takes 2 arguments, interleaving the two vectors depending on a 3rd (the swizzle selector)
 195
 196 +-----------+-------+-------+-------+-------+-------+------+
 197 |           | 31:27 | 26:25 | 24:20 | 19:15 | 14:12 | 11:7 |
 198 +===========+=======+=======+=======+=======+=======+======+
 199 | swizzle2  | rs3   | 00    | rs2   | rs1   | 000   | rd   |
 200 +-----------+-------+-------+-------+-------+-------+------+
 201 | fswizzle2 | rs3   | 01    | rs2   | rs1   | 000   | rd   |
 202 +-----------+-------+-------+-------+-------+-------+------+
 203 | swizzle   | 0     | 10    | rs2   | rs1   | 000   | rd   |
 204 +-----------+-------+-------+-------+-------+-------+------+
 205 | fswizzle  | 0     | 11    | rs2   | rs1   | 000   | rd   |
 206 +-----------+-------+-------+-------+-------+-------+------+
 207 | swizzlei  | imm                   | rs1   | 001   | rd   |
 208 +-----------+                       +-------+-------+------+
 209 | fswizzlei |                       | rs1   | 010   | rd   |
 210 +-----------+-------+-------+-------+-------+-------+------+
 211
 212 More:
 213
 214 swizzlei would still need the 12-bit format due to not having enough immediate bits. we can get away with only 3 i-type funct3s used for [f]swizzlei by having one funct3 for destsubvl 1 through 3 for int and fp versions and a separate one for destsubvl = 4 that's shared between int/fp:
 215
 216 +--------+-----------+----+-----------+----------+-------+-------+------+
 217 | int/fp | DESTSUBVL | 31 | 30:29     | 28:20    | 19:15 | 14:12 | 11:7 |
 218 +========+===========+====+===========+==========+=======+=======+======+
 219 | int    | 1 to 3    | 0  | DESTSUBVL | selector | rs    | 000   | rd   |
 220 +--------+-----------+----+-----------+----------+-------+-------+------+
 221 | fp     | 1 to 3    | 1  | DESTSUBVL | selector | rs    | 000   | rd   |
 222 +--------+-----------+----+-----------+----------+-------+-------+------+
 223 | int    | 4         | selector[11:0]            | rs    | 001   | rd   |
 224 +--------+-----------+---------------------------+-------+-------+------+
 225 | fp     | 4         | selector[11:0]            | rs    | 010   | rd   |
 226 +--------+-----------+---------------------------+-------+-------+------+
 227
 228 the rest could be encoded as follows:
 229
 230 +-----------+-------+-----------+-------+-------+-------+------+
 231 |           | 31:27 | 26:25     | 24:20 | 19:15 | 14:12 | 11:7 |
 232 +===========+=======+===========+=======+=======+=======+======+
 233 | swizzle2  | rs3   | DESTSUBVL | rs2   | rs1   | 100   | rd   |
 234 +-----------+-------+-----------+-------+-------+-------+------+
 235 | swizzle   | rs1   | DESTSUBVL | rs2   | rs1   | 100   | rd   |
 236 +-----------+-------+-----------+-------+-------+-------+------+
 237 | fswizzle2 | rs3   | DESTSUBVL | rs2   | rs1   | 101   | rd   |
 238 +-----------+-------+-----------+-------+-------+-------+------+
 239 | fswizzle  | rs1   | DESTSUBVL | rs2   | rs1   | 101   | rd   |
 240 +-----------+-------+-----------+-------+-------+-------+------+
 241
 242 note how for [f]swizzle, rs3 == rs1
 243
 244 so it uses 5 funct3 values overall, which is appropriate, since swizzle is probably right after muladd in usage in graphics shaders.
 245
 246 Alternative
 247
 248
 249 +--------+----+-----------+----------+-------+-------+------+
 250 | int/fp | 31:28     | 27:20    | 19:15 | 14:12 | 11:7 |
 251 +========+===========+==========+=======+=======+======+
 252 | int    | DESTMASK  | selector | rs    | 000   | rd   |
 253 +--------+-----------+----------+-------+-------+------+
 254 | fp     | DESTMASK  | selector | rs    | 001   | rd   |
 255 +--------+-----------+----------+-------+-------+------+
 256 | int    | DESTMASK  | constsel | rs    | 010   | rd   |
 257 +--------+-----------+----------+-------+-------+------+
 258 | fp     | DESTMASK  | constsel | rs    | 011   | rd   |
 259 +--------+-----------+----------+-------+-------+------+
 260
 261 Matrix 4x4 Vector mul
 262 =====================
 263
 264 ::
 265
 266     pfscale,3 F2, F1, F10
 267     pfscaleadd,2 F2, F1, F11, F2
 268     pfscaleadd,1 F2, F1, F12, F2
 269     pfscaleadd,0 F2, F1, F13, F2
 270
 271 pfscale is a 4 vec mv.shuffle followed by a fmul. pfscaleadd is a 4 vec mv.shuffle followed by a fmac.
 272
 273 In effect what this is doing is:
 274
 275 ::
 276
 277     fmul f2, f1.xxxx, f10
 278     fmac f2, f1.yyyy, f11, f2
 279     fmac f2, f1.zzzz, f12, f2
 280     fmac f2, f1.wwww, f13, f2
 281
 282 Where all of f2, f1, and f10-13 are vec4, and f1.x-w are copied (fixed index) where the other vec4 indices progress.
 283
 284 Pseudocode
 285 ==========
 286
 287 Swizzle:
 288
 289 ::
 290
 291     pub trait SwizzleConstants: Copy + 'static {
 292         const CONSTANTS: &'static [Self; 4];
 293     }
 294
 295     impl SwizzleConstants for u8 {
 296         const CONSTANTS: &'static [Self; 4] = &[0, 1, 0xFF, 0x7F];
 297     }
 298
 299     impl SwizzleConstants for u16 {
 300         const CONSTANTS: &'static [Self; 4] = &[0, 1, 0xFFFF, 0x7FFF];
 301     }
 302
 303     impl SwizzleConstants for f32 {
 304         const CONSTANTS: &'static [Self; 4] = &[0.0, 1.0, -1.0, 0.5];
 305     }
 306
 307     // impl for other types too...
 308
 309     pub fn swizzle<Elm, Selector>(
 310         rd: &mut [Elm],
 311         rs1: &[Elm],
 312         rs2: &[Selector],
 313         vl: usize,
 314         destsubvl: usize,
 315         srcsubvl: usize)
 316     where
 317         Elm: SwizzleConstants,
 318         // Selector is a copyable type that can be converted into u64
 319         Selector: Copy + Into<u64>,
 320     {
 321         const FIELD_SIZE: usize = 3;
 322         const FIELD_MASK: u64 = 0b111;
 323         for vindex in 0..vl {
 324             let selector = rs2[vindex].into();
 325             // selector's type is u64
 326             if selector >> (FIELD_SIZE * destsubvl) != 0 {
 327                 // handle illegal instruction trap
 328             }
 329             for i in 0..destsubvl {
 330                 let mut sel_field = selector >> (FIELD_SIZE * i);
 331                 sel_field &= FIELD_MASK;
 332                 let src = if (sel_field & 0b100) == 0 {
 333                     &rs1[(vindex * srcsubvl)..]
 334                 } else {
 335                     SwizzleConstants::CONSTANTS
 336                 };
 337                 sel_field &= 0b11;
 338                 if sel_field as usize >= srcsubvl {
 339                     // handle illegal instruction trap
 340                 }
 341                 let value = src[sel_field as usize];
 342                 rd[vindex * destsubvl + i] = value;
 343             }
 344         }
 345     }
 346
 347 Swizzle2:
 348
 349 ::
 350
 351     fn swizzle2<Elm, Selector>(
 352         rd: &mut [Elm],
 353         rs1: &[Elm],
 354         rs2: &[Selector],
 355         rs3: &[Elm],
 356         vl: usize,
 357         destsubvl: usize,
 358         srcsubvl: usize)
 359     where
 360         // Elm is a copyable type
 361         Elm: Copy,
 362         // Selector is a copyable type that can be converted into u64
 363         Selector: Copy + Into<u64>,
 364     {
 365         const FIELD_SIZE: usize = 3;
 366         const FIELD_MASK: u64 = 0b111;
 367         for vindex in 0..vl {
 368             let selector = rs2[vindex].into();
 369             // selector's type is u64
 370             if selector >> (FIELD_SIZE * destsubvl) != 0 {
 371                 // handle illegal instruction trap
 372             }
 373             for i in 0..destsubvl {
 374                 let mut sel_field = selector >> (FIELD_SIZE * i);
 375                 sel_field &= FIELD_MASK;
 376                 let src = if (sel_field & 0b100) != 0 {
 377                     rs1
 378                 } else {
 379                     rs3
 380                 };
 381                 sel_field &= 0b11;
 382                 if sel_field as usize >= srcsubvl {
 383                     // handle illegal instruction trap
 384                 }
 385                 let value = src[vindex * srcsubvl + (sel_field as usize)];
 386                 rd[vindex * destsubvl + i] = value;
 387             }
 388         }
 389     }
 390