simple_v_extension/specification/mv.x.rst

   1 [[!tag standards]]
   2
   3 MV.X and MV.swizzle
   4 ===================
   5
   6 swizzle needs a MV (there are 2 of them: swizzle and swizzle2).
   7 see below for a potential way to use the funct7 to do a swizzle in rs2.
   8
   9 +---------------+-------------+-------+----------+----------+--------+----------+--------+--------+
  10 | Encoding      | 31:27       | 26:25 | 24:20    | 19:15    | 14:12  | 11:7     | 6:2    | 1:0    |
  11 +---------------+-------------+-------+----------+----------+--------+----------+--------+--------+
  12 | RV32-I-type   + imm[11:0]                      + rs1[4:0] + funct3 | rd[4:0]  + opcode + 0b11   |
  13 +---------------+-------------+-------+----------+----------+--------+----------+--------+--------+
  14 | RV32-I-type   + fn4[3:0]    + swizzle[7:0]     + rs1[4:0] + 0b000  | rd[4:0]  + OP-V   + 0b11   |
  15 +---------------+-------------+-------+----------+----------+--------+----------+--------+--------+
  16
  17 * funct3 = MV: 0b000 for FP, 0b001 for INT
  18 * OP-V = 0b1010111
  19 * fn4 = 4 bit function.
  20 * fn4 = 0b0000 - MV-SWIZZLE
  21 * fn4 = 0bNN01 - MV-X, NN=elwidth (default/8/16/32)
  22 * fn4 = 0bNN11 - MV-X.SUBVL NN=elwidth (default/8/16/32)
  23
  24 swizzle (only active on SV or P48/P64 when SUBVL!=0):
  25
  26 +-----+-----+-----+-----+
  27 | 7:6 | 5:4 | 3:2 | 1:0 |
  28 +-----+-----+-----+-----+
  29 |   w |   z |   y |   x |
  30 +-----+-----+-----+-----+
  31
  32 MV.X has two modes: SUBVL mode applies the element offsets only within a SUBVL inner loop. This can be used for transposition.
  33
  34 ::
  35
  36   for i in range(VL):
  37      for j in range(SUBVL):
  38         regs[rd] = regs[rd+regs[rs+j]]
  39
  40 Normal mode will apply the element offsets incrementally:
  41
  42 ::
  43
  44   for i in range(VL):
  45      for j in range(SUBVL):
  46         regs[rd] = regs[rd+regs[rs+k]]
  47           k++
  48
  49
  50 Pseudocode for element width part of MV.X:
  51
  52 ::
  53
  54   def mv_x(rd, rs1, funct4):
  55       elwidth = (funct4>>2) & 0x3
  56       bitwidth = {0:XLEN, 1:8, 2:16, 3:32}[elwidth] # get bits per el
  57       bytewidth = bitwidth / 8 # get bytes per el
  58       for i in range(VL):
  59           addr = (unsigned char *)&regs[rs1]
  60           offset = addr + bytewidth # get offset within regfile as SRAM
  61           # TODO, actually, needs to respect rd and rs1 element width,
  62           # here, as well.  this pseudocode just illustrates that the
  63           # MV.X operation contains a way to compact the indices into
  64           # less space.
  65           regs[rd] = (unsigned char*)(regs)[offset]
  66
  67 The idea here is to allow 8-bit indices to be stored inside XLEN-sized
  68 registers, such that rather than doing this:
  69
  70 .. parsed-literal::
  71     ldimm x8, 1
  72     ldimm x9, 3
  73     ldimm x10, 2
  74     ldimm x11, 0
  75     {SVP.VL=4} MV.X x3, x8, elwidth=default
  76
  77 The alternative is this:
  78
  79 .. parsed-literal::
  80     ldimm x8, 0x00020301
  81     {SVP.VL=4} MV.X x3, x8, elwidth=8
  82
  83 Thus compacting four indices into the one register.  x3 and x8's element
  84 width are *independent* of the MV.X elwidth, thus allowing both source
  85 and element element widths of the *elements* to be moved to be over-ridden,
  86 whilst *at the same time* allowing the *indices* to be compacted, as well.
  87
  88 ----
  89
  90 potential MV.X?  register-version of MV-swizzle?
  91
  92 +-------------+-------+-------+----------+----------+--------+----------+--------+--------+
  93 | Encoding    | 31:27 | 26:25 | 24:20    | 19:15    | 14:12  | 11:7     | 6:2    | 1:0    |
  94 +-------------+-------+-------+----------+----------+--------+----------+--------+--------+
  95 | RV32-R-type + funct7        + rs2[4:0] + rs1[4:0] + funct3 | rd[4:0]  + opcode + 0b11   |
  96 +-------------+-------+-------+----------+----------+--------+----------+--------+--------+
  97 | RV32-R-type + 0b0000000     + rs2[4:0] + rs1[4:0] + 0b001  | rd[4:0]  + OP-V   + 0b11   |
  98 +-------------+-------+-------+----------+----------+--------+----------+--------+--------+
  99
 100 * funct3 = MV.X
 101 * OP-V = 0b1010111
 102 * funct7 = 0b000NN00 - INT MV.X, elwidth=NN (default/8/16/32)
 103 * funct7 = 0b000NN10 - FP MV.X, elwidth=NN (default/8/16/32)
 104 * funct7 = 0b0000001 - INT MV.swizzle to say that rs2 is a swizzle argument?
 105 * funct7 = 0b0000011 - FP MV.swizzle to say that rs2 is a swizzle argument?
 106
 107 question: do we need a swizzle MV.X as well?
 108
 109 MV.X with 3 operands
 110 ====================
 111
 112 regs[rd] = regs[rs1 + regs[rs2]]
 113
 114 Similar to LD/ST with the same twin predication rules
 115
 116 macro-op fusion
 117 ===============
 118
 119 there is the potential for macro-op fusion of mv-swizzle with the following instruction and/or preceding instruction.
 120 <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002486.html>
 121
 122 VBLOCK context?
 123 ===============
 124
 125 additional idea: a VBLOCK context that says that if a given register is used, it indicates that the
 126 register is to be "swizzled", and the VBLOCK swizzle context contains the swizzling to be carried out.
 127
 128 mm_shuffle_ps?
 129 ==============
 130
 131 __m128 _mm_shuffle_ps(__m128 lo,__m128 hi,
 132        _MM_SHUFFLE(hi3,hi2,lo1,lo0))
 133 Interleave inputs into low 2 floats and high 2 floats of output. Basically
 134    out[0]=lo[lo0];
 135    out[1]=lo[lo1];
 136    out[2]=hi[hi2];
 137    out[3]=hi[hi3];
 138
 139 For example, _mm_shuffle_ps(a,a,_MM_SHUFFLE(i,i,i,i)) copies the float
 140 a[i] into all 4 output floats.
 141
 142 Transpose
 143 =========
 144
 145 assuming a vector of 4x4 matrixes is stored as 4 separate vectors with subvl=4 in struct-of-array-of-struct form (the form I've been planning on using):
 146 using standard (4+4) -> 4 swizzle instructions with 2 input vectors with subvl=4 and 1 output vector with subvl, a vectorized matrix transpose operation can be done in 2 steps with 4 instructions per step to give 8 instructions in total:
 147
 148 input:
 149 | m00 m10 m20 m30 |
 150 | m01 m11 m21 m31 |
 151 | m02 m12 m22 m32 |
 152 | m03 m13 m23 m33 |
 153
 154 transpose 4 corner 2x2 matrices
 155
 156 intermediate:
 157 | m00 m01 m20 m21 |
 158 | m10 m11 m30 m31 |
 159 | m02 m03 m22 m23 |
 160 | m12 m13 m32 m33 |
 161
 162 finish transpose
 163
 164 output:
 165 | m00 m01 m02 m03 |
 166 | m10 m11 m12 m13 |
 167 | m20 m21 m22 m23 |
 168 | m30 m31 m32 m33 |
 169
 170 <http://web.archive.org/web/20100111104515/http://www.randombit.net:80/bitbashing/programming/integer_matrix_transpose_in_sse2.html>
 171
 172
 173 ::
 174
 175    __m128i T0 = _mm_unpacklo_epi32(I0, I1);
 176    __m128i T1 = _mm_unpacklo_epi32(I2, I3);
 177    __m128i T2 = _mm_unpackhi_epi32(I0, I1);
 178    __m128i T3 = _mm_unpackhi_epi32(I2, I3);
 179
 180    /* Assigning transposed values back into I[0-3] */
 181    I0 = _mm_unpacklo_epi64(T0, T1);
 182    I1 = _mm_unpackhi_epi64(T0, T1);
 183    I2 = _mm_unpacklo_epi64(T2, T3);
 184    I3 = _mm_unpackhi_epi64(T2, T3);
 185
 186 Transforms for DCT
 187 ==================
 188
 189 <https://opencores.org/websvn/filedetails?repname=mpeg2fpga&path=%2Fmpeg2fpga%2Ftrunk%2Frtl%2Fmpeg2%2Fidct.v>
 190
 191 Table to evaluate
 192 =================
 193
 194 swizzle2 takes 2 arguments, interleaving the two vectors depending on a 3rd (the swizzle selector)
 195
 196 +-----------+-------+-------+-------+-------+-------+------+
 197 |           | 31:27 | 26:25 | 24:20 | 19:15 | 14:12 | 11:7 |
 198 +===========+=======+=======+=======+=======+=======+======+
 199 | swizzle2  | rs3   | 00    | rs2   | rs1   | 000   | rd   |
 200 +-----------+-------+-------+-------+-------+-------+------+
 201 | fswizzle2 | rs3   | 01    | rs2   | rs1   | 000   | rd   |
 202 +-----------+-------+-------+-------+-------+-------+------+
 203 | swizzle   | 0     | 10    | rs2   | rs1   | 000   | rd   |
 204 +-----------+-------+-------+-------+-------+-------+------+
 205 | fswizzle  | 0     | 11    | rs2   | rs1   | 000   | rd   |
 206 +-----------+-------+-------+-------+-------+-------+------+
 207 | swizzlei  | imm                   | rs1   | 001   | rd   |
 208 +-----------+                       +-------+-------+------+
 209 | fswizzlei |                       | rs1   | 010   | rd   |
 210 +-----------+-------+-------+-------+-------+-------+------+
 211
 212 Matrix 4x4 Vector mul
 213 =====================
 214
 215 ::
 216
 217     pfscale,3 F2, F1, F10
 218     pfscaleadd,2 F2, F1, F11, F2
 219     pfscaleadd,1 F2, F1, F12, F2
 220     pfscaleadd,0 F2, F1, F13, F2
 221
 222 pfscale is a 4 vec mv.shuffle followed by a fmul. pfscaleadd is a 4 vec mv.shuffle followed by a fmac.
 223
 224 In effect what this is doing is:
 225
 226 ::
 227
 228     fmul f2, f1.xxxx, f10
 229     fmac f2, f1.yyyy, f11, f2
 230     fmac f2, f1.zzzz, f12, f2
 231     fmac f2, f1.wwww, f13, f2
 232
 233 Where all of f2, f1, and f10-13 are vec4
 234
 235 Pseudocode
 236 ==========
 237
 238 Swizzle:
 239
 240 ::
 241
 242     pub trait SwizzleConstants: Copy + 'static {
 243         const CONSTANTS: &'static [Self; 4];
 244     }
 245
 246     impl SwizzleConstants for u8 {
 247         const CONSTANTS: &'static [Self; 4] = &[0, 1, 0xFF, 0x7F];
 248     }
 249
 250     impl SwizzleConstants for u16 {
 251         const CONSTANTS: &'static [Self; 4] = &[0, 1, 0xFFFF, 0x7FFF];
 252     }
 253
 254     impl SwizzleConstants for f32 {
 255         const CONSTANTS: &'static [Self; 4] = &[0.0, 1.0, -1.0, 0.5];
 256     }
 257
 258     // impl for other types too...
 259
 260     pub fn swizzle<Elm, Selector>(
 261         rd: &mut [Elm],
 262         rs1: &[Elm],
 263         rs2: &[Selector],
 264         vl: usize,
 265         destsubvl: usize,
 266         srcsubvl: usize)
 267     where
 268         Elm: SwizzleConstants,
 269         // Selector is a copyable type that can be converted into u64
 270         Selector: Copy + Into<u64>,
 271     {
 272         const FIELD_SIZE: usize = 3;
 273         const FIELD_MASK: u64 = 0b111;
 274         for vindex in 0..vl {
 275             let selector = rs2[vindex].into();
 276             // selector's type is u64
 277             if selector >> (FIELD_SIZE * destsubvl) != 0 {
 278                 // handle illegal instruction trap
 279             }
 280             for i in 0..destsubvl {
 281                 let mut sel_field = selector >> (FIELD_SIZE * i);
 282                 sel_field &= FIELD_MASK;
 283                 let src = if (sel_field & 0b100) == 0 {
 284                     &rs1[(vindex * srcsubvl)..]
 285                 } else {
 286                     SwizzleConstants::CONSTANTS
 287                 };
 288                 sel_field &= 0b11;
 289                 if sel_field as usize >= srcsubvl {
 290                     // handle illegal instruction trap
 291                 }
 292                 let value = src[sel_field as usize];
 293                 rd[vindex * destsubvl + i] = value;
 294             }
 295         }
 296     }
 297
 298 Swizzle2:
 299
 300 ::
 301
 302     fn swizzle2<Elm, Selector>(
 303         rd: &mut [Elm],
 304         rs1: &[Elm],
 305         rs2: &[Selector],
 306         rs3: &[Elm],
 307         vl: usize,
 308         destsubvl: usize,
 309         srcsubvl: usize)
 310     where
 311         // Elm is a copyable type
 312         Elm: Copy,
 313         // Selector is a copyable type that can be converted into u64
 314         Selector: Copy + Into<u64>,
 315     {
 316         const FIELD_SIZE: usize = 3;
 317         const FIELD_MASK: u64 = 0b111;
 318         for vindex in 0..vl {
 319             let selector = rs2[vindex].into();
 320             // selector's type is u64
 321             if selector >> (FIELD_SIZE * destsubvl) != 0 {
 322                 // handle illegal instruction trap
 323             }
 324             for i in 0..destsubvl {
 325                 let mut sel_field = selector >> (FIELD_SIZE * i);
 326                 sel_field &= FIELD_MASK;
 327                 let src = if (sel_field & 0b100) != 0 {
 328                     rs1
 329                 } else {
 330                     rs3
 331                 };
 332                 sel_field &= 0b11;
 333                 if sel_field as usize >= srcsubvl {
 334                     // handle illegal instruction trap
 335                 }
 336                 let value = src[vindex * srcsubvl + (sel_field as usize)];
 337                 rd[vindex * destsubvl + i] = value;
 338             }
 339         }
 340     }
 341