[[!tag standards]] # NOTE This section is under revision (and is optional) # REMAP CSR There is one 32-bit CSR which may be used to indicate which registers, if used in any operation, must be "reshaped" (re-mapped) from a linear form to a 2D or 3D transposed form, or "offset" to permit arbitrary access to elements within a register. Their primary use is for Matrix Multiplication, reordering of sequential data in-place. Three CSRs are provided so that a single FMAC may be used in a single loop to perform 4x4 times 4x4 Matrix multiplication, generating 64 FMACs. The 32-bit REMAP CSR may reshape up to 3 registers: | 29..28 | 27..26 | 25..24 | 23 | 22..16 | 15 | 14..8 | 7 | 6..0 | | ------ | ------ | ------ | -- | ------- | -- | ------- | -- | ------- | | shape2 | shape1 | shape0 | 0 | regidx2 | 0 | regidx1 | 0 | regidx0 | regidx0-2 refer not to the Register CSR CAM entry but to the underlying *real* register (see regidx, the value) and consequently is 7-bits wide. When set to zero (referring to x0), clearly reshaping x0 is pointless, so is used to indicate "disabled". shape0-2 refers to one of three SHAPE CSRs. A value of 0x3 is reserved. Bits 7, 15, 23, 30 and 31 are also reserved, and must be set to zero. It is anticipated that these specialist CSRs not be very often used. Unlike the CSR Register and Predication tables, the REMAP CSRs use the full 7-bit regidx so that they can be set once and left alone, whilst the CSR Register entries pointing to them are disabled, instead. # SHAPE 1D/2D/3D vector-matrix remapping CSRs There are three "shape" CSRs, SHAPE0, SHAPE1, SHAPE2, 32-bits in each, which have the same format. [[!inline raw="yes" pages="simple_v_extension/shape_table_format" ]] The algorithm below shows how REMAP works more clearly, and may be executed as a python program: xdim = 3 ydim = 4 zdim = 1 lims = [xdim, ydim, zdim] idxs = [0,0,0] # starting indices order = [0,1,2] # experiment with different permutations, here offset = 2 # experiment with different offset, here VL = xdim * ydim * zdim # multiply (or add) to this to get "cycling" applydim = 0 invxyz = [0,0,0] # run for offset iterations before actually starting for idx in range(offset): for i in range(3): idxs[order[i]] = idxs[order[i]] + 1 if (idxs[order[i]] != lims[order[i]]): break idxs[order[i]] = 0 break_count = 0 for idx in range(VL): ix = [0] * 3 for i in range(3): if i >= applydim: ix[i] = idxs[i] if invxyz[i]: ix[i] = lims[i] - 1 - ix[i] new_idx = ix[0] + ix[1] * xdim + ix[2] * xdim * ydim print new_idx, break_count += 1 if break_count == lims[order[0]]: print break_count = 0 for i in range(3): idxs[order[i]] = idxs[order[i]] + 1 if (idxs[order[i]] != lims[order[i]]): break idxs[order[i]] = 0 Here, it is assumed that this algorithm be run within all pseudo-code throughout this document where a (parallelism) for-loop would normally run from 0 to VL-1 to refer to contiguous register elements; instead, where REMAP indicates to do so, the element index is run through the above algorithm to work out the **actual** element index, instead. Given that there are three possible SHAPE entries, up to three separate registers in any given operation may be simultaneously remapped: function op_add(rd, rs1, rs2) # add not VADD! ... ...  for (i = 0; i < VL; i++) xSTATE.srcoffs = i # save context if (predval & 1<