[[!tag standards]] # NOTE This section is under revision (and is optional) # REMAP REMAP allows the usual vector loop `0..VL-1` to be "reshaped" (re-mapped) from a linear form to a 2D or 3D transposed form, or "offset" to permit arbitrary access to elements, independently on each Vector src or dest register. Their primary use is for Matrix Multiplication, reordering of sequential data in-place. Four CSRs are provided so that a single FMAC may be used in a single loop to perform 4x4 times 4x4 Matrix multiplication, generating 64 FMACs. Additional uses include regular "Structure Packing" such as RGB pixel data extraction and reforming. # SHAPE 1D/2D/3D vector-matrix remapping CSRs There are four "shape" CSRs, SHAPE0-3, 32-bits in each, which have the same format. [[!inline raw="yes" pages="simple_v_extension/shape_table_format" ]] The algorithm below shows how REMAP works more clearly, and may be executed as a python program: xdim = 3 ydim = 4 zdim = 1 lims = [xdim, ydim, zdim] idxs = [0,0,0] # starting indices order = [0,1,2] # experiment with different permutations, here offset = 2 # experiment with different offset, here VL = xdim * ydim * zdim # multiply (or add) to this to get "cycling" applydim = 0 invxyz = [0,0,0] # run for offset iterations before actually starting for idx in range(offset): for i in range(3): idxs[order[i]] = idxs[order[i]] + 1 if (idxs[order[i]] != lims[order[i]]): break idxs[order[i]] = 0 break_count = 0 for idx in range(VL): ix = [0] * 3 for i in range(3): if i >= applydim: ix[i] = idxs[i] if invxyz[i]: ix[i] = lims[i] - 1 - ix[i] new_idx = ix[0] + ix[1] * xdim + ix[2] * xdim * ydim print new_idx, break_count += 1 if break_count == lims[order[0]]: print break_count = 0 for i in range(3): idxs[order[i]] = idxs[order[i]] + 1 if (idxs[order[i]] != lims[order[i]]): break idxs[order[i]] = 0 Here, it is assumed that this algorithm be run within all pseudo-code throughout this document where a (parallelism) for-loop would normally run from 0 to VL-1 to refer to contiguous register elements; instead, where REMAP indicates to do so, the element index is run through the above algorithm to work out the **actual** element index, instead. Given that there are four possible SHAPE entries, up to four separate registers in any given operation may be simultaneously remapped: function op_add(rd, rs1, rs2) # add not VADD! ... ...  for (i = 0; i < VL; i++) xSTATE.srcoffs = i # save context if (predval & 1<