harmonised_rvv_rvp/discussion.mdwn

   1 # Comments
   2
   3 ## enabling/disabling individual 8 and 16-bit operations in SIMD blocks
   4
   5 * At the end of a loop, how are the three end operations of 4-wide 8-bit operations to be disabled (to avoid "SIMD considered harmful"?)
   6 * Likewise at the beginning of a loop, how are (up to) the first three operations to be disabled?
   7 * Likewise the last (and first) of 2-wide 16-bit operations?
   8 * What about predication within a 4-wide 8-bit group?
   9 * Likewise what about predication within a 2-wide 16-bit group?
  10
  11 ## Providing "cross-over" between elements in a group
  12
  13 what do you think of the "CSR cross[32][6]" idea?  sorry below may
  14 not be exactly clear, it's basically a way to generalise all
  15 cross-operations, even the SUNPKD810 rt, ra and ZUNPKD810 rt, ra would
  16 reduce down to one instruction as opposed to 8 right now.
  17
  18     def butterfly_remap(remap_me):
  19         # hmmm a little hazy on the details here....
  20         # help, help! logic-dyslexia kicking in!
  21         # erm do some crossover using the 6 bits from
  22         # the CSR cross map.  first 2 bits swap
  23         # elements in index positions 0,1 and 2,3
  24         # second 2 bits swap elements in positions 0,2 and 1,3
  25         # then swap 0,1 and 2,3 a second time.
  26         # gives full set of all permutations.
  27         return something, something
  28
  29     def crossover(elidx, destreg):
  30         base = elidx & ~0x7
  31         return butterfly_remap(CSR_cross[destreg][elidx & 0x7])
  32
  33     def op(v1, v2, v3):
  34        for l in vlen:
  35           remap_src1, remap_src2 = crossover(i, v1)
  36           # remap_srcN references byte offsets? erm.... :)
  37           GPR[v1] = scalar_op(GPR[v2][remap_src1],
  38                               GPR[v3][remap_src2])
  39
  40 Otherwise, VSHUFFLE and so on (and possibly xBitManip) would
  41 need to be used. xBitManip would not be a bad idea, except
  42 consideration of VLIW-like DSP (TI C67*) architectures needs
  43 to be given, which do not do register-renaming and have fixed
  44 pipeline phases with no stalling on register-dependencies.