# Comments ## enabling/disabling individual 8 and 16-bit operations in SIMD blocks * At the end of a loop, how are the three end operations of 4-wide 8-bit operations to be disabled (to avoid "SIMD considered harmful"?) * Likewise at the beginning of a loop, how are (up to) the first three operations to be disabled? * Likewise the last (and first) of 2-wide 16-bit operations? * What about predication within a 4-wide 8-bit group? * Likewise what about predication within a 2-wide 16-bit group? ## Providing "cross-over" between elements in a group what do you think of the "CSR cross[32][6]" idea? sorry below may not be exactly clear, it's basically a way to generalise all cross-operations, even the SUNPKD810 rt, ra and ZUNPKD810 rt, ra would reduce down to one instruction as opposed to 8 right now. def butterfly_remap(remap_me): # hmmm a little hazy on the details here.... # help, help! logic-dyslexia kicking in! # erm do some crossover using the 6 bits from # the CSR cross map. first 2 bits swap # elements in index positions 0,1 and 2,3 # second 2 bits swap elements in positions 0,2 and 1,3 # then swap 0,1 and 2,3 a second time. # gives full set of all permutations. return something, something def crossover(elidx, destreg): base = elidx & ~0x7 return butterfly_remap(CSR_cross[destreg][elidx & 0x7]) def op(v1, v2, v3): for l in vlen: remap_src1, remap_src2 = crossover(i, v1) # remap_srcN references byte offsets? erm.... :) GPR[v1] = scalar_op(GPR[v2][remap_src1], GPR[v3][remap_src2]) Otherwise, VSHUFFLE and so on (and possibly xBitManip) would need to be used. xBitManip would not be a bad idea, except consideration of VLIW-like DSP (TI C67*) architectures needs to be given, which do not do register-renaming and have fixed pipeline phases with no stalling on register-dependencies.