3d_gpu/architecture/dynamic_simd/logicops.mdwn

   1 # Partitioned Logical boolean operations
   2
   3 Links
   4
   5 * <https://bugs.libre-soc.org/show_bug.cgi?id=176>
   6 * <https://bugs.libre-soc.org/show_bug.cgi?id=549>
   7
   8 These are **not** the same as bitwise logical operations equivalent to:
   9
  10      for i in range(64):
  11          result[i] = a[i] xor b[i] # 2 operands
  12
  13 they are instead SIMD versions of:
  14
  15      result = 0 # initial value (single bit)
  16      for i in range(64):
  17          result = result xor a[i] # one operand
  18
  19 more specifically (8x8 version):
  20
  21      result = 0 # i bit, clear all bits
  22      for j in range(8)
  23          partial = 0 # initial value (single bit)
  24          for i in range(8):
  25              partial = partial xor a[i+j*8]
  26          result[j] = partial
  27
  28 # Requirements
  29
  30 Given a signal width (typically 64) and given an array of "Partition Points" (typically 7) that break the signal down into an arbitrary permutaion of 8 bit to 64 bit independent SIMD results, compute the following:
  31
  32 * xor of all bits in each partitioned group, regardless of the partitioning
  33 * "are some bits set" in each partitioned group
  34 * "are all bits set" in each partitioned group
  35
  36 note that "are some bits set" is equivalent to "is a != 0" whilst "are all bits set" is equivalent to "is a == all 1s" or "is (~a) == 0"
  37
  38 # xor operator as an example
  39
  40 instead of the above single 64 bit xor result, dynamic partitioned SIMD must return a batch of results.   if the subdivision is 2x32 it is:
  41
  42      result[0] = 0 # initial value for low word
  43      result[1] = 0 # initial value for hi word
  44      for i in range(32):
  45          result[0] = result[0] xor a[i]
  46          result[1] = result[1] xor a[i+32]
  47
  48 and likewise by the time 8x8 is reached:
  49
  50      for j in range(8):
  51          result[j] = 0 # initial value for each byte
  52          for i in range(8):
  53              result[j] = result[j] xor a[i+j*8]
  54
  55 now the question becomes: what to do when the Signal is dynamically partitionable? how do we merge all of the combinations, 1x64 2x32 4x16 8x8 into the same statically-allocated hardware?
  56
  57 the first thing is to define some conventions, that the answer (result) will always be 8 bit (not 1 bit) and that, rather than just one bit being set if some are set, all 8 bits are clear or all 8 bits are set.
  58
  59      if result[0]:             # bit 0 true?
  60          result[1:7] = 1111111 # set the remaining 7
  61
  62 likewise, when configured as 2x32 the result is subdivided into two 4 bit halves: the first half is all zero if all the first 32 bits are zero, and all ones if any one bit in the first 32 bits are set.
  63
  64      result[0] = 0 # initial value for low word
  65      result[4] = 0 # initial value for hi word
  66      for i in range(32):
  67          result[0] = result[0] xor a[i]
  68          result[4] = result[4] xor a[i+32]
  69      if result[0]:
  70          result[1:3] = 111
  71      if result[4]:
  72          result[5:7] = 111
  73
  74 Thus we have a convention where the result is *also a partitioned signal*, and can be reconfigured to return 1x xor yes/no, 2x xor yes/no, 4x xor yes/no or up to 8 independent yes/no boolean values.
  75
  76 The second observation then is that, actually, just like the other partitioned operations, it may be possible to "construct" the longer results from the 8x8 ones, based on whether the partition gates are open or closed.  This is further explored below.
  77
  78 # Boolean truth table for Partitioned XOR
  79
  80 Exactly the same as for eq, instead the "xor" operator for example is the amalgamation of 4 partial results, x0 to x3.
  81
  82     partition:     P    P    P     (3 partition bits)
  83     a        : .... .... .... .... (32 input bits)
  84     xor-a    : x0  P x1 P x2 P x3  (4+3 bits, P=1 "breaks")
  85
  86 the partial results x0-x3 are computed as follows:
  87
  88     x0 = input[0:7].xor()
  89     x1 = input[8:15].xor()
  90     x2 = input[16:23].xor()
  91     x3 = input[24:31].xor()
  92
  93 here is the table showing how to combine `x0-3` based on partitions `p0-2`, to produce results `o0-3`
  94
  95 [[!table  data="""
  96 p2p1p0 | o0       | o1       | o2       | o3       |
  97 ------ | -------- | -------- | -------- | -------- |
  98 0 0 0  | ^(x0-3)  | ^(x0-3)  | ^(x0-3)  | ^(x0-3)  |
  99 0 0 1  | x0       | ^(x1-3)  | ^(x1-3)  | ^(x1-3)  |
 100 0 1 0  | ^(x0-1)  | ^(x0-1)  | ^(x2-3)  | ^(x2-3)  |
 101 0 1 1  | x0       | x1       | ^(x2-3)  | ^(x2-3)  |
 102 1 0 0  | ^(x0-2)  | ^(x0-2)  | ^(x0-2)  | x3       |
 103 1 0 1  | x0       | ^(x1-2)  | ^(x1-2)  | x3       |
 104 1 1 0  | ^(x0-1)  | ^(x0-1)  | x2       | x3       |
 105 1 1 1  | x0       | x1       | x2       | x3       |
 106 """]]
 107
 108 Example:
 109
 110 when p2p1p0 == 101 this indicates
 111
 112 * that the output is to contain:
 113   -  an XOR of the top 8 bits
 114   - the middle 16 bits
 115   - and the low 8 bits.
 116 * this in a 4 bit result (`o3o2o1o0`) divided into `o3`, `o2o1` and `o0`
 117   - the top bit `o3` of the 4-bit answer contains x3
 118   - the middle 2 bits `o2o1`  contain the XOR of x1 and x2
 119   - the first bit `o0` contains x0.
 120 * therefore, the final result:
 121   - the top bit contains the XOR of the input bits 24 to 31
 122   - the middle 2 bits contains the XOR of bits 8 to 15
 123   - the lowest bit contains the XOR of bits 0 to 7.
 124
 125 A different partition creates a completely different SIMD subdivision.
 126 This is *entirely dynamic*.