(no commit message)
[libreriscv.git] / 3d_gpu / architecture / dynamic_simd / assign.mdwn
1 # PartitionedSignal nmigen-aware eq (assign)
2
3 * <https://bugs.libre-soc.org/show_bug.cgi?id=709>
4
5 For copying (assigning) PartitionedSignal to PartitionedSignal
6 of equal size there is no issue. However if the source has a
7 greater width than the target, *partition-aware* truncation
8 must occur. For the opposite, sign/zero extension must occur.
9 Finally for a Signal or Const, duplication across all Partitions
10 must occur, again, following the rules of zero, sign or unsigned.
11
12 Take two PartitionedSignals (source a, dest b) of 32 bit:
13
14 partition: p p p (3 bits)
15 a : AAA3 AAA2 AAA1 AAA0 (32 bits)
16 b : BBB3 BBB2 BBB1 BBB0 (32 bits)
17
18 For all partition settings this copies verbatim. Also,
19 when A is longer, a truncated version of A is always
20 copied verbatim, regardless of partition settings.
21 However if A
22 is shorter than B:
23
24 partition: p p p (3 bits)
25 a : A7A6 A5A4 A3A2 A1A0 (8 bits)
26 b : BBB3 BBB2 BBB1 BBB0 (16 bits)
27
28 then it matters what the partition settings are:
29
30 | partition | o3 | o2 | o1 | o0 |
31 | --------- | -- | -- | -- | -- |
32 | 000 | [A7A7A7A7] | [A7A7A7A7] | A7A6A5A4 | A3A2A1A0 |
33 | 001 | [A7A7A7A7] | [A7A7]A7A6 | A5A4A3A2 | [A1A1]A1A0 |
34 | 010 | [A7A7A7A7] | A7A6A5A4 | [A3A3A3A3] | A3A2A1A0 |
35 | 011 | [A7A7A7A7] | A7A6A5A4 | [A3A3]A3A2 | [A1A1]A1A0 |
36 | 100 | [A7A7]A7A6 | [A5A5A5A5] | [A5A5]A5A4 | A3A2A1A0 |
37 | 101 | [A7A7]A7A6 | [A5A5A5A5] | A5A4A3A2 | [A1A1]A1A0 |
38 | 110 | [A7A7]A7A6 | [A5A5]A5A4 | [A3A3A3A3] | A3A2A1A0 |
39 | 111 | [A7A7]A7A6 | [A5A5]A5A4 | [A3A3]A3A2 | [A1A1]A1A0 |
40
41 where square brackets are zero if A is unsigned, and contains
42 the specified bits if signed. Here, each partition copies the
43 smaller value (A) into the larger partition (B) then, depending
44 on whether A is signed or unsigned, sign-extends or zero-extends
45 *on a per-partition basis*.
46
47 # Scalar source
48
49 When the source A is scalar and is equal or larger than
50 the destination it requires copying across multiple
51 partitions:
52
53 partition: p p p (3 bits)
54 a : AAAA AAAA AAAA AAAA (16 bits)
55 b : B7B6 B5B4 B3B2 B1B0 (8 bits)
56
57 The partition options are:
58
59 | partition | o3 | o2 | o1 | o0 |
60 | --------- | -- | -- | -- | -- |
61 | 000 | A7A6 | A5A4 | A3A2 | A1A0 |
62 | 001 | A5A4 | A3A2 | A1A0 | A1A0 |
63 | 010 | A3A2 | A1A0 | A3A2 | A1A0 |
64 | 011 | A3A2 | A1A0 | A1A0 | A1A0 |
65 | 100 | A1A0 | A5A4 | A3A2 | A1A0 |
66 | 101 | A1A0 | A3A2 | A1A0 | A1A0 |
67 | 110 | A1A0 | A1A0 | A3A2 | A1A0 |
68 | 111 | A1A0 | A1A0 | A1A0 | A1A0 |
69
70 When the partitions are all open (1x) only the bits that will fit across
71 the whole of the target are copied. In this example, B is 8 bits so only
72 8 bits of A are copied.
73
74 When the partitions are all closed (4x SIMD) each partition of B is
75 2 bits wide, therefore only the *first two* bits of A are copied into
76 *each* of the four 2-bit partitions in B.
77
78 For the case where A is shorter than B output, sign or zero
79 extension is required. Here we assume A is 8 bits, B is 16.
80 This is similar to the parallel case except A is repeated
81 (broadcast) across all of B.
82
83
84 | partition | o3 | o2 | o1 | o0 |
85 | --------- | -- | -- | -- | -- |
86 | 000 | [A7A7A7A7] | [A7A7A7A7] | A7A6A5A4 | A3A2A1A0 |
87 | 001 | [A7A7A7A7] | A7A6A5A4 | A3A2A1A0 | A3A2A1A0 |
88 | 010 | A7A6A5A4 | A3A2A1A0 | A7A6A5A4 | A3A2A1A0 |
89 | 011 | A7A6A5A4 | A3A2A1A0 | A3A2A1A0 | A3A2A1A0 |
90 | 100 | A3A2A1A0 | [A7A7A7A7] | A7A6A5A4 | A3A2A1A0 |
91 | 101 | A3A2A1A0 | A7A6A5A4 | A3A2A1A0 | A3A2A1A0 |
92 | 110 | A3A2A1A0 | A3A2A1A0 | A7A6A5A4 | A3A2A1A0 |
93 | 111 | A3A2A1A0 | A3A2A1A0 | A3A2A1A0v | A3A2A1A0 |
94
95 Note how when the entire partition set is open (1x 16-bit output)
96 that all of A is copied out, and either zero or sign extended
97 in the top half of the output. At the other extreme is the
98 4x 4-bit output partitions, which have four copies of A, truncated
99 from the first 4 bits of A.
100
101 Unlike the parallel case, A is not itself partitioned, so is copied
102 over as much as is possible. In some cases such as `1x 4-bit, 1x 12-bit`
103 (partition mask = `0b100`, above) the 8-bit scalar source will need sign or zero extending.