(no commit message)
[libreriscv.git] / Comparative_analysis_Harmonised_RVP_vs_Andes_Packed_SIMD_ISA_proposal.mdwn
1 # Comparative analysis of Andes Packed ISA proposal vs Harmonised RVP
2
3 Harmonised RVP is a proposal to provide SIMD functionality comparable to the Andes Packed SIMD ISA, but in a manner that is forwards compatible ("harmonised") with the RV Vector specification.
4
5 An example use case is a string copy operation - using Harmonised RVP, code can use integer register SIMD instructions to copy a string. This code can then also execute (unchanged) on a full RV Vector processor and use the dedicated vector unit to copy the string. Harmonised RVP also upwards compatibility between RV32 and RV64 SIMD using this same approach.
6
7 ## Register file comparison
8
9 The default Harmonised RVP GPR register file is divided into a lower bank of Vector[INT8] and an upper bank of Vector[INT16].
10 In contrast, the Andes Packed SIMD ISA permits any GPR to be used for either INT8 or INT16 vector operations.
11
12 | Register | Andes ISA | Harmonised RVP ISA |
13 | ------------------ | ------------------------- | ------------------- |
14 | v0 | Hardwired zero | Hardwired zero |
15 | v1 | 32bit GPR or Vector[4xINT8 or 2xINT16] | Predicate mask |
16 | | | |
17 | v2 | 32bit GPR or Vector[4xINT8 or 2xINT16] | 32bit GPR or Vector[4xSINT8] |
18 | v3 | 32bit GPR or Vector[4xINT8 or 2xINT16] | 32bit GPR or Vector[4xSINT8] |
19 | v4 | 32bit GPR or Vector[4xINT8 or 2xINT16] | 32bit GPR or Vector[4xSINT8] |
20 | v5 | 32bit GPR or Vector[4xINT8 or 2xINT16] | 32bit GPR or Vector[4xSINT8] |
21 | v6 | 32bit GPR or Vector[4xINT8 or 2xINT16] | 32bit GPR or Vector[4xSINT8] |
22 | v7 | 32bit GPR or Vector[4xINT8 or 2xINT16] | 32bit GPR or Vector[4xSINT8] |
23 | v8 | 32bit GPR or Vector[4xINT8 or 2xINT16] | 32bit GPR or Vector[4xUINT8] |
24 | v9 | 32bit GPR or Vector[4xINT8 or 2xINT16] | 32bit GPR or Vector[4xUINT8] |
25 | v10 | 32bit GPR or Vector[4xINT8 or 2xINT16] | 32bit GPR or Vector[4xUINT8] |
26 | v11 | 32bit GPR or Vector[4xINT8 or 2xINT16] | 32bit GPR or Vector[4xUINT8] |
27 | v12 | 32bit GPR or Vector[4xINT8 or 2xINT16] | 32bit GPR or Vector[4xUINT8] |
28 | v13 | 32bit GPR or Vector[4xINT8 or 2xINT16] | 32bit GPR or Vector[4xUINT8] |
29 | v14 | 32bit GPR or Vector[4xINT8 or 2xINT16] | 32bit GPR or Vector[4xUINT8] |
30 | v15 | 32bit GPR or Vector[4xINT8 or 2xINT16] | 32bit GPR or Vector[4xUINT8] |
31 | | | |
32 | v16 | 32bit GPR or Vector[4xINT8 or 2xINT16] | 32bit GPR or Vector[2xSINT16] |
33 | v17 | 32bit GPR or Vector[4xINT8 or 2xINT16] | 32bit GPR or Vector[2xSINT16] |
34 | v18 | 32bit GPR or Vector[4xINT8 or 2xINT16] | 32bit GPR or Vector[2xSINT16] |
35 | v19 | 32bit GPR or Vector[4xINT8 or 2xINT16] | 32bit GPR or Vector[2xSINT16] |
36 | v20 | 32bit GPR or Vector[4xINT8 or 2xINT16] | 32bit GPR or Vector[2xSINT16] |
37 | v21 | 32bit GPR or Vector[4xINT8 or 2xINT16] | 32bit GPR or Vector[2xSINT16] |
38 | v22 | 32bit GPR or Vector[4xINT8 or 2xINT16] | 32bit GPR or Vector[2xSINT16] |
39 | v23 | 32bit GPR or Vector[4xINT8 or 2xINT16] | 32bit GPR or Vector[2xSINT16] |
40 | v24 | 32bit GPR or Vector[4xINT8 or 2xINT16] | 32bit GPR or Vector[2xUINT16] |
41 | v25 | 32bit GPR or Vector[4xINT8 or 2xINT16] | 32bit GPR or Vector[2xUINT16] |
42 | v26 | 32bit GPR or Vector[4xINT8 or 2xINT16] | 32bit GPR or Vector[2xUINT16] |
43 | v27 | 32bit GPR or Vector[4xINT8 or 2xINT16] | 32bit GPR or Vector[2xUINT16] |
44 | v28 | 32bit GPR or Vector[4xINT8 or 2xINT16] | 32bit GPR or Vector[2xUINT16] |
45 | v29 | 32bit GPR or Vector[4xINT8 or 2xINT16] | 32bit GPR or Vector[2xUINT16] |
46 | | | |
47 | v30 | 32bit GPR or Vector[4xINT8 or 2xINT16] | 32bit GPR or Vector[1xSINT32] |
48 | v31 | 32bit GPR or Vector[4xINT8 or 2xINT16] | 32bit GPR or Vector[1xSINT32] |
49
50 However, programmers may reconfigure the Harmonised RVP register file if the default configuration is unsuitable.
51 To keep implementations simple and focused on within-register SIMD only, there is a strict 1:1 mapping between vectors (v0-v31) and integer registers (r0-r31).
52 Programmers needing forwards compatibility with RV Vector implementations should use VLD and VST to load/store from vector registers (even though these are then mapped into integer registers).
53
54 ## Proposed Harmonised RVP vector op instruction encoding
55
56 Register x 2 -> register operations:
57
58 | 31 30 29 28 27 26 | 25 | 24 23 22 21 20 | 19 18 17 16 15 | 14 | 13 12 | 11 10 9 8 7 | 6 5 4 3 2 1 0 |
59 | ----------------- | -- | -------------- | -------------- | -- | ----- | ----------- | ------------- |
60 | func_6 | 0 | rs2 | rs1 | 0 | mm | rd1 | VOP opcode |
61
62 Immediate + register -> register operations:
63
64 | 31 30 29 | 28 27 26 | 25 | 24 23 22 21 20 | 19 18 17 16 15 | 14 | 13 12 | 11 10 9 8 7 | 6 5 4 3 2 1 0 |
65 | -------- | -------- | -- | -------------- | -------------- | -- | ----- | ----------- | ------------- |
66 | func_3 | imm[7:5] | 1 | imm[4:0] | rs1 | 0 | mm | rd1 | VOP opcode |
67
68 Register x 3 -> register operations:
69
70 | 31 30 29 28 27 | 26 25 | 24 23 22 21 20 | 19 18 17 16 15 | 14 | 13 12 | 11 10 9 8 7 | 6 5 4 3 2 1 0 |
71 | ----------------------- | -------------- | -------------- | -- | ----- | ----------- | ------------- |
72 | rs3 | func_2 | rs2 | rs1 | 1 | mm | rd1 | VOP opcode |
73
74 Values for mm field (bits 12:13 above):
75
76 * mm = 00 -> no predicate mask, and use current global saturation / rounding settings
77 * mm = 00 -> no predicate mask, and force saturation or rounding for this instruction only
78 * mm = 10 -> use v1 as predicate mask, and use global saturation / rounding settings
79 * mm = 11 -> use ~v1 as predicate mask, and use global saturation / rounding settings
80
81 ## 16-bit Arithmetic
82
83 | Andes Mnemonic | 16-bit Instruction | Harmonised RVP Equivalent |
84 | ------------------ | ------------------------- | ------------------- |
85 | ADD16 rt, ra, rb | Add | VADD (v16 <= rt,ra,rb <= v29), mm=00|
86 | RADD16 rt, ra, rb | Signed Halving add | RADD (v16 <= rt,ra,rb <= v23), mm=00|
87 | URADD16 rt, ra, rb | Unsigned Halving add | RADD (v24 <= rt,ra,rb <= v29), mm=00|
88 | KADD16 rt, ra, rb | Signed Saturating add | VADD (v16 <= rt,ra,rb <= v23), mm=01|
89 | UKADD16 rt, ra, rb | Unsigned Saturating add | VADD (v24 <= rt,ra,rb <= v29), mm=01|
90 | SUB16 rt, ra, rb | Subtract | VSUB (v16 <= rt,ra,rb <= v29), mm=00|
91 | RSUB16 rt, ra, rb | Signed Halving sub | RSUB (v16 <= rt,ra,rb <= v23), mm=00|
92 | URSUB16 rt, ra, rb | Unsigned Halving sub | RSUB (v24 <= rt,ra,rb <= v29), mm=00|
93 | KSUB16 rt, ra, rb | Signed Saturating sub | VSUB (v16 <= rt,ra,rb <= v23), mm=01|
94 | UKSUB16 rt, ra, rb | Unsigned Saturating sub | VSUB (v24 <= rt,ra,rb <= v29), mm=01|
95 | CRAS16 rt, ra, rb | Cross Add & Sub | |
96 | RCRAS16 rt, ra, rb | Signed Halving Cross Add & Sub | |
97 | URCRAS16 rt, ra, rb| Unsigned Halving Cross Add & Sub | |
98 | KCRAS16 rt, ra, rb | Signed Saturating Cross Add & Sub | |
99 | UKCRAS16 rt, ra, rb| Unsigned Saturating Cross Add & Sub | |
100 | CRSA16 rt, ra, rb | Cross Sub & Add | |
101 | RCRSA16 rt, ra, rb | Signed Halving Cross Sub & Add | |
102 | URCRSA16 rt, ra, rb| Unsigned Halving Cross Sub & Add | |
103 | KCRSA16 rt, ra, rb | Signed Saturating Cross Sub & Add | |
104 | UKCRSA16 rt, ra, rb| Unsigned Saturating Cross Sub & Add | |
105
106 ## 8-bit Arithmetic
107
108 | Andes Mnemonic | 8-bit Instruction | Harmonised RVP Equivalent |
109 | ------------------ | ------------------------- | ------------------- |
110 | ADD8 rt, ra, rb | Add | VADD (v2 <= rt,ra,rb <= v15), mm=00 |
111 | RADD8 rt, ra, rb | Signed Halving add | RADD (v2 <= rt,ra,rb <= v7), mm=00 |
112 | URADD8 rt, ra, rb | Unsigned Halving add | RADD (v8 <= rt,ra,rb <= v15), mm=00 |
113 | KADD8 rt, ra, rb | Signed Saturating add | VADD (v2 <= rt,ra,rb <= v7), mm=01 |
114 | UKADD8 rt, ra, rb | Unsigned Saturating add | VADD (v8 <= rt,ra,rb <= v15), mm=01 |
115 | SUB8 rt, ra, rb | Subtract | VSUB (v2 <= rt,ra,rb <= v15), mm=00 |
116 | RSUB8 rt, ra, rb | Signed Halving sub | RSUB (v2 <= rt,ra,rb <= v7), mm=00 |
117 | URSUB8 rt, ra, rb | Unsigned Halving sub | RSUB (v8 <= rt,ra,rb <= v15), mm=00 |
118 | KSUB8 rt, ra, rb | Signed Saturating sub | VSUB (v2 <= rt,ra,rb <= v7), mm=01 |
119 | UKSUB8 rt, ra, rb | Unsigned Saturating sub | VSUB (v8 <= rt,ra,rb <= v15), mm=01 |
120
121 ## 16-bit Shifts
122
123 SRA[I]16/SRL[I]16/SLL[I]16 to be mapped to VOP shift instructions in same manner as ADD16/SUB16
124
125 The “K” (Saturation) and “u” (Rounding) variants could be encoded using VOP’s mm field (mm=01 is saturated or rounded shift, mm=00 is standard VOP shift)
126
127 | Andes Mnemonic | 16-bit Instruction | Harmonised RVP Equivalent |
128 | ------------------ | ------------------------- | ------------------- |
129 | SRA16 rt, ra, rb | Shift right arithmetic | VSRA (v16 <= rt,ra,rb <= v29), mm=00|
130 | SRAI16 rt, ra, im | Shift right arithmetic imm | VSRAI (v16 <= rt,ra <= v29), mm=00|
131 | SRA16.u rt, ra, rb | Rounding Shift right arithmetic | VSRA (v16 <= rt,ra,rb <= v29), mm=01|
132 | SRAI16.u rt, ra, im | Rounding Shift right arithmetic imm | VSRAI (v16 <= rt,ra <= v29), mm=01|
133 | SRL16 rt, ra, rb | Shift right logical | VSRL (v16 <= rt,ra,rb <= v29), mm=00|
134 | SRLI16 rt, ra, im | Shift right logical imm | VSRLI (v16 <= rt,ra <= v29), mm=00|
135 | SRL16.u rt, ra, rb | Rounding Shift right logical | VSRL (v16 <= rt,ra,rb <= v29), mm=01|
136 | SRLI16.u rt, ra, im | Rounding Shift right logical imm | VSLRI (v16 <= rt,ra <= v29), mm=01|
137 | SLL16 rt, ra, rb | Shift left logical | VSLL (v16 <= rt,ra,rb <= v29), mm=00|
138 | SLLI16 rt, ra, im | Shift left logical imm | VSLLI (v16 <= rt,ra <= v29), mm=00|
139 | KSLL16 rt, ra, rb | Saturating Shift left logical | VSLL (v16 <= rt,ra,rb <= v29), mm=01|
140 | KSLLI16 rt, ra, im | Saturating Shift left logical imm | VSLLI (v16 <= rt,ra <= v29), mm=01|
141 | KSLRA16 rt, ra, rb | Saturating Shift left logical or Shift right arithmetic ||
142 | KSLRA16.u rt, ra, rb | Saturating Shift left logical or Rounding Shift right arithmetic ||
143
144
145 ## 8-bit Shifts
146
147 Andes SIMD Packed ISA omits 8 bit shifts, but these can be encoded in Harmonised RVP as follows:
148
149 | Andes Mnemonic | 8-bit Instruction | Harmonised RVP Equivalent |
150 | ------------------ | ------------------------- | ------------------- |
151 | n/a | Shift right arithmetic | VSRA (v2 <= rt,ra,rb <= v15), mm=00|
152 | n/a | Shift right arithmetic imm | VSRAI (v2 <= rt,ra <= v15), mm=00|
153 | n/a | Rounding Shift right arithmetic | VSRA (v2 <= rt,ra,rb <= v15), mm=01|
154 | n/a | Rounding Shift right arithmetic imm | VSRAI (v2 <= rt,ra <= v15), mm=01|
155 | n/a | Shift right logical | VSRL (v2 <= rt,ra,rb <= v15), mm=00|
156 | n/a | Shift right logical imm | VSRLI (v2 <= rt,ra <= v15), mm=00|
157 | n/a | Rounding Shift right logical | VSRL (v2 <= rt,ra,rb <= v15), mm=01|
158 | n/a | Rounding Shift right logical imm | VSLRI (v2 <= rt,ra <= v15), mm=01|
159 | n/a | Shift left logical | VSLL (v2 <= rt,ra,rb <= v15), mm=00|
160 | n/a | Shift left logical imm | VSLLI (v2 <= rt,ra <= v15), mm=00|
161 | n/a | Saturating Shift left logical | VSLL (v2 <= rt,ra,rb <= v15), mm=01|
162 | n/a | Saturating Shift left logical imm | VSLLI (v2 <= rt,ra <= v15), mm=01|
163
164 ## 16-bit Comparison instructions
165
166 | Andes Mnemonic | 16-bit Instruction | Harmonised RVP Equivalent |
167 | ------------------ | ------------------------- | ------------------- |
168 | CMPEQ16 rt, ra, rb | Compare equal | VSEQ (v16 <= rt,ra,rb <= v29), mm=00|
169 | SCMPLT16 rt, ra, rb | Signed Compare less than | !VSGT (v16 <= rt,ra,rb <= v23), mm=00|
170 | SCMPLE16 rt, ra, rb | Signed Compare less or equal | VSLE (v16 <= rt,ra,rb <= v23), mm=00|
171 | UCMPLT16 rt, ra, rb | Unsigned Compare less than | !VSGT (v24 <= rt,ra,rb <= v29), mm=00|
172 | UCMPLE16 rt, ra, rb | Unsigned Compare less or equal | VSLE (v24 <= rt,ra,rb <= v29), mm=00|
173
174 ## 8-bit Comparison instructions
175
176 | Andes Mnemonic | 8-bit Instruction | Harmonised RVP Equivalent |
177 | ------------------ | ------------------------- | ------------------- |
178 | CMPEQ8 rt, ra, rb | Compare equal | VSEQ (v2 <= rt,ra,rb <= v7), mm=00|
179 | SCMPLT8 rt, ra, rb | Signed Compare less than | !VSGT (v2 <= rt,ra,rb <= v7), mm=00|
180 | SCMPLE8 rt, ra, rb | Signed Compare less or equal | VSLE (v2 <= rt,ra,rb <= v7), mm=00|
181 | UCMPLT8 rt, ra, rb | Unsigned Compare less than | !VSGT (v8 <= rt,ra,rb <= v15), mm=00|
182 | UCMPLE8 rt, ra, rb | Unsigned Compare less or equal | VSLE (v8 <= rt,ra,rb <= v15), mm=00|
183
184 ## 16-bit Miscellaneous instructions
185
186 | Andes Mnemonic | 16-bit Instruction | Harmonised RVP Equivalent |
187 | ------------------ | ------------------------ | ------------------- |
188 | SMIN16 rt, ra, rb | Signed minimum | VMIN (v16 <= rt,ra,rb <= v23), mm=00|
189 | UMIN16 rt, ra, rb | Unsigned minimum | VMIN (v24 <= rt,ra,rb <= v29), mm=00|
190 | SMAX16 rt, ra, rb | Signed maximum | VMAX (v16 <= rt,ra,rb <= v23), mm=00|
191 | UMAX16 rt, ra, rb | Unsigned maximum | VMAX (v24 <= rt,ra,rb <= v29), mm=00|
192 | SCLIP16 rt, ra, im | Signed clip | ?VCLIP (v16 <= rt,ra,rb <= v23), mm=01|
193 | UCLIP16 rt, ra, im | Unsigned clip | ?VCLIP (v24 <= rt,ra,rb <= v29), mm=01|
194 | KMUL16 rt, ra, rb | Signed multiply 16x16->16 | VMUL (v16 <= rt,ra,rb <= v23), mm=01|
195 | KMULX16 rt, ra, rb | Signed crossed multiply 16x16->16 | |
196 | SMUL16 rt, ra, rb | Signed multiply 16x16->32 | VMUL (30 <= rt <= 31, v16 <= ra,rb <= v23), mm=00|
197 | SMULX16 rt, ra, rb | Signed crossed multiply 16x16->32 | |
198 | UMUL16 rt, ra, rb | Signed multiply 16x16->32 | VMUL (30 <= rt <= 31, v24 <= ra,rb <= r31), mm=00|
199 | UMULX16 rt, ra, rb | Signed crossed multiply 16x16->32 | |
200 | KABS16 rt, ra | Saturated absolute value | VSGNX (v16 <= rt <= v29, v16 <= ra,rb <= v23, mm=01) |
201
202 ## 8-bit Miscellaneous instructions
203
204 | Andes Mnemonic | 8-bit Instruction | Harmonised RVP Equivalent |
205 | ------------------ | ------------------------- | ------------------- |
206 | SMIN8 rt, ra, rb | Signed minimum | VMIN (v2 <= rt,ra,rb <= v7), mm=00|
207 | UMIN8 rt, ra, rb | Unsigned minimum | VMIN (v8 <= rt,ra,rb <= v15), mm=00|
208 | SMAX8 rt, ra, rb | Signed maximum | VMAX (v2 <= rt,ra,rb <= v7), mm=00|
209 | UMAX8 rt, ra, rb | Unsigned maximum | VMAX (v8 <= rt,ra,rb <= v15), mm=00|
210 | KABS8 rt, ra | Saturated absolute value | VSGNX (v2 <= rt <= v15, v2 <= ra,rb <= v8, mm=01) |
211
212 ## 8-bit Unpacking instructions
213
214 | Andes Mnemonic | 8-bit Instruction | Harmonised RVP Equivalent |
215 | ------------------ | ------------------------- | ------------------- |
216 | SUNPKD810 rt, ra | Signed unpack bytes 1 & 0 | VMV (v16<= rt <= 23, v2 <= ra <= v7), mm=00|
217 | SUNPKD820 rt, ra | Signed unpack bytes 2 & 0 | |
218 | SUNPKD830 rt, ra | Signed unpack bytes 3 & 0 | |
219 | SUNPKD831 rt, ra | Signed unpack bytes 3 & 1 | |
220 | ZUNPKD810 rt, ra | Unsigned unpack bytes 1 & 0 | VMV (v24<= rt <= 31, v8 <= ra <= v15), mm=00|
221 | ZUNPKD820 rt, ra | Unsigned unpack bytes 2 & 0 | |
222 | ZUNPKD830 rt, ra | Unsigned unpack bytes 3 & 0 | |
223 | ZUNPKD831 rt, ra | Unsigned unpack bytes 3 & 1 | |