move add to its own simple example
[libreriscv.git] / simple_v_extension / appendix.mdwn
1 # Simple-V (Parallelism Extension Proposal) Appendix
2
3 * Copyright (C) 2017, 2018, 2019 Luke Kenneth Casson Leighton
4 * Status: DRAFTv0.6
5 * Last edited: 25 jun 2019
6 * main spec [[specification]]
7
8 [[!toc ]]
9
10 # Fail-on-first modes
11
12 Fail-on-first data dependency has different behaviour for traps than
13 for conditional testing. "Conditional" is taken to mean "anything
14 that is zero", however with traps, the first element has to
15 be given the opportunity to throw the exact same trap that would
16 be thrown if this were a scalar operation (when VL=1).
17
18 ## Fail-on-first traps
19
20 Except for the first element, ffirst stops sequential element processing
21 when a trap occurs. The first element is treated normally (as if ffirst
22 is clear). Should any subsequent element instruction require a trap,
23 instead it and subsequent indexed elements are ignored (or cancelled in
24 out-of-order designs), and VL is set to the *last* instruction that did
25 not take the trap.
26
27 Note that predicated-out elements (where the predicate mask bit is zero)
28 are clearly excluded (i.e. the trap will not occur). However, note that
29 the loop still had to test the predicate bit: thus on return,
30 VL is set to include elements that did not take the trap *and* includes
31 the elements that were predicated (masked) out (not tested up to the
32 point where the trap occurred).
33
34 If SUBVL is being used (SUBVL!=1), the first *sub-group* of elements
35 will cause a trap as normal (as if ffirst is not set); subsequently,
36 the trap must not occur in the *sub-group* of elements. SUBVL will **NOT**
37 be modified.
38
39 Given that predication bits apply to SUBVL groups, the same rules apply
40 to predicated-out (masked-out) sub-groups in calculating the value that VL
41 is set to.
42
43 ## Fail-on-first conditional tests
44
45 ffirst stops sequential element conditional testing on the first element result
46 being zero. VL is set to the number of elements that were processed before
47 the fail-condition was encountered.
48
49 Note that just as with traps, if SUBVL!=1, the first of any of the *sub-group*
50 will cause the processing to end, and, even if there were elements within
51 the *sub-group* that passed the test, that sub-group is still (entirely)
52 excluded from the count (from setting VL). i.e. VL is set to the total
53 number of *sub-groups* that had no fail-condition up until execution was
54 stopped.
55
56 Note again that, just as with traps, predicated-out (masked-out) elements
57 are included in the count leading up to the fail-condition, even though they
58 were not tested.
59
60 # Instructions <a name="instructions" />
61
62 Despite being a 98% complete and accurate topological remap of RVV
63 concepts and functionality, no new instructions are needed.
64 Compared to RVV: *All* RVV instructions can be re-mapped, however xBitManip
65 becomes a critical dependency for efficient manipulation of predication
66 masks (as a bit-field). Despite the removal of all operations,
67 with the exception of CLIP and VSELECT.X
68 *all instructions from RVV Base are topologically re-mapped and retain their
69 complete functionality, intact*. Note that if RV64G ever had
70 a MV.X added as well as FCLIP, the full functionality of RVV-Base would
71 be obtained in SV.
72
73 Three instructions, VSELECT, VCLIP and VCLIPI, do not have RV Standard
74 equivalents, so are left out of Simple-V. VSELECT could be included if
75 there existed a MV.X instruction in RV (MV.X is a hypothetical
76 non-immediate variant of MV that would allow another register to
77 specify which register was to be copied). Note that if any of these three
78 instructions are added to any given RV extension, their functionality
79 will be inherently parallelised.
80
81 With some exceptions, where it does not make sense or is simply too
82 challenging, all RV-Base instructions are parallelised:
83
84 * CSR instructions, whilst a case could be made for fast-polling of
85 a CSR into multiple registers, or for being able to copy multiple
86 contiguously addressed CSRs into contiguous registers, and so on,
87 are the fundamental core basis of SV. If parallelised, extreme
88 care would need to be taken. Additionally, CSR reads are done
89 using x0, and it is *really* inadviseable to tag x0.
90 * LUI, C.J, C.JR, WFI, AUIPC are not suitable for parallelising so are
91 left as scalar.
92 * LR/SC could hypothetically be parallelised however their purpose is
93 single (complex) atomic memory operations where the LR must be followed
94 up by a matching SC. A sequence of parallel LR instructions followed
95 by a sequence of parallel SC instructions therefore is guaranteed to
96 not be useful. Not least: the guarantees of a Multi-LR/SC
97 would be impossible to provide if emulated in a trap.
98 * EBREAK, NOP, FENCE and others do not use registers so are not inherently
99 paralleliseable anyway.
100
101 All other operations using registers are automatically parallelised.
102 This includes AMOMAX, AMOSWAP and so on, where particular care and
103 attention must be paid.
104
105 Example pseudo-code for an integer ADD operation (including scalar
106 operations). Floating-point uses the FP Register Table.
107
108 [[!inline raw="yes" pages="simple_v_extension/simple_add_example" ]]
109
110 Note that for simplicity there is quite a lot missing from the above
111 pseudo-code: PCVBLK, element widths, zeroing on predication, dimensional
112 reshaping and offsets and so on. However it demonstrates the basic
113 principle. Augmentations that produce the full pseudo-code are covered in
114 other sections.
115
116 ## SUBVL Pseudocode <a name="subvl-pseudocode"></a>
117
118 Adding in support for SUBVL is a matter of adding in an extra inner
119 for-loop, where register src and dest are still incremented inside the
120 inner part. Note that the predication is still taken from the VL index.
121
122 So whilst elements are indexed by "(i * SUBVL + s)", predicate bits are
123 indexed by "(i)"
124
125 function op_add(rd, rs1, rs2) # add not VADD!
126  int i, id=0, irs1=0, irs2=0;
127  predval = get_pred_val(FALSE, rd);
128  rd = int_vec[rd ].isvector ? int_vec[rd ].regidx : rd;
129  rs1 = int_vec[rs1].isvector ? int_vec[rs1].regidx : rs1;
130  rs2 = int_vec[rs2].isvector ? int_vec[rs2].regidx : rs2;
131  for (i = 0; i < VL; i++)
132 xSTATE.srcoffs = i # save context
133 for (s = 0; s < SUBVL; s++)
134 xSTATE.ssvoffs = s # save context
135 if (predval & 1<<i) # predication uses intregs
136 # actual add is here (at last)
137    ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
138 if (!int_vec[rd ].isvector) break;
139 if (int_vec[rd ].isvector)  { id += 1; }
140 if (int_vec[rs1].isvector)  { irs1 += 1; }
141 if (int_vec[rs2].isvector)  { irs2 += 1; }
142 if (id == VL or irs1 == VL or irs2 == VL) {
143 # end VL hardware loop
144 xSTATE.srcoffs = 0; # reset
145 xSTATE.ssvoffs = 0; # reset
146 return;
147 }
148
149
150 NOTE: pseudocode simplified greatly: zeroing, proper predicate handling,
151 elwidth handling etc. all left out.
152
153 ## Instruction Format
154
155 It is critical to appreciate that there are
156 **no operations added to SV, at all**.
157
158 Instead, by using CSRs to tag registers as an indication of "changed
159 behaviour", SV *overloads* pre-existing branch operations into predicated
160 variants, and implicitly overloads arithmetic operations, MV, FCVT, and
161 LOAD/STORE depending on CSR configurations for bitwidth and predication.
162 **Everything** becomes parallelised. *This includes Compressed
163 instructions* as well as any future instructions and Custom Extensions.
164
165 Note: CSR tags to change behaviour of instructions is nothing new, including
166 in RISC-V. UXL, SXL and MXL change the behaviour so that XLEN=32/64/128.
167 FRM changes the behaviour of the floating-point unit, to alter the rounding
168 mode. Other architectures change the LOAD/STORE byte-order from big-endian
169 to little-endian on a per-instruction basis. SV is just a little more...
170 comprehensive in its effect on instructions.
171
172 ## Branch Instructions
173
174 Branch operations are augmented slightly to be a little more like FP
175 Compares (FEQ, FNE etc.), by permitting the cumulation (and storage)
176 of multiple comparisons into a register (taken indirectly from the predicate
177 table). As such, "ffirst" - fail-on-first - condition mode can be enabled.
178 See ffirst mode in the Predication Table section.
179
180 ### Standard Branch <a name="standard_branch"></a>
181
182 Branch operations use standard RV opcodes that are reinterpreted to
183 be "predicate variants" in the instance where either of the two src
184 registers are marked as vectors (active=1, vector=1).
185
186 Note that the predication register to use (if one is enabled) is taken from
187 the *first* src register, and that this is used, just as with predicated
188 arithmetic operations, to mask whether the comparison operations take
189 place or not. The target (destination) predication register
190 to use (if one is enabled) is taken from the *second* src register.
191
192 If either of src1 or src2 are scalars (whether by there being no
193 CSR register entry or whether by the CSR entry specifically marking
194 the register as "scalar") the comparison goes ahead as vector-scalar
195 or scalar-vector.
196
197 In instances where no vectorisation is detected on either src registers
198 the operation is treated as an absolutely standard scalar branch operation.
199 Where vectorisation is present on either or both src registers, the
200 branch may stil go ahead if any only if *all* tests succeed (i.e. excluding
201 those tests that are predicated out).
202
203 Note that when zero-predication is enabled (from source rs1),
204 a cleared bit in the predicate indicates that the result
205 of the compare is set to "false", i.e. that the corresponding
206 destination bit (or result)) be set to zero. Contrast this with
207 when zeroing is not set: bits in the destination predicate are
208 only *set*; they are **not** cleared. This is important to appreciate,
209 as there may be an expectation that, going into the hardware-loop,
210 the destination predicate is always expected to be set to zero:
211 this is **not** the case. The destination predicate is only set
212 to zero if **zeroing** is enabled.
213
214 Note that just as with the standard (scalar, non-predicated) branch
215 operations, BLE, BGT, BLEU and BTGU may be synthesised by inverting
216 src1 and src2.
217
218 In Hwacha EECS-2015-262 Section 6.7.2 the following pseudocode is given
219 for predicated compare operations of function "cmp":
220
221 for (int i=0; i<vl; ++i)
222 if ([!]preg[p][i])
223 preg[pd][i] = cmp(s1 ? vreg[rs1][i] : sreg[rs1],
224 s2 ? vreg[rs2][i] : sreg[rs2]);
225
226 With associated predication, vector-length adjustments and so on,
227 and temporarily ignoring bitwidth (which makes the comparisons more
228 complex), this becomes:
229
230 s1 = reg_is_vectorised(src1);
231 s2 = reg_is_vectorised(src2);
232
233 if not s1 && not s2
234 if cmp(rs1, rs2) # scalar compare
235 goto branch
236 return
237
238 preg = int_pred_reg[rd]
239 reg = int_regfile
240
241 ps = get_pred_val(I/F==INT, rs1);
242 rd = get_pred_val(I/F==INT, rs2); # this may not exist
243
244 if not exists(rd) or zeroing:
245 result = 0
246 else
247 result = preg[rd]
248
249 for (int i = 0; i < VL; ++i)
250 if (zeroing)
251 if not (ps & (1<<i))
252 result &= ~(1<<i);
253 else if (ps & (1<<i))
254 if (cmp(s1 ? reg[src1+i]:reg[src1],
255 s2 ? reg[src2+i]:reg[src2])
256 result |= 1<<i;
257 else
258 result &= ~(1<<i);
259
260 if not exists(rd)
261 if result == ps
262 goto branch
263 else
264 preg[rd] = result # store in destination
265 if preg[rd] == ps
266 goto branch
267
268 Notes:
269
270 * Predicated SIMD comparisons would break src1 and src2 further down
271 into bitwidth-sized chunks (see Appendix "Bitwidth Virtual Register
272 Reordering") setting Vector-Length times (number of SIMD elements) bits
273 in Predicate Register rd, as opposed to just Vector-Length bits.
274 * The execution of "parallelised" instructions **must** be implemented
275 as "re-entrant" (to use a term from software). If an exception (trap)
276 occurs during the middle of a vectorised
277 Branch (now a SV predicated compare) operation, the partial results
278 of any comparisons must be written out to the destination
279 register before the trap is permitted to begin. If however there
280 is no predicate, the **entire** set of comparisons must be **restarted**,
281 with the offset loop indices set back to zero. This is because
282 there is no place to store the temporary result during the handling
283 of traps.
284
285 TODO: predication now taken from src2. also branch goes ahead
286 if all compares are successful.
287
288 Note also that where normally, predication requires that there must
289 also be a CSR register entry for the register being used in order
290 for the **predication** CSR register entry to also be active,
291 for branches this is **not** the case. src2 does **not** have
292 to have its CSR register entry marked as active in order for
293 predication on src2 to be active.
294
295 Also note: SV Branch operations are **not** twin-predicated
296 (see Twin Predication section). This would require three
297 element offsets: one to track src1, one to track src2 and a third
298 to track where to store the accumulation of the results. Given
299 that the element offsets need to be exposed via CSRs so that
300 the parallel hardware looping may be made re-entrant on traps
301 and exceptions, the decision was made not to make SV Branches
302 twin-predicated.
303
304 ### Floating-point Comparisons
305
306 There does not exist floating-point branch operations, only compare.
307 Interestingly no change is needed to the instruction format because
308 FP Compare already stores a 1 or a zero in its "rd" integer register
309 target, i.e. it's not actually a Branch at all: it's a compare.
310
311 In RV (scalar) Base, a branch on a floating-point compare is
312 done via the sequence "FEQ x1, f0, f5; BEQ x1, x0, #jumploc".
313 This does extend to SV, as long as x1 (in the example sequence given)
314 is vectorised. When that is the case, x1..x(1+VL-1) will also be
315 set to 0 or 1 depending on whether f0==f5, f1==f6, f2==f7 and so on.
316 The BEQ that follows will *also* compare x1==x0, x2==x0, x3==x0 and
317 so on. Consequently, unlike integer-branch, FP Compare needs no
318 modification in its behaviour.
319
320 In addition, it is noted that an entry "FNE" (the opposite of FEQ) is missing,
321 and whilst in ordinary branch code this is fine because the standard
322 RVF compare can always be followed up with an integer BEQ or a BNE (or
323 a compressed comparison to zero or non-zero), in predication terms that
324 becomes more of an impact. To deal with this, SV's predication has
325 had "invert" added to it.
326
327 Also: note that FP Compare may be predicated, using the destination
328 integer register (rd) to determine the predicate. FP Compare is **not**
329 a twin-predication operation, as, again, just as with SV Branches,
330 there are three registers involved: FP src1, FP src2 and INT rd.
331
332 Also: note that ffirst (fail first mode) applies directly to this operation.
333
334 ### Compressed Branch Instruction
335
336 Compressed Branch instructions are, just like standard Branch instructions,
337 reinterpreted to be vectorised and predicated based on the source register
338 (rs1s) CSR entries. As however there is only the one source register,
339 given that c.beqz a10 is equivalent to beqz a10,x0, the optional target
340 to store the results of the comparisions is taken from CSR predication
341 table entries for **x0**.
342
343 The specific required use of x0 is, with a little thought, quite obvious,
344 but is counterintuitive. Clearly it is **not** recommended to redirect
345 x0 with a CSR register entry, however as a means to opaquely obtain
346 a predication target it is the only sensible option that does not involve
347 additional special CSRs (or, worse, additional special opcodes).
348
349 Note also that, just as with standard branches, the 2nd source
350 (in this case x0 rather than src2) does **not** have to have its CSR
351 register table marked as "active" in order for predication to work.
352
353 ## Vectorised Dual-operand instructions
354
355 There is a series of 2-operand instructions involving copying (and
356 sometimes alteration):
357
358 * C.MV
359 * FMV, FNEG, FABS, FCVT, FSGNJ, FSGNJN and FSGNJX
360 * C.LWSP, C.SWSP, C.LDSP, C.FLWSP etc.
361 * LOAD(-FP) and STORE(-FP)
362
363 All of these operations follow the same two-operand pattern, so it is
364 *both* the source *and* destination predication masks that are taken into
365 account. This is different from
366 the three-operand arithmetic instructions, where the predication mask
367 is taken from the *destination* register, and applied uniformly to the
368 elements of the source register(s), element-for-element.
369
370 The pseudo-code pattern for twin-predicated operations is as
371 follows:
372
373 function op(rd, rs):
374  rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
375  rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
376  ps = get_pred_val(FALSE, rs); # predication on src
377  pd = get_pred_val(FALSE, rd); # ... AND on dest
378  for (int i = 0, int j = 0; i < VL && j < VL;):
379 if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
380 if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
381 xSTATE.srcoffs = i # save context
382 xSTATE.destoffs = j # save context
383 reg[rd+j] = SCALAR_OPERATION_ON(reg[rs+i])
384 if (int_csr[rs].isvec) i++;
385 if (int_csr[rd].isvec) j++; else break
386
387 This pattern covers scalar-scalar, scalar-vector, vector-scalar
388 and vector-vector, and predicated variants of all of those.
389 Zeroing is not presently included (TODO). As such, when compared
390 to RVV, the twin-predicated variants of C.MV and FMV cover
391 **all** standard vector operations: VINSERT, VSPLAT, VREDUCE,
392 VEXTRACT, VSCATTER, VGATHER, VCOPY, and more.
393
394 Note that:
395
396 * elwidth (SIMD) is not covered in the pseudo-code above
397 * ending the loop early in scalar cases (VINSERT, VEXTRACT) is also
398 not covered
399 * zero predication is also not shown (TODO).
400
401 ### C.MV Instruction <a name="c_mv"></a>
402
403 There is no MV instruction in RV however there is a C.MV instruction.
404 It is used for copying integer-to-integer registers (vectorised FMV
405 is used for copying floating-point).
406
407 If either the source or the destination register are marked as vectors
408 C.MV is reinterpreted to be a vectorised (multi-register) predicated
409 move operation. The actual instruction's format does not change:
410
411 [[!table data="""
412 15 12 | 11 7 | 6 2 | 1 0 |
413 funct4 | rd | rs | op |
414 4 | 5 | 5 | 2 |
415 C.MV | dest | src | C0 |
416 """]]
417
418 A simplified version of the pseudocode for this operation is as follows:
419
420 function op_mv(rd, rs) # MV not VMV!
421  rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
422  rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
423  ps = get_pred_val(FALSE, rs); # predication on src
424  pd = get_pred_val(FALSE, rd); # ... AND on dest
425  for (int i = 0, int j = 0; i < VL && j < VL;):
426 if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
427 if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
428 xSTATE.srcoffs = i # save context
429 xSTATE.destoffs = j # save context
430 ireg[rd+j] <= ireg[rs+i];
431 if (int_csr[rs].isvec) i++;
432 if (int_csr[rd].isvec) j++; else break
433
434 There are several different instructions from RVV that are covered by
435 this one opcode:
436
437 [[!table data="""
438 src | dest | predication | op |
439 scalar | vector | none | VSPLAT |
440 scalar | vector | destination | sparse VSPLAT |
441 scalar | vector | 1-bit dest | VINSERT |
442 vector | scalar | 1-bit? src | VEXTRACT |
443 vector | vector | none | VCOPY |
444 vector | vector | src | Vector Gather |
445 vector | vector | dest | Vector Scatter |
446 vector | vector | src & dest | Gather/Scatter |
447 vector | vector | src == dest | sparse VCOPY |
448 """]]
449
450 Also, VMERGE may be implemented as back-to-back (macro-op fused) C.MV
451 operations with zeroing off, and inversion on the src and dest predication
452 for one of the two C.MV operations. The non-inverted C.MV will place
453 one set of registers into the destination, and the inverted one the other
454 set. With predicate-inversion, copying and inversion of the predicate mask
455 need not be done as a separate (scalar) instruction.
456
457 Note that in the instance where the Compressed Extension is not implemented,
458 MV may be used, but that is a pseudo-operation mapping to addi rd, x0, rs.
459 Note that the behaviour is **different** from C.MV because with addi the
460 predication mask to use is taken **only** from rd and is applied against
461 all elements: rs[i] = rd[i].
462
463 ### FMV, FNEG and FABS Instructions
464
465 These are identical in form to C.MV, except covering floating-point
466 register copying. The same double-predication rules also apply.
467 However when elwidth is not set to default the instruction is implicitly
468 and automatic converted to a (vectorised) floating-point type conversion
469 operation of the appropriate size covering the source and destination
470 register bitwidths.
471
472 (Note that FMV, FNEG and FABS are all actually pseudo-instructions)
473
474 ### FVCT Instructions
475
476 These are again identical in form to C.MV, except that they cover
477 floating-point to integer and integer to floating-point. When element
478 width in each vector is set to default, the instructions behave exactly
479 as they are defined for standard RV (scalar) operations, except vectorised
480 in exactly the same fashion as outlined in C.MV.
481
482 However when the source or destination element width is not set to default,
483 the opcode's explicit element widths are *over-ridden* to new definitions,
484 and the opcode's element width is taken as indicative of the SIMD width
485 (if applicable i.e. if packed SIMD is requested) instead.
486
487 For example FCVT.S.L would normally be used to convert a 64-bit
488 integer in register rs1 to a 64-bit floating-point number in rd.
489 If however the source rs1 is set to be a vector, where elwidth is set to
490 default/2 and "packed SIMD" is enabled, then the first 32 bits of
491 rs1 are converted to a floating-point number to be stored in rd's
492 first element and the higher 32-bits *also* converted to floating-point
493 and stored in the second. The 32 bit size comes from the fact that
494 FCVT.S.L's integer width is 64 bit, and with elwidth on rs1 set to
495 divide that by two it means that rs1 element width is to be taken as 32.
496
497 Similar rules apply to the destination register.
498
499 ## LOAD / STORE Instructions and LOAD-FP/STORE-FP <a name="load_store"></a>
500
501 An earlier draft of SV modified the behaviour of LOAD/STORE (modified
502 the interpretation of the instruction fields). This
503 actually undermined the fundamental principle of SV, namely that there
504 be no modifications to the scalar behaviour (except where absolutely
505 necessary), in order to simplify an implementor's task if considering
506 converting a pre-existing scalar design to support parallelism.
507
508 So the original RISC-V scalar LOAD/STORE and LOAD-FP/STORE-FP functionality
509 do not change in SV, however just as with C.MV it is important to note
510 that dual-predication is possible.
511
512 In vectorised architectures there are usually at least two different modes
513 for LOAD/STORE:
514
515 * Read (or write for STORE) from sequential locations, where one
516 register specifies the address, and the one address is incremented
517 by a fixed amount. This is usually known as "Unit Stride" mode.
518 * Read (or write) from multiple indirected addresses, where the
519 vector elements each specify separate and distinct addresses.
520
521 To support these different addressing modes, the CSR Register "isvector"
522 bit is used. So, for a LOAD, when the src register is set to
523 scalar, the LOADs are sequentially incremented by the src register
524 element width, and when the src register is set to "vector", the
525 elements are treated as indirection addresses. Simplified
526 pseudo-code would look like this:
527
528 function op_ld(rd, rs) # LD not VLD!
529  rdv = int_csr[rd].active ? int_csr[rd].regidx : rd;
530  rsv = int_csr[rs].active ? int_csr[rs].regidx : rs;
531  ps = get_pred_val(FALSE, rs); # predication on src
532  pd = get_pred_val(FALSE, rd); # ... AND on dest
533  for (int i = 0, int j = 0; i < VL && j < VL;):
534 if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
535 if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
536 if (int_csr[rd].isvec)
537 # indirect mode (multi mode)
538 srcbase = ireg[rsv+i];
539 else
540 # unit stride mode
541 srcbase = ireg[rsv] + i * XLEN/8; # offset in bytes
542 ireg[rdv+j] <= mem[srcbase + imm_offs];
543 if (!int_csr[rs].isvec &&
544 !int_csr[rd].isvec) break # scalar-scalar LD
545 if (int_csr[rs].isvec) i++;
546 if (int_csr[rd].isvec) j++;
547
548 Notes:
549
550 * For simplicity, zeroing and elwidth is not included in the above:
551 the key focus here is the decision-making for srcbase; vectorised
552 rs means use sequentially-numbered registers as the indirection
553 address, and scalar rs is "offset" mode.
554 * The test towards the end for whether both source and destination are
555 scalar is what makes the above pseudo-code provide the "standard" RV
556 Base behaviour for LD operations.
557 * The offset in bytes (XLEN/8) changes depending on whether the
558 operation is a LB (1 byte), LH (2 byes), LW (4 bytes) or LD
559 (8 bytes), and also whether the element width is over-ridden
560 (see special element width section).
561
562 ## Compressed Stack LOAD / STORE Instructions <a name="c_ld_st"></a>
563
564 C.LWSP / C.SWSP and floating-point etc. are also source-dest twin-predicated,
565 where it is implicit in C.LWSP/FLWSP etc. that x2 is the source register.
566 It is therefore possible to use predicated C.LWSP to efficiently
567 pop registers off the stack (by predicating x2 as the source), cherry-picking
568 which registers to store to (by predicating the destination). Likewise
569 for C.SWSP. In this way, LOAD/STORE-Multiple is efficiently achieved.
570
571 The two modes ("unit stride" and multi-indirection) are still supported,
572 as with standard LD/ST. Essentially, the only difference is that the
573 use of x2 is hard-coded into the instruction.
574
575 **Note**: it is still possible to redirect x2 to an alternative target
576 register. With care, this allows C.LWSP / C.SWSP (and C.FLWSP) to be used as
577 general-purpose LOAD/STORE operations.
578
579 ## Compressed LOAD / STORE Instructions
580
581 Compressed LOAD and STORE are again exactly the same as scalar LOAD/STORE,
582 where the same rules apply and the same pseudo-code apply as for
583 non-compressed LOAD/STORE. Again: setting scalar or vector mode
584 on the src for LOAD and dest for STORE switches mode from "Unit Stride"
585 to "Multi-indirection", respectively.
586
587 # Element bitwidth polymorphism <a name="elwidth"></a>
588
589 Element bitwidth is best covered as its own special section, as it
590 is quite involved and applies uniformly across-the-board. SV restricts
591 bitwidth polymorphism to default, 8-bit, 16-bit and 32-bit.
592
593 The effect of setting an element bitwidth is to re-cast each entry
594 in the register table, and for all memory operations involving
595 load/stores of certain specific sizes, to a completely different width.
596 Thus In c-style terms, on an RV64 architecture, effectively each register
597 now looks like this:
598
599 typedef union {
600 uint8_t b[8];
601 uint16_t s[4];
602 uint32_t i[2];
603 uint64_t l[1];
604 } reg_t;
605
606 // integer table: assume maximum SV 7-bit regfile size
607 reg_t int_regfile[128];
608
609 where the CSR Register table entry (not the instruction alone) determines
610 which of those union entries is to be used on each operation, and the
611 VL element offset in the hardware-loop specifies the index into each array.
612
613 However a naive interpretation of the data structure above masks the
614 fact that setting VL greater than 8, for example, when the bitwidth is 8,
615 accessing one specific register "spills over" to the following parts of
616 the register file in a sequential fashion. So a much more accurate way
617 to reflect this would be:
618
619 typedef union {
620 uint8_t actual_bytes[8]; // 8 for RV64, 4 for RV32, 16 for RV128
621 uint8_t b[0]; // array of type uint8_t
622 uint16_t s[0];
623 uint32_t i[0];
624 uint64_t l[0];
625 uint128_t d[0];
626 } reg_t;
627
628 reg_t int_regfile[128];
629
630 where when accessing any individual regfile[n].b entry it is permitted
631 (in c) to arbitrarily over-run the *declared* length of the array (zero),
632 and thus "overspill" to consecutive register file entries in a fashion
633 that is completely transparent to a greatly-simplified software / pseudo-code
634 representation.
635 It is however critical to note that it is clearly the responsibility of
636 the implementor to ensure that, towards the end of the register file,
637 an exception is thrown if attempts to access beyond the "real" register
638 bytes is ever attempted.
639
640 Now we may modify pseudo-code an operation where all element bitwidths have
641 been set to the same size, where this pseudo-code is otherwise identical
642 to its "non" polymorphic versions (above):
643
644 function op_add(rd, rs1, rs2) # add not VADD!
645 ...
646 ...
647  for (i = 0; i < VL; i++)
648 ...
649 ...
650 // TODO, calculate if over-run occurs, for each elwidth
651 if (elwidth == 8) {
652    int_regfile[rd].b[id] <= int_regfile[rs1].i[irs1] +
653     int_regfile[rs2].i[irs2];
654 } else if elwidth == 16 {
655    int_regfile[rd].s[id] <= int_regfile[rs1].s[irs1] +
656     int_regfile[rs2].s[irs2];
657 } else if elwidth == 32 {
658    int_regfile[rd].i[id] <= int_regfile[rs1].i[irs1] +
659     int_regfile[rs2].i[irs2];
660 } else { // elwidth == 64
661    int_regfile[rd].l[id] <= int_regfile[rs1].l[irs1] +
662     int_regfile[rs2].l[irs2];
663 }
664 ...
665 ...
666
667 So here we can see clearly: for 8-bit entries rd, rs1 and rs2 (and registers
668 following sequentially on respectively from the same) are "type-cast"
669 to 8-bit; for 16-bit entries likewise and so on.
670
671 However that only covers the case where the element widths are the same.
672 Where the element widths are different, the following algorithm applies:
673
674 * Analyse the bitwidth of all source operands and work out the
675 maximum. Record this as "maxsrcbitwidth"
676 * If any given source operand requires sign-extension or zero-extension
677 (ldb, div, rem, mul, sll, srl, sra etc.), instead of mandatory 32-bit
678 sign-extension / zero-extension or whatever is specified in the standard
679 RV specification, **change** that to sign-extending from the respective
680 individual source operand's bitwidth from the CSR table out to
681 "maxsrcbitwidth" (previously calculated), instead.
682 * Following separate and distinct (optional) sign/zero-extension of all
683 source operands as specifically required for that operation, carry out the
684 operation at "maxsrcbitwidth". (Note that in the case of LOAD/STORE or MV
685 this may be a "null" (copy) operation, and that with FCVT, the changes
686 to the source and destination bitwidths may also turn FVCT effectively
687 into a copy).
688 * If the destination operand requires sign-extension or zero-extension,
689 instead of a mandatory fixed size (typically 32-bit for arithmetic,
690 for subw for example, and otherwise various: 8-bit for sb, 16-bit for sw
691 etc.), overload the RV specification with the bitwidth from the
692 destination register's elwidth entry.
693 * Finally, store the (optionally) sign/zero-extended value into its
694 destination: memory for sb/sw etc., or an offset section of the register
695 file for an arithmetic operation.
696
697 In this way, polymorphic bitwidths are achieved without requiring a
698 massive 64-way permutation of calculations **per opcode**, for example
699 (4 possible rs1 bitwidths times 4 possible rs2 bitwidths times 4 possible
700 rd bitwidths). The pseudo-code is therefore as follows:
701
702 typedef union {
703 uint8_t b;
704 uint16_t s;
705 uint32_t i;
706 uint64_t l;
707 } el_reg_t;
708
709 bw(elwidth):
710 if elwidth == 0: return xlen
711 if elwidth == 1: return 8
712 if elwidth == 2: return 16
713 // elwidth == 3:
714 return 32
715
716 get_max_elwidth(rs1, rs2):
717 return max(bw(int_csr[rs1].elwidth), # default (XLEN) if not set
718 bw(int_csr[rs2].elwidth)) # again XLEN if no entry
719
720 get_polymorphed_reg(reg, bitwidth, offset):
721 el_reg_t res;
722 res.l = 0; // TODO: going to need sign-extending / zero-extending
723 if bitwidth == 8:
724 reg.b = int_regfile[reg].b[offset]
725 elif bitwidth == 16:
726 reg.s = int_regfile[reg].s[offset]
727 elif bitwidth == 32:
728 reg.i = int_regfile[reg].i[offset]
729 elif bitwidth == 64:
730 reg.l = int_regfile[reg].l[offset]
731 return res
732
733 set_polymorphed_reg(reg, bitwidth, offset, val):
734 if (!int_csr[reg].isvec):
735 # sign/zero-extend depending on opcode requirements, from
736 # the reg's bitwidth out to the full bitwidth of the regfile
737 val = sign_or_zero_extend(val, bitwidth, xlen)
738 int_regfile[reg].l[0] = val
739 elif bitwidth == 8:
740 int_regfile[reg].b[offset] = val
741 elif bitwidth == 16:
742 int_regfile[reg].s[offset] = val
743 elif bitwidth == 32:
744 int_regfile[reg].i[offset] = val
745 elif bitwidth == 64:
746 int_regfile[reg].l[offset] = val
747
748 maxsrcwid = get_max_elwidth(rs1, rs2) # source element width(s)
749 destwid = int_csr[rs1].elwidth # destination element width
750  for (i = 0; i < VL; i++)
751 if (predval & 1<<i) # predication uses intregs
752 // TODO, calculate if over-run occurs, for each elwidth
753 src1 = get_polymorphed_reg(rs1, maxsrcwid, irs1)
754 // TODO, sign/zero-extend src1 and src2 as operation requires
755 if (op_requires_sign_extend_src1)
756 src1 = sign_extend(src1, maxsrcwid)
757 src2 = get_polymorphed_reg(rs2, maxsrcwid, irs2)
758 result = src1 + src2 # actual add here
759 // TODO, sign/zero-extend result, as operation requires
760 if (op_requires_sign_extend_dest)
761 result = sign_extend(result, maxsrcwid)
762 set_polymorphed_reg(rd, destwid, ird, result)
763 if (!int_vec[rd].isvector) break
764 if (int_vec[rd ].isvector)  { id += 1; }
765 if (int_vec[rs1].isvector)  { irs1 += 1; }
766 if (int_vec[rs2].isvector)  { irs2 += 1; }
767
768 Whilst specific sign-extension and zero-extension pseudocode call
769 details are left out, due to each operation being different, the above
770 should be clear that;
771
772 * the source operands are extended out to the maximum bitwidth of all
773 source operands
774 * the operation takes place at that maximum source bitwidth (the
775 destination bitwidth is not involved at this point, at all)
776 * the result is extended (or potentially even, truncated) before being
777 stored in the destination. i.e. truncation (if required) to the
778 destination width occurs **after** the operation **not** before.
779 * when the destination is not marked as "vectorised", the **full**
780 (standard, scalar) register file entry is taken up, i.e. the
781 element is either sign-extended or zero-extended to cover the
782 full register bitwidth (XLEN) if it is not already XLEN bits long.
783
784 Implementors are entirely free to optimise the above, particularly
785 if it is specifically known that any given operation will complete
786 accurately in less bits, as long as the results produced are
787 directly equivalent and equal, for all inputs and all outputs,
788 to those produced by the above algorithm.
789
790 ## Polymorphic floating-point operation exceptions and error-handling
791
792 For floating-point operations, conversion takes place without
793 raising any kind of exception. Exactly as specified in the standard
794 RV specification, NAN (or appropriate) is stored if the result
795 is beyond the range of the destination, and, again, exactly as
796 with the standard RV specification just as with scalar
797 operations, the floating-point flag is raised (FCSR). And, again, just as
798 with scalar operations, it is software's responsibility to check this flag.
799 Given that the FCSR flags are "accrued", the fact that multiple element
800 operations could have occurred is not a problem.
801
802 Note that it is perfectly legitimate for floating-point bitwidths of
803 only 8 to be specified. However whilst it is possible to apply IEEE 754
804 principles, no actual standard yet exists. Implementors wishing to
805 provide hardware-level 8-bit support rather than throw a trap to emulate
806 in software should contact the author of this specification before
807 proceeding.
808
809 ## Polymorphic shift operators
810
811 A special note is needed for changing the element width of left and right
812 shift operators, particularly right-shift. Even for standard RV base,
813 in order for correct results to be returned, the second operand RS2 must
814 be truncated to be within the range of RS1's bitwidth. spike's implementation
815 of sll for example is as follows:
816
817 WRITE_RD(sext_xlen(zext_xlen(RS1) << (RS2 & (xlen-1))));
818
819 which means: where XLEN is 32 (for RV32), restrict RS2 to cover the
820 range 0..31 so that RS1 will only be left-shifted by the amount that
821 is possible to fit into a 32-bit register. Whilst this appears not
822 to matter for hardware, it matters greatly in software implementations,
823 and it also matters where an RV64 system is set to "RV32" mode, such
824 that the underlying registers RS1 and RS2 comprise 64 hardware bits
825 each.
826
827 For SV, where each operand's element bitwidth may be over-ridden, the
828 rule about determining the operation's bitwidth *still applies*, being
829 defined as the maximum bitwidth of RS1 and RS2. *However*, this rule
830 **also applies to the truncation of RS2**. In other words, *after*
831 determining the maximum bitwidth, RS2's range must **also be truncated**
832 to ensure a correct answer. Example:
833
834 * RS1 is over-ridden to a 16-bit width
835 * RS2 is over-ridden to an 8-bit width
836 * RD is over-ridden to a 64-bit width
837 * the maximum bitwidth is thus determined to be 16-bit - max(8,16)
838 * RS2 is **truncated to a range of values from 0 to 15**: RS2 & (16-1)
839
840 Pseudocode (in spike) for this example would therefore be:
841
842 WRITE_RD(sext_xlen(zext_16bit(RS1) << (RS2 & (16-1))));
843
844 This example illustrates that considerable care therefore needs to be
845 taken to ensure that left and right shift operations are implemented
846 correctly. The key is that
847
848 * The operation bitwidth is determined by the maximum bitwidth
849 of the *source registers*, **not** the destination register bitwidth
850 * The result is then sign-extend (or truncated) as appropriate.
851
852 ## Polymorphic MULH/MULHU/MULHSU
853
854 MULH is designed to take the top half MSBs of a multiply that
855 does not fit within the range of the source operands, such that
856 smaller width operations may produce a full double-width multiply
857 in two cycles. The issue is: SV allows the source operands to
858 have variable bitwidth.
859
860 Here again special attention has to be paid to the rules regarding
861 bitwidth, which, again, are that the operation is performed at
862 the maximum bitwidth of the **source** registers. Therefore:
863
864 * An 8-bit x 8-bit multiply will create a 16-bit result that must
865 be shifted down by 8 bits
866 * A 16-bit x 8-bit multiply will create a 24-bit result that must
867 be shifted down by 16 bits (top 8 bits being zero)
868 * A 16-bit x 16-bit multiply will create a 32-bit result that must
869 be shifted down by 16 bits
870 * A 32-bit x 16-bit multiply will create a 48-bit result that must
871 be shifted down by 32 bits
872 * A 32-bit x 8-bit multiply will create a 40-bit result that must
873 be shifted down by 32 bits
874
875 So again, just as with shift-left and shift-right, the result
876 is shifted down by the maximum of the two source register bitwidths.
877 And, exactly again, truncation or sign-extension is performed on the
878 result. If sign-extension is to be carried out, it is performed
879 from the same maximum of the two source register bitwidths out
880 to the result element's bitwidth.
881
882 If truncation occurs, i.e. the top MSBs of the result are lost,
883 this is "Officially Not Our Problem", i.e. it is assumed that the
884 programmer actually desires the result to be truncated. i.e. if the
885 programmer wanted all of the bits, they would have set the destination
886 elwidth to accommodate them.
887
888 ## Polymorphic elwidth on LOAD/STORE <a name="elwidth_loadstore"></a>
889
890 Polymorphic element widths in vectorised form means that the data
891 being loaded (or stored) across multiple registers needs to be treated
892 (reinterpreted) as a contiguous stream of elwidth-wide items, where
893 the source register's element width is **independent** from the destination's.
894
895 This makes for a slightly more complex algorithm when using indirection
896 on the "addressed" register (source for LOAD and destination for STORE),
897 particularly given that the LOAD/STORE instruction provides important
898 information about the width of the data to be reinterpreted.
899
900 Let's illustrate the "load" part, where the pseudo-code for elwidth=default
901 was as follows, and i is the loop from 0 to VL-1:
902
903 srcbase = ireg[rs+i];
904 return mem[srcbase + imm]; // returns XLEN bits
905
906 Instead, when elwidth != default, for a LW (32-bit LOAD), elwidth-wide
907 chunks are taken from the source memory location addressed by the current
908 indexed source address register, and only when a full 32-bits-worth
909 are taken will the index be moved on to the next contiguous source
910 address register:
911
912 bitwidth = bw(elwidth); // source elwidth from CSR reg entry
913 elsperblock = 32 / bitwidth // 1 if bw=32, 2 if bw=16, 4 if bw=8
914 srcbase = ireg[rs+i/(elsperblock)]; // integer divide
915 offs = i % elsperblock; // modulo
916 return &mem[srcbase + imm + offs]; // re-cast to uint8_t*, uint16_t* etc.
917
918 Note that the constant "32" above is replaced by 8 for LB, 16 for LH, 64 for LD
919 and 128 for LQ.
920
921 The principle is basically exactly the same as if the srcbase were pointing
922 at the memory of the *register* file: memory is re-interpreted as containing
923 groups of elwidth-wide discrete elements.
924
925 When storing the result from a load, it's important to respect the fact
926 that the destination register has its *own separate element width*. Thus,
927 when each element is loaded (at the source element width), any sign-extension
928 or zero-extension (or truncation) needs to be done to the *destination*
929 bitwidth. Also, the storing has the exact same analogous algorithm as
930 above, where in fact it is just the set\_polymorphed\_reg pseudocode
931 (completely unchanged) used above.
932
933 One issue remains: when the source element width is **greater** than
934 the width of the operation, it is obvious that a single LB for example
935 cannot possibly obtain 16-bit-wide data. This condition may be detected
936 where, when using integer divide, elsperblock (the width of the LOAD
937 divided by the bitwidth of the element) is zero.
938
939 The issue is "fixed" by ensuring that elsperblock is a minimum of 1:
940
941 elsperblock = min(1, LD_OP_BITWIDTH / element_bitwidth)
942
943 The elements, if the element bitwidth is larger than the LD operation's
944 size, will then be sign/zero-extended to the full LD operation size, as
945 specified by the LOAD (LDU instead of LD, LBU instead of LB), before
946 being passed on to the second phase.
947
948 As LOAD/STORE may be twin-predicated, it is important to note that
949 the rules on twin predication still apply, except where in previous
950 pseudo-code (elwidth=default for both source and target) it was
951 the *registers* that the predication was applied to, it is now the
952 **elements** that the predication is applied to.
953
954 Thus the full pseudocode for all LD operations may be written out
955 as follows:
956
957 function LBU(rd, rs):
958 load_elwidthed(rd, rs, 8, true)
959 function LB(rd, rs):
960 load_elwidthed(rd, rs, 8, false)
961 function LH(rd, rs):
962 load_elwidthed(rd, rs, 16, false)
963 ...
964 ...
965 function LQ(rd, rs):
966 load_elwidthed(rd, rs, 128, false)
967
968 # returns 1 byte of data when opwidth=8, 2 bytes when opwidth=16..
969 function load_memory(rs, imm, i, opwidth):
970 elwidth = int_csr[rs].elwidth
971 bitwidth = bw(elwidth);
972 elsperblock = min(1, opwidth / bitwidth)
973 srcbase = ireg[rs+i/(elsperblock)];
974 offs = i % elsperblock;
975 return mem[srcbase + imm + offs]; # 1/2/4/8/16 bytes
976
977 function load_elwidthed(rd, rs, opwidth, unsigned):
978 destwid = int_csr[rd].elwidth # destination element width
979  rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
980  rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
981  ps = get_pred_val(FALSE, rs); # predication on src
982  pd = get_pred_val(FALSE, rd); # ... AND on dest
983  for (int i = 0, int j = 0; i < VL && j < VL;):
984 if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
985 if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
986 val = load_memory(rs, imm, i, opwidth)
987 if unsigned:
988 val = zero_extend(val, min(opwidth, bitwidth))
989 else:
990 val = sign_extend(val, min(opwidth, bitwidth))
991 set_polymorphed_reg(rd, bitwidth, j, val)
992 if (int_csr[rs].isvec) i++;
993 if (int_csr[rd].isvec) j++; else break;
994
995 Note:
996
997 * when comparing against for example the twin-predicated c.mv
998 pseudo-code, the pattern of independent incrementing of rd and rs
999 is preserved unchanged.
1000 * just as with the c.mv pseudocode, zeroing is not included and must be
1001 taken into account (TODO).
1002 * that due to the use of a twin-predication algorithm, LOAD/STORE also
1003 take on the same VSPLAT, VINSERT, VREDUCE, VEXTRACT, VGATHER and
1004 VSCATTER characteristics.
1005 * that due to the use of the same set\_polymorphed\_reg pseudocode,
1006 a destination that is not vectorised (marked as scalar) will
1007 result in the element being fully sign-extended or zero-extended
1008 out to the full register file bitwidth (XLEN). When the source
1009 is also marked as scalar, this is how the compatibility with
1010 standard RV LOAD/STORE is preserved by this algorithm.
1011
1012 ### Example Tables showing LOAD elements
1013
1014 This section contains examples of vectorised LOAD operations, showing
1015 how the two stage process works (three if zero/sign-extension is included).
1016
1017
1018 #### Example: LD x8, x5(0), x8 CSR-elwidth=32, x5 CSR-elwidth=16, VL=7
1019
1020 This is:
1021
1022 * a 64-bit load, with an offset of zero
1023 * with a source-address elwidth of 16-bit
1024 * into a destination-register with an elwidth of 32-bit
1025 * where VL=7
1026 * from register x5 (actually x5-x6) to x8 (actually x8 to half of x11)
1027 * RV64, where XLEN=64 is assumed.
1028
1029 First, the memory table, which, due to the
1030 element width being 16 and the operation being LD (64), the 64-bits
1031 loaded from memory are subdivided into groups of **four** elements.
1032 And, with VL being 7 (deliberately to illustrate that this is reasonable
1033 and possible), the first four are sourced from the offset addresses pointed
1034 to by x5, and the next three from the ofset addresses pointed to by
1035 the next contiguous register, x6:
1036
1037 [[!table data="""
1038 addr | byte 0 | byte 1 | byte 2 | byte 3 | byte 4 | byte 5 | byte 6 | byte 7 |
1039 @x5 | elem 0 || elem 1 || elem 2 || elem 3 ||
1040 @x6 | elem 4 || elem 5 || elem 6 || not loaded ||
1041 """]]
1042
1043 Next, the elements are zero-extended from 16-bit to 32-bit, as whilst
1044 the elwidth CSR entry for x5 is 16-bit, the destination elwidth on x8 is 32.
1045
1046 [[!table data="""
1047 byte 3 | byte 2 | byte 1 | byte 0 |
1048 0x0 | 0x0 | elem0 ||
1049 0x0 | 0x0 | elem1 ||
1050 0x0 | 0x0 | elem2 ||
1051 0x0 | 0x0 | elem3 ||
1052 0x0 | 0x0 | elem4 ||
1053 0x0 | 0x0 | elem5 ||
1054 0x0 | 0x0 | elem6 ||
1055 0x0 | 0x0 | elem7 ||
1056 """]]
1057
1058 Lastly, the elements are stored in contiguous blocks, as if x8 was also
1059 byte-addressable "memory". That "memory" happens to cover registers
1060 x8, x9, x10 and x11, with the last 32 "bits" of x11 being **UNMODIFIED**:
1061
1062 [[!table data="""
1063 reg# | byte 7 | byte 6 | byte 5 | byte 4 | byte 3 | byte 2 | byte 1 | byte 0 |
1064 x8 | 0x0 | 0x0 | elem 1 || 0x0 | 0x0 | elem 0 ||
1065 x9 | 0x0 | 0x0 | elem 3 || 0x0 | 0x0 | elem 2 ||
1066 x10 | 0x0 | 0x0 | elem 5 || 0x0 | 0x0 | elem 4 ||
1067 x11 | **UNMODIFIED** |||| 0x0 | 0x0 | elem 6 ||
1068 """]]
1069
1070 Thus we have data that is loaded from the **addresses** pointed to by
1071 x5 and x6, zero-extended from 16-bit to 32-bit, stored in the **registers**
1072 x8 through to half of x11.
1073 The end result is that elements 0 and 1 end up in x8, with element 8 being
1074 shifted up 32 bits, and so on, until finally element 6 is in the
1075 LSBs of x11.
1076
1077 Note that whilst the memory addressing table is shown left-to-right byte order,
1078 the registers are shown in right-to-left (MSB) order. This does **not**
1079 imply that bit or byte-reversal is carried out: it's just easier to visualise
1080 memory as being contiguous bytes, and emphasises that registers are not
1081 really actually "memory" as such.
1082
1083 ## Why SV bitwidth specification is restricted to 4 entries
1084
1085 The four entries for SV element bitwidths only allows three over-rides:
1086
1087 * 8 bit
1088 * 16 hit
1089 * 32 bit
1090
1091 This would seem inadequate, surely it would be better to have 3 bits or
1092 more and allow 64, 128 and some other options besides. The answer here
1093 is, it gets too complex, no RV128 implementation yet exists, and so RV64's
1094 default is 64 bit, so the 4 major element widths are covered anyway.
1095
1096 There is an absolutely crucial aspect oF SV here that explicitly
1097 needs spelling out, and it's whether the "vectorised" bit is set in
1098 the Register's CSR entry.
1099
1100 If "vectorised" is clear (not set), this indicates that the operation
1101 is "scalar". Under these circumstances, when set on a destination (RD),
1102 then sign-extension and zero-extension, whilst changed to match the
1103 override bitwidth (if set), will erase the **full** register entry
1104 (64-bit if RV64).
1105
1106 When vectorised is *set*, this indicates that the operation now treats
1107 **elements** as if they were independent registers, so regardless of
1108 the length, any parts of a given actual register that are not involved
1109 in the operation are **NOT** modified, but are **PRESERVED**.
1110
1111 For example:
1112
1113 * when the vector bit is clear and elwidth set to 16 on the destination
1114 register, operations are truncated to 16 bit and then sign or zero
1115 extended to the *FULL* XLEN register width.
1116 * when the vector bit is set, elwidth is 16 and VL=1 (or other value where
1117 groups of elwidth sized elements do not fill an entire XLEN register),
1118 the "top" bits of the destination register do *NOT* get modified, zero'd
1119 or otherwise overwritten.
1120
1121 SIMD micro-architectures may implement this by using predication on
1122 any elements in a given actual register that are beyond the end of
1123 multi-element operation.
1124
1125 Other microarchitectures may choose to provide byte-level write-enable
1126 lines on the register file, such that each 64 bit register in an RV64
1127 system requires 8 WE lines. Scalar RV64 operations would require
1128 activation of all 8 lines, where SV elwidth based operations would
1129 activate the required subset of those byte-level write lines.
1130
1131 Example:
1132
1133 * rs1, rs2 and rd are all set to 8-bit
1134 * VL is set to 3
1135 * RV64 architecture is set (UXL=64)
1136 * add operation is carried out
1137 * bits 0-23 of RD are modified to be rs1[23..16] + rs2[23..16]
1138 concatenated with similar add operations on bits 15..8 and 7..0
1139 * bits 24 through 63 **remain as they originally were**.
1140
1141 Example SIMD micro-architectural implementation:
1142
1143 * SIMD architecture works out the nearest round number of elements
1144 that would fit into a full RV64 register (in this case: 8)
1145 * SIMD architecture creates a hidden predicate, binary 0b00000111
1146 i.e. the bottom 3 bits set (VL=3) and the top 5 bits clear
1147 * SIMD architecture goes ahead with the add operation as if it
1148 was a full 8-wide batch of 8 adds
1149 * SIMD architecture passes top 5 elements through the adders
1150 (which are "disabled" due to zero-bit predication)
1151 * SIMD architecture gets the 5 unmodified top 8-bits back unmodified
1152 and stores them in rd.
1153
1154 This requires a read on rd, however this is required anyway in order
1155 to support non-zeroing mode.
1156
1157 ## Polymorphic floating-point
1158
1159 Standard scalar RV integer operations base the register width on XLEN,
1160 which may be changed (UXL in USTATUS, and the corresponding MXL and
1161 SXL in MSTATUS and SSTATUS respectively). Integer LOAD, STORE and
1162 arithmetic operations are therefore restricted to an active XLEN bits,
1163 with sign or zero extension to pad out the upper bits when XLEN has
1164 been dynamically set to less than the actual register size.
1165
1166 For scalar floating-point, the active (used / changed) bits are
1167 specified exclusively by the operation: ADD.S specifies an active
1168 32-bits, with the upper bits of the source registers needing to
1169 be all 1s ("NaN-boxed"), and the destination upper bits being
1170 *set* to all 1s (including on LOAD/STOREs).
1171
1172 Where elwidth is set to default (on any source or the destination)
1173 it is obvious that this NaN-boxing behaviour can and should be
1174 preserved. When elwidth is non-default things are less obvious,
1175 so need to be thought through. Here is a normal (scalar) sequence,
1176 assuming an RV64 which supports Quad (128-bit) FLEN:
1177
1178 * FLD loads 64-bit wide from memory. Top 64 MSBs are set to all 1s
1179 * ADD.D performs a 64-bit-wide add. Top 64 MSBs of destination set to 1s.
1180 * FSD stores lowest 64-bits from the 128-bit-wide register to memory:
1181 top 64 MSBs ignored.
1182
1183 Therefore it makes sense to mirror this behaviour when, for example,
1184 elwidth is set to 32. Assume elwidth set to 32 on all source and
1185 destination registers:
1186
1187 * FLD loads 64-bit wide from memory as **two** 32-bit single-precision
1188 floating-point numbers.
1189 * ADD.D performs **two** 32-bit-wide adds, storing one of the adds
1190 in bits 0-31 and the second in bits 32-63.
1191 * FSD stores lowest 64-bits from the 128-bit-wide register to memory
1192
1193 Here's the thing: it does not make sense to overwrite the top 64 MSBs
1194 of the registers either during the FLD **or** the ADD.D. The reason
1195 is that, effectively, the top 64 MSBs actually represent a completely
1196 independent 64-bit register, so overwriting it is not only gratuitous
1197 but may actually be harmful for a future extension to SV which may
1198 have a way to directly access those top 64 bits.
1199
1200 The decision is therefore **not** to touch the upper parts of floating-point
1201 registers whereever elwidth is set to non-default values, including
1202 when "isvec" is false in a given register's CSR entry. Only when the
1203 elwidth is set to default **and** isvec is false will the standard
1204 RV behaviour be followed, namely that the upper bits be modified.
1205
1206 Ultimately if elwidth is default and isvec false on *all* source
1207 and destination registers, a SimpleV instruction defaults completely
1208 to standard RV scalar behaviour (this holds true for **all** operations,
1209 right across the board).
1210
1211 The nice thing here is that ADD.S, ADD.D and ADD.Q when elwidth are
1212 non-default values are effectively all the same: they all still perform
1213 multiple ADD operations, just at different widths. A future extension
1214 to SimpleV may actually allow ADD.S to access the upper bits of the
1215 register, effectively breaking down a 128-bit register into a bank
1216 of 4 independently-accesible 32-bit registers.
1217
1218 In the meantime, although when e.g. setting VL to 8 it would technically
1219 make no difference to the ALU whether ADD.S, ADD.D or ADD.Q is used,
1220 using ADD.Q may be an easy way to signal to the microarchitecture that
1221 it is to receive a higher VL value. On a superscalar OoO architecture
1222 there may be absolutely no difference, however on simpler SIMD-style
1223 microarchitectures they may not necessarily have the infrastructure in
1224 place to know the difference, such that when VL=8 and an ADD.D instruction
1225 is issued, it completes in 2 cycles (or more) rather than one, where
1226 if an ADD.Q had been issued instead on such simpler microarchitectures
1227 it would complete in one.
1228
1229 ## Specific instruction walk-throughs
1230
1231 This section covers walk-throughs of the above-outlined procedure
1232 for converting standard RISC-V scalar arithmetic operations to
1233 polymorphic widths, to ensure that it is correct.
1234
1235 ### add
1236
1237 Standard Scalar RV32/RV64 (xlen):
1238
1239 * RS1 @ xlen bits
1240 * RS2 @ xlen bits
1241 * add @ xlen bits
1242 * RD @ xlen bits
1243
1244 Polymorphic variant:
1245
1246 * RS1 @ rs1 bits, zero-extended to max(rs1, rs2) bits
1247 * RS2 @ rs2 bits, zero-extended to max(rs1, rs2) bits
1248 * add @ max(rs1, rs2) bits
1249 * RD @ rd bits. zero-extend to rd if rd > max(rs1, rs2) otherwise truncate
1250
1251 Note here that polymorphic add zero-extends its source operands,
1252 where addw sign-extends.
1253
1254 ### addw
1255
1256 The RV Specification specifically states that "W" variants of arithmetic
1257 operations always produce 32-bit signed values. In a polymorphic
1258 environment it is reasonable to assume that the signed aspect is
1259 preserved, where it is the length of the operands and the result
1260 that may be changed.
1261
1262 Standard Scalar RV64 (xlen):
1263
1264 * RS1 @ xlen bits
1265 * RS2 @ xlen bits
1266 * add @ xlen bits
1267 * RD @ xlen bits, truncate add to 32-bit and sign-extend to xlen.
1268
1269 Polymorphic variant:
1270
1271 * RS1 @ rs1 bits, sign-extended to max(rs1, rs2) bits
1272 * RS2 @ rs2 bits, sign-extended to max(rs1, rs2) bits
1273 * add @ max(rs1, rs2) bits
1274 * RD @ rd bits. sign-extend to rd if rd > max(rs1, rs2) otherwise truncate
1275
1276 Note here that polymorphic addw sign-extends its source operands,
1277 where add zero-extends.
1278
1279 This requires a little more in-depth analysis. Where the bitwidth of
1280 rs1 equals the bitwidth of rs2, no sign-extending will occur. It is
1281 only where the bitwidth of either rs1 or rs2 are different, will the
1282 lesser-width operand be sign-extended.
1283
1284 Effectively however, both rs1 and rs2 are being sign-extended (or truncated),
1285 where for add they are both zero-extended. This holds true for all arithmetic
1286 operations ending with "W".
1287
1288 ### addiw
1289
1290 Standard Scalar RV64I:
1291
1292 * RS1 @ xlen bits, truncated to 32-bit
1293 * immed @ 12 bits, sign-extended to 32-bit
1294 * add @ 32 bits
1295 * RD @ rd bits. sign-extend to rd if rd > 32, otherwise truncate.
1296
1297 Polymorphic variant:
1298
1299 * RS1 @ rs1 bits
1300 * immed @ 12 bits, sign-extend to max(rs1, 12) bits
1301 * add @ max(rs1, 12) bits
1302 * RD @ rd bits. sign-extend to rd if rd > max(rs1, 12) otherwise truncate
1303
1304 # Predication Element Zeroing
1305
1306 The introduction of zeroing on traditional vector predication is usually
1307 intended as an optimisation for lane-based microarchitectures with register
1308 renaming to be able to save power by avoiding a register read on elements
1309 that are passed through en-masse through the ALU. Simpler microarchitectures
1310 do not have this issue: they simply do not pass the element through to
1311 the ALU at all, and therefore do not store it back in the destination.
1312 More complex non-lane-based micro-architectures can, when zeroing is
1313 not set, use the predication bits to simply avoid sending element-based
1314 operations to the ALUs, entirely: thus, over the long term, potentially
1315 keeping all ALUs 100% occupied even when elements are predicated out.
1316
1317 SimpleV's design principle is not based on or influenced by
1318 microarchitectural design factors: it is a hardware-level API.
1319 Therefore, looking purely at whether zeroing is *useful* or not,
1320 (whether less instructions are needed for certain scenarios),
1321 given that a case can be made for zeroing *and* non-zeroing, the
1322 decision was taken to add support for both.
1323
1324 ## Single-predication (based on destination register)
1325
1326 Zeroing on predication for arithmetic operations is taken from
1327 the destination register's predicate. i.e. the predication *and*
1328 zeroing settings to be applied to the whole operation come from the
1329 CSR Predication table entry for the destination register.
1330 Thus when zeroing is set on predication of a destination element,
1331 if the predication bit is clear, then the destination element is *set*
1332 to zero (twin-predication is slightly different, and will be covered
1333 next).
1334
1335 Thus the pseudo-code loop for a predicated arithmetic operation
1336 is modified to as follows:
1337
1338  for (i = 0; i < VL; i++)
1339 if not zeroing: # an optimisation
1340 while (!(predval & 1<<i) && i < VL)
1341 if (int_vec[rd ].isvector)  { id += 1; }
1342 if (int_vec[rs1].isvector)  { irs1 += 1; }
1343 if (int_vec[rs2].isvector)  { irs2 += 1; }
1344 if i == VL:
1345 return
1346 if (predval & 1<<i)
1347 src1 = ....
1348 src2 = ...
1349 else:
1350 result = src1 + src2 # actual add (or other op) here
1351 set_polymorphed_reg(rd, destwid, ird, result)
1352 if int_vec[rd].ffirst and result == 0:
1353 VL = i # result was zero, end loop early, return VL
1354 return
1355 if (!int_vec[rd].isvector) return
1356 else if zeroing:
1357 result = 0
1358 set_polymorphed_reg(rd, destwid, ird, result)
1359 if (int_vec[rd ].isvector)  { id += 1; }
1360 else if (predval & 1<<i) return
1361 if (int_vec[rs1].isvector)  { irs1 += 1; }
1362 if (int_vec[rs2].isvector)  { irs2 += 1; }
1363 if (rd == VL or rs1 == VL or rs2 == VL): return
1364
1365 The optimisation to skip elements entirely is only possible for certain
1366 micro-architectures when zeroing is not set. However for lane-based
1367 micro-architectures this optimisation may not be practical, as it
1368 implies that elements end up in different "lanes". Under these
1369 circumstances it is perfectly fine to simply have the lanes
1370 "inactive" for predicated elements, even though it results in
1371 less than 100% ALU utilisation.
1372
1373 ## Twin-predication (based on source and destination register)
1374
1375 Twin-predication is not that much different, except that that
1376 the source is independently zero-predicated from the destination.
1377 This means that the source may be zero-predicated *or* the
1378 destination zero-predicated *or both*, or neither.
1379
1380 When with twin-predication, zeroing is set on the source and not
1381 the destination, if a predicate bit is set it indicates that a zero
1382 data element is passed through the operation (the exception being:
1383 if the source data element is to be treated as an address - a LOAD -
1384 then the data returned *from* the LOAD is zero, rather than looking up an
1385 *address* of zero.
1386
1387 When zeroing is set on the destination and not the source, then just
1388 as with single-predicated operations, a zero is stored into the destination
1389 element (or target memory address for a STORE).
1390
1391 Zeroing on both source and destination effectively result in a bitwise
1392 NOR operation of the source and destination predicate: the result is that
1393 where either source predicate OR destination predicate is set to 0,
1394 a zero element will ultimately end up in the destination register.
1395
1396 However: this may not necessarily be the case for all operations;
1397 implementors, particularly of custom instructions, clearly need to
1398 think through the implications in each and every case.
1399
1400 Here is pseudo-code for a twin zero-predicated operation:
1401
1402 function op_mv(rd, rs) # MV not VMV!
1403  rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
1404  rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
1405  ps, zerosrc = get_pred_val(FALSE, rs); # predication on src
1406  pd, zerodst = get_pred_val(FALSE, rd); # ... AND on dest
1407  for (int i = 0, int j = 0; i < VL && j < VL):
1408 if (int_csr[rs].isvec && !zerosrc) while (!(ps & 1<<i)) i++;
1409 if (int_csr[rd].isvec && !zerodst) while (!(pd & 1<<j)) j++;
1410 if ((pd & 1<<j))
1411 if ((pd & 1<<j))
1412 sourcedata = ireg[rs+i];
1413 else
1414 sourcedata = 0
1415 ireg[rd+j] <= sourcedata
1416 else if (zerodst)
1417 ireg[rd+j] <= 0
1418 if (int_csr[rs].isvec)
1419 i++;
1420 if (int_csr[rd].isvec)
1421 j++;
1422 else
1423 if ((pd & 1<<j))
1424 break;
1425
1426 Note that in the instance where the destination is a scalar, the hardware
1427 loop is ended the moment a value *or a zero* is placed into the destination
1428 register/element. Also note that, for clarity, variable element widths
1429 have been left out of the above.
1430
1431 # Subsets of RV functionality
1432
1433 This section describes the differences when SV is implemented on top of
1434 different subsets of RV.
1435
1436 ## Common options
1437
1438 It is permitted to only implement SVprefix and not the VBLOCK instruction
1439 format option, and vice-versa. UNIX Platforms **MUST** raise illegal
1440 instruction on seeing an unsupported VBLOCK or SVprefix opcode, so that
1441 traps may emulate the format.
1442
1443 It is permitted in SVprefix to either not implement VL or not implement
1444 SUBVL (see [[sv_prefix_proposal]] for full details. Again, UNIX Platforms
1445 *MUST* raise illegal instruction on implementations that do not support
1446 VL or SUBVL.
1447
1448 It is permitted to limit the size of either (or both) the register files
1449 down to the original size of the standard RV architecture. However, below
1450 the mandatory limits set in the RV standard will result in non-compliance
1451 with the SV Specification.
1452
1453 ## RV32 / RV32F
1454
1455 When RV32 or RV32F is implemented, XLEN is set to 32, and thus the
1456 maximum limit for predication is also restricted to 32 bits. Whilst not
1457 actually specifically an "option" it is worth noting.
1458
1459 ## RV32G
1460
1461 Normally in standard RV32 it does not make much sense to have
1462 RV32G, The critical instructions that are missing in standard RV32
1463 are those for moving data to and from the double-width floating-point
1464 registers into the integer ones, as well as the FCVT routines.
1465
1466 In an earlier draft of SV, it was possible to specify an elwidth
1467 of double the standard register size: this had to be dropped,
1468 and may be reintroduced in future revisions.
1469
1470 ## RV32 (not RV32F / RV32G) and RV64 (not RV64F / RV64G)
1471
1472 When floating-point is not implemented, the size of the User Register and
1473 Predication CSR tables may be halved, to only 4 2x16-bit CSRs (8 entries
1474 per table).
1475
1476 ## RV32E
1477
1478 In embedded scenarios the User Register and Predication CSRs may be
1479 dropped entirely, or optionally limited to 1 CSR, such that the combined
1480 number of entries from the M-Mode CSR Register table plus U-Mode
1481 CSR Register table is either 4 16-bit entries or (if the U-Mode is
1482 zero) only 2 16-bit entries (M-Mode CSR table only). Likewise for
1483 the Predication CSR tables.
1484
1485 RV32E is the most likely candidate for simply detecting that registers
1486 are marked as "vectorised", and generating an appropriate exception
1487 for the VL loop to be implemented in software.
1488
1489 ## RV128
1490
1491 RV128 has not been especially considered, here, however it has some
1492 extremely large possibilities: double the element width implies
1493 256-bit operands, spanning 2 128-bit registers each, and predication
1494 of total length 128 bit given that XLEN is now 128.
1495
1496 # Example usage
1497
1498 TODO evaluate strncpy and strlen
1499 <https://groups.google.com/forum/m/#!msg/comp.arch/bGBeaNjAKvc/_vbqyxTUAQAJ>
1500
1501 ## strncpy
1502
1503 RVV version: <a name="strncpy"></>
1504
1505 strncpy:
1506 mv a3, a0 # Copy dst
1507 loop:
1508 setvli x0, a2, vint8 # Vectors of bytes.
1509 vlbff.v v1, (a1) # Get src bytes
1510 vseq.vi v0, v1, 0 # Flag zero bytes
1511 vmfirst a4, v0 # Zero found?
1512 vmsif.v v0, v0 # Set mask up to and including zero byte. Ppplio
1513 vsb.v v1, (a3), v0.t # Write out bytes
1514 bgez a4, exit # Done
1515 csrr t1, vl # Get number of bytes fetched
1516 add a1, a1, t1 # Bump src pointer
1517 sub a2, a2, t1 # Decrement count.
1518 add a3, a3, t1 # Bump dst pointer
1519 bnez a2, loop # Anymore?
1520
1521 exit:
1522 ret
1523
1524 SV version (WIP):
1525
1526 strncpy:
1527 mv a3, a0
1528 SETMVLI 8 # set max vector to 8
1529 RegCSR[a3] = 8bit, a3, scalar
1530 RegCSR[a1] = 8bit, a1, scalar
1531 RegCSR[t0] = 8bit, t0, vector
1532 PredTb[t0] = ffirst, x0, inv
1533 loop:
1534 SETVLI a2, t4 # t4 and VL now 1..8
1535 ldb t0, (a1) # t0 fail first mode
1536 bne t0, x0, allnonzero # still ff
1537 # VL points to last nonzero
1538 GETVL t4 # from bne tests
1539 addi t4, t4, 1 # include zero
1540 SETVL t4 # set exactly to t4
1541 stb t0, (a3) # store incl zero
1542 ret # end subroutine
1543 allnonzero:
1544 stb t0, (a3) # VL legal range
1545 GETVL t4 # from bne tests
1546 add a1, a1, t4 # Bump src pointer
1547 sub a2, a2, t4 # Decrement count.
1548 add a3, a3, t4 # Bump dst pointer
1549 bnez a2, loop # Anymore?
1550 exit:
1551 ret
1552
1553 Notes:
1554
1555 * Setting MVL to 8 is just an example. If enough registers are spare it may be set to XLEN which will require a bank of 8 scalar registers for a1, a3 and t0.
1556 * obviously if that is done, t0 is not separated by 8 full registers, and would overwrite t1 thru t7. x80 would work well, as an example, instead.
1557 * with the exception of the GETVL (a pseudo code alias for csrr), every single instruction above may use RVC.
1558 * RVC C.BNEZ can be used because rs1' may be extended to the full 128 registers through redirection
1559 * RVC C.LW and C.SW may be used because the W format may be overridden by the 8 bit format. All of t0, a3 and a1 are overridden to make that work.
1560 * with the exception of the GETVL, all Vector Context may be done in VBLOCK form.
1561 * setting predication to x0 (zero) and invert on t0 is a trick to enable just ffirst on t0
1562 * ldb and bne are both using t0, both in ffirst mode
1563 * ldb will end on illegal mem, reduce VL, but copied all sorts of stuff into t0
1564 * bne t0 x0 tests up to the NEW VL for nonzero, vector t0 against scalar x0
1565 * however as t0 is in ffirst mode, the first fail wil ALSO stop the compares, and reduce VL as well
1566 * the branch only goes to allnonzero if all tests succeed
1567 * if it did not, we can safely increment VL by 1 (using a4) to include the zero.
1568 * SETVL sets *exactly* the requested amount into VL.
1569 * the SETVL just after allnonzero label is needed in case the ldb ffirst activates but the bne allzeros does not.
1570 * this would cause the stb to copy up to the end of the legal memory
1571 * of course, on the next loop the ldb would throw a trap, as a1 now points to the first illegal mem location.
1572
1573 ## strcpy
1574
1575 RVV version:
1576
1577 mv a3, a0 # Save start
1578 loop:
1579 setvli a1, x0, vint8 # byte vec, x0 (Zero reg) => use max hardware len
1580 vldbff.v v1, (a3) # Get bytes
1581 csrr a1, vl # Get bytes actually read e.g. if fault
1582 vseq.vi v0, v1, 0 # Set v0[i] where v1[i] = 0
1583 add a3, a3, a1 # Bump pointer
1584 vmfirst a2, v0 # Find first set bit in mask, returns -1 if none
1585 bltz a2, loop # Not found?
1586 add a0, a0, a1 # Sum start + bump
1587 add a3, a3, a2 # Add index of zero byte
1588 sub a0, a3, a0 # Subtract start address+bump
1589 ret