move fail-on-first to appendix
[libreriscv.git] / simple_v_extension / appendix.mdwn
1 # Simple-V (Parallelism Extension Proposal) Appendix
2
3 * Copyright (C) 2017, 2018, 2019 Luke Kenneth Casson Leighton
4 * Status: DRAFTv0.6
5 * Last edited: 25 jun 2019
6 * main spec [[specification]]
7
8 [[!toc ]]
9
10 # Fail-on-first modes
11
12 Fail-on-first data dependency has different behaviour for traps than
13 for conditional testing. "Conditional" is taken to mean "anything
14 that is zero", however with traps, the first element has to
15 be given the opportunity to throw the exact same trap that would
16 be thrown if this were a scalar operation (when VL=1).
17
18 ## Fail-on-first traps
19
20 Except for the first element, ffirst stops sequential element processing
21 when a trap occurs. The first element is treated normally (as if ffirst
22 is clear). Should any subsequent element instruction require a trap,
23 instead it and subsequent indexed elements are ignored (or cancelled in
24 out-of-order designs), and VL is set to the *last* instruction that did
25 not take the trap.
26
27 Note that predicated-out elements (where the predicate mask bit is zero)
28 are clearly excluded (i.e. the trap will not occur). However, note that
29 the loop still had to test the predicate bit: thus on return,
30 VL is set to include elements that did not take the trap *and* includes
31 the elements that were predicated (masked) out (not tested up to the
32 point where the trap occurred).
33
34 If SUBVL is being used (SUBVL!=1), the first *sub-group* of elements
35 will cause a trap as normal (as if ffirst is not set); subsequently,
36 the trap must not occur in the *sub-group* of elements. SUBVL will **NOT**
37 be modified.
38
39 Given that predication bits apply to SUBVL groups, the same rules apply
40 to predicated-out (masked-out) sub-groups in calculating the value that VL
41 is set to.
42
43 ## Fail-on-first conditional tests
44
45 ffirst stops sequential element conditional testing on the first element result
46 being zero. VL is set to the number of elements that were processed before
47 the fail-condition was encountered.
48
49 Note that just as with traps, if SUBVL!=1, the first of any of the *sub-group*
50 will cause the processing to end, and, even if there were elements within
51 the *sub-group* that passed the test, that sub-group is still (entirely)
52 excluded from the count (from setting VL). i.e. VL is set to the total
53 number of *sub-groups* that had no fail-condition up until execution was
54 stopped.
55
56 Note again that, just as with traps, predicated-out (masked-out) elements
57 are included in the count leading up to the fail-condition, even though they
58 were not tested.
59
60 # Instructions <a name="instructions" />
61
62 Despite being a 98% complete and accurate topological remap of RVV
63 concepts and functionality, no new instructions are needed.
64 Compared to RVV: *All* RVV instructions can be re-mapped, however xBitManip
65 becomes a critical dependency for efficient manipulation of predication
66 masks (as a bit-field). Despite the removal of all operations,
67 with the exception of CLIP and VSELECT.X
68 *all instructions from RVV Base are topologically re-mapped and retain their
69 complete functionality, intact*. Note that if RV64G ever had
70 a MV.X added as well as FCLIP, the full functionality of RVV-Base would
71 be obtained in SV.
72
73 Three instructions, VSELECT, VCLIP and VCLIPI, do not have RV Standard
74 equivalents, so are left out of Simple-V. VSELECT could be included if
75 there existed a MV.X instruction in RV (MV.X is a hypothetical
76 non-immediate variant of MV that would allow another register to
77 specify which register was to be copied). Note that if any of these three
78 instructions are added to any given RV extension, their functionality
79 will be inherently parallelised.
80
81 With some exceptions, where it does not make sense or is simply too
82 challenging, all RV-Base instructions are parallelised:
83
84 * CSR instructions, whilst a case could be made for fast-polling of
85 a CSR into multiple registers, or for being able to copy multiple
86 contiguously addressed CSRs into contiguous registers, and so on,
87 are the fundamental core basis of SV. If parallelised, extreme
88 care would need to be taken. Additionally, CSR reads are done
89 using x0, and it is *really* inadviseable to tag x0.
90 * LUI, C.J, C.JR, WFI, AUIPC are not suitable for parallelising so are
91 left as scalar.
92 * LR/SC could hypothetically be parallelised however their purpose is
93 single (complex) atomic memory operations where the LR must be followed
94 up by a matching SC. A sequence of parallel LR instructions followed
95 by a sequence of parallel SC instructions therefore is guaranteed to
96 not be useful. Not least: the guarantees of a Multi-LR/SC
97 would be impossible to provide if emulated in a trap.
98 * EBREAK, NOP, FENCE and others do not use registers so are not inherently
99 paralleliseable anyway.
100
101 All other operations using registers are automatically parallelised.
102 This includes AMOMAX, AMOSWAP and so on, where particular care and
103 attention must be paid.
104
105 Example pseudo-code for an integer ADD operation (including scalar
106 operations). Floating-point uses the FP Register Table.
107
108 function op_add(rd, rs1, rs2) # add not VADD!
109  int i, id=0, irs1=0, irs2=0;
110  predval = get_pred_val(FALSE, rd);
111  rd = int_vec[rd ].isvector ? int_vec[rd ].regidx : rd;
112  rs1 = int_vec[rs1].isvector ? int_vec[rs1].regidx : rs1;
113  rs2 = int_vec[rs2].isvector ? int_vec[rs2].regidx : rs2;
114  for (i = 0; i < VL; i++)
115 xSTATE.srcoffs = i # save context
116 if (predval & 1<<i) # predication uses intregs
117    ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
118 if (!int_vec[rd ].isvector) break;
119 if (int_vec[rd ].isvector)  { id += 1; }
120 if (int_vec[rs1].isvector)  { irs1 += 1; }
121 if (int_vec[rs2].isvector)  { irs2 += 1; }
122
123 Note that for simplicity there is quite a lot missing from the above
124 pseudo-code: element widths, zeroing on predication, dimensional
125 reshaping and offsets and so on. However it demonstrates the basic
126 principle. Augmentations that produce the full pseudo-code are covered in
127 other sections.
128
129 ## SUBVL Pseudocode <a name="subvl-pseudocode"></a>
130
131 Adding in support for SUBVL is a matter of adding in an extra inner
132 for-loop, where register src and dest are still incremented inside the
133 inner part. Note that the predication is still taken from the VL index.
134
135 So whilst elements are indexed by "(i * SUBVL + s)", predicate bits are
136 indexed by "(i)"
137
138 function op_add(rd, rs1, rs2) # add not VADD!
139  int i, id=0, irs1=0, irs2=0;
140  predval = get_pred_val(FALSE, rd);
141  rd = int_vec[rd ].isvector ? int_vec[rd ].regidx : rd;
142  rs1 = int_vec[rs1].isvector ? int_vec[rs1].regidx : rs1;
143  rs2 = int_vec[rs2].isvector ? int_vec[rs2].regidx : rs2;
144  for (i = 0; i < VL; i++)
145 xSTATE.srcoffs = i # save context
146 for (s = 0; s < SUBVL; s++)
147 xSTATE.ssvoffs = s # save context
148 if (predval & 1<<i) # predication uses intregs
149 # actual add is here (at last)
150    ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
151 if (!int_vec[rd ].isvector) break;
152 if (int_vec[rd ].isvector)  { id += 1; }
153 if (int_vec[rs1].isvector)  { irs1 += 1; }
154 if (int_vec[rs2].isvector)  { irs2 += 1; }
155 if (id == VL or irs1 == VL or irs2 == VL) {
156 # end VL hardware loop
157 xSTATE.srcoffs = 0; # reset
158 xSTATE.ssvoffs = 0; # reset
159 return;
160 }
161
162
163 NOTE: pseudocode simplified greatly: zeroing, proper predicate handling,
164 elwidth handling etc. all left out.
165
166 ## Instruction Format
167
168 It is critical to appreciate that there are
169 **no operations added to SV, at all**.
170
171 Instead, by using CSRs to tag registers as an indication of "changed
172 behaviour", SV *overloads* pre-existing branch operations into predicated
173 variants, and implicitly overloads arithmetic operations, MV, FCVT, and
174 LOAD/STORE depending on CSR configurations for bitwidth and predication.
175 **Everything** becomes parallelised. *This includes Compressed
176 instructions* as well as any future instructions and Custom Extensions.
177
178 Note: CSR tags to change behaviour of instructions is nothing new, including
179 in RISC-V. UXL, SXL and MXL change the behaviour so that XLEN=32/64/128.
180 FRM changes the behaviour of the floating-point unit, to alter the rounding
181 mode. Other architectures change the LOAD/STORE byte-order from big-endian
182 to little-endian on a per-instruction basis. SV is just a little more...
183 comprehensive in its effect on instructions.
184
185 ## Branch Instructions
186
187 Branch operations are augmented slightly to be a little more like FP
188 Compares (FEQ, FNE etc.), by permitting the cumulation (and storage)
189 of multiple comparisons into a register (taken indirectly from the predicate
190 table). As such, "ffirst" - fail-on-first - condition mode can be enabled.
191 See ffirst mode in the Predication Table section.
192
193 ### Standard Branch <a name="standard_branch"></a>
194
195 Branch operations use standard RV opcodes that are reinterpreted to
196 be "predicate variants" in the instance where either of the two src
197 registers are marked as vectors (active=1, vector=1).
198
199 Note that the predication register to use (if one is enabled) is taken from
200 the *first* src register, and that this is used, just as with predicated
201 arithmetic operations, to mask whether the comparison operations take
202 place or not. The target (destination) predication register
203 to use (if one is enabled) is taken from the *second* src register.
204
205 If either of src1 or src2 are scalars (whether by there being no
206 CSR register entry or whether by the CSR entry specifically marking
207 the register as "scalar") the comparison goes ahead as vector-scalar
208 or scalar-vector.
209
210 In instances where no vectorisation is detected on either src registers
211 the operation is treated as an absolutely standard scalar branch operation.
212 Where vectorisation is present on either or both src registers, the
213 branch may stil go ahead if any only if *all* tests succeed (i.e. excluding
214 those tests that are predicated out).
215
216 Note that when zero-predication is enabled (from source rs1),
217 a cleared bit in the predicate indicates that the result
218 of the compare is set to "false", i.e. that the corresponding
219 destination bit (or result)) be set to zero. Contrast this with
220 when zeroing is not set: bits in the destination predicate are
221 only *set*; they are **not** cleared. This is important to appreciate,
222 as there may be an expectation that, going into the hardware-loop,
223 the destination predicate is always expected to be set to zero:
224 this is **not** the case. The destination predicate is only set
225 to zero if **zeroing** is enabled.
226
227 Note that just as with the standard (scalar, non-predicated) branch
228 operations, BLE, BGT, BLEU and BTGU may be synthesised by inverting
229 src1 and src2.
230
231 In Hwacha EECS-2015-262 Section 6.7.2 the following pseudocode is given
232 for predicated compare operations of function "cmp":
233
234 for (int i=0; i<vl; ++i)
235 if ([!]preg[p][i])
236 preg[pd][i] = cmp(s1 ? vreg[rs1][i] : sreg[rs1],
237 s2 ? vreg[rs2][i] : sreg[rs2]);
238
239 With associated predication, vector-length adjustments and so on,
240 and temporarily ignoring bitwidth (which makes the comparisons more
241 complex), this becomes:
242
243 s1 = reg_is_vectorised(src1);
244 s2 = reg_is_vectorised(src2);
245
246 if not s1 && not s2
247 if cmp(rs1, rs2) # scalar compare
248 goto branch
249 return
250
251 preg = int_pred_reg[rd]
252 reg = int_regfile
253
254 ps = get_pred_val(I/F==INT, rs1);
255 rd = get_pred_val(I/F==INT, rs2); # this may not exist
256
257 if not exists(rd) or zeroing:
258 result = 0
259 else
260 result = preg[rd]
261
262 for (int i = 0; i < VL; ++i)
263 if (zeroing)
264 if not (ps & (1<<i))
265 result &= ~(1<<i);
266 else if (ps & (1<<i))
267 if (cmp(s1 ? reg[src1+i]:reg[src1],
268 s2 ? reg[src2+i]:reg[src2])
269 result |= 1<<i;
270 else
271 result &= ~(1<<i);
272
273 if not exists(rd)
274 if result == ps
275 goto branch
276 else
277 preg[rd] = result # store in destination
278 if preg[rd] == ps
279 goto branch
280
281 Notes:
282
283 * Predicated SIMD comparisons would break src1 and src2 further down
284 into bitwidth-sized chunks (see Appendix "Bitwidth Virtual Register
285 Reordering") setting Vector-Length times (number of SIMD elements) bits
286 in Predicate Register rd, as opposed to just Vector-Length bits.
287 * The execution of "parallelised" instructions **must** be implemented
288 as "re-entrant" (to use a term from software). If an exception (trap)
289 occurs during the middle of a vectorised
290 Branch (now a SV predicated compare) operation, the partial results
291 of any comparisons must be written out to the destination
292 register before the trap is permitted to begin. If however there
293 is no predicate, the **entire** set of comparisons must be **restarted**,
294 with the offset loop indices set back to zero. This is because
295 there is no place to store the temporary result during the handling
296 of traps.
297
298 TODO: predication now taken from src2. also branch goes ahead
299 if all compares are successful.
300
301 Note also that where normally, predication requires that there must
302 also be a CSR register entry for the register being used in order
303 for the **predication** CSR register entry to also be active,
304 for branches this is **not** the case. src2 does **not** have
305 to have its CSR register entry marked as active in order for
306 predication on src2 to be active.
307
308 Also note: SV Branch operations are **not** twin-predicated
309 (see Twin Predication section). This would require three
310 element offsets: one to track src1, one to track src2 and a third
311 to track where to store the accumulation of the results. Given
312 that the element offsets need to be exposed via CSRs so that
313 the parallel hardware looping may be made re-entrant on traps
314 and exceptions, the decision was made not to make SV Branches
315 twin-predicated.
316
317 ### Floating-point Comparisons
318
319 There does not exist floating-point branch operations, only compare.
320 Interestingly no change is needed to the instruction format because
321 FP Compare already stores a 1 or a zero in its "rd" integer register
322 target, i.e. it's not actually a Branch at all: it's a compare.
323
324 In RV (scalar) Base, a branch on a floating-point compare is
325 done via the sequence "FEQ x1, f0, f5; BEQ x1, x0, #jumploc".
326 This does extend to SV, as long as x1 (in the example sequence given)
327 is vectorised. When that is the case, x1..x(1+VL-1) will also be
328 set to 0 or 1 depending on whether f0==f5, f1==f6, f2==f7 and so on.
329 The BEQ that follows will *also* compare x1==x0, x2==x0, x3==x0 and
330 so on. Consequently, unlike integer-branch, FP Compare needs no
331 modification in its behaviour.
332
333 In addition, it is noted that an entry "FNE" (the opposite of FEQ) is missing,
334 and whilst in ordinary branch code this is fine because the standard
335 RVF compare can always be followed up with an integer BEQ or a BNE (or
336 a compressed comparison to zero or non-zero), in predication terms that
337 becomes more of an impact. To deal with this, SV's predication has
338 had "invert" added to it.
339
340 Also: note that FP Compare may be predicated, using the destination
341 integer register (rd) to determine the predicate. FP Compare is **not**
342 a twin-predication operation, as, again, just as with SV Branches,
343 there are three registers involved: FP src1, FP src2 and INT rd.
344
345 Also: note that ffirst (fail first mode) applies directly to this operation.
346
347 ### Compressed Branch Instruction
348
349 Compressed Branch instructions are, just like standard Branch instructions,
350 reinterpreted to be vectorised and predicated based on the source register
351 (rs1s) CSR entries. As however there is only the one source register,
352 given that c.beqz a10 is equivalent to beqz a10,x0, the optional target
353 to store the results of the comparisions is taken from CSR predication
354 table entries for **x0**.
355
356 The specific required use of x0 is, with a little thought, quite obvious,
357 but is counterintuitive. Clearly it is **not** recommended to redirect
358 x0 with a CSR register entry, however as a means to opaquely obtain
359 a predication target it is the only sensible option that does not involve
360 additional special CSRs (or, worse, additional special opcodes).
361
362 Note also that, just as with standard branches, the 2nd source
363 (in this case x0 rather than src2) does **not** have to have its CSR
364 register table marked as "active" in order for predication to work.
365
366 ## Vectorised Dual-operand instructions
367
368 There is a series of 2-operand instructions involving copying (and
369 sometimes alteration):
370
371 * C.MV
372 * FMV, FNEG, FABS, FCVT, FSGNJ, FSGNJN and FSGNJX
373 * C.LWSP, C.SWSP, C.LDSP, C.FLWSP etc.
374 * LOAD(-FP) and STORE(-FP)
375
376 All of these operations follow the same two-operand pattern, so it is
377 *both* the source *and* destination predication masks that are taken into
378 account. This is different from
379 the three-operand arithmetic instructions, where the predication mask
380 is taken from the *destination* register, and applied uniformly to the
381 elements of the source register(s), element-for-element.
382
383 The pseudo-code pattern for twin-predicated operations is as
384 follows:
385
386 function op(rd, rs):
387  rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
388  rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
389  ps = get_pred_val(FALSE, rs); # predication on src
390  pd = get_pred_val(FALSE, rd); # ... AND on dest
391  for (int i = 0, int j = 0; i < VL && j < VL;):
392 if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
393 if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
394 xSTATE.srcoffs = i # save context
395 xSTATE.destoffs = j # save context
396 reg[rd+j] = SCALAR_OPERATION_ON(reg[rs+i])
397 if (int_csr[rs].isvec) i++;
398 if (int_csr[rd].isvec) j++; else break
399
400 This pattern covers scalar-scalar, scalar-vector, vector-scalar
401 and vector-vector, and predicated variants of all of those.
402 Zeroing is not presently included (TODO). As such, when compared
403 to RVV, the twin-predicated variants of C.MV and FMV cover
404 **all** standard vector operations: VINSERT, VSPLAT, VREDUCE,
405 VEXTRACT, VSCATTER, VGATHER, VCOPY, and more.
406
407 Note that:
408
409 * elwidth (SIMD) is not covered in the pseudo-code above
410 * ending the loop early in scalar cases (VINSERT, VEXTRACT) is also
411 not covered
412 * zero predication is also not shown (TODO).
413
414 ### C.MV Instruction <a name="c_mv"></a>
415
416 There is no MV instruction in RV however there is a C.MV instruction.
417 It is used for copying integer-to-integer registers (vectorised FMV
418 is used for copying floating-point).
419
420 If either the source or the destination register are marked as vectors
421 C.MV is reinterpreted to be a vectorised (multi-register) predicated
422 move operation. The actual instruction's format does not change:
423
424 [[!table data="""
425 15 12 | 11 7 | 6 2 | 1 0 |
426 funct4 | rd | rs | op |
427 4 | 5 | 5 | 2 |
428 C.MV | dest | src | C0 |
429 """]]
430
431 A simplified version of the pseudocode for this operation is as follows:
432
433 function op_mv(rd, rs) # MV not VMV!
434  rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
435  rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
436  ps = get_pred_val(FALSE, rs); # predication on src
437  pd = get_pred_val(FALSE, rd); # ... AND on dest
438  for (int i = 0, int j = 0; i < VL && j < VL;):
439 if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
440 if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
441 xSTATE.srcoffs = i # save context
442 xSTATE.destoffs = j # save context
443 ireg[rd+j] <= ireg[rs+i];
444 if (int_csr[rs].isvec) i++;
445 if (int_csr[rd].isvec) j++; else break
446
447 There are several different instructions from RVV that are covered by
448 this one opcode:
449
450 [[!table data="""
451 src | dest | predication | op |
452 scalar | vector | none | VSPLAT |
453 scalar | vector | destination | sparse VSPLAT |
454 scalar | vector | 1-bit dest | VINSERT |
455 vector | scalar | 1-bit? src | VEXTRACT |
456 vector | vector | none | VCOPY |
457 vector | vector | src | Vector Gather |
458 vector | vector | dest | Vector Scatter |
459 vector | vector | src & dest | Gather/Scatter |
460 vector | vector | src == dest | sparse VCOPY |
461 """]]
462
463 Also, VMERGE may be implemented as back-to-back (macro-op fused) C.MV
464 operations with zeroing off, and inversion on the src and dest predication
465 for one of the two C.MV operations. The non-inverted C.MV will place
466 one set of registers into the destination, and the inverted one the other
467 set. With predicate-inversion, copying and inversion of the predicate mask
468 need not be done as a separate (scalar) instruction.
469
470 Note that in the instance where the Compressed Extension is not implemented,
471 MV may be used, but that is a pseudo-operation mapping to addi rd, x0, rs.
472 Note that the behaviour is **different** from C.MV because with addi the
473 predication mask to use is taken **only** from rd and is applied against
474 all elements: rs[i] = rd[i].
475
476 ### FMV, FNEG and FABS Instructions
477
478 These are identical in form to C.MV, except covering floating-point
479 register copying. The same double-predication rules also apply.
480 However when elwidth is not set to default the instruction is implicitly
481 and automatic converted to a (vectorised) floating-point type conversion
482 operation of the appropriate size covering the source and destination
483 register bitwidths.
484
485 (Note that FMV, FNEG and FABS are all actually pseudo-instructions)
486
487 ### FVCT Instructions
488
489 These are again identical in form to C.MV, except that they cover
490 floating-point to integer and integer to floating-point. When element
491 width in each vector is set to default, the instructions behave exactly
492 as they are defined for standard RV (scalar) operations, except vectorised
493 in exactly the same fashion as outlined in C.MV.
494
495 However when the source or destination element width is not set to default,
496 the opcode's explicit element widths are *over-ridden* to new definitions,
497 and the opcode's element width is taken as indicative of the SIMD width
498 (if applicable i.e. if packed SIMD is requested) instead.
499
500 For example FCVT.S.L would normally be used to convert a 64-bit
501 integer in register rs1 to a 64-bit floating-point number in rd.
502 If however the source rs1 is set to be a vector, where elwidth is set to
503 default/2 and "packed SIMD" is enabled, then the first 32 bits of
504 rs1 are converted to a floating-point number to be stored in rd's
505 first element and the higher 32-bits *also* converted to floating-point
506 and stored in the second. The 32 bit size comes from the fact that
507 FCVT.S.L's integer width is 64 bit, and with elwidth on rs1 set to
508 divide that by two it means that rs1 element width is to be taken as 32.
509
510 Similar rules apply to the destination register.
511
512 ## LOAD / STORE Instructions and LOAD-FP/STORE-FP <a name="load_store"></a>
513
514 An earlier draft of SV modified the behaviour of LOAD/STORE (modified
515 the interpretation of the instruction fields). This
516 actually undermined the fundamental principle of SV, namely that there
517 be no modifications to the scalar behaviour (except where absolutely
518 necessary), in order to simplify an implementor's task if considering
519 converting a pre-existing scalar design to support parallelism.
520
521 So the original RISC-V scalar LOAD/STORE and LOAD-FP/STORE-FP functionality
522 do not change in SV, however just as with C.MV it is important to note
523 that dual-predication is possible.
524
525 In vectorised architectures there are usually at least two different modes
526 for LOAD/STORE:
527
528 * Read (or write for STORE) from sequential locations, where one
529 register specifies the address, and the one address is incremented
530 by a fixed amount. This is usually known as "Unit Stride" mode.
531 * Read (or write) from multiple indirected addresses, where the
532 vector elements each specify separate and distinct addresses.
533
534 To support these different addressing modes, the CSR Register "isvector"
535 bit is used. So, for a LOAD, when the src register is set to
536 scalar, the LOADs are sequentially incremented by the src register
537 element width, and when the src register is set to "vector", the
538 elements are treated as indirection addresses. Simplified
539 pseudo-code would look like this:
540
541 function op_ld(rd, rs) # LD not VLD!
542  rdv = int_csr[rd].active ? int_csr[rd].regidx : rd;
543  rsv = int_csr[rs].active ? int_csr[rs].regidx : rs;
544  ps = get_pred_val(FALSE, rs); # predication on src
545  pd = get_pred_val(FALSE, rd); # ... AND on dest
546  for (int i = 0, int j = 0; i < VL && j < VL;):
547 if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
548 if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
549 if (int_csr[rd].isvec)
550 # indirect mode (multi mode)
551 srcbase = ireg[rsv+i];
552 else
553 # unit stride mode
554 srcbase = ireg[rsv] + i * XLEN/8; # offset in bytes
555 ireg[rdv+j] <= mem[srcbase + imm_offs];
556 if (!int_csr[rs].isvec &&
557 !int_csr[rd].isvec) break # scalar-scalar LD
558 if (int_csr[rs].isvec) i++;
559 if (int_csr[rd].isvec) j++;
560
561 Notes:
562
563 * For simplicity, zeroing and elwidth is not included in the above:
564 the key focus here is the decision-making for srcbase; vectorised
565 rs means use sequentially-numbered registers as the indirection
566 address, and scalar rs is "offset" mode.
567 * The test towards the end for whether both source and destination are
568 scalar is what makes the above pseudo-code provide the "standard" RV
569 Base behaviour for LD operations.
570 * The offset in bytes (XLEN/8) changes depending on whether the
571 operation is a LB (1 byte), LH (2 byes), LW (4 bytes) or LD
572 (8 bytes), and also whether the element width is over-ridden
573 (see special element width section).
574
575 ## Compressed Stack LOAD / STORE Instructions <a name="c_ld_st"></a>
576
577 C.LWSP / C.SWSP and floating-point etc. are also source-dest twin-predicated,
578 where it is implicit in C.LWSP/FLWSP etc. that x2 is the source register.
579 It is therefore possible to use predicated C.LWSP to efficiently
580 pop registers off the stack (by predicating x2 as the source), cherry-picking
581 which registers to store to (by predicating the destination). Likewise
582 for C.SWSP. In this way, LOAD/STORE-Multiple is efficiently achieved.
583
584 The two modes ("unit stride" and multi-indirection) are still supported,
585 as with standard LD/ST. Essentially, the only difference is that the
586 use of x2 is hard-coded into the instruction.
587
588 **Note**: it is still possible to redirect x2 to an alternative target
589 register. With care, this allows C.LWSP / C.SWSP (and C.FLWSP) to be used as
590 general-purpose LOAD/STORE operations.
591
592 ## Compressed LOAD / STORE Instructions
593
594 Compressed LOAD and STORE are again exactly the same as scalar LOAD/STORE,
595 where the same rules apply and the same pseudo-code apply as for
596 non-compressed LOAD/STORE. Again: setting scalar or vector mode
597 on the src for LOAD and dest for STORE switches mode from "Unit Stride"
598 to "Multi-indirection", respectively.
599
600 # Element bitwidth polymorphism <a name="elwidth"></a>
601
602 Element bitwidth is best covered as its own special section, as it
603 is quite involved and applies uniformly across-the-board. SV restricts
604 bitwidth polymorphism to default, 8-bit, 16-bit and 32-bit.
605
606 The effect of setting an element bitwidth is to re-cast each entry
607 in the register table, and for all memory operations involving
608 load/stores of certain specific sizes, to a completely different width.
609 Thus In c-style terms, on an RV64 architecture, effectively each register
610 now looks like this:
611
612 typedef union {
613 uint8_t b[8];
614 uint16_t s[4];
615 uint32_t i[2];
616 uint64_t l[1];
617 } reg_t;
618
619 // integer table: assume maximum SV 7-bit regfile size
620 reg_t int_regfile[128];
621
622 where the CSR Register table entry (not the instruction alone) determines
623 which of those union entries is to be used on each operation, and the
624 VL element offset in the hardware-loop specifies the index into each array.
625
626 However a naive interpretation of the data structure above masks the
627 fact that setting VL greater than 8, for example, when the bitwidth is 8,
628 accessing one specific register "spills over" to the following parts of
629 the register file in a sequential fashion. So a much more accurate way
630 to reflect this would be:
631
632 typedef union {
633 uint8_t actual_bytes[8]; // 8 for RV64, 4 for RV32, 16 for RV128
634 uint8_t b[0]; // array of type uint8_t
635 uint16_t s[0];
636 uint32_t i[0];
637 uint64_t l[0];
638 uint128_t d[0];
639 } reg_t;
640
641 reg_t int_regfile[128];
642
643 where when accessing any individual regfile[n].b entry it is permitted
644 (in c) to arbitrarily over-run the *declared* length of the array (zero),
645 and thus "overspill" to consecutive register file entries in a fashion
646 that is completely transparent to a greatly-simplified software / pseudo-code
647 representation.
648 It is however critical to note that it is clearly the responsibility of
649 the implementor to ensure that, towards the end of the register file,
650 an exception is thrown if attempts to access beyond the "real" register
651 bytes is ever attempted.
652
653 Now we may modify pseudo-code an operation where all element bitwidths have
654 been set to the same size, where this pseudo-code is otherwise identical
655 to its "non" polymorphic versions (above):
656
657 function op_add(rd, rs1, rs2) # add not VADD!
658 ...
659 ...
660  for (i = 0; i < VL; i++)
661 ...
662 ...
663 // TODO, calculate if over-run occurs, for each elwidth
664 if (elwidth == 8) {
665    int_regfile[rd].b[id] <= int_regfile[rs1].i[irs1] +
666     int_regfile[rs2].i[irs2];
667 } else if elwidth == 16 {
668    int_regfile[rd].s[id] <= int_regfile[rs1].s[irs1] +
669     int_regfile[rs2].s[irs2];
670 } else if elwidth == 32 {
671    int_regfile[rd].i[id] <= int_regfile[rs1].i[irs1] +
672     int_regfile[rs2].i[irs2];
673 } else { // elwidth == 64
674    int_regfile[rd].l[id] <= int_regfile[rs1].l[irs1] +
675     int_regfile[rs2].l[irs2];
676 }
677 ...
678 ...
679
680 So here we can see clearly: for 8-bit entries rd, rs1 and rs2 (and registers
681 following sequentially on respectively from the same) are "type-cast"
682 to 8-bit; for 16-bit entries likewise and so on.
683
684 However that only covers the case where the element widths are the same.
685 Where the element widths are different, the following algorithm applies:
686
687 * Analyse the bitwidth of all source operands and work out the
688 maximum. Record this as "maxsrcbitwidth"
689 * If any given source operand requires sign-extension or zero-extension
690 (ldb, div, rem, mul, sll, srl, sra etc.), instead of mandatory 32-bit
691 sign-extension / zero-extension or whatever is specified in the standard
692 RV specification, **change** that to sign-extending from the respective
693 individual source operand's bitwidth from the CSR table out to
694 "maxsrcbitwidth" (previously calculated), instead.
695 * Following separate and distinct (optional) sign/zero-extension of all
696 source operands as specifically required for that operation, carry out the
697 operation at "maxsrcbitwidth". (Note that in the case of LOAD/STORE or MV
698 this may be a "null" (copy) operation, and that with FCVT, the changes
699 to the source and destination bitwidths may also turn FVCT effectively
700 into a copy).
701 * If the destination operand requires sign-extension or zero-extension,
702 instead of a mandatory fixed size (typically 32-bit for arithmetic,
703 for subw for example, and otherwise various: 8-bit for sb, 16-bit for sw
704 etc.), overload the RV specification with the bitwidth from the
705 destination register's elwidth entry.
706 * Finally, store the (optionally) sign/zero-extended value into its
707 destination: memory for sb/sw etc., or an offset section of the register
708 file for an arithmetic operation.
709
710 In this way, polymorphic bitwidths are achieved without requiring a
711 massive 64-way permutation of calculations **per opcode**, for example
712 (4 possible rs1 bitwidths times 4 possible rs2 bitwidths times 4 possible
713 rd bitwidths). The pseudo-code is therefore as follows:
714
715 typedef union {
716 uint8_t b;
717 uint16_t s;
718 uint32_t i;
719 uint64_t l;
720 } el_reg_t;
721
722 bw(elwidth):
723 if elwidth == 0: return xlen
724 if elwidth == 1: return 8
725 if elwidth == 2: return 16
726 // elwidth == 3:
727 return 32
728
729 get_max_elwidth(rs1, rs2):
730 return max(bw(int_csr[rs1].elwidth), # default (XLEN) if not set
731 bw(int_csr[rs2].elwidth)) # again XLEN if no entry
732
733 get_polymorphed_reg(reg, bitwidth, offset):
734 el_reg_t res;
735 res.l = 0; // TODO: going to need sign-extending / zero-extending
736 if bitwidth == 8:
737 reg.b = int_regfile[reg].b[offset]
738 elif bitwidth == 16:
739 reg.s = int_regfile[reg].s[offset]
740 elif bitwidth == 32:
741 reg.i = int_regfile[reg].i[offset]
742 elif bitwidth == 64:
743 reg.l = int_regfile[reg].l[offset]
744 return res
745
746 set_polymorphed_reg(reg, bitwidth, offset, val):
747 if (!int_csr[reg].isvec):
748 # sign/zero-extend depending on opcode requirements, from
749 # the reg's bitwidth out to the full bitwidth of the regfile
750 val = sign_or_zero_extend(val, bitwidth, xlen)
751 int_regfile[reg].l[0] = val
752 elif bitwidth == 8:
753 int_regfile[reg].b[offset] = val
754 elif bitwidth == 16:
755 int_regfile[reg].s[offset] = val
756 elif bitwidth == 32:
757 int_regfile[reg].i[offset] = val
758 elif bitwidth == 64:
759 int_regfile[reg].l[offset] = val
760
761 maxsrcwid = get_max_elwidth(rs1, rs2) # source element width(s)
762 destwid = int_csr[rs1].elwidth # destination element width
763  for (i = 0; i < VL; i++)
764 if (predval & 1<<i) # predication uses intregs
765 // TODO, calculate if over-run occurs, for each elwidth
766 src1 = get_polymorphed_reg(rs1, maxsrcwid, irs1)
767 // TODO, sign/zero-extend src1 and src2 as operation requires
768 if (op_requires_sign_extend_src1)
769 src1 = sign_extend(src1, maxsrcwid)
770 src2 = get_polymorphed_reg(rs2, maxsrcwid, irs2)
771 result = src1 + src2 # actual add here
772 // TODO, sign/zero-extend result, as operation requires
773 if (op_requires_sign_extend_dest)
774 result = sign_extend(result, maxsrcwid)
775 set_polymorphed_reg(rd, destwid, ird, result)
776 if (!int_vec[rd].isvector) break
777 if (int_vec[rd ].isvector)  { id += 1; }
778 if (int_vec[rs1].isvector)  { irs1 += 1; }
779 if (int_vec[rs2].isvector)  { irs2 += 1; }
780
781 Whilst specific sign-extension and zero-extension pseudocode call
782 details are left out, due to each operation being different, the above
783 should be clear that;
784
785 * the source operands are extended out to the maximum bitwidth of all
786 source operands
787 * the operation takes place at that maximum source bitwidth (the
788 destination bitwidth is not involved at this point, at all)
789 * the result is extended (or potentially even, truncated) before being
790 stored in the destination. i.e. truncation (if required) to the
791 destination width occurs **after** the operation **not** before.
792 * when the destination is not marked as "vectorised", the **full**
793 (standard, scalar) register file entry is taken up, i.e. the
794 element is either sign-extended or zero-extended to cover the
795 full register bitwidth (XLEN) if it is not already XLEN bits long.
796
797 Implementors are entirely free to optimise the above, particularly
798 if it is specifically known that any given operation will complete
799 accurately in less bits, as long as the results produced are
800 directly equivalent and equal, for all inputs and all outputs,
801 to those produced by the above algorithm.
802
803 ## Polymorphic floating-point operation exceptions and error-handling
804
805 For floating-point operations, conversion takes place without
806 raising any kind of exception. Exactly as specified in the standard
807 RV specification, NAN (or appropriate) is stored if the result
808 is beyond the range of the destination, and, again, exactly as
809 with the standard RV specification just as with scalar
810 operations, the floating-point flag is raised (FCSR). And, again, just as
811 with scalar operations, it is software's responsibility to check this flag.
812 Given that the FCSR flags are "accrued", the fact that multiple element
813 operations could have occurred is not a problem.
814
815 Note that it is perfectly legitimate for floating-point bitwidths of
816 only 8 to be specified. However whilst it is possible to apply IEEE 754
817 principles, no actual standard yet exists. Implementors wishing to
818 provide hardware-level 8-bit support rather than throw a trap to emulate
819 in software should contact the author of this specification before
820 proceeding.
821
822 ## Polymorphic shift operators
823
824 A special note is needed for changing the element width of left and right
825 shift operators, particularly right-shift. Even for standard RV base,
826 in order for correct results to be returned, the second operand RS2 must
827 be truncated to be within the range of RS1's bitwidth. spike's implementation
828 of sll for example is as follows:
829
830 WRITE_RD(sext_xlen(zext_xlen(RS1) << (RS2 & (xlen-1))));
831
832 which means: where XLEN is 32 (for RV32), restrict RS2 to cover the
833 range 0..31 so that RS1 will only be left-shifted by the amount that
834 is possible to fit into a 32-bit register. Whilst this appears not
835 to matter for hardware, it matters greatly in software implementations,
836 and it also matters where an RV64 system is set to "RV32" mode, such
837 that the underlying registers RS1 and RS2 comprise 64 hardware bits
838 each.
839
840 For SV, where each operand's element bitwidth may be over-ridden, the
841 rule about determining the operation's bitwidth *still applies*, being
842 defined as the maximum bitwidth of RS1 and RS2. *However*, this rule
843 **also applies to the truncation of RS2**. In other words, *after*
844 determining the maximum bitwidth, RS2's range must **also be truncated**
845 to ensure a correct answer. Example:
846
847 * RS1 is over-ridden to a 16-bit width
848 * RS2 is over-ridden to an 8-bit width
849 * RD is over-ridden to a 64-bit width
850 * the maximum bitwidth is thus determined to be 16-bit - max(8,16)
851 * RS2 is **truncated to a range of values from 0 to 15**: RS2 & (16-1)
852
853 Pseudocode (in spike) for this example would therefore be:
854
855 WRITE_RD(sext_xlen(zext_16bit(RS1) << (RS2 & (16-1))));
856
857 This example illustrates that considerable care therefore needs to be
858 taken to ensure that left and right shift operations are implemented
859 correctly. The key is that
860
861 * The operation bitwidth is determined by the maximum bitwidth
862 of the *source registers*, **not** the destination register bitwidth
863 * The result is then sign-extend (or truncated) as appropriate.
864
865 ## Polymorphic MULH/MULHU/MULHSU
866
867 MULH is designed to take the top half MSBs of a multiply that
868 does not fit within the range of the source operands, such that
869 smaller width operations may produce a full double-width multiply
870 in two cycles. The issue is: SV allows the source operands to
871 have variable bitwidth.
872
873 Here again special attention has to be paid to the rules regarding
874 bitwidth, which, again, are that the operation is performed at
875 the maximum bitwidth of the **source** registers. Therefore:
876
877 * An 8-bit x 8-bit multiply will create a 16-bit result that must
878 be shifted down by 8 bits
879 * A 16-bit x 8-bit multiply will create a 24-bit result that must
880 be shifted down by 16 bits (top 8 bits being zero)
881 * A 16-bit x 16-bit multiply will create a 32-bit result that must
882 be shifted down by 16 bits
883 * A 32-bit x 16-bit multiply will create a 48-bit result that must
884 be shifted down by 32 bits
885 * A 32-bit x 8-bit multiply will create a 40-bit result that must
886 be shifted down by 32 bits
887
888 So again, just as with shift-left and shift-right, the result
889 is shifted down by the maximum of the two source register bitwidths.
890 And, exactly again, truncation or sign-extension is performed on the
891 result. If sign-extension is to be carried out, it is performed
892 from the same maximum of the two source register bitwidths out
893 to the result element's bitwidth.
894
895 If truncation occurs, i.e. the top MSBs of the result are lost,
896 this is "Officially Not Our Problem", i.e. it is assumed that the
897 programmer actually desires the result to be truncated. i.e. if the
898 programmer wanted all of the bits, they would have set the destination
899 elwidth to accommodate them.
900
901 ## Polymorphic elwidth on LOAD/STORE <a name="elwidth_loadstore"></a>
902
903 Polymorphic element widths in vectorised form means that the data
904 being loaded (or stored) across multiple registers needs to be treated
905 (reinterpreted) as a contiguous stream of elwidth-wide items, where
906 the source register's element width is **independent** from the destination's.
907
908 This makes for a slightly more complex algorithm when using indirection
909 on the "addressed" register (source for LOAD and destination for STORE),
910 particularly given that the LOAD/STORE instruction provides important
911 information about the width of the data to be reinterpreted.
912
913 Let's illustrate the "load" part, where the pseudo-code for elwidth=default
914 was as follows, and i is the loop from 0 to VL-1:
915
916 srcbase = ireg[rs+i];
917 return mem[srcbase + imm]; // returns XLEN bits
918
919 Instead, when elwidth != default, for a LW (32-bit LOAD), elwidth-wide
920 chunks are taken from the source memory location addressed by the current
921 indexed source address register, and only when a full 32-bits-worth
922 are taken will the index be moved on to the next contiguous source
923 address register:
924
925 bitwidth = bw(elwidth); // source elwidth from CSR reg entry
926 elsperblock = 32 / bitwidth // 1 if bw=32, 2 if bw=16, 4 if bw=8
927 srcbase = ireg[rs+i/(elsperblock)]; // integer divide
928 offs = i % elsperblock; // modulo
929 return &mem[srcbase + imm + offs]; // re-cast to uint8_t*, uint16_t* etc.
930
931 Note that the constant "32" above is replaced by 8 for LB, 16 for LH, 64 for LD
932 and 128 for LQ.
933
934 The principle is basically exactly the same as if the srcbase were pointing
935 at the memory of the *register* file: memory is re-interpreted as containing
936 groups of elwidth-wide discrete elements.
937
938 When storing the result from a load, it's important to respect the fact
939 that the destination register has its *own separate element width*. Thus,
940 when each element is loaded (at the source element width), any sign-extension
941 or zero-extension (or truncation) needs to be done to the *destination*
942 bitwidth. Also, the storing has the exact same analogous algorithm as
943 above, where in fact it is just the set\_polymorphed\_reg pseudocode
944 (completely unchanged) used above.
945
946 One issue remains: when the source element width is **greater** than
947 the width of the operation, it is obvious that a single LB for example
948 cannot possibly obtain 16-bit-wide data. This condition may be detected
949 where, when using integer divide, elsperblock (the width of the LOAD
950 divided by the bitwidth of the element) is zero.
951
952 The issue is "fixed" by ensuring that elsperblock is a minimum of 1:
953
954 elsperblock = min(1, LD_OP_BITWIDTH / element_bitwidth)
955
956 The elements, if the element bitwidth is larger than the LD operation's
957 size, will then be sign/zero-extended to the full LD operation size, as
958 specified by the LOAD (LDU instead of LD, LBU instead of LB), before
959 being passed on to the second phase.
960
961 As LOAD/STORE may be twin-predicated, it is important to note that
962 the rules on twin predication still apply, except where in previous
963 pseudo-code (elwidth=default for both source and target) it was
964 the *registers* that the predication was applied to, it is now the
965 **elements** that the predication is applied to.
966
967 Thus the full pseudocode for all LD operations may be written out
968 as follows:
969
970 function LBU(rd, rs):
971 load_elwidthed(rd, rs, 8, true)
972 function LB(rd, rs):
973 load_elwidthed(rd, rs, 8, false)
974 function LH(rd, rs):
975 load_elwidthed(rd, rs, 16, false)
976 ...
977 ...
978 function LQ(rd, rs):
979 load_elwidthed(rd, rs, 128, false)
980
981 # returns 1 byte of data when opwidth=8, 2 bytes when opwidth=16..
982 function load_memory(rs, imm, i, opwidth):
983 elwidth = int_csr[rs].elwidth
984 bitwidth = bw(elwidth);
985 elsperblock = min(1, opwidth / bitwidth)
986 srcbase = ireg[rs+i/(elsperblock)];
987 offs = i % elsperblock;
988 return mem[srcbase + imm + offs]; # 1/2/4/8/16 bytes
989
990 function load_elwidthed(rd, rs, opwidth, unsigned):
991 destwid = int_csr[rd].elwidth # destination element width
992  rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
993  rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
994  ps = get_pred_val(FALSE, rs); # predication on src
995  pd = get_pred_val(FALSE, rd); # ... AND on dest
996  for (int i = 0, int j = 0; i < VL && j < VL;):
997 if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
998 if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
999 val = load_memory(rs, imm, i, opwidth)
1000 if unsigned:
1001 val = zero_extend(val, min(opwidth, bitwidth))
1002 else:
1003 val = sign_extend(val, min(opwidth, bitwidth))
1004 set_polymorphed_reg(rd, bitwidth, j, val)
1005 if (int_csr[rs].isvec) i++;
1006 if (int_csr[rd].isvec) j++; else break;
1007
1008 Note:
1009
1010 * when comparing against for example the twin-predicated c.mv
1011 pseudo-code, the pattern of independent incrementing of rd and rs
1012 is preserved unchanged.
1013 * just as with the c.mv pseudocode, zeroing is not included and must be
1014 taken into account (TODO).
1015 * that due to the use of a twin-predication algorithm, LOAD/STORE also
1016 take on the same VSPLAT, VINSERT, VREDUCE, VEXTRACT, VGATHER and
1017 VSCATTER characteristics.
1018 * that due to the use of the same set\_polymorphed\_reg pseudocode,
1019 a destination that is not vectorised (marked as scalar) will
1020 result in the element being fully sign-extended or zero-extended
1021 out to the full register file bitwidth (XLEN). When the source
1022 is also marked as scalar, this is how the compatibility with
1023 standard RV LOAD/STORE is preserved by this algorithm.
1024
1025 ### Example Tables showing LOAD elements
1026
1027 This section contains examples of vectorised LOAD operations, showing
1028 how the two stage process works (three if zero/sign-extension is included).
1029
1030
1031 #### Example: LD x8, x5(0), x8 CSR-elwidth=32, x5 CSR-elwidth=16, VL=7
1032
1033 This is:
1034
1035 * a 64-bit load, with an offset of zero
1036 * with a source-address elwidth of 16-bit
1037 * into a destination-register with an elwidth of 32-bit
1038 * where VL=7
1039 * from register x5 (actually x5-x6) to x8 (actually x8 to half of x11)
1040 * RV64, where XLEN=64 is assumed.
1041
1042 First, the memory table, which, due to the
1043 element width being 16 and the operation being LD (64), the 64-bits
1044 loaded from memory are subdivided into groups of **four** elements.
1045 And, with VL being 7 (deliberately to illustrate that this is reasonable
1046 and possible), the first four are sourced from the offset addresses pointed
1047 to by x5, and the next three from the ofset addresses pointed to by
1048 the next contiguous register, x6:
1049
1050 [[!table data="""
1051 addr | byte 0 | byte 1 | byte 2 | byte 3 | byte 4 | byte 5 | byte 6 | byte 7 |
1052 @x5 | elem 0 || elem 1 || elem 2 || elem 3 ||
1053 @x6 | elem 4 || elem 5 || elem 6 || not loaded ||
1054 """]]
1055
1056 Next, the elements are zero-extended from 16-bit to 32-bit, as whilst
1057 the elwidth CSR entry for x5 is 16-bit, the destination elwidth on x8 is 32.
1058
1059 [[!table data="""
1060 byte 3 | byte 2 | byte 1 | byte 0 |
1061 0x0 | 0x0 | elem0 ||
1062 0x0 | 0x0 | elem1 ||
1063 0x0 | 0x0 | elem2 ||
1064 0x0 | 0x0 | elem3 ||
1065 0x0 | 0x0 | elem4 ||
1066 0x0 | 0x0 | elem5 ||
1067 0x0 | 0x0 | elem6 ||
1068 0x0 | 0x0 | elem7 ||
1069 """]]
1070
1071 Lastly, the elements are stored in contiguous blocks, as if x8 was also
1072 byte-addressable "memory". That "memory" happens to cover registers
1073 x8, x9, x10 and x11, with the last 32 "bits" of x11 being **UNMODIFIED**:
1074
1075 [[!table data="""
1076 reg# | byte 7 | byte 6 | byte 5 | byte 4 | byte 3 | byte 2 | byte 1 | byte 0 |
1077 x8 | 0x0 | 0x0 | elem 1 || 0x0 | 0x0 | elem 0 ||
1078 x9 | 0x0 | 0x0 | elem 3 || 0x0 | 0x0 | elem 2 ||
1079 x10 | 0x0 | 0x0 | elem 5 || 0x0 | 0x0 | elem 4 ||
1080 x11 | **UNMODIFIED** |||| 0x0 | 0x0 | elem 6 ||
1081 """]]
1082
1083 Thus we have data that is loaded from the **addresses** pointed to by
1084 x5 and x6, zero-extended from 16-bit to 32-bit, stored in the **registers**
1085 x8 through to half of x11.
1086 The end result is that elements 0 and 1 end up in x8, with element 8 being
1087 shifted up 32 bits, and so on, until finally element 6 is in the
1088 LSBs of x11.
1089
1090 Note that whilst the memory addressing table is shown left-to-right byte order,
1091 the registers are shown in right-to-left (MSB) order. This does **not**
1092 imply that bit or byte-reversal is carried out: it's just easier to visualise
1093 memory as being contiguous bytes, and emphasises that registers are not
1094 really actually "memory" as such.
1095
1096 ## Why SV bitwidth specification is restricted to 4 entries
1097
1098 The four entries for SV element bitwidths only allows three over-rides:
1099
1100 * 8 bit
1101 * 16 hit
1102 * 32 bit
1103
1104 This would seem inadequate, surely it would be better to have 3 bits or
1105 more and allow 64, 128 and some other options besides. The answer here
1106 is, it gets too complex, no RV128 implementation yet exists, and so RV64's
1107 default is 64 bit, so the 4 major element widths are covered anyway.
1108
1109 There is an absolutely crucial aspect oF SV here that explicitly
1110 needs spelling out, and it's whether the "vectorised" bit is set in
1111 the Register's CSR entry.
1112
1113 If "vectorised" is clear (not set), this indicates that the operation
1114 is "scalar". Under these circumstances, when set on a destination (RD),
1115 then sign-extension and zero-extension, whilst changed to match the
1116 override bitwidth (if set), will erase the **full** register entry
1117 (64-bit if RV64).
1118
1119 When vectorised is *set*, this indicates that the operation now treats
1120 **elements** as if they were independent registers, so regardless of
1121 the length, any parts of a given actual register that are not involved
1122 in the operation are **NOT** modified, but are **PRESERVED**.
1123
1124 For example:
1125
1126 * when the vector bit is clear and elwidth set to 16 on the destination
1127 register, operations are truncated to 16 bit and then sign or zero
1128 extended to the *FULL* XLEN register width.
1129 * when the vector bit is set, elwidth is 16 and VL=1 (or other value where
1130 groups of elwidth sized elements do not fill an entire XLEN register),
1131 the "top" bits of the destination register do *NOT* get modified, zero'd
1132 or otherwise overwritten.
1133
1134 SIMD micro-architectures may implement this by using predication on
1135 any elements in a given actual register that are beyond the end of
1136 multi-element operation.
1137
1138 Other microarchitectures may choose to provide byte-level write-enable
1139 lines on the register file, such that each 64 bit register in an RV64
1140 system requires 8 WE lines. Scalar RV64 operations would require
1141 activation of all 8 lines, where SV elwidth based operations would
1142 activate the required subset of those byte-level write lines.
1143
1144 Example:
1145
1146 * rs1, rs2 and rd are all set to 8-bit
1147 * VL is set to 3
1148 * RV64 architecture is set (UXL=64)
1149 * add operation is carried out
1150 * bits 0-23 of RD are modified to be rs1[23..16] + rs2[23..16]
1151 concatenated with similar add operations on bits 15..8 and 7..0
1152 * bits 24 through 63 **remain as they originally were**.
1153
1154 Example SIMD micro-architectural implementation:
1155
1156 * SIMD architecture works out the nearest round number of elements
1157 that would fit into a full RV64 register (in this case: 8)
1158 * SIMD architecture creates a hidden predicate, binary 0b00000111
1159 i.e. the bottom 3 bits set (VL=3) and the top 5 bits clear
1160 * SIMD architecture goes ahead with the add operation as if it
1161 was a full 8-wide batch of 8 adds
1162 * SIMD architecture passes top 5 elements through the adders
1163 (which are "disabled" due to zero-bit predication)
1164 * SIMD architecture gets the 5 unmodified top 8-bits back unmodified
1165 and stores them in rd.
1166
1167 This requires a read on rd, however this is required anyway in order
1168 to support non-zeroing mode.
1169
1170 ## Polymorphic floating-point
1171
1172 Standard scalar RV integer operations base the register width on XLEN,
1173 which may be changed (UXL in USTATUS, and the corresponding MXL and
1174 SXL in MSTATUS and SSTATUS respectively). Integer LOAD, STORE and
1175 arithmetic operations are therefore restricted to an active XLEN bits,
1176 with sign or zero extension to pad out the upper bits when XLEN has
1177 been dynamically set to less than the actual register size.
1178
1179 For scalar floating-point, the active (used / changed) bits are
1180 specified exclusively by the operation: ADD.S specifies an active
1181 32-bits, with the upper bits of the source registers needing to
1182 be all 1s ("NaN-boxed"), and the destination upper bits being
1183 *set* to all 1s (including on LOAD/STOREs).
1184
1185 Where elwidth is set to default (on any source or the destination)
1186 it is obvious that this NaN-boxing behaviour can and should be
1187 preserved. When elwidth is non-default things are less obvious,
1188 so need to be thought through. Here is a normal (scalar) sequence,
1189 assuming an RV64 which supports Quad (128-bit) FLEN:
1190
1191 * FLD loads 64-bit wide from memory. Top 64 MSBs are set to all 1s
1192 * ADD.D performs a 64-bit-wide add. Top 64 MSBs of destination set to 1s.
1193 * FSD stores lowest 64-bits from the 128-bit-wide register to memory:
1194 top 64 MSBs ignored.
1195
1196 Therefore it makes sense to mirror this behaviour when, for example,
1197 elwidth is set to 32. Assume elwidth set to 32 on all source and
1198 destination registers:
1199
1200 * FLD loads 64-bit wide from memory as **two** 32-bit single-precision
1201 floating-point numbers.
1202 * ADD.D performs **two** 32-bit-wide adds, storing one of the adds
1203 in bits 0-31 and the second in bits 32-63.
1204 * FSD stores lowest 64-bits from the 128-bit-wide register to memory
1205
1206 Here's the thing: it does not make sense to overwrite the top 64 MSBs
1207 of the registers either during the FLD **or** the ADD.D. The reason
1208 is that, effectively, the top 64 MSBs actually represent a completely
1209 independent 64-bit register, so overwriting it is not only gratuitous
1210 but may actually be harmful for a future extension to SV which may
1211 have a way to directly access those top 64 bits.
1212
1213 The decision is therefore **not** to touch the upper parts of floating-point
1214 registers whereever elwidth is set to non-default values, including
1215 when "isvec" is false in a given register's CSR entry. Only when the
1216 elwidth is set to default **and** isvec is false will the standard
1217 RV behaviour be followed, namely that the upper bits be modified.
1218
1219 Ultimately if elwidth is default and isvec false on *all* source
1220 and destination registers, a SimpleV instruction defaults completely
1221 to standard RV scalar behaviour (this holds true for **all** operations,
1222 right across the board).
1223
1224 The nice thing here is that ADD.S, ADD.D and ADD.Q when elwidth are
1225 non-default values are effectively all the same: they all still perform
1226 multiple ADD operations, just at different widths. A future extension
1227 to SimpleV may actually allow ADD.S to access the upper bits of the
1228 register, effectively breaking down a 128-bit register into a bank
1229 of 4 independently-accesible 32-bit registers.
1230
1231 In the meantime, although when e.g. setting VL to 8 it would technically
1232 make no difference to the ALU whether ADD.S, ADD.D or ADD.Q is used,
1233 using ADD.Q may be an easy way to signal to the microarchitecture that
1234 it is to receive a higher VL value. On a superscalar OoO architecture
1235 there may be absolutely no difference, however on simpler SIMD-style
1236 microarchitectures they may not necessarily have the infrastructure in
1237 place to know the difference, such that when VL=8 and an ADD.D instruction
1238 is issued, it completes in 2 cycles (or more) rather than one, where
1239 if an ADD.Q had been issued instead on such simpler microarchitectures
1240 it would complete in one.
1241
1242 ## Specific instruction walk-throughs
1243
1244 This section covers walk-throughs of the above-outlined procedure
1245 for converting standard RISC-V scalar arithmetic operations to
1246 polymorphic widths, to ensure that it is correct.
1247
1248 ### add
1249
1250 Standard Scalar RV32/RV64 (xlen):
1251
1252 * RS1 @ xlen bits
1253 * RS2 @ xlen bits
1254 * add @ xlen bits
1255 * RD @ xlen bits
1256
1257 Polymorphic variant:
1258
1259 * RS1 @ rs1 bits, zero-extended to max(rs1, rs2) bits
1260 * RS2 @ rs2 bits, zero-extended to max(rs1, rs2) bits
1261 * add @ max(rs1, rs2) bits
1262 * RD @ rd bits. zero-extend to rd if rd > max(rs1, rs2) otherwise truncate
1263
1264 Note here that polymorphic add zero-extends its source operands,
1265 where addw sign-extends.
1266
1267 ### addw
1268
1269 The RV Specification specifically states that "W" variants of arithmetic
1270 operations always produce 32-bit signed values. In a polymorphic
1271 environment it is reasonable to assume that the signed aspect is
1272 preserved, where it is the length of the operands and the result
1273 that may be changed.
1274
1275 Standard Scalar RV64 (xlen):
1276
1277 * RS1 @ xlen bits
1278 * RS2 @ xlen bits
1279 * add @ xlen bits
1280 * RD @ xlen bits, truncate add to 32-bit and sign-extend to xlen.
1281
1282 Polymorphic variant:
1283
1284 * RS1 @ rs1 bits, sign-extended to max(rs1, rs2) bits
1285 * RS2 @ rs2 bits, sign-extended to max(rs1, rs2) bits
1286 * add @ max(rs1, rs2) bits
1287 * RD @ rd bits. sign-extend to rd if rd > max(rs1, rs2) otherwise truncate
1288
1289 Note here that polymorphic addw sign-extends its source operands,
1290 where add zero-extends.
1291
1292 This requires a little more in-depth analysis. Where the bitwidth of
1293 rs1 equals the bitwidth of rs2, no sign-extending will occur. It is
1294 only where the bitwidth of either rs1 or rs2 are different, will the
1295 lesser-width operand be sign-extended.
1296
1297 Effectively however, both rs1 and rs2 are being sign-extended (or truncated),
1298 where for add they are both zero-extended. This holds true for all arithmetic
1299 operations ending with "W".
1300
1301 ### addiw
1302
1303 Standard Scalar RV64I:
1304
1305 * RS1 @ xlen bits, truncated to 32-bit
1306 * immed @ 12 bits, sign-extended to 32-bit
1307 * add @ 32 bits
1308 * RD @ rd bits. sign-extend to rd if rd > 32, otherwise truncate.
1309
1310 Polymorphic variant:
1311
1312 * RS1 @ rs1 bits
1313 * immed @ 12 bits, sign-extend to max(rs1, 12) bits
1314 * add @ max(rs1, 12) bits
1315 * RD @ rd bits. sign-extend to rd if rd > max(rs1, 12) otherwise truncate
1316
1317 # Predication Element Zeroing
1318
1319 The introduction of zeroing on traditional vector predication is usually
1320 intended as an optimisation for lane-based microarchitectures with register
1321 renaming to be able to save power by avoiding a register read on elements
1322 that are passed through en-masse through the ALU. Simpler microarchitectures
1323 do not have this issue: they simply do not pass the element through to
1324 the ALU at all, and therefore do not store it back in the destination.
1325 More complex non-lane-based micro-architectures can, when zeroing is
1326 not set, use the predication bits to simply avoid sending element-based
1327 operations to the ALUs, entirely: thus, over the long term, potentially
1328 keeping all ALUs 100% occupied even when elements are predicated out.
1329
1330 SimpleV's design principle is not based on or influenced by
1331 microarchitectural design factors: it is a hardware-level API.
1332 Therefore, looking purely at whether zeroing is *useful* or not,
1333 (whether less instructions are needed for certain scenarios),
1334 given that a case can be made for zeroing *and* non-zeroing, the
1335 decision was taken to add support for both.
1336
1337 ## Single-predication (based on destination register)
1338
1339 Zeroing on predication for arithmetic operations is taken from
1340 the destination register's predicate. i.e. the predication *and*
1341 zeroing settings to be applied to the whole operation come from the
1342 CSR Predication table entry for the destination register.
1343 Thus when zeroing is set on predication of a destination element,
1344 if the predication bit is clear, then the destination element is *set*
1345 to zero (twin-predication is slightly different, and will be covered
1346 next).
1347
1348 Thus the pseudo-code loop for a predicated arithmetic operation
1349 is modified to as follows:
1350
1351  for (i = 0; i < VL; i++)
1352 if not zeroing: # an optimisation
1353 while (!(predval & 1<<i) && i < VL)
1354 if (int_vec[rd ].isvector)  { id += 1; }
1355 if (int_vec[rs1].isvector)  { irs1 += 1; }
1356 if (int_vec[rs2].isvector)  { irs2 += 1; }
1357 if i == VL:
1358 return
1359 if (predval & 1<<i)
1360 src1 = ....
1361 src2 = ...
1362 else:
1363 result = src1 + src2 # actual add (or other op) here
1364 set_polymorphed_reg(rd, destwid, ird, result)
1365 if int_vec[rd].ffirst and result == 0:
1366 VL = i # result was zero, end loop early, return VL
1367 return
1368 if (!int_vec[rd].isvector) return
1369 else if zeroing:
1370 result = 0
1371 set_polymorphed_reg(rd, destwid, ird, result)
1372 if (int_vec[rd ].isvector)  { id += 1; }
1373 else if (predval & 1<<i) return
1374 if (int_vec[rs1].isvector)  { irs1 += 1; }
1375 if (int_vec[rs2].isvector)  { irs2 += 1; }
1376 if (rd == VL or rs1 == VL or rs2 == VL): return
1377
1378 The optimisation to skip elements entirely is only possible for certain
1379 micro-architectures when zeroing is not set. However for lane-based
1380 micro-architectures this optimisation may not be practical, as it
1381 implies that elements end up in different "lanes". Under these
1382 circumstances it is perfectly fine to simply have the lanes
1383 "inactive" for predicated elements, even though it results in
1384 less than 100% ALU utilisation.
1385
1386 ## Twin-predication (based on source and destination register)
1387
1388 Twin-predication is not that much different, except that that
1389 the source is independently zero-predicated from the destination.
1390 This means that the source may be zero-predicated *or* the
1391 destination zero-predicated *or both*, or neither.
1392
1393 When with twin-predication, zeroing is set on the source and not
1394 the destination, if a predicate bit is set it indicates that a zero
1395 data element is passed through the operation (the exception being:
1396 if the source data element is to be treated as an address - a LOAD -
1397 then the data returned *from* the LOAD is zero, rather than looking up an
1398 *address* of zero.
1399
1400 When zeroing is set on the destination and not the source, then just
1401 as with single-predicated operations, a zero is stored into the destination
1402 element (or target memory address for a STORE).
1403
1404 Zeroing on both source and destination effectively result in a bitwise
1405 NOR operation of the source and destination predicate: the result is that
1406 where either source predicate OR destination predicate is set to 0,
1407 a zero element will ultimately end up in the destination register.
1408
1409 However: this may not necessarily be the case for all operations;
1410 implementors, particularly of custom instructions, clearly need to
1411 think through the implications in each and every case.
1412
1413 Here is pseudo-code for a twin zero-predicated operation:
1414
1415 function op_mv(rd, rs) # MV not VMV!
1416  rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
1417  rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
1418  ps, zerosrc = get_pred_val(FALSE, rs); # predication on src
1419  pd, zerodst = get_pred_val(FALSE, rd); # ... AND on dest
1420  for (int i = 0, int j = 0; i < VL && j < VL):
1421 if (int_csr[rs].isvec && !zerosrc) while (!(ps & 1<<i)) i++;
1422 if (int_csr[rd].isvec && !zerodst) while (!(pd & 1<<j)) j++;
1423 if ((pd & 1<<j))
1424 if ((pd & 1<<j))
1425 sourcedata = ireg[rs+i];
1426 else
1427 sourcedata = 0
1428 ireg[rd+j] <= sourcedata
1429 else if (zerodst)
1430 ireg[rd+j] <= 0
1431 if (int_csr[rs].isvec)
1432 i++;
1433 if (int_csr[rd].isvec)
1434 j++;
1435 else
1436 if ((pd & 1<<j))
1437 break;
1438
1439 Note that in the instance where the destination is a scalar, the hardware
1440 loop is ended the moment a value *or a zero* is placed into the destination
1441 register/element. Also note that, for clarity, variable element widths
1442 have been left out of the above.
1443
1444 # Subsets of RV functionality
1445
1446 This section describes the differences when SV is implemented on top of
1447 different subsets of RV.
1448
1449 ## Common options
1450
1451 It is permitted to only implement SVprefix and not the VBLOCK instruction
1452 format option, and vice-versa. UNIX Platforms **MUST** raise illegal
1453 instruction on seeing an unsupported VBLOCK or SVprefix opcode, so that
1454 traps may emulate the format.
1455
1456 It is permitted in SVprefix to either not implement VL or not implement
1457 SUBVL (see [[sv_prefix_proposal]] for full details. Again, UNIX Platforms
1458 *MUST* raise illegal instruction on implementations that do not support
1459 VL or SUBVL.
1460
1461 It is permitted to limit the size of either (or both) the register files
1462 down to the original size of the standard RV architecture. However, below
1463 the mandatory limits set in the RV standard will result in non-compliance
1464 with the SV Specification.
1465
1466 ## RV32 / RV32F
1467
1468 When RV32 or RV32F is implemented, XLEN is set to 32, and thus the
1469 maximum limit for predication is also restricted to 32 bits. Whilst not
1470 actually specifically an "option" it is worth noting.
1471
1472 ## RV32G
1473
1474 Normally in standard RV32 it does not make much sense to have
1475 RV32G, The critical instructions that are missing in standard RV32
1476 are those for moving data to and from the double-width floating-point
1477 registers into the integer ones, as well as the FCVT routines.
1478
1479 In an earlier draft of SV, it was possible to specify an elwidth
1480 of double the standard register size: this had to be dropped,
1481 and may be reintroduced in future revisions.
1482
1483 ## RV32 (not RV32F / RV32G) and RV64 (not RV64F / RV64G)
1484
1485 When floating-point is not implemented, the size of the User Register and
1486 Predication CSR tables may be halved, to only 4 2x16-bit CSRs (8 entries
1487 per table).
1488
1489 ## RV32E
1490
1491 In embedded scenarios the User Register and Predication CSRs may be
1492 dropped entirely, or optionally limited to 1 CSR, such that the combined
1493 number of entries from the M-Mode CSR Register table plus U-Mode
1494 CSR Register table is either 4 16-bit entries or (if the U-Mode is
1495 zero) only 2 16-bit entries (M-Mode CSR table only). Likewise for
1496 the Predication CSR tables.
1497
1498 RV32E is the most likely candidate for simply detecting that registers
1499 are marked as "vectorised", and generating an appropriate exception
1500 for the VL loop to be implemented in software.
1501
1502 ## RV128
1503
1504 RV128 has not been especially considered, here, however it has some
1505 extremely large possibilities: double the element width implies
1506 256-bit operands, spanning 2 128-bit registers each, and predication
1507 of total length 128 bit given that XLEN is now 128.
1508
1509 # Example usage
1510
1511 TODO evaluate strncpy and strlen
1512 <https://groups.google.com/forum/m/#!msg/comp.arch/bGBeaNjAKvc/_vbqyxTUAQAJ>
1513
1514 ## strncpy
1515
1516 RVV version: <a name="strncpy"></>
1517
1518 strncpy:
1519 mv a3, a0 # Copy dst
1520 loop:
1521 setvli x0, a2, vint8 # Vectors of bytes.
1522 vlbff.v v1, (a1) # Get src bytes
1523 vseq.vi v0, v1, 0 # Flag zero bytes
1524 vmfirst a4, v0 # Zero found?
1525 vmsif.v v0, v0 # Set mask up to and including zero byte. Ppplio
1526 vsb.v v1, (a3), v0.t # Write out bytes
1527 bgez a4, exit # Done
1528 csrr t1, vl # Get number of bytes fetched
1529 add a1, a1, t1 # Bump src pointer
1530 sub a2, a2, t1 # Decrement count.
1531 add a3, a3, t1 # Bump dst pointer
1532 bnez a2, loop # Anymore?
1533
1534 exit:
1535 ret
1536
1537 SV version (WIP):
1538
1539 strncpy:
1540 mv a3, a0
1541 SETMVLI 8 # set max vector to 8
1542 RegCSR[a3] = 8bit, a3, scalar
1543 RegCSR[a1] = 8bit, a1, scalar
1544 RegCSR[t0] = 8bit, t0, vector
1545 PredTb[t0] = ffirst, x0, inv
1546 loop:
1547 SETVLI a2, t4 # t4 and VL now 1..8
1548 ldb t0, (a1) # t0 fail first mode
1549 bne t0, x0, allnonzero # still ff
1550 # VL points to last nonzero
1551 GETVL t4 # from bne tests
1552 addi t4, t4, 1 # include zero
1553 SETVL t4 # set exactly to t4
1554 stb t0, (a3) # store incl zero
1555 ret # end subroutine
1556 allnonzero:
1557 stb t0, (a3) # VL legal range
1558 GETVL t4 # from bne tests
1559 add a1, a1, t4 # Bump src pointer
1560 sub a2, a2, t4 # Decrement count.
1561 add a3, a3, t4 # Bump dst pointer
1562 bnez a2, loop # Anymore?
1563 exit:
1564 ret
1565
1566 Notes:
1567
1568 * Setting MVL to 8 is just an example. If enough registers are spare it may be set to XLEN which will require a bank of 8 scalar registers for a1, a3 and t0.
1569 * obviously if that is done, t0 is not separated by 8 full registers, and would overwrite t1 thru t7. x80 would work well, as an example, instead.
1570 * with the exception of the GETVL (a pseudo code alias for csrr), every single instruction above may use RVC.
1571 * RVC C.BNEZ can be used because rs1' may be extended to the full 128 registers through redirection
1572 * RVC C.LW and C.SW may be used because the W format may be overridden by the 8 bit format. All of t0, a3 and a1 are overridden to make that work.
1573 * with the exception of the GETVL, all Vector Context may be done in VBLOCK form.
1574 * setting predication to x0 (zero) and invert on t0 is a trick to enable just ffirst on t0
1575 * ldb and bne are both using t0, both in ffirst mode
1576 * ldb will end on illegal mem, reduce VL, but copied all sorts of stuff into t0
1577 * bne t0 x0 tests up to the NEW VL for nonzero, vector t0 against scalar x0
1578 * however as t0 is in ffirst mode, the first fail wil ALSO stop the compares, and reduce VL as well
1579 * the branch only goes to allnonzero if all tests succeed
1580 * if it did not, we can safely increment VL by 1 (using a4) to include the zero.
1581 * SETVL sets *exactly* the requested amount into VL.
1582 * the SETVL just after allnonzero label is needed in case the ldb ffirst activates but the bne allzeros does not.
1583 * this would cause the stb to copy up to the end of the legal memory
1584 * of course, on the next loop the ldb would throw a trap, as a1 now points to the first illegal mem location.
1585
1586 ## strcpy
1587
1588 RVV version:
1589
1590 mv a3, a0 # Save start
1591 loop:
1592 setvli a1, x0, vint8 # byte vec, x0 (Zero reg) => use max hardware len
1593 vldbff.v v1, (a3) # Get bytes
1594 csrr a1, vl # Get bytes actually read e.g. if fault
1595 vseq.vi v0, v1, 0 # Set v0[i] where v1[i] = 0
1596 add a3, a3, a1 # Bump pointer
1597 vmfirst a2, v0 # Find first set bit in mask, returns -1 if none
1598 bltz a2, loop # Not found?
1599 add a0, a0, a1 # Sum start + bump
1600 add a3, a3, a2 # Add index of zero byte
1601 sub a0, a3, a0 # Subtract start address+bump
1602 ret