correct pseudocode for cbranch
[libreriscv.git] / simple_v_extension / appendix.mdwn
1 # Simple-V (Parallelism Extension Proposal) Appendix
2
3 * Copyright (C) 2017, 2018, 2019 Luke Kenneth Casson Leighton
4 * Status: DRAFTv0.6
5 * Last edited: 30 jun 2019
6 * main spec [[specification]]
7
8 [[!toc ]]
9
10 # Fail-on-first modes <a name="ffirst"></a>
11
12 Fail-on-first data dependency has different behaviour for traps than
13 for conditional testing. "Conditional" is taken to mean "anything
14 that is zero", however with traps, the first element has to
15 be given the opportunity to throw the exact same trap that would
16 be thrown if this were a scalar operation (when VL=1).
17
18 Note that implementors are required to mutually exclusively choose one
19 or the other modes: an instruction is **not** permitted to fail on a
20 trap *and* fail a conditional test at the same time. This advice to
21 custom opcode writers as well as future extension writers.
22
23 ## Fail-on-first traps
24
25 Except for the first element, ffirst stops sequential element processing
26 when a trap occurs. The first element is treated normally (as if ffirst
27 is clear). Should any subsequent element instruction require a trap,
28 instead it and subsequent indexed elements are ignored (or cancelled in
29 out-of-order designs), and VL is set to the *last* in-sequence instruction
30 that did not take the trap.
31
32 Note that predicated-out elements (where the predicate mask bit is
33 zero) are clearly excluded (i.e. the trap will not occur). However,
34 note that the loop still had to test the predicate bit: thus on return,
35 VL is set to include elements that did not take the trap *and* includes
36 the elements that were predicated (masked) out (not tested up to the
37 point where the trap occurred).
38
39 Unlike conditional tests, "fail-on-first trap" instruction behaviour is
40 unaltered by setting zero or non-zero predication mode.
41
42 If SUBVL is being used (SUBVL!=1), the first *sub-group* of elements
43 will cause a trap as normal (as if ffirst is not set); subsequently, the
44 trap must not occur in the *sub-group* of elements. SUBVL will **NOT**
45 be modified. Traps must analyse (x)eSTATE (subvl offset indices) to
46 determine the element that caused the trap.
47
48 Given that predication bits apply to SUBVL groups, the same rules apply
49 to predicated-out (masked-out) sub-groups in calculating the value that
50 VL is set to.
51
52 ## Fail-on-first conditional tests
53
54 ffirst stops sequential (or sequentially-appearing in the case of
55 out-of-order designs) element conditional testing on the first element
56 result being zero (or other "fail" condition). VL is set to the number
57 of elements that were (sequentially) processed before the fail-condition
58 was encountered.
59
60 Unlike trap fail-on-first, fail-on-first conditional testing behaviour
61 responds to changes in the zero or non-zero predication mode. Whilst
62 in non-zeroing mode, masked-out elements are simply not tested (and
63 thus considered "never to fail"), in zeroing mode, masked-out elements
64 may be viewed as *always* (unconditionally) failing. This effectively
65 turns VL into something akin to a software-controlled loop.
66
67 Note that just as with traps, if SUBVL!=1, the first trap in the
68 *sub-group* will cause the processing to end, and, even if there were
69 elements within the *sub-group* that passed the test, that sub-group is
70 still (entirely) excluded from the count (from setting VL). i.e. VL is
71 set to the total number of *sub-groups* that had no fail-condition up
72 until execution was stopped. However, again: SUBVL must not be modified:
73 traps must analyse (x)eSTATE (subvl offset indices) to determine the
74 element that caused the trap.
75
76 Note again that, just as with traps, predicated-out (masked-out) elements
77 are included in the (sequential) count leading up to the fail-condition,
78 even though they were not tested.
79
80 # Instructions <a name="instructions" />
81
82 Despite being a 98% complete and accurate topological remap of RVV
83 concepts and functionality, no new instructions are needed.
84 Compared to RVV: *All* RVV instructions can be re-mapped, however xBitManip
85 becomes a critical dependency for efficient manipulation of predication
86 masks (as a bit-field). Despite the removal of all operations,
87 with the exception of CLIP and VSELECT.X
88 *all instructions from RVV Base are topologically re-mapped and retain their
89 complete functionality, intact*. Note that if RV64G ever had
90 a MV.X added as well as FCLIP, the full functionality of RVV-Base would
91 be obtained in SV.
92
93 Three instructions, VSELECT, VCLIP and VCLIPI, do not have RV Standard
94 equivalents, so are left out of Simple-V. VSELECT could be included if
95 there existed a MV.X instruction in RV (MV.X is a hypothetical
96 non-immediate variant of MV that would allow another register to
97 specify which register was to be copied). Note that if any of these three
98 instructions are added to any given RV extension, their functionality
99 will be inherently parallelised.
100
101 With some exceptions, where it does not make sense or is simply too
102 challenging, all RV-Base instructions are parallelised:
103
104 * CSR instructions, whilst a case could be made for fast-polling of
105 a CSR into multiple registers, or for being able to copy multiple
106 contiguously addressed CSRs into contiguous registers, and so on,
107 are the fundamental core basis of SV. If parallelised, extreme
108 care would need to be taken. Additionally, CSR reads are done
109 using x0, and it is *really* inadviseable to tag x0.
110 * LUI, C.J, C.JR, WFI, AUIPC are not suitable for parallelising so are
111 left as scalar.
112 * LR/SC could hypothetically be parallelised however their purpose is
113 single (complex) atomic memory operations where the LR must be followed
114 up by a matching SC. A sequence of parallel LR instructions followed
115 by a sequence of parallel SC instructions therefore is guaranteed to
116 not be useful. Not least: the guarantees of a Multi-LR/SC
117 would be impossible to provide if emulated in a trap.
118 * EBREAK, NOP, FENCE and others do not use registers so are not inherently
119 paralleliseable anyway.
120
121 All other operations using registers are automatically parallelised.
122 This includes AMOMAX, AMOSWAP and so on, where particular care and
123 attention must be paid.
124
125 Example pseudo-code for an integer ADD operation (including scalar
126 operations). Floating-point uses the FP Register Table.
127
128 [[!inline raw="yes" pages="simple_v_extension/simple_add_example" ]]
129
130 Note that for simplicity there is quite a lot missing from the above
131 pseudo-code: PCVBLK, element widths, zeroing on predication, dimensional
132 reshaping and offsets and so on. However it demonstrates the basic
133 principle. Augmentations that produce the full pseudo-code are covered in
134 other sections.
135
136 ## SUBVL Pseudocode <a name="subvl-pseudocode"></a>
137
138 Adding in support for SUBVL is a matter of adding in an extra inner
139 for-loop, where register src and dest are still incremented inside the
140 inner part. Note that the predication is still taken from the VL index.
141
142 So whilst elements are indexed by "(i * SUBVL + s)", predicate bits are
143 indexed by "(i)"
144
145 function op_add(rd, rs1, rs2) # add not VADD!
146  int i, id=0, irs1=0, irs2=0;
147  predval = get_pred_val(FALSE, rd);
148  rd = int_vec[rd ].isvector ? int_vec[rd ].regidx : rd;
149  rs1 = int_vec[rs1].isvector ? int_vec[rs1].regidx : rs1;
150  rs2 = int_vec[rs2].isvector ? int_vec[rs2].regidx : rs2;
151  for (i = 0; i < VL; i++)
152 xSTATE.srcoffs = i # save context
153 for (s = 0; s < SUBVL; s++)
154 xSTATE.ssvoffs = s # save context
155 if (predval & 1<<i) # predication uses intregs
156 # actual add is here (at last)
157    ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
158 if (!int_vec[rd ].isvector) break;
159 if (int_vec[rd ].isvector)  { id += 1; }
160 if (int_vec[rs1].isvector)  { irs1 += 1; }
161 if (int_vec[rs2].isvector)  { irs2 += 1; }
162 if (id == VL or irs1 == VL or irs2 == VL) {
163 # end VL hardware loop
164 xSTATE.srcoffs = 0; # reset
165 xSTATE.ssvoffs = 0; # reset
166 return;
167 }
168
169
170 NOTE: pseudocode simplified greatly: zeroing, proper predicate handling,
171 elwidth handling etc. all left out.
172
173 ## Instruction Format
174
175 It is critical to appreciate that there are
176 **no operations added to SV, at all**.
177
178 Instead, by using CSRs to tag registers as an indication of "changed
179 behaviour", SV *overloads* pre-existing branch operations into predicated
180 variants, and implicitly overloads arithmetic operations, MV, FCVT, and
181 LOAD/STORE depending on CSR configurations for bitwidth and predication.
182 **Everything** becomes parallelised. *This includes Compressed
183 instructions* as well as any future instructions and Custom Extensions.
184
185 Note: CSR tags to change behaviour of instructions is nothing new, including
186 in RISC-V. UXL, SXL and MXL change the behaviour so that XLEN=32/64/128.
187 FRM changes the behaviour of the floating-point unit, to alter the rounding
188 mode. Other architectures change the LOAD/STORE byte-order from big-endian
189 to little-endian on a per-instruction basis. SV is just a little more...
190 comprehensive in its effect on instructions.
191
192 ## Branch Instructions
193
194 Branch operations are augmented slightly to be a little more like FP
195 Compares (FEQ, FNE etc.), by permitting the cumulation (and storage)
196 of multiple comparisons into a register (taken indirectly from the predicate
197 table). As such, "ffirst" - fail-on-first - condition mode can be enabled.
198 See ffirst mode in the Predication Table section.
199
200 There are two registers for the comparison operation, therefore there
201 is the opportunity to associate two predicate registers (note: not in
202 the same way as twin-predication). The first is a "normal" predicate
203 register, which acts just as it does on any other single-predicated
204 operation: masks out elements where a bit is zero, applies an inversion
205 to the predicate mask, and enables zeroing / non-zeroing mode.
206
207 The second (not to be confused with a twin-predication 2nd register)
208 is utilised to indicate where the results of each comparison are to
209 be stored, as a bitmask. Additionally, the behaviour of the branch -
210 when it occurs - may also be modified depending on whether the 2nd predicate's
211 "invert" and "zeroing" bits are set. These four combinations result
212 in "consensual branches", cbranch.ifnone (NOR), cbranch.ifany (OR),
213 cbranch.ifall (AND), cbranch.ifnotall (NAND).
214
215 | invert | zeroing | description | operation | cbranch |
216 | ------ | ------- | --------------------------- | --------- | ------- |
217 | 0 | 0 | branch if all pass | AND | ifall |
218 | 1 | 0 | branch if one fails | NAND | ifnall |
219 | 0 | 1 | branch if one passes | OR | ifany |
220 | 1 | 1 | branch if all fail | NOR | ifnone |
221
222 This inversion capability covers AND, OR, NAND and NOR branching
223 based on multiple element comparisons. Without the full set of four,
224 it is necessary to have two-sequence branch operations: one conditional, one
225 unconditional.
226
227 Note that unlike normal computer programming, early-termination of chains
228 of AND or OR conditional tests, the chain does *not* terminate early
229 except if fail-on-first is set, and even then ffirst ends on the first
230 data-dependent zero. When ffirst mode is not set, *all* conditional
231 element tests must be performed (and the result optionally stored in
232 the result mask), with a "post-analysis" phase carried out which checks
233 whether to branch.
234
235 ### Standard Branch <a name="standard_branch"></a>
236
237 Branch operations use standard RV opcodes that are reinterpreted to
238 be "predicate variants" in the instance where either of the two src
239 registers are marked as vectors (active=1, vector=1).
240
241 Note that the predication register to use (if one is enabled) is taken from
242 the *first* src register, and that this is used, just as with predicated
243 arithmetic operations, to mask whether the comparison operations take
244 place or not. The target (destination) predication register
245 to use (if one is enabled) is taken from the *second* src register.
246
247 If either of src1 or src2 are scalars (whether by there being no
248 CSR register entry or whether by the CSR entry specifically marking
249 the register as "scalar") the comparison goes ahead as vector-scalar
250 or scalar-vector.
251
252 In instances where no vectorisation is detected on either src registers
253 the operation is treated as an absolutely standard scalar branch operation.
254 Where vectorisation is present on either or both src registers, the
255 branch may stil go ahead if any only if *all* tests succeed (i.e. excluding
256 those tests that are predicated out).
257
258 Note that when zero-predication is enabled (from source rs1),
259 a cleared bit in the predicate indicates that the result
260 of the compare is set to "false", i.e. that the corresponding
261 destination bit (or result)) be set to zero. Contrast this with
262 when zeroing is not set: bits in the destination predicate are
263 only *set*; they are **not** cleared. This is important to appreciate,
264 as there may be an expectation that, going into the hardware-loop,
265 the destination predicate is always expected to be set to zero:
266 this is **not** the case. The destination predicate is only set
267 to zero if **zeroing** is enabled.
268
269 Note that just as with the standard (scalar, non-predicated) branch
270 operations, BLE, BGT, BLEU and BTGU may be synthesised by inverting
271 src1 and src2, however note that in doing so, the predicate table
272 setup must also be correspondingly adjusted.
273
274 In Hwacha EECS-2015-262 Section 6.7.2 the following pseudocode is given
275 for predicated compare operations of function "cmp":
276
277 for (int i=0; i<vl; ++i)
278 if ([!]preg[p][i])
279 preg[pd][i] = cmp(s1 ? vreg[rs1][i] : sreg[rs1],
280 s2 ? vreg[rs2][i] : sreg[rs2]);
281
282 With associated predication, vector-length adjustments and so on,
283 and temporarily ignoring bitwidth (which makes the comparisons more
284 complex), this becomes:
285
286 s1 = reg_is_vectorised(src1);
287 s2 = reg_is_vectorised(src2);
288
289 if not s1 && not s2
290 if cmp(rs1, rs2) # scalar compare
291 goto branch
292 return
293
294 preg = int_pred_reg[rd]
295 reg = int_regfile
296
297 ps = get_pred_val(I/F==INT, rs1);
298 rd = get_pred_val(I/F==INT, rs2); # this may not exist
299
300 ffirst_mode, zeroing = get_pred_flags(rs1)
301 if exists(rd):
302 pred_inversion, pred_zeroing = get_pred_flags(rs2)
303 else
304 pred_inversion, pred_zeroing = False, False
305
306 if not exists(rd) or zeroing:
307 result = (1<<VL)-1 # all 1s
308 else
309 result = preg[rd]
310
311 for (int i = 0; i < VL; ++i)
312 if (zeroing)
313 if not (ps & (1<<i))
314 result &= ~(1<<i);
315 else if (ps & (1<<i))
316 if (cmp(s1 ? reg[src1+i]:reg[src1],
317 s2 ? reg[src2+i]:reg[src2])
318 result |= 1<<i;
319 else
320 result &= ~(1<<i);
321 if ffirst_mode:
322 break
323
324 if exists(rd):
325 preg[rd] = result # store in destination
326
327 if pred_inversion:
328 if pred_zeroing:
329 # NOR
330 if result == 0:
331 goto branch
332 else:
333 # NAND
334 if (result & ps) != result:
335 goto branch
336 else:
337 if pred_zeroing:
338 # OR
339 if result != 0:
340 goto branch
341 else:
342 # AND
343 if (result & ps) == result:
344 goto branch
345
346 Notes:
347
348 * Predicated SIMD comparisons would break src1 and src2 further down
349 into bitwidth-sized chunks (see Appendix "Bitwidth Virtual Register
350 Reordering") setting Vector-Length times (number of SIMD elements) bits
351 in Predicate Register rd, as opposed to just Vector-Length bits.
352 * The execution of "parallelised" instructions **must** be implemented
353 as "re-entrant" (to use a term from software). If an exception (trap)
354 occurs during the middle of a vectorised
355 Branch (now a SV predicated compare) operation, the partial results
356 of any comparisons must be written out to the destination
357 register before the trap is permitted to begin. If however there
358 is no predicate, the **entire** set of comparisons must be **restarted**,
359 with the offset loop indices set back to zero. This is because
360 there is no place to store the temporary result during the handling
361 of traps.
362
363 TODO: predication now taken from src2. also branch goes ahead
364 if all compares are successful.
365
366 Note also that where normally, predication requires that there must
367 also be a CSR register entry for the register being used in order
368 for the **predication** CSR register entry to also be active,
369 for branches this is **not** the case. src2 does **not** have
370 to have its CSR register entry marked as active in order for
371 predication on src2 to be active.
372
373 Also note: SV Branch operations are **not** twin-predicated
374 (see Twin Predication section). This would require three
375 element offsets: one to track src1, one to track src2 and a third
376 to track where to store the accumulation of the results. Given
377 that the element offsets need to be exposed via CSRs so that
378 the parallel hardware looping may be made re-entrant on traps
379 and exceptions, the decision was made not to make SV Branches
380 twin-predicated.
381
382 ### Floating-point Comparisons
383
384 There does not exist floating-point branch operations, only compare.
385 Interestingly no change is needed to the instruction format because
386 FP Compare already stores a 1 or a zero in its "rd" integer register
387 target, i.e. it's not actually a Branch at all: it's a compare.
388
389 In RV (scalar) Base, a branch on a floating-point compare is
390 done via the sequence "FEQ x1, f0, f5; BEQ x1, x0, #jumploc".
391 This does extend to SV, as long as x1 (in the example sequence given)
392 is vectorised. When that is the case, x1..x(1+VL-1) will also be
393 set to 0 or 1 depending on whether f0==f5, f1==f6, f2==f7 and so on.
394 The BEQ that follows will *also* compare x1==x0, x2==x0, x3==x0 and
395 so on. Consequently, unlike integer-branch, FP Compare needs no
396 modification in its behaviour.
397
398 In addition, it is noted that an entry "FNE" (the opposite of FEQ) is
399 missing, and whilst in ordinary branch code this is fine because the
400 standard RVF compare can always be followed up with an integer BEQ or
401 a BNE (or a compressed comparison to zero or non-zero), in predication
402 terms that becomes more of an impact. To deal with this, SV's predication
403 has had "invert" added to it.
404
405 Also: note that FP Compare may be predicated, using the destination
406 integer register (rd) to determine the predicate. FP Compare is **not**
407 a twin-predication operation, as, again, just as with SV Branches,
408 there are three registers involved: FP src1, FP src2 and INT rd.
409
410 Also: note that ffirst (fail first mode) applies directly to this operation.
411
412 ### Compressed Branch Instruction
413
414 Compressed Branch instructions are, just like standard Branch instructions,
415 reinterpreted to be vectorised and predicated based on the source register
416 (rs1s) CSR entries. As however there is only the one source register,
417 given that c.beqz a10 is equivalent to beqz a10,x0, the optional target
418 to store the results of the comparisions is taken from CSR predication
419 table entries for **x0**.
420
421 The specific required use of x0 is, with a little thought, quite obvious,
422 but is counterintuitive. Clearly it is **not** recommended to redirect
423 x0 with a CSR register entry, however as a means to opaquely obtain
424 a predication target it is the only sensible option that does not involve
425 additional special CSRs (or, worse, additional special opcodes).
426
427 Note also that, just as with standard branches, the 2nd source
428 (in this case x0 rather than src2) does **not** have to have its CSR
429 register table marked as "active" in order for predication to work.
430
431 ## Vectorised Dual-operand instructions
432
433 There is a series of 2-operand instructions involving copying (and
434 sometimes alteration):
435
436 * C.MV
437 * FMV, FNEG, FABS, FCVT, FSGNJ, FSGNJN and FSGNJX
438 * C.LWSP, C.SWSP, C.LDSP, C.FLWSP etc.
439 * LOAD(-FP) and STORE(-FP)
440
441 All of these operations follow the same two-operand pattern, so it is
442 *both* the source *and* destination predication masks that are taken into
443 account. This is different from
444 the three-operand arithmetic instructions, where the predication mask
445 is taken from the *destination* register, and applied uniformly to the
446 elements of the source register(s), element-for-element.
447
448 The pseudo-code pattern for twin-predicated operations is as
449 follows:
450
451 function op(rd, rs):
452  rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
453  rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
454  ps = get_pred_val(FALSE, rs); # predication on src
455  pd = get_pred_val(FALSE, rd); # ... AND on dest
456  for (int i = 0, int j = 0; i < VL && j < VL;):
457 if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
458 if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
459 xSTATE.srcoffs = i # save context
460 xSTATE.destoffs = j # save context
461 reg[rd+j] = SCALAR_OPERATION_ON(reg[rs+i])
462 if (int_csr[rs].isvec) i++;
463 if (int_csr[rd].isvec) j++; else break
464
465 This pattern covers scalar-scalar, scalar-vector, vector-scalar
466 and vector-vector, and predicated variants of all of those.
467 Zeroing is not presently included (TODO). As such, when compared
468 to RVV, the twin-predicated variants of C.MV and FMV cover
469 **all** standard vector operations: VINSERT, VSPLAT, VREDUCE,
470 VEXTRACT, VSCATTER, VGATHER, VCOPY, and more.
471
472 Note that:
473
474 * elwidth (SIMD) is not covered in the pseudo-code above
475 * ending the loop early in scalar cases (VINSERT, VEXTRACT) is also
476 not covered
477 * zero predication is also not shown (TODO).
478
479 ### C.MV Instruction <a name="c_mv"></a>
480
481 There is no MV instruction in RV however there is a C.MV instruction.
482 It is used for copying integer-to-integer registers (vectorised FMV
483 is used for copying floating-point).
484
485 If either the source or the destination register are marked as vectors
486 C.MV is reinterpreted to be a vectorised (multi-register) predicated
487 move operation. The actual instruction's format does not change:
488
489 [[!table data="""
490 15 12 | 11 7 | 6 2 | 1 0 |
491 funct4 | rd | rs | op |
492 4 | 5 | 5 | 2 |
493 C.MV | dest | src | C0 |
494 """]]
495
496 A simplified version of the pseudocode for this operation is as follows:
497
498 function op_mv(rd, rs) # MV not VMV!
499  rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
500  rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
501  ps = get_pred_val(FALSE, rs); # predication on src
502  pd = get_pred_val(FALSE, rd); # ... AND on dest
503  for (int i = 0, int j = 0; i < VL && j < VL;):
504 if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
505 if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
506 xSTATE.srcoffs = i # save context
507 xSTATE.destoffs = j # save context
508 ireg[rd+j] <= ireg[rs+i];
509 if (int_csr[rs].isvec) i++;
510 if (int_csr[rd].isvec) j++; else break
511
512 There are several different instructions from RVV that are covered by
513 this one opcode:
514
515 [[!table data="""
516 src | dest | predication | op |
517 scalar | vector | none | VSPLAT |
518 scalar | vector | destination | sparse VSPLAT |
519 scalar | vector | 1-bit dest | VINSERT |
520 vector | scalar | 1-bit? src | VEXTRACT |
521 vector | vector | none | VCOPY |
522 vector | vector | src | Vector Gather |
523 vector | vector | dest | Vector Scatter |
524 vector | vector | src & dest | Gather/Scatter |
525 vector | vector | src == dest | sparse VCOPY |
526 """]]
527
528 Also, VMERGE may be implemented as back-to-back (macro-op fused) C.MV
529 operations with zeroing off, and inversion on the src and dest predication
530 for one of the two C.MV operations. The non-inverted C.MV will place
531 one set of registers into the destination, and the inverted one the other
532 set. With predicate-inversion, copying and inversion of the predicate mask
533 need not be done as a separate (scalar) instruction.
534
535 Note that in the instance where the Compressed Extension is not implemented,
536 MV may be used, but that is a pseudo-operation mapping to addi rd, x0, rs.
537 Note that the behaviour is **different** from C.MV because with addi the
538 predication mask to use is taken **only** from rd and is applied against
539 all elements: rs[i] = rd[i].
540
541 ### FMV, FNEG and FABS Instructions
542
543 These are identical in form to C.MV, except covering floating-point
544 register copying. The same double-predication rules also apply.
545 However when elwidth is not set to default the instruction is implicitly
546 and automatic converted to a (vectorised) floating-point type conversion
547 operation of the appropriate size covering the source and destination
548 register bitwidths.
549
550 (Note that FMV, FNEG and FABS are all actually pseudo-instructions)
551
552 ### FVCT Instructions
553
554 These are again identical in form to C.MV, except that they cover
555 floating-point to integer and integer to floating-point. When element
556 width in each vector is set to default, the instructions behave exactly
557 as they are defined for standard RV (scalar) operations, except vectorised
558 in exactly the same fashion as outlined in C.MV.
559
560 However when the source or destination element width is not set to default,
561 the opcode's explicit element widths are *over-ridden* to new definitions,
562 and the opcode's element width is taken as indicative of the SIMD width
563 (if applicable i.e. if packed SIMD is requested) instead.
564
565 For example FCVT.S.L would normally be used to convert a 64-bit
566 integer in register rs1 to a 64-bit floating-point number in rd.
567 If however the source rs1 is set to be a vector, where elwidth is set to
568 default/2 and "packed SIMD" is enabled, then the first 32 bits of
569 rs1 are converted to a floating-point number to be stored in rd's
570 first element and the higher 32-bits *also* converted to floating-point
571 and stored in the second. The 32 bit size comes from the fact that
572 FCVT.S.L's integer width is 64 bit, and with elwidth on rs1 set to
573 divide that by two it means that rs1 element width is to be taken as 32.
574
575 Similar rules apply to the destination register.
576
577 ## LOAD / STORE Instructions and LOAD-FP/STORE-FP <a name="load_store"></a>
578
579 An earlier draft of SV modified the behaviour of LOAD/STORE (modified
580 the interpretation of the instruction fields). This
581 actually undermined the fundamental principle of SV, namely that there
582 be no modifications to the scalar behaviour (except where absolutely
583 necessary), in order to simplify an implementor's task if considering
584 converting a pre-existing scalar design to support parallelism.
585
586 So the original RISC-V scalar LOAD/STORE and LOAD-FP/STORE-FP functionality
587 do not change in SV, however just as with C.MV it is important to note
588 that dual-predication is possible.
589
590 In vectorised architectures there are usually at least two different modes
591 for LOAD/STORE:
592
593 * Read (or write for STORE) from sequential locations, where one
594 register specifies the address, and the one address is incremented
595 by a fixed amount. This is usually known as "Unit Stride" mode.
596 * Read (or write) from multiple indirected addresses, where the
597 vector elements each specify separate and distinct addresses.
598
599 To support these different addressing modes, the CSR Register "isvector"
600 bit is used. So, for a LOAD, when the src register is set to
601 scalar, the LOADs are sequentially incremented by the src register
602 element width, and when the src register is set to "vector", the
603 elements are treated as indirection addresses. Simplified
604 pseudo-code would look like this:
605
606 function op_ld(rd, rs) # LD not VLD!
607  rdv = int_csr[rd].active ? int_csr[rd].regidx : rd;
608  rsv = int_csr[rs].active ? int_csr[rs].regidx : rs;
609  ps = get_pred_val(FALSE, rs); # predication on src
610  pd = get_pred_val(FALSE, rd); # ... AND on dest
611  for (int i = 0, int j = 0; i < VL && j < VL;):
612 if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
613 if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
614 if (int_csr[rd].isvec)
615 # indirect mode (multi mode)
616 srcbase = ireg[rsv+i];
617 else
618 # unit stride mode
619 srcbase = ireg[rsv] + i * XLEN/8; # offset in bytes
620 ireg[rdv+j] <= mem[srcbase + imm_offs];
621 if (!int_csr[rs].isvec &&
622 !int_csr[rd].isvec) break # scalar-scalar LD
623 if (int_csr[rs].isvec) i++;
624 if (int_csr[rd].isvec) j++;
625
626 Notes:
627
628 * For simplicity, zeroing and elwidth is not included in the above:
629 the key focus here is the decision-making for srcbase; vectorised
630 rs means use sequentially-numbered registers as the indirection
631 address, and scalar rs is "offset" mode.
632 * The test towards the end for whether both source and destination are
633 scalar is what makes the above pseudo-code provide the "standard" RV
634 Base behaviour for LD operations.
635 * The offset in bytes (XLEN/8) changes depending on whether the
636 operation is a LB (1 byte), LH (2 byes), LW (4 bytes) or LD
637 (8 bytes), and also whether the element width is over-ridden
638 (see special element width section).
639
640 ## Compressed Stack LOAD / STORE Instructions <a name="c_ld_st"></a>
641
642 C.LWSP / C.SWSP and floating-point etc. are also source-dest twin-predicated,
643 where it is implicit in C.LWSP/FLWSP etc. that x2 is the source register.
644 It is therefore possible to use predicated C.LWSP to efficiently
645 pop registers off the stack (by predicating x2 as the source), cherry-picking
646 which registers to store to (by predicating the destination). Likewise
647 for C.SWSP. In this way, LOAD/STORE-Multiple is efficiently achieved.
648
649 The two modes ("unit stride" and multi-indirection) are still supported,
650 as with standard LD/ST. Essentially, the only difference is that the
651 use of x2 is hard-coded into the instruction.
652
653 **Note**: it is still possible to redirect x2 to an alternative target
654 register. With care, this allows C.LWSP / C.SWSP (and C.FLWSP) to be used as
655 general-purpose LOAD/STORE operations.
656
657 ## Compressed LOAD / STORE Instructions
658
659 Compressed LOAD and STORE are again exactly the same as scalar LOAD/STORE,
660 where the same rules apply and the same pseudo-code apply as for
661 non-compressed LOAD/STORE. Again: setting scalar or vector mode
662 on the src for LOAD and dest for STORE switches mode from "Unit Stride"
663 to "Multi-indirection", respectively.
664
665 # Element bitwidth polymorphism <a name="elwidth"></a>
666
667 Element bitwidth is best covered as its own special section, as it
668 is quite involved and applies uniformly across-the-board. SV restricts
669 bitwidth polymorphism to default, 8-bit, 16-bit and 32-bit.
670
671 The effect of setting an element bitwidth is to re-cast each entry
672 in the register table, and for all memory operations involving
673 load/stores of certain specific sizes, to a completely different width.
674 Thus In c-style terms, on an RV64 architecture, effectively each register
675 now looks like this:
676
677 typedef union {
678 uint8_t b[8];
679 uint16_t s[4];
680 uint32_t i[2];
681 uint64_t l[1];
682 } reg_t;
683
684 // integer table: assume maximum SV 7-bit regfile size
685 reg_t int_regfile[128];
686
687 where the CSR Register table entry (not the instruction alone) determines
688 which of those union entries is to be used on each operation, and the
689 VL element offset in the hardware-loop specifies the index into each array.
690
691 However a naive interpretation of the data structure above masks the
692 fact that setting VL greater than 8, for example, when the bitwidth is 8,
693 accessing one specific register "spills over" to the following parts of
694 the register file in a sequential fashion. So a much more accurate way
695 to reflect this would be:
696
697 typedef union {
698 uint8_t actual_bytes[8]; // 8 for RV64, 4 for RV32, 16 for RV128
699 uint8_t b[0]; // array of type uint8_t
700 uint16_t s[0];
701 uint32_t i[0];
702 uint64_t l[0];
703 uint128_t d[0];
704 } reg_t;
705
706 reg_t int_regfile[128];
707
708 where when accessing any individual regfile[n].b entry it is permitted
709 (in c) to arbitrarily over-run the *declared* length of the array (zero),
710 and thus "overspill" to consecutive register file entries in a fashion
711 that is completely transparent to a greatly-simplified software / pseudo-code
712 representation.
713 It is however critical to note that it is clearly the responsibility of
714 the implementor to ensure that, towards the end of the register file,
715 an exception is thrown if attempts to access beyond the "real" register
716 bytes is ever attempted.
717
718 Now we may modify pseudo-code an operation where all element bitwidths have
719 been set to the same size, where this pseudo-code is otherwise identical
720 to its "non" polymorphic versions (above):
721
722 function op_add(rd, rs1, rs2) # add not VADD!
723 ...
724 ...
725  for (i = 0; i < VL; i++)
726 ...
727 ...
728 // TODO, calculate if over-run occurs, for each elwidth
729 if (elwidth == 8) {
730    int_regfile[rd].b[id] <= int_regfile[rs1].i[irs1] +
731     int_regfile[rs2].i[irs2];
732 } else if elwidth == 16 {
733    int_regfile[rd].s[id] <= int_regfile[rs1].s[irs1] +
734     int_regfile[rs2].s[irs2];
735 } else if elwidth == 32 {
736    int_regfile[rd].i[id] <= int_regfile[rs1].i[irs1] +
737     int_regfile[rs2].i[irs2];
738 } else { // elwidth == 64
739    int_regfile[rd].l[id] <= int_regfile[rs1].l[irs1] +
740     int_regfile[rs2].l[irs2];
741 }
742 ...
743 ...
744
745 So here we can see clearly: for 8-bit entries rd, rs1 and rs2 (and registers
746 following sequentially on respectively from the same) are "type-cast"
747 to 8-bit; for 16-bit entries likewise and so on.
748
749 However that only covers the case where the element widths are the same.
750 Where the element widths are different, the following algorithm applies:
751
752 * Analyse the bitwidth of all source operands and work out the
753 maximum. Record this as "maxsrcbitwidth"
754 * If any given source operand requires sign-extension or zero-extension
755 (ldb, div, rem, mul, sll, srl, sra etc.), instead of mandatory 32-bit
756 sign-extension / zero-extension or whatever is specified in the standard
757 RV specification, **change** that to sign-extending from the respective
758 individual source operand's bitwidth from the CSR table out to
759 "maxsrcbitwidth" (previously calculated), instead.
760 * Following separate and distinct (optional) sign/zero-extension of all
761 source operands as specifically required for that operation, carry out the
762 operation at "maxsrcbitwidth". (Note that in the case of LOAD/STORE or MV
763 this may be a "null" (copy) operation, and that with FCVT, the changes
764 to the source and destination bitwidths may also turn FVCT effectively
765 into a copy).
766 * If the destination operand requires sign-extension or zero-extension,
767 instead of a mandatory fixed size (typically 32-bit for arithmetic,
768 for subw for example, and otherwise various: 8-bit for sb, 16-bit for sw
769 etc.), overload the RV specification with the bitwidth from the
770 destination register's elwidth entry.
771 * Finally, store the (optionally) sign/zero-extended value into its
772 destination: memory for sb/sw etc., or an offset section of the register
773 file for an arithmetic operation.
774
775 In this way, polymorphic bitwidths are achieved without requiring a
776 massive 64-way permutation of calculations **per opcode**, for example
777 (4 possible rs1 bitwidths times 4 possible rs2 bitwidths times 4 possible
778 rd bitwidths). The pseudo-code is therefore as follows:
779
780 typedef union {
781 uint8_t b;
782 uint16_t s;
783 uint32_t i;
784 uint64_t l;
785 } el_reg_t;
786
787 bw(elwidth):
788 if elwidth == 0: return xlen
789 if elwidth == 1: return 8
790 if elwidth == 2: return 16
791 // elwidth == 3:
792 return 32
793
794 get_max_elwidth(rs1, rs2):
795 return max(bw(int_csr[rs1].elwidth), # default (XLEN) if not set
796 bw(int_csr[rs2].elwidth)) # again XLEN if no entry
797
798 get_polymorphed_reg(reg, bitwidth, offset):
799 el_reg_t res;
800 res.l = 0; // TODO: going to need sign-extending / zero-extending
801 if bitwidth == 8:
802 reg.b = int_regfile[reg].b[offset]
803 elif bitwidth == 16:
804 reg.s = int_regfile[reg].s[offset]
805 elif bitwidth == 32:
806 reg.i = int_regfile[reg].i[offset]
807 elif bitwidth == 64:
808 reg.l = int_regfile[reg].l[offset]
809 return res
810
811 set_polymorphed_reg(reg, bitwidth, offset, val):
812 if (!int_csr[reg].isvec):
813 # sign/zero-extend depending on opcode requirements, from
814 # the reg's bitwidth out to the full bitwidth of the regfile
815 val = sign_or_zero_extend(val, bitwidth, xlen)
816 int_regfile[reg].l[0] = val
817 elif bitwidth == 8:
818 int_regfile[reg].b[offset] = val
819 elif bitwidth == 16:
820 int_regfile[reg].s[offset] = val
821 elif bitwidth == 32:
822 int_regfile[reg].i[offset] = val
823 elif bitwidth == 64:
824 int_regfile[reg].l[offset] = val
825
826 maxsrcwid = get_max_elwidth(rs1, rs2) # source element width(s)
827 destwid = int_csr[rs1].elwidth # destination element width
828  for (i = 0; i < VL; i++)
829 if (predval & 1<<i) # predication uses intregs
830 // TODO, calculate if over-run occurs, for each elwidth
831 src1 = get_polymorphed_reg(rs1, maxsrcwid, irs1)
832 // TODO, sign/zero-extend src1 and src2 as operation requires
833 if (op_requires_sign_extend_src1)
834 src1 = sign_extend(src1, maxsrcwid)
835 src2 = get_polymorphed_reg(rs2, maxsrcwid, irs2)
836 result = src1 + src2 # actual add here
837 // TODO, sign/zero-extend result, as operation requires
838 if (op_requires_sign_extend_dest)
839 result = sign_extend(result, maxsrcwid)
840 set_polymorphed_reg(rd, destwid, ird, result)
841 if (!int_vec[rd].isvector) break
842 if (int_vec[rd ].isvector)  { id += 1; }
843 if (int_vec[rs1].isvector)  { irs1 += 1; }
844 if (int_vec[rs2].isvector)  { irs2 += 1; }
845
846 Whilst specific sign-extension and zero-extension pseudocode call
847 details are left out, due to each operation being different, the above
848 should be clear that;
849
850 * the source operands are extended out to the maximum bitwidth of all
851 source operands
852 * the operation takes place at that maximum source bitwidth (the
853 destination bitwidth is not involved at this point, at all)
854 * the result is extended (or potentially even, truncated) before being
855 stored in the destination. i.e. truncation (if required) to the
856 destination width occurs **after** the operation **not** before.
857 * when the destination is not marked as "vectorised", the **full**
858 (standard, scalar) register file entry is taken up, i.e. the
859 element is either sign-extended or zero-extended to cover the
860 full register bitwidth (XLEN) if it is not already XLEN bits long.
861
862 Implementors are entirely free to optimise the above, particularly
863 if it is specifically known that any given operation will complete
864 accurately in less bits, as long as the results produced are
865 directly equivalent and equal, for all inputs and all outputs,
866 to those produced by the above algorithm.
867
868 ## Polymorphic floating-point operation exceptions and error-handling
869
870 For floating-point operations, conversion takes place without raising any
871 kind of exception. Exactly as specified in the standard RV specification,
872 NAN (or appropriate) is stored if the result is beyond the range of the
873 destination, and, again, exactly as with the standard RV specification
874 just as with scalar operations, the floating-point flag is raised
875 (FCSR). And, again, just as with scalar operations, it is software's
876 responsibility to check this flag. Given that the FCSR flags are
877 "accrued", the fact that multiple element operations could have occurred
878 is not a problem.
879
880 Note that it is perfectly legitimate for floating-point bitwidths of
881 only 8 to be specified. However whilst it is possible to apply IEEE 754
882 principles, no actual standard yet exists. Implementors wishing to
883 provide hardware-level 8-bit support rather than throw a trap to emulate
884 in software should contact the author of this specification before
885 proceeding.
886
887 ## Polymorphic shift operators
888
889 A special note is needed for changing the element width of left and
890 right shift operators, particularly right-shift. Even for standard RV
891 base, in order for correct results to be returned, the second operand
892 RS2 must be truncated to be within the range of RS1's bitwidth.
893 spike's implementation of sll for example is as follows:
894
895 WRITE_RD(sext_xlen(zext_xlen(RS1) << (RS2 & (xlen-1))));
896
897 which means: where XLEN is 32 (for RV32), restrict RS2 to cover the
898 range 0..31 so that RS1 will only be left-shifted by the amount that
899 is possible to fit into a 32-bit register. Whilst this appears not
900 to matter for hardware, it matters greatly in software implementations,
901 and it also matters where an RV64 system is set to "RV32" mode, such
902 that the underlying registers RS1 and RS2 comprise 64 hardware bits
903 each.
904
905 For SV, where each operand's element bitwidth may be over-ridden, the
906 rule about determining the operation's bitwidth *still applies*, being
907 defined as the maximum bitwidth of RS1 and RS2. *However*, this rule
908 **also applies to the truncation of RS2**. In other words, *after*
909 determining the maximum bitwidth, RS2's range must **also be truncated**
910 to ensure a correct answer. Example:
911
912 * RS1 is over-ridden to a 16-bit width
913 * RS2 is over-ridden to an 8-bit width
914 * RD is over-ridden to a 64-bit width
915 * the maximum bitwidth is thus determined to be 16-bit - max(8,16)
916 * RS2 is **truncated to a range of values from 0 to 15**: RS2 & (16-1)
917
918 Pseudocode (in spike) for this example would therefore be:
919
920 WRITE_RD(sext_xlen(zext_16bit(RS1) << (RS2 & (16-1))));
921
922 This example illustrates that considerable care therefore needs to be
923 taken to ensure that left and right shift operations are implemented
924 correctly. The key is that
925
926 * The operation bitwidth is determined by the maximum bitwidth
927 of the *source registers*, **not** the destination register bitwidth
928 * The result is then sign-extend (or truncated) as appropriate.
929
930 ## Polymorphic MULH/MULHU/MULHSU
931
932 MULH is designed to take the top half MSBs of a multiply that
933 does not fit within the range of the source operands, such that
934 smaller width operations may produce a full double-width multiply
935 in two cycles. The issue is: SV allows the source operands to
936 have variable bitwidth.
937
938 Here again special attention has to be paid to the rules regarding
939 bitwidth, which, again, are that the operation is performed at
940 the maximum bitwidth of the **source** registers. Therefore:
941
942 * An 8-bit x 8-bit multiply will create a 16-bit result that must
943 be shifted down by 8 bits
944 * A 16-bit x 8-bit multiply will create a 24-bit result that must
945 be shifted down by 16 bits (top 8 bits being zero)
946 * A 16-bit x 16-bit multiply will create a 32-bit result that must
947 be shifted down by 16 bits
948 * A 32-bit x 16-bit multiply will create a 48-bit result that must
949 be shifted down by 32 bits
950 * A 32-bit x 8-bit multiply will create a 40-bit result that must
951 be shifted down by 32 bits
952
953 So again, just as with shift-left and shift-right, the result
954 is shifted down by the maximum of the two source register bitwidths.
955 And, exactly again, truncation or sign-extension is performed on the
956 result. If sign-extension is to be carried out, it is performed
957 from the same maximum of the two source register bitwidths out
958 to the result element's bitwidth.
959
960 If truncation occurs, i.e. the top MSBs of the result are lost,
961 this is "Officially Not Our Problem", i.e. it is assumed that the
962 programmer actually desires the result to be truncated. i.e. if the
963 programmer wanted all of the bits, they would have set the destination
964 elwidth to accommodate them.
965
966 ## Polymorphic elwidth on LOAD/STORE <a name="elwidth_loadstore"></a>
967
968 Polymorphic element widths in vectorised form means that the data
969 being loaded (or stored) across multiple registers needs to be treated
970 (reinterpreted) as a contiguous stream of elwidth-wide items, where
971 the source register's element width is **independent** from the destination's.
972
973 This makes for a slightly more complex algorithm when using indirection
974 on the "addressed" register (source for LOAD and destination for STORE),
975 particularly given that the LOAD/STORE instruction provides important
976 information about the width of the data to be reinterpreted.
977
978 Let's illustrate the "load" part, where the pseudo-code for elwidth=default
979 was as follows, and i is the loop from 0 to VL-1:
980
981 srcbase = ireg[rs+i];
982 return mem[srcbase + imm]; // returns XLEN bits
983
984 Instead, when elwidth != default, for a LW (32-bit LOAD), elwidth-wide
985 chunks are taken from the source memory location addressed by the current
986 indexed source address register, and only when a full 32-bits-worth
987 are taken will the index be moved on to the next contiguous source
988 address register:
989
990 bitwidth = bw(elwidth); // source elwidth from CSR reg entry
991 elsperblock = 32 / bitwidth // 1 if bw=32, 2 if bw=16, 4 if bw=8
992 srcbase = ireg[rs+i/(elsperblock)]; // integer divide
993 offs = i % elsperblock; // modulo
994 return &mem[srcbase + imm + offs]; // re-cast to uint8_t*, uint16_t* etc.
995
996 Note that the constant "32" above is replaced by 8 for LB, 16 for LH, 64 for LD
997 and 128 for LQ.
998
999 The principle is basically exactly the same as if the srcbase were pointing
1000 at the memory of the *register* file: memory is re-interpreted as containing
1001 groups of elwidth-wide discrete elements.
1002
1003 When storing the result from a load, it's important to respect the fact
1004 that the destination register has its *own separate element width*. Thus,
1005 when each element is loaded (at the source element width), any sign-extension
1006 or zero-extension (or truncation) needs to be done to the *destination*
1007 bitwidth. Also, the storing has the exact same analogous algorithm as
1008 above, where in fact it is just the set\_polymorphed\_reg pseudocode
1009 (completely unchanged) used above.
1010
1011 One issue remains: when the source element width is **greater** than
1012 the width of the operation, it is obvious that a single LB for example
1013 cannot possibly obtain 16-bit-wide data. This condition may be detected
1014 where, when using integer divide, elsperblock (the width of the LOAD
1015 divided by the bitwidth of the element) is zero.
1016
1017 The issue is "fixed" by ensuring that elsperblock is a minimum of 1:
1018
1019 elsperblock = min(1, LD_OP_BITWIDTH / element_bitwidth)
1020
1021 The elements, if the element bitwidth is larger than the LD operation's
1022 size, will then be sign/zero-extended to the full LD operation size, as
1023 specified by the LOAD (LDU instead of LD, LBU instead of LB), before
1024 being passed on to the second phase.
1025
1026 As LOAD/STORE may be twin-predicated, it is important to note that
1027 the rules on twin predication still apply, except where in previous
1028 pseudo-code (elwidth=default for both source and target) it was
1029 the *registers* that the predication was applied to, it is now the
1030 **elements** that the predication is applied to.
1031
1032 Thus the full pseudocode for all LD operations may be written out
1033 as follows:
1034
1035 function LBU(rd, rs):
1036 load_elwidthed(rd, rs, 8, true)
1037 function LB(rd, rs):
1038 load_elwidthed(rd, rs, 8, false)
1039 function LH(rd, rs):
1040 load_elwidthed(rd, rs, 16, false)
1041 ...
1042 ...
1043 function LQ(rd, rs):
1044 load_elwidthed(rd, rs, 128, false)
1045
1046 # returns 1 byte of data when opwidth=8, 2 bytes when opwidth=16..
1047 function load_memory(rs, imm, i, opwidth):
1048 elwidth = int_csr[rs].elwidth
1049 bitwidth = bw(elwidth);
1050 elsperblock = min(1, opwidth / bitwidth)
1051 srcbase = ireg[rs+i/(elsperblock)];
1052 offs = i % elsperblock;
1053 return mem[srcbase + imm + offs]; # 1/2/4/8/16 bytes
1054
1055 function load_elwidthed(rd, rs, opwidth, unsigned):
1056 destwid = int_csr[rd].elwidth # destination element width
1057  rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
1058  rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
1059  ps = get_pred_val(FALSE, rs); # predication on src
1060  pd = get_pred_val(FALSE, rd); # ... AND on dest
1061  for (int i = 0, int j = 0; i < VL && j < VL;):
1062 if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
1063 if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
1064 val = load_memory(rs, imm, i, opwidth)
1065 if unsigned:
1066 val = zero_extend(val, min(opwidth, bitwidth))
1067 else:
1068 val = sign_extend(val, min(opwidth, bitwidth))
1069 set_polymorphed_reg(rd, bitwidth, j, val)
1070 if (int_csr[rs].isvec) i++;
1071 if (int_csr[rd].isvec) j++; else break;
1072
1073 Note:
1074
1075 * when comparing against for example the twin-predicated c.mv
1076 pseudo-code, the pattern of independent incrementing of rd and rs
1077 is preserved unchanged.
1078 * just as with the c.mv pseudocode, zeroing is not included and must be
1079 taken into account (TODO).
1080 * that due to the use of a twin-predication algorithm, LOAD/STORE also
1081 take on the same VSPLAT, VINSERT, VREDUCE, VEXTRACT, VGATHER and
1082 VSCATTER characteristics.
1083 * that due to the use of the same set\_polymorphed\_reg pseudocode,
1084 a destination that is not vectorised (marked as scalar) will
1085 result in the element being fully sign-extended or zero-extended
1086 out to the full register file bitwidth (XLEN). When the source
1087 is also marked as scalar, this is how the compatibility with
1088 standard RV LOAD/STORE is preserved by this algorithm.
1089
1090 ### Example Tables showing LOAD elements
1091
1092 This section contains examples of vectorised LOAD operations, showing
1093 how the two stage process works (three if zero/sign-extension is included).
1094
1095
1096 #### Example: LD x8, x5(0), x8 CSR-elwidth=32, x5 CSR-elwidth=16, VL=7
1097
1098 This is:
1099
1100 * a 64-bit load, with an offset of zero
1101 * with a source-address elwidth of 16-bit
1102 * into a destination-register with an elwidth of 32-bit
1103 * where VL=7
1104 * from register x5 (actually x5-x6) to x8 (actually x8 to half of x11)
1105 * RV64, where XLEN=64 is assumed.
1106
1107 First, the memory table, which, due to the element width being 16 and the
1108 operation being LD (64), the 64-bits loaded from memory are subdivided
1109 into groups of **four** elements. And, with VL being 7 (deliberately
1110 to illustrate that this is reasonable and possible), the first four are
1111 sourced from the offset addresses pointed to by x5, and the next three
1112 from the ofset addresses pointed to by the next contiguous register, x6:
1113
1114 [[!table data="""
1115 addr | byte 0 | byte 1 | byte 2 | byte 3 | byte 4 | byte 5 | byte 6 | byte 7 |
1116 @x5 | elem 0 || elem 1 || elem 2 || elem 3 ||
1117 @x6 | elem 4 || elem 5 || elem 6 || not loaded ||
1118 """]]
1119
1120 Next, the elements are zero-extended from 16-bit to 32-bit, as whilst
1121 the elwidth CSR entry for x5 is 16-bit, the destination elwidth on x8 is 32.
1122
1123 [[!table data="""
1124 byte 3 | byte 2 | byte 1 | byte 0 |
1125 0x0 | 0x0 | elem0 ||
1126 0x0 | 0x0 | elem1 ||
1127 0x0 | 0x0 | elem2 ||
1128 0x0 | 0x0 | elem3 ||
1129 0x0 | 0x0 | elem4 ||
1130 0x0 | 0x0 | elem5 ||
1131 0x0 | 0x0 | elem6 ||
1132 0x0 | 0x0 | elem7 ||
1133 """]]
1134
1135 Lastly, the elements are stored in contiguous blocks, as if x8 was also
1136 byte-addressable "memory". That "memory" happens to cover registers
1137 x8, x9, x10 and x11, with the last 32 "bits" of x11 being **UNMODIFIED**:
1138
1139 [[!table data="""
1140 reg# | byte 7 | byte 6 | byte 5 | byte 4 | byte 3 | byte 2 | byte 1 | byte 0 |
1141 x8 | 0x0 | 0x0 | elem 1 || 0x0 | 0x0 | elem 0 ||
1142 x9 | 0x0 | 0x0 | elem 3 || 0x0 | 0x0 | elem 2 ||
1143 x10 | 0x0 | 0x0 | elem 5 || 0x0 | 0x0 | elem 4 ||
1144 x11 | **UNMODIFIED** |||| 0x0 | 0x0 | elem 6 ||
1145 """]]
1146
1147 Thus we have data that is loaded from the **addresses** pointed to by
1148 x5 and x6, zero-extended from 16-bit to 32-bit, stored in the **registers**
1149 x8 through to half of x11.
1150 The end result is that elements 0 and 1 end up in x8, with element 8 being
1151 shifted up 32 bits, and so on, until finally element 6 is in the
1152 LSBs of x11.
1153
1154 Note that whilst the memory addressing table is shown left-to-right byte order,
1155 the registers are shown in right-to-left (MSB) order. This does **not**
1156 imply that bit or byte-reversal is carried out: it's just easier to visualise
1157 memory as being contiguous bytes, and emphasises that registers are not
1158 really actually "memory" as such.
1159
1160 ## Why SV bitwidth specification is restricted to 4 entries
1161
1162 The four entries for SV element bitwidths only allows three over-rides:
1163
1164 * 8 bit
1165 * 16 hit
1166 * 32 bit
1167
1168 This would seem inadequate, surely it would be better to have 3 bits or
1169 more and allow 64, 128 and some other options besides. The answer here
1170 is, it gets too complex, no RV128 implementation yet exists, and so RV64's
1171 default is 64 bit, so the 4 major element widths are covered anyway.
1172
1173 There is an absolutely crucial aspect oF SV here that explicitly
1174 needs spelling out, and it's whether the "vectorised" bit is set in
1175 the Register's CSR entry.
1176
1177 If "vectorised" is clear (not set), this indicates that the operation
1178 is "scalar". Under these circumstances, when set on a destination (RD),
1179 then sign-extension and zero-extension, whilst changed to match the
1180 override bitwidth (if set), will erase the **full** register entry
1181 (64-bit if RV64).
1182
1183 When vectorised is *set*, this indicates that the operation now treats
1184 **elements** as if they were independent registers, so regardless of
1185 the length, any parts of a given actual register that are not involved
1186 in the operation are **NOT** modified, but are **PRESERVED**.
1187
1188 For example:
1189
1190 * when the vector bit is clear and elwidth set to 16 on the destination
1191 register, operations are truncated to 16 bit and then sign or zero
1192 extended to the *FULL* XLEN register width.
1193 * when the vector bit is set, elwidth is 16 and VL=1 (or other value where
1194 groups of elwidth sized elements do not fill an entire XLEN register),
1195 the "top" bits of the destination register do *NOT* get modified, zero'd
1196 or otherwise overwritten.
1197
1198 SIMD micro-architectures may implement this by using predication on
1199 any elements in a given actual register that are beyond the end of
1200 multi-element operation.
1201
1202 Other microarchitectures may choose to provide byte-level write-enable
1203 lines on the register file, such that each 64 bit register in an RV64
1204 system requires 8 WE lines. Scalar RV64 operations would require
1205 activation of all 8 lines, where SV elwidth based operations would
1206 activate the required subset of those byte-level write lines.
1207
1208 Example:
1209
1210 * rs1, rs2 and rd are all set to 8-bit
1211 * VL is set to 3
1212 * RV64 architecture is set (UXL=64)
1213 * add operation is carried out
1214 * bits 0-23 of RD are modified to be rs1[23..16] + rs2[23..16]
1215 concatenated with similar add operations on bits 15..8 and 7..0
1216 * bits 24 through 63 **remain as they originally were**.
1217
1218 Example SIMD micro-architectural implementation:
1219
1220 * SIMD architecture works out the nearest round number of elements
1221 that would fit into a full RV64 register (in this case: 8)
1222 * SIMD architecture creates a hidden predicate, binary 0b00000111
1223 i.e. the bottom 3 bits set (VL=3) and the top 5 bits clear
1224 * SIMD architecture goes ahead with the add operation as if it
1225 was a full 8-wide batch of 8 adds
1226 * SIMD architecture passes top 5 elements through the adders
1227 (which are "disabled" due to zero-bit predication)
1228 * SIMD architecture gets the 5 unmodified top 8-bits back unmodified
1229 and stores them in rd.
1230
1231 This requires a read on rd, however this is required anyway in order
1232 to support non-zeroing mode.
1233
1234 ## Polymorphic floating-point
1235
1236 Standard scalar RV integer operations base the register width on XLEN,
1237 which may be changed (UXL in USTATUS, and the corresponding MXL and
1238 SXL in MSTATUS and SSTATUS respectively). Integer LOAD, STORE and
1239 arithmetic operations are therefore restricted to an active XLEN bits,
1240 with sign or zero extension to pad out the upper bits when XLEN has
1241 been dynamically set to less than the actual register size.
1242
1243 For scalar floating-point, the active (used / changed) bits are
1244 specified exclusively by the operation: ADD.S specifies an active
1245 32-bits, with the upper bits of the source registers needing to
1246 be all 1s ("NaN-boxed"), and the destination upper bits being
1247 *set* to all 1s (including on LOAD/STOREs).
1248
1249 Where elwidth is set to default (on any source or the destination)
1250 it is obvious that this NaN-boxing behaviour can and should be
1251 preserved. When elwidth is non-default things are less obvious,
1252 so need to be thought through. Here is a normal (scalar) sequence,
1253 assuming an RV64 which supports Quad (128-bit) FLEN:
1254
1255 * FLD loads 64-bit wide from memory. Top 64 MSBs are set to all 1s
1256 * ADD.D performs a 64-bit-wide add. Top 64 MSBs of destination set to 1s.
1257 * FSD stores lowest 64-bits from the 128-bit-wide register to memory:
1258 top 64 MSBs ignored.
1259
1260 Therefore it makes sense to mirror this behaviour when, for example,
1261 elwidth is set to 32. Assume elwidth set to 32 on all source and
1262 destination registers:
1263
1264 * FLD loads 64-bit wide from memory as **two** 32-bit single-precision
1265 floating-point numbers.
1266 * ADD.D performs **two** 32-bit-wide adds, storing one of the adds
1267 in bits 0-31 and the second in bits 32-63.
1268 * FSD stores lowest 64-bits from the 128-bit-wide register to memory
1269
1270 Here's the thing: it does not make sense to overwrite the top 64 MSBs
1271 of the registers either during the FLD **or** the ADD.D. The reason
1272 is that, effectively, the top 64 MSBs actually represent a completely
1273 independent 64-bit register, so overwriting it is not only gratuitous
1274 but may actually be harmful for a future extension to SV which may
1275 have a way to directly access those top 64 bits.
1276
1277 The decision is therefore **not** to touch the upper parts of floating-point
1278 registers whereever elwidth is set to non-default values, including
1279 when "isvec" is false in a given register's CSR entry. Only when the
1280 elwidth is set to default **and** isvec is false will the standard
1281 RV behaviour be followed, namely that the upper bits be modified.
1282
1283 Ultimately if elwidth is default and isvec false on *all* source
1284 and destination registers, a SimpleV instruction defaults completely
1285 to standard RV scalar behaviour (this holds true for **all** operations,
1286 right across the board).
1287
1288 The nice thing here is that ADD.S, ADD.D and ADD.Q when elwidth are
1289 non-default values are effectively all the same: they all still perform
1290 multiple ADD operations, just at different widths. A future extension
1291 to SimpleV may actually allow ADD.S to access the upper bits of the
1292 register, effectively breaking down a 128-bit register into a bank
1293 of 4 independently-accesible 32-bit registers.
1294
1295 In the meantime, although when e.g. setting VL to 8 it would technically
1296 make no difference to the ALU whether ADD.S, ADD.D or ADD.Q is used,
1297 using ADD.Q may be an easy way to signal to the microarchitecture that
1298 it is to receive a higher VL value. On a superscalar OoO architecture
1299 there may be absolutely no difference, however on simpler SIMD-style
1300 microarchitectures they may not necessarily have the infrastructure in
1301 place to know the difference, such that when VL=8 and an ADD.D instruction
1302 is issued, it completes in 2 cycles (or more) rather than one, where
1303 if an ADD.Q had been issued instead on such simpler microarchitectures
1304 it would complete in one.
1305
1306 ## Specific instruction walk-throughs
1307
1308 This section covers walk-throughs of the above-outlined procedure
1309 for converting standard RISC-V scalar arithmetic operations to
1310 polymorphic widths, to ensure that it is correct.
1311
1312 ### add
1313
1314 Standard Scalar RV32/RV64 (xlen):
1315
1316 * RS1 @ xlen bits
1317 * RS2 @ xlen bits
1318 * add @ xlen bits
1319 * RD @ xlen bits
1320
1321 Polymorphic variant:
1322
1323 * RS1 @ rs1 bits, zero-extended to max(rs1, rs2) bits
1324 * RS2 @ rs2 bits, zero-extended to max(rs1, rs2) bits
1325 * add @ max(rs1, rs2) bits
1326 * RD @ rd bits. zero-extend to rd if rd > max(rs1, rs2) otherwise truncate
1327
1328 Note here that polymorphic add zero-extends its source operands,
1329 where addw sign-extends.
1330
1331 ### addw
1332
1333 The RV Specification specifically states that "W" variants of arithmetic
1334 operations always produce 32-bit signed values. In a polymorphic
1335 environment it is reasonable to assume that the signed aspect is
1336 preserved, where it is the length of the operands and the result
1337 that may be changed.
1338
1339 Standard Scalar RV64 (xlen):
1340
1341 * RS1 @ xlen bits
1342 * RS2 @ xlen bits
1343 * add @ xlen bits
1344 * RD @ xlen bits, truncate add to 32-bit and sign-extend to xlen.
1345
1346 Polymorphic variant:
1347
1348 * RS1 @ rs1 bits, sign-extended to max(rs1, rs2) bits
1349 * RS2 @ rs2 bits, sign-extended to max(rs1, rs2) bits
1350 * add @ max(rs1, rs2) bits
1351 * RD @ rd bits. sign-extend to rd if rd > max(rs1, rs2) otherwise truncate
1352
1353 Note here that polymorphic addw sign-extends its source operands,
1354 where add zero-extends.
1355
1356 This requires a little more in-depth analysis. Where the bitwidth of
1357 rs1 equals the bitwidth of rs2, no sign-extending will occur. It is
1358 only where the bitwidth of either rs1 or rs2 are different, will the
1359 lesser-width operand be sign-extended.
1360
1361 Effectively however, both rs1 and rs2 are being sign-extended (or
1362 truncated), where for add they are both zero-extended. This holds true
1363 for all arithmetic operations ending with "W".
1364
1365 ### addiw
1366
1367 Standard Scalar RV64I:
1368
1369 * RS1 @ xlen bits, truncated to 32-bit
1370 * immed @ 12 bits, sign-extended to 32-bit
1371 * add @ 32 bits
1372 * RD @ rd bits. sign-extend to rd if rd > 32, otherwise truncate.
1373
1374 Polymorphic variant:
1375
1376 * RS1 @ rs1 bits
1377 * immed @ 12 bits, sign-extend to max(rs1, 12) bits
1378 * add @ max(rs1, 12) bits
1379 * RD @ rd bits. sign-extend to rd if rd > max(rs1, 12) otherwise truncate
1380
1381 # Predication Element Zeroing
1382
1383 The introduction of zeroing on traditional vector predication is usually
1384 intended as an optimisation for lane-based microarchitectures with register
1385 renaming to be able to save power by avoiding a register read on elements
1386 that are passed through en-masse through the ALU. Simpler microarchitectures
1387 do not have this issue: they simply do not pass the element through to
1388 the ALU at all, and therefore do not store it back in the destination.
1389 More complex non-lane-based micro-architectures can, when zeroing is
1390 not set, use the predication bits to simply avoid sending element-based
1391 operations to the ALUs, entirely: thus, over the long term, potentially
1392 keeping all ALUs 100% occupied even when elements are predicated out.
1393
1394 SimpleV's design principle is not based on or influenced by
1395 microarchitectural design factors: it is a hardware-level API.
1396 Therefore, looking purely at whether zeroing is *useful* or not,
1397 (whether less instructions are needed for certain scenarios),
1398 given that a case can be made for zeroing *and* non-zeroing, the
1399 decision was taken to add support for both.
1400
1401 ## Single-predication (based on destination register)
1402
1403 Zeroing on predication for arithmetic operations is taken from
1404 the destination register's predicate. i.e. the predication *and*
1405 zeroing settings to be applied to the whole operation come from the
1406 CSR Predication table entry for the destination register.
1407 Thus when zeroing is set on predication of a destination element,
1408 if the predication bit is clear, then the destination element is *set*
1409 to zero (twin-predication is slightly different, and will be covered
1410 next).
1411
1412 Thus the pseudo-code loop for a predicated arithmetic operation
1413 is modified to as follows:
1414
1415  for (i = 0; i < VL; i++)
1416 if not zeroing: # an optimisation
1417 while (!(predval & 1<<i) && i < VL)
1418 if (int_vec[rd ].isvector)  { id += 1; }
1419 if (int_vec[rs1].isvector)  { irs1 += 1; }
1420 if (int_vec[rs2].isvector)  { irs2 += 1; }
1421 if i == VL:
1422 return
1423 if (predval & 1<<i)
1424 src1 = ....
1425 src2 = ...
1426 else:
1427 result = src1 + src2 # actual add (or other op) here
1428 set_polymorphed_reg(rd, destwid, ird, result)
1429 if int_vec[rd].ffirst and result == 0:
1430 VL = i # result was zero, end loop early, return VL
1431 return
1432 if (!int_vec[rd].isvector) return
1433 else if zeroing:
1434 result = 0
1435 set_polymorphed_reg(rd, destwid, ird, result)
1436 if (int_vec[rd ].isvector)  { id += 1; }
1437 else if (predval & 1<<i) return
1438 if (int_vec[rs1].isvector)  { irs1 += 1; }
1439 if (int_vec[rs2].isvector)  { irs2 += 1; }
1440 if (rd == VL or rs1 == VL or rs2 == VL): return
1441
1442 The optimisation to skip elements entirely is only possible for certain
1443 micro-architectures when zeroing is not set. However for lane-based
1444 micro-architectures this optimisation may not be practical, as it
1445 implies that elements end up in different "lanes". Under these
1446 circumstances it is perfectly fine to simply have the lanes
1447 "inactive" for predicated elements, even though it results in
1448 less than 100% ALU utilisation.
1449
1450 ## Twin-predication (based on source and destination register)
1451
1452 Twin-predication is not that much different, except that that
1453 the source is independently zero-predicated from the destination.
1454 This means that the source may be zero-predicated *or* the
1455 destination zero-predicated *or both*, or neither.
1456
1457 When with twin-predication, zeroing is set on the source and not
1458 the destination, if a predicate bit is set it indicates that a zero
1459 data element is passed through the operation (the exception being:
1460 if the source data element is to be treated as an address - a LOAD -
1461 then the data returned *from* the LOAD is zero, rather than looking up an
1462 *address* of zero.
1463
1464 When zeroing is set on the destination and not the source, then just
1465 as with single-predicated operations, a zero is stored into the destination
1466 element (or target memory address for a STORE).
1467
1468 Zeroing on both source and destination effectively result in a bitwise
1469 NOR operation of the source and destination predicate: the result is that
1470 where either source predicate OR destination predicate is set to 0,
1471 a zero element will ultimately end up in the destination register.
1472
1473 However: this may not necessarily be the case for all operations;
1474 implementors, particularly of custom instructions, clearly need to
1475 think through the implications in each and every case.
1476
1477 Here is pseudo-code for a twin zero-predicated operation:
1478
1479 function op_mv(rd, rs) # MV not VMV!
1480  rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
1481  rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
1482  ps, zerosrc = get_pred_val(FALSE, rs); # predication on src
1483  pd, zerodst = get_pred_val(FALSE, rd); # ... AND on dest
1484  for (int i = 0, int j = 0; i < VL && j < VL):
1485 if (int_csr[rs].isvec && !zerosrc) while (!(ps & 1<<i)) i++;
1486 if (int_csr[rd].isvec && !zerodst) while (!(pd & 1<<j)) j++;
1487 if ((pd & 1<<j))
1488 if ((pd & 1<<j))
1489 sourcedata = ireg[rs+i];
1490 else
1491 sourcedata = 0
1492 ireg[rd+j] <= sourcedata
1493 else if (zerodst)
1494 ireg[rd+j] <= 0
1495 if (int_csr[rs].isvec)
1496 i++;
1497 if (int_csr[rd].isvec)
1498 j++;
1499 else
1500 if ((pd & 1<<j))
1501 break;
1502
1503 Note that in the instance where the destination is a scalar, the hardware
1504 loop is ended the moment a value *or a zero* is placed into the destination
1505 register/element. Also note that, for clarity, variable element widths
1506 have been left out of the above.
1507
1508 # Subsets of RV functionality
1509
1510 This section describes the differences when SV is implemented on top of
1511 different subsets of RV.
1512
1513 ## Common options
1514
1515 It is permitted to only implement SVprefix and not the VBLOCK instruction
1516 format option, and vice-versa. UNIX Platforms **MUST** raise illegal
1517 instruction on seeing an unsupported VBLOCK or SVprefix opcode, so that
1518 traps may emulate the format.
1519
1520 It is permitted in SVprefix to either not implement VL or not implement
1521 SUBVL (see [[sv_prefix_proposal]] for full details. Again, UNIX Platforms
1522 *MUST* raise illegal instruction on implementations that do not support
1523 VL or SUBVL.
1524
1525 It is permitted to limit the size of either (or both) the register files
1526 down to the original size of the standard RV architecture. However, below
1527 the mandatory limits set in the RV standard will result in non-compliance
1528 with the SV Specification.
1529
1530 ## RV32 / RV32F
1531
1532 When RV32 or RV32F is implemented, XLEN is set to 32, and thus the
1533 maximum limit for predication is also restricted to 32 bits. Whilst not
1534 actually specifically an "option" it is worth noting.
1535
1536 ## RV32G
1537
1538 Normally in standard RV32 it does not make much sense to have
1539 RV32G, The critical instructions that are missing in standard RV32
1540 are those for moving data to and from the double-width floating-point
1541 registers into the integer ones, as well as the FCVT routines.
1542
1543 In an earlier draft of SV, it was possible to specify an elwidth
1544 of double the standard register size: this had to be dropped,
1545 and may be reintroduced in future revisions.
1546
1547 ## RV32 (not RV32F / RV32G) and RV64 (not RV64F / RV64G)
1548
1549 When floating-point is not implemented, the size of the User Register and
1550 Predication CSR tables may be halved, to only 4 2x16-bit CSRs (8 entries
1551 per table).
1552
1553 ## RV32E
1554
1555 In embedded scenarios the User Register and Predication CSRs may be
1556 dropped entirely, or optionally limited to 1 CSR, such that the combined
1557 number of entries from the M-Mode CSR Register table plus U-Mode
1558 CSR Register table is either 4 16-bit entries or (if the U-Mode is
1559 zero) only 2 16-bit entries (M-Mode CSR table only). Likewise for
1560 the Predication CSR tables.
1561
1562 RV32E is the most likely candidate for simply detecting that registers
1563 are marked as "vectorised", and generating an appropriate exception
1564 for the VL loop to be implemented in software.
1565
1566 ## RV128
1567
1568 RV128 has not been especially considered, here, however it has some
1569 extremely large possibilities: double the element width implies
1570 256-bit operands, spanning 2 128-bit registers each, and predication
1571 of total length 128 bit given that XLEN is now 128.
1572
1573 # Example usage
1574
1575 TODO evaluate strncpy and strlen
1576 <https://groups.google.com/forum/m/#!msg/comp.arch/bGBeaNjAKvc/_vbqyxTUAQAJ>
1577
1578 ## strncpy <a name="strncpy"></>
1579
1580 RVV version:
1581
1582 strncpy:
1583 c.mv a3, a0 # Copy dst
1584 loop:
1585 setvli x0, a2, vint8 # Vectors of bytes.
1586 vlbff.v v1, (a1) # Get src bytes
1587 vseq.vi v0, v1, 0 # Flag zero bytes
1588 vmfirst a4, v0 # Zero found?
1589 vmsif.v v0, v0 # Set mask up to and including zero byte.
1590 vsb.v v1, (a3), v0.t # Write out bytes
1591 c.bgez a4, exit # Done
1592 csrr t1, vl # Get number of bytes fetched
1593 c.add a1, a1, t1 # Bump src pointer
1594 c.sub a2, a2, t1 # Decrement count.
1595 c.add a3, a3, t1 # Bump dst pointer
1596 c.bnez a2, loop # Anymore?
1597
1598 exit:
1599 c.ret
1600
1601 SV version (WIP):
1602
1603 strncpy:
1604 c.mv a3, a0
1605 VBLK.RegCSR[t0] = 8bit, t0, vector
1606 VBLK.PredTb[t0] = ffirst, x0, inv
1607 loop:
1608 VBLK.SETVLI a2, t4, 8 # t4 and VL now 1..8 (MVL=8)
1609 c.ldb t0, (a1) # t0 fail first mode
1610 c.bne t0, x0, allnonzero # still ff
1611 # VL (t4) points to last nonzero
1612 c.addi t4, t4, 1 # include zero
1613 c.stb t0, (a3) # store incl zero
1614 c.ret # end subroutine
1615 allnonzero:
1616 c.stb t0, (a3) # VL legal range
1617 c.add a1, a1, t4 # Bump src pointer
1618 c.sub a2, a2, t4 # Decrement count.
1619 c.add a3, a3, t4 # Bump dst pointer
1620 c.bnez a2, loop # Anymore?
1621 exit:
1622 c.ret
1623
1624 Notes:
1625
1626 * Setting MVL to 8 is just an example. If enough registers are spare it
1627 may be set to XLEN which will require a bank of 8 scalar registers for
1628 a1, a3 and t0.
1629 * obviously if that is done, t0 is not separated by 8 full registers, and
1630 would overwrite t1 thru t7. x80 would work well, as an example, instead.
1631 * with the exception of the GETVL (a pseudo code alias for csrr), every
1632 single instruction above may use RVC.
1633 * RVC C.BNEZ can be used because rs1' may be extended to the full 128
1634 registers through redirection
1635 * RVC C.LW and C.SW may be used because the W format may be overridden by
1636 the 8 bit format. All of t0, a3 and a1 are overridden to make that work.
1637 * with the exception of the GETVL, all Vector Context may be done in
1638 VBLOCK form.
1639 * setting predication to x0 (zero) and invert on t0 is a trick to enable
1640 just ffirst on t0
1641 * ldb and bne are both using t0, both in ffirst mode
1642 * t0 vectorised, a1 scalar, both elwidth 8 bit: ldb enters "unit stride,
1643 vectorised, no (un)sign-extension or truncation" mode.
1644 * ldb will end on illegal mem, reduce VL, but copied all sorts of stuff
1645 into t0 (could contain zeros).
1646 * bne t0 x0 tests up to the NEW VL for nonzero, vector t0 against
1647 scalar x0
1648 * however as t0 is in ffirst mode, the first fail will ALSO stop the
1649 compares, and reduce VL as well
1650 * the branch only goes to allnonzero if all tests succeed
1651 * if it did not, we can safely increment VL by 1 (using a4) to include
1652 the zero.
1653 * SETVL sets *exactly* the requested amount into VL.
1654 * the SETVL just after allnonzero label is needed in case the ldb ffirst
1655 activates but the bne allzeros does not.
1656 * this would cause the stb to copy up to the end of the legal memory
1657 * of course, on the next loop the ldb would throw a trap, as a1 now
1658 points to the first illegal mem location.
1659
1660 ## strcpy
1661
1662 RVV version:
1663
1664 mv a3, a0 # Save start
1665 loop:
1666 setvli a1, x0, vint8 # byte vec, x0 (Zero reg) => use max hardware len
1667 vldbff.v v1, (a3) # Get bytes
1668 csrr a1, vl # Get bytes actually read e.g. if fault
1669 vseq.vi v0, v1, 0 # Set v0[i] where v1[i] = 0
1670 add a3, a3, a1 # Bump pointer
1671 vmfirst a2, v0 # Find first set bit in mask, returns -1 if none
1672 bltz a2, loop # Not found?
1673 add a0, a0, a1 # Sum start + bump
1674 add a3, a3, a2 # Add index of zero byte
1675 sub a0, a3, a0 # Subtract start address+bump
1676 ret
1677
1678 ## DAXPY <a name="daxpy"></a>
1679
1680 [[!inline raw="yes" pages="simple_v_extension/daxpy_example" ]]
1681
1682 Notes:
1683
1684 * Setting MVL to 4 is just an example. With enough space between the
1685 FP regs, MVL may be set to larger values
1686 * VBLOCK header takes 16 bits, 8-bit mode may be used on the registers,
1687 taking only another 16 bits, VBLOCK.SETVL requires 16 bits. Total
1688 overhead for use of VBLOCK: 48 bits (3 16-bit words).
1689 * All instructions except fmadd may use Compressed variants. Total
1690 number of 16-bit instruction words: 11.
1691 * Total: 14 16-bit words. By contrast, RVV requires around 18 16-bit words.
1692
1693 ## BigInt add <a name="bigadd"></a>
1694
1695 [[!inline raw="yes" pages="simple_v_extension/bigadd_example" ]]