add note about swapping src1/src2
[libreriscv.git] / simple_v_extension / appendix.mdwn
1 # Simple-V (Parallelism Extension Proposal) Appendix
2
3 * Copyright (C) 2017, 2018, 2019 Luke Kenneth Casson Leighton
4 * Status: DRAFTv0.6
5 * Last edited: 30 jun 2019
6 * main spec [[specification]]
7
8 [[!toc ]]
9
10 # Fail-on-first modes <a name="ffirst"></a>
11
12 Fail-on-first data dependency has different behaviour for traps than
13 for conditional testing. "Conditional" is taken to mean "anything
14 that is zero", however with traps, the first element has to
15 be given the opportunity to throw the exact same trap that would
16 be thrown if this were a scalar operation (when VL=1).
17
18 Note that implementors are required to mutually exclusively choose one
19 or the other modes: an instruction is **not** permitted to fail on a
20 trap *and* fail a conditional test at the same time. This advice to
21 custom opcode writers as well as future extension writers.
22
23 ## Fail-on-first traps
24
25 Except for the first element, ffirst stops sequential element processing
26 when a trap occurs. The first element is treated normally (as if ffirst
27 is clear). Should any subsequent element instruction require a trap,
28 instead it and subsequent indexed elements are ignored (or cancelled in
29 out-of-order designs), and VL is set to the *last* in-sequence instruction
30 that did not take the trap.
31
32 Note that predicated-out elements (where the predicate mask bit is
33 zero) are clearly excluded (i.e. the trap will not occur). However,
34 note that the loop still had to test the predicate bit: thus on return,
35 VL is set to include elements that did not take the trap *and* includes
36 the elements that were predicated (masked) out (not tested up to the
37 point where the trap occurred).
38
39 Unlike conditional tests, "fail-on-first trap" instruction behaviour is
40 unaltered by setting zero or non-zero predication mode.
41
42 If SUBVL is being used (SUBVL!=1), the first *sub-group* of elements
43 will cause a trap as normal (as if ffirst is not set); subsequently, the
44 trap must not occur in the *sub-group* of elements. SUBVL will **NOT**
45 be modified. Traps must analyse (x)eSTATE (subvl offset indices) to
46 determine the element that caused the trap.
47
48 Given that predication bits apply to SUBVL groups, the same rules apply
49 to predicated-out (masked-out) sub-groups in calculating the value that
50 VL is set to.
51
52 ## Fail-on-first conditional tests
53
54 ffirst stops sequential (or sequentially-appearing in the case of
55 out-of-order designs) element conditional testing on the first element
56 result being zero (or other "fail" condition). VL is set to the number
57 of elements that were (sequentially) processed before the fail-condition
58 was encountered.
59
60 Unlike trap fail-on-first, fail-on-first conditional testing behaviour
61 responds to changes in the zero or non-zero predication mode. Whilst
62 in non-zeroing mode, masked-out elements are simply not tested (and
63 thus considered "never to fail"), in zeroing mode, masked-out elements
64 may be viewed as *always* (unconditionally) failing. This effectively
65 turns VL into something akin to a software-controlled loop.
66
67 Note that just as with traps, if SUBVL!=1, the first trap in the
68 *sub-group* will cause the processing to end, and, even if there were
69 elements within the *sub-group* that passed the test, that sub-group is
70 still (entirely) excluded from the count (from setting VL). i.e. VL is
71 set to the total number of *sub-groups* that had no fail-condition up
72 until execution was stopped. However, again: SUBVL must not be modified:
73 traps must analyse (x)eSTATE (subvl offset indices) to determine the
74 element that caused the trap.
75
76 Note again that, just as with traps, predicated-out (masked-out) elements
77 are included in the (sequential) count leading up to the fail-condition,
78 even though they were not tested.
79
80 # Instructions <a name="instructions" />
81
82 Despite being a 98% complete and accurate topological remap of RVV
83 concepts and functionality, no new instructions are needed.
84 Compared to RVV: *All* RVV instructions can be re-mapped, however xBitManip
85 becomes a critical dependency for efficient manipulation of predication
86 masks (as a bit-field). Despite the removal of all operations,
87 with the exception of CLIP and VSELECT.X
88 *all instructions from RVV Base are topologically re-mapped and retain their
89 complete functionality, intact*. Note that if RV64G ever had
90 a MV.X added as well as FCLIP, the full functionality of RVV-Base would
91 be obtained in SV.
92
93 Three instructions, VSELECT, VCLIP and VCLIPI, do not have RV Standard
94 equivalents, so are left out of Simple-V. VSELECT could be included if
95 there existed a MV.X instruction in RV (MV.X is a hypothetical
96 non-immediate variant of MV that would allow another register to
97 specify which register was to be copied). Note that if any of these three
98 instructions are added to any given RV extension, their functionality
99 will be inherently parallelised.
100
101 With some exceptions, where it does not make sense or is simply too
102 challenging, all RV-Base instructions are parallelised:
103
104 * CSR instructions, whilst a case could be made for fast-polling of
105 a CSR into multiple registers, or for being able to copy multiple
106 contiguously addressed CSRs into contiguous registers, and so on,
107 are the fundamental core basis of SV. If parallelised, extreme
108 care would need to be taken. Additionally, CSR reads are done
109 using x0, and it is *really* inadviseable to tag x0.
110 * LUI, C.J, C.JR, WFI, AUIPC are not suitable for parallelising so are
111 left as scalar.
112 * LR/SC could hypothetically be parallelised however their purpose is
113 single (complex) atomic memory operations where the LR must be followed
114 up by a matching SC. A sequence of parallel LR instructions followed
115 by a sequence of parallel SC instructions therefore is guaranteed to
116 not be useful. Not least: the guarantees of a Multi-LR/SC
117 would be impossible to provide if emulated in a trap.
118 * EBREAK, NOP, FENCE and others do not use registers so are not inherently
119 paralleliseable anyway.
120
121 All other operations using registers are automatically parallelised.
122 This includes AMOMAX, AMOSWAP and so on, where particular care and
123 attention must be paid.
124
125 Example pseudo-code for an integer ADD operation (including scalar
126 operations). Floating-point uses the FP Register Table.
127
128 [[!inline raw="yes" pages="simple_v_extension/simple_add_example" ]]
129
130 Note that for simplicity there is quite a lot missing from the above
131 pseudo-code: PCVBLK, element widths, zeroing on predication, dimensional
132 reshaping and offsets and so on. However it demonstrates the basic
133 principle. Augmentations that produce the full pseudo-code are covered in
134 other sections.
135
136 ## SUBVL Pseudocode <a name="subvl-pseudocode"></a>
137
138 Adding in support for SUBVL is a matter of adding in an extra inner
139 for-loop, where register src and dest are still incremented inside the
140 inner part. Note that the predication is still taken from the VL index.
141
142 So whilst elements are indexed by "(i * SUBVL + s)", predicate bits are
143 indexed by "(i)"
144
145 function op_add(rd, rs1, rs2) # add not VADD!
146  int i, id=0, irs1=0, irs2=0;
147  predval = get_pred_val(FALSE, rd);
148  rd = int_vec[rd ].isvector ? int_vec[rd ].regidx : rd;
149  rs1 = int_vec[rs1].isvector ? int_vec[rs1].regidx : rs1;
150  rs2 = int_vec[rs2].isvector ? int_vec[rs2].regidx : rs2;
151  for (i = 0; i < VL; i++)
152 xSTATE.srcoffs = i # save context
153 for (s = 0; s < SUBVL; s++)
154 xSTATE.ssvoffs = s # save context
155 if (predval & 1<<i) # predication uses intregs
156 # actual add is here (at last)
157    ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
158 if (!int_vec[rd ].isvector) break;
159 if (int_vec[rd ].isvector)  { id += 1; }
160 if (int_vec[rs1].isvector)  { irs1 += 1; }
161 if (int_vec[rs2].isvector)  { irs2 += 1; }
162 if (id == VL or irs1 == VL or irs2 == VL) {
163 # end VL hardware loop
164 xSTATE.srcoffs = 0; # reset
165 xSTATE.ssvoffs = 0; # reset
166 return;
167 }
168
169
170 NOTE: pseudocode simplified greatly: zeroing, proper predicate handling,
171 elwidth handling etc. all left out.
172
173 ## Instruction Format
174
175 It is critical to appreciate that there are
176 **no operations added to SV, at all**.
177
178 Instead, by using CSRs to tag registers as an indication of "changed
179 behaviour", SV *overloads* pre-existing branch operations into predicated
180 variants, and implicitly overloads arithmetic operations, MV, FCVT, and
181 LOAD/STORE depending on CSR configurations for bitwidth and predication.
182 **Everything** becomes parallelised. *This includes Compressed
183 instructions* as well as any future instructions and Custom Extensions.
184
185 Note: CSR tags to change behaviour of instructions is nothing new, including
186 in RISC-V. UXL, SXL and MXL change the behaviour so that XLEN=32/64/128.
187 FRM changes the behaviour of the floating-point unit, to alter the rounding
188 mode. Other architectures change the LOAD/STORE byte-order from big-endian
189 to little-endian on a per-instruction basis. SV is just a little more...
190 comprehensive in its effect on instructions.
191
192 ## Branch Instructions
193
194 Branch operations are augmented slightly to be a little more like FP
195 Compares (FEQ, FNE etc.), by permitting the cumulation (and storage)
196 of multiple comparisons into a register (taken indirectly from the predicate
197 table) and enhancing them to branch "consensually" depending on *multiple*
198 tests. "ffirst" - fail-on-first - condition mode can also be enabled,
199 to terminate the comparisons early.
200 See ffirst mode in the Predication Table section.
201
202 There are two registers for the comparison operation, therefore there
203 is the opportunity to associate two predicate registers (note: not in
204 the same way as twin-predication). The first is a "normal" predicate
205 register, which acts just as it does on any other single-predicated
206 operation: masks out elements where a bit is zero, applies an inversion
207 to the predicate mask, and enables zeroing / non-zeroing mode.
208
209 The second (not to be confused with a twin-predication 2nd register)
210 is utilised to indicate where the results of each comparison are to
211 be stored, as a bitmask. Additionally, the behaviour of the branch -
212 when it occurs - may also be modified depending on whether the 2nd predicate's
213 "invert" and "zeroing" bits are set. These four combinations result
214 in "consensual branches", cbranch.ifnone (NOR), cbranch.ifany (OR),
215 cbranch.ifall (AND), cbranch.ifnotall (NAND).
216
217 | invert | zeroing | description | operation | cbranch |
218 | ------ | ------- | --------------------------- | --------- | ------- |
219 | 0 | 0 | branch if all pass | AND | ifall |
220 | 1 | 0 | branch if one fails | NAND | ifnall |
221 | 0 | 1 | branch if one passes | OR | ifany |
222 | 1 | 1 | branch if all fail | NOR | ifnone |
223
224 This inversion capability covers AND, OR, NAND and NOR branching
225 based on multiple element comparisons. Without the full set of four,
226 it is necessary to have two-sequence branch operations: one conditional, one
227 unconditional.
228
229 Note that unlike normal computer programming, early-termination of chains
230 of AND or OR conditional tests, the chain does *not* terminate early
231 except if fail-on-first is set, and even then ffirst ends on the first
232 data-dependent zero. When ffirst mode is not set, *all* conditional
233 element tests must be performed (and the result optionally stored in
234 the result mask), with a "post-analysis" phase carried out which checks
235 whether to branch.
236
237 Note also that whilst it may seem excessive to have all four (because
238 conditional comparisons may be inverted by swapping src1 and src2),
239 data-dependent fail-on-first is *not* invertible and *only* terminates
240 on first zero-condition encountered. Additionally it may be inconvenient
241 to have to swap the predicate registers associated with src1 and src2,
242 because this involves a new VBLOCK Context.
243
244 ### Standard Branch <a name="standard_branch"></a>
245
246 Branch operations use standard RV opcodes that are reinterpreted to
247 be "predicate variants" in the instance where either of the two src
248 registers are marked as vectors (active=1, vector=1).
249
250 Note that the predication register to use (if one is enabled) is taken from
251 the *first* src register, and that this is used, just as with predicated
252 arithmetic operations, to mask whether the comparison operations take
253 place or not. The target (destination) predication register
254 to use (if one is enabled) is taken from the *second* src register.
255
256 If either of src1 or src2 are scalars (whether by there being no
257 CSR register entry or whether by the CSR entry specifically marking
258 the register as "scalar") the comparison goes ahead as vector-scalar
259 or scalar-vector.
260
261 In instances where no vectorisation is detected on either src registers
262 the operation is treated as an absolutely standard scalar branch operation.
263 Where vectorisation is present on either or both src registers, the
264 branch may stil go ahead if any only if *all* tests succeed (i.e. excluding
265 those tests that are predicated out).
266
267 Note that when zero-predication is enabled (from source rs1),
268 a cleared bit in the predicate indicates that the result
269 of the compare is set to "false", i.e. that the corresponding
270 destination bit (or result)) be set to zero. Contrast this with
271 when zeroing is not set: bits in the destination predicate are
272 only *set*; they are **not** cleared. This is important to appreciate,
273 as there may be an expectation that, going into the hardware-loop,
274 the destination predicate is always expected to be set to zero:
275 this is **not** the case. The destination predicate is only set
276 to zero if **zeroing** is enabled.
277
278 Note that just as with the standard (scalar, non-predicated) branch
279 operations, BLE, BGT, BLEU and BTGU may be synthesised by inverting
280 src1 and src2, however note that in doing so, the predicate table
281 setup must also be correspondingly adjusted.
282
283 In Hwacha EECS-2015-262 Section 6.7.2 the following pseudocode is given
284 for predicated compare operations of function "cmp":
285
286 for (int i=0; i<vl; ++i)
287 if ([!]preg[p][i])
288 preg[pd][i] = cmp(s1 ? vreg[rs1][i] : sreg[rs1],
289 s2 ? vreg[rs2][i] : sreg[rs2]);
290
291 With associated predication, vector-length adjustments and so on,
292 and temporarily ignoring bitwidth (which makes the comparisons more
293 complex), this becomes:
294
295 s1 = reg_is_vectorised(src1);
296 s2 = reg_is_vectorised(src2);
297
298 if not s1 && not s2
299 if cmp(rs1, rs2) # scalar compare
300 goto branch
301 return
302
303 preg = int_pred_reg[rd]
304 reg = int_regfile
305
306 ps = get_pred_val(I/F==INT, rs1);
307 rd = get_pred_val(I/F==INT, rs2); # this may not exist
308
309 ffirst_mode, zeroing = get_pred_flags(rs1)
310 if exists(rd):
311 pred_inversion, pred_zeroing = get_pred_flags(rs2)
312 else
313 pred_inversion, pred_zeroing = False, False
314
315 if not exists(rd) or zeroing:
316 result = (1<<VL)-1 # all 1s
317 else
318 result = preg[rd]
319
320 for (int i = 0; i < VL; ++i)
321 if (zeroing)
322 if not (ps & (1<<i))
323 result &= ~(1<<i);
324 else if (ps & (1<<i))
325 if (cmp(s1 ? reg[src1+i]:reg[src1],
326 s2 ? reg[src2+i]:reg[src2])
327 result |= 1<<i;
328 else
329 result &= ~(1<<i);
330 if ffirst_mode:
331 break
332
333 if exists(rd):
334 preg[rd] = result # store in destination
335
336 if pred_inversion:
337 if pred_zeroing:
338 # NOR
339 if result == 0:
340 goto branch
341 else:
342 # NAND
343 if (result & ps) != result:
344 goto branch
345 else:
346 if pred_zeroing:
347 # OR
348 if result != 0:
349 goto branch
350 else:
351 # AND
352 if (result & ps) == result:
353 goto branch
354
355 Notes:
356
357 * Predicated SIMD comparisons would break src1 and src2 further down
358 into bitwidth-sized chunks (see Appendix "Bitwidth Virtual Register
359 Reordering") setting Vector-Length times (number of SIMD elements) bits
360 in Predicate Register rd, as opposed to just Vector-Length bits.
361 * The execution of "parallelised" instructions **must** be implemented
362 as "re-entrant" (to use a term from software). If an exception (trap)
363 occurs during the middle of a vectorised
364 Branch (now a SV predicated compare) operation, the partial results
365 of any comparisons must be written out to the destination
366 register before the trap is permitted to begin. If however there
367 is no predicate, the **entire** set of comparisons must be **restarted**,
368 with the offset loop indices set back to zero. This is because
369 there is no place to store the temporary result during the handling
370 of traps.
371
372 TODO: predication now taken from src2. also branch goes ahead
373 if all compares are successful.
374
375 Note also that where normally, predication requires that there must
376 also be a CSR register entry for the register being used in order
377 for the **predication** CSR register entry to also be active,
378 for branches this is **not** the case. src2 does **not** have
379 to have its CSR register entry marked as active in order for
380 predication on src2 to be active.
381
382 Also note: SV Branch operations are **not** twin-predicated
383 (see Twin Predication section). This would require three
384 element offsets: one to track src1, one to track src2 and a third
385 to track where to store the accumulation of the results. Given
386 that the element offsets need to be exposed via CSRs so that
387 the parallel hardware looping may be made re-entrant on traps
388 and exceptions, the decision was made not to make SV Branches
389 twin-predicated.
390
391 ### Floating-point Comparisons
392
393 There does not exist floating-point branch operations, only compare.
394 Interestingly no change is needed to the instruction format because
395 FP Compare already stores a 1 or a zero in its "rd" integer register
396 target, i.e. it's not actually a Branch at all: it's a compare.
397
398 In RV (scalar) Base, a branch on a floating-point compare is
399 done via the sequence "FEQ x1, f0, f5; BEQ x1, x0, #jumploc".
400 This does extend to SV, as long as x1 (in the example sequence given)
401 is vectorised. When that is the case, x1..x(1+VL-1) will also be
402 set to 0 or 1 depending on whether f0==f5, f1==f6, f2==f7 and so on.
403 The BEQ that follows will *also* compare x1==x0, x2==x0, x3==x0 and
404 so on. Consequently, unlike integer-branch, FP Compare needs no
405 modification in its behaviour.
406
407 In addition, it is noted that an entry "FNE" (the opposite of FEQ) is
408 missing, and whilst in ordinary branch code this is fine because the
409 standard RVF compare can always be followed up with an integer BEQ or
410 a BNE (or a compressed comparison to zero or non-zero), in predication
411 terms that becomes more of an impact. To deal with this, SV's predication
412 has had "invert" added to it.
413
414 Also: note that FP Compare may be predicated, using the destination
415 integer register (rd) to determine the predicate. FP Compare is **not**
416 a twin-predication operation, as, again, just as with SV Branches,
417 there are three registers involved: FP src1, FP src2 and INT rd.
418
419 Also: note that ffirst (fail first mode) applies directly to this operation.
420
421 ### Compressed Branch Instruction
422
423 Compressed Branch instructions are, just like standard Branch instructions,
424 reinterpreted to be vectorised and predicated based on the source register
425 (rs1s) CSR entries. As however there is only the one source register,
426 given that c.beqz a10 is equivalent to beqz a10,x0, the optional target
427 to store the results of the comparisions is taken from CSR predication
428 table entries for **x0**.
429
430 The specific required use of x0 is, with a little thought, quite obvious,
431 but is counterintuitive. Clearly it is **not** recommended to redirect
432 x0 with a CSR register entry, however as a means to opaquely obtain
433 a predication target it is the only sensible option that does not involve
434 additional special CSRs (or, worse, additional special opcodes).
435
436 Note also that, just as with standard branches, the 2nd source
437 (in this case x0 rather than src2) does **not** have to have its CSR
438 register table marked as "active" in order for predication to work.
439
440 ## Vectorised Dual-operand instructions
441
442 There is a series of 2-operand instructions involving copying (and
443 sometimes alteration):
444
445 * C.MV
446 * FMV, FNEG, FABS, FCVT, FSGNJ, FSGNJN and FSGNJX
447 * C.LWSP, C.SWSP, C.LDSP, C.FLWSP etc.
448 * LOAD(-FP) and STORE(-FP)
449
450 All of these operations follow the same two-operand pattern, so it is
451 *both* the source *and* destination predication masks that are taken into
452 account. This is different from
453 the three-operand arithmetic instructions, where the predication mask
454 is taken from the *destination* register, and applied uniformly to the
455 elements of the source register(s), element-for-element.
456
457 The pseudo-code pattern for twin-predicated operations is as
458 follows:
459
460 function op(rd, rs):
461  rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
462  rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
463  ps = get_pred_val(FALSE, rs); # predication on src
464  pd = get_pred_val(FALSE, rd); # ... AND on dest
465  for (int i = 0, int j = 0; i < VL && j < VL;):
466 if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
467 if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
468 xSTATE.srcoffs = i # save context
469 xSTATE.destoffs = j # save context
470 reg[rd+j] = SCALAR_OPERATION_ON(reg[rs+i])
471 if (int_csr[rs].isvec) i++;
472 if (int_csr[rd].isvec) j++; else break
473
474 This pattern covers scalar-scalar, scalar-vector, vector-scalar
475 and vector-vector, and predicated variants of all of those.
476 Zeroing is not presently included (TODO). As such, when compared
477 to RVV, the twin-predicated variants of C.MV and FMV cover
478 **all** standard vector operations: VINSERT, VSPLAT, VREDUCE,
479 VEXTRACT, VSCATTER, VGATHER, VCOPY, and more.
480
481 Note that:
482
483 * elwidth (SIMD) is not covered in the pseudo-code above
484 * ending the loop early in scalar cases (VINSERT, VEXTRACT) is also
485 not covered
486 * zero predication is also not shown (TODO).
487
488 ### C.MV Instruction <a name="c_mv"></a>
489
490 There is no MV instruction in RV however there is a C.MV instruction.
491 It is used for copying integer-to-integer registers (vectorised FMV
492 is used for copying floating-point).
493
494 If either the source or the destination register are marked as vectors
495 C.MV is reinterpreted to be a vectorised (multi-register) predicated
496 move operation. The actual instruction's format does not change:
497
498 [[!table data="""
499 15 12 | 11 7 | 6 2 | 1 0 |
500 funct4 | rd | rs | op |
501 4 | 5 | 5 | 2 |
502 C.MV | dest | src | C0 |
503 """]]
504
505 A simplified version of the pseudocode for this operation is as follows:
506
507 function op_mv(rd, rs) # MV not VMV!
508  rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
509  rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
510  ps = get_pred_val(FALSE, rs); # predication on src
511  pd = get_pred_val(FALSE, rd); # ... AND on dest
512  for (int i = 0, int j = 0; i < VL && j < VL;):
513 if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
514 if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
515 xSTATE.srcoffs = i # save context
516 xSTATE.destoffs = j # save context
517 ireg[rd+j] <= ireg[rs+i];
518 if (int_csr[rs].isvec) i++;
519 if (int_csr[rd].isvec) j++; else break
520
521 There are several different instructions from RVV that are covered by
522 this one opcode:
523
524 [[!table data="""
525 src | dest | predication | op |
526 scalar | vector | none | VSPLAT |
527 scalar | vector | destination | sparse VSPLAT |
528 scalar | vector | 1-bit dest | VINSERT |
529 vector | scalar | 1-bit? src | VEXTRACT |
530 vector | vector | none | VCOPY |
531 vector | vector | src | Vector Gather |
532 vector | vector | dest | Vector Scatter |
533 vector | vector | src & dest | Gather/Scatter |
534 vector | vector | src == dest | sparse VCOPY |
535 """]]
536
537 Also, VMERGE may be implemented as back-to-back (macro-op fused) C.MV
538 operations with zeroing off, and inversion on the src and dest predication
539 for one of the two C.MV operations. The non-inverted C.MV will place
540 one set of registers into the destination, and the inverted one the other
541 set. With predicate-inversion, copying and inversion of the predicate mask
542 need not be done as a separate (scalar) instruction.
543
544 Note that in the instance where the Compressed Extension is not implemented,
545 MV may be used, but that is a pseudo-operation mapping to addi rd, x0, rs.
546 Note that the behaviour is **different** from C.MV because with addi the
547 predication mask to use is taken **only** from rd and is applied against
548 all elements: rs[i] = rd[i].
549
550 ### FMV, FNEG and FABS Instructions
551
552 These are identical in form to C.MV, except covering floating-point
553 register copying. The same double-predication rules also apply.
554 However when elwidth is not set to default the instruction is implicitly
555 and automatic converted to a (vectorised) floating-point type conversion
556 operation of the appropriate size covering the source and destination
557 register bitwidths.
558
559 (Note that FMV, FNEG and FABS are all actually pseudo-instructions)
560
561 ### FVCT Instructions
562
563 These are again identical in form to C.MV, except that they cover
564 floating-point to integer and integer to floating-point. When element
565 width in each vector is set to default, the instructions behave exactly
566 as they are defined for standard RV (scalar) operations, except vectorised
567 in exactly the same fashion as outlined in C.MV.
568
569 However when the source or destination element width is not set to default,
570 the opcode's explicit element widths are *over-ridden* to new definitions,
571 and the opcode's element width is taken as indicative of the SIMD width
572 (if applicable i.e. if packed SIMD is requested) instead.
573
574 For example FCVT.S.L would normally be used to convert a 64-bit
575 integer in register rs1 to a 64-bit floating-point number in rd.
576 If however the source rs1 is set to be a vector, where elwidth is set to
577 default/2 and "packed SIMD" is enabled, then the first 32 bits of
578 rs1 are converted to a floating-point number to be stored in rd's
579 first element and the higher 32-bits *also* converted to floating-point
580 and stored in the second. The 32 bit size comes from the fact that
581 FCVT.S.L's integer width is 64 bit, and with elwidth on rs1 set to
582 divide that by two it means that rs1 element width is to be taken as 32.
583
584 Similar rules apply to the destination register.
585
586 ## LOAD / STORE Instructions and LOAD-FP/STORE-FP <a name="load_store"></a>
587
588 An earlier draft of SV modified the behaviour of LOAD/STORE (modified
589 the interpretation of the instruction fields). This
590 actually undermined the fundamental principle of SV, namely that there
591 be no modifications to the scalar behaviour (except where absolutely
592 necessary), in order to simplify an implementor's task if considering
593 converting a pre-existing scalar design to support parallelism.
594
595 So the original RISC-V scalar LOAD/STORE and LOAD-FP/STORE-FP functionality
596 do not change in SV, however just as with C.MV it is important to note
597 that dual-predication is possible.
598
599 In vectorised architectures there are usually at least two different modes
600 for LOAD/STORE:
601
602 * Read (or write for STORE) from sequential locations, where one
603 register specifies the address, and the one address is incremented
604 by a fixed amount. This is usually known as "Unit Stride" mode.
605 * Read (or write) from multiple indirected addresses, where the
606 vector elements each specify separate and distinct addresses.
607
608 To support these different addressing modes, the CSR Register "isvector"
609 bit is used. So, for a LOAD, when the src register is set to
610 scalar, the LOADs are sequentially incremented by the src register
611 element width, and when the src register is set to "vector", the
612 elements are treated as indirection addresses. Simplified
613 pseudo-code would look like this:
614
615 function op_ld(rd, rs) # LD not VLD!
616  rdv = int_csr[rd].active ? int_csr[rd].regidx : rd;
617  rsv = int_csr[rs].active ? int_csr[rs].regidx : rs;
618  ps = get_pred_val(FALSE, rs); # predication on src
619  pd = get_pred_val(FALSE, rd); # ... AND on dest
620  for (int i = 0, int j = 0; i < VL && j < VL;):
621 if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
622 if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
623 if (int_csr[rd].isvec)
624 # indirect mode (multi mode)
625 srcbase = ireg[rsv+i];
626 else
627 # unit stride mode
628 srcbase = ireg[rsv] + i * XLEN/8; # offset in bytes
629 ireg[rdv+j] <= mem[srcbase + imm_offs];
630 if (!int_csr[rs].isvec &&
631 !int_csr[rd].isvec) break # scalar-scalar LD
632 if (int_csr[rs].isvec) i++;
633 if (int_csr[rd].isvec) j++;
634
635 Notes:
636
637 * For simplicity, zeroing and elwidth is not included in the above:
638 the key focus here is the decision-making for srcbase; vectorised
639 rs means use sequentially-numbered registers as the indirection
640 address, and scalar rs is "offset" mode.
641 * The test towards the end for whether both source and destination are
642 scalar is what makes the above pseudo-code provide the "standard" RV
643 Base behaviour for LD operations.
644 * The offset in bytes (XLEN/8) changes depending on whether the
645 operation is a LB (1 byte), LH (2 byes), LW (4 bytes) or LD
646 (8 bytes), and also whether the element width is over-ridden
647 (see special element width section).
648
649 ## Compressed Stack LOAD / STORE Instructions <a name="c_ld_st"></a>
650
651 C.LWSP / C.SWSP and floating-point etc. are also source-dest twin-predicated,
652 where it is implicit in C.LWSP/FLWSP etc. that x2 is the source register.
653 It is therefore possible to use predicated C.LWSP to efficiently
654 pop registers off the stack (by predicating x2 as the source), cherry-picking
655 which registers to store to (by predicating the destination). Likewise
656 for C.SWSP. In this way, LOAD/STORE-Multiple is efficiently achieved.
657
658 The two modes ("unit stride" and multi-indirection) are still supported,
659 as with standard LD/ST. Essentially, the only difference is that the
660 use of x2 is hard-coded into the instruction.
661
662 **Note**: it is still possible to redirect x2 to an alternative target
663 register. With care, this allows C.LWSP / C.SWSP (and C.FLWSP) to be used as
664 general-purpose LOAD/STORE operations.
665
666 ## Compressed LOAD / STORE Instructions
667
668 Compressed LOAD and STORE are again exactly the same as scalar LOAD/STORE,
669 where the same rules apply and the same pseudo-code apply as for
670 non-compressed LOAD/STORE. Again: setting scalar or vector mode
671 on the src for LOAD and dest for STORE switches mode from "Unit Stride"
672 to "Multi-indirection", respectively.
673
674 # Element bitwidth polymorphism <a name="elwidth"></a>
675
676 Element bitwidth is best covered as its own special section, as it
677 is quite involved and applies uniformly across-the-board. SV restricts
678 bitwidth polymorphism to default, 8-bit, 16-bit and 32-bit.
679
680 The effect of setting an element bitwidth is to re-cast each entry
681 in the register table, and for all memory operations involving
682 load/stores of certain specific sizes, to a completely different width.
683 Thus In c-style terms, on an RV64 architecture, effectively each register
684 now looks like this:
685
686 typedef union {
687 uint8_t b[8];
688 uint16_t s[4];
689 uint32_t i[2];
690 uint64_t l[1];
691 } reg_t;
692
693 // integer table: assume maximum SV 7-bit regfile size
694 reg_t int_regfile[128];
695
696 where the CSR Register table entry (not the instruction alone) determines
697 which of those union entries is to be used on each operation, and the
698 VL element offset in the hardware-loop specifies the index into each array.
699
700 However a naive interpretation of the data structure above masks the
701 fact that setting VL greater than 8, for example, when the bitwidth is 8,
702 accessing one specific register "spills over" to the following parts of
703 the register file in a sequential fashion. So a much more accurate way
704 to reflect this would be:
705
706 typedef union {
707 uint8_t actual_bytes[8]; // 8 for RV64, 4 for RV32, 16 for RV128
708 uint8_t b[0]; // array of type uint8_t
709 uint16_t s[0];
710 uint32_t i[0];
711 uint64_t l[0];
712 uint128_t d[0];
713 } reg_t;
714
715 reg_t int_regfile[128];
716
717 where when accessing any individual regfile[n].b entry it is permitted
718 (in c) to arbitrarily over-run the *declared* length of the array (zero),
719 and thus "overspill" to consecutive register file entries in a fashion
720 that is completely transparent to a greatly-simplified software / pseudo-code
721 representation.
722 It is however critical to note that it is clearly the responsibility of
723 the implementor to ensure that, towards the end of the register file,
724 an exception is thrown if attempts to access beyond the "real" register
725 bytes is ever attempted.
726
727 Now we may modify pseudo-code an operation where all element bitwidths have
728 been set to the same size, where this pseudo-code is otherwise identical
729 to its "non" polymorphic versions (above):
730
731 function op_add(rd, rs1, rs2) # add not VADD!
732 ...
733 ...
734  for (i = 0; i < VL; i++)
735 ...
736 ...
737 // TODO, calculate if over-run occurs, for each elwidth
738 if (elwidth == 8) {
739    int_regfile[rd].b[id] <= int_regfile[rs1].i[irs1] +
740     int_regfile[rs2].i[irs2];
741 } else if elwidth == 16 {
742    int_regfile[rd].s[id] <= int_regfile[rs1].s[irs1] +
743     int_regfile[rs2].s[irs2];
744 } else if elwidth == 32 {
745    int_regfile[rd].i[id] <= int_regfile[rs1].i[irs1] +
746     int_regfile[rs2].i[irs2];
747 } else { // elwidth == 64
748    int_regfile[rd].l[id] <= int_regfile[rs1].l[irs1] +
749     int_regfile[rs2].l[irs2];
750 }
751 ...
752 ...
753
754 So here we can see clearly: for 8-bit entries rd, rs1 and rs2 (and registers
755 following sequentially on respectively from the same) are "type-cast"
756 to 8-bit; for 16-bit entries likewise and so on.
757
758 However that only covers the case where the element widths are the same.
759 Where the element widths are different, the following algorithm applies:
760
761 * Analyse the bitwidth of all source operands and work out the
762 maximum. Record this as "maxsrcbitwidth"
763 * If any given source operand requires sign-extension or zero-extension
764 (ldb, div, rem, mul, sll, srl, sra etc.), instead of mandatory 32-bit
765 sign-extension / zero-extension or whatever is specified in the standard
766 RV specification, **change** that to sign-extending from the respective
767 individual source operand's bitwidth from the CSR table out to
768 "maxsrcbitwidth" (previously calculated), instead.
769 * Following separate and distinct (optional) sign/zero-extension of all
770 source operands as specifically required for that operation, carry out the
771 operation at "maxsrcbitwidth". (Note that in the case of LOAD/STORE or MV
772 this may be a "null" (copy) operation, and that with FCVT, the changes
773 to the source and destination bitwidths may also turn FVCT effectively
774 into a copy).
775 * If the destination operand requires sign-extension or zero-extension,
776 instead of a mandatory fixed size (typically 32-bit for arithmetic,
777 for subw for example, and otherwise various: 8-bit for sb, 16-bit for sw
778 etc.), overload the RV specification with the bitwidth from the
779 destination register's elwidth entry.
780 * Finally, store the (optionally) sign/zero-extended value into its
781 destination: memory for sb/sw etc., or an offset section of the register
782 file for an arithmetic operation.
783
784 In this way, polymorphic bitwidths are achieved without requiring a
785 massive 64-way permutation of calculations **per opcode**, for example
786 (4 possible rs1 bitwidths times 4 possible rs2 bitwidths times 4 possible
787 rd bitwidths). The pseudo-code is therefore as follows:
788
789 typedef union {
790 uint8_t b;
791 uint16_t s;
792 uint32_t i;
793 uint64_t l;
794 } el_reg_t;
795
796 bw(elwidth):
797 if elwidth == 0: return xlen
798 if elwidth == 1: return 8
799 if elwidth == 2: return 16
800 // elwidth == 3:
801 return 32
802
803 get_max_elwidth(rs1, rs2):
804 return max(bw(int_csr[rs1].elwidth), # default (XLEN) if not set
805 bw(int_csr[rs2].elwidth)) # again XLEN if no entry
806
807 get_polymorphed_reg(reg, bitwidth, offset):
808 el_reg_t res;
809 res.l = 0; // TODO: going to need sign-extending / zero-extending
810 if bitwidth == 8:
811 reg.b = int_regfile[reg].b[offset]
812 elif bitwidth == 16:
813 reg.s = int_regfile[reg].s[offset]
814 elif bitwidth == 32:
815 reg.i = int_regfile[reg].i[offset]
816 elif bitwidth == 64:
817 reg.l = int_regfile[reg].l[offset]
818 return res
819
820 set_polymorphed_reg(reg, bitwidth, offset, val):
821 if (!int_csr[reg].isvec):
822 # sign/zero-extend depending on opcode requirements, from
823 # the reg's bitwidth out to the full bitwidth of the regfile
824 val = sign_or_zero_extend(val, bitwidth, xlen)
825 int_regfile[reg].l[0] = val
826 elif bitwidth == 8:
827 int_regfile[reg].b[offset] = val
828 elif bitwidth == 16:
829 int_regfile[reg].s[offset] = val
830 elif bitwidth == 32:
831 int_regfile[reg].i[offset] = val
832 elif bitwidth == 64:
833 int_regfile[reg].l[offset] = val
834
835 maxsrcwid = get_max_elwidth(rs1, rs2) # source element width(s)
836 destwid = int_csr[rs1].elwidth # destination element width
837  for (i = 0; i < VL; i++)
838 if (predval & 1<<i) # predication uses intregs
839 // TODO, calculate if over-run occurs, for each elwidth
840 src1 = get_polymorphed_reg(rs1, maxsrcwid, irs1)
841 // TODO, sign/zero-extend src1 and src2 as operation requires
842 if (op_requires_sign_extend_src1)
843 src1 = sign_extend(src1, maxsrcwid)
844 src2 = get_polymorphed_reg(rs2, maxsrcwid, irs2)
845 result = src1 + src2 # actual add here
846 // TODO, sign/zero-extend result, as operation requires
847 if (op_requires_sign_extend_dest)
848 result = sign_extend(result, maxsrcwid)
849 set_polymorphed_reg(rd, destwid, ird, result)
850 if (!int_vec[rd].isvector) break
851 if (int_vec[rd ].isvector)  { id += 1; }
852 if (int_vec[rs1].isvector)  { irs1 += 1; }
853 if (int_vec[rs2].isvector)  { irs2 += 1; }
854
855 Whilst specific sign-extension and zero-extension pseudocode call
856 details are left out, due to each operation being different, the above
857 should be clear that;
858
859 * the source operands are extended out to the maximum bitwidth of all
860 source operands
861 * the operation takes place at that maximum source bitwidth (the
862 destination bitwidth is not involved at this point, at all)
863 * the result is extended (or potentially even, truncated) before being
864 stored in the destination. i.e. truncation (if required) to the
865 destination width occurs **after** the operation **not** before.
866 * when the destination is not marked as "vectorised", the **full**
867 (standard, scalar) register file entry is taken up, i.e. the
868 element is either sign-extended or zero-extended to cover the
869 full register bitwidth (XLEN) if it is not already XLEN bits long.
870
871 Implementors are entirely free to optimise the above, particularly
872 if it is specifically known that any given operation will complete
873 accurately in less bits, as long as the results produced are
874 directly equivalent and equal, for all inputs and all outputs,
875 to those produced by the above algorithm.
876
877 ## Polymorphic floating-point operation exceptions and error-handling
878
879 For floating-point operations, conversion takes place without raising any
880 kind of exception. Exactly as specified in the standard RV specification,
881 NAN (or appropriate) is stored if the result is beyond the range of the
882 destination, and, again, exactly as with the standard RV specification
883 just as with scalar operations, the floating-point flag is raised
884 (FCSR). And, again, just as with scalar operations, it is software's
885 responsibility to check this flag. Given that the FCSR flags are
886 "accrued", the fact that multiple element operations could have occurred
887 is not a problem.
888
889 Note that it is perfectly legitimate for floating-point bitwidths of
890 only 8 to be specified. However whilst it is possible to apply IEEE 754
891 principles, no actual standard yet exists. Implementors wishing to
892 provide hardware-level 8-bit support rather than throw a trap to emulate
893 in software should contact the author of this specification before
894 proceeding.
895
896 ## Polymorphic shift operators
897
898 A special note is needed for changing the element width of left and
899 right shift operators, particularly right-shift. Even for standard RV
900 base, in order for correct results to be returned, the second operand
901 RS2 must be truncated to be within the range of RS1's bitwidth.
902 spike's implementation of sll for example is as follows:
903
904 WRITE_RD(sext_xlen(zext_xlen(RS1) << (RS2 & (xlen-1))));
905
906 which means: where XLEN is 32 (for RV32), restrict RS2 to cover the
907 range 0..31 so that RS1 will only be left-shifted by the amount that
908 is possible to fit into a 32-bit register. Whilst this appears not
909 to matter for hardware, it matters greatly in software implementations,
910 and it also matters where an RV64 system is set to "RV32" mode, such
911 that the underlying registers RS1 and RS2 comprise 64 hardware bits
912 each.
913
914 For SV, where each operand's element bitwidth may be over-ridden, the
915 rule about determining the operation's bitwidth *still applies*, being
916 defined as the maximum bitwidth of RS1 and RS2. *However*, this rule
917 **also applies to the truncation of RS2**. In other words, *after*
918 determining the maximum bitwidth, RS2's range must **also be truncated**
919 to ensure a correct answer. Example:
920
921 * RS1 is over-ridden to a 16-bit width
922 * RS2 is over-ridden to an 8-bit width
923 * RD is over-ridden to a 64-bit width
924 * the maximum bitwidth is thus determined to be 16-bit - max(8,16)
925 * RS2 is **truncated to a range of values from 0 to 15**: RS2 & (16-1)
926
927 Pseudocode (in spike) for this example would therefore be:
928
929 WRITE_RD(sext_xlen(zext_16bit(RS1) << (RS2 & (16-1))));
930
931 This example illustrates that considerable care therefore needs to be
932 taken to ensure that left and right shift operations are implemented
933 correctly. The key is that
934
935 * The operation bitwidth is determined by the maximum bitwidth
936 of the *source registers*, **not** the destination register bitwidth
937 * The result is then sign-extend (or truncated) as appropriate.
938
939 ## Polymorphic MULH/MULHU/MULHSU
940
941 MULH is designed to take the top half MSBs of a multiply that
942 does not fit within the range of the source operands, such that
943 smaller width operations may produce a full double-width multiply
944 in two cycles. The issue is: SV allows the source operands to
945 have variable bitwidth.
946
947 Here again special attention has to be paid to the rules regarding
948 bitwidth, which, again, are that the operation is performed at
949 the maximum bitwidth of the **source** registers. Therefore:
950
951 * An 8-bit x 8-bit multiply will create a 16-bit result that must
952 be shifted down by 8 bits
953 * A 16-bit x 8-bit multiply will create a 24-bit result that must
954 be shifted down by 16 bits (top 8 bits being zero)
955 * A 16-bit x 16-bit multiply will create a 32-bit result that must
956 be shifted down by 16 bits
957 * A 32-bit x 16-bit multiply will create a 48-bit result that must
958 be shifted down by 32 bits
959 * A 32-bit x 8-bit multiply will create a 40-bit result that must
960 be shifted down by 32 bits
961
962 So again, just as with shift-left and shift-right, the result
963 is shifted down by the maximum of the two source register bitwidths.
964 And, exactly again, truncation or sign-extension is performed on the
965 result. If sign-extension is to be carried out, it is performed
966 from the same maximum of the two source register bitwidths out
967 to the result element's bitwidth.
968
969 If truncation occurs, i.e. the top MSBs of the result are lost,
970 this is "Officially Not Our Problem", i.e. it is assumed that the
971 programmer actually desires the result to be truncated. i.e. if the
972 programmer wanted all of the bits, they would have set the destination
973 elwidth to accommodate them.
974
975 ## Polymorphic elwidth on LOAD/STORE <a name="elwidth_loadstore"></a>
976
977 Polymorphic element widths in vectorised form means that the data
978 being loaded (or stored) across multiple registers needs to be treated
979 (reinterpreted) as a contiguous stream of elwidth-wide items, where
980 the source register's element width is **independent** from the destination's.
981
982 This makes for a slightly more complex algorithm when using indirection
983 on the "addressed" register (source for LOAD and destination for STORE),
984 particularly given that the LOAD/STORE instruction provides important
985 information about the width of the data to be reinterpreted.
986
987 Let's illustrate the "load" part, where the pseudo-code for elwidth=default
988 was as follows, and i is the loop from 0 to VL-1:
989
990 srcbase = ireg[rs+i];
991 return mem[srcbase + imm]; // returns XLEN bits
992
993 Instead, when elwidth != default, for a LW (32-bit LOAD), elwidth-wide
994 chunks are taken from the source memory location addressed by the current
995 indexed source address register, and only when a full 32-bits-worth
996 are taken will the index be moved on to the next contiguous source
997 address register:
998
999 bitwidth = bw(elwidth); // source elwidth from CSR reg entry
1000 elsperblock = 32 / bitwidth // 1 if bw=32, 2 if bw=16, 4 if bw=8
1001 srcbase = ireg[rs+i/(elsperblock)]; // integer divide
1002 offs = i % elsperblock; // modulo
1003 return &mem[srcbase + imm + offs]; // re-cast to uint8_t*, uint16_t* etc.
1004
1005 Note that the constant "32" above is replaced by 8 for LB, 16 for LH, 64 for LD
1006 and 128 for LQ.
1007
1008 The principle is basically exactly the same as if the srcbase were pointing
1009 at the memory of the *register* file: memory is re-interpreted as containing
1010 groups of elwidth-wide discrete elements.
1011
1012 When storing the result from a load, it's important to respect the fact
1013 that the destination register has its *own separate element width*. Thus,
1014 when each element is loaded (at the source element width), any sign-extension
1015 or zero-extension (or truncation) needs to be done to the *destination*
1016 bitwidth. Also, the storing has the exact same analogous algorithm as
1017 above, where in fact it is just the set\_polymorphed\_reg pseudocode
1018 (completely unchanged) used above.
1019
1020 One issue remains: when the source element width is **greater** than
1021 the width of the operation, it is obvious that a single LB for example
1022 cannot possibly obtain 16-bit-wide data. This condition may be detected
1023 where, when using integer divide, elsperblock (the width of the LOAD
1024 divided by the bitwidth of the element) is zero.
1025
1026 The issue is "fixed" by ensuring that elsperblock is a minimum of 1:
1027
1028 elsperblock = min(1, LD_OP_BITWIDTH / element_bitwidth)
1029
1030 The elements, if the element bitwidth is larger than the LD operation's
1031 size, will then be sign/zero-extended to the full LD operation size, as
1032 specified by the LOAD (LDU instead of LD, LBU instead of LB), before
1033 being passed on to the second phase.
1034
1035 As LOAD/STORE may be twin-predicated, it is important to note that
1036 the rules on twin predication still apply, except where in previous
1037 pseudo-code (elwidth=default for both source and target) it was
1038 the *registers* that the predication was applied to, it is now the
1039 **elements** that the predication is applied to.
1040
1041 Thus the full pseudocode for all LD operations may be written out
1042 as follows:
1043
1044 function LBU(rd, rs):
1045 load_elwidthed(rd, rs, 8, true)
1046 function LB(rd, rs):
1047 load_elwidthed(rd, rs, 8, false)
1048 function LH(rd, rs):
1049 load_elwidthed(rd, rs, 16, false)
1050 ...
1051 ...
1052 function LQ(rd, rs):
1053 load_elwidthed(rd, rs, 128, false)
1054
1055 # returns 1 byte of data when opwidth=8, 2 bytes when opwidth=16..
1056 function load_memory(rs, imm, i, opwidth):
1057 elwidth = int_csr[rs].elwidth
1058 bitwidth = bw(elwidth);
1059 elsperblock = min(1, opwidth / bitwidth)
1060 srcbase = ireg[rs+i/(elsperblock)];
1061 offs = i % elsperblock;
1062 return mem[srcbase + imm + offs]; # 1/2/4/8/16 bytes
1063
1064 function load_elwidthed(rd, rs, opwidth, unsigned):
1065 destwid = int_csr[rd].elwidth # destination element width
1066  rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
1067  rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
1068  ps = get_pred_val(FALSE, rs); # predication on src
1069  pd = get_pred_val(FALSE, rd); # ... AND on dest
1070  for (int i = 0, int j = 0; i < VL && j < VL;):
1071 if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
1072 if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
1073 val = load_memory(rs, imm, i, opwidth)
1074 if unsigned:
1075 val = zero_extend(val, min(opwidth, bitwidth))
1076 else:
1077 val = sign_extend(val, min(opwidth, bitwidth))
1078 set_polymorphed_reg(rd, bitwidth, j, val)
1079 if (int_csr[rs].isvec) i++;
1080 if (int_csr[rd].isvec) j++; else break;
1081
1082 Note:
1083
1084 * when comparing against for example the twin-predicated c.mv
1085 pseudo-code, the pattern of independent incrementing of rd and rs
1086 is preserved unchanged.
1087 * just as with the c.mv pseudocode, zeroing is not included and must be
1088 taken into account (TODO).
1089 * that due to the use of a twin-predication algorithm, LOAD/STORE also
1090 take on the same VSPLAT, VINSERT, VREDUCE, VEXTRACT, VGATHER and
1091 VSCATTER characteristics.
1092 * that due to the use of the same set\_polymorphed\_reg pseudocode,
1093 a destination that is not vectorised (marked as scalar) will
1094 result in the element being fully sign-extended or zero-extended
1095 out to the full register file bitwidth (XLEN). When the source
1096 is also marked as scalar, this is how the compatibility with
1097 standard RV LOAD/STORE is preserved by this algorithm.
1098
1099 ### Example Tables showing LOAD elements
1100
1101 This section contains examples of vectorised LOAD operations, showing
1102 how the two stage process works (three if zero/sign-extension is included).
1103
1104
1105 #### Example: LD x8, x5(0), x8 CSR-elwidth=32, x5 CSR-elwidth=16, VL=7
1106
1107 This is:
1108
1109 * a 64-bit load, with an offset of zero
1110 * with a source-address elwidth of 16-bit
1111 * into a destination-register with an elwidth of 32-bit
1112 * where VL=7
1113 * from register x5 (actually x5-x6) to x8 (actually x8 to half of x11)
1114 * RV64, where XLEN=64 is assumed.
1115
1116 First, the memory table, which, due to the element width being 16 and the
1117 operation being LD (64), the 64-bits loaded from memory are subdivided
1118 into groups of **four** elements. And, with VL being 7 (deliberately
1119 to illustrate that this is reasonable and possible), the first four are
1120 sourced from the offset addresses pointed to by x5, and the next three
1121 from the ofset addresses pointed to by the next contiguous register, x6:
1122
1123 [[!table data="""
1124 addr | byte 0 | byte 1 | byte 2 | byte 3 | byte 4 | byte 5 | byte 6 | byte 7 |
1125 @x5 | elem 0 || elem 1 || elem 2 || elem 3 ||
1126 @x6 | elem 4 || elem 5 || elem 6 || not loaded ||
1127 """]]
1128
1129 Next, the elements are zero-extended from 16-bit to 32-bit, as whilst
1130 the elwidth CSR entry for x5 is 16-bit, the destination elwidth on x8 is 32.
1131
1132 [[!table data="""
1133 byte 3 | byte 2 | byte 1 | byte 0 |
1134 0x0 | 0x0 | elem0 ||
1135 0x0 | 0x0 | elem1 ||
1136 0x0 | 0x0 | elem2 ||
1137 0x0 | 0x0 | elem3 ||
1138 0x0 | 0x0 | elem4 ||
1139 0x0 | 0x0 | elem5 ||
1140 0x0 | 0x0 | elem6 ||
1141 0x0 | 0x0 | elem7 ||
1142 """]]
1143
1144 Lastly, the elements are stored in contiguous blocks, as if x8 was also
1145 byte-addressable "memory". That "memory" happens to cover registers
1146 x8, x9, x10 and x11, with the last 32 "bits" of x11 being **UNMODIFIED**:
1147
1148 [[!table data="""
1149 reg# | byte 7 | byte 6 | byte 5 | byte 4 | byte 3 | byte 2 | byte 1 | byte 0 |
1150 x8 | 0x0 | 0x0 | elem 1 || 0x0 | 0x0 | elem 0 ||
1151 x9 | 0x0 | 0x0 | elem 3 || 0x0 | 0x0 | elem 2 ||
1152 x10 | 0x0 | 0x0 | elem 5 || 0x0 | 0x0 | elem 4 ||
1153 x11 | **UNMODIFIED** |||| 0x0 | 0x0 | elem 6 ||
1154 """]]
1155
1156 Thus we have data that is loaded from the **addresses** pointed to by
1157 x5 and x6, zero-extended from 16-bit to 32-bit, stored in the **registers**
1158 x8 through to half of x11.
1159 The end result is that elements 0 and 1 end up in x8, with element 8 being
1160 shifted up 32 bits, and so on, until finally element 6 is in the
1161 LSBs of x11.
1162
1163 Note that whilst the memory addressing table is shown left-to-right byte order,
1164 the registers are shown in right-to-left (MSB) order. This does **not**
1165 imply that bit or byte-reversal is carried out: it's just easier to visualise
1166 memory as being contiguous bytes, and emphasises that registers are not
1167 really actually "memory" as such.
1168
1169 ## Why SV bitwidth specification is restricted to 4 entries
1170
1171 The four entries for SV element bitwidths only allows three over-rides:
1172
1173 * 8 bit
1174 * 16 hit
1175 * 32 bit
1176
1177 This would seem inadequate, surely it would be better to have 3 bits or
1178 more and allow 64, 128 and some other options besides. The answer here
1179 is, it gets too complex, no RV128 implementation yet exists, and so RV64's
1180 default is 64 bit, so the 4 major element widths are covered anyway.
1181
1182 There is an absolutely crucial aspect oF SV here that explicitly
1183 needs spelling out, and it's whether the "vectorised" bit is set in
1184 the Register's CSR entry.
1185
1186 If "vectorised" is clear (not set), this indicates that the operation
1187 is "scalar". Under these circumstances, when set on a destination (RD),
1188 then sign-extension and zero-extension, whilst changed to match the
1189 override bitwidth (if set), will erase the **full** register entry
1190 (64-bit if RV64).
1191
1192 When vectorised is *set*, this indicates that the operation now treats
1193 **elements** as if they were independent registers, so regardless of
1194 the length, any parts of a given actual register that are not involved
1195 in the operation are **NOT** modified, but are **PRESERVED**.
1196
1197 For example:
1198
1199 * when the vector bit is clear and elwidth set to 16 on the destination
1200 register, operations are truncated to 16 bit and then sign or zero
1201 extended to the *FULL* XLEN register width.
1202 * when the vector bit is set, elwidth is 16 and VL=1 (or other value where
1203 groups of elwidth sized elements do not fill an entire XLEN register),
1204 the "top" bits of the destination register do *NOT* get modified, zero'd
1205 or otherwise overwritten.
1206
1207 SIMD micro-architectures may implement this by using predication on
1208 any elements in a given actual register that are beyond the end of
1209 multi-element operation.
1210
1211 Other microarchitectures may choose to provide byte-level write-enable
1212 lines on the register file, such that each 64 bit register in an RV64
1213 system requires 8 WE lines. Scalar RV64 operations would require
1214 activation of all 8 lines, where SV elwidth based operations would
1215 activate the required subset of those byte-level write lines.
1216
1217 Example:
1218
1219 * rs1, rs2 and rd are all set to 8-bit
1220 * VL is set to 3
1221 * RV64 architecture is set (UXL=64)
1222 * add operation is carried out
1223 * bits 0-23 of RD are modified to be rs1[23..16] + rs2[23..16]
1224 concatenated with similar add operations on bits 15..8 and 7..0
1225 * bits 24 through 63 **remain as they originally were**.
1226
1227 Example SIMD micro-architectural implementation:
1228
1229 * SIMD architecture works out the nearest round number of elements
1230 that would fit into a full RV64 register (in this case: 8)
1231 * SIMD architecture creates a hidden predicate, binary 0b00000111
1232 i.e. the bottom 3 bits set (VL=3) and the top 5 bits clear
1233 * SIMD architecture goes ahead with the add operation as if it
1234 was a full 8-wide batch of 8 adds
1235 * SIMD architecture passes top 5 elements through the adders
1236 (which are "disabled" due to zero-bit predication)
1237 * SIMD architecture gets the 5 unmodified top 8-bits back unmodified
1238 and stores them in rd.
1239
1240 This requires a read on rd, however this is required anyway in order
1241 to support non-zeroing mode.
1242
1243 ## Polymorphic floating-point
1244
1245 Standard scalar RV integer operations base the register width on XLEN,
1246 which may be changed (UXL in USTATUS, and the corresponding MXL and
1247 SXL in MSTATUS and SSTATUS respectively). Integer LOAD, STORE and
1248 arithmetic operations are therefore restricted to an active XLEN bits,
1249 with sign or zero extension to pad out the upper bits when XLEN has
1250 been dynamically set to less than the actual register size.
1251
1252 For scalar floating-point, the active (used / changed) bits are
1253 specified exclusively by the operation: ADD.S specifies an active
1254 32-bits, with the upper bits of the source registers needing to
1255 be all 1s ("NaN-boxed"), and the destination upper bits being
1256 *set* to all 1s (including on LOAD/STOREs).
1257
1258 Where elwidth is set to default (on any source or the destination)
1259 it is obvious that this NaN-boxing behaviour can and should be
1260 preserved. When elwidth is non-default things are less obvious,
1261 so need to be thought through. Here is a normal (scalar) sequence,
1262 assuming an RV64 which supports Quad (128-bit) FLEN:
1263
1264 * FLD loads 64-bit wide from memory. Top 64 MSBs are set to all 1s
1265 * ADD.D performs a 64-bit-wide add. Top 64 MSBs of destination set to 1s.
1266 * FSD stores lowest 64-bits from the 128-bit-wide register to memory:
1267 top 64 MSBs ignored.
1268
1269 Therefore it makes sense to mirror this behaviour when, for example,
1270 elwidth is set to 32. Assume elwidth set to 32 on all source and
1271 destination registers:
1272
1273 * FLD loads 64-bit wide from memory as **two** 32-bit single-precision
1274 floating-point numbers.
1275 * ADD.D performs **two** 32-bit-wide adds, storing one of the adds
1276 in bits 0-31 and the second in bits 32-63.
1277 * FSD stores lowest 64-bits from the 128-bit-wide register to memory
1278
1279 Here's the thing: it does not make sense to overwrite the top 64 MSBs
1280 of the registers either during the FLD **or** the ADD.D. The reason
1281 is that, effectively, the top 64 MSBs actually represent a completely
1282 independent 64-bit register, so overwriting it is not only gratuitous
1283 but may actually be harmful for a future extension to SV which may
1284 have a way to directly access those top 64 bits.
1285
1286 The decision is therefore **not** to touch the upper parts of floating-point
1287 registers whereever elwidth is set to non-default values, including
1288 when "isvec" is false in a given register's CSR entry. Only when the
1289 elwidth is set to default **and** isvec is false will the standard
1290 RV behaviour be followed, namely that the upper bits be modified.
1291
1292 Ultimately if elwidth is default and isvec false on *all* source
1293 and destination registers, a SimpleV instruction defaults completely
1294 to standard RV scalar behaviour (this holds true for **all** operations,
1295 right across the board).
1296
1297 The nice thing here is that ADD.S, ADD.D and ADD.Q when elwidth are
1298 non-default values are effectively all the same: they all still perform
1299 multiple ADD operations, just at different widths. A future extension
1300 to SimpleV may actually allow ADD.S to access the upper bits of the
1301 register, effectively breaking down a 128-bit register into a bank
1302 of 4 independently-accesible 32-bit registers.
1303
1304 In the meantime, although when e.g. setting VL to 8 it would technically
1305 make no difference to the ALU whether ADD.S, ADD.D or ADD.Q is used,
1306 using ADD.Q may be an easy way to signal to the microarchitecture that
1307 it is to receive a higher VL value. On a superscalar OoO architecture
1308 there may be absolutely no difference, however on simpler SIMD-style
1309 microarchitectures they may not necessarily have the infrastructure in
1310 place to know the difference, such that when VL=8 and an ADD.D instruction
1311 is issued, it completes in 2 cycles (or more) rather than one, where
1312 if an ADD.Q had been issued instead on such simpler microarchitectures
1313 it would complete in one.
1314
1315 ## Specific instruction walk-throughs
1316
1317 This section covers walk-throughs of the above-outlined procedure
1318 for converting standard RISC-V scalar arithmetic operations to
1319 polymorphic widths, to ensure that it is correct.
1320
1321 ### add
1322
1323 Standard Scalar RV32/RV64 (xlen):
1324
1325 * RS1 @ xlen bits
1326 * RS2 @ xlen bits
1327 * add @ xlen bits
1328 * RD @ xlen bits
1329
1330 Polymorphic variant:
1331
1332 * RS1 @ rs1 bits, zero-extended to max(rs1, rs2) bits
1333 * RS2 @ rs2 bits, zero-extended to max(rs1, rs2) bits
1334 * add @ max(rs1, rs2) bits
1335 * RD @ rd bits. zero-extend to rd if rd > max(rs1, rs2) otherwise truncate
1336
1337 Note here that polymorphic add zero-extends its source operands,
1338 where addw sign-extends.
1339
1340 ### addw
1341
1342 The RV Specification specifically states that "W" variants of arithmetic
1343 operations always produce 32-bit signed values. In a polymorphic
1344 environment it is reasonable to assume that the signed aspect is
1345 preserved, where it is the length of the operands and the result
1346 that may be changed.
1347
1348 Standard Scalar RV64 (xlen):
1349
1350 * RS1 @ xlen bits
1351 * RS2 @ xlen bits
1352 * add @ xlen bits
1353 * RD @ xlen bits, truncate add to 32-bit and sign-extend to xlen.
1354
1355 Polymorphic variant:
1356
1357 * RS1 @ rs1 bits, sign-extended to max(rs1, rs2) bits
1358 * RS2 @ rs2 bits, sign-extended to max(rs1, rs2) bits
1359 * add @ max(rs1, rs2) bits
1360 * RD @ rd bits. sign-extend to rd if rd > max(rs1, rs2) otherwise truncate
1361
1362 Note here that polymorphic addw sign-extends its source operands,
1363 where add zero-extends.
1364
1365 This requires a little more in-depth analysis. Where the bitwidth of
1366 rs1 equals the bitwidth of rs2, no sign-extending will occur. It is
1367 only where the bitwidth of either rs1 or rs2 are different, will the
1368 lesser-width operand be sign-extended.
1369
1370 Effectively however, both rs1 and rs2 are being sign-extended (or
1371 truncated), where for add they are both zero-extended. This holds true
1372 for all arithmetic operations ending with "W".
1373
1374 ### addiw
1375
1376 Standard Scalar RV64I:
1377
1378 * RS1 @ xlen bits, truncated to 32-bit
1379 * immed @ 12 bits, sign-extended to 32-bit
1380 * add @ 32 bits
1381 * RD @ rd bits. sign-extend to rd if rd > 32, otherwise truncate.
1382
1383 Polymorphic variant:
1384
1385 * RS1 @ rs1 bits
1386 * immed @ 12 bits, sign-extend to max(rs1, 12) bits
1387 * add @ max(rs1, 12) bits
1388 * RD @ rd bits. sign-extend to rd if rd > max(rs1, 12) otherwise truncate
1389
1390 # Predication Element Zeroing
1391
1392 The introduction of zeroing on traditional vector predication is usually
1393 intended as an optimisation for lane-based microarchitectures with register
1394 renaming to be able to save power by avoiding a register read on elements
1395 that are passed through en-masse through the ALU. Simpler microarchitectures
1396 do not have this issue: they simply do not pass the element through to
1397 the ALU at all, and therefore do not store it back in the destination.
1398 More complex non-lane-based micro-architectures can, when zeroing is
1399 not set, use the predication bits to simply avoid sending element-based
1400 operations to the ALUs, entirely: thus, over the long term, potentially
1401 keeping all ALUs 100% occupied even when elements are predicated out.
1402
1403 SimpleV's design principle is not based on or influenced by
1404 microarchitectural design factors: it is a hardware-level API.
1405 Therefore, looking purely at whether zeroing is *useful* or not,
1406 (whether less instructions are needed for certain scenarios),
1407 given that a case can be made for zeroing *and* non-zeroing, the
1408 decision was taken to add support for both.
1409
1410 ## Single-predication (based on destination register)
1411
1412 Zeroing on predication for arithmetic operations is taken from
1413 the destination register's predicate. i.e. the predication *and*
1414 zeroing settings to be applied to the whole operation come from the
1415 CSR Predication table entry for the destination register.
1416 Thus when zeroing is set on predication of a destination element,
1417 if the predication bit is clear, then the destination element is *set*
1418 to zero (twin-predication is slightly different, and will be covered
1419 next).
1420
1421 Thus the pseudo-code loop for a predicated arithmetic operation
1422 is modified to as follows:
1423
1424  for (i = 0; i < VL; i++)
1425 if not zeroing: # an optimisation
1426 while (!(predval & 1<<i) && i < VL)
1427 if (int_vec[rd ].isvector)  { id += 1; }
1428 if (int_vec[rs1].isvector)  { irs1 += 1; }
1429 if (int_vec[rs2].isvector)  { irs2 += 1; }
1430 if i == VL:
1431 return
1432 if (predval & 1<<i)
1433 src1 = ....
1434 src2 = ...
1435 else:
1436 result = src1 + src2 # actual add (or other op) here
1437 set_polymorphed_reg(rd, destwid, ird, result)
1438 if int_vec[rd].ffirst and result == 0:
1439 VL = i # result was zero, end loop early, return VL
1440 return
1441 if (!int_vec[rd].isvector) return
1442 else if zeroing:
1443 result = 0
1444 set_polymorphed_reg(rd, destwid, ird, result)
1445 if (int_vec[rd ].isvector)  { id += 1; }
1446 else if (predval & 1<<i) return
1447 if (int_vec[rs1].isvector)  { irs1 += 1; }
1448 if (int_vec[rs2].isvector)  { irs2 += 1; }
1449 if (rd == VL or rs1 == VL or rs2 == VL): return
1450
1451 The optimisation to skip elements entirely is only possible for certain
1452 micro-architectures when zeroing is not set. However for lane-based
1453 micro-architectures this optimisation may not be practical, as it
1454 implies that elements end up in different "lanes". Under these
1455 circumstances it is perfectly fine to simply have the lanes
1456 "inactive" for predicated elements, even though it results in
1457 less than 100% ALU utilisation.
1458
1459 ## Twin-predication (based on source and destination register)
1460
1461 Twin-predication is not that much different, except that that
1462 the source is independently zero-predicated from the destination.
1463 This means that the source may be zero-predicated *or* the
1464 destination zero-predicated *or both*, or neither.
1465
1466 When with twin-predication, zeroing is set on the source and not
1467 the destination, if a predicate bit is set it indicates that a zero
1468 data element is passed through the operation (the exception being:
1469 if the source data element is to be treated as an address - a LOAD -
1470 then the data returned *from* the LOAD is zero, rather than looking up an
1471 *address* of zero.
1472
1473 When zeroing is set on the destination and not the source, then just
1474 as with single-predicated operations, a zero is stored into the destination
1475 element (or target memory address for a STORE).
1476
1477 Zeroing on both source and destination effectively result in a bitwise
1478 NOR operation of the source and destination predicate: the result is that
1479 where either source predicate OR destination predicate is set to 0,
1480 a zero element will ultimately end up in the destination register.
1481
1482 However: this may not necessarily be the case for all operations;
1483 implementors, particularly of custom instructions, clearly need to
1484 think through the implications in each and every case.
1485
1486 Here is pseudo-code for a twin zero-predicated operation:
1487
1488 function op_mv(rd, rs) # MV not VMV!
1489  rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
1490  rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
1491  ps, zerosrc = get_pred_val(FALSE, rs); # predication on src
1492  pd, zerodst = get_pred_val(FALSE, rd); # ... AND on dest
1493  for (int i = 0, int j = 0; i < VL && j < VL):
1494 if (int_csr[rs].isvec && !zerosrc) while (!(ps & 1<<i)) i++;
1495 if (int_csr[rd].isvec && !zerodst) while (!(pd & 1<<j)) j++;
1496 if ((pd & 1<<j))
1497 if ((pd & 1<<j))
1498 sourcedata = ireg[rs+i];
1499 else
1500 sourcedata = 0
1501 ireg[rd+j] <= sourcedata
1502 else if (zerodst)
1503 ireg[rd+j] <= 0
1504 if (int_csr[rs].isvec)
1505 i++;
1506 if (int_csr[rd].isvec)
1507 j++;
1508 else
1509 if ((pd & 1<<j))
1510 break;
1511
1512 Note that in the instance where the destination is a scalar, the hardware
1513 loop is ended the moment a value *or a zero* is placed into the destination
1514 register/element. Also note that, for clarity, variable element widths
1515 have been left out of the above.
1516
1517 # Subsets of RV functionality
1518
1519 This section describes the differences when SV is implemented on top of
1520 different subsets of RV.
1521
1522 ## Common options
1523
1524 It is permitted to only implement SVprefix and not the VBLOCK instruction
1525 format option, and vice-versa. UNIX Platforms **MUST** raise illegal
1526 instruction on seeing an unsupported VBLOCK or SVprefix opcode, so that
1527 traps may emulate the format.
1528
1529 It is permitted in SVprefix to either not implement VL or not implement
1530 SUBVL (see [[sv_prefix_proposal]] for full details. Again, UNIX Platforms
1531 *MUST* raise illegal instruction on implementations that do not support
1532 VL or SUBVL.
1533
1534 It is permitted to limit the size of either (or both) the register files
1535 down to the original size of the standard RV architecture. However, below
1536 the mandatory limits set in the RV standard will result in non-compliance
1537 with the SV Specification.
1538
1539 ## RV32 / RV32F
1540
1541 When RV32 or RV32F is implemented, XLEN is set to 32, and thus the
1542 maximum limit for predication is also restricted to 32 bits. Whilst not
1543 actually specifically an "option" it is worth noting.
1544
1545 ## RV32G
1546
1547 Normally in standard RV32 it does not make much sense to have
1548 RV32G, The critical instructions that are missing in standard RV32
1549 are those for moving data to and from the double-width floating-point
1550 registers into the integer ones, as well as the FCVT routines.
1551
1552 In an earlier draft of SV, it was possible to specify an elwidth
1553 of double the standard register size: this had to be dropped,
1554 and may be reintroduced in future revisions.
1555
1556 ## RV32 (not RV32F / RV32G) and RV64 (not RV64F / RV64G)
1557
1558 When floating-point is not implemented, the size of the User Register and
1559 Predication CSR tables may be halved, to only 4 2x16-bit CSRs (8 entries
1560 per table).
1561
1562 ## RV32E
1563
1564 In embedded scenarios the User Register and Predication CSRs may be
1565 dropped entirely, or optionally limited to 1 CSR, such that the combined
1566 number of entries from the M-Mode CSR Register table plus U-Mode
1567 CSR Register table is either 4 16-bit entries or (if the U-Mode is
1568 zero) only 2 16-bit entries (M-Mode CSR table only). Likewise for
1569 the Predication CSR tables.
1570
1571 RV32E is the most likely candidate for simply detecting that registers
1572 are marked as "vectorised", and generating an appropriate exception
1573 for the VL loop to be implemented in software.
1574
1575 ## RV128
1576
1577 RV128 has not been especially considered, here, however it has some
1578 extremely large possibilities: double the element width implies
1579 256-bit operands, spanning 2 128-bit registers each, and predication
1580 of total length 128 bit given that XLEN is now 128.
1581
1582 # Example usage
1583
1584 TODO evaluate strncpy and strlen
1585 <https://groups.google.com/forum/m/#!msg/comp.arch/bGBeaNjAKvc/_vbqyxTUAQAJ>
1586
1587 ## strncpy <a name="strncpy"></>
1588
1589 RVV version:
1590
1591 strncpy:
1592 c.mv a3, a0 # Copy dst
1593 loop:
1594 setvli x0, a2, vint8 # Vectors of bytes.
1595 vlbff.v v1, (a1) # Get src bytes
1596 vseq.vi v0, v1, 0 # Flag zero bytes
1597 vmfirst a4, v0 # Zero found?
1598 vmsif.v v0, v0 # Set mask up to and including zero byte.
1599 vsb.v v1, (a3), v0.t # Write out bytes
1600 c.bgez a4, exit # Done
1601 csrr t1, vl # Get number of bytes fetched
1602 c.add a1, a1, t1 # Bump src pointer
1603 c.sub a2, a2, t1 # Decrement count.
1604 c.add a3, a3, t1 # Bump dst pointer
1605 c.bnez a2, loop # Anymore?
1606
1607 exit:
1608 c.ret
1609
1610 SV version (WIP):
1611
1612 strncpy:
1613 c.mv a3, a0
1614 VBLK.RegCSR[t0] = 8bit, t0, vector
1615 VBLK.PredTb[t0] = ffirst, x0, inv
1616 loop:
1617 VBLK.SETVLI a2, t4, 8 # t4 and VL now 1..8 (MVL=8)
1618 c.ldb t0, (a1) # t0 fail first mode
1619 c.bne t0, x0, allnonzero # still ff
1620 # VL (t4) points to last nonzero
1621 c.addi t4, t4, 1 # include zero
1622 c.stb t0, (a3) # store incl zero
1623 c.ret # end subroutine
1624 allnonzero:
1625 c.stb t0, (a3) # VL legal range
1626 c.add a1, a1, t4 # Bump src pointer
1627 c.sub a2, a2, t4 # Decrement count.
1628 c.add a3, a3, t4 # Bump dst pointer
1629 c.bnez a2, loop # Anymore?
1630 exit:
1631 c.ret
1632
1633 Notes:
1634
1635 * Setting MVL to 8 is just an example. If enough registers are spare it
1636 may be set to XLEN which will require a bank of 8 scalar registers for
1637 a1, a3 and t0.
1638 * obviously if that is done, t0 is not separated by 8 full registers, and
1639 would overwrite t1 thru t7. x80 would work well, as an example, instead.
1640 * with the exception of the GETVL (a pseudo code alias for csrr), every
1641 single instruction above may use RVC.
1642 * RVC C.BNEZ can be used because rs1' may be extended to the full 128
1643 registers through redirection
1644 * RVC C.LW and C.SW may be used because the W format may be overridden by
1645 the 8 bit format. All of t0, a3 and a1 are overridden to make that work.
1646 * with the exception of the GETVL, all Vector Context may be done in
1647 VBLOCK form.
1648 * setting predication to x0 (zero) and invert on t0 is a trick to enable
1649 just ffirst on t0
1650 * ldb and bne are both using t0, both in ffirst mode
1651 * t0 vectorised, a1 scalar, both elwidth 8 bit: ldb enters "unit stride,
1652 vectorised, no (un)sign-extension or truncation" mode.
1653 * ldb will end on illegal mem, reduce VL, but copied all sorts of stuff
1654 into t0 (could contain zeros).
1655 * bne t0 x0 tests up to the NEW VL for nonzero, vector t0 against
1656 scalar x0
1657 * however as t0 is in ffirst mode, the first fail will ALSO stop the
1658 compares, and reduce VL as well
1659 * the branch only goes to allnonzero if all tests succeed
1660 * if it did not, we can safely increment VL by 1 (using a4) to include
1661 the zero.
1662 * SETVL sets *exactly* the requested amount into VL.
1663 * the SETVL just after allnonzero label is needed in case the ldb ffirst
1664 activates but the bne allzeros does not.
1665 * this would cause the stb to copy up to the end of the legal memory
1666 * of course, on the next loop the ldb would throw a trap, as a1 now
1667 points to the first illegal mem location.
1668
1669 ## strcpy
1670
1671 RVV version:
1672
1673 mv a3, a0 # Save start
1674 loop:
1675 setvli a1, x0, vint8 # byte vec, x0 (Zero reg) => use max hardware len
1676 vldbff.v v1, (a3) # Get bytes
1677 csrr a1, vl # Get bytes actually read e.g. if fault
1678 vseq.vi v0, v1, 0 # Set v0[i] where v1[i] = 0
1679 add a3, a3, a1 # Bump pointer
1680 vmfirst a2, v0 # Find first set bit in mask, returns -1 if none
1681 bltz a2, loop # Not found?
1682 add a0, a0, a1 # Sum start + bump
1683 add a3, a3, a2 # Add index of zero byte
1684 sub a0, a3, a0 # Subtract start address+bump
1685 ret
1686
1687 ## DAXPY <a name="daxpy"></a>
1688
1689 [[!inline raw="yes" pages="simple_v_extension/daxpy_example" ]]
1690
1691 Notes:
1692
1693 * Setting MVL to 4 is just an example. With enough space between the
1694 FP regs, MVL may be set to larger values
1695 * VBLOCK header takes 16 bits, 8-bit mode may be used on the registers,
1696 taking only another 16 bits, VBLOCK.SETVL requires 16 bits. Total
1697 overhead for use of VBLOCK: 48 bits (3 16-bit words).
1698 * All instructions except fmadd may use Compressed variants. Total
1699 number of 16-bit instruction words: 11.
1700 * Total: 14 16-bit words. By contrast, RVV requires around 18 16-bit words.
1701
1702 ## BigInt add <a name="bigadd"></a>
1703
1704 [[!inline raw="yes" pages="simple_v_extension/bigadd_example" ]]