replace description with table on cbranch
[libreriscv.git] / simple_v_extension / appendix.mdwn
1 # Simple-V (Parallelism Extension Proposal) Appendix
2
3 * Copyright (C) 2017, 2018, 2019 Luke Kenneth Casson Leighton
4 * Status: DRAFTv0.6
5 * Last edited: 30 jun 2019
6 * main spec [[specification]]
7
8 [[!toc ]]
9
10 # Fail-on-first modes <a name="ffirst"></a>
11
12 Fail-on-first data dependency has different behaviour for traps than
13 for conditional testing. "Conditional" is taken to mean "anything
14 that is zero", however with traps, the first element has to
15 be given the opportunity to throw the exact same trap that would
16 be thrown if this were a scalar operation (when VL=1).
17
18 Note that implementors are required to mutually exclusively choose one
19 or the other modes: an instruction is **not** permitted to fail on a
20 trap *and* fail a conditional test at the same time. This advice to
21 custom opcode writers as well as future extension writers.
22
23 ## Fail-on-first traps
24
25 Except for the first element, ffirst stops sequential element processing
26 when a trap occurs. The first element is treated normally (as if ffirst
27 is clear). Should any subsequent element instruction require a trap,
28 instead it and subsequent indexed elements are ignored (or cancelled in
29 out-of-order designs), and VL is set to the *last* in-sequence instruction
30 that did not take the trap.
31
32 Note that predicated-out elements (where the predicate mask bit is
33 zero) are clearly excluded (i.e. the trap will not occur). However,
34 note that the loop still had to test the predicate bit: thus on return,
35 VL is set to include elements that did not take the trap *and* includes
36 the elements that were predicated (masked) out (not tested up to the
37 point where the trap occurred).
38
39 Unlike conditional tests, "fail-on-first trap" instruction behaviour is
40 unaltered by setting zero or non-zero predication mode.
41
42 If SUBVL is being used (SUBVL!=1), the first *sub-group* of elements
43 will cause a trap as normal (as if ffirst is not set); subsequently, the
44 trap must not occur in the *sub-group* of elements. SUBVL will **NOT**
45 be modified. Traps must analyse (x)eSTATE (subvl offset indices) to
46 determine the element that caused the trap.
47
48 Given that predication bits apply to SUBVL groups, the same rules apply
49 to predicated-out (masked-out) sub-groups in calculating the value that
50 VL is set to.
51
52 ## Fail-on-first conditional tests
53
54 ffirst stops sequential (or sequentially-appearing in the case of
55 out-of-order designs) element conditional testing on the first element
56 result being zero (or other "fail" condition). VL is set to the number
57 of elements that were (sequentially) processed before the fail-condition
58 was encountered.
59
60 Unlike trap fail-on-first, fail-on-first conditional testing behaviour
61 responds to changes in the zero or non-zero predication mode. Whilst
62 in non-zeroing mode, masked-out elements are simply not tested (and
63 thus considered "never to fail"), in zeroing mode, masked-out elements
64 may be viewed as *always* (unconditionally) failing. This effectively
65 turns VL into something akin to a software-controlled loop.
66
67 Note that just as with traps, if SUBVL!=1, the first trap in the
68 *sub-group* will cause the processing to end, and, even if there were
69 elements within the *sub-group* that passed the test, that sub-group is
70 still (entirely) excluded from the count (from setting VL). i.e. VL is
71 set to the total number of *sub-groups* that had no fail-condition up
72 until execution was stopped. However, again: SUBVL must not be modified:
73 traps must analyse (x)eSTATE (subvl offset indices) to determine the
74 element that caused the trap.
75
76 Note again that, just as with traps, predicated-out (masked-out) elements
77 are included in the (sequential) count leading up to the fail-condition,
78 even though they were not tested.
79
80 # Instructions <a name="instructions" />
81
82 Despite being a 98% complete and accurate topological remap of RVV
83 concepts and functionality, no new instructions are needed.
84 Compared to RVV: *All* RVV instructions can be re-mapped, however xBitManip
85 becomes a critical dependency for efficient manipulation of predication
86 masks (as a bit-field). Despite the removal of all operations,
87 with the exception of CLIP and VSELECT.X
88 *all instructions from RVV Base are topologically re-mapped and retain their
89 complete functionality, intact*. Note that if RV64G ever had
90 a MV.X added as well as FCLIP, the full functionality of RVV-Base would
91 be obtained in SV.
92
93 Three instructions, VSELECT, VCLIP and VCLIPI, do not have RV Standard
94 equivalents, so are left out of Simple-V. VSELECT could be included if
95 there existed a MV.X instruction in RV (MV.X is a hypothetical
96 non-immediate variant of MV that would allow another register to
97 specify which register was to be copied). Note that if any of these three
98 instructions are added to any given RV extension, their functionality
99 will be inherently parallelised.
100
101 With some exceptions, where it does not make sense or is simply too
102 challenging, all RV-Base instructions are parallelised:
103
104 * CSR instructions, whilst a case could be made for fast-polling of
105 a CSR into multiple registers, or for being able to copy multiple
106 contiguously addressed CSRs into contiguous registers, and so on,
107 are the fundamental core basis of SV. If parallelised, extreme
108 care would need to be taken. Additionally, CSR reads are done
109 using x0, and it is *really* inadviseable to tag x0.
110 * LUI, C.J, C.JR, WFI, AUIPC are not suitable for parallelising so are
111 left as scalar.
112 * LR/SC could hypothetically be parallelised however their purpose is
113 single (complex) atomic memory operations where the LR must be followed
114 up by a matching SC. A sequence of parallel LR instructions followed
115 by a sequence of parallel SC instructions therefore is guaranteed to
116 not be useful. Not least: the guarantees of a Multi-LR/SC
117 would be impossible to provide if emulated in a trap.
118 * EBREAK, NOP, FENCE and others do not use registers so are not inherently
119 paralleliseable anyway.
120
121 All other operations using registers are automatically parallelised.
122 This includes AMOMAX, AMOSWAP and so on, where particular care and
123 attention must be paid.
124
125 Example pseudo-code for an integer ADD operation (including scalar
126 operations). Floating-point uses the FP Register Table.
127
128 [[!inline raw="yes" pages="simple_v_extension/simple_add_example" ]]
129
130 Note that for simplicity there is quite a lot missing from the above
131 pseudo-code: PCVBLK, element widths, zeroing on predication, dimensional
132 reshaping and offsets and so on. However it demonstrates the basic
133 principle. Augmentations that produce the full pseudo-code are covered in
134 other sections.
135
136 ## SUBVL Pseudocode <a name="subvl-pseudocode"></a>
137
138 Adding in support for SUBVL is a matter of adding in an extra inner
139 for-loop, where register src and dest are still incremented inside the
140 inner part. Note that the predication is still taken from the VL index.
141
142 So whilst elements are indexed by "(i * SUBVL + s)", predicate bits are
143 indexed by "(i)"
144
145 function op_add(rd, rs1, rs2) # add not VADD!
146  int i, id=0, irs1=0, irs2=0;
147  predval = get_pred_val(FALSE, rd);
148  rd = int_vec[rd ].isvector ? int_vec[rd ].regidx : rd;
149  rs1 = int_vec[rs1].isvector ? int_vec[rs1].regidx : rs1;
150  rs2 = int_vec[rs2].isvector ? int_vec[rs2].regidx : rs2;
151  for (i = 0; i < VL; i++)
152 xSTATE.srcoffs = i # save context
153 for (s = 0; s < SUBVL; s++)
154 xSTATE.ssvoffs = s # save context
155 if (predval & 1<<i) # predication uses intregs
156 # actual add is here (at last)
157    ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
158 if (!int_vec[rd ].isvector) break;
159 if (int_vec[rd ].isvector)  { id += 1; }
160 if (int_vec[rs1].isvector)  { irs1 += 1; }
161 if (int_vec[rs2].isvector)  { irs2 += 1; }
162 if (id == VL or irs1 == VL or irs2 == VL) {
163 # end VL hardware loop
164 xSTATE.srcoffs = 0; # reset
165 xSTATE.ssvoffs = 0; # reset
166 return;
167 }
168
169
170 NOTE: pseudocode simplified greatly: zeroing, proper predicate handling,
171 elwidth handling etc. all left out.
172
173 ## Instruction Format
174
175 It is critical to appreciate that there are
176 **no operations added to SV, at all**.
177
178 Instead, by using CSRs to tag registers as an indication of "changed
179 behaviour", SV *overloads* pre-existing branch operations into predicated
180 variants, and implicitly overloads arithmetic operations, MV, FCVT, and
181 LOAD/STORE depending on CSR configurations for bitwidth and predication.
182 **Everything** becomes parallelised. *This includes Compressed
183 instructions* as well as any future instructions and Custom Extensions.
184
185 Note: CSR tags to change behaviour of instructions is nothing new, including
186 in RISC-V. UXL, SXL and MXL change the behaviour so that XLEN=32/64/128.
187 FRM changes the behaviour of the floating-point unit, to alter the rounding
188 mode. Other architectures change the LOAD/STORE byte-order from big-endian
189 to little-endian on a per-instruction basis. SV is just a little more...
190 comprehensive in its effect on instructions.
191
192 ## Branch Instructions
193
194 Branch operations are augmented slightly to be a little more like FP
195 Compares (FEQ, FNE etc.), by permitting the cumulation (and storage)
196 of multiple comparisons into a register (taken indirectly from the predicate
197 table). As such, "ffirst" - fail-on-first - condition mode can be enabled.
198 See ffirst mode in the Predication Table section.
199
200 There are two registers for the comparison operation, therefore there is
201 the opportunity to associate two predicate registers. The first is a
202 "normal" predicate register, which acts just as it does on any other
203 single-predicated operation: masks out elements where a bit is zero,
204 applies an inversion to the predicate mask, and enables zeroing / non-zeroing
205 mode.
206
207 The second is utilised to indicate where the results of each comparison
208 are to be stored, as a bitmask. Additionally, the behaviour of the branch
209 - when it occurs - may also be modified depending on whether the predicate
210 "invert" and "zeroing" bits are set.
211 These four combinations result in "consensual branches", cbranch.ifnone (NOR), cbranch.ifany (OR), cbranch.ifall (AND), cbranch.ifnotall (NAND).
212
213 | invert | zeroing | description | operation | cbranch |
214 | ------ | ------- | --------------------------- | --------- | ------- |
215 | 0 | 0 | branch if all pass | AND | ifall |
216 | 1 | 0 | branch if one fails | NAND | ifnall |
217 | 0 | 1 | branch if one passes | OR | ifany |
218 | 1 | 1 | branch if all fail | NOR | ifnone |
219
220 This inversion capability covers AND, OR, NAND and NOR branching
221 based on multiple element comparisons. Without the full set of four,
222 it is necessary to have two-sequence branch operations: one conditional, one
223 unconditional.
224
225 Note that unlike normal computer
226 programming early-termination of chains of AND or OR conditional tests,
227 the chain does *not* terminate early except if fail-on-first is set,
228 and even then ffirst ends on the first data-dependent zero. When ffirst
229 mode is not set, *all* conditional element tests must be performed (and
230 the result optionally stored in the result mask), with a "post-analysis"
231 phase carried out which checks whether to branch.
232
233 ### Standard Branch <a name="standard_branch"></a>
234
235 Branch operations use standard RV opcodes that are reinterpreted to
236 be "predicate variants" in the instance where either of the two src
237 registers are marked as vectors (active=1, vector=1).
238
239 Note that the predication register to use (if one is enabled) is taken from
240 the *first* src register, and that this is used, just as with predicated
241 arithmetic operations, to mask whether the comparison operations take
242 place or not. The target (destination) predication register
243 to use (if one is enabled) is taken from the *second* src register.
244
245 If either of src1 or src2 are scalars (whether by there being no
246 CSR register entry or whether by the CSR entry specifically marking
247 the register as "scalar") the comparison goes ahead as vector-scalar
248 or scalar-vector.
249
250 In instances where no vectorisation is detected on either src registers
251 the operation is treated as an absolutely standard scalar branch operation.
252 Where vectorisation is present on either or both src registers, the
253 branch may stil go ahead if any only if *all* tests succeed (i.e. excluding
254 those tests that are predicated out).
255
256 Note that when zero-predication is enabled (from source rs1),
257 a cleared bit in the predicate indicates that the result
258 of the compare is set to "false", i.e. that the corresponding
259 destination bit (or result)) be set to zero. Contrast this with
260 when zeroing is not set: bits in the destination predicate are
261 only *set*; they are **not** cleared. This is important to appreciate,
262 as there may be an expectation that, going into the hardware-loop,
263 the destination predicate is always expected to be set to zero:
264 this is **not** the case. The destination predicate is only set
265 to zero if **zeroing** is enabled.
266
267 Note that just as with the standard (scalar, non-predicated) branch
268 operations, BLE, BGT, BLEU and BTGU may be synthesised by inverting
269 src1 and src2, however note that in doing so, the predicate table
270 setup must also be correspondingly adjusted.
271
272 In Hwacha EECS-2015-262 Section 6.7.2 the following pseudocode is given
273 for predicated compare operations of function "cmp":
274
275 for (int i=0; i<vl; ++i)
276 if ([!]preg[p][i])
277 preg[pd][i] = cmp(s1 ? vreg[rs1][i] : sreg[rs1],
278 s2 ? vreg[rs2][i] : sreg[rs2]);
279
280 With associated predication, vector-length adjustments and so on,
281 and temporarily ignoring bitwidth (which makes the comparisons more
282 complex), this becomes:
283
284 s1 = reg_is_vectorised(src1);
285 s2 = reg_is_vectorised(src2);
286
287 if not s1 && not s2
288 if cmp(rs1, rs2) # scalar compare
289 goto branch
290 return
291
292 preg = int_pred_reg[rd]
293 reg = int_regfile
294
295 ps = get_pred_val(I/F==INT, rs1);
296 rd = get_pred_val(I/F==INT, rs2); # this may not exist
297
298 ffirst_mode, zeroing = get_pred_flags(rs1)
299 if exists(rd):
300 pred_inversion, pred_zeroing = get_pred_flags(rs2)
301 else
302 pred_inversion, pred_zeroing = False, False
303
304 if not exists(rd) or zeroing:
305 result = (1<<VL)-1 # all 1s
306 else
307 result = preg[rd]
308
309 for (int i = 0; i < VL; ++i)
310 if (zeroing)
311 if not (ps & (1<<i))
312 result &= ~(1<<i);
313 else if (ps & (1<<i))
314 if (cmp(s1 ? reg[src1+i]:reg[src1],
315 s2 ? reg[src2+i]:reg[src2])
316 result |= 1<<i;
317 else
318 result &= ~(1<<i);
319 if ffirst_mode:
320 break
321
322 if exists(rd):
323 preg[rd] = result # store in destination
324
325 if pred_inversion:
326 if pred_zeroing:
327 if result != 0:
328 goto branch
329 else:
330 if result == 0:
331 goto branch
332 else:
333 if pred_zeroing:
334 if (result & ps) != result:
335 goto branch
336 else:
337 if (result & ps) == result:
338 goto branch
339
340 Notes:
341
342 * Predicated SIMD comparisons would break src1 and src2 further down
343 into bitwidth-sized chunks (see Appendix "Bitwidth Virtual Register
344 Reordering") setting Vector-Length times (number of SIMD elements) bits
345 in Predicate Register rd, as opposed to just Vector-Length bits.
346 * The execution of "parallelised" instructions **must** be implemented
347 as "re-entrant" (to use a term from software). If an exception (trap)
348 occurs during the middle of a vectorised
349 Branch (now a SV predicated compare) operation, the partial results
350 of any comparisons must be written out to the destination
351 register before the trap is permitted to begin. If however there
352 is no predicate, the **entire** set of comparisons must be **restarted**,
353 with the offset loop indices set back to zero. This is because
354 there is no place to store the temporary result during the handling
355 of traps.
356
357 TODO: predication now taken from src2. also branch goes ahead
358 if all compares are successful.
359
360 Note also that where normally, predication requires that there must
361 also be a CSR register entry for the register being used in order
362 for the **predication** CSR register entry to also be active,
363 for branches this is **not** the case. src2 does **not** have
364 to have its CSR register entry marked as active in order for
365 predication on src2 to be active.
366
367 Also note: SV Branch operations are **not** twin-predicated
368 (see Twin Predication section). This would require three
369 element offsets: one to track src1, one to track src2 and a third
370 to track where to store the accumulation of the results. Given
371 that the element offsets need to be exposed via CSRs so that
372 the parallel hardware looping may be made re-entrant on traps
373 and exceptions, the decision was made not to make SV Branches
374 twin-predicated.
375
376 ### Floating-point Comparisons
377
378 There does not exist floating-point branch operations, only compare.
379 Interestingly no change is needed to the instruction format because
380 FP Compare already stores a 1 or a zero in its "rd" integer register
381 target, i.e. it's not actually a Branch at all: it's a compare.
382
383 In RV (scalar) Base, a branch on a floating-point compare is
384 done via the sequence "FEQ x1, f0, f5; BEQ x1, x0, #jumploc".
385 This does extend to SV, as long as x1 (in the example sequence given)
386 is vectorised. When that is the case, x1..x(1+VL-1) will also be
387 set to 0 or 1 depending on whether f0==f5, f1==f6, f2==f7 and so on.
388 The BEQ that follows will *also* compare x1==x0, x2==x0, x3==x0 and
389 so on. Consequently, unlike integer-branch, FP Compare needs no
390 modification in its behaviour.
391
392 In addition, it is noted that an entry "FNE" (the opposite of FEQ) is
393 missing, and whilst in ordinary branch code this is fine because the
394 standard RVF compare can always be followed up with an integer BEQ or
395 a BNE (or a compressed comparison to zero or non-zero), in predication
396 terms that becomes more of an impact. To deal with this, SV's predication
397 has had "invert" added to it.
398
399 Also: note that FP Compare may be predicated, using the destination
400 integer register (rd) to determine the predicate. FP Compare is **not**
401 a twin-predication operation, as, again, just as with SV Branches,
402 there are three registers involved: FP src1, FP src2 and INT rd.
403
404 Also: note that ffirst (fail first mode) applies directly to this operation.
405
406 ### Compressed Branch Instruction
407
408 Compressed Branch instructions are, just like standard Branch instructions,
409 reinterpreted to be vectorised and predicated based on the source register
410 (rs1s) CSR entries. As however there is only the one source register,
411 given that c.beqz a10 is equivalent to beqz a10,x0, the optional target
412 to store the results of the comparisions is taken from CSR predication
413 table entries for **x0**.
414
415 The specific required use of x0 is, with a little thought, quite obvious,
416 but is counterintuitive. Clearly it is **not** recommended to redirect
417 x0 with a CSR register entry, however as a means to opaquely obtain
418 a predication target it is the only sensible option that does not involve
419 additional special CSRs (or, worse, additional special opcodes).
420
421 Note also that, just as with standard branches, the 2nd source
422 (in this case x0 rather than src2) does **not** have to have its CSR
423 register table marked as "active" in order for predication to work.
424
425 ## Vectorised Dual-operand instructions
426
427 There is a series of 2-operand instructions involving copying (and
428 sometimes alteration):
429
430 * C.MV
431 * FMV, FNEG, FABS, FCVT, FSGNJ, FSGNJN and FSGNJX
432 * C.LWSP, C.SWSP, C.LDSP, C.FLWSP etc.
433 * LOAD(-FP) and STORE(-FP)
434
435 All of these operations follow the same two-operand pattern, so it is
436 *both* the source *and* destination predication masks that are taken into
437 account. This is different from
438 the three-operand arithmetic instructions, where the predication mask
439 is taken from the *destination* register, and applied uniformly to the
440 elements of the source register(s), element-for-element.
441
442 The pseudo-code pattern for twin-predicated operations is as
443 follows:
444
445 function op(rd, rs):
446  rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
447  rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
448  ps = get_pred_val(FALSE, rs); # predication on src
449  pd = get_pred_val(FALSE, rd); # ... AND on dest
450  for (int i = 0, int j = 0; i < VL && j < VL;):
451 if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
452 if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
453 xSTATE.srcoffs = i # save context
454 xSTATE.destoffs = j # save context
455 reg[rd+j] = SCALAR_OPERATION_ON(reg[rs+i])
456 if (int_csr[rs].isvec) i++;
457 if (int_csr[rd].isvec) j++; else break
458
459 This pattern covers scalar-scalar, scalar-vector, vector-scalar
460 and vector-vector, and predicated variants of all of those.
461 Zeroing is not presently included (TODO). As such, when compared
462 to RVV, the twin-predicated variants of C.MV and FMV cover
463 **all** standard vector operations: VINSERT, VSPLAT, VREDUCE,
464 VEXTRACT, VSCATTER, VGATHER, VCOPY, and more.
465
466 Note that:
467
468 * elwidth (SIMD) is not covered in the pseudo-code above
469 * ending the loop early in scalar cases (VINSERT, VEXTRACT) is also
470 not covered
471 * zero predication is also not shown (TODO).
472
473 ### C.MV Instruction <a name="c_mv"></a>
474
475 There is no MV instruction in RV however there is a C.MV instruction.
476 It is used for copying integer-to-integer registers (vectorised FMV
477 is used for copying floating-point).
478
479 If either the source or the destination register are marked as vectors
480 C.MV is reinterpreted to be a vectorised (multi-register) predicated
481 move operation. The actual instruction's format does not change:
482
483 [[!table data="""
484 15 12 | 11 7 | 6 2 | 1 0 |
485 funct4 | rd | rs | op |
486 4 | 5 | 5 | 2 |
487 C.MV | dest | src | C0 |
488 """]]
489
490 A simplified version of the pseudocode for this operation is as follows:
491
492 function op_mv(rd, rs) # MV not VMV!
493  rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
494  rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
495  ps = get_pred_val(FALSE, rs); # predication on src
496  pd = get_pred_val(FALSE, rd); # ... AND on dest
497  for (int i = 0, int j = 0; i < VL && j < VL;):
498 if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
499 if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
500 xSTATE.srcoffs = i # save context
501 xSTATE.destoffs = j # save context
502 ireg[rd+j] <= ireg[rs+i];
503 if (int_csr[rs].isvec) i++;
504 if (int_csr[rd].isvec) j++; else break
505
506 There are several different instructions from RVV that are covered by
507 this one opcode:
508
509 [[!table data="""
510 src | dest | predication | op |
511 scalar | vector | none | VSPLAT |
512 scalar | vector | destination | sparse VSPLAT |
513 scalar | vector | 1-bit dest | VINSERT |
514 vector | scalar | 1-bit? src | VEXTRACT |
515 vector | vector | none | VCOPY |
516 vector | vector | src | Vector Gather |
517 vector | vector | dest | Vector Scatter |
518 vector | vector | src & dest | Gather/Scatter |
519 vector | vector | src == dest | sparse VCOPY |
520 """]]
521
522 Also, VMERGE may be implemented as back-to-back (macro-op fused) C.MV
523 operations with zeroing off, and inversion on the src and dest predication
524 for one of the two C.MV operations. The non-inverted C.MV will place
525 one set of registers into the destination, and the inverted one the other
526 set. With predicate-inversion, copying and inversion of the predicate mask
527 need not be done as a separate (scalar) instruction.
528
529 Note that in the instance where the Compressed Extension is not implemented,
530 MV may be used, but that is a pseudo-operation mapping to addi rd, x0, rs.
531 Note that the behaviour is **different** from C.MV because with addi the
532 predication mask to use is taken **only** from rd and is applied against
533 all elements: rs[i] = rd[i].
534
535 ### FMV, FNEG and FABS Instructions
536
537 These are identical in form to C.MV, except covering floating-point
538 register copying. The same double-predication rules also apply.
539 However when elwidth is not set to default the instruction is implicitly
540 and automatic converted to a (vectorised) floating-point type conversion
541 operation of the appropriate size covering the source and destination
542 register bitwidths.
543
544 (Note that FMV, FNEG and FABS are all actually pseudo-instructions)
545
546 ### FVCT Instructions
547
548 These are again identical in form to C.MV, except that they cover
549 floating-point to integer and integer to floating-point. When element
550 width in each vector is set to default, the instructions behave exactly
551 as they are defined for standard RV (scalar) operations, except vectorised
552 in exactly the same fashion as outlined in C.MV.
553
554 However when the source or destination element width is not set to default,
555 the opcode's explicit element widths are *over-ridden* to new definitions,
556 and the opcode's element width is taken as indicative of the SIMD width
557 (if applicable i.e. if packed SIMD is requested) instead.
558
559 For example FCVT.S.L would normally be used to convert a 64-bit
560 integer in register rs1 to a 64-bit floating-point number in rd.
561 If however the source rs1 is set to be a vector, where elwidth is set to
562 default/2 and "packed SIMD" is enabled, then the first 32 bits of
563 rs1 are converted to a floating-point number to be stored in rd's
564 first element and the higher 32-bits *also* converted to floating-point
565 and stored in the second. The 32 bit size comes from the fact that
566 FCVT.S.L's integer width is 64 bit, and with elwidth on rs1 set to
567 divide that by two it means that rs1 element width is to be taken as 32.
568
569 Similar rules apply to the destination register.
570
571 ## LOAD / STORE Instructions and LOAD-FP/STORE-FP <a name="load_store"></a>
572
573 An earlier draft of SV modified the behaviour of LOAD/STORE (modified
574 the interpretation of the instruction fields). This
575 actually undermined the fundamental principle of SV, namely that there
576 be no modifications to the scalar behaviour (except where absolutely
577 necessary), in order to simplify an implementor's task if considering
578 converting a pre-existing scalar design to support parallelism.
579
580 So the original RISC-V scalar LOAD/STORE and LOAD-FP/STORE-FP functionality
581 do not change in SV, however just as with C.MV it is important to note
582 that dual-predication is possible.
583
584 In vectorised architectures there are usually at least two different modes
585 for LOAD/STORE:
586
587 * Read (or write for STORE) from sequential locations, where one
588 register specifies the address, and the one address is incremented
589 by a fixed amount. This is usually known as "Unit Stride" mode.
590 * Read (or write) from multiple indirected addresses, where the
591 vector elements each specify separate and distinct addresses.
592
593 To support these different addressing modes, the CSR Register "isvector"
594 bit is used. So, for a LOAD, when the src register is set to
595 scalar, the LOADs are sequentially incremented by the src register
596 element width, and when the src register is set to "vector", the
597 elements are treated as indirection addresses. Simplified
598 pseudo-code would look like this:
599
600 function op_ld(rd, rs) # LD not VLD!
601  rdv = int_csr[rd].active ? int_csr[rd].regidx : rd;
602  rsv = int_csr[rs].active ? int_csr[rs].regidx : rs;
603  ps = get_pred_val(FALSE, rs); # predication on src
604  pd = get_pred_val(FALSE, rd); # ... AND on dest
605  for (int i = 0, int j = 0; i < VL && j < VL;):
606 if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
607 if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
608 if (int_csr[rd].isvec)
609 # indirect mode (multi mode)
610 srcbase = ireg[rsv+i];
611 else
612 # unit stride mode
613 srcbase = ireg[rsv] + i * XLEN/8; # offset in bytes
614 ireg[rdv+j] <= mem[srcbase + imm_offs];
615 if (!int_csr[rs].isvec &&
616 !int_csr[rd].isvec) break # scalar-scalar LD
617 if (int_csr[rs].isvec) i++;
618 if (int_csr[rd].isvec) j++;
619
620 Notes:
621
622 * For simplicity, zeroing and elwidth is not included in the above:
623 the key focus here is the decision-making for srcbase; vectorised
624 rs means use sequentially-numbered registers as the indirection
625 address, and scalar rs is "offset" mode.
626 * The test towards the end for whether both source and destination are
627 scalar is what makes the above pseudo-code provide the "standard" RV
628 Base behaviour for LD operations.
629 * The offset in bytes (XLEN/8) changes depending on whether the
630 operation is a LB (1 byte), LH (2 byes), LW (4 bytes) or LD
631 (8 bytes), and also whether the element width is over-ridden
632 (see special element width section).
633
634 ## Compressed Stack LOAD / STORE Instructions <a name="c_ld_st"></a>
635
636 C.LWSP / C.SWSP and floating-point etc. are also source-dest twin-predicated,
637 where it is implicit in C.LWSP/FLWSP etc. that x2 is the source register.
638 It is therefore possible to use predicated C.LWSP to efficiently
639 pop registers off the stack (by predicating x2 as the source), cherry-picking
640 which registers to store to (by predicating the destination). Likewise
641 for C.SWSP. In this way, LOAD/STORE-Multiple is efficiently achieved.
642
643 The two modes ("unit stride" and multi-indirection) are still supported,
644 as with standard LD/ST. Essentially, the only difference is that the
645 use of x2 is hard-coded into the instruction.
646
647 **Note**: it is still possible to redirect x2 to an alternative target
648 register. With care, this allows C.LWSP / C.SWSP (and C.FLWSP) to be used as
649 general-purpose LOAD/STORE operations.
650
651 ## Compressed LOAD / STORE Instructions
652
653 Compressed LOAD and STORE are again exactly the same as scalar LOAD/STORE,
654 where the same rules apply and the same pseudo-code apply as for
655 non-compressed LOAD/STORE. Again: setting scalar or vector mode
656 on the src for LOAD and dest for STORE switches mode from "Unit Stride"
657 to "Multi-indirection", respectively.
658
659 # Element bitwidth polymorphism <a name="elwidth"></a>
660
661 Element bitwidth is best covered as its own special section, as it
662 is quite involved and applies uniformly across-the-board. SV restricts
663 bitwidth polymorphism to default, 8-bit, 16-bit and 32-bit.
664
665 The effect of setting an element bitwidth is to re-cast each entry
666 in the register table, and for all memory operations involving
667 load/stores of certain specific sizes, to a completely different width.
668 Thus In c-style terms, on an RV64 architecture, effectively each register
669 now looks like this:
670
671 typedef union {
672 uint8_t b[8];
673 uint16_t s[4];
674 uint32_t i[2];
675 uint64_t l[1];
676 } reg_t;
677
678 // integer table: assume maximum SV 7-bit regfile size
679 reg_t int_regfile[128];
680
681 where the CSR Register table entry (not the instruction alone) determines
682 which of those union entries is to be used on each operation, and the
683 VL element offset in the hardware-loop specifies the index into each array.
684
685 However a naive interpretation of the data structure above masks the
686 fact that setting VL greater than 8, for example, when the bitwidth is 8,
687 accessing one specific register "spills over" to the following parts of
688 the register file in a sequential fashion. So a much more accurate way
689 to reflect this would be:
690
691 typedef union {
692 uint8_t actual_bytes[8]; // 8 for RV64, 4 for RV32, 16 for RV128
693 uint8_t b[0]; // array of type uint8_t
694 uint16_t s[0];
695 uint32_t i[0];
696 uint64_t l[0];
697 uint128_t d[0];
698 } reg_t;
699
700 reg_t int_regfile[128];
701
702 where when accessing any individual regfile[n].b entry it is permitted
703 (in c) to arbitrarily over-run the *declared* length of the array (zero),
704 and thus "overspill" to consecutive register file entries in a fashion
705 that is completely transparent to a greatly-simplified software / pseudo-code
706 representation.
707 It is however critical to note that it is clearly the responsibility of
708 the implementor to ensure that, towards the end of the register file,
709 an exception is thrown if attempts to access beyond the "real" register
710 bytes is ever attempted.
711
712 Now we may modify pseudo-code an operation where all element bitwidths have
713 been set to the same size, where this pseudo-code is otherwise identical
714 to its "non" polymorphic versions (above):
715
716 function op_add(rd, rs1, rs2) # add not VADD!
717 ...
718 ...
719  for (i = 0; i < VL; i++)
720 ...
721 ...
722 // TODO, calculate if over-run occurs, for each elwidth
723 if (elwidth == 8) {
724    int_regfile[rd].b[id] <= int_regfile[rs1].i[irs1] +
725     int_regfile[rs2].i[irs2];
726 } else if elwidth == 16 {
727    int_regfile[rd].s[id] <= int_regfile[rs1].s[irs1] +
728     int_regfile[rs2].s[irs2];
729 } else if elwidth == 32 {
730    int_regfile[rd].i[id] <= int_regfile[rs1].i[irs1] +
731     int_regfile[rs2].i[irs2];
732 } else { // elwidth == 64
733    int_regfile[rd].l[id] <= int_regfile[rs1].l[irs1] +
734     int_regfile[rs2].l[irs2];
735 }
736 ...
737 ...
738
739 So here we can see clearly: for 8-bit entries rd, rs1 and rs2 (and registers
740 following sequentially on respectively from the same) are "type-cast"
741 to 8-bit; for 16-bit entries likewise and so on.
742
743 However that only covers the case where the element widths are the same.
744 Where the element widths are different, the following algorithm applies:
745
746 * Analyse the bitwidth of all source operands and work out the
747 maximum. Record this as "maxsrcbitwidth"
748 * If any given source operand requires sign-extension or zero-extension
749 (ldb, div, rem, mul, sll, srl, sra etc.), instead of mandatory 32-bit
750 sign-extension / zero-extension or whatever is specified in the standard
751 RV specification, **change** that to sign-extending from the respective
752 individual source operand's bitwidth from the CSR table out to
753 "maxsrcbitwidth" (previously calculated), instead.
754 * Following separate and distinct (optional) sign/zero-extension of all
755 source operands as specifically required for that operation, carry out the
756 operation at "maxsrcbitwidth". (Note that in the case of LOAD/STORE or MV
757 this may be a "null" (copy) operation, and that with FCVT, the changes
758 to the source and destination bitwidths may also turn FVCT effectively
759 into a copy).
760 * If the destination operand requires sign-extension or zero-extension,
761 instead of a mandatory fixed size (typically 32-bit for arithmetic,
762 for subw for example, and otherwise various: 8-bit for sb, 16-bit for sw
763 etc.), overload the RV specification with the bitwidth from the
764 destination register's elwidth entry.
765 * Finally, store the (optionally) sign/zero-extended value into its
766 destination: memory for sb/sw etc., or an offset section of the register
767 file for an arithmetic operation.
768
769 In this way, polymorphic bitwidths are achieved without requiring a
770 massive 64-way permutation of calculations **per opcode**, for example
771 (4 possible rs1 bitwidths times 4 possible rs2 bitwidths times 4 possible
772 rd bitwidths). The pseudo-code is therefore as follows:
773
774 typedef union {
775 uint8_t b;
776 uint16_t s;
777 uint32_t i;
778 uint64_t l;
779 } el_reg_t;
780
781 bw(elwidth):
782 if elwidth == 0: return xlen
783 if elwidth == 1: return 8
784 if elwidth == 2: return 16
785 // elwidth == 3:
786 return 32
787
788 get_max_elwidth(rs1, rs2):
789 return max(bw(int_csr[rs1].elwidth), # default (XLEN) if not set
790 bw(int_csr[rs2].elwidth)) # again XLEN if no entry
791
792 get_polymorphed_reg(reg, bitwidth, offset):
793 el_reg_t res;
794 res.l = 0; // TODO: going to need sign-extending / zero-extending
795 if bitwidth == 8:
796 reg.b = int_regfile[reg].b[offset]
797 elif bitwidth == 16:
798 reg.s = int_regfile[reg].s[offset]
799 elif bitwidth == 32:
800 reg.i = int_regfile[reg].i[offset]
801 elif bitwidth == 64:
802 reg.l = int_regfile[reg].l[offset]
803 return res
804
805 set_polymorphed_reg(reg, bitwidth, offset, val):
806 if (!int_csr[reg].isvec):
807 # sign/zero-extend depending on opcode requirements, from
808 # the reg's bitwidth out to the full bitwidth of the regfile
809 val = sign_or_zero_extend(val, bitwidth, xlen)
810 int_regfile[reg].l[0] = val
811 elif bitwidth == 8:
812 int_regfile[reg].b[offset] = val
813 elif bitwidth == 16:
814 int_regfile[reg].s[offset] = val
815 elif bitwidth == 32:
816 int_regfile[reg].i[offset] = val
817 elif bitwidth == 64:
818 int_regfile[reg].l[offset] = val
819
820 maxsrcwid = get_max_elwidth(rs1, rs2) # source element width(s)
821 destwid = int_csr[rs1].elwidth # destination element width
822  for (i = 0; i < VL; i++)
823 if (predval & 1<<i) # predication uses intregs
824 // TODO, calculate if over-run occurs, for each elwidth
825 src1 = get_polymorphed_reg(rs1, maxsrcwid, irs1)
826 // TODO, sign/zero-extend src1 and src2 as operation requires
827 if (op_requires_sign_extend_src1)
828 src1 = sign_extend(src1, maxsrcwid)
829 src2 = get_polymorphed_reg(rs2, maxsrcwid, irs2)
830 result = src1 + src2 # actual add here
831 // TODO, sign/zero-extend result, as operation requires
832 if (op_requires_sign_extend_dest)
833 result = sign_extend(result, maxsrcwid)
834 set_polymorphed_reg(rd, destwid, ird, result)
835 if (!int_vec[rd].isvector) break
836 if (int_vec[rd ].isvector)  { id += 1; }
837 if (int_vec[rs1].isvector)  { irs1 += 1; }
838 if (int_vec[rs2].isvector)  { irs2 += 1; }
839
840 Whilst specific sign-extension and zero-extension pseudocode call
841 details are left out, due to each operation being different, the above
842 should be clear that;
843
844 * the source operands are extended out to the maximum bitwidth of all
845 source operands
846 * the operation takes place at that maximum source bitwidth (the
847 destination bitwidth is not involved at this point, at all)
848 * the result is extended (or potentially even, truncated) before being
849 stored in the destination. i.e. truncation (if required) to the
850 destination width occurs **after** the operation **not** before.
851 * when the destination is not marked as "vectorised", the **full**
852 (standard, scalar) register file entry is taken up, i.e. the
853 element is either sign-extended or zero-extended to cover the
854 full register bitwidth (XLEN) if it is not already XLEN bits long.
855
856 Implementors are entirely free to optimise the above, particularly
857 if it is specifically known that any given operation will complete
858 accurately in less bits, as long as the results produced are
859 directly equivalent and equal, for all inputs and all outputs,
860 to those produced by the above algorithm.
861
862 ## Polymorphic floating-point operation exceptions and error-handling
863
864 For floating-point operations, conversion takes place without raising any
865 kind of exception. Exactly as specified in the standard RV specification,
866 NAN (or appropriate) is stored if the result is beyond the range of the
867 destination, and, again, exactly as with the standard RV specification
868 just as with scalar operations, the floating-point flag is raised
869 (FCSR). And, again, just as with scalar operations, it is software's
870 responsibility to check this flag. Given that the FCSR flags are
871 "accrued", the fact that multiple element operations could have occurred
872 is not a problem.
873
874 Note that it is perfectly legitimate for floating-point bitwidths of
875 only 8 to be specified. However whilst it is possible to apply IEEE 754
876 principles, no actual standard yet exists. Implementors wishing to
877 provide hardware-level 8-bit support rather than throw a trap to emulate
878 in software should contact the author of this specification before
879 proceeding.
880
881 ## Polymorphic shift operators
882
883 A special note is needed for changing the element width of left and
884 right shift operators, particularly right-shift. Even for standard RV
885 base, in order for correct results to be returned, the second operand
886 RS2 must be truncated to be within the range of RS1's bitwidth.
887 spike's implementation of sll for example is as follows:
888
889 WRITE_RD(sext_xlen(zext_xlen(RS1) << (RS2 & (xlen-1))));
890
891 which means: where XLEN is 32 (for RV32), restrict RS2 to cover the
892 range 0..31 so that RS1 will only be left-shifted by the amount that
893 is possible to fit into a 32-bit register. Whilst this appears not
894 to matter for hardware, it matters greatly in software implementations,
895 and it also matters where an RV64 system is set to "RV32" mode, such
896 that the underlying registers RS1 and RS2 comprise 64 hardware bits
897 each.
898
899 For SV, where each operand's element bitwidth may be over-ridden, the
900 rule about determining the operation's bitwidth *still applies*, being
901 defined as the maximum bitwidth of RS1 and RS2. *However*, this rule
902 **also applies to the truncation of RS2**. In other words, *after*
903 determining the maximum bitwidth, RS2's range must **also be truncated**
904 to ensure a correct answer. Example:
905
906 * RS1 is over-ridden to a 16-bit width
907 * RS2 is over-ridden to an 8-bit width
908 * RD is over-ridden to a 64-bit width
909 * the maximum bitwidth is thus determined to be 16-bit - max(8,16)
910 * RS2 is **truncated to a range of values from 0 to 15**: RS2 & (16-1)
911
912 Pseudocode (in spike) for this example would therefore be:
913
914 WRITE_RD(sext_xlen(zext_16bit(RS1) << (RS2 & (16-1))));
915
916 This example illustrates that considerable care therefore needs to be
917 taken to ensure that left and right shift operations are implemented
918 correctly. The key is that
919
920 * The operation bitwidth is determined by the maximum bitwidth
921 of the *source registers*, **not** the destination register bitwidth
922 * The result is then sign-extend (or truncated) as appropriate.
923
924 ## Polymorphic MULH/MULHU/MULHSU
925
926 MULH is designed to take the top half MSBs of a multiply that
927 does not fit within the range of the source operands, such that
928 smaller width operations may produce a full double-width multiply
929 in two cycles. The issue is: SV allows the source operands to
930 have variable bitwidth.
931
932 Here again special attention has to be paid to the rules regarding
933 bitwidth, which, again, are that the operation is performed at
934 the maximum bitwidth of the **source** registers. Therefore:
935
936 * An 8-bit x 8-bit multiply will create a 16-bit result that must
937 be shifted down by 8 bits
938 * A 16-bit x 8-bit multiply will create a 24-bit result that must
939 be shifted down by 16 bits (top 8 bits being zero)
940 * A 16-bit x 16-bit multiply will create a 32-bit result that must
941 be shifted down by 16 bits
942 * A 32-bit x 16-bit multiply will create a 48-bit result that must
943 be shifted down by 32 bits
944 * A 32-bit x 8-bit multiply will create a 40-bit result that must
945 be shifted down by 32 bits
946
947 So again, just as with shift-left and shift-right, the result
948 is shifted down by the maximum of the two source register bitwidths.
949 And, exactly again, truncation or sign-extension is performed on the
950 result. If sign-extension is to be carried out, it is performed
951 from the same maximum of the two source register bitwidths out
952 to the result element's bitwidth.
953
954 If truncation occurs, i.e. the top MSBs of the result are lost,
955 this is "Officially Not Our Problem", i.e. it is assumed that the
956 programmer actually desires the result to be truncated. i.e. if the
957 programmer wanted all of the bits, they would have set the destination
958 elwidth to accommodate them.
959
960 ## Polymorphic elwidth on LOAD/STORE <a name="elwidth_loadstore"></a>
961
962 Polymorphic element widths in vectorised form means that the data
963 being loaded (or stored) across multiple registers needs to be treated
964 (reinterpreted) as a contiguous stream of elwidth-wide items, where
965 the source register's element width is **independent** from the destination's.
966
967 This makes for a slightly more complex algorithm when using indirection
968 on the "addressed" register (source for LOAD and destination for STORE),
969 particularly given that the LOAD/STORE instruction provides important
970 information about the width of the data to be reinterpreted.
971
972 Let's illustrate the "load" part, where the pseudo-code for elwidth=default
973 was as follows, and i is the loop from 0 to VL-1:
974
975 srcbase = ireg[rs+i];
976 return mem[srcbase + imm]; // returns XLEN bits
977
978 Instead, when elwidth != default, for a LW (32-bit LOAD), elwidth-wide
979 chunks are taken from the source memory location addressed by the current
980 indexed source address register, and only when a full 32-bits-worth
981 are taken will the index be moved on to the next contiguous source
982 address register:
983
984 bitwidth = bw(elwidth); // source elwidth from CSR reg entry
985 elsperblock = 32 / bitwidth // 1 if bw=32, 2 if bw=16, 4 if bw=8
986 srcbase = ireg[rs+i/(elsperblock)]; // integer divide
987 offs = i % elsperblock; // modulo
988 return &mem[srcbase + imm + offs]; // re-cast to uint8_t*, uint16_t* etc.
989
990 Note that the constant "32" above is replaced by 8 for LB, 16 for LH, 64 for LD
991 and 128 for LQ.
992
993 The principle is basically exactly the same as if the srcbase were pointing
994 at the memory of the *register* file: memory is re-interpreted as containing
995 groups of elwidth-wide discrete elements.
996
997 When storing the result from a load, it's important to respect the fact
998 that the destination register has its *own separate element width*. Thus,
999 when each element is loaded (at the source element width), any sign-extension
1000 or zero-extension (or truncation) needs to be done to the *destination*
1001 bitwidth. Also, the storing has the exact same analogous algorithm as
1002 above, where in fact it is just the set\_polymorphed\_reg pseudocode
1003 (completely unchanged) used above.
1004
1005 One issue remains: when the source element width is **greater** than
1006 the width of the operation, it is obvious that a single LB for example
1007 cannot possibly obtain 16-bit-wide data. This condition may be detected
1008 where, when using integer divide, elsperblock (the width of the LOAD
1009 divided by the bitwidth of the element) is zero.
1010
1011 The issue is "fixed" by ensuring that elsperblock is a minimum of 1:
1012
1013 elsperblock = min(1, LD_OP_BITWIDTH / element_bitwidth)
1014
1015 The elements, if the element bitwidth is larger than the LD operation's
1016 size, will then be sign/zero-extended to the full LD operation size, as
1017 specified by the LOAD (LDU instead of LD, LBU instead of LB), before
1018 being passed on to the second phase.
1019
1020 As LOAD/STORE may be twin-predicated, it is important to note that
1021 the rules on twin predication still apply, except where in previous
1022 pseudo-code (elwidth=default for both source and target) it was
1023 the *registers* that the predication was applied to, it is now the
1024 **elements** that the predication is applied to.
1025
1026 Thus the full pseudocode for all LD operations may be written out
1027 as follows:
1028
1029 function LBU(rd, rs):
1030 load_elwidthed(rd, rs, 8, true)
1031 function LB(rd, rs):
1032 load_elwidthed(rd, rs, 8, false)
1033 function LH(rd, rs):
1034 load_elwidthed(rd, rs, 16, false)
1035 ...
1036 ...
1037 function LQ(rd, rs):
1038 load_elwidthed(rd, rs, 128, false)
1039
1040 # returns 1 byte of data when opwidth=8, 2 bytes when opwidth=16..
1041 function load_memory(rs, imm, i, opwidth):
1042 elwidth = int_csr[rs].elwidth
1043 bitwidth = bw(elwidth);
1044 elsperblock = min(1, opwidth / bitwidth)
1045 srcbase = ireg[rs+i/(elsperblock)];
1046 offs = i % elsperblock;
1047 return mem[srcbase + imm + offs]; # 1/2/4/8/16 bytes
1048
1049 function load_elwidthed(rd, rs, opwidth, unsigned):
1050 destwid = int_csr[rd].elwidth # destination element width
1051  rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
1052  rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
1053  ps = get_pred_val(FALSE, rs); # predication on src
1054  pd = get_pred_val(FALSE, rd); # ... AND on dest
1055  for (int i = 0, int j = 0; i < VL && j < VL;):
1056 if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
1057 if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
1058 val = load_memory(rs, imm, i, opwidth)
1059 if unsigned:
1060 val = zero_extend(val, min(opwidth, bitwidth))
1061 else:
1062 val = sign_extend(val, min(opwidth, bitwidth))
1063 set_polymorphed_reg(rd, bitwidth, j, val)
1064 if (int_csr[rs].isvec) i++;
1065 if (int_csr[rd].isvec) j++; else break;
1066
1067 Note:
1068
1069 * when comparing against for example the twin-predicated c.mv
1070 pseudo-code, the pattern of independent incrementing of rd and rs
1071 is preserved unchanged.
1072 * just as with the c.mv pseudocode, zeroing is not included and must be
1073 taken into account (TODO).
1074 * that due to the use of a twin-predication algorithm, LOAD/STORE also
1075 take on the same VSPLAT, VINSERT, VREDUCE, VEXTRACT, VGATHER and
1076 VSCATTER characteristics.
1077 * that due to the use of the same set\_polymorphed\_reg pseudocode,
1078 a destination that is not vectorised (marked as scalar) will
1079 result in the element being fully sign-extended or zero-extended
1080 out to the full register file bitwidth (XLEN). When the source
1081 is also marked as scalar, this is how the compatibility with
1082 standard RV LOAD/STORE is preserved by this algorithm.
1083
1084 ### Example Tables showing LOAD elements
1085
1086 This section contains examples of vectorised LOAD operations, showing
1087 how the two stage process works (three if zero/sign-extension is included).
1088
1089
1090 #### Example: LD x8, x5(0), x8 CSR-elwidth=32, x5 CSR-elwidth=16, VL=7
1091
1092 This is:
1093
1094 * a 64-bit load, with an offset of zero
1095 * with a source-address elwidth of 16-bit
1096 * into a destination-register with an elwidth of 32-bit
1097 * where VL=7
1098 * from register x5 (actually x5-x6) to x8 (actually x8 to half of x11)
1099 * RV64, where XLEN=64 is assumed.
1100
1101 First, the memory table, which, due to the element width being 16 and the
1102 operation being LD (64), the 64-bits loaded from memory are subdivided
1103 into groups of **four** elements. And, with VL being 7 (deliberately
1104 to illustrate that this is reasonable and possible), the first four are
1105 sourced from the offset addresses pointed to by x5, and the next three
1106 from the ofset addresses pointed to by the next contiguous register, x6:
1107
1108 [[!table data="""
1109 addr | byte 0 | byte 1 | byte 2 | byte 3 | byte 4 | byte 5 | byte 6 | byte 7 |
1110 @x5 | elem 0 || elem 1 || elem 2 || elem 3 ||
1111 @x6 | elem 4 || elem 5 || elem 6 || not loaded ||
1112 """]]
1113
1114 Next, the elements are zero-extended from 16-bit to 32-bit, as whilst
1115 the elwidth CSR entry for x5 is 16-bit, the destination elwidth on x8 is 32.
1116
1117 [[!table data="""
1118 byte 3 | byte 2 | byte 1 | byte 0 |
1119 0x0 | 0x0 | elem0 ||
1120 0x0 | 0x0 | elem1 ||
1121 0x0 | 0x0 | elem2 ||
1122 0x0 | 0x0 | elem3 ||
1123 0x0 | 0x0 | elem4 ||
1124 0x0 | 0x0 | elem5 ||
1125 0x0 | 0x0 | elem6 ||
1126 0x0 | 0x0 | elem7 ||
1127 """]]
1128
1129 Lastly, the elements are stored in contiguous blocks, as if x8 was also
1130 byte-addressable "memory". That "memory" happens to cover registers
1131 x8, x9, x10 and x11, with the last 32 "bits" of x11 being **UNMODIFIED**:
1132
1133 [[!table data="""
1134 reg# | byte 7 | byte 6 | byte 5 | byte 4 | byte 3 | byte 2 | byte 1 | byte 0 |
1135 x8 | 0x0 | 0x0 | elem 1 || 0x0 | 0x0 | elem 0 ||
1136 x9 | 0x0 | 0x0 | elem 3 || 0x0 | 0x0 | elem 2 ||
1137 x10 | 0x0 | 0x0 | elem 5 || 0x0 | 0x0 | elem 4 ||
1138 x11 | **UNMODIFIED** |||| 0x0 | 0x0 | elem 6 ||
1139 """]]
1140
1141 Thus we have data that is loaded from the **addresses** pointed to by
1142 x5 and x6, zero-extended from 16-bit to 32-bit, stored in the **registers**
1143 x8 through to half of x11.
1144 The end result is that elements 0 and 1 end up in x8, with element 8 being
1145 shifted up 32 bits, and so on, until finally element 6 is in the
1146 LSBs of x11.
1147
1148 Note that whilst the memory addressing table is shown left-to-right byte order,
1149 the registers are shown in right-to-left (MSB) order. This does **not**
1150 imply that bit or byte-reversal is carried out: it's just easier to visualise
1151 memory as being contiguous bytes, and emphasises that registers are not
1152 really actually "memory" as such.
1153
1154 ## Why SV bitwidth specification is restricted to 4 entries
1155
1156 The four entries for SV element bitwidths only allows three over-rides:
1157
1158 * 8 bit
1159 * 16 hit
1160 * 32 bit
1161
1162 This would seem inadequate, surely it would be better to have 3 bits or
1163 more and allow 64, 128 and some other options besides. The answer here
1164 is, it gets too complex, no RV128 implementation yet exists, and so RV64's
1165 default is 64 bit, so the 4 major element widths are covered anyway.
1166
1167 There is an absolutely crucial aspect oF SV here that explicitly
1168 needs spelling out, and it's whether the "vectorised" bit is set in
1169 the Register's CSR entry.
1170
1171 If "vectorised" is clear (not set), this indicates that the operation
1172 is "scalar". Under these circumstances, when set on a destination (RD),
1173 then sign-extension and zero-extension, whilst changed to match the
1174 override bitwidth (if set), will erase the **full** register entry
1175 (64-bit if RV64).
1176
1177 When vectorised is *set*, this indicates that the operation now treats
1178 **elements** as if they were independent registers, so regardless of
1179 the length, any parts of a given actual register that are not involved
1180 in the operation are **NOT** modified, but are **PRESERVED**.
1181
1182 For example:
1183
1184 * when the vector bit is clear and elwidth set to 16 on the destination
1185 register, operations are truncated to 16 bit and then sign or zero
1186 extended to the *FULL* XLEN register width.
1187 * when the vector bit is set, elwidth is 16 and VL=1 (or other value where
1188 groups of elwidth sized elements do not fill an entire XLEN register),
1189 the "top" bits of the destination register do *NOT* get modified, zero'd
1190 or otherwise overwritten.
1191
1192 SIMD micro-architectures may implement this by using predication on
1193 any elements in a given actual register that are beyond the end of
1194 multi-element operation.
1195
1196 Other microarchitectures may choose to provide byte-level write-enable
1197 lines on the register file, such that each 64 bit register in an RV64
1198 system requires 8 WE lines. Scalar RV64 operations would require
1199 activation of all 8 lines, where SV elwidth based operations would
1200 activate the required subset of those byte-level write lines.
1201
1202 Example:
1203
1204 * rs1, rs2 and rd are all set to 8-bit
1205 * VL is set to 3
1206 * RV64 architecture is set (UXL=64)
1207 * add operation is carried out
1208 * bits 0-23 of RD are modified to be rs1[23..16] + rs2[23..16]
1209 concatenated with similar add operations on bits 15..8 and 7..0
1210 * bits 24 through 63 **remain as they originally were**.
1211
1212 Example SIMD micro-architectural implementation:
1213
1214 * SIMD architecture works out the nearest round number of elements
1215 that would fit into a full RV64 register (in this case: 8)
1216 * SIMD architecture creates a hidden predicate, binary 0b00000111
1217 i.e. the bottom 3 bits set (VL=3) and the top 5 bits clear
1218 * SIMD architecture goes ahead with the add operation as if it
1219 was a full 8-wide batch of 8 adds
1220 * SIMD architecture passes top 5 elements through the adders
1221 (which are "disabled" due to zero-bit predication)
1222 * SIMD architecture gets the 5 unmodified top 8-bits back unmodified
1223 and stores them in rd.
1224
1225 This requires a read on rd, however this is required anyway in order
1226 to support non-zeroing mode.
1227
1228 ## Polymorphic floating-point
1229
1230 Standard scalar RV integer operations base the register width on XLEN,
1231 which may be changed (UXL in USTATUS, and the corresponding MXL and
1232 SXL in MSTATUS and SSTATUS respectively). Integer LOAD, STORE and
1233 arithmetic operations are therefore restricted to an active XLEN bits,
1234 with sign or zero extension to pad out the upper bits when XLEN has
1235 been dynamically set to less than the actual register size.
1236
1237 For scalar floating-point, the active (used / changed) bits are
1238 specified exclusively by the operation: ADD.S specifies an active
1239 32-bits, with the upper bits of the source registers needing to
1240 be all 1s ("NaN-boxed"), and the destination upper bits being
1241 *set* to all 1s (including on LOAD/STOREs).
1242
1243 Where elwidth is set to default (on any source or the destination)
1244 it is obvious that this NaN-boxing behaviour can and should be
1245 preserved. When elwidth is non-default things are less obvious,
1246 so need to be thought through. Here is a normal (scalar) sequence,
1247 assuming an RV64 which supports Quad (128-bit) FLEN:
1248
1249 * FLD loads 64-bit wide from memory. Top 64 MSBs are set to all 1s
1250 * ADD.D performs a 64-bit-wide add. Top 64 MSBs of destination set to 1s.
1251 * FSD stores lowest 64-bits from the 128-bit-wide register to memory:
1252 top 64 MSBs ignored.
1253
1254 Therefore it makes sense to mirror this behaviour when, for example,
1255 elwidth is set to 32. Assume elwidth set to 32 on all source and
1256 destination registers:
1257
1258 * FLD loads 64-bit wide from memory as **two** 32-bit single-precision
1259 floating-point numbers.
1260 * ADD.D performs **two** 32-bit-wide adds, storing one of the adds
1261 in bits 0-31 and the second in bits 32-63.
1262 * FSD stores lowest 64-bits from the 128-bit-wide register to memory
1263
1264 Here's the thing: it does not make sense to overwrite the top 64 MSBs
1265 of the registers either during the FLD **or** the ADD.D. The reason
1266 is that, effectively, the top 64 MSBs actually represent a completely
1267 independent 64-bit register, so overwriting it is not only gratuitous
1268 but may actually be harmful for a future extension to SV which may
1269 have a way to directly access those top 64 bits.
1270
1271 The decision is therefore **not** to touch the upper parts of floating-point
1272 registers whereever elwidth is set to non-default values, including
1273 when "isvec" is false in a given register's CSR entry. Only when the
1274 elwidth is set to default **and** isvec is false will the standard
1275 RV behaviour be followed, namely that the upper bits be modified.
1276
1277 Ultimately if elwidth is default and isvec false on *all* source
1278 and destination registers, a SimpleV instruction defaults completely
1279 to standard RV scalar behaviour (this holds true for **all** operations,
1280 right across the board).
1281
1282 The nice thing here is that ADD.S, ADD.D and ADD.Q when elwidth are
1283 non-default values are effectively all the same: they all still perform
1284 multiple ADD operations, just at different widths. A future extension
1285 to SimpleV may actually allow ADD.S to access the upper bits of the
1286 register, effectively breaking down a 128-bit register into a bank
1287 of 4 independently-accesible 32-bit registers.
1288
1289 In the meantime, although when e.g. setting VL to 8 it would technically
1290 make no difference to the ALU whether ADD.S, ADD.D or ADD.Q is used,
1291 using ADD.Q may be an easy way to signal to the microarchitecture that
1292 it is to receive a higher VL value. On a superscalar OoO architecture
1293 there may be absolutely no difference, however on simpler SIMD-style
1294 microarchitectures they may not necessarily have the infrastructure in
1295 place to know the difference, such that when VL=8 and an ADD.D instruction
1296 is issued, it completes in 2 cycles (or more) rather than one, where
1297 if an ADD.Q had been issued instead on such simpler microarchitectures
1298 it would complete in one.
1299
1300 ## Specific instruction walk-throughs
1301
1302 This section covers walk-throughs of the above-outlined procedure
1303 for converting standard RISC-V scalar arithmetic operations to
1304 polymorphic widths, to ensure that it is correct.
1305
1306 ### add
1307
1308 Standard Scalar RV32/RV64 (xlen):
1309
1310 * RS1 @ xlen bits
1311 * RS2 @ xlen bits
1312 * add @ xlen bits
1313 * RD @ xlen bits
1314
1315 Polymorphic variant:
1316
1317 * RS1 @ rs1 bits, zero-extended to max(rs1, rs2) bits
1318 * RS2 @ rs2 bits, zero-extended to max(rs1, rs2) bits
1319 * add @ max(rs1, rs2) bits
1320 * RD @ rd bits. zero-extend to rd if rd > max(rs1, rs2) otherwise truncate
1321
1322 Note here that polymorphic add zero-extends its source operands,
1323 where addw sign-extends.
1324
1325 ### addw
1326
1327 The RV Specification specifically states that "W" variants of arithmetic
1328 operations always produce 32-bit signed values. In a polymorphic
1329 environment it is reasonable to assume that the signed aspect is
1330 preserved, where it is the length of the operands and the result
1331 that may be changed.
1332
1333 Standard Scalar RV64 (xlen):
1334
1335 * RS1 @ xlen bits
1336 * RS2 @ xlen bits
1337 * add @ xlen bits
1338 * RD @ xlen bits, truncate add to 32-bit and sign-extend to xlen.
1339
1340 Polymorphic variant:
1341
1342 * RS1 @ rs1 bits, sign-extended to max(rs1, rs2) bits
1343 * RS2 @ rs2 bits, sign-extended to max(rs1, rs2) bits
1344 * add @ max(rs1, rs2) bits
1345 * RD @ rd bits. sign-extend to rd if rd > max(rs1, rs2) otherwise truncate
1346
1347 Note here that polymorphic addw sign-extends its source operands,
1348 where add zero-extends.
1349
1350 This requires a little more in-depth analysis. Where the bitwidth of
1351 rs1 equals the bitwidth of rs2, no sign-extending will occur. It is
1352 only where the bitwidth of either rs1 or rs2 are different, will the
1353 lesser-width operand be sign-extended.
1354
1355 Effectively however, both rs1 and rs2 are being sign-extended (or
1356 truncated), where for add they are both zero-extended. This holds true
1357 for all arithmetic operations ending with "W".
1358
1359 ### addiw
1360
1361 Standard Scalar RV64I:
1362
1363 * RS1 @ xlen bits, truncated to 32-bit
1364 * immed @ 12 bits, sign-extended to 32-bit
1365 * add @ 32 bits
1366 * RD @ rd bits. sign-extend to rd if rd > 32, otherwise truncate.
1367
1368 Polymorphic variant:
1369
1370 * RS1 @ rs1 bits
1371 * immed @ 12 bits, sign-extend to max(rs1, 12) bits
1372 * add @ max(rs1, 12) bits
1373 * RD @ rd bits. sign-extend to rd if rd > max(rs1, 12) otherwise truncate
1374
1375 # Predication Element Zeroing
1376
1377 The introduction of zeroing on traditional vector predication is usually
1378 intended as an optimisation for lane-based microarchitectures with register
1379 renaming to be able to save power by avoiding a register read on elements
1380 that are passed through en-masse through the ALU. Simpler microarchitectures
1381 do not have this issue: they simply do not pass the element through to
1382 the ALU at all, and therefore do not store it back in the destination.
1383 More complex non-lane-based micro-architectures can, when zeroing is
1384 not set, use the predication bits to simply avoid sending element-based
1385 operations to the ALUs, entirely: thus, over the long term, potentially
1386 keeping all ALUs 100% occupied even when elements are predicated out.
1387
1388 SimpleV's design principle is not based on or influenced by
1389 microarchitectural design factors: it is a hardware-level API.
1390 Therefore, looking purely at whether zeroing is *useful* or not,
1391 (whether less instructions are needed for certain scenarios),
1392 given that a case can be made for zeroing *and* non-zeroing, the
1393 decision was taken to add support for both.
1394
1395 ## Single-predication (based on destination register)
1396
1397 Zeroing on predication for arithmetic operations is taken from
1398 the destination register's predicate. i.e. the predication *and*
1399 zeroing settings to be applied to the whole operation come from the
1400 CSR Predication table entry for the destination register.
1401 Thus when zeroing is set on predication of a destination element,
1402 if the predication bit is clear, then the destination element is *set*
1403 to zero (twin-predication is slightly different, and will be covered
1404 next).
1405
1406 Thus the pseudo-code loop for a predicated arithmetic operation
1407 is modified to as follows:
1408
1409  for (i = 0; i < VL; i++)
1410 if not zeroing: # an optimisation
1411 while (!(predval & 1<<i) && i < VL)
1412 if (int_vec[rd ].isvector)  { id += 1; }
1413 if (int_vec[rs1].isvector)  { irs1 += 1; }
1414 if (int_vec[rs2].isvector)  { irs2 += 1; }
1415 if i == VL:
1416 return
1417 if (predval & 1<<i)
1418 src1 = ....
1419 src2 = ...
1420 else:
1421 result = src1 + src2 # actual add (or other op) here
1422 set_polymorphed_reg(rd, destwid, ird, result)
1423 if int_vec[rd].ffirst and result == 0:
1424 VL = i # result was zero, end loop early, return VL
1425 return
1426 if (!int_vec[rd].isvector) return
1427 else if zeroing:
1428 result = 0
1429 set_polymorphed_reg(rd, destwid, ird, result)
1430 if (int_vec[rd ].isvector)  { id += 1; }
1431 else if (predval & 1<<i) return
1432 if (int_vec[rs1].isvector)  { irs1 += 1; }
1433 if (int_vec[rs2].isvector)  { irs2 += 1; }
1434 if (rd == VL or rs1 == VL or rs2 == VL): return
1435
1436 The optimisation to skip elements entirely is only possible for certain
1437 micro-architectures when zeroing is not set. However for lane-based
1438 micro-architectures this optimisation may not be practical, as it
1439 implies that elements end up in different "lanes". Under these
1440 circumstances it is perfectly fine to simply have the lanes
1441 "inactive" for predicated elements, even though it results in
1442 less than 100% ALU utilisation.
1443
1444 ## Twin-predication (based on source and destination register)
1445
1446 Twin-predication is not that much different, except that that
1447 the source is independently zero-predicated from the destination.
1448 This means that the source may be zero-predicated *or* the
1449 destination zero-predicated *or both*, or neither.
1450
1451 When with twin-predication, zeroing is set on the source and not
1452 the destination, if a predicate bit is set it indicates that a zero
1453 data element is passed through the operation (the exception being:
1454 if the source data element is to be treated as an address - a LOAD -
1455 then the data returned *from* the LOAD is zero, rather than looking up an
1456 *address* of zero.
1457
1458 When zeroing is set on the destination and not the source, then just
1459 as with single-predicated operations, a zero is stored into the destination
1460 element (or target memory address for a STORE).
1461
1462 Zeroing on both source and destination effectively result in a bitwise
1463 NOR operation of the source and destination predicate: the result is that
1464 where either source predicate OR destination predicate is set to 0,
1465 a zero element will ultimately end up in the destination register.
1466
1467 However: this may not necessarily be the case for all operations;
1468 implementors, particularly of custom instructions, clearly need to
1469 think through the implications in each and every case.
1470
1471 Here is pseudo-code for a twin zero-predicated operation:
1472
1473 function op_mv(rd, rs) # MV not VMV!
1474  rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
1475  rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
1476  ps, zerosrc = get_pred_val(FALSE, rs); # predication on src
1477  pd, zerodst = get_pred_val(FALSE, rd); # ... AND on dest
1478  for (int i = 0, int j = 0; i < VL && j < VL):
1479 if (int_csr[rs].isvec && !zerosrc) while (!(ps & 1<<i)) i++;
1480 if (int_csr[rd].isvec && !zerodst) while (!(pd & 1<<j)) j++;
1481 if ((pd & 1<<j))
1482 if ((pd & 1<<j))
1483 sourcedata = ireg[rs+i];
1484 else
1485 sourcedata = 0
1486 ireg[rd+j] <= sourcedata
1487 else if (zerodst)
1488 ireg[rd+j] <= 0
1489 if (int_csr[rs].isvec)
1490 i++;
1491 if (int_csr[rd].isvec)
1492 j++;
1493 else
1494 if ((pd & 1<<j))
1495 break;
1496
1497 Note that in the instance where the destination is a scalar, the hardware
1498 loop is ended the moment a value *or a zero* is placed into the destination
1499 register/element. Also note that, for clarity, variable element widths
1500 have been left out of the above.
1501
1502 # Subsets of RV functionality
1503
1504 This section describes the differences when SV is implemented on top of
1505 different subsets of RV.
1506
1507 ## Common options
1508
1509 It is permitted to only implement SVprefix and not the VBLOCK instruction
1510 format option, and vice-versa. UNIX Platforms **MUST** raise illegal
1511 instruction on seeing an unsupported VBLOCK or SVprefix opcode, so that
1512 traps may emulate the format.
1513
1514 It is permitted in SVprefix to either not implement VL or not implement
1515 SUBVL (see [[sv_prefix_proposal]] for full details. Again, UNIX Platforms
1516 *MUST* raise illegal instruction on implementations that do not support
1517 VL or SUBVL.
1518
1519 It is permitted to limit the size of either (or both) the register files
1520 down to the original size of the standard RV architecture. However, below
1521 the mandatory limits set in the RV standard will result in non-compliance
1522 with the SV Specification.
1523
1524 ## RV32 / RV32F
1525
1526 When RV32 or RV32F is implemented, XLEN is set to 32, and thus the
1527 maximum limit for predication is also restricted to 32 bits. Whilst not
1528 actually specifically an "option" it is worth noting.
1529
1530 ## RV32G
1531
1532 Normally in standard RV32 it does not make much sense to have
1533 RV32G, The critical instructions that are missing in standard RV32
1534 are those for moving data to and from the double-width floating-point
1535 registers into the integer ones, as well as the FCVT routines.
1536
1537 In an earlier draft of SV, it was possible to specify an elwidth
1538 of double the standard register size: this had to be dropped,
1539 and may be reintroduced in future revisions.
1540
1541 ## RV32 (not RV32F / RV32G) and RV64 (not RV64F / RV64G)
1542
1543 When floating-point is not implemented, the size of the User Register and
1544 Predication CSR tables may be halved, to only 4 2x16-bit CSRs (8 entries
1545 per table).
1546
1547 ## RV32E
1548
1549 In embedded scenarios the User Register and Predication CSRs may be
1550 dropped entirely, or optionally limited to 1 CSR, such that the combined
1551 number of entries from the M-Mode CSR Register table plus U-Mode
1552 CSR Register table is either 4 16-bit entries or (if the U-Mode is
1553 zero) only 2 16-bit entries (M-Mode CSR table only). Likewise for
1554 the Predication CSR tables.
1555
1556 RV32E is the most likely candidate for simply detecting that registers
1557 are marked as "vectorised", and generating an appropriate exception
1558 for the VL loop to be implemented in software.
1559
1560 ## RV128
1561
1562 RV128 has not been especially considered, here, however it has some
1563 extremely large possibilities: double the element width implies
1564 256-bit operands, spanning 2 128-bit registers each, and predication
1565 of total length 128 bit given that XLEN is now 128.
1566
1567 # Example usage
1568
1569 TODO evaluate strncpy and strlen
1570 <https://groups.google.com/forum/m/#!msg/comp.arch/bGBeaNjAKvc/_vbqyxTUAQAJ>
1571
1572 ## strncpy <a name="strncpy"></>
1573
1574 RVV version:
1575
1576 strncpy:
1577 c.mv a3, a0 # Copy dst
1578 loop:
1579 setvli x0, a2, vint8 # Vectors of bytes.
1580 vlbff.v v1, (a1) # Get src bytes
1581 vseq.vi v0, v1, 0 # Flag zero bytes
1582 vmfirst a4, v0 # Zero found?
1583 vmsif.v v0, v0 # Set mask up to and including zero byte.
1584 vsb.v v1, (a3), v0.t # Write out bytes
1585 c.bgez a4, exit # Done
1586 csrr t1, vl # Get number of bytes fetched
1587 c.add a1, a1, t1 # Bump src pointer
1588 c.sub a2, a2, t1 # Decrement count.
1589 c.add a3, a3, t1 # Bump dst pointer
1590 c.bnez a2, loop # Anymore?
1591
1592 exit:
1593 c.ret
1594
1595 SV version (WIP):
1596
1597 strncpy:
1598 c.mv a3, a0
1599 VBLK.RegCSR[t0] = 8bit, t0, vector
1600 VBLK.PredTb[t0] = ffirst, x0, inv
1601 loop:
1602 VBLK.SETVLI a2, t4, 8 # t4 and VL now 1..8 (MVL=8)
1603 c.ldb t0, (a1) # t0 fail first mode
1604 c.bne t0, x0, allnonzero # still ff
1605 # VL (t4) points to last nonzero
1606 c.addi t4, t4, 1 # include zero
1607 c.stb t0, (a3) # store incl zero
1608 c.ret # end subroutine
1609 allnonzero:
1610 c.stb t0, (a3) # VL legal range
1611 c.add a1, a1, t4 # Bump src pointer
1612 c.sub a2, a2, t4 # Decrement count.
1613 c.add a3, a3, t4 # Bump dst pointer
1614 c.bnez a2, loop # Anymore?
1615 exit:
1616 c.ret
1617
1618 Notes:
1619
1620 * Setting MVL to 8 is just an example. If enough registers are spare it
1621 may be set to XLEN which will require a bank of 8 scalar registers for
1622 a1, a3 and t0.
1623 * obviously if that is done, t0 is not separated by 8 full registers, and
1624 would overwrite t1 thru t7. x80 would work well, as an example, instead.
1625 * with the exception of the GETVL (a pseudo code alias for csrr), every
1626 single instruction above may use RVC.
1627 * RVC C.BNEZ can be used because rs1' may be extended to the full 128
1628 registers through redirection
1629 * RVC C.LW and C.SW may be used because the W format may be overridden by
1630 the 8 bit format. All of t0, a3 and a1 are overridden to make that work.
1631 * with the exception of the GETVL, all Vector Context may be done in
1632 VBLOCK form.
1633 * setting predication to x0 (zero) and invert on t0 is a trick to enable
1634 just ffirst on t0
1635 * ldb and bne are both using t0, both in ffirst mode
1636 * t0 vectorised, a1 scalar, both elwidth 8 bit: ldb enters "unit stride,
1637 vectorised, no (un)sign-extension or truncation" mode.
1638 * ldb will end on illegal mem, reduce VL, but copied all sorts of stuff
1639 into t0 (could contain zeros).
1640 * bne t0 x0 tests up to the NEW VL for nonzero, vector t0 against
1641 scalar x0
1642 * however as t0 is in ffirst mode, the first fail will ALSO stop the
1643 compares, and reduce VL as well
1644 * the branch only goes to allnonzero if all tests succeed
1645 * if it did not, we can safely increment VL by 1 (using a4) to include
1646 the zero.
1647 * SETVL sets *exactly* the requested amount into VL.
1648 * the SETVL just after allnonzero label is needed in case the ldb ffirst
1649 activates but the bne allzeros does not.
1650 * this would cause the stb to copy up to the end of the legal memory
1651 * of course, on the next loop the ldb would throw a trap, as a1 now
1652 points to the first illegal mem location.
1653
1654 ## strcpy
1655
1656 RVV version:
1657
1658 mv a3, a0 # Save start
1659 loop:
1660 setvli a1, x0, vint8 # byte vec, x0 (Zero reg) => use max hardware len
1661 vldbff.v v1, (a3) # Get bytes
1662 csrr a1, vl # Get bytes actually read e.g. if fault
1663 vseq.vi v0, v1, 0 # Set v0[i] where v1[i] = 0
1664 add a3, a3, a1 # Bump pointer
1665 vmfirst a2, v0 # Find first set bit in mask, returns -1 if none
1666 bltz a2, loop # Not found?
1667 add a0, a0, a1 # Sum start + bump
1668 add a3, a3, a2 # Add index of zero byte
1669 sub a0, a3, a0 # Subtract start address+bump
1670 ret
1671
1672 ## DAXPY <a name="daxpy"></a>
1673
1674 [[!inline raw="yes" pages="simple_v_extension/daxpy_example" ]]
1675
1676 Notes:
1677
1678 * Setting MVL to 4 is just an example. With enough space between the
1679 FP regs, MVL may be set to larger values
1680 * VBLOCK header takes 16 bits, 8-bit mode may be used on the registers,
1681 taking only another 16 bits, VBLOCK.SETVL requires 16 bits. Total
1682 overhead for use of VBLOCK: 48 bits (3 16-bit words).
1683 * All instructions except fmadd may use Compressed variants. Total
1684 number of 16-bit instruction words: 11.
1685 * Total: 14 16-bit words. By contrast, RVV requires around 18 16-bit words.
1686
1687 ## BigInt add <a name="bigadd"></a>
1688
1689 [[!inline raw="yes" pages="simple_v_extension/bigadd_example" ]]