-# Simple-V (Parallelism Extension Proposal) Appendix
+[[!oldstandards]]
+
+# Simple-V (Parallelism Extension Proposal) Appendix (OBSOLETE)
+
+**OBSOLETE**
* Copyright (C) 2017, 2018, 2019 Luke Kenneth Casson Leighton
* Status: DRAFTv0.6
-* Last edited: 25 jun 2019
+* Last edited: 30 jun 2019
* main spec [[specification]]
[[!toc ]]
+# Fail-on-first modes <a name="ffirst"></a>
+
+Fail-on-first data dependency has different behaviour for traps than
+for conditional testing. "Conditional" is taken to mean "anything
+that is zero", however with traps, the first element has to
+be given the opportunity to throw the exact same trap that would
+be thrown if this were a scalar operation (when VL=1).
+
+Note that implementors are required to mutually exclusively choose one
+or the other modes: an instruction is **not** permitted to fail on a
+trap *and* fail a conditional test at the same time. This advice to
+custom opcode writers as well as future extension writers.
+
+## Fail-on-first traps
+
+Except for the first element, ffirst stops sequential element processing
+when a trap occurs. The first element is treated normally (as if ffirst
+is clear). Should any subsequent element instruction require a trap,
+instead it and subsequent indexed elements are ignored (or cancelled in
+out-of-order designs), and VL is set to the *last* in-sequence instruction
+that did not take the trap.
+
+Note that predicated-out elements (where the predicate mask bit is
+zero) are clearly excluded (i.e. the trap will not occur). However,
+note that the loop still had to test the predicate bit: thus on return,
+VL is set to include elements that did not take the trap *and* includes
+the elements that were predicated (masked) out (not tested up to the
+point where the trap occurred).
+
+Unlike conditional tests, "fail-on-first trap" instruction behaviour is
+unaltered by setting zero or non-zero predication mode.
+
+If SUBVL is being used (SUBVL!=1), the first *sub-group* of elements
+will cause a trap as normal (as if ffirst is not set); subsequently, the
+trap must not occur in the *sub-group* of elements. SUBVL will **NOT**
+be modified. Traps must analyse (x)eSTATE (subvl offset indices) to
+determine the element that caused the trap.
+
+Given that predication bits apply to SUBVL groups, the same rules apply
+to predicated-out (masked-out) sub-groups in calculating the value that
+VL is set to.
+
+## Fail-on-first conditional tests
+
+ffirst stops sequential (or sequentially-appearing in the case of
+out-of-order designs) element conditional testing on the first element
+result being zero (or other "fail" condition). VL is set to the number
+of elements that were (sequentially) processed before the fail-condition
+was encountered.
+
+Unlike trap fail-on-first, fail-on-first conditional testing behaviour
+responds to changes in the zero or non-zero predication mode. Whilst
+in non-zeroing mode, masked-out elements are simply not tested (and
+thus considered "never to fail"), in zeroing mode, masked-out elements
+may be viewed as *always* (unconditionally) failing. This effectively
+turns VL into something akin to a software-controlled loop.
+
+Note that just as with traps, if SUBVL!=1, the first trap in the
+*sub-group* will cause the processing to end, and, even if there were
+elements within the *sub-group* that passed the test, that sub-group is
+still (entirely) excluded from the count (from setting VL). i.e. VL is
+set to the total number of *sub-groups* that had no fail-condition up
+until execution was stopped. However, again: SUBVL must not be modified:
+traps must analyse (x)eSTATE (subvl offset indices) to determine the
+element that caused the trap.
+
+Note again that, just as with traps, predicated-out (masked-out) elements
+are included in the (sequential) count leading up to the fail-condition,
+even though they were not tested.
+
# Instructions <a name="instructions" />
Despite being a 98% complete and accurate topological remap of RVV
Example pseudo-code for an integer ADD operation (including scalar
operations). Floating-point uses the FP Register Table.
- function op_add(rd, rs1, rs2) # add not VADD!
- int i, id=0, irs1=0, irs2=0;
- predval = get_pred_val(FALSE, rd);
- rd = int_vec[rd ].isvector ? int_vec[rd ].regidx : rd;
- rs1 = int_vec[rs1].isvector ? int_vec[rs1].regidx : rs1;
- rs2 = int_vec[rs2].isvector ? int_vec[rs2].regidx : rs2;
- for (i = 0; i < VL; i++)
- xSTATE.srcoffs = i # save context
- if (predval & 1<<i) # predication uses intregs
- ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
- if (!int_vec[rd ].isvector) break;
- if (int_vec[rd ].isvector) { id += 1; }
- if (int_vec[rs1].isvector) { irs1 += 1; }
- if (int_vec[rs2].isvector) { irs2 += 1; }
+[[!inline raw="yes" pages="simple_v_extension/simple_add_example" ]]
Note that for simplicity there is quite a lot missing from the above
-pseudo-code: element widths, zeroing on predication, dimensional
+pseudo-code: PCVBLK, element widths, zeroing on predication, dimensional
reshaping and offsets and so on. However it demonstrates the basic
principle. Augmentations that produce the full pseudo-code are covered in
other sections.
Adding in support for SUBVL is a matter of adding in an extra inner
for-loop, where register src and dest are still incremented inside the
-inner part. Not that the predication is still taken from the VL index.
+inner part. Note that the predication is still taken from the VL index.
So whilst elements are indexed by "(i * SUBVL + s)", predicate bits are
indexed by "(i)"
Branch operations are augmented slightly to be a little more like FP
Compares (FEQ, FNE etc.), by permitting the cumulation (and storage)
of multiple comparisons into a register (taken indirectly from the predicate
-table). As such, "ffirst" - fail-on-first - condition mode can be enabled.
+table) and enhancing them to branch "consensually" depending on *multiple*
+tests. "ffirst" - fail-on-first - condition mode can also be enabled,
+to terminate the comparisons early.
See ffirst mode in the Predication Table section.
+There are two registers for the comparison operation, therefore there
+is the opportunity to associate two predicate registers (note: not in
+the same way as twin-predication). The first is a "normal" predicate
+register, which acts just as it does on any other single-predicated
+operation: masks out elements where a bit is zero, applies an inversion
+to the predicate mask, and enables zeroing / non-zeroing mode.
+
+The second (not to be confused with a twin-predication 2nd register)
+is utilised to indicate where the results of each comparison are to
+be stored, as a bitmask. Additionally, the behaviour of the branch -
+when it occurs - may also be modified depending on whether the 2nd predicate's
+"invert" and "zeroing" bits are set. These four combinations result
+in "consensual branches", cbranch.ifnone (NOR), cbranch.ifany (OR),
+cbranch.ifall (AND), cbranch.ifnotall (NAND).
+
+| invert | zeroing | description | operation | cbranch |
+| ------ | ------- | --------------------------- | --------- | ------- |
+| 0 | 0 | branch if all pass | AND | ifall |
+| 1 | 0 | branch if one fails | NAND | ifnall |
+| 0 | 1 | branch if one passes | OR | ifany |
+| 1 | 1 | branch if all fail | NOR | ifnone |
+
+This inversion capability covers AND, OR, NAND and NOR branching
+based on multiple element comparisons. Without the full set of four,
+it is necessary to have two-sequence branch operations: one conditional, one
+unconditional.
+
+Note that unlike normal computer programming, early-termination of chains
+of AND or OR conditional tests, the chain does *not* terminate early
+except if fail-on-first is set, and even then ffirst ends on the first
+data-dependent zero. When ffirst mode is not set, *all* conditional
+element tests must be performed (and the result optionally stored in
+the result mask), with a "post-analysis" phase carried out which checks
+whether to branch.
+
+Note also that whilst it may seem excessive to have all four (because
+conditional comparisons may be inverted by swapping src1 and src2),
+data-dependent fail-on-first is *not* invertible and *only* terminates
+on first zero-condition encountered. Additionally it may be inconvenient
+to have to swap the predicate registers associated with src1 and src2,
+because this involves a new VBLOCK Context.
+
### Standard Branch <a name="standard_branch"></a>
Branch operations use standard RV opcodes that are reinterpreted to
Note that just as with the standard (scalar, non-predicated) branch
operations, BLE, BGT, BLEU and BTGU may be synthesised by inverting
-src1 and src2.
+src1 and src2, however note that in doing so, the predicate table
+setup must also be correspondingly adjusted.
In Hwacha EECS-2015-262 Section 6.7.2 the following pseudocode is given
for predicated compare operations of function "cmp":
ps = get_pred_val(I/F==INT, rs1);
rd = get_pred_val(I/F==INT, rs2); # this may not exist
+ ffirst_mode, zeroing = get_pred_flags(rs1)
+ if exists(rd):
+ pred_inversion, pred_zeroing = get_pred_flags(rs2)
+ else
+ pred_inversion, pred_zeroing = False, False
+
if not exists(rd) or zeroing:
- result = 0
+ result = (1<<VL)-1 # all 1s
else
result = preg[rd]
result |= 1<<i;
else
result &= ~(1<<i);
+ if ffirst_mode:
+ break
- if not exists(rd)
- if result == ps
- goto branch
- else
+ if exists(rd):
preg[rd] = result # store in destination
- if preg[rd] == ps
- goto branch
+
+ if pred_inversion:
+ if pred_zeroing:
+ # NOR
+ if result == 0:
+ goto branch
+ else:
+ # NAND
+ if (result & ps) != result:
+ goto branch
+ else:
+ if pred_zeroing:
+ # OR
+ if result != 0:
+ goto branch
+ else:
+ # AND
+ if (result & ps) == result:
+ goto branch
Notes:
so on. Consequently, unlike integer-branch, FP Compare needs no
modification in its behaviour.
-In addition, it is noted that an entry "FNE" (the opposite of FEQ) is missing,
-and whilst in ordinary branch code this is fine because the standard
-RVF compare can always be followed up with an integer BEQ or a BNE (or
-a compressed comparison to zero or non-zero), in predication terms that
-becomes more of an impact. To deal with this, SV's predication has
-had "invert" added to it.
+In addition, it is noted that an entry "FNE" (the opposite of FEQ) is
+missing, and whilst in ordinary branch code this is fine because the
+standard RVF compare can always be followed up with an integer BEQ or
+a BNE (or a compressed comparison to zero or non-zero), in predication
+terms that becomes more of an impact. To deal with this, SV's predication
+has had "invert" added to it.
Also: note that FP Compare may be predicated, using the destination
integer register (rd) to determine the predicate. FP Compare is **not**
## Polymorphic floating-point operation exceptions and error-handling
-For floating-point operations, conversion takes place without
-raising any kind of exception. Exactly as specified in the standard
-RV specification, NAN (or appropriate) is stored if the result
-is beyond the range of the destination, and, again, exactly as
-with the standard RV specification just as with scalar
-operations, the floating-point flag is raised (FCSR). And, again, just as
-with scalar operations, it is software's responsibility to check this flag.
-Given that the FCSR flags are "accrued", the fact that multiple element
-operations could have occurred is not a problem.
+For floating-point operations, conversion takes place without raising any
+kind of exception. Exactly as specified in the standard RV specification,
+NAN (or appropriate) is stored if the result is beyond the range of the
+destination, and, again, exactly as with the standard RV specification
+just as with scalar operations, the floating-point flag is raised
+(FCSR). And, again, just as with scalar operations, it is software's
+responsibility to check this flag. Given that the FCSR flags are
+"accrued", the fact that multiple element operations could have occurred
+is not a problem.
Note that it is perfectly legitimate for floating-point bitwidths of
only 8 to be specified. However whilst it is possible to apply IEEE 754
## Polymorphic shift operators
-A special note is needed for changing the element width of left and right
-shift operators, particularly right-shift. Even for standard RV base,
-in order for correct results to be returned, the second operand RS2 must
-be truncated to be within the range of RS1's bitwidth. spike's implementation
-of sll for example is as follows:
+A special note is needed for changing the element width of left and
+right shift operators, particularly right-shift. Even for standard RV
+base, in order for correct results to be returned, the second operand
+RS2 must be truncated to be within the range of RS1's bitwidth.
+spike's implementation of sll for example is as follows:
WRITE_RD(sext_xlen(zext_xlen(RS1) << (RS2 & (xlen-1))));
is also marked as scalar, this is how the compatibility with
standard RV LOAD/STORE is preserved by this algorithm.
-### Example Tables showing LOAD elements
+### Example Tables showing LOAD elements <a name="load_example"></a>
This section contains examples of vectorised LOAD operations, showing
how the two stage process works (three if zero/sign-extension is included).
* from register x5 (actually x5-x6) to x8 (actually x8 to half of x11)
* RV64, where XLEN=64 is assumed.
-First, the memory table, which, due to the
-element width being 16 and the operation being LD (64), the 64-bits
-loaded from memory are subdivided into groups of **four** elements.
-And, with VL being 7 (deliberately to illustrate that this is reasonable
-and possible), the first four are sourced from the offset addresses pointed
-to by x5, and the next three from the ofset addresses pointed to by
-the next contiguous register, x6:
+First, the memory table, which, due to the element width being 16 and the
+operation being LD (64), the 64-bits loaded from memory are subdivided
+into groups of **four** elements. And, with VL being 7 (deliberately
+to illustrate that this is reasonable and possible), the first four are
+sourced from the offset addresses pointed to by x5, and the next three
+from the ofset addresses pointed to by the next contiguous register, x6:
[[!table data="""
addr | byte 0 | byte 1 | byte 2 | byte 3 | byte 4 | byte 5 | byte 6 | byte 7 |
only where the bitwidth of either rs1 or rs2 are different, will the
lesser-width operand be sign-extended.
-Effectively however, both rs1 and rs2 are being sign-extended (or truncated),
-where for add they are both zero-extended. This holds true for all arithmetic
-operations ending with "W".
+Effectively however, both rs1 and rs2 are being sign-extended (or
+truncated), where for add they are both zero-extended. This holds true
+for all arithmetic operations ending with "W".
### addiw
"inactive" for predicated elements, even though it results in
less than 100% ALU utilisation.
-## Twin-predication (based on source and destination register)
+## Twin-predication (based on source and destination register) <a name="tpred"></a>
Twin-predication is not that much different, except that that
the source is independently zero-predicated from the destination.
TODO evaluate strncpy and strlen
<https://groups.google.com/forum/m/#!msg/comp.arch/bGBeaNjAKvc/_vbqyxTUAQAJ>
-## strncpy
-
-RVV version: <a name="strncpy"></>
-
- strncpy:
- mv a3, a0 # Copy dst
- loop:
- setvli x0, a2, vint8 # Vectors of bytes.
- vlbff.v v1, (a1) # Get src bytes
- vseq.vi v0, v1, 0 # Flag zero bytes
- vmfirst a4, v0 # Zero found?
- vmsif.v v0, v0 # Set mask up to and including zero byte. Ppplio
- vsb.v v1, (a3), v0.t # Write out bytes
- bgez a4, exit # Done
- csrr t1, vl # Get number of bytes fetched
- add a1, a1, t1 # Bump src pointer
- sub a2, a2, t1 # Decrement count.
- add a3, a3, t1 # Bump dst pointer
- bnez a2, loop # Anymore?
-
- exit:
- ret
+## strncpy <a name="strncpy"></>
+
+RVV version:
+
+ strncpy:
+ c.mv a3, a0 # Copy dst
+ loop:
+ setvli x0, a2, vint8 # Vectors of bytes.
+ vlbff.v v1, (a1) # Get src bytes
+ vseq.vi v0, v1, 0 # Flag zero bytes
+ vmfirst a4, v0 # Zero found?
+ vmsif.v v0, v0 # Set mask up to and including zero byte.
+ vsb.v v1, (a3), v0.t # Write out bytes
+ c.bgez a4, exit # Done
+ csrr t1, vl # Get number of bytes fetched
+ c.add a1, a1, t1 # Bump src pointer
+ c.sub a2, a2, t1 # Decrement count.
+ c.add a3, a3, t1 # Bump dst pointer
+ c.bnez a2, loop # Anymore?
+
+ exit:
+ c.ret
SV version (WIP):
strncpy:
- mv a3, a0
- SETMVLI 8 # set max vector to 8
- RegCSR[a3] = 8bit, a3, scalar
- RegCSR[a1] = 8bit, a1, scalar
- RegCSR[t0] = 8bit, t0, vector
- PredTb[t0] = ffirst, x0, inv
+ c.mv a3, a0
+ VBLK.RegCSR[t0] = 8bit, t0, vector
+ VBLK.PredTb[t0] = ffirst, x0, inv
loop:
- SETVLI a2, t4 # t4 and VL now 1..8
- ldb t0, (a1) # t0 fail first mode
- bne t0, x0, allnonzero # still ff
- # VL points to last nonzero
- GETVL t4 # from bne tests
- addi t4, t4, 1 # include zero
- SETVL t4 # set exactly to t4
- stb t0, (a3) # store incl zero
- ret # end subroutine
+ VBLK.SETVLI a2, t4, 8 # t4 and VL now 1..8 (MVL=8)
+ c.ldb t0, (a1) # t0 fail first mode
+ c.bne t0, x0, allnonzero # still ff
+ # VL (t4) points to last nonzero
+ c.addi t4, t4, 1 # include zero
+ c.stb t0, (a3) # store incl zero
+ c.ret # end subroutine
allnonzero:
- stb t0, (a3) # VL legal range
- GETVL t4 # from bne tests
- add a1, a1, t4 # Bump src pointer
- sub a2, a2, t4 # Decrement count.
- add a3, a3, t4 # Bump dst pointer
- bnez a2, loop # Anymore?
+ c.stb t0, (a3) # VL legal range
+ c.add a1, a1, t4 # Bump src pointer
+ c.sub a2, a2, t4 # Decrement count.
+ c.add a3, a3, t4 # Bump dst pointer
+ c.bnez a2, loop # Anymore?
exit:
- ret
+ c.ret
Notes:
-* Setting MVL to 8 is just an example. If enough registers are spare it may be set to XLEN which will require a bank of 8 scalar registers for a1, a3 and t0.
-* obviously if that is done, t0 is not separated by 8 full registers, and would overwrite t1 thru t7. x80 would work well, as an example, instead.
-* with the exception of the GETVL (a pseudo code alias for csrr), every single instruction above may use RVC.
-* RVC C.BNEZ can be used because rs1' may be extended to the full 128 registers through redirection
-* RVC C.LW and C.SW may be used because the W format may be overridden by the 8 bit format. All of t0, a3 and a1 are overridden to make that work.
-* with the exception of the GETVL, all Vector Context may be done in VBLOCK form.
-* setting predication to x0 (zero) and invert on t0 is a trick to enable just ffirst on t0
+* Setting MVL to 8 is just an example. If enough registers are spare it
+ may be set to XLEN which will require a bank of 8 scalar registers for
+ a1, a3 and t0.
+* obviously if that is done, t0 is not separated by 8 full registers, and
+ would overwrite t1 thru t7. x80 would work well, as an example, instead.
+* with the exception of the GETVL (a pseudo code alias for csrr), every
+ single instruction above may use RVC.
+* RVC C.BNEZ can be used because rs1' may be extended to the full 128
+ registers through redirection
+* RVC C.LW and C.SW may be used because the W format may be overridden by
+ the 8 bit format. All of t0, a3 and a1 are overridden to make that work.
+* with the exception of the GETVL, all Vector Context may be done in
+ VBLOCK form.
+* setting predication to x0 (zero) and invert on t0 is a trick to enable
+ just ffirst on t0
* ldb and bne are both using t0, both in ffirst mode
-* ldb will end on illegal mem, reduce VL, but copied all sorts of stuff into t0
-* bne t0 x0 tests up to the NEW VL for nonzero, vector t0 against scalar x0
-* however as t0 is in ffirst mode, the first fail wil ALSO stop the compares, and reduce VL as well
+* t0 vectorised, a1 scalar, both elwidth 8 bit: ldb enters "unit stride,
+ vectorised, no (un)sign-extension or truncation" mode.
+* ldb will end on illegal mem, reduce VL, but copied all sorts of stuff
+ into t0 (could contain zeros).
+* bne t0 x0 tests up to the NEW VL for nonzero, vector t0 against
+ scalar x0
+* however as t0 is in ffirst mode, the first fail will ALSO stop the
+ compares, and reduce VL as well
* the branch only goes to allnonzero if all tests succeed
-* if it did not, we can safely increment VL by 1 (using a4) to include the zero.
+* if it did not, we can safely increment VL by 1 (using a4) to include
+ the zero.
* SETVL sets *exactly* the requested amount into VL.
-* the SETVL just after allnonzero label is needed in case the ldb ffirst activates but the bne allzeros does not.
+* the SETVL just after allnonzero label is needed in case the ldb ffirst
+ activates but the bne allzeros does not.
* this would cause the stb to copy up to the end of the legal memory
-* of course, on the next loop the ldb would throw a trap, as a1 now points to the first illegal mem location.
+* of course, on the next loop the ldb would throw a trap, as a1 now
+ points to the first illegal mem location.
## strcpy
RVV version:
- mv a3, a0 # Save start
- loop:
+ mv a3, a0 # Save start
+ loop:
setvli a1, x0, vint8 # byte vec, x0 (Zero reg) => use max hardware len
vldbff.v v1, (a3) # Get bytes
csrr a1, vl # Get bytes actually read e.g. if fault
- vseq.vi v0, v1, 0 # Set v0[i] where v1[i] = 0
+ vseq.vi v0, v1, 0 # Set v0[i] where v1[i] = 0
add a3, a3, a1 # Bump pointer
vmfirst a2, v0 # Find first set bit in mask, returns -1 if none
bltz a2, loop # Not found?
add a0, a0, a1 # Sum start + bump
add a3, a3, a2 # Add index of zero byte
sub a0, a3, a0 # Subtract start address+bump
- ret
+ ret
+
+## DAXPY <a name="daxpy"></a>
+
+[[!inline raw="yes" pages="simple_v_extension/daxpy_example" ]]
+
+Notes:
+
+* Setting MVL to 4 is just an example. With enough space between the
+ FP regs, MVL may be set to larger values
+* VBLOCK header takes 16 bits, 8-bit mode may be used on the registers,
+ taking only another 16 bits, VBLOCK.SETVL requires 16 bits. Total
+ overhead for use of VBLOCK: 48 bits (3 16-bit words).
+* All instructions except fmadd may use Compressed variants. Total
+ number of 16-bit instruction words: 11.
+* Total: 14 16-bit words. By contrast, RVV requires around 18 16-bit words.
+
+## BigInt add <a name="bigadd"></a>
+
+[[!inline raw="yes" pages="simple_v_extension/bigadd_example" ]]