-# Simple-V (Parallelism Extension Proposal) Appendix
+[[!oldstandards]]
+
+# Simple-V (Parallelism Extension Proposal) Appendix (OBSOLETE)
+
+**OBSOLETE**
* Copyright (C) 2017, 2018, 2019 Luke Kenneth Casson Leighton
* Status: DRAFTv0.6
-* Last edited: 28 jun 2019
+* Last edited: 30 jun 2019
* main spec [[specification]]
[[!toc ]]
-# Fail-on-first modes
+# Fail-on-first modes <a name="ffirst"></a>
Fail-on-first data dependency has different behaviour for traps than
for conditional testing. "Conditional" is taken to mean "anything
be given the opportunity to throw the exact same trap that would
be thrown if this were a scalar operation (when VL=1).
-Note that implementors are required to mutually exclusively choose one or
-the other modes: an instruction is **not** permitted to fail on a trap
-*and* fail a conditional test. This advice to custom opcode writers as
-well as future extension writers.
+Note that implementors are required to mutually exclusively choose one
+or the other modes: an instruction is **not** permitted to fail on a
+trap *and* fail a conditional test at the same time. This advice to
+custom opcode writers as well as future extension writers.
## Fail-on-first traps
the elements that were predicated (masked) out (not tested up to the
point where the trap occurred).
+Unlike conditional tests, "fail-on-first trap" instruction behaviour is
+unaltered by setting zero or non-zero predication mode.
+
If SUBVL is being used (SUBVL!=1), the first *sub-group* of elements
will cause a trap as normal (as if ffirst is not set); subsequently, the
trap must not occur in the *sub-group* of elements. SUBVL will **NOT**
of elements that were (sequentially) processed before the fail-condition
was encountered.
+Unlike trap fail-on-first, fail-on-first conditional testing behaviour
+responds to changes in the zero or non-zero predication mode. Whilst
+in non-zeroing mode, masked-out elements are simply not tested (and
+thus considered "never to fail"), in zeroing mode, masked-out elements
+may be viewed as *always* (unconditionally) failing. This effectively
+turns VL into something akin to a software-controlled loop.
+
Note that just as with traps, if SUBVL!=1, the first trap in the
*sub-group* will cause the processing to end, and, even if there were
elements within the *sub-group* that passed the test, that sub-group is
Branch operations are augmented slightly to be a little more like FP
Compares (FEQ, FNE etc.), by permitting the cumulation (and storage)
of multiple comparisons into a register (taken indirectly from the predicate
-table). As such, "ffirst" - fail-on-first - condition mode can be enabled.
+table) and enhancing them to branch "consensually" depending on *multiple*
+tests. "ffirst" - fail-on-first - condition mode can also be enabled,
+to terminate the comparisons early.
See ffirst mode in the Predication Table section.
+There are two registers for the comparison operation, therefore there
+is the opportunity to associate two predicate registers (note: not in
+the same way as twin-predication). The first is a "normal" predicate
+register, which acts just as it does on any other single-predicated
+operation: masks out elements where a bit is zero, applies an inversion
+to the predicate mask, and enables zeroing / non-zeroing mode.
+
+The second (not to be confused with a twin-predication 2nd register)
+is utilised to indicate where the results of each comparison are to
+be stored, as a bitmask. Additionally, the behaviour of the branch -
+when it occurs - may also be modified depending on whether the 2nd predicate's
+"invert" and "zeroing" bits are set. These four combinations result
+in "consensual branches", cbranch.ifnone (NOR), cbranch.ifany (OR),
+cbranch.ifall (AND), cbranch.ifnotall (NAND).
+
+| invert | zeroing | description | operation | cbranch |
+| ------ | ------- | --------------------------- | --------- | ------- |
+| 0 | 0 | branch if all pass | AND | ifall |
+| 1 | 0 | branch if one fails | NAND | ifnall |
+| 0 | 1 | branch if one passes | OR | ifany |
+| 1 | 1 | branch if all fail | NOR | ifnone |
+
+This inversion capability covers AND, OR, NAND and NOR branching
+based on multiple element comparisons. Without the full set of four,
+it is necessary to have two-sequence branch operations: one conditional, one
+unconditional.
+
+Note that unlike normal computer programming, early-termination of chains
+of AND or OR conditional tests, the chain does *not* terminate early
+except if fail-on-first is set, and even then ffirst ends on the first
+data-dependent zero. When ffirst mode is not set, *all* conditional
+element tests must be performed (and the result optionally stored in
+the result mask), with a "post-analysis" phase carried out which checks
+whether to branch.
+
+Note also that whilst it may seem excessive to have all four (because
+conditional comparisons may be inverted by swapping src1 and src2),
+data-dependent fail-on-first is *not* invertible and *only* terminates
+on first zero-condition encountered. Additionally it may be inconvenient
+to have to swap the predicate registers associated with src1 and src2,
+because this involves a new VBLOCK Context.
+
### Standard Branch <a name="standard_branch"></a>
Branch operations use standard RV opcodes that are reinterpreted to
Note that just as with the standard (scalar, non-predicated) branch
operations, BLE, BGT, BLEU and BTGU may be synthesised by inverting
-src1 and src2.
+src1 and src2, however note that in doing so, the predicate table
+setup must also be correspondingly adjusted.
In Hwacha EECS-2015-262 Section 6.7.2 the following pseudocode is given
for predicated compare operations of function "cmp":
ps = get_pred_val(I/F==INT, rs1);
rd = get_pred_val(I/F==INT, rs2); # this may not exist
+ ffirst_mode, zeroing = get_pred_flags(rs1)
+ if exists(rd):
+ pred_inversion, pred_zeroing = get_pred_flags(rs2)
+ else
+ pred_inversion, pred_zeroing = False, False
+
if not exists(rd) or zeroing:
- result = 0
+ result = (1<<VL)-1 # all 1s
else
result = preg[rd]
result |= 1<<i;
else
result &= ~(1<<i);
+ if ffirst_mode:
+ break
- if not exists(rd)
- if result == ps
- goto branch
- else
+ if exists(rd):
preg[rd] = result # store in destination
- if preg[rd] == ps
- goto branch
+
+ if pred_inversion:
+ if pred_zeroing:
+ # NOR
+ if result == 0:
+ goto branch
+ else:
+ # NAND
+ if (result & ps) != result:
+ goto branch
+ else:
+ if pred_zeroing:
+ # OR
+ if result != 0:
+ goto branch
+ else:
+ # AND
+ if (result & ps) == result:
+ goto branch
Notes:
is also marked as scalar, this is how the compatibility with
standard RV LOAD/STORE is preserved by this algorithm.
-### Example Tables showing LOAD elements
+### Example Tables showing LOAD elements <a name="load_example"></a>
This section contains examples of vectorised LOAD operations, showing
how the two stage process works (three if zero/sign-extension is included).
"inactive" for predicated elements, even though it results in
less than 100% ALU utilisation.
-## Twin-predication (based on source and destination register)
+## Twin-predication (based on source and destination register) <a name="tpred"></a>
Twin-predication is not that much different, except that that
the source is independently zero-predicated from the destination.
TODO evaluate strncpy and strlen
<https://groups.google.com/forum/m/#!msg/comp.arch/bGBeaNjAKvc/_vbqyxTUAQAJ>
-## strncpy
+## strncpy <a name="strncpy"></>
-RVV version: <a name="strncpy"></>
+RVV version:
- strncpy:
- mv a3, a0 # Copy dst
- loop:
- setvli x0, a2, vint8 # Vectors of bytes.
- vlbff.v v1, (a1) # Get src bytes
- vseq.vi v0, v1, 0 # Flag zero bytes
- vmfirst a4, v0 # Zero found?
+ strncpy:
+ c.mv a3, a0 # Copy dst
+ loop:
+ setvli x0, a2, vint8 # Vectors of bytes.
+ vlbff.v v1, (a1) # Get src bytes
+ vseq.vi v0, v1, 0 # Flag zero bytes
+ vmfirst a4, v0 # Zero found?
vmsif.v v0, v0 # Set mask up to and including zero byte.
- vsb.v v1, (a3), v0.t # Write out bytes
- bgez a4, exit # Done
- csrr t1, vl # Get number of bytes fetched
- add a1, a1, t1 # Bump src pointer
- sub a2, a2, t1 # Decrement count.
- add a3, a3, t1 # Bump dst pointer
- bnez a2, loop # Anymore?
+ vsb.v v1, (a3), v0.t # Write out bytes
+ c.bgez a4, exit # Done
+ csrr t1, vl # Get number of bytes fetched
+ c.add a1, a1, t1 # Bump src pointer
+ c.sub a2, a2, t1 # Decrement count.
+ c.add a3, a3, t1 # Bump dst pointer
+ c.bnez a2, loop # Anymore?
- exit:
- ret
+ exit:
+ c.ret
SV version (WIP):
strncpy:
- mv a3, a0
- SETMVLI 8 # set max vector to 8
- RegCSR[a3] = 8bit, a3, scalar
- RegCSR[a1] = 8bit, a1, scalar
- RegCSR[t0] = 8bit, t0, vector
- PredTb[t0] = ffirst, x0, inv
+ c.mv a3, a0
+ VBLK.RegCSR[t0] = 8bit, t0, vector
+ VBLK.PredTb[t0] = ffirst, x0, inv
loop:
- SETVLI a2, t4 # t4 and VL now 1..8
- ldb t0, (a1) # t0 fail first mode
- bne t0, x0, allnonzero # still ff
- # VL points to last nonzero
- GETVL t4 # from bne tests
- addi t4, t4, 1 # include zero
- SETVL t4 # set exactly to t4
- stb t0, (a3) # store incl zero
- ret # end subroutine
+ VBLK.SETVLI a2, t4, 8 # t4 and VL now 1..8 (MVL=8)
+ c.ldb t0, (a1) # t0 fail first mode
+ c.bne t0, x0, allnonzero # still ff
+ # VL (t4) points to last nonzero
+ c.addi t4, t4, 1 # include zero
+ c.stb t0, (a3) # store incl zero
+ c.ret # end subroutine
allnonzero:
- stb t0, (a3) # VL legal range
- GETVL t4 # from bne tests
- add a1, a1, t4 # Bump src pointer
- sub a2, a2, t4 # Decrement count.
- add a3, a3, t4 # Bump dst pointer
- bnez a2, loop # Anymore?
+ c.stb t0, (a3) # VL legal range
+ c.add a1, a1, t4 # Bump src pointer
+ c.sub a2, a2, t4 # Decrement count.
+ c.add a3, a3, t4 # Bump dst pointer
+ c.bnez a2, loop # Anymore?
exit:
- ret
+ c.ret
Notes:
into t0 (could contain zeros).
* bne t0 x0 tests up to the NEW VL for nonzero, vector t0 against
scalar x0
-* however as t0 is in ffirst mode, the first fail wil ALSO stop the
+* however as t0 is in ffirst mode, the first fail will ALSO stop the
compares, and reduce VL as well
* the branch only goes to allnonzero if all tests succeed
* if it did not, we can safely increment VL by 1 (using a4) to include
RVV version:
- mv a3, a0 # Save start
- loop:
+ mv a3, a0 # Save start
+ loop:
setvli a1, x0, vint8 # byte vec, x0 (Zero reg) => use max hardware len
vldbff.v v1, (a3) # Get bytes
csrr a1, vl # Get bytes actually read e.g. if fault
- vseq.vi v0, v1, 0 # Set v0[i] where v1[i] = 0
+ vseq.vi v0, v1, 0 # Set v0[i] where v1[i] = 0
add a3, a3, a1 # Bump pointer
vmfirst a2, v0 # Find first set bit in mask, returns -1 if none
bltz a2, loop # Not found?
add a0, a0, a1 # Sum start + bump
add a3, a3, a2 # Add index of zero byte
sub a0, a3, a0 # Subtract start address+bump
- ret
+ ret
-## DAXPY
+## DAXPY <a name="daxpy"></a>
[[!inline raw="yes" pages="simple_v_extension/daxpy_example" ]]
+
+Notes:
+
+* Setting MVL to 4 is just an example. With enough space between the
+ FP regs, MVL may be set to larger values
+* VBLOCK header takes 16 bits, 8-bit mode may be used on the registers,
+ taking only another 16 bits, VBLOCK.SETVL requires 16 bits. Total
+ overhead for use of VBLOCK: 48 bits (3 16-bit words).
+* All instructions except fmadd may use Compressed variants. Total
+ number of 16-bit instruction words: 11.
+* Total: 14 16-bit words. By contrast, RVV requires around 18 16-bit words.
+
+## BigInt add <a name="bigadd"></a>
+
+[[!inline raw="yes" pages="simple_v_extension/bigadd_example" ]]