(no commit message)

[libreriscv.git] / simple_v_extension / appendix.mdwn
diff --git a/simple_v_extension/appendix.mdwn b/simple_v_extension/appendix.mdwn

index 65a573dd48b7445d3148dabbfe6cd4689d92ff0d..c29044cfea6b9772be22c43d9b8dc3d968f819ee 100644 (file)
--- a/simple_v_extension/appendix.mdwn
+++ b/simple_v_extension/appendix.mdwn
@@ -1,13 +1,17 @@
-# Simple-V (Parallelism Extension Proposal) Appendix
+[[!oldstandards]]
+
+# Simple-V (Parallelism Extension Proposal) Appendix (OBSOLETE)
+
+**OBSOLETE**
  
  * Copyright (C) 2017, 2018, 2019 Luke Kenneth Casson Leighton
  * Status: DRAFTv0.6
-* Last edited: 25 jun 2019
+* Last edited: 30 jun 2019
  * main spec [[specification]]
  
  [[!toc ]]
  
-# Fail-on-first modes
+# Fail-on-first modes <a name="ffirst"></a>
  
  Fail-on-first data dependency has different behaviour for traps than
  for conditional testing.  "Conditional" is taken to mean "anything
@@ -15,47 +19,67 @@ that is zero", however with traps, the first element has to
  be given the opportunity to throw the exact same trap that would
  be thrown if this were a scalar operation (when VL=1).
  
+Note that implementors are required to mutually exclusively choose one
+or the other modes: an instruction is **not** permitted to fail on a
+trap *and* fail a conditional test at the same time.  This advice to
+custom opcode writers as well as future extension writers.
+
  ## Fail-on-first traps
  
  Except for the first element, ffirst stops sequential element processing
  when a trap occurs.  The first element is treated normally (as if ffirst
  is clear).  Should any subsequent element instruction require a trap,
  instead it and subsequent indexed elements are ignored (or cancelled in
-out-of-order designs), and VL is set to the *last* instruction that did
-not take the trap.
+out-of-order designs), and VL is set to the *last* in-sequence instruction
+that did not take the trap.
  
-Note that predicated-out elements (where the predicate mask bit is zero)
-are clearly excluded (i.e. the trap will not occur).  However, note that
-the loop still had to test the predicate bit: thus on return,
+Note that predicated-out elements (where the predicate mask bit is
+zero) are clearly excluded (i.e. the trap will not occur).  However,
+note that the loop still had to test the predicate bit: thus on return,
  VL is set to include elements that did not take the trap *and* includes
  the elements that were predicated (masked) out (not tested up to the
  point where the trap occurred).
  
+Unlike conditional tests, "fail-on-first trap" instruction behaviour is
+unaltered by setting zero or non-zero predication mode.
+
  If SUBVL is being used (SUBVL!=1), the first *sub-group* of elements
-will cause a trap as normal (as if ffirst is not set); subsequently,
-the trap must not occur in the *sub-group* of elements.  SUBVL will **NOT**
-be modified.
+will cause a trap as normal (as if ffirst is not set); subsequently, the
+trap must not occur in the *sub-group* of elements.  SUBVL will **NOT**
+be modified.  Traps must analyse (x)eSTATE (subvl offset indices) to
+determine the element that caused the trap.
  
  Given that predication bits apply to SUBVL groups, the same rules apply
-to predicated-out (masked-out) sub-groups in calculating the value that VL
-is set to.
+to predicated-out (masked-out) sub-groups in calculating the value that
+VL is set to.
  
  ## Fail-on-first conditional tests
  
-ffirst stops sequential element conditional testing on the first element result
-being zero.  VL is set to the number of elements that were processed before
-the fail-condition was encountered.
-
-Note that just as with traps, if SUBVL!=1, the first of any of the *sub-group*
-will cause the processing to end, and, even if there were elements within
-the *sub-group* that passed the test, that sub-group is still (entirely)
-excluded from the count (from setting VL).  i.e. VL is set to the total
-number of *sub-groups* that had no fail-condition up until execution was
-stopped.
+ffirst stops sequential (or sequentially-appearing in the case of
+out-of-order designs) element conditional testing on the first element
+result being zero (or other "fail" condition).  VL is set to the number
+of elements that were (sequentially) processed before the fail-condition
+was encountered.
+
+Unlike trap fail-on-first, fail-on-first conditional testing behaviour
+responds to changes in the zero or non-zero predication mode.  Whilst
+in non-zeroing mode, masked-out elements are simply not tested (and
+thus considered "never to fail"), in zeroing mode, masked-out elements
+may be viewed as *always* (unconditionally) failing.  This effectively
+turns VL into something akin to a software-controlled loop.
+
+Note that just as with traps, if SUBVL!=1, the first trap in the
+*sub-group* will cause the processing to end, and, even if there were
+elements within the *sub-group* that passed the test, that sub-group is
+still (entirely) excluded from the count (from setting VL).  i.e. VL is
+set to the total number of *sub-groups* that had no fail-condition up
+until execution was stopped.  However, again: SUBVL must not be modified:
+traps must analyse (x)eSTATE (subvl offset indices) to determine the
+element that caused the trap.
  
  Note again that, just as with traps, predicated-out (masked-out) elements
-are included in the count leading up to the fail-condition, even though they
-were not tested.
+are included in the (sequential) count leading up to the fail-condition,
+even though they were not tested.
  
  # Instructions <a name="instructions" />
  
@@ -105,23 +129,10 @@ attention must be paid.
  Example pseudo-code for an integer ADD operation (including scalar
  operations).  Floating-point uses the FP Register Table.
  
-    function op_add(rd, rs1, rs2) # add not VADD!
-      int i, id=0, irs1=0, irs2=0;
-      predval = get_pred_val(FALSE, rd);
-      rd  = int_vec[rd ].isvector ? int_vec[rd ].regidx : rd;
-      rs1 = int_vec[rs1].isvector ? int_vec[rs1].regidx : rs1;
-      rs2 = int_vec[rs2].isvector ? int_vec[rs2].regidx : rs2;
-      for (i = 0; i < VL; i++)
-        xSTATE.srcoffs = i # save context
-        if (predval & 1<<i) # predication uses intregs
-           ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
-           if (!int_vec[rd ].isvector) break;
-        if (int_vec[rd ].isvector)  { id += 1; }
-        if (int_vec[rs1].isvector)  { irs1 += 1; }
-        if (int_vec[rs2].isvector)  { irs2 += 1; }
+[[!inline raw="yes" pages="simple_v_extension/simple_add_example" ]]
  
  Note that for simplicity there is quite a lot missing from the above
-pseudo-code: element widths, zeroing on predication, dimensional
+pseudo-code: PCVBLK, element widths, zeroing on predication, dimensional
  reshaping and offsets and so on.  However it demonstrates the basic
  principle.  Augmentations that produce the full pseudo-code are covered in
  other sections.
@@ -130,7 +141,7 @@ other sections.
  
  Adding in support for SUBVL is a matter of adding in an extra inner
  for-loop, where register src and dest are still incremented inside the
-inner part. Not that the predication is still taken from the VL index.
+inner part. Note that the predication is still taken from the VL index.
  
  So whilst elements are indexed by "(i * SUBVL + s)", predicate bits are
  indexed by "(i)"
@@ -187,9 +198,53 @@ comprehensive in its effect on instructions.
  Branch operations are augmented slightly to be a little more like FP
  Compares (FEQ, FNE etc.), by permitting the cumulation (and storage)
  of multiple comparisons into a register (taken indirectly from the predicate
-table).  As such, "ffirst" - fail-on-first - condition mode can be enabled.
+table) and enhancing them to branch "consensually" depending on *multiple*
+tests.  "ffirst" - fail-on-first - condition mode can also be enabled,
+to terminate the comparisons early.
  See ffirst mode in the Predication Table section.
  
+There are two registers for the comparison operation, therefore there
+is the opportunity to associate two predicate registers (note: not in
+the same way as twin-predication).  The first is a "normal" predicate
+register, which acts just as it does on any other single-predicated
+operation: masks out elements where a bit is zero, applies an inversion
+to the predicate mask, and enables zeroing / non-zeroing mode.
+
+The second (not to be confused with a twin-predication 2nd register)
+is utilised to indicate where the results of each comparison are to
+be stored, as a bitmask.  Additionally, the behaviour of the branch -
+when it occurs - may also be modified depending on whether the 2nd predicate's
+"invert" and "zeroing" bits are set.  These four combinations result
+in "consensual branches", cbranch.ifnone (NOR), cbranch.ifany (OR),
+cbranch.ifall (AND), cbranch.ifnotall (NAND).
+
+| invert | zeroing | description                 | operation | cbranch |
+| ------ | ------- | --------------------------- | --------- | ------- |
+| 0      | 0       | branch if all pass          | AND       | ifall   |
+| 1      | 0       | branch if one fails         | NAND      | ifnall  |
+| 0      | 1       | branch if one passes        | OR        | ifany   |
+| 1      | 1       | branch if all fail          | NOR       | ifnone  |
+
+This inversion capability covers AND, OR, NAND and NOR branching
+based on multiple element comparisons. Without the full set of four,
+it is necessary to have two-sequence branch operations: one conditional, one
+unconditional.
+
+Note that unlike normal computer programming, early-termination of chains
+of AND or OR conditional tests, the chain does *not* terminate early
+except if fail-on-first is set, and even then ffirst ends on the first
+data-dependent zero.  When ffirst mode is not set, *all* conditional
+element tests must be performed (and the result optionally stored in
+the result mask), with a "post-analysis" phase carried out which checks
+whether to branch.
+
+Note also that whilst it may seem excessive to have all four (because
+conditional comparisons may be inverted by swapping src1 and src2),
+data-dependent fail-on-first is *not* invertible and *only* terminates
+on first zero-condition encountered.  Additionally it may be inconvenient
+to have to swap the predicate registers associated with src1 and src2,
+because this involves a new VBLOCK Context.
+
  ### Standard Branch <a name="standard_branch"></a>
  
  Branch operations use standard RV opcodes that are reinterpreted to
@@ -226,7 +281,8 @@ to zero if **zeroing** is enabled.
  
  Note that just as with the standard (scalar, non-predicated) branch
  operations, BLE, BGT, BLEU and BTGU may be synthesised by inverting
-src1 and src2.
+src1 and src2, however note that in doing so, the predicate table
+setup must also be correspondingly adjusted.
  
  In Hwacha EECS-2015-262 Section 6.7.2 the following pseudocode is given
  for predicated compare operations of function "cmp":
@@ -254,8 +310,14 @@ complex), this becomes:
      ps = get_pred_val(I/F==INT, rs1);
      rd = get_pred_val(I/F==INT, rs2); # this may not exist
  
+    ffirst_mode, zeroing = get_pred_flags(rs1)
+    if exists(rd):
+        pred_inversion, pred_zeroing = get_pred_flags(rs2)
+    else
+        pred_inversion, pred_zeroing = False, False
+
      if not exists(rd) or zeroing:
-        result = 0
+        result = (1<<VL)-1 # all 1s
      else
          result = preg[rd]
  
@@ -269,14 +331,30 @@ complex), this becomes:
                result |= 1<<i;
            else
                result &= ~(1<<i);
+              if ffirst_mode:
+                break
  
-     if not exists(rd)
-        if result == ps
-            goto branch
-     else
+    if exists(rd):
          preg[rd] = result # store in destination
-        if preg[rd] == ps
-            goto branch
+
+    if pred_inversion:
+        if pred_zeroing:
+            # NOR
+            if result == 0:
+                goto branch
+        else:
+            # NAND
+            if (result & ps) != result:
+                goto branch
+    else:
+        if pred_zeroing:
+            # OR
+            if result != 0:
+                goto branch
+        else:
+            # AND
+            if (result & ps) == result:
+                goto branch
  
  Notes:
  
@@ -330,12 +408,12 @@ The BEQ that follows will *also* compare x1==x0, x2==x0, x3==x0 and
  so on.  Consequently, unlike integer-branch, FP Compare needs no
  modification in its behaviour.
  
-In addition, it is noted that an entry "FNE" (the opposite of FEQ) is missing,
-and whilst in ordinary branch code this is fine because the standard
-RVF compare can always be followed up with an integer BEQ or a BNE (or
-a compressed comparison to zero or non-zero), in predication terms that
-becomes more of an impact.  To deal with this, SV's predication has
-had "invert" added to it.
+In addition, it is noted that an entry "FNE" (the opposite of FEQ) is
+missing, and whilst in ordinary branch code this is fine because the
+standard RVF compare can always be followed up with an integer BEQ or
+a BNE (or a compressed comparison to zero or non-zero), in predication
+terms that becomes more of an impact.  To deal with this, SV's predication
+has had "invert" added to it.
  
  Also: note that FP Compare may be predicated, using the destination
  integer register (rd) to determine the predicate.  FP Compare is **not**
@@ -802,15 +880,15 @@ to those produced by the above algorithm.
  
  ## Polymorphic floating-point operation exceptions and error-handling
  
-For floating-point operations, conversion takes place without
-raising any kind of exception.  Exactly as specified in the standard
-RV specification, NAN (or appropriate) is stored if the result
-is beyond the range of the destination, and, again, exactly as
-with the standard RV specification just as with scalar
-operations, the floating-point flag is raised (FCSR).  And, again, just as
-with scalar operations, it is software's responsibility to check this flag.
-Given that the FCSR flags are "accrued", the fact that multiple element
-operations could have occurred is not a problem.
+For floating-point operations, conversion takes place without raising any
+kind of exception.  Exactly as specified in the standard RV specification,
+NAN (or appropriate) is stored if the result is beyond the range of the
+destination, and, again, exactly as with the standard RV specification
+just as with scalar operations, the floating-point flag is raised
+(FCSR).  And, again, just as with scalar operations, it is software's
+responsibility to check this flag.  Given that the FCSR flags are
+"accrued", the fact that multiple element operations could have occurred
+is not a problem.
  
  Note that it is perfectly legitimate for floating-point bitwidths of
  only 8 to be specified.  However whilst it is possible to apply IEEE 754
@@ -821,11 +899,11 @@ proceeding.
  
  ## Polymorphic shift operators
  
-A special note is needed for changing the element width of left and right
-shift operators, particularly right-shift.  Even for standard RV base,
-in order for correct results to be returned, the second operand RS2 must
-be truncated to be within the range of RS1's bitwidth.  spike's implementation
-of sll for example is as follows:
+A special note is needed for changing the element width of left and
+right shift operators, particularly right-shift.  Even for standard RV
+base, in order for correct results to be returned, the second operand
+RS2 must be truncated to be within the range of RS1's bitwidth.
+spike's implementation of sll for example is as follows:
  
      WRITE_RD(sext_xlen(zext_xlen(RS1) << (RS2 & (xlen-1))));
  
@@ -1022,7 +1100,7 @@ Note:
    is also marked as scalar, this is how the compatibility with
    standard RV LOAD/STORE is preserved by this algorithm.
  
-### Example Tables showing LOAD elements
+### Example Tables showing LOAD elements <a name="load_example"></a>
  
  This section contains examples of vectorised LOAD operations, showing
  how the two stage process works (three if zero/sign-extension is included).
@@ -1039,13 +1117,12 @@ This is:
  * from register x5 (actually x5-x6) to x8 (actually x8 to half of x11)
  * RV64, where XLEN=64 is assumed.
  
-First, the memory table, which, due to the
-element width being 16 and the operation being LD (64), the 64-bits
-loaded from memory are subdivided into groups of **four** elements.
-And, with VL being 7 (deliberately to illustrate that this is reasonable
-and possible), the first four are sourced from the offset addresses pointed
-to by x5, and the next three from the ofset addresses pointed to by
-the next contiguous register, x6:
+First, the memory table, which, due to the element width being 16 and the
+operation being LD (64), the 64-bits loaded from memory are subdivided
+into groups of **four** elements.  And, with VL being 7 (deliberately
+to illustrate that this is reasonable and possible), the first four are
+sourced from the offset addresses pointed to by x5, and the next three
+from the ofset addresses pointed to by the next contiguous register, x6:
  
  [[!table  data="""
  addr | byte 0 | byte 1 | byte 2 | byte 3 | byte 4 | byte 5 | byte 6 | byte 7 |
@@ -1294,9 +1371,9 @@ rs1 equals the bitwidth of rs2, no sign-extending will occur.  It is
  only where the bitwidth of either rs1 or rs2 are different, will the
  lesser-width operand be sign-extended.
  
-Effectively however, both rs1 and rs2 are being sign-extended (or truncated),
-where for add they are both zero-extended.  This holds true for all arithmetic
-operations ending with "W".
+Effectively however, both rs1 and rs2 are being sign-extended (or
+truncated), where for add they are both zero-extended.  This holds true
+for all arithmetic operations ending with "W".
  
  ### addiw
  
@@ -1383,7 +1460,7 @@ circumstances it is perfectly fine to simply have the lanes
  "inactive" for predicated elements, even though it results in
  less than 100% ALU utilisation.
  
-## Twin-predication (based on source and destination register)
+## Twin-predication (based on source and destination register) <a name="tpred"></a>
  
  Twin-predication is not that much different, except that that
  the source is independently zero-predicated from the destination.
@@ -1511,92 +1588,121 @@ of total length 128 bit given that XLEN is now 128.
  TODO evaluate strncpy and strlen
  <https://groups.google.com/forum/m/#!msg/comp.arch/bGBeaNjAKvc/_vbqyxTUAQAJ>
  
-## strncpy
-
-RVV version: <a name="strncpy"></>
-
-    strncpy: 
-        mv a3, a0               # Copy dst 
-    loop: 
-        setvli x0, a2, vint8    # Vectors of bytes. 
-        vlbff.v v1, (a1)        # Get src bytes 
-        vseq.vi v0, v1, 0       # Flag zero bytes 
-        vmfirst a4, v0          # Zero found? 
-        vmsif.v v0, v0          # Set mask up to and including zero byte. Ppplio
-        vsb.v v1, (a3), v0.t    # Write out bytes 
-        bgez a4, exit           # Done 
-        csrr t1, vl             # Get number of bytes fetched 
-        add a1, a1, t1          # Bump src pointer 
-        sub a2, a2, t1          # Decrement count. 
-        add a3, a3, t1          # Bump dst pointer 
-        bnez a2, loop           # Anymore? 
-
-    exit: 
-        ret 
+## strncpy <a name="strncpy"></>
+
+RVV version:
+
+    strncpy:
+        c.mv a3, a0               # Copy dst
+    loop:
+        setvli x0, a2, vint8    # Vectors of bytes.
+        vlbff.v v1, (a1)        # Get src bytes
+        vseq.vi v0, v1, 0       # Flag zero bytes
+        vmfirst a4, v0          # Zero found?
+        vmsif.v v0, v0          # Set mask up to and including zero byte.
+        vsb.v v1, (a3), v0.t    # Write out bytes
+        c.bgez a4, exit           # Done
+        csrr t1, vl             # Get number of bytes fetched
+        c.add a1, a1, t1          # Bump src pointer
+        c.sub a2, a2, t1          # Decrement count.
+        c.add a3, a3, t1          # Bump dst pointer
+        c.bnez a2, loop           # Anymore?
+
+    exit:
+        c.ret
  
  SV version (WIP):
  
      strncpy:
-        mv a3, a0
-        SETMVLI 8 # set max vector to 8
-        RegCSR[a3] = 8bit, a3, scalar
-        RegCSR[a1] = 8bit, a1, scalar
-        RegCSR[t0] = 8bit, t0, vector
-        PredTb[t0] = ffirst, x0, inv
+        c.mv a3, a0
+        VBLK.RegCSR[t0] = 8bit, t0, vector
+        VBLK.PredTb[t0] = ffirst, x0, inv
      loop:
-        SETVLI a2, t4 # t4 and VL now 1..8
-        ldb t0, (a1) # t0 fail first mode
-        bne t0, x0, allnonzero # still ff
-        # VL points to last nonzero
-        GETVL t4       # from bne tests
-        addi t4, t4, 1 # include zero
-        SETVL t4       # set exactly to t4
-        stb t0, (a3)   # store incl zero
-        ret            # end subroutine
+        VBLK.SETVLI a2, t4, 8 # t4 and VL now 1..8 (MVL=8)
+        c.ldb t0, (a1) # t0 fail first mode
+        c.bne t0, x0, allnonzero # still ff
+        # VL (t4) points to last nonzero
+        c.addi t4, t4, 1 # include zero
+        c.stb t0, (a3)   # store incl zero
+        c.ret            # end subroutine
      allnonzero:
-        stb t0, (a3)    # VL legal range
-        GETVL t4        # from bne tests
-        add a1, a1, t4  # Bump src pointer 
-        sub a2, a2, t4  # Decrement count. 
-        add a3, a3, t4  # Bump dst pointer 
-        bnez a2, loop   # Anymore? 
+        c.stb t0, (a3)    # VL legal range
+        c.add a1, a1, t4  # Bump src pointer
+        c.sub a2, a2, t4  # Decrement count.
+        c.add a3, a3, t4  # Bump dst pointer
+        c.bnez a2, loop   # Anymore?
      exit:
-        ret
+        c.ret
  
  Notes:
  
-* Setting MVL to 8 is just an example. If enough registers are spare it may be set to XLEN which will require a bank of 8 scalar registers for a1, a3 and t0.
-* obviously if that is done, t0 is not separated by 8 full registers, and would overwrite t1 thru t7. x80 would work well, as an example, instead.
-* with the exception of the GETVL (a pseudo code alias for csrr), every single instruction above may use RVC.
-* RVC C.BNEZ can be used because rs1' may be extended to the full 128 registers through redirection
-* RVC C.LW and C.SW may be used because the W format may be overridden by the 8 bit format. All of t0, a3 and a1 are overridden to make that work.
-* with the exception of the GETVL, all Vector Context may be done in VBLOCK form.
-* setting predication to x0 (zero) and invert on t0 is a trick to enable just ffirst on t0
+* Setting MVL to 8 is just an example. If enough registers are spare it
+  may be set to XLEN which will require a bank of 8 scalar registers for
+  a1, a3 and t0.
+* obviously if that is done, t0 is not separated by 8 full registers, and
+  would overwrite t1 thru t7. x80 would work well, as an example, instead.
+* with the exception of the GETVL (a pseudo code alias for csrr), every
+  single instruction above may use RVC.
+* RVC C.BNEZ can be used because rs1' may be extended to the full 128
+  registers through redirection
+* RVC C.LW and C.SW may be used because the W format may be overridden by
+  the 8 bit format. All of t0, a3 and a1 are overridden to make that work.
+* with the exception of the GETVL, all Vector Context may be done in
+  VBLOCK form.
+* setting predication to x0 (zero) and invert on t0 is a trick to enable
+  just ffirst on t0
  * ldb and bne are both using t0, both in ffirst mode
-* ldb will end on illegal mem, reduce VL, but copied all sorts of stuff into t0
-* bne t0 x0 tests up to the NEW VL for nonzero, vector t0 against scalar x0
-* however as t0 is in ffirst mode, the first fail wil ALSO stop the compares, and reduce VL as well
+* t0 vectorised, a1 scalar, both elwidth 8 bit: ldb enters "unit stride,
+  vectorised, no (un)sign-extension or truncation" mode.
+* ldb will end on illegal mem, reduce VL, but copied all sorts of stuff
+  into t0 (could contain zeros).
+* bne t0 x0 tests up to the NEW VL for nonzero, vector t0 against
+  scalar x0
+* however as t0 is in ffirst mode, the first fail will ALSO stop the
+  compares, and reduce VL as well
  * the branch only goes to allnonzero if all tests succeed
-* if it did not, we can safely increment VL by 1 (using a4) to include the zero.
+* if it did not, we can safely increment VL by 1 (using a4) to include
+  the zero.
  * SETVL sets *exactly* the requested amount into VL.
-* the SETVL just after allnonzero label is needed in case the ldb ffirst activates but the bne allzeros does not.
+* the SETVL just after allnonzero label is needed in case the ldb ffirst
+  activates but the bne allzeros does not.
  * this would cause the stb to copy up to the end of the legal memory
-* of course, on the next loop the ldb would throw a trap, as a1 now points to the first illegal mem location.
+* of course, on the next loop the ldb would throw a trap, as a1 now
+  points to the first illegal mem location.
  
  ## strcpy
  
  RVV version:
  
-        mv a3, a0             # Save start 
-    loop: 
+        mv a3, a0             # Save start
+    loop:
          setvli a1, x0, vint8  # byte vec, x0 (Zero reg) => use max hardware len
          vldbff.v v1, (a3)     # Get bytes
          csrr a1, vl           # Get bytes actually read e.g. if fault
-        vseq.vi v0, v1, 0     # Set v0[i] where v1[i] = 0 
+        vseq.vi v0, v1, 0     # Set v0[i] where v1[i] = 0
          add a3, a3, a1        # Bump pointer
          vmfirst a2, v0        # Find first set bit in mask, returns -1 if none
          bltz a2, loop         # Not found?
          add a0, a0, a1        # Sum start + bump
          add a3, a3, a2        # Add index of zero byte
          sub a0, a3, a0        # Subtract start address+bump
-        ret 
+        ret
+
+## DAXPY <a name="daxpy"></a>
+
+[[!inline raw="yes" pages="simple_v_extension/daxpy_example" ]]
+
+Notes:
+
+* Setting MVL to 4 is just an example.  With enough space between the
+  FP regs, MVL may be set to larger values
+* VBLOCK header takes 16 bits, 8-bit mode may be used on the registers,
+  taking only another 16 bits, VBLOCK.SETVL requires 16 bits.  Total
+  overhead for use of VBLOCK: 48 bits (3 16-bit words).
+* All instructions except fmadd may use Compressed variants.  Total
+  number of 16-bit instruction words: 11.
+* Total: 14 16-bit words.  By contrast, RVV requires around 18 16-bit words.
+
+## BigInt add <a name="bigadd"></a>
+
+[[!inline raw="yes" pages="simple_v_extension/bigadd_example" ]]