as noted by Intel in their notes on mulx,
`RA*RB+RC+RD` cannot overflow, so does not require
setting an additional CA flag. We first cover the chain of
-RA*RB+RC as follows:
+`RA*RB+RC` as follows:
RT0, RC0 = RA0 * RB0 + 0
|
li r16, 0 # zero accumulator
addic r16, r16, 0 # CA to zero as well
- sv.madde r0.v, r8.v, r17, r16 # mul vector
+ sv.madde r0.v, r8.v, r17, r16 # mul vector
sv.adde r24.v, r24.v, r0.v # big-add row to result
-
+
Normally, in a Scalar ISA, the use of a register as both a source
and destination like this would create costly Dependency Hazards, so
such an instruction would never be proposed. However: it turns out
afterwards. Essentially there are three phases:
* Calculation of the quotient estimate. This uses a single
- Scalar divide, which is covered separately in a later section
+ Scalar divide, which is covered separately in a later section
* Big Integer multiply and subtract.
* Carry-Correction with a big integer add, if the estimate from
phase 1 was wrong by one digit.
However when moving to 64-bit digits (desirable because the algorithm
is `O(N^2)`) this in turn means that the estimate has to be computed
-from a *128* bit dividend and a 64-bit divisor. Such an operation
+from a *128* bit dividend and a 64-bit divisor. Such an operation
simply does not exist in most Scalar 64-bit ISAs. Although Power ISA
comes close with `divdeu`, by placing one operand in the upper half
of a 128-bit dividend, the lower half is zero. Again Power ISA
improve big-integer divide by moving to 64-bit digits in order to take
advantage of the efficiency of 64-bit scalar multiply when Vectorised
would instead
-lock up CPU time performing a 128/64 scalar division. With the Vector
+lock up CPU time performing a 128/64 scalar division. With the Vector
Multiply operations being critically dependent on that `qhat` estimate, and
because that scalar is as an input into each of the vector digit
multiples, as a Dependency Hazard it would cause *all* Parallel
Here, just as with `madded` which can put the hi-half of the 128 bit product
back in as a form of 64-bit carry, a scalar divisor of a vector dividend
puts the modulo back in as the hi-half of a 128/64-bit divide.
+
+ RT0 = (( 0<<64) | RA0) / RB0
+ RC0 = (( 0<<64) | RA0) % RB0
+ |
+ +-------+
+ |
+ RT1 = ((RC0<<64) | RA1) / RB1
+ RC1 = ((RC0<<64) | RA1) % RB1
+ |
+ +-------+
+ |
+ RT2 = ((RC1<<64) | RA2) / RB2
+ RC2 = ((RC1<<64) | RA2) % RB2
+
By a nice coincidence this is exactly the same 128/64-bit operation
-needed for the `qhat` estimate if it may produce both the quotient and
-the remainder.
+needed (once, rather than chained) for the `qhat` estimate if it may
+produce both the quotient and the remainder.
+The pseudocode cleanly covering both scenarios (leaving out
+overflow for clarity) can be written as:
`divrem2du RT,RA,RB,RC`
- dividend = (RC) || (RB)
- divisor = EXTZ128(RA)
+ dividend = (RC) || (RA)
+ divisor = EXTZ128(RB)
RT = UDIV(dividend, divisor)
RS = UREM(dividend, divisor)
or act in loop-back mode for big-int division by a scalar,
or for a single scalar 128/64 div/mod.
+Again, just as with `sv.madded` and `sv.adde`, adventurous implementors
+may perform massively-wide DIV/MOD by transparently merging (fusing)
+the Vector element operations together, only inputting a single RC and
+outputting the last RC. Where efficient algorithms such as Goldschmidt
+are deployed internally this could dramatically reduce the cycle completion
+time for massive Vector DIV/MOD. Thus, just as with the other operations
+the apparent limitation of creating chains is overcome: SVP64 is,
+by design, an "expression of intent" where the implementor is free to
+achieve that intent in any way they see fit
+as long as strict precise-aware Program Order is
+preserved (even on the VL for-loops).
+
Just as with `divdeu` on which this instruction is based an overflow
detection is required. When the divisor is too small compared to
the dividend then the result may not fit into 64 bit. Knuth's
the overflow. This saves having to add an Rc=1 or OE=1 mode when
the available space in VA-Form EXT04 is extremely limited.
-Looking closely at the loop however we can see that overflow
-will not occur. The initial value k is zero, and on subsequent iterations
-new k, being the modulo, is always less than the divisor. Thus the
-condition (the loop invariant) `RC < RA` is preserved, as long as RC
-starts at zero.
+Looking closely at the loop however we can see that overflow
+will not occur. The initial value k is zero: as long as a divide-by-zero
+is not requested this always fulfils the condition `RC < RA`, and on
+subsequent iterations the new k, being the modulo, is always less than the
+divisor as well. Thus the condition (the loop invariant) `RC < RA`
+is preserved, as long as RC starts at zero.
+
+# Conclusion
+
+TODO