Merge branch 'master' of ssh://git.libre-riscv.org:922/libreriscv

author Tobias Platen <tplaten@posteo.de>

Fri, 29 Apr 2022 15:04:17 +0000 (17:04 +0200)

committer Tobias Platen <tplaten@posteo.de>

Fri, 29 Apr 2022 15:04:17 +0000 (17:04 +0200)
author Tobias Platen <tplaten@posteo.de>
Fri, 29 Apr 2022 15:04:17 +0000 (17:04 +0200)
committer Tobias Platen <tplaten@posteo.de>
Fri, 29 Apr 2022 15:04:17 +0000 (17:04 +0200)
diff --git a/openpower/sv/biginteger/analysis.mdwn b/openpower/sv/biginteger/analysis.mdwn

index 7e0c72aa655756a039c6e5ca9f0b60a5592ced45..8ddce0c3a91dc595df647e1b7b45cad11754af46 100644 (file)
--- a/openpower/sv/biginteger/analysis.mdwn
+++ b/openpower/sv/biginteger/analysis.mdwn
@@ -247,7 +247,7 @@ Successive iterations thus effectively use RC as a 64-bit carry, and
  as noted by Intel in their notes on mulx,
  `RA*RB+RC+RD` cannot overflow, so does not require
  setting an additional CA flag. We first cover the chain of
-RA*RB+RC as follows:
+`RA*RB+RC` as follows:
  
      RT0, RC0 = RA0 * RB0 + 0
            |
@@ -267,9 +267,9 @@ which are scalar initialisation:
  
      li r16, 0                     # zero accumulator
      addic r16, r16, 0             # CA to zero as well
-    sv.madde r0.v, r8.v, r17, r16 # mul vector 
+    sv.madde r0.v, r8.v, r17, r16 # mul vector
      sv.adde r24.v, r24.v, r0.v   # big-add row to result
-    
+
  Normally, in a Scalar ISA, the use of a register as both a source
  and destination like this would create costly Dependency Hazards, so
  such an instruction would never be proposed.  However: it turns out
@@ -317,7 +317,7 @@ Algorithm D performs estimates which, if wrong, are compensated for
  afterwards.  Essentially there are three phases:
  
  * Calculation of the quotient estimate. This uses a single
-  Scalar divide, which is covered separately in a later section 
+  Scalar divide, which is covered separately in a later section
  * Big Integer multiply and subtract.
  * Carry-Correction with a big integer add, if the estimate from
    phase 1 was wrong by one digit.
@@ -400,7 +400,7 @@ the digits are 32 bit and, special-casing the overflow, a 64/32 divide is suffic
  
  However when moving to 64-bit digits (desirable because the algorithm
  is `O(N^2)`) this in turn means that the estimate has to be computed
-from a *128* bit dividend and a 64-bit divisor.  Such an operation 
+from a *128* bit dividend and a 64-bit divisor.  Such an operation
  simply does not exist in most Scalar 64-bit ISAs. Although Power ISA
  comes close with `divdeu`, by placing one operand in the upper half
  of a 128-bit dividend, the lower half is zero.  Again Power ISA
@@ -414,7 +414,7 @@ The irony is, therefore, that attempting to
  improve big-integer divide by moving to 64-bit digits in order to take
  advantage of the efficiency of 64-bit scalar multiply when Vectorised
  would instead
-lock up CPU time performing a 128/64 scalar division.  With the Vector 
+lock up CPU time performing a 128/64 scalar division.  With the Vector
  Multiply operations being critically dependent on that `qhat` estimate, and
  because that scalar is as an input into each of the vector digit
  multiples, as a Dependency Hazard it would cause *all* Parallel
@@ -458,14 +458,30 @@ Look closely at Algorithm D when the divisor is only a scalar
  Here, just as with `madded` which can put the hi-half of the 128 bit product
  back in as a form of 64-bit carry, a scalar divisor of a vector dividend
  puts the modulo back in as the hi-half of a 128/64-bit divide.
+
+    RT0      = ((  0<<64) | RA0) / RB0
+         RC0 = ((  0<<64) | RA0) % RB0
+          |
+          +-------+
+                  |
+    RT1      = ((RC0<<64) | RA1) / RB1
+         RC1 = ((RC0<<64) | RA1) % RB1
+          |
+          +-------+
+                  |
+    RT2      = ((RC1<<64) | RA2) / RB2
+         RC2 = ((RC1<<64) | RA2) % RB2
+
  By a nice coincidence this is exactly the same 128/64-bit operation
-needed for the `qhat` estimate if it may produce both the quotient and
-the remainder.
+needed (once, rather than chained) for the `qhat` estimate if it may
+produce both the quotient and the remainder.
+The pseudocode cleanly covering both scenarios (leaving out
+overflow for clarity) can be written as:
  
  `divrem2du RT,RA,RB,RC`
  
-     dividend = (RC) || (RB)
-     divisor = EXTZ128(RA) 
+     dividend = (RC) || (RA)
+     divisor = EXTZ128(RB)
       RT = UDIV(dividend, divisor)
       RS = UREM(dividend, divisor)
  
@@ -476,6 +492,18 @@ allows the instruction to perform full parallel vector div/mod,
  or act in loop-back mode for big-int division by a scalar,
  or for a single scalar 128/64 div/mod.
  
+Again, just as with `sv.madded` and `sv.adde`, adventurous implementors
+may perform massively-wide DIV/MOD by transparently merging (fusing)
+the Vector element operations together, only inputting a single RC and
+outputting the last RC. Where efficient algorithms such as Goldschmidt
+are deployed internally this could dramatically reduce the cycle completion
+time for massive Vector DIV/MOD.  Thus, just as with the other operations
+the apparent limitation of creating chains is overcome: SVP64 is,
+by design, an "expression of intent" where the implementor is free to
+achieve that intent in any way they see fit
+as long as strict precise-aware Program Order is
+preserved (even on the VL for-loops).
+
  Just as with `divdeu` on which this instruction is based an overflow
  detection is required.  When the divisor is too small compared to
  the dividend then the result may not fit into 64 bit.  Knuth's
@@ -485,8 +513,13 @@ in  `divrem2du`  a  `cmpl` instruction can be used instead to detect
  the overflow. This saves having to add an Rc=1 or OE=1 mode when
  the available space in VA-Form EXT04 is extremely limited.
  
-Looking closely at the loop however we can see that overflow 
-will not occur. The initial value k is zero, and on subsequent iterations
-new k, being the modulo, is always less than the divisor. Thus the
-condition (the loop invariant) `RC < RA` is preserved, as long as RC
-starts at zero.
+Looking closely at the loop however we can see that overflow
+will not occur. The initial value k is zero: as long as a divide-by-zero
+is not requested this always fulfils the condition `RC < RA`, and on
+subsequent iterations the new k, being the modulo, is always less than the
+divisor as well. Thus the condition (the loop invariant) `RC < RA`
+is preserved, as long as RC starts at zero.
+
+# Conclusion
+
+TODO
author	Tobias Platen <tplaten@posteo.de>
	Fri, 29 Apr 2022 15:04:17 +0000 (17:04 +0200)
committer	Tobias Platen <tplaten@posteo.de>
	Fri, 29 Apr 2022 15:04:17 +0000 (17:04 +0200)