1 \documentclass[slidestop
]{beamer
}
2 \usepackage{beamerthemesplit
}
8 \title{Big Integer Arithmetic Instruction design
}
9 \author{Luke Kenneth Casson Leighton
}
16 \huge{Big Integer Arithmetic Instruction design
}\\
18 \Large{An analysis of big-integer arithmetic instructions
}\\
19 \Large{(why not to put all eggs in Custom Silicon basket)
}\\
21 \Large{Silicon Salon
2023}\\
23 \large{Sponsored by NLnet's Assure Programme
}\\
30 \frame{\frametitle{Who are we?
}
33 \item Libre-SOC: a fully Libre Project with the goal of creating
34 a Hybrid
3D CPU-VPU-GPU including designing a powerful Vector
35 Extension (for the Power ISA). https://libre-soc.org
37 \item RED Semiconductor Ltd: a commercial realisation of Libre-SOC
38 designs. https://redsemiconductor.com
40 \item Libre-SOC researches and designs instructions that are then
41 proposed to the OpenPOWER Foundation ISA Technical Workgroup;
42 RED Semiconductor (as an OPF ISA WG Voting Member) then keeps
45 \item RED Semiconductor Ltd seeks VC funding and commercial business
46 propositions, Libre-SOC covers Research.
53 \frame{\frametitle{What are the challenges faced by Biginteger?
}
56 \item Algorithms especially post-quantum are now fast-moving. This
57 does not go down well! It typically takes
5-
10 years for an
58 algorithm to become "trustable".
60 \item Custom Cryptographic Hardware will typically take
3 years from
61 design concept to first production silicon: Certification even longer.
62 If a fault is found in the algorithm, the entire investment is wasted.
64 \item Performance on
32-bit and
64-bit Embedded Hardware sucks. Algorithms
65 are roughly O(N
\textsuperscript{2}) which wreaks havoc. The
66 temptation therefore is to add SIMD instructions or dedicated
67 "custom" instructions which makes the problem worse.
69 \item So how can these polar opposites be solved?
75 \begin{frame
}[fragile
]\frametitle{Go back to the algorithms.
}
78 \item https://libre-soc.org/openpower/sv/biginteger/analysis/
79 \item Starting with Knuth's Algorithm D and M, if a True-Scalable
80 Vector ISA can cope with those, chances are good it'll cope
81 with more (Karatsuba, and so on).
82 \item SVP64 has "looping" as a primary construct: \\
83 loop i
0..VL-
1: GPR(RT+i) = ADD(GPR(RA+i), GPR(RB+i)\\
85 \item If however Carry-in and Carry-out are included in that, we
86 have arbitrary-length Big-Integer Vector Add!
87 \item For all other operations as long as Scalar-Vector is ok,
88 it turns out to be possible to do
64-bit carry-in and
89 64-bit carry-out, without significant hardware disruption.
90 \item Irony: all relevant Scalar instructions (shift, mul, div)
91 usually drop
1/
2 the result on the floor!
96 \begin{frame
}[fragile
]\frametitle{Turning add-with-carry into Vector-Add
}
99 \item Add-with-Carry is the building-block of larger operations
100 \item Let's simply chain them together.
101 \item sv.adde (Add-Carry with Vector loop) creates chains
105 R0,CA = A0+B0+CA adde r0,a0,b0
109 R1,CA = A1+B1+CA adde r1,a1,b1
113 R2,CA = A2+B2+CA adde r2,a2,b2
118 \begin{frame
}[fragile
]\frametitle{Vector-Scalar Shift
}
121 \item Shift by
64-bit is just "pick a register"
122 \item Add a
2nd input register with what needs to be shifted IN\\
124 \item Add
2nd output saving what normally gets thrown away\\
126 \item Again: a chain of these performs Vector-by-Scalar shift
130 brs(uint64_t s, uint64_t r
[], uint64_t un
[], int n)
{
131 for (int i =
0; i < n -
1; i++)
132 r
[i
] = (un
[i
] >> s) | (un
[i +
1] << (
64 - s));
133 r
[n -
1] = un
[n -
1] >> s;
139 \begin{frame
}[fragile
]\frametitle{Vector-Scalar Multiply
}
142 \item Normally in MAC the top
64-bits is thrown away.
143 \item What if we stored those
64-bits in a
2nd register?\\
145 \item And what if the next MAC added that "digit" on?\\
147 \item Again: a chain of these performs Vector-by-Scalar Multiply
151 RT0, RC0 = RA0 * RB0 +
0
155 RT1, RC1 = RA1 * RB1 + RC0
159 RT2, RC2 = RA2 * RB2 + RC1
164 \begin{frame
}[fragile
]\frametitle{Vector-Scalar Divide
}
167 \item Same story. special-case for overflow.
171 RT0 = ((
0<<
64) | RA0) / RB0
172 RC0 = ((
0<<
64) | RA0)
% RB0
176 RT1 = ((RC0<<
64) | RA1) / RB1
177 RC1 = ((RC0<<
64) | RA1)
% RB1
181 RT2 = ((RC1<<
64) | RA2) / RB2
182 RC2 = ((RC1<<
64) | RA2)
% RB2
187 \frame{\frametitle{Summary so far
}
190 \item Extending the usual
1-bit Carry-in Carry-out to
64-bit and
191 adding a loop-construct inherently turns Scalar operations
192 into arbitrary-length Vectorised ones
193 \item Irony:
30 years ago Power ISA actually had a "Carry SPR", where
194 the normally-discarded upper half of multiply would be placed in
195 that SPR (it was deprecated).
196 \item Hardware is NOT made more complex because in all shift multiply
197 and divide operations these bits are discarded in other ISAs,
198 which is why you end up with complex carry workarounds. This
199 gives ISAs a "bad rep" for doing Big-int
200 \item The "complication" is that you need
3-in
2-out instructions,
201 but actually in Micro-code you can do operand-forwarding.
202 1st op:
3-in
1-out. chain:
2-in
1-out. Last:
2-in
2-out.
206 \frame{\frametitle{OpenTITAN
}
209 \item https://opentitan.org/book/hw/ip/otbn/index.html
210 \item 256b wide data path with
32 256b wide registers
211 \item Zero-Overhead Loop Control would have been better\\
212 https://ieeexplore.ieee.org/abstract/
document/
1692906/
213 \item Formal verification completion time is a factor of the operation
214 bit-width.
256-bit unlikely to be reasonable time.
215 \item 256-bit is great for EC25519 but for RSA (etc.) you run
216 into exactly the same problem as a Scalar ISA, made worse.
217 \item Opportunities to optimise algorithms not possible (efficient
218 power-optimised Karatsuba, etc.)
222 \begin{frame
}[fragile
]\frametitle{OpenTITAN shift
}
225 \item Immediate-only. what about shift-by-reg?
226 \item merges
2 operands, still not chainable.
227 \item needs a copy of the vector input (double number of regs)
228 \item needs massive
256-bit shifter!
8 layers of muxes!
235 result = (((a <<
256) | b) >> imm) & ((
1 <<
256) -
1)
241 \begin{frame
}[fragile
]\frametitle{Draft Double-Shift
}
244 \item Remarkably similar to x86 dsld
245 \item Does not need
128-bit ROT: simple mod to existing hardware
246 \item Hardware may macro-op fuse Vector-shift for better efficiency
247 \item Chainable and in-place (no copy of vector needed).
251 n <- (RB)
[58:
63] # Power ISA MSB0 numbering. sigh
253 mask <- MASK(
0,
63-n)
254 RT <- (v
[0:
63] & mask) | ((RC) & ~mask)
255 RS <- v
[0:
63] & ~mask
260 \frame{\frametitle{Conclusion
}
263 \item We went back to the algorithms (Knuth D and M) and examined
264 what they are trying to achieve.
265 \item Turns out they need a
64-bit carry-in and carry-out
266 \item Keeping to
64-bit maximum hardware means Formal Proofs complete
267 in reasonable time (less than heat-death of universe)
268 \item Reasonably straightforward: creates and uses partial results
269 normally thrown away (needing more instructions)
270 \item Freaks out pure-RISC proponents (
3-in
2-out) but look at the
271 number of instructions (and temporary registers) needed
272 otherwise, and the overall algorithm efficiency, and the
273 case for these instructions is clear.
274 \item They also speed up
\textbf{general-purpose
} code
282 Questions?\\
\vspace{5pt
}
287 \item https://redsemiconductor.com
288 \item Discussion: http://lists.libre-soc.org
289 \item Libera.Chat IRC \#libre-soc
290 \item http://libre-soc.org/
291 \item http://nlnet.nl/assure
292 \item https://libre-soc.org/nlnet/\#faq