(no commit message)
[libreriscv.git] / openpower / sv / cookbook / chacha20.mdwn
1 [[!tag svp64_cookbook]]
2
3 # Funding and Links
4
5 * Funded by NLnet NGI-ASSURE under EU grant agreement No 957073.
6 * <https://bugs.libre-soc.org/show_bug.cgi?id=965>
7 * <https://bugs.libre-soc.org/show_bug.cgi?id=1007>
8 * <https://github.com/pts/chacha20>
9
10 # XChaCha20 SVP64 Implementation Analysis
11
12 This document shows how xchacha20's core loop - all 20 rounds - was
13 implemented in just 11 Vector instructions. There are an additional
14 9 instructions involved in establishing a REMAP Schedule (explained
15 below), which if there are multiple blocks these 9 instructions do not
16 need to be called again.
17
18 Firstly, we analyse the xchacha20 algorithm, showing what operations
19 are performed and in what order. Secondly, two innovative features
20 of SVP64 are described which are crucial to understanding of Simple-V
21 Vectorization: Vertical-First Mode and Indexed REMAP. Then we show
22 how Index REMAP eliminates the need entirely for inline-loop-unrolling,
23 but note that in this particular algorithm REMAP is only useful for
24 us in Vertical-First Mode.
25
26 ## Description of XChacha20 Algorithm
27
28 We will first try to analyze the XChacha20 algorithm as found in:
29
30 https://github.com/spcnvdr/xchacha20/blob/master/src/xchacha20.c
31
32 The function under inspection is `xchacha_hchacha20`. If we notice
33 we will that the main part of the computation, the main algorithm is
34 just a for loop -which is also the same in the `xchacha_encrypt_bytes`
35 function as well.
36
37 Main loop for `xchacha_hchacha20`:
38
39 ```
40 for (i = 0; i < 10; i++){
41 QUARTERROUND(x0, x4, x8, x12);
42 QUARTERROUND(x1, x5, x9, x13);
43 QUARTERROUND(x2, x6, x10, x14);
44 QUARTERROUND(x3, x7, x11, x15);
45 QUARTERROUND(x0, x5, x10, x15);
46 QUARTERROUND(x1, x6, x11, x12);
47 QUARTERROUND(x2, x7, x8, x13);
48 QUARTERROUND(x3, x4, x9, x14);
49 }
50
51 #define QUARTERROUND(a,b,c,d) \
52 a = PLUS(a,b); d = ROTATE(XOR(d,a),16); \
53 c = PLUS(c,d); b = ROTATE(XOR(b,c),12); \
54 a = PLUS(a,b); d = ROTATE(XOR(d,a), 8); \
55 c = PLUS(c,d); b = ROTATE(XOR(b,c), 7);
56 ```
57
58 We see that the loop is split in two groups of `QUARTERROUND` calls,
59 one with `step=4`:
60
61 ```
62 QUARTERROUND(x0, x4, x8, x12);
63 QUARTERROUND(x1, x5, x9, x13);
64 QUARTERROUND(x2, x6, x10, x14);
65 QUARTERROUND(x3, x7, x11, x15);
66 ```
67
68 and another with `step=5`:
69
70 ```
71 QUARTERROUND(x0, x5, x10, x15);
72 QUARTERROUND(x1, x6, x11, x12);
73 QUARTERROUND(x2, x7, x8, x13);
74 QUARTERROUND(x3, x4, x9, x14);
75 ```
76
77 Let's start with the first group of `QUARTERROUND`s, by unrolling it,
78 essentially it results in the following instructions:
79
80 ```
81 x0 = x0 + x4; x12 = ROTATE(x12 ^ x0, 16);
82 x8 = x8 + x12; x4 = ROTATE(x4 ^ x8, 12);
83 x0 = x0 + x4; x12 = ROTATE(x12 ^ x0, 8);
84 x8 = x8 + x12; x4 = ROTATE(x4 ^ x8, 7);
85 x1 = x1 + x5; x13 = ROTATE(x13 ^ x1, 16);
86 x9 = x9 + x13; x5 = ROTATE(x5 ^ x9, 12);
87 x1 = x1 + x5; x13 = ROTATE(x13 ^ x1, 8);
88 x9 = x9 + x13; x5 = ROTATE(x5 ^ x9, 7);
89 x2 = x2 + x6; x14 = ROTATE(x14 ^ x2, 16);
90 x10 = x10 + x14; x6 = ROTATE(x6 ^ x10, 12);
91 x2 = x2 + x6; x14 = ROTATE(x14 ^ x2, 8);
92 x10 = x10 + x14; x6 = ROTATE(x6 ^ x10, 7);
93 x3 = x3 + x7; x15 = ROTATE(x15 ^ x3, 16);
94 x11 = x11 + x15; x7 = ROTATE(x7 ^ x11, 12);
95 x3 = x3 + x7; x15 = ROTATE(x15 ^ x3, 8);
96 x11 = x11 + x15; x7 = ROTATE(x7 ^ x11, 7);
97 ```
98
99 Second group of `QUARTERROUND`s, unrolled:
100
101 ```
102 x0 = x0 + x5; x15 = ROTATE(x15 ^ x0, 16);
103 x10 = x10 + x15; x5 = ROTATE(x5 ^ x10, 12);
104 x0 = x0 + x5; x12 = ROTATE(x15 ^ x0, 8);
105 x10 = x10 + x15; x5 = ROTATE(x5 ^ x10, 7);
106 x1 = x1 + x6; x12 = ROTATE(x12 ^ x1, 16);
107 x11 = x11 + x12; x6 = ROTATE(x6 ^ x11, 12);
108 x1 = x1 + x6; x12 = ROTATE(x12 ^ x1, 8);
109 x11 = x11 + x12; x6 = ROTATE(x6 ^ x11, 7);
110 x2 = x2 + x7; x13 = ROTATE(x13 ^ x2, 16);
111 x8 = x8 + x13; x7 = ROTATE(x7 ^ x8, 12);
112 x2 = x2 + x7; x13 = ROTATE(x13 ^ x2, 8);
113 x8 = x8 + x13; x7 = ROTATE(x7 ^ x8, 7);
114 x3 = x3 + x4; x14 = ROTATE(x14 ^ x3, 16);
115 x9 = x9 + x14; x4 = ROTATE(x4 ^ x9, 12);
116 x3 = x3 + x4; x14 = ROTATE(x14 ^ x3, 8);
117 x9 = x9 + x14; x4 = ROTATE(x4 ^ x9, 7);
118 ```
119
120 Let's list the additions only:
121
122 ```
123 x0 = x0 + x4
124 x8 = x8 + x12
125 x0 = x0 + x4
126 x8 = x8 + x12
127 x1 = x1 + x5
128 x9 = x9 + x13
129 x1 = x1 + x5
130 x9 = x9 + x13
131 x2 = x2 + x6
132 x10 = x10 + x14
133 x2 = x2 + x6
134 x10 = x10 + x14
135 x3 = x3 + x7
136 x11 = x11 + x15
137 x3 = x3 + x7
138 x11 = x11 + x15
139 x0 = x0 + x5
140 x10 = x10 + x15
141 x0 = x0 + x5
142 x10 = x10 + x15
143 x1 = x1 + x6
144 x11 = x11 + x12
145 x1 = x1 + x6
146 x11 = x11 + x12
147 x2 = x2 + x7
148 x8 = x8 + x13
149 x2 = x2 + x7
150 x8 = x8 + x13
151 x3 = x3 + x4
152 x9 = x9 + x14
153 x3 = x3 + x4
154 x9 = x9 + x14
155 ```
156
157 These are clearly regular patterns, and if there was a
158 system of "register offsets" available, it would be possible
159 to put a loop around one single add instead of unrolling 32
160 operations. This is where REMAP Indexing steps in.
161
162 ## Introduction to REMAP Indexing
163
164 REMAP Indexing performs any arbitrary re-positioning of elements.
165 In effect it makes it possible to perform "arrbitrary addressing"
166 at runtime of registers.
167 Where normally any other Vector Processor would only be able to do a
168 sequential element-level series of operations, and if re-ordering
169 of the elements is required use a special re-ordering instruction,
170 SVP64 can *in-place* reorder elements on *any* instruction, using
171 the REMAP subsystem.
172
173 Most of the REMAP systems are simple fixed-hardware Deterministic
174 Schedules, but there is one that is general-purpose: Indexing. It
175 requires specifying a group of GPRs (or indices packed into GPRs)
176 that are to be used as the offsets.
177
178 This is a normal Simple-V operation:
179
180 ```
181 for i in range(VL):
182 GPR[RT+i] = OPERATION(GPR[RA+i])
183 ```
184
185 This is what happens when REMAP is enabled with Indexing:
186
187 ```
188 def REMAP(SVSHAPE, i):
189 return GPR(SVSHAPE.GPR + i)
190 for i in range(VL):
191 idx_rt = REMAP(SVSHAPE0, i)
192 idx_ra = REMAP(SVSHAPE1, i)
193 GPR[RT+idx_rt] = OPERATION(GPR[RA+idx_ra])
194 ```
195
196 In this way we can literally jump about, pretty much anywhere in
197 the register file, according to a Schedule that is determined by
198 the programmer. Therefore, if we can put all of the chacha20
199 round intermediary data into an array of registers, and can
200 analyse the *order* in which add-operations, xor-operations
201 and rotate-operations occur, it might just be possible to
202 eliminate **all** loop-unrolled inline assembler, replacing it
203 with three instructions and appropriate Indexed REMAP Schedules!
204 Turns out that this is indeed possible.
205
206 ## Introduction to Vertical-First Mode
207
208 We're going to use Vertical-First mode (VF) to implement this, so we
209 will first do a short introduction to VF. In normal Horizontal Vector
210 mode, or even in traditional SIMD mode, the instruction is executed
211 across a (Horizontal) Vector. So, if we have, for example `VL=8`, then
212 the instruction
213
214 ```
215 sv.add *RT, *RA, *RB # RT = RA + RB
216 ```
217
218 will be executed on all elements of the vector, **before** moving to
219 the next assembly instruction. This behaviour changes in Vertical-First
220 mode. In VF mode, the instruction is executed on a **single** element of
221 the vector and then moves to the next assembly instruction. It continues
222 to do so until execution reaches the `svstep` instruction where processing
223 will be moved to the next element/register in the vector. Branching to
224 the beginning of the loop does not occur automatically though, a branch
225 instruction will have to be added manually.
226
227 The reason why Vertical-First is needed is because it should be clear
228 from xchacha20 that there are ordering dependencies between the three
229 operations `add, xor, rotate`. It is not okay to perform the entire
230 suite of Vector-adds then move on to the Vector-xors then the Vector-rotates:
231 they have to be interleaved so as to respect the element-level ordering.
232 This is exactly what Vertical-First allows us to do:
233 element 0 add, element 0 xor, element 0 rotate, then element **1**
234 add, element **1** xor, element **1** rotate and so on. Vertical-First
235 *combined* with Index REMAP we can literally jump those operations around,
236 anywhere within the Vector.
237
238 ## Application of VF mode in the Xchacha20 loop
239
240 Let's assume the values `x` in the registers 24-36
241
242 | GPR # | | | | |
243 |--------|-----|-----|-----|-----|
244 | 24 | x0 | x1 | x2 | x3 |
245 | 28 | x4 | x5 | x6 | x7 |
246 | 32 | x8 | x9 | x10 | x11 |
247 | 36 | x12 | x13 | x14 | x15 |
248
249 So for the addition in Vertical-First mode, `RT` (and `RA` as they are the
250 same) indices are (in terms of x):
251
252 | | | | | | | | |
253 |----|----|----|----|----|----|----|----|
254 | 0 | 8 | 0 | 8 | 1 | 9 | 1 | 9 |
255 | 2 | 10 | 2 | 10 | 3 | 11 | 3 | 11 |
256 | 0 | 10 | 0 | 10 | 1 | 11 | 1 | 11 |
257 | 2 | 8 | 2 | 8 | 3 | 9 | 3 | 9 |
258
259 However, since the indices are small values, using a single 64-bit
260 register for a single index value is a waste so we will compress them,
261 8 indices in a 64-bit register:
262 So, `RT` indices will fit inside these 4 registers (in Little Endian format):
263
264 | | | | | |
265 |-----------|-------------------|-------------------|-------------------|-------------------|
266 | SVSHAPE0: | 0x901090108000800 | 0xb030b030a020a02 | 0xb010b010a000a00 | 0x903090308020802 |
267
268 Similarly we find the RB indices:
269
270 | | | | | | | | |
271 |----|----|----|----|----|----|----|----|
272 | 4 | 12 | 4 | 12 | 5 | 13 | 5 | 13 |
273 | 6 | 14 | 6 | 14 | 7 | 15 | 7 | 15 |
274 | 5 | 15 | 5 | 15 | 6 | 12 | 6 | 12 |
275 | 7 | 13 | 7 | 13 | 4 | 14 | 7 | 14 |
276
277 Using a similar method, we find the final 4 registers with the `RB` indices:
278
279 | | | | | |
280 |-----------|-------------------|-------------------|-------------------|-------------------|
281 | SVSHAPE1: | 0xd050d050c040c04 | 0xf070f070e060e06 | 0xc060c060f050f05 | 0xe040e040d070d07 |
282
283 Now, we can construct the Vertical First loop:
284
285 ```
286 svindex 4, 0, 1, 3, 0, 1, 0 # SVSHAPE0, add RA/RT indices
287 svindex 6, 1, 1, 3, 0, 1, 0 # SVSHAPE1, add RB indices
288 setvl 0, 0, 32, 1, 1, 1 # MAXVL=VL=32, VF=1
289 svremap 31, 1, 0, 0, 0, 0, 0 # RA=1, RB=0, RT=0 (0b01011)
290 sv.add/w=32 *x, *x, *x # RT, RB: SHAPE0. RA: SHAPE1
291 svstep. 16, 1, 0 # step to next in-regs element
292 ```
293
294 What this code snippet does is the following:
295
296 The first instruction
297
298 ```
299 svindex 4, 0, 1, 3, 0, 1, 0
300 ```
301
302 loads the add `RT` indices in the `SVSHAPE0`, in register 8. You will note
303 that 4 is listed, but that's because it only works on even registers,
304 so in order to save a bit, we have to double that number to get the
305 actual register. So, `SVSHAPE0` will be listed in GPRs 8-12. The number
306 3 lists that the elements will be 8-bit long. 0=64-bit, 1=32-bit,
307 2=16-bit, 3=8-bit.
308
309 The next step instruction
310
311 ```
312 svindex 6, 1, 1, 3, 0, 1, 0
313 ```
314
315 loads the add `RB` indices into `SVSHAPE1`. Again, even though we list 6,
316 the actual registers will be loaded in GPR #12, again a use of 8-bit
317 elements is denoted.
318 Next, the `setvl` instructions:
319
320 ```
321 setvl 0, 0, 32, 1, 1, 1
322 ```
323
324 We have to call `setvl` to set `MAXVL` and `VL` to 32 and also configure
325 Vertical-First mode. Afterwards, we have to instruct the way we intend
326 to use the indices, and we do this using `svremap`.
327
328 ```
329 svremap 31, 1, 0, 0, 0, 0, 0
330 ```
331
332 `svremap` basically instructs the scheduler to use `SVSHAPE0` for `RT` and `RB`,
333 `SVSHAPE1` for `RA`. The next instruction performs the **actual** addition:
334
335 ```
336 sv.add/w=32 *x, *x, *x
337 ```
338
339 Note the `/w=32` suffix. This instructs the adder to perform the operation
340 in elements of `w=32` bits. Since the Power CPU is a 64-bit CPU, this means
341 that we need to have 2 32-bit elements loaded in each register. Also,
342 note that in all parameters we use the `*x` as argument. This instructs
343 the scheduler to act on the registers as a vector, or a sequence of
344 elements. But even though they are all the same, their indices will be
345 taken from the `SVSHAPE0`/`SVSHAPE1` indices as defined previously. Also
346 note that the indices are relative to the actual register used. So,
347 if `*x` starts in GPR 24 for example, in essence this instruction will
348 issue the following sequence of instructions:
349
350 ```
351 add/w=32 24 + 0, 24 + 4, 24 + 0
352 add/w=32 24 + 8, 24 + 12, 24 + 8
353 add/w=32 24 + 0, 24 + 4, 24 + 0
354 add/w=32 24 + 8, 24 + 12, 24 + 8
355 add/w=32 24 + 1, 24 + 5, 24 + 1
356 add/w=32 24 + 9, 24 + 13, 24 + 9
357 add/w=32 24 + 1, 24 + 5, 24 + 1
358 add/w=32 24 + 9, 24 + 13, 24 + 9
359 ...
360 ```
361
362 Finally, the `svstep.` instruction steps to the next set of indices
363
364 We have shown how to do the additions in a Vertical-first mode. Now
365 let's add the rest of the instructions in the `QUARTERROUND`s. For the
366 `XOR` instructions of both `QUARTERROUND`s groups only, assuming that `d =
367 XOR(d, a)`:
368
369 ```
370 x12 = x12 ^ x0
371 x4 = x4 ^ x8
372 x12 = x12 ^ x0
373 x4 = x4 ^ x8
374 x13 = x13 ^ x1
375 x5 = x5 ^ x9
376 x13 = x13 ^ x1
377 x5 = x5 ^ x9
378 x14 = x14 ^ x2
379 x6 = x6 ^ x10
380 x14 = x14 ^ x2
381 x6 = x6 ^ x10
382 x15 = x15 ^ x3
383 x7 = x7 ^ x11
384 x15 = x15 ^ x3
385 x7 = x7 ^ x11
386 x15 = x15 ^ x0
387 x5 = x5 ^ x10
388 x12 = x15 ^ x0
389 x5 = x5 ^ x10
390 x12 = x12 ^ x1
391 x6 = x6 ^ x11
392 x12 = x12 ^ x1
393 x6 = x6 ^ x11
394 x13 = x13 ^ x2
395 x7 = x7 ^ x8
396 x13 = x13 ^ x1
397 x7 = x7 ^ x8
398 x14 = x14 ^ x3
399 x4 = x4 ^ x9
400 x14 = x14 ^ x3
401 x4 = x4 ^ x9
402 ```
403
404 We will need to create another set of indices for the `XOR` instructions. We
405 will only need one set as the other set of indices is the same as `RT`
406 for `sv.add` (`SHAPE0`). So, remembering that our
407
408 | | | | | | | | |
409 |----|----|----|----|----|----|----|----|
410 | 12 | 4 | 12 | 4 | 13 | 5 | 13 | 5 |
411 | 14 | 6 | 14 | 6 | 15 | 7 | 15 | 7 |
412 | 15 | 5 | 15 | 5 | 12 | 6 | 12 | 6 |
413 | 13 | 7 | 13 | 7 | 14 | 4 | 14 | 4 |
414
415 Again, we find
416
417 | | | | | |
418 |-----------|-------------------|-------------------|-------------------|-------------------|
419 | SVSHAPE2: | 0x50d050d040c040c | 0x70f070f060e060e | 0x60c060c050f050f | 0x40e040e070d070d |
420
421 The next operation is the `ROTATE` which takes as operand the result of the
422 `XOR` and a shift argument. You can easily see that the indices used in this
423 case are the same as the `XOR`. However, the shift values cycle every 4:
424 16, 12, 8, 7. For the indices we can again use `svindex`, like this:
425
426 ```
427 svindex 8, 2, 1, 3, 0, 1, 0
428 ```
429
430 Which again means `SVSHAPE2`, operating on 8-bit elements, starting
431 from GPR #16 (`8*2`). For the shift values cycling every 4 elements,
432 the `svshape2` instruction will be used:
433
434 ```
435 svshape2 0, 0, 3, 4, 0, 1
436 ```
437
438 This will create an `SVSHAPE3`, which will use a modulo 4 for all of its
439 elements. Now we can list both `XOR` and `ROTATE` instructions in assembly,
440 together with the respective svremap instructions:
441
442 ```
443 svremap 31, 2, 0, 2, 2, 0, 0 # RA=2, RB=0, RS=2 (0b00111)
444 sv.xor/w=32 *x, *x, *x
445 svremap 31, 0, 3, 2, 2, 0, 0 # RA=2, RB=3, RS=2 (0b01110)
446 sv.rldcl/w=32 *x, *x, *SHIFTS, 0
447 ```
448
449 So, in a similar fashion, we instruct `XOR` (`sv.xor`) to use `SVSHAPE2` for
450 `RA` and `RS` and `SVSHAPE0` for `RB`, again for 32-bit elements, while `ROTATE`
451 (`sv.rldcl`) will also use `SVSHAPE2` for `RA` and `RS`, but `SVSHAPE3` for `RB`
452 (the shift values, which cycle every 4 elements). Note that the actual
453 indices for `SVSHAPE3` will have to be in 32-bit elements:
454
455 | | | |
456 |---------|--------------------|--------------------|
457 | SHIFTS: | 0x0000000c00000010 | 0x0000000700000008 |
458
459 The complete algorithm for a loop with 10 iterations is as follows:
460
461 ```
462 li 7, 10 # Load value 10 into GPR #7
463 mtctr 7 # Set up counter on GPR #7
464
465 # set up VL=32 vertical-first, and SVSHAPEs 0-2
466 setvl 0, 0, 32, 1, 1, 1
467 # SHAPE0, used by sv.add starts at GPR #8
468 svindex 8/2, 0, 1, 3, 0, 1, 0 # SVSHAPE0, a
469 # SHAPE1, used by sv.xor starts at GPR #12
470 svindex 12/2, 1, 1, 3, 0, 1, 0 # SVSHAPE1, b
471 # SHAPE2, used by sv.rldcl starts at GPR #16
472 svindex 16/2, 2, 1, 3, 0, 1, 0 # SVSHAPE2, c
473 # SHAPE3, used also by sv.rldcl to hold the shift values starts at GPR #20
474 # The inner loop will do 32 iterations, but there are only
475 # 4 shift values, so we mod by 4, and can cycle through them
476 svshape2 0, 0, 3, 4, 0, 1 # SVSHAPE3, shift amount, mod4
477 .outer:
478 # outer loop begins here (standard CTR loop)
479 setvl 0, 0, 32, 1, 1, 1 # MAXVL=VL=32, VF=1
480 # inner loop begins here. add-xor-rotl32 with remap, step, branch
481 .inner:
482 svremap 31, 1, 0, 0, 0, 0, 0 # RA=1, RB=0, RT=0 (0b01011)
483 sv.add/w=32 *x, *x, *x
484 svremap 31, 2, 0, 2, 2, 0, 0 # RA=2, RB=0, RS=2 (0b00111)
485 sv.xor/w=32 *x, *x, *x
486 svremap 31, 0, 3, 2, 2, 0, 0 # RA=2, RB=3, RS=2 (0b01110)
487 sv.rldcl/w=32 *x, *x, *SHIFTS, 0
488 # 16 is the destination containing the result of svstep.
489 # it overlaps with SHAPE2 which is also 16. the first 8 indices
490 # will get corrupted.
491 svstep. 7, 1, 0 # step to next in-regs element
492 bc 6, 3, .inner # svstep. Rc=1 loop-end-condition?
493 # inner-loop done: outer loop standard CTR-decrement to setvl again
494 bdnz .outer # Loop until CTR is zero
495 ```
496
497