d578c98b2559b6e871abdcaa0049a7fc29a53861
[libreriscv.git] / openpower / sv / cookbook / chacha20.mdwn
1 [[!tag svp64_cookbook]]
2
3 # XChaCha20 SVP64 Implementation Analysis
4
5 This document shows how xchacha20's core loop - all 20 rounds - was
6 implemented in just 11 Vector instructions. There are an additional
7 9 instructions involved in establishing a REMAP Schedule (explained
8 below), which if there are multiple blocks these 9 instructions do not
9 need to be called again.
10
11 Firstly, we analyse the xchacha20 algorithm, showing what operations
12 are performed and in what order. Secondly, two innovative features
13 of SVP64 are described which are crucial to understanding of Simple-V
14 Vectorization: Vertical-First Mode and Indexed REMAP. Then we show
15 how Index REMAP eliminates the need entirely for inline-loop-unrolling,
16 but note that in this particular algorithm REMAP is only useful for
17 us in Vertical-First Mode.
18
19 ## Description of XChacha20 Algorithm
20
21 We will first try to analyze the XChacha20 algorithm as found in:
22
23 https://github.com/spcnvdr/xchacha20/blob/master/src/xchacha20.c
24
25 The function under inspection is `xchacha_hchacha20`. If we notice
26 we will that the main part of the computation, the main algorithm is
27 just a for loop -which is also the same in the `xchacha_encrypt_bytes`
28 function as well.
29
30 Main loop for `xchacha_hchacha20`:
31
32 ```
33 for (i = 0; i < 10; i++){
34 QUARTERROUND(x0, x4, x8, x12);
35 QUARTERROUND(x1, x5, x9, x13);
36 QUARTERROUND(x2, x6, x10, x14);
37 QUARTERROUND(x3, x7, x11, x15);
38 QUARTERROUND(x0, x5, x10, x15);
39 QUARTERROUND(x1, x6, x11, x12);
40 QUARTERROUND(x2, x7, x8, x13);
41 QUARTERROUND(x3, x4, x9, x14);
42 }
43
44 #define QUARTERROUND(a,b,c,d) \
45 a = PLUS(a,b); d = ROTATE(XOR(d,a),16); \
46 c = PLUS(c,d); b = ROTATE(XOR(b,c),12); \
47 a = PLUS(a,b); d = ROTATE(XOR(d,a), 8); \
48 c = PLUS(c,d); b = ROTATE(XOR(b,c), 7);
49 ```
50
51 We see that the loop is split in two groups of `QUARTERROUND` calls,
52 one with `step=4`:
53
54 ```
55 QUARTERROUND(x0, x4, x8, x12);
56 QUARTERROUND(x1, x5, x9, x13);
57 QUARTERROUND(x2, x6, x10, x14);
58 QUARTERROUND(x3, x7, x11, x15);
59 ```
60
61 and another with `step=5`:
62
63 ```
64 QUARTERROUND(x0, x5, x10, x15);
65 QUARTERROUND(x1, x6, x11, x12);
66 QUARTERROUND(x2, x7, x8, x13);
67 QUARTERROUND(x3, x4, x9, x14);
68 ```
69
70 Let's start with the first group of `QUARTERROUND`s, by unrolling it,
71 essentially it results in the following instructions:
72
73 ```
74 x0 = x0 + x4; x12 = ROTATE(x12 ^ x0, 16);
75 x8 = x8 + x12; x4 = ROTATE(x4 ^ x8, 12);
76 x0 = x0 + x4; x12 = ROTATE(x12 ^ x0, 8);
77 x8 = x8 + x12; x4 = ROTATE(x4 ^ x8, 7);
78 x1 = x1 + x5; x13 = ROTATE(x13 ^ x1, 16);
79 x9 = x9 + x13; x5 = ROTATE(x5 ^ x9, 12);
80 x1 = x1 + x5; x13 = ROTATE(x13 ^ x1, 8);
81 x9 = x9 + x13; x5 = ROTATE(x5 ^ x9, 7);
82 x2 = x2 + x6; x14 = ROTATE(x14 ^ x2, 16);
83 x10 = x10 + x14; x6 = ROTATE(x6 ^ x10, 12);
84 x2 = x2 + x6; x14 = ROTATE(x14 ^ x2, 8);
85 x10 = x10 + x14; x6 = ROTATE(x6 ^ x10, 7);
86 x3 = x3 + x7; x15 = ROTATE(x15 ^ x3, 16);
87 x11 = x11 + x15; x7 = ROTATE(x7 ^ x11, 12);
88 x3 = x3 + x7; x15 = ROTATE(x15 ^ x3, 8);
89 x11 = x11 + x15; x7 = ROTATE(x7 ^ x11, 7);
90 ```
91
92 Second group of `QUARTERROUND`s, unrolled:
93
94 ```
95 x0 = x0 + x5; x15 = ROTATE(x15 ^ x0, 16);
96 x10 = x10 + x15; x5 = ROTATE(x5 ^ x10, 12);
97 x0 = x0 + x5; x12 = ROTATE(x15 ^ x0, 8);
98 x10 = x10 + x15; x5 = ROTATE(x5 ^ x10, 7);
99 x1 = x1 + x6; x12 = ROTATE(x12 ^ x1, 16);
100 x11 = x11 + x12; x6 = ROTATE(x6 ^ x11, 12);
101 x1 = x1 + x6; x12 = ROTATE(x12 ^ x1, 8);
102 x11 = x11 + x12; x6 = ROTATE(x6 ^ x11, 7);
103 x2 = x2 + x7; x13 = ROTATE(x13 ^ x2, 16);
104 x8 = x8 + x13; x7 = ROTATE(x7 ^ x8, 12);
105 x2 = x2 + x7; x13 = ROTATE(x13 ^ x2, 8);
106 x8 = x8 + x13; x7 = ROTATE(x7 ^ x8, 7);
107 x3 = x3 + x4; x14 = ROTATE(x14 ^ x3, 16);
108 x9 = x9 + x14; x4 = ROTATE(x4 ^ x9, 12);
109 x3 = x3 + x4; x14 = ROTATE(x14 ^ x3, 8);
110 x9 = x9 + x14; x4 = ROTATE(x4 ^ x9, 7);
111 ```
112
113 Let's list the additions only:
114
115 ```
116 x0 = x0 + x4
117 x8 = x8 + x12
118 x0 = x0 + x4
119 x8 = x8 + x12
120 x1 = x1 + x5
121 x9 = x9 + x13
122 x1 = x1 + x5
123 x9 = x9 + x13
124 x2 = x2 + x6
125 x10 = x10 + x14
126 x2 = x2 + x6
127 x10 = x10 + x14
128 x3 = x3 + x7
129 x11 = x11 + x15
130 x3 = x3 + x7
131 x11 = x11 + x15
132 x0 = x0 + x5
133 x10 = x10 + x15
134 x0 = x0 + x5
135 x10 = x10 + x15
136 x1 = x1 + x6
137 x11 = x11 + x12
138 x1 = x1 + x6
139 x11 = x11 + x12
140 x2 = x2 + x7
141 x8 = x8 + x13
142 x2 = x2 + x7
143 x8 = x8 + x13
144 x3 = x3 + x4
145 x9 = x9 + x14
146 x3 = x3 + x4
147 x9 = x9 + x14
148 ```
149
150 ## Introduction to REMAP Indexing
151
152 REMAP Indexing performs any arbitrary re-positioning of elements.
153 Where normally any other Vector Processor would only be able to do a
154 sequential element-level series of operations, and if re-ordering
155 of the elements is required use a special re-ordering instruction,
156 SVP64 can *in-place* reorder elements on *any* instruction, using
157 the REMAP subsystem.
158
159 Most of the REMAP systems are simple fixed-hardware Deterministic
160 Schedules, but there is one that is general-purpose: Indexing. It
161 requires specifying a group of GPRs (or indices packed into GPRs)
162 that are to be used as the offsets.
163
164 This is a normal Simple-V operation:
165
166 ```
167 for i in range(VL):
168 GPR[RT+i] = OPERATION(GPR[RA+i])
169 ```
170
171 This is what happens when REMAP is enabled with Indexing:
172
173 ```
174 def REMAP(SVSHAPE, i):
175 return GPR(SVSHAPE.GPR + i)
176 for i in range(VL):
177 idx_rt = REMAP(SVSHAPE0, i)
178 idx_ra = REMAP(SVSHAPE1, i)
179 GPR[RT+idx_rt] = OPERATION(GPR[RA+idx_ra])
180 ```
181
182 In this way we can literally jump about, pretty much anywhere in
183 the register file, according to a Schedule that is determined by
184 the programmer. Therefore, if we can put all of the chacha20
185 round intermediary data into an array of registers, and can
186 analyse the *order* in which add-operations, xor-operations
187 and rotate-operations occur, it might just be possible to
188 eliminate **all** loop-unrolled inline assembler, replacing it
189 with three instructions and appropriate Indexed REMAP Schedules!
190 Turns out that this is indeed possible.
191
192 ## Introduction to Vertical-First Mode
193
194 We're going to use Vertical-First mode (VF) to implement this, so we
195 will first do a short introduction to VF. In normal Horizontal Vector
196 mode, or even in traditional SIMD mode, the instruction is executed
197 across a (Horizontal) Vector. So, if we have, for example `VL=8`, then
198 the instruction
199
200 ```
201 sv.add *RT, *RA, *RB # RT = RA + RB
202 ```
203
204 will be executed on all elements of the vector, **before** moving to
205 the next assembly instruction. This behaviour changes in Vertical-First
206 mode. In VF mode, the instruction is executed on a **single** element of
207 the vector and then moves to the next assembly instruction. It continues
208 to do so until execution reaches the `svstep` instruction where processing
209 will be moved to the next element/register in the vector. Branching to
210 the beginning of the loop does not occur automatically though, a branch
211 instruction will have to be added manually.
212
213 The reason why Vertical-First is needed is because it should be clear
214 from xchacha20 that there are ordering dependencies between the three
215 operations `add, xor, rotate`. It is not okay to perform the entire
216 suite of Vector-adds then move on to the Vector-xors then the Vector-rotates:
217 they have to be interleaved so as to respect the element-level ordering.
218 This is exactly what Vertical-First allows us to do:
219 element 0 add, element 0 xor, element 0 rotate, then element **1**
220 add, element **1** xor, element **1** rotate and so on. Vertical-First
221 *combined* with Index REMAP we can literally jump those operations around,
222 anywhere within the Vector.
223
224 ## Application of VF mode in the Xchacha20 loop
225
226 Let's assume the values `x` in the registers 24-36
227
228 | GPR # | | | | |
229 |--------|-----|-----|-----|-----|
230 | 24 | x0 | x1 | x2 | x3 |
231 | 28 | x4 | x5 | x6 | x7 |
232 | 32 | x8 | x9 | x10 | x11 |
233 | 36 | x12 | x13 | x14 | x15 |
234
235 So for the addition in Vertical-First mode, `RT` (and `RA` as they are the
236 same) indices are (in terms of x):
237
238 | | | | | | | | |
239 |----|----|----|----|----|----|----|----|
240 | 0 | 8 | 0 | 8 | 1 | 9 | 1 | 9 |
241 | 2 | 10 | 2 | 10 | 3 | 11 | 3 | 11 |
242 | 0 | 10 | 0 | 10 | 1 | 11 | 1 | 11 |
243 | 2 | 8 | 2 | 8 | 3 | 9 | 3 | 9 |
244
245 However, since the indices are small values, using a single 64-bit
246 register for a single index value is a waste so we will compress them,
247 8 indices in a 64-bit register:
248 So, `RT` indices will fit inside these 4 registers (in Little Endian format):
249
250 | | | | | |
251 |-----------|-------------------|-------------------|-------------------|-------------------|
252 | SVSHAPE0: | 0x901090108000800 | 0xb030b030a020a02 | 0xb010b010a000a00 | 0x903090308020802 |
253
254 Similarly we find the RB indices:
255
256 | | | | | | | | |
257 |----|----|----|----|----|----|----|----|
258 | 4 | 12 | 4 | 12 | 5 | 13 | 5 | 13 |
259 | 6 | 14 | 6 | 14 | 7 | 15 | 7 | 15 |
260 | 5 | 15 | 5 | 15 | 6 | 12 | 6 | 12 |
261 | 7 | 13 | 7 | 13 | 4 | 14 | 7 | 14 |
262
263 Using a similar method, we find the final 4 registers with the `RB` indices:
264
265 | | | | | |
266 |-----------|-------------------|-------------------|-------------------|-------------------|
267 | SVSHAPE1: | 0xd050d050c040c04 | 0xf070f070e060e06 | 0xc060c060f050f05 | 0xe040e040d070d07 |
268
269 Now, we can construct the Vertical First loop:
270
271 ```
272 svindex 4, 0, 1, 3, 0, 1, 0 # SVSHAPE0, add RA/RT indices
273 svindex 6, 1, 1, 3, 0, 1, 0 # SVSHAPE1, add RB indices
274 setvl 0, 0, 32, 1, 1, 1 # MAXVL=VL=32, VF=1
275 svremap 31, 1, 0, 0, 0, 0, 0 # RA=1, RB=0, RT=0 (0b01011)
276 sv.add/w=32 *x, *x, *x # RT, RB: SHAPE0. RA: SHAPE1
277 svstep. 16, 1, 0 # step to next in-regs element
278 ```
279
280 What this code snippet does is the following:
281
282 The first instruction
283
284 ```
285 svindex 4, 0, 1, 3, 0, 1, 0
286 ```
287
288 loads the add `RT` indices in the `SVSHAPE0`, in register 8. You will note
289 that 4 is listed, but that's because it only works on even registers,
290 so in order to save a bit, we have to double that number to get the
291 actual register. So, `SVSHAPE0` will be listed in GPRs 8-12. The number
292 3 lists that the elements will be 8-bit long. 0=64-bit, 1=32-bit,
293 2=16-bit, 3=8-bit.
294
295 The next step instruction
296
297 ```
298 svindex 6, 1, 1, 3, 0, 1, 0
299 ```
300
301 loads the add `RB` indices into `SVSHAPE1`. Again, even though we list 6,
302 the actual registers will be loaded in GPR #12, again a use of 8-bit
303 elements is denoted.
304 Next, the `setvl` instructions:
305
306 ```
307 setvl 0, 0, 32, 1, 1, 1
308 ```
309
310 We have to call `setvl` to set `MAXVL` and `VL` to 32 and also configure
311 Vertical-First mode. Afterwards, we have to instruct the way we intend
312 to use the indices, and we do this using `svremap`.
313
314 ```
315 svremap 31, 1, 0, 0, 0, 0, 0
316 ```
317
318 `svremap` basically instructs the scheduler to use `SVSHAPE0` for `RT` and `RB`,
319 `SVSHAPE1` for `RA`. The next instruction performs the **actual** addition:
320
321 ```
322 sv.add/w=32 *x, *x, *x
323 ```
324
325 Note the `/w=32` suffix. This instructs the adder to perform the operation
326 in elements of `w=32` bits. Since the Power CPU is a 64-bit CPU, this means
327 that we need to have 2 32-bit elements loaded in each register. Also,
328 note that in all parameters we use the `*x` as argument. This instructs
329 the scheduler to act on the registers as a vector, or a sequence of
330 elements. But even though they are all the same, their indices will be
331 taken from the `SVSHAPE0`/`SVSHAPE1` indices as defined previously. Also
332 note that the indices are relative to the actual register used. So,
333 if `*x` starts in GPR 24 for example, in essence this instruction will
334 issue the following sequence of instructions:
335
336 ```
337 add/w=32 24 + 0, 24 + 4, 24 + 0
338 add/w=32 24 + 8, 24 + 12, 24 + 8
339 add/w=32 24 + 0, 24 + 4, 24 + 0
340 add/w=32 24 + 8, 24 + 12, 24 + 8
341 add/w=32 24 + 1, 24 + 5, 24 + 1
342 add/w=32 24 + 9, 24 + 13, 24 + 9
343 add/w=32 24 + 1, 24 + 5, 24 + 1
344 add/w=32 24 + 9, 24 + 13, 24 + 9
345 ...
346 ```
347
348 Finally, the `svstep.` instruction steps to the next set of indices
349
350 We have shown how to do the additions in a Vertical-first mode. Now
351 let's add the rest of the instructions in the `QUARTERROUND`s. For the
352 `XOR` instructions of both `QUARTERROUND`s groups only, assuming that `d =
353 XOR(d, a)`:
354
355 ```
356 x12 = x12 ^ x0
357 x4 = x4 ^ x8
358 x12 = x12 ^ x0
359 x4 = x4 ^ x8
360 x13 = x13 ^ x1
361 x5 = x5 ^ x9
362 x13 = x13 ^ x1
363 x5 = x5 ^ x9
364 x14 = x14 ^ x2
365 x6 = x6 ^ x10
366 x14 = x14 ^ x2
367 x6 = x6 ^ x10
368 x15 = x15 ^ x3
369 x7 = x7 ^ x11
370 x15 = x15 ^ x3
371 x7 = x7 ^ x11
372 x15 = x15 ^ x0
373 x5 = x5 ^ x10
374 x12 = x15 ^ x0
375 x5 = x5 ^ x10
376 x12 = x12 ^ x1
377 x6 = x6 ^ x11
378 x12 = x12 ^ x1
379 x6 = x6 ^ x11
380 x13 = x13 ^ x2
381 x7 = x7 ^ x8
382 x13 = x13 ^ x1
383 x7 = x7 ^ x8
384 x14 = x14 ^ x3
385 x4 = x4 ^ x9
386 x14 = x14 ^ x3
387 x4 = x4 ^ x9
388 ```
389
390 We will need to create another set of indices for the `XOR` instructions. We
391 will only need one set as the other set of indices is the same as `RT`
392 for `sv.add` (`SHAPE0`). So, remembering that our
393
394 | | | | | | | | |
395 |----|----|----|----|----|----|----|----|
396 | 12 | 4 | 12 | 4 | 13 | 5 | 13 | 5 |
397 | 14 | 6 | 14 | 6 | 15 | 7 | 15 | 7 |
398 | 15 | 5 | 15 | 5 | 12 | 6 | 12 | 6 |
399 | 13 | 7 | 13 | 7 | 14 | 4 | 14 | 4 |
400
401 Again, we find
402
403 | | | | | |
404 |-----------|-------------------|-------------------|-------------------|-------------------|
405 | SVSHAPE2: | 0x50d050d040c040c | 0x70f070f060e060e | 0x60c060c050f050f | 0x40e040e070d070d |
406
407 The next operation is the `ROTATE` which takes as operand the result of the
408 `XOR` and a shift argument. You can easily see that the indices used in this
409 case are the same as the `XOR`. However, the shift values cycle every 4:
410 16, 12, 8, 7. For the indices we can again use `svindex`, like this:
411
412 ```
413 svindex 8, 2, 1, 3, 0, 1, 0
414 ```
415
416 Which again means `SVSHAPE2`, operating on 8-bit elements, starting
417 from GPR #16 (`8*2`). For the shift values cycling every 4 elements,
418 the `svshape2` instruction will be used:
419
420 ```
421 svshape2 0, 0, 3, 4, 0, 1
422 ```
423
424 This will create an `SVSHAPE3`, which will use a modulo 4 for all of its
425 elements. Now we can list both `XOR` and `ROTATE` instructions in assembly,
426 together with the respective svremap instructions:
427
428 ```
429 svremap 31, 2, 0, 2, 2, 0, 0 # RA=2, RB=0, RS=2 (0b00111)
430 sv.xor/w=32 *x, *x, *x
431 svremap 31, 0, 3, 2, 2, 0, 0 # RA=2, RB=3, RS=2 (0b01110)
432 sv.rldcl/w=32 *x, *x, *SHIFTS, 0
433 ```
434
435 So, in a similar fashion, we instruct `XOR` (`sv.xor`) to use `SVSHAPE2` for
436 `RA` and `RS` and `SVSHAPE0` for `RB`, again for 32-bit elements, while `ROTATE`
437 (`sv.rldcl`) will also use `SVSHAPE2` for `RA` and `RS`, but `SVSHAPE3` for `RB`
438 (the shift values, which cycle every 4 elements). Note that the actual
439 indices for `SVSHAPE3` will have to be in 32-bit elements:
440
441 | | | |
442 |---------|--------------------|--------------------|
443 | SHIFTS: | 0x0000000c00000010 | 0x0000000700000008 |
444
445 The complete algorithm for a loop with 10 iterations is as follows:
446
447 ```
448 li 7, 10 # Load value 10 into GPR #7
449 mtctr 7 # Set up counter on GPR #7
450
451 # set up VL=32 vertical-first, and SVSHAPEs 0-2
452 setvl 0, 0, 32, 1, 1, 1
453 # SHAPE0, used by sv.add starts at GPR #8
454 svindex 8/2, 0, 1, 3, 0, 1, 0 # SVSHAPE0, a
455 # SHAPE1, used by sv.xor starts at GPR #12
456 svindex 12/2, 1, 1, 3, 0, 1, 0 # SVSHAPE1, b
457 # SHAPE2, used by sv.rldcl starts at GPR #16
458 svindex 16/2, 2, 1, 3, 0, 1, 0 # SVSHAPE2, c
459 # SHAPE3, used also by sv.rldcl to hold the shift values starts at GPR #20
460 # The inner loop will do 32 iterations, but there are only
461 # 4 shift values, so we mod by 4, and can cycle through them
462 svshape2 0, 0, 3, 4, 0, 1 # SVSHAPE3, shift amount, mod4
463 .outer:
464 # outer loop begins here (standard CTR loop)
465 setvl 0, 0, 32, 1, 1, 1 # MAXVL=VL=32, VF=1
466 # inner loop begins here. add-xor-rotl32 with remap, step, branch
467 .inner:
468 svremap 31, 1, 0, 0, 0, 0, 0 # RA=1, RB=0, RT=0 (0b01011)
469 sv.add/w=32 *x, *x, *x
470 svremap 31, 2, 0, 2, 2, 0, 0 # RA=2, RB=0, RS=2 (0b00111)
471 sv.xor/w=32 *x, *x, *x
472 svremap 31, 0, 3, 2, 2, 0, 0 # RA=2, RB=3, RS=2 (0b01110)
473 sv.rldcl/w=32 *x, *x, *SHIFTS, 0
474 # 16 is the destination containing the result of svstep.
475 # it overlaps with SHAPE2 which is also 16. the first 8 indices
476 # will get corrupted.
477 svstep. 7, 1, 0 # step to next in-regs element
478 bc 6, 3, .inner # svstep. Rc=1 loop-end-condition?
479 # inner-loop done: outer loop standard CTR-decrement to setvl again
480 bdnz .outer # Loop until CTR is zero
481 ```
482
483