openpower/sv/cookbook/chacha20.mdwn

   1 [[!tag svp64_cookbook]]
   2
   3 # Funding and Links
   4
   5 * Funded by NLnet NGI-ASSURE under EU grant agreement No 957073.
   6 * <https://bugs.libre-soc.org/show_bug.cgi?id=965>
   7 * <https://bugs.libre-soc.org/show_bug.cgi?id=1007>
   8 * <https://github.com/pts/chacha20>
   9
  10 # XChaCha20 SVP64 Implementation Analysis
  11
  12 This document shows how xchacha20's core loop - all 20 rounds - was
  13 implemented in just 11 Vector instructions.  There are an additional
  14 9 instructions involved in establishing a REMAP Schedule (explained
  15 below), which if there are multiple blocks these 9 instructions do not
  16 need to be called again.
  17
  18 Firstly, we analyse the xchacha20 algorithm, showing what operations
  19 are performed and in what order.  Secondly, two innovative features
  20 of SVP64 are described which are crucial to understanding of Simple-V
  21 Vectorization: Vertical-First Mode and Indexed REMAP.  Then we show
  22 how Index REMAP eliminates the need entirely for inline-loop-unrolling,
  23 but note that in this particular algorithm REMAP is only useful for
  24 us in Vertical-First Mode.
  25
  26 ## Description of XChacha20 Algorithm
  27
  28 We will first try to analyze the XChacha20 algorithm as found in:
  29
  30 https://github.com/spcnvdr/xchacha20/blob/master/src/xchacha20.c
  31
  32 The function under inspection is `xchacha_hchacha20`. If we notice
  33 we will that the main part of the computation, the main algorithm is
  34 just a for loop -which is also the same in the `xchacha_encrypt_bytes`
  35 function as well.
  36
  37 Main loop for `xchacha_hchacha20`:
  38
  39 ```
  40     for (i = 0; i < 10; i++){
  41         QUARTERROUND(x0, x4,  x8, x12);
  42         QUARTERROUND(x1, x5,  x9, x13);
  43         QUARTERROUND(x2, x6, x10, x14);
  44         QUARTERROUND(x3, x7, x11, x15);
  45         QUARTERROUND(x0, x5, x10, x15);
  46         QUARTERROUND(x1, x6, x11, x12);
  47         QUARTERROUND(x2, x7,  x8, x13);
  48         QUARTERROUND(x3, x4,  x9, x14);
  49     }
  50
  51     #define QUARTERROUND(a,b,c,d) \
  52         a = PLUS(a,b); d = ROTATE(XOR(d,a),16); \
  53         c = PLUS(c,d); b = ROTATE(XOR(b,c),12); \
  54         a = PLUS(a,b); d = ROTATE(XOR(d,a), 8); \
  55         c = PLUS(c,d); b = ROTATE(XOR(b,c), 7);
  56 ```
  57
  58 We see that the loop is split in two groups of `QUARTERROUND` calls,
  59 one with `step=4`:
  60
  61 ```
  62     QUARTERROUND(x0, x4,  x8, x12);
  63     QUARTERROUND(x1, x5,  x9, x13);
  64     QUARTERROUND(x2, x6, x10, x14);
  65     QUARTERROUND(x3, x7, x11, x15);
  66 ```
  67
  68 and another with `step=5`:
  69
  70 ```
  71     QUARTERROUND(x0, x5, x10, x15);
  72     QUARTERROUND(x1, x6, x11, x12);
  73     QUARTERROUND(x2, x7,  x8, x13);
  74     QUARTERROUND(x3, x4,  x9, x14);
  75 ```
  76
  77 Let's start with the first group of `QUARTERROUND`s, by unrolling it,
  78 essentially it results in the following instructions:
  79
  80 ```
  81     x0  = x0 + x4;   x12 = ROTATE(x12 ^ x0, 16);
  82     x8  = x8 + x12;   x4 = ROTATE(x4 ^ x8, 12);
  83     x0  = x0 + x4;   x12 = ROTATE(x12 ^ x0, 8);
  84     x8  = x8 + x12;   x4 = ROTATE(x4 ^ x8, 7);
  85     x1  = x1 + x5;   x13 = ROTATE(x13 ^ x1, 16);
  86     x9  = x9 + x13;   x5 = ROTATE(x5 ^ x9, 12);
  87     x1  = x1 + x5;   x13 = ROTATE(x13 ^ x1, 8);
  88     x9  = x9 + x13;   x5 = ROTATE(x5 ^ x9, 7);
  89     x2  = x2 + x6;   x14 = ROTATE(x14 ^ x2, 16);
  90     x10 = x10 + x14;  x6 = ROTATE(x6 ^ x10, 12);
  91     x2  = x2 + x6;   x14 = ROTATE(x14 ^ x2, 8);
  92     x10 = x10 + x14;  x6 = ROTATE(x6 ^ x10, 7);
  93     x3  = x3 + x7;   x15 = ROTATE(x15 ^ x3, 16);
  94     x11 = x11 + x15;  x7 = ROTATE(x7 ^ x11, 12);
  95     x3  = x3 + x7;   x15 = ROTATE(x15 ^ x3, 8);
  96     x11 = x11 + x15;  x7 = ROTATE(x7 ^ x11, 7);
  97 ```
  98
  99 Second group of `QUARTERROUND`s, unrolled:
 100
 101 ```
 102     x0  = x0 + x5;   x15 = ROTATE(x15 ^ x0, 16);
 103     x10 = x10 + x15;  x5 = ROTATE(x5 ^ x10, 12);
 104     x0  = x0 + x5;   x12 = ROTATE(x15 ^ x0, 8);
 105     x10 = x10 + x15;  x5 = ROTATE(x5 ^ x10, 7);
 106     x1  = x1 + x6;   x12 = ROTATE(x12 ^ x1, 16);
 107     x11 = x11 + x12;  x6 = ROTATE(x6 ^ x11, 12);
 108     x1  = x1 + x6;   x12 = ROTATE(x12 ^ x1, 8);
 109     x11 = x11 + x12;  x6 = ROTATE(x6 ^ x11, 7);
 110     x2  = x2 + x7;   x13 = ROTATE(x13 ^ x2, 16);
 111     x8  = x8 + x13;   x7 = ROTATE(x7 ^ x8, 12);
 112     x2  = x2 + x7;   x13 = ROTATE(x13 ^ x2, 8);
 113     x8  = x8 + x13;   x7 = ROTATE(x7 ^ x8, 7);
 114     x3  = x3 + x4;   x14 = ROTATE(x14 ^ x3, 16);
 115     x9  = x9 + x14;   x4 = ROTATE(x4 ^ x9, 12);
 116     x3  = x3 + x4;   x14 = ROTATE(x14 ^ x3, 8);
 117     x9  = x9 + x14;   x4 = ROTATE(x4 ^ x9, 7);
 118 ```
 119
 120 Let's list the additions only:
 121
 122 ```
 123     x0  = x0 + x4
 124     x8  = x8 + x12
 125     x0  = x0 + x4
 126     x8  = x8 + x12
 127     x1  = x1 + x5
 128     x9  = x9 + x13
 129     x1  = x1 + x5
 130     x9  = x9 + x13
 131     x2  = x2 + x6
 132     x10 = x10 + x14
 133     x2  = x2 + x6
 134     x10 = x10 + x14
 135     x3  = x3 + x7
 136     x11 = x11 + x15
 137     x3  = x3 + x7
 138     x11 = x11 + x15
 139     x0  = x0 + x5
 140     x10 = x10 + x15
 141     x0  = x0 + x5
 142     x10 = x10 + x15
 143     x1  = x1 + x6
 144     x11 = x11 + x12
 145     x1  = x1 + x6
 146     x11 = x11 + x12
 147     x2  = x2 + x7
 148     x8  = x8 + x13
 149     x2  = x2 + x7
 150     x8  = x8 + x13
 151     x3  = x3 + x4
 152     x9  = x9 + x14
 153     x3  = x3 + x4
 154     x9  = x9 + x14
 155 ```
 156
 157 These are clearly regular patterns, and if there was a
 158 system of "register offsets" available, it would be possible
 159 to put a loop around one single add instead of unrolling 32
 160 operations. This is where REMAP Indexing steps in.
 161
 162 ## Introduction to REMAP Indexing
 163
 164 REMAP Indexing performs any arbitrary re-positioning of elements.
 165 In effect it makes it possible to perform "arrbitrary addressing"
 166 at runtime of registers.
 167 Where normally any other Vector Processor would only be able to do a
 168 sequential element-level series of operations, and if re-ordering
 169 of the elements is required use a special re-ordering instruction,
 170 SVP64 can *in-place* reorder elements on *any* instruction, using
 171 the REMAP subsystem.
 172
 173 Most of the REMAP systems are simple fixed-hardware Deterministic
 174 Schedules, but there is one that is general-purpose: Indexing.  It
 175 requires specifying a group of GPRs (or indices packed into GPRs)
 176 that are to be used as the offsets.
 177
 178 This is a normal Simple-V operation:
 179
 180 ```
 181     for i in range(VL):
 182         GPR[RT+i] = OPERATION(GPR[RA+i])
 183 ```
 184
 185 This is what happens when REMAP is enabled with Indexing:
 186
 187 ```
 188     def REMAP(SVSHAPE, i):
 189         return GPR(SVSHAPE.GPR + i)
 190     for i in range(VL):
 191         idx_rt = REMAP(SVSHAPE0, i)
 192         idx_ra = REMAP(SVSHAPE1, i)
 193         GPR[RT+idx_rt] = OPERATION(GPR[RA+idx_ra])
 194 ```
 195
 196 In this way we can literally jump about, pretty much anywhere in
 197 the register file, according to a Schedule that is determined by
 198 the programmer.  Therefore, if we can put all of the chacha20
 199 round intermediary data into an array of registers, and can
 200 analyse the *order* in which add-operations, xor-operations
 201 and rotate-operations occur, it might just be possible to
 202 eliminate **all** loop-unrolled inline assembler, replacing it
 203 with three instructions and appropriate Indexed REMAP Schedules!
 204 Turns out that this is indeed possible.
 205
 206 ## Introduction to Vertical-First Mode
 207
 208 We're going to use Vertical-First mode (VF) to implement this, so we
 209 will first do a short introduction to VF. In normal Horizontal Vector
 210 mode, or even in traditional SIMD mode, the instruction is executed
 211 across a (Horizontal) Vector. So, if we have, for example `VL=8`, then
 212 the instruction
 213
 214 ```
 215     sv.add *RT, *RA, *RB  # RT = RA + RB
 216 ```
 217
 218 will be executed on all elements of the vector, **before** moving to
 219 the next assembly instruction.  This behaviour changes in Vertical-First
 220 mode. In VF mode, the instruction is executed on a **single** element of
 221 the vector and then moves to the next assembly instruction. It continues
 222 to do so until execution reaches the `svstep` instruction where processing
 223 will be moved to the next element/register in the vector. Branching to
 224 the beginning of the loop does not occur automatically though, a branch
 225 instruction will have to be added manually.
 226
 227 The reason why Vertical-First is needed is because it should be clear
 228 from xchacha20 that there are ordering dependencies between the three
 229 operations `add, xor, rotate`.  It is not okay to perform the entire
 230 suite of Vector-adds then move on to the Vector-xors then the Vector-rotates:
 231 they have to be interleaved so as to respect the element-level ordering.
 232 This is exactly what Vertical-First allows us to do:
 233 element 0 add, element 0 xor, element 0 rotate, then element **1**
 234 add, element **1** xor, element **1** rotate and so on.  Vertical-First
 235 *combined* with Index REMAP we can literally jump those operations around,
 236 anywhere within the Vector.
 237
 238 ## Application of VF mode in the Xchacha20 loop
 239
 240 Let's assume the values `x` in the registers 24-36
 241
 242 | GPR #  |     |     |     |     |
 243 |--------|-----|-----|-----|-----|
 244 |   24   |  x0 |  x1 |  x2 |  x3 |
 245 |   28   |  x4 |  x5 |  x6 |  x7 |
 246 |   32   |  x8 |  x9 | x10 | x11 |
 247 |   36   | x12 | x13 | x14 | x15 |
 248
 249 So for the addition in Vertical-First mode, `RT` (and `RA` as they are the
 250 same) indices are (in terms of x):
 251
 252 |    |    |    |    |    |    |    |    |
 253 |----|----|----|----|----|----|----|----|
 254 |  0 |  8 |  0 |  8 |  1 |  9 |  1 |  9 |
 255 |  2 | 10 |  2 | 10 |  3 | 11 |  3 | 11 |
 256 |  0 | 10 |  0 | 10 |  1 | 11 |  1 | 11 |
 257 |  2 |  8 |  2 |  8 |  3 |  9 |  3 |  9 |
 258
 259 However, since the indices are small values, using a single 64-bit
 260 register for a single index value is a waste so we will compress them,
 261 8 indices in a 64-bit register:
 262 So, `RT` indices will fit inside these 4 registers (in Little Endian format):
 263
 264 |           |                   |                   |                   |                   |
 265 |-----------|-------------------|-------------------|-------------------|-------------------|
 266 | SVSHAPE0: | 0x901090108000800 | 0xb030b030a020a02 | 0xb010b010a000a00 | 0x903090308020802 |
 267
 268 Similarly we find the RB indices:
 269
 270 |    |    |    |    |    |    |    |    |
 271 |----|----|----|----|----|----|----|----|
 272 |  4 | 12 |  4 | 12 |  5 | 13 |  5 | 13 |
 273 |  6 | 14 |  6 | 14 |  7 | 15 |  7 | 15 |
 274 |  5 | 15 |  5 | 15 |  6 | 12 |  6 | 12 |
 275 |  7 | 13 |  7 | 13 |  4 | 14 |  7 | 14 |
 276
 277 Using a similar method, we find the final 4 registers with the `RB` indices:
 278
 279 |           |                   |                   |                   |                   |
 280 |-----------|-------------------|-------------------|-------------------|-------------------|
 281 | SVSHAPE1: | 0xd050d050c040c04 | 0xf070f070e060e06 | 0xc060c060f050f05 | 0xe040e040d070d07 |
 282
 283 Now, we can construct the Vertical First loop:
 284
 285 ```
 286     svindex             4, 0, 1, 3, 0, 1, 0     # SVSHAPE0, add RA/RT indices
 287     svindex             6, 1, 1, 3, 0, 1, 0     # SVSHAPE1, add RB indices
 288     setvl               0, 0, 32, 1, 1, 1       # MAXVL=VL=32, VF=1
 289     svremap             31, 1, 0, 0, 0, 0, 0    # RA=1, RB=0, RT=0 (0b01011)
 290     sv.add/w=32         *x, *x, *x              # RT, RB: SHAPE0. RA: SHAPE1
 291     svstep. 16, 1, 0                            # step to next in-regs element
 292 ```
 293
 294 What this code snippet does is the following:
 295
 296 The first instruction
 297
 298 ```
 299     svindex             4, 0, 1, 3, 0, 1, 0
 300 ```
 301
 302 loads the add `RT` indices in the `SVSHAPE0`, in register 8. You will note
 303 that 4 is listed, but that's because it only works on even registers,
 304 so in order to save a bit, we have to double that number to get the
 305 actual register. So, `SVSHAPE0` will be listed in GPRs 8-12. The number
 306 3 lists that the elements will be 8-bit long. 0=64-bit, 1=32-bit,
 307 2=16-bit, 3=8-bit.
 308
 309 The next step instruction
 310
 311 ```
 312     svindex             6, 1, 1, 3, 0, 1, 0
 313 ```
 314
 315 loads the add `RB` indices into `SVSHAPE1`. Again, even though we list 6,
 316 the actual registers will be loaded in GPR #12, again a use of 8-bit
 317 elements is denoted.
 318 Next, the `setvl` instructions:
 319
 320 ```
 321     setvl               0, 0, 32, 1, 1, 1
 322 ```
 323
 324 We have to call `setvl` to set `MAXVL` and `VL` to 32 and also configure
 325 Vertical-First mode.  Afterwards, we have to instruct the way we intend
 326 to use the indices, and we do this using `svremap`.
 327
 328 ```
 329     svremap 31, 1, 0, 0, 0, 0, 0
 330 ```
 331
 332 `svremap` basically instructs the scheduler to use `SVSHAPE0` for `RT` and `RB`,
 333 `SVSHAPE1` for `RA`.  The next instruction performs the **actual** addition:
 334
 335 ```
 336     sv.add/w=32         *x, *x, *x
 337 ```
 338
 339 Note the `/w=32` suffix. This instructs the adder to perform the operation
 340 in elements of `w=32` bits. Since the Power CPU is a 64-bit CPU, this means
 341 that we need to have 2 32-bit elements loaded in each register. Also,
 342 note that in all parameters we use the `*x` as argument. This instructs
 343 the scheduler to act on the registers as a vector, or a sequence of
 344 elements. But even though they are all the same, their indices will be
 345 taken from the `SVSHAPE0`/`SVSHAPE1` indices as defined previously. Also
 346 note that the indices are relative to the actual register used. So,
 347 if `*x` starts in GPR 24 for example, in essence this instruction will
 348 issue the following sequence of instructions:
 349
 350 ```
 351     add/w=32            24 + 0, 24 +  4, 24 + 0
 352     add/w=32            24 + 8, 24 + 12, 24 + 8
 353     add/w=32            24 + 0, 24 +  4, 24 + 0
 354     add/w=32            24 + 8, 24 + 12, 24 + 8
 355     add/w=32            24 + 1, 24 +  5, 24 + 1
 356     add/w=32            24 + 9, 24 + 13, 24 + 9
 357     add/w=32            24 + 1, 24 +  5, 24 + 1
 358     add/w=32            24 + 9, 24 + 13, 24 + 9
 359     ...
 360 ```
 361
 362 Finally, the `svstep.` instruction steps to the next set of indices
 363
 364 We have shown how to do the additions in a Vertical-first mode. Now
 365 let's add the rest of the instructions in the `QUARTERROUND`s.  For the
 366 `XOR` instructions of both `QUARTERROUND`s groups only, assuming that `d =
 367 XOR(d, a)`:
 368
 369 ```
 370     x12 = x12 ^ x0
 371     x4 = x4 ^ x8
 372     x12 = x12 ^ x0
 373     x4 = x4 ^ x8
 374     x13 = x13 ^ x1
 375     x5 = x5 ^ x9
 376     x13 = x13 ^ x1
 377     x5 = x5 ^ x9
 378     x14 = x14 ^ x2
 379     x6 = x6 ^ x10
 380     x14 = x14 ^ x2
 381     x6 = x6 ^ x10
 382     x15 = x15 ^ x3
 383     x7 = x7 ^ x11
 384     x15 = x15 ^ x3
 385     x7 = x7 ^ x11
 386     x15 = x15 ^ x0
 387     x5 = x5 ^ x10
 388     x12 = x15 ^ x0
 389     x5 = x5 ^ x10
 390     x12 = x12 ^ x1
 391     x6 = x6 ^ x11
 392     x12 = x12 ^ x1
 393     x6 = x6 ^ x11
 394     x13 = x13 ^ x2
 395     x7 = x7 ^ x8
 396     x13 = x13 ^ x1
 397     x7 = x7 ^ x8
 398     x14 = x14 ^ x3
 399     x4 = x4 ^ x9
 400     x14 = x14 ^ x3
 401     x4 = x4 ^ x9
 402 ```
 403
 404 We will need to create another set of indices for the `XOR` instructions. We
 405 will only need one set as the other set of indices is the same as `RT`
 406 for `sv.add` (`SHAPE0`). So, remembering that our
 407
 408 |    |    |    |    |    |    |    |    |
 409 |----|----|----|----|----|----|----|----|
 410 | 12 |  4 | 12 |  4 | 13 |  5 | 13 |  5 |
 411 | 14 |  6 | 14 |  6 | 15 |  7 | 15 |  7 |
 412 | 15 |  5 | 15 |  5 | 12 |  6 | 12 |  6 |
 413 | 13 |  7 | 13 |  7 | 14 |  4 | 14 |  4 |
 414
 415 Again, we find
 416
 417 |           |                   |                   |                   |                   |
 418 |-----------|-------------------|-------------------|-------------------|-------------------|
 419 | SVSHAPE2: | 0x50d050d040c040c | 0x70f070f060e060e | 0x60c060c050f050f | 0x40e040e070d070d |
 420
 421 The next operation is the `ROTATE` which takes as operand the result of the
 422 `XOR` and a shift argument. You can easily see that the indices used in this
 423 case are the same as the `XOR`. However, the shift values cycle every 4:
 424 16, 12, 8, 7. For the indices we can again use `svindex`, like this:
 425
 426 ```
 427     svindex             8, 2, 1, 3, 0, 1, 0
 428 ```
 429
 430 Which again means `SVSHAPE2`, operating on 8-bit elements, starting
 431 from GPR #16 (`8*2`). For the shift values cycling every 4 elements,
 432 the `svshape2` instruction will be used:
 433
 434 ```
 435     svshape2            0, 0, 3, 4, 0, 1
 436 ```
 437
 438 This will create an `SVSHAPE3`, which will use a modulo 4 for all of its
 439 elements. Now we can list both `XOR` and `ROTATE` instructions in assembly,
 440 together with the respective svremap instructions:
 441
 442 ```
 443     svremap             31, 2, 0, 2, 2, 0, 0       # RA=2, RB=0, RS=2 (0b00111)
 444     sv.xor/w=32         *x, *x, *x
 445     svremap             31, 0, 3, 2, 2, 0, 0       # RA=2, RB=3, RS=2 (0b01110)
 446     sv.rldcl/w=32       *x, *x, *SHIFTS, 0
 447 ```
 448
 449 So, in a similar fashion, we instruct `XOR` (`sv.xor`) to use `SVSHAPE2` for
 450 `RA` and `RS` and `SVSHAPE0` for `RB`, again for 32-bit elements, while `ROTATE`
 451 (`sv.rldcl`) will also use `SVSHAPE2` for `RA` and `RS`, but `SVSHAPE3` for `RB`
 452 (the shift values, which cycle every 4 elements). Note that the actual
 453 indices for `SVSHAPE3` will have to be in 32-bit elements:
 454
 455 |         |                    |                    |
 456 |---------|--------------------|--------------------|
 457 | SHIFTS: | 0x0000000c00000010 | 0x0000000700000008 |
 458
 459 The complete algorithm for a loop with 10 iterations is as follows:
 460
 461 ```
 462     li                  7, 10                     # Load value 10 into GPR #7
 463     mtctr               7                         # Set up counter on GPR #7
 464
 465     # set up VL=32 vertical-first, and SVSHAPEs 0-2
 466     setvl               0, 0, 32, 1, 1, 1
 467     # SHAPE0, used by sv.add starts at GPR #8
 468     svindex             8/2, 0, 1, 3, 0, 1, 0     # SVSHAPE0, a
 469     # SHAPE1, used by sv.xor starts at GPR #12
 470     svindex             12/2, 1, 1, 3, 0, 1, 0    # SVSHAPE1, b
 471     # SHAPE2, used by sv.rldcl starts at GPR #16
 472     svindex             16/2, 2, 1, 3, 0, 1, 0    # SVSHAPE2, c
 473     # SHAPE3, used also by sv.rldcl to hold the shift values starts at GPR #20
 474     # The inner loop will do 32 iterations, but there are only
 475     # 4 shift values, so we mod by 4, and can cycle through them
 476     svshape2            0, 0, 3, 4, 0, 1          # SVSHAPE3, shift amount, mod4
 477 .outer:
 478     # outer loop begins here (standard CTR loop)
 479     setvl               0, 0, 32, 1, 1, 1         # MAXVL=VL=32, VF=1
 480     # inner loop begins here. add-xor-rotl32 with remap, step, branch
 481 .inner:
 482     svremap             31, 1, 0, 0, 0, 0, 0      # RA=1, RB=0, RT=0 (0b01011)
 483     sv.add/w=32         *x, *x, *x
 484     svremap             31, 2, 0, 2, 2, 0, 0      # RA=2, RB=0, RS=2 (0b00111)
 485     sv.xor/w=32         *x, *x, *x
 486     svremap             31, 0, 3, 2, 2, 0, 0      # RA=2, RB=3, RS=2 (0b01110)
 487     sv.rldcl/w=32       *x, *x, *SHIFTS, 0
 488     # 16 is the destination containing the result of svstep.
 489     # it overlaps with SHAPE2 which is also 16. the first 8 indices
 490     # will get corrupted.
 491     svstep.             7, 1, 0               # step to next in-regs element
 492     bc                  6, 3, .inner          # svstep. Rc=1 loop-end-condition?
 493     # inner-loop done: outer loop standard CTR-decrement to setvl again
 494     bdnz                .outer                # Loop until CTR is zero
 495 ```
 496
 497