openpower/sv/cookbook/chacha20.mdwn

   1 [[!tag svp64_cookbook]]
   2
   3 # XChaCha20 SVP64 Implementation Analysis
   4
   5 This document shows how xchacha20's core loop - all 20 rounds - was
   6 implemented in just 11 Vector instructions.  There are an additional
   7 9 instructions involved in establishing a REMAP Schedule (explained
   8 below), which if there are multiple blocks these 9 instructions do not
   9 need to be called again.
  10
  11 Firstly, we analyse the xchacha20 algorithm, showing what operations
  12 are performed and in what order.  Secondly, two innovative features
  13 of SVP64 are described which are crucial to understanding of Simple-V
  14 Vectorization: Vertical-First Mode and Indexed REMAP.  Then we show
  15 how Index REMAP eliminates the need entirely for inline-loop-unrolling,
  16 but note that in this particular algorithm REMAP is only useful for
  17 us in Vertical-First Mode.
  18
  19 ## Description of XChacha20 Algorithm
  20
  21 We will first try to analyze the XChacha20 algorithm as found in:
  22
  23 https://github.com/spcnvdr/xchacha20/blob/master/src/xchacha20.c
  24
  25 The function under inspection is `xchacha_hchacha20`. If we notice
  26 we will that the main part of the computation, the main algorithm is
  27 just a for loop -which is also the same in the `xchacha_encrypt_bytes`
  28 function as well.
  29
  30 Main loop for `xchacha_hchacha20`:
  31
  32 ```
  33     for (i = 0; i < 10; i++){
  34         QUARTERROUND(x0, x4,  x8, x12);
  35         QUARTERROUND(x1, x5,  x9, x13);
  36         QUARTERROUND(x2, x6, x10, x14);
  37         QUARTERROUND(x3, x7, x11, x15);
  38         QUARTERROUND(x0, x5, x10, x15);
  39         QUARTERROUND(x1, x6, x11, x12);
  40         QUARTERROUND(x2, x7,  x8, x13);
  41         QUARTERROUND(x3, x4,  x9, x14);
  42     }
  43
  44     #define QUARTERROUND(a,b,c,d) \
  45         a = PLUS(a,b); d = ROTATE(XOR(d,a),16); \
  46         c = PLUS(c,d); b = ROTATE(XOR(b,c),12); \
  47         a = PLUS(a,b); d = ROTATE(XOR(d,a), 8); \
  48         c = PLUS(c,d); b = ROTATE(XOR(b,c), 7);
  49 ```
  50
  51 We see that the loop is split in two groups of `QUARTERROUND` calls,
  52 one with `step=4`:
  53
  54 ```
  55     QUARTERROUND(x0, x4,  x8, x12);
  56     QUARTERROUND(x1, x5,  x9, x13);
  57     QUARTERROUND(x2, x6, x10, x14);
  58     QUARTERROUND(x3, x7, x11, x15);
  59 ```
  60
  61 and another with `step=5`:
  62
  63 ```
  64     QUARTERROUND(x0, x5, x10, x15);
  65     QUARTERROUND(x1, x6, x11, x12);
  66     QUARTERROUND(x2, x7,  x8, x13);
  67     QUARTERROUND(x3, x4,  x9, x14);
  68 ```
  69
  70 Let's start with the first group of `QUARTERROUND`s, by unrolling it,
  71 essentially it results in the following instructions:
  72
  73 ```
  74     x0  = x0 + x4;   x12 = ROTATE(x12 ^ x0, 16);
  75     x8  = x8 + x12;   x4 = ROTATE(x4 ^ x8, 12);
  76     x0  = x0 + x4;   x12 = ROTATE(x12 ^ x0, 8);
  77     x8  = x8 + x12;   x4 = ROTATE(x4 ^ x8, 7);
  78     x1  = x1 + x5;   x13 = ROTATE(x13 ^ x1, 16);
  79     x9  = x9 + x13;   x5 = ROTATE(x5 ^ x9, 12);
  80     x1  = x1 + x5;   x13 = ROTATE(x13 ^ x1, 8);
  81     x9  = x9 + x13;   x5 = ROTATE(x5 ^ x9, 7);
  82     x2  = x2 + x6;   x14 = ROTATE(x14 ^ x2, 16);
  83     x10 = x10 + x14;  x6 = ROTATE(x6 ^ x10, 12);
  84     x2  = x2 + x6;   x14 = ROTATE(x14 ^ x2, 8);
  85     x10 = x10 + x14;  x6 = ROTATE(x6 ^ x10, 7);
  86     x3  = x3 + x7;   x15 = ROTATE(x15 ^ x3, 16);
  87     x11 = x11 + x15;  x7 = ROTATE(x7 ^ x11, 12);
  88     x3  = x3 + x7;   x15 = ROTATE(x15 ^ x3, 8);
  89     x11 = x11 + x15;  x7 = ROTATE(x7 ^ x11, 7);
  90 ```
  91
  92 Second group of `QUARTERROUND`s, unrolled:
  93
  94 ```
  95     x0  = x0 + x5;   x15 = ROTATE(x15 ^ x0, 16);
  96     x10 = x10 + x15;  x5 = ROTATE(x5 ^ x10, 12);
  97     x0  = x0 + x5;   x12 = ROTATE(x15 ^ x0, 8);
  98     x10 = x10 + x15;  x5 = ROTATE(x5 ^ x10, 7);
  99     x1  = x1 + x6;   x12 = ROTATE(x12 ^ x1, 16);
 100     x11 = x11 + x12;  x6 = ROTATE(x6 ^ x11, 12);
 101     x1  = x1 + x6;   x12 = ROTATE(x12 ^ x1, 8);
 102     x11 = x11 + x12;  x6 = ROTATE(x6 ^ x11, 7);
 103     x2  = x2 + x7;   x13 = ROTATE(x13 ^ x2, 16);
 104     x8  = x8 + x13;   x7 = ROTATE(x7 ^ x8, 12);
 105     x2  = x2 + x7;   x13 = ROTATE(x13 ^ x2, 8);
 106     x8  = x8 + x13;   x7 = ROTATE(x7 ^ x8, 7);
 107     x3  = x3 + x4;   x14 = ROTATE(x14 ^ x3, 16);
 108     x9  = x9 + x14;   x4 = ROTATE(x4 ^ x9, 12);
 109     x3  = x3 + x4;   x14 = ROTATE(x14 ^ x3, 8);
 110     x9  = x9 + x14;   x4 = ROTATE(x4 ^ x9, 7);
 111 ```
 112
 113 Let's list the additions only:
 114
 115 ```
 116     x0  = x0 + x4
 117     x8  = x8 + x12
 118     x0  = x0 + x4
 119     x8  = x8 + x12
 120     x1  = x1 + x5
 121     x9  = x9 + x13
 122     x1  = x1 + x5
 123     x9  = x9 + x13
 124     x2  = x2 + x6
 125     x10 = x10 + x14
 126     x2  = x2 + x6
 127     x10 = x10 + x14
 128     x3  = x3 + x7
 129     x11 = x11 + x15
 130     x3  = x3 + x7
 131     x11 = x11 + x15
 132     x0  = x0 + x5
 133     x10 = x10 + x15
 134     x0  = x0 + x5
 135     x10 = x10 + x15
 136     x1  = x1 + x6
 137     x11 = x11 + x12
 138     x1  = x1 + x6
 139     x11 = x11 + x12
 140     x2  = x2 + x7
 141     x8  = x8 + x13
 142     x2  = x2 + x7
 143     x8  = x8 + x13
 144     x3  = x3 + x4
 145     x9  = x9 + x14
 146     x3  = x3 + x4
 147     x9  = x9 + x14
 148 ```
 149
 150 ## Introduction to REMAP Indexing
 151
 152 REMAP Indexing performs any arbitrary re-positioning of elements.
 153 Where normally any other Vector Processor would only be able to do a
 154 sequential element-level series of operations, and if re-ordering
 155 of the elements is required use a special re-ordering instruction,
 156 SVP64 can *in-place* reorder elements on *any* instruction, using
 157 the REMAP subsystem.
 158
 159 Most of the REMAP systems are simple fixed-hardware Deterministic
 160 Schedules, but there is one that is general-purpose: Indexing.  It
 161 requires specifying a group of GPRs (or indices packed into GPRs)
 162 that are to be used as the offsets.
 163
 164 This is a normal Simple-V operation:
 165
 166 ```
 167     for i in range(VL):
 168         GPR[RT+i] = OPERATION(GPR[RA+i])
 169 ```
 170
 171 This is what happens when REMAP is enabled with Indexing:
 172
 173 ```
 174     def REMAP(SVSHAPE, i):
 175         return GPR(SVSHAPE.GPR + i)
 176     for i in range(VL):
 177         idx_rt = REMAP(SVSHAPE0, i)
 178         idx_ra = REMAP(SVSHAPE1, i)
 179         GPR[RT+idx_rt] = OPERATION(GPR[RA+idx_ra])
 180 ```
 181
 182 In this way we can literally jump about, pretty much anywhere in
 183 the register file, according to a Schedule that is determined by
 184 the programmer.  Therefore, if we can put all of the chacha20
 185 round intermediary data into an array of registers, and can
 186 analyse the *order* in which add-operations, xor-operations
 187 and rotate-operations occur, it might just be possible to
 188 eliminate **all** loop-unrolled inline assembler, replacing it
 189 with three instructions and appropriate Indexed REMAP Schedules!
 190 Turns out that this is indeed possible.
 191
 192 ## Introduction to Vertical-First Mode
 193
 194 We're going to use Vertical-First mode (VF) to implement this, so we
 195 will first do a short introduction to VF. In normal Horizontal Vector
 196 mode, or even in traditional SIMD mode, the instruction is executed
 197 across a (Horizontal) Vector. So, if we have, for example `VL=8`, then
 198 the instruction
 199
 200 ```
 201     sv.add *RT, *RA, *RB  # RT = RA + RB
 202 ```
 203
 204 will be executed on all elements of the vector, **before** moving to
 205 the next assembly instruction.  This behaviour changes in Vertical-First
 206 mode. In VF mode, the instruction is executed on a **single** element of
 207 the vector and then moves to the next assembly instruction. It continues
 208 to do so until execution reaches the `svstep` instruction where processing
 209 will be moved to the next element/register in the vector. Branching to
 210 the beginning of the loop does not occur automatically though, a branch
 211 instruction will have to be added manually.
 212
 213 The reason why Vertical-First is needed is because it should be clear
 214 from xchacha20 that there are ordering dependencies between the three
 215 operations `add, xor, rotate`.  It is not okay to perform the entire
 216 suite of Vector-adds then move on to the Vector-xors then the Vector-rotates:
 217 they have to be interleaved so as to respect the element-level ordering.
 218 This is exactly what Vertical-First allows us to do:
 219 element 0 add, element 0 xor, element 0 rotate, then element **1**
 220 add, element **1** xor, element **1** rotate and so on.  Vertical-First
 221 *combined* with Index REMAP we can literally jump those operations around,
 222 anywhere within the Vector.
 223
 224 ## Application of VF mode in the Xchacha20 loop
 225
 226 Let's assume the values `x` in the registers 24-36
 227
 228 | GPR #  |     |     |     |     |
 229 |--------|-----|-----|-----|-----|
 230 |   24   |  x0 |  x1 |  x2 |  x3 |
 231 |   28   |  x4 |  x5 |  x6 |  x7 |
 232 |   32   |  x8 |  x9 | x10 | x11 |
 233 |   36   | x12 | x13 | x14 | x15 |
 234
 235 So for the addition in Vertical-First mode, `RT` (and `RA` as they are the
 236 same) indices are (in terms of x):
 237
 238 |    |    |    |    |    |    |    |    |
 239 |----|----|----|----|----|----|----|----|
 240 |  0 |  8 |  0 |  8 |  1 |  9 |  1 |  9 |
 241 |  2 | 10 |  2 | 10 |  3 | 11 |  3 | 11 |
 242 |  0 | 10 |  0 | 10 |  1 | 11 |  1 | 11 |
 243 |  2 |  8 |  2 |  8 |  3 |  9 |  3 |  9 |
 244
 245 However, since the indices are small values, using a single 64-bit
 246 register for a single index value is a waste so we will compress them,
 247 8 indices in a 64-bit register:
 248 So, `RT` indices will fit inside these 4 registers (in Little Endian format):
 249
 250 |           |                   |                   |                   |                   |
 251 |-----------|-------------------|-------------------|-------------------|-------------------|
 252 | SVSHAPE0: | 0x901090108000800 | 0xb030b030a020a02 | 0xb010b010a000a00 | 0x903090308020802 |
 253
 254 Similarly we find the RB indices:
 255
 256 |    |    |    |    |    |    |    |    |
 257 |----|----|----|----|----|----|----|----|
 258 |  4 | 12 |  4 | 12 |  5 | 13 |  5 | 13 |
 259 |  6 | 14 |  6 | 14 |  7 | 15 |  7 | 15 |
 260 |  5 | 15 |  5 | 15 |  6 | 12 |  6 | 12 |
 261 |  7 | 13 |  7 | 13 |  4 | 14 |  7 | 14 |
 262
 263 Using a similar method, we find the final 4 registers with the `RB` indices:
 264
 265 |           |                   |                   |                   |                   |
 266 |-----------|-------------------|-------------------|-------------------|-------------------|
 267 | SVSHAPE1: | 0xd050d050c040c04 | 0xf070f070e060e06 | 0xc060c060f050f05 | 0xe040e040d070d07 |
 268
 269 Now, we can construct the Vertical First loop:
 270
 271 ```
 272     svindex             4, 0, 1, 3, 0, 1, 0     # SVSHAPE0, add RA/RT indices
 273     svindex             6, 1, 1, 3, 0, 1, 0     # SVSHAPE1, add RB indices
 274     setvl               0, 0, 32, 1, 1, 1       # MAXVL=VL=32, VF=1
 275     svremap             31, 1, 0, 0, 0, 0, 0    # RA=1, RB=0, RT=0 (0b01011)
 276     sv.add/w=32         *x, *x, *x              # RT, RB: SHAPE0. RA: SHAPE1
 277     svstep. 16, 1, 0                            # step to next in-regs element
 278 ```
 279
 280 What this code snippet does is the following:
 281
 282 The first instruction
 283
 284 ```
 285     svindex             4, 0, 1, 3, 0, 1, 0
 286 ```
 287
 288 loads the add `RT` indices in the `SVSHAPE0`, in register 8. You will note
 289 that 4 is listed, but that's because it only works on even registers,
 290 so in order to save a bit, we have to double that number to get the
 291 actual register. So, `SVSHAPE0` will be listed in GPRs 8-12. The number
 292 3 lists that the elements will be 8-bit long. 0=64-bit, 1=32-bit,
 293 2=16-bit, 3=8-bit.
 294
 295 The next step instruction
 296
 297 ```
 298     svindex             6, 1, 1, 3, 0, 1, 0
 299 ```
 300
 301 loads the add `RB` indices into `SVSHAPE1`. Again, even though we list 6,
 302 the actual registers will be loaded in GPR #12, again a use of 8-bit
 303 elements is denoted.
 304 Next, the `setvl` instructions:
 305
 306 ```
 307     setvl               0, 0, 32, 1, 1, 1
 308 ```
 309
 310 We have to call `setvl` to set `MAXVL` and `VL` to 32 and also configure
 311 Vertical-First mode.  Afterwards, we have to instruct the way we intend
 312 to use the indices, and we do this using `svremap`.
 313
 314 ```
 315     svremap 31, 1, 0, 0, 0, 0, 0
 316 ```
 317
 318 `svremap` basically instructs the scheduler to use `SVSHAPE0` for `RT` and `RB`,
 319 `SVSHAPE1` for `RA`.  The next instruction performs the **actual** addition:
 320
 321 ```
 322     sv.add/w=32         *x, *x, *x
 323 ```
 324
 325 Note the `/w=32` suffix. This instructs the adder to perform the operation
 326 in elements of `w=32` bits. Since the Power CPU is a 64-bit CPU, this means
 327 that we need to have 2 32-bit elements loaded in each register. Also,
 328 note that in all parameters we use the `*x` as argument. This instructs
 329 the scheduler to act on the registers as a vector, or a sequence of
 330 elements. But even though they are all the same, their indices will be
 331 taken from the `SVSHAPE0`/`SVSHAPE1` indices as defined previously. Also
 332 note that the indices are relative to the actual register used. So,
 333 if `*x` starts in GPR 24 for example, in essence this instruction will
 334 issue the following sequence of instructions:
 335
 336 ```
 337     add/w=32            24 + 0, 24 +  4, 24 + 0
 338     add/w=32            24 + 8, 24 + 12, 24 + 8
 339     add/w=32            24 + 0, 24 +  4, 24 + 0
 340     add/w=32            24 + 8, 24 + 12, 24 + 8
 341     add/w=32            24 + 1, 24 +  5, 24 + 1
 342     add/w=32            24 + 9, 24 + 13, 24 + 9
 343     add/w=32            24 + 1, 24 +  5, 24 + 1
 344     add/w=32            24 + 9, 24 + 13, 24 + 9
 345     ...
 346 ```
 347
 348 Finally, the `svstep.` instruction steps to the next set of indices
 349
 350 We have shown how to do the additions in a Vertical-first mode. Now
 351 let's add the rest of the instructions in the `QUARTERROUND`s.  For the
 352 `XOR` instructions of both `QUARTERROUND`s groups only, assuming that `d =
 353 XOR(d, a)`:
 354
 355 ```
 356     x12 = x12 ^ x0
 357     x4 = x4 ^ x8
 358     x12 = x12 ^ x0
 359     x4 = x4 ^ x8
 360     x13 = x13 ^ x1
 361     x5 = x5 ^ x9
 362     x13 = x13 ^ x1
 363     x5 = x5 ^ x9
 364     x14 = x14 ^ x2
 365     x6 = x6 ^ x10
 366     x14 = x14 ^ x2
 367     x6 = x6 ^ x10
 368     x15 = x15 ^ x3
 369     x7 = x7 ^ x11
 370     x15 = x15 ^ x3
 371     x7 = x7 ^ x11
 372     x15 = x15 ^ x0
 373     x5 = x5 ^ x10
 374     x12 = x15 ^ x0
 375     x5 = x5 ^ x10
 376     x12 = x12 ^ x1
 377     x6 = x6 ^ x11
 378     x12 = x12 ^ x1
 379     x6 = x6 ^ x11
 380     x13 = x13 ^ x2
 381     x7 = x7 ^ x8
 382     x13 = x13 ^ x1
 383     x7 = x7 ^ x8
 384     x14 = x14 ^ x3
 385     x4 = x4 ^ x9
 386     x14 = x14 ^ x3
 387     x4 = x4 ^ x9
 388 ```
 389
 390 We will need to create another set of indices for the `XOR` instructions. We
 391 will only need one set as the other set of indices is the same as `RT`
 392 for `sv.add` (`SHAPE0`). So, remembering that our
 393
 394 |    |    |    |    |    |    |    |    |
 395 |----|----|----|----|----|----|----|----|
 396 | 12 |  4 | 12 |  4 | 13 |  5 | 13 |  5 |
 397 | 14 |  6 | 14 |  6 | 15 |  7 | 15 |  7 |
 398 | 15 |  5 | 15 |  5 | 12 |  6 | 12 |  6 |
 399 | 13 |  7 | 13 |  7 | 14 |  4 | 14 |  4 |
 400
 401 Again, we find
 402
 403 |           |                   |                   |                   |                   |
 404 |-----------|-------------------|-------------------|-------------------|-------------------|
 405 | SVSHAPE2: | 0x50d050d040c040c | 0x70f070f060e060e | 0x60c060c050f050f | 0x40e040e070d070d |
 406
 407 The next operation is the `ROTATE` which takes as operand the result of the
 408 `XOR` and a shift argument. You can easily see that the indices used in this
 409 case are the same as the `XOR`. However, the shift values cycle every 4:
 410 16, 12, 8, 7. For the indices we can again use `svindex`, like this:
 411
 412 ```
 413     svindex             8, 2, 1, 3, 0, 1, 0
 414 ```
 415
 416 Which again means `SVSHAPE2`, operating on 8-bit elements, starting
 417 from GPR #16 (`8*2`). For the shift values cycling every 4 elements,
 418 the `svshape2` instruction will be used:
 419
 420 ```
 421     svshape2            0, 0, 3, 4, 0, 1
 422 ```
 423
 424 This will create an `SVSHAPE3`, which will use a modulo 4 for all of its
 425 elements. Now we can list both `XOR` and `ROTATE` instructions in assembly,
 426 together with the respective svremap instructions:
 427
 428 ```
 429     svremap             31, 2, 0, 2, 2, 0, 0       # RA=2, RB=0, RS=2 (0b00111)
 430     sv.xor/w=32         *x, *x, *x
 431     svremap             31, 0, 3, 2, 2, 0, 0       # RA=2, RB=3, RS=2 (0b01110)
 432     sv.rldcl/w=32       *x, *x, *SHIFTS, 0
 433 ```
 434
 435 So, in a similar fashion, we instruct `XOR` (`sv.xor`) to use `SVSHAPE2` for
 436 `RA` and `RS` and `SVSHAPE0` for `RB`, again for 32-bit elements, while `ROTATE`
 437 (`sv.rldcl`) will also use `SVSHAPE2` for `RA` and `RS`, but `SVSHAPE3` for `RB`
 438 (the shift values, which cycle every 4 elements). Note that the actual
 439 indices for `SVSHAPE3` will have to be in 32-bit elements:
 440
 441 |         |                    |                    |
 442 |---------|--------------------|--------------------|
 443 | SHIFTS: | 0x0000000c00000010 | 0x0000000700000008 |
 444
 445 The complete algorithm for a loop with 10 iterations is as follows:
 446
 447 ```
 448     li                  7, 10                     # Load value 10 into GPR #7
 449     mtctr               7                         # Set up counter on GPR #7
 450
 451     # set up VL=32 vertical-first, and SVSHAPEs 0-2
 452     setvl               0, 0, 32, 1, 1, 1
 453     # SHAPE0, used by sv.add starts at GPR #8
 454     svindex             8/2, 0, 1, 3, 0, 1, 0     # SVSHAPE0, a
 455     # SHAPE1, used by sv.xor starts at GPR #12
 456     svindex             12/2, 1, 1, 3, 0, 1, 0    # SVSHAPE1, b
 457     # SHAPE2, used by sv.rldcl starts at GPR #16
 458     svindex             16/2, 2, 1, 3, 0, 1, 0    # SVSHAPE2, c
 459     # SHAPE3, used also by sv.rldcl to hold the shift values starts at GPR #20
 460     # The inner loop will do 32 iterations, but there are only
 461     # 4 shift values, so we mod by 4, and can cycle through them
 462     svshape2            0, 0, 3, 4, 0, 1          # SVSHAPE3, shift amount, mod4
 463 .outer:
 464     # outer loop begins here (standard CTR loop)
 465     setvl               0, 0, 32, 1, 1, 1         # MAXVL=VL=32, VF=1
 466     # inner loop begins here. add-xor-rotl32 with remap, step, branch
 467 .inner:
 468     svremap             31, 1, 0, 0, 0, 0, 0      # RA=1, RB=0, RT=0 (0b01011)
 469     sv.add/w=32         *x, *x, *x
 470     svremap             31, 2, 0, 2, 2, 0, 0      # RA=2, RB=0, RS=2 (0b00111)
 471     sv.xor/w=32         *x, *x, *x
 472     svremap             31, 0, 3, 2, 2, 0, 0      # RA=2, RB=3, RS=2 (0b01110)
 473     sv.rldcl/w=32       *x, *x, *SHIFTS, 0
 474     # 16 is the destination containing the result of svstep.
 475     # it overlaps with SHAPE2 which is also 16. the first 8 indices
 476     # will get corrupted.
 477     svstep.             7, 1, 0               # step to next in-regs element
 478     bc                  6, 3, .inner          # svstep. Rc=1 loop-end-condition?
 479     # inner-loop done: outer loop standard CTR-decrement to setvl again
 480     bdnz                .outer                # Loop until CTR is zero
 481 ```
 482
 483