important to update about tax agreements
[crowdsupply.git] / updates / 011_2019jan16_spectre_plan.mdwn
1 # Spectre Plan
2
3 So from the previous update, we had a massive spanner in the works,
4 which is hitting not just this design, it's absolutely every single
5 out-of-order processor, as the problems associated with timing attacks
6 that probe resource congestion are related to the out-of-order paradigm,
7 not just a particular vendor or one particular processor: it's **all**
8 out-of-order processors, period.
9
10 To illustrate: if a vendor decides to have a single divide ALU shared
11 across multiple cores, arbitrary untrusted processes can issue divide
12 operations to find out if **other** cores are trying to use the (shared)
13 divide ALU resource.
14
15 If there is limited bandwidth on operand forwarding, for example, then
16 an arbitrary untrusted process may issue a series of instructions that
17 are specifically designed to be chained together so as to trigger
18 operand forwarding, use up all the available bandwidth of the Operand
19 Forwarding Bus, and, if the completion time is not as expected, the
20 attacker knows that another process tried to use the same Bus.
21
22 We think we have a solution to this: a "Speculation Fence" instruction
23 (or "hint", as they are known). The idea is, before an arbitrary
24 untrusted process is permitted to run, to call a special instruction
25 that *clears the decks*, resetting the Out-of-Order execution engine
26 back to a known, quiescent state. Thus, there *is* no information
27 to leak to the attacker.
28
29 We will also need all system calls, traps and interrupts to automatically
30 be a speculation fence point. We can also look at doing a "graded"
31 shutdown of speculation and resource allocation, on the basis that
32 if it is known in advance that a system call is coming up, there is
33 no point issuing speculative instructions or using out-of-order resources
34 if they are about to be cancelled within 5-10 instructions!
35
36 The alternatives... well, they don't work. A software-only solution
37 ("fixing" Spectre in the linux kernel) has got so complicated and has
38 so badly affected performance that Linus Torvalds recently put his foot
39 down and refused to allow "yet another Spectre patch". A hardware-only
40 solution *also* isn't good enough, as it basically involves degrading
41 performance back to that of a **single-issue in-order** machine.
42
43 The "cooperative" approach we feel is a reasonable compromise that is
44 also simple and straightforward to implement in both hardware and software.
45 It will be a lot of work, however at least we can put the underpinnings
46 in place (in the hardware).
47
48 # 48-bit Instruction Extension
49
50 Jacob raised an idea to do
51 [extension prefixes](http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-January/000316.html) on Simple-V.
52 It's a really good idea, that I was hoping would not be necessary. It
53 comes down to the fact that it takes a bit more than was anticipated to
54 do the setup and teardown of the Vectorisation Engine.
55
56 So the plan is to have a couple of prefixes: one 16-bit, one 32-bit, that
57 "extend" both Compressed (16-bit) and standard (32-bit) instructions, turning
58 them into "one-off" Vector Instructions. There are two problems: firstly,
59 that extends the instruction encoding, which in turn complicates the
60 instruction decode phase. Secondly: we may have to use the 48-bit encoding
61 space, which in turn takes up a whopping six of the available 16 bits,
62 which in turn puts a huge amount of pressure on what can actually be
63 extended.
64
65 For example: if 2 bits are allocated to extend 5-bit register numbers
66 out to 7 bits, that allows us to access the full 128 integer and FP
67 range needed for a GPU and VPU. Unfortunately, we need 2 bits for
68 rs1, 2 bits for rs2, 2 bits for rs3 and 2 bits for rd. That's 8 bits
69 already, and we haven't gotten to VL (Vector Length), the element
70 width (setting 8/16/32/64 bit), or predication.
71
72 If doing a 32-bit prefix, that actually needs to either be a 48-bit
73 encoding or a 64-bit encoding, depending on whether a 16-bit "Compressed"
74 instruction or a 32-bit standard instruction is to be prefixed.
75
76 There is an alternative: for the 16-bit prefix, there happens to be
77 a Compressed major opcode that is not being used (bits 13-15 equal to 100,
78 bits 0-1 equal to 00). This gives 11 bits spare (where a 48-bit encoding
79 can only squeeze out 8 maybe 9). It also has one significant advantage:
80 as it is actually a standard "C" opcode, it can be done as macro-op fusion.
81 That in turn means that modifications to the compiler toolchain are a lot
82 less significant.
83
84 12 available bits, things start to look a lot better. For 32-bit opcodes,
85 2 bits can be prepended to a 5 bit destination, 2 more bits for all source
86 registers. 2 bits for Vector Length (VL=1/2/3/4), and 2 bits for the
87 element width (8/16/32/64). That leaves 4 spare bits for specifying
88 predication, *or*, if prefixing 16-bit "Compressed" instructions, it
89 could be used to extend some of the operations that only have 3-bit
90 registers, by another 2 bits.
91
92 It's quite complex and is going to need a lot of thought. Some compromises
93 need to be made, the issue being that we won't know what the best choices
94 are until we have a better handle on things, through simulations and
95 comprehensive analysis.
96
97 Designing processors is tricky!