updates/011_2019jan16_spectre_plan.mdwn

   1 # Spectre Plan
   2
   3 So from the previous update, we had a massive spanner in the works,
   4 which is hitting not just this design, it's absolutely every single
   5 out-of-order processor, as the problems associated with timing attacks
   6 that probe resource congestion are related to the out-of-order paradigm,
   7 not just a particular vendor or one particular processor: it's **all**
   8 out-of-order processors, period.
   9
  10 To illustrate: if a vendor decides to have a single divide ALU shared
  11 across multiple cores, arbitrary untrusted processes can issue divide
  12 operations to find out if **other** cores are trying to use the (shared)
  13 divide ALU resource.
  14
  15 If there is limited bandwidth on operand forwarding, for example, then
  16 an arbitrary untrusted process may issue a series of instructions that
  17 are specifically designed to be chained together so as to trigger
  18 operand forwarding, use up all the available bandwidth of the Operand
  19 Forwarding Bus, and, if the completion time is not as expected, the
  20 attacker knows that another process tried to use the same Bus.
  21
  22 We think we have a solution to this: a "Speculation Fence" instruction
  23 (or "hint", as they are known).  The idea is, before an arbitrary
  24 untrusted process is permitted to run, to call a special instruction
  25 that *clears the decks*, resetting the Out-of-Order execution engine
  26 back to a known, quiescent state.  Thus, there *is* no information
  27 to leak to the attacker.
  28
  29 We will also need all system calls, traps and interrupts to automatically
  30 be a speculation fence point.  We can also look at doing a "graded"
  31 shutdown of speculation and resource allocation, on the basis that
  32 if it is known in advance that a system call is coming up, there is
  33 no point issuing speculative instructions or using out-of-order resources
  34 if they are about to be cancelled within 5-10 instructions!
  35
  36 The alternatives... well, they don't work.  A software-only solution
  37 ("fixing" Spectre in the linux kernel) has got so complicated and has
  38 so badly affected performance that Linus Torvalds recently put his foot
  39 down and refused to allow "yet another Spectre patch".  A hardware-only
  40 solution *also* isn't good enough, as it basically involves degrading
  41 performance back to that of a **single-issue in-order** machine.
  42
  43 The "cooperative" approach we feel is a reasonable compromise that is
  44 also simple and straightforward to implement in both hardware and software.
  45 It will be a lot of work, however at least we can put the underpinnings
  46 in place (in the hardware).
  47
  48 # 48-bit Instruction Extension
  49
  50 Jacob raised an idea to do
  51 [extension prefixes](http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-January/000316.html) on Simple-V.
  52 It's a really good idea, that I was hoping would not be necessary.  It
  53 comes down to the fact that it takes a bit more than was anticipated to
  54 do the setup and teardown of the Vectorisation Engine.
  55
  56 So the plan is to have a couple of prefixes: one 16-bit, one 32-bit, that
  57 "extend" both Compressed (16-bit) and standard (32-bit) instructions, turning
  58 them into "one-off" Vector Instructions.  There are two problems: firstly,
  59 that extends the instruction encoding, which in turn complicates the
  60 instruction decode phase.  Secondly: we may have to use the 48-bit encoding
  61 space, which in turn takes up a whopping six of the available 16 bits,
  62 which in turn puts a huge amount of pressure on what can actually be
  63 extended.
  64
  65 For example: if 2 bits are allocated to extend 5-bit register numbers
  66 out to 7 bits, that allows us to access the full 128 integer and FP
  67 range needed for a GPU and VPU.  Unfortunately, we need 2 bits for
  68 rs1, 2 bits for rs2, 2 bits for rs3 and 2 bits for rd.  That's 8 bits
  69 already, and we haven't gotten to VL (Vector Length), the element
  70 width (setting 8/16/32/64 bit), or predication.
  71
  72 If doing a 32-bit prefix, that actually needs to either be a 48-bit
  73 encoding or a 64-bit encoding, depending on whether a 16-bit "Compressed"
  74 instruction or a 32-bit standard instruction is to be prefixed.
  75
  76 There is an alternative: for the 16-bit prefix, there happens to be
  77 a Compressed major opcode that is not being used (bits 13-15 equal to 100,
  78 bits 0-1 equal to 00).  This gives 11 bits spare (where a 48-bit encoding
  79 can only squeeze out 8 maybe 9).  It also has one significant advantage:
  80 as it is actually a standard "C" opcode, it can be done as macro-op fusion.
  81 That in turn means that modifications to the compiler toolchain are a lot
  82 less significant.
  83
  84 12 available bits, things start to look a lot better.  For 32-bit opcodes,
  85 2 bits can be prepended to a 5 bit destination, 2 more bits for all source
  86 registers.  2 bits for Vector Length (VL=1/2/3/4), and 2 bits for the
  87 element width (8/16/32/64).  That leaves 4 spare bits for specifying
  88 predication, *or*, if prefixing 16-bit "Compressed" instructions, it
  89 could be used to extend some of the operations that only have 3-bit
  90 registers, by another 2 bits.
  91
  92 It's quite complex and is going to need a lot of thought.  Some compromises
  93 need to be made, the issue being that we won't know what the best choices
  94 are until we have a better handle on things, through simulations and
  95 comprehensive analysis.
  96
  97 Designing processors is tricky!