# Spectre Plan So from the previous update, we had a massive spanner in the works, which is hitting not just this design, it's absolutely every single out-of-order processor, as the problems associated with timing attacks that probe resource congestion are related to the out-of-order paradigm, not just a particular vendor or one particular processor: it's **all** out-of-order processors, period. To illustrate: if a vendor decides to have a single divide ALU shared across multiple cores, arbitrary untrusted processes can issue divide operations to find out if **other** cores are trying to use the (shared) divide ALU resource. If there is limited bandwidth on operand forwarding, for example, then an arbitrary untrusted process may issue a series of instructions that are specifically designed to be chained together so as to trigger operand forwarding, use up all the available bandwidth of the Operand Forwarding Bus, and, if the completion time is not as expected, the attacker knows that another process tried to use the same Bus. We think we have a solution to this: a "Speculation Fence" instruction (or "hint", as they are known). The idea is, before an arbitrary untrusted process is permitted to run, to call a special instruction that *clears the decks*, resetting the Out-of-Order execution engine back to a known, quiescent state. Thus, there *is* no information to leak to the attacker. We will also need all system calls, traps and interrupts to automatically be a speculation fence point. We can also look at doing a "graded" shutdown of speculation and resource allocation, on the basis that if it is known in advance that a system call is coming up, there is no point issuing speculative instructions or using out-of-order resources if they are about to be cancelled within 5-10 instructions! The alternatives... well, they don't work. A software-only solution ("fixing" Spectre in the linux kernel) has got so complicated and has so badly affected performance that Linus Torvalds recently put his foot down and refused to allow "yet another Spectre patch". A hardware-only solution *also* isn't good enough, as it basically involves degrading performance back to that of a **single-issue in-order** machine. The "cooperative" approach we feel is a reasonable compromise that is also simple and straightforward to implement in both hardware and software. It will be a lot of work, however at least we can put the underpinnings in place (in the hardware). # 48-bit Instruction Extension Jacob raised an idea to do [extension prefixes](http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-January/000316.html) on Simple-V. It's a really good idea, that I was hoping would not be necessary. It comes down to the fact that it takes a bit more than was anticipated to do the setup and teardown of the Vectorisation Engine. So the plan is to have a couple of prefixes: one 16-bit, one 32-bit, that "extend" both Compressed (16-bit) and standard (32-bit) instructions, turning them into "one-off" Vector Instructions. There are two problems: firstly, that extends the instruction encoding, which in turn complicates the instruction decode phase. Secondly: we may have to use the 48-bit encoding space, which in turn takes up a whopping six of the available 16 bits, which in turn puts a huge amount of pressure on what can actually be extended. For example: if 2 bits are allocated to extend 5-bit register numbers out to 7 bits, that allows us to access the full 128 integer and FP range needed for a GPU and VPU. Unfortunately, we need 2 bits for rs1, 2 bits for rs2, 2 bits for rs3 and 2 bits for rd. That's 8 bits already, and we haven't gotten to VL (Vector Length), the element width (setting 8/16/32/64 bit), or predication. If doing a 32-bit prefix, that actually needs to either be a 48-bit encoding or a 64-bit encoding, depending on whether a 16-bit "Compressed" instruction or a 32-bit standard instruction is to be prefixed. There is an alternative: for the 16-bit prefix, there happens to be a Compressed major opcode that is not being used (bits 13-15 equal to 100, bits 0-1 equal to 00). This gives 11 bits spare (where a 48-bit encoding can only squeeze out 8 maybe 9). It also has one significant advantage: as it is actually a standard "C" opcode, it can be done as macro-op fusion. That in turn means that modifications to the compiler toolchain are a lot less significant. 12 available bits, things start to look a lot better. For 32-bit opcodes, 2 bits can be prepended to a 5 bit destination, 2 more bits for all source registers. 2 bits for Vector Length (VL=1/2/3/4), and 2 bits for the element width (8/16/32/64). That leaves 4 spare bits for specifying predication, *or*, if prefixing 16-bit "Compressed" instructions, it could be used to extend some of the operations that only have 3-bit registers, by another 2 bits. It's quite complex and is going to need a lot of thought. Some compromises need to be made, the issue being that we won't know what the best choices are until we have a better handle on things, through simulations and comprehensive analysis. Designing processors is tricky!