# Spectre Plan

So from the previous update, we had a massive spanner in the works,
which is hitting not just this design, it's absolutely every single
out-of-order processor, as the problems associated with timing attacks
that probe resource congestion are related to the out-of-order paradigm,
not just a particular vendor or one particular processor: it's **all**
out-of-order processors, period.

To illustrate: if a vendor decides to have a single divide ALU shared
across multiple cores, arbitrary untrusted processes can issue divide
operations to find out if **other** cores are trying to use the (shared)
divide ALU resource.

If there is limited bandwidth on operand forwarding, for example, then
an arbitrary untrusted process may issue a series of instructions that
are specifically designed to be chained together so as to trigger
operand forwarding, use up all the available bandwidth of the Operand
Forwarding Bus, and, if the completion time is not as expected, the
attacker knows that another process tried to use the same Bus.

We think we have a solution to this: a "Speculation Fence" instruction
(or "hint", as they are known).  The idea is, before an arbitrary
untrusted process is permitted to run, to call a special instruction
that *clears the decks*, resetting the Out-of-Order execution engine
back to a known, quiescent state.  Thus, there *is* no information
to leak to the attacker.

We will also need all system calls, traps and interrupts to automatically
be a speculation fence point.  We can also look at doing a "graded"
shutdown of speculation and resource allocation, on the basis that
if it is known in advance that a system call is coming up, there is
no point issuing speculative instructions or using out-of-order resources
if they are about to be cancelled within 5-10 instructions!

The alternatives... well, they don't work.  A software-only solution
("fixing" Spectre in the linux kernel) has got so complicated and has
so badly affected performance that Linus Torvalds recently put his foot
down and refused to allow "yet another Spectre patch".  A hardware-only
solution *also* isn't good enough, as it basically involves degrading
performance back to that of a **single-issue in-order** machine.

The "cooperative" approach we feel is a reasonable compromise that is
also simple and straightforward to implement in both hardware and software.
It will be a lot of work, however at least we can put the underpinnings
in place (in the hardware).

# 48-bit Instruction Extension

Jacob raised an idea to do
[extension prefixes](http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-January/000316.html) on Simple-V.
It's a really good idea, that I was hoping would not be necessary.  It
comes down to the fact that it takes a bit more than was anticipated to
do the setup and teardown of the Vectorisation Engine.

So the plan is to have a couple of prefixes: one 16-bit, one 32-bit, that
"extend" both Compressed (16-bit) and standard (32-bit) instructions, turning
them into "one-off" Vector Instructions.  There are two problems: firstly,
that extends the instruction encoding, which in turn complicates the
instruction decode phase.  Secondly: we may have to use the 48-bit encoding
space, which in turn takes up a whopping six of the available 16 bits,
which in turn puts a huge amount of pressure on what can actually be
extended.

For example: if 2 bits are allocated to extend 5-bit register numbers
out to 7 bits, that allows us to access the full 128 integer and FP
range needed for a GPU and VPU.  Unfortunately, we need 2 bits for
rs1, 2 bits for rs2, 2 bits for rs3 and 2 bits for rd.  That's 8 bits
already, and we haven't gotten to VL (Vector Length), the element
width (setting 8/16/32/64 bit), or predication.

If doing a 32-bit prefix, that actually needs to either be a 48-bit
encoding or a 64-bit encoding, depending on whether a 16-bit "Compressed"
instruction or a 32-bit standard instruction is to be prefixed.

There is an alternative: for the 16-bit prefix, there happens to be
a Compressed major opcode that is not being used (bits 13-15 equal to 100,
bits 0-1 equal to 00).  This gives 11 bits spare (where a 48-bit encoding
can only squeeze out 8 maybe 9).  It also has one significant advantage:
as it is actually a standard "C" opcode, it can be done as macro-op fusion.
That in turn means that modifications to the compiler toolchain are a lot
less significant.

12 available bits, things start to look a lot better.  For 32-bit opcodes,
2 bits can be prepended to a 5 bit destination, 2 more bits for all source
registers.  2 bits for Vector Length (VL=1/2/3/4), and 2 bits for the
element width (8/16/32/64).  That leaves 4 spare bits for specifying
predication, *or*, if prefixing 16-bit "Compressed" instructions, it
could be used to extend some of the operations that only have 3-bit
registers, by another 2 bits.

It's quite complex and is going to need a lot of thought.  Some compromises
need to be made, the issue being that we won't know what the best choices
are until we have a better handle on things, through simulations and
comprehensive analysis.

Designing processors is tricky!