# Out-of-out-of-Order: Unpredictable Predication Predicament

So, early in the development of the Not-So-Simple-V specification it was
identified that it meshed well with a multi-issue microarchitecture.
Recall that the basic principle is that registers are "tagged" through
a CSR table as "vectorised", and, if the Vector Length is greater than 1,
*multiple* (sequential) instructions are issued to the pipeline (where
one would normally be sent), *without* increasing the Program Counter,
the difference between these otherwise identical instructions being that
the source (and/or) destination registers are incremented continuously by
one on each loop.

The nice thing about a multi-issue microarchitecture is that it is very
simple to drop these element-based otherwise-identical instructions directly
into the instruction FIFO.  What is even nicer is: when predication is
introduced, all that needs to be done is that when the relevant element
predicate bit is clear, the associated element-based instruction is
**not** placed into the multi-issue instruction FIFO.

Simple, right?  Couldn't be easier.

The problem is: the predicate and the source and the destination registers
can all come from the *same register file*.  So, one instruction may modify
an integer register that on the *next instruction* is used as a predication
target.  That creates a write-hazard that has to be dealt with, as the
predicated (Vectorised) instruction simply cannot be allowed to proceed
until the instruction that is calculating its predicate has actually
completed.  That means that in this particular out-of-order architecture,
the instruction issue phase **itself has to become a Function Unit**.

Let me repeat that again: the instruction issue phase that deals with
predication **itself** has to have its own scoreboard Dependency Matrix entry.

This brings some quite fascinating (read: scary) challenges and opportunities.
If handled incorrectly, it means that the entire idea of using a multi-issue
instruction FIFO is toast, as there will be guaranteed stalling whenever
a predicated vectorised instruction is encountered.

Normally, a multi-issue engine has a guaranteed regular number of instructions
to process and place in the queue.  Even branches do not stop the flow
of placement into the FIFO, as branch prediction (speculative execution) can
guess with high accuracy where the branch will go.  Predicated vectorised
instruction issue is completely different: we have *no idea* - in advance -
if the issued element-based instruction is actually going to be executed
or not.  We do not have the predicate source register (yet) because it
hasn't been calculated, because the prior instruction (which is being
executed out-of-order, and is **itself** dependent on prior instruction
completion) hasn't even been started yet.

Perhaps - thinking out loud - it would be okay to have a place-holder,
waiting for the predicate bits to arrive.  Perhaps it is as simple as
adding an extra source register (predicate source) to every single Function
Unit.  So instead of each Function Unit having src1 and src2, it has 
src1, src2, predicate "bit".  Given that it is just a single bit that each
Function Unit would be waiting for, it does seem somewhat gratuitous,
and a huge complication of an otherwise extremely simple scoreboard
(at present, there are no CAMs and no multi-wire I/Os in any of the
cells of either the FU-to-FU Matrix or the FU-to-Register Dependency Matrix).
Therefore, having **separate** Function Unit(s) which wait for the
predication register to be available, that are themselves plumbed in to
the actual Scoreboard system, decoding and issuing further instructions only
once the predicate register is ready, seems to be a reasonable avenue to
explore.

However, the last thing that we need is to stall execution completely,
so a lot more thought is going to be needed.  The nice thing about having
a predicated vectorisation "Issue" Function Unit is: some of the more
complex decoding (particularly REMAP) can hypothetically be pipelined.
However that is **guaranteed** to result in stalled execution, as the
out-of-order system is going to critically depend on knowing what the
dependencies **are**!  Perhaps it may be possible to put in temporary
"blank" entries that are filled in later?  Issue place-holder instructions
into the Dependency Matrix, where we know that the registers on which
the instruction will depend is known at a later date?

Why that needs to be considered is: remember that the whole basis of
Simple-V is: you issue multiple *sequential* instructions.  Unfortunately,
REMAP jumbles up the word "sequential" using a 1D/2D/3D/offset algorithm,
such that the actual register (or part-register in the case of 8/16/32-bit
element widths) needs a calculation to be performed in order to determine
which register is to be used.  And, secondly, predication can entirely
skip some of those element-based instructions!

Talk about complex!  Simple-V is supposed to be simple!  No wonder
chip designers go for SIMD and let the software sort out the mess...

# Placeholder instructions: predication shadow

Recall from earlier updates that Mitch Alsup describes, in two unpublished
book chapters, some augmentations and modernisations to the 6600 Scoreboard
system, providing speculative branch execution as well as precise exceptions.
Both are identically based on the idea of adding a "schroedinger" wire that may
be used to kill off future instructions, along-side an additional
**non-register-based** Write Hazard dependency that prevents register
writes from committing, **without** preventing the instruction from actually
calculating the result that is to be written (once or if permitted).

Mentioned above is the idea of issuing "place-holder" instructions.  These
are basically instructions which are waiting for their relevant predicate
bit to become *available*.  They could hypothetically actually still be
executed (or at least begin execution).  They would however **not** be
permitted to commit the results to the register file, and they would be
"shadowed" by the above-proposed "Predication Calculating Function Unit".

This ineptly-named Function Unit would have the relevant predication register
as its src, just like any other Function Unit with dependent source registers.
It would similarly have a "schroedinger" wire, and it would similarly
cast a write-block shadow over the Vectorised instructions that were waiting
for predication bits.

Once the predicate register is available, the Predicate-computing FU would
begin "farming out" individual bits of the predicate, calling "Go\_Die"
schroedinger signals on those Vectorised instructions where their associated
predicate bit is zero (or, for when zeroing is enabled, turn them into
"zero result" instructions), and for those instructions where the predicate
bit is set, cancel the write-block shadow.

Whether this is a wise utilisation of resources is another matter.  If
predication is routinely 50% or less, a significant portion of the Vectorised
Function Units could hypothetically be calculating results that are *known*
to be discarded almost immediately.  Also, the whole point of the exercise
of using a multi-issue execution engine was to save resources, not allocating
instructions *at all* where the predication bit for that Vectorised operation
is zero.

However, it is better than the alternatives, and it's possible to
keep to a multi-issue micro-architecture as well, which is important in
order to achieve the target performance.  Ultimately, simulations can tell us
whether the GPU and VPU workloads will have significant predication better
than guessing will: we'll just have to see how it goes.