svp64-primer/summary.tex

   1 \section{Summary}
   2 The proposed \acs{SV} is a Scalable Vector Specification for a hardware for-loop \textbf{that
   3 ONLY uses scalar instructions}.
   4
   5 \begin{itemize}
   6 \itemsep 0em
   7 \item The Power \acs{ISA} v3.1 Specification is not altered in any way.
   8   v3.1 Code-compatibility is guaranteed.
   9 \item Does not require sacrificing 32-bit Major Opcodes.
  10 \item Does not require adding duplicates of instructions
  11       (popcnt, popcntw, popcntd, vpopcntb, vpopcnth, vpopcntw, vpopcntd)
  12 \item Fully abstracted: does not create Micro-architectural dependencies
  13       (no fixed "Lane" size), one binary works across all existing
  14       \textit{and future} implementations.
  15 \item Specifically designed to be easily implemented
  16       on top of an existing Micro-architecture (especially
  17       Superscalar Out-of-Order Multi-issue) without
  18       disruptive full architectural redesigns.
  19 \item Divided into Compliancy Levels to suit differing needs.
  20 \item At the highest Compliancy Level only requires five instructions
  21       (SVE2 requires appx 9,000. \acs{AVX-512} around 10,000. \acs{RVV} around
  22       300).
  23 \item Predication, an often-requested feature, is added cleanly
  24       (without modifying the v3.1 Power ISA)
  25 \item In-registers arbitrary-sized Matrix Multiply is achieved in three
  26       instructions (without adding any v3.1 Power ISA instructions)
  27 \item Full \acs{DCT} and \acs{FFT} RADIX2 Triple-loops are achieved with
  28       dramatically reduced instruction count, and power consumption expected
  29       to greatly reduce. Normally found only in high-end \acs{VLIW} \acs{DSP}
  30       (TI MSP, Qualcomm Hexagon)
  31 \item Fail-First Load/Store allows Vectorised high performance
  32       strncpy to be implemented in around 14
  33       instructions (hand-optimised \acs{VSX} assembler is 240).
  34 \item Inner loop of MP3 implemented in under 100 instructions
  35       (gcc produces 450 for the same function on POWER9).
  36 \end{itemize}
  37
  38 All areas investigated so far consistently showed reductions in executable
  39 size, which as outlined in \cite{SIMD_HARM} has an indirect reduction in
  40 power consumption due to less I-Cache/TLB pressure and also Issue remaining
  41 idle for long periods.
  42 Simple-V has been specifically and carefully crafted to respect
  43 the Power ISA's Supercomputing pedigree, and very specifically crafted
  44 to fit on top of both simple single-issue and complex multi-issue
  45 Superscalar Micro-Architectures.
  46
  47 \begin{figure}[hb]
  48     \centering
  49         \includegraphics[width=0.6\linewidth]{power_pipelines.png}
  50         \caption{Showing how SV fits in between Decode and Issue}
  51         \label{fig:power_pipelines}
  52 \end{figure}
  53
  54 \pagebreak
  55
  56 \subsection{What is SIMD?}
  57
  58 \acs{SIMD} is a way of partitioning existing \acs{CPU}
  59 registers of 64-bit length into smaller 8-, 16-, 32-bit pieces.
  60 \cite{SIMD_HARM}\cite{SIMD_HPC}
  61 These partitions can then be operated on simultaneously, and the initial values
  62 and results being stored as entire 64-bit registers (\acs{SWAR}).
  63 The SIMD instruction opcode
  64 includes the data width and the operation to perform.
  65 \par
  66
  67 \begin{figure}[hb]
  68     \centering
  69         \includegraphics[width=0.6\linewidth]{simd_axb.png}
  70         \caption{SIMD multiplication}
  71         \label{fig:simd_axb}
  72 \end{figure}
  73
  74 This method can have a huge advantage for rapid processing of
  75 vector-type data (image/video, physics simulations, cryptography,
  76 etc.),
  77 \cite{SIMD_WASM},
  78  and thus on paper is very attractive compared to
  79 scalar-only instructions.
  80 \textit{As long as the data width fits the workload, everything is fine}.
  81 \par
  82
  83 \subsection{Shortfalls of SIMD}
  84 SIMD registers are of a fixed length and thus to achieve greater
  85 performance, CPU architects typically increase the width of registers
  86 (to 128-, 256-, 512-bit etc) for more partitions.\par Additionally,
  87 binary compatibility is an important feature, and thus each doubling
  88 of SIMD registers also expands the instruction set. The number of
  89 instructions quickly balloons and this can be seen in for example
  90 IA-32 expanding from 80 to about 1400 instructions since
  91 the 1970s\cite{SIMD_HARM}.\par
  92
  93 Five digit Opcode proliferation (10,000 instructions) is overwhelming.
  94 The following are just some of the reasons why SIMD is unsustainable as
  95 the number of instructions increase:
  96 \begin{itemize}
  97     \itemsep 0em
  98         \item Hardware design, ASIC routing etc.
  99         \item Compiler design
 100         \item Documentation of the ISA
 101         \item Manual coding and optimisation
 102         \item Time to support the platform
 103         \item Compilance Suite development and testing
 104         \item Protracted Variable-Length encoding (x86) severely compromises
 105         Multi-issue decoding
 106 \end{itemize}
 107
 108 \subsection{Scalable Vector Architectures}
 109 An older alternative exists to utilise data parallelism - vector
 110 architectures. Vector CPUs collect operands from the main memory, and
 111 store them in large, sequential vector registers.\par
 112
 113 A simple vector processor might operate on one element at a time,
 114 however as the element operations are usually independent,
 115 a processor could be made to compute all of the vector's
 116 elements simultaneously, taking advantage of multiple pipelines.\par
 117
 118 Typically, today's vector processors can execute two, four, or eight
 119 64-bit elements per clock cycle.
 120 \cite{SIMD_HARM}.
 121 Vector ISAs are specifically designed to deal with (in hardware) fringe
 122 cases where an algorithm's element count is not a multiple of the
 123 underlying hardware "Lane" width. The element data width
 124 is variable (8 to 64-bit just like in SIMD)
 125 but it is the \textit{number} of elements being
 126 variable under control of a "setvl" instruction that specifically
 127 makes Vector ISAs "Scalable"
 128 \par
 129
 130 \acs{RVV} supports a VL of up to $2^{16}$ or $65536$ bits,
 131 which can fit 1024 64-bit words.
 132 \cite{riscv-v-spec}.
 133 The Cray-1 had 8 Vector Registers with up to 64 elements (64-bit each).
 134 An early Draft of RVV supported overlaying the Vector Registers onto the
 135 Floating Point registers, similar to \acs{MMX}.
 136
 137 \begin{figure}[ht]
 138     \centering
 139         \includegraphics[width=0.6\linewidth]{cray_vector_regs.png}
 140         \caption{Cray Vector registers: 8 registers, 64 elements each}
 141         \label{fig:cray_vector_regs}
 142 \end{figure}
 143
 144 Simple-V's "Vector" Registers (a misnomer) are specifically designed to fit
 145 on top of
 146 the Scalar (GPR, FPR) register files, which are extended from the default
 147 of 32, to 128 entries in the high-end Compliancy Levels.  This is a primary
 148 reason why Simple-V can be added on top of an existing Scalar ISA, and
 149 \textit{in particular} why there is no need to add explicit Vector
 150 Registers or
 151 Vector instructions.  The diagram below shows \textit{conceptually}
 152 how a Vector's elements are sequentially and linearly mapped onto the
 153 \textit{Scalar} register file:
 154
 155 \begin{figure}[ht]
 156     \centering
 157         \includegraphics[width=0.6\linewidth]{svp64_regs.png}
 158         \caption{three instructions, same vector length, different element widths}
 159         \label{fig:svp64_regs}
 160 \end{figure}
 161
 162 \pagebreak
 163
 164 \subsection{Simple Vectorisation}
 165 \acs{SV} is a Scalable Vector ISA designed for hybrid workloads (CPU, GPU,
 166 VPU, 3D?).  Includes features normally found only on Cray-style Supercomputers
 167 (Cray-1, NEC SX-Aurora) and GPUs.  Keeps to a strict uniform RISC paradigm,
 168 leveraging a scalar ISA by using "Prefixing".
 169 \textbf{No dedicated vector opcodes exist in SV, at all}.
 170 SVP64 uses 25\% of the Power ISA v3.1 64-bit Prefix space (EXT001) to create
 171 the SV Vectorisation Context for the 32-bit Scalar Suffix.
 172
 173 \vspace{10pt}
 174 Main design principles
 175 \begin{itemize}
 176     \itemsep 0em
 177         \item Introduce by implementing on top of existing Power ISA
 178         \item Effectively a \textbf{hardware for-loop}, pauses main PC,
 179               issues multiple scalar operations
 180         \item Strictly preserves (leverages) underlying scalar execution
 181           dependencies as if
 182               the for-loop had been expanded into actual scalar instructions
 183           ("preserving Program Order")
 184         \item Augments existing instructions by adding "tags" - provides
 185           Vectorisation "context" rather than adding new opcodes.
 186         \item Does not modify or deviate from the underlying scalar
 187           Power ISA unless there's a significant performance boost or other
 188           advantage in the vector space
 189         \item Aimed at Supercomputing: avoids creating significant
 190               \textit{sequential dependency hazards}, allowing \textbf{high
 191               performance multi-issue superscalar microarchitectures} to be
 192           leveraged.
 193 \end{itemize}
 194
 195 Advantages include:
 196 \begin{itemize}
 197     \itemsep 0em
 198         \item Easy to create first (and sometimes only) implementation
 199               as a literal for-loop in hardware, simulators, and compilers.
 200         \item Obliterates SIMD opcode proliferation
 201           ($O(N^6)$) as well as dedicated Vectorisation
 202           ISAs. No more separate vector instructions.
 203         \item Reducing maintenance overhead (no separate Vector instructions).
 204           Adding any new Scalar instruction
 205           \textit{automatically adds a Vectorised version of the same}.
 206         \item Easier for compilers, coders, documentation
 207 \end{itemize}
 208